Read first this page to get up to speed. Then come back.
Welcome back. SNOB is a program written by Chris Wallace and encoding for the Minimum Message Length method for cluster analysis. Bibliography is available from this link. You can possibly start from this paper. On the technical front proceed as follows :
Do it :
mkdir snob/ cd snob/ # # Use every 10th line # perl -ne 'print ((0 == $. % 10) ? $_ : "")' /home/myself/something/carma/carma.dPCA.fluctuations.dat > selected_lines # # Use top 8 PCs (first column is the frame number) # awk '{print $1,$2,$3,$4,$5,$6,$7,$8,$9}' selected_lines > selected_lines_and_PCs
Prepare the vset file. Change the number '8' and the number of lines containing the expression 'PC?? 1' to reflect number of PCs selected above.
cat > data.vset data 8 PC01 1 PC02 1 PC03 1 PC04 1 PC05 1 PC06 1 PC07 1 PC08 1 <CTRL-D>
Prepare the header portion of samp file. The number of lines containing '0.000001'
must be equal to number of PCs selected. The last number is number of
frames (use wc selected_lines_and_PCs
to determine).
cat > header data data 0.000001 0.000001 0.000001 0.000001 0.000001 0.000001 0.000001 0.000001 184456 <CTRL-D>
You can now prepare the samp file and the script for SLURM :
cat header selected_lines_and_PCs > data.samp cat > snob.sh #!/bin/tcsh snob-vanilla << eof data.vset data.samp doall 200 doall 200 doall 200 doall 200 doall 200 doall 200 doall 200 doall 200 doall 200 doall 200 doall 10000 mrep out trep out prclass -1 0 tree stop eof exit <CTRL-D>
When ready submit the job to the cluster :
sbatch -n 1 snob.sh
The job will take a while. It is reading time again …
SNOB will assign each and every frame to at least one cluster. Because this is a proper Bayesian approach, the assignments are probabilistic : what you get for every frame is a set of assignments of the form : “the probability of frame X belonging to cluster A is p1, belonging to cluster B is p2, belonging to cluster C is p3, …”. This, admittedly, is a very satisfying approach to clustering. You can envision a similarly probabilistic approach to calculating, say, representative structures for each cluster, but this may needlessly complicate things for large clusters with plenty of highly probable assignments. So we will keep it simple :
The program starts cycling …
Enter variable-set file name: There being no comms file, input will be taken from StdInput Readvset returns 0 Enter sample file name: Number of active cases = 109819 Begin sort of 109819 cases Finished sort Readsample returns 0 Allocated space 3514796 chars AaAAa Popln 1 on sample 1, 1 leaves,109819 things Cost 5615787.10 Assign mode Partial --- Adjust: Params Tree POP 0 RelAb 1.000 Size109819.0 1 RelAb 1.000 Size109819.0 Aa Firstpop returns 0 Allocated space 3635784 chars S# 0 POP Age# 3 SampSz# 109819.0 RelAb 1.000 Sz 109819.0 Pcost 39.38 Tcost 5615752.22 Total 5615791.60 ??? line 0 >> Cycle 7 Pop 1 1 leaves Cost 5615784.8 >> Sample data >> Adjust PT Assign P # aaAaaaaaaaaaS Cycle 20 Splitting 1 into 2, 3 Ben6409.08 AaaaaaAAAAAAAAAAAAAAS Cycle 41 Splitting 2 into 4, 5 Ben1509.03 AS Cycle 43 Splitting 3 into 6, 7 Ben39307.37 ....
… sometime reaching convergence :
>> Cycle 2047 Pop 1 41 leaves Cost 5538349.7 >> Sample data >> Adjust PT Assign P # aaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaaAaa AaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAa aAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaA aaAaaAaaAaaAaaAaaaAaaaAaaaAaaaAaaaAaaaAaaaaAaaaaAaaaaAaaaaaAaaaaaAaaaaaaAaaaaaaa aAaaaaaaaaaaaAaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa Doall ends after 394 cycles
Then comes a similarity table (between the clusters identified), followed by the corresponding tree. You can use the relationships depicted by the tree to form larger clusters (by merging leafs that are highly similar).
Table of class similarities SER 27 38 40 42 43 44 45 46 47 48 49 51 52 53 55 56 58 62 64 65 66 67 68 69 70 71 73 74 76 77 78 79 80 82 83 85 86 87 88 90 38 6.8e-05 40 0.00774 1.9e-04 42 1.6e-06 9.9e-05 0.00148 43 0.03155 0.18453 0.30318 0.13527 44 0.54303 0.00161 0.15843 0.00109 0.20480 45 0.07768 7.8e-04 0.48929 0.06160 0.30695 0.46856 46 0.53300 1.9e-05 0.07499 1.2e-05 0.04584 0.71955 0.24980 47 0.11770 6.1e-05 0.45024 3.7e-05 0.12760 0.56533 0.50343 0.54896 48 9.0e-07 0.00411 8.2e-05 0.36885 0.12174 1.9e-04 0.00214 1.7e-06 6.9e-06 49 2.3e-04 0.44460 0.01681 0.01946 0.35958 0.00612 0.01030 2.8e-04 0.00182 0.08541 51 0.15980 2.5e-05 0.14707 0.01163 0.07607 0.56483 0.66198 0.45696 0.43544 1.1e-04 5.0e-04 52 0.45055 8.9e-06 0.01386 0.00549 0.02407 0.50467 0.20827 0.43904 0.13173 2.4e-05 8.1e-05 0.56352 53 0.48270 6.2e-04 0.11070 4.5e-08 0.11018 0.60969 0.15107 0.59349 0.45978 1.0e-06 0.00252 0.13336 0.14024 55 2.6e-05 0.15319 0.00403 0.01162 0.10411 8.4e-04 0.00172 3.2e-05 2.6e-04 0.02460 0.66735 5.0e-05 6.4e-06 4.3e-04 56 1.3e-05 0.01954 0.01393 0.03611 0.08250 6.5e-04 0.00312 4.0e-05 5.3e-04 0.01757 0.36555 9.4e-05 6.6e-06 3.1e-04 0.63384 58 0.00308 5.3e-06 1.5e-05 0.19178 0.00963 0.01792 0.01537 0.00108 1.8e-04 7.4e-04 3.5e-05 0.02209 0.05935 3.0e-04 8.8e-07 3.5e-07 62 0.01374 0.47201 0.13702 1.7e-04 0.55324 0.07590 0.04801 0.01332 0.04394 0.00248 0.31646 0.00747 0.00292 0.08028 0.11268 0.05034 2.4e-04 64 0.16857 0.08790 0.04371 0.06001 0.38761 0.28769 0.14213 0.11053 0.08163 0.03037 0.06668 0.12699 0.16735 0.14589 0.01228 0.00525 0.13915 0.33923 65 1.4e-05 0.02793 1.4e-07 0.02868 0.13672 6.5e-04 3.8e-04 1.5e-06 7.2e-07 0.04722 0.01088 3.5e-05 2.3e-05 2.6e-05 1.0e-04 5.7e-06 0.00120 0.20719 0.42111 66 0.62276 2.3e-04 0.01155 1.6e-09 0.04352 0.34824 0.04072 0.32006 0.10355 8.9e-08 5.1e-04 0.04226 0.10243 0.65318 7.9e-05 3.3e-05 1.9e-04 0.03518 0.13834 1.7e-05 67 0.45310 0.00844 0.01577 1.2e-05 0.14373 0.39972 0.06381 0.19296 0.08650 1.9e-05 0.00668 0.05648 0.11055 0.47092 0.00109 3.0e-04 0.00247 0.15688 0.33020 0.00172 0.67321 68 1.9e-04 0.29653 7.2e-04 0.12065 0.38964 0.00448 0.00558 1.1e-04 2.8e-04 0.45854 0.32644 4.1e-04 1.2e-04 8.4e-04 0.05410 0.01459 7.3e-04 0.28252 0.20317 0.31803 2.9e-04 0.00622 69 0.00106 0.05648 0.03976 0.26523 0.42991 0.01837 0.06388 0.00235 0.00810 0.35770 0.52435 0.01071 0.00285 0.00305 0.34544 0.41222 0.00218 0.07202 0.06538 0.01103 6.1e-04 0.00419 0.28826 70 0.03767 1.8e-04 0.03505 0.30694 0.09843 0.19134 0.37565 0.07162 0.06269 0.00523 0.00170 0.43644 0.36249 0.01672 1.8e-04 2.4e-04 0.46084 0.00466 0.21016 0.00564 0.00586 0.01864 0.00709 0.02718 71 0.56605 3.6e-04 0.00182 4.9e-05 0.03685 0.38605 0.05266 0.17341 0.03091 1.1e-05 4.5e-04 0.11086 0.43748 0.16851 3.5e-05 9.8e-06 0.03962 0.01647 0.31822 6.1e-04 0.23801 0.37514 9.7e-04 0.00165 0.08185 73 0.16679 5.3e-04 1.7e-04 1.4e-05 0.02117 0.10611 0.00934 0.02555 0.00335 4.7e-06 2.2e-04 0.01624 0.09843 0.04138 1.1e-05 1.7e-06 0.02300 0.01636 0.39977 0.00535 0.09116 0.26427 0.00121 4.7e-04 0.02372 0.58895 74 0.50051 2.8e-04 5.2e-04 7.2e-09 0.02105 0.17660 0.01186 0.08119 0.01016 1.1e-07 2.0e-04 0.01713 0.12031 0.16422 2.0e-05 3.9e-06 0.00159 0.01689 0.20901 1.7e-04 0.44505 0.59787 3.5e-04 2.6e-04 0.00628 0.57951 0.49659 76 0.00104 5.3e-06 3.8e-07 0.04451 0.00250 0.00464 0.00147 1.2e-04 1.1e-05 9.6e-05 7.3e-06 0.00206 0.00915 1.1e-04 8.0e-08 1.2e-08 0.38692 3.0e-04 0.17031 0.00227 1.2e-04 0.00211 2.4e-04 2.7e-04 0.11863 0.03819 0.08566 0.00330 77 4.5e-07 1.4e-09 5.8e-12 0.00176 2.2e-05 2.1e-05 4.4e-06 5.5e-09 3.0e-10 9.1e-08 2.0e-09 2.5e-06 1.9e-05 3.2e-08 2.4e-12 1.9e-13 0.01450 2.4e-06 0.06142 7.2e-05 1.2e-08 1.3e-05 3.3e-07 1.4e-06 0.00528 0.00102 0.01720 2.2e-05 0.40931 78 4.5e-08 9.2e-11 7.6e-13 0.00419 8.3e-06 1.6e-05 4.0e-06 1.2e-10 1.8e-11 5.2e-08 2.9e-10 1.2e-06 1.1e-05 3.5e-09 4.5e-13 4.2e-14 0.03172 3.7e-07 0.04040 5.9e-06 1.0e-10 7.1e-06 3.6e-08 1.3e-06 0.00751 0.00151 0.02273 9.6e-06 0.42636 0.44917 79 0.01346 0.03890 0.00452 2.3e-15 0.09776 0.01583 0.00214 0.00284 0.00239 7.9e-09 0.01673 2.0e-04 7.0e-04 0.03686 0.00947 0.00209 8.9e-07 0.26388 0.09551 5.2e-04 0.07929 0.29497 0.00596 4.8e-04 4.8e-05 0.00864 0.01320 0.04483 7.3e-06 1.9e-08 5.4e-09 80 7.9e-04 0.44031 5.4e-05 4.3e-12 0.09411 0.00208 2.1e-04 8.0e-05 4.9e-05 1.1e-06 0.05547 1.7e-05 6.2e-05 0.00171 0.01401 9.4e-04 3.1e-06 0.43902 0.18028 0.10774 0.00309 0.04758 0.10286 0.00119 3.1e-05 0.00278 0.01399 0.00714 7.1e-05 3.0e-06 1.0e-06 0.12299 82 0.16495 2.6e-04 0.00333 0.03962 0.04511 0.24616 0.09483 0.08956 0.02640 7.0e-04 4.9e-04 0.18757 0.43326 0.04557 2.8e-05 1.3e-05 0.49938 0.00848 0.42341 0.00714 0.04242 0.09899 0.00306 0.00560 0.47724 0.49546 0.33935 0.11874 0.38690 0.04642 0.05670 0.00125 0.00111 83 0.00770 3.0e-06 2.0e-06 6.8e-04 0.00147 0.00852 9.7e-04 7.3e-04 5.9e-05 2.2e-06 1.9e-06 0.00209 0.01741 7.2e-04 3.0e-08 5.0e-09 0.07025 4.9e-04 0.25316 0.00206 0.00125 0.01086 4.9e-05 5.8e-05 0.02913 0.12915 0.40889 0.03530 0.42159 0.35761 0.33899 1.1e-04 5.8e-04 0.31266 85 0.09750 0.00265 0.45353 3.3e-05 0.30476 0.42505 0.28662 0.22975 0.55314 3.3e-05 0.01889 0.11337 0.03453 0.59047 0.00453 0.00478 1.1e-04 0.24604 0.12162 2.2e-05 0.19896 0.23438 0.00301 0.01891 0.02110 0.03243 0.00635 0.02653 1.5e-05 9.5e-10 8.4e-11 0.08778 0.00289 0.01395 5.0e-05 86 0.19564 0.00533 0.03724 7.8e-11 0.13426 0.19849 0.03335 0.11912 0.08653 1.8e-07 0.00977 0.01327 0.02427 0.51195 0.00324 0.00134 2.4e-05 0.21208 0.12700 3.1e-05 0.55588 0.56848 0.00212 0.00186 0.00176 0.06727 0.02825 0.15420 2.2e-05 9.1e-09 2.0e-09 0.37885 0.01605 0.01148 3.1e-04 0.46210 87 3.0e-05 0.63140 4.8e-05 1.0e-10 0.05181 2.9e-04 7.3e-05 5.0e-06 1.0e-05 6.3e-06 0.23285 1.6e-06 1.6e-06 2.4e-04 0.22507 0.02418 6.8e-08 0.20644 0.02336 0.00152 1.6e-04 0.00418 0.05784 0.00672 4.1e-06 8.9e-05 1.5e-04 1.6e-04 2.5e-07 1.2e-10 2.6e-11 0.04000 0.25087 2.6e-05 5.5e-07 0.00113 0.00457 88 2.1e-04 0.59169 0.00209 1.9e-05 0.17144 0.00306 0.00180 1.2e-04 4.4e-04 0.00162 0.69915 8.5e-05 3.0e-05 0.00201 0.59226 0.15882 2.7e-06 0.33576 0.04207 0.00196 6.1e-04 0.00815 0.13091 0.09933 1.7e-04 4.0e-04 2.4e-04 3.0e-04 1.5e-06 3.7e-10 8.3e-11 0.02819 0.09639 1.8e-04 1.2e-06 0.01015 0.01209 0.56265 90 1.0e-09 6.9e-12 3.4e-14 0.00667 1.5e-06 5.6e-07 3.0e-07 9.8e-12 9.1e-13 1.5e-08 3.8e-11 7.1e-08 3.7e-07 7.2e-11 2.1e-14 2.1e-15 0.00534 3.7e-08 0.01596 4.8e-06 6.3e-12 8.9e-08 1.4e-08 2.4e-07 0.00223 2.8e-05 5.7e-04 5.7e-08 0.26222 0.49767 0.08173 1.0e-11 9.6e-09 0.00997 0.04449 2.9e-12 1.0e-11 1.8e-13 2.4e-12 91 5.4e-08 3.8e-09 6.5e-13 4.7e-04 1.5e-05 3.0e-06 6.4e-07 9.7e-10 5.1e-11 5.7e-08 2.5e-09 3.0e-07 1.6e-06 6.8e-09 1.6e-12 7.3e-14 0.00133 4.6e-06 0.03575 1.6e-04 4.3e-09 2.5e-06 6.6e-07 3.4e-07 9.4e-04 8.8e-05 0.00286 2.5e-06 0.10324 0.48256 0.05285 7.6e-09 2.9e-06 0.00886 0.10726 3.1e-10 1.4e-09 1.2e-10 2.9e-10 0.49907 Join Leaf 44 and Leaf 46 into dad 1 at sim 7.196e-01 Join Leaf 49 and Leaf 88 into dad 2 at sim 6.991e-01 Join Dad 2 and Leaf 55 into dad 3 at sim 6.917e-01 Join Leaf 66 and Leaf 67 into dad 4 at sim 6.732e-01 Join Leaf 45 and Leaf 51 into dad 5 at sim 6.620e-01 Join Dad 1 and Leaf 53 into dad 6 at sim 6.444e-01 Join Leaf 38 and Leaf 87 into dad 7 at sim 6.314e-01 Join Dad 4 and Leaf 86 into dad 8 at sim 6.125e-01 Join Dad 6 and Leaf 47 into dad 9 at sim 5.896e-01 Join Leaf 71 and Leaf 73 into dad 10 at sim 5.890e-01 Join Dad 10 and Leaf 74 into dad 11 at sim 5.893e-01 Join Dad 9 and Leaf 85 into dad 12 at sim 5.648e-01 Join Leaf 43 and Leaf 62 into dad 13 at sim 5.532e-01 Join Leaf 27 and Dad 11 into dad 14 at sim 5.285e-01 Join Dad 14 and Dad 8 into dad 15 at sim 5.262e-01 Join Leaf 58 and Leaf 82 into dad 16 at sim 4.994e-01 Join Dad 16 and Leaf 70 into dad 17 at sim 5.157e-01 Join Leaf 90 and Leaf 91 into dad 18 at sim 4.991e-01 Join Leaf 77 and Dad 18 into dad 19 at sim 5.537e-01 Join Dad 15 and Dad 12 into dad 20 at sim 4.970e-01 Join Dad 7 and Dad 3 into dad 21 at sim 4.953e-01 Join Leaf 48 and Leaf 68 into dad 22 at sim 4.585e-01 Join Dad 5 and Leaf 52 into dad 23 at sim 4.521e-01 Join Dad 20 and Dad 23 into dad 24 at sim 4.507e-01 Join Leaf 76 and Leaf 78 into dad 25 at sim 4.264e-01 Join Dad 25 and Leaf 83 into dad 26 at sim 4.513e-01 Join Leaf 64 and Leaf 65 into dad 27 at sim 4.211e-01 Join Leaf 56 and Leaf 69 into dad 28 at sim 4.122e-01 Join Dad 21 and Dad 28 into dad 29 at sim 4.686e-01 Join Dad 26 and Dad 19 into dad 30 at sim 4.114e-01 Join Dad 29 and Dad 13 into dad 31 at sim 3.877e-01 Join Dad 31 and Dad 22 into dad 32 at sim 3.028e-01 Join Dad 24 and Dad 17 into dad 33 at sim 3.005e-01 Join Dad 33 and Dad 27 into dad 34 at sim 2.989e-01 Join Dad 34 and Leaf 40 into dad 35 at sim 2.474e-01 Join Dad 32 and Leaf 80 into dad 36 at sim 2.074e-01 Join Dad 35 and Dad 30 into dad 37 at sim 1.245e-01 Join Dad 36 and Leaf 42 into dad 38 at sim 1.215e-01 Join Dad 37 and Leaf 79 into dad 39 at sim 1.187e-01 Join Dad 39 and Dad 38 into dad 40 at sim 7.433e-02 Binary tree(s) of classes. There are 1 roots 27 .___________________________. | 71 .___________________. |_. |_. | | 73 .___________________| |_____| | | | 74 ._____________________| |_________. | | 66 ._______. | | |_______. | | 67 ._______| |_____________| | | | 86 ._______________| | |_______. 44 ._. | | |_________. | | 46 ._| |_____. | | | | | | 53 .___________| |_____. | | | | | |_________________. 47 ._________________| |_______________| | | | | | 85 ._______________________| | | | | 45 ._________. | | |___________________________________. | |_. 51 ._________| |_| | | | | | 52 ._____________________________________________| | | | | 58 ._______________________________. | | |_. | |_. 82 ._______________________________| |_______________________________| | | | | | 70 ._________________________________| | | | |___. 64 ._____________________________________________________. | | | |_____________| | | 65 ._____________________________________________________| | | | | 40 ._____________________________________________________________________| |___. | | 76 ._________________________________________________. | | |_. | | 78 ._________________________________________________| |_______. | | | | | | 83 .___________________________________________________| |_____________| | | |_. 77 ._____________________________________. | | | |_____________________| | | 90 .___________________________________. | | | |_| | | 91 .___________________________________| | | | | 79 ._____________________________________________________________________________| | | 38 ._____________. | |___________________________. | 87 ._____________| | | |_______________. | 49 .___. | | | |_. | | | 88 .___| |___________________________________| | |_ | |___. | 55 ._____| | | | | | | 56 ._______________________________________________________. | | | |_| |_. | 69 ._______________________________________________________| | | | | | | 43 ._________________________. | | | |___________________________________| |_______. | 62 ._________________________| | | | | | | 48 .___________________________________________. | |___. | |___________________| | | | 68 .___________________________________________| | | | | |___| 80 ._______________________________________________________________________| | | 42 .___________________________________________________________________________|
At its header contains number of clusters and number of frames per cluster. Then follows the real thing : to which cluster and with what probability each frame belongs to :
VanillaSnob-Thing-Report-File Sample data 41 leaf classes. Class serials and sizes: 27 6050.5 38 2929.1 40 555.7 ................................... 90 277.4 91 215.3 + 10 76 (97) 58 ( 2) 82 ( 1) 20 76 (91) 77 ( 4) 83 ( 3) 82 ( 2) 30 76 (92) 77 ( 3) 83 ( 2) 82 ( 2) 40 76 (56) 82 (22) 83 (20) 58 ( 1) 50 82 (57) 76 (28) 83 ( 8) 58 ( 5) 70 ( 1) ................................... 1098180 70 (75) 52 (13) 82 (10) 51 ( 1) 1098190 76 (87) 70 ( 5) 58 ( 4) 82 ( 2) END
For example, the line starting with 1098190
(above), says that the probablity of frame 1098190 belonging to cluster with the identifier 76 is p1=0.87, for the cluster 70 p2=0.05, 58 → 0.04, …
You can now form lists of frames that belong to specific clusters. To collect all frames for which the most probable cluster is 49 use something like :
grep -P '^\s+\d+\s+49 \(' out.trep | awk '{print $1, $2}' > Cluster_01.dat
You can now use carma to prepare a DCD file containing only the selected frames and then proceed with whichever analyses you are up to.
If you only want to select frames that belong to a given cluster with a probability higher than, say, 0.70, you could say something like :
grep -o -P '^\s+\d+\s+49 \(\d+' out.trep | tr -d '(' | awk '{if ($3 > 70) print $1,$2,$3}' > Cluster_01_p_0.70.dat