Read first this page to get up to speed. Then come back.
Welcome back. SNOB is a program written by Chris Wallace and encoding for the Minimum Message Length method for cluster analysis. Bibliography is available from this link. You can possibly start from this paper. On the technical front proceed as follows :
Do it :
mkdir snob/ cd snob/ # # Use every 10th line # perl -ne 'print ((0 == $. % 10) ? $_ : "")' /home/myself/something/carma/carma.dPCA.fluctuations.dat > selected_lines # # Use top 8 PCs (first column is the frame number) # awk '{print $1,$2,$3,$4,$5,$6,$7,$8,$9}' selected_lines > selected_lines_and_PCs
Prepare the vset file. Change the number '8' and the number of lines containing the expression 'PC?? 1' to reflect number of PCs selected above.
cat > data.vset data 8 PC01 1 PC02 1 PC03 1 PC04 1 PC05 1 PC06 1 PC07 1 PC08 1 <CTRL-D>
Prepare the header portion of samp file. The number of lines containing '0.000001'
must be equal to number of PCs selected. The last number is number of
frames (use wc selected_lines_and_PCs to determine).
cat > header data data 0.000001 0.000001 0.000001 0.000001 0.000001 0.000001 0.000001 0.000001 184456 <CTRL-D>
You can now prepare the samp file and the script for SLURM :
cat header selected_lines_and_PCs > data.samp cat > snob.sh #!/bin/tcsh snob-vanilla << eof data.vset data.samp doall 200 doall 200 doall 200 doall 200 doall 200 doall 200 doall 200 doall 200 doall 200 doall 200 doall 10000 mrep out trep out prclass -1 0 tree stop eof exit <CTRL-D>
When ready submit the job to the cluster :
sbatch -n 1 snob.sh
The job will take a while. It is reading time again …
SNOB will assign each and every frame to at least one cluster. Because this is a proper Bayesian approach, the assignments are probabilistic : what you get for every frame is a set of assignments of the form : “the probability of frame X belonging to cluster A is p1, belonging to cluster B is p2, belonging to cluster C is p3, …”. This, admittedly, is a very satisfying approach to clustering. You can envision a similarly probabilistic approach to calculating, say, representative structures for each cluster, but this may needlessly complicate things for large clusters with plenty of highly probable assignments. So we will keep it simple :
The program starts cycling …
Enter variable-set file name:
There being no comms file, input will be taken from StdInput
Readvset returns 0
Enter sample file name:
Number of active cases = 109819
Begin sort of 109819 cases
Finished sort
Readsample returns 0
Allocated space 3514796 chars
AaAAa
Popln 1 on sample 1, 1 leaves,109819 things Cost 5615787.10
Assign mode Partial --- Adjust: Params Tree
POP 0 RelAb 1.000 Size109819.0
1 RelAb 1.000 Size109819.0
Aa
Firstpop returns 0
Allocated space 3635784 chars
S# 0 POP Age# 3 SampSz# 109819.0 RelAb 1.000 Sz 109819.0
Pcost 39.38 Tcost 5615752.22 Total 5615791.60
??? line 0
>> Cycle 7 Pop 1 1 leaves Cost 5615784.8
>> Sample data
>> Adjust PT Assign P
# aaAaaaaaaaaaS
Cycle 20 Splitting 1 into 2, 3 Ben6409.08
AaaaaaAAAAAAAAAAAAAAS
Cycle 41 Splitting 2 into 4, 5 Ben1509.03
AS
Cycle 43 Splitting 3 into 6, 7 Ben39307.37
....
… sometime reaching convergence :
>> Cycle 2047 Pop 1 41 leaves Cost 5538349.7 >> Sample data >> Adjust PT Assign P # aaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaaAaa AaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAa aAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaA aaAaaAaaAaaAaaAaaaAaaaAaaaAaaaAaaaAaaaAaaaaAaaaaAaaaaAaaaaaAaaaaaAaaaaaaAaaaaaaa aAaaaaaaaaaaaAaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa Doall ends after 394 cycles
Then comes a similarity table (between the clusters identified), followed by the corresponding tree. You can use the relationships depicted by the tree to form larger clusters (by merging leafs that are highly similar).
Table of class similarities
SER 27 38 40 42 43 44 45 46 47 48 49 51 52 53 55 56 58 62 64 65 66 67 68 69 70 71 73 74 76 77 78 79 80 82 83 85 86 87 88 90
38 6.8e-05
40 0.00774 1.9e-04
42 1.6e-06 9.9e-05 0.00148
43 0.03155 0.18453 0.30318 0.13527
44 0.54303 0.00161 0.15843 0.00109 0.20480
45 0.07768 7.8e-04 0.48929 0.06160 0.30695 0.46856
46 0.53300 1.9e-05 0.07499 1.2e-05 0.04584 0.71955 0.24980
47 0.11770 6.1e-05 0.45024 3.7e-05 0.12760 0.56533 0.50343 0.54896
48 9.0e-07 0.00411 8.2e-05 0.36885 0.12174 1.9e-04 0.00214 1.7e-06 6.9e-06
49 2.3e-04 0.44460 0.01681 0.01946 0.35958 0.00612 0.01030 2.8e-04 0.00182 0.08541
51 0.15980 2.5e-05 0.14707 0.01163 0.07607 0.56483 0.66198 0.45696 0.43544 1.1e-04 5.0e-04
52 0.45055 8.9e-06 0.01386 0.00549 0.02407 0.50467 0.20827 0.43904 0.13173 2.4e-05 8.1e-05 0.56352
53 0.48270 6.2e-04 0.11070 4.5e-08 0.11018 0.60969 0.15107 0.59349 0.45978 1.0e-06 0.00252 0.13336 0.14024
55 2.6e-05 0.15319 0.00403 0.01162 0.10411 8.4e-04 0.00172 3.2e-05 2.6e-04 0.02460 0.66735 5.0e-05 6.4e-06 4.3e-04
56 1.3e-05 0.01954 0.01393 0.03611 0.08250 6.5e-04 0.00312 4.0e-05 5.3e-04 0.01757 0.36555 9.4e-05 6.6e-06 3.1e-04 0.63384
58 0.00308 5.3e-06 1.5e-05 0.19178 0.00963 0.01792 0.01537 0.00108 1.8e-04 7.4e-04 3.5e-05 0.02209 0.05935 3.0e-04 8.8e-07 3.5e-07
62 0.01374 0.47201 0.13702 1.7e-04 0.55324 0.07590 0.04801 0.01332 0.04394 0.00248 0.31646 0.00747 0.00292 0.08028 0.11268 0.05034 2.4e-04
64 0.16857 0.08790 0.04371 0.06001 0.38761 0.28769 0.14213 0.11053 0.08163 0.03037 0.06668 0.12699 0.16735 0.14589 0.01228 0.00525 0.13915 0.33923
65 1.4e-05 0.02793 1.4e-07 0.02868 0.13672 6.5e-04 3.8e-04 1.5e-06 7.2e-07 0.04722 0.01088 3.5e-05 2.3e-05 2.6e-05 1.0e-04 5.7e-06 0.00120 0.20719 0.42111
66 0.62276 2.3e-04 0.01155 1.6e-09 0.04352 0.34824 0.04072 0.32006 0.10355 8.9e-08 5.1e-04 0.04226 0.10243 0.65318 7.9e-05 3.3e-05 1.9e-04 0.03518 0.13834 1.7e-05
67 0.45310 0.00844 0.01577 1.2e-05 0.14373 0.39972 0.06381 0.19296 0.08650 1.9e-05 0.00668 0.05648 0.11055 0.47092 0.00109 3.0e-04 0.00247 0.15688 0.33020 0.00172 0.67321
68 1.9e-04 0.29653 7.2e-04 0.12065 0.38964 0.00448 0.00558 1.1e-04 2.8e-04 0.45854 0.32644 4.1e-04 1.2e-04 8.4e-04 0.05410 0.01459 7.3e-04 0.28252 0.20317 0.31803 2.9e-04 0.00622
69 0.00106 0.05648 0.03976 0.26523 0.42991 0.01837 0.06388 0.00235 0.00810 0.35770 0.52435 0.01071 0.00285 0.00305 0.34544 0.41222 0.00218 0.07202 0.06538 0.01103 6.1e-04 0.00419 0.28826
70 0.03767 1.8e-04 0.03505 0.30694 0.09843 0.19134 0.37565 0.07162 0.06269 0.00523 0.00170 0.43644 0.36249 0.01672 1.8e-04 2.4e-04 0.46084 0.00466 0.21016 0.00564 0.00586 0.01864 0.00709 0.02718
71 0.56605 3.6e-04 0.00182 4.9e-05 0.03685 0.38605 0.05266 0.17341 0.03091 1.1e-05 4.5e-04 0.11086 0.43748 0.16851 3.5e-05 9.8e-06 0.03962 0.01647 0.31822 6.1e-04 0.23801 0.37514 9.7e-04 0.00165 0.08185
73 0.16679 5.3e-04 1.7e-04 1.4e-05 0.02117 0.10611 0.00934 0.02555 0.00335 4.7e-06 2.2e-04 0.01624 0.09843 0.04138 1.1e-05 1.7e-06 0.02300 0.01636 0.39977 0.00535 0.09116 0.26427 0.00121 4.7e-04 0.02372 0.58895
74 0.50051 2.8e-04 5.2e-04 7.2e-09 0.02105 0.17660 0.01186 0.08119 0.01016 1.1e-07 2.0e-04 0.01713 0.12031 0.16422 2.0e-05 3.9e-06 0.00159 0.01689 0.20901 1.7e-04 0.44505 0.59787 3.5e-04 2.6e-04 0.00628 0.57951 0.49659
76 0.00104 5.3e-06 3.8e-07 0.04451 0.00250 0.00464 0.00147 1.2e-04 1.1e-05 9.6e-05 7.3e-06 0.00206 0.00915 1.1e-04 8.0e-08 1.2e-08 0.38692 3.0e-04 0.17031 0.00227 1.2e-04 0.00211 2.4e-04 2.7e-04 0.11863 0.03819 0.08566 0.00330
77 4.5e-07 1.4e-09 5.8e-12 0.00176 2.2e-05 2.1e-05 4.4e-06 5.5e-09 3.0e-10 9.1e-08 2.0e-09 2.5e-06 1.9e-05 3.2e-08 2.4e-12 1.9e-13 0.01450 2.4e-06 0.06142 7.2e-05 1.2e-08 1.3e-05 3.3e-07 1.4e-06 0.00528 0.00102 0.01720 2.2e-05 0.40931
78 4.5e-08 9.2e-11 7.6e-13 0.00419 8.3e-06 1.6e-05 4.0e-06 1.2e-10 1.8e-11 5.2e-08 2.9e-10 1.2e-06 1.1e-05 3.5e-09 4.5e-13 4.2e-14 0.03172 3.7e-07 0.04040 5.9e-06 1.0e-10 7.1e-06 3.6e-08 1.3e-06 0.00751 0.00151 0.02273 9.6e-06 0.42636 0.44917
79 0.01346 0.03890 0.00452 2.3e-15 0.09776 0.01583 0.00214 0.00284 0.00239 7.9e-09 0.01673 2.0e-04 7.0e-04 0.03686 0.00947 0.00209 8.9e-07 0.26388 0.09551 5.2e-04 0.07929 0.29497 0.00596 4.8e-04 4.8e-05 0.00864 0.01320 0.04483 7.3e-06 1.9e-08 5.4e-09
80 7.9e-04 0.44031 5.4e-05 4.3e-12 0.09411 0.00208 2.1e-04 8.0e-05 4.9e-05 1.1e-06 0.05547 1.7e-05 6.2e-05 0.00171 0.01401 9.4e-04 3.1e-06 0.43902 0.18028 0.10774 0.00309 0.04758 0.10286 0.00119 3.1e-05 0.00278 0.01399 0.00714 7.1e-05 3.0e-06 1.0e-06 0.12299
82 0.16495 2.6e-04 0.00333 0.03962 0.04511 0.24616 0.09483 0.08956 0.02640 7.0e-04 4.9e-04 0.18757 0.43326 0.04557 2.8e-05 1.3e-05 0.49938 0.00848 0.42341 0.00714 0.04242 0.09899 0.00306 0.00560 0.47724 0.49546 0.33935 0.11874 0.38690 0.04642 0.05670 0.00125 0.00111
83 0.00770 3.0e-06 2.0e-06 6.8e-04 0.00147 0.00852 9.7e-04 7.3e-04 5.9e-05 2.2e-06 1.9e-06 0.00209 0.01741 7.2e-04 3.0e-08 5.0e-09 0.07025 4.9e-04 0.25316 0.00206 0.00125 0.01086 4.9e-05 5.8e-05 0.02913 0.12915 0.40889 0.03530 0.42159 0.35761 0.33899 1.1e-04 5.8e-04 0.31266
85 0.09750 0.00265 0.45353 3.3e-05 0.30476 0.42505 0.28662 0.22975 0.55314 3.3e-05 0.01889 0.11337 0.03453 0.59047 0.00453 0.00478 1.1e-04 0.24604 0.12162 2.2e-05 0.19896 0.23438 0.00301 0.01891 0.02110 0.03243 0.00635 0.02653 1.5e-05 9.5e-10 8.4e-11 0.08778 0.00289 0.01395 5.0e-05
86 0.19564 0.00533 0.03724 7.8e-11 0.13426 0.19849 0.03335 0.11912 0.08653 1.8e-07 0.00977 0.01327 0.02427 0.51195 0.00324 0.00134 2.4e-05 0.21208 0.12700 3.1e-05 0.55588 0.56848 0.00212 0.00186 0.00176 0.06727 0.02825 0.15420 2.2e-05 9.1e-09 2.0e-09 0.37885 0.01605 0.01148 3.1e-04 0.46210
87 3.0e-05 0.63140 4.8e-05 1.0e-10 0.05181 2.9e-04 7.3e-05 5.0e-06 1.0e-05 6.3e-06 0.23285 1.6e-06 1.6e-06 2.4e-04 0.22507 0.02418 6.8e-08 0.20644 0.02336 0.00152 1.6e-04 0.00418 0.05784 0.00672 4.1e-06 8.9e-05 1.5e-04 1.6e-04 2.5e-07 1.2e-10 2.6e-11 0.04000 0.25087 2.6e-05 5.5e-07 0.00113 0.00457
88 2.1e-04 0.59169 0.00209 1.9e-05 0.17144 0.00306 0.00180 1.2e-04 4.4e-04 0.00162 0.69915 8.5e-05 3.0e-05 0.00201 0.59226 0.15882 2.7e-06 0.33576 0.04207 0.00196 6.1e-04 0.00815 0.13091 0.09933 1.7e-04 4.0e-04 2.4e-04 3.0e-04 1.5e-06 3.7e-10 8.3e-11 0.02819 0.09639 1.8e-04 1.2e-06 0.01015 0.01209 0.56265
90 1.0e-09 6.9e-12 3.4e-14 0.00667 1.5e-06 5.6e-07 3.0e-07 9.8e-12 9.1e-13 1.5e-08 3.8e-11 7.1e-08 3.7e-07 7.2e-11 2.1e-14 2.1e-15 0.00534 3.7e-08 0.01596 4.8e-06 6.3e-12 8.9e-08 1.4e-08 2.4e-07 0.00223 2.8e-05 5.7e-04 5.7e-08 0.26222 0.49767 0.08173 1.0e-11 9.6e-09 0.00997 0.04449 2.9e-12 1.0e-11 1.8e-13 2.4e-12
91 5.4e-08 3.8e-09 6.5e-13 4.7e-04 1.5e-05 3.0e-06 6.4e-07 9.7e-10 5.1e-11 5.7e-08 2.5e-09 3.0e-07 1.6e-06 6.8e-09 1.6e-12 7.3e-14 0.00133 4.6e-06 0.03575 1.6e-04 4.3e-09 2.5e-06 6.6e-07 3.4e-07 9.4e-04 8.8e-05 0.00286 2.5e-06 0.10324 0.48256 0.05285 7.6e-09 2.9e-06 0.00886 0.10726 3.1e-10 1.4e-09 1.2e-10 2.9e-10 0.49907
Join Leaf 44 and Leaf 46 into dad 1 at sim 7.196e-01
Join Leaf 49 and Leaf 88 into dad 2 at sim 6.991e-01
Join Dad 2 and Leaf 55 into dad 3 at sim 6.917e-01
Join Leaf 66 and Leaf 67 into dad 4 at sim 6.732e-01
Join Leaf 45 and Leaf 51 into dad 5 at sim 6.620e-01
Join Dad 1 and Leaf 53 into dad 6 at sim 6.444e-01
Join Leaf 38 and Leaf 87 into dad 7 at sim 6.314e-01
Join Dad 4 and Leaf 86 into dad 8 at sim 6.125e-01
Join Dad 6 and Leaf 47 into dad 9 at sim 5.896e-01
Join Leaf 71 and Leaf 73 into dad 10 at sim 5.890e-01
Join Dad 10 and Leaf 74 into dad 11 at sim 5.893e-01
Join Dad 9 and Leaf 85 into dad 12 at sim 5.648e-01
Join Leaf 43 and Leaf 62 into dad 13 at sim 5.532e-01
Join Leaf 27 and Dad 11 into dad 14 at sim 5.285e-01
Join Dad 14 and Dad 8 into dad 15 at sim 5.262e-01
Join Leaf 58 and Leaf 82 into dad 16 at sim 4.994e-01
Join Dad 16 and Leaf 70 into dad 17 at sim 5.157e-01
Join Leaf 90 and Leaf 91 into dad 18 at sim 4.991e-01
Join Leaf 77 and Dad 18 into dad 19 at sim 5.537e-01
Join Dad 15 and Dad 12 into dad 20 at sim 4.970e-01
Join Dad 7 and Dad 3 into dad 21 at sim 4.953e-01
Join Leaf 48 and Leaf 68 into dad 22 at sim 4.585e-01
Join Dad 5 and Leaf 52 into dad 23 at sim 4.521e-01
Join Dad 20 and Dad 23 into dad 24 at sim 4.507e-01
Join Leaf 76 and Leaf 78 into dad 25 at sim 4.264e-01
Join Dad 25 and Leaf 83 into dad 26 at sim 4.513e-01
Join Leaf 64 and Leaf 65 into dad 27 at sim 4.211e-01
Join Leaf 56 and Leaf 69 into dad 28 at sim 4.122e-01
Join Dad 21 and Dad 28 into dad 29 at sim 4.686e-01
Join Dad 26 and Dad 19 into dad 30 at sim 4.114e-01
Join Dad 29 and Dad 13 into dad 31 at sim 3.877e-01
Join Dad 31 and Dad 22 into dad 32 at sim 3.028e-01
Join Dad 24 and Dad 17 into dad 33 at sim 3.005e-01
Join Dad 33 and Dad 27 into dad 34 at sim 2.989e-01
Join Dad 34 and Leaf 40 into dad 35 at sim 2.474e-01
Join Dad 32 and Leaf 80 into dad 36 at sim 2.074e-01
Join Dad 35 and Dad 30 into dad 37 at sim 1.245e-01
Join Dad 36 and Leaf 42 into dad 38 at sim 1.215e-01
Join Dad 37 and Leaf 79 into dad 39 at sim 1.187e-01
Join Dad 39 and Dad 38 into dad 40 at sim 7.433e-02
Binary tree(s) of classes. There are 1 roots
27 .___________________________.
|
71 .___________________. |_.
|_. | |
73 .___________________| |_____| |
| |
74 ._____________________| |_________.
| |
66 ._______. | |
|_______. | |
67 ._______| |_____________| |
| |
86 ._______________| |
|_______.
44 ._. | |
|_________. | |
46 ._| |_____. | |
| | | |
53 .___________| |_____. | |
| | | |_________________.
47 ._________________| |_______________| | |
| | |
85 ._______________________| | |
| |
45 ._________. | |
|___________________________________. | |_.
51 ._________| |_| | |
| | |
52 ._____________________________________________| | |
| |
58 ._______________________________. | |
|_. | |_.
82 ._______________________________| |_______________________________| | |
| | |
70 ._________________________________| | |
| |___.
64 ._____________________________________________________. | | |
|_____________| | |
65 ._____________________________________________________| | |
| |
40 ._____________________________________________________________________| |___.
| |
76 ._________________________________________________. | |
|_. | |
78 ._________________________________________________| |_______. | |
| | | |
83 .___________________________________________________| |_____________| |
| |_.
77 ._____________________________________. | | |
|_____________________| | |
90 .___________________________________. | | |
|_| | |
91 .___________________________________| | |
| |
79 ._____________________________________________________________________________| |
|
38 ._____________. |
|___________________________. |
87 ._____________| | |
|_______________. |
49 .___. | | |
|_. | | |
88 .___| |___________________________________| | |_
| |___. |
55 ._____| | | |
| | |
56 ._______________________________________________________. | | |
|_| |_. |
69 ._______________________________________________________| | | |
| | |
43 ._________________________. | | |
|___________________________________| |_______. |
62 ._________________________| | | |
| | |
48 .___________________________________________. | |___. |
|___________________| | | |
68 .___________________________________________| | | |
| |___|
80 ._______________________________________________________________________| |
|
42 .___________________________________________________________________________|
At its header contains number of clusters and number of frames per cluster. Then follows the real thing : to which cluster and with what probability each frame belongs to :
VanillaSnob-Thing-Report-File
Sample data
41 leaf classes.
Class serials and sizes:
27 6050.5
38 2929.1
40 555.7
...................................
90 277.4
91 215.3
+
10 76 (97) 58 ( 2) 82 ( 1)
20 76 (91) 77 ( 4) 83 ( 3) 82 ( 2)
30 76 (92) 77 ( 3) 83 ( 2) 82 ( 2)
40 76 (56) 82 (22) 83 (20) 58 ( 1)
50 82 (57) 76 (28) 83 ( 8) 58 ( 5) 70 ( 1)
...................................
1098180 70 (75) 52 (13) 82 (10) 51 ( 1)
1098190 76 (87) 70 ( 5) 58 ( 4) 82 ( 2)
END
For example, the line starting with 1098190 (above), says that the probablity of frame 1098190 belonging to cluster with the identifier 76 is p1=0.87, for the cluster 70 p2=0.05, 58 → 0.04, …
You can now form lists of frames that belong to specific clusters. To collect all frames for which the most probable cluster is 49 use something like :
grep -P '^\s+\d+\s+49 \(' out.trep | awk '{print $1, $2}' > Cluster_01.dat
You can now use carma to prepare a DCD file containing only the selected frames and then proceed with whichever analyses you are up to.
If you only want to select frames that belong to a given cluster with a probability higher than, say, 0.70, you could say something like :
grep -o -P '^\s+\d+\s+49 \(\d+' out.trep | tr -d '(' | awk '{if ($3 > 70) print $1,$2,$3}' > Cluster_01_p_0.70.dat