High dimensionality PCA-based cluster analysis using SNOB

Read first this page to get up to speed. Then come back.

Welcome back. SNOB is a program written by Chris Wallace and encoding for the Minimum Message Length method for cluster analysis. Bibliography is available from this link. You can possibly start from this paper. On the technical front proceed as follows :

  • Decide how many principal components you need for the analysis (look here for a lazy way).
  • Decide how many frames you will use, noting the following. Memory usage is no problem with SNOB, you can possibly feed it all 5 million of your frames. But CPU time is. To give you a feeling for that : a test run using 200,000 frames and 10 principal components had been running for ~5 days at the time the power failed (and I didn't want to start all over again). Note also that the amount of CPU time needed is problem-dependant: For well-behaving (Gaussian-like distributed data) things can go much faster. For example, clustering 109819 frames using the top three components of a well-behaved data set took only two hours on a single core. YMMV.


Do it :

 
mkdir snob/
cd snob/
 
#
# Use every 10th line
#
perl -ne 'print ((0 == $. % 10) ? $_ : "")' /home/myself/something/carma/carma.dPCA.fluctuations.dat > selected_lines
 
#
# Use top 8 PCs (first column is the frame number)
#
awk '{print $1,$2,$3,$4,$5,$6,$7,$8,$9}' selected_lines > selected_lines_and_PCs


Prepare the vset file. Change the number '8' and the number of lines containing the expression 'PC?? 1' to reflect number of PCs selected above.

cat > data.vset
data
8                                  
PC01 1
PC02 1
PC03 1
PC04 1
PC05 1
PC06 1
PC07 1
PC08 1
<CTRL-D>


Prepare the header portion of samp file. The number of lines containing '0.000001' must be equal to number of PCs selected. The last number is number of frames (use wc selected_lines_and_PCs to determine).

cat > header
data
data
 0.000001
 0.000001
 0.000001
 0.000001
 0.000001
 0.000001
 0.000001
 0.000001
184456
<CTRL-D>


You can now prepare the samp file and the script for SLURM :

cat header selected_lines_and_PCs > data.samp
 
cat > snob.sh
#!/bin/tcsh
 
snob-vanilla << eof
data.vset
data.samp
doall 200
doall 200
doall 200
doall 200
doall 200
doall 200
doall 200
doall 200
doall 200
doall 200
doall 10000
mrep out
trep out
prclass -1 0
tree
stop
eof
 
exit
<CTRL-D>


When ready submit the job to the cluster :


sbatch -n 1 snob.sh

The job will take a while. It is reading time again …



Dealing with the output

SNOB will assign each and every frame to at least one cluster. Because this is a proper Bayesian approach, the assignments are probabilistic : what you get for every frame is a set of assignments of the form : “the probability of frame X belonging to cluster A is p1, belonging to cluster B is p2, belonging to cluster C is p3, …”. This, admittedly, is a very satisfying approach to clustering. You can envision a similarly probabilistic approach to calculating, say, representative structures for each cluster, but this may needlessly complicate things for large clusters with plenty of highly probable assignments. So we will keep it simple :



The LOG file (slurm-*.out)

The program starts cycling …

Enter variable-set file name:
There being no comms file, input will be taken from StdInput
Readvset returns 0
Enter sample file name:
Number of active cases = 109819
Begin sort of 109819 cases
Finished sort
Readsample returns 0
Allocated space  3514796 chars
AaAAa
Popln  1 on sample  1,   1 leaves,109819 things  Cost 5615787.10

  Assign mode Partial    --- Adjust: Params Tree
POP    0  RelAb 1.000  Size109819.0
       1  RelAb 1.000  Size109819.0
Aa
Firstpop returns 0
Allocated space  3635784 chars

S#     0   POP  Age#    3  SampSz# 109819.0  RelAb 1.000  Sz 109819.0
Pcost     39.38  Tcost 5615752.22  Total 5615791.60
??? line      0
>>  Cycle        7  Pop  1     1 leaves  Cost 5615784.8
>>  Sample data
>>  Adjust PT  Assign P
# aaAaaaaaaaaaS
Cycle     20  Splitting   1 into   2,   3  Ben6409.08
AaaaaaAAAAAAAAAAAAAAS
Cycle     41  Splitting   2 into   4,   5  Ben1509.03
AS
Cycle     43  Splitting   3 into   6,   7  Ben39307.37

....

… sometime reaching convergence :


>>  Cycle     2047  Pop  1    41 leaves  Cost 5538349.7
>>  Sample data
>>  Adjust PT  Assign P
# aaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaAaaAaa
AaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAa
aAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaAaaA
aaAaaAaaAaaAaaAaaaAaaaAaaaAaaaAaaaAaaaAaaaaAaaaaAaaaaAaaaaaAaaaaaAaaaaaaAaaaaaaa
aAaaaaaaaaaaaAaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Doall ends after 394 cycles

Then comes a similarity table (between the clusters identified), followed by the corresponding tree. You can use the relationships depicted by the tree to form larger clusters (by merging leafs that are highly similar).

 Table of class similarities
 SER      27      38      40      42      43      44      45      46      47      48      49      51      52      53      55      56      58      62      64      65      66      67      68      69      70      71      73      74      76      77      78      79      80      82      83      85      86      87      88      90
  38 6.8e-05
  40 0.00774 1.9e-04
  42 1.6e-06 9.9e-05 0.00148
  43 0.03155 0.18453 0.30318 0.13527
  44 0.54303 0.00161 0.15843 0.00109 0.20480
  45 0.07768 7.8e-04 0.48929 0.06160 0.30695 0.46856
  46 0.53300 1.9e-05 0.07499 1.2e-05 0.04584 0.71955 0.24980
  47 0.11770 6.1e-05 0.45024 3.7e-05 0.12760 0.56533 0.50343 0.54896
  48 9.0e-07 0.00411 8.2e-05 0.36885 0.12174 1.9e-04 0.00214 1.7e-06 6.9e-06
  49 2.3e-04 0.44460 0.01681 0.01946 0.35958 0.00612 0.01030 2.8e-04 0.00182 0.08541
  51 0.15980 2.5e-05 0.14707 0.01163 0.07607 0.56483 0.66198 0.45696 0.43544 1.1e-04 5.0e-04
  52 0.45055 8.9e-06 0.01386 0.00549 0.02407 0.50467 0.20827 0.43904 0.13173 2.4e-05 8.1e-05 0.56352
  53 0.48270 6.2e-04 0.11070 4.5e-08 0.11018 0.60969 0.15107 0.59349 0.45978 1.0e-06 0.00252 0.13336 0.14024
  55 2.6e-05 0.15319 0.00403 0.01162 0.10411 8.4e-04 0.00172 3.2e-05 2.6e-04 0.02460 0.66735 5.0e-05 6.4e-06 4.3e-04
  56 1.3e-05 0.01954 0.01393 0.03611 0.08250 6.5e-04 0.00312 4.0e-05 5.3e-04 0.01757 0.36555 9.4e-05 6.6e-06 3.1e-04 0.63384
  58 0.00308 5.3e-06 1.5e-05 0.19178 0.00963 0.01792 0.01537 0.00108 1.8e-04 7.4e-04 3.5e-05 0.02209 0.05935 3.0e-04 8.8e-07 3.5e-07
  62 0.01374 0.47201 0.13702 1.7e-04 0.55324 0.07590 0.04801 0.01332 0.04394 0.00248 0.31646 0.00747 0.00292 0.08028 0.11268 0.05034 2.4e-04
  64 0.16857 0.08790 0.04371 0.06001 0.38761 0.28769 0.14213 0.11053 0.08163 0.03037 0.06668 0.12699 0.16735 0.14589 0.01228 0.00525 0.13915 0.33923
  65 1.4e-05 0.02793 1.4e-07 0.02868 0.13672 6.5e-04 3.8e-04 1.5e-06 7.2e-07 0.04722 0.01088 3.5e-05 2.3e-05 2.6e-05 1.0e-04 5.7e-06 0.00120 0.20719 0.42111
  66 0.62276 2.3e-04 0.01155 1.6e-09 0.04352 0.34824 0.04072 0.32006 0.10355 8.9e-08 5.1e-04 0.04226 0.10243 0.65318 7.9e-05 3.3e-05 1.9e-04 0.03518 0.13834 1.7e-05
  67 0.45310 0.00844 0.01577 1.2e-05 0.14373 0.39972 0.06381 0.19296 0.08650 1.9e-05 0.00668 0.05648 0.11055 0.47092 0.00109 3.0e-04 0.00247 0.15688 0.33020 0.00172 0.67321
  68 1.9e-04 0.29653 7.2e-04 0.12065 0.38964 0.00448 0.00558 1.1e-04 2.8e-04 0.45854 0.32644 4.1e-04 1.2e-04 8.4e-04 0.05410 0.01459 7.3e-04 0.28252 0.20317 0.31803 2.9e-04 0.00622
  69 0.00106 0.05648 0.03976 0.26523 0.42991 0.01837 0.06388 0.00235 0.00810 0.35770 0.52435 0.01071 0.00285 0.00305 0.34544 0.41222 0.00218 0.07202 0.06538 0.01103 6.1e-04 0.00419 0.28826
  70 0.03767 1.8e-04 0.03505 0.30694 0.09843 0.19134 0.37565 0.07162 0.06269 0.00523 0.00170 0.43644 0.36249 0.01672 1.8e-04 2.4e-04 0.46084 0.00466 0.21016 0.00564 0.00586 0.01864 0.00709 0.02718
  71 0.56605 3.6e-04 0.00182 4.9e-05 0.03685 0.38605 0.05266 0.17341 0.03091 1.1e-05 4.5e-04 0.11086 0.43748 0.16851 3.5e-05 9.8e-06 0.03962 0.01647 0.31822 6.1e-04 0.23801 0.37514 9.7e-04 0.00165 0.08185
  73 0.16679 5.3e-04 1.7e-04 1.4e-05 0.02117 0.10611 0.00934 0.02555 0.00335 4.7e-06 2.2e-04 0.01624 0.09843 0.04138 1.1e-05 1.7e-06 0.02300 0.01636 0.39977 0.00535 0.09116 0.26427 0.00121 4.7e-04 0.02372 0.58895
  74 0.50051 2.8e-04 5.2e-04 7.2e-09 0.02105 0.17660 0.01186 0.08119 0.01016 1.1e-07 2.0e-04 0.01713 0.12031 0.16422 2.0e-05 3.9e-06 0.00159 0.01689 0.20901 1.7e-04 0.44505 0.59787 3.5e-04 2.6e-04 0.00628 0.57951 0.49659
  76 0.00104 5.3e-06 3.8e-07 0.04451 0.00250 0.00464 0.00147 1.2e-04 1.1e-05 9.6e-05 7.3e-06 0.00206 0.00915 1.1e-04 8.0e-08 1.2e-08 0.38692 3.0e-04 0.17031 0.00227 1.2e-04 0.00211 2.4e-04 2.7e-04 0.11863 0.03819 0.08566 0.00330
  77 4.5e-07 1.4e-09 5.8e-12 0.00176 2.2e-05 2.1e-05 4.4e-06 5.5e-09 3.0e-10 9.1e-08 2.0e-09 2.5e-06 1.9e-05 3.2e-08 2.4e-12 1.9e-13 0.01450 2.4e-06 0.06142 7.2e-05 1.2e-08 1.3e-05 3.3e-07 1.4e-06 0.00528 0.00102 0.01720 2.2e-05 0.40931
  78 4.5e-08 9.2e-11 7.6e-13 0.00419 8.3e-06 1.6e-05 4.0e-06 1.2e-10 1.8e-11 5.2e-08 2.9e-10 1.2e-06 1.1e-05 3.5e-09 4.5e-13 4.2e-14 0.03172 3.7e-07 0.04040 5.9e-06 1.0e-10 7.1e-06 3.6e-08 1.3e-06 0.00751 0.00151 0.02273 9.6e-06 0.42636 0.44917
  79 0.01346 0.03890 0.00452 2.3e-15 0.09776 0.01583 0.00214 0.00284 0.00239 7.9e-09 0.01673 2.0e-04 7.0e-04 0.03686 0.00947 0.00209 8.9e-07 0.26388 0.09551 5.2e-04 0.07929 0.29497 0.00596 4.8e-04 4.8e-05 0.00864 0.01320 0.04483 7.3e-06 1.9e-08 5.4e-09
  80 7.9e-04 0.44031 5.4e-05 4.3e-12 0.09411 0.00208 2.1e-04 8.0e-05 4.9e-05 1.1e-06 0.05547 1.7e-05 6.2e-05 0.00171 0.01401 9.4e-04 3.1e-06 0.43902 0.18028 0.10774 0.00309 0.04758 0.10286 0.00119 3.1e-05 0.00278 0.01399 0.00714 7.1e-05 3.0e-06 1.0e-06 0.12299
  82 0.16495 2.6e-04 0.00333 0.03962 0.04511 0.24616 0.09483 0.08956 0.02640 7.0e-04 4.9e-04 0.18757 0.43326 0.04557 2.8e-05 1.3e-05 0.49938 0.00848 0.42341 0.00714 0.04242 0.09899 0.00306 0.00560 0.47724 0.49546 0.33935 0.11874 0.38690 0.04642 0.05670 0.00125 0.00111
  83 0.00770 3.0e-06 2.0e-06 6.8e-04 0.00147 0.00852 9.7e-04 7.3e-04 5.9e-05 2.2e-06 1.9e-06 0.00209 0.01741 7.2e-04 3.0e-08 5.0e-09 0.07025 4.9e-04 0.25316 0.00206 0.00125 0.01086 4.9e-05 5.8e-05 0.02913 0.12915 0.40889 0.03530 0.42159 0.35761 0.33899 1.1e-04 5.8e-04 0.31266
  85 0.09750 0.00265 0.45353 3.3e-05 0.30476 0.42505 0.28662 0.22975 0.55314 3.3e-05 0.01889 0.11337 0.03453 0.59047 0.00453 0.00478 1.1e-04 0.24604 0.12162 2.2e-05 0.19896 0.23438 0.00301 0.01891 0.02110 0.03243 0.00635 0.02653 1.5e-05 9.5e-10 8.4e-11 0.08778 0.00289 0.01395 5.0e-05
  86 0.19564 0.00533 0.03724 7.8e-11 0.13426 0.19849 0.03335 0.11912 0.08653 1.8e-07 0.00977 0.01327 0.02427 0.51195 0.00324 0.00134 2.4e-05 0.21208 0.12700 3.1e-05 0.55588 0.56848 0.00212 0.00186 0.00176 0.06727 0.02825 0.15420 2.2e-05 9.1e-09 2.0e-09 0.37885 0.01605 0.01148 3.1e-04 0.46210
  87 3.0e-05 0.63140 4.8e-05 1.0e-10 0.05181 2.9e-04 7.3e-05 5.0e-06 1.0e-05 6.3e-06 0.23285 1.6e-06 1.6e-06 2.4e-04 0.22507 0.02418 6.8e-08 0.20644 0.02336 0.00152 1.6e-04 0.00418 0.05784 0.00672 4.1e-06 8.9e-05 1.5e-04 1.6e-04 2.5e-07 1.2e-10 2.6e-11 0.04000 0.25087 2.6e-05 5.5e-07 0.00113 0.00457
  88 2.1e-04 0.59169 0.00209 1.9e-05 0.17144 0.00306 0.00180 1.2e-04 4.4e-04 0.00162 0.69915 8.5e-05 3.0e-05 0.00201 0.59226 0.15882 2.7e-06 0.33576 0.04207 0.00196 6.1e-04 0.00815 0.13091 0.09933 1.7e-04 4.0e-04 2.4e-04 3.0e-04 1.5e-06 3.7e-10 8.3e-11 0.02819 0.09639 1.8e-04 1.2e-06 0.01015 0.01209 0.56265
  90 1.0e-09 6.9e-12 3.4e-14 0.00667 1.5e-06 5.6e-07 3.0e-07 9.8e-12 9.1e-13 1.5e-08 3.8e-11 7.1e-08 3.7e-07 7.2e-11 2.1e-14 2.1e-15 0.00534 3.7e-08 0.01596 4.8e-06 6.3e-12 8.9e-08 1.4e-08 2.4e-07 0.00223 2.8e-05 5.7e-04 5.7e-08 0.26222 0.49767 0.08173 1.0e-11 9.6e-09 0.00997 0.04449 2.9e-12 1.0e-11 1.8e-13 2.4e-12
  91 5.4e-08 3.8e-09 6.5e-13 4.7e-04 1.5e-05 3.0e-06 6.4e-07 9.7e-10 5.1e-11 5.7e-08 2.5e-09 3.0e-07 1.6e-06 6.8e-09 1.6e-12 7.3e-14 0.00133 4.6e-06 0.03575 1.6e-04 4.3e-09 2.5e-06 6.6e-07 3.4e-07 9.4e-04 8.8e-05 0.00286 2.5e-06 0.10324 0.48256 0.05285 7.6e-09 2.9e-06 0.00886 0.10726 3.1e-10 1.4e-09 1.2e-10 2.9e-10 0.49907

Join Leaf  44  and  Leaf  46  into dad   1 at sim  7.196e-01
Join Leaf  49  and  Leaf  88  into dad   2 at sim  6.991e-01
Join  Dad   2  and  Leaf  55  into dad   3 at sim  6.917e-01
Join Leaf  66  and  Leaf  67  into dad   4 at sim  6.732e-01
Join Leaf  45  and  Leaf  51  into dad   5 at sim  6.620e-01
Join  Dad   1  and  Leaf  53  into dad   6 at sim  6.444e-01
Join Leaf  38  and  Leaf  87  into dad   7 at sim  6.314e-01
Join  Dad   4  and  Leaf  86  into dad   8 at sim  6.125e-01
Join  Dad   6  and  Leaf  47  into dad   9 at sim  5.896e-01
Join Leaf  71  and  Leaf  73  into dad  10 at sim  5.890e-01
Join  Dad  10  and  Leaf  74  into dad  11 at sim  5.893e-01
Join  Dad   9  and  Leaf  85  into dad  12 at sim  5.648e-01
Join Leaf  43  and  Leaf  62  into dad  13 at sim  5.532e-01
Join Leaf  27  and   Dad  11  into dad  14 at sim  5.285e-01
Join  Dad  14  and   Dad   8  into dad  15 at sim  5.262e-01
Join Leaf  58  and  Leaf  82  into dad  16 at sim  4.994e-01
Join  Dad  16  and  Leaf  70  into dad  17 at sim  5.157e-01
Join Leaf  90  and  Leaf  91  into dad  18 at sim  4.991e-01
Join Leaf  77  and   Dad  18  into dad  19 at sim  5.537e-01
Join  Dad  15  and   Dad  12  into dad  20 at sim  4.970e-01
Join  Dad   7  and   Dad   3  into dad  21 at sim  4.953e-01
Join Leaf  48  and  Leaf  68  into dad  22 at sim  4.585e-01
Join  Dad   5  and  Leaf  52  into dad  23 at sim  4.521e-01
Join  Dad  20  and   Dad  23  into dad  24 at sim  4.507e-01
Join Leaf  76  and  Leaf  78  into dad  25 at sim  4.264e-01
Join  Dad  25  and  Leaf  83  into dad  26 at sim  4.513e-01
Join Leaf  64  and  Leaf  65  into dad  27 at sim  4.211e-01
Join Leaf  56  and  Leaf  69  into dad  28 at sim  4.122e-01
Join  Dad  21  and   Dad  28  into dad  29 at sim  4.686e-01
Join  Dad  26  and   Dad  19  into dad  30 at sim  4.114e-01
Join  Dad  29  and   Dad  13  into dad  31 at sim  3.877e-01
Join  Dad  31  and   Dad  22  into dad  32 at sim  3.028e-01
Join  Dad  24  and   Dad  17  into dad  33 at sim  3.005e-01
Join  Dad  33  and   Dad  27  into dad  34 at sim  2.989e-01
Join  Dad  34  and  Leaf  40  into dad  35 at sim  2.474e-01
Join  Dad  32  and  Leaf  80  into dad  36 at sim  2.074e-01
Join  Dad  35  and   Dad  30  into dad  37 at sim  1.245e-01
Join  Dad  36  and  Leaf  42  into dad  38 at sim  1.215e-01
Join  Dad  37  and  Leaf  79  into dad  39 at sim  1.187e-01
Join  Dad  39  and   Dad  38  into dad  40 at sim  7.433e-02

Binary tree(s) of classes. There are  1 roots
  27 .___________________________.                                                     
                                 |                                                     
  71 .___________________.       |_.                                                   
                         |_.     | |                                                   
  73 .___________________| |_____| |                                                   
                           |       |                                                   
  74 ._____________________|       |_________.                                         
                                   |         |                                         
  66 ._______.                     |         |                                         
             |_______.             |         |                                         
  67 ._______|       |_____________|         |                                         
                     |                       |                                         
  86 ._______________|                       |                                         
                                             |_______.                                 
  44 ._.                                     |       |                                 
       |_________.                           |       |                                 
  46 ._|         |_____.                     |       |                                 
                 |     |                     |       |                                 
  53 .___________|     |_____.               |       |                                 
                       |     |               |       |_________________.               
  47 ._________________|     |_______________|       |                 |               
                             |                       |                 |               
  85 ._______________________|                       |                 |               
                                                     |                 |               
  45 ._________.                                     |                 |               
               |___________________________________. |                 |_.             
  51 ._________|                                   |_|                 | |             
                                                   |                   | |             
  52 ._____________________________________________|                   | |             
                                                                       | |             
  58 ._______________________________.                                 | |             
                                     |_.                               | |_.           
  82 ._______________________________| |_______________________________| | |           
                                       |                                 | |           
  70 ._________________________________|                                 | |           
                                                                         | |___.       
  64 ._____________________________________________________.             | |   |       
                                                           |_____________| |   |       
  65 ._____________________________________________________|               |   |       
                                                                           |   |       
  40 ._____________________________________________________________________|   |___.   
                                                                               |   |   
  76 ._________________________________________________.                       |   |   
                                                       |_.                     |   |   
  78 ._________________________________________________| |_______.             |   |   
                                                         |       |             |   |   
  83 .___________________________________________________|       |_____________|   |   
                                                                 |                 |_. 
  77 ._____________________________________.                     |                 | | 
                                           |_____________________|                 | | 
  90 .___________________________________. |                                       | | 
                                         |_|                                       | | 
  91 .___________________________________|                                         | | 
                                                                                   | | 
  79 ._____________________________________________________________________________| | 
                                                                                     | 
  38 ._____________.                                                                 | 
                   |___________________________.                                     | 
  87 ._____________|                           |                                     | 
                                               |_______________.                     | 
  49 .___.                                     |               |                     | 
         |_.                                   |               |                     | 
  88 .___| |___________________________________|               |                     |_
           |                                                   |___.                 | 
  55 ._____|                                                   |   |                 | 
                                                               |   |                 | 
  56 ._______________________________________________________. |   |                 | 
                                                             |_|   |_.               | 
  69 ._______________________________________________________|     | |               | 
                                                                   | |               | 
  43 ._________________________.                                   | |               | 
                               |___________________________________| |_______.       | 
  62 ._________________________|                                     |       |       | 
                                                                     |       |       | 
  48 .___________________________________________.                   |       |___.   | 
                                                 |___________________|       |   |   | 
  68 .___________________________________________|                           |   |   | 
                                                                             |   |___| 
  80 ._______________________________________________________________________|   |     
                                                                                 |     
  42 .___________________________________________________________________________|     
                                                                                       



The out.trep file

At its header contains number of clusters and number of frames per cluster. Then follows the real thing : to which cluster and with what probability each frame belongs to :


VanillaSnob-Thing-Report-File
Sample  data
 41 leaf classes.
Class serials and sizes:
    27    6050.5
    38    2929.1
    40     555.7

...................................

    90     277.4
    91     215.3

+
      10      76 (97)      58 ( 2)      82 ( 1)
      20      76 (91)      77 ( 4)      83 ( 3)      82 ( 2)
      30      76 (92)      77 ( 3)      83 ( 2)      82 ( 2)
      40      76 (56)      82 (22)      83 (20)      58 ( 1)
      50      82 (57)      76 (28)      83 ( 8)      58 ( 5)      70 ( 1)

...................................

 1098180      70 (75)      52 (13)      82 (10)      51 ( 1)
 1098190      76 (87)      70 ( 5)      58 ( 4)      82 ( 2)
END

For example, the line starting with 1098190 (above), says that the probablity of frame 1098190 belonging to cluster with the identifier 76 is p1=0.87, for the cluster 70 p2=0.05, 58 → 0.04, …

You can now form lists of frames that belong to specific clusters. To collect all frames for which the most probable cluster is 49 use something like :


grep -P '^\s+\d+\s+49 \(' out.trep | awk '{print $1, $2}' > Cluster_01.dat

You can now use carma to prepare a DCD file containing only the selected frames and then proceed with whichever analyses you are up to.



If you only want to select frames that belong to a given cluster with a probability higher than, say, 0.70, you could say something like :


grep -o -P '^\s+\d+\s+49 \(\d+' out.trep | tr -d '(' | awk '{if ($3 > 70) print $1,$2,$3}' > Cluster_01_p_0.70.dat



research/howto/high_dimensionality_cluster_analysis_with_snob.txt · Last modified: 2012/01/31 17:26 (external edit)