research:howto:high_dimensionality_cluster_analysis_based_on_pca [Norma]

- - High dimensionality PCA-based cluster analysis using k-means

High dimensionality PCA-based cluster analysis using k-means

The idea is to automatically perform k-means-based cluster analysis using an arbitrary (large) number of principal components and a large (>50K) number of frames. See the Altis, Otten, Nguyen, Hegger & Stock (2008) paper for details. On the technical front, proceed as follows:

Prelude

* Place a copy of your carma.dPCA.fluctuations.dat or carma.PCA.fluctuations.dat in a clean directory.

* Depending on the number of frames in the file, you'll have to use a subset so that the data set to be analysed contains less than ~50,000 frames (if you have enough physical memory and CPU resources available, you can even go to ~100,000 frames). Use the command

perl -ne 'print ((0 == $. % 100) ? $_ : "")' carma.dPCA.fluctuations.dat > selected.dat

to select every 100th frame.

* Prepare a CSV file with awk :

awk '{print $2,",",$3,",",$4,",",$5,",",$6}' selected.dat > selected.csv

The number of columns selected at this stage determines the number of principal components that you'll use (in this case, 5). See the Altis, Otten, Nguyen, Hegger & Stock (2008) paper for ways of choosing the number of components to use. See this page for a lazy way to do this.

Working interactively on your workstation

* If you do not have java or McKmeans installed on your machine, do that now.

* Run McKmeans interactively :

/usr/local/jre1.6.0_26/bin/java -Xmx4G -jar /usr/local/bin/McKmeans.jar

If you do not have 4Gbytes of physical memory on your machine, change the -Xmx4G flag to, say, -Xmx1G.

* Read-in the file selected.csv

* Go to the 'cluster number estimation' tab and change both 'Number of resamplings per cluster' and Number of k-means restarts' to a smaller number (say, 5), unless you have a small number of frames, or plenty of CPU time to spare (overnight job ? weekly maybe ?). If you suspect that 10 clusters as maximum are not enough, increase it. Run the cluster number estimation.

* When it finishes (which can take a long-long time), write the results to a file, let's say out

* Prepare files suitable for usage with carma -sort :

paste out selected.dat > inter
grep '^0' inter | awk '{print $2}' > Cluster_01.dat
grep '^1' inter | awk '{print $2}' > Cluster_02.dat
...

* You can now continue with preparing a DCD containing only the frames belonging to a given cluster and with whatever other analyses you are up to.

Submitting a job to the cluster

McKmeans can be used in command line mode :

# /usr/local/jre1.6.0_26/bin/java -Djava.awt.headless=true -Xmx4G -jar /usr/local/bin/McKmeans.jar --help

Command line usage
Options
  --infile, -i <arg>        The name of the input file.                                                                                                                        
  --outfile, -o <arg>       The name of the output file.                                                                                               [default clustering.txt]
  -k <arg>                  The number of clusters                                                                                                     [default 2]             
  --maxiter <arg>           The maximum number of iterations allowed.                                                                                  [default 10]            
  --nstart <arg>            The number of K-means restarts. If set > 1 the best result from these repeated runs is reported.                           [default 1]             
  --cne                     Run a cluster number estimation? This is a boolean flag.                                                                                           
  --cnemin <arg>            The minimal number of clusters for the cluster number estimation.                                                          [default 2]             
  --cnemax <arg>            The maximal number of clusters for the cluster number estimation.                                                          [default 10]            
  --cneruns <arg>           The number of repeated runs of clusterings for each partitioning.                                                          [default 10]            
  --cnenstart <arg>         The number of K-means restarts in cluster number estimation. If set > 1 only the best result from these runs is included.  [default 10]            
  --cneoutfile, --co <arg>  The name of the output file for the cluster number estimation.                                                             [default cneresult.txt] 
  --cneplot                 Plot the result of the cluster number estimation to file? This is a boolean flag.                                                                  
  --cneplotfile <arg>       The name of the plot file for boxplots from cluster number estimation.                                                     [default cneplot.svg]

The idea, then, is that you prepare a shell script with your command line flags which you then submit to the cluster via slurm (sbatch). An example of such a script is :

#!/bin/tcsh

/usr/local/jre1.6.0_26/bin/java -Djava.awt.headless=true -Xmx4G -jar /usr/local/bin/McKmeans.jar -i selected.csv -o out >& LOG

exit