The idea is to automatically perform k-means-based cluster analysis using an arbitrary (large) number of principal components and a large (>50K) number of frames. See the Altis, Otten, Nguyen, Hegger & Stock (2008) paper for details. On the technical front, proceed as follows:
* Place a copy of your carma.dPCA.fluctuations.dat
or carma.PCA.fluctuations.dat
in a clean directory.
* Depending on the number of frames in the file, you'll have to use a subset so that the data set to be analysed contains less than ~50,000 frames (if you have enough physical memory and CPU resources available, you can even go to ~100,000 frames). Use the command
perl -ne 'print ((0 == $. % 100) ? $_ : "")' carma.dPCA.fluctuations.dat > selected.dat
to select every 100th frame.
* Prepare a CSV file with awk :
awk '{print $2,",",$3,",",$4,",",$5,",",$6}' selected.dat > selected.csv
The number of columns selected at this stage determines the number of principal components that you'll use (in this case, 5). See the Altis, Otten, Nguyen, Hegger & Stock (2008) paper for ways of choosing the number of components to use. See this page for a lazy way to do this.
* If you do not have java or McKmeans installed on your machine, do that now.
* Run McKmeans interactively :
/usr/local/jre1.6.0_26/bin/java -Xmx4G -jar /usr/local/bin/McKmeans.jar
If you do not have 4Gbytes of physical memory on your machine, change the -Xmx4G
flag to, say, -Xmx1G
.
* Read-in the file selected.csv
* Go to the 'cluster number estimation' tab and change both 'Number of resamplings per cluster' and Number of k-means restarts' to a smaller number (say, 5), unless you have a small number of frames, or plenty of CPU time to spare (overnight job ? weekly maybe ?). If you suspect that 10 clusters as maximum are not enough, increase it. Run the cluster number estimation.
* When it finishes (which can take a long-long time), write the results to a file, let's say out
* Prepare files suitable for usage with carma -sort
:
paste out selected.dat > inter grep '^0' inter | awk '{print $2}' > Cluster_01.dat grep '^1' inter | awk '{print $2}' > Cluster_02.dat ...
* You can now continue with preparing a DCD containing only the frames belonging to a given cluster and with whatever other analyses you are up to.
McKmeans can be used in command line mode :
# /usr/local/jre1.6.0_26/bin/java -Djava.awt.headless=true -Xmx4G -jar /usr/local/bin/McKmeans.jar --help Command line usage Options --infile, -i <arg> The name of the input file. --outfile, -o <arg> The name of the output file. [default clustering.txt] -k <arg> The number of clusters [default 2] --maxiter <arg> The maximum number of iterations allowed. [default 10] --nstart <arg> The number of K-means restarts. If set > 1 the best result from these repeated runs is reported. [default 1] --cne Run a cluster number estimation? This is a boolean flag. --cnemin <arg> The minimal number of clusters for the cluster number estimation. [default 2] --cnemax <arg> The maximal number of clusters for the cluster number estimation. [default 10] --cneruns <arg> The number of repeated runs of clusterings for each partitioning. [default 10] --cnenstart <arg> The number of K-means restarts in cluster number estimation. If set > 1 only the best result from these runs is included. [default 10] --cneoutfile, --co <arg> The name of the output file for the cluster number estimation. [default cneresult.txt] --cneplot Plot the result of the cluster number estimation to file? This is a boolean flag. --cneplotfile <arg> The name of the plot file for boxplots from cluster number estimation. [default cneplot.svg]
The idea, then, is that you prepare a shell script with your command line flags which you then submit to the cluster via slurm (sbatch
). An example of such a script is :
#!/bin/tcsh /usr/local/jre1.6.0_26/bin/java -Djava.awt.headless=true -Xmx4G -jar /usr/local/bin/McKmeans.jar -i selected.csv -o out >& LOG exit