This is the preferred and possibly the easiest way. To calculate the RMSD matrix using only CA atoms, and every 500th structure give:
carma -v -cross -step 500 my.dcd my.psf carma -col - < carma.RMSD.matrix
and view the resulting postscript file. If you wanted to use all heavy atoms of segid “P” and every 250th structure you would have said:
carma -v -cross -segid P -atmid HEAVY -step 250 my.dcd my.psf
Now, lets assume you have a crossDCD.matrix plot like the following and you wish to perform a cluster analysis of the RMSD matrix.
Performing cluster analysis based on a distance matrix has many possible solutions, and not a uniquely correct answer. Have a look at http://www.statmethods.net/advstats/cluster.html for things that can be done with the R package. One possible line of work (which assumes that you can decide at what RMSD cutoff you want to define clusters) is:
First, you make a dendrogram based on these RMSDs using the statistical package R and a hierarchical cluster analysis method, like hclust(). Before starting R, wc your matrix to find out its dimensions. Then:
# # # d -h total 46M -rw-r--r-- 1 glykos glykos 46M Mar 29 13:37 crossDCD.matrix # # # wc -l crossDCD.matrix 2500 crossDCD.matrix # # # R | tee LOG R version 2.8.1 (2008-12-22) Copyright (C) 2008 The R Foundation for Statistical Computing ISBN 3-900051-07-0 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > A <- matrix(scan("crossDCD.matrix", n = 2500*2500), 2500, 2500, byrow = TRUE) Read 6250000 items > hc <- hclust( as.dist(A), method="average") > plclust(hc) > postscript() > plclust(hc) > dev.off() null device > cutree( hc, h = 1.3) [1] 1 1 1 2 2 2 2 2 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 [38] 1 1 1 1 1 1 1 2 2 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 [75] 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 2 1 1 1 1 1 [112] 1 2 1 1 2 2 1 1 1 2 2 2 2 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 2 2 [149] 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 2 1 1 1 1 [186] 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 1 1 1 [260] 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 2 [297] 1 1 2 1 1 1 1 1 1 2 2 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 [334] 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 1 <many lines with cluster assignments> [2332] 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 2 1 1 [2369] 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 [2406] 1 2 2 3 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 [2443] 1 1 1 1 1 1 2 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [2480] 1 2 2 1 1 1 1 2 2 2 1 1 2 2 1 2 2 1 1 1 1 > rect.hclust( hc, h = 1.3, border="red") > clusters <- cutree( hc, h = 1.3) > as.data.frame(clusters) >q() Save workspace image? [y/n/c]: n
Which gives a dendrogram like:
From this dendrogram you can define a cutoff to find the number of clusters (the expression h = 1.3
in the example shown above). The variable as.data.frame(clusters) gives you something similar to the carma.clusters.dat file, where each frame is assigned to a cluster (per line). This way you can prepare a DCD file that corresponds to each cluster and continue your analysis from there…
Use the crossDCD script (from carma's distribution) to obtain a two-dimensional matrix with all frame-to-frame RMSDs of your trajectory (type 'crossDCD' from the shell for a usage summary). Adjust the step (for crossDCD) in such a way as to have manageable matrix sizes (max ~5000×5000). After the calculation finishes, the matrix will be contained in a file (in the current working directory) named crossDCD.matrix
. You can plot the matrix with carma -col - < crossDCD.matrix
. A typical session would look like this:
# # d -h total 2.0G -r--r--r-- 1 glykos glykos 2.0G Mar 29 13:24 INYWLAH.dcd -r--r--r-- 1 glykos glykos 30K Mar 29 13:24 INYWLAH.psf # # carma -v -w -fit -step 500 INYWLAH.dcd INYWLAH.psf carma v.1.1____________________________________________________________________ 7 CA atoms are declared in the PSF file. It appears that this DCD file contains unit cell information. Number of coordinate sets is 1309323 Starting timestep is 0 Timesteps between sets is 1 Titles : Created by DCD plugin REMARKS Created 10 July, 2010 at 13:31 Number of atoms in sets is 128 Fit DCD frames onto reference frame. Now processing frame 1309500 End of DCD reached. All done. # # # d -h total 2.0G -r--r--r-- 1 glykos glykos 2.0G Mar 29 13:24 INYWLAH.dcd -r--r--r-- 1 glykos glykos 30K Mar 29 13:24 INYWLAH.psf -rw-r--r-- 1 glykos glykos 451K Mar 29 13:28 carma.fit-rms.dat -rw-r----- 1 glykos glykos 420K Mar 29 13:28 carma.fitted.dcd -rw-r--r-- 1 glykos glykos 811 Mar 29 13:28 carma.selected_atoms.psf # # mv carma.fitted.dcd CAs.dcd # mv carma.selected_atoms.psf CAs.psf # rm carma.fit-rms.dat # # d -h total 2.0G -rw-r----- 1 glykos glykos 420K Mar 29 13:28 CAs.dcd -rw-r--r-- 1 glykos glykos 811 Mar 29 13:28 CAs.psf -r--r--r-- 1 glykos glykos 2.0G Mar 29 13:24 INYWLAH.dcd -r--r--r-- 1 glykos glykos 30K Mar 29 13:24 INYWLAH.psf # # # crossDCD Usage : CrossDCD <PSF_file> <DCD_1> <DCD_2> [step] [flags] Summary : Calculate a 2D matrix containing the rms deviations (after least squares fitting) between the frames of two DCD files. If <step> is defined, then instead of using each and every frame of the two DCDs, only consider every <step>th frame. DCD_1 and DCD_2 can well be the one and same file. The matrix will be written to a file ( crossDCD.matrix ). To plot the results use 'carma - < crossDCD.matrix' and view the resulting postscript file. If you want to pass flags to the carma runs (things line -segid or -atomid), you will have to define the <step> (even if it is 1) and then define carma's flags enclosed in double quotes (ie " ...flags..."). # # crossDCD CAs.psf CAs.dcd CAs.dcd DCD_1 and DCD_2 are identical. Will calculate intra-DCD rmsds. DCD_1 contains 2619 frames DCD_2 contains 2619 frames .................................................. .................................................. <many lines with dots> .................................................. .................................................. ................... # # d -h total 2.1G -rw-r----- 1 glykos glykos 420K Mar 29 13:28 CAs.dcd -rw-r--r-- 1 glykos glykos 811 Mar 29 13:28 CAs.psf -r--r--r-- 1 glykos glykos 2.0G Mar 29 13:24 INYWLAH.dcd -r--r--r-- 1 glykos glykos 30K Mar 29 13:24 INYWLAH.psf -rw-r--r-- 1 glykos glykos 46M Mar 29 13:37 crossDCD.matrix # # carma -col - < crossDCD.matrix Will plot the gradient from 0.000000 to 5.753800 using colour postscript. # # wc -l crossDCD.matrix 2619 crossDCD.matrix # # d -h total 2.1G -rw-r----- 1 glykos glykos 420K Mar 29 13:28 CAs.dcd -rw-r--r-- 1 glykos glykos 811 Mar 29 13:28 CAs.psf -r--r--r-- 1 glykos glykos 2.0G Mar 29 13:24 INYWLAH.dcd -r--r--r-- 1 glykos glykos 30K Mar 29 13:24 INYWLAH.psf -rw-r--r-- 1 glykos glykos 41M Mar 29 13:37 carma.stdin.ps -rw-r--r-- 1 glykos glykos 46M Mar 29 13:37 crossDCD.matrix # # # gs carma.stdin.ps GPL Ghostscript 8.62 (2008-02-29) Copyright (C) 2008 Artifex Software, Inc. All rights reserved. This software comes with NO WARRANTY: see the file PUBLIC for details. >>showpage, press <return> to continue<< GS> quit # #