This is an old and well-known method (there is a nefeli's wiki page discussing it here). This method won't do for highly disordered systems, for example, a folding simulation of a flexible peptide.
What we talk about : you have performed two independent simulations of the same system and you obtained the secondary structure distributions for each one of them. Graphically, what you have may look like this (for a case of two folding simulations of the same peptide but with different time steps) :
The question is : how to quantify the similarity or dissimilarity of the two distributions. The weblogo diagrams are not very useful because they squash the information contained in the distributions to just one dimension.
… and then you calculate e.g. the correlation coefficient between the two columns.
Not implemented (yet)
Do we really have to go through the secondary structure assignments ? Why not using directly the coordinates or, better, the dihedral angles ? What we want is a numeric measure of how much the two simulations agree (or, equivalently, disagree). This could possibly go through the calculation of a cross-RMSD matrix comparing the RMSD (either Cartesian or torsion) between all possible pairs of structures from the two trajectories. For this comparison we do not care about time, which means that we only care for the minimum value observed in each column (and/or row) of the matrix. The good thing about that is that we don't really need to calculate the full 2D matrix, and the procedure can, thus, be applied even with large data sets.
It could go like this :
The result is a list of the minimum torsion-RMSDs observed for each structure from the first trajectory (when compared with all structures from the second trajectory). The distribution of these minimums can probably be converted to things like e.g. the percentage of structures that are common in the two trajectories.
You can probably do the same thing using Cartesian coordinates (and may actually already have most of the code needed for that : the crossDCD script).
carma had for some time now an undocumented feature (flag -mm
) to calculate something called 'max-of-mins of an RMSD matrix'. This undocumented feature has been modified to implement the solution discussed in the previous paragraph. An example.
Let's say that the first trajectory is called traj1.dcd
, the second traj2.dcd
and the PSF test.psf
. To calculate the minimal RMSDs for each structure of the traj2.dcd
trajectory when compared with all structures from the traj1.dcd
proceed as follows (the number 11910 below is the number of frames in traj1.dcd
plus one):
# # carma traj1.dcd test.psf Total number of frames (from header) is 11909 # carma traj2.dcd test.psf Total number of frames (from header) is 11899 # # # catdcd4 -o all.dcd traj1.dcd traj2.dcd CatDCD 4.0 dcdplugin) detected standard 32-bit DCD file of native endianness dcdplugin) CHARMM format DCD file (also NAMD 2.1 and later) Opening file 'all.dcd' for writing. dcdplugin) detected standard 32-bit DCD file of native endianness dcdplugin) CHARMM format DCD file (also NAMD 2.1 and later) Opened file 'traj1.dcd' for reading. Read 11909 frames from file traj1.dcd, wrote 11909. dcdplugin) detected standard 32-bit DCD file of native endianness dcdplugin) CHARMM format DCD file (also NAMD 2.1 and later) Opened file 'traj2.dcd' for reading. Read 11899 frames from file traj2.dcd, wrote 11899. Total frames: 23808 Frames written: 23808 CatDCD exited normally. # # # # # carma64 -v -cross -mm -first 11910 all.dcd test.psf carma v.1.7____________________________________________________________________ 16 CA atoms are declared in the PSF file. It appears that this DCD file contains unit cell information. Number of coordinate sets is 23808 Starting timestep is 0 Timesteps between sets is 1 Titles : Created by DCD plugin REMARKS Created 28 September, 2017 at 18:12 Number of atoms in sets is 259 Last frame set to 23808 Calculate max of mins of RMSD matrix of the trajectory. Now processing frame 11910 Max of mins of RMSD matrix is 3.246655. All done in 4.3 minutes. # # # # ls -lF carma.RMSD.mins -rw-rw-r--. 1 glykos glykos 95192 Sep 28 18:17 carma.RMSD.mins # # # wc carma.RMSD.mins 11899 11899 95192 carma.RMSD.mins # # # head carma.RMSD.mins 0.891 1.637 2.034 1.937 1.610 2.332 1.661 2.287 2.112 2.060 # # # plot -h < carma.RMSD.mins
giving this distribution of minimal cross RMSDs (max at ~2.2 Angstrom) :
Two things to note :
The same idea as above but this time with torsion-RMSD (and not Cartesian RMSD). The nice thing here is that the code can be made short and parallel (copy, paste and compile with something like gcc -O2 -fopenmp cross_tRMSD_minima.c -lm
) :
To use this program, run grcarma twice (for the two trajectories) and compute phi/psi angles. Then, concatenate the two files into one (cat traj1.phipsi traj2.phipsi > all.dat
), and run it in parallel mode (on a machine with many cores) :
setenv OMP_NUM_THREADS 32 ./a.out < all.dat > out plot -h < out
The code appears to scale pretty well : On a machine with 48 cores it shows an almost linear speed-up out to 32 cores. The graph of log(cores) vs log(wall clock time in seconds) is :
The actual numbers are : 1 core → 704 seconds, 32 cores → 31 seconds, a speed-up by a factor of ~23.
As before, you must remember that you have to demonstrate that the step used for frame selection is fine enough for the calculation to converge. The time requirements are still important but with the openmp parallelization not as demanding as for the Cartesian calculation above : for the same problem as previously (119,085 and 118,990 frames, a 16-residue peptide) the torsion-based calculation took 51 minutes instead of 10.2 hours.
The last graph (below) is a worked example showing how the distribution changes depending of the time interval selected for taking samples from the trajectories (the intervals are 1 ns, 500 ps and 100 ps for this example, it is a disordered peptide folding simulation) :
Indeed. The graph shown above shows the amount of difference between the two trajectories, but does not answer the initial question : are the two trajectories consistent with each other (and, thus, mergeable) ? In the case of the peptide discussed on this page (see secondary structure graphs at the top of this page), we can possibly escape because the two runs were quite long, 11 μs each. So : we take the first trajectory, divide it into two halves, and compare the two halves using cross t-RMSD. Then we compare the distribution obtained from the two halves of the same trajectory with the distribution obtained by comparing the two trajectories. With a 100 ps step the results are :
What this says is that the two independent trajectories are slightly more alike between them than the two halves of the same trajectory. This more-or-less settles the matter and the two trajectories should be considered mergeable (the reason that the comparison of the two halves is -unexpectedly- slightly worse can be seen in the first figure of this page : the first trajectory has a long stretch of β structure in its first half only).