research:howto:comparing_two_secondary_structure_distributions [Norma]

- Attempts to quantify how different (or similar) two trajectories are

Attempts to quantify how different (or similar) two trajectories are

[1] eigenvector overlap

This is an old and well-known method (there is a nefeli's wiki page discussing it here). This method won't do for highly disordered systems, for example, a folding simulation of a flexible peptide.

[2] via secondary structure assignments

What we talk about : you have performed two independent simulations of the same system and you obtained the secondary structure distributions for each one of them. Graphically, what you have may look like this (for a case of two folding simulations of the same peptide but with different time steps) :

The question is : how to quantify the similarity or dissimilarity of the two distributions. The weblogo diagrams are not very useful because they squash the information contained in the distributions to just one dimension.

Putative primitive solution

For each simulation : Take all unique STRIDE-derived secondary structure assignments and count how many times each one of them occurs
You now have two lists (for the two simulations) with each list containing two columns : the first column is a STRIDE assignment, the second is how many times this assignment has been observed (in the corresponding simulation).
In the final step you merge the two lists using the STRIDE assignments as indices. Assignments that have not been observed are given a value of zero.

… and then you calculate e.g. the correlation coefficient between the two columns.

Not implemented (yet)

[3] via cross RMSD matrices (Cartesian or torsion)

Do we really have to go through the secondary structure assignments ? Why not using directly the coordinates or, better, the dihedral angles ? What we want is a numeric measure of how much the two simulations agree (or, equivalently, disagree). This could possibly go through the calculation of a cross-RMSD matrix comparing the RMSD (either Cartesian or torsion) between all possible pairs of structures from the two trajectories. For this comparison we do not care about time, which means that we only care for the minimum value observed in each column (and/or row) of the matrix. The good thing about that is that we don't really need to calculate the full 2D matrix, and the procedure can, thus, be applied even with large data sets.

It could go like this :

Use carma/grcarma to prepare a list of (φ,ψ) pairs for all residues of each trajectory as a function of time.
Take the first set of torsions from the first trajectory and calculate the RMSD from each and every set of torsions from the second trajectory. Write out the minimal value located.
Repeat for all sets of torsions from the first trajectory.

The result is a list of the minimum torsion-RMSDs observed for each structure from the first trajectory (when compared with all structures from the second trajectory). The distribution of these minimums can probably be converted to things like e.g. the percentage of structures that are common in the two trajectories.

You can probably do the same thing using Cartesian coordinates (and may actually already have most of the code needed for that : the crossDCD script).

Solution with Cartesian coordinates

carma had for some time now an undocumented feature (flag -mm) to calculate something called 'max-of-mins of an RMSD matrix'. This undocumented feature has been modified to implement the solution discussed in the previous paragraph. An example.

Let's say that the first trajectory is called traj1.dcd, the second traj2.dcd and the PSF test.psf. To calculate the minimal RMSDs for each structure of the traj2.dcd trajectory when compared with all structures from the traj1.dcd proceed as follows (the number 11910 below is the number of frames in traj1.dcd plus one):

#
# carma traj1.dcd test.psf
Total number of frames (from header) is 11909


# carma traj2.dcd test.psf
Total number of frames (from header) is 11899


#
#
# catdcd4 -o all.dcd traj1.dcd traj2.dcd
CatDCD 4.0
dcdplugin) detected standard 32-bit DCD file of native endianness
dcdplugin) CHARMM format DCD file (also NAMD 2.1 and later)
Opening file 'all.dcd' for writing.
dcdplugin) detected standard 32-bit DCD file of native endianness
dcdplugin) CHARMM format DCD file (also NAMD 2.1 and later)
Opened file 'traj1.dcd' for reading.
Read 11909 frames from file traj1.dcd, wrote 11909.
dcdplugin) detected standard 32-bit DCD file of native endianness
dcdplugin) CHARMM format DCD file (also NAMD 2.1 and later)
Opened file 'traj2.dcd' for reading.
Read 11899 frames from file traj2.dcd, wrote 11899.
Total frames: 23808
Frames written: 23808
CatDCD exited normally.
#
#
#
# 
# carma64 -v -cross -mm -first 11910 all.dcd test.psf 

carma v.1.7____________________________________________________________________

16  CA   atoms are declared in the PSF file.
It appears that this DCD file contains unit cell information.
Number of coordinate sets is 23808
Starting timestep         is 0
Timesteps between sets    is 1
Titles : 
Created by DCD plugin
REMARKS Created 28 September, 2017 at 18:12
Number of atoms in sets   is 259
Last frame set to 23808
Calculate max of mins of RMSD matrix of the trajectory.
Now processing frame    11910
Max of mins of RMSD matrix is 3.246655.
All done in 4.3 minutes.
#
#
#
# ls -lF carma.RMSD.mins 
-rw-rw-r--. 1 glykos glykos 95192 Sep 28 18:17 carma.RMSD.mins
#
#
# wc carma.RMSD.mins
11899 11899 95192 carma.RMSD.mins
#
#
# head carma.RMSD.mins
  0.891
  1.637
  2.034
  1.937
  1.610
  2.332
  1.661
  2.287
  2.112
  2.060
#
#
# plot -h < carma.RMSD.mins

giving this distribution of minimal cross RMSDs (max at ~2.2 Angstrom) :

Two things to note :

Take care with the amount of calculation involved. To calculate the distribution of minimal cross RMSDs for two trajectories containing 119,085 and 118,990 frames (and for a small 16-residue peptide) took ~10 hours.
The histograms (distributions) obtained do depend on the step (stride) used for the analysis. For the same peptide shown above, repeating the calculation with a 10 times finer step gave the following distribution (max at ~1.88 Angstrom this time) :

Solution with torsion angles

The same idea as above but this time with torsion-RMSD (and not Cartesian RMSD). The nice thing here is that the code can be made short and parallel (copy, paste and compile with something like gcc -O2 -fopenmp cross_tRMSD_minima.c -lm) :

Cross torsion RMSD minima

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>
 
#define	MAX_ASSIGN	300000
#define	MAX_RESIDUES	20
 
int	RESIDUES;
 
float	angles[MAX_ASSIGN][2*MAX_RESIDUES+1];
float	res[MAX_ASSIGN];
 
 
int main()
{
  int	nof_assign;
  int	i, k;
  char	line[10000];
  FILE	*stream;
  int	SET2;
  int	resid;
 
  float	d1, d2, d, dd;
  float	a21_a11, a22_a12;
  float	min;
 
  int   nthreads, tid;
 
 
  /* How many residues are there ? First sanity check. */
 
  fgets(line, 9999, stdin);
  RESIDUES = (int)((strlen( line )-10) / 20 + 0.5);
 
  if ( RESIDUES*20+10 != strlen( line ) )
    {
      fprintf(stderr, "\nNot a grcarma produced input file ??? Abort.\n\n");
      exit(1);
    }
 
  if ( RESIDUES > MAX_RESIDUES )
    {
      fprintf(stderr, "\nIncrease MAX_RESIDUES and recompile.\n\n");
      exit(1);
    }
 
  fprintf(stderr, "phi/psi angles for %d residues will be used.\n", RESIDUES );
 
 
 
  /* Get back the data from the first line. fmemopen() is POSIX 2008. This won't work on old machines. */
 
  stream = fmemopen( line, strlen( line ), "r" );
  for ( i=0 ; i < 2*RESIDUES+1 ;  i++ )
  {
    if ( fscanf( stream, "%f", &angles[0][i]) != 1)
      {
        fprintf(stderr, "\nInformative message follows : Panic [1].\n\n");
        exit(1);
      }
  }
  fclose( stream);
 
 
  /* Read input */
 
  nof_assign = 1;
  i = 0;
  while ( scanf("%f", &angles[nof_assign][i]) == 1 )
    {
      if ( i == 0 && nof_assign > 1 && angles[nof_assign][i] < angles[nof_assign-1][i] )
        SET2 = nof_assign;
 
      i++;
      if ( i == RESIDUES*2 + 1 )
        {
          nof_assign++;
          i = 0;
        }
      if ( nof_assign == MAX_ASSIGN-1 )
        {
          fprintf(stderr, "\n\nIncrease MAX_ASSIGN and recompile.\n\n");
          exit(1);
        }
    }
 
 
  /* Last sanity check */
 
  for ( i=0 ; i < nof_assign ; i++ )
    {
      if ( (int)(angles[i][0]) != angles[i][0] )
        {
          fprintf(stderr, "\nInformative message follows : Panic [2].\n\n");
          exit(1);
        }
    }
 
  fprintf(stderr, "Read %d sets of phi/psi angles.\n", nof_assign );
  fprintf(stderr, "The second set starts at position %d.\n", SET2 );
 
 
  /* Convert everything to radians */
 
  for ( i=0 ; i < nof_assign ; i++ )
  {
    for ( resid = 0 ; resid < RESIDUES ; resid++ )
    {
     angles[i][2*resid+1] *= M_PI / 180.0l;
     angles[i][2*resid+2] *= M_PI / 180.0l;
    }
  }
 
 
 
 
  /* The actual calculation */
 
  for ( i=0 ; i < SET2 ; i++ )
  {
    #pragma omp parallel for private( d, resid, a21_a11, a22_a12, d1, d2, min)
    for ( k=SET2 ; k < nof_assign ; k++ )
      {
 
        d = 0.0;
        for ( resid = 0 ; resid < RESIDUES ; resid++ )
          {
            a21_a11 = angles[k][2*resid+1] - angles[i][2*resid+1] ;
            a22_a12 = angles[k][2*resid+2] - angles[i][2*resid+2] ;
 
            d1 = atan2( (sin(a21_a11)) , (cos(a21_a11)) );
            d2 = atan2( (sin(a22_a12)) , (cos(a22_a12)) );
 
            d += d1*d1 + d2*d2 ;
          }
 
        res[k] = sqrt( d / RESIDUES ) ;
      }
 
    min = 100000000000.0;
    for (  k=SET2 ; k < nof_assign ; k++ )
      if ( res[k] < min )
        min = res[k];
 
    printf("%6.3f\n", min );
 
  }
 
 
}

To use this program, run grcarma twice (for the two trajectories) and compute phi/psi angles. Then, concatenate the two files into one (cat traj1.phipsi traj2.phipsi > all.dat), and run it in parallel mode (on a machine with many cores) :

setenv OMP_NUM_THREADS 32
./a.out < all.dat > out
plot -h < out

The code appears to scale pretty well : On a machine with 48 cores it shows an almost linear speed-up out to 32 cores. The graph of log(cores) vs log(wall clock time in seconds) is :

The actual numbers are : 1 core → 704 seconds, 32 cores → 31 seconds, a speed-up by a factor of ~23.

As before, you must remember that you have to demonstrate that the step used for frame selection is fine enough for the calculation to converge. The time requirements are still important but with the openmp parallelization not as demanding as for the Cartesian calculation above : for the same problem as previously (119,085 and 118,990 frames, a 16-residue peptide) the torsion-based calculation took 51 minutes instead of 10.2 hours.

The last graph (below) is a worked example showing how the distribution changes depending of the time interval selected for taking samples from the trajectories (the intervals are 1 ns, 500 ps and 100 ps for this example, it is a disordered peptide folding simulation) :

OK, you got the distributions. So what ?

Indeed. The graph shown above shows the amount of difference between the two trajectories, but does not answer the initial question : are the two trajectories consistent with each other (and, thus, mergeable) ? In the case of the peptide discussed on this page (see secondary structure graphs at the top of this page), we can possibly escape because the two runs were quite long, 11 μs each. So : we take the first trajectory, divide it into two halves, and compare the two halves using cross t-RMSD. Then we compare the distribution obtained from the two halves of the same trajectory with the distribution obtained by comparing the two trajectories. With a 100 ps step the results are :

What this says is that the two independent trajectories are slightly more alike between them than the two halves of the same trajectory. This more-or-less settles the matter and the two trajectories should be considered mergeable (the reason that the comparison of the two halves is -unexpectedly- slightly worse can be seen in the first figure of this page : the first trajectory has a long stretch of β structure in its first half only).