about:benchmarks:namd60k [Norma]

NAMD, 60,000 atoms benchmarks

The system used for the test comprised 60660 atoms, with an orthogonal PBC box of dimensions ~124x77x63, an inner step of 2 fs, nonbonded every 4 fs, and electrostatics every 8 fs (script include below). All runs were allowed to run till stabilisation. The table below reports days per nanosecond of simulation for each combination indicated.

NAMD script used for these tests

# Input files
#
structure               ionized.psf
coordinates             heat_out.coor
velocities              heat_out.vel
extendedSystem          heat_out.xsc
parameters              par_all27_prot_na.inp
paraTypeCharmm          on

#
# Output files & writing frequency for DCD
# and restart files
#
outputname              output/equi_out
binaryoutput            off
restartname             output/restart
restartfreq             1000
binaryrestart           yes
dcdFile                 output/equi_out.dcd
dcdFreq                 200
DCDunitcell             on

#
# Frequencies for logs and the xst file
#
outputEnergies          20
outputTiming            200
xstFreq                 200

#
# Timestep & friends
#
timestep                2.0
stepsPerCycle           8
nonBondedFreq           2
fullElectFrequency      4

#
# Simulation space partitioning
#
switching               on
switchDist              10
cutoff                  12
pairlistdist            13.5

#
# Basic dynamics
#
COMmotion               no
dielectric              1.0
exclude                 scaled1-4
1-4scaling              1.0
rigidbonds              all

#
# Particle Mesh Ewald parameters. 
#
Pme                     on
PmeGridsizeX            128                      # <===== CHANGE ME
PmeGridsizeY            96                     # <===== CHANGE ME
PmeGridsizeZ            64                     # <===== CHANGE ME
# Pmeprocessors         8

#
# Periodic boundary things
#
wrapWater               on
wrapNearest             on
wrapAll                 on


#
# Langevin dynamics parameters
#
langevin                on
langevinDamping         1
langevinTemp            320                     # <===== Check me
langevinHydrogen        on

langevinPiston          on
langevinPistonTarget    1.01325
langevinPistonPeriod    200
langevinPistonDecay     100
langevinPistonTemp      320                     # <===== Check me

useGroupPressure        yes

firsttimestep           120000                     # <===== CHANGE ME
run                     100000000                  ;# <===== CHANGE ME

	4 cores	8 cores	12 cores	16 cores
NAMD 2.6 TCP	2.04	1.28	1.95	1.57
NAMD 2.6 UDP	2.07	1.33	1.12	0.99
NAMD 2.5 UDP	2.16	1.36	1.10	0.98

For the runs shown above, the PME grid was 128x80x64, all cores were used for FFT and the nodelist file was a simple core list using the eth0 interface:

group main
host 10.0.0.11
host 10.0.0.11
host 10.0.0.11
host 10.0.0.11
host 10.0.0.12
...

Keeping constant the number of cores (16) and the NAMD version (2.6 UDP), we have:

Modification of above scenario	Days per nsec	nsec per day
Use eth1 (10.0.1.x) in nodelist but 10.0.0.1 for charmrun	0.95	1.05
Use eth1 (10.0.1.x) in nodelist and 10.0.1.1 for charmrun	0.96	1.04
Use a mix of eth0 and eth1 in nodelist, 10.0.0.1 for charmrun	0.97	1.03
Use the +atm namd command-line flag	0.82	1.22
Use the +atm +giga NAMD command-line flags	0.77	1.30
Use the +giga namd command-line flag	0.76	1.31
Use +giga, mixed eth0 & eth1, 10.0.0.1 for charmrun	0.74	1.35

The current best looks like this (with a doubt concerning the choice for useip):

charmrun /usr/local/namd/namd2 +p16 +giga ++useip 10.0.0.1 equi.namd

Keeping the above line constant, try with different nodelist files and VLAN settings on the switch:

VLANs in use ?	Nodelist form	Days per nanosecond
No	10.0.0.11, 10.0.0.11, 10.0.0.11, 10.0.0.11, 10.0.0.12, …	0.77
No	10.0.1.11, 10.0.1.11, 10.0.1.11, 10.0.1.11, 10.0.1.12, …	0.77
No	10.0.0.11, 10.0.1.11, 10.0.0.11, 10.0.1.11, 10.0.0.12, …	0.74
No	10.0.1.11, 10.0.0.11, 10.0.1.11, 10.0.0.11, 10.0.1.12, …	0.74

It appears that the absence of VLANs doesn't affect performance significantly, so go back to two established VLANs (to keep traffic segragated).

Since we are here, do a quick test with all 32 cores to cheer-up:

# charmrun /usr/local/namd/namd2 +p32 +giga ++useip 10.0.0.1 equi.namd > LOG &
# days_per_nanosecond LOG
0.56717
0.522217
0.510001
...

Try some additional NAMD parameters: +idlepoll (no effect), +eth (no effect), +stacksize (no effect), +LBObjOnly (failed), +truecrash (no effect), +strategy USE_MESH/USE_GRID. To recap up to now, the following table compares number of cores vs. timings & efficiency. Efficiency is defined as [100*(days/nsec) for one core] / [ n*(days/nsec) for n cores]

	Days per nsec	nsec per day	Efficiency (%)
1 core	7.05	0.14	100%
4 cores	2.07	0.48	85%
8 cores	1.19	0.84	74%
16 cores	0.74	1.35	59%
32 cores	0.50	2.00	44%

NAMD, 60,000 atoms, nsec per day & efficiency

Test two different NAMD executables as provided by the developers using 16 cores (measurements in days per nanosecond):

	16 cores	32 cores
Linux-i686	0.74	0.51
Linux-amd64	0.56	0.43

Using 16 cores with a one-by-one nodelist file as described here : 0.53 days per nsec.

Start messing with the namd run per se. Check PMEprocessors: run NAMD using the command line:

/usr/local/namd/charmrun /usr/local/namd/namd2 +p32 +giga ++useip 10.0.0.1 equi.namd

and vary the PMEprocessors and whether they are on the same node(s) or different (measurements in days per nsec):

PMEprocessors 32 (all cores)	0.44
PMEprocessors 16 (two per node)	0.43
PMEprocessors 4 (on four different nodes)	0.45

Try increasing 'stepsPerCycle' and 'pairlistdist' to improve parallel scaling. Try using '+asyncio +strategy USE_HYPERCUBE'. With all that, the improvement is rather small, ending to about 0.40 days per nanosecond.