NAMD, 60,000 atoms benchmarks

The system used for the test comprised 60660 atoms, with an orthogonal PBC box of dimensions ~124x77x63, an inner step of 2 fs, nonbonded every 4 fs, and electrostatics every 8 fs (script include below). All runs were allowed to run till stabilisation. The table below reports days per nanosecond of simulation for each combination indicated.

NAMD script used for these tests

4 cores 8 cores 12 cores 16 cores
NAMD 2.6 TCP 2.04 1.28 1.95 1.57
NAMD 2.6 UDP 2.07 1.33 1.12 0.99
NAMD 2.5 UDP 2.16 1.36 1.10 0.98

For the runs shown above, the PME grid was 128x80x64, all cores were used for FFT and the nodelist file was a simple core list using the eth0 interface:

group main

Keeping constant the number of cores (16) and the NAMD version (2.6 UDP), we have:

Modification of above scenario Days per nsec nsec per day
Use eth1 (10.0.1.x) in nodelist but for charmrun 0.95 1.05
Use eth1 (10.0.1.x) in nodelist and for charmrun 0.96 1.04
Use a mix of eth0 and eth1 in nodelist, for charmrun 0.97 1.03
Use the +atm namd command-line flag 0.82 1.22
Use the +atm +giga NAMD command-line flags 0.77 1.30
Use the +giga namd command-line flag 0.76 1.31
Use +giga, mixed eth0 & eth1, for charmrun 0.74 1.35

The current best looks like this (with a doubt concerning the choice for useip):

charmrun /usr/local/namd/namd2 +p16 +giga ++useip equi.namd

Keeping the above line constant, try with different nodelist files and VLAN settings on the switch:

VLANs in use ? Nodelist form Days per nanosecond
No,,,,, … 0.77
No,,,,, … 0.77
No,,,,, … 0.74
No,,,,, … 0.74

It appears that the absence of VLANs doesn't affect performance significantly, so go back to two established VLANs (to keep traffic segragated).

Since we are here, do a quick test with all 32 cores to cheer-up:

# charmrun /usr/local/namd/namd2 +p32 +giga ++useip equi.namd > LOG &
# days_per_nanosecond LOG

Try some additional NAMD parameters: +idlepoll (no effect), +eth (no effect), +stacksize (no effect), +LBObjOnly (failed), +truecrash (no effect), +strategy USE_MESH/USE_GRID. To recap up to now, the following table compares number of cores vs. timings & efficiency. Efficiency is defined as [100*(days/nsec) for one core] / [ n*(days/nsec) for n cores]

Days per nsec nsec per day Efficiency (%)
1 core 7.05 0.14 100%
4 cores 2.07 0.48 85%
8 cores 1.19 0.84 74%
16 cores 0.74 1.35 59%
32 cores 0.50 2.00 44%

NAMD, 60,000 atoms, nsec per day & efficiency

Test two different NAMD executables as provided by the developers using 16 cores (measurements in days per nanosecond):

16 cores 32 cores
Linux-i686 0.74 0.51
Linux-amd64 0.56 0.43

Using 16 cores with a one-by-one nodelist file as described here : 0.53 days per nsec.

Start messing with the namd run per se. Check PMEprocessors: run NAMD using the command line:

/usr/local/namd/charmrun /usr/local/namd/namd2 +p32 +giga ++useip equi.namd

and vary the PMEprocessors and whether they are on the same node(s) or different (measurements in days per nsec):

PMEprocessors 32 (all cores) 0.44
PMEprocessors 16 (two per node) 0.43
PMEprocessors 4 (on four different nodes) 0.45

Try increasing 'stepsPerCycle' and 'pairlistdist' to improve parallel scaling. Try using '+asyncio +strategy USE_HYPERCUBE'. With all that, the improvement is rather small, ending to about 0.40 days per nanosecond.

about/benchmarks/namd60k.txt · Last modified: 2009/02/06 14:01 (external edit)