This is a 99,744 atom system with a PME grid of 112x108x108 (script included below). For all the tests that follow we used the NAMD 2.6 amd64 executable as provided by the NAMD developers.
NAMD script used for these tests
# Input files
#
structure ionized.psf
coordinates heat_out.coor
velocities heat_out.vel
extendedSystem heat_out.xsc
parameters par_all27_prot_na.inp
paraTypeCharmm on
#
# Output files & writing frequency for DCD
# and restart files
#
outputname output/equi_out
binaryoutput off
restartname output/restart
restartfreq 1000
binaryrestart yes
dcdFile output/equi_out.dcd
dcdFreq 200
DCDunitcell on
#
# Frequencies for logs and the xst file
#
outputEnergies 20
outputTiming 200
xstFreq 200
#
# Timestep & friends
#
timestep 2.0
stepsPerCycle 8
nonBondedFreq 2
fullElectFrequency 4
#
# Simulation space partitioning
#
switching on
switchDist 10
cutoff 12
pairlistdist 13.5
#
# Basic dynamics
#
COMmotion no
dielectric 1.0
exclude scaled1-4
1-4scaling 1.0
rigidbonds all
#
# Particle Mesh Ewald parameters.
#
Pme on
PmeGridsizeX 112 # <===== CHANGE ME
PmeGridsizeY 108 # <===== CHANGE ME
PmeGridsizeZ 108 # <===== CHANGE ME
# Pmeprocessors 8
#
# Periodic boundary things
#
wrapWater on
wrapNearest on
wrapAll on
#
# Langevin dynamics parameters
#
langevin on
langevinDamping 1
langevinTemp 298 # <===== Check me
langevinHydrogen on
langevinPiston on
langevinPistonTarget 1.01325
langevinPistonPeriod 200
langevinPistonDecay 100
langevinPistonTemp 298 # <===== Check me
useGroupPressure yes
firsttimestep 26000 # <===== CHANGE ME
run 25000000 ;# <===== CHANGE ME
| 1 core | 4 cores | 8 cores | 16 cores | 32 cores |
Days per nsec | 8.35 | 2.50 | 1.40 | 0.89 | 0.60 |
nsec per day | 0.12 | 0.40 | 0.71 | 1.12 | 1.66 |
Efficiency | 100% | 83% | 75% | 59% | 44% |
Now try the following: instead of 'filling-up' all four cores of each node, distribute the work to different nodes (applicable only if less than 32 cores are needed). The following .nodelist file is one solution:
Modified .nodelist file
group main
host 10.0.1.11
host 10.0.1.12
host 10.0.1.13
host 10.0.1.14
host 10.0.1.15
host 10.0.1.16
host 10.0.1.17
host 10.0.1.18
host 10.0.0.11
host 10.0.0.12
host 10.0.0.13
host 10.0.0.14
host 10.0.0.15
host 10.0.0.16
host 10.0.0.17
host 10.0.0.18
host 10.0.1.11
host 10.0.1.12
host 10.0.1.13
host 10.0.1.14
host 10.0.1.15
host 10.0.1.16
host 10.0.1.17
host 10.0.1.18
host 10.0.0.11
host 10.0.0.12
host 10.0.0.13
host 10.0.0.14
host 10.0.0.15
host 10.0.0.16
host 10.0.0.17
host 10.0.0.18
Using this nodelist file and repeating the measurements, we have:
| 1 core | 4 cores | 8 cores | 16 cores | 32 cores |
Days per nsec | 8.35 | 2.25 | 1.19 | 0.84 | 0.65 |
nsec per day | 0.12 | 0.44 | 0.84 | 1.19 | 1.54 |
Efficiency | 100% | 93% | 88% | 62% | 21% |
which means that for anything up-to and including 16 cores, you are better-off with the nodelist file shown above.
Finally, an attempt to try filling-in pairs of cores before moving to the next node
(group main
host 10.0.0.11
host 10.0.1.11
host 10.0.0.12
host 10.0.1.12
host 10.0.0.13
host 10.0.1.13
host 10.0.0.14
host 10.0.1.14
host 10.0.0.15
host 10.0.1.15
host 10.0.0.16
host 10.0.1.16
host 10.0.0.17
host 10.0.1.17
host 10.0.0.18
host 10.0.1.18
host 10.0.0.11
host 10.0.1.11
…) gave worst scaling than the previously mentioned solution.