May 17th-20th, 2018

The results first : System with 9418 atoms, 4 fs/step (HMR)
Box/GPU combination Performance in ns/day
IBM server (32 cores on a AMD 6234 @ 2.4GHz, no cuda) 70
Old Q6600 @ 2.4GHz box + GTX 1050 135
n0010 (AMD FX-8150, 8 cores @ 3.6 GHz) + GTX 1050 195
Scarlet (i7-6800K @ 3.4GHz + GTX-1070) 445

And the story :

Had been wondering how much slower NAMD would run with the much cheaper GTX 1050 (with 2Gbytes DDR5) instead of GTX 1070.
Got a card plus a disk and decided to test it with some boxes from the original cluster setup :

Old box with Q6600 Kentsfield 2.4 GHz quad processors

  1. First try was the old n0002 → kept on crashing → cannibalized it.
  2. Second attempt was with the old n0005 → didn't even boot → cannibalized it.
  3. Third attempt was with the old n0004 → installed centos 7 → install nvidia drivers plus cuda → install NAMD (git) → Looks stable, let it run o/n.
  4. See benchmarks above.

n0010 (AMD FX-8150, 8 cores @ 3.6 GHz)

  1. Try to transfer disk plus GTX1050 to the old n0010 (AMD FX-8150, 8 cores @ 3.6 GHz) ⇒ Can't even get to BIOS ? ⇒ Keep trying …
  2. The problem appears to be with the GTX1050. Changed the card and could boot. Unfortunately, the transferred disk doesn't work. Then while trying to re-install centos7 was getting continuous crashes. Reduced the CPU frequency to half ⇒ install completed :-( Looks like we will have to under-clock this box.
  3. Try to set it up at the reduced CPU frequency (the plan is to try going back to normal frequencies during the NAMD tests.)
  4. Tried after install to switch to GTX1050 ⇒ still refuses to boot with the GTX1050.
  5. Try updating the bloody bios … :-( ⇒ Surprisingly, this worked ! Boots with the 1050 card.
  6. Install nvidia drivers plus cuda → install NAMD (git) → Start testing it (after disabling the AMD turbo mode, it causes instability).
  7. Long run to see if it is stable …

Can we move it meaningfully back to the cluster for the test ?

  1. Move it back to the cluster and connect RJ45
  2. Boot it ⇒ It appears that it got the correct IP address. I do not understand why (MAC address ?).
  3. Copy /usr/sbin/wulfd from n0011 to n0010. Run it with “echo READY > /.nodestatus” ”/usr/sbin/wulfd -m” from rc.local ⇒ Works (!?)
  4. Copy ssh keys (user + root) to .ssh/
  5. NFS mount /home from norma _and_ export /home to norma.
  6. Start a job to see if it is stable.
  7. Getting namd segfaults in dmesg : namd2[10658]: segfault at fffffffe178f9f90 ip 0000000000a82ff7 sp 0000000017888280 error 7 in namd2[400000+c4f9000]
  8. Try a slightly older namd executable (the one running on scarlet) …
  9. Still crashing, this time with an interesting message :
    [28802.391337] [Hardware Error]: Corrected error, no action required.
    [28802.391347] [Hardware Error]: CPU:6 (15:1:2) MC5_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x9c00000000020e0f
    [28802.391359] [Hardware Error]: Error Addr: 0x0000000000000021
    [28802.391366] [Hardware Error]: MC5 Error: AG payload array parity error.
    [28802.391374] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (no timeout)
  10. OK. Try changing the power supply unit ⇒ The segfaults disappeared ??? But now we are getting proper namd failures ERROR: Atoms moving too fast; simulation has become unstable (1 atoms on patch 17 pe 7). ⇒ Try adding twoAwayX yes ⇒ Got a segfault again after ~8 hours. :-(
  11. Keep trying : is it a hardware fault or an issue between the hardware and namd ?
  12. Test it with CLN025 and a 2fs step (195 ns/day vs 330 on scarlet) ⇒ Segfault again.
  13. memtester (3G) ⇒ possibly ok, no problems after two cycles.
  14. tuned-adm profile latency-performance ⇒ segfault
  15. Go back to original PSU and test lowering the CPU frequency ⇒ @ 1.8 GHz (140 ns/day) : stable, it looks like a CPU problem :-(
  16. Try 3.0 GHz + AMD turbo mode ⇒ segfaults.
  17. Try 3.0 GHz, disable AMD turbo mode ⇒ Looks stable, but artifacts from adaptive tempering are present.
  18. Try lowering the CPU voltage ⇒ unstable.
  19. Try 2.6 GHz (170 ns/day) ⇒ after 5.5 hours still looking ok. Continue … ⇒ still getting problems from adaptive.
  20. Try 3.2 GHz and set usePMECUDA no (150 ns/day) ⇒ Looks stable. Leave it here.
  21. The question remains : are these problems inherent to using the GTX1050 instead of GTX1070 ?

maintenance/may_17th_2018.txt · Last modified: 2018/05/29 20:42 by glykos