Differences

This shows you the differences between the selected revision and the current version of the page.

maintenance:may_17th_2018 2018/05/29 20:42 maintenance:may_17th_2018 2018/08/03 23:11 current
Line 1: Line 1:
 +====== Benchmarks : May 17th-20th, 2018 ======
 +
 +
 +\\
 +
 +== The results first : System with 9418 atoms, 4 fs/step (HMR) ==
 +
 +^ Box/GPU combination                                    ^ Performance in ns/day ^ cudaPME OK ? ^ ns/day without cudaPME ^ Local NAMD ? ^
 +| n0011 IBM server (32 cores on a AMD 6234@2.4GHz, no cuda)    |  **70**        |    -      |          |        |
 +| Old Q6600 @ 2.4GHz box + **GTX 1050**                        |  **140**        |    ✔      |          |        |
 +| n0009 (i7-975, 4 cores @ 3.33 GHz) + **GTX 1050**            |  **210**        |    ✔      |          |    ✔  |
 +| n0010 (AMD FX-8150, 8 cores @ 3.6 GHz) + **GTX 1050**        |    200          |    No      |  **150**  |        |
 +| n0012 (AMD FX-8350, 8 cores @ 4.0 GHz) + **GTX 1050 Ti**      |    240          |    No      |  **200**  |    ✔  |
 +| Scarlet (i7-6800K @ 3.4GHz + **GTX-1070**)                    |  **450**        |    ✔      |          |        |
 +
 +
 +\\
 +\\
 +
 +== And the story : ==
 +
 +\\
 +
 +Had been wondering how much slower NAMD would run with the much cheaper GTX 1050 (with 2Gbytes DDR5) instead of GTX 1070.\\
 +Got a card plus a disk and decided to test it with some boxes from the original cluster setup :
 +
 +\\
 +------
 +\\
 +
 +**Old box with Q6600 Kentsfield 2.4 GHz quad processors**
 +
 +  - First try was the old n0002 -> kept on crashing -> cannibalized it.
 +  - Second attempt was with the old n0005 -> didn't even boot -> cannibalized it.
 +  - Third attempt was with the old n0004 -> installed centos 7 -> install nvidia drivers plus cuda -> install NAMD (git) -> Looks stable, let it run o/n.
 +  - See benchmarks above.
 +
 +\\
 +------
 +\\
 +
 +
 +**n0010 (AMD FX-8150, 8 cores @ 3.6 GHz)**
 +
 +  - Try to transfer disk plus GTX1050 to the old **n0010 (AMD FX-8150, 8 cores @ 3.6 GHz)** => Can't even get to BIOS ? => Keep trying ...
 +  - The problem appears to be with the GTX1050. Changed the card and could boot. Unfortunately, the transferred disk doesn't work. Then while trying to re-install centos7 was getting continuous crashes. Reduced the CPU frequency to half => install completed :-( Looks like we will have to under-clock this box.
 +  - Try to set it up at the reduced CPU frequency (the plan is to try going back to normal frequencies during the NAMD tests.)
 +  - Tried after install to switch to GTX1050 => still refuses to boot with the GTX1050.
 +  - Try updating the bloody bios ... :-( => Surprisingly, this worked ! Boots with the 1050 card.
 +  - Install nvidia drivers plus cuda → install NAMD (git) → Start testing it (**after disabling the AMD turbo mode, it causes instability**).
 +  - Long run to see if it is stable ...
 +
 +\\
 +
 +Can we move it meaningfully back to the cluster for the test ?
 +
 +  - Move it back to the cluster and connect RJ45
 +  - Boot it => It appears that it got the correct IP address. I do not understand why (MAC address ?).
 +  - Copy /usr/sbin/wulfd from n0011 to n0010. Run it with "echo READY > /.nodestatus" "/usr/sbin/wulfd -m 10.0.0.1" from rc.local => Works (!?)
 +  - Copy ssh keys (user + root) to .ssh/
 +  - NFS mount /home **from** norma _and_ export /home **to** norma.
 +  - Start a job to see if it is stable.
 +  - Getting namd segfaults in dmesg : ''namd2[10658]: segfault at fffffffe178f9f90 ip 0000000000a82ff7 sp 0000000017888280 error 7 in namd2[400000+c4f9000]''
 +  - Try a slightly older namd executable (the one running on scarlet) ...
 +  - Still crashing, this time with an interesting message :<file>
 +[28802.391337] [Hardware Error]: Corrected error, no action required.
 +[28802.391347] [Hardware Error]: CPU:6 (15:1:2) MC5_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x9c00000000020e0f
 +[28802.391359] [Hardware Error]: Error Addr: 0x0000000000000021
 +[28802.391366] [Hardware Error]: MC5 Error: AG payload array parity error.
 +[28802.391374] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (no timeout)
 +</file>
 +  - OK. Try changing the power supply unit => The segfaults disappeared ??? But now we are getting proper namd failures ''ERROR: Atoms moving too fast; simulation has become unstable (1 atoms on patch 17 pe 7).'' => Try adding ''twoAwayX                yes'' => Got a segfault again after ~8 hours. :-(
 +  - Keep trying : is it a hardware fault or an issue between the hardware and namd ?
 +  - Test it with CLN025 and a 2fs step (195 ns/day vs 330 on scarlet) => Segfault again.
 +  - memtester (3G) => possibly ok, no problems after two cycles.
 +  - ''tuned-adm profile latency-performance'' => segfault
 +  - Go back to original PSU and test lowering the CPU frequency => @ 1.8 GHz (140 ns/day) : stable, it looks like a CPU problem :-(
 +  - Try 3.0 GHz + AMD turbo mode => segfaults.
 +  - Try 3.0 GHz, disable AMD turbo mode => Looks stable, but artifacts from adaptive tempering are present.
 +  - Try lowering the CPU voltage => unstable.
 +  - Try 2.6 GHz (170 ns/day) => after 5.5 hours still looking ok. Continue ... => still getting problems from adaptive.
 +  - Try 3.2 GHz and set ''usePMECUDA no'' (150 ns/day) => Looks stable. Leave it here.
 +  - The question remains : are these problems inherent to using the GTX1050 instead of GTX1070 ?
 +
 +
 +
 +\\
 +\\