Differences

This shows you the differences between the selected revision and the current version of the page.

--- maintenance:may_17th_2018 2018/05/29 20:42
+++ maintenance:may_17th_2018 2020/02/14 16:11 current
@@ Line 1: / Line 1: @@
+====== Benchmarks : May 17th-20th, 2018 ======
+\\
+== The results first : System with 9418 atoms, 4 fs/step (HMR) ==
+\\
+<html><center></html>
+^ Box/GPU combination                                    ^ Performance in ns/day ^ cudaPME OK ? ^ ns/day without cudaPME ^
+| n0011 IBM server (32 cores on a AMD 6234@2.4GHz, no cuda)     |  **70**         |     -       |           |
+| n0007 (Old Q6600 @ 2.4GHz box) + **GTX 1050**                 |  **140**        |     ✔       |           |
+| n0009 (i7-975, 4 cores @ 3.33 GHz) + **GTX 1050**             |  **200**        |     ✔       |           |
+| n0010 (AMD FX-8150, 8 cores @ 3.6 GHz) + **GTX 1050**         |  **200**        |     ✔       |  **150**  |
+| n0012 (AMD FX-8350, 8 cores @ 4.0 GHz) + **GTX 1050 Ti**      |  **230**        |     ✔       |  **200**  |
+| Scarlet (i7-6800K @ 3.4GHz + **GTX-1070**)                    |  **460**        |     ✔       |           |
+| n0013 (i9-9900K @ 3.6GHz + **GTX-1080**)                      |  **560**        |     ✔       |           |
+| n0014 (i9-9900K @ 3.6GHz + **RTX2070S**)                      |  **575**        |     ✔       |           |
+\\
+__**NOTE :** cudaPME is stable and without problems with NAMD v.2.13 with the nvidia 410.78 driver.__
+<html></center></html>
+\\
+\\
+== And the story : ==
+\\
+Had been wondering how much slower NAMD would run with the much cheaper GTX 1050 (with 2Gbytes DDR5) instead of GTX 1070.\\
+Got a card plus a disk and decided to test it with some boxes from the original cluster setup :
+\\
+------
+\\
+**Old box with Q6600 Kentsfield 2.4 GHz quad processors**
+  - First try was the old n0002 -> kept on crashing -> cannibalized it.
+  - Second attempt was with the old n0005 -> didn't even boot -> cannibalized it.
+  - Third attempt was with the old n0004 -> installed centos 7 -> install nvidia drivers plus cuda -> install NAMD (git) -> Looks stable, let it run o/n.
+  - See benchmarks above.
+\\
+------
+\\
+**n0010 (AMD FX-8150, 8 cores @ 3.6 GHz)**
+  - Try to transfer disk plus GTX1050 to the old **n0010 (AMD FX-8150, 8 cores @ 3.6 GHz)** => Can't even get to BIOS ? => Keep trying ...
+  - The problem appears to be with the GTX1050. Changed the card and could boot. Unfortunately, the transferred disk doesn't work. Then while trying to re-install centos7 was getting continuous crashes. Reduced the CPU frequency to half => install completed :-( Looks like we will have to under-clock this box.
+  - Try to set it up at the reduced CPU frequency (the plan is to try going back to normal frequencies during the NAMD tests.)
+  - Tried after install to switch to GTX1050 => still refuses to boot with the GTX1050.
+  - Try updating the bloody bios ... :-( => Surprisingly, this worked ! Boots with the 1050 card.
+  - Install nvidia drivers plus cuda → install NAMD (git) → Start testing it (**after disabling the AMD turbo mode, it causes instability**).
+  - Long run to see if it is stable ...
+\\
+Can we move it meaningfully back to the cluster for the test ?
+  - Move it back to the cluster and connect RJ45
+  - Boot it => It appears that it got the correct IP address. I do not understand why (MAC address ?).
+  - Copy /usr/sbin/wulfd from n0011 to n0010. Run it with "echo READY > /.nodestatus" "/usr/sbin/wulfd -m 10.0.0.1" from rc.local => Works (!?)
+  - Copy ssh keys (user + root) to .ssh/
+  - NFS mount /home **from** norma _and_ export /home **to** norma.
+  - Start a job to see if it is stable.
+  - Getting namd segfaults in dmesg : ''namd2[10658]: segfault at fffffffe178f9f90 ip 0000000000a82ff7 sp 0000000017888280 error 7 in namd2[400000+c4f9000]''
+  - Try a slightly older namd executable (the one running on scarlet) ...
+  - Still crashing, this time with an interesting message :<file>
+[28802.391337] [Hardware Error]: Corrected error, no action required.
+[28802.391347] [Hardware Error]: CPU:6 (15:1:2) MC5_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x9c00000000020e0f
+[28802.391359] [Hardware Error]: Error Addr: 0x0000000000000021
+[28802.391366] [Hardware Error]: MC5 Error: AG payload array parity error.
+[28802.391374] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (no timeout)
+</file>
+  - OK. Try changing the power supply unit => The segfaults disappeared ??? But now we are getting proper namd failures ''ERROR: Atoms moving too fast; simulation has become unstable (1 atoms on patch 17 pe 7).'' => Try adding ''twoAwayX                yes'' => Got a segfault again after ~8 hours. :-(
+  - Keep trying : is it a hardware fault or an issue between the hardware and namd ?
+  - Test it with CLN025 and a 2fs step (195 ns/day vs 330 on scarlet) => Segfault again.
+  - memtester (3G) => possibly ok, no problems after two cycles.
+  - ''tuned-adm profile latency-performance'' => segfault
+  - Go back to original PSU and test lowering the CPU frequency => @ 1.8 GHz (140 ns/day) : stable, it looks like a CPU problem :-(
+  - Try 3.0 GHz + AMD turbo mode => segfaults.
+  - Try 3.0 GHz, disable AMD turbo mode => Looks stable, but artifacts from adaptive tempering are present.
+  - Try lowering the CPU voltage => unstable.
+  - Try 2.6 GHz (170 ns/day) => after 5.5 hours still looking ok. Continue ... => still getting problems from adaptive.
+  - Try 3.2 GHz and set ''usePMECUDA 		no'' (150 ns/day) => Looks stable. Leave it here.
+  - The question remains : are these problems inherent to using the GTX1050 instead of GTX1070 ?
+\\
+\\