This shows you the differences between the selected revision and the current version of the page.
maintenance:may_17th_2018 2018/05/29 20:42 | maintenance:may_17th_2018 2020/02/14 16:11 current | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Benchmarks : May 17th-20th, 2018 ====== | ||
+ | |||
+ | |||
+ | \\ | ||
+ | |||
+ | |||
+ | |||
+ | == The results first : System with 9418 atoms, 4 fs/step (HMR) == | ||
+ | |||
+ | \\ | ||
+ | |||
+ | <html><center></html> | ||
+ | |||
+ | ^ Box/GPU combination ^ Performance in ns/day ^ cudaPME OK ? ^ ns/day without cudaPME ^ | ||
+ | | n0011 IBM server (32 cores on a AMD 6234@2.4GHz, no cuda) | **70** | - | | | ||
+ | | n0007 (Old Q6600 @ 2.4GHz box) + **GTX 1050** | **140** | ✔ | | | ||
+ | | n0009 (i7-975, 4 cores @ 3.33 GHz) + **GTX 1050** | **200** | ✔ | | | ||
+ | | n0010 (AMD FX-8150, 8 cores @ 3.6 GHz) + **GTX 1050** | **200** | ✔ | **150** | | ||
+ | | n0012 (AMD FX-8350, 8 cores @ 4.0 GHz) + **GTX 1050 Ti** | **230** | ✔ | **200** | | ||
+ | | Scarlet (i7-6800K @ 3.4GHz + **GTX-1070**) | **460** | ✔ | | | ||
+ | | n0013 (i9-9900K @ 3.6GHz + **GTX-1080**) | **560** | ✔ | | | ||
+ | | n0014 (i9-9900K @ 3.6GHz + **RTX2070S**) | **575** | ✔ | | | ||
+ | \\ | ||
+ | |||
+ | __**NOTE :** cudaPME is stable and without problems with NAMD v.2.13 with the nvidia 410.78 driver.__ | ||
+ | |||
+ | |||
+ | <html></center></html> | ||
+ | |||
+ | \\ | ||
+ | \\ | ||
+ | |||
+ | == And the story : == | ||
+ | |||
+ | \\ | ||
+ | |||
+ | Had been wondering how much slower NAMD would run with the much cheaper GTX 1050 (with 2Gbytes DDR5) instead of GTX 1070.\\ | ||
+ | Got a card plus a disk and decided to test it with some boxes from the original cluster setup : | ||
+ | |||
+ | \\ | ||
+ | ------ | ||
+ | \\ | ||
+ | |||
+ | **Old box with Q6600 Kentsfield 2.4 GHz quad processors** | ||
+ | |||
+ | - First try was the old n0002 -> kept on crashing -> cannibalized it. | ||
+ | - Second attempt was with the old n0005 -> didn't even boot -> cannibalized it. | ||
+ | - Third attempt was with the old n0004 -> installed centos 7 -> install nvidia drivers plus cuda -> install NAMD (git) -> Looks stable, let it run o/n. | ||
+ | - See benchmarks above. | ||
+ | |||
+ | \\ | ||
+ | ------ | ||
+ | \\ | ||
+ | |||
+ | |||
+ | **n0010 (AMD FX-8150, 8 cores @ 3.6 GHz)** | ||
+ | |||
+ | - Try to transfer disk plus GTX1050 to the old **n0010 (AMD FX-8150, 8 cores @ 3.6 GHz)** => Can't even get to BIOS ? => Keep trying ... | ||
+ | - The problem appears to be with the GTX1050. Changed the card and could boot. Unfortunately, the transferred disk doesn't work. Then while trying to re-install centos7 was getting continuous crashes. Reduced the CPU frequency to half => install completed :-( Looks like we will have to under-clock this box. | ||
+ | - Try to set it up at the reduced CPU frequency (the plan is to try going back to normal frequencies during the NAMD tests.) | ||
+ | - Tried after install to switch to GTX1050 => still refuses to boot with the GTX1050. | ||
+ | - Try updating the bloody bios ... :-( => Surprisingly, this worked ! Boots with the 1050 card. | ||
+ | - Install nvidia drivers plus cuda → install NAMD (git) → Start testing it (**after disabling the AMD turbo mode, it causes instability**). | ||
+ | - Long run to see if it is stable ... | ||
+ | |||
+ | \\ | ||
+ | |||
+ | Can we move it meaningfully back to the cluster for the test ? | ||
+ | |||
+ | - Move it back to the cluster and connect RJ45 | ||
+ | - Boot it => It appears that it got the correct IP address. I do not understand why (MAC address ?). | ||
+ | - Copy /usr/sbin/wulfd from n0011 to n0010. Run it with "echo READY > /.nodestatus" "/usr/sbin/wulfd -m 10.0.0.1" from rc.local => Works (!?) | ||
+ | - Copy ssh keys (user + root) to .ssh/ | ||
+ | - NFS mount /home **from** norma _and_ export /home **to** norma. | ||
+ | - Start a job to see if it is stable. | ||
+ | - Getting namd segfaults in dmesg : ''namd2[10658]: segfault at fffffffe178f9f90 ip 0000000000a82ff7 sp 0000000017888280 error 7 in namd2[400000+c4f9000]'' | ||
+ | - Try a slightly older namd executable (the one running on scarlet) ... | ||
+ | - Still crashing, this time with an interesting message :<file> | ||
+ | [28802.391337] [Hardware Error]: Corrected error, no action required. | ||
+ | [28802.391347] [Hardware Error]: CPU:6 (15:1:2) MC5_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x9c00000000020e0f | ||
+ | [28802.391359] [Hardware Error]: Error Addr: 0x0000000000000021 | ||
+ | [28802.391366] [Hardware Error]: MC5 Error: AG payload array parity error. | ||
+ | [28802.391374] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (no timeout) | ||
+ | </file> | ||
+ | - OK. Try changing the power supply unit => The segfaults disappeared ??? But now we are getting proper namd failures ''ERROR: Atoms moving too fast; simulation has become unstable (1 atoms on patch 17 pe 7).'' => Try adding ''twoAwayX yes'' => Got a segfault again after ~8 hours. :-( | ||
+ | - Keep trying : is it a hardware fault or an issue between the hardware and namd ? | ||
+ | - Test it with CLN025 and a 2fs step (195 ns/day vs 330 on scarlet) => Segfault again. | ||
+ | - memtester (3G) => possibly ok, no problems after two cycles. | ||
+ | - ''tuned-adm profile latency-performance'' => segfault | ||
+ | - Go back to original PSU and test lowering the CPU frequency => @ 1.8 GHz (140 ns/day) : stable, it looks like a CPU problem :-( | ||
+ | - Try 3.0 GHz + AMD turbo mode => segfaults. | ||
+ | - Try 3.0 GHz, disable AMD turbo mode => Looks stable, but artifacts from adaptive tempering are present. | ||
+ | - Try lowering the CPU voltage => unstable. | ||
+ | - Try 2.6 GHz (170 ns/day) => after 5.5 hours still looking ok. Continue ... => still getting problems from adaptive. | ||
+ | - Try 3.2 GHz and set ''usePMECUDA no'' (150 ns/day) => Looks stable. Leave it here. | ||
+ | - The question remains : are these problems inherent to using the GTX1050 instead of GTX1070 ? | ||
+ | |||
+ | |||
+ | |||
+ | \\ | ||
+ | \\ | ||