Hiccup on n0001. No network traffic, namd job apparently still present (but with less overall load for node), job stopped writing to the disk. Stop & restart the job → did it again with same symptoms. Reboot node → Happened again.
OK, have something. dmesg points to the GPU :
NVRM: Xid (0000:08:00): 13, 0001 00000000 000090c0 00001b0c 00000000 00000000
Start testing hypotheses :
[root@n0001 memtestG80-1.1-linux64]$ ./memtestG80 -b 768 50000 ------------------------------------------------------------- | MemtestG80 v1.00 | | | | Usage: memtestG80 [flags] [MB GPU RAM to test] [# iters] | | | | Defaults: GPU 0, 128MB RAM, 50 test iterations | | Amount of tested RAM will be rounded up to nearest 2MB | ------------------------------------------------------------- Available flags: --gpu N ,-g N : run test on the Nth (from 0) CUDA GPU --license ,-l : show license terms for this build --forcecomm, -f : DO send test results to Stanford (don't prompt) --bancomm, -b : DO NOT send test results to Stanford (don't prompt) --ramclock X , -r X: Specify RAM clock speed (for returned results) as X MHz --coreclock X , -c X: Specify core/ROP clock speed (for returned results) as X MHz Running 50000 iterations of tests over 768 MB of GPU memory on card 0: GeForce GTX 460 Running memory bandwidth test over 20 iterations of 384 MB transfers... Estimated bandwidth 60472.44 MB/s Test iteration 1 (GPU 0, 768 MiB): 0 errors so far Moving Inversions (ones and zeros): 0 errors (51 ms) Memtest86 Walking 8-bit: 0 errors (384 ms) True Walking zeros (8-bit): 0 errors (192 ms) True Walking ones (8-bit): 0 errors (192 ms) Moving Inversions (random): 0 errors (48 ms) Memtest86 Walking zeros (32-bit): 0 errors (780 ms) Memtest86 Walking ones (32-bit): 0 errors (772 ms) Random blocks: 0 errors (348 ms) .... Test iteration 1047 (GPU 0, 768 MiB): 0 errors so far Moving Inversions (ones and zeros): 0 errors (48 ms) Memtest86 Walking 8-bit: 0 errors (384 ms) True Walking zeros (8-bit): 0 errors (192 ms) True Walking ones (8-bit): 0 errors (192 ms) Moving Inversions (random): 0 errors (48 ms) Memtest86 Walking zeros (32-bit): 0 errors (768 ms) Memtest86 Walking ones (32-bit): 0 errors (768 ms) Random blocks: 0 errors (348 ms) Memtest86 Modulo-20: 0 errors (3680 ms) Logic (one iteration): 0 errors (32 ms) Logic (4 iterations): 0 errors (60 ms) Logic (shared memory, one iteration): 0 errors (40 ms) Logic (shared-memory, 4 iterations): 0 errors (100 ms)
Card looks OK (?).
#!/bin/csh -f setenv LD_LIBRARY_PATH /usr/local/cuda/lib64:$LD_LIBRARY_PATH ./dgemmSweep 0 1000 exit
giving after 2.5 hours :
Error: cublasDgemm returned an invalid result at location 1322,3409 in iteration 4352 on device 0 Testing device 0: GeForce GTX 460 device = 0 iterSize = 5984 Device 0: i = 128 Device 0: i = 160 ... Device 0: i = 4320 Device 0: i = 4352 8707.007812 ERROR: Failed with device 0. dgemmSweep FAILED.
Try to verify with memtestG80
as well :
#!/bin/csh -f ./memtestG80 -b 768 10000 >& LOG exit
Bombs-out in 16 minutes :
------------------------------------------------------------- | MemtestG80 v1.00 | | | | Usage: memtestG80 [flags] [MB GPU RAM to test] [# iters] | | | | Defaults: GPU 0, 128MB RAM, 50 test iterations | | Amount of tested RAM will be rounded up to nearest 2MB | ------------------------------------------------------------- Available flags: --gpu N ,-g N : run test on the Nth (from 0) CUDA GPU --license ,-l : show license terms for this build --forcecomm, -f : DO send test results to Stanford (don't prompt) --bancomm, -b : DO NOT send test results to Stanford (don't prompt) --ramclock X , -r X: Specify RAM clock speed (for returned results) as X MHz --coreclock X , -c X: Specify core/ROP clock speed (for returned results) as X MHz Running 10000 iterations of tests over 768 MB of GPU memory on card 0: GeForce GTX 460 Running memory bandwidth test over 20 iterations of 384 MB transfers... Estimated bandwidth 60235.29 MB/s Test iteration 1 (GPU 0, 768 MiB): 0 errors so far Moving Inversions (ones and zeros): 0 errors (49 ms) Memtest86 Walking 8-bit: 0 errors (384 ms) True Walking zeros (8-bit): 0 errors (192 ms) True Walking ones (8-bit): 0 errors (192 ms) Moving Inversions (random): 0 errors (48 ms) Memtest86 Walking zeros (32-bit): 0 errors (776 ms) Memtest86 Walking ones (32-bit): 0 errors (776 ms) Random blocks: 0 errors (348 ms) Memtest86 Modulo-20: 0 errors (3684 ms) Logic (one iteration): 0 errors (32 ms) Logic (4 iterations): 0 errors (60 ms) Logic (shared memory, one iteration): 0 errors (40 ms) Logic (shared-memory, 4 iterations): 0 errors (100 ms) Test iteration 2 (GPU 0, 768 MiB): 0 errors so far Moving Inversions (ones and zeros): 0 errors (48 ms) ... Test iteration 136 (GPU 0, 768 MiB): 0 errors so far Moving Inversions (ones and zeros): 0 errors (48 ms) Memtest86 Walking 8-bit: 0 errors (388 ms) True Walking zeros (8-bit): 0 errors (192 ms) True Walking ones (8-bit): 0 errors (192 ms) Moving Inversions (random): 0 errors (48 ms) Memtest86 Walking zeros (32-bit): 0 errors (772 ms) Memtest86 Walking ones (32-bit): 0 errors (768 ms) Random blocks: 0 errors (348 ms) Memtest86 Modulo-20: 0 errors (3680 ms) Logic (one iteration): 0 errors (32 ms) Logic (4 iterations): 0 errors (60 ms) Logic (shared memory, one iteration): 768 errors (40 ms) Logic (shared-memory, 4 iterations): 1152 errors (100 ms) Test iteration 137 (GPU 0, 768 MiB): 1920 errors so far Moving Inversions (ones and zeros): 2 errors (48 ms) Memtest86 Walking 8-bit: 3 errors (384 ms) True Walking zeros (8-bit): 4294967292 errors (104 ms) True Walking ones (8-bit): 4294967288 errors (0 ms) Moving Inversions (random): 4294967295 errors (0 ms) Memtest86 Walking zeros (32-bit): 4294967264 errors (0 ms) Memtest86 Walking ones (32-bit): 4294967264 errors (1 ms) Random blocks: 4294967295 errors (0 ms) Memtest86 Modulo-20: 4294967276 errors (0 ms) Logic (one iteration): 4294967295 errors (0 ms) Logic (4 iterations): 4294967295 errors (0 ms) Logic (shared memory, one iteration): 4294967295 errors (0 ms) Logic (shared-memory, 4 iterations): 4294967295 errors (0 ms) Test iteration 138 (GPU 0, 768 MiB): 1823 errors so far Moving Inversions (ones and zeros): 4294967295 errors (0 ms) Memtest86 Walking 8-bit: 4294967288 errors (0 ms) True Walking zeros (8-bit): 4294967288 errors (0 ms) True Walking ones (8-bit): 4294967288 errors (0 ms) Moving Inversions (random): 4294967295 errors (0 ms) Memtest86 Walking zeros (32-bit): 4294967264 errors (0 ms) Memtest86 Walking ones (32-bit): 4294967264 errors (0 ms) Random blocks: 4294967295 errors (0 ms) Memtest86 Modulo-20: 4294967276 errors (0 ms) Logic (one iteration): 4294967295 errors (0 ms) Logic (4 iterations): 4294967295 errors (0 ms) Logic (shared memory, one iteration): 4294967295 errors (0 ms) Logic (shared-memory, 4 iterations): 4294967295 errors (0 ms) ...
Take card out, send it off …