Hiccup on n0001. No network traffic, namd job apparently still present (but with less overall load for node), job stopped writing to the disk. Stop & restart the job → did it again with same symptoms. Reboot node → Happened again.
OK, have something. dmesg points to the GPU :
NVRM: Xid (0000:08:00): 13, 0001 00000000 000090c0 00001b0c 00000000 00000000
Start testing hypotheses :
[root@n0001 memtestG80-1.1-linux64]$ ./memtestG80 -b 768 50000
-------------------------------------------------------------
| MemtestG80 v1.00 |
| |
| Usage: memtestG80 [flags] [MB GPU RAM to test] [# iters] |
| |
| Defaults: GPU 0, 128MB RAM, 50 test iterations |
| Amount of tested RAM will be rounded up to nearest 2MB |
-------------------------------------------------------------
Available flags:
--gpu N ,-g N : run test on the Nth (from 0) CUDA GPU
--license ,-l : show license terms for this build
--forcecomm, -f : DO send test results to Stanford (don't prompt)
--bancomm, -b : DO NOT send test results to Stanford (don't prompt)
--ramclock X , -r X: Specify RAM clock speed (for returned results) as X MHz
--coreclock X , -c X: Specify core/ROP clock speed (for returned results) as X MHz
Running 50000 iterations of tests over 768 MB of GPU memory on card 0: GeForce GTX 460
Running memory bandwidth test over 20 iterations of 384 MB transfers...
Estimated bandwidth 60472.44 MB/s
Test iteration 1 (GPU 0, 768 MiB): 0 errors so far
Moving Inversions (ones and zeros): 0 errors (51 ms)
Memtest86 Walking 8-bit: 0 errors (384 ms)
True Walking zeros (8-bit): 0 errors (192 ms)
True Walking ones (8-bit): 0 errors (192 ms)
Moving Inversions (random): 0 errors (48 ms)
Memtest86 Walking zeros (32-bit): 0 errors (780 ms)
Memtest86 Walking ones (32-bit): 0 errors (772 ms)
Random blocks: 0 errors (348 ms)
....
Test iteration 1047 (GPU 0, 768 MiB): 0 errors so far
Moving Inversions (ones and zeros): 0 errors (48 ms)
Memtest86 Walking 8-bit: 0 errors (384 ms)
True Walking zeros (8-bit): 0 errors (192 ms)
True Walking ones (8-bit): 0 errors (192 ms)
Moving Inversions (random): 0 errors (48 ms)
Memtest86 Walking zeros (32-bit): 0 errors (768 ms)
Memtest86 Walking ones (32-bit): 0 errors (768 ms)
Random blocks: 0 errors (348 ms)
Memtest86 Modulo-20: 0 errors (3680 ms)
Logic (one iteration): 0 errors (32 ms)
Logic (4 iterations): 0 errors (60 ms)
Logic (shared memory, one iteration): 0 errors (40 ms)
Logic (shared-memory, 4 iterations): 0 errors (100 ms)
Card looks OK (?).
#!/bin/csh -f setenv LD_LIBRARY_PATH /usr/local/cuda/lib64:$LD_LIBRARY_PATH ./dgemmSweep 0 1000 exit
giving after 2.5 hours :
Error: cublasDgemm returned an invalid result at location 1322,3409 in iteration 4352 on device 0 Testing device 0: GeForce GTX 460 device = 0 iterSize = 5984 Device 0: i = 128 Device 0: i = 160 ... Device 0: i = 4320 Device 0: i = 4352 8707.007812 ERROR: Failed with device 0. dgemmSweep FAILED.
Try to verify with memtestG80 as well :
#!/bin/csh -f ./memtestG80 -b 768 10000 >& LOG exit
Bombs-out in 16 minutes :
-------------------------------------------------------------
| MemtestG80 v1.00 |
| |
| Usage: memtestG80 [flags] [MB GPU RAM to test] [# iters] |
| |
| Defaults: GPU 0, 128MB RAM, 50 test iterations |
| Amount of tested RAM will be rounded up to nearest 2MB |
-------------------------------------------------------------
Available flags:
--gpu N ,-g N : run test on the Nth (from 0) CUDA GPU
--license ,-l : show license terms for this build
--forcecomm, -f : DO send test results to Stanford (don't prompt)
--bancomm, -b : DO NOT send test results to Stanford (don't prompt)
--ramclock X , -r X: Specify RAM clock speed (for returned results) as X MHz
--coreclock X , -c X: Specify core/ROP clock speed (for returned results) as X MHz
Running 10000 iterations of tests over 768 MB of GPU memory on card 0: GeForce GTX 460
Running memory bandwidth test over 20 iterations of 384 MB transfers...
Estimated bandwidth 60235.29 MB/s
Test iteration 1 (GPU 0, 768 MiB): 0 errors so far
Moving Inversions (ones and zeros): 0 errors (49 ms)
Memtest86 Walking 8-bit: 0 errors (384 ms)
True Walking zeros (8-bit): 0 errors (192 ms)
True Walking ones (8-bit): 0 errors (192 ms)
Moving Inversions (random): 0 errors (48 ms)
Memtest86 Walking zeros (32-bit): 0 errors (776 ms)
Memtest86 Walking ones (32-bit): 0 errors (776 ms)
Random blocks: 0 errors (348 ms)
Memtest86 Modulo-20: 0 errors (3684 ms)
Logic (one iteration): 0 errors (32 ms)
Logic (4 iterations): 0 errors (60 ms)
Logic (shared memory, one iteration): 0 errors (40 ms)
Logic (shared-memory, 4 iterations): 0 errors (100 ms)
Test iteration 2 (GPU 0, 768 MiB): 0 errors so far
Moving Inversions (ones and zeros): 0 errors (48 ms)
...
Test iteration 136 (GPU 0, 768 MiB): 0 errors so far
Moving Inversions (ones and zeros): 0 errors (48 ms)
Memtest86 Walking 8-bit: 0 errors (388 ms)
True Walking zeros (8-bit): 0 errors (192 ms)
True Walking ones (8-bit): 0 errors (192 ms)
Moving Inversions (random): 0 errors (48 ms)
Memtest86 Walking zeros (32-bit): 0 errors (772 ms)
Memtest86 Walking ones (32-bit): 0 errors (768 ms)
Random blocks: 0 errors (348 ms)
Memtest86 Modulo-20: 0 errors (3680 ms)
Logic (one iteration): 0 errors (32 ms)
Logic (4 iterations): 0 errors (60 ms)
Logic (shared memory, one iteration): 768 errors (40 ms)
Logic (shared-memory, 4 iterations): 1152 errors (100 ms)
Test iteration 137 (GPU 0, 768 MiB): 1920 errors so far
Moving Inversions (ones and zeros): 2 errors (48 ms)
Memtest86 Walking 8-bit: 3 errors (384 ms)
True Walking zeros (8-bit): 4294967292 errors (104 ms)
True Walking ones (8-bit): 4294967288 errors (0 ms)
Moving Inversions (random): 4294967295 errors (0 ms)
Memtest86 Walking zeros (32-bit): 4294967264 errors (0 ms)
Memtest86 Walking ones (32-bit): 4294967264 errors (1 ms)
Random blocks: 4294967295 errors (0 ms)
Memtest86 Modulo-20: 4294967276 errors (0 ms)
Logic (one iteration): 4294967295 errors (0 ms)
Logic (4 iterations): 4294967295 errors (0 ms)
Logic (shared memory, one iteration): 4294967295 errors (0 ms)
Logic (shared-memory, 4 iterations): 4294967295 errors (0 ms)
Test iteration 138 (GPU 0, 768 MiB): 1823 errors so far
Moving Inversions (ones and zeros): 4294967295 errors (0 ms)
Memtest86 Walking 8-bit: 4294967288 errors (0 ms)
True Walking zeros (8-bit): 4294967288 errors (0 ms)
True Walking ones (8-bit): 4294967288 errors (0 ms)
Moving Inversions (random): 4294967295 errors (0 ms)
Memtest86 Walking zeros (32-bit): 4294967264 errors (0 ms)
Memtest86 Walking ones (32-bit): 4294967264 errors (0 ms)
Random blocks: 4294967295 errors (0 ms)
Memtest86 Modulo-20: 4294967276 errors (0 ms)
Logic (one iteration): 4294967295 errors (0 ms)
Logic (4 iterations): 4294967295 errors (0 ms)
Logic (shared memory, one iteration): 4294967295 errors (0 ms)
Logic (shared-memory, 4 iterations): 4294967295 errors (0 ms)
...
Take card out, send it off …






