Oct 28th, 2011

Hiccup on n0001. No network traffic, namd job apparently still present (but with less overall load for node), job stopped writing to the disk. Stop & restart the job → did it again with same symptoms. Reboot node → Happened again.

OK, have something. dmesg points to the GPU :

NVRM: Xid (0000:08:00): 13, 0001 00000000 000090c0 00001b0c 00000000 00000000


Start testing hypotheses :

  • The CUDA card died (or dies) ? Proper full shutdown, recycle power, stress-test the card for a couple of hours (noting that the NAMD problem appears within minutes of starting the job) :
[root@n0001 memtestG80-1.1-linux64]$ ./memtestG80 -b 768 50000
     -------------------------------------------------------------
     |                      MemtestG80 v1.00                     |
     |                                                           |
     | Usage: memtestG80 [flags] [MB GPU RAM to test] [# iters]  |
     |                                                           |
     | Defaults: GPU 0, 128MB RAM, 50 test iterations            |
     | Amount of tested RAM will be rounded up to nearest 2MB    |
     -------------------------------------------------------------

      Available flags:
        --gpu N ,-g N : run test on the Nth (from 0) CUDA GPU
        --license ,-l : show license terms for this build
        --forcecomm, -f : DO send test results to Stanford  (don't prompt)
        --bancomm, -b : DO NOT send test results to Stanford  (don't prompt)
        --ramclock X , -r X: Specify RAM clock speed (for returned results) as X MHz
        --coreclock X , -c X: Specify core/ROP clock speed (for returned results) as X MHz

Running 50000 iterations of tests over 768 MB of GPU memory on card 0: GeForce GTX 460

Running memory bandwidth test over 20 iterations of 384 MB transfers...
	Estimated bandwidth 60472.44 MB/s

Test iteration 1 (GPU 0, 768 MiB): 0 errors so far
	Moving Inversions (ones and zeros): 0 errors (51 ms)
	Memtest86 Walking 8-bit: 0 errors (384 ms)
	True Walking zeros (8-bit): 0 errors (192 ms)
	True Walking ones (8-bit): 0 errors (192 ms)
	Moving Inversions (random): 0 errors (48 ms)
	Memtest86 Walking zeros (32-bit): 0 errors (780 ms)
	Memtest86 Walking ones (32-bit): 0 errors (772 ms)
	Random blocks: 0 errors (348 ms)

....

Test iteration 1047 (GPU 0, 768 MiB): 0 errors so far
	Moving Inversions (ones and zeros): 0 errors (48 ms)
	Memtest86 Walking 8-bit: 0 errors (384 ms)
	True Walking zeros (8-bit): 0 errors (192 ms)
	True Walking ones (8-bit): 0 errors (192 ms)
	Moving Inversions (random): 0 errors (48 ms)
	Memtest86 Walking zeros (32-bit): 0 errors (768 ms)
	Memtest86 Walking ones (32-bit): 0 errors (768 ms)
	Random blocks: 0 errors (348 ms)
	Memtest86 Modulo-20: 0 errors (3680 ms)
	Logic (one iteration): 0 errors (32 ms)
	Logic (4 iterations): 0 errors (60 ms)
	Logic (shared memory, one iteration): 0 errors (40 ms)
	Logic (shared-memory, 4 iterations): 0 errors (100 ms)

Card looks OK (?).


  • Getting interesting. Immediately after stopping the stress-test, start the NAMD job → Started ok, and seems to continue without problems. It looks like a power-cycle-requiring hiccup after all …
  • No, it is not a hiccup. Stopped again after three hours. Reboot node and stress-test the card overnight :
#!/bin/csh -f

setenv LD_LIBRARY_PATH /usr/local/cuda/lib64:$LD_LIBRARY_PATH

./dgemmSweep 0 1000

exit

giving after 2.5 hours :

Error: cublasDgemm returned an invalid result at location 1322,3409 in iteration 4352 on device 0
Testing device 0: GeForce GTX 460
device = 0
iterSize = 5984
Device 0: i = 128
Device 0: i = 160
...
Device 0: i = 4320
Device 0: i = 4352
8707.007812
ERROR: Failed with device 0. dgemmSweep FAILED.


Try to verify with memtestG80 as well :

#!/bin/csh -f

./memtestG80 -b 768 10000 >& LOG

exit


Bombs-out in 16 minutes :

   -------------------------------------------------------------
     |                      MemtestG80 v1.00                     |
     |                                                           |
     | Usage: memtestG80 [flags] [MB GPU RAM to test] [# iters]  |
     |                                                           |
     | Defaults: GPU 0, 128MB RAM, 50 test iterations            |
     | Amount of tested RAM will be rounded up to nearest 2MB    |
     -------------------------------------------------------------

      Available flags:
        --gpu N ,-g N : run test on the Nth (from 0) CUDA GPU
        --license ,-l : show license terms for this build
        --forcecomm, -f : DO send test results to Stanford  (don't prompt)
        --bancomm, -b : DO NOT send test results to Stanford  (don't prompt)
        --ramclock X , -r X: Specify RAM clock speed (for returned results) as X MHz
        --coreclock X , -c X: Specify core/ROP clock speed (for returned results) as X MHz

Running 10000 iterations of tests over 768 MB of GPU memory on card 0: GeForce GTX 460

Running memory bandwidth test over 20 iterations of 384 MB transfers...
	Estimated bandwidth 60235.29 MB/s

Test iteration 1 (GPU 0, 768 MiB): 0 errors so far
	Moving Inversions (ones and zeros): 0 errors (49 ms)
	Memtest86 Walking 8-bit: 0 errors (384 ms)
	True Walking zeros (8-bit): 0 errors (192 ms)
	True Walking ones (8-bit): 0 errors (192 ms)
	Moving Inversions (random): 0 errors (48 ms)
	Memtest86 Walking zeros (32-bit): 0 errors (776 ms)
	Memtest86 Walking ones (32-bit): 0 errors (776 ms)
	Random blocks: 0 errors (348 ms)
	Memtest86 Modulo-20: 0 errors (3684 ms)
	Logic (one iteration): 0 errors (32 ms)
	Logic (4 iterations): 0 errors (60 ms)
	Logic (shared memory, one iteration): 0 errors (40 ms)
	Logic (shared-memory, 4 iterations): 0 errors (100 ms)

Test iteration 2 (GPU 0, 768 MiB): 0 errors so far
	Moving Inversions (ones and zeros): 0 errors (48 ms)

...


Test iteration 136 (GPU 0, 768 MiB): 0 errors so far
	Moving Inversions (ones and zeros): 0 errors (48 ms)
	Memtest86 Walking 8-bit: 0 errors (388 ms)
	True Walking zeros (8-bit): 0 errors (192 ms)
	True Walking ones (8-bit): 0 errors (192 ms)
	Moving Inversions (random): 0 errors (48 ms)
	Memtest86 Walking zeros (32-bit): 0 errors (772 ms)
	Memtest86 Walking ones (32-bit): 0 errors (768 ms)
	Random blocks: 0 errors (348 ms)
	Memtest86 Modulo-20: 0 errors (3680 ms)
	Logic (one iteration): 0 errors (32 ms)
	Logic (4 iterations): 0 errors (60 ms)
	Logic (shared memory, one iteration): 768 errors (40 ms)
	Logic (shared-memory, 4 iterations): 1152 errors (100 ms)

Test iteration 137 (GPU 0, 768 MiB): 1920 errors so far
	Moving Inversions (ones and zeros): 2 errors (48 ms)
	Memtest86 Walking 8-bit: 3 errors (384 ms)
	True Walking zeros (8-bit): 4294967292 errors (104 ms)
	True Walking ones (8-bit): 4294967288 errors (0 ms)
	Moving Inversions (random): 4294967295 errors (0 ms)
	Memtest86 Walking zeros (32-bit): 4294967264 errors (0 ms)
	Memtest86 Walking ones (32-bit): 4294967264 errors (1 ms)
	Random blocks: 4294967295 errors (0 ms)
	Memtest86 Modulo-20: 4294967276 errors (0 ms)
	Logic (one iteration): 4294967295 errors (0 ms)
	Logic (4 iterations): 4294967295 errors (0 ms)
	Logic (shared memory, one iteration): 4294967295 errors (0 ms)
	Logic (shared-memory, 4 iterations): 4294967295 errors (0 ms)

Test iteration 138 (GPU 0, 768 MiB): 1823 errors so far
	Moving Inversions (ones and zeros): 4294967295 errors (0 ms)
	Memtest86 Walking 8-bit: 4294967288 errors (0 ms)
	True Walking zeros (8-bit): 4294967288 errors (0 ms)
	True Walking ones (8-bit): 4294967288 errors (0 ms)
	Moving Inversions (random): 4294967295 errors (0 ms)
	Memtest86 Walking zeros (32-bit): 4294967264 errors (0 ms)
	Memtest86 Walking ones (32-bit): 4294967264 errors (0 ms)
	Random blocks: 4294967295 errors (0 ms)
	Memtest86 Modulo-20: 4294967276 errors (0 ms)
	Logic (one iteration): 4294967295 errors (0 ms)
	Logic (4 iterations): 4294967295 errors (0 ms)
	Logic (shared memory, one iteration): 4294967295 errors (0 ms)
	Logic (shared-memory, 4 iterations): 4294967295 errors (0 ms)

...


Take card out, send it off …



maintenance/oct_28th_2011.txt · Last modified: 2011/10/29 12:38 (external edit)