Sept 10th, 2014

n0001 is dead for all practical purposes (on-board ethernet ?). So : shuffle graphics cards n0008 → n0001, n0001 → n0005, n0005 → n0008. Then add n0008 to CUDA partition, start a job on n0005, leave n0001 idle. The PSU on n0008 is a 600W unit, we'll see how long it will last …

No. Something else is the matter : it appears that whenever a job is using the n0001's GPU after a little while the node fails (observed on n0001, n0005, n0008). Additionally, days_per_nanosecond shows an increase in 'time per step' just before the node fails. It seems that we have a broken GPU. Should test it with dgemm …

Yep. Running dgemm for 5 minutes was enough for killing the node carrying the specific GPU. Order a GTX 660 …