April 16th, 2009

n0003 followed the steps of n0004 and n0008: yet another power supply failed (which takes us to three out of eight in less than three months). Pooh.

The behaviour of slurm is commendable: two jobs were running on the failed node. When the node stopped responding, slurm set it to 'down', and re-queued the jobs. One of them started immediately on two cores that were not allocated, the other still awaits resources. Nice.