Sept 18th, 2014

After the addition of the chassis fan on the GPU, n0001 appears to be computationally stable. The problem with the ethernet, however, is still present : even after downgrading the port to 100 Mbps, the tell-tale signs are still there

Sep 17 19:06:13 norma slurmctld[30609]: error: slurm_receive_msgs: Socket timed out on send/recv operation
Sep 17 19:06:13 norma slurmctld[30609]: error: slurm_send_recv_msgs(_send_and_recv_msgs) to n0001: Socket timed out on send/recv operation
Sep 17 19:06:13 norma slurmctld[30609]: error: agent/send_recv_msg: n0001: Socket timed out on send/recv operation
Sep 17 19:06:14 norma slurmctld[30609]: error: Node n0001 not responding
Sep 18 04:26:17 norma slurmctld[30609]: error: slurm_receive_msgs: Socket timed out on send/recv operation
Sep 18 04:26:17 norma slurmctld[30609]: error: slurm_send_recv_msgs(_send_and_recv_msgs) to n0001: Socket timed out on send/recv operation
Sep 18 04:26:17 norma slurmctld[30609]: error: agent/send_recv_msg: n0001: Socket timed out on send/recv operation
Sep 18 04:26:18 norma slurmctld[30609]: error: Node n0001 not responding
2014/09/18 13:53

Sept 10th, 2014

n0001 is dead for all practical purposes (on-board ethernet ?). So : shuffle graphics cards n0008 → n0001, n0001 → n0005, n0005 → n0008. Then add n0008 to CUDA partition, start a job on n0005, leave n0001 idle. The PSU on n0008 is a 600W unit, we'll see how long it will last …

No. Something else is the matter : it appears that whenever a job is using the n0001's GPU after a little while the node fails (observed on n0001, n0005, n0008). Additionally, days_per_nanosecond shows an increase in 'time per step' just before the node fails. It seems that we have a broken GPU. Should test it with dgemm …

Yep. Running dgemm for 5 minutes was enough for killing the node carrying the specific GPU. Order a GTX 660 …

2014/09/11 11:40

<< Newer entries | Older entries >>

The full maintenance archive is kept here

…and finally, The infamous MBG's Power Failure Log

about/maintenance.txt · Last modified: 2011/01/31 17:56 (external edit)