Sept 18th, 2014

After the addition of the chassis fan on the GPU, n0001 appears to be computationally stable. The problem with the ethernet, however, is still present : even after downgrading the port to 100 Mbps, the tell-tale signs are still there

Sep 17 19:06:13 norma slurmctld[30609]: error: slurm_receive_msgs: Socket timed out on send/recv operation
Sep 17 19:06:13 norma slurmctld[30609]: error: slurm_send_recv_msgs(_send_and_recv_msgs) to n0001: Socket timed out on send/recv operation
Sep 17 19:06:13 norma slurmctld[30609]: error: agent/send_recv_msg: n0001: Socket timed out on send/recv operation
Sep 17 19:06:14 norma slurmctld[30609]: error: Node n0001 not responding
Sep 18 04:26:17 norma slurmctld[30609]: error: slurm_receive_msgs: Socket timed out on send/recv operation
Sep 18 04:26:17 norma slurmctld[30609]: error: slurm_send_recv_msgs(_send_and_recv_msgs) to n0001: Socket timed out on send/recv operation
Sep 18 04:26:17 norma slurmctld[30609]: error: agent/send_recv_msg: n0001: Socket timed out on send/recv operation
Sep 18 04:26:18 norma slurmctld[30609]: error: Node n0001 not responding