New tricks. Nodes that are not running jobs report :
nfs: server 10.0.0.1 not responding, timed out nfs: server 10.0.0.1 not responding, timed out nfs: server 10.0.0.1 not responding, timed out INFO: task tcsh:24370 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. tcsh D 0000000000000000 0 24370 24367 ffff81007acfb8b8 0000000000000086 0000000000000000 0000000000000046 ffff81007f4ea280 ffff81007acfb848 ffffffff80b9cb00 0000000000000000 ffff8100785d2000 ffff8100cb0040d0 ffff8100cb95f370 ffff8100cb004330 Call Trace: [<ffffffffa0e12e7f>] :sunrpc:rpc_wait_bit_killable+0x0/0x31 [<ffffffffa0e12eac>] :sunrpc:rpc_wait_bit_killable+0x2d/0x31 [<ffffffff80431c3b>] __wait_on_bit+0x41/0x70 [<ffffffffa0e12e7f>] :sunrpc:rpc_wait_bit_killable+0x0/0x31 .....
Tried restarting NFS, rebooting the nodes, …, all to no avail. The switch ?