March 11th, 2014

New tricks. Nodes that are not running jobs report :

nfs: server 10.0.0.1 not responding, timed out
nfs: server 10.0.0.1 not responding, timed out
nfs: server 10.0.0.1 not responding, timed out
INFO: task tcsh:24370 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
tcsh          D 0000000000000000     0 24370  24367
 ffff81007acfb8b8 0000000000000086 0000000000000000 0000000000000046
 ffff81007f4ea280 ffff81007acfb848 ffffffff80b9cb00 0000000000000000
 ffff8100785d2000 ffff8100cb0040d0 ffff8100cb95f370 ffff8100cb004330
Call Trace:
 [<ffffffffa0e12e7f>] :sunrpc:rpc_wait_bit_killable+0x0/0x31
 [<ffffffffa0e12eac>] :sunrpc:rpc_wait_bit_killable+0x2d/0x31
 [<ffffffff80431c3b>] __wait_on_bit+0x41/0x70
 [<ffffffffa0e12e7f>] :sunrpc:rpc_wait_bit_killable+0x0/0x31
.....

Tried restarting NFS, rebooting the nodes, …, all to no avail. The switch ?

maintenance/march_11th_2014.txt · Last modified: 2014/03/11 11:22 (external edit)