Μαρ 17τη, 2014

This time it was disk (port) 1 that was thrown out of array. Rebuild completed without issues. Are we getting close to a disaster ? Do yet another L0 back-up to be on the safe side.

2014/03/18 11:37

March 11th, 2014

New tricks. Nodes that are not running jobs report :

nfs: server 10.0.0.1 not responding, timed out
nfs: server 10.0.0.1 not responding, timed out
nfs: server 10.0.0.1 not responding, timed out
INFO: task tcsh:24370 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
tcsh          D 0000000000000000     0 24370  24367
 ffff81007acfb8b8 0000000000000086 0000000000000000 0000000000000046
 ffff81007f4ea280 ffff81007acfb848 ffffffff80b9cb00 0000000000000000
 ffff8100785d2000 ffff8100cb0040d0 ffff8100cb95f370 ffff8100cb004330
Call Trace:
 [<ffffffffa0e12e7f>] :sunrpc:rpc_wait_bit_killable+0x0/0x31
 [<ffffffffa0e12eac>] :sunrpc:rpc_wait_bit_killable+0x2d/0x31
 [<ffffffff80431c3b>] __wait_on_bit+0x41/0x70
 [<ffffffffa0e12e7f>] :sunrpc:rpc_wait_bit_killable+0x0/0x31
.....

Tried restarting NFS, rebooting the nodes, …, all to no avail. The switch ?

2014/03/11 11:22

<< Newer entries | Older entries >>

The full maintenance archive is kept here

…and finally, The infamous MBG's Power Failure Log

about/maintenance.txt · Last modified: 2011/01/31 17:56 (external edit)