Jul 3rd, 2011

Bloody UPS problems again (?). Nodes n0003 & n0004 died and then successfully restarted o/n. No sign in the logs of a power failure of sufficient length to be recorded. The other two nodes connected to the same UPSs stayed up-and-running. The load on all UPSs is the same (at ~80%). Current working hypothesis is that even very short power disturbances are sufficient for killing the two GPU-loaded nodes. But why shouldn't this also be the case for n0001 & n0002 ? :-/

Ignore the above (???). For nodes n0003, n0004 & n0006 chassis fans not functional. Could this be it ? Will have to wait …

