Jul 3rd, 2011

Bloody UPS problems again (?). Nodes n0003 & n0004 died and then successfully restarted o/n. No sign in the logs of a power failure of sufficient length to be recorded. The other two nodes connected to the same UPSs stayed up-and-running. The load on all UPSs is the same (at ~80%). Current working hypothesis is that even very short power disturbances are sufficient for killing the two GPU-loaded nodes. But why shouldn't this also be the case for n0001 & n0002 ? :-/

Ignore the above (???). For nodes n0003, n0004 & n0006 chassis fans not functional. Could this be it ? Will have to wait …

2011/07/03 20:51

Jun 27th, 2011

Again power failures, again nodes not responding to wake-on-LAN. Re-wired UPSs as follows :

UPS1 head node + switches + DAT tape
UPS2 n0001 + n0005
UPS3 n0002 + n0006
UPS4 n0003 + n0007
UPS5 n0004 + n0008 (sitting behind the cluster)
UPS6 n0009 (sitting to the left of i7)

Following the re-wiring, the UPSs (at full load) stabilized at ~80%. We'll see how this goes. Took the opportunity to clear the head nodes' CMOS and re-gain the two 'lost' cores. To wrap this whole UPS story up, I should have bought 700 VA APC units two years ago and not this 'green-friendly' nonsense …

2011/06/27 19:54

<< Newer entries | Older entries >>

The full maintenance archive is kept here

…and finally, The infamous MBG's Power Failure Log

about/maintenance.txt · Last modified: 2011/01/31 17:56 (external edit)