Jun 21st, 2011

Busy day. Installed two GTX460 on n0003 & n0004, to be on the safe side replaced their power supplies with two 550W units, created two new slurm queues (cuda & noncuda), and run the ApoA1 test on the four cuda Q6660's (getting close to 4.5 ns/day). Unfortunately, the load on the UPSs with the GPUs loaded exceeded their capacity, so a new UPS was added leading to a combination with two nodes per UPS (except norma and i7 which have their own). Even with the new UPS, however, the load on one of them slightly exceeds 100%, making it marginally stable. It would probably be better to complicate things and have mixed nodes (with and without GPU) on each UPS. This would mean 1+5 on UPS_1, 2+6 on UPS_2, 3+7 on UPS_3 and 4+8 on UPS_4. It will have to wait for the next power failure.

2011/06/21 20:12

Jun 20th, 2011

Power failure period again. Replaced power supply on n0001, restarted jobs. Lost two cores on head-node, will have to reset CMOS on the next power failure (sometime today).

2011/06/20 12:11

<< Newer entries | Older entries >>

The full maintenance archive is kept here

…and finally, The infamous MBG's Power Failure Log

about/maintenance.txt · Last modified: 2011/01/31 17:56 (external edit)