Feb 18th, 2015

The problems with n0001 (or is it the switch ?) continue. The major symptom was that once a job was started, the node hang. The node was subjected to memory and CPU testing (stand-alone) which showed no problems. Then the switch port was exchanged between n0001 and n0008. During the first test, the node hang again. Then (without changing anything else), it behaved and the job run without problems. At the next power failure I'll try to cold-start everything in the cluster room.

maintenance/feb_18th_2015.txt · Last modified: 2015/02/18 13:45 (external edit)