PageDescription
February 2nd, 2022 That's new. Network card failed 8-) Thankfully, it was not the motherboard card. Replaced.
Sept 14th, 2020 n0011 dead. Start digging ... :-(
August 29th, 2020 Motherboard failed on head node. Cannibalized n0007. Then SSL was too old. Switched to lighttpd. What a mess ...
Benchmarks : May 17th-20th, 2018 The results first : System with 9418 atoms, 4 fs/step (HMR) Box/GPU combination Performance in ns/day cudaPME OK ? ns/day without cudaPME n0011 IBM server (32 cores on a AMD 6234@2.4GHz, no cuda) 70 - n0007 (Old Q6600 @ 2.4GHz box) + GTX 1050 140 ✔ n0009 (i7-975, 4 cores @ 3.33 GHz) + GTX 1050 200 ✔ n0010 (AMD FX-8150, 8 c…
August 31st, 2019 The switch is dead. Need new one. Almost all old machines are off, so we could possibly get away with a 12 port device ? Got a TP-LINK Switch 16PORT 10/100/1000 Rackmount TL-SG1016D, will replace it tomorrow ... Done. The old procurve switch is still on active warranty, will send it to HP.
Jun 17th, 2018 Thunderstorm killed one UPS (battery replaced) and created problems booting n0010. :-( More thunderstorms coming soon, hardware check-up time.
Oct 11th, 2017 RAID's port 0 disk replaced. Array rebuilded and verified without problems.
Oct 10th, 2017 Again port 0 kicked-out. Next time the disk must be replaced.
Oct 8th, 2017 Again port 0 gone. Slowly but steadily getting to the stage of replacing the disk.
Sept 28th, 2017 3w-9xxx: scsi4: AEN: WARNING (0x04:0x0023): Sector repair completed:port=0, LBA=0x39206300. 3w-9xxx: scsi4: AEN: WARNING (0x04:0x0023): Sector repair completed:port=0, LBA=0x39206316. 3w-9xxx: scsi4: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=0. 3w-9xxx: scsi4: AEN: ERROR (0x04:0x0002): Degraded unit:unit=0, port=0. sd 4:0:0:0: WARNING: (0x06:0x002C): Command (0x28) timed out, resetting card. 3w-9xxx: scsi4: AEN: INFO (0x04:0x005E): Cache synchronized after power fail:unit=0. 3w-9xxx:…
Sept 24th, 2017 Again RAID's port 0 : 3w-9xxx: scsi4: AEN: INFO (0x04:0x0029): Verify started:unit=0. 3w-9xxx: scsi4: AEN: INFO (0x04:0x002B): Verify completed:unit=0. 3w-9xxx: scsi4: AEN: WARNING (0x04:0x0023): Sector repair completed:port=0, LBA=0x380C828E. 3w-9xxx: scsi4: AEN: ERROR (0x04:0x0002): Degraded unit:unit=0, port=0. 3w-9xxx: scsi4: AEN: INFO (0x04:0x000B): Rebuild started:unit=0. 3w-9xxx: scsi4: AEN: INFO (0x04:0x0005): Rebuild completed:unit=0.
Oct 12-16th, 2016 Continuous power failures. Leave everything off for now ...
Sept 18th, 2016 Motherboard died on norma. Let the cannibalizing period of the cluster begun : take-out n0008, move everything from norma (cards, RAID, disks, ...) to n0008, do not forget to re-enable USB, extra memory and SpeedStep through BIOS, put everything back together. After ~3 hours looks like we are back in business (for now). Restart jobs.
Aug 30th, 2016 Port 2 on RAID lost. Rebuild appears to be successful.
Feb 22nd, 2016 Finally a solution to n0010's instability problems : changed the BIOS settings concerning an 'AMD turbo mode' and made all 8 cores permanently active -> looks stable (with a job that kept on crashing within minutes).
Oct 14th, 2015 Trouble : Oct 14 12:49:40 norma kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x000D): PCI Abort: clearing. Oct 14 12:49:40 norma kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x000D): PCI Abort: clearing. Oct 14 12:49:59 norma kernel: Machine check events logged Oct 14 16:51:41 norma kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x000C): PCI Parity Error: clearing. Oct 14 16:51:41 norma kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x000D): PCI Abort: clearing. Oct 14 16:51:41 norma kernel: 3w-9xxx: scsi4: ERROR: (0x03:0x010D)…
June 1st, 2015 Again lost port 2 disk. About time to replace it ???
May 20th, 2015 Port 2 thrown out of the disk array twice. Rebuild completed successfully the second time, verification in progress ... Time is up for disk 2 ?
May 5th, 2015 Freon added to A/C unit : dramatic improvement. See how long it will last.
Mar 26th, 2015 Cluster overheated. You can tell spring is coming ... :-(
Mar 24th, 2015 Disk kicked-out of the n0011's array. Reloading the 'foreign configuration' appears to have fixed the problem (?). It appears that it could have gone much worse : RAID problem
Mar 7th, 2015 After a power failure hard rebooted the switch. Apparently all nodes came up without detectable network problems (?). Try starting a job on n0001 to see if it is stable now : No, still fails immediately.
Feb 18th, 2015 The problems with n0001 (or is it the switch ?) continue. The major symptom was that once a job was started, the node hang. The node was subjected to memory and CPU testing (stand-alone) which showed no problems. Then the switch port was exchanged between n0001 and n0008. During the first test, the node hang again. Then (without changing anything else), it behaved and the job run without problems. At the next power failure I'll try to cold-start everything in the cluster room.
Feb 6th, 2015 Continuous power failures continue. To top-it up, significant icing on A/C unit. Wait for northerly winds on Saturday ... :-/
Feb 1st-4th, 2015 Continuous power failures due to southerly salty winds and African dust :-)
Sept 17th, 2014 Got a GTX 660 (with a 750W PSU) to replace the broken GPU. Installed it and tested on n0005. It is ~20% slower than a GTX 460 for the given small system but for 200 euros I shouldn't complain. While replacing the card, it became apparent that the problem with the old card is possibly a broken fan. So : placed the 660 on n0005, took the Quadro card out of n0001, placed the old GTX460 in n0001 with the broken fan removed and a chassis fan placed on top of the card :-D Will test it ... -> Surprisi…
Sept 18th, 2014 After the addition of the chassis fan on the GPU, n0001 appears to be computationally stable. The problem with the ethernet, however, is still present : even after downgrading the port to 100 Mbps, the tell-tale signs are still there Sep 17 19:06:13 norma slurmctld[30609]: error: slurm_receive_msgs: Socket timed out on send/recv operation Sep 17 19:06:13 norma slurmctld[30609]: error: slurm_send_recv_msgs(_send_and_recv_msgs) to n0001: Socket timed out on send/recv operation Sep 17 19:06:13 no…
Sept 10th, 2014 n0001 is dead for all practical purposes (on-board ethernet ?). So : shuffle graphics cards n0008 -> n0001, n0001 -> n0005, n0005 -> n0008. Then add n0008 to CUDA partition, start a job on n0005, leave n0001 idle. The PSU on n0008 is a 600W unit, we'll see how long it will last ...
Aug 3rd, 2014 Continuous power failures. Restarting again and again and again ... :-/
Aug 2nd, 2014 Thunderstorms. Went down twice. Leave it down o/n ...
Jun 12th, 2014 Continuing problems : problems with the switch downgrading ports, with the array, with the UPSes, ...
June 4th, 2014 Again lost the array. NAMD restart files and their backups were zeroed, a huge mess. Crontab guarantee that this will not happen again : */30 * * * * cd /home/glykos/work ; find . -size +50c -name 'restart.vel' -exec cp {} '{}.SAFE' \; */30 * * * * cd /home/glykos/work ; find . -size +50c -name 'restart.coor' -exec cp {} '{}.SAFE' \; */30 * * * * cd /home/glykos/work ; find . -size +50c -name 'restart.xsc' -exec cp {} '{}.SAFE' \; */30 * * * * cd /home/glykos/work ; find . -size +50c -name 're…
May 27th, 2014 Port 0 thrown out of the array this time. Add a NAS with a /home2 on it and only use the server for slurm + perceus ???
May 8th, 2014 Port 2 thrown-out of the array. Rebuilt completed without incidents, but with two disks giving us trouble we are getting close to the point of no return.
May 4th, 2014 3w-9xxx: scsi4: AEN: INFO (0x04:0x0029): Verify started:unit=0. 3w-9xxx: scsi4: AEN: WARNING (0x04:0x0023): Sector repair completed:port=0, LBA=0x37F3B5BE. 3w-9xxx: scsi4: AEN: WARNING (0x04:0x0023): Sector repair completed:port=0, LBA=0x37F1CF00. 3w-9xxx: scsi4: AEN: WARNING (0x04:0x0023): Sector repair completed:port=0, LBA=0x37F1CC80. 3w-9xxx: scsi4: AEN: WARNING (0x04:0x0023): Sector repair completed:port=0, LBA=0x37F3B5BE. 3w-9xxx: scsi4: AEN: INFO (0x04:0x002B): Verify completed:unit=0.
March 18th, 2014 The problem with n0001 switch ports re-appeared. Had to downgrade the ports to 100T to boot it. Obviously there are good reasons for choosing five years as the expected lifetime of a cluster ...
Μαρ 17τη, 2014 This time it was disk (port) 1 that was thrown out of array. Rebuild completed without issues. Are we getting close to a disaster ? Do yet another L0 back-up to be on the safe side.
March 11th, 2014 New tricks. Nodes that are not running jobs report : nfs: server 10.0.0.1 not responding, timed out nfs: server 10.0.0.1 not responding, timed out nfs: server 10.0.0.1 not responding, timed out INFO: task tcsh:24370 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. tcsh D 0000000000000000 0 24370 24367 ffff81007acfb8b8 0000000000000086 0000000000000000 0000000000000046 ffff81007f4ea280 ffff81007acfb848 ffffffff80b9cb00 …
March 2nd, 2014 Disk (port) 2 thrown out of array. Rebuild completed without issues. We'll see how it goes (this is one of the old, initially installed disks). It is TS country again, let the cluster sleep for a day ...
Jan 30th, 2014 That's new (on n0001) : PXE-E61 Media Test Failure, Check Cable ... and PXE boot aborted. Checked cables -> probably not the problem. Moved them to new ports -> it booted. Went to the switch and, voila, the respective ports on the switch had been downgraded to 100 and 10T. Forced them both to 100T, imposed flow control, and this time the node booted.
Jan 27th, 2014 Made progress. A/C looks nice, stable and cool without icing for the last 10 days.
Jan 17th, 2014 It took a little longer (like 4 days) but the A/C was 'serviced'. Try it ...
Jan 13th, 2014 Reached the point where the A/C must be serviced-fixed-replaced. Shut down the nodes, give it another try tomorrow, call the service and if all else fails, buy a replacement (?). Leaving o/n didn't work. Try servicing it. Will take a couple of days.
Nov 4th, 2013 Iced-up again. Do the drill.
Oct 27th, 2013 Try yet another setting for the bloody A/C => Had to go to a target temperature of 26 deg C (up 10 deg from previous setting) to avoid ice-built-up within minutes.
Oct 25th & 26th, 2013 Icing on A/C again. Shut down n0001-n0008, let it thaw o/n. Iced again. With ~98% relative humidity possibly not surprising. Too hot, too humid, too windy, too real life, we can't stabilize the bloody thing ... :-/
Oct 11th, 2013 No way to stabilize temperatures at reasonable levels. Everything tried so far made matters worse. Running out of options. This is horrible, but better slow than sorry : reduce load on cluster by placing on hold two very long jobs (which will get restarted once two other shorter jobs are finished).
Oct 9th, 2013 Try again : take them up, restart jobs, watch temperatures. => It will not happen.
Oct 8th, 2013 Sick and tired. Again icing, again over-heating, lost two cores on our way to shutdown ... ;-E Start de-icing procedure again.
Oct 5th, 2013 Fighting to stabilize temperatures. Failing.
Oct 4th, 2013 Double power failure in the morning. The A/C was left in the de-hydrate mode for 24 hours. Try again with cooling mode while watching for icing problems.
Oct 3rd, 2013 Again icing on the A/C, everything auto-shutdown due to overheating :-/ Will have to dry it thoroughly and be around it when starting ... => and to make things even better, central A/C was turned to heating. Lovely. Move the cluster to the roof of the building ? :-D
Oct 2nd, 2013 A very long ~12 hour power failure, and then the next morning, the A/C had a nice coat of ice on it. Disgusting. Switch-off nodes 1-8, switch off A/C, wait to de-ice slowly. => Restart everything late in the afternoon.
Sept 27th, 2013 TS, power down after 63 (!) days. Weather unstable, let it rest o/n ... No, wake them and start a few jobs.
Sept 26th & 27th, 2013 Losing control of the temperatures again : [ Temperatures over the last 300 hours ] Try changing again the setup-up in the computer room : almost completely removed the roof panels. => Can't get it to stabilize. I'll burn everything down by the look of it :-/
Sept 18th, 2013 Tried new VNFS capsule with n0010 and is looking good (have to fix NFS mounting of /ibm some time). NAMD-CUDA is a little bit faster. We'll have to wait and see if that fixes the instabilities observed with this node. => Looks stable.
Sept 15th & 16th, 2013 Temperatures high (and the new server with 5 fans blowing hot air in the room doesn't help). Try changing the ventilation set-up and watch temperatures. Well, it is not very surprising that the room heats-up, is it ? Back in 2009 we had 9 machines each one burning about ~300 Watts, which gives 3.412*2700 ~ 9200 BTUs. Which means that the 9000 BTU unit could possibly manage to keep the room cool. Then we added 4 GPUs on the old nodes, each at ~200W. Then we got n0009 and n0010 with nominal requi…
Sept 13th, 2013 Replace chassis fans on head node plus n0003 -> OK Take-out quadro card from n0011, place it in n0008, change slurm settings for queues, change VNFS capsule for n0008 -> OK Test the card on n0008 with the 25,000 atom benchmark : 3.03 ns/day without CUDA, 6.44 ns/day with CUDA -> OK
Sept 9th, 10th & 11th, 2013 Received the nvidia K2000D card for the IBM server. Hardware-wise installation appears to have gone well (the cards fits the riser assembly with no problems whatsoever). But then, again, you can't have everything : Booting stops at the initial firmware screen (never gets to the <F2>, <F12>, ..., screen). Will have to make the hardware installation cycle a couple of times, test the card onto a different machine, confirm that the problem is due to the card, confirm that it is not an issue with the…
Aug 23rd, 2013 Fighting with slurm on n0011. After changing the executables to the old(er) version, replacing the corresponding libraries, and fixing passwordless ssh to node, n0011 appears in the sinfo list. Next targets are * fix slurm.conf to allow more than one job to run on n0011 simultaneously => Done * update NAMDjob to allow (semi)-automatic usage of new node => Done * Use LSI webbios to prepare a RAID 1 (mirroring) => Done * format n0011 disks' and export as /home2 with a mode of 1777 to…
Aug 25th, 2013 Playing with n0011 software-wise, looking reasonably good. Temperatures are up again and I'm not sure why (norma's and n0003's chassis fans are dead, but this shouldn't be enough). Also n0010 died suddenly again. It looks like yet another hardware crisis is in hand ...
Aug 1st & 2nd 2013 Tried an old but recently updated 2.6.18 kernel -> not surprisingly, couldn't get passed the kexec stage. So, back to drawing board. I seem to recall that the correction for the AMD-family-related memory problem were a few lines of code in just one kernel module. So, the question is : can the correction be applied in the corresponding module of the our old kernel ?
Jul 31st, 2013 Fail, fail and fail again. Keep trying : => Try to update udev in the VNFS capsule -> Failed again.
Jul 29th & 30th, 2013 Horrible mess. After numerous attempts with chroot's, mkinitramfs, perceus twinking, etc, turns out that the broadcom ethernet cards need a firmware file for the kernels I was testing. To make matters worse, the loss of memory mapping persists even with 2.6.32-9. In summary, after ~10 hours we'll have to go back and start with a more recent kernel. :-/
Jul 28th, 2013 * Switched-on external A/C, programmed it to automatically reset itself (off and on again) twice a day. Temperatures improved, watching stability. * Still fighting with applying a new kernel to the VNFS capsule for the new IBM box. Still failing (kernel boots-up, can't get ethernet setup to work correctly (a mess with modules ? depmod -a missing ??)).
Jul 26th, 2013 Temperature alarm, everything went down. A/C unit ??? Restart everything, switch-off external A/C, let it equilibrate, watch temps. n0011 update : I hate getting anywhere near the kernel, but here we are again messing (unsuccessfully) with the VNFS capsules.
Jul 25th, 2013 Arranged power line and place for the IBM box, and got it seated : [ New IBM box] Powered it on, looked good, then tried connecting it to the existing perceus/slurm environment. Surprisingly, it booted, all 48 cores were visible, but you can't have everything :
Jul 22nd, 2013 Power failure o/n. n0009 PSU dead, n0002 PSU dead, rsync (correctly) deleted backup on second disk. All in all, one of those days ... :-E
Jul 23rd, 2013 * PSU on n0009 replaced with a 750W unit. It won't last long (the previous was a 1000W crugar unit). * PSU on n0002 replaced with a 600W unit. We'll see. * Try to dependably get rsync to work. Still failing. * Start thinking about the new IBM x3755 M3 box. We need at least kernel 2.6.32+, and ideally 2.6.37+. But norma only has 2.6.26. We'll have to play with VNFS capsules (and it will not be very nice).
Jul 21st, 2013 Short, continuous power failures due to TSs in the neighborhood. Let it rest o/n.
Jul 14th, 2013 n0001 PSU burned and replaced with a raptor 750W. The PSU that was replaced was claimed to be a 550W unit. How on earth do they calculate the stated wattages on PSUs ??? All-in-all -and after 4.5 years of continuous usage- this Beowulf cluster's weak links were [1] PSUs (x12 replacements and counting), [2] disks (x2 but this was possibly due to a bad batch), [3] GPUs (x1), [4] UPS (all over-rated, no way to keep the machines up for more than 30 seconds). No problems with memories or CPUs.
Jun 28th, 2013 n0002 PSU refusing to cooperate. Sick and tired of replacing PSUs.
Jun 23rd, 2013 82 days without a power failure !?!
May 21st, 2013 PSU on n0003 replaced.
May 9th, 2013 n0003 refuses to boot. PSU again ?
Apr 26th, 2013 Hate it : 3w-9xxx: scsi4: AEN: WARNING (0x04:0x0023): Sector repair completed:port=0, LBA=0x39E74766. 3w-9xxx: scsi4: AEN: ERROR (0x04:0x0002): Degraded unit:unit=0, port=0. 3w-9xxx: scsi4: AEN: ERROR (0x04:0x003A): Drive power on reset detected:port=0. 3w-9xxx: scsi4: AEN: INFO (0x04:0x000B): Rebuild started:unit=0. 3w-9xxx: scsi4: AEN: INFO (0x04:0x0005): Rebuild completed:unit=0.
Apr 3rd, 2013 Power failure. Took the opportunity to replace PSU on n0007 and chassis fan on n0002. n0003 is giving power-up problems, yet another PSU about to fail ?
Mar 28th, 2013 Circuit breaker blown, power down in the whole cluster room. It looks as if n0007 (which had shown abnormally high temperatures) had died an impressive death. Still looking into it ...
Mar 10th, 2013 Power supply on n0008 died. Replaced it with a 600W unit. Also replaced chassis fan on n0005.
Feb 7th, 2013 A dark, stormy night. We will go down soon enough ... ;-)
Jan 24th, 2013 Impressive long-lasting TS over station. Play safe and let it rest ...
Nov 25th, 2012 No, it wasn't the chassis fan the problem with n0007 ... The power supply ???
Nov 24th, 2012 n0007 hot : n0006: Core 0: +65.0 C (high = +82.0 C, crit = +100.0 C) n0006: Core 1: +63.0 C (high = +82.0 C, crit = +100.0 C) n0006: Core 2: +59.0 C (high = +82.0 C, crit = +100.0 C) n0006: Core 3: +57.0 C (high = +82.0 C, crit = +100.0 C) n0007: Core 0: +71.0 C (high = +82.0 C, crit = +100.0 C) n0007: Core 1: +72.0 C (high = +82.0 C, crit = +100.0 C) n0007: Core 2: +69.0 C (high = +82.0 C, crit = +100.0 C) n0007: Core 3: +69.0 C (…
October 29th, 2012 Thunderstorm season ? Leave everything down until electricity stabilizes (possibly on the 31st) ... Took the opportunity and replaced power supply on n0006 (in situ !, good fun). Later in the evening gave-up waiting and re-started jobs. Will probably regret it ...
October 3rd, 2012 Trouble, trouble, trouble ... Sep 30 06:07:12 norma kernel: 3w-9xxx: scsi4: AEN: INFO (0x04:0x002B): Verify completed:unit=0. Oct 1 09:14:50 norma kernel: 3w-9xxx: scsi4: AEN: WARNING (0x04:0x0023): Sector repair completed:port=2, LBA=0x34C8E1C6. Oct 1 09:14:50 norma kernel: 3w-9xxx: scsi4: AEN: WARNING (0x04:0x0023): Sector repair completed:port=2, LBA=0x34C8E100. Oct 1 09:14:50 norma kernel: 3w-9xxx: scsi4: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=2. Oct 1 09:15:22 norma ker…
Sept 26th, 2012 Head node dead, 'bus error' reported for each and every command, array degraded (port 2 this time). Getting worryingly close to serious trouble by the look of it. Rebuild succeeded, restart jobs ...
Aug 28th, 2012 Aha. From /var/log/messages : Aug 28 12:38:04 norma kernel: Unable to find swap-space signature with first appearance on August the 7th. Could this be it ? mkswap, swapon to go back to : Aug 28 18:48:35 norma kernel: Adding 3140696k swap on /dev/sda2. Priority:-1 extents:1 across:3140696k
Aug 19th, 2012 Head node went down at the least opportune moment. Had to wait till the 28th of August for physical access. 3ware showed all ports normal, so it may not have been the disk again. Take everything up again. All restart files from one job were empty. Start fixing things ...
Aug 8th, 2012 So far so good (one verification pass made without problems). And then : process DCDs, built and install rosetta v.3.4 and ambertools v.12, xfs_fsr (to completion), reboot, xfs_check, xfs_repair, level zero backup of / and /boot -> back to safety. Restart jobs.
Aug 7th, 2012 Port 0 of RAID thrown out again (which brought the system down). Last time it was again port 0, two months ago. Try xfs_repair + rebuilt (but the disk will have to be replaced at some point).
Aug 6th, 2012 Tried to get SLURM to work with ccp4i. We got to the stage that the job gets submitted, but : (a) the interface does not pick-up the running job (could cause clashes with file names), (b) ccp4i can not watch job's progress. Possibly safer to submit the ccp4ish jobs by hand with sbatch ...
Jun 14th, 2012 RAID looks OK (???) after rebuilt. Start cleaning-up the mess left behind from the failure. Schedule a RAID verification to check again that the disk is indeed usable ...
Jun 13th, 2012 RAID degraded, port 0 (WD-WCASY20895) was lost. This also brought the whole system down. Did an xfs_repair on the degraded array, then re-mounted the (probably faulty) disk and started a re-build to verify that the disk is indeed broken (before physically replacing it).
May 31st, 2012 Continuous power failures. Let it rest ...
May 24th, 2012 All major NAMD executables (CUDA, multicore, TCP) updated to 2.9.
May 20th, 2012 Three power failures within 5 hours. Let it rest ... :-/
May 19th, 2012 Thunderstorms every few hours ... Installed external 2T USB disk and moved accumulated material to it. Main RAID usage reduced to 5% (external disk at 22% with the whole /home copied to it).
May 13th, 2012 Spring thunderstorm season is on ... :-/
Apr 8th, 2012 Two power failures back-to-back, thunderstorm in the vicinity. Let it rest o/n ...
Mar 29th, 2012 Average core temperatures over the last 5000 hours :
Feb 26th, 2012 A hiccup from adaptive tempering again. Following a carefully improvised set of voodoo movements (like ignoring the restart file from the previous tempering run and building one from scratch), order appears to have been restored ... -> No, order hasn't been restored. Keep trying ...
Feb 19th, 2012 All GPU fans adjusted with the exception of the dual GTX295. Whenever fan settings changed (on VGA bios with NiBiTor 6.0.3), the card falls-off the bus (according to dmesg). Leave it alone for the time being. GTX295's fan speed was finally adjusted by directly editing (with hexedit) the bios. The correct byte was identified through comparison with the GTX260 bios. Don't want to have to do this again ...
Feb 24th, 2012 Variation of GPU temperatures (4 x GTX460, GTX260, GTX295, GTX570) versus time over a period of 18 hours. Temperatures as reported by nvidia-smi : [GPU temperatures]
Feb 22nd, 2012 Evolution of the average CPU cores' temperatures (for 'hot' nodes only) over the last 2000 hours : [Temperatures]
Feb 20th, 2012 Zero level backup. UPS box added to support n0010 -> nominally 850 VA, but doesn't look very convincing. We'll see ...
Feb 18th, 2012 Started changing fan speeds on GPUs using nvflash + NiBiTor. Set them to 80% ... n0002 (GTX460) -> Done. n0005 (GTX260) -> Done. n0010 (GTX570) -> Done. n0009 (GTX295) -> Problem. Two bios-es with seemingly different settings (???). Tried, failed, decided to leave alone ...
Feb 13th, 2012 Power failures again. Keep the machine down o/n.
Feb 11th, 2012 Communication problems for slurm-perceus-new node -> fixed (possibly).
Feb 10th, 2012 New node appears to be ready for burn-in. Start a job ...
Feb 9th, 2012 Power failure spree continues ...
Feb 7th, 2012 New box with 8-core AMD + GTX570 arrived. Images and benchmarks will appear in 'Benchmarks'
Feb 6th-7th, 2012 Continuous power failures ... Keep trying, keep failing ...
Feb 1st, 2012 Moved the GTX260 GPU from the head node to n0005 and replaced n0005's PSU with a 650W unit. Take everything up again : Looks OK. ... and then after two hours had this excellent idea of moving the 3ware card to the newly emptied PIC-e slot. Only to discover that this 3ware card can only fit to a PCI-X slot. After a bit of cursing, put everything back to their original places, cursed a bit more, take it up, restart jobs. :-/
Jan 31st, 2012 Pipes froze, water started dripping in the terminal room [thankfully not in the cluster room (yet ?)]. To top it up, had three power failures back-to-back. Later in the evening started looking stable again (?) ...
Jan 30th, 2012 (ii) The GTX260 on the head node is wasted. Should it be moved to a compute-only node (n0005) ??? Test it first. * For a small system (5,500 atoms) : GTX460 -> 0.032 days/ns, GTX260 -> 0.034 days/ns. * For a slightly larger system (7,400 atoms) : GTX460 -> 0.043 days/ns, GTX260 -> 0.062 days/ns.
Jan 30th, 2012 PSU on n0004 replaced. Looks OK.
Jan 28th, 2012 n0004 died. Will have to wait till tomorrow. It was the PSU (550W) again ...
Jan 4th, 2012 A/C unit moved. Watch temperatures -> Worse than before ... :-( Average core temperatures vs time over the last 2000 hours. Limits on y range from 49 to 67 degrees. [Average core temperature vs time (cluster-wide)]
Jan 20th, 2012 That smell again ... A PSU about to die. Possibly n0002's (?). Indeed died two hours later. Replace PSU, restart job.
Dec 22nd, 2011 Restart the jobs while waiting for the rain (if not winter) to pass ... Change the A/C setup : make the direction of air stream constant (and pointing to the lower part of the cluster) -> significant difference in average core temperatures (to the better). Wait for possible icing problems -> still looking stable after three days ...
Nov 20th, 2011 Update added 1/1/2012 : Adaptive tempering now stable with CVS-2011-12-12. Adaptive tempering with May's CVS version dumps core. Updated NAMD-CUDA + UDP-NAMD (but not multicore) to the latest (20th of November) CVS version. Well, no : it still dumps core upon initialization, but not consistently (occasionally starts without any apparent problem). Not stable yet ...
Dec 21st, 2011 Overheated again due to ice accumulation. Call for A/C unit servicing .... -> Can't service it because it is raining ... ;-/
Dec 8th, 2011 Norma overheated at ~2200 on the night of the 7th. Restart jobs next morning -> temperatures again high. Open cover of A/C unit -> covered throughout with 0.5cm of ice. Shutdown everything, let the ice thaw slowly ... After ~6 hours : try again (A/C on, restart jobs) -> watch it.
Dec 4th, 2011 First attempt to update NAMDautorestart to include the possibility of an adaptive tempering run. Will have to wait for the next power failure (which won't be long, its been 12 days now ...).
Dec 3rd, 2011 NAMD CVS 30/11 installed. This contains bug fixes for adaptive tempering from Johan Strumpfer. Start a long run to see how it goes -> So far so good ...
Nov 21st, 2011 Double power failure : 'lost' two cores, network failed. Dump CMOS -> got cores back. Later network also came back-on. Current version of NAMD's adaptive tempering has teething problems. It gives seg faults and restarts (via a restart file) fail on keyword parsing. Having said that, do keep testing it ...
Nov 22nd, 2011 Power keeps on failing twice a day ... :-/
Nov 8th, 2011 Norma is hot, shutting everything down now ... Average core temperature of the active (job-wise) nodes got higher than 63 degrees ... The good news is that monitoring worked. The bad news is that it was triggered. A/C unit appeared to be OK and the room cool. Restart the jobs and schedule an A/C unit servicing ...
Nov 7th, 2011 New GTX460 card for n0001 arrived. Install and start using immediately ...
Oct 28th, 2011 Hiccup on n0001. No network traffic, namd job apparently still present (but with less overall load for node), job stopped writing to the disk. Stop & restart the job -> did it again with same symptoms. Reboot node -> Happened again. OK, have something. dmesg points to the GPU :
Oct 13th, 2011 Ooops. Power failure exactly after writing the restart .vel and .xsc and opening (but not writing) the .coor. The result was an empty restart.coor. Because (through the NAMDautorestart script) I foolishingly deleted the restart*.old files, no restart files were available. Because I also foolishingly deleted the RUN_??.dcd file, no restart file could be prepared. Nice job, well done ... :-\
Aug 23rd, 2011 Too many short-spaced power failures throughout the day. Let it rest ...
Aug 3rd, 2011 Daily power failures allowed testing the wake-them-up drill. On all four occasions everything went smoothly (from sending the WOL packets, to restarting the running jobs). Which leads us to NMG's corollary to Murphys' ninth law : “Everything will appear to work beautifully until five minutes after you leave for summer vacations.”
Jul 29th, 2011 Power failure caused two cores on n0002 to disappear. Had enough and rolled-back n0002 bios to v.1.4, which however is not identical release with the one found on the other nodes ... ;-/
Jul 20th, 2011 n0003 & n0004 keep on dying unexpectedly. Is it hardware, is it the UPSs, the new GPUs, their power supplies ? For these two nodes “to be on the safe side replaced their power supplies with two 550W units”. Could this be it ?
Jul 16th, 2011 Power failed. Took the opportunity to empty the nodes' UPS batteries completely (until the UPSs shut down). Will do a couple of complete discharge-recharge cycles with the forthcoming power failures.
Jul 7th, 2011 It was not a problem with fans, it is the bloody green-friendly UPSs : A momentary voltage drop (a couple of seconds) was enough to kill again nodes n0003 & n0004. Replace these UPSs with proper APC 700VA units ?
Jul 4th, 2011 Replaced chassis fans on n0003, n0004, n0006. Removed side panel on n0009, will have to watch temperatures. First impression is that temperatures are looking good. A graph will appear after a week-or-so ...
Jul 3rd, 2011 Bloody UPS problems again (?). Nodes n0003 & n0004 died and then successfully restarted o/n. No sign in the logs of a power failure of sufficient length to be recorded. The other two nodes connected to the same UPSs stayed up-and-running. The load on all UPSs is the same (at ~80%). Current working hypothesis is that even very short power disturbances are sufficient for killing the two GPU-loaded nodes. But why shouldn't this also be the case for n0001 & n0002 ? :-/
Jun 27th, 2011 Again power failures, again nodes not responding to wake-on-LAN. Re-wired UPSs as follows : UPS1 head node + switches + DAT tape UPS2 n0001 + n0005 UPS3 n0002 + n0006 UPS4 n0003 + n0007 UPS5 n0004 + n0008 (sitting behind the cluster) UPS6 n0009 (sitting to the left of i7) Following the re-wiring, the UPSs (at full load) stabilized at ~80%. We'll see how this goes. Took the opportunity to clear the head nodes' CMOS and re-gain the two 'lost' cores. To wrap this whole UPS …
Jun 26th, 2011 Oh well. Short power failure killed nodes n0001-n0004. n0002 upon reboot 'lost' two cores (bloody BIOS update to 3.1 and bloody 'green' UPS's). Decided to try my luck remotely via flashbios hoping to go back to BIOS v.1.4. Unfortunately, but not unexpectedly, this didn't work as expected, especially when considering that I used a BIOS dump of a different node. After a short trip to access the cluster physically, and a few problems with MAC addresses, n0002 is up again, but still on BIOS v.3.1. D…
Jun 25th, 2011 TS over station. Five nodes didn't respond to wake-on-LAN, they will have to wait till tomorrow (assuming the power stays on o/n, which appears to be doubtful). Second close-spaced black-out killed another two nodes. The Déjà vu is strong : ”...
Jun 21st, 2011 Busy day. Installed two GTX460 on n0003 & n0004, to be on the safe side replaced their power supplies with two 550W units, created two new slurm queues (cuda & noncuda), and run the ApoA1 test on the four cuda Q6660's (getting close to 4.5 ns/day). Unfortunately, the load on the UPSs with the GPUs loaded exceeded their capacity, so a new UPS was added leading to a combination with two nodes per UPS (except norma and i7 which have their own). Even with the new UPS, however, the load on one of the…
Jun 20th, 2011 Power failure period again. Replaced power supply on n0001, restarted jobs. Lost two cores on head-node, will have to reset CMOS on the next power failure (sometime today).
Jun 19th, 2011 Installed mcKmeans, R 2.13.0 and JRE 1.6.0_26. n0001's power supply failed (it was one of the old ones and didn't stand the additional load from the GPU).
Jun 17th, 2011 Modify NAMDjob to allow submission of CUDA-enabled jobs and to preferentially fill non-cuda-capable nodes for non-cuda jobs.
Jun 16th, 2011 The two new GTX460 cards arrived and were installed on n0001 and n0002. New nvidia driver (275.09.07) installed cluster-wide (including VFS capsule). NAMDjob modified to enable submission of CUDA-enabled NAMD jobs to the default cluster partition. First tests look promising given the cards' price : For small systems jobs run ~30% faster on one node+CUDA than on two nodes.
Jun 12th, 2011 Power failure (after 19 days). Change i7's UPS, xfs_check RAID.
May 24th, 2011 Broken NFS mounts causing load instabilities ? Restart NFS cluster-wide and wait ... No, this wasn't the culprit. named ? No. portmapper ? No. OK. A power failure to save the day ? :-/ SSH timeouts is the source of the instability of the indicated load. So, both NFS and SSH timeouts. Is it the switch ?
May 21st, 2011 It seems that this temperature-monitoring script does work: May 21 20:02:11 norma logger: [temp] Alert, current average core temperature is 59 deg C May 21 20:02:11 norma logger: [temp] Temperature alert, taking everything down now ... May 21 20:02:11 norma logger: [temp] Issued shutdown to n0001 May 21 20:02:11 norma logger: [temp] Issued shutdown to n0002 May 21 20:02:11 norma logger: [temp] Issued shutdown to n0003 May 21 20:02:11 norma logger: [temp] Issued shutdown to n0004 May 21 20:02:1…
May 10th, 2011 UPS on i7 is a problem. Replace it ?
Mar 27th, 2011 NAMDjob_i7 script written to support submission of NAMD+CUDA jobs to the i7 node. Script tested and timings uploaded in the benchmarks page.
Mar 7th, 2011 This is scandalous. 22 days without a power failure ? Disgusting ...
Jan 16th, 2011 Multiple, closed-spaced power failures o/n created filesystem issues. Will have to do a proper check next time the power goes down (ie. later today :-) )
Jan 12th, 2011 n0005 died o/n. This is the second time, and, again, it did came back OK. Memory ?
Jan 8th, 2011 Lost internet connectivity from cluster room. Nikos Grigoriadis fixed it and identified the problem as being a loopback. It was probably the second 10/100 hub that was known to be problematic. Took it off.
Jan 5th, 2011 Cron script to update MBG's power failure log written, installed and tested. Link placed in maintenance page. Happy power failures.
Jan 1st, 2011 I do need a 'power_failure' page and a script to automatically update it. Restart jobs.
Dec 30th, 2010 Risk it. Set BIOS setting to restore power after power failure. Set grub timeout to 5 minutes to guard for a quick double hit.
Dec 29th, 2010 Power failure. Everyday ? :-( Giving up. Every two hours?
Dec 28th, 2010 Again power failure. Restart jobs ...
Dec 27th, 2010 First version of a new perl script named 'DCDstitch'. The aim is to automate the post-processing of multiple DCD files from multiple molecular dynamics restarts. The script will hopefully find the number of frames that must be kept from each DCD, remove waters and ions, and paste all DCD together. There are several assumptions involved in the procedure, so you'd better have a look at the script's source code at /usr/bin/DCDstitch
Dec 26th, 2010 Power failure. Restart next day. Bringing-up the cluster automatically didn't work this time (?).
Dec 17th, 2010 Well, five days without a power failure was disgusting. Thankfully, it didn't last longer ... Restart everything.
Dec 11th, 2010 Continuous power failures. Wait ...
Dec 10th, 2010 Power failure.
Dec 8th, 2010 Power failure. Take them up again.
Dec 5th, 2010 TS. All went down. Northerly winds at last.
Dec 3rd, 2010 Southerly salty winds lasted too long for comfort. All went down.
Nov 28th, 2010 Impressive TS over station, all went down. Wait a bit before restarting jobs ...
Nov 12th, 2010 Again power failure. Thankfully, it rained. Lost two cores on head node, had to dump ROM to get them back. Start the whole lot again.
Nov 10th, 2010 Continuous power failures. Southerly winds ... :-(
Oct 28th, 2010 Continuous thunderstorms causing power failures every two hours. Wait. Before restarting jobs, add +idlepoll to the NAMD flags used by the NAMDjob script.
Sept 25th, 2010 n0001 died with no sign of what went wrong, and came back-up without problems. Sigh.
Sept 24th, 2010 Power failure.
Sept 20th, 2010 Power failure.
Sept 14th, 2010 Power failure. Scripts to automatically bring-up the machines worked satisfactorily. Restarted jobs with NAMDautorestart, again without problems.
Sept 8th, 2010 Power supply failed on n0002. Replace and restart job. First use of NAMDautorestart script: Perl script to prepare files for a restart #!/usr/bin/perl -w print "\nChecking files:\n"; if ( -e "equi.namd" ) { print " Found: equi.namd\n"; } else { print " Missing equi.namd. Aborting now.\n"; exit; } if ( -e "LOG" ) { print " Found: LOG\n"; } else { print " Missing LOG. Aborting now.\n"; exit; } if ( -e "output/equi_out.dcd" ) { print "…
Sept 4th, 2010 Power failure, of course.
Aug 13th, 2010 Eleven days without a power failure had been scandalous. Thankfully, didn't last longer. Unfortunately, the i7 node didn't come back. Physical access needed :-/
Aug 2nd, 2010 Power failure on the night of 1st killed two nodes. The rest died on the morning of 2nd. Restarted jobs but started wondering whether 'persisting until prevailing' is a viable option under the circumstances. :-/
Jul 31st, 2010 Power failure. I should be reporting time intervals with 'power available'.
Jul 30th, 2010 Power failure o/n. Restart.
Jul 20th, 2010 Power failure.
Jul 16th, 2010 Power failure o/n. Every day ?
Jul 15th, 2010 Not every Sunday only. Restart jobs. Again lost two cores (gonne hot-pluged). Will have to re-flash BIOS after the next power failure :-/
Jul 11th, 2010 Power failure (every Sunday ?). Take them up again.
Jul 5th, 2010 Power supply failed on n0008 (was still on one of the originally provided units). Replaced.
Jul 2nd, 2010 Double power failure (thunderstorm season). Take them up again.
Jun 16th, 2010 Power failure (sans TS, avec too many AC units working simultaneously due to heat wave).
Jun 15th, 2010 Aha, culprit found: n0005 died o/n. Unfortunately, it did came-up again so we can't locate the problem. Restart the job hoping that this time it will fail for good.
June 9th, 10th, 2010 And suddenly, NAMD jobs started bombing-out at step (8 x LDBperiod) with the following message from the balancer: ENERGY: 60240 58.8727 162.3848 79.1835 5.3860 -397348.5272 36382.6634 0.0000 LDB: ============= START OF LOAD BALANCING ============== 2252.97 LDB: ============== END OF LOAD BALANCING =============== 2252.97 ENERGY: 60320 54.5252 157.6152 83.2358 11.1675 -397481.1164 36760.0562 …
May 27th, 2010 Guess what. Power failure ! Should I start be posting dates _without_ a power failure ?
May 17th, 2010 Scheduled electric work. All went down and up again smoothly.
May 8th, 2010 A power supply (possibly n0008's) is about to give-up the ghost. Let it burn.
May 4th, 2010 Pilot-induced oscillation (cluster crash): /etc/init.d/network stop typed in the wrong terminal (connected to norma instead of the box I was setting-up), and then forgotten. Few minutes later slurm officially declared the nodes dead, and killed all jobs. It was only fair that I had to restart all jobs late in the evening :-?
Apr 13th, 2010 xplor-nih and vmd-xplor installed.
Apr 7th, 2010 Power failure of course, but this time it had its toll: two cores of the head node went 'hotplugged', never to be seen again. Still working on it ... OK. Updating the BIOS (latest v.3.10) fixed the problem, but I doubt that the problem was indeed BIOS-induced.
Mar 30th, 2010 The new UPS (on which the i7 node is mounted) appears to have failed without obvious reason (all other UPSs were OK). UPS, node & job restarted.
Mar 7th, 2010 Guess what, power failure again. Mean time between power failures is of the order of days. Restart everything.
Mar 4th, 2010 Electrical work. All went down. Restart jobs. Update VNFS capsule to include CUDA stuff for n0009.
Mar 2nd, 2010 CUDA up and running on the new node with not very much trouble (needed kernel module, init script to create the devices _and_ the cuda libraries in /usr/lib64). The only thing left to do is to move everything to the VNFS capsule. Since we were there, took the opportunity to update the nvidia driver + toolkit + SDK on the head node (current is 190.53).
Feb 24th, 2010 New node with i965 extreme and GTX295 arrived. Unfortunately, not with the expected (fast) memory (which is assumed to be on its way). Node installed without problems and benchmarked. Only CUDA left to set-up ...
Feb 13th, 2010 TS OVER STATION. All went down and came back up again gracefully.
Feb 1st, 2010 Power failure (thunderstorm). Take everything up again, restart jobs.
Jan 8th, 2010 Power failure overnight. Everything came back Ok.
Dec 15th, 2009 Take everything down due to scheduled electrical work on the building. Take everything up again (uneventfully) the next day.
Dec 1st, 2009 Power failure again, but this time with a Murphy's touch: the power went down for a period of time suffieciently short to kill the jobs, but not sufficiently long to permanently shutdown the nodes. Nodes came-up again, slurm restarted the jobs (from their state three weeks ago), and this was when things went to hell in a handbasket: the restart files that would be needed to continue the long jobs were overwritten, making a proper restart impossible. As a result, a --no-requeue flag was added to …
Nov 12th, 2009 Power failure. Take everything up again, restart jobs.
Oct 29th, 2009 It was indeed the same mistake made twice: rebuild with the old disk failed. Machine taken down, disk replaced, rebuild initiated -> this time went OK. Back to safety. Take it up again, send the disk away. At the end of the day everything looking normal. Take the opportunity to set PCI latencies to zero for all network cards.
Oct 28th, 2009 (B) Ok. Assume for starters that the disk is Ok (doing the same mistake twice). Then try to rebuild the array with the filesystem alive as follows: * Use 3dm2 to (a) remove the drive from the unit, * Rescan the card to make the drive available again, * Start rebuilding the array at low priority -> see how it goes
Oct 28th, 2009 Hate it: Port 0 WDC WD5002ABYS-01B1B0 465.76 GB OK Port 1 WDC WD5002ABYS-02B1B0 465.76 GB OK Port 2 WDC WD5002ABYS-01B1B0 465.76 GB OK Port 3 WDC WD5002ABYS-01B1B0 465.76 GB DEGRADED [Remove Drive] corresponding to
Oct 27th, 2009 Forty days without a thunderstorm (and without a power failure)!
Oct 24th, 2009 # tail /var/log/messages Oct 21 00:12:36 norma kernel: 3w-9xxx: scsi4: AEN: INFO (0x04:0x0029): Verify started:unit=0. Oct 21 05:04:56 norma kernel: 3w-9xxx: scsi4: AEN: INFO (0x04:0x002B): Verify completed:unit=0. Oct 23 04:04:58 norma kernel: 3w-9xxx: scsi4: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card. Oct 23 04:04:58 norma kernel: sd 4:0:0:0: WARNING: (0x06:0x002C): Command (0x28) timed out, resetting card. Oct 23 04:05:12 norma kernel: 3w-9xxx: scsi4: AEN: INFO …
Sept 24th, 2009 Xdock is the new game in town. Use it to prepare a nice little dockapp that uses warewulf to display the CPU status of the cluster's nodes. Nice and easy: [Screenshot with applet in use] By having wulfd running on the head node, can also watch how relaxed the server is (first column in the applet).
Sept 17th, 2009 Surprise, surprise: power failure again. Looks like the current settings for the UPS can shut down everything gracefully. Still have a problem with the cluster coming up automatically and cleanly (mainly because the power is not restored cleanly, but through bursts of power-on-power-down).
Sept 15th, 2009 RAID is probably Ok: gonne through two verifications without incidents.
Sep 10th, 2009 (B) The disk was indeed faulty, but you could only tell after asking the WD diagnostic tool to fill the whole thing with zeros (all other WD-provided tests passed OK). The take-home message from this RAID-disk story is that the controller is dependable: it correctly identified and threw-out of the array a disk that could have passed all the standard vendor-provided tests. So: if the controller says “throw it away”, throw it away. Don't think, don't hesitate, just replace the darn thing and get on w…
Sep 10th, 2009 The attempt to re-build the array with the old disk failed. The complete set of logged error messages concerning the specific drive were: Sep 09, 2009 08:49.40AM (0x04:0x003A): Drive power on reset detected: port=1 Sep 08, 2009 08:53.47PM (0x04:0x0002): Degraded unit: unit=0, port=1 Sep 08, 2009 05:47.29PM (0x04:0x000B): Rebuild started: unit=0 Sep 08, 2009 05:17.39PM (0x04:0x0042): Primary DCB read error occurred: port=1, error=0x208 Sep 06, 2009 05:13.06AM (0x04:0x0002)…
Sep 6th, 2009 Sep 06, 2009 05:13.06AM (0x04:0x0002): Degraded unit: unit=0, port=1 Sep 06, 2009 05:13.06AM (0x04:0x0023): Sector repair completed: port=1, LBA=0x38688186 Sep 06, 2009 12:10.09AM (0x04:0x0029): Verify started: unit=0 ... and that was the end of a lovely Sunday morning.
Sep 8th, 2009 The hardware shop says that the faulty disk passes the Western Digital tests, and thus, that it is not faulty by their standards. Got two new ones, but decided to give the old disk a try first (see below). As if the RAID problems were not enough, the wooden structure supporting the cluster gave way at it's base: take it apart, re-enforce the structure, put it back together again. At the end of the day, back to normal at least structurally. Then: remount the old disk, and start the rebuild.
Aug 31st, 2009 Level 1 xfsdump of head node.
Aug 22nd, 2009 Other than a power failure (fixed remotely), it has been a quiet and (at least hardware-wise) productive summer vacation ...
Aug 26th, 2009 Yet another power failure. Head node went down gracefully. Take-up everything again, start restarting jobs.
Jul 18th, 2009 Attempted to install a crontab job to make weekly back-ups of wiki pages on optical media: Didn't work, can't write multisession DVDs with growisofs.
Jul 10th, 2009 At last a solution to the problem with DCD files appearing corrupted and lagging behind the simulation: the (now obvious) answer is that sync-ing the head node is not enough. The compute nodes (with 4Gb of memory) are caching results for a long-long time before flushing them out (to the head node's disks). Adding a crontab entry that issues a /bin/sync to the nodes every five minutes is the current working solution (which doesn't seem to affect performance).
Jul 9th, 2009 Power failure again with no obvious reason. This time there were problems: n0007 appears to be dead (no POST), and the server failed to shutdown cleanly. xfs_check repaired one unlinked file coming from slurm. Server boot-up normally (?). The current scenario concerning unclean shutdowns is that the culprit is the powerfail-duration line in /etc/pwrstatd.conf : it seems that whenever powerfail-duration is not zero, the shutdown is not clean. We'll see ...
Jun 28th, 2009 Power failure again, this time sans-thunderstorm. Shutdown appears to have been smooth. Wake-them up ...
June 23rd, 2009 Significant thunderstorm hit the area overnight. Everything went down (gracefully ? doubt it). Take them up again. To be on the safe side, do an xfs_check on the disks: looks good.
Jun 16th, 2009 CHARMM toppar_c35b2_c36a2 installed (this is a version that includes the CMAP correction).
Jun 9th, 2009 Latest version of NAMDjob to submit NAMD jobs to slurm. The main differences from the previous version are * Addition of several network- and cluster-specific command-line flags. * By-passing charmrun when running on a single node (to take full advantage of SMP capabilities).
June 2nd, 2009 Building NAMD version 2.7b1 from source. See this benchmarks page for results. First attempt was with UDP, icc, plus smp. Charm++ was built using ./build charm++ net-linux-x86_64 icc smp ifort -O -DCMK_OPTIMIZE NAMD was build with ./config Linux-x86_64-icc --charm-arch net-linux-x86_64-ifort-smp-icc
June 4th, 2009 Almost a month with all nodes -plus GPU- fully loaded, stable, and without hiccups (save for the recent n0002 incident). It is official, we have a cluster.
Jun 3rd, 2009 Modified the NAMDjob script to use slurm's --exclusive flag when the number of cores requested is a multiple of four. This is in order not to loose the SMP-related performance. New version of NAMDjob #!/bin/tcsh -f renice +19 -p $$ >& /dev/null # # Check command line arguments ... # if ( $# != 3 ) then echo " " echo "Usage : NAMDjob <number of cores> <filename of namd script> <log filename>" echo " " exit endif # # ... presence of script # if (! -es $2 ) then echo "Missing file ($2) co…
May 20th, 2009 At last, a proper burn-in test of the nodes: all thirty-two cores fully loaded for days. So far, so good. And, now for weeks. Still, so far so good.
May 26th, 2009 Schrodinger package installed (maestro & desmond).
May 19th, 2009 n0002 died, this time it was not the power supply. Memory again ? We'll see ... Take it up again.
May 8th, 2009 Surprise, surprise. Power failure again.
May 3rd, 2009 Level 0 dump of / and /boot. Dump session # # df -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 1.4T 78G 1.3T 6% / /dev/sda1 479M 12M 442M 3% /boot none 128M 36K 128M 1% /tmp # # # mt -f /dev/st0 status SCSI 2 tape drive: File number=0, block number=0, partition=0. Tape block size 0 bytes. Density code 0x47 (DDS-5 or TR-5). Soft error count since last status=0 General status bits on (41010000): BOT ONLINE IM_REP_…
May 2nd, 2009 Hewlett Packard's DAT 72G installed and tested, together with mt-st.
May 1st, 2009 Black-out again :-? At least found a way to properly check xfs via ubuntu 9.04 (which comes with the 3ware kernel module, xfs, and xfstools). xfs_repair did found (and repair) some problems, but none of the affected files appeared to be critical. Also, I changed my mind about what to do with power failures. The current scenario is: power goes away, everything stays put for 2min (just in case it was a shorty), server issues shutdown to compute nodes, server shuts down, UPS's power stays on so tha…
April 28th, 2009 Black-out again ... And this time (after messing with UPS's pwrstatd), the head node did *not* shutdown cleanly. Tried to xfs_check the disk, but couldn't get the filesystem read-only, even from the single-user level.
April 27th, 2009 Power supply replaced on n0003. Hook-it up again.
April 26th, 2009 * Change all BIOSes to no-IDE, no-USB, no-Audio, ... * Change all BIOSes to 'power on' on power failure * Change the UPS settings on server: shutdown at 10%, do not switch-off UPS. * Waiting for the next black-out to test the new settings (won't be long).
April 22nd, 2009 Power failure again (and again, and again). This time, and due to Easter, everything stayed down till the 26th.
April 16th, 2009 n0003 followed the steps of n0004 and n0008: yet another power supply failed (which takes us to three out of eight in less than three months). Pooh. The behaviour of slurm is commendable: two jobs were running on the failed node. When the node stopped responding, slurm set it to 'down', and re-queued the jobs. One of them started immediately on two cores that were not allocated, the other still awaits resources. Nice.
Apr 11th, 2009 Cuda burn-in using a dgemm-based test. Card looks good and ready to go. The script and results are: CUDA burn-in test # # # cat script.sh #!/bin/tcsh -f date ./dgemmSweep 0 300 date exit # # # ./script.sh >& LOG & # # head LOG Fri Apr 10 20:37:22 EEST 2009 Testing device 0: GeForce GTX 260 device = 0 iterSize = 5504 Device 0: i = 128 Device 0: i = 160 Device 0: i = 192 Device 0: i = 224 Device 0: i = 256 Device 0: i = 288 # # # tail LOG Device 0: i = 5280 Device 0: i = 5312 Device 0: i…
April 14th, 2009 Second black-out within a week. All went down gracefully. Take them up again. I should prepare an MBG power-failure blog.
April 13th, 2009 Replaced n0008's power supply. Node back into service.
April 12th, 2009 Power supply on n0008 failed. That takes us to 2 (out of 8) burned power supplies within two months :-?
Apr 7th, 2009 Power failure o/n. Everything went down gracefully. Take them up again.
Apr 6th, 2009 n0004 back from the shop. It was a power supply failure, even behind a UPS. Pooh.
April 5th, 2009 CUDA installed and tested. This included: driver 180.22, the 2.1 toolkit, and the 2.10.1215.2015 sdk. The remaining question is: who can write code for it ? Details from testing the card # ./deviceQuery There is 1 device supporting CUDA Device 0: "GeForce GTX 260" Major revision number: 1 Minor revision number: 3 Total amount of global memory: 939196416 bytes Number of multiprocessors: 24 Number…
Mar 31st, 2009 n0004 died, and died well (no POST). Infant mortality for a power supply (behind a UPS) ? Went to the shop on the 1st.
Mar 27th, 2009 ... and suddenly four nodes stopped accepting ssh connections, with each attempted connection resulting to yet another process accumulating (which is exactly what happened in the past with n0008). Tried everything I could think of, got nowhere. Finally, gave-up and risked with /etc/init.d/nfs restart, and voila: all accumulated jobs disappeared and the nodes are back to normal. But are they? What NFS does when it is restarted with with several open file descriptor over it ? We will have to wait …
Mar 25th, 2009 Thunderstorm caused momentary power failure which killed the (non_UPSed) nodes 7 & 8. Node n0008 came back immediately. N0007 was left in a miserable non-responding state. The moral is that we need a fourth UPS box. => Which we got and installed on Mar 26th.
Mar 18th, 2009 * Tried -but failed- to convince charmrun-initiated jobs to respond to SIGSTOP and SIGCONT signals from slurm (except when in ++local mode). Undetered, sched/gang was implemented and a second queue was installed with a time limit of one hour. This configurations awaits testing. * Installed pam_slurm module to allow access to nodes to only those users that have active jobs. On my way to this goal, updated slurm which broke the FORCE:1 shared selection. Downgraded to slurm 1.3.6.
Mar 8th, 2009 Built a 64bit-no-graphics executable of carma (named carma64) using intel's icc-mkl v.11.0. See carma64
Mar 13th, 2009 Second new terminal (with Intel Atom) installed with SL 5.2. Looking good.
March 6th, 2009 Write cache on RAID controller turned-off (danger of filesystem corruption under xfs). Then, on a second thought, it was re-enabled (why do we have a 800VA UPS dedicated to the head node if we can't shutdown cleanly ?). But what about individual disks' caches and xfs' 'metadata' ? Confusing.
Mar 10th, 2009 OpenMPI v.1.2.4 built using Intel's compilers v.11 (using CaOS' cports, very nice).
Mar 9th, 2009 n0008 spiraled down with a large number of accumulating non-completing ssh commands (originating from house-keeping temperature and load monitoring cron scripts). Failed to identify the source of the problem, failed to pass commands in any other way, hard reboot was the (non-)solution. Pooh.
Mar 4th, 2009 NAMDjob, a script to submit NAMD jobs to slurm #!/bin/tcsh -f # # Check command line arguments ... # if ( $# != 3 ) then echo " " echo "Usage : NAMDjob <number of cores> <filename of namd script> <log filename>" echo " " exit endif # # ... presence of script # if (! -es $2 ) then echo "Missing file ($2) containing NAMD script ? Abort." exit endif # # ... write access # touch .test_$$ if (! -e .test_$$ ) then echo "No write access in current directory ? Abort." exit endif /bin/rm -rf .te…
Mar 5th, 2009 Short script to plot NAMD log file from norma. It uses plot. plotnamd #!/bin/tcsh -f # # Check command line arguments ... # if ( $# < 1 || $# > 2 ) then echo " " echo "Usage : plotnamd <log filename> [lines to skip]" echo " " exit endif # # ... presence of log # if (! -es $1 ) then echo "Missing log file ? Abort." exit endif if ( $# == 1 ) then set skip = "+0" else set skip = "+$2" endif echo "TS vs. TOTAL" grep -i '^ENERGY:' LOG | tail --lines=$skip | tee /tmp/$$ | awk '{print $2 " "…
Mar 3rd, 2009 Jobs finished, an opportunity to fix slurm: it took all day long but finally it came. The problem was that the hostname 'norma' meant nothing to the nodes (and even if it did meant, it would be the wrong thing, ie. the address of the public ethernet interface). Adding 'norma' (as 10.0.0.1) to the nodes' /etc/hosts file seems to have fixed the problems.
Feb 28th, 2009 Two new boxes destined for user terminals arrived, both based on Intel Atom with 2 Gbytes of memory. Funny how by enabling hyperthreading 2.6.18 sees four (and not just two) cores. SL 5.2 installed on one of them. Network cards based on RealTek chips (and, of course, needed the r8168 driver to get them working). At the end of the day it looks mostly OK.
Feb 27th, 2009 Modeller 9v6 installed.
Feb 22nd, 2009 R version 2.8.1 (2008-12-22) installed.
Feb 19th, 2009 VMD 1.8.6 added. plot written and installed.
Feb 17th, 2009 n0002 back from the shop with downgraded BIOS and no hardware changes. The current scenario goes like this: during n0002's last visit to hardware people (for its memory problem), its BIOS was ungraded. The hypothesis is that the new BIOS detected some momentary voltage problem and decided to make two (of the four) cores hotplug-able. After downgrading BIOS all four cores are again visible. n0002 was hooked-up again, and looks normal. Some heavy CPU testing is in order.
Feb 12th, 2009  
Feb 16th, 2009 * n0002 went to the doctor :-? * Status page and corresponding perl scripts created Weekly load script #!/usr/bin/perl -w $load = `/usr/bin/pdsh -w n0001,n0002,n0003,n0004,n0005,n0006,n0007,n0008 w 2>&1 | grep 'load' | awk '{print \$NF}' | awk '{sum+=\$1} END { print sum/8.0}'`; open IN, "/srv/www/html/data/pages/status.txt" or die "Can not open status.txt\n"; open OUT, ">/tmp/load_status" or die "Can not open temporary file\n"; $first = 0; while ( $line = <IN> ) { …
Feb 14th, 2009 * First long job stopped after ~38 nsec. Start heating for next one. * Slurm woes: communication problems (as always). * n0002 must be cursed: two cores disappeared never to be seen again (and dmesg contains the line SMP: Allowing 4 CPUs, 2 hotplug CPUs. Looks like hardware again.
Feb 9th, 2009 A/C unit installed. Initially recorded temperatures quite good (that is, low). Later in the evening (after equilibration) not brilliant. Run a temperature monitoring script to see how it goes. Needless to say that nefeli's A/C was immediately switched-off with marked results:
Feb 6th, 2009 Pymol 1.2b3pre compiled from the cvs without having to download half the sourceforge. Freemol tools also installed (they include the APBS plugin for pymol).
Jan 28th, 2009 * n0002 is back from the shop. Looks Ok. Re-connect it. * Setting-up a terminal in the machine room (pentium III at 800MHz ;-) ). Use Caos NSA for it as well to see how it goes.
Feb 4th, 2009 Added autodock & autodock tools. More work with the wiki. Air conditioning unit ordered. First attempt with a temperature monitoring perl script. Alarm (shutting-down) temperature set to 55 degrees C (average over all alive cores). Script placed in /etc/cron.hourly (which is probably not fast enough).
Jan 29th, 2009 New terminal installed. New 10/100 switch added. False alarm with dcd files appearing corrupted. Try to start a serious namd job and watch temperatures.
Feb 2nd, 2009 Software installation session: * CNS 1.2.1 * X-PLOR 3.851 (hopefully with large file support) * CCP4 6.1.0 with phaser 2.1.4 * gromacs and gromacs-mpi, 4.0.3
Jan 31st, 2009 3dm2 RAID manager installed. Autoverify set to on. Self-tests scheduled.
Jan 27th, 2009 LaTex plugin installed on dokuwiki (with difficulty). See the playground for illustrated examples.
Jan 26th, 2009 * Intel's compiler suite installed: icc & ifc for ia32 and intel64 plus the redistributable libraries. * First hardware problems appeared: n0002 oopses the kernel at random intervals. Looks like memory problem. First attempt (long shot) was to change position for the two memory banks. Of course didn't work. Take it to the hardware shop.
Jan 25th, 2009 npt & ntpd working properly across the cluster. More benchmarks.
Jan 23rd, 2009 Installed Intel's MKL libraries. NAMD optimisation & benchmarks.
Jan 20th, 2009 Installed xfsdump & xfsrestore. Installed UPS monitoring & alarm tool (on head node only).
Jan 16th, 2009 Jan 16th, 2009 Playing with dokuwiki (it rocks). Modified the theme's footer.html to include graphical output from warewulf showing node state and CPU utilization. Unfortunately this broke the xhtml compliance of dokuwiki.
Jan 19th, 2009 Still adding pages and plugins for the wiki. Changed the 'welcome' page to a more familiar look. Started with benchmarks: netpipe's results on the two network interfaces done.
Jan 15th, 2009 Plenty of wiring, but at the end of the day all nodes powered-up, networked and PXE-capable. First quick tests with NAMD. Playing with dokuwiki.
about/marchive.txt · Last modified: 2009/06/18 18:21 (external edit)