Sept 9th, 10th & 11th, 2013

Received the nvidia K2000D card for the IBM server. Hardware-wise installation appears to have gone well (the cards fits the riser assembly with no problems whatsoever). But then, again, you can't have everything : Booting stops at the initial firmware screen (never gets to the <F2>, <F12>, …, screen). Will have to make the hardware installation cycle a couple of times, test the card onto a different machine, confirm that the problem is due to the card, confirm that it is not an issue with the 3rd PSU needed to be operational, update IBM's firmware(?), …

⇒ Remove and re-install card a couple of times → no

⇒ Add 3rd PSU in the game → no

⇒ Will try to (i) disable the 16x slot from bios, (ii) confirm that it boots with the card in place but the slot disabled, (iii) reboot, go to bios, enable slot, try to see if will get us through.

⇒ Disable from within BIOS the ROM option for PCIe slot 1 → Success, we can boot and see the device.


But, of course, you can't have everything : the kernel module 270.41.06 does not support the K2000D card. Here we go again …

And, even better, the driver version suggested by nvidia (319.49) needs very recent kernels (>3.10 ?). It is getting better and better.


Give it a try with fc19 kernel + friends :

  • kernel-3.10.4-300.fc19.x86_64.rpm
  • kmod-nvidia-3.10.4-300.fc19.x86_64-319.32-2.fc19.1.x86_64.rpm
  • linux-firmware-20130418-0.1.gitb584174.fc19.noarch.rpm

⇒ OK. Can boot and load the nvidia module. But still not done : there is an incompatibility between cuda version and driver. Try to get a newer libcuda from nvidia, but then the libc was not compatible. Hate it. Try with the EL6 distribution (and then with the EL5 ?).

Finally, all OK. Can get NAMD to run with '+devices 0,0,0,…'. Unfortunately, it didn't worth all this effort : for large number of cores, the quadro K2000D is actually slowing down the calculation. See the this page from benchmarks.


2013/09/11 20:18

Aug 23rd, 2013

Fighting with slurm on n0011. After changing the executables to the old(er) version, replacing the corresponding libraries, and fixing passwordless ssh to node, n0011 appears in the sinfo list.

Next targets are

  • fix slurm.conf to allow more than one job to run on n0011 simultaneously ⇒ Done
  • update NAMDjob to allow (semi)-automatic usage of new node ⇒ Done
  • Use LSI webbios to prepare a RAID 1 (mirroring) ⇒ Done
  • format n0011 disks' and export as /home2 with a mode of 1777 to be used as a second writing device ⇒ Done
  • Try to export n0011's RAID to all nodes (not just norma) ⇒ Done
  • Enter n0011 to the various scripts (load, shutdown, …)
  • Find a suitable CUDA-enabled card for the box

Should it be RAID 0 instead of RAID 1 ???????

2013/08/26 12:33

<< Newer entries | Older entries >>

The full maintenance archive is kept here

…and finally, The infamous MBG's Power Failure Log

about/maintenance.txt · Last modified: 2011/01/31 17:56 (external edit)