Aug 25th, 2013

Playing with n0011 software-wise, looking reasonably good.

Temperatures are up again and I'm not sure why (norma's and n0003's chassis fans are dead, but this shouldn't be enough).

Also n0010 died suddenly again.

It looks like yet another hardware crisis is in hand …

2013/08/25 19:41

Aug 1st & 2nd 2013

Tried an old but recently updated 2.6.18 kernel → not surprisingly, couldn't get passed the kexec stage.

So, back to drawing board. I seem to recall that the correction for the AMD-family-related memory problem were a few lines of code in just one kernel module. So, the question is : can the correction be applied in the corresponding module of the our old kernel ?

It was a simple diff (http://us.generation-nt.com/answer/cpu-mtrrs-linux-kernel-help-207801521.html), but the change is within the kernel (and not in a module), which means recompiling everything. Bloody hell. The never ending story …

Tried to find a 2.6.32 kernel with a firmware-independent bnx2 module. Tested kernel-2.6.32-358.el6.x86_64.rpm, kernel-2.6.32-71.el6.x86_64.rpm, kernel-2.6.32-131.21.1.el6.centos.plus.x86_64.rpm, kernel-2.6.32-131.0.15.el6.centos.plus.x86_64.rpm, kernel-2.6.32-71.7.1.el6.centos.plus.x86_64.rpm, kernel-2.6.32-71.el6.x86_64.rpm, kernel-2.6.32.26-175.fc12.x86_64.rpm, kernel-2.6.32.10-44.fc11.x86_64.rpm, kernel-server-2.6.32.8-69mib-1-1mib2010.0.x86_64.rpm. Failed again.

Last ditch effort ? Get a newer VNFS capsule from the CAOS repository, which is based on 2.6.31.6-2. This version already has a firmware-based bnx2 module. To begin with, test it on an as-is basis. If the memory problem persists, then we might get lucky and throw in this capsule a new 2.6.32-XXX kernel (and then start looking for site-specific changes). Will have to wait till tomorrow …

⇒ This went better. The memory problem persisted and the bnx2 module was not automatically loaded, but after logging-in and doing a rmmod-modprobe cycle network was up and functioning.

⇒ OK, here we go again : try to place 2.6.38.6-26 kernel + firmware in the new capsule.

⇒ Finally got somewhere. It boots correctly, but slurm (due to newer version) does not cooperate. Do some benchmarks with NAMD (this is without a CUDA-enabled card).

2013/08/02 17:58

<< Newer entries | Older entries >>

The full maintenance archive is kept here

…and finally, The infamous MBG's Power Failure Log

about/maintenance.txt · Last modified: 2011/01/31 17:56 (external edit)