Aug 1st & 2nd 2013

Tried an old but recently updated 2.6.18 kernel → not surprisingly, couldn't get passed the kexec stage.

So, back to drawing board. I seem to recall that the correction for the AMD-family-related memory problem were a few lines of code in just one kernel module. So, the question is : can the correction be applied in the corresponding module of the our old kernel ?

It was a simple diff (, but the change is within the kernel (and not in a module), which means recompiling everything. Bloody hell. The never ending story …

Tried to find a 2.6.32 kernel with a firmware-independent bnx2 module. Tested kernel-2.6.32-358.el6.x86_64.rpm, kernel-2.6.32-71.el6.x86_64.rpm,,,, kernel-2.6.32-71.el6.x86_64.rpm, kernel-, kernel-, kernel-server- Failed again.

Last ditch effort ? Get a newer VNFS capsule from the CAOS repository, which is based on This version already has a firmware-based bnx2 module. To begin with, test it on an as-is basis. If the memory problem persists, then we might get lucky and throw in this capsule a new 2.6.32-XXX kernel (and then start looking for site-specific changes). Will have to wait till tomorrow …

⇒ This went better. The memory problem persisted and the bnx2 module was not automatically loaded, but after logging-in and doing a rmmod-modprobe cycle network was up and functioning.

⇒ OK, here we go again : try to place kernel + firmware in the new capsule.

⇒ Finally got somewhere. It boots correctly, but slurm (due to newer version) does not cooperate. Do some benchmarks with NAMD (this is without a CUDA-enabled card).

maintenance/aug_1st_2013.txt · Last modified: 2013/08/02 17:58 (external edit)