Hi. I'm jon.404, a Unix/Linux/Database/Openstack/Kubernetes Administrator, AWS/GCP/Azure Engineer, mathematics enthusiast, and amateur philosopher. This is where I rant about that which upsets me, laugh about that which amuses me, and jabber about that which holds my interest most: *nix.

RIP: bad value: NULL

Posted 09-18-2013 at 08:57 PM by rocket357

Spent a few days debugging what I thought was a bizarre kernel bug on a RHEL 6.3 box. We had crash utilities installed and I'd gotten incredibly comfortable with analyzing kernel crash dumps. I can also say that I'm intimately familiar with a few features of the Linux kernel that I didn't even know existed before.

If you know much about kernel debugging, you probably already know the fix (due to the subject line). See, when you analyze a kernel crash dump, you get certain information, such as the register states, backtrace, logs, etc... I'd gotten comfortable with disassembling functions and converting from assembly back to the C source (in my head) to see where the issue was occurring. Oddly, the vast majority of the time the crash occurred in rebalance_domains (2.6.32 kernel). This realization lead me down the wrong path...for reasons that will soon be clear.

So, let's figure out what rebalance_domains does...it basically analyzes the load on each core in an SMP system and figures out how to evenly distribute the load across all cores. It's "process load balancing" in the kernel, so to speak. The fact that the crashes tended to occur in this particular function lead me to a kernel dev list post about *disabling* this feature, which I thought was pretty neat, and I tried it out.

mkdir /dev/cpuset
mount -t cgroup -ocpuset cpuset /dev/cpuset
echo 0 > /dev/cpuset/cpuset.sched_load_balance

This seemed to work, and I was pretty confident that the workaround was solid...but the question bugged me: why is this necessary? I started digging through the crash dumps and noticed a strange issue. The RIP (next instruction register) sometimes pointed to "half-instructions". I'd disassemble rebalance_domains and see an instruction on rebalance_domains+0x433, but RIP was pointing to rebalance_domains+0x432 (and the instructions on either side were 431 and 436, so why is RIP pointing halfway through an instruction?!). We'd already done the obvious (replace the CPUs), so I thought "certainly this isn't a hardware issue! I must be doing something wrong."

A month later, the server hadn't crashed or caused any issues aside from the occasional service restart. I considered myself lucky that it was just service restarts and not complete server failures. Time passed, and eventually I found this server back on my radar, with instructions to remove the load balancing "fix". Early one morning I logged in and removed the fix:

echo 1 > /dev/cpuset/cpuset.sched_load_balance

A few hours later, we start receiving alerts. A quick analysis revealed "Machine check events" in the logs, so I started reviewing the new crash dumps. As another Red Hat tech and I went through the crash dumps, a pattern started to emerge. CPU 7 was on most of the crash dumps, but not all. CPU 19 was there as well. Then I realized the machine had 12 physical cores. 12 + 7 = 19, so 19 must be CPU 7's hyperthread core. Ugh.

An analysis of the most recent crash dump revealed what I'd been overlooking all along:

RIP: bad value: NULL

NULL?!? But RIP can't be NULL! From the Intel Software Developer's Manual Volume 1:

"The EIP (RIP) register cannot be accessed directly by software; it is controlled implicitly by control-transfer instructions (such as JMP, Jcc, CALL, and RET), interrupts, and exceptions. The only way to read the EIP (RIP) register is to execute a CALL instruction and then read the value of the return instruction pointer from the procedure stack."

How the &*(@!#^*!@!!! did RIP get set to NULL if...

/facepalm

Sent the ticket back to the DC to have the CPUs replaced again. As the DC tech said "You must be the luckiest man alive to get back-to-back bad CPUs like this. If I were you, I'd play the lottery."

RIP: bad value: NULL

Comments