Published at LXer:
In a couple of fascinating threads on thelkml, Linus Torvalds has been working with several other kernel developers to try and track down a difficult data corruption bug [story]. Linus posted a test-program that's capable of consistently triggering the data corruption, so it's a matter of time before the bug is found and fixed. "I think the page-writeout is implicated," Linus explains, "because I do seem to need it, but the page-cache flush does seem to make corruption _easier_ to see. I now seem about to trigger it with a 100MB file on a 256MB machine in a minute or so, with this slight modification. I still don't see _why_, though. But maybe smarter people than me can see it." Earlier it was thought that new page balancing code added in the 2.6.19 kernel was to blame, but using Linus' test-program the data corruption has been reported as far back as the 2.6.5 kernel. "It's not actually a new bug at all," suggested Linus, "it's just that the dirty page balancing causes writeback to happen _earlier_, and thus is better able to _show_ a bug that we've likely had for a long long time." Before heading out to dinner to celebrate his birthday, Linus sent out a patch for tracing the areas of the kernel where the corruption bug is happening, "in the hope that somebody else is working on this corruption issue and is interested." He went on summarize the current status of the debugging effort:"What we need now is actually looking at the source code, and people who understand the VM, I'm afraid. I'm gathering traces now that I have a good test-case. I'll post my trace tools once I've tested that they work, in case others want to help."(And hey, you don't have to be a VM expert to help: this could be a learning experience. However, I'll warn you: this is _the_ most grotty part of the whole kernel. It's not even ugly, it's just damn hard and complex)."