ZFS/Btrfs mysterious checksum errors

gkovacs · 06-01-2016, 10:02 AM

I have posted an issue on the ZFS github last November that for some time looked like a software bug, but later I have become unsure of this. This is a problem that I am unable to solve for 7 months now, and by posting here I am looking for new ideas. As you can imagine I have tried many many things to locate the defective part in my system, please try to read everything first (I know it's a lot), as most things that come to mind were already tried out. Here is the original thread for reference: https://github.com/zfsonlinux/zfs/issues/3990

The problem surfaces
Let's start at the beginning: we were pulling out a perfectly working server running Windows over Intel ICH RAID, and decided to install Proxmox 3 (basically a Debian Wheezy system with a 2.6.32 RHEL6 kernel and virtualization management built in) on it. The specs were: ASUS P8H67 motherboard, Intel Core i7-2600 cpu, 4x 8GB DDR3 RAM, 4x Toshiba 2TB HDD (brand new).

Proxmox setup created a ZFS RAID10 pool on the four 2TB disks, and during restoring and migrating VM's to this server (basically copying hundreds of gigabytes to the ZFS pool) we noticed unreparable checksum erros during routine scrubs.

Code:

zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0h4m with 1 errors on Thu Nov  5 21:30:02 2015
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     1
      mirror-0  ONLINE       0     0     2
        sdc2    ONLINE       0     0     2
        sdf2    ONLINE       0     0     2
      mirror-1  ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        //var/lib/vz/images/501/vm-501-disk-1.qcow2

First diagnosis
So first I thought memory or disk error, yet running memtest for hours (in both single core and SMP mode) and checking the disks with smartctl did not show any defects. I tried diagnostic tools (like mcelog) but nothing popped up. I tried many kernel options as well (mtrr, slub_nomerge, memory_corruption_check, intel_iommu, etc.), and flashing a new BIOS and disabling Turbo/EIST/C-states, relaxing RAM timings (from 9-9-9-24 to 11-11-11-30) and RAM speeds (DDR3-1333, 1066 and 800), but nothing led to a solution. I was so desperate I even tried running the system out of the case.

On the hardware part I tried with different SATA cables, different set of disks, another Sandy Bridge cpu (an i5-2500K), a dedicated PCI-E GPU (Radeon 6870), a stronger PSU and different RAM modules, and only one thing showed up: if running the system with 2 DIMMs (single or dual channel) there was no corruption (only if running with 3 or 4 DIMMs).

Eliminating hardware
So I started replacing every single piece of this system with newly ordered parts: a new Intel DQ77MK motherboard (one generation newer), a new Intel Core i7-3770 cpu (also one gen. newer), and an Adaptec 6805E RAID card to eliminate the Intel ICH controller from the mix. Needless to say the checksum errors were still there:

Code:

Linux proxmox 2.6.32-39-pve #1 SMP Fri May 8 11:27:35 CEST 2015 x86_64 GNU/Linux
kernel: ZFS: Loaded module v0.6.4.1-1, ZFS pool version 5000, ZFS filesystem version 5

     NAME                                             STATE     READ WRITE CKSUM
     rpool                                            ONLINE       0     0    35
       mirror-0                                       ONLINE       0     0    34
         scsi-SAdaptec_Morphed_JBOD_00FABE6527-part2  ONLINE       0     0    42
         scsi-SAdaptec_Morphed_JBOD_01E1CE6527-part2  ONLINE       0     0    44
       mirror-1                                       ONLINE       0     0    36
         scsi-SAdaptec_Morphed_JBOD_025EDA6527        ONLINE       0     0    48
         scsi-SAdaptec_Morphed_JBOD_0347E66527        ONLINE       0     0    45

Important discovery: number of errors correlates with amount of copied data, and also filesystem checksum type (ZFS shows more errors than Btrfs on the same amount of data).

Eliminating software
In the following months I have tried to replicate the issue on different kernels (2.6.32, 4.2 and 4.4) and ZFS versions (4 versions between 0.6.5 to 0.6.5.6), and later Btrfs and ext4 as well. All of them showed checksum errors, yet interestingly ext4 didn't at that time (even if checking the copied files with manual checksumming).

An unreliable computer is born
So I accepted defeat and put the server back in service (no important data ofcourse) with LVM/ext4 running on Adaptec HW RAID10. Soon enough MySQL (InnoDB has checksumming) showed corrupted pages on disk. I also tried running ZFS for a few weeks but with only 2 DIMMs installed in the motherboard (the only config that showed no errors), and lo and behold there were checksum errors, only very rarely (like once in every 3 weeks). The server now sits at my home, not sure I trust running any more than Minecraft on it.

The question
So after 7 months of trying and failing, ordering and replacing part after part the question remains: what the hell causes a problem in a system that had every single part of it replaced? Is it possible that a Sandy/Ivy Bridge architectural problem lurks beneath the surface unnoticed? Please share your ideas, as I am completely out of them...

designator · 06-01-2016, 12:18 PM

Take each disk out of RAID and zero it out. dd may find errors that SMART won't pick up. Really sounds like a drive problem since its the only hardware you did not replace.

gkovacs · 06-01-2016, 12:40 PM

Quote:

Originally Posted by designator

Take each disk out of RAID and zero it out. dd may find errors that SMART won't pick up. Really sounds like a drive problem since its the only hardware you did not replace.

I did replace everything, in fact it was the first thing I did try:

Quote:

On the hardware part I tried with different SATA cables, different set of disks

Not to mention the fact that the errors get created in memory, since they are irreparable errors (they are on both disks of a mirror, in the very same place).
Please read more carefully...

chrismurphy · 06-02-2016, 02:47 PM

Quote:

Originally Posted by gkovacs

So first I thought memory or disk error, yet running memtest for hours (in both single core and SMP mode) and checking the disks with smartctl did not show any defects.

This is inconclusive even when running memtest for days, let alone only hours. You don't mention RAM being replaced.

Quote:

and only one thing showed up: if running the system with 2 DIMMs (single or dual channel) there was no corruption (only if running with 3 or 4 DIMMs).

Bad RAM, or the wrong RAM, and more RAM exacerbates the problem. And it could be more than one problem, that in combination result in this corruption: i.e. it could be marginal RAM, and unfiltered power not being run through a UPS, for example.

chrismurphy · 06-02-2016, 03:40 PM

OK so I read the thread https://github.com/zfsonlinux/zfs/issues/3990 and the last entry is March 22; there's still the open ended question about single channel testing with 2 DIMMs, to determine if the problem is due to absolute number of modules or number of modules per channel.

If the problem does not reproduce with FreeBSD on the same hardware, it suggests it's a long standing problem in the Linux kernel related to memory management possibly a conditional bug with a fairly narrow set of hardware. Otherwise a bunch of people would be having problems. I think you'll need a write up, with the upfront summary that the corruption reproduces with various makes/models of manufacturer specified compatible RAM for the motherboard, but only with 3+ DIMMS. If 2 or fewer DIMMs, no corruption ever.

And then if you can point to the problem not happening with FreeBSD even in the 3+ DIMM case, that's a very powerful argument, and almost certainly you've found a rather significant bug somewhere in the kernel.

chrismurphy · 06-02-2016, 04:33 PM

And and where to put the write up? lkml.org