I have posted an issue on the ZFS github last November that for some time looked like a software bug, but later I have become unsure of this. This is a problem that I am unable to solve for 7 months now, and by posting here I am looking for new ideas. As you can imagine I have tried many many things to locate the defective part in my system, please try to read everything first (I know it's a lot), as most things that come to mind were already tried out. Here is the original thread for reference:
https://github.com/zfsonlinux/zfs/issues/3990
The problem surfaces
Let's start at the beginning: we were pulling out a perfectly working server running Windows over Intel ICH RAID, and decided to install Proxmox 3 (basically a Debian Wheezy system with a 2.6.32 RHEL6 kernel and virtualization management built in) on it. The specs were: ASUS P8H67 motherboard, Intel Core i7-2600 cpu, 4x 8GB DDR3 RAM, 4x Toshiba 2TB HDD (brand new).
Proxmox setup created a ZFS RAID10 pool on the four 2TB disks, and during restoring and migrating VM's to this server (basically copying hundreds of gigabytes to the ZFS pool) we noticed unreparable checksum erros during routine scrubs.
Code:
zpool status -v
pool: rpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: scrub repaired 0 in 0h4m with 1 errors on Thu Nov 5 21:30:02 2015
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 1
mirror-0 ONLINE 0 0 2
sdc2 ONLINE 0 0 2
sdf2 ONLINE 0 0 2
mirror-1 ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
//var/lib/vz/images/501/vm-501-disk-1.qcow2
First diagnosis
So first I thought memory or disk error, yet running memtest for hours (in both single core and SMP mode) and checking the disks with smartctl did not show any defects. I tried diagnostic tools (like mcelog) but nothing popped up. I tried many kernel options as well (mtrr, slub_nomerge, memory_corruption_check, intel_iommu, etc.), and flashing a new BIOS and disabling Turbo/EIST/C-states, relaxing RAM timings (from 9-9-9-24 to 11-11-11-30) and RAM speeds (DDR3-1333, 1066 and 800), but nothing led to a solution. I was so desperate I even tried running the system out of the case.
On the hardware part I tried with different SATA cables, different set of disks, another Sandy Bridge cpu (an i5-2500K), a dedicated PCI-E GPU (Radeon 6870), a stronger PSU and different RAM modules, and only one thing showed up:
if running the system with 2 DIMMs (single or dual channel) there was no corruption (only if running with 3 or 4 DIMMs).
Eliminating hardware
So I started replacing every single piece of this system with newly ordered parts: a new Intel DQ77MK motherboard (one generation newer), a new Intel Core i7-3770 cpu (also one gen. newer), and an Adaptec 6805E RAID card to eliminate the Intel ICH controller from the mix. Needless to say the checksum errors were still there:
Code:
Linux proxmox 2.6.32-39-pve #1 SMP Fri May 8 11:27:35 CEST 2015 x86_64 GNU/Linux
kernel: ZFS: Loaded module v0.6.4.1-1, ZFS pool version 5000, ZFS filesystem version 5
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 35
mirror-0 ONLINE 0 0 34
scsi-SAdaptec_Morphed_JBOD_00FABE6527-part2 ONLINE 0 0 42
scsi-SAdaptec_Morphed_JBOD_01E1CE6527-part2 ONLINE 0 0 44
mirror-1 ONLINE 0 0 36
scsi-SAdaptec_Morphed_JBOD_025EDA6527 ONLINE 0 0 48
scsi-SAdaptec_Morphed_JBOD_0347E66527 ONLINE 0 0 45
Important discovery: number of errors correlates with amount of copied data, and also filesystem checksum type (ZFS shows more errors than Btrfs on the same amount of data).
Eliminating software
In the following months I have tried to replicate the issue on different kernels (2.6.32, 4.2 and 4.4) and ZFS versions (4 versions between 0.6.5 to 0.6.5.6), and later Btrfs and ext4 as well. All of them showed checksum errors, yet interestingly ext4 didn't at that time (even if checking the copied files with manual checksumming).
An unreliable computer is born
So I accepted defeat and put the server back in service (no important data ofcourse) with LVM/ext4 running on Adaptec HW RAID10. Soon enough MySQL (InnoDB has checksumming) showed corrupted pages on disk. I also tried running ZFS for a few weeks but with only 2 DIMMs installed in the motherboard (the only config that showed no errors), and lo and behold there were checksum errors, only very rarely (like once in every 3 weeks). The server now sits at my home, not sure I trust running any more than Minecraft on it.
The question
So after 7 months of trying and failing, ordering and replacing part after part the question remains: what the hell causes a problem in a system that had every single part of it replaced? Is it possible that a Sandy/Ivy Bridge architectural problem lurks beneath the surface unnoticed? Please share your ideas, as I am completely out of them...