LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 06-01-2016, 10:02 AM   #1
gkovacs
LQ Newbie
 
Registered: Jun 2016
Posts: 3

Rep: Reputation: Disabled
ZFS/Btrfs mysterious checksum errors


I have posted an issue on the ZFS github last November that for some time looked like a software bug, but later I have become unsure of this. This is a problem that I am unable to solve for 7 months now, and by posting here I am looking for new ideas. As you can imagine I have tried many many things to locate the defective part in my system, please try to read everything first (I know it's a lot), as most things that come to mind were already tried out. Here is the original thread for reference: https://github.com/zfsonlinux/zfs/issues/3990

The problem surfaces
Let's start at the beginning: we were pulling out a perfectly working server running Windows over Intel ICH RAID, and decided to install Proxmox 3 (basically a Debian Wheezy system with a 2.6.32 RHEL6 kernel and virtualization management built in) on it. The specs were: ASUS P8H67 motherboard, Intel Core i7-2600 cpu, 4x 8GB DDR3 RAM, 4x Toshiba 2TB HDD (brand new).

Proxmox setup created a ZFS RAID10 pool on the four 2TB disks, and during restoring and migrating VM's to this server (basically copying hundreds of gigabytes to the ZFS pool) we noticed unreparable checksum erros during routine scrubs.

Code:
zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0h4m with 1 errors on Thu Nov  5 21:30:02 2015
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     1
      mirror-0  ONLINE       0     0     2
        sdc2    ONLINE       0     0     2
        sdf2    ONLINE       0     0     2
      mirror-1  ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        //var/lib/vz/images/501/vm-501-disk-1.qcow2
First diagnosis
So first I thought memory or disk error, yet running memtest for hours (in both single core and SMP mode) and checking the disks with smartctl did not show any defects. I tried diagnostic tools (like mcelog) but nothing popped up. I tried many kernel options as well (mtrr, slub_nomerge, memory_corruption_check, intel_iommu, etc.), and flashing a new BIOS and disabling Turbo/EIST/C-states, relaxing RAM timings (from 9-9-9-24 to 11-11-11-30) and RAM speeds (DDR3-1333, 1066 and 800), but nothing led to a solution. I was so desperate I even tried running the system out of the case.

On the hardware part I tried with different SATA cables, different set of disks, another Sandy Bridge cpu (an i5-2500K), a dedicated PCI-E GPU (Radeon 6870), a stronger PSU and different RAM modules, and only one thing showed up: if running the system with 2 DIMMs (single or dual channel) there was no corruption (only if running with 3 or 4 DIMMs).

Eliminating hardware
So I started replacing every single piece of this system with newly ordered parts: a new Intel DQ77MK motherboard (one generation newer), a new Intel Core i7-3770 cpu (also one gen. newer), and an Adaptec 6805E RAID card to eliminate the Intel ICH controller from the mix. Needless to say the checksum errors were still there:

Code:
Linux proxmox 2.6.32-39-pve #1 SMP Fri May 8 11:27:35 CEST 2015 x86_64 GNU/Linux
kernel: ZFS: Loaded module v0.6.4.1-1, ZFS pool version 5000, ZFS filesystem version 5

     NAME                                             STATE     READ WRITE CKSUM
     rpool                                            ONLINE       0     0    35
       mirror-0                                       ONLINE       0     0    34
         scsi-SAdaptec_Morphed_JBOD_00FABE6527-part2  ONLINE       0     0    42
         scsi-SAdaptec_Morphed_JBOD_01E1CE6527-part2  ONLINE       0     0    44
       mirror-1                                       ONLINE       0     0    36
         scsi-SAdaptec_Morphed_JBOD_025EDA6527        ONLINE       0     0    48
         scsi-SAdaptec_Morphed_JBOD_0347E66527        ONLINE       0     0    45
Important discovery: number of errors correlates with amount of copied data, and also filesystem checksum type (ZFS shows more errors than Btrfs on the same amount of data).

Eliminating software
In the following months I have tried to replicate the issue on different kernels (2.6.32, 4.2 and 4.4) and ZFS versions (4 versions between 0.6.5 to 0.6.5.6), and later Btrfs and ext4 as well. All of them showed checksum errors, yet interestingly ext4 didn't at that time (even if checking the copied files with manual checksumming).

An unreliable computer is born
So I accepted defeat and put the server back in service (no important data ofcourse) with LVM/ext4 running on Adaptec HW RAID10. Soon enough MySQL (InnoDB has checksumming) showed corrupted pages on disk. I also tried running ZFS for a few weeks but with only 2 DIMMs installed in the motherboard (the only config that showed no errors), and lo and behold there were checksum errors, only very rarely (like once in every 3 weeks). The server now sits at my home, not sure I trust running any more than Minecraft on it.

The question
So after 7 months of trying and failing, ordering and replacing part after part the question remains: what the hell causes a problem in a system that had every single part of it replaced? Is it possible that a Sandy/Ivy Bridge architectural problem lurks beneath the surface unnoticed? Please share your ideas, as I am completely out of them...
 
Old 06-01-2016, 12:18 PM   #2
designator
Member
 
Registered: Jun 2003
Location: California, USA
Distribution: OpenSUSE Tumbleweed
Posts: 219

Rep: Reputation: 37
Take each disk out of RAID and zero it out. dd may find errors that SMART won't pick up. Really sounds like a drive problem since its the only hardware you did not replace.
 
Old 06-01-2016, 12:40 PM   #3
gkovacs
LQ Newbie
 
Registered: Jun 2016
Posts: 3

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by designator View Post
Take each disk out of RAID and zero it out. dd may find errors that SMART won't pick up. Really sounds like a drive problem since its the only hardware you did not replace.
I did replace everything, in fact it was the first thing I did try:

Quote:
On the hardware part I tried with different SATA cables, different set of disks
Not to mention the fact that the errors get created in memory, since they are irreparable errors (they are on both disks of a mirror, in the very same place).
Please read more carefully...

Last edited by gkovacs; 06-01-2016 at 12:41 PM.
 
Old 06-02-2016, 02:47 PM   #4
chrismurphy
LQ Newbie
 
Registered: Feb 2011
Posts: 18

Rep: Reputation: 1
Quote:
Originally Posted by gkovacs View Post
So first I thought memory or disk error, yet running memtest for hours (in both single core and SMP mode) and checking the disks with smartctl did not show any defects.
This is inconclusive even when running memtest for days, let alone only hours. You don't mention RAM being replaced.

Quote:
and only one thing showed up: if running the system with 2 DIMMs (single or dual channel) there was no corruption (only if running with 3 or 4 DIMMs).
Bad RAM, or the wrong RAM, and more RAM exacerbates the problem. And it could be more than one problem, that in combination result in this corruption: i.e. it could be marginal RAM, and unfiltered power not being run through a UPS, for example.
 
Old 06-02-2016, 03:40 PM   #5
chrismurphy
LQ Newbie
 
Registered: Feb 2011
Posts: 18

Rep: Reputation: 1
OK so I read the thread https://github.com/zfsonlinux/zfs/issues/3990 and the last entry is March 22; there's still the open ended question about single channel testing with 2 DIMMs, to determine if the problem is due to absolute number of modules or number of modules per channel.

If the problem does not reproduce with FreeBSD on the same hardware, it suggests it's a long standing problem in the Linux kernel related to memory management possibly a conditional bug with a fairly narrow set of hardware. Otherwise a bunch of people would be having problems. I think you'll need a write up, with the upfront summary that the corruption reproduces with various makes/models of manufacturer specified compatible RAM for the motherboard, but only with 3+ DIMMS. If 2 or fewer DIMMs, no corruption ever.

And then if you can point to the problem not happening with FreeBSD even in the 3+ DIMM case, that's a very powerful argument, and almost certainly you've found a rather significant bug somewhere in the kernel.
 
Old 06-02-2016, 04:33 PM   #6
chrismurphy
LQ Newbie
 
Registered: Feb 2011
Posts: 18

Rep: Reputation: 1
And and where to put the write up? lkml.org
 
  


Reply

Tags
btrfs, checksum, error, wtf, zfs



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Btrfs * ZFS * EncFS * Non-HA mirror solution yek Linux - Server 4 02-17-2015 12:13 AM
LXer: ZFS Still Trying To Compete With EXT4 & Btrfs On Linux LXer Syndicated Linux News 0 08-27-2013 10:50 AM
LXer: Can DragonFly's HAMMER Compete With Btrfs, ZFS? LXer Syndicated Linux News 0 01-07-2011 09:30 AM
LXer: Revisited: ZFS, Btrfs and Oracle. LXer Syndicated Linux News 0 03-20-2010 11:10 AM
LXer: ZFS, Btrfs and Oracle LXer Syndicated Linux News 0 05-18-2009 12:11 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 01:25 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration