LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 11-06-2015, 05:46 AM   #1
matt2kjones
LQ Newbie
 
Registered: May 2004
Location: Wales, UK
Distribution: Gentoo
Posts: 5

Rep: Reputation: 0
Kernel issue with raid6


Hello,

I have a raid6 array with a damaged hard drive. However, when a write error occurs on the array, it doesn't fail the harddrive, instead one of two things happen:

If I'm using kernel 3.18.12, it will log messages to dmesg saying I/O error, and the file on the array will be corrupt. The array does not fail the disk, as it should, so I end up with tons of corrupt files

If I'm using any 4.x version of kernel (I have tried both 4.0.9 and 4.1.12) then when a write error occurs, I get a kernel oops logged to dmesg and all I/O to the array will hang. I have to forcefully reboot the server, because a ton of processes get stuck in state D, and the discs are never marked as failed.

Here is the output from dmesg of a write error when it occurs on kernel version 3.18.12:

Code:
 172.679073] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5 writing to inode 361872033 (offset 25165824 size 1052672 starting block 5172953088)
[  172.679076] Buffer I/O error on device md4, logical block 5172953088
[  172.679078] Buffer I/O error on device md4, logical block 5172953089
[  172.679078] Buffer I/O error on device md4, logical block 5172953090
[  172.679079] Buffer I/O error on device md4, logical block 5172953091
[  172.679080] Buffer I/O error on device md4, logical block 5172953092
[  172.679081] Buffer I/O error on device md4, logical block 5172953093
[  172.679082] Buffer I/O error on device md4, logical block 5172953094
[  172.679082] Buffer I/O error on device md4, logical block 5172953095
[  172.679083] Buffer I/O error on device md4, logical block 5172953096
[  172.679084] Buffer I/O error on device md4, logical block 5172953097
[  172.983977] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5 writing to inode 361872033 (offset 25165824 size 1576960 starting block 5172953216)
[  173.489071] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5 writing to inode 361872033 (offset 25165824 size 2101248 starting block 5172953344)
[  174.330710] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5 writing to inode 361872033 (offset 25165824 size 2625536 starting block 5172953472)
[  175.123257] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5 writing to inode 361872033 (offset 25165824 size 3149824 starting block 5172953600)
[  175.406390] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5 writing to inode 361872033 (offset 25165824 size 3674112 starting block 5172953728)
[  175.608958] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5 writing to inode 361872033 (offset 25165824 size 4198400 starting block 5172953856)
[  175.968224] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5 writing to inode 361872033 (offset 25165824 size 4722688 starting block 5172953984)
[  176.130072] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5 writing to inode 361872033 (offset 25165824 size 5246976 starting block 5172954112)
[  176.215623] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5 writing to inode 361872033 (offset 25165824 size 6819840 starting block 5172954240)
[  177.925267] EXT4-fs warning: 6 callbacks suppressed
[  177.925270] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5 writing to inode 361872033 (offset 33554432 size 1052672 starting block 5172955136)
[  177.925271] buffer_io_error: 2038 callbacks suppressed
[  177.925272] Buffer I/O error on device md4, logical block 5172955136
[  177.925274] Buffer I/O error on device md4, logical block 5172955137
[  177.925275] Buffer I/O error on device md4, logical block 5172955138
[  177.925276] Buffer I/O error on device md4, logical block 5172955139
[  177.925276] Buffer I/O error on device md4, logical block 5172955140
[  177.925277] Buffer I/O error on device md4, logical block 5172955141
[  177.925278] Buffer I/O error on device md4, logical block 5172955142
[  177.925279] Buffer I/O error on device md4, logical block 5172955143
[  177.925280] Buffer I/O error on device md4, logical block 5172955144
[  177.925280] Buffer I/O error on device md4, logical block 5172955145
[  178.642566] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5 writing to inode 361872033 (offset 33554432 size 1576960 starting block 5172955264)
[  179.078914] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5 writing to inode 361872033 (offset 33554432 size 2101248 starting block 5172955392)
[  179.976324] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5 writing to inode 361872033 (offset 33554432 size 2625536 starting block 5172955520)
[  180.782833] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5 writing to inode 361872033 (offset 33554432 size 3149824 starting block 5172955648)
[  181.333570] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5 writing to inode 361872033 (offset 33554432 size 3674112 starting block 5172955776)
[  181.820475] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5 writing to inode 361872033 (offset 33554432 size 4198400 starting block 5172955904)
[  183.171425] EXT4-fs warning (device md4): ext4_end_bio:317: I/O error -5 writing to inode 361872033 (offset 33554432 size 4722688 starting block 5172956032)
[  183.171428] buffer_io_error: 886 callbacks suppressed
[  183.171429] Buffer I/O error on device md4, logical block 5172956032
[  183.171431] Buffer I/O error on device md4, logical block 5172956033
[  183.171432] Buffer I/O error on device md4, logical block 5172956034
[  183.171433] Buffer I/O error on device md4, logical block 5172956035
[  183.171434] Buffer I/O error on device md4, logical block 5172956036
[  183.171435] Buffer I/O error on device md4, logical block 5172956037
[  183.171436] Buffer I/O error on device md4, logical block 5172956038
[  183.171436] Buffer I/O error on device md4, logical block 5172956039
[  183.171437] Buffer I/O error on device md4, logical block 5172956040
[  183.171438] Buffer I/O error on device md4, logical block 5172956041
Here is sample output from dmesg when a write error occurs on version 4.0.9 or 4.1.12:

Code:
[  158.138253] BUG: unable to handle kernel NULL pointer dereference at 0000000000000120
[  158.138391] IP: [<ffffffffa024cc1f>] handle_stripe+0xdc0/0x1e1f [raid456]
[  158.138482] PGD 24ff59067 PUD 24fe43067 PMD 0
[  158.138646] Oops: 0000 [#1] SMP
[  158.138758] Modules linked in: ipv6 binfmt_misc joydev x86_pkg_temp_thermal coretemp kvm_intel kvm microcode pcspkr video i2c_i801 thermal acpi_cpufreq fan battery rtc_cmos backlight processor thermal_sys xhci_pci button xts gf128mul aes_x86_64 cbc sha256_generic scsi_transport_iscsi multipath linear raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log dm_mod hid_sunplus hid_sony led_class hid_samsung hid_pl hid_petalynx hid_monterey hid_microsoft hid_logitech hid_gyration hid_ezkey hid_cypress hid_chicony hid_cherry hid_belkin hid_apple hid_a4tech sl811_hcd usbhid xhci_hcd ohci_hcd uhci_hcd usb_storage ehci_pci ehci_hcd usbcore usb_common megaraid_sas megaraid_mbox megaraid_mm megaraid sx8
[  158.141809]  DAC960 cciss mptsas mptfc scsi_transport_fc mptspi scsi_transport_spi mptscsih mptbase sg
[  158.142226] CPU: 0 PID: 2017 Comm: md4_raid6 Not tainted 4.1.12-gentoo #1
[  158.142272] Hardware name: Supermicro X10SAT/X10SAT, BIOS 2.0 04/21/2014
[  158.142323] task: ffff880254267050 ti: ffff880095afc000 task.ti: ffff880095afc000
[  158.142376] RIP: 0010:[<ffffffffa024cc1f>]  [<ffffffffa024cc1f>] handle_stripe+0xdc0/0x1e1f [raid456]
[  158.142493] RSP: 0018:ffff880095affc18  EFLAGS: 00010202
[  158.142554] RAX: 000000000000000d RBX: ffff880095cfac00 RCX: 0000000000000002
[  158.142617] RDX: 000000000000000d RSI: 0000000000000000 RDI: 0000000000001040
[  158.142682] RBP: ffff880095affcf8 R08: 0000000000000003 R09: 00000000cd920408
[  158.142745] R10: 000000000000000d R11: 0000000000000007 R12: 000000000000000d
[  158.142809] R13: 0000000000000000 R14: 000000000000000c R15: ffff8802161f2588
[  158.142873] FS:  0000000000000000(0000) GS:ffff88025ea00000(0000) knlGS:0000000000000000
[  158.142938] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  158.143000] CR2: 0000000000000120 CR3: 0000000253ef4000 CR4: 00000000001406f0
[  158.143062] Stack:
[  158.143117]  0000000000000000 ffff880254267050 00000000000147c0 0000000000000000
[  158.143328]  ffff8802161f25d0 0000000effffffff ffff8802161f3670 ffff8802161f2ef0
[  158.143537]  0000000000000000 0000000000000000 0000000000000000 0000000c00000000
[  158.143747] Call Trace:
[  158.143805]  [<ffffffffa024dea3>] handle_active_stripes.isra.37+0x225/0x2aa [raid456]
[  158.143873]  [<ffffffffa024e31d>] raid5d+0x363/0x40d [raid456]
[  158.143937]  [<ffffffff814315bc>] ? schedule+0x6f/0x7e
[  158.143998]  [<ffffffff81372ae7>] md_thread+0x125/0x13b
[  158.144060]  [<ffffffff81061b00>] ? wait_woken+0x71/0x71
[  158.144122]  [<ffffffff813729c2>] ? md_start_sync+0xda/0xda
[  158.144185]  [<ffffffff81050609>] kthread+0xcd/0xd5
[  158.144244]  [<ffffffff8105053c>] ? kthread_create_on_node+0x16d/0x16d
[  158.144309]  [<ffffffff81434f92>] ret_from_fork+0x42/0x70
[  158.144370]  [<ffffffff8105053c>] ? kthread_create_on_node+0x16d/0x16d
[  158.144432] Code: 8c 0f d0 01 00 00 48 8b 49 10 80 e1 10 74 0d 49 8b 4f 48 80 e1 40 0f 84 c2 0f 00 00 31 c9 41 39 c8 7e 31 48 8b b4 cd 50 ff ff ff <48> 83 be 20 01 00 00 00 74 1a 48 8b be 38 01 00 00 40 80 e7 01
[  158.147700] RIP  [<ffffffffa024cc1f>] handle_stripe+0xdc0/0x1e1f [raid456]
[  158.147801]  RSP <ffff880095affc18>
[  158.147859] CR2: 0000000000000120
[  158.147916] ---[ end trace 536b72bd7c91f068 ]---
Things that I have tried:

Disable queuing on all drives
Disable write cache on all drives
Build minimal kernel which doesn't contain any sata drivers for any controller other than what I'm using.

The drives are connected to two LSI PCI-Express SAS Controllers. These controllers don't support hardware raid, setup as JBOD.

Any Idea's? I can obviously change the faulty disk to stop this from happening, but I don't want to do that until this is fixed, because if a drive fails in the future, and I don't notice, I could have corrupt files.

My /proc/mdstat:
Code:
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] [linear] [multipath]
md2 : active raid1 sdk2[0] sdl2[1]
      16760832 blocks super 1.2 [2/2] [UU]

md4 : active raid6 sdc1[0] sdp1[13] sdo1[12] sdn1[11] sdm1[10] sdj1[9] sdb1[8] sdg1[15] sdi1[6] sdh1[5] sda1[14] sdf1[3] sde1[2] sdd1[1]
      23440588800 blocks super 1.2 level 6, 512k chunk, algorithm 2 [14/14] [UUUUUUUUUUUUUU]
      bitmap: 2/15 pages [8KB], 65536KB chunk

md1 : active raid1 sdk1[0] sdl1[1]
      1048512 blocks [2/2] [UU]

md3 : active raid1 sdk3[0] sdl3[1]
      1935556672 blocks super 1.2 [2/2] [UU]
      bitmap: 2/15 pages [8KB], 65536KB chunk

unused devices: <none>
My mdadm --detail /dev/md4:
Code:
/dev/md4:
        Version : 1.2
  Creation Time : Thu May 21 09:36:16 2015
     Raid Level : raid6
     Array Size : 23440588800 (22354.69 GiB 24003.16 GB)
  Used Dev Size : 1953382400 (1862.89 GiB 2000.26 GB)
   Raid Devices : 14
  Total Devices : 14
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Fri Nov  6 11:44:14 2015
          State : clean
 Active Devices : 14
Working Devices : 14
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : livecd:4
           UUID : 4b40ac8c:4f7ea8a7:722cbf0a:97537a64
         Events : 4122

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1
       2       8       65        2      active sync   /dev/sde1
       3       8       81        3      active sync   /dev/sdf1
      14       8        1        4      active sync   /dev/sda1
       5       8      113        5      active sync   /dev/sdh1
       6       8      129        6      active sync   /dev/sdi1
      15       8       97        7      active sync   /dev/sdg1
       8       8       17        8      active sync   /dev/sdb1
       9       8      145        9      active sync   /dev/sdj1
      10       8      193       10      active sync   /dev/sdm1
      11       8      209       11      active sync   /dev/sdn1
      12       8      225       12      active sync   /dev/sdo1
      13       8      241       13      active sync   /dev/sdp1
Thanks
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
mdadm: Trouble with raid6 Tomyx Linux - Desktop 3 01-21-2013 07:08 PM
Raid6 problem saran_redhat Linux - Newbie 6 03-27-2012 10:55 AM
Disable RAID6 in kernel? dbrazeau Linux - Kernel 6 04-13-2010 11:37 PM
RAID6 I/O and Alignment aviso Linux - Server 0 08-16-2009 12:29 PM
RAID6 Setup Questions carlosinfl Linux - Hardware 3 05-22-2007 09:44 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 10:43 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration