LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 08-07-2022, 05:59 PM   #1
gosssamer
Member
 
Registered: Dec 2010
Posts: 59

Rep: Reputation: 0
Kernel crash due to bad disk?


Hi, I have a fedora35 system that uses rsync to operate as a backup server. It has a 8TB RAID5 array, and for the last few days, has crashed/segfaulted in what appears to be the same time that it starts to backup a particular remote host. This indicates to me that perhaps the there is some spot on the disk that is related to this particular host's data that is triggering this.

When it happens, there is a segfault message on the console, but nothing related to it in the logs. There are bits from the kernel about being unable to write prior to the crash, however:

Code:
Aug  1 12:24:32 mail03 kernel: [2415225.412978] EXT4-fs warning (device md2): ext4_end_bio:343: I/O error 10 writing to inode 232141206 starting block 3033088)
Aug  1 12:24:32 mail03 kernel: [2415225.412987] Buffer I/O error on device md2, logical block 3033088
Aug  1 12:24:32 mail03 kernel: [2415225.413025] Buffer I/O error on device md2, logical block 3033089
...
Aug  1 12:24:32 mail03 kernel: [2415225.526007] JBD2: Detected IO errors while flushing file data on md2-8
Aug  1 12:24:35 mail03 kernel: [2415227.560338] JBD2: Detected IO errors while flushing file data on md2-8
How do I identify which of the four disks this is? I've run smartctl short checks on each disk in the array, but all four passed without error. What is md2-8?

From /proc/mdstat:
Code:
md2 : active raid5 sde1[4] sdc1[7] sda1[5] sdf1[6]
      8790402048 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 0/22 pages [0KB], 65536KB chunk
You'll also notice the array is fully operational.

I'm also now running a full fsck scan of the disk:

Code:
# fsck -Vfp -C0 /dev/md2
fsck from util-linux 2.37.4
[/usr/sbin/fsck.ext4 (1) -- /var/backup] fsck.ext4 -fp -C0 /dev/md2
/dev/md2: |===                                                     |  5.7%
but it'll clearly take a while. Update: this did eventually finish, and reported no errors.

I also don't see any errors in the kernel log related to each of the four individual disks.
 
Old 08-07-2022, 08:43 PM   #2
mrmazda
LQ Guru
 
Registered: Aug 2016
Location: SE USA
Distribution: openSUSE 24/7; Debian, Knoppix, Mageia, Fedora, others
Posts: 5,788
Blog Entries: 1

Rep: Reputation: 2065Reputation: 2065Reputation: 2065Reputation: 2065Reputation: 2065Reputation: 2065Reputation: 2065Reputation: 2065Reputation: 2065Reputation: 2065Reputation: 2065
Run smartctl -t long on each of the disks, then look in results log for pending and reallocat, which you want to be raw zeros, at least for pending.
Code:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
189 High_Fly_Writes         -O-RCK   098   098   000    -    2
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
198 Offline_Uncorrectable   ----C-   100   100   000    -    0
 
Old 08-08-2022, 12:20 AM   #3
lvm_
Member
 
Registered: Jul 2020
Posts: 884

Rep: Reputation: 308Reputation: 308Reputation: 308Reputation: 308
RAID is designed to handle disk failures - it is its primary and basic functionality. In all my experience with md disk errors a) where logged properly including physical device names and b) never caused system to fail, if it happens it either indicates a major bug or - and much more likely, something else is going on, especially since, as I understand your message, RAID is a plain data volume with root and swap filesystems somewhere else. What exactly is the segfault message?
 
Old 08-08-2022, 07:26 AM   #4
gosssamer
Member
 
Registered: Dec 2010
Posts: 59

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by mrmazda View Post
Run smartctl -t long on each of the disks, then look in results log for pending and reallocat, which you want to be raw zeros, at least for pending.
Code:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
189 High_Fly_Writes         -O-RCK   098   098   000    -    2
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
198 Offline_Uncorrectable   ----C-   100   100   000    -    0
Yes, I should have mentioned I had already performed a -t long test and all four disks passed without error.

The realloc and pending values were also zero for all four disks.
 
Old 08-08-2022, 07:28 AM   #5
gosssamer
Member
 
Registered: Dec 2010
Posts: 59

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by lvm_ View Post
RAID is designed to handle disk failures - it is its primary and basic functionality. In all my experience with md disk errors a) where logged properly including physical device names and b) never caused system to fail, if it happens it either indicates a major bug or - and much more likely, something else is going on, especially since, as I understand your message, RAID is a plain data volume with root and swap filesystems somewhere else. What exactly is the segfault message?
Yes, your assumptions are correct. Root and swap and home are on different partitions.

Not enough of the segfault message on the screen is visible enough to really get an idea of what caused it to fail, and there is never any info in the logs.

I've also run it through 4 hours of memtest, so I don't think it's a hardware/CPU/mem problem.

I'm going to start the backups again and see if it still fails in the same spot.
 
Old 08-08-2022, 07:37 AM   #6
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,103

Rep: Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117
Quote:
Originally Posted by gosssamer View Post
Not enough of the segfault message on the screen is visible enough to really get an idea of what caused it to fail
Those are filesystem errors, not (kernel) segfaults.
Quote:
... and there is never any info in the logs.
Unlikely - more likely is that you haven't been monitoring mdadm errors; they may have been months ago.
 
Old 08-08-2022, 08:38 AM   #7
gosssamer
Member
 
Registered: Dec 2010
Posts: 59

Original Poster
Rep: Reputation: 0
Okay, I restarted the backup that I know appears to write to the area that causes the panic, and within 15 minutes of running, it produced this on th escreen. There's no ability to scroll up or shift-pgup that I normally do, to see the top of the message. I'm not sure how helpful this is.

https://imgur.com/a/uZUtpTO

I guess there's no way to paste an image here directly? Please see the above link for the panic
 
  


Reply

Tags
kernel, raid5


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
crash () { crash|crash& }; crash grob115 Linux - Security 6 05-07-2011 03:06 AM
Bad disk, bad disk controller, or bad memory? NULL Pointer Linux - General 2 03-01-2009 05:21 PM
Kernel Deleted due to Upgrade, Cannot Reinstall Kernel due to Dependency Issues Kenji Miyamoto Debian 2 02-17-2007 09:44 AM
Crash, Crash, Crash, Crash and You Guessed it Crash! little_penguin SUSE / openSUSE 8 07-04-2005 09:34 AM
xmms crash xine crash mplayer crash paledread Linux - Software 9 03-09-2004 07:09 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 09:03 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration