Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Must be the drives. There's no way other hardware or software can make a drive report "pending sectors" via S.M.A.R.T. Media error is the only possibility.
Ok, I'm on the premises. I turned off the server (it was hanging with alot of error messages, like you predicted). I removed sdb (I looked for the serial number on the drive casing, to match the serial number as reported by smartctl on sdb).
Booted up, and it's running now. But here's the really strange thing:
Code:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md3 : active raid1 sdc1[1] sdb1[0]
2930133824 blocks super 1.2 [2/2] [UU]
md0 : active raid1 sda1[0]
1464710976 blocks super 1.2 [2/1] [U_]
md1 : active (auto-read-only) raid1 sda2[0]
24006528 blocks super 1.2 [2/1] [U_]
md2 : active raid1 sda3[0]
1441268544 blocks super 1.2 [2/1] [U_]
md4 : active raid1 sdd2[0] sde2[1]
2929939264 blocks super 1.2 [2/2] [UU]
unused devices: <none>
But I definitely removed sdb. But now sdf is missing, and sdb is there. Also, that mdstat doesn't make any sense, look at it closely... Looks like sdf became sdb, or something. Compare this with how mdstat used to look before:
Code:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[2] sda1[0]
1464710976 blocks super 1.2 [2/1] [U_]
[>....................] recovery = 4.3% (63596480/1464710976) finish=318.5min speed=73315K/sec
md1 : active raid1 sda2[0] sdb2[1]
24006528 blocks super 1.2 [2/2] [UU]
md2 : active raid1 sdb3[1] sda3[0]
1441268544 blocks super 1.2 [2/2] [UU]
md3 : active raid1 sdc1[0] sdd1[1]
2930133824 blocks super 1.2 [2/2] [UU]
md4 : active raid1 sdf2[1] sde2[0]
2929939264 blocks super 1.2 [2/2] [UU]
unused devices: <none>
Btw, my swap partition runs on md1, but it shows as auto read-only?
EDIT: Here are the md device details:
Code:
/dev/md0:
Version : 1.2
Creation Time : Sat Dec 29 17:09:45 2012
Raid Level : raid1
Array Size : 1464710976 (1396.86 GiB 1499.86 GB)
Used Dev Size : 1464710976 (1396.86 GiB 1499.86 GB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent
Update Time : Fri Nov 15 22:08:29 2013
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Name : lia:0 (local to host lia)
UUID : eb302d19:ff70c7bf:401d63af:ed042d59
Events : 513922
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 0 0 1 removed
Code:
/dev/md1:
Version : 1.2
Creation Time : Sat Dec 29 17:09:50 2012
Raid Level : raid1
Array Size : 24006528 (22.89 GiB 24.58 GB)
Used Dev Size : 24006528 (22.89 GiB 24.58 GB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent
Update Time : Fri Nov 15 15:36:33 2013
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Name : lia:1 (local to host lia)
UUID : 1f8dff14:bc317bcb:d3587249:9ffc0b42
Events : 58
Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
1 0 0 1 removed
Code:
/dev/md2:
Version : 1.2
Creation Time : Sat Dec 29 17:09:59 2012
Raid Level : raid1
Array Size : 1441268544 (1374.50 GiB 1475.86 GB)
Used Dev Size : 1441268544 (1374.50 GiB 1475.86 GB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent
Update Time : Fri Nov 15 21:42:19 2013
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Name : lia:2 (local to host lia)
UUID : 543b8db0:660e4e18:d388dec8:b9fe81cb
Events : 103
Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
1 0 0 1 removed
Code:
/dev/md3:
Version : 1.2
Creation Time : Sat Dec 29 17:10:04 2012
Raid Level : raid1
Array Size : 2930133824 (2794.39 GiB 3000.46 GB)
Used Dev Size : 2930133824 (2794.39 GiB 3000.46 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Fri Nov 15 21:48:23 2013
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : lia:3 (local to host lia)
UUID : 2a35faa7:b076b115:f2e45d70:e9e0f885
Events : 72
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 33 1 active sync /dev/sdc1
Code:
/dev/md4:
Version : 1.2
Creation Time : Sat Dec 29 17:10:15 2012
Raid Level : raid1
Array Size : 2929939264 (2794.21 GiB 3000.26 GB)
Used Dev Size : 2929939264 (2794.21 GiB 3000.26 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Fri Nov 15 22:08:50 2013
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : lia:4 (local to host lia)
UUID : 18cafde6:cdd0d6ad:e80fe7e2:a346e157
Events : 196
Number Major Minor RaidDevice State
0 8 50 0 active sync /dev/sdd2
1 8 66 1 active sync /dev/sde2
I'll post the smartctl stats in the next post, this one is getting a bit long.
As you can see from all the stats in the above 3 posts, the sdb device doesn't have the original sdb serial number. Seems sdf renamed itself to sdb. Bizarre...
It looks like your current /dev/sdc may have issues. You should resync md3 immediately.
I've actually never seen an md device become read-only before. I found a forum post describing what seems to be a similar issue. Are you by any chance accessing Intel software RAID sets with mdadm?
As for the device names, well, welcome to the SCSI system, where device names are assigned by the kernel on a "first come-first serve" basis.
When you removed sdb, that name became vacant. Normally that would mean that every device gets to move one step up the ladder (sdc becomes sdb, sdd becomes sdc and so on), but on some (if not most) distributions, daemons like udev may interfere and try to preserve device-to-node mappings.
Thankfully, it doesn't really matter to the md driver what name is assigned to devices and partitions, as every component is labeled with a UUID. It does, however make it difficult to determine exactly which device has any given device name at any given time. If you start off with six drives:
It looks like your current /dev/sdc may have issues. You should resync md3 immediately.
I've actually never seen an md device become read-only before. I found a forum post describing what seems to be a similar issue. Are you by any chance accessing Intel software RAID sets with mdadm?
As for the device names, well, welcome to the SCSI system, where device names are assigned by the kernel on a "first come-first serve" basis.
When you removed sdb, that name became vacant. Normally that would mean that every device gets to move one step up the ladder (sdc becomes sdb, sdd becomes sdc and so on), but on some (if not most) distributions, daemons like udev may interfere and try to preserve device-to-node mappings.
Thankfully, it doesn't really matter to the md driver what name is assigned to devices and partitions, as every component is labeled with a UUID. It does, however make it difficult to determine exactly which device has any given device name at any given time. If you start off with six drives:
But after a reboot, there's always a risk that device names may have been reassigned:
Code:
After a reboot, and after being subjected
to typically inconsistent udev behaviour:
1 2 3 4 5 6
[sda] ----- [sdc] [sdd] [sde] [sdb]
It's mostly just a nuisance, unless you're using device names rather than labels or UUIDs in /etc/fstab.
Thanks for the explanation - I suspected that it's simply a renaming to an empty slot issue, but at this stage I'm so paranoid that I'm pessimistic about anything strange :P Luckily we do use UUID's in the fstab, yes, so it should be all good.
I've read the post you've linked, but am still unsure what resolution to follow regarding the read-only swap md device. Not sure what you mean by Intel software raid - we didn't set up the raid devices using the onboard raid utility, we set them up during the original Linux installation using Ubuntu's software raid. So I guess the answer is no?
Could resyncing md3 result in the same catastrophic crash that we experienced when resyncing md0 earlier today? (Also, how exactly do I resync the "right" way?)
PS: I really owe you for sticking with me through this. Much appreciated!
I've read the post you've linked, but am still unsure what resolution to follow regarding the read-only swap md device. Not sure what you mean by Intel software raid - we didn't set up the raid devices using the onboard raid utility, we set them up during the original Linux installation using Ubuntu's software raid. So I guess the answer is no?
I guess so. The person in the thread ended up destroying and recreating the RAID device/array, and I guess you could do the same, if the device in question is only used for swap (which obviously isn't working now, with the device being read-only).
Quote:
Originally Posted by reano
Could resyncing md3 result in the same catastrophic crash that we experienced when resyncing md0 earlier today? (Also, how exactly do I resync the "right" way?)
A resync is highly unlikely to cause any problems, quite the opposite. The md driver is remarkably tolerant of errors, and will try to rewrite a bad sector several times using data from another device in the array before failing a RAID member.
Your experience with the drive that used to be sdb is very much atypical, but problems can occur if a device is allowed to "bit rot" for an extended period of time. Arrays need to be verified/"scrubbed" regularly, and the S.M.A.R.T. status of all drives should be continuously monitored.
You can resync an md device by writing "check" to /sys/devices/virtual/block/<device>/md/sync_action. In this case, this command should initiate a verify/resync:
I guess so. The person in the thread ended up destroying and recreating the RAID device/array, and I guess you could do the same, if the device in question is only used for swap (which obviously isn't working now, with the device being read-only).
Strange, the read-only flagged disappeared suddenly. I'll see what it does after the next reboot (which will probably only be after md3's resync, and preferably on Monday when I'm onsite again to monitor the boot process.
Quote:
Originally Posted by Ser Olmy
A resync is highly unlikely to cause any problems, quite the opposite. The md driver is remarkably tolerant of errors, and will try to rewrite a bad sector several times using data from another device in the array before failing a RAID member.
Your experience with the drive that used to be sdb is very much atypical, but problems can occur if a device is allowed to "bit rot" for an extended period of time. Arrays need to be verified/"scrubbed" regularly, and the S.M.A.R.T. status of all drives should be continuously monitored.
You can resync an md device by writing "check" to /sys/devices/virtual/block/<device>/md/sync_action. In this case, this command should initiate a verify/resync:
Thanks, I'll do that. How do I check the progress of the resync? Also via /proc/mdstat?
By the way, I've noticed something else. Every night at 30mins past midnight, the server backs up the contents of the /home directory to a NAS drive. This process usually takes about 20 minutes, but now it lasted over 90mins. /home resides on md3 - why would it take so long this time? I haven't started the resync on md3 yet, so it can't be that?
Thanks, I'll do that. How do I check the progress of the resync? Also via /proc/mdstat?
That, or run mdadm --detail /dev/md3
Quote:
Originally Posted by reano
By the way, I've noticed something else. Every night at 30mins past midnight, the server backs up the contents of the /home directory to a NAS drive. This process usually takes about 20 minutes, but now it lasted over 90mins. /home resides on md3 - why would it take so long this time? I haven't started the resync on md3 yet, so it can't be that?
The md driver implements "read balancing" for RAID 1 sets, so I'd expect read performance to suffer with one device missing.
That could certainly be the reason, in which case you should see read errors in the logs.
Doing the resync now on md3 - this is going to take a few hours. Perfect excuse to get some sleep, it's about 2:30AM here now and it's (literally and figuratively) been a stormy night. Non-stop lightning since early evening. Seemed extremely appropriate to the situation, too - irony is a right bastard sometimes, hehe.
Okay, so seems both the (old) sdb and the current sdc are faulty. So you'd recommend I replace both those drives, right?
Btw, what do I check for specifically in the SMART status to determine if a drive is going AWOL on me? Only Pending sectors and Reallocated sectors, or is there another red flag to watch out for?
A growing number of defects is the first sign of a drive (slowly) going bad. The sectors first show up in Current_Pending_Sectors as the drive lists them for reallocation, and once reallocated they become part of the Reallocated_Sectors statistics.
The problem with S.M.A.R.T. is that the drive has to detect the errors for them to show up among the attributes. A bad sector will go undetected until you attempt to read it. Even regular backups might not cause such a sector to be read, as incremental or delta backups and de-duplication has become common features. That's why regular verify/scrubbing of a RAID array is of the utmost importance.
As for other S.M.A.R.T. attributes, they can usually be ignored unless the drive status changes to "failing". smartd can be configured to send an e-mail whenever an attribute changes, something I would strongly recommend. Combined with mdadm in "--monitor" mode, you'll be informed if there's trouble brewing.
I'm planning to do a weekly scrubbing/resync of the arrays. My plan is to do it via a cron job (echo check > /sys/devices/virtual/block/<md_device>/md/sync_action) as follows:
- md1 (swap, only a few GB) on Wednesday mornings at 3AM, should finish before 4AM.
- md0 (root filesystem, 1.5TB) on Thursday mornings at 3AM, should finish by about 7AM.
- md2 (shared user data, 1.5TB) on Friday mornings at 3AM, should finish by about 7AM.
- md3 (home directories, 3.0TB) on Saturday mornings at 3AM, should finish by about 11AM.
- md4 (user IMAP mails, 3.0TB) on Saturday afternoons at 12PM, should finish by about 8PM.
Can users use the system, shared resources, mails, homedirs, etc while the resync is taking place? Just incase there are some early-birds starting work before 7AM, or on Saturdays? Or will they experience some serious slowdowns?
Then, further to that, I want to do another cronjob to mail me the smartctl output of all drives, daily, every morning at 8AM.
Does this make sense (and was I more or less correct with my time/duration estimates), or would you recommend any changes to the plan above?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.