LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 08-16-2017, 11:44 AM   #1
choogendyk
Senior Member
 
Registered: Aug 2007
Location: Massachusetts, USA
Distribution: Solaris 9 & 10, Mac OS X, Ubuntu Server
Posts: 1,197

Rep: Reputation: 105Reputation: 105
how to reassemble mdadm raid after troubled reboot


I have a supermicro server running Ubuntu 14.04 with all the latest aptitude updates. It has about 40 drives among the internal and two external cabinets that are assembled into a number of mdadm raid arrays with LVM. The cabinets are SAS multi-path.

The server experienced a panic yesterday morning and halted. The admin on call hit the reset and the system came up, but with issues. Apparently, it failed to get a full inventory of drives before assembling the arrays. One mirror didn't come up, one mirror was missing a drive, one raid6 was missing two drives, one raid5 was missing a drive, and one raid5 was missing two drives and didn't start up.

I have managed most of it. I got one mirror to come up using `sudo mdadm --run /dev/md0`. I got the other mirror and the raids that were running rebuilt with commands like `sudo mdadm --manage /dev/md2 --add /dev/sdk`. Since these arrays were up and running, in degraded mode, the event counts on the drives had diverged from the removed drives and adding the removed drive back in required rebuilding. This worked while the system was running.

The raid that didn't come up consists of three 4TB drives as raid5. It shows two removed. Since the drives were not found on reboot, and the raid array was never started, I'm assuming the event counts on the three drives have not diverged. It seems like I ought to be able to reassemble the array and bring it up without having to do any sort of rebuild. However, the metadata on the drive that came up now lists the other two drives as removed. Thus, `sudo mdadm --detail /dev/md125` shows it as "active, FAILED, Not Started" with two drives removed. So, I'm assuming a reboot would bring it up the same way.

Any ideas how to do this? Would two adds followed by a run do the job? I'm hoping for someone with real experience and not just speculation. The data on the raid is critical.
 
Old 08-16-2017, 08:25 PM   #2
jefro
Moderator
 
Registered: Mar 2008
Posts: 21,973

Rep: Reputation: 3623Reputation: 3623Reputation: 3623Reputation: 3623Reputation: 3623Reputation: 3623Reputation: 3623Reputation: 3623Reputation: 3623Reputation: 3623Reputation: 3623
"The raid that didn't come up consists of three 4TB drives as raid5. It shows two removed."

Might have to start by finding more about the condition of the physical drives maybe??

Last edited by jefro; 08-17-2017 at 02:46 PM.
 
Old 08-17-2017, 07:04 AM   #3
choogendyk
Senior Member
 
Registered: Aug 2007
Location: Massachusetts, USA
Distribution: Solaris 9 & 10, Mac OS X, Ubuntu Server
Posts: 1,197

Original Poster
Rep: Reputation: 105Reputation: 105
`sudo smartctl -a /dev/sdq` for each drive says they are good.

`sudo mdadm --examine /dev/sdq1` for each drive shows the UID for the array and an event count that matches the array for all drives.

These are enterprise class drives in a Sun J4500 hanging off a SuperMicro SuperServer, not consumer grade cheap equipment. Every collection of research data is critical to the faculty member whose research depends on the data. We encourage them to have other copies, but that's not always possible when you have many terrabytes of data. I also try to keep tape backups, but the escalation of data over the past few years has been unbelievable. I went from AIT5 to LTO6 and now that's not keeping up. I just got an Overland NEO series T24 with two LTO7 drives, but haven't gotten it in operation yet. We've got on the order of 80TB data in this department, and approaching 100TB data in the other department I take care of. I have mirrors on the root drives and raid6 on the larger arrays. When people need more space, we're currently buying HGST 10TB Helium drives.
 
Old 08-17-2017, 07:39 AM   #4
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,119

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Not many are going to be prepared to offer advice that might trash that data.
Your institution is responsible for securing that data, not us. Harsh, but true.

Talk to the people that know this stuff - start here.
 
Old 08-17-2017, 02:52 PM   #5
jefro
Moderator
 
Registered: Mar 2008
Posts: 21,973

Rep: Reputation: 3623Reputation: 3623Reputation: 3623Reputation: 3623Reputation: 3623Reputation: 3623Reputation: 3623Reputation: 3623Reputation: 3623Reputation: 3623Reputation: 3623
I like syg00's link better than the one I posted.
 
Old 08-17-2017, 03:36 PM   #6
choogendyk
Senior Member
 
Registered: Aug 2007
Location: Massachusetts, USA
Distribution: Solaris 9 & 10, Mac OS X, Ubuntu Server
Posts: 1,197

Original Poster
Rep: Reputation: 105Reputation: 105
syg00, I appreciate the point. Obviously, I can't hold anyone accountable for free advice given in a public forum. I was hoping that someone might have encountered the situation and could say what worked for them.

Your link to the wiki at kernel.org was very useful. I don't know why that doesn't pop up in google searches.

Their suggestion for similar situations was to issue `sudo mdadm --stop /dev/md125` followed by `sudo --assemble /dev/md125 /dev/sdv1 /dev/sdu1 /dev/sdai1` (substituting my values there). They said that this would do no harm, and that the sequence could be repeated with different parameters. When I did this, I got device busy on the two that I wanted to "add" back in (v1 and u1). I did not want to use `--force`, because that can result in things you don't want to happen.

At this point, I was then looking at the parted manual and partprobe manual to see if there might be something there about the busy status. I had another 10TB raid5 made of two 10TB drives (we start minimal and add on), which had one drive dropped, was running, but had no data yet. I figured I could risk playing with it. It also had the weird anomaly that the partition device, /dev/sdn1, had come up as a character device rather than as a block device. Advice on StackExchange suggested rm'ing /dev/sdn1 and doing `sudo partprobe /dev/sdn` to regenerate it. In my case, that didn't work.

At this point, I realized none of these others are mentioning anything about multi-pathing, but my drive cabinets are multi-pathed. So, each drive shows up twice as, say, /dev/sdq and /dev/sdau. Furthermore, another device, e.g. /dev/dm-21, is created that encompasses those two. Some devices that haven't been put into an array yet show up in /dev/mapper/ with much longer names made up of a wwn and possible "-partn". I looked there and found three devices and a part1 entry. Using `sudo mdadm --examine /dev/mapper/35000cca266237d6c-part1 (for example), I could see the UUID of the raid array as well as the UUID for the drive. With that ID, I could see which ones belonged where. I then did a `sudo mdadm --manage /dev/md127 --add /dev/mapper/35000cca266237d6c-part1` which started the 10TB raid rebuilding parity. After that worked, I repeated the `sudo mdadm --stop /dev/md125` and followed up with the `sudo mdadm --assemble /dev/md125 /dev/mapper/35000c5007bb7cc25 /dev/mapper/35000c5007bb79f4b /dev/sdai1`. That worked. The 3 drive raid5 simply came up with all data intact.

The advice from https://raid.wiki.kernel.org/index.php/Assemble_Run was spot on. The difference in my situation was the multi-pathing and how drives should be referenced to get the multi-path device rather than a single path instance of the drive. That came from /dev/mapper/ and was confirmed by `mdadm --detail`.

Now we have to try to figure out why the system panic occurred in the first place and why the bootup went haywire.
 
Old 08-17-2017, 05:41 PM   #7
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,119

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Glad you got it sorted, and thanks for making us all aware of the situation and fix.
Now, about my sigline - and those LTO7's.
 
  


Reply

Tags
mdadm


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Hybrid RAID1 configuration between HDD partition and Ramdisk does not reassemble on reboot Elaidon Linux - Newbie 16 04-29-2016 03:33 PM
Software mdadm RAID6 recovery - how to reassemble badly broken array eduardr Linux - Server 0 09-20-2011 11:00 AM
mdadm software raid problem after every reboot agurkas Linux - General 3 08-04-2010 09:53 AM
mdadm raid 5 recovery / reassemble Ciesko Linux - Server 1 04-15-2010 12:53 PM
remount RAID drive after reboot with mdadm ufmale Linux - Software 1 11-15-2007 08:13 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 02:03 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration