RAID5 problems on Ubuntu

donbabe · 01-23-2012, 01:58 AM

I had three 80Gb hard disks of which about 60Gb of each one was configured to a Raid5 array. It has worked well for a number of years until I got an error screen telling me one of the Raid disks was failing and did I want to boot with a downgraded Raid. I have added another disk to the Raid and taken the affected one out. However I have not been able to identify the physical drive so it remains installed.
Performance is very slow, is there something else I should be doing? I have Ubuntu 11.10 installed and version 3.14 of mdadm.

JonathanWilson · 01-24-2012, 11:17 AM

To identify a drive I've found the best way, with my setup, is to do the following...

Find out the physical connection, usually in the manual, say sata1, sata2 etc.

To save having do prefix the commands with sudo, bring up a console window then issue "sudo -i" (remove quotes) and enter your current users password.

Then issue "dmesg", then look for something along the lines of :-

[ 0.855906] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 0.871922] ata1.00: ATA-8: WDC WD10EACS-00D6B1, 01.01A01, max UDMA/133
[ 0.871928] ata1.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[ 0.872814] ata1.00: configured for UDMA/133
[ 0.888121] scsi 0:0:0:0: Direct-Access ATA WDC WD10EACS-00D 01.0 PQ: 0 ANSI: 5
[ 1.612019] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 1.620299] ata2.00: ATA-8: WDC WD10EAVS-00D7B1, 01.01A01, max UDMA/133
[ 1.620304] ata2.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[ 1.621204] ata2.00: configured for UDMA/133
[ 1.636122] scsi 1:0:0:0: Direct-Access ATA WDC WD10EAVS-00D 01.0 PQ: 0 ANSI: 5

This identifies the drives numbers according to the kernel and is directly related to the hardware port numbers as the ports are enumerated in physical order.

Further down in the dmesg output is the following :-

[ 4.645265] sd 0:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
[ 4.645333] sd 0:0:0:0: [sda] Write Protect is off
[ 4.645336] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 4.645365] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

This relates the scsi number to the sd* number.

And finally...

[ 5.505237] md: md2 stopped.
[ 5.506711] md: bind<sdb2>
[ 5.506863] md: bind<sda2>
[ 5.508861] raid1: raid set md2 active with 2 out of 2 mirrors
[ 5.509448] md2: bitmap initialized from disk: read 1/1 pages, set 0 bits
[ 5.509451] created bitmap (1 pages) for device md2
[ 5.522807] md2: detected capacity change from 0 to 209702912
[ 5.523766] md2: unknown partition table

links the sd* number to the md* raid number.

Using a mix of the above and "cat /proc/mdstat/" (remove quotes) and the mdadm (see the manual) command to list devices and details should tell you which physical drive you need to remove and if the raid is re-building or is still degraded or some other problems are present.

The output of the dmesg will also show any errors related to the drives, so a careful scan of its content can be illuminating.

I would suggest that if one drive has failed there is a good chance that the others are on their way out so backup before they fail.

donbabe · 01-30-2012, 12:14 AM

Thanks for the response Jonathon.

My dmesg output is not quite as simple as yours, the lines refer to ata3 and ata4 for a bit then quite a bit later refer to ata5 and ata6. Are you saying these will be in order of the labels on the mother board?

I do not quite follow your steps to identify the linux drives to the ata numbers therefore the physical ports either. Could you provide a bit more guidance please?

When my machine is booting I know it will boot correctly if it comes up with a message that IRQ #11 is being disabled. I have lifted the context of that from my dmseg output to see if that helps. It also says that the RAID was set up correctly but performance is still very slow. It takes a long time to get filesystem information and my wine application reloads pages slower than reading speed.

1.713761] FDC 0 is a post-1991 82077
[ 1.917695] md: bind<sdc3>
[ 1.923016] md: bind<sda3>
[ 1.933910] md: bind<sdd2>
[ 3.812031] ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 0)
[ 4.615882] irq 11: nobody cared (try booting with the "irqpoll" option)
[ 4.615888] Pid: 0, comm: swapper Not tainted 3.0.0-12-generic #20-Ubuntu
[ 4.615891] Call Trace:
[ 4.615893] <IRQ> [<ffffffff810cf8ad>] __report_bad_irq+0x3d/0xe0
[ 4.615908] [<ffffffff810cfcd5>] note_interrupt+0x135/0x180
[ 4.615913] [<ffffffff810cdcc9>] handle_irq_event_percpu+0xa9/0x220
[ 4.615918] [<ffffffff810937a8>] ? tick_dev_program_event+0x48/0x110
[ 4.615923] [<ffffffff810cde8e>] handle_irq_event+0x4e/0x80
[ 4.615927] [<ffffffff810d01e1>] handle_level_irq+0x81/0x100
[ 4.615932] [<ffffffff8100c252>] handle_irq+0x22/0x40
[ 4.615937] [<ffffffff815f3d2a>] do_IRQ+0x5a/0xe0
[ 4.615942] [<ffffffff815ea413>] common_interrupt+0x13/0x13
[ 4.615947] [<ffffffff81065f10>] ? __do_softirq+0x60/0x210
[ 4.615952] [<ffffffff8109388f>] ? tick_program_event+0x1f/0x30
[ 4.615956] [<ffffffff815f34dc>] ? call_softirq+0x1c/0x30
[ 4.615959] [<ffffffff8100c2d5>] ? do_softirq+0x65/0xa0
[ 4.615963] [<ffffffff8106633e>] ? irq_exit+0x8e/0xb0
[ 4.615967] [<ffffffff815f3e1e>] ? smp_apic_timer_interrupt+0x6e/0x99
[ 4.615972] [<ffffffff815f2c93>] ? apic_timer_interrupt+0x13/0x20
[ 4.615974] <EOI> [<ffffffff81012457>] ? mwait_idle+0x87/0x160
[ 4.615984] [<ffffffff8100920b>] ? cpu_idle+0xab/0x100
[ 4.615990] [<ffffffff815b803e>] ? rest_init+0x72/0x74
[ 4.615995] [<ffffffff81ad0c2b>] ? start_kernel+0x3d4/0x3df
[ 4.616000] [<ffffffff81ad0388>] ? x86_64_start_reservations+0x132/0x136
[ 4.616005] [<ffffffff81ad0140>] ? early_idt_handlers+0x140/0x140
[ 4.616007] [<ffffffff81ad0459>] ? x86_64_start_kernel+0xcd/0xdc
[ 4.616007] handlers:
[ 4.616007] [<ffffffff81449450>] usb_hcd_irq
[ 4.616007] Disabling IRQ #11
[ 8.812022] ata5.00: qc timeout (cmd 0xa1)
[ 8.812029] ata5.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[ 10.996040] ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 0)
[ 20.996017] ata5.00: qc timeout (cmd 0xec)
[ 20.996024] ata5.00: failed to IDENTIFY (I/O error, err_mask=0x5)
[ 20.996028] ata5: limiting SATA link speed to 1.5 Gbps
[ 23.180035] ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 10)
[ 53.180021] ata5.00: qc timeout (cmd 0xec)
[ 53.180027] ata5.00: failed to IDENTIFY (I/O error, err_mask=0x5)
[ 55.364036] ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 10)
[ 57.444023] ata6: SATA link down (SStatus 0 SControl 0)
[ 57.444525] xor: automatically using best checksumming function: generic_sse
[ 57.464008] generic_sse: 4403.000 MB/sec
[ 57.464011] xor: using function: generic_sse (4403.000 MB/sec)
[ 57.469621] md: raid6 personality registered for level 6
[ 57.469626] md: raid5 personality registered for level 5
[ 57.469629] md: raid4 personality registered for level 4
[ 57.472629] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 10
[ 57.472635] PCI: setting IRQ 10 as level-triggered
[ 57.472643] firewire_ohci 0000:04:02.0: PCI INT A -> Link[LNKB] -> GSI 10 (level, low) -> IRQ 10
[ 57.474181] bio: create slab <bio-1> at 1
[ 57.474204] md/raid:md0: device sdd2 operational as raid disk 2
[ 57.474207] md/raid:md0: device sda3 operational as raid disk 0
[ 57.474210] md/raid:md0: device sdc3 operational as raid disk 1
[ 57.474831] md/raid:md0: allocated 3230kB
[ 57.475105] md/raid:md0: raid level 5 active with 3 out of 3 devices, algorithm 2
[ 57.475109] RAID conf printout:
[ 57.475111] --- level:5 rd:3 wd:3
[ 57.475114] disk 0, o:1, dev:sda3
[ 57.475116] disk 1, o:1, dev:sdc3
[ 57.475118] disk 2, o:1, dev:sdd2
[ 57.475152] md0: detected capacity change from 0 to 119019667456

Any further help greatly appreciated.

JonathanWilson · 01-31-2012, 09:19 AM

I'm not sure your level of knowledge, so can you issue:-

Can you post the output of "cat /proc/mdstat/" (remove quotes)

Which will tell me if your raids are up and in a good state.

Then the output of "mdadm --detail /dev/md0" (remove quotes)

Which tells me if the md0 is good, and a few bits of other info.

According to what i've found out... irq11 is used by network devices, additional disk controlers, and sound cards.
There are various reported problems, so it would be best if you look at what google brings up and if it relates to your setup.

I notice you are using a new version of ubuntu, there are a lot of bug fixes going on all the time and there was a rather nasty short lived problem with mdadm so make sure you have run all updates.

At a guess from output shown I'd say that sdb was the failed device?
Also are you using a sata card, or sata on the motherboard?
Do you have any card readers, such as compact flash etc?

What is the motherboard make and model.

IF the above after posting does NOT help you then in another post do the following.

Finally the whole output of "dmesg" after a re-boot and upto the point where it says something similar to
[ 10.629406] EXT4-fs (md5): mounted filesystem with ordered data mode
[ 10.729736] EXT4-fs (md6): mounted filesystem with ordered data mode

As this is when its all happy with the disks and has mounted them... usually after this it will be something about bringing up the network and potentually log in stuff.

You may find that pastebin.com is useful for the dmesg output as its going to be long... also please check that you don't include output that may identify your computers ip address, computer name, or network name.

I'm not sure if I can help at this point as its getting very machine specific, but I'll try.

chrism01 · 02-01-2012, 07:57 PM

Try using CODE tags https://www.linuxquestions.org/quest...do=bbcode#code to save keep saying 'remove quotes'

Thus we get

Code:

cat /proc/mdstat

mdadm --detail /dev/md0

donbabe · 02-05-2012, 02:23 PM

Thanks for persisting Jonathon, I think you are about to find out my knowledge is extremely limited.

These are the posts you asked for;

don@don-ubuntu:~$ sudo cat /proc/mdstat/
[sudo] password for don:
cat: /proc/mdstat/: Not a directory
don@don-ubuntu:~$ cat /proc/mdstat/
cat: /proc/mdstat/: Not a directory
don@don-ubuntu:~$ sudo mdadm --detail /dev/md0
/dev/md0:
Version : 0.90
Creation Time : Sat Oct 27 16:40:52 2007
Raid Level : raid5
Array Size : 116230144 (110.85 GiB 119.02 GB)
Used Dev Size : 58115072 (55.42 GiB 59.51 GB)
Raid Devices : 3
Total Devices : 3
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Mon Feb 6 09:13:53 2012
State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

UUID : 5a5910e6:769ae62d:16970757:09b4ed7d (local to host don-ubuntu)
Events : 0.2379

Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
1 8 35 1 active sync /dev/sdc3
2 8 50 2 active sync /dev/sdd2

I do not know why the cat command did not work. Am I being real thick?

SATA is on the motherboard. I always try to keep the system up to date applying updates as once as they are available. The motherboard is an ABIT AL8. Yes sdb was the drive that was reporting bad sectors. No there is no flash card reader as part of the system but I have a reader that plugs into the USB port and recently read a faulty card. However that USB port has read other stuff since.

Thanks again,
Don.

JonathanWilson · 02-06-2012, 12:53 PM

My mistake on the cat command, it shouldn't have the trailing / so can you give me the output of :-

Code:

cat /proc/mdstat

However... the detail tells me that your disks are all correctly in and fired up within the md0, the mdstat will tell me if anything else is odd!

The motherboard manual is available from :-

http://file.abit.com.tw/pub/download...al8_series.zip

And according to it, on p25 you have six possible sata's and one ide.

Can you

Code:

ls -al /dev/disk/by-id

Which will tell me what you have in the way of disks...

and also can you give me the complete output of dmesg after a re-boot and after you have browsed the content of all your drives, just incase there is one causing suprious errors.

I'm begining to think this may not be a disk or raid problem, although it would be worth identifying the failed drive to remove it completely.

One point of note though is that if the older drives were only capable of achieving 1.5G transfer speeds and the new one is 3G speeds then I#m not sure the implications for the software raid receiving the data at different speeds. I guess its possible it could cause problems that wouldn't actually be errors.

[edit]I checked on the mailing list and it seems it should make no difference at all to overall performance.[/edit]

It might be worth looking at the mailing list for linux raid at :-

http://vger.kernel.org/vger-lists.html#linux-raid

After I've had a look at the dmesg and we find out which is the failed drive.

donbabe · 02-14-2012, 01:14 AM

I have had network problems so no internet for a week or so hence no response. Also an update to the kernel just came through so I will see how things go for a while. If I still have problems I will follow Jonathon's suggestions.

Thanks for your help.

donbabe · 02-15-2012, 11:55 PM

After the kernel upgrade the problem is persisting. I have pasted the dmesg output to pastebin onder the same user name as this one. I noticed when I was browsing it that there was a warning that I was trying to mount an ext3 file as ext2. I can not find the warning now but wondered if that was the problem.
Since the kernel was upgraded there are no reports of the RAID being degraded but IRQ11 is still being disabled.
Any help appreciated.

Don.

JonathanWilson · 02-21-2012, 09:18 AM

I've had a look, but sadly I'm stumped at this point.

donbabe · 03-02-2012, 04:15 PM

Thanks for your time Jonathan. I am hoping the upgrade due in April will solve things.