LinuxQuestions.org - Cannot assemble my clean RAID...

- Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)

- - Cannot assemble my clean RAID... (https://www.linuxquestions.org/questions/linux-server-73/cannot-assemble-my-clean-raid-577810/)

Cannot assemble my clean RAID...

I have a server at home that I brought down to replace the fans (they were getting loud and annoying my roommates). I bring it back up and for some reason the RAID array won't assemble.

here's the script of what I tried...

Code:

[server] ~ # uname -r

2.6.22-gentoo-r2



[server] ~ # cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4]

unused devices: <none>



[server] ~ # mdadm --assemble --verbose /dev/md0 /dev/sd[a-d]

mdadm: looking for devices for /dev/md0

mdadm: /dev/sda is identified as a member of /dev/md0, slot 1.

mdadm: /dev/sdb is identified as a member of /dev/md0, slot 0.

mdadm: /dev/sdc is identified as a member of /dev/md0, slot 2.

mdadm: /dev/sdd is identified as a member of /dev/md0, slot 3.

mdadm: added /dev/sda to /dev/md0 as 1

mdadm: added /dev/sdc to /dev/md0 as 2

mdadm: added /dev/sdd to /dev/md0 as 3

mdadm: added /dev/sdb to /dev/md0 as 0

mdadm: failed to RUN_ARRAY /dev/md0: Input/output error



[server] ~ # cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4]

md0 : inactive sdb[0] sdd[3] sdc[2] sda[1]

      1953545984 blocks



unused devices: <none>



[server] ~ # dmesg | tail -n 27

md: bind<sda>

md: bind<sdc>

md: bind<sdd>

md: bind<sdb>

raid5: device sdb operational as raid disk 0

raid5: device sdd operational as raid disk 3

raid5: device sdc operational as raid disk 2

raid5: device sda operational as raid disk 1

raid5: allocated 4262kB for md0

raid5: raid level 5 set md0 active with 4 out of 4 devices, algorithm 2

RAID5 conf printout:

 --- rd:4 wd:4

 disk 0, o:1, dev:sdb

 disk 1, o:1, dev:sda

 disk 2, o:1, dev:sdc

 disk 3, o:1, dev:sdd

attempt to access beyond end of device

sdb: rw=16, want=976773176, limit=976773168

attempt to access beyond end of device

sdd: rw=16, want=976773176, limit=976773168

attempt to access beyond end of device

sdc: rw=16, want=976773176, limit=976773168

attempt to access beyond end of device

sda: rw=16, want=976773176, limit=976773168

md0: bitmap initialized from disk: read 21/30 pages, set 4 bits, status: -5

md0: failed to create bitmap (-5)

md: pers->run() failed ...



[server] ~ # mdadm --examine /dev/sda

/dev/sda:

          Magic : a92b4efc

        Version : 00.90.00

          UUID : 02478577:a48b95e7:f32ee040:204d165a

  Creation Time : Sun Sep 24 12:20:52 2006

    Raid Level : raid5

  Used Dev Size : 488386496 (465.76 GiB 500.11 GB)

    Array Size : 1465159488 (1397.29 GiB 1500.32 GB)

  Raid Devices : 4

  Total Devices : 4

Preferred Minor : 0



    Update Time : Thu Aug 16 12:42:03 2007

          State : clean                                                                          

Internal Bitmap : present

 Active Devices : 4

Working Devices : 4

 Failed Devices : 0

  Spare Devices : 0

      Checksum : c999d1c - correct

        Events : 0.1266080



        Layout : left-symmetric

    Chunk Size : 64K



      Number  Major  Minor  RaidDevice State

this    1      8        0        1      active sync  /dev/sda



  0    0      8      16        0      active sync  /dev/sdb

  1    1      8        0        1      active sync  /dev/sda

  2    2      8      32        2      active sync  /dev/sdc

  3    3      8      48        3      active sync  /dev/sdd



[server] ~ # mdadm --examine /dev/sdb

/dev/sdb:

          Magic : a92b4efc

        Version : 00.90.00

          UUID : 02478577:a48b95e7:f32ee040:204d165a

  Creation Time : Sun Sep 24 12:20:52 2006

    Raid Level : raid5

  Used Dev Size : 488386496 (465.76 GiB 500.11 GB)

    Array Size : 1465159488 (1397.29 GiB 1500.32 GB)

  Raid Devices : 4

  Total Devices : 4

Preferred Minor : 0

 

    Update Time : Thu Aug 16 12:42:03 2007

          State : clean

Internal Bitmap : present

 Active Devices : 4

Working Devices : 4

 Failed Devices : 0

  Spare Devices : 0

      Checksum : c999d2a - correct

        Events : 0.1266080

      

        Layout : left-symmetric

    Chunk Size : 64K

  

      Number  Major  Minor  RaidDevice State

this    0      8      16        0      active sync  /dev/sdb

  

  0    0      8      16        0      active sync  /dev/sdb

  1    1      8        0        1      active sync  /dev/sda

  2    2      8      32        2      active sync  /dev/sdc

  3    3      8      48        3      active sync  /dev/sdd



[server] ~ # mdadm --examine /dev/sdc

/dev/sdc:

          Magic : a92b4efc

        Version : 00.90.00

          UUID : 02478577:a48b95e7:f32ee040:204d165a

  Creation Time : Sun Sep 24 12:20:52 2006

    Raid Level : raid5

  Used Dev Size : 488386496 (465.76 GiB 500.11 GB)

    Array Size : 1465159488 (1397.29 GiB 1500.32 GB)

  Raid Devices : 4

  Total Devices : 4

Preferred Minor : 0

 

    Update Time : Thu Aug 16 12:42:03 2007

          State : clean

Internal Bitmap : present

 Active Devices : 4

Working Devices : 4

 Failed Devices : 0

  Spare Devices : 0

      Checksum : c999d3e - correct

        Events : 0.1266080

      

        Layout : left-symmetric

    Chunk Size : 64K

  

      Number  Major  Minor  RaidDevice State

this    2      8      32        2      active sync  /dev/sdc

  

  0    0      8      16        0      active sync  /dev/sdb

  1    1      8        0        1      active sync  /dev/sda

  2    2      8      32        2      active sync  /dev/sdc

  3    3      8      48        3      active sync  /dev/sdd



[server] ~ # mdadm --examine /dev/sdd

/dev/sdd:

          Magic : a92b4efc

        Version : 00.90.00

          UUID : 02478577:a48b95e7:f32ee040:204d165a

  Creation Time : Sun Sep 24 12:20:52 2006

    Raid Level : raid5

  Used Dev Size : 488386496 (465.76 GiB 500.11 GB)

    Array Size : 1465159488 (1397.29 GiB 1500.32 GB)

  Raid Devices : 4

  Total Devices : 4

Preferred Minor : 0

 

    Update Time : Thu Aug 16 12:42:03 2007

          State : clean

Internal Bitmap : present

 Active Devices : 4

Working Devices : 4

 Failed Devices : 0

  Spare Devices : 0

      Checksum : c999d50 - correct

        Events : 0.1266080

      

        Layout : left-symmetric

    Chunk Size : 64K

  

      Number  Major  Minor  RaidDevice State

this    3      8      48        3      active sync  /dev/sdd

  

  0    0      8      16        0      active sync  /dev/sdb

  1    1      8        0        1      active sync  /dev/sda

  2    2      8      32        2      active sync  /dev/sdc

  3    3      8      48        3      active sync  /dev/sdd



[server] ~ # mdadm --stop /dev/md0

mdadm: stopped /dev/md0



[server] ~ # dmesg | tail

md0: failed to create bitmap (-5)

md: pers->run() failed ...

md: md0 stopped.  

md: unbind<sdb>

md: export_rdev(sdb)

md: unbind<sdd>

md: export_rdev(sdd)

md: unbind<sdc>

md: export_rdev(sdc)

md: unbind<sda>

md: export_rdev(sda)

I'm running gentoo 2007.1. I've tried a few things, like rolling back to an old kernel, emerge update, adding libata.ignore_hpa=1 to the boot commands. I can't even get up to the point where I see a filesystem to check. I usually have pretty good luck mashing together solutions from problems other people are having, but there must be something I'm missing here.

Thanks in advance for the help,

- Check dmesg and / or /var/log/messages for the drive with the error.
- Assemble the array without that drive. Don't continue unless this succeeds!
- Wipe the beginning of the failed drive:

dd if=/dev/zero of=/dev/xxx bs=512 count=65

If this fails, replace the drive.

- Repartition the drive. If this fails, replace the drive.
- re-add the drive to the array. If this fails, replace the drive.

Quote:

Originally Posted by macemoneta (Post 2862121)

I suppose I should have chosen a different subject for my post... The drives assemble but will not run.

Perhaps the answer to your question is that all the drives are giving an error. However, I think that all drives are good because I changed nothing between reboots that would affect disc structure -- no formatting, no partitions, no rebuilding, etc. All I did was pull three fans, put new ones in their place, and hit the go button. There is no mention of drive errors in the entirety of /var/log/messages (even back to before the reboot when everything was working).

The dmesg error that all 4 drives are throwing was in the original post:

Code:

attempt to access beyond end of device

sdb: rw=16, want=976773176, limit=976773168

This error is the same for each device. I suspect this might be a red herring because the array was working before the reboot and none of the partitions were changed. But I'm no expert, is there something I can try to get through this particular error, that won't result in data loss?

For completeness though, I tried assembling 4 times -- each time indicating that a different drive was missing. Nothing; the array wouldn't run. 'dmesg' shows the same results as before, except the drive that was not attempted to be assembled into the set was not throwing the "attempt to access beyond end of device" error.

Here's the code of the results, a dmesg output is not included, as it is predictably the same as the original post with each drive throwing an error when it is put into the set:

Code:

[server] ~ # mdadm --verbose --assemble --run /dev/md0 /dev/sdb /dev/sdc /dev/sdd                                                                                                                    

mdadm: looking for devices for /dev/md0

mdadm: /dev/sdb is identified as a member of /dev/md0, slot 0.

mdadm: /dev/sdc is identified as a member of /dev/md0, slot 2.

mdadm: /dev/sdd is identified as a member of /dev/md0, slot 3.

mdadm: no uptodate device for slot 1 of /dev/md0

mdadm: added /dev/sdc to /dev/md0 as 2

mdadm: added /dev/sdd to /dev/md0 as 3

mdadm: added /dev/sdb to /dev/md0 as 0

mdadm: failed to RUN_ARRAY /dev/md0: Input/output error



[server] ~ # mdadm --stop /dev/md0

mdadm: stopped /dev/md0



[server] ~ # mdadm --verbose --assemble --run /dev/md0 /dev/sda /dev/sdc /dev/sdd                                    

mdadm: looking for devices for /dev/md0

mdadm: /dev/sda is identified as a member of /dev/md0, slot 1.

mdadm: /dev/sdc is identified as a member of /dev/md0, slot 2.

mdadm: /dev/sdd is identified as a member of /dev/md0, slot 3.

mdadm: no uptodate device for slot 0 of /dev/md0

mdadm: added /dev/sdc to /dev/md0 as 2

mdadm: added /dev/sdd to /dev/md0 as 3

mdadm: added /dev/sda to /dev/md0 as 1

mdadm: failed to RUN_ARRAY /dev/md0: Input/output error



[server] ~ # mdadm --stop /dev/md0

mdadm: stopped /dev/md0



[server] ~ # mdadm --verbose --assemble --run /dev/md0 /dev/sda /dev/sdb /dev/sdd                      

mdadm: looking for devices for /dev/md0

mdadm: /dev/sda is identified as a member of /dev/md0, slot 1.

mdadm: /dev/sdb is identified as a member of /dev/md0, slot 0.

mdadm: /dev/sdd is identified as a member of /dev/md0, slot 3.

mdadm: added /dev/sda to /dev/md0 as 1

mdadm: no uptodate device for slot 2 of /dev/md0

mdadm: added /dev/sdd to /dev/md0 as 3

mdadm: added /dev/sdb to /dev/md0 as 0

mdadm: failed to RUN_ARRAY /dev/md0: Input/output error



[server] ~ # mdadm --stop /dev/md0

mdadm: stopped /dev/md0



[server] ~ # mdadm --verbose --assemble --run /dev/md0 /dev/sda /dev/sdb /dev/sdc

mdadm: looking for devices for /dev/md0

mdadm: /dev/sda is identified as a member of /dev/md0, slot 1.

mdadm: /dev/sdb is identified as a member of /dev/md0, slot 0.

mdadm: /dev/sdc is identified as a member of /dev/md0, slot 2.

mdadm: added /dev/sda to /dev/md0 as 1

mdadm: added /dev/sdc to /dev/md0 as 2

mdadm: no uptodate device for slot 3 of /dev/md0

mdadm: added /dev/sdb to /dev/md0 as 0

mdadm: failed to RUN_ARRAY /dev/md0: Input/output error

  

[server] ~ # mdadm --stop /dev/md0    

mdadm: stopped /dev/md0

I can post the entirety of dmesg and/or the entirety of /var/log/messages if it might help here, but my untrained eye doesn't notice anything out of the ordinary apart from the already discussed errors.

If you are running with libata (and at 2.6.22 with devices having SDx naming you probably are), try adding this to your /etc/modprobe.conf:

option libata ignore_hpa=1

Then reboot. This tells libata to ignore the "host protected area" on the drives. This was the default with the old IDE drivers, but libata defaults it off.

I took your recommendation and added the option to modules.conf as well as modprobe.conf to no avail... The option is also specified in the grub kernel commands and I'm thinking it's being honored here because I mistyped it the first time and dmesg had an error about an unknown option. Upon fixing it the error disappeared and below is the result. You will notice that the same errors are being thrown at the end. As well, I'm still getting the same thing when I try running the array with one fewer drive.

Code:

[server] ~ # dmesg

Linux version 2.6.22-gentoo-r2 (root@aneris) (gcc version 4.1.2 (Gentoo 4.1.2 p1.0.1)) #1 Thu Aug 16 22:48:27 MDT 2007

Command line: root=/dev/hda3 softlevel=samba libata.ignore_hpa=1

--- SNIP ---

Kernel command line: root=/dev/hda3 softlevel=samba libata.ignore_hpa=1

--- SNIP ---

libata version 2.21 loaded.

--- SNIP ---

sata_nv 0000:00:0e.0: version 3.4

ACPI: PCI Interrupt Link [LSA0] enabled at IRQ 22

ACPI: PCI Interrupt 0000:00:0e.0[A] -> Link [LSA0] -> GSI 22 (level, low) -> IRQ 22

PCI: Setting latency timer of device 0000:00:0e.0 to 64

scsi0 : sata_nv

scsi1 : sata_nv

ata1: SATA max UDMA/133 cmd 0x000000000001e800 ctl 0x000000000001e482 bmdma 0x000000000001e000 irq 22

ata2: SATA max UDMA/133 cmd 0x000000000001e400 ctl 0x000000000001e082 bmdma 0x000000000001e008 irq 22

ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

ata1.00: ATA-7: ST3500630AS, 3.AAK, max UDMA/133

ata1.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 0/32)

ata1.00: configured for UDMA/133

ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

ata2.00: ATA-7: ST3500630AS, 3.AAK, max UDMA/133

ata2.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 0/32)

ata2.00: configured for UDMA/133

scsi 0:0:0:0: Direct-Access    ATA      ST3500630AS      3.AA PQ: 0 ANSI: 5

sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)

sd 0:0:0:0: [sda] Write Protect is off

sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00

sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)

sd 0:0:0:0: [sda] Write Protect is off

sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00

sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

 sda:

sd 0:0:0:0: [sda] Attached SCSI disk

sd 0:0:0:0: Attached scsi generic sg0 type 0

scsi 1:0:0:0: Direct-Access    ATA      ST3500630AS      3.AA PQ: 0 ANSI: 5

sd 1:0:0:0: [sdb] 976773168 512-byte hardware sectors (500108 MB)

sd 1:0:0:0: [sdb] Write Protect is off

sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00

sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

sd 1:0:0:0: [sdb] 976773168 512-byte hardware sectors (500108 MB)

sd 1:0:0:0: [sdb] Write Protect is off

sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00

sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

 sdb:

sd 1:0:0:0: [sdb] Attached SCSI disk

sd 1:0:0:0: Attached scsi generic sg1 type 0

ACPI: PCI Interrupt Link [LSA1] enabled at IRQ 21

ACPI: PCI Interrupt 0000:00:0f.0[A] -> Link [LSA1] -> GSI 21 (level, low) -> IRQ 21

PCI: Setting latency timer of device 0000:00:0f.0 to 64

scsi2 : sata_nv

scsi3 : sata_nv

ata3: SATA max UDMA/133 cmd 0x000000000001dc00 ctl 0x000000000001d882 bmdma 0x000000000001d400 irq 21

ata4: SATA max UDMA/133 cmd 0x000000000001d800 ctl 0x000000000001d482 bmdma 0x000000000001d408 irq 21

ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

ata3.00: ATA-7: ST3500630AS, 3.AAK, max UDMA/133

ata3.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 0/32)

ata3.00: configured for UDMA/133

ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

ata4.00: ATA-7: ST3500630AS, 3.AAK, max UDMA/133

ata4.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 0/32)

ata4.00: configured for UDMA/133

scsi 2:0:0:0: Direct-Access    ATA      ST3500630AS      3.AA PQ: 0 ANSI: 5

sd 2:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB)

sd 2:0:0:0: [sdc] Write Protect is off

sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00

sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

sd 2:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB)

sd 2:0:0:0: [sdc] Write Protect is off

sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00

sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

 sdc:

sd 2:0:0:0: [sdc] Attached SCSI disk

sd 2:0:0:0: Attached scsi generic sg2 type 0

scsi 3:0:0:0: Direct-Access    ATA      ST3500630AS      3.AA PQ: 0 ANSI: 5

sd 3:0:0:0: [sdd] 976773168 512-byte hardware sectors (500108 MB)

sd 3:0:0:0: [sdd] Write Protect is off

sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00

sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

sd 3:0:0:0: [sdd] 976773168 512-byte hardware sectors (500108 MB)

sd 3:0:0:0: [sdd] Write Protect is off

sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00

sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

 sdd:

sd 3:0:0:0: [sdd] Attached SCSI disk

sd 3:0:0:0: Attached scsi generic sg3 type 0

--- SNIP ---

md: md0 stopped.

md: bind<sda>

md: bind<sdc>

md: bind<sdd>

md: bind<sdb>

raid5: device sdb operational as raid disk 0

raid5: device sdd operational as raid disk 3

raid5: device sdc operational as raid disk 2

raid5: device sda operational as raid disk 1

raid5: allocated 4262kB for md0

raid5: raid level 5 set md0 active with 4 out of 4 devices, algorithm 2

RAID5 conf printout:

 --- rd:4 wd:4

 disk 0, o:1, dev:sdb

 disk 1, o:1, dev:sda

 disk 2, o:1, dev:sdc

 disk 3, o:1, dev:sdd

attempt to access beyond end of device

sdb: rw=16, want=976773176, limit=976773168

attempt to access beyond end of device

sdd: rw=16, want=976773176, limit=976773168

attempt to access beyond end of device

sdc: rw=16, want=976773176, limit=976773168

attempt to access beyond end of device

sda: rw=16, want=976773176, limit=976773168

md0: bitmap initialized from disk: read 21/30 pages, set 4 bits, status: -5

md0: failed to create bitmap (-5)

md: pers->run() failed ...

Sounds like I've got a real head-scratcher on my hands...

How did you create the array? After the array was created, what options did you specify on mke2fs?

Are those drives partitioned correctly?

I'd expect to see the partitions:

Code:

/dev/sda1

/dev/sdb1

/dev/sdc1

/dev/sdd2

making up the /dev/md0 RAID set, rather than the raw device entries:

Code:

/dev/sda

/dev/sdb

/dev/sdc

/dev/sdd

as you have listed.

For example, /proc/mdstat on one of my servers says:

Code:

> cat /proc/mdstat

Personalities : [raid1] 

md1 : active raid1 sdb2[1] sda2[0]

      4216960 blocks [2/2] [UU]

      

md2 : active raid1 sdb3[1] sda3[0]

      286744064 blocks [2/2] [UU]

      

md0 : active raid1 sdb1[1] sda1[0]

      2072256 blocks [2/2] [UU]

      

unused devices: <none>

and a madam /dev/md0 --detail gives:

Code:

> mdadm --detail /dev/md0

/dev/md0:

        Version : 00.90.02

  Creation Time : Sat Jan  7 15:56:31 2006

    Raid Level : raid1

    Array Size : 2072256 (2023.69 MiB 2121.99 MB)

    Device Size : 2072256 (2023.69 MiB 2121.99 MB)

  Raid Devices : 2

  Total Devices : 2

Preferred Minor : 0

    Persistence : Superblock is persistent



    Update Time : Sat Aug 18 18:32:32 2007

          State : clean

 Active Devices : 2

Working Devices : 2

 Failed Devices : 0

  Spare Devices : 0



          UUID : afd6ffc6:803a3b15:f337fa45:27abb7f8

        Events : 0.8495



    Number  Major  Minor  RaidDevice State

      0      8        1        0      active sync  /dev/sda1

      1      8      17        1      active sync  /dev/sdb1

... unless I'm reading your output wrong.

Nope... you're reading my output correctly. I did partition the drives before adding them to the set. However, when I actually created the array, I issued the following command:

Code:

mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 /dev/sda /dev/sdb /dev/sdc /dev/sdd

Though I didn't save the output, it is noted in the man page that you can use the whole device in this way rather than individual partitions, which is what I opted to do. Perhaps I wasn't understanding the man page properly though.

This could be part of my problem now and if so, it seems like an easy fix as soon as I have the complete data back and can rebuild the array. But there has got to be a way around this in the mean time, the setup has worked through many reboots and a few unrelated hardware changes (DVD-RW, memory add) so far for over a year.

My weekly backup regimen was delayed for a month and a half through a bunch of circumstances. Had these things not happened, I would just eat a few days worth of data, toast the array, and start again. But theres a few weeks of data I really want and I can just tell it's there waiting for me to figure out what's holding up the array.

I actually meant to rebuild the array anyways by now and go to RAID-10, except I'm not sure even that would have solved this problem, since this doesn't seem like any single drive failed. I think I'll stick to my non-raid backup because I can generally be sure I have accessible data that way, even if I do have to cut it up to fit it on various themed backup disks. I, like most people I have encountered, just have to be more diligent with the backup schedule. In this case, I really must solve this problem, then move on to the next which is the backup schedule.

I did see something like what you're getting when I was replacing a failed drive. I typoed

Code:

mdadm /dev/md0 --add /dev/sdb

rather than

Code:

mdadm /dev/mdo --add /dev/sdb1

and it all appeared to proceed normally - cat /proc/mdstat showed the drive rebuilding as I would expect. I didn't see a problem until I rebooted the system and it immediately failed /dev/sdb with similar errors to those you are seeing. Luckily, I only ever use RAID1, so I just re-prepped the replacement drive and issued the rebuild commands correctly, and all was well.

Soft RAID5 is an accident waiting to happen IMO. I did a lot of testing before I implemented any Linux soft-RAID stuff, and the although RAID5 works fine, there are many issues with regard to failure/replacement procedures and the operational management of the thing to be considered reliable.

RAID1 is simple enough to be able to boot in a number of failure conditions, and reliable enough to do the job. Given the price of 320GB drives, I don't believe there's much point in adding the complexity of RAID5 unless you need a really huge amount of storage, and for the risk, even the 400-500GB drives are cheap enough.

The issue is backing all that data up - 500GB tape drive doesn't come cheap!

Anyway, I digress. If this has started happening all-of-a-sudden, then something must have changed to cause the problem. In this case, I don't think it's hardware failure due to my similar experiences before. I would suspect a software update of the MD driver, but as you've rolled back the kernel a few times to no avail, I'm a little bit stumped I'm afraid.

I think my task for today then is going to be to figure out my emerge history, kernel history, etc and try to revert a bunch of things just to get the data back. Perhaps I have missed something and updated something remotely without remembering.

Well, I'm fairly beaten down here... the last time I updated any software was April 2, 2007. I have brought the server down many times since then without issue. The update then was the kernel, which I already reverted to and found no luck there.

Is there anything I can modify at the disk level that would allow me to rebuild, even if it does result in *some* data loss?

Just reading back through the thread, you say you partitioned the drives before issuing the assemble command. It could be worth checking the partition tables with fdisk.

If partitions exist, (I would expect them to be ID 'fd' - Linux Raid), you could then try assembling the array from the partitions, eg.

Code:

mdadm --assemble /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1

If there are no partitions, then I'm a bit stumped. I've googled a few things, and I can't find any reference to using raw devices with the MD driver. :confused:

Though I did partition the disk, the partitions did not exist anymore... likely mdraid took over the partition table as well, as fdisk didn't like the structure of the disk when I opened it up to have a look.

Since you had a similar experience, I'm going to say that this is, in fact, my problem. Perhaps the planets were all just aligned when I created it the first time and it just took until now to put a bit of data somewhere it didn't belong on the disk. I'm going to have to chalk this one up to lessons learned. I'm about 75% on the backup here after I dug through all my increments and looked for local copies of data on the connected machines. The remaining data will just have to be lost -- the machine needs to go back into production and I don't have the funds to duplicate the storage right now and keep trying. I would really like to get there soon to have RAID-10 -- we are in agreement here that RAID-5 is not a reasonable measure to protect against data loss.

As always, more backups, more backups, more backups. Probably should have learned by now that before I type anything in on the console, even if its 'shutdown -h now', I should ask myself if I can spare the 5 minutes to run off an increment to my external storage.

Thanks both of you for all the advice thus far. I appreciate it and hope I can pay it forward some day soon.

At least I've got one thing going for me -- these fluid-dynamic bearing fans are really quiet. The loudest part of the box now is the drives spinning away with no more purpose.