LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)
-   -   Cannot assemble my clean RAID... (https://www.linuxquestions.org/questions/linux-server-73/cannot-assemble-my-clean-raid-577810/)

silicon.pyro 08-17-2007 11:02 AM

Cannot assemble my clean RAID...
 
I have a server at home that I brought down to replace the fans (they were getting loud and annoying my roommates). I bring it back up and for some reason the RAID array won't assemble.

here's the script of what I tried...

Code:

[server] ~ # uname -r
2.6.22-gentoo-r2

[server] ~ # cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
unused devices: <none>

[server] ~ # mdadm --assemble --verbose /dev/md0 /dev/sd[a-d]
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdb is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdc is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdd is identified as a member of /dev/md0, slot 3.
mdadm: added /dev/sda to /dev/md0 as 1
mdadm: added /dev/sdc to /dev/md0 as 2
mdadm: added /dev/sdd to /dev/md0 as 3
mdadm: added /dev/sdb to /dev/md0 as 0
mdadm: failed to RUN_ARRAY /dev/md0: Input/output error

[server] ~ # cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : inactive sdb[0] sdd[3] sdc[2] sda[1]
      1953545984 blocks

unused devices: <none>

[server] ~ # dmesg | tail -n 27
md: bind<sda>
md: bind<sdc>
md: bind<sdd>
md: bind<sdb>
raid5: device sdb operational as raid disk 0
raid5: device sdd operational as raid disk 3
raid5: device sdc operational as raid disk 2
raid5: device sda operational as raid disk 1
raid5: allocated 4262kB for md0
raid5: raid level 5 set md0 active with 4 out of 4 devices, algorithm 2
RAID5 conf printout:
 --- rd:4 wd:4
 disk 0, o:1, dev:sdb
 disk 1, o:1, dev:sda
 disk 2, o:1, dev:sdc
 disk 3, o:1, dev:sdd
attempt to access beyond end of device
sdb: rw=16, want=976773176, limit=976773168
attempt to access beyond end of device
sdd: rw=16, want=976773176, limit=976773168
attempt to access beyond end of device
sdc: rw=16, want=976773176, limit=976773168
attempt to access beyond end of device
sda: rw=16, want=976773176, limit=976773168
md0: bitmap initialized from disk: read 21/30 pages, set 4 bits, status: -5
md0: failed to create bitmap (-5)
md: pers->run() failed ...

[server] ~ # mdadm --examine /dev/sda
/dev/sda:
          Magic : a92b4efc
        Version : 00.90.00
          UUID : 02478577:a48b95e7:f32ee040:204d165a
  Creation Time : Sun Sep 24 12:20:52 2006
    Raid Level : raid5
  Used Dev Size : 488386496 (465.76 GiB 500.11 GB)
    Array Size : 1465159488 (1397.29 GiB 1500.32 GB)
  Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0

    Update Time : Thu Aug 16 12:42:03 2007
          State : clean                                                                         
Internal Bitmap : present
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
      Checksum : c999d1c - correct
        Events : 0.1266080

        Layout : left-symmetric
    Chunk Size : 64K

      Number  Major  Minor  RaidDevice State
this    1      8        0        1      active sync  /dev/sda

  0    0      8      16        0      active sync  /dev/sdb
  1    1      8        0        1      active sync  /dev/sda
  2    2      8      32        2      active sync  /dev/sdc
  3    3      8      48        3      active sync  /dev/sdd

[server] ~ # mdadm --examine /dev/sdb
/dev/sdb:
          Magic : a92b4efc
        Version : 00.90.00
          UUID : 02478577:a48b95e7:f32ee040:204d165a
  Creation Time : Sun Sep 24 12:20:52 2006
    Raid Level : raid5
  Used Dev Size : 488386496 (465.76 GiB 500.11 GB)
    Array Size : 1465159488 (1397.29 GiB 1500.32 GB)
  Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
 
    Update Time : Thu Aug 16 12:42:03 2007
          State : clean
Internal Bitmap : present
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
      Checksum : c999d2a - correct
        Events : 0.1266080
     
        Layout : left-symmetric
    Chunk Size : 64K
 
      Number  Major  Minor  RaidDevice State
this    0      8      16        0      active sync  /dev/sdb
 
  0    0      8      16        0      active sync  /dev/sdb
  1    1      8        0        1      active sync  /dev/sda
  2    2      8      32        2      active sync  /dev/sdc
  3    3      8      48        3      active sync  /dev/sdd

[server] ~ # mdadm --examine /dev/sdc
/dev/sdc:
          Magic : a92b4efc
        Version : 00.90.00
          UUID : 02478577:a48b95e7:f32ee040:204d165a
  Creation Time : Sun Sep 24 12:20:52 2006
    Raid Level : raid5
  Used Dev Size : 488386496 (465.76 GiB 500.11 GB)
    Array Size : 1465159488 (1397.29 GiB 1500.32 GB)
  Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
 
    Update Time : Thu Aug 16 12:42:03 2007
          State : clean
Internal Bitmap : present
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
      Checksum : c999d3e - correct
        Events : 0.1266080
     
        Layout : left-symmetric
    Chunk Size : 64K
 
      Number  Major  Minor  RaidDevice State
this    2      8      32        2      active sync  /dev/sdc
 
  0    0      8      16        0      active sync  /dev/sdb
  1    1      8        0        1      active sync  /dev/sda
  2    2      8      32        2      active sync  /dev/sdc
  3    3      8      48        3      active sync  /dev/sdd

[server] ~ # mdadm --examine /dev/sdd
/dev/sdd:
          Magic : a92b4efc
        Version : 00.90.00
          UUID : 02478577:a48b95e7:f32ee040:204d165a
  Creation Time : Sun Sep 24 12:20:52 2006
    Raid Level : raid5
  Used Dev Size : 488386496 (465.76 GiB 500.11 GB)
    Array Size : 1465159488 (1397.29 GiB 1500.32 GB)
  Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
 
    Update Time : Thu Aug 16 12:42:03 2007
          State : clean
Internal Bitmap : present
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
      Checksum : c999d50 - correct
        Events : 0.1266080
     
        Layout : left-symmetric
    Chunk Size : 64K
 
      Number  Major  Minor  RaidDevice State
this    3      8      48        3      active sync  /dev/sdd
 
  0    0      8      16        0      active sync  /dev/sdb
  1    1      8        0        1      active sync  /dev/sda
  2    2      8      32        2      active sync  /dev/sdc
  3    3      8      48        3      active sync  /dev/sdd

[server] ~ # mdadm --stop /dev/md0
mdadm: stopped /dev/md0

[server] ~ # dmesg | tail
md0: failed to create bitmap (-5)
md: pers->run() failed ...
md: md0 stopped. 
md: unbind<sdb>
md: export_rdev(sdb)
md: unbind<sdd>
md: export_rdev(sdd)
md: unbind<sdc>
md: export_rdev(sdc)
md: unbind<sda>
md: export_rdev(sda)

I'm running gentoo 2007.1. I've tried a few things, like rolling back to an old kernel, emerge update, adding libata.ignore_hpa=1 to the boot commands. I can't even get up to the point where I see a filesystem to check. I usually have pretty good luck mashing together solutions from problems other people are having, but there must be something I'm missing here.

Thanks in advance for the help,

macemoneta 08-17-2007 11:11 AM

- Check dmesg and / or /var/log/messages for the drive with the error.
- Assemble the array without that drive. Don't continue unless this succeeds!
- Wipe the beginning of the failed drive:

dd if=/dev/zero of=/dev/xxx bs=512 count=65

If this fails, replace the drive.

- Repartition the drive. If this fails, replace the drive.
- re-add the drive to the array. If this fails, replace the drive.

silicon.pyro 08-18-2007 02:31 AM

Quote:

Originally Posted by macemoneta (Post 2862121)
- Check dmesg and / or /var/log/messages for the drive with the error.
- Assemble the array without that drive. Don't continue unless this succeeds!
- Wipe the beginning of the failed drive:

dd if=/dev/zero of=/dev/xxx bs=512 count=65

If this fails, replace the drive.

- Repartition the drive. If this fails, replace the drive.
- re-add the drive to the array. If this fails, replace the drive.

I suppose I should have chosen a different subject for my post... The drives assemble but will not run.

Perhaps the answer to your question is that all the drives are giving an error. However, I think that all drives are good because I changed nothing between reboots that would affect disc structure -- no formatting, no partitions, no rebuilding, etc. All I did was pull three fans, put new ones in their place, and hit the go button. There is no mention of drive errors in the entirety of /var/log/messages (even back to before the reboot when everything was working).

The dmesg error that all 4 drives are throwing was in the original post:
Code:

attempt to access beyond end of device
sdb: rw=16, want=976773176, limit=976773168

This error is the same for each device. I suspect this might be a red herring because the array was working before the reboot and none of the partitions were changed. But I'm no expert, is there something I can try to get through this particular error, that won't result in data loss?

For completeness though, I tried assembling 4 times -- each time indicating that a different drive was missing. Nothing; the array wouldn't run. 'dmesg' shows the same results as before, except the drive that was not attempted to be assembled into the set was not throwing the "attempt to access beyond end of device" error.

Here's the code of the results, a dmesg output is not included, as it is predictably the same as the original post with each drive throwing an error when it is put into the set:

Code:

[server] ~ # mdadm --verbose --assemble --run /dev/md0 /dev/sdb /dev/sdc /dev/sdd                                                                                                                   
mdadm: looking for devices for /dev/md0
mdadm: /dev/sdb is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdc is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdd is identified as a member of /dev/md0, slot 3.
mdadm: no uptodate device for slot 1 of /dev/md0
mdadm: added /dev/sdc to /dev/md0 as 2
mdadm: added /dev/sdd to /dev/md0 as 3
mdadm: added /dev/sdb to /dev/md0 as 0
mdadm: failed to RUN_ARRAY /dev/md0: Input/output error

[server] ~ # mdadm --stop /dev/md0
mdadm: stopped /dev/md0

[server] ~ # mdadm --verbose --assemble --run /dev/md0 /dev/sda /dev/sdc /dev/sdd                                   
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdc is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdd is identified as a member of /dev/md0, slot 3.
mdadm: no uptodate device for slot 0 of /dev/md0
mdadm: added /dev/sdc to /dev/md0 as 2
mdadm: added /dev/sdd to /dev/md0 as 3
mdadm: added /dev/sda to /dev/md0 as 1
mdadm: failed to RUN_ARRAY /dev/md0: Input/output error

[server] ~ # mdadm --stop /dev/md0
mdadm: stopped /dev/md0

[server] ~ # mdadm --verbose --assemble --run /dev/md0 /dev/sda /dev/sdb /dev/sdd                     
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdb is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdd is identified as a member of /dev/md0, slot 3.
mdadm: added /dev/sda to /dev/md0 as 1
mdadm: no uptodate device for slot 2 of /dev/md0
mdadm: added /dev/sdd to /dev/md0 as 3
mdadm: added /dev/sdb to /dev/md0 as 0
mdadm: failed to RUN_ARRAY /dev/md0: Input/output error

[server] ~ # mdadm --stop /dev/md0
mdadm: stopped /dev/md0

[server] ~ # mdadm --verbose --assemble --run /dev/md0 /dev/sda /dev/sdb /dev/sdc
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdb is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdc is identified as a member of /dev/md0, slot 2.
mdadm: added /dev/sda to /dev/md0 as 1
mdadm: added /dev/sdc to /dev/md0 as 2
mdadm: no uptodate device for slot 3 of /dev/md0
mdadm: added /dev/sdb to /dev/md0 as 0
mdadm: failed to RUN_ARRAY /dev/md0: Input/output error
 
[server] ~ # mdadm --stop /dev/md0   
mdadm: stopped /dev/md0

I can post the entirety of dmesg and/or the entirety of /var/log/messages if it might help here, but my untrained eye doesn't notice anything out of the ordinary apart from the already discussed errors.

macemoneta 08-18-2007 06:31 AM

If you are running with libata (and at 2.6.22 with devices having SDx naming you probably are), try adding this to your /etc/modprobe.conf:

option libata ignore_hpa=1

Then reboot. This tells libata to ignore the "host protected area" on the drives. This was the default with the old IDE drivers, but libata defaults it off.

silicon.pyro 08-18-2007 10:44 AM

I took your recommendation and added the option to modules.conf as well as modprobe.conf to no avail... The option is also specified in the grub kernel commands and I'm thinking it's being honored here because I mistyped it the first time and dmesg had an error about an unknown option. Upon fixing it the error disappeared and below is the result. You will notice that the same errors are being thrown at the end. As well, I'm still getting the same thing when I try running the array with one fewer drive.

Code:

[server] ~ # dmesg
Linux version 2.6.22-gentoo-r2 (root@aneris) (gcc version 4.1.2 (Gentoo 4.1.2 p1.0.1)) #1 Thu Aug 16 22:48:27 MDT 2007
Command line: root=/dev/hda3 softlevel=samba libata.ignore_hpa=1
--- SNIP ---
Kernel command line: root=/dev/hda3 softlevel=samba libata.ignore_hpa=1
--- SNIP ---
libata version 2.21 loaded.
--- SNIP ---
sata_nv 0000:00:0e.0: version 3.4
ACPI: PCI Interrupt Link [LSA0] enabled at IRQ 22
ACPI: PCI Interrupt 0000:00:0e.0[A] -> Link [LSA0] -> GSI 22 (level, low) -> IRQ 22
PCI: Setting latency timer of device 0000:00:0e.0 to 64
scsi0 : sata_nv
scsi1 : sata_nv
ata1: SATA max UDMA/133 cmd 0x000000000001e800 ctl 0x000000000001e482 bmdma 0x000000000001e000 irq 22
ata2: SATA max UDMA/133 cmd 0x000000000001e400 ctl 0x000000000001e082 bmdma 0x000000000001e008 irq 22
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-7: ST3500630AS, 3.AAK, max UDMA/133
ata1.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 0/32)
ata1.00: configured for UDMA/133
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: ATA-7: ST3500630AS, 3.AAK, max UDMA/133
ata2.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 0/32)
ata2.00: configured for UDMA/133
scsi 0:0:0:0: Direct-Access    ATA      ST3500630AS      3.AA PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sda:
sd 0:0:0:0: [sda] Attached SCSI disk
sd 0:0:0:0: Attached scsi generic sg0 type 0
scsi 1:0:0:0: Direct-Access    ATA      ST3500630AS      3.AA PQ: 0 ANSI: 5
sd 1:0:0:0: [sdb] 976773168 512-byte hardware sectors (500108 MB)
sd 1:0:0:0: [sdb] Write Protect is off
sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 1:0:0:0: [sdb] 976773168 512-byte hardware sectors (500108 MB)
sd 1:0:0:0: [sdb] Write Protect is off
sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdb:
sd 1:0:0:0: [sdb] Attached SCSI disk
sd 1:0:0:0: Attached scsi generic sg1 type 0
ACPI: PCI Interrupt Link [LSA1] enabled at IRQ 21
ACPI: PCI Interrupt 0000:00:0f.0[A] -> Link [LSA1] -> GSI 21 (level, low) -> IRQ 21
PCI: Setting latency timer of device 0000:00:0f.0 to 64
scsi2 : sata_nv
scsi3 : sata_nv
ata3: SATA max UDMA/133 cmd 0x000000000001dc00 ctl 0x000000000001d882 bmdma 0x000000000001d400 irq 21
ata4: SATA max UDMA/133 cmd 0x000000000001d800 ctl 0x000000000001d482 bmdma 0x000000000001d408 irq 21
ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata3.00: ATA-7: ST3500630AS, 3.AAK, max UDMA/133
ata3.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 0/32)
ata3.00: configured for UDMA/133
ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata4.00: ATA-7: ST3500630AS, 3.AAK, max UDMA/133
ata4.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 0/32)
ata4.00: configured for UDMA/133
scsi 2:0:0:0: Direct-Access    ATA      ST3500630AS      3.AA PQ: 0 ANSI: 5
sd 2:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB)
sd 2:0:0:0: [sdc] Write Protect is off
sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 2:0:0:0: [sdc] 976773168 512-byte hardware sectors (500108 MB)
sd 2:0:0:0: [sdc] Write Protect is off
sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdc:
sd 2:0:0:0: [sdc] Attached SCSI disk
sd 2:0:0:0: Attached scsi generic sg2 type 0
scsi 3:0:0:0: Direct-Access    ATA      ST3500630AS      3.AA PQ: 0 ANSI: 5
sd 3:0:0:0: [sdd] 976773168 512-byte hardware sectors (500108 MB)
sd 3:0:0:0: [sdd] Write Protect is off
sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 3:0:0:0: [sdd] 976773168 512-byte hardware sectors (500108 MB)
sd 3:0:0:0: [sdd] Write Protect is off
sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdd:
sd 3:0:0:0: [sdd] Attached SCSI disk
sd 3:0:0:0: Attached scsi generic sg3 type 0
--- SNIP ---
md: md0 stopped.
md: bind<sda>
md: bind<sdc>
md: bind<sdd>
md: bind<sdb>
raid5: device sdb operational as raid disk 0
raid5: device sdd operational as raid disk 3
raid5: device sdc operational as raid disk 2
raid5: device sda operational as raid disk 1
raid5: allocated 4262kB for md0
raid5: raid level 5 set md0 active with 4 out of 4 devices, algorithm 2
RAID5 conf printout:
 --- rd:4 wd:4
 disk 0, o:1, dev:sdb
 disk 1, o:1, dev:sda
 disk 2, o:1, dev:sdc
 disk 3, o:1, dev:sdd
attempt to access beyond end of device
sdb: rw=16, want=976773176, limit=976773168
attempt to access beyond end of device
sdd: rw=16, want=976773176, limit=976773168
attempt to access beyond end of device
sdc: rw=16, want=976773176, limit=976773168
attempt to access beyond end of device
sda: rw=16, want=976773176, limit=976773168
md0: bitmap initialized from disk: read 21/30 pages, set 4 bits, status: -5
md0: failed to create bitmap (-5)
md: pers->run() failed ...

Sounds like I've got a real head-scratcher on my hands...

macemoneta 08-18-2007 12:33 PM

How did you create the array? After the array was created, what options did you specify on mke2fs?

ajg 08-18-2007 12:36 PM

Are those drives partitioned correctly?

I'd expect to see the partitions:
Code:

/dev/sda1
/dev/sdb1
/dev/sdc1
/dev/sdd2

making up the /dev/md0 RAID set, rather than the raw device entries:
Code:

/dev/sda
/dev/sdb
/dev/sdc
/dev/sdd

as you have listed.

For example, /proc/mdstat on one of my servers says:

Code:

> cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb2[1] sda2[0]
      4216960 blocks [2/2] [UU]
     
md2 : active raid1 sdb3[1] sda3[0]
      286744064 blocks [2/2] [UU]
     
md0 : active raid1 sdb1[1] sda1[0]
      2072256 blocks [2/2] [UU]
     
unused devices: <none>

and a madam /dev/md0 --detail gives:

Code:

> mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.02
  Creation Time : Sat Jan  7 15:56:31 2006
    Raid Level : raid1
    Array Size : 2072256 (2023.69 MiB 2121.99 MB)
    Device Size : 2072256 (2023.69 MiB 2121.99 MB)
  Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Sat Aug 18 18:32:32 2007
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

          UUID : afd6ffc6:803a3b15:f337fa45:27abb7f8
        Events : 0.8495

    Number  Major  Minor  RaidDevice State
      0      8        1        0      active sync  /dev/sda1
      1      8      17        1      active sync  /dev/sdb1

... unless I'm reading your output wrong.

silicon.pyro 08-19-2007 01:06 AM

Nope... you're reading my output correctly. I did partition the drives before adding them to the set. However, when I actually created the array, I issued the following command:

Code:

mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 /dev/sda /dev/sdb /dev/sdc /dev/sdd
Though I didn't save the output, it is noted in the man page that you can use the whole device in this way rather than individual partitions, which is what I opted to do. Perhaps I wasn't understanding the man page properly though.

This could be part of my problem now and if so, it seems like an easy fix as soon as I have the complete data back and can rebuild the array. But there has got to be a way around this in the mean time, the setup has worked through many reboots and a few unrelated hardware changes (DVD-RW, memory add) so far for over a year.

My weekly backup regimen was delayed for a month and a half through a bunch of circumstances. Had these things not happened, I would just eat a few days worth of data, toast the array, and start again. But theres a few weeks of data I really want and I can just tell it's there waiting for me to figure out what's holding up the array.

I actually meant to rebuild the array anyways by now and go to RAID-10, except I'm not sure even that would have solved this problem, since this doesn't seem like any single drive failed. I think I'll stick to my non-raid backup because I can generally be sure I have accessible data that way, even if I do have to cut it up to fit it on various themed backup disks. I, like most people I have encountered, just have to be more diligent with the backup schedule. In this case, I really must solve this problem, then move on to the next which is the backup schedule.

ajg 08-19-2007 03:22 AM

I did see something like what you're getting when I was replacing a failed drive. I typoed

Code:

mdadm /dev/md0 --add /dev/sdb
rather than

Code:

mdadm /dev/mdo --add /dev/sdb1
and it all appeared to proceed normally - cat /proc/mdstat showed the drive rebuilding as I would expect. I didn't see a problem until I rebooted the system and it immediately failed /dev/sdb with similar errors to those you are seeing. Luckily, I only ever use RAID1, so I just re-prepped the replacement drive and issued the rebuild commands correctly, and all was well.

Soft RAID5 is an accident waiting to happen IMO. I did a lot of testing before I implemented any Linux soft-RAID stuff, and the although RAID5 works fine, there are many issues with regard to failure/replacement procedures and the operational management of the thing to be considered reliable.

RAID1 is simple enough to be able to boot in a number of failure conditions, and reliable enough to do the job. Given the price of 320GB drives, I don't believe there's much point in adding the complexity of RAID5 unless you need a really huge amount of storage, and for the risk, even the 400-500GB drives are cheap enough.

The issue is backing all that data up - 500GB tape drive doesn't come cheap!

Anyway, I digress. If this has started happening all-of-a-sudden, then something must have changed to cause the problem. In this case, I don't think it's hardware failure due to my similar experiences before. I would suspect a software update of the MD driver, but as you've rolled back the kernel a few times to no avail, I'm a little bit stumped I'm afraid.

silicon.pyro 08-19-2007 10:58 AM

I think my task for today then is going to be to figure out my emerge history, kernel history, etc and try to revert a bunch of things just to get the data back. Perhaps I have missed something and updated something remotely without remembering.

silicon.pyro 08-19-2007 11:54 AM

Well, I'm fairly beaten down here... the last time I updated any software was April 2, 2007. I have brought the server down many times since then without issue. The update then was the kernel, which I already reverted to and found no luck there.

Is there anything I can modify at the disk level that would allow me to rebuild, even if it does result in *some* data loss?

ajg 08-19-2007 07:01 PM

Just reading back through the thread, you say you partitioned the drives before issuing the assemble command. It could be worth checking the partition tables with fdisk.

If partitions exist, (I would expect them to be ID 'fd' - Linux Raid), you could then try assembling the array from the partitions, eg.

Code:

mdadm --assemble /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
If there are no partitions, then I'm a bit stumped. I've googled a few things, and I can't find any reference to using raw devices with the MD driver. :confused:

silicon.pyro 08-20-2007 01:40 AM

Though I did partition the disk, the partitions did not exist anymore... likely mdraid took over the partition table as well, as fdisk didn't like the structure of the disk when I opened it up to have a look.

Since you had a similar experience, I'm going to say that this is, in fact, my problem. Perhaps the planets were all just aligned when I created it the first time and it just took until now to put a bit of data somewhere it didn't belong on the disk. I'm going to have to chalk this one up to lessons learned. I'm about 75% on the backup here after I dug through all my increments and looked for local copies of data on the connected machines. The remaining data will just have to be lost -- the machine needs to go back into production and I don't have the funds to duplicate the storage right now and keep trying. I would really like to get there soon to have RAID-10 -- we are in agreement here that RAID-5 is not a reasonable measure to protect against data loss.

As always, more backups, more backups, more backups. Probably should have learned by now that before I type anything in on the console, even if its 'shutdown -h now', I should ask myself if I can spare the 5 minutes to run off an increment to my external storage.

Thanks both of you for all the advice thus far. I appreciate it and hope I can pay it forward some day soon.

At least I've got one thing going for me -- these fluid-dynamic bearing fans are really quiet. The loudest part of the box now is the drives spinning away with no more purpose.


All times are GMT -5. The time now is 06:38 PM.