Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have several drives in an LVM VG/LV and for some reason on reboot, a drive will get a corrupt GTP table. I have killed the entire VG and re-created it without the drive that was showing the problem, then then it just happens to another drive. It does not appear to be the same drive each time either. I've confirmed this by using smartctl to check the SN of the drive reporting a corrupted table. It's not always the same drive.
I have swapped around cables to the two controllers to see if I could pin-point which cable or port showed the problem and long story short, there was little consistency in it. This simply does not appear to be caused by any single cable, port, controller, or drive.
Code:
parted /dev/sdb print
Error: The primary GPT table is corrupt, but the backup appears OK, so that will be used.
OK/Cancel?
When I see that and select Ok, it just shows it again. I can do an mklabel and mkpart, then the LVM LV shows up under /dev as it should, without another vgscan. If I then mount that LV, I can see the data is there and it seems Ok despite the warning of mklabel saying it will destroy the data.
Logs show no cause during boot.
So, what is causing this? Will doing the mklabel kill the data on it?
I just don't understand why Ubuntu is randomly corrupting GTP tables.
Code:
Ubuntu 10.10 x64
Mobo: ASUS A8N-SLI - On board NVIDIA nforce4-SLI controller has 4 ports connected to 3 drives in this LVM LV.
HighPoint Technologies, Inc. RocketRAID 230x 4 Port SATA-II Controller - Has 4 ports, 3 of which are used in the LVM LV. (Had 4, one is out with an RMA).
Linux teal 2.6.35-22-server #34-Ubuntu SMP Sun Oct 10 10:54:55 UTC 2010 x86_64 GNU/Linux
--- Volume group ---
VG Name vg-backup
System ID
Format lvm2
Metadata Areas 6
Metadata Sequence No 2
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 1
Open LV 0
Max PV 0
Cur PV 6
Act PV 6
VG Size 3.64 TiB
PE Size 4.00 MiB
Total PE 953868
Alloc PE / Size 953868 / 3.64 TiB
Free PE / Size 0 / 0
VG UUID iKFodI-VcUI-Aikr-N1v2-V6Fq-fXFX-6hhXmD
--- Logical volume ---
LV Name /dev/vg-backup/lv-backup
VG Name vg-backup
LV UUID yxDOVK-ep0Z-ODBT-LjdR-fQcS-72x8-Qu0fcI
LV Write Access read/write
LV Status available
# open 0
LV Size 3.64 TiB
Current LE 953868
Segments 6
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 251:4
04:00.0 SCSI storage controller: HighPoint Technologies, Inc. RocketRAID 230x 4 Port SATA-II Controller (rev 02)
Subsystem: Marvell Technology Group Ltd. Device 11ab
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 17
Region 0: Memory at d5000000 (64-bit, non-prefetchable) [size=1M]
Region 2: I/O ports at a000 [size=256]
[virtual] Expansion ROM at d6200000 [disabled] [size=512K]
Capabilities: [40] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [60] Express (v1) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <256ns, L1 <1us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #3, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 <256ns, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- Retrain- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
AERCap: First Error Pointer: 14, GenCap- CGenEn- ChkCap- ChkEn-
Kernel driver in use: sata_mv
Kernel modules: sata_mv
I've had some trouble with certain HighPoint products in the past, so I've switched all of my systems over to software-based RAID (mdadm is awesome).
A few questions first; a) You have a RAID configured by the HighPoint controller? b) Are you booting off of that RAID array? c) What Kernel version are you using?
Now for some of my "lessons learned" with HighPoint; a) The manual's have some whacky stuff in there. Here's yours. b) Do you have (Oh dammit, I butchered my post)
Okay, here's what I meant to say; b) Do you have "Staggered drive spin up" enabled?
I had problems with this, but I'm also using four 1TB 7200 RPM drives. By the time the fourth drive would spin-up, something kept timing-out and throwing a bus reset. c) Is "EBDA Reallocation" disabled?
I have a 2304 card, and this was disabled by default. Not only were there boot problems, but there were several issues with waking up the drives. d) Have you checked the SMART status of your disks?
smartctl -H /dev/sd#
I don't know if this helps, but I had to look up GTP -(GUID Partition Table) on Wikipedia, not having moved to Grub2 myself, and not have >2.2TB's of hard drives, but it appears to me that the GPT header is now stored at LBA 1,(rather than the old MBR being stored at LBA 0.
The Wikipedia article then goes on to say that most recent disks have 4096-byte sectors (whether or not they report 512-byte sectors) and as I understand it, if the GTP doesn't start at the right place you get corruption. In which event the GPT header (backup) stored at the end of the disk (no, I don't know which one) and this is still good.
The article also mentions that the disk 'data' (LVs etc.) should not start until LBA 40.
Hope this helps, although I don't profess to know how to ensure exactly where the GPT header gets loaded.
Okay, here's what I meant to say; b) Do you have "Staggered drive spin up" enabled?
I had problems with this, but I'm also using four 1TB 7200 RPM drives. By the time the fourth drive would spin-up, something kept timing-out and throwing a bus reset. c) Is "EBDA Reallocation" disabled?
I have a 2304 card, and this was disabled by default. Not only were there boot problems, but there were several issues with waking up the drives. d) Have you checked the SMART status of your disks?
smartctl -H /dev/sd#
b) Nope. c) I don't see anything which mentions this in the BIOS config. The manual mentions it, but only briefly in the Windows section. d) Yep, I've ran a long test on all of them and they all come back clean. I have already ran extended tests from the drive manufac's boot media. All is reporting good.
One last option (if you have a backup of the data on that disk). You mentioned in your first post that you've thrown around a few disks, and keep getting the GPT errors on different disks. I also noticed that you have a VG (vg-backup) with six PVs in it.
a) Is vg-backup basically concatenating, or are you doing anything for redundancy? b) Do you have a valid backup of this system (or is this system basically the backup-server)? c) can you trash the whole vg-backup (all six PVs)? (Basically, is there anything "worth it" in that VG? d) You're not booting the OS from any of those PVs, right? e) Are all of those PVs "whole-disk", or are you using other partitions for something else? (like the OS?) f) Have you considered using mdadm to setup mirroring? (RAID10?)
The reason I ask is this; 1) if you don't have any redundancy, that should be addressed (six PVs is a *lot* of drives that could fail) 2) Something may be fishy with the way the VG was setup. Specifically if those disks were used for something prior to this. 3) Perhaps "dd'ing" the first 1MB of the disks would clear up any crazyness left-over on the disks from their previous life. However this would destroy data (hence the backup questions).
One last option (if you have a backup of the data on that disk). You mentioned in your first post that you've thrown around a few disks, and keep getting the GPT errors on different disks. I also noticed that you have a VG (vg-backup) with six PVs in it.
a) Is vg-backup basically concatenating, or are you doing anything for redundancy? b) Do you have a valid backup of this system (or is this system basically the backup-server)? c) can you trash the whole vg-backup (all six PVs)? (Basically, is there anything "worth it" in that VG? d) You're not booting the OS from any of those PVs, right? e) Are all of those PVs "whole-disk", or are you using other partitions for something else? (like the OS?) f) Have you considered using mdadm to setup mirroring? (RAID10?)
a) No redundancy on this box. b) This is the backup. c) Already trashed it several times. d) Nope. e) All are whole disk. f) I did, but the drives are not all the same size and there is no decent way of doing it with my setup. I am aware of the danger, but I don't want to drop the money on a proper setup. This is an old desktop with a bunch of hard drives in it, nothing more.
Quote:
Originally Posted by xeleema
The reason I ask is this; 1) if you don't have any redundancy, that should be addressed (six PVs is a *lot* of drives that could fail) 2) Something may be fishy with the way the VG was setup. Specifically if those disks were used for something prior to this. 3) Perhaps "dd'ing" the first 1MB of the disks would clear up any crazyness left-over on the disks from their previous life. However this would destroy data (hence the backup questions).
a) I am aware of this and accept it. b) I doubt it. In my testing I've been doing: lvremove, vgremove, pvremove {eachdrive}, pvcreate {eachdrive}, vgcreate, lvcreate, mkfs. c) I'll try it later, but I have doubts that will solve the problem.
3) Perhaps "dd'ing" the first 1MB of the disks would clear up any crazyness left-over on the disks from their previous life. However this would destroy data (hence the backup questions).
I did a dd on the first 10 MB of the drives, verified the table was gone on all of them, rebooted, re-made the table and partition on each one, verified they showed up (parted /dev/sda print), rebooted, and yet again the table was lost on one of them.
Note that this was before LVM so that completely rules it out (I didn't suspect it earlier though).
So, something is still causing the table to get corrupted. Replacing my kernel with the mainline one is the only other idea I have, and I've never done it before so it may take a while and I'll have to read through some documentation first.
I did a dd on the first 10 MB of the drives, verified the table was gone on all of them, rebooted, re-made the table and partition on each one, verified they showed up (parted /dev/sda print), rebooted, and yet again the table was lost on one of them.
This is the Twighlight Zone of errors if I've ever seen one.
Quote:
Originally Posted by gimpy530
Note that this was before LVM so that completely rules it out (I didn't suspect it earlier though).
Good to know. I didn't suspect it either, but it's good you covered that base (just in case).
Quote:
Originally Posted by gimpy530
So, something is still causing the table to get corrupted. Replacing my kernel with the mainline one is the only other idea I have, and I've never done it before so it may take a while and I'll have to read through some documentation first.
Okay, before you do that, see if you have a /proc/config.gz file. If you do, that's a compressed copy of the running kernel's configuration. This will save you a lot of guess-work about what needs to be a module vs compiled-in.
I'm going to pull-down Ubuntu 10.10 x64, set it up as a VM, and see if I can reproduce things on my end. I have 8 x 8GB USB sticks and a 16-port USB 2.0 HUB I can attach to the VM and experiment with....
By the way, are you using "Desktop" or "Server"?
Update #1: Ubuntu 10.10 x64 "Desktop" finished downloading. I've configured a VM, just have to install the OS.
Some more googling around has made me notice something...everyone with a GPT error has used "parted" to partition their drives rather than "fdisk", and no one seems to use mkfs.ext2. I wonder if that's a RedHat-ism or an Ubuntu-ism...
(Apologies for the low res screencap. Trying to stay under my LQ limit.)
Update #2: Installing the OS now, just wanted to note my partition layout (incase it's relevent).
Update #3: OS is installed. Adding kernel sources for 2.6.35 & kernel tools, too. (Have to have my VMware Tools working...)
Update #4: Ready to do the test. A little pre-show diagnostic info;
Code:
luser@lhost:~$ cat /etc/lsb-release ; uname -a ; sudo fdisk -l /dev/sd[a-z]|grep -i dev| grep .
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=10.10
DISTRIB_CODENAME=maverick
DISTRIB_DESCRIPTION="Ubuntu 10.10"
Linux lhost 2.6.35-22-generic #33-Ubuntu SMP Sun Sep 19 20:32:27 UTC 2010 x86_64 GNU/Linux
Disk /dev/sda: 10.7 GB, 10737418240 bytes
Device Boot Start End Blocks Id System
/dev/sda1 1 125 999424 82 Linux swap / Solaris
/dev/sda2 * 125 187 499712 83 Linux
/dev/sda3 187 1306 8984576 83 Linux
luser@lhost:~$
Update #5: 8 x 8GB USB sticks attached to host system...blowing-away the first 5MB everything on each stick to destroy any existing filesystems (Damn I hate U3 disks, craziest partitioning scheme I've ever seen).
Update #6: Only 6 of the 8GB USB sticks want to place nice-nice. On with the show!
Last edited by xeleema; 01-31-2011 at 04:41 PM.
Reason: updated....
They've all been PV'd, and VG "vgusb" has been created.
Code:
luser@lhost:~# sudo vgdisplay /dev/vgusb
--- Volume group ---
VG Name vgusb
System ID
Format lvm2
Metadata Areas 6
Metadata Sequence No 1
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 0
Open LV 0
Max PV 0
Cur PV 6
Act PV 6
VG Size 44.84 GiB
PE Size 4.00 MiB
Total PE 11480
Alloc PE / Size 0 / 0
Free PE / Size 11480 / 44.84 GiB
VG UUID zjo6Iu-n78i-ym4T-p73g-Ocf2-Y3CK-sLgKid
luser@lhost:~# sudo vgdisplay -v /dev/vgusb|grep "PV Name"
Using volume group(s) on command line
Finding volume group "vgusb"
PV Name /dev/sdb1
PV Name /dev/sdc1
PV Name /dev/sdd1
PV Name /dev/sde1
PV Name /dev/sdf1
PV Name /dev/sdg1
luser@lhost:~#
I've also created "lvusb", one big fat logical volume (without any sort of striping or mirroring going on).
(I did enable "mount on reboot", too.)
You are far less lazy than me in doing all of that.
I was using GPT (hence the title of this thread) but I moved to msdos on all of these drives and I have not been able to replicate the problem. So, how GPT stores its tables is part of the issue.
I could create a VM which matches the config of the physical machine and give it 8 or so virtual drives to try to emulate the problem on virtual hardware which would (mostly) confirm if the kernel itself is causing the problem.
Other random ideas I had:
The controller itself could be over-writing the table of a drive when it initializes. I have not seen any information to support this, but it seems the most likely. The strange part is it is not always the same port which has the problem, maybe it is just whatever the first drive which is found during the scanning gets corrupted?
The kernel module in use (sata_mv) could not not playing nicely. It could be not writing the information properly in the first place, which I could discover by looking at the raw data at the beginning of the disk...after researching what it should look like. Given that I can see the table after I create it before a reboot and other drives are fine, I doubt this is the problem.
The kernel itself could be doing the above.
Within a year I will have to consider redesigning my storage layout of both my servers, at which point I may have to ditch this controller and go with a much higher-end one.
I've been able to replicate the problem by using GPT EFI partitioning on the USB sticks, and *not* dd'ing the head and tail of each device!
You'll need to nuke the drives as I mentioned in my previous post, then use "fdisk" to create the partitions (I have a newfound hatred for parted).
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.