We recently encountered a fatal condition:
Code:
kernel: cciss: cmd ffff810037e16b30 has CHECK CONDITION sense key = 0x3
Buffer I/O error on device cciss/c0d1p1, logical block 204890595
lost page write due to I/O error on cciss/c0d1p1
...
kernel: Aborting journal on device cciss/c0d1p1
...
kernel: EXT3-fs error (device cciss/c0d1p1): ext3_journal_start_sb: Detected aborted journal
kernel: Remounting filesystem read-only
That last one is the killer. Is there anyway from preventing the kernel from remounting the system as RO?
Later, I ran badblocks on it (fsck -cc) and got about a dozen messages of the sort:
Code:
badblocks: Input/output error during test data write, block 189444672
I ran smartctl on it and got some additional info:
Code:
# /usr/sbin/smartctl -d cciss,0 -a /dev/cciss/c0d1
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
Device: HP EG0146FARTR Version: HPD5
Serial number: D0A1P9B06G200945
Device type: disk
Transport protocol: SAS
Local Time is: Mon Nov 28 19:08:30 2011 CET
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK
Current Drive Temperature: 32 C
Drive Trip Temperature: 65 C
Manufactured in week 45 of year 2009
Recommended maximum start stop count: 50000 times
Current start stop count: 17 times
Elements in grown defect list: 0
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 3 0 1 0 139921.210 0
write: 0 0 0 0 0 1855.147 0
Non-medium error count: 64
No self-tests have been logged
Long (extended) Self Test duration: 1722 seconds [28.7 minutes]
A similar problem was discussed on this forum in
http://www.linuxquestions.org/questi...216/page2.html but I have the added information of bad logical and physical blocks being detected, so simply dropping the journal and reapplying it won't really fix the problem. Or would it? Perhaps the problem is that the SCSI system re-mapped the sectors the first time, creating holes in the journal causing it to crash. Then the second time, the journal was still in the same state and it crashed again. But if that is the case, why did a subsequent badblocks (fsck -cc) detect bad blocks?
Further, I'm in the luxurious situation of being able to wipe the entire drive. Is there a way to do it without running badblocks
again and yet guarantee the previously found bad blocks have been fixed?