Help: Detected aborted journal / Remounting filesystem read-only

otheus · 11-28-2011, 11:38 AM

We recently encountered a fatal condition:

Code:

kernel: cciss: cmd ffff810037e16b30 has CHECK CONDITION sense key = 0x3
Buffer I/O error on device cciss/c0d1p1, logical block 204890595
lost page write due to I/O error on cciss/c0d1p1
...
kernel: Aborting journal on device cciss/c0d1p1
...
kernel: EXT3-fs error (device cciss/c0d1p1): ext3_journal_start_sb: Detected aborted journal
kernel: Remounting filesystem read-only

That last one is the killer. Is there anyway from preventing the kernel from remounting the system as RO?

Later, I ran badblocks on it (fsck -cc) and got about a dozen messages of the sort:

Code:

badblocks: Input/output error during test data write, block 189444672

I ran smartctl on it and got some additional info:

Code:

#  /usr/sbin/smartctl -d cciss,0 -a /dev/cciss/c0d1
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: HP       EG0146FARTR      Version: HPD5
Serial number: D0A1P9B06G200945
Device type: disk
Transport protocol: SAS
Local Time is: Mon Nov 28 19:08:30 2011 CET
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK

Current Drive Temperature:     32 C
Drive Trip Temperature:        65 C
Manufactured in week 45 of year 2009
Recommended maximum start stop count:  50000 times
Current start stop count:      17 times
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        3         0         1          0     139921.210           0
write:         0        0         0         0          0       1855.147           0

Non-medium error count:       64
No self-tests have been logged
Long (extended) Self Test duration: 1722 seconds [28.7 minutes]

A similar problem was discussed on this forum in http://www.linuxquestions.org/questi...216/page2.html but I have the added information of bad logical and physical blocks being detected, so simply dropping the journal and reapplying it won't really fix the problem. Or would it? Perhaps the problem is that the SCSI system re-mapped the sectors the first time, creating holes in the journal causing it to crash. Then the second time, the journal was still in the same state and it crashed again. But if that is the case, why did a subsequent badblocks (fsck -cc) detect bad blocks?

Further, I'm in the luxurious situation of being able to wipe the entire drive. Is there a way to do it without running badblocks again and yet guarantee the previously found bad blocks have been fixed?

clvic · 11-30-2011, 07:32 AM

Quote:

But if that is the case, why did a subsequent badblocks (fsck -cc) detect bad blocks?

If I read well, the problem is that there are no more blocks in the reserve to substitute the bad ones.

Quote:

Is there a way to do it without running badblocks again and yet guarantee the previously found bad blocks have been fixed?

If you have saved the badblocks' list in a text file, you can make use of the "-l" option of mkfs.ext3 to have it read the list and use it properly.