LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Red Hat
User Name
Password
Red Hat This forum is for the discussion of Red Hat Linux.

Notices


Reply
  Search this Thread
Old 02-11-2014, 04:45 PM   #1
orev
LQ Newbie
 
Registered: May 2013
Posts: 10

Rep: Reputation: Disabled
Intermittent read-only filesystem and ata errors


Running RHEL 6.5 x86.

I have been seeing this problem every so often, sometimes a few weeks, sometimes a few hours apart. The console shows an error on ata1, and the filesystem goes into read-only mode. I can login but cannot run any other commands -- trying to do so results in an IO error. I have tried to generate IO load which does not seem to cause the problem. Otherwise this system is mostly idle.

I have performed tests, such as:
- Run disk manufacturer diagnostics
- Run spinrite
- Examine SMART data

I'm trying to determine if it's the drive or controller, so I can figure out which one to replace.

Here are my logs from dmesg and smartctl:

Code:
dmesg:
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: failed command: WRITE DMA
ata1.00: cmd ca/00:08:00:b0:0f/00:00:00:00:00/e0 tag 0 dma 4096 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: link is slow to respond, please be patient (ready=0)
ata1: device not ready (errno=-16), forcing hardreset
ata1: soft resetting link
ata1: link is slow to respond, please be patient (ready=0)
ata1: SRST failed (errno=-16)
ata1: soft resetting link
ata1: link is slow to respond, please be patient (ready=0)
ata1: SRST failed (errno=-16)
ata1: soft resetting link
ata1: link is slow to respond, please be patient (ready=0)
ata1: SRST failed (errno=-16)
ata1: soft resetting link
ata1: SRST failed (errno=-16)
ata1: reset failed, giving up
ata1.00: disabled
ata1.00: device reported invalid CHS sector 0
ata1: EH complete
sd 1:0:0:0: [sda] Unhandled error code
sd 1:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 1:0:0:0: [sda] CDB: Write(10): 2a 00 00 0f b0 00 00 00 08 00
Buffer I/O error on device dm-0, logical block 0
lost page write due to I/O error on dm-0
sd 1:0:0:0: [sda] Unhandled error code
sd 1:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 1:0:0:0: [sda] CDB: Write(10): 2a 00 00 0f b0 10 00 00 08 00
Buffer I/O error on device dm-0, logical block 2
lost page write due to I/O error on dm-0
sd 1:0:0:0: [sda] Unhandled error code
sd 1:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 1:0:0:0: [sda] CDB: Write(10): 2a 00 02 0f b0 08 00 00 08 00
Buffer I/O error on device dm-0, logical block 4194305
lost page write due to I/O error on dm-0
sd 1:0:0:0: [sda] Unhandled error code
sd 1:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 1:0:0:0: [sda] CDB: Write(10): 2a 00 02 0f b0 80 00 00 08 00
Buffer I/O error on device dm-0, logical block 4194320
lost page write due to I/O error on dm-0
sd 1:0:0:0: [sda] Unhandled error code
sd 1:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 1:0:0:0: [sda] CDB: Write(10): 2a 00 02 0f b1 18 00 00 10 00
Buffer I/O error on device dm-0, logical block 4194339
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 4194340
lost page write due to I/O error on dm-0
sd 1:0:0:0: [sda] Unhandled error code
sd 1:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 1:0:0:0: [sda] CDB: Write(10): 2a 00 02 0f b1 50 00 00 08 00
Buffer I/O error on device dm-0, logical block 4194346
lost page write due to I/O error on dm-0
sd 1:0:0:0: [sda] Unhandled error code
sd 1:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 1:0:0:0: [sda] CDB: Write(10): 2a 00 02 0f b2 b8 00 00 08 00
Buffer I/O error on device dm-0, logical block 4194391
lost page write due to I/O error on dm-0
sd 1:0:0:0: [sda] Unhandled error code
sd 1:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 1:0:0:0: [sda] CDB: Write(10): 2a 00 02 10 b9 28 00 00 08 00
Buffer I/O error on device dm-0, logical block 4202789
lost page write due to I/O error on dm-0
sd 1:0:0:0: [sda] Unhandled error code
sd 1:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 1:0:0:0: [sda] CDB: Write(10): 2a 00 03 16 2d d0 00 00 08 00
sd 1:0:0:0: [sda] Unhandled error code
sd 1:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 1:0:0:0: [sda] CDB: Write(10): 2a 00 03 16 2d d8 00 00 08 00
Aborting journal on device dm-0-8.
end_request: I/O error, dev sda, sector 51621888
JBD2: I/O error detected when updating journal superblock for dm-0-8.
end_request: I/O error, dev sda, sector 93610960
JBD2: Detected IO errors while flushing file data on dm-0-8
EXT4-fs error (device dm-0): ext4_journal_start_sb: Detected aborted journal
EXT4-fs (dm-0): Remounting filesystem read-only
__ratelimit: 2 callbacks suppressed
sd 1:0:0:0: [sda] Unhandled error code
sd 1:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 1:0:0:0: [sda] CDB: Read(10): 28 00 02 10 b9 28 00 00 08 00
__ratelimit: 2 callbacks suppressed
EXT4-fs error (device dm-0): ext4_find_entry: reading directory #1049471 offset 0
sd 1:0:0:0: [sda] Unhandled error code
sd 1:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 1:0:0:0: [sda] CDB: Read(10): 28 00 00 57 22 e0 00 00 40 00
sd 1:0:0:0: [sda] Unhandled error code
sd 1:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 1:0:0:0: [sda] CDB: Read(10): 28 00 00 57 23 18 00 00 08 00
sd 1:0:0:0: [sda] Unhandled error code
sd 1:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 1:0:0:0: [sda] CDB: Read(10): 28 00 00 57 23 18 00 00 08 00


Code:
smartctl:
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Scorpio Blue Serial ATA
Device Model:     WDC WD1600BEVT-75ZCT2
Serial Number:    WD-WXAXXXXXXXXX
LU WWN Device Id: 5 0014ee 0ac759719
Firmware Version: 11.01A11
User Capacity:    160,041,885,696 bytes [160 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Feb 11 14:32:17 2014 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                ( 5160) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  64) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   160   159   021    Pre-fail  Always       -       991
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       567
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2688
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       475
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       338
193 Load_Cycle_Count        0x0032   099   099   000    Old_age   Always       -       303000
194 Temperature_Celsius     0x0022   108   091   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
240 Head_Flying_Hours       0x0032   099   099   000    Old_age   Always       -       758
241 Total_LBAs_Written      0x0032   200   200   000    Old_age   Always       -       2879985410
242 Total_LBAs_Read         0x0032   200   200   000    Old_age   Always       -       3061568892
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      2505         -
# 2  Short offline       Completed without error       00%      2504         -
# 3  Short offline       Completed without error       00%      2503         -
 
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Old 02-11-2014, 04:53 PM   #2
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 3,350

Rep: Reputation: Disabled
Looks like a failed write operation is causing the drive to hang, which in turn triggers a bus reset. This could indeed be a defective controller/motherboard, a drive going bad, or even the SATA cables/connectors. But statistically speaking, a bad drive is the most likely explanation by far.

And I think I may know why the drive may be struggling. A Load Cycle Count of 303000?!? Holy defective power management, Batman!
 
Old 02-11-2014, 05:26 PM   #3
metaschima
Senior Member
 
Registered: Dec 2013
Distribution: Slackware
Posts: 1,982

Rep: Reputation: 492Reputation: 492Reputation: 492Reputation: 492Reputation: 492
Yeah, I would change the SATA cable first.

The HDD is fine, it is NOT failing.

As for the controller, I can't be sure if it is a hardware or software issue. Can you check what driver is being used ? '/sbin/lspci' Also, how old is this computer, and what kernel version are you running ?
 
Old 02-11-2014, 05:39 PM   #4
orev
LQ Newbie
 
Registered: May 2013
Posts: 10

Original Poster
Rep: Reputation: Disabled
It's a Dell Mini 10v netbook from 2009, so no SATA cables and controller is on the mainboard. Yeah, a bit old, but hoping to make it into a nice low power system that can sit in the closet

Kernel is 2.6.32-431.3.1.el6.i686

Here's the lspci output:

Code:
lspci:
00:00.0 Host bridge: Intel Corporation Mobile 945GSE Express Memory Controller Hub (rev 03)
00:02.0 VGA compatible controller: Intel Corporation Mobile 945GSE Express Integrated Graphics Controller (rev 03)
00:02.1 Display controller: Intel Corporation Mobile 945GM/GMS/GME, 943/940GML Express Integrated Graphics Controller (rev 03)
00:1b.0 Audio device: Intel Corporation NM10/ICH7 Family High Definition Audio Controller (rev 02)
00:1c.0 PCI bridge: Intel Corporation NM10/ICH7 Family PCI Express Port 1 (rev 02)
00:1c.1 PCI bridge: Intel Corporation NM10/ICH7 Family PCI Express Port 2 (rev 02)
00:1c.2 PCI bridge: Intel Corporation NM10/ICH7 Family PCI Express Port 3 (rev 02)
00:1d.0 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller #1 (rev 02)
00:1d.1 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller #2 (rev 02)
00:1d.2 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller #3 (rev 02)
00:1d.3 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller #4 (rev 02)
00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI Controller (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev e2)
00:1f.0 ISA bridge: Intel Corporation 82801GBM (ICH7-M) LPC Interface Bridge (rev 02)
00:1f.2 IDE interface: Intel Corporation 82801GBM/GHM (ICH7-M Family) SATA Controller [IDE mode] (rev 02)
00:1f.3 SMBus: Intel Corporation NM10/ICH7 Family SMBus Controller (rev 02)
03:00.0 Network controller: Broadcom Corporation BCM4312 802.11b/g LP-PHY (rev 01)
04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8101E/RTL8102E PCI Express Fast Ethernet controller (rev 02)
Also, the load cycle count has increased by 248 since I posted this, so maybe something I need to look into there too.
 
Old 02-11-2014, 05:52 PM   #5
orev
LQ Newbie
 
Registered: May 2013
Posts: 10

Original Poster
Rep: Reputation: Disabled
Running:
Code:
hdparm -B 254 /dev/sda
hdparm -S 0 /dev/sda
seems to have resolved the load_cycle issue.
 
Old 02-11-2014, 06:07 PM   #6
metaschima
Senior Member
 
Registered: Dec 2013
Distribution: Slackware
Posts: 1,982

Rep: Reputation: 492Reputation: 492Reputation: 492Reputation: 492Reputation: 492
Ok, well on a laptop your options are limited. It's a good thing that you turned off power cycling on the HDD as this wears the drive out.

One thing you could try is putting the SATA controller in AHCI mode in the BIOS if it supports it. First make sure the kernel has the 'ahci' module bulit-in or in the initrd. This would be to eliminate possible bugs in the IDE controller driver. There's no guarantee that this will fix it tho.
 
Old 02-23-2014, 09:33 PM   #7
orev
LQ Newbie
 
Registered: May 2013
Posts: 10

Original Poster
Rep: Reputation: Disabled
I wanted to post an update to this. My original problem was happening sporadically once a week or two, then would seem to switch into a mode where it happened multiple times per day, then go back to once every week or two. Since I made the hdparm changes, it has either resolved it or caused it to go back into the "sporadic" mode, since it has not happened since.

Hopefully this post will help anyone else with this problem (if it's resolved), or anger the tech gods so much that they will make it happen again, in which case I know it's not fixed.
 
Old 03-03-2014, 07:01 PM   #8
byau
Member
 
Registered: Sep 2009
Location: Los Angeles, CA
Posts: 33

Rep: Reputation: 5
This has happened to us on occasion on different linux versions. Always seems to be firmware and driver related from the vendor hardware point of view

Our solution which seems to work so far is

1) update vendor firmware

If that doesn't fix

2) overwrite generic linux raid card driver with vendor specific driver (by default it may keep the generic linux driver during install so you may need to force the issue to say "hey i want to use the vendor driver)

So far that has always fixed the issue whenever we've had our linux OS go read only
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
ATA errors and IO errors on sda3 Spyes Linux - Laptop and Netbook 11 09-08-2012 02:57 PM
[SOLVED] Filesystem errors on SCSI RAID volume force it to turn read-only Vanyel Linux - Server 2 02-11-2010 01:25 PM
ATA bus errors gewe Linux - Hardware 0 11-09-2007 08:21 AM
Intermittent nfs errors bilbod Linux - Networking 0 01-05-2005 09:43 PM
Intermittent connection errors aidanmcgowran Linux - Wireless Networking 0 06-29-2004 08:16 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Red Hat

All times are GMT -5. The time now is 08:25 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration