UbuntuThis forum is for the discussion of Ubuntu Linux.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have a server with three physical hard drives. Since a couple of days /dev/sdc seems to be renamed to /dev/sdd while the server is running. Since this drive houses /home and swap this causes my home directories to give input/output errors. Most of the times the issue resolves itself with a reboot. However sometimes it fails to come up properly, possible due to a disk error.
A possible cause I came up with is that the drives are set in the BIOS to spin down after 15 minutes of idle time and might not come back up properly. According to the logs something strange happens in the morning:
This is probably when it notices the /home directory is missing when it is attempting to write a backup to it (called by cron.daily).
The line "ata2.00: model number mismatch 'SAMSUNG SP1614N' != 'QAMSUNE QP1614L'" is rather peculiar.
[44771.732352] ata2.00: model number mismatch 'SAMSUNG SP1614N' != 'QAMSUNE QP1614L'
You're right! That is quite odd. I wonder where it's getting that strange name from?
It isn't often these days that Google is stumped: the only result on googling that whole line was this very thread. However, searching only for "ata2.00: model number mismatch" produces a fair number of hits, but in the few minutes I read through some of them, no solution was seen. Many of the threads did mention Debian or Ubuntu though, so *possibly* a kernel/driver issue in these kernels, but too soon to say.
My first feeling about this is a firmware bug, either in the drive's firmware itself, or in the machine's BIOS; more likely to be in the drive's firmware I think. However! - for the record, it is very interesting that if you notice the correct spelling of the make & model, compared to the wrong spelling:
Code:
SAMSUNGSP1614NQAMSUNEQP1614L
The first and last character of each "word" above, shown in bold, are alphabetically 2 letters behind what they're supposed to be in the mis-spelled version. I don't know enough about kernel code and how this data is read from the drive, but is it possible that a kernel/driver coding error (bug) could produce the shifted values seen here? There's basically a pattern to the spelling errors, and this to me is more indicative of a coding error in the code that reads this info from the drive..
Could you run a couple commands and provide a little more information about the hardware?
-- What make & model of computer, or what make & model# of motherboard in it, and what BIOS version? If you don't know this, perhaps you have the `dmidecode` command shown below:
Code:
/usr/sbin/dmidecode
Yours may be in a different location - use the `which` command to locate your dmidecode. If you have it, the command will output a whack of info about your machine; primarily I'd be interested in the first three blocks of data, which give specifics about the BIOS and motherboard, something like this:
Code:
root@reactor: /usr/sbin/dmidecode
# dmidecode 2.10
SMBIOS 2.5 present.
54 structures occupying 1995 bytes.
Table at 0x000FB4F0.
Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
Vendor: American Megatrends Inc.
Version: V2.7
Release Date: 12/09/2008
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 512 kB
Characteristics:
<-- SNIP lot's of stuff in here, please leave yours. -->
BIOS Revision: 8.13
Handle 0x0001, DMI type 1, 27 bytes
System Information
Manufacturer: MSI
Product Name: MS-7350
Version: 1.0
Serial Number: To Be Filled By O.E.M.
UUID: Not Present
Wake-up Type: Power Switch
SKU Number: To Be Filled By O.E.M.
Family: To Be Filled By O.E.M.
Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
Manufacturer: MSI
Product Name: MSI P6N SLI
Version: 1.0
Serial Number: To be filled by O.E.M.
Asset Tag: To Be Filled By O.E.M.
Features:
Board is a hosting board
Board is replaceable
Location In Chassis: To Be Filled By O.E.M.
Chassis Handle: 0x0003
Type: Motherboard
Contained Object Handles: 0
So we have some info about the board now. Next let's learn a bit about the disk drive itself; if you have the SMART tools installed, we'll use that. On my system, they're in a package called "smartmontools-5.39", and the commands we want are:
Code:
smartctl -a /dev/hdc
OR
smartctl -a /dev/hdd
That will produce a load of data about the drive. Of primary interest for now will be the first block, called "START OF INFORMATION SECTION", which will have the make & model and some other data about the drive. Please paste that block of data.
Regardless what information the above commands produce, I have no suggestion for what to try to work around this. Maybe these commands will have a clue or something more to search on, but as yet, no idea. Maybe a hardware/firmware bug, or maybe it *is* a kernel bug. More searching is needed, unless someone already has the answer & solution from a similar past experience.
You're very observant! I hadn't noticed that those letters are off exactly two places in the alphabet.
It's a fairly old system, here's the output of /usr/sbin/dmidecode
Code:
/usr/sbin/dmidecode
# dmidecode 2.9
SMBIOS 2.3 present.
49 structures occupying 1360 bytes.
Table at 0x000F3B20.
Handle 0x0000, DMI type 0, 20 bytes
BIOS Information
Vendor: Award Software, Inc.
Version: ASUS A7V8X ACPI BIOS Revision 1014
Release Date: 04/21/2004
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 512 kB
Characteristics:
PCI is supported
PNP is supported
APM is supported
BIOS is upgradeable
BIOS shadowing is allowed
ESCD support is available
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
EDD is supported
5.25"/360 KB floppy services are supported (int 13h)
5.25"/1.2 MB floppy services are supported (int 13h)
3.5"/720 KB floppy services are supported (int 13h)
3.5"/2.88 MB floppy services are supported (int 13h)
Print screen service is supported (int 5h)
8042 keyboard services are supported (int 9h)
Serial services are supported (int 14h)
Printer services are supported (int 17h)
CGA/mono video services are supported (int 10h)
ACPI is supported
USB legacy is supported
AGP is supported
Handle 0x0001, DMI type 1, 25 bytes
System Information
Manufacturer: System Manufacturer
Product Name: System Name
Version: System Version
Serial Number: SYS-1234567890
UUID: Not Settable
Wake-up Type: Power Switch
Handle 0x0002, DMI type 2, 8 bytes
Base Board Information
Manufacturer: ASUSTeK Computer INC.
Product Name: A7V8X
Version: REV 1.xx
Serial Number: xxxxxxxxxxx
Handle 0x0003, DMI type 3, 17 bytes
Chassis Information
Manufacturer: Chassis Manufacture
Type: Tower
Lock: Not Present
Version: Chassis Version
Serial Number: Chassis Serial Number
Asset Tag: Asset-1234567890
Boot-up State: Safe
Power Supply State: Safe
Thermal State: Safe
Security Status: Unknown
OEM Information: 0x00000001
Handle 0x0004, DMI type 4, 32 bytes
Processor Information
Socket Designation: SOCKET A
Type: Central Processor
Family: Other
Manufacturer: AuthenticAMD
ID: 81 06 00 00 FF FB 83 03
Signature: Family 6, Model 8, Stepping 1
Flags:
FPU (Floating-point unit on-chip)
VME (Virtual mode extension)
DE (Debugging extension)
PSE (Page size extension)
TSC (Time stamp counter)
MSR (Model specific registers)
PAE (Physical address extension)
MCE (Machine check exception)
CX8 (CMPXCHG8 instruction supported)
APIC (On-chip APIC hardware supported)
SEP (Fast system call)
MTRR (Memory type range registers)
PGE (Page global enable)
MCA (Machine check architecture)
CMOV (Conditional move instruction supported)
PAT (Page attribute table)
PSE-36 (36-bit page size extension)
MMX (MMX technology supported)
FXSR (Fast floating-point save and restore)
SSE (Streaming SIMD extensions)
Version: AMD Athlon(TM) XP 2400+
Voltage: 1.7 V
External Clock: 133 MHz
Max Speed: 2250 MHz
Current Speed: 2000 MHz
Status: Populated, Enabled
Upgrade: Other
L1 Cache Handle: 0x0009
L2 Cache Handle: 0x000A
L3 Cache Handle: Not Provided
Handle 0x0005, DMI type 5, 22 bytes
Memory Controller Information
Error Detecting Method: None
Error Correcting Capabilities:
Other
Supported Interleave: Unknown
Current Interleave: Unknown
Maximum Memory Module Size: 1024 MB
Maximum Total Memory Size: 3072 MB
Supported Speeds:
70 ns
60 ns
50 ns
Supported Memory Types:
ECC
DIMM
SDRAM
Memory Module Voltage: 3.3 V
Associated Memory Slots: 3
0x0006
0x0007
0x0008
Enabled Error Correcting Capabilities:
Unknown
Handle 0x0006, DMI type 6, 12 bytes
Memory Module Information
Socket Designation: DIMM 1
Bank Connections: 0 1
Current Speed: Unknown
Type: DIMM SDRAM
Installed Size: 512 MB (Double-bank Connection)
Enabled Size: 512 MB (Double-bank Connection)
Error Status: OK
Handle 0x0007, DMI type 6, 12 bytes
Memory Module Information
Socket Designation: DIMM 2
Bank Connections: 2 3
Current Speed: Unknown
Type: DIMM SDRAM
Installed Size: 512 MB (Double-bank Connection)
Enabled Size: 512 MB (Double-bank Connection)
Error Status: OK
Handle 0x0008, DMI type 6, 12 bytes
Memory Module Information
Socket Designation: DIMM 3
Bank Connections: 4 5
Current Speed: Unknown
Type: DIMM SDRAM
Installed Size: Not Installed
Enabled Size: Not Installed
Error Status: OK
Handle 0x0009, DMI type 7, 19 bytes
Cache Information
Socket Designation: L1 Cache
Configuration: Enabled, Not Socketed, Level 1
Operational Mode: Write Back
Location: Internal
Installed Size: 128 KB
Maximum Size: 128 KB
Supported SRAM Types:
Pipeline Burst
Synchronous
Installed SRAM Type: Pipeline Burst Synchronous
Speed: Unknown
Error Correction Type: Unknown
System Type: Data
Associativity: 4-way Set-associative
Handle 0x000A, DMI type 7, 19 bytes
Cache Information
Socket Designation: L2 Cache
Configuration: Enabled, Not Socketed, Level 2
Operational Mode: Write Back
Location: Internal
Installed Size: 256 KB
Maximum Size: 8192 KB
Supported SRAM Types:
Pipeline Burst
Synchronous
Installed SRAM Type: Pipeline Burst Synchronous
Speed: Unknown
Error Correction Type: Unknown
System Type: Data
Associativity: 4-way Set-associative
Handle 0x000B, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: PRIMARY IDE/HDD
Internal Connector Type: On Board IDE
External Reference Designator: Not Specified
External Connector Type: None
Port Type: None
Handle 0x000C, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: SECONDARY IDE/HDD
Internal Connector Type: On Board IDE
External Reference Designator: Not Specified
External Connector Type: None
Port Type: None
Handle 0x000D, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: FLOPPY
Internal Connector Type: On Board Floppy
External Reference Designator: Not Specified
External Connector Type: None
Port Type: None
Handle 0x000E, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: Not Specified
Internal Connector Type: None
External Reference Designator: USB1
External Connector Type: Access Bus (USB)
Port Type: USB
Handle 0x000F, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: Not Specified
Internal Connector Type: None
External Reference Designator: USB2
External Connector Type: Access Bus (USB)
Port Type: USB
Handle 0x0010, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: Not Specified
Internal Connector Type: None
External Reference Designator: USB3
External Connector Type: Access Bus (USB)
Port Type: USB
Handle 0x0011, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: Not Specified
Internal Connector Type: None
External Reference Designator: USB4
External Connector Type: Access Bus (USB)
Port Type: USB
Handle 0x0012, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: Not Specified
Internal Connector Type: None
External Reference Designator: USB5
External Connector Type: Access Bus (USB)
Port Type: USB
Handle 0x0013, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: Not Specified
Internal Connector Type: None
External Reference Designator: USB6
External Connector Type: Access Bus (USB)
Port Type: USB
Handle 0x0014, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: Not Specified
Internal Connector Type: None
External Reference Designator: PS/2 Keyboard
External Connector Type: PS/2
Port Type: Keyboard Port
Handle 0x0015, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: Not Specified
Internal Connector Type: None
External Reference Designator: PS/2 Mouse
External Connector Type: PS/2
Port Type: Mouse Port
Handle 0x0016, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: Not Specified
Internal Connector Type: None
External Reference Designator: Parallel Port
External Connector Type: DB-25 female
Port Type: Parallel Port ECP/EPP
Handle 0x0017, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: Not Specified
Internal Connector Type: None
External Reference Designator: Serial Port 1
External Connector Type: DB-9 male
Port Type: Serial Port 16550 Compatible
Handle 0x0018, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: Not Specified
Internal Connector Type: None
External Reference Designator: Serial Port 2
External Connector Type: DB-9 male
Port Type: Serial Port 16550 Compatible
Handle 0x0019, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: Not Specified
Internal Connector Type: None
External Reference Designator: Joystick Port
External Connector Type: DB-15 female
Port Type: Joystick Port
Handle 0x001A, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: Not Specified
Internal Connector Type: None
External Reference Designator: MIDI Port
External Connector Type: DB-15 female
Port Type: MIDI Port
Handle 0x001B, DMI type 8, 9 bytes
Port Connector Information
Internal Reference Designator: Not Specified
Internal Connector Type: None
External Reference Designator: Line In Jack
External Connector Type: Mini Jack (headphones)
Port Type: Audio Port
Handle 0x001C, DMI type 9, 13 bytes
System Slot Information
Designation: PCI 1
Type: 32-bit PCI
Current Usage: Available
Length: Short
ID: 1
Characteristics:
5.0 V is provided
3.3 V is provided
PME signal is supported
Handle 0x001D, DMI type 9, 13 bytes
System Slot Information
Designation: PCI 2
Type: 32-bit PCI
Current Usage: Available
Length: Short
ID: 2
Characteristics:
5.0 V is provided
3.3 V is provided
PME signal is supported
Handle 0x001E, DMI type 9, 13 bytes
System Slot Information
Designation: PCI 3
Type: 32-bit PCI
Current Usage: Available
Length: Short
ID: 3
Characteristics:
5.0 V is provided
3.3 V is provided
PME signal is supported
Handle 0x001F, DMI type 9, 13 bytes
System Slot Information
Designation: PCI 4
Type: 32-bit PCI
Current Usage: Available
Length: Short
ID: 4
Characteristics:
5.0 V is provided
3.3 V is provided
PME signal is supported
Handle 0x0020, DMI type 9, 13 bytes
System Slot Information
Designation: PCI 5
Type: 32-bit PCI
Current Usage: Available
Length: Short
ID: 5
Characteristics:
5.0 V is provided
3.3 V is provided
PME signal is supported
Handle 0x0021, DMI type 9, 13 bytes
System Slot Information
Designation: PCI 6
Type: 32-bit PCI
Current Usage: Available
Length: Short
ID: 6
Characteristics:
5.0 V is provided
3.3 V is provided
PME signal is supported
Handle 0x0022, DMI type 9, 13 bytes
System Slot Information
Designation: AGP
Type: 32-bit AGP 8x
Current Usage: In Use
Length: Short
ID: 7
Characteristics:
3.3 V is provided
Handle 0x0023, DMI type 11, 5 bytes
OEM Strings
String 1: 0
String 2: 0
Handle 0x0024, DMI type 13, 22 bytes
BIOS Language Information
Installable Languages: 1
en|US|iso8859-1
Currently Installed Language: en|US|iso8859-1
Handle 0x0025, DMI type 14, 14 bytes
Group Associations
Name: Cpu Module
Items: 3
0x0004 (Processor)
0x0009 (Cache)
0x000A (Cache)
Handle 0x0026, DMI type 14, 29 bytes
Group Associations
Name: Memory Module Set
Items: 8
0x0027 (Physical Memory Array)
0x0028 (Memory Device)
0x002C (Memory Device Mapped Address)
0x0029 (Memory Device)
0x002D (Memory Device Mapped Address)
0x002A (Memory Device)
0x002E (Memory Device Mapped Address)
0x002B (Memory Array Mapped Address)
Handle 0x0027, DMI type 16, 15 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: None
Maximum Capacity: 3 GB
Error Information Handle: Not Provided
Number Of Devices: 3
Handle 0x0028, DMI type 17, 23 bytes
Memory Device
Array Handle: 0x0027
Error Information Handle: No Error
Total Width: 64 bits
Data Width: 64 bits
Size: 512 MB
Form Factor: DIMM
Set: 1
Locator: DDR 1
Bank Locator: Not Specified
Type: DRAM
Type Detail: Synchronous
Speed: Unknown
Handle 0x0029, DMI type 17, 23 bytes
Memory Device
Array Handle: 0x0027
Error Information Handle: No Error
Total Width: 64 bits
Data Width: 64 bits
Size: 512 MB
Form Factor: DIMM
Set: 2
Locator: DDR 2
Bank Locator: Not Specified
Type: DRAM
Type Detail: Synchronous
Speed: Unknown
Handle 0x002A, DMI type 17, 23 bytes
Memory Device
Array Handle: 0x0027
Error Information Handle: No Error
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: DIMM
Set: 3
Locator: DDR 3
Bank Locator: Not Specified
Type: DRAM
Type Detail: Synchronous
Speed: Unknown
Handle 0x002B, DMI type 19, 15 bytes
Memory Array Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x0003FFFFFFF
Range Size: 1 GB
Physical Array Handle: 0x0027
Partition Width: 0
Handle 0x002C, DMI type 20, 19 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x0001FFFFFFF
Range Size: 512 MB
Physical Device Handle: 0x0028
Memory Array Mapped Address Handle: 0x002B
Partition Row Position: 1
Handle 0x002D, DMI type 20, 19 bytes
Memory Device Mapped Address
Starting Address: 0x00020000000
Ending Address: 0x0003FFFFFFF
Range Size: 512 MB
Physical Device Handle: 0x0029
Memory Array Mapped Address Handle: 0x002B
Partition Row Position: 2
Handle 0x002E, DMI type 126, 19 bytes
Inactive
Handle 0x002F, DMI type 32, 11 bytes
System Boot Information
Status: No errors detected
Handle 0x0030, DMI type 127, 4 bytes
End Of Table
The smart data for /dev/sdc:
Code:
smartctl -a /dev/sdc
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG SP1614N
Serial Number: S016J10X612210
Firmware Version: TM100-24
User Capacity: 160,041,885,696 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
Local Time is: Thu Sep 30 13:28:07 2010 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (5760) seconds.
Offline data collection
capabilities: (0x1b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 96) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 051 Pre-fail Always - 2
3 Spin_Up_Time 0x0007 065 054 000 Pre-fail Always - 6016
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 827
5 Reallocated_Sector_Ct 0x0033 253 253 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 253 253 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0024 253 253 000 Old_age Offline - 0
9 Power_On_Half_Minutes 0x0032 098 098 000 Old_age Always - 10342h+39m
10 Spin_Retry_Count 0x0013 253 253 049 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 157
194 Temperature_Celsius 0x0022 199 100 000 Old_age Always - 13
195 Hardware_ECC_Recovered 0x000a 100 100 000 Old_age Always - 252397751
196 Reallocated_Event_Count 0x0012 253 253 000 Old_age Always - 0
197 Current_Pending_Sector 0x0033 253 253 010 Pre-fail Always - 0
198 Offline_Uncorrectable 0x0031 253 253 010 Pre-fail Offline - 0
199 UDMA_CRC_Error_Count 0x000b 100 100 051 Pre-fail Always - 18
200 Multi_Zone_Error_Rate 0x000b 100 100 051 Pre-fail Always - 0
201 Soft_Read_Error_Rate 0x000b 100 100 051 Pre-fail Always - 0
SMART Error Log Version: 1
ATA Error Count: 2297 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 2297 occurred at disk power-on lifetime: 10341 hours (430 days + 21 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 fe 00 00 00 40 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ef 05 fe 00 00 00 40 00 05:56:36.375 SET FEATURES [Enable APM]
ca 00 10 08 09 80 ef 00 05:56:36.375 WRITE DMA
ca 00 08 00 08 84 e8 00 05:56:36.375 WRITE DMA
ca 00 08 48 09 81 e1 00 05:56:36.375 WRITE DMA
ca 00 08 a0 09 80 e1 00 05:56:36.375 WRITE DMA
Error 2296 occurred at disk power-on lifetime: 10341 hours (430 days + 21 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 fe 00 00 00 40 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ef 05 fe 00 00 00 40 00 05:46:47.375 SET FEATURES [Enable APM]
c4 00 08 02 04 00 a1 00 05:46:47.250 READ MULTIPLE
c4 00 08 21 00 00 a0 00 05:46:47.250 READ MULTIPLE
c4 00 08 09 00 00 a8 00 05:46:47.250 READ MULTIPLE
c4 00 08 1d 00 00 ac 00 05:46:47.250 READ MULTIPLE
Error 2295 occurred at disk power-on lifetime: 10341 hours (430 days + 21 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 fe 00 00 00 40 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ef 05 fe 00 00 00 40 00 04:22:13.125 SET FEATURES [Enable APM]
c4 00 08 02 04 00 a1 00 04:22:13.063 READ MULTIPLE
c4 00 08 21 00 00 a0 00 04:22:13.063 READ MULTIPLE
c4 00 08 09 00 00 a8 00 04:22:13.063 READ MULTIPLE
c4 00 08 1d 00 00 ac 00 04:22:13.063 READ MULTIPLE
Error 2294 occurred at disk power-on lifetime: 10341 hours (430 days + 21 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 08 01 00 00 a0 Error: ICRC, ABRT 8 sectors at LBA = 0x00000001 = 1
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 01 00 00 a0 00 04:22:12.688 READ DMA
ec 00 00 00 00 00 a0 00 04:22:12.688 IDENTIFY DEVICE
ef 03 42 00 00 00 a0 00 04:22:12.688 SET FEATURES [Set transfer mode]
ec 00 00 00 00 00 a0 00 04:22:12.688 IDENTIFY DEVICE
ec 00 00 00 00 00 a0 00 04:22:12.688 IDENTIFY DEVICE
Error 2293 occurred at disk power-on lifetime: 10341 hours (430 days + 21 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 08 01 00 00 a0 Error: ICRC, ABRT 8 sectors at LBA = 0x00000001 = 1
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 01 00 00 a0 00 04:22:12.438 READ DMA
ec 00 00 00 00 00 a0 00 04:22:12.375 IDENTIFY DEVICE
ef 03 44 00 00 00 a0 00 04:22:12.375 SET FEATURES [Set transfer mode]
ec 00 00 00 00 00 a0 00 04:22:12.375 IDENTIFY DEVICE
ec 00 00 00 00 00 a0 00 04:22:12.375 IDENTIFY DEVICE
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
Device does not support Selective Self Tests/Logging
In the mean time I tested the drive spinning down and up again when it's idle. That seems to work just fine. Examining the logs I only find errors just after the daily crons have been executed. Could it be that something in the daily cron jobs triggers this?
The log of the last 5 errors does return errors during identify. Strange thing is that this does work properly during boot.
I dont want to worry you, and I'm not an expert on SMART, but to me that looks like the HD is either failing, or is in a state of "can fail at any time". All those prefail warnings as what is throwing this up for me.
Do you have good, solid backups elsewhere? If not, then now is a good time to start thinking about some.
Could it be that something in the daily cron jobs triggers this?
Certainly could be. Depends what-all is being done in these cronjobs. If there's something in one or more of the cronjobs that is messing up (confusing) this drive, then it would stand to reason that after the crons are done, the drive is borked up.
Above are two of the five sequences of commands given to the drive before it self-identified that an error of some sort had occurred. They appear to be listed newest to oldest. Common to all five blocks of messages is the "SET FEATURES" call. I wonder if something is trying to set features of the drive which it is incapable of doing, or the drive itself is failing to set features that it is supposed to have, because of some internal problem. And who's making those command calls to the drive - is it the computer's BIOS, or the kernel/driver, or something in a cronjob such as an `hdparm` command or `smartctl` command. I'm particularly looking at the "ENABLE APM" command, since you mention (if I'm understanding post #1 correctly) that this problem seems to manifest in the morning, after (I am guessing) the machine (or drive) may have been in a low-power or power-off state during the night?
Generally speaking, what are these cronjobs all about? Is there anything in any of them that directly sends commands to the drive, like `hdparm` or `smartctl`? Maybe one of the crons is trying to set some drive feature (see below) before doing a backup or something.
If nothing else at all has changed recently anywhere within the system, such as an OS upgrade, a kernel change, etc., and this just started happening for no apparent reason (only the past few days), and never used to happen, then I'd be getting prepared to have to replace that drive. But, you might do some more testing first:
--disable the APM thing in the BIOS so the drive doesn't spin down. See if the problem goes away.
--manually force-execute your cronjobs, and see if a particular one triggers the problem. If so, check out that cronjob more closely.
--use `smartctl` to test enabling and disabling of available features of the drive, like APM, DMA mode, transfer mode, etc. and see if the error occurs.
--run a full/long SMART test on the drive. See `smartctl` man page re: the -t option or --test=long
--Of course, while doing these things, keep an eye on your kernel log or whatever log that was in your first post.
I dont want to worry you, and I'm not an expert on SMART, but to me that looks like the HD is either failing, or is in a state of "can fail at any time". All those prefail warnings as what is throwing this up for me.
Do you have good, solid backups elsewhere? If not, then now is a good time to start thinking about some.
Thank you for your concern. I've moved critical data from the drive. However the SMART values are, to my understanding, nothing to worry about. All pre-fail indicators are at their best possible value except maybe for spin up time. Normally values decrease once errors occur. At least that's my understanding of how SMART works.
Quote:
Originally Posted by GrapefruiTgirl
Certainly could be. Depends what-all is being done in these cronjobs. If there's something in one or more of the cronjobs that is messing up (confusing) this drive, then it would stand to reason that after the crons are done, the drive is borked up.
Above are two of the five sequences of commands given to the drive before it self-identified that an error of some sort had occurred. They appear to be listed newest to oldest. Common to all five blocks of messages is the "SET FEATURES" call. I wonder if something is trying to set features of the drive which it is incapable of doing, or the drive itself is failing to set features that it is supposed to have, because of some internal problem. And who's making those command calls to the drive - is it the computer's BIOS, or the kernel/driver, or something in a cronjob such as an `hdparm` command or `smartctl` command. I'm particularly looking at the "ENABLE APM" command, since you mention (if I'm understanding post #1 correctly) that this problem seems to manifest in the morning, after (I am guessing) the machine (or drive) may have been in a low-power or power-off state during the night?
Generally speaking, what are these cronjobs all about? Is there anything in any of them that directly sends commands to the drive, like `hdparm` or `smartctl`? Maybe one of the crons is trying to set some drive feature (see below) before doing a backup or something.
If nothing else at all has changed recently anywhere within the system, such as an OS upgrade, a kernel change, etc., and this just started happening for no apparent reason (only the past few days), and never used to happen, then I'd be getting prepared to have to replace that drive. But, you might do some more testing first:
--disable the APM thing in the BIOS so the drive doesn't spin down. See if the problem goes away.
--manually force-execute your cronjobs, and see if a particular one triggers the problem. If so, check out that cronjob more closely.
--use `smartctl` to test enabling and disabling of available features of the drive, like APM, DMA mode, transfer mode, etc. and see if the error occurs.
--run a full/long SMART test on the drive. See `smartctl` man page re: the -t option or --test=long
--Of course, while doing these things, keep an eye on your kernel log or whatever log that was in your first post.
Let us know what turns up if anything.
Problems started about 36 hours after a kernel update. Before that the drive would just spin down and come up properly as defined in the BIOS settings. Unfortunately I was unable to find the changelog for that update.
I was unable to reproduce the problem running the cronjobs (which are pretty standard). Also hdparm can set the drive to sleep and standby state without any problems.
So the BIOS power down feature regarding to the hard drives has now been disabled. If that turns out to be the problem I'll look into a software solution to power down the hard drives.
Thanks for your help and I will report back tomorrow whether the system made it through the night .
The way I understand the SMART output, I too would say that the drive's general health as reported by SMART looks OK. All values appear to be pretty good.
As for that kernel update you mention: keep that high on your list of potential reasons for this. If the machine had been rebooted immediately after that upgrade had been finished, I would expect that if this problem would have begun to materialize less than 36 hours after the upgrade (like maybe after the first night). On the other hand, if the machine had not been rebooted immediately after the upgrade, but maybe a few days later instead, then that could account for the 36 hour delay.
Or, if this problem happens less frequently than once every 24-48 hours, like on an inconsistent basis maybe every 1-5 days or so, then again, that kernel upgrade could still be the culprit -- maybe a bug or typo in the kernel somewhere. Only way to verify would be thorough testing/evaluation & monitoring of logs for a week or more with the new kernel, and followed by a week or more of running on the previous kernel, by rolling back that upgrade. If this turns up any evidence that the kernel upgrade produces the symptoms but the old kernel does not, then I'd suggest a bug report to launchpad.
Before testing other things, I would recommend to test the harddrive with the manufacturers diagnosis tool. I have seen many harddrives that did not report errors over the S.M.A.R.T.-interface, but were actually failing, wihich was discovered by using those tools. For Samsung you should use the HUTIL-tool.
Well, no problems so far. Apparently powering down the drive and then coming up again was the trigger for the problems. Of course this isn't really a solution since the drive doesn't power down at all now but at least the system is stable again.
I still haven't been able to find any changelogs for the kernel though.
After almost two months without problems it started again. After Valery's reply I noticed the drive was on another cable than the other two. I replaced the cable, I hope that solved the issue. I also tried to switch the cables on the motherboard busses but that resulted in problem with the masters/slave configuration which I really don't want to get into today.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.