LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Ubuntu
User Name
Password
Ubuntu This forum is for the discussion of Ubuntu Linux.

Notices


Reply
  Search this Thread
Old 09-30-2010, 05:10 AM   #1
Harkov
Member
 
Registered: May 2004
Distribution: Ubuntu 10.04.1 LTS
Posts: 38

Rep: Reputation: 15
hard drive /dev/sdc becomes /dev/sdd


Hello,

I have a server with three physical hard drives. Since a couple of days /dev/sdc seems to be renamed to /dev/sdd while the server is running. Since this drive houses /home and swap this causes my home directories to give input/output errors. Most of the times the issue resolves itself with a reboot. However sometimes it fails to come up properly, possible due to a disk error.

A possible cause I came up with is that the drives are set in the BIOS to spin down after 15 minutes of idle time and might not come back up properly. According to the logs something strange happens in the morning:
Code:
Sep 30 07:56:05 patio4 kernel: [44766.403489] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Sep 30 07:56:05 patio4 kernel: [44766.403693] ata2.00: BMDMA stat 0x24
Sep 30 07:56:05 patio4 kernel: [44766.403808] ata2.00: failed command: READ DMA
Sep 30 07:56:05 patio4 kernel: [44766.403935] ata2.00: cmd c8/00:08:38:09:81/00:00:00:00:00/e1 tag 0 dma 4096 in
Sep 30 07:56:05 patio4 kernel: [44766.403937]          res 51/84:00:3f:09:81/00:00:00:00:00/e1 Emask 0x10 (ATA bus error)
Sep 30 07:56:05 patio4 kernel: [44766.404336] ata2.00: status: { DRDY ERR }
Sep 30 07:56:05 patio4 kernel: [44766.404453] ata2.00: error: { ICRC ABRT }
Sep 30 07:56:05 patio4 kernel: [44766.404597] ata2: soft resetting link
Sep 30 07:56:05 patio4 kernel: [44766.576332] ata2.00: model number mismatch 'SAMSUNG SP1614N' != 'QAMSUNE QP1614L'
Sep 30 07:56:05 patio4 kernel: [44766.576340] ata2.00: revalidation failed (errno=-19)
Sep 30 07:56:10 patio4 kernel: [44771.560072] ata2: soft resetting link
Sep 30 07:56:10 patio4 kernel: [44771.732352] ata2.00: model number mismatch 'SAMSUNG SP1614N' != 'QAMSUNE QP1614L'
Sep 30 07:56:10 patio4 kernel: [44771.732359] ata2.00: revalidation failed (errno=-19)
Sep 30 07:56:10 patio4 kernel: [44771.732504] ata2.00: disabled
Sep 30 07:56:15 patio4 kernel: [44776.716060] ata2: soft resetting link
Sep 30 07:56:15 patio4 kernel: [44776.888356] ata2.00: ATA-7: QAMSUNE QP1614L, TM100-04, max UDMA/100
Sep 30 07:56:15 patio4 kernel: [44776.888363] ata2.00: 15997968 sectors, multi 16, CHS 15871/16/63
Sep 30 07:56:15 patio4 kernel: [44776.896245] ata2.00: configured for UDMA/100
Sep 30 07:56:15 patio4 kernel: [44776.912228] ata2.00: configured for UDMA/100
Sep 30 07:56:15 patio4 kernel: [44776.912252] sd 1:0:0:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 30 07:56:15 patio4 kernel: [44776.912257] sd 1:0:0:0: [sdc] Sense Key : Aborted Command [current] [descriptor]
Sep 30 07:56:15 patio4 kernel: [44776.912264] Descriptor sense data with sense descriptors (in hex):
Sep 30 07:56:15 patio4 kernel: [44776.912267]         72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Sep 30 07:56:15 patio4 kernel: [44776.912276]         01 81 09 3f
Sep 30 07:56:15 patio4 kernel: [44776.912280] sd 1:0:0:0: [sdc] Add. Sense: Scsi parity error
Sep 30 07:56:15 patio4 kernel: [44776.912290] sd 1:0:0:0: [sdc] CDB: Read(10): 28 00 01 81 09 38 00 00 08 00
Sep 30 07:56:15 patio4 kernel: [44776.912300] end_request: I/O error, dev sdc, sector 25233720
Sep 30 07:56:15 patio4 kernel: [44776.912476] ata2: EH complete
Sep 30 07:56:15 patio4 kernel: [44776.912488] EXT4-fs error (device sdc6): ext4_find_entry: reading directory #786502 offset 0
Sep 30 07:56:15 patio4 kernel: [44776.912838] sd 1:0:0:0: rejecting I/O to offline device
Sep 30 07:56:15 patio4 kernel: [44776.913171] sd 1:0:0:0: rejecting I/O to offline device
Sep 30 07:56:15 patio4 kernel: [44776.913313] ata2.00: detaching (SCSI 1:0:0:0)
Sep 30 07:56:15 patio4 kernel: [44776.917504] EXT4-fs error (device sdc6): ext4_find_entry: reading directory #4194307 offset 0
Sep 30 07:56:15 patio4 kernel: [44776.917842] EXT4-fs (sdc6): previous I/O error to superblock detected
Sep 30 07:56:15 patio4 kernel: [44776.918209] EXT4-fs error (device sdc6): ext4_find_entry: reading directory #4194307 offset 0
Sep 30 07:56:15 patio4 kernel: [44776.918522] EXT4-fs (sdc6): previous I/O error to superblock detected
Sep 30 07:56:15 patio4 kernel: [44776.918800] EXT4-fs error (device sdc6): ext4_find_entry: reading directory #4194307 offset 0
Sep 30 07:56:15 patio4 kernel: [44776.919112] EXT4-fs (sdc6): previous I/O error to superblock detected
Sep 30 07:56:15 patio4 kernel: [44776.937197] sd 1:0:0:0: [sdc] Synchronizing SCSI cache
Sep 30 07:56:15 patio4 kernel: [44776.937986] sd 1:0:0:0: [sdc] Stopping disk
Sep 30 07:56:15 patio4 kernel: [44776.940045] EXT4-fs error (device sdc6): __ext4_get_inode_loc: unable to read inode block - inode=3014657, block=12058656
Sep 30 07:56:15 patio4 kernel: [44776.957240] EXT4-fs (sdc6): previous I/O error to superblock detected
Sep 30 07:56:15 patio4 kernel: [44776.976145] EXT4-fs error (device sdc6): ext4_find_entry: reading directory #2 offset 0
Sep 30 07:56:15 patio4 kernel: [44776.992994] EXT4-fs (sdc6): previous I/O error to superblock detected
Sep 30 07:56:16 patio4 kernel: [44777.205514] scsi 1:0:0:0: Direct-Access     ATA      QAMSUNE QP1614L  TM10 PQ: 0 ANSI: 5
Sep 30 07:56:16 patio4 kernel: [44777.205790] sd 1:0:0:0: Attached scsi generic sg2 type 0
Sep 30 07:56:16 patio4 kernel: [44777.212289] sd 1:0:0:0: [sdd] 15997968 512-byte logical blocks: (8.19 GB/7.62 GiB)
Sep 30 07:56:16 patio4 kernel: [44777.212503] sd 1:0:0:0: [sdd] Write Protect is off
Sep 30 07:56:16 patio4 kernel: [44777.212507] sd 1:0:0:0: [sdd] Mode Sense: 00 3a 00 00
Sep 30 07:56:16 patio4 kernel: [44777.212539] sd 1:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Sep 30 07:56:18 patio4 kernel: [44777.213533]  sdd:
Sep 30 07:56:18 patio4 kernel: [44779.307146] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Sep 30 07:56:18 patio4 kernel: [44779.315872] ata2.00: BMDMA stat 0x24
Sep 30 07:56:18 patio4 kernel: [44779.324497] ata2.00: failed command: READ DMA
Sep 30 07:56:18 patio4 kernel: [44779.332914] ata2.00: cmd c8/00:08:01:00:00/00:00:00:00:00/a0 tag 0 dma 4096 in
Sep 30 07:56:18 patio4 kernel: [44779.332916]          res 51/84:00:08:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Sep 30 07:56:18 patio4 kernel: [44779.367060] ata2.00: status: { DRDY ERR }
Sep 30 07:56:18 patio4 kernel: [44779.375887] ata2.00: error: { ICRC ABRT }
Sep 30 07:56:18 patio4 kernel: [44779.384800] ata2: soft resetting link
Sep 30 07:56:18 patio4 kernel: [44779.564286] ata2.00: configured for UDMA/100
Sep 30 07:56:18 patio4 kernel: [44779.580267] ata2.00: configured for UDMA/100
Sep 30 07:56:18 patio4 kernel: [44779.580287] ata2: EH complete
Sep 30 07:56:18 patio4 kernel: [44779.582118] ata2.00: limiting speed to UDMA/66:PIO4
Sep 30 07:56:18 patio4 kernel: [44779.582124] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Sep 30 07:56:18 patio4 kernel: [44779.591202] ata2.00: BMDMA stat 0x24
Sep 30 07:56:18 patio4 kernel: [44779.600313] ata2.00: failed command: READ DMA
Sep 30 07:56:18 patio4 kernel: [44779.609244] ata2.00: cmd c8/00:08:01:00:00/00:00:00:00:00/a0 tag 0 dma 4096 in
Sep 30 07:56:18 patio4 kernel: [44779.609246]          res 51/84:00:08:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Sep 30 07:56:18 patio4 kernel: [44779.645508] ata2.00: status: { DRDY ERR }
Sep 30 07:56:18 patio4 kernel: [44779.654434] ata2.00: error: { ICRC ABRT }
Sep 30 07:56:18 patio4 kernel: [44779.663335] ata2: soft resetting link
Sep 30 07:56:18 patio4 kernel: [44779.840257] ata2.00: configured for UDMA/66
Sep 30 07:56:18 patio4 kernel: [44779.856320] ata2.00: configured for UDMA/66
Sep 30 07:56:18 patio4 kernel: [44779.856339] ata2: EH complete
Sep 30 07:56:18 patio4 kernel: [44779.865541] ata2.00: limiting speed to UDMA/33:PIO4
Sep 30 07:56:18 patio4 kernel: [44779.865550] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Sep 30 07:56:18 patio4 kernel: [44779.874381] ata2.00: BMDMA stat 0x24
Sep 30 07:56:18 patio4 kernel: [44779.883106] ata2.00: failed command: READ DMA
Sep 30 07:56:18 patio4 kernel: [44779.891742] ata2.00: cmd c8/00:08:01:00:00/00:00:00:00:00/a0 tag 0 dma 4096 in
Sep 30 07:56:18 patio4 kernel: [44779.891744]          res 51/84:00:08:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Sep 30 07:56:18 patio4 kernel: [44779.926007] ata2.00: status: { DRDY ERR }
Sep 30 07:56:18 patio4 kernel: [44779.934676] ata2.00: error: { ICRC ABRT }
Sep 30 07:56:18 patio4 kernel: [44779.943240] ata2: soft resetting link
Sep 30 07:56:18 patio4 kernel: [44780.120418] ata2.00: configured for UDMA/33
Sep 30 07:56:18 patio4 kernel: [44780.136273] ata2.00: configured for UDMA/33
Sep 30 07:56:18 patio4 kernel: [44780.136294] ata2: EH complete
Sep 30 07:56:19 patio4 kernel: [44780.140674] ata2.00: limiting speed to PIO4
Sep 30 07:56:19 patio4 kernel: [44780.140682] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Sep 30 07:56:19 patio4 kernel: [44780.149303] ata2.00: BMDMA stat 0x24
Sep 30 07:56:19 patio4 kernel: [44780.157797] ata2.00: failed command: READ DMA
Sep 30 07:56:19 patio4 kernel: [44780.166168] ata2.00: cmd c8/00:08:01:00:00/00:00:00:00:00/a0 tag 0 dma 4096 in
Sep 30 07:56:19 patio4 kernel: [44780.166170]          res 51/84:00:08:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Sep 30 07:56:19 patio4 kernel: [44780.199782] ata2.00: status: { DRDY ERR }
Sep 30 07:56:19 patio4 kernel: [44780.208173] ata2.00: error: { ICRC ABRT }
Sep 30 07:56:19 patio4 kernel: [44780.216489] ata2: soft resetting link
Sep 30 07:56:19 patio4 kernel: [44780.396369] ata2.00: configured for PIO4
Sep 30 07:56:19 patio4 kernel: [44780.412278] ata2.00: configured for PIO4
Sep 30 07:56:19 patio4 kernel: [44780.412299] ata2: EH complete
Sep 30 07:56:19 patio4 kernel: [44780.419628]  unknown partition table
Sep 30 07:56:19 patio4 kernel: [44780.421972] sdd: detected capacity change from 0 to 8190959616
Sep 30 07:56:19 patio4 kernel: [44780.422096] sd 1:0:0:0: [sdd] Attached SCSI disk
Sep 30 07:58:11 patio4 kernel: [44892.182907] EXT4-fs error (device sdc6): ext4_find_entry: reading directory #2 offset 0
Sep 30 07:58:11 patio4 kernel: [44892.200026] EXT4-fs (sdc6): previous I/O error to superblock detected
This is probably when it notices the /home directory is missing when it is attempting to write a backup to it (called by cron.daily).
The line "ata2.00: model number mismatch 'SAMSUNG SP1614N' != 'QAMSUNE QP1614L'" is rather peculiar.

Normally the layout is like this:
Code:
ls /dev/sd*
/dev/sda  /dev/sda1  /dev/sdb  /dev/sdb1  /dev/sdb5  /dev/sdc  /dev/sdc2  /dev/sdc5  /dev/sdc6
However it changes to this:
Code:
ls /dev/sd*
/dev/sda  /dev/sda1  /dev/sdb  /dev/sdb1  /dev/sdb5  /dev/sdd
Normal fdisk output
Code:
fdisk -l

Disk /dev/sda: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0000b653

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1       19458   156289024   83  Linux

Disk /dev/sdb: 320.1 GB, 320072933376 bytes
255 heads, 63 sectors/track, 38913 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00094932

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1       38914   312568833    5  Extended
/dev/sdb5               1       38914   312568832   83  Linux

Disk /dev/sdc: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000e2dd5

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc2               1       19458   156289025    5  Extended
/dev/sdc5           18237       19458     9804800   82  Linux swap / Solaris
/dev/sdc6               1       18237   146483200   83  Linux
fstab:
Code:
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
proc            /proc           proc    nodev,noexec,nosuid 0       0
# / was on /dev/sda1 during installation
UUID=d6f35d28-cb1f-4401-a936-b5fb1b25d5c7 /               ext4    errors=remount-ro 0       1
# /data was on /dev/sdb5 during installation
UUID=44de0bbc-7f2c-4111-9b75-e16b9f0532c5 /data           ext4    defaults        0       2
# /home was on /dev/sdc6 during installation
UUID=c0d46731-933b-44d9-aa9f-078550b3f2b2 /home           ext4    defaults        0       2
# swap was on /dev/sdc5 during installation
UUID=37b43414-2373-419e-ad9b-372eb760b1fa none            swap    sw              0       0
#/dev/fd0        /media/floppy0  auto    rw,user,noauto,exec,utf8 0       0
/dev/mapper/cryptswap1 none swap sw 0 0
The server is running Ubuntu 10.04.1 LTS
2.6.32-25-generic-pae #44-Ubuntu SMP Fri Sep 17 21:57:48 UTC 2010 i686 GNU/Linux

Any ideas to what is causing this and what to do about it?

Last edited by Harkov; 09-30-2010 at 05:29 AM.
 
Old 09-30-2010, 06:15 AM   #2
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 556Reputation: 556Reputation: 556Reputation: 556Reputation: 556Reputation: 556
More info about the hardware..

Code:
[44771.732352] ata2.00: model number mismatch 'SAMSUNG SP1614N' != 'QAMSUNE QP1614L'
You're right! That is quite odd. I wonder where it's getting that strange name from?

It isn't often these days that Google is stumped: the only result on googling that whole line was this very thread. However, searching only for "ata2.00: model number mismatch" produces a fair number of hits, but in the few minutes I read through some of them, no solution was seen. Many of the threads did mention Debian or Ubuntu though, so *possibly* a kernel/driver issue in these kernels, but too soon to say.

My first feeling about this is a firmware bug, either in the drive's firmware itself, or in the machine's BIOS; more likely to be in the drive's firmware I think.
However! - for the record, it is very interesting that if you notice the correct spelling of the make & model, compared to the wrong spelling:
Code:
SAMSUNG SP1614N
QAMSUNE QP1614L
The first and last character of each "word" above, shown in bold, are alphabetically 2 letters behind what they're supposed to be in the mis-spelled version. I don't know enough about kernel code and how this data is read from the drive, but is it possible that a kernel/driver coding error (bug) could produce the shifted values seen here? There's basically a pattern to the spelling errors, and this to me is more indicative of a coding error in the code that reads this info from the drive..

Could you run a couple commands and provide a little more information about the hardware?

-- What make & model of computer, or what make & model# of motherboard in it, and what BIOS version? If you don't know this, perhaps you have the `dmidecode` command shown below:
Code:
/usr/sbin/dmidecode
Yours may be in a different location - use the `which` command to locate your dmidecode. If you have it, the command will output a whack of info about your machine; primarily I'd be interested in the first three blocks of data, which give specifics about the BIOS and motherboard, something like this:
Code:
root@reactor: /usr/sbin/dmidecode
# dmidecode 2.10
SMBIOS 2.5 present.
54 structures occupying 1995 bytes.
Table at 0x000FB4F0.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
        Vendor: American Megatrends Inc.
        Version: V2.7
        Release Date: 12/09/2008
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 512 kB
        Characteristics:
     <-- SNIP lot's of stuff in here, please leave yours. -->
        BIOS Revision: 8.13

Handle 0x0001, DMI type 1, 27 bytes
System Information
        Manufacturer: MSI
        Product Name: MS-7350
        Version: 1.0
        Serial Number: To Be Filled By O.E.M.
        UUID: Not Present
        Wake-up Type: Power Switch
        SKU Number: To Be Filled By O.E.M.
        Family: To Be Filled By O.E.M.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
        Manufacturer: MSI
        Product Name: MSI P6N SLI
        Version: 1.0
        Serial Number: To be filled by O.E.M.
        Asset Tag: To Be Filled By O.E.M.
        Features:
                Board is a hosting board
                Board is replaceable
        Location In Chassis: To Be Filled By O.E.M.
        Chassis Handle: 0x0003
        Type: Motherboard
        Contained Object Handles: 0
So we have some info about the board now. Next let's learn a bit about the disk drive itself; if you have the SMART tools installed, we'll use that. On my system, they're in a package called "smartmontools-5.39", and the commands we want are:
Code:
smartctl -a /dev/hdc
OR
smartctl -a /dev/hdd
That will produce a load of data about the drive. Of primary interest for now will be the first block, called "START OF INFORMATION SECTION", which will have the make & model and some other data about the drive. Please paste that block of data.

Regardless what information the above commands produce, I have no suggestion for what to try to work around this. Maybe these commands will have a clue or something more to search on, but as yet, no idea. Maybe a hardware/firmware bug, or maybe it *is* a kernel bug. More searching is needed, unless someone already has the answer & solution from a similar past experience.

Interesting.. Good luck!
 
Old 09-30-2010, 06:50 AM   #3
Harkov
Member
 
Registered: May 2004
Distribution: Ubuntu 10.04.1 LTS
Posts: 38

Original Poster
Rep: Reputation: 15
Thanks for your reply.

You're very observant! I hadn't noticed that those letters are off exactly two places in the alphabet.

It's a fairly old system, here's the output of /usr/sbin/dmidecode
Code:
/usr/sbin/dmidecode
# dmidecode 2.9
SMBIOS 2.3 present.
49 structures occupying 1360 bytes.
Table at 0x000F3B20.

Handle 0x0000, DMI type 0, 20 bytes
BIOS Information
        Vendor: Award Software, Inc.
        Version: ASUS A7V8X ACPI BIOS Revision 1014
        Release Date: 04/21/2004
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 512 kB
        Characteristics:
                PCI is supported
                PNP is supported
                APM is supported
                BIOS is upgradeable
                BIOS shadowing is allowed
                ESCD support is available
                Boot from CD is supported
                Selectable boot is supported
                BIOS ROM is socketed
                EDD is supported
                5.25"/360 KB floppy services are supported (int 13h)
                5.25"/1.2 MB floppy services are supported (int 13h)
                3.5"/720 KB floppy services are supported (int 13h)
                3.5"/2.88 MB floppy services are supported (int 13h)
                Print screen service is supported (int 5h)
                8042 keyboard services are supported (int 9h)
                Serial services are supported (int 14h)
                Printer services are supported (int 17h)
                CGA/mono video services are supported (int 10h)
                ACPI is supported
                USB legacy is supported
                AGP is supported

Handle 0x0001, DMI type 1, 25 bytes
System Information
        Manufacturer: System Manufacturer
        Product Name: System Name
        Version: System Version
        Serial Number: SYS-1234567890
        UUID: Not Settable
        Wake-up Type: Power Switch

Handle 0x0002, DMI type 2, 8 bytes
Base Board Information
        Manufacturer: ASUSTeK Computer INC.
        Product Name: A7V8X
        Version: REV 1.xx
        Serial Number: xxxxxxxxxxx

Handle 0x0003, DMI type 3, 17 bytes
Chassis Information
        Manufacturer: Chassis Manufacture
        Type: Tower
        Lock: Not Present
        Version: Chassis Version
        Serial Number: Chassis Serial Number
        Asset Tag: Asset-1234567890
        Boot-up State: Safe
        Power Supply State: Safe
        Thermal State: Safe
        Security Status: Unknown
        OEM Information: 0x00000001

Handle 0x0004, DMI type 4, 32 bytes
Processor Information
        Socket Designation: SOCKET A
        Type: Central Processor
        Family: Other
        Manufacturer: AuthenticAMD
        ID: 81 06 00 00 FF FB 83 03
        Signature: Family 6, Model 8, Stepping 1
        Flags:
                FPU (Floating-point unit on-chip)
                VME (Virtual mode extension)
                DE (Debugging extension)
                PSE (Page size extension)
                TSC (Time stamp counter)
                MSR (Model specific registers)
                PAE (Physical address extension)
                MCE (Machine check exception)
                CX8 (CMPXCHG8 instruction supported)
                APIC (On-chip APIC hardware supported)
                SEP (Fast system call)
                MTRR (Memory type range registers)
                PGE (Page global enable)
                MCA (Machine check architecture)
                CMOV (Conditional move instruction supported)
                PAT (Page attribute table)
                PSE-36 (36-bit page size extension)
                MMX (MMX technology supported)
                FXSR (Fast floating-point save and restore)
                SSE (Streaming SIMD extensions)
        Version: AMD Athlon(TM) XP 2400+
        Voltage: 1.7 V
        External Clock: 133 MHz
        Max Speed: 2250 MHz
        Current Speed: 2000 MHz
        Status: Populated, Enabled
        Upgrade: Other
        L1 Cache Handle: 0x0009
        L2 Cache Handle: 0x000A
        L3 Cache Handle: Not Provided

Handle 0x0005, DMI type 5, 22 bytes
Memory Controller Information
        Error Detecting Method: None
        Error Correcting Capabilities:
                Other
        Supported Interleave: Unknown
        Current Interleave: Unknown
        Maximum Memory Module Size: 1024 MB
        Maximum Total Memory Size: 3072 MB
        Supported Speeds:
                70 ns
                60 ns
                50 ns
        Supported Memory Types:
                ECC
                DIMM
                SDRAM
        Memory Module Voltage: 3.3 V
        Associated Memory Slots: 3
                0x0006
                0x0007
                0x0008
        Enabled Error Correcting Capabilities:
                Unknown

Handle 0x0006, DMI type 6, 12 bytes
Memory Module Information
        Socket Designation: DIMM 1
        Bank Connections: 0 1
        Current Speed: Unknown
        Type: DIMM SDRAM
        Installed Size: 512 MB (Double-bank Connection)
        Enabled Size: 512 MB (Double-bank Connection)
        Error Status: OK

Handle 0x0007, DMI type 6, 12 bytes
Memory Module Information
        Socket Designation: DIMM 2
        Bank Connections: 2 3
        Current Speed: Unknown
        Type: DIMM SDRAM
        Installed Size: 512 MB (Double-bank Connection)
        Enabled Size: 512 MB (Double-bank Connection)
        Error Status: OK

Handle 0x0008, DMI type 6, 12 bytes
Memory Module Information
        Socket Designation: DIMM 3
        Bank Connections: 4 5
        Current Speed: Unknown
        Type: DIMM SDRAM
        Installed Size: Not Installed
        Enabled Size: Not Installed
        Error Status: OK

Handle 0x0009, DMI type 7, 19 bytes
Cache Information
        Socket Designation: L1 Cache
        Configuration: Enabled, Not Socketed, Level 1
        Operational Mode: Write Back
        Location: Internal
        Installed Size: 128 KB
        Maximum Size: 128 KB
        Supported SRAM Types:
                Pipeline Burst
                Synchronous
        Installed SRAM Type: Pipeline Burst Synchronous
        Speed: Unknown
        Error Correction Type: Unknown
        System Type: Data
        Associativity: 4-way Set-associative

Handle 0x000A, DMI type 7, 19 bytes
Cache Information
        Socket Designation: L2 Cache
        Configuration: Enabled, Not Socketed, Level 2
        Operational Mode: Write Back
        Location: Internal
        Installed Size: 256 KB
        Maximum Size: 8192 KB
        Supported SRAM Types:
                Pipeline Burst
                Synchronous
        Installed SRAM Type: Pipeline Burst Synchronous
        Speed: Unknown
        Error Correction Type: Unknown
        System Type: Data
        Associativity: 4-way Set-associative

Handle 0x000B, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: PRIMARY IDE/HDD
        Internal Connector Type: On Board IDE
        External Reference Designator: Not Specified
        External Connector Type: None
        Port Type: None

Handle 0x000C, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: SECONDARY IDE/HDD
        Internal Connector Type: On Board IDE
        External Reference Designator: Not Specified
        External Connector Type: None
        Port Type: None

Handle 0x000D, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: FLOPPY
        Internal Connector Type: On Board Floppy
        External Reference Designator: Not Specified
        External Connector Type: None
        Port Type: None

Handle 0x000E, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: USB1
        External Connector Type: Access Bus (USB)
        Port Type: USB

Handle 0x000F, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: USB2
        External Connector Type: Access Bus (USB)
        Port Type: USB

Handle 0x0010, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: USB3
        External Connector Type: Access Bus (USB)
        Port Type: USB

Handle 0x0011, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: USB4
        External Connector Type: Access Bus (USB)
        Port Type: USB

Handle 0x0012, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: USB5
        External Connector Type: Access Bus (USB)
        Port Type: USB

Handle 0x0013, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: USB6
        External Connector Type: Access Bus (USB)
        Port Type: USB

Handle 0x0014, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: PS/2 Keyboard
        External Connector Type: PS/2
        Port Type: Keyboard Port

Handle 0x0015, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: PS/2 Mouse
        External Connector Type: PS/2
        Port Type: Mouse Port

Handle 0x0016, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: Parallel Port
        External Connector Type: DB-25 female
        Port Type: Parallel Port ECP/EPP

Handle 0x0017, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: Serial Port 1
        External Connector Type: DB-9 male
        Port Type: Serial Port 16550 Compatible

Handle 0x0018, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: Serial Port 2
        External Connector Type: DB-9 male
        Port Type: Serial Port 16550 Compatible

Handle 0x0019, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: Joystick Port
        External Connector Type: DB-15 female
        Port Type: Joystick Port

Handle 0x001A, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: MIDI Port
        External Connector Type: DB-15 female
        Port Type: MIDI Port

Handle 0x001B, DMI type 8, 9 bytes
Port Connector Information
        Internal Reference Designator: Not Specified
        Internal Connector Type: None
        External Reference Designator: Line In Jack
        External Connector Type: Mini Jack (headphones)
        Port Type: Audio Port

Handle 0x001C, DMI type 9, 13 bytes
System Slot Information
        Designation: PCI 1
        Type: 32-bit PCI
        Current Usage: Available
        Length: Short
        ID: 1
        Characteristics:
                5.0 V is provided
                3.3 V is provided
                PME signal is supported

Handle 0x001D, DMI type 9, 13 bytes
System Slot Information
        Designation: PCI 2
        Type: 32-bit PCI
        Current Usage: Available
        Length: Short
        ID: 2
        Characteristics:
                5.0 V is provided
                3.3 V is provided
                PME signal is supported

Handle 0x001E, DMI type 9, 13 bytes
System Slot Information
        Designation: PCI 3
        Type: 32-bit PCI
        Current Usage: Available
        Length: Short
        ID: 3
        Characteristics:
                5.0 V is provided
                3.3 V is provided
                PME signal is supported

Handle 0x001F, DMI type 9, 13 bytes
System Slot Information
        Designation: PCI 4
        Type: 32-bit PCI
        Current Usage: Available
        Length: Short
        ID: 4
        Characteristics:
                5.0 V is provided
                3.3 V is provided
                PME signal is supported

Handle 0x0020, DMI type 9, 13 bytes
System Slot Information
        Designation: PCI 5
        Type: 32-bit PCI
        Current Usage: Available
        Length: Short
        ID: 5
        Characteristics:
                5.0 V is provided
                3.3 V is provided
                PME signal is supported

Handle 0x0021, DMI type 9, 13 bytes
System Slot Information
        Designation: PCI 6
        Type: 32-bit PCI
        Current Usage: Available
        Length: Short
        ID: 6
        Characteristics:
                5.0 V is provided
                3.3 V is provided
                PME signal is supported

Handle 0x0022, DMI type 9, 13 bytes
System Slot Information
        Designation: AGP
        Type: 32-bit AGP 8x
        Current Usage: In Use
        Length: Short
        ID: 7
        Characteristics:
                3.3 V is provided

Handle 0x0023, DMI type 11, 5 bytes
OEM Strings
        String 1: 0
        String 2: 0

Handle 0x0024, DMI type 13, 22 bytes
BIOS Language Information
        Installable Languages: 1
                en|US|iso8859-1
        Currently Installed Language: en|US|iso8859-1

Handle 0x0025, DMI type 14, 14 bytes
Group Associations
        Name: Cpu Module
        Items: 3
                0x0004 (Processor)
                0x0009 (Cache)
                0x000A (Cache)

Handle 0x0026, DMI type 14, 29 bytes
Group Associations
        Name: Memory Module Set
        Items: 8
                0x0027 (Physical Memory Array)
                0x0028 (Memory Device)
                0x002C (Memory Device Mapped Address)
                0x0029 (Memory Device)
                0x002D (Memory Device Mapped Address)
                0x002A (Memory Device)
                0x002E (Memory Device Mapped Address)
                0x002B (Memory Array Mapped Address)

Handle 0x0027, DMI type 16, 15 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: None
        Maximum Capacity: 3 GB
        Error Information Handle: Not Provided
        Number Of Devices: 3

Handle 0x0028, DMI type 17, 23 bytes
Memory Device
        Array Handle: 0x0027
        Error Information Handle: No Error
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 512 MB
        Form Factor: DIMM
        Set: 1
        Locator: DDR 1
        Bank Locator: Not Specified
        Type: DRAM
        Type Detail: Synchronous
        Speed: Unknown

Handle 0x0029, DMI type 17, 23 bytes
Memory Device
        Array Handle: 0x0027
        Error Information Handle: No Error
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 512 MB
        Form Factor: DIMM
        Set: 2
        Locator: DDR 2
        Bank Locator: Not Specified
        Type: DRAM
        Type Detail: Synchronous
        Speed: Unknown

Handle 0x002A, DMI type 17, 23 bytes
Memory Device
        Array Handle: 0x0027
        Error Information Handle: No Error
        Total Width: Unknown
        Data Width: Unknown
        Size: No Module Installed
        Form Factor: DIMM
        Set: 3
        Locator: DDR 3
        Bank Locator: Not Specified
        Type: DRAM
        Type Detail: Synchronous
        Speed: Unknown

Handle 0x002B, DMI type 19, 15 bytes
Memory Array Mapped Address
        Starting Address: 0x00000000000
        Ending Address: 0x0003FFFFFFF
        Range Size: 1 GB
        Physical Array Handle: 0x0027
        Partition Width: 0

Handle 0x002C, DMI type 20, 19 bytes
Memory Device Mapped Address
        Starting Address: 0x00000000000
        Ending Address: 0x0001FFFFFFF
        Range Size: 512 MB
        Physical Device Handle: 0x0028
        Memory Array Mapped Address Handle: 0x002B
        Partition Row Position: 1

Handle 0x002D, DMI type 20, 19 bytes
Memory Device Mapped Address
        Starting Address: 0x00020000000
        Ending Address: 0x0003FFFFFFF
        Range Size: 512 MB
        Physical Device Handle: 0x0029
        Memory Array Mapped Address Handle: 0x002B
        Partition Row Position: 2

Handle 0x002E, DMI type 126, 19 bytes
Inactive

Handle 0x002F, DMI type 32, 11 bytes
System Boot Information
        Status: No errors detected

Handle 0x0030, DMI type 127, 4 bytes
End Of Table
The smart data for /dev/sdc:
Code:
smartctl -a /dev/sdc
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG SP1614N
Serial Number:    S016J10X612210
Firmware Version: TM100-24
User Capacity:    160,041,885,696 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:    Thu Sep 30 13:28:07 2010 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (5760) seconds.
Offline data collection
capabilities:                    (0x1b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  96) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   051    Pre-fail  Always       -       2
  3 Spin_Up_Time            0x0007   065   054   000    Pre-fail  Always       -       6016
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       827
  5 Reallocated_Sector_Ct   0x0033   253   253   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   253   253   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0024   253   253   000    Old_age   Offline      -       0
  9 Power_On_Half_Minutes   0x0032   098   098   000    Old_age   Always       -       10342h+39m
 10 Spin_Retry_Count        0x0013   253   253   049    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       157
194 Temperature_Celsius     0x0022   199   100   000    Old_age   Always       -       13
195 Hardware_ECC_Recovered  0x000a   100   100   000    Old_age   Always       -       252397751
196 Reallocated_Event_Count 0x0012   253   253   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0033   253   253   010    Pre-fail  Always       -       0
198 Offline_Uncorrectable   0x0031   253   253   010    Pre-fail  Offline      -       0
199 UDMA_CRC_Error_Count    0x000b   100   100   051    Pre-fail  Always       -       18
200 Multi_Zone_Error_Rate   0x000b   100   100   051    Pre-fail  Always       -       0
201 Soft_Read_Error_Rate    0x000b   100   100   051    Pre-fail  Always       -       0

SMART Error Log Version: 1
ATA Error Count: 2297 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2297 occurred at disk power-on lifetime: 10341 hours (430 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 fe 00 00 00 40  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 05 fe 00 00 00 40 00      05:56:36.375  SET FEATURES [Enable APM]
  ca 00 10 08 09 80 ef 00      05:56:36.375  WRITE DMA
  ca 00 08 00 08 84 e8 00      05:56:36.375  WRITE DMA
  ca 00 08 48 09 81 e1 00      05:56:36.375  WRITE DMA
  ca 00 08 a0 09 80 e1 00      05:56:36.375  WRITE DMA

Error 2296 occurred at disk power-on lifetime: 10341 hours (430 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 fe 00 00 00 40  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 05 fe 00 00 00 40 00      05:46:47.375  SET FEATURES [Enable APM]
  c4 00 08 02 04 00 a1 00      05:46:47.250  READ MULTIPLE
  c4 00 08 21 00 00 a0 00      05:46:47.250  READ MULTIPLE
  c4 00 08 09 00 00 a8 00      05:46:47.250  READ MULTIPLE
  c4 00 08 1d 00 00 ac 00      05:46:47.250  READ MULTIPLE

Error 2295 occurred at disk power-on lifetime: 10341 hours (430 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 fe 00 00 00 40  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 05 fe 00 00 00 40 00      04:22:13.125  SET FEATURES [Enable APM]
  c4 00 08 02 04 00 a1 00      04:22:13.063  READ MULTIPLE
  c4 00 08 21 00 00 a0 00      04:22:13.063  READ MULTIPLE
  c4 00 08 09 00 00 a8 00      04:22:13.063  READ MULTIPLE
  c4 00 08 1d 00 00 ac 00      04:22:13.063  READ MULTIPLE

Error 2294 occurred at disk power-on lifetime: 10341 hours (430 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 08 01 00 00 a0  Error: ICRC, ABRT 8 sectors at LBA = 0x00000001 = 1

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 01 00 00 a0 00      04:22:12.688  READ DMA
  ec 00 00 00 00 00 a0 00      04:22:12.688  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 00      04:22:12.688  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 00      04:22:12.688  IDENTIFY DEVICE
  ec 00 00 00 00 00 a0 00      04:22:12.688  IDENTIFY DEVICE

Error 2293 occurred at disk power-on lifetime: 10341 hours (430 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 08 01 00 00 a0  Error: ICRC, ABRT 8 sectors at LBA = 0x00000001 = 1

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 01 00 00 a0 00      04:22:12.438  READ DMA
  ec 00 00 00 00 00 a0 00      04:22:12.375  IDENTIFY DEVICE
  ef 03 44 00 00 00 a0 00      04:22:12.375  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 00      04:22:12.375  IDENTIFY DEVICE
  ec 00 00 00 00 00 a0 00      04:22:12.375  IDENTIFY DEVICE

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


Device does not support Selective Self Tests/Logging
In the mean time I tested the drive spinning down and up again when it's idle. That seems to work just fine. Examining the logs I only find errors just after the daily crons have been executed. Could it be that something in the daily cron jobs triggers this?

The log of the last 5 errors does return errors during identify. Strange thing is that this does work properly during boot.
 
Old 09-30-2010, 06:59 AM   #4
djsmiley2k
Member
 
Registered: Feb 2005
Location: Coventry, UK
Distribution: Home: Gentoo x86/amd64, Debian ppc. Work: Ubuntu, SuSe, CentOS
Posts: 343
Blog Entries: 1

Rep: Reputation: 72
Quote:
Originally Posted by Harkov View Post
Thanks for your reply.

You're very observant! I hadn't noticed that those letters are off exactly two places in the alphabet.

It's a fairly old system, here's the output of /usr/sbin/dmidecode
Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   051    Pre-fail  Always       -       2
  3 Spin_Up_Time            0x0007   065   054   000    Pre-fail  Always       -       6016
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       827
  5 Reallocated_Sector_Ct   0x0033   253   253   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   253   253   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0024   253   253   000    Old_age   Offline      -       0
  9 Power_On_Half_Minutes   0x0032   098   098   000    Old_age   Always       -       10342h+39m
 10 Spin_Retry_Count        0x0013   253   253   049    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       157
194 Temperature_Celsius     0x0022   199   100   000    Old_age   Always       -       13
195 Hardware_ECC_Recovered  0x000a   100   100   000    Old_age   Always       -       252397751
196 Reallocated_Event_Count 0x0012   253   253   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0033   253   253   010    Pre-fail  Always       -       0
198 Offline_Uncorrectable   0x0031   253   253   010    Pre-fail  Offline      -       0
199 UDMA_CRC_Error_Count    0x000b   100   100   051    Pre-fail  Always       -       18
200 Multi_Zone_Error_Rate   0x000b   100   100   051    Pre-fail  Always       -       0
201 Soft_Read_Error_Rate    0x000b   100   100   051    Pre-fail  Always       -       0
I dont want to worry you, and I'm not an expert on SMART, but to me that looks like the HD is either failing, or is in a state of "can fail at any time". All those prefail warnings as what is throwing this up for me.

Do you have good, solid backups elsewhere? If not, then now is a good time to start thinking about some.
 
Old 09-30-2010, 07:16 AM   #5
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 556Reputation: 556Reputation: 556Reputation: 556Reputation: 556Reputation: 556
Quote:
Could it be that something in the daily cron jobs triggers this?
Certainly could be. Depends what-all is being done in these cronjobs. If there's something in one or more of the cronjobs that is messing up (confusing) this drive, then it would stand to reason that after the crons are done, the drive is borked up.

I find it interesting:
Code:
  ef 05 fe 00 00 00 40 00      04:22:13.125  SET FEATURES [Enable APM]
  c4 00 08 02 04 00 a1 00      04:22:13.063  READ MULTIPLE
  c4 00 08 21 00 00 a0 00      04:22:13.063  READ MULTIPLE
  c4 00 08 09 00 00 a8 00      04:22:13.063  READ MULTIPLE
  c4 00 08 1d 00 00 ac 00      04:22:13.063  READ MULTIPLE
<-- snip -->
  c8 00 08 01 00 00 a0 00      04:22:12.688  READ DMA
  ec 00 00 00 00 00 a0 00      04:22:12.688  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 00      04:22:12.688  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 00      04:22:12.688  IDENTIFY DEVICE
  ec 00 00 00 00 00 a0 00      04:22:12.688  IDENTIFY DEVICE
Above are two of the five sequences of commands given to the drive before it self-identified that an error of some sort had occurred. They appear to be listed newest to oldest. Common to all five blocks of messages is the "SET FEATURES" call. I wonder if something is trying to set features of the drive which it is incapable of doing, or the drive itself is failing to set features that it is supposed to have, because of some internal problem. And who's making those command calls to the drive - is it the computer's BIOS, or the kernel/driver, or something in a cronjob such as an `hdparm` command or `smartctl` command. I'm particularly looking at the "ENABLE APM" command, since you mention (if I'm understanding post #1 correctly) that this problem seems to manifest in the morning, after (I am guessing) the machine (or drive) may have been in a low-power or power-off state during the night?

Generally speaking, what are these cronjobs all about? Is there anything in any of them that directly sends commands to the drive, like `hdparm` or `smartctl`? Maybe one of the crons is trying to set some drive feature (see below) before doing a backup or something.

If nothing else at all has changed recently anywhere within the system, such as an OS upgrade, a kernel change, etc., and this just started happening for no apparent reason (only the past few days), and never used to happen, then I'd be getting prepared to have to replace that drive. But, you might do some more testing first:

--disable the APM thing in the BIOS so the drive doesn't spin down. See if the problem goes away.
--manually force-execute your cronjobs, and see if a particular one triggers the problem. If so, check out that cronjob more closely.
--use `smartctl` to test enabling and disabling of available features of the drive, like APM, DMA mode, transfer mode, etc. and see if the error occurs.
--run a full/long SMART test on the drive. See `smartctl` man page re: the -t option or --test=long
--Of course, while doing these things, keep an eye on your kernel log or whatever log that was in your first post.

Let us know what turns up if anything.
 
1 members found this post helpful.
Old 09-30-2010, 08:48 AM   #6
Harkov
Member
 
Registered: May 2004
Distribution: Ubuntu 10.04.1 LTS
Posts: 38

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by djsmiley2k View Post
I dont want to worry you, and I'm not an expert on SMART, but to me that looks like the HD is either failing, or is in a state of "can fail at any time". All those prefail warnings as what is throwing this up for me.

Do you have good, solid backups elsewhere? If not, then now is a good time to start thinking about some.
Thank you for your concern. I've moved critical data from the drive. However the SMART values are, to my understanding, nothing to worry about. All pre-fail indicators are at their best possible value except maybe for spin up time. Normally values decrease once errors occur. At least that's my understanding of how SMART works.

Quote:
Originally Posted by GrapefruiTgirl View Post
Certainly could be. Depends what-all is being done in these cronjobs. If there's something in one or more of the cronjobs that is messing up (confusing) this drive, then it would stand to reason that after the crons are done, the drive is borked up.

I find it interesting:
Code:
  ef 05 fe 00 00 00 40 00      04:22:13.125  SET FEATURES [Enable APM]
  c4 00 08 02 04 00 a1 00      04:22:13.063  READ MULTIPLE
  c4 00 08 21 00 00 a0 00      04:22:13.063  READ MULTIPLE
  c4 00 08 09 00 00 a8 00      04:22:13.063  READ MULTIPLE
  c4 00 08 1d 00 00 ac 00      04:22:13.063  READ MULTIPLE
<-- snip -->
  c8 00 08 01 00 00 a0 00      04:22:12.688  READ DMA
  ec 00 00 00 00 00 a0 00      04:22:12.688  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 00      04:22:12.688  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 00      04:22:12.688  IDENTIFY DEVICE
  ec 00 00 00 00 00 a0 00      04:22:12.688  IDENTIFY DEVICE
Above are two of the five sequences of commands given to the drive before it self-identified that an error of some sort had occurred. They appear to be listed newest to oldest. Common to all five blocks of messages is the "SET FEATURES" call. I wonder if something is trying to set features of the drive which it is incapable of doing, or the drive itself is failing to set features that it is supposed to have, because of some internal problem. And who's making those command calls to the drive - is it the computer's BIOS, or the kernel/driver, or something in a cronjob such as an `hdparm` command or `smartctl` command. I'm particularly looking at the "ENABLE APM" command, since you mention (if I'm understanding post #1 correctly) that this problem seems to manifest in the morning, after (I am guessing) the machine (or drive) may have been in a low-power or power-off state during the night?

Generally speaking, what are these cronjobs all about? Is there anything in any of them that directly sends commands to the drive, like `hdparm` or `smartctl`? Maybe one of the crons is trying to set some drive feature (see below) before doing a backup or something.

If nothing else at all has changed recently anywhere within the system, such as an OS upgrade, a kernel change, etc., and this just started happening for no apparent reason (only the past few days), and never used to happen, then I'd be getting prepared to have to replace that drive. But, you might do some more testing first:

--disable the APM thing in the BIOS so the drive doesn't spin down. See if the problem goes away.
--manually force-execute your cronjobs, and see if a particular one triggers the problem. If so, check out that cronjob more closely.
--use `smartctl` to test enabling and disabling of available features of the drive, like APM, DMA mode, transfer mode, etc. and see if the error occurs.
--run a full/long SMART test on the drive. See `smartctl` man page re: the -t option or --test=long
--Of course, while doing these things, keep an eye on your kernel log or whatever log that was in your first post.

Let us know what turns up if anything.
Problems started about 36 hours after a kernel update. Before that the drive would just spin down and come up properly as defined in the BIOS settings. Unfortunately I was unable to find the changelog for that update.
Code:
Start-Date: 2010-09-27  17:13:38
Install: linux-headers-2.6.32-25 (2.6.32-25.44), linux-headers-2.6.32-25-generic-pae (2.6.32-25.44), linux-image-2.6.32-25-generic-pae (2.6.32-25.44)
Upgrade: linux-image-generic-pae (2.6.32.24.25, 2.6.32.25.27), linux-generic-pae (2.6.32.24.25, 2.6.32.25.27), linux-headers-generic-pae (2.6.32.24.25, 2.6.32.25.27)
End-Date: 2010-09-27  17:14:49
I was unable to reproduce the problem running the cronjobs (which are pretty standard). Also hdparm can set the drive to sleep and standby state without any problems.

So the BIOS power down feature regarding to the hard drives has now been disabled. If that turns out to be the problem I'll look into a software solution to power down the hard drives.

Thanks for your help and I will report back tomorrow whether the system made it through the night .
 
Old 09-30-2010, 08:58 AM   #7
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 556Reputation: 556Reputation: 556Reputation: 556Reputation: 556Reputation: 556
The way I understand the SMART output, I too would say that the drive's general health as reported by SMART looks OK. All values appear to be pretty good.

As for that kernel update you mention: keep that high on your list of potential reasons for this. If the machine had been rebooted immediately after that upgrade had been finished, I would expect that if this problem would have begun to materialize less than 36 hours after the upgrade (like maybe after the first night). On the other hand, if the machine had not been rebooted immediately after the upgrade, but maybe a few days later instead, then that could account for the 36 hour delay.

Or, if this problem happens less frequently than once every 24-48 hours, like on an inconsistent basis maybe every 1-5 days or so, then again, that kernel upgrade could still be the culprit -- maybe a bug or typo in the kernel somewhere. Only way to verify would be thorough testing/evaluation & monitoring of logs for a week or more with the new kernel, and followed by a week or more of running on the previous kernel, by rolling back that upgrade. If this turns up any evidence that the kernel upgrade produces the symptoms but the old kernel does not, then I'd suggest a bug report to launchpad.

Good luck again! Let us know what you learn.
 
Old 09-30-2010, 09:45 AM   #8
TobiSGD
Moderator
 
Registered: Dec 2009
Location: Germany
Distribution: Whatever fits the task best
Posts: 17,148
Blog Entries: 2

Rep: Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886
Before testing other things, I would recommend to test the harddrive with the manufacturers diagnosis tool. I have seen many harddrives that did not report errors over the S.M.A.R.T.-interface, but were actually failing, wihich was discovered by using those tools. For Samsung you should use the HUTIL-tool.
 
Old 10-01-2010, 07:52 AM   #9
Harkov
Member
 
Registered: May 2004
Distribution: Ubuntu 10.04.1 LTS
Posts: 38

Original Poster
Rep: Reputation: 15
Well, no problems so far. Apparently powering down the drive and then coming up again was the trigger for the problems. Of course this isn't really a solution since the drive doesn't power down at all now but at least the system is stable again.

I still haven't been able to find any changelogs for the kernel though.
 
Old 10-01-2010, 11:37 AM   #10
Valery Reznic
ELF Statifier author
 
Registered: Oct 2007
Posts: 676

Rep: Reputation: 137Reputation: 137
Once my system too misspelled hard dive name.

Turns out that culprit was cabel (or may be IDE controller) I don't remember exactly which of them
 
1 members found this post helpful.
Old 11-30-2010, 02:24 PM   #11
Harkov
Member
 
Registered: May 2004
Distribution: Ubuntu 10.04.1 LTS
Posts: 38

Original Poster
Rep: Reputation: 15
After almost two months without problems it started again. After Valery's reply I noticed the drive was on another cable than the other two. I replaced the cable, I hope that solved the issue. I also tried to switch the cables on the motherboard busses but that resulted in problem with the masters/slave configuration which I really don't want to get into today.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
device order (dev/sdb,dev/sdd) changes why ? comtmr Linux - Enterprise 5 06-26-2012 04:28 AM
Installing grub on /dev/sdb or /dev/sdc Skaperen Ubuntu 4 07-18-2009 07:45 PM
Installing RIP LinuX on a USB drive - device name /dev/sdc becomes /dev/sda Mleahy Linux - Software 1 07-30-2008 08:57 PM
my old /dev/hda became /dev/hdb due to a new hard drive yanewbie Linux - Hardware 3 12-09-2007 11:25 PM
grub on /dev/sdc and live-usb on /dev/sdc1 fitzov Linux - General 2 04-25-2007 11:11 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Ubuntu

All times are GMT -5. The time now is 09:39 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration