SlackwareThis Forum is for the discussion of Slackware Linux.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Right so I've got some strange messages at boot. Courteousy of a mixture (kdiff3) of dmesg and /var/log/dmesg I bring you this log info - it shows hdb in use (root partition is hdb1, hdb2 is swap, hda1 is vfat (windows but haven't booted it since XP trial period ended 18+ months ago):
Then there's the fact that smartctl (with testing on both hda and hdb don't report any errors. Observe:
Code:
root@ixthus:/etc/profile.d# smartctl --all /dev/hda
smartctl version 5.36 [i486-slackware-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Model Family: Seagate U Series 5 family
Device Model: ST320413A
Serial Number: 5ED0ERBZ
Firmware Version: 3.54
User Capacity: 20,020,396,032 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 4
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Tue Sep 5 00:04:58 2006 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 422) seconds.
Offline data collection
capabilities: (0x1b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 23) minutes.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000e 075 055 025 Old_age Always - 16125276
3 Spin_Up_Time 0x0002 075 072 000 Old_age Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 364
5 Reallocated_Sector_Ct 0x0032 100 100 036 Old_age Always - 0
7 Seek_Error_Rate 0x000e 082 060 030 Old_age Always - 187872984
9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 6578
10 Spin_Retry_Count 0x0012 100 100 097 Old_age Always - 0
12 Power_Cycle_Count 0x0032 097 097 020 Old_age Always - 3306
194 Temperature_Celsius 0x0022 035 052 000 Old_age Always - 35
195 Hardware_ECC_Recovered 0x001a 085 055 000 Old_age Always - 192632326
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 100 000 Old_age Offline - 0
202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 6567 -
# 2 Short offline Completed without error 00% 5782 -
# 3 Extended offline Completed without error 00% 5600 -
# 4 Short offline Completed without error 00% 5601 -
# 5 Short offline Completed without error 00% 5601 -
# 6 Short offline Completed without error 00% 3915 -
# 7 Short captive Completed without error 00% 0 -
Device does not support Selective Self Tests/Logging
root@ixthus:/etc/profile.d# smartctl --all /dev/hdb
smartctl version 5.36 [i486-slackware-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar SE family
Device Model: WDC WD400JB-00ENA0
Serial Number: WD-WCAD16396026
Firmware Version: 05.03E05
User Capacity: 40,020,664,320 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 5
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Tue Sep 5 00:06:06 2006 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (1506) seconds.
Offline data collection
capabilities: (0x3b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 28) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0007 093 092 021 Pre-fail Always - 2716
4 Start_Stop_Count 0x0032 098 098 040 Old_age Always - 2001
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4052
10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1889
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 765 -
# 2 Short offline Completed without error 00% 1004 -
# 3 Short offline Completed without error 00% 805 -
# 4 Short offline Interrupted (host reset) 20% 992 -
Device does not support Selective Self Tests/Logging
Also, I booted in Knoppix (v5 I think) and ran fsck.reiserfs on hdb1 (my root drive) and fsck -c on a few other partitions without getting any errors or warnings, no journal replays required (I think that's what it says?).
I can run "hdparm -d1 -k1 /dev/hdb" on both drives and using hdparm tests I can confirm that DMA is activated, but it won't get restored on reboot.
There's also some other weirdness happening with my system. Like I can no longer shutdown properly (I just get "system halted" and have to force a hard-off via the power button).
Recently I've upgrade to KDE 3.5.4 . But I don't think that's related. I also tried a kernel upgrade to 2.6.17.11 which ultimately failed as I couldn't get nvidia installed. I have a geforce2 and have to use a patch to compile for my current kernel:
Code:
Linux ixthus 2.6.16.13pbhj #1 PREEMPT Fri May 5 21:37:32 BST 2006 i686 athlon i386 GNU/Linux
So anyone any ideas: other threads with similar errors have had responses "drive is failing d00d, replace pronto". However everything else says that the drive is fine?! I have checked the cables are firmly in place! If I hear nohting I guess I'll boot knoppix and check the dmesg there and try using a different cable on hda.
http://www.mail-archive.com/linux-id.../msg00704.html this is about an Ali chipset error but may relate. However, this kernel has run without me noticing this error before so I find it's appearance now due to kernel bugs unlikely.
macemoneta, thanks for the reply. That's what I'd say only the smartctl man pages say:
Code:
The conversion from Raw value to a quantity with physical units is not specified by the SMART standard. In most
cases, the values printed by smartctl are sensible. For example the temperature Attribute generally has its raw
value equal to the temperature in Celsius. However in some cases vendors use unusual conventions. For example
the Hitachi disk on my laptop reports its power-on hours in minutes, not hours. Some IBM disks track three tem-
peratures rather than one, in their raw values. And so on.
Each Attribute also has a Threshold value (whose range is 0 to 255) which is printed under the heading "THRESH".
If the Normalized value is less than or equal to the Threshold value, then the Attribute is said to have failed.
If the Attribute is a pre-failure Attribute, then disk failure is imminent.
Each Attribute also has a "Worst" value shown under the heading "WORST". This is the smallest (closest to fail-
ure) value that the disk has recorded at any time during its lifetime when SMART was enabled. [Note however
that some vendors firmware may actually increase the "Worst" value for some "rate-type" Attributes.]
The Attribute table printed out by smartctl also shows the "TYPE" of the Attribute. Attributes are one of two
possible types: Pre-failure or Old age. Pre-failure Attributes are ones which, if less than or equal to their
threshold values, indicate pending disk failure. Old age, or usage Attributes, are ones which indicate
end-of-product life from old-age or normal aging and wearout, if the Attribute value is less than or equal to
the threshold. Please note: the fact that an Attribute is of type 'Pre-fail' does not mean that your disk is
about to fail! It only has this meaning if the Attributes current Normalized value is less than or equal to
the threshold value.
If the Attributes current Normalized value is less than or equal to the threshold value, then the "WHEN_FAILED"
column will display "FAILING_NOW". If not, but the worst recorded value is less than or equal to the threshold
value, then this column will display "In_the_past". If the "WHEN_FAILED" column has no entry (indicated by a
dash: -) then this Attribute is OK now (not failing) and has also never failed in the past.
The table column labeled "UPDATED" shows if the SMART Attribute values are updated during both normal operation
and off-line testing, or only during offline testing. The former are labeled "Always" and the latter are
labeled "Offline".
What that says to me is:
1) the raw values have no real meaning
2) a state of "pre-fail" doesn't mean failure is coming
3) it says "failing" when it's failing
4) a current value of an attribute above the threshold is good
smartctl reports healthy, values are well above the thresholds. The drive is old, yes. But if nothings actually failing where/why are the errors being made.
What's your interpretation of those results, thanks.
I'm sorry I just dont' agree. I've now run the Seagate tests (except the overwrite test) on this drive and it returns that the drive is healthy.
So I still think that this is an error elsewhere. Anyone else want to back up macemoneta (perhaps with a little more reasoning) or suggest something else.
Well, DMA is failing right?
What other reason do you need to think that your hard drive might die any day of these?
I wouldn't trust SMART, and I certainly wouldn't trust a failing drive.
If you consider your data worth of a new drive, then replace it when you still can do so. If you just don't plain care for your data, you can save some money albeit don't cry later if everything gets magically lost.
I would say that if the problem developed without any changes made by you (eg recompiled kernel)
then macemoneta and raska are right, and its time for a backup/new disk
If you just recompiled your kernel, check this option:
CONFIG_IDEDISK_MULTI_MODE=y
if that was set to =y before, and is not now, you could see those errors
My interpretation of the results is that the drive is failing. Replace it as soon as you can.
Your interpretation of the SMART test results is flawed. The values you cite as being indicative of failure are perfectly normal... or every drive I own is about to die. The documentation for smartctl also disagrees with you.
It's been my experience that this particular behaviour is typically one of two things. The first being a bad 80-conductor IDE cable (or someone using a 40 conductor cable) or, looking at the actual error, the kernel, for whatever reason, is requesting a sector that's not actually there. Look at the dmesg output where it's reporting how many sectors the drive has, and remember that the sector count starts with zero. I'm guessing this is screwing up when it's telling the drive to fetch it's last sector, and you'll be seeing 39102337 in the line of output where it first inits the drive.
The latter can usually be handled by blanking the drive with dd if=/dev/zero of=/drivename and then rebooting and repartitioning correctly, but what tobyl mentions can also screw this up.
I would say that if the problem developed without any changes made by you (eg recompiled kernel)
then macemoneta and raska are right, and its time for a backup/new disk
If you just recompiled your kernel, check this option:
Thanks Tobyl, evilD, shade, raska et al.. The thing is I noticed the problem on my first boot into a new kernel. I haven't then gone on to use that kernel as I can no longer install the nvidia.ko (old geforce2 card) using that kernel. So I'm still using the previous kernel. Nothing else has changed really. I've probably upgrade some resiertools and such but the kernel config is known to work ... I still can't work out why my system won't halt properly now (?) perhaps it is an IDE controller problem maybe indicative of an imminent mainboard failure. We'll see!
I've checked the seating of my cable but haven't swapped it out. No other errors appear, I can enable DMA (hdparm -d1 /dev/hda) and performance with the hdparm tests is then as expected.
I've read a couple dozen threads on this type of problem now. Most seem to end unresolved, did the kernel fix Tobyl mentions, or just with "I changed my HDD and it went away".
I'm not too bothered about that drive (hda) as, like I say, it's an old windows drive that I've not had cause to boot for some time - it does have lilo in the mbr however.
You're right, it is too obvious. DMA I think was already being enabled by the kernel on boot, it is now being disabled later on.
I thought I posted this already, but never mind.
I've swapped the HDD's now and used a new cable, so hda<->hdb. I used Knoppix to rewrite lilo.conf and install inthe mbr of the (now) hda (Western Digital). So, it's not the controller / cable it seems as the errors now show on hdb. Curious.
The thing is I thought I'd cracked it .. no errors showed until I realised I'd not updated fstab. I updated fstab (again in Knoppix) and then rebooted and bingo: disk access errors appear:
I'm currently thinking that it's more akin to Tobyl's post. This from "fdisk -l", is there some way to look directly at the block at which the error is reported??
Code:
Disk /dev/hda: 40.0 GB, 40020664320 bytes
240 heads, 63 sectors/track, 5169 cylinders
Units = cylinders of 15120 * 512 = 7741440 bytes
Device Boot Start End Blocks Id System
/dev/hda1 * 1 1355 10243768+ 83 Linux
/dev/hda2 1356 1491 1028160 82 Linux swap
/dev/hda3 1492 2846 10243800 83 Linux
/dev/hda4 2847 5169 17561880 5 Extended
/dev/hda5 2847 3123 2094088+ c W95 FAT32 (LBA)
/dev/hda6 3124 5169 15467728+ 83 Linux
Disk /dev/hdb: 20.0 GB, 20020396544 bytes
240 heads, 63 sectors/track, 2586 cylinders
Units = cylinders of 15120 * 512 = 7741440 bytes
Device Boot Start End Blocks Id System
/dev/hdb1 * 1 961 7265128+ c W95 FAT32 (LBA)
/dev/hdb2 962 2585 12277440 f W95 Ext'd (LBA)
/dev/hdb5 962 2585 12277408+ 7 HPFS/NTFS
Any further ideas greatfully accepted.
If I can't get anywhere now I think I'll wait for the next release and do a reinstall (or perhaps Kubuntu instead?).
I had a similar problem about a year ago with a self built system and it turned out to be a bad power supply. Have you checked all the voltages (either in your bios or from an OS monitoring tool)?
Thanks for the suggestion I guess i need to move on to other causes (hardware) as being the culprit. The system is quite old. Unfortunately I don't think it has any voltage readings in the BIOS and I don't know of any software tools that can help with this, do you?
Considering the situation now, I can't actually see why hdb (Seagate) is accessed at all. It's now used for partitions that are all set to noauto. I guess there's a lookup of the partition table on boot in case any partitions are to be mounted and just to see what's there?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.