[SOLVED] Suddenly, strange hardware(?) warnings from syslog...

irgunII · 02-14-2013, 08:17 PM

I've been running my slackware 14 on this hdd on a new MoBo with new (4GB) RAM for about a week now and everything has been working fine.

Yesterday, I came back in to see about 50 little windows with 'warning's' about hardware and stuff.

I thought maybe I'd made a strange/wrong setting in the UEFI BIOS, so I went in and set it to 'default'. Still happened about 10 to 15 minutes later...a bunch of those 'warnings' pop up real fast (almost all at once actually) and I have to click on each one to get rid of it.

So all day today I've been testing different configurations in the BIOS and it just keeps happening.

I decided to start up my backup hdd instead about 2 hours ago to see if it happens over there (I use luckybackup to put *everything - including hidden files - from my /home dir to the backup hdd so that it's no different than my main hdd other than a very few things/apps not installed). That hdd stayed up and there were no problems with it. Not one warning window or anything.

Here's a few of the 'warning's' from the syslog in my /var/log...

Feb 14 10:26:42 oogah kernel: [12900.000036] [Hardware Error]: ^IMC0_ADDR: 0x00000000cb5c8c00
[Hardware Error]: Data Cache Error: during L1 linefill from L2.
[Hardware Error]: cache level: L2, tx: DATA, mem-tx

RD
[Hardware Error]:CPU:0^IMC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd40040004a000136
[Hardware Error]: ^IMC0_ADDR: 0x00000000c0ce8c00
[Hardware Error]: Data Cache Error: during L1 linefill from L2.
[Hardware Error]: cache level: L2, tx: DATA, mem-tx

RD
[Hardware Error]:CPU:0^IMC2_STATUS[-|CE|-|-|AddrV|CECC]: 0x940040000000018a
[Hardware Error]: ^IMC2_ADDR: 0x000000008bba8800
[Hardware Error]: Bus Unit Error: SNP error during data copyback.
[Hardware Error]: cache level: L2, tx: GEN, mem-tx: SNP
[Hardware Error]: CPU:0^IMC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd40040004a000136
[Hardware Error]: ^IMC0_ADDR: 0x00000000c0938d80
[Hardware Error]: Data Cache Error: during L1 linefill from L2.
[Hardware Error]: cache level: L2, tx: DATA, mem-tx

RD
[Hardware Error]: CPU:0^IMC2_STATUS[-|CE|-|-|AddrV|CECC]: 0x940040000000018a
[Hardware Error]: ^IMC2_ADDR: 0x00000000c8378800
[Hardware Error]: Bus Unit Error: SNP error during data copyback.
[Hardware Error]: cache level: L2, tx: GEN, mem-tx: SNP
[Hardware Error]: CPU:0^IMC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd40040004a000136
[Hardware Error]: ^IMC0_ADDR: 0x00000000c0978800
[Hardware Error]: Data Cache Error: during L1 linefill from L2.
[Hardware Error]: cache level: L2, tx: DATA, mem-tx

RD

I can't tell head or tails what the warnings are. Anyone have any ideas? Some things to try to do?

WiseDraco · 02-15-2013, 01:28 AM

i am not specialist in that, but for me, looks like something with processor \ motherboard?
first i try to check RAM with one of memtest - download UBCD and write it to CD or to usb flash, and make them bootable, then boot from it and choose one of memtest.
L2 cache, as i understand, sit in processor. do you monitor temperatures and voltages via sensors? it all be ok? if yes, i start with processor get out from mobo, see it all contacts and so is clear, without dust and debris, and set it again. set heatsink with new thermointerface. reset ram modules, look at mobo electrolyte capacitators -all it ok ith them? if yes - try to turn on again. if warnings remain, try to swap to another PSU. it all of it not helps, then i not know -try to swap on another CPU and show it help or not - if not, try to check mainboard...

irgunII · 02-15-2013, 01:59 AM

Sorry...I already did a Memtest86 and the RAM is all good. The temp on the cpu is at 107F, which is nice and cool. Nothing wrong with the cpu or MoBo or RAM as the backup hdd runs just fine. I'm posting from it right now. I've been on this hdd for the past few hours now and nothing pops up on this hdd.

I did get Parted Magic and booted with it and did some tests and it looks like it's more than likely the hdd is going out, though the tests weren't conclusive that anything is or will happen soon. I have to figure since this hdd is working fine and the other keeps popping up 'warnings' that it's the hdd going out. Thank goodness for backing up!

Thank you though for your input.

WiseDraco · 02-15-2013, 02:07 AM

very strange to me - logs talk about CPU, L1 and L2 cache, and it no have any connections to HDD, as i understand.
you check that hdd with smartctl -a /dev/sda ?

H_TeXMeX_H · 02-15-2013, 02:48 AM

Try running:
http://www.mersenne.org/freesoft/#source

Run it in mode 1 to test for CPU issues.

wildwizard · 02-15-2013, 03:10 AM

It is your CPU, but it's not necessarily faulty, it just maybe a kernel incompatibility.

What you need to do is compile the very latest kernel from kernel.org and test it with that.

irgunII · 02-15-2013, 06:56 AM

@WiseDraco - No, but I will shortly and give the results.

@H_TeXMeX_H - Download the source or what? What is that anyway? It looks like something similar to BOINC just not distributed computing (which I'm running with seti@home and is working fine on the backup hdd).

@wildwizard - But why all of a sudden out of the blue after almost a week of being fine? And why isn't my backup hdd making it do the same thing? Not doubting you, it just doesn't make any sense to me and mentioning it is all.

whizje · 02-15-2013, 07:12 AM

Mersenne is a program which searches for mersenne prime numbers and is highly optimized so that your processor and memory will be extremely tested. If there is a problem with your processor or ram it will give errors. Even so that computers which normally give no errors can give errors under this pressure so it's a very good test.

irgunII · 02-15-2013, 07:43 AM

@whizje - Gotcha. Thanks. I used all the tests for the cpu that come on the latest Parted Magic, which were similar IIRR. Tyhe cpu was flying without error and doing well against the other cpu's in the list.

Here's what I got for the #smartctl command on sda:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 3
3 Spin_Up_Time 0x0003 162 154 021 Pre-fail Always - 2891
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 25
5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 2
7 Seek_Error_Rate 0x000e 100 253 000 Old_age Always - 0
9 Power_On_Hours 0x0032 051 051 000 Old_age Always - 36012
10 Spin_Retry_Count 0x0012 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 507
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 279
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 507
194 Temperature_Celsius 0x0022 112 100 000 Old_age Always - 31
196 Reallocated_Event_Count 0x0032 198 198 000 Old_age Always - 2
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

This test is on Parted Magic also and it told me that the Raw_Read_Error_Rate, Reallocated_Sector_Ct and Reallocated_Event_Count with Raw Value of anything but zero is time to start to think about a new hdd possibly.

irgunII · 02-15-2013, 07:44 AM

@whizje - Gotcha. Thanks. I used all the tests for the cpu that come on the latest Parted Magic, which were similar IIRR. Tyhe cpu was flying without error and doing well against the other cpu's in the list.

Here's what I got for the #smartctl command on sda:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 3
3 Spin_Up_Time 0x0003 162 154 021 Pre-fail Always - 2891
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 25
5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 2
7 Seek_Error_Rate 0x000e 100 253 000 Old_age Always - 0
9 Power_On_Hours 0x0032 051 051 000 Old_age Always - 36012
10 Spin_Retry_Count 0x0012 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 507
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 279
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 507
194 Temperature_Celsius 0x0022 112 100 000 Old_age Always - 31
196 Reallocated_Event_Count 0x0032 198 198 000 Old_age Always - 2
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

This test is on Parted Magic also and it told me that the Raw_Read_Error_Rate, Reallocated_Sector_Ct and Reallocated_Event_Count with Raw Value of anything but zero is time to start to think about a new hdd possibly.

WiseDraco · 02-15-2013, 08:00 AM

i think, raw error not nothing fearly
i have that value now at 1050. reallocated sectors - not good, but if number is small and do not increase, than, imho, it is nothing tragic.

try to run hdd regenerator on that disc? but for me i not see any case, who can get that problems ..?
http://www.dposoft.net/hdd.html

H_TeXMeX_H · 02-15-2013, 08:31 AM

Quote:

Originally Posted by irgunII

@whizje - Gotcha. Thanks. I used all the tests for the cpu that come on the latest Parted Magic, which were similar IIRR. Tyhe cpu was flying without error and doing well against the other cpu's in the list.

Here's what I got for the #smartctl command on sda:

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       3
  3 Spin_Up_Time            0x0003   162   154   021    Pre-fail  Always       -       2891
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       25
  5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       2
  7 Seek_Error_Rate         0x000e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   051   051   000    Old_age   Always       -       36012
 10 Spin_Retry_Count        0x0012   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       507
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       279
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       507
194 Temperature_Celsius     0x0022   112   100   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   198   198   000    Old_age   Always       -       2
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

This test is on Parted Magic also and it told me that the Raw_Read_Error_Rate, Reallocated_Sector_Ct and Reallocated_Event_Count with Raw Value of anything but zero is time to start to think about a new hdd possibly.

I don't see anything to worry about there. If you still are concerned that it may be the HDD, you can run a SMART long test.

irgunII · 02-15-2013, 12:49 PM

Ran all three tests for hdd's and the one I posted above was the only one with anything 'negative' to say. (the two short ones and the 39 minute one)

Does it not make any sense to everyone else though that my main hdd starts, out of the blue one day, to get those 'warnings', yet when I restart the system and boot into my backup hdd nothing happens at all and the backup hdd runs like my main one did for almost a week? I understand that the warnings said L1 cache and such and that that is something to do with the cpu, but does anyone know how the cpu can be affected by one hdd and not another on the same system? It's really bugging the heck out of me and I hate not having a backup hdd and don't have the money to get another until next month.

Is it possible that maybe an sata cable can make the cpu hiccup and burp and send out warnings? Or possibly the sata plugin on the MoBo? I can't think of anything else (which doesn't mean much, heh).

H_TeXMeX_H · 02-15-2013, 12:57 PM

The L1 cache is on the CPU die, so I doubt anything can affect it, the HDD included.

However, if the HDD became corrupt, it could be software that is reporting it incorrectly, but the HDD seems fine.

WiseDraco · 02-15-2013, 01:53 PM

Quote:

Originally Posted by H_TeXMeX_H

The L1 cache is on the CPU die, so I doubt anything can affect it, the HDD included.

However, if the HDD became corrupt, it could be software that is reporting it incorrectly, but the HDD seems fine.

+1
logfile reported about CPU /L1 / L2 errors, and that things is no connection with hdd or ram theoretically. something like may causes ( i think) from bad PSU, or mb or CPU itself - better is bought that pieces, and try to swap with your to see, what happens.