[SOLVED] Suddenly, strange hardware(?) warnings from syslog...
SlackwareThis Forum is for the discussion of Slackware Linux.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Suddenly, strange hardware(?) warnings from syslog...
I've been running my slackware 14 on this hdd on a new MoBo with new (4GB) RAM for about a week now and everything has been working fine.
Yesterday, I came back in to see about 50 little windows with 'warning's' about hardware and stuff.
I thought maybe I'd made a strange/wrong setting in the UEFI BIOS, so I went in and set it to 'default'. Still happened about 10 to 15 minutes later...a bunch of those 'warnings' pop up real fast (almost all at once actually) and I have to click on each one to get rid of it.
So all day today I've been testing different configurations in the BIOS and it just keeps happening.
I decided to start up my backup hdd instead about 2 hours ago to see if it happens over there (I use luckybackup to put *everything - including hidden files - from my /home dir to the backup hdd so that it's no different than my main hdd other than a very few things/apps not installed). That hdd stayed up and there were no problems with it. Not one warning window or anything.
Here's a few of the 'warning's' from the syslog in my /var/log...
Feb 14 10:26:42 oogah kernel: [12900.000036] [Hardware Error]: ^IMC0_ADDR: 0x00000000cb5c8c00
[Hardware Error]: Data Cache Error: during L1 linefill from L2.
[Hardware Error]: cache level: L2, tx: DATA, mem-txRD
[Hardware Error]:CPU:0^IMC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd40040004a000136
[Hardware Error]: ^IMC0_ADDR: 0x00000000c0ce8c00
[Hardware Error]: Data Cache Error: during L1 linefill from L2.
[Hardware Error]: cache level: L2, tx: DATA, mem-txRD
[Hardware Error]:CPU:0^IMC2_STATUS[-|CE|-|-|AddrV|CECC]: 0x940040000000018a
[Hardware Error]: ^IMC2_ADDR: 0x000000008bba8800
[Hardware Error]: Bus Unit Error: SNP error during data copyback.
[Hardware Error]: cache level: L2, tx: GEN, mem-tx: SNP
[Hardware Error]: CPU:0^IMC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd40040004a000136
[Hardware Error]: ^IMC0_ADDR: 0x00000000c0938d80
[Hardware Error]: Data Cache Error: during L1 linefill from L2.
[Hardware Error]: cache level: L2, tx: DATA, mem-txRD
[Hardware Error]: CPU:0^IMC2_STATUS[-|CE|-|-|AddrV|CECC]: 0x940040000000018a
[Hardware Error]: ^IMC2_ADDR: 0x00000000c8378800
[Hardware Error]: Bus Unit Error: SNP error during data copyback.
[Hardware Error]: cache level: L2, tx: GEN, mem-tx: SNP
[Hardware Error]: CPU:0^IMC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd40040004a000136
[Hardware Error]: ^IMC0_ADDR: 0x00000000c0978800
[Hardware Error]: Data Cache Error: during L1 linefill from L2.
[Hardware Error]: cache level: L2, tx: DATA, mem-txRD
I can't tell head or tails what the warnings are. Anyone have any ideas? Some things to try to do?
i am not specialist in that, but for me, looks like something with processor \ motherboard?
first i try to check RAM with one of memtest - download UBCD and write it to CD or to usb flash, and make them bootable, then boot from it and choose one of memtest.
L2 cache, as i understand, sit in processor. do you monitor temperatures and voltages via sensors? it all be ok? if yes, i start with processor get out from mobo, see it all contacts and so is clear, without dust and debris, and set it again. set heatsink with new thermointerface. reset ram modules, look at mobo electrolyte capacitators -all it ok ith them? if yes - try to turn on again. if warnings remain, try to swap to another PSU. it all of it not helps, then i not know -try to swap on another CPU and show it help or not - if not, try to check mainboard...
Sorry...I already did a Memtest86 and the RAM is all good. The temp on the cpu is at 107F, which is nice and cool. Nothing wrong with the cpu or MoBo or RAM as the backup hdd runs just fine. I'm posting from it right now. I've been on this hdd for the past few hours now and nothing pops up on this hdd.
I did get Parted Magic and booted with it and did some tests and it looks like it's more than likely the hdd is going out, though the tests weren't conclusive that anything is or will happen soon. I have to figure since this hdd is working fine and the other keeps popping up 'warnings' that it's the hdd going out. Thank goodness for backing up!
very strange to me - logs talk about CPU, L1 and L2 cache, and it no have any connections to HDD, as i understand.
you check that hdd with smartctl -a /dev/sda ?
@WiseDraco - No, but I will shortly and give the results.
@H_TeXMeX_H - Download the source or what? What is that anyway? It looks like something similar to BOINC just not distributed computing (which I'm running with seti@home and is working fine on the backup hdd).
@wildwizard - But why all of a sudden out of the blue after almost a week of being fine? And why isn't my backup hdd making it do the same thing? Not doubting you, it just doesn't make any sense to me and mentioning it is all.
Mersenne is a program which searches for mersenne prime numbers and is highly optimized so that your processor and memory will be extremely tested. If there is a problem with your processor or ram it will give errors. Even so that computers which normally give no errors can give errors under this pressure so it's a very good test.
@whizje - Gotcha. Thanks. I used all the tests for the cpu that come on the latest Parted Magic, which were similar IIRR. Tyhe cpu was flying without error and doing well against the other cpu's in the list.
Here's what I got for the #smartctl command on sda:
This test is on Parted Magic also and it told me that the Raw_Read_Error_Rate, Reallocated_Sector_Ct and Reallocated_Event_Count with Raw Value of anything but zero is time to start to think about a new hdd possibly.
@whizje - Gotcha. Thanks. I used all the tests for the cpu that come on the latest Parted Magic, which were similar IIRR. Tyhe cpu was flying without error and doing well against the other cpu's in the list.
Here's what I got for the #smartctl command on sda:
This test is on Parted Magic also and it told me that the Raw_Read_Error_Rate, Reallocated_Sector_Ct and Reallocated_Event_Count with Raw Value of anything but zero is time to start to think about a new hdd possibly.
i think, raw error not nothing fearly
i have that value now at 1050. reallocated sectors - not good, but if number is small and do not increase, than, imho, it is nothing tragic.
try to run hdd regenerator on that disc? but for me i not see any case, who can get that problems ..? http://www.dposoft.net/hdd.html
@whizje - Gotcha. Thanks. I used all the tests for the cpu that come on the latest Parted Magic, which were similar IIRR. Tyhe cpu was flying without error and doing well against the other cpu's in the list.
Here's what I got for the #smartctl command on sda:
This test is on Parted Magic also and it told me that the Raw_Read_Error_Rate, Reallocated_Sector_Ct and Reallocated_Event_Count with Raw Value of anything but zero is time to start to think about a new hdd possibly.
I don't see anything to worry about there. If you still are concerned that it may be the HDD, you can run a SMART long test.
Ran all three tests for hdd's and the one I posted above was the only one with anything 'negative' to say. (the two short ones and the 39 minute one)
Does it not make any sense to everyone else though that my main hdd starts, out of the blue one day, to get those 'warnings', yet when I restart the system and boot into my backup hdd nothing happens at all and the backup hdd runs like my main one did for almost a week? I understand that the warnings said L1 cache and such and that that is something to do with the cpu, but does anyone know how the cpu can be affected by one hdd and not another on the same system? It's really bugging the heck out of me and I hate not having a backup hdd and don't have the money to get another until next month.
Is it possible that maybe an sata cable can make the cpu hiccup and burp and send out warnings? Or possibly the sata plugin on the MoBo? I can't think of anything else (which doesn't mean much, heh).
The L1 cache is on the CPU die, so I doubt anything can affect it, the HDD included.
However, if the HDD became corrupt, it could be software that is reporting it incorrectly, but the HDD seems fine.
+1
logfile reported about CPU /L1 / L2 errors, and that things is no connection with hdd or ram theoretically. something like may causes ( i think) from bad PSU, or mb or CPU itself - better is bought that pieces, and try to swap with your to see, what happens.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.