dma_timer_expiry: dma status

PingFloyd · 08-05-2006, 08:15 AM

I've been searching the net for days for info about the following errors showing in system logs.

Code:

Aug  5 00:54:52 localhost kernel: hdc: dma_timer_expiry: dma status == 0x61
Aug  5 00:55:06 localhost kernel: hda: dma_timer_expiry: dma status == 0x21
Aug  5 00:55:06 localhost kernel: hdc: DMA timeout error
Aug  5 00:55:06 localhost kernel: hdc: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest }
Aug  5 00:55:06 localhost kernel:
Aug  5 00:55:06 localhost kernel: hdd: status error: status=0x51 { DriveReady SeekComplete Error }
Aug  5 00:55:06 localhost kernel: hdd: status error: error=0x04Aborted Command
Aug  5 00:55:06 localhost kernel: hdd: status error: status=0x58 { DriveReady SeekComplete DataRequest }
Aug  5 00:55:06 localhost kernel: hdd: status error: error=0x00
Aug  5 00:55:08 localhost kernel: hdd: drive not ready for command
Aug  5 00:58:59 localhost kernel: hdc: dma_timer_expiry: dma status == 0x61
Aug  5 00:58:59 localhost kernel: hda: dma_timer_expiry: dma status == 0x21
Aug  5 00:59:11 localhost kernel: hdc: DMA timeout error
Aug  5 00:59:11 localhost kernel: hdc: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest }
Aug  5 00:59:11 localhost kernel:
Aug  5 00:59:11 localhost kernel: hdd: status error: status=0x51 { DriveReady SeekComplete Error }
Aug  5 00:59:11 localhost kernel: hdd: status error: error=0x04Aborted Command
Aug  5 00:59:11 localhost kernel: hdd: status error: status=0x58 { DriveReady SeekComplete DataRequest }
Aug  5 00:59:11 localhost kernel: hdd: status error: error=0x00
Aug  5 00:59:11 localhost kernel: hdd: drive not ready for command
Aug  5 00:59:13 localhost kernel: hdc: status error: status=0x59 { DriveReady SeekComplete DataRequest Error }
Aug  5 00:59:13 localhost kernel: hdc: status error: error=0x04 { DriveStatusError }
Aug  5 00:59:13 localhost kernel: hdc: no DRQ after issuing WRITE
Aug  5 00:59:13 localhost kernel: hdd: status error: status=0x51 { DriveReady SeekComplete Error }
Aug  5 00:59:13 localhost kernel: hdd: status error: error=0x04Aborted Command
Aug  5 01:00:21 localhost kernel: hdd: status error: status=0x58 { DriveReady SeekComplete DataRequest }
Aug  5 01:00:21 localhost kernel: hdd: status error: error=0x00
Aug  5 01:00:21 localhost kernel: hdd: drive not ready for command
Aug  5 01:00:23 localhost kernel: hdc: status error: status=0x59 { DriveReady SeekComplete DataRequest Error }
Aug  5 01:00:23 localhost kernel: hdc: status error: error=0x04 { DriveStatusError }
Aug  5 01:00:23 localhost kernel: hdc: no DRQ after issuing WRITE
Aug  5 01:00:23 localhost kernel: hdd: status error: status=0x51 { DriveReady SeekComplete Error }
Aug  5 01:00:23 localhost kernel: hdd: status error: error=0x04Aborted Command
Aug  5 01:00:23 localhost kernel: hdd: status error: status=0x58 { DriveReady SeekComplete DataRequest }
Aug  5 01:00:23 localhost kernel: hdd: status error: error=0x00
Aug  5 01:00:23 localhost kernel: hdd: drive not ready for command
Aug  5 01:00:25 localhost kernel: hdc: status error: status=0x59 { DriveReady SeekComplete DataRequest Error }
Aug  5 01:00:25 localhost kernel: hdc: status error: error=0x04 { DriveStatusError }
Aug  5 01:00:25 localhost kernel: hdc: no DRQ after issuing WRITE
Aug  5 01:05:14 localhost kernel: hdd: status error: status=0x51 { DriveReady SeekComplete Error }
Aug  5 01:05:14 localhost kernel: hdd: status error: error=0x04Aborted Command
Aug  5 01:05:14 localhost kernel: hdd: status error: status=0x58 { DriveReady SeekComplete DataRequest }
Aug  5 01:05:14 localhost kernel: hdd: status error: error=0x00
Aug  5 01:05:14 localhost kernel: hdd: drive not ready for command
Aug  5 01:05:16 localhost kernel: hdc: status error: status=0x59 { DriveReady SeekComplete DataRequest Error }
Aug  5 01:05:16 localhost kernel: hdc: status error: error=0x04 { DriveStatusError }
Aug  5 01:05:16 localhost kernel: hdc: no DRQ after issuing WRITE
Aug  5 01:05:16 localhost kernel: hdd: status error: status=0x51 { DriveReady SeekComplete Error }
Aug  5 01:05:16 localhost kernel: hdd: status error: error=0x04Aborted Command
Aug  5 01:05:16 localhost kernel: hdd: status error: status=0x58 { DriveReady SeekComplete DataRequest }
Aug  5 01:05:16 localhost kernel: hdd: status error: error=0x00
Aug  5 01:05:16 localhost kernel: hdd: drive not ready for command

I do know that other people have encountered liked this before. Keep in mind that my system has been completely stable and has NOT froze or crashed ever. It seems that usually when people get these sorts of error their system freezes or crashes etc. I looked through my logs and noticed these error show up in logs dating back to about a month or so ago. They seem to happen spuradically. Usually about 2-3 times a day.

First off some info about my system.
Debian Sarge
Stock Debian kernel 2.6.8-3-686
Hardware:
PII 300Mhz 128MB DDR DRAM
P2L97 Asus Motherboard (if my memory serves me. Haven't had a chance to crack open the case to double-check. I'll have to do that next time I get a chance.)

One person suggested to me to check the cables, but I can rule this out since 2 different HDs on seperate IDE channels are getting these errors. In fact it looks as if all HDs in this system are getting DMA related errors from the looks of the logs.

I have a Quantum Fireball ST 6.4 GB ATA as master (/dev/hda) and a Western Digital WDC 32500H as slave on the primary IDE channel (/dev/hdb), a Western Digital WDC AC31200F ATA as master (/dev/hdc) and some atapi cd-rom (sorry, can't remember the make and model off hand) (/dev/hdd) as slave on the secondary IDE channel.

I came acrossed one handy little piece of info on some mailing list.
http://www.uwsg.iu.edu/hypermail/lin...03.3/0767.html

It sounds like what he is saying is that he either has to disable DMA or remove the slave drive (/dev/hdb) in order for the error to go away. However, in his case, it sounds as if the system hangs or the drives quits working until reboot when the error comes up for him. In my case, the system seems to go on functioning just fine no matter how many times the system logs the error, from what I can tell. In fact, I didn't even know there was any sort of problem with DMA in my system until I had decided to carefully look through my system logs one day.

I guess the solution sounds obvious though - to disable DMA or experiment with where and what drives I have the drives installed on the system. The only problem is that going either route will be less than ideal. I would imagine that I'm going to take a considerable performance hit if I disable DMA to my hard drives. If I have to take a hard drive out of the system, then I'm going to have to some serious work to move around my filesystem etc, not to mention lose some precious needed storage in my case.

The main reason I am writing this message is because I'm curious if anyone has ever come acrossed this issue or something similar and if they figured out an elegant way to rectify it. I am also wondering if I really need to worry about it at all since my system seems to be stable and these errors seem to have been existant since when I first install Debian on this system about a month or two ago. About the only thing that seems like any red flag, if it really can be considered one, is that e2fsck found an error that it fixed. e2fsck was only triggered because the filesystem had reach it's maximal mount count as set by tune2fs.

Any thoughts or advice?

runnerfrog · 08-06-2006, 09:11 PM

One time I had the same problem with debian woody a couple of year before. I booted a livecd with nodma option, checked every filesystem with badblocks on, that found some badblocks, but the joke on it is that never worked well until... I open the case and replug the ide cables (40 and 80-conductor) a little harder. Having in mind that you use non-journalling filesystem (you named e2fsck), then you should _start_ by doing both things, but first: I can hardly recall something about 2.4 and some 2.6 kernels and bugs involving this disks messages, on LinuxQuestions you can visit:

http://www.linuxquestions.org/questi...d.php?t=171059

There are guys better prepared to guide you with this, of course, but, if needed, you should do what this thread suggests about compiled options in the kernel. good luck.

PingFloyd · 08-08-2006, 05:26 AM

Thanks for the reply and suggestions Runnerfrog.

runnerfrog · 08-14-2006, 07:30 PM

Hi, i'm back. Please excuse my crappy english: I've heard some of my friends who's passed through this problem, their ideas, and I just searched around, and, if you have a VIA chipset like them and I, this debian report bug page might be a little useful: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=336103
When they say "/etc/mkinitramfs/modules" you might, or might not, need to replace that route with "/etc/initramfs-tools/modules", and update with "update-initramfs -u".
Good luck, pal.

PingFloyd · 08-15-2006, 12:54 PM

My system is using the PIIX4 chipset (440LX) for it's IDE. I've come acrossed all sorts of info on the internet about this error message (been scouring over tons of sites and mailing lists through google mostly). It seems that there are tons of people having this same problem with all sorts of different chipsets and hardware. It's hard to tell if it's the same exact issue, but all sorts of people get the "dma_timer_expiry" error along with the combinations of other accompanying error messages. I've also noticed that it presents it's self in different ways. Though that might be more of a matter of different people noticing different oddities in how their systems are behaving.

After reading alot of what other users have written I noticed that my system gets periods to where it quits responding for a little while (around 5-10 seconds or so) when these error messages are popping up in the logs.

Anyway, I've been experimenting a bit to try to figure out where the problem stems from.

I updated my BIOS. It seemed like at first this fixed it, but the error came back. It was one of those cross your fingers things followed by disappointment when it came back (much like everyone else plagued by this issue). My current experiment is that I've rearranged my where my system is using swap. I originally had it on hdc1. Now I have the system set to swap from hdb1 (this one given higher priority than hda1) then hda1. The reason for this is I got to thinking about things and noticed what seemed like a pattern -- that this error message would pop up when during periods of alot of swap activity (high activity on hdc). So I'm gonna run awhile without using hdc for swapping (have it for archiving right now) and cross my fingers and see if I can draw correllation. If this goes on for a long time without the error message, then it would seem to be an indication that perhaps there is an issue with that drive. Perhaps something like an incompatibility with it for Linux or on Linux with my chipsets. Either way, it will help me get closer to trying to solve this hopefully.

I saw your link to that bug report. It looks like one of those guys is using the same chipset as me. I didn't really understand what they were talking about mkinitramfs. I think maybe my Debian installation doesn't have it. Was making me wonder if that was changed or something since then. I search in the repositories with synaptic for any packages relating to it, but nothing turned up. Debian can be hard to follow and confusing in some ways.

I can't help but think maybe one of the modules I'm using doesn't work well with my hardware or something. I'm going to look into that further. Actually, I'm contemplating trying another distro that I can tell what is going on with things in a clearer way. It seems that with Debian it can be hard to trace all the configuration files and what the system is doing behind the scenes. I guess that's what makes it user friendly, but when something goes wrong, it seems to make it harder to troubleshoot. Debian is nice and easy to maintain, but it's automation makes it hard to take direct control of things.

I'm thinking about giving Arch or Slackware a try since they sound a little more straight forward. I think I'll probably give Arch a shot since it sounds like it's got repositories for installing packages through the internet.

Thanks for getting back to me. I'll keep you updated if anything changes, as well as post if I figure out something that seems to fix or change this error for me. Maybe between my findings and others we can figure out what the cause of it is.

One question. I'm running with Debian Sarge and their 2.6.8-3-686 pre-compiled kernel. I'm wondering if what they were discussing in that bug report applies to my situation. I'm a bit confused as to what they were talking about. I'm not sure, but I think my system uses mkinitrd instead of mkinitramfs. I don't quite understand the differences between the two, but I was left with the impression that they're different and a person would use one or the other but not both. I read all the man pages I could find in reference to them, but I still don't quite understand how those work. I guess they act as a ramdisk image of one's root file system. Is this correct? Alot of the man pages are hard to understand.

stress_junkie · 08-15-2006, 01:05 PM

Does your motherboard report S.M.A.R.T. disk errors to the operating system? I used to get a lot of disk errors through this function. I turned it off on the motherboard BIOS over a year ago. My disks were and are fine. Remember to do backups and you will be prepared if the disks really are going bad.

PingFloyd · 08-15-2006, 01:41 PM

I know that one of the HDs support S.M.A.R.T. feature set. When you talked mentioned "motherboard report S.M.A.R.T. disk errors to the operating system", did you mean that as in errors were showing up in system logs related to that? I haven't seen any error messages relating to S.M.A.R.T. in my logs that I can tell. Can you remember what the gist of what such error message related to that would look like? Then I can comb through the logs and see if I missed anything like that.

I think I have any S.M.A.R.T. related settings disable in my BIOS, but I probably should double check that.

Thanks for the info.

stress_junkie · 08-15-2006, 06:37 PM

Yes the system logs had disk errors. I believe that they looked a lot like the ones in your system log. Anyway I originally had the BIOS report disk errors to the O.S. and in Linux I had the S.M.A.R.T. system daemon running. I'm pretty sure it was a single function daemon. So I turned off the daemon and I changed the motherboard settings to not report these disks errors. I've been happy ever since. Keep in mind that I do a backup on any day that I've done a lot of work. A surprise disk failure wouldn't be a problem for me.

iive · 08-26-2006, 06:59 AM

I'm having very similar problem.

However I have the feeling that it appears only when both ide channels are used in parallel.

The higher transfer rate both channels are having, the higher chance of triggering the problem (and filling the logs with dma errors followed by resetting to PIO mode).

You can try to run read-only badblocks on both disks and see if that will trigger it.

At first I though it may be my (quite old) VIA KT133A chipset problem, but here I see it on Intel, so the chance this is problem with the kernel is quite high.

Oh, BTW unmask_irq=0 doesn't help. Both channels have separate irq's (14&15) that are not shared with other devices...
I'm running Slackware-current, with latest 2.6.17.10 kernel compiled.

I do not have SMART in the bios. When I run smartctl -a /dev/hda , I get zero raw_reed errors, zero relocated sectors (write errors), zero seek_errors (weird!) and zero UDMA_CRC errors. I guess this indicates the drive have no problems and the 80pin cable is just fine.

I'm still googling and browsing sources. If I get something, I'll post it here.

PingFloyd · 08-27-2006, 12:38 AM

Update:

I think I fixed the problem. I haven't seen the error message in my logs for a long time.

I think it's a problem or limitation in my chipset or how the kernel utilizes it. Something between that relationship. I can't say for certain since there is still some stones left unturned to elminate all possible culprits. What made the error quit happenning in my case was when I changed where I was swapping from.

When I was plagued by this error, it was when I was using a swap partition that was on my /dev/hdc. /dev/hdc is the HD that is master on the secondary IDE channel. When I had swap being handled from /dev/hda (primary master) and/or /dev/hdb (primary slave), the error didn't seem to come up. The thing is that I haven't gotten around to moving around the hard drive layout between the different channels and being assigned master/slave to test things down further, to eliminate other possible culprits and factors.

It seems the error would not come up no matter how I handled swap between both drives on the primary IDE channel. I tested with swap on /dev/hda, then on /dev/hdc, then also on both of those with different swap priorities and with same swap priority. The error would not come up no matter what as long as I didn't run it from /dev/hdc. That is good since it's some progress. At least it's giving me a general direction to trying to figure out what is responsible for it. At least in my setup. But if I can eventually figure it out in my setup, it will be closer to figuring what the true cause is by comparing other people's results with their setups.

There of course is the possibility that I'm getting lucky or it's coincidence since those DMA errors seem to be rather intermittent. However, when I was getting the error it would usually come up at least twice day. After no longer using /dev/hdc for swap I have yet to see it come up at all.

However, there's a couple of things to keep in mind. I changed distros a few weeks ago, but I was still running the original distro for about a week testing and having good results by taking /dev/hdc out of the picture. So I'm fairly certain that having the swap on /dev/hdc is what had the part in causing the problem. Not necesarily the HD itself. I can't say that yet. There is still alot of things to test like moving around the HDs and testing to eliminate if it's a IDE cable issue, or it's specific to the drive etc etc.

What made me turn my direction toward experimenting with the location of the swap file was that I started to see a pattern that seemed to indicate that the error was popping up at times when there was heavy swapping going on (/dev/hdc had my swap partition during the times I was having the problem). It's probably not related to swap per say, but probably just the fact that their was heavy disk activity on that particular drive. So now I'm left with trying to figure out if it's the drive, chipset, cabling, using certain drives together in the system. In other words there is still a few variables to test.

Anyway, I just wanted to share that. Maybe that will help others head in the right direction in their troubleshooting this issue. Hopefully between all of our results we'll eventually have this thing figured out.

Anyone still having this problem, you might try completely running you Linux system (all filesystems, including swap) from just one HD. Run that way for awhile and see if the error still comes up. If it doesn't, then try integrating the rest of your drives one by one and running for a period of time at each interval. If and once the error comes back, you might try rearranging the drives and see if goes away. This will give a closer indication of what's going on. If you've got some spare 80 pin IDE cables laying around, you might try swapping those in first. I plan to get around to trying this myself, but right now my time is limited. If and once I do, I'll post the results.

I've also got some other ideas for testing this as well.

iive,

not only that, but it seems like there have been a whole bunch of systems with completely different chipsets than just even Intel and Via. I've seen posts in all over the net of people owning promise controllers and a few others mentioned that I can't think of off the top of my head.

Anyway, I'll be sure to post any new things I come acrossed and any new findings.

Thanks for sharing the info. Maybe we'll eventually get this thing narrowed down once and for all. I was pretty ecstatic when the error quit coming up by no longer using my /dev/hdc as swap. This helped eliminate a great deal of variables.

One thing to note, is that I still have /dev/hdc plugged in completely (power and IDE cable running in the same position it's always ran). So there was no need to actually unplug it to test. All that I had to do was have it to where it wasn't being used as a swap file. I'm currently keeping an archive on it. I actually created a tar of my originally installed Debian system to it. Interesting thing, is during the big long taring process, it didn't seem to bring up that error.

pbhj · 09-04-2006, 07:50 PM

Me too .. ... My system is an Athlon 1.1Gig with the via kt133 chipset.

My swap is at hdb2 and root is hdb1. Lilo runs from hda and is installed in the MBR, IIRC.

Weird!

Electro · 09-04-2006, 08:20 PM

Some distributions have IDE multimode (CONFIG_IDEDISK_MULTI_MODE) support disabled in the kernel. You may need to configure your kernel to enable it and compile it.

A Pentium II does not have DDR. It actually have SDRAM DRAM. DDR came at the end of Pentium III and at the beginning of Pentium 4.