Greetings,
I'm searching for help with a bug that has been haunting my system for quite a long time now. The problem is that weird that I don't even know where to file a bug report, so if anybody could point me in the right direction, I'd be very thankful.
System:
X40, debian 5.0, kernel 2.6.28.5
Description:
After the window manager (kdm) has loaded, there's a chance that the system will lock up after a short time. This doesn't happen always, but merely every second system boot|reboot|resume. The system as a whole won't respond to anything, neither to keystrokes nor to the sysreq-combos. Only hard power-off (holding the power key) "helps" and after that the system boots without complication and always runs until I power-off|reboot|suspend again, no matter how long. The bug isn't definitely reproducible, though.
There's never any identifiable trace of what happened in the log files of /var/log. I got myself a cam recently and used it to make some screen shots. The vertically distorted image is quite characteristical for this error.
http://bayimg.com/landlAAbK
http://bayimg.com/landNAabK
http://bayimg.com/lANDpAABk
The error will occur whenever kdm was loaded. From what I experienced the freeze happens most often when initializing the user's kde session, causing corruption of the config files opened at that very time (now I always keep a whole backup of ~./kde/share/apps/config around to handle this). But it happens nearly as often with the first action done in kmail. Leaving the kde login manager running without actually logging in will cause it to occur too.
Switching to a terminal with ctrl-alt-fn doesn't help -- as long as there's an x-server running nearby the system will definitely lock up.
Narrowing down:
There's a whole bunch of programs that get loaded at startup. Therefore I thought for a long time it was caused by kmail (that mistaken I even filed a bug report on the kmail bugzilla ...), but having been forced to use webmail by a stupid dormitory firewall I found out that it happens without kmail, too. Debug builds of kmail (3.5.[7-10]) never revealed any useful information in the logs.
It definitely looks like a graphics bug but it happens with intel 855GM (kernel drivers: 830 and 955; xorg drivers: i810 and intel) and Via Unichrome (both drivers) as well, no matter whether with drm or without. I never even tried compiz and all that eye candy. Never tried it without X at all, though ... but I didn't experience the lockup in single user mode yet.
Had the problem with xfs, ext[2-4] and others. (Xfs is by far the worst mess after that kind of hang.) Had it with encrypted partitions and without.
Afair it first occured with a fresh install of SUSE 9.3, now I'm with debian since 4.0 and so far neither a new OS nor an upgrade could help.
Since debian (and the divine make-kpkg!) I use my own kernels and already tried dozens of options, all to no avail.
I'm nearly absolutely sure that it is not an hardware issue because it happened not only the IBM X40 I use today but on the Acer Aspire 1362 that I had before. These two don't have any part in common. (I got the Acer new some years ago and the error occured first after about 1 year of usage. Last summer I switched to the second hand Thinkpad and the error was there right after installing debian.)
No difference between wireless and ethernet.
It's very improbable that the problem is caused by overheating. My X40 can run boinc-client for days without complications and compiling three applications at the same time doesn't lead to instability either.
Maybe it's related to USB, too, for as a mobile user I rely heavily on external equipment at home, i.e. Ultrabay, USB mouse, USB hdds, USB dvd burner etc, all those are present at boot time. During weeks I mostly work without these things in the library, where I'm not sure whether I experienced the lockup at all. There I'm offline, too, and working with another user account that doesn't use net applications like kmail, kopete.
Conclusion:
Most strange seems to me, that the error at one time simply began (with SUSE9.3 or 10.0, I think, but I don't recall whether this was correlated with the use of a special program) and persists until today. Even more strange is the phenomenon of the distorted screen while it's most probably not even related to the graphics hardware. Therefore I guess it's an USB or net issue.
I'm using the notebook as production system, so the lockups sometimes corrupt important data very badly; journaling fs can't compensate all the losses. I'd really like to get rid of the problem but so far I don't even know where to start. Please point me to the logfiles to post. Any help is appreciated, I thank you for any kind of recommendation.
PS: Many times I made investigations on the net, from what I found my error looks similar to the notorious
"random system freeze" but neither is my configuration the same nor does the usual resolution (disable compiz etc) apply. There are many other more or less
similar bug reports, but I never read an exact description of my problem so far. There's one
thread over there at ubuntu that comes very close, but the solution offered there is more than impractical in the eyes of a laptop user.