Big regression between 13.1 -> 13.37--am I the only one?

storkus · 05-19-2011, 08:21 AM

Searching around, I couldn't find anyone talking about this so I wonder if it's something specific to my BIOS and/or chipset, an NVIDIA N67 aka C67 aka GeForce 7000M / nForce 610M IGP on an Acer 5520 laptop.

13.1 has always worked fine, though I needed NVIDIA's binary blob to get full performance from my video. Unfortunately a weird problem has cropped up with 13.37. I didn't notice it at first (when I wrote about my Nouveau woes), but noticed it tonight:

Only while running X, regardless of kernel (2.6.37.6 or 2.6.38.4) or video driver (Nouveau or the proprietary blob), with the laptop plugged in, if I pull the plug the usual ACPI event is generated; however, the IDE/SCSI/SATA subsystem also proceeds to do a reset on the drive, which disconnects it. Following this, the entire computer crashes. Since X is running, I haven't been able to see the console for any panic messages: it just freezes. If I have an xterm open as root, I can still try running commands, but nothing happens, probably because it can't load it from disk, but as soon as I exit back to my normal user, it locks up too and I get nothing. The only way out is to do the ol' holding the power button for 4 seconds thing to force the machine off.

My next idea, when I have time, is to try compiling the latest 2.6.39 kernel with the defaults that Pat gave out; if that fails, then I'll start removing things until it starts working. But it would sure save a lot of time if any of you had ideas or, better, solved this already.

Thanks, Mike

business_kid · 05-19-2011, 10:05 AM

What you are describing is an acpi issue.
acpid acts on events in /etc/acpi/events, and runs scripts relating to them in /etc/acpi
Slackware installs a default script and that does things, but what or how I'm not too sure.

To debug this, run acpid with the -l option, which logs events to syslog. Then tail syslog

Quote:

/usr/sbin/acpid -l
tail -f /var/log/messages

and do acpi things. the lid button, charge the battery, etc, and note the events. You can then create event handling scripts to handle each situation.

storkus · 05-19-2011, 02:33 PM

Business_Kid, tailing messages is exactly what I was doing, from an xterm as root. When I disconnect the power supply it shows 2 messages, one about AC power and the other the Battery. When I plug it back in, it shows them again--both times without actually giving the states, which I suppose the option you mentioned would do (haven't read the man page). What I have no clue about is why these events would be doing anything with the nforce IDE/SATA driver, and why it requires X to be running to do them...although one clue occurred to me about 10 minutes ago as I was resting my eyes a bit (didn't sleep a wink last night, partly due to this): at all times I've been running XFCE, which has various ACPI hooks in some of its daemons--the power manager in particular being suspect here. What would happen if I sub FVWM2 or even TWM so that no X-related ACPI daemons are running?

Again, though, the big question: how is user space causing kernel space to crash, especially forcing the IDE/SATA driver to do a hard reset (according to the log)? What is so different between 13.1 and 13.37 to cause this regression that no one else has apparently reported?

This is a tough one...

business_kid · 05-20-2011, 03:41 AM

That default script tries to figure out what's talking, iirc. Better take the two events and link them to something else. If you grab a fedora or debian acpi package, they parcel more event scripts than anyone (or everyone) needs.

acpid is run with no options normally. Try

Quote:

/etc/rc.d/rc.acpid stop
/usr/sbin/acpid -l

which logs all events. For detecting your problem, try moving /etc/acpi/events/default and replace it temporarily with a symlink to /bin/true. That will do nothing. See if you still have your problems or that cures it. I found xfce's power thing a pain because it did what it liked and I couldn't configure it. So I stopped it running (XFCE/Settings/Sessions & Startup/Application Autostart)

My laptop handles thermal stuff (fan speed) automagically. I have the lid button, power button, & battery critical events all linked to a hibernate script (suspend is a PITA on my box). I cobbled a script for brightness up and down, and left it at that

storkus · 05-20-2011, 07:10 PM

Ok, thanks for the ideas. You also, perhaps inadvertently, gave me another one with your outside package idea:

With 13.1 I did something I've never done before and did NOT compile a custom kernel: it was working fine and I felt a bit lazy with plenty of memory to spare on this machine (unlike the Vista that was on it before). Therefore, if replacing XFCE with FVWM or TWM doesn't work, the next idea is to try compiling a newer kernel on 13.1 since everything works fine there (after seeing what acpid is throwing out, as you suggested). Being that 13.1 has XFCE 4.6.1 and 13.37 has 4.6.2, the differences should be minor (but I'll check the change log just to be sure). If that bombs, I just back track on versions until I find the one where the problem starts happening and then look at the kernel change logs to see what they did to give me a clue.

One final idea that literally just occurred to me as I write this is to temporarily move /var/log (or, perhaps better, tail it) to another device like a USB stick or SD card: if I'm lucky, the problem is specific to the ATA subsystem and not the entire SCSI I/O infrastructure and it'll still record logs.

Whatever I do, compiling kernels takes even longer on this machine than a clean wipe and install (Turion dual-core 1,900 MHz), so that's the last idea for sure; and I'm running out of time as I'm taking off for my folks over Memorial Day weekend for a week and it would be nice to take it just in case (though I'm taking my N900 regardless).

Worst case scenario, I just go back to 13.1 until get back.

I'll write here again when I know more, especially if I can capture logs--one crazy thought is to take pictures of the screen and post them!

business_kid · 05-21-2011, 03:24 AM

I'd suggest the best way to handle /var/log would be to make /var or /var/log a separate partition mounted on usb. Then if your ata sat down, there may be a chance it could still write there.

storkus · 05-21-2011, 09:50 PM

That's exactly what I had in mind, BK. As I said, I also have some SD cards laying around and the laptop has a Ricoh interface that's worked fine as long as I've owned the laptop, so that's a possibility too. The only possible catch I see is if any of the problems are in the SCSI subsystem, but all my testing so far says no.

Now let me get to that point: the idea of trying the lightweight WM's worked: FVWM 2.4 and TWM both had no problems; however, both XFCE and KDE caused the machine freeze identically. I still haven't been able to figure out what gets started by both WM's that would cause this. Blueman was a possibility, but deleting it did nothing.

Funny thing, I did get enough log to see a little after the fact: when I finally get that log written to the USB stick or SD card I'll post it here in the next message.

Also: looking at the kernel config options for the SATA system, under ACPI for SATA (or something like that) it gives the kernel boot time option "libata.noacpi=1". I don't get the weird ATA reset messages with it, but the machine still freezes up.

Finally, something kinda similar was posted back in December:

http://www.linuxquestions.org/questi...-drive-851900/

Think it could be related?

storkus · 05-22-2011, 05:17 AM

Still no logs (they got wiped?!?), but a quick update before I rush to bed at 3am here. I compiled 2.6.39 and reduced the number of drivers to mostly just what I need, with the frequently used ones compiled in and the rest modules, and no initrd.

The good news: no crash on pulling the plug! Just the ACPI messages.

The bad news is several items, though:

1. xfce4-power-manager isn't running? And running manually doesn't seem to work. Weird.
May be related. I should try running KDE as well before I go and see what happens.

2. I compiled without framebuffer console support. I now don't have a console and had to touch-type "startx", which brought everything up normally. I've had enough of KMS and Nouveau--I'm going to the blob and not looking back til next year!

3. Power brick seems to be running much hotter than before. I'm guessing it's the power-usage regression everyone's been talking about. The processor is only running 800 MHz (lowest frequency), but that doesn't seem to be making any difference.

But I'm done here tonight. Now that I have a pretty good idea of what to look for, I think I can streamline this all a bit and finally get those logs.

storkus · 05-22-2011, 05:41 AM

Ok, I *REALLY* need to go to bed and deal with this tomorrow (today, now). *BUT* I have the origin of the problem nailed down now: HAL. Sometime back, I don't know when, I deleted HAL, and that's why everything seemed to work above except that the power manager wouldn't start and xfce itself looked kind of weird. Once I restored HAL, it looks fine, and the weird ATA messages returned.

The machine didn't crash, though...at least not at first. But as I was composing a message, preparing to upload the logs, I accidentally did <CTRL> <ALT> <F2> instead of <2> (I use Jove) and the machine promptly crashed.

I'm convinced now that if I look at the differences between the HAL (and maybe D-BUS) scripts and config files between 13.1 and 13.37 I'll nail this down once and for all.

storkus · 05-22-2011, 10:38 PM

Quick update: I came across "pm_utils" today. I mention it like this because, inside /usr/doc/pm_utils*/README.SLACKWARE there is a TODO about removing it--they've got *ZERO* (their's (Robby's?) emphasis) e-mail about it. I wish I'd known it was there before, but I certainly do now!

Anyway, looking at the other READMEs, both from 1.3.0 in Slackware 13.1 and 1.4.1 in 13.37, there was some specific code added that messes with drives and such added to the 1.4 series and 1.3.1--but not 1.3.0. This sounds way too much like my problem to ignore, so I'm doing a clean install of 13.37 with only pm_utils missing. If it works, it looks like I'll have found my problem. If not, I'll get those logs out tonight before I go home from work.

Also see the pm_utils wiki's release notes section: http://pm-utils.freedesktop.org/wiki

UPDATE: I feel like a Mythbuster: **CONFIRMED!!!** No problem without pm_utils installed on a clean install, and it returned when I installed the package! W00T!! Now I just need those logs to see where this is happening--hopefully a script, but we'll see--and then I can file a real bug report and perhaps a fix so this doesn't happen to anyone else. Another idea is to try the git version, but I couldn't find anything in their Bugzilla entries about this problem.

Again, just a reminder: pm_utils 1.3.0 from Slackware 13.1 does NOT have this problem, but 1.3.1 probably does--but I haven't tested it. This is based on the READMEs and the project wiki's release notes.