LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   Identifying unaccountable, BAD, system-wide latency? 2.6.25.9 & nVidia 173.14.09 (https://www.linuxquestions.org/questions/linux-general-1/identifying-unaccountable-bad-system-wide-latency-2-6-25-9-and-nvidia-173-14-09-a-651931/)

GrapefruiTgirl 06-26-2008 07:50 PM

Identifying unaccountable, BAD, system-wide latency? 2.6.25.9 & nVidia 173.14.09
 
First, since I do not know if this may lead anywhere, I'm posting in General for now.

OS:
Slackware 11

Suspects:
Kernel 2.6.25.9 (latest stable)
nVidia driver 173.14.09 (latest stable)

Possible (though unlikely) accomplices:
KDE
Firefox 2.0.0.14

The situation:
The other day, I updated to 2.6.25.4, from 2.6.24.4. As I usually do, I began the new kernel configuration with the .config from my previous kernel, made any necessary adjustments, looked into any new features/options, and built it. That went fine.
Then, I checked for a new nVidia driver (they had issues with >2.6.25 kernels for a while) and found they had released a new one, so I installed that.

I rebooted, and all seemed perfect for a while (I estimate about half an hour till I noticed the weirdness) until I found that the desktop environment was acting very slow while using Firefox, reading some PDF files, not much intensive for sure, but:

1) mousing over windows generally switches focus in about 1/2 second, but it took like 5 seconds (or I had to click to speed up focus a bit)
2) the clock was going in slow motion (by the time I noticed this it was 45 minutes behind, and the colons that flash seconds were flashing about every 10 seconds)
3) issuing any console commands took 20-30 seconds to respond. `Telinit 1` from init-4 took about 2 minutes.
4) Logout took about 45 seconds to produce the Greeter screen.
5) etc..

I started "TOP" from init 1, which was almost useless because it was so slow, but I didn't note any process running up the CPU, or really running at all. The only process actually 'running' was TOP, but the CPU 'id' indicator (what is this?) in TOP was at 100%. All processes in the list were near 0% CPU.

I rebuilt the kernel (2.6.25.9) about 5 times now, as well as alternately switching back to the previous nVidia driver, and each time, the system would boot normally, I would log in, and for a time think all was well, but suddenly, while doing next to nothing, it would happen again.

As of this writing, I had had the machine running in init 1 with HTOP up for 5 hours now, with no problem. HTOP reported only itself as the sole running process, and the latency is normal.
I installed the new nVidia driver again, started X and logged in, toodled around for a bit, and then started posting this thread.
My 'toodling' has included a look in the LKML for any obvious causes, as well as Google and LQ searches for any similar reports referencing 2.6.25.[7|8|9] but nothing pops out at me.
The system is still OK right now.

SO.. Besides being open to suggestions, here are a question or two:

1) Will "LatencyTOP" help me identify what is/was going on? I need to build some kernel options to use that right? Or is "latencyTOP" the wrong approach?

2) Anyone else experience something like this, either with the same kernel, OR with the new nVidia driver?

3) Here is some stuff from var/log/messages.log output from one of the malfunctioning kernels, with a couple highlighted chunks which I think normally aren't there, and so may indicate something related to the slowdown:
Code:


Jun 26 09:43:59 darkstarssi exiting on signal 15
Jun 26 09:50:04 darkstarssi syslogd 1.4.1: restart.
...
Jun 26 09:50:07 darkstarssi kernel: Intel machine check architecture supported.
Jun 26 09:50:07 darkstarssi kernel: Intel machine check reporting enabled on CPU#0.
Jun 26 09:50:07 darkstarssi kernel: using mwait in idle threads.
Jun 26 09:50:07 darkstarssi kernel: Checking 'hlt' instruction... OK.
...
Jun 26 09:50:07 darkstarssi kernel: PCI: Using IRQ router default [10de/0260] at 0000:00:0a.0
Jun 26 09:50:07 darkstarssi kernel: PCI->APIC IRQ transform: 0000:00:0a.1[A] -> IRQ 11
Jun 26 09:50:07 darkstarssi kernel: PCI->APIC IRQ transform: 0000:00:0b.0[A] -> IRQ 5
Jun 26 09:50:07 darkstarssi kernel: PCI->APIC IRQ transform: 0000:00:0b.1[B] -> IRQ 10
Jun 26 09:50:07 darkstarssi kernel: PCI->APIC IRQ transform: 0000:00:0e.0[A] -> IRQ 11
Jun 26 09:50:07 darkstarssi kernel: PCI->APIC IRQ transform: 0000:00:0f.0[A] -> IRQ 5
Jun 26 09:50:07 darkstarssi kernel: PCI->APIC IRQ transform: 0000:00:10.1[B] -> IRQ 10
Jun 26 09:50:07 darkstarssi kernel: PCI->APIC IRQ transform: 0000:00:14.0[A] -> IRQ 10
Jun 26 09:50:07 darkstarssi kernel: PCI->APIC IRQ transform: 0000:01:00.0[A] -> IRQ 10
Jun 26 09:50:07 darkstarssi kernel: PCI->APIC IRQ transform: 0000:02:00.0[A] -> IRQ 10
Jun 26 09:50:07 darkstarssi kernel: PCI->APIC IRQ transform: 0000:05:06.0[A] -> IRQ 11
Jun 26 09:50:07 darkstarssi kernel: PCI->APIC IRQ transform: 0000:05:09.0[A] -> IRQ 5
Jun 26 09:50:07 darkstarssi kernel: PCI: Bridge: 0000:00:03.0
Jun 26 09:50:07 darkstarssi kernel:  IO window: disabled.
Jun 26 09:50:07 darkstarssi kernel:  MEM window: 0xf9f00000-0xfbffffff
Jun 26 09:50:07 darkstarssi kernel:  PREFETCH window: 0x00000000c0000000-0x00000000cfffffff
Jun 26 09:50:07 darkstarssi kernel: PCI: Bridge: 0000:00:05.0
Jun 26 09:50:07 darkstarssi kernel:  IO window: disabled.
Jun 26 09:50:07 darkstarssi kernel:  MEM window: 0xfc000000-0xfe6fffff
Jun 26 09:50:07 darkstarssi kernel:  PREFETCH window: 0x00000000d0000000-0x00000000dfffffff
Jun 26 09:50:07 darkstarssi kernel: PCI: Bridge: 0000:00:06.0
Jun 26 09:50:07 darkstarssi kernel:  IO window: disabled.
Jun 26 09:50:07 darkstarssi kernel:  MEM window: disabled.
Jun 26 09:50:07 darkstarssi kernel:  PREFETCH window: disabled.
Jun 26 09:50:07 darkstarssi kernel: PCI: Bridge: 0000:00:07.0
Jun 26 09:50:07 darkstarssi kernel:  IO window: disabled.
Jun 26 09:50:07 darkstarssi kernel:  MEM window: disabled.
Jun 26 09:50:07 darkstarssi kernel:  PREFETCH window: disabled.
Jun 26 09:50:07 darkstarssi kernel: PCI: Bridge: 0000:00:10.0
Jun 26 09:50:07 darkstarssi kernel:  IO window: e000-efff
Jun 26 09:50:07 darkstarssi kernel:  MEM window: 0xfe700000-0xfebfffff
Jun 26 09:50:07 darkstarssi kernel:  PREFETCH window: disabled.
Jun 26 09:50:07 darkstarssi kernel: NET: Registered protocol family 2
...
Jun 26 09:50:07 darkstarssi kernel: io scheduler cfq registered (default)
Jun 26 09:50:07 darkstarssi kernel: pci 0000:00:03.0: Disabling HT MSI mapping<6>pci 0000:00:05.0:
 Disabling HT MSI mapping<6>pci 0000:00:06.0:
Disabling HT MSI mapping<6>pci 0000:00:07.0:
Disabling HT MSI mapping<6>pci 0000:00:09.0:
 Disabling HT MSI mapping<6>pci 0000:00:0e.0:
 Disabling HT MSI mapping<6>pci 0000:00:0f.0:
 Disabling HT MSI mapping<6>pci 0000:00:10.0:
 Disabling HT MSI mapping<6>pci 0000:00:10.1:
 Disabling HT MSI mapping<7>pci 0000:01:00.0:
 Boot video device

Jun 26 09:50:07 darkstarssi kernel: Starting balanced_irq
Jun 26 09:50:07 darkstarssi kernel: Freeing unused kernel memory: 252k freed
Jun 26 09:50:07 darkstarssi kernel: Adding 2048248k swap on /dev/hda6.  Priority:-1 extents:1 across:2048248k
Jun 26 09:50:07 darkstarssi kernel: forcedeth: Reverse Engineered nForce ethernet driver. Version 0.61.
Jun 26 09:50:07 darkstarssi kernel: forcedeth 0000:00:14.0: ifname eth0, PHY OUI 0x732 @ 1, addr 00:19:db:b5:91:e2
Jun 26 09:50:07 darkstarssi kernel: forcedeth 0000:00:14.0: highdma pwrctl timirq gbit lnktim desc-v3
Jun 26 09:50:07 darkstarssi kernel: ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[5]  MMIO=[fe7ff800-fe7fffff]  Max Packet=[2048]  IR/IT contexts=[4/8]
Jun 26 09:50:08 darkstarssi logger: /etc/rc.d/rc.M:  /sbin/ifconfig lo 127.0.0.1
Jun 26 09:50:08 darkstarssi logger: /etc/rc.d/rc.M:  /sbin/route add -net 127.0.0.0 netmask 255.0.0.0 lo
Jun 26 09:50:08 darkstarssi logger: /etc/rc.d/rc.M:  /sbin/dhcpcd -d -t 60  eth0
Jun 26 09:50:09 darkstarssi kernel: ip_tables: (C) 2000-2006 Netfilter Core Team
Jun 26 09:50:10 darkstarssi sshd[2051]: Server listening on 0.0.0.0 port 22.
Jun 26 09:50:18 darkstarssi /usr/sbin/gpm[2070]: *** info [startup.c(95)]:
Jun 26 09:50:18 darkstarssi /usr/sbin/gpm[2070]: Started gpm successfully. Entered daemon mode.
Jun 26 09:50:18 darkstarssi /usr/sbin/gpm[2070]: *** info [mice.c(1766)]:
Jun 26 09:50:18 darkstarssi /usr/sbin/gpm[2070]: imps2: Auto-detected intellimouse PS/2
Jun 26 09:50:20 darkstarssi smartd[2079]: smartd version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Jun 26 09:50:20 darkstarssi smartd[2079]: Home page is http://smartmontools.sourceforge.net/ 
Jun 26 09:50:20 darkstarssi smartd[2079]: Opened configuration file /usr/etc/smartd.conf
...
Jun 26 09:50:20 darkstarssi kernel: pktcdvd: writer pktcdvd0 mapped to hdd
Jun 26 09:50:20 darkstarssi kernel: pktcdvd: writer pktcdvd1 mapped to sr0
Jun 26 09:50:21 darkstarssi ntpd[2136]: ntpd 4.2.2p3@1.1577-o Tue Aug  8 04:20:31 UTC 2006 (1)
Jun 26 09:50:21 darkstarssi ntpd[2136]: logging to file /var/log/ntp.log
Jun 26 09:50:39 darkstarssi /usr/sbin/gpm[2070]: *** info [client.c(137)]:
Jun 26 09:50:39 darkstarssi /usr/sbin/gpm[2070]: Connecting at fd 6
Jun 26 09:51:50 darkstarssi /usr/sbin/gpm[2070]: *** info [client.c(275)]:
Jun 26 09:51:50 darkstarssi /usr/sbin/gpm[2070]: Request on 6 (console 2)
Jun 26 09:51:50 darkstarssi /usr/sbin/gpm[2070]: *** info [client.c(284)]:
Jun 26 09:51:50 darkstarssi /usr/sbin/gpm[2070]: Closing
Jun 26 09:52:10 darkstarssi kernel: warning: process `nvidia-installe' used the deprecated sysctl system call with 1.23.
Jun 26 09:52:10 darkstarssi last message repeated 2 times

Jun 26 09:52:32 darkstarssi /usr/sbin/gpm[2070]: *** info [client.c(137)]:
Jun 26 09:52:32 darkstarssi /usr/sbin/gpm[2070]: Connecting at fd 6
Jun 26 09:52:50 darkstarssi /usr/sbin/gpm[2070]: *** info [client.c(275)]:
Jun 26 09:52:50 darkstarssi /usr/sbin/gpm[2070]: Request on 6 (console 2)
Jun 26 09:52:50 darkstarssi /usr/sbin/gpm[2070]: *** info [client.c(284)]:
Jun 26 09:52:50 darkstarssi /usr/sbin/gpm[2070]: Closing
Jun 26 09:52:55 darkstarssi init: Switching to runlevel: 3
Jun 26 09:52:59 darkstarssi init: Switching to runlevel: 4
Jun 26 11:17:13 darkstarssi ntpd[6425]: ntpd 4.2.2p3@1.1577-o Tue Aug  8 04:20:31 UTC 2006 (1)
Jun 26 11:17:13 darkstarssi ntpd[6426]: precision = 1.000 usec
Jun 26 11:17:13 darkstarssi ntpd[6426]: Listening on interface wildcard, 0.0.0.0#123 Disabled
Jun 26 11:17:13 darkstarssi ntpd[6426]: Listening on interface lo, 127.0.0.1#123 Enabled
Jun 26 11:17:13 darkstarssi ntpd[6426]: Listening on interface eth0, 192.168.123.188#123 Enabled
Jun 26 11:17:13 darkstarssi ntpd[6426]: kernel time sync status 0040
Jun 26 11:18:41 darkstarssi ntpd[6710]: ntpd 4.2.2p3@1.1577-o Tue Aug  8 04:20:31 UTC 2006 (1)
Jun 26 11:18:41 darkstarssi ntpd[6711]: precision = 1.000 usec
Jun 26 11:18:41 darkstarssi ntpd[6711]: Listening on interface wildcard, 0.0.0.0#123 Disabled
Jun 26 11:18:41 darkstarssi ntpd[6711]: Listening on interface lo, 127.0.0.1#123 Enabled
Jun 26 11:18:41 darkstarssi ntpd[6711]: Listening on interface eth0, 192.168.123.188#123 Enabled
Jun 26 11:18:41 darkstarssi ntpd[6711]: kernel time sync status 0040
Jun 26 11:39:41 darkstarssi -- MARK --
Jun 26 11:59:41 darkstarssi -- MARK --
Jun 26 12:19:41 darkstarssi -- MARK --
Jun 26 12:22:03 darkstarssi init: Switching to runlevel: 3
...
Request on 6 (console 2)
Jun 26 12:22:08 darkstarssi /usr/sbin/gpm[2070]: *** info [client.c(284)]:
Jun 26 12:22:08 darkstarssi /usr/sbin/gpm[2070]: Closing
Jun 26 12:22:11 darkstarssi kernel: warning: process `nvidia-installe' used the deprecated sysctl system call with 1.23.
Jun 26 12:22:11 darkstarssi kernel: warning: process `nvidia-installe' used the deprecated sysctl system call with 1.23.

Jun 26 12:22:14 darkstarssi /usr/sbin/gpm[2070]: *** info [client.c(137)]:
...

Jun 26 12:22:20 darkstarssi init: Switching to runlevel: 1
Jun 26 12:22:20 darkstarssi syslogd 1.4.1: restart.
Jun 26 12:22:20 darkstarssi sshd[2051]: Received SIGHUP; restarting.
Jun 26 12:22:20 darkstarssi kernel: Kernel logging (proc) stopped.
Jun 26 12:22:20 darkstarssi kernel: Kernel log daemon terminating.
Jun 26 12:22:20 darkstarssi sshd[19920]: Server listening on 0.0.0.0 port 22.
Jun 26 12:22:20 darkstarssi init: Re-reading inittab
Jun 26 12:22:25 darkstarssi exiting on signal 15
Jun 26 12:24:12 darkstarssi syslogd 1.4.1: restart.

I would have put more but LQ said the post was too long.. SO that's it for now. Tonight or tomorrow (or when it gets slow again) I will revisit the first build of the new kernel and run HTOP non stop to see if I can see anything.
Thanks for reading :) and ideas welcomed.

SVA

syg00 06-26-2008 10:23 PM

Quote:

Originally Posted by GrapefruiTgirl (Post 3196354)
1) Will "LatencyTOP" help me identify what is/was going on? I need to build some kernel options to use that right? Or is "latencyTOP" the wrong approach?

In my case, latency issues have almost always been a driver issue.
I never use modules unless I absolutely have to - nvidia being a classic example.
As for latencytop - just do it. You have a problem, you have a potential tool specifically designed for the job (that needs testers), and you do your own kernels ...
The patch looks to be against 2.6.24-rc8, but might fit your kernel. Try it and see.
Might be interesting to see what it produces.

I haven't done any testing of latencytop - I might do up a kernel and have a go this arvo as well.

As for that id%, that's the idle task. Yep, just what it sounds like.

Later: I found a 2.6.24-rc8 source-tree that hadn't been touched. Built a kernel, and latencytop went on fine. Apply the patch and you'll get another option under "kernel hacking". The tool is a simple make/make install - the latter as root.
No doco that I could find.
Seemed to perform as promised, although only a perfunctory test on my part.

GrapefruiTgirl 06-27-2008 06:55 AM

Thanks Syg00,

latencytop will be the plan for the day, after breakfast and all that stuff.
Got up this morning to see the screensaver going slow-motion; unlocked desktop, HTOP is still running from yesterday, sorted by CPU%, and is showing nothing interesting, yet my browser barely scrolls, window focus is slow as heck, cursor isn't flashing, and the clock is about 12 hours behind.... LOL :confused:

I'll update later.

Sasha

GrapefruiTgirl 06-27-2008 08:44 AM

I am unable to patch 2.6.25.[4|7|8|9] with the latencytop patch. It seems that *some* of the intended patching has already been integrated into these kernels, causing Patch to fail.
I tried telling Patch to ignore whitespace and existing/reversed patches, as well as a fuzz-factor of 3, and managed to get the patch to succeed, but the kernel make-xconfig aborts with a "recursive dependency" error.

The latencytop patch applies fine on 2.6.24.4, so I decided to do that, then patch the patched kernel up to 2.6.25.9, however the available incremental diffs on kernel.org do not patch across major numbers, in other words I cannot make the leap from a 2.6.24 release to a 2.6.25 release via patches; the highest I can get is 2.6.24.7, then there is no way to jump to 2.6.25.[0|1]

Ideas welcome. I am still working on it but not sure how to proceed.

PS - *IF* the latencytop patch will apply on 2.6.25.1, I can start there and patch upwards to 2.6.25.9, but I'm on dialup so downloading 2.6.25.1 to try that will need to wait till tonight/tomorrow.

EDIT - Now I feel silly :p it seems that latency tools ARE in the darned thing already.. I EXPANDED the kernel hacking section, but apparently didn't actually CLICK on it..

Grr.. OK, more later.

syg00 06-27-2008 08:49 AM

I'll have a look next week - I was hitting my test server at work.
Got a yard to mow over the weekend, so patching will have to wait ....

GrapefruiTgirl 06-27-2008 12:46 PM

Hi Syg00

Thanks for the cont'd input.

Currently desktop latency is so brutal that barely anything works (interactively at least) but applications still run. LatencyTOP is so 'latent' that by the time it displays anything, the information is so old as to be worthless.

That said, I did get it to show some stuff earlier today, junk about 'kthread' helper and 'syscall_call' and other in-kernel code, that was showing latency numbers like "1974565632783.415 msec" which is totally ridiculous.
As stated, LatencyTOP is not even displaying anything now, it is too slow.

I am now going into my BIOS to see if anything looks weird due to a power outage the other day, like maybe that "Advanced C1 Halt state" turned back on. If all is OK there, I'm booting back into the last working kernel I had been running for the past few months (2.6.24.4) and will run that for the next 24 hours to eliminate hardware issues.
Assuming that is all good, it'll be back to the current kernel with its latency issues for more diagnostics.

Cheers.
Sasha

GrapefruiTgirl 06-27-2008 01:53 PM

I couldn't get into my BIOS (very weird) so I booted an Ultimate Boot CD to see what it had to say, and the AIDA diagnostic output was so bizarre that I can't really put any stock in it: among other things, it claimed my CPU x FSB was 7x257 (it is actually 9x200) and also told me that my nVidia cards were Yamahas. LOL..
Anyway, because of the password issue, I reset the CMOS and reconfigged the BIOS, rebooted my working 2.6.24.4 kernel, and here I am.
I noticed in DMESG that the stuff I posted above about the ".. HT PCI MSI-X disabled.." was not present in this kernel/boot/dmesg as it was in the latency kernels; but there's too much vagueness going on here to draw a conclusion, and I cannot verify the state of my BIOS during the running of the latency kernels either.

I will run this (known-good) kernel for the next 24hrs, then rebuild a fresh 2.6.25.9 kernel with this exact kconfig I am runnung now, and see where that leads.

Syg00, in case you overlook the posts above, do not bother patching anything for my benefit-- the latencyTOP stuff is built into 2.6.25.9.

More tomorrow.
S

syg00 06-27-2008 05:17 PM

Quote:

Originally Posted by GrapefruiTgirl (Post 3197184)
the latencyTOP stuff is built into 2.6.25.9

Earlier than that even - this Fedora 9 system shows it in 2.6.25.6

H_TeXMeX_H 07-01-2008 12:28 PM

Hmmm, what an odd issue. I'm actually running the same versions of the kernel 2.6.25.9 and nvidia drivers 173.14.09, and I experience no lag whatsoever. However, I'm using Slackware 12.1. Also my video card is a GeForce 8800 GTS 512, and I have a quad core Q9300 processor. What kind of processor do you have ? Is it a Pentium 4 with HT ?

GrapefruiTgirl 07-01-2008 12:46 PM

Quote:

Originally Posted by H_TeXMeX_H (Post 3200480)
Hmmm, what an odd issue. I'm actually running the same versions of the kernel 2.6.25.9 and nvidia drivers 173.14.09, and I experience no lag whatsoever. However, I'm using Slackware 12.1. Also my video card is a GeForce 8800 GTS 512, and I have a quad core Q9300 processor. What kind of processor do you have ? Is it a Pentium 4 with HT ?

Hi TexMex,

First, I just want to reiterate that the 'lag' is not present at boot, nor for quite a while after boot. It just suddenly 'kicks in' for no reason; last time I ran one of the suspect kernels long enough, it took about 50 minutes to happen.

I do not think it relates to the nvidia driver though. It could, but I don't think so..

This is Slack 11 (for the most part), an E2160 Pentium core2 NON-HT processor. 4GB of RAM and two GeForce 72/7300 PCI-E cards.

I am currently experimenting with some rebuilds of the kernel: I have un-patched 2.6.25.9 down to 2.6.25.1 even though I believe the issue is still in 2.6.25.x *period* (I haven't yet made it 100% through the 7.5 MB changelog for 2.6.25 :p) and am playing with some ACPI boot stuff at the same time.

In a moment here I will be rebooting the 2.6.25.1 I just built, and will be away for the machine for several hours afterwards, so I will see if THIS build plops into slow-as-mud mode during that time.

Cheers!

Sasha

H_TeXMeX_H 07-01-2008 02:36 PM

could it be some type of memory leak ? I guess top would not see it if it were a kernel leak, but I think the overall used system memory would decrease.

GrapefruiTgirl 07-04-2008 02:39 PM

*grumble grumble* :/ too many projects going on at the same time, both in the 'puter and not..

Anyhow, have been running 2.6.25.1 for a number of days now without any sign of trouble.
I still am not fully up to date on EVERYthing in the 2.6.25 changelog, but I do know that there is relatively little in the changelogs between .1 and .9 that affects my x86 machine.

Without jumping the gun too fast (which I am about to do anyway), unless there is a bug between .1 and .9 that affected my system like it did, I am going to for now attribute the weird latency to a misconfiguration/oversight on my part--- that's not to say that the "kconfig turn machine into molasses after a while" option functions as intended for those who select it deliberately (I see no value in a valid config option that performs as that did), but without actually knowing what it was (if that is what it was) I can't pin it down to one specific thing.

As is my style, I will be advancing to the next kernel minor release over the next days or so, to 2.6.25.2, and so on until I get back up to .9, but will use the exact same config file I am running ATM until I get there. No experimental options or whatever, and will update again later.

Thanks,

Sasha


All times are GMT -5. The time now is 04:00 PM.