LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Slackware (http://www.linuxquestions.org/questions/slackware-14/)
-   -   How to diagnose system freeze (http://www.linuxquestions.org/questions/slackware-14/how-to-diagnose-system-freeze-4175440722/)

chexmix 12-09-2012 07:46 PM

How to diagnose system freeze
 
Hi all -

I've been enjoying Slackware 14.0 on my new ZaReason laptop ... but am experiencing random system freezes: no cursor movement, keyboard unresponsive, the works.

At first I thought it might be Firefox, since I'd read in a couple of places that people were having such an issue. But the machine froze when I had a combination of seamonkey and openoffice running -- and no Firefox.

I ran memtest and there were no failures. I've been looking at syslog to see whether there's anything strange there, and there are a number of entries re: NetworkManager --

Code:

Dec  9 20:16:44 catbutt dhcpcd[1884]: timed out
Dec  9 20:16:44 catbutt dhcpcd[1884]: allowing 8 seconds for IPv4LL timeout
Dec  9 20:16:52 catbutt dhcpcd[1884]: timed out
Dec  9 20:16:58 catbutt NetworkManager[2085]: <warn> Failed to open plugin directory /usr/lib64/NetworkManager: Error opening directory '/usr/lib64/NetworkManager': No such file or directory
Dec  9 20:16:58 catbutt NetworkManager[2085]: <warn> failed to allocate link cache: (-10) Operation not supported
Dec  9 20:16:58 catbutt NetworkManager[2085]: <warn> (wlan0): driver supports Access Point (AP) mode
Dec  9 20:16:59 catbutt NetworkManager[2085]: <warn> bluez error getting default adapter: The name org.bluez was not provided by any .service files
Dec  9 20:16:59 catbutt NetworkManager[2085]: <warn> Trying to remove a non-existant call id.
Dec  9 20:17:00 catbutt dhcpcd[2119]: wlan0: sendmsg: Cannot assign requested address

... but I have no idea whether that could cause my whole system to lock up. Since they are just warnings, I'm skeptical that that is the cause.

What or where else might I check?

Thanks,

Glenn

H_TeXMeX_H 12-10-2012 02:10 AM

Can you post the output of 'lspci -k' and 'lsmod'. This is mostly for what hardware and drivers you have.

I wrote a hardware diagnostics wiki, it may help:
http://docs.slackware.com/howtos:har...re_diagnostics

I'm not sure NetworkManager can cause such a hang. Maybe the errors were caused by the hang.

chexmix 12-10-2012 06:12 AM

Here is lspci -k:

Code:

00:14.0 USB controller: Intel Corporation Panther Point USB xHCI Host Controller (rev 04)
        Subsystem: COMPAL Electronics Inc Device 0065
        Kernel driver in use: xhci_hcd
00:16.0 Communication controller: Intel Corporation Panther Point MEI Controller #1 (rev 04)
        Subsystem: COMPAL Electronics Inc Device 0065
        Kernel driver in use: mei
00:1a.0 USB controller: Intel Corporation Panther Point USB Enhanced Host Controller #2 (rev 04)
        Subsystem: COMPAL Electronics Inc Device 0065
        Kernel driver in use: ehci_hcd
00:1b.0 Audio device: Intel Corporation Panther Point High Definition Audio Controller (rev 04)
        Subsystem: COMPAL Electronics Inc Device 0065
        Kernel driver in use: snd_hda_intel
00:1c.0 PCI bridge: Intel Corporation Panther Point PCI Express Root Port 1 (rev c4)
        Kernel driver in use: pcieport
00:1c.1 PCI bridge: Intel Corporation Panther Point PCI Express Root Port 2 (rev c4)
        Kernel driver in use: pcieport
00:1d.0 USB controller: Intel Corporation Panther Point USB Enhanced Host Controller #1 (rev 04)
        Subsystem: COMPAL Electronics Inc Device 0065
        Kernel driver in use: ehci_hcd
00:1f.0 ISA bridge: Intel Corporation Panther Point LPC Controller (rev 04)
        Subsystem: COMPAL Electronics Inc Device 0065
00:1f.2 SATA controller: Intel Corporation Panther Point 6 port SATA Controller [AHCI mode] (rev 04)
        Subsystem: COMPAL Electronics Inc Device 0065
        Kernel driver in use: ahci
00:1f.3 SMBus: Intel Corporation Panther Point SMBus Controller (rev 04)
        Subsystem: COMPAL Electronics Inc Device 0065
01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06)
        Subsystem: COMPAL Electronics Inc Device 0065
        Kernel driver in use: r8169
02:00.0 Network controller: Intel Corporation Centrino Wireless-N 1030 (rev 34)
        Subsystem: Intel Corporation Centrino Wireless-N 1030 BGN
        Kernel driver in use: iwlwifi

The only thing that sticks out for me the last entry, for wireless - I thought I had a 3945.

And here is lspci -k:

Code:

Module                  Size  Used by
snd_seq_dummy          1455  0
snd_seq_oss            29048  0
snd_seq_midi_event      5620  1 snd_seq_oss
snd_seq                51265  5 snd_seq_midi_event,snd_seq_oss,snd_seq_dummy
snd_seq_device          5228  3 snd_seq,snd_seq_oss,snd_seq_dummy
snd_pcm_oss            39183  0
snd_mixer_oss          15404  2 snd_pcm_oss
ipv6                  279979  38
pcmcia                35720  0
pcmcia_core            12061  1 pcmcia
cpufreq_ondemand        6252  4
acpi_cpufreq            5773  1
mperf                  1171  1 acpi_cpufreq
freq_table              2475  2 acpi_cpufreq,cpufreq_ondemand
lp                      9787  0
ppdev                  5958  0
parport_pc            19423  0
parport                31427  3 parport_pc,ppdev,lp
fuse                  66626  3
snd_hda_codec_hdmi    24057  1
rts5139              342736  0
usbhid                35615  0
hid                    82876  1 usbhid
snd_hda_codec_realtek  195474  1
joydev                  9972  0
uvcvideo              62784  0
videodev              76679  1 uvcvideo
v4l2_compat_ioctl32    8660  1 videodev
iwlwifi              199185  0
i915                  419107  2
snd_hda_intel          23267  2
r8169                  48922  0
snd_hda_codec          81925  3 snd_hda_intel,snd_hda_codec_realtek,snd_hda_codec_hdmi
mac80211              227731  1 iwlwifi
snd_hwdep              6324  1 snd_hda_codec
snd_pcm                72864  4 snd_hda_codec,snd_hda_intel,snd_hda_codec_hdmi,snd_pcm_oss
snd_page_alloc          7081  2 snd_pcm,snd_hda_intel
snd_timer              18798  2 snd_pcm,snd_seq
snd                    57796  14 snd_timer,snd_pcm,snd_hwdep,snd_hda_codec,snd_hda_intel,snd_hda_codec_realtek,snd_hda_codec_hdmi,snd_mixer_oss,snd_pcm_oss,snd_seq_device,snd_seq,snd_seq_oss
intel_agp              10864  1 i915
drm_kms_helper        26133  1 i915
intel_gtt              13833  3 intel_agp,i915
drm                  187389  3 drm_kms_helper,i915
psmouse                61704  0
i2c_algo_bit            5319  1 i915
btusb                  11676  0
mii                    3987  1 r8169
cfg80211              169025  2 mac80211,iwlwifi
bluetooth            151679  1 btusb
processor              25592  5 acpi_cpufreq
thermal                7983  0
fan                    2418  0
video                  11378  1 i915
rfkill                15428  4 bluetooth,cfg80211
mei                    32534  0
i2c_i801                8044  0
thermal_sys            14578  4 video,fan,thermal,processor
i2c_core              19978  6 i2c_i801,i2c_algo_bit,drm,drm_kms_helper,i915,videodev
agpgart                27372  3 drm,intel_gtt,intel_agp
serio_raw              4389  0
ac                      3331  0
hwmon                  1329  1 thermal_sys
soundcore              5474  2 snd
battery                11171  0
evdev                  9574  10
button                  4529  1 i915
loop                  18192  0

Thanks for looking.

/Glenn

H_TeXMeX_H 12-10-2012 06:28 AM

Does it always freeze the same way ?

It could be hardware or it could be some driver. I remember there were some intel video driver issues on 13.37, but I don't think they apply to 14.0 and they look different:
http://www.linuxquestions.org/questi...ng-4175425214/

kooru 12-10-2012 06:33 AM

Quote:

Originally Posted by chexmix (Post 4845915)
but am experiencing random system freezes: no cursor movement, keyboard unresponsive, the works.

Same thing for me.
I resolved upgrading the kernel.
you can see here

onebuck 12-10-2012 06:33 AM

Member Response
 
Hi,

What about switching to another console or 'ssh' into the box to see if system is actually frozen?

chexmix 12-10-2012 06:40 AM

Quote:

Originally Posted by H_TeXMeX_H (Post 4846174)
Does it always freeze the same way ?

Well, I typically notice it via the mouse cursor freezing in place ... there doesn't seem to be a common thread re: what kind of work I happen to be doing.

The machine locks hard: CTRL-ALT-DEL does nothing, for what that's worth.

/G

H_TeXMeX_H 12-10-2012 07:01 AM

Try Alt-SysRq REISUB:
http://en.wikipedia.org/wiki/Reisub

chexmix 12-10-2012 08:41 AM

Quote:

Originally Posted by H_TeXMeX_H (Post 4846203)

Thanks -- I will try that.

Just noted before I left for work: there are some troubling lines re: ACPI in /var/log/messages (wish I'd had time to copy/paste them, but I was running late).

Also, when I unplugged the AC power cord, the battery monitor said my battery was at 89%. This thing's been plugged in a loooong time ... could this be a bad battery issue?

/Glenn

elyk 12-10-2012 10:26 PM

I have a machine that would act similar to what you describe -- keyboard and mouse freeze at random, pressing numlock/capslock don't change the keyboard LEDs, Alt+SysRq doesn't seem to be recognized. But I could SSH in after it happens. I think adding 'nolapic' to the kernel parameters fixed it.

chexmix 12-11-2012 06:47 AM

Quote:

Originally Posted by elyk (Post 4846678)
I have a machine that would act similar to what you describe -- keyboard and mouse freeze at random, pressing numlock/capslock don't change the keyboard LEDs, Alt+SysRq doesn't seem to be recognized. But I could SSH in after it happens. I think adding 'nolapic' to the kernel parameters fixed it.

I'm still waiting for another freeze to try things out (thanks everyone!) ...

Did a little Googling on 'nolapic' ... doesn't this slow down performance? In one place I could swear I read that it essentially turned a multicore machine into single core.

onebuck 12-11-2012 07:21 AM

Member Response
 
Hi,

Quote:

Originally Posted by chexmix (Post 4846928)
I'm still waiting for another freeze to try things out (thanks everyone!) ...

Did a little Googling on 'nolapic' ... doesn't this slow down performance? In one place I could swear I read that it essentially turned a multicore machine into single core.

From http://www.kernel.org/doc/Documentat...parameters.txt
Quote:

noapic [SMP,APIC] Tells the kernel to not make use of any IOAPICs that may be present in the system.
Quote:

nolapic [X86-32,APIC] Do not enable or use the local APIC.
Please notice the underlined qualifier in the above quotes. You can use noapic for 'SMP' and 'nolapic' for 32bit which are in the APIC classifier.

HTH!

H_TeXMeX_H 12-11-2012 08:47 AM

So then nolapic would have no effect on 64-bit systems ?

chexmix 12-11-2012 09:07 AM

Quote:

Originally Posted by H_TeXMeX_H (Post 4847011)
So then nolapic would have no effect on 64-bit systems ?

I'd like to know this as well. This box is Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz. I'm running Slack64 on it.

Would I use nolapic or noapic for a dual core machine?

Update: just had a freeze. Machine did not respond to the magic SysRq sequences. So I'm ready to try the no[l]apic thing. Where in lilo.conf do I add it (I can Google this, but I thought I'd ask while I was here)?

EDIT: I assume it goes under "# Append any additional kernel parameters:", and look lie this:

append = "noapic"

(or nolapic)

Thanks,

Glenn

onebuck 12-11-2012 11:15 AM

Member Response
 
Hi,

I would use 'addappend=' in the stanza unless you wish global changes then use 'append=' within global section;
Quote:

From 'man lilo.conf;
addappend=<string>
(22.6) The kernel parameters from the specified string, are concatenated to the parameter(s) from an append= specification (see below). The
string must be enclosed within double quotes. Usually, the previous append= will specify parameters common to all kernels by appearing in the
top, or global, section of the configuration file and addappend= will be used to add local parameter(s) to an individual image. Addappend= may
be used only once per "image=" section.

append=<string>
Appends the options specified to the parameter line passed to the kernel. This is typically used to specify hardware parameters that can't be
entirely auto-detected or for which probing may be dangerous. Multiple kernel parameters are separated by a blank space, and the string must be
enclosed in double quotes. A local append= appearing withing an image= section overrides any global append= appearing in the top section of the
configuration file. Append= may be used only once per "image=" section. To concatenate parameter strings, use "addappend=". Example:

append="mem=96M hd=576,64,32 console=ttyS1,9600"
A little old but applicable;
Quote:

From http://osdev.berlios.de/pic.html

1.Introduction There are basically two things here to consider.
  1. Built into all recent x86 CPU chips (Pent Pro and up) is a thing called a Local APIC. It is addressed at physical addresses FEE00xxx. Actually, that is the default, it can be moved by programming the MSR that holds it base address.
    It has many fun things in it. The big thing is that you can interrupt other CPU's in a multiprocessor system. But if you just have a uniprocessor, there are useful things for it, too.
    The Local APIC is described in Chapter 7 of Volume 3 of the Intel processor books.
  2. Some motherboards have an IO APIC on them. This is usually only found on multiprocessor boards. Functionally, it replaces the 8259's. You must essentially shut off the 8259's and turn on the IO APIC to use it.
    The IO APIC is typically located at physical address FEC00000, but may be moved by programming the north/southbridge chipset.
    The Intel chip number is 82093 and you can get the doc for it off of the Intel website.
2.What the Local APIC Is
As stated above, the Local APIC (LAPIC) is a circuit that is part of the CPU chip. It contains these basic elements:
  1. A mechanism for generating interrupts
  2. A mechanism for accepting interrupts
  3. A timer
If you have a multiprocessor system, the APIC's are wired together so they can communicate. So the LAPIC on CPU 0 can communicate with the LAPIC on CPU 1, etc.

3.What the IO APIC Is This is a separate chip that is wired to the Local APIC's so it can forward interrupts on to the CPU chips. It is programmed similar to the 8259's but has more flexibility.
It is wired to the same bus as the Local APIC's so it can communicate with them.

4.Fun things to do with a Local APIC in a Uniprocessor this stuff also applies to multiprocessors, too One thing the LAPIC can help with is the following problem:
An IRQ-type interrupt routine wishes to wake a sleeping thread, but this IRQ interrupt may be nested several levels inside other IRQ interrupts, so it cannot simply switch stacks as those outer interrupt routines would not complete until the old thread is re-woken.
So we have to somehow switch out of the current thread and switch into the thread to be woken. A way the LAPIC can help us is to tell it to interrupt this same CPU, but only when there are no IRQ-type interrupt handlers active.
I call this a 'software' interrupt because the operating system software initiated the interrupt. It is programmed into the LAPIC to be at a priority lower than any IRQ-type interrupt.
So now if some IRQ-type routine wants to wake a thread, it makes the necessary changes to the datastructures, then triggers a software interrupt to itself. Then, when all IRQ-type interrupt handlers have returned out, the LAPIC is now able to interrupt.It interrupts out of the currently executing thread and switches to the thread that was just woken. Very neat.
Without the LAPIC, your interrupt routine has to set a flag in memory somewhere that each IRET has to check for. So each IRET checks this flag and checks to see if it is the 'last' IRET. It is more efficient to let the LAPIC do this testing for you.
So now we have to make this software LAPIC interrupt have a lower priority than IRQ interrupts. We do this by studying how the LAPIC assigns priority to interrupts. This is a bit lame but it works ok. The priority is based on the vector number we choose for the interrupt. Interrupt vectors are numbered 0x00 through 0xFF in Intel CPUs. The LAPIC assigns a priority based on the first of the two hex digits and ignores the second digit. Thus, any interrupts using vectors 0x50 through 0x5F have the same priority. So if you block something at priority 0x52, you block all interrupts in the range 0x50 through 0x5F.
Now the CPU itself uses vectors in the range 0x00..0x1F for exceptions, so we don't want to use those for LAPIC interrupts. This means we can use a vector numbered 0x20 or 0x2F or somewhere in that range. We will have to redirect the IRQ interrupts to vectors 0x30..0x3F or something even higher if necessary, by re-programming the 8295's. Now we can block software interrupts without blocking IRQ interrupts.
The LAPIC's priority can be set by writing the LAPIC's TSKPRI (task priority) register. So if you want to block all interrupts through level 0x2F, just write a 0x20 (or 0x2B, etc) into the TSKPRI and you have blocked those interrupts.
Now the LAPIC is not really connected to the 8259's. You cannot block 8259 generated interrupts with the LAPIC. Likewise, being in an IRQ-type interrupt handler does not block any LAPIC interrupts. So we have to manually block/unblock the softints at the beginning of our IRQ handler. Just push the LAPIC's TSKPRI register, set it to 0x20 and handle your IRQ interrupt as usual. When done, pop the saved LAPIC's TSKPRI then IRET.
SMP should help you understand multiprocessor or multi-core processor.

HTH!


All times are GMT -5. The time now is 08:21 AM.