LinuxQuestions.org - [SOLVED] --current, randomly timed kernel oops on bootup of two test boxen

- Slackware (https://www.linuxquestions.org/questions/slackware-14/)

- - --current, randomly timed kernel oops on bootup of two test boxen (https://www.linuxquestions.org/questions/slackware-14/current-randomly-timed-kernel-oops-on-bootup-of-two-test-boxen-852843/)

Quote:

Originally Posted by hitest (Post 4210814)

Last crash, snapped a picture.

That looks familiar.

I have seen several different ways in which the kernel can crash. Since patching udev-165, I haven't seen any more crashes.
Ed

Is it safe to just downgrade to udev-164? Do we expect the next slackware update to roll back to udev-164 until the bug is fixed, or not?

Same problem during boot a few days ago...

I use Slack-64-m, but I have another partition Slack32-current and
update it today, after some boots I got a kernel panic.

Since the partition which use slack64-M-current (I'm using 64 now)
I took a look in / var / log / syslog (the slack32-current).
Below is it ..

+========================================+

Code:

Jan  3 23:23:13 base2 kernel: [    0.157045] raid6: int32x1  1281 MB/s

Jan  3 23:23:13 base2 kernel: [    0.174046] raid6: int32x2  1125 MB/s

Jan  3 23:23:13 base2 kernel: [    0.191042] raid6: int32x4    757 MB/s

Jan  3 23:23:13 base2 kernel: [    0.208002] raid6: int32x8    746 MB/s

Jan  3 23:23:13 base2 kernel: [    0.225018] raid6: mmxx1    2164 MB/s

Jan  3 23:23:13 base2 kernel: [    0.242001] raid6: mmxx2    3855 MB/s

Jan  3 23:23:13 base2 udevd[1599]: bind failed: Address already in use 

Jan  3 23:23:13 base2 udevd[1599]: error binding control socket, seems udevd is already running 

Jan  3 23:23:13 base2 kernel: [    0.259019] raid6: sse1x1    2246 MB/s

Jan  3 23:23:13 base2 kernel: [    0.276009] raid6: sse1x2    3617 MB/s

Jan  3 23:23:13 base2 kernel: [    0.293007] raid6: sse2x1    3832 MB/s

Jan  3 23:23:13 base2 kernel: [    0.310011] raid6: sse2x2    5042 MB/s

Jan  3 23:23:13 base2 kernel: [    0.310015] raid6: using algorithm sse2x2 (5042 MB/s)

Jan  3 23:23:13 base2 kernel: [    0.313378] pnp 00:02: disabling [mem 0x00000000-0x00000fff window] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]

Jan  3 23:23:13 base2 kernel: [    0.313378] pnp 00:02: disabling [mem 0x00000000-0x00000fff window disabled] because it overlaps 0000:01:00.0 BAR 6 [mem 0x00000000-0x0007ffff pref]

Jan  3 23:23:13 base2 kernel: [    0.313378] pnp 00:02: disabling [mem 0x00000000-0x00000fff window disabled] because it overlaps 0000:02:00.0 BAR 6 [mem 0x00000000-0x0000ffff pref]

Jan  3 23:23:13 base2 kernel: [    0.316114] pnp 00:0c: disabling [mem 0x000d2a00-0x000d3fff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]

Jan  3 23:23:13 base2 kernel: [    0.316123] pnp 00:0c: disabling [mem 0x000f0000-0x000f7fff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]

Jan  3 23:23:13 base2 kernel: [    0.316130] pnp 00:0c: disabling [mem 0x000f8000-0x000fbfff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]

Jan  3 23:23:13 base2 kernel: [    0.316137] pnp 00:0c: disabling [mem 0x000fc000-0x000fffff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]

Jan  3 23:23:13 base2 kernel: [    0.316144] pnp 00:0c: disabling [mem 0x00000000-0x0009ffff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]

Jan  3 23:23:13 base2 kernel: [    0.316151] pnp 00:0c: disabling [mem 0x00100000-0x7fddffff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]

Jan  3 23:23:13 base2 kernel: [    0.951021] pci 0000:00:13.0: OHCI: BIOS handoff failed (BIOS bug?) 00000184

Jan  3 23:23:13 base2 kernel: [    0.988634] highmem bounce pool size: 64 pages

Jan  3 23:23:13 base2 kernel: [    0.991832] Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)

Jan  3 23:23:13 base2 kernel: [    0.992165] DLM (built Oct 11 2010 14:46:35) installed

Jan  3 23:23:13 base2 kernel: [    0.997371] OCFS2 User DLM kernel interface loaded

Jan  3 23:23:13 base2 kernel: [    0.998386] GFS2 (built Oct 11 2010 14:47:37) installed

Jan  3 23:23:13 base2 kernel: [    1.007672] Console: switching to colour frame buffer device 128x48

Jan  3 23:23:13 base2 kernel: [    1.595357] Compaq SMART2 Driver (v 2.6.0)

Jan  3 23:23:13 base2 kernel: [    1.599995] scsi: <fdomain> Detection failed (no card)

Jan  3 23:23:13 base2 kernel: [    1.608948] Emulex LightPulse Fibre Channel SCSI driver 8.3.12

Jan  3 23:23:13 base2 kernel: [    1.610431] Copyright(c) 2004-2009 Emulex.  All rights reserved.

Jan  3 23:23:13 base2 kernel: [    1.635040] Failed initialization of WD-7000 SCSI card!

Jan  3 23:23:13 base2 kernel: [    1.672485] GDT-HA: Storage RAID Controller Driver. Version: 3.05

Jan  3 23:23:13 base2 kernel: [    1.674207] 3ware Storage Controller device driver for Linux v1.26.02.003.

Jan  3 23:23:13 base2 kernel: [    1.675841] 3ware 9000 Storage Controller device driver for Linux v2.26.02.014.

Jan  3 23:23:13 base2 kernel: [    2.172523] PNP: PS/2 appears to have AUX port disabled, if this is incorrect please boot with i8042.nopnp

Jan  3 23:23:13 base2 kernel: [    2.179096] ata1: softreset failed (device not ready)

Jan  3 23:23:13 base2 kernel: [    2.179099] ata1: applying SB600 PMP SRST workaround and retrying

Jan  3 23:23:13 base2 kernel: [    2.445517]  sda1 sda2 sda3 sda4 < sda5 sda6 sda7 sda8 sda9 sda10 sda11 >

Jan  3 23:23:13 base2 kernel: [    2.605160] registered taskstats version 1

Jan  3 23:23:13 base2 kernel: [    2.642812] EXT3-fs (sda6): error: couldn't mount because of unsupported optional features (240)

Jan  3 23:23:13 base2 kernel: [    2.644658] EXT2-fs (sda6): error: couldn't mount because of unsupported optional features (240)

Jan  3 23:23:13 base2 kernel: [    2.674769] VFS: Mounted root (ext4 filesystem) readonly on device 8:6.

Jan  3 23:23:13 base2 kernel: [    4.056908] ACPI: resource piix4_smbus [io  0x0b00-0x0b07] conflicts with ACPI region SOR1 [??? 0x00000b00-0x00000b0f flags 0x31]

Jan  3 23:23:13 base2 kernel: [    4.062051] k8temp 0000:00:18.3: Temperature readouts might be wrong - check erratum #141

Jan  3 23:23:13 base2 kernel: [    5.247158] nvidia: module license 'NVIDIA' taints kernel.

Jan  3 23:23:13 base2 kernel: [    5.249073] Disabling lock debugging due to kernel taint

Jan  3 23:23:13 base2 kernel: [    6.276402] NVRM: loading NVIDIA UNIX x86 Kernel Module  260.19.21  Thu Nov  4 20:24:24 PDT 2010

Jan  3 23:23:20 base2 console-kit-daemon[1792]: WARNING: Failed to acquire org.freedesktop.ConsoleKit 

Jan  3 23:23:20 base2 console-kit-daemon[1792]: WARNING: Could not acquire name; bailing out 

Jan  3 23:24:19 base2 python: hp-systray[2271]: warning: No hp: or hpfax: devices found in any installed CUPS queue. Exiting.

+========================================+

Update Wed Jan 5 2011:

Okay so the bug is back!

I am running slackware current and got crashes twice this morning.

This has to be fixed!

If I were a new user I would give up on Linux and go back to Windows...

In the old days we used to make fun of Windows because of the "Blue Screen of Death" would occur so often.

This is exactly what I am seeing now with Linux!

-----------------------------------------------------
Jan 3 2010:

I'm beginning to think that my random kernel oops are caused by my hardware.

When my fairly new ps/2 mouse is connected to a KVM then I get the random kernel oops.

And at other times the mouse is misconfigured by the kernel as a keyboard and does not work as a mouse.

When I disconnect the KVM and plug the mouse directly into the motherboard, then it works fine.

I have gone back to Slackware current using udev-165 and kernel 2.6.35.7

So far no kernel oops and the mouse is recognized as a mouse.

There is still a kernel bug in there and it needs to be fixed, but hopefully it is not a very common bug.

A bad mouse should not be allowed to cause a kernel oops.

Hi all,
A follow-up to post #25. The problem is probably in the kernel scsi/sg code, specifically supporting "ATA pass-through" functionality. What's new in udev-165 is a function "disk_identify_packet_device_command" which tries a scsi SPC-4 ATA 16-bit pass-through command to identify a cd/dvd drive, and if that fails, tries an SPC-3 version of the command. I believe it is the version 3 attempt which causes the oops; commenting out line 270 of extras/ata_id/ata_id.c:

253 ret = ioctl(fd, SG_IO, &io_v4);
254 if (ret != 0) {
255 /* could be that the driver doesn't do version 4, try version 3 */
256 if (errno == EINVAL) {
257 struct sg_io_hdr io_hdr;
258
259 memset(&io_hdr, 0, sizeof(struct sg_io_hdr));
260 io_hdr.interface_id = 'S';
261 io_hdr.cmdp = (unsigned char*) cdb;
262 io_hdr.cmd_len = sizeof (cdb);
263 io_hdr.dxferp = buf;
264 io_hdr.dxfer_len = buf_len;
265 io_hdr.sbp = sense;
266 io_hdr.mx_sb_len = sizeof (sense);
267 io_hdr.dxfer_direction = SG_DXFER_FROM_DEV;
268 io_hdr.timeout = COMMAND_TIMEOUT_MSEC;
269
270 // ret = ioctl(fd, SG_IO, &io_hdr);
271 if (ret != 0)
272 goto out;
273 } else {
274 goto out;
275 }
276 }

appears to eliminate the panic.

Also, running the "sg_sat_identify" command from the sg3_utils package (http://sg.danny.cz/sg/sg3_utils.html#mozTocId479511), eg,
sg_sat_identify -p /dev/dvd

works, while running it as
sg_sat_identify -p -c /dev/dvd

frequently produces a kernel panic which looks the same as the udevd one at bootup. The difference is the -c switch which instructs the kernel to write back ATA register data in the sense buffer. The udev-165 code also does this (setting the ck_cond bit and hence, the oops).

UPDATE: Further testing shows that the ck_cond value is not relevant-- the panic results regardless of how ck_cond is set.

As a test, on one of my test boxes, I put udev-164-i486-3 back in the system, but kept the /etc/rc.d/rc.udev from udev-165 (as that apparently creates the /dev/root properly according to the changelog).

I have yet to notice any issues with the downgrade, and haven't had the boot time kernel oops yet after a bunch of halt, reboot, suspend or hibernate.

I just installed linux-2.6.37, and re-installed a vanilla udev-165. So far things look good, so (hopefully) the issue has been resolved in the kernel. The problem may be related to the fix (http://www.kernel.org/pub/linux/kern...geLog-2.6.37):

commit 2a5f07b5ec098edc69e05fdd2f35d3fbb1235723
Author: Tejun Heo <tj@kernel.org>
Date: Mon Nov 1 11:39:19 2010 +0100

libata: fix NULL sdev dereference race in atapi_qc_complete()

SCSI commands may be issued between __scsi_add_device() and dev->sdev
assignment, so it's unsafe for ata_qc_complete() to dereference
dev->sdev->locked without checking whether it's NULL or not. Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>

I am still quietly putting up with the issue which sees a failed boot one in every 10 or so boots. Any word on an 'official' fix from the slackware team? Otherwise I suspect I shall simply downgrade the udev package.....

That patch is in 2.6.35.10; has anyone reproduced the problem with that kernel, perchance?

New kernels in -current are still a little ways out, I think - some other stuff probably needs to hit the tree first.

Quote:

Originally Posted by rworkman (Post 4214717)

That patch is in 2.6.35.10; has anyone reproduced the problem with that kernel, perchance?

I am new to -current and keen to get involved so:

Code:

andrew@skamandros~$ uname -r

2.6.35.10-ads

Used PV's config and just build the filesystem in; I shall watch for the familiar problem over the next couple of days :).

Follow-up to post #38:
I jumped the gun... The problem remains in kernel linux-2.6.37.

Follow-up to post #36:
The problem is probably in the kernel block, drivers/scsi/sg, or drivers/scsi/sd code, specifically related to "ATA pass-through" functionality, and probably only occurs for certain drive hardware. I don't know anything about this code, so until someone who does can fix it, using udev-164 (which doesn't use the ATA pass-through command on cd/dvd devices in ata_id.c), or commenting out this command in ata_id.c for udev-165, will side-step the issue for me.

Additional experimentation:
Another possible cause of this oops could be inappropriate buffer alignment. I built udev-165 with ata_id.c patched to use page-aligned sense and response buffers (rather than simple unsigned char arrays), and so far it looks promising- no panics yet (see attached patch).

Quote:

Originally Posted by resonance (Post 4215639)

I agree with this observation. I have an old PIII 850 MHz 32 bit -current box that has no issues with booting at all. But, my two newer intel dual core boxes were crashing with 32 bit -current. I've moved my newer boxes to Slackware64-current for the time being.

Update Fri Jan 7:

2.6.36.10 and udev-165 crashed the same way this morning...
----------------------------------------------------
Thurs Jan 6:

Quote:

Originally Posted by andrew.46 (Post 4215178)

Used PV's config and just build the filesystem in; I shall watch for the familiar problem over the next couple of days :).

I did the same thing using the current huge smp kernel config to build a 2.6.35.10 kernel.

And I now use 2.6.35.10 and udev-165 along with the rest of Slackware current.

My set up is a new Intel dual core D510MO motherboard using an sata drive hooked up to the motherboard disk controller.

Quote:

Originally Posted by rworkman (Post 4214717)

I have been running 2.6.35.10 long-term for about two weeks now. I had two kernel panics in a row last night. I have had kernel panics a couple of other times as well. I don't know how to catch the ouptut from the panic. It looks similar to what I've read in this thread. I must say I'm relieved it's not just my system.