LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Kernel (https://www.linuxquestions.org/questions/linux-kernel-70/)
-   -   Hard drive failure + Kernel reinstall = panics and odd behavior (https://www.linuxquestions.org/questions/linux-kernel-70/hard-drive-failure-kernel-reinstall-%3D-panics-and-odd-behavior-448284/)

Storm16 05-24-2006 10:15 PM

Hard drive failure + Kernel reinstall = panics and odd behavior
 
I'm having a problem with my wife's machine. She recently lost a hard drive, which I replaced. I also replaced her mainboard/CPU/RAM in the process, with a board/chip that came out of my machine a week before. It's an 1.2 GHz Athlon with 768 MB of RAM. I've been using the same hardware (except the hard drive) in another box, and I have 10 other Debian unstable machines running in the house, so I'm not sure whether the problem is hardware or software, but I am leaning more toward the hardware. The LSPCI output is:

Code:

0000:00:00.0 Host bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133] (rev 81)
0000:00:01.0 PCI bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133 AGP]
0000:00:07.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South] (rev 40)
0000:00:07.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
0000:00:07.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 1a)
0000:00:07.3 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 1a)
0000:00:07.4 Bridge: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] (rev 40)
0000:00:11.0 Multimedia audio controller: ESS Technology ES1978 Maestro 2E (rev 10)
0000:00:12.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 30)
0000:00:14.0 Multimedia video controller: Brooktree Corporation Bt878 Video Capture (rev 02)
0000:00:14.1 Multimedia controller: Brooktree Corporation Bt878 Audio Capture (rev 02)
0000:01:00.0 VGA compatible controller: nVidia Corporation NV11 [GeForce2 MX/MX 400] (rev a1)

Drives from dmesg:

Code:

hda: IBM-DTTA-371010, ATA DISK drive
hdb: Hewlett-Packard CD-Writer Plus 9100, ATAPI CD/DVD-ROM drive
hdc: ST328040A, ATA DISK drive
hdd: HITACHI GD-2000, ATAPI CD/DVD-ROM drive

Since I lost the drive, I reinstalled using the Debian etch beta netinst cd. Install went fine, added packages, installed the right processor version of the kernel:

Code:

Linux sutherland 2.6.16-1-k7 #2 Thu May 4 18:35:10 UTC 2006 i686 GNU/Linux
I also tried compiling my own kernel for this machine:

Code:

Linux sutherland 2.6.16 #1 PREEMPT Thu May 18 23:57:42 EDT 2006 i686 GNU/Linux
With similar results.

Installed the same package list that she was using before (Debian unstable). When I noticed that the system was having problems, I did a package upgrade.

In the last week since doing this, she has had the machine lock up or behave strangely consistently. A few examples...
  • In Firefox, at times (like during text entry in search boxes), the text will change from uppercase to lowercaseless at random.
  • She was sitting on a page in Firefox, and the browser executed a "back" without any input from her.
  • I rebooted the other night, and when I logged in to her account (running KDE), the machine locked hard. I was unable to even <Ctrl><alt><F1> to get to a console. I had to reboot again.
  • The printer, a USB HP PSC 1315v, works sometimes, and other thimes it does not.
  • Starting Kontact causes a crash message from KDE, and sometimes locks up the system.
  • Was typing on a console last night while troubleshooting the issue, and without pressing any control keys, characters stayed lower case while other characters were shifted (e.g. / became ?). If I hit the caps lock, the situation reversed. I am swapping keyboards, so hopefully this is the problem.
  • Audio is garbled, its kinda hard to explain, but the sound wavers in and out, in amplitude, frequency and volume. Its the same with video. I run tvtime, and it flickers in time with the audio.

I did manage to capture two kernel panics.

Code:

May 17 23:35:33 sutherland kernel: ------------[ cut here ]------------
May 17 23:35:33 sutherland kernel: kernel BUG at lib/prio_tree.c:149!
May 17 23:35:33 sutherland kernel: invalid opcode: 0000 [#1]
May
comm
otplug via_agp agpgart parport_pc parport ext3 jbd mbcache ide_cd cdrom ide_disk 3c59x mii uhci_hcd usbcore via82cxxx generic ide_core processor
May 17 23:35:33 sutherland kernel: CPU:    0
May 17 23:35:33 sutherland kernel: EIP:    0060:[prio_tree_replace+32/97]    Tainted: P      VLI
May 17 23:35:33 sutherland kernel: EFLAGS: 00010203  (2.6.16-1-k7 #2)
May 17 23:35:33 sutherland kernel: EIP is at prio_tree_replace+0x20/0x61
May 17 23:35:33 sutherland kernel: eax: b1968884  ebx: bd6002cc  ecx: c9579514  edx: b1968884
May 17 23:35:33 sutherland kernel: esi: b196885c  edi: c95794ec  ebp: bd6002cc  esp: bad0bdec
May 17 23:35:33 sutherland kernel: ds: 007b  es: 007b  ss: 0068
May 17 23:35:33 sutherland kernel: Process gaim (pid: 16611, threadinfo=bad0a000 task=ce793050)
May 17 23:35:33 sutherland kernel: Stack: <0>00000000 b0137761 bd6002cc b1968884 c9579514 b196885c b196885c b1968804
May 17 23:35:33 sutherland kernel:        a68f7000 b013b2e6 b196885c bd6002cc c1de68b4 b0138ab8 b196885c b196885c
May 17 23:35:33 sutherland kernel:        bad0be6c c1de64ec c42aa3e0 bad0bec0 b013b60c bad0be6c c1de64ec 00000000
May 17 23:35:33 sutherland kernel: Call Trace:
May 17 23:35:33 sutherland kernel:  [vma_prio_tree_remove+129/192] vma_prio_tree_remove+0x81/0xc0
May 17 23:35:33 sutherland kernel:  [__remove_shared_vm_struct+68/72] __remove_shared_vm_struct+0x44/0x48
May 17 23:35:33 sutherland kernel:  [free_pgtables+34/108] free_pgtables+0x22/0x6c
May 17 23:35:33 sutherland kernel:  [exit_mmap+102/179] exit_mmap+0x66/0xb3
May 17 23:35:33 sutherland kernel:  [mmput+28/96] mmput+0x1c/0x60
May 17 23:35:33 sutherland kernel:  [exit_mm+183/188] exit_mm+0xb7/0xbc
May 17 23:35:33 sutherland kernel:  [do_exit+392/1597] do_exit+0x188/0x63d
May 17 23:35:33 sutherland kernel:  [sys_exit_group+0/17] sys_exit_group+0x0/0x11
May 17 23:35:33 sutherland kernel:  [get_signal_to_deliver+844/860] get_signal_to_deliver+0x34c/0x35c
May 17 23:35:33 sutherland kernel:  [do_notify_resume+138/1475] do_notify_resume+0x8a/0x5c3
May 17 23:35:33 sutherland kernel:  [sigprocmask+127/148] sigprocmask+0x7f/0x94
May 17 23:35:33 sutherland kernel:  [sys_rt_sigprocmask+71/154] sys_rt_sigprocmask+0x47/0x9a
May 17 23:35:33 sutherland kernel:  [work_notifysig+19/25] work_notifysig+0x13/0x19
May 17 23:35:33 sutherland kernel: Code: 80 8b 03 eb 02 31 c0 5b 5e 5f c3 53 8b 54 24 0c 8b 4c 24 10 8b 5c 24 08 89 49 08 8b 42 08 89 49 04 89 09 39 d0 75 13 39 13 74 08 <0f> 0b 95 00 1b db 27 b0 89 49 08 89 0b eb 11 89 41 08 8b 42 08
May 17 23:35:33 sutherland kernel:  <1>Fixing recursive fault but reboot is needed!

Code:

May 18 01:49:03 sutherland kernel: Unable to handle kernel paging request at virtual address fffe0004
May 18 01:49:03 sutherland kernel:  printing eip:
May 18 01:49:03 sutherland kernel: b0144438
May 18 01:49:03 sutherland kernel: *pde = 00002067
May 18 01:49:03 sutherland kernel: *pte = 00000000
May 18 01:49:03 sutherland kernel: Oops: 0002 [#2]
May
comm
otplug via_agp agpgart parport_pc parport ext3 jbd mbcache ide_cd cdrom ide_disk 3c59x mii uhci_hcd usbcore via82cxxx generic ide_core processor
May 18 01:49:03 sutherland kernel: CPU:    0
May 18 01:49:03 sutherland kernel: EIP:    0060:[cache_alloc_refill+305/1004]    Tainted: P      VLI
May 18 01:49:03 sutherland kernel: EFLAGS: 00010046  (2.6.16-1-k7 #2)
May 18 01:49:03 sutherland kernel: EIP is at cache_alloc_refill+0x131/0x3ec
May 18 01:49:03 sutherland kernel: eax: dfffdce0  ebx: ffffffff  ecx: dffffc00  edx: fffe0000
May 18 01:49:03 sutherland kernel: esi: b03b7000  edi: dfffdce0  ebp: dfff9200  esp: dd9dde28
May 18 01:49:03 sutherland kernel: ds: 007b  es: 007b  ss: 0068
May 18 01:49:03 sutherland kernel: Process ud (pid: 6445, threadinfo=dd9dc000 task=df172ab0)
May 18 01:49:03 sutherland kernel: Stack: <0>00000022 00000050 dffffc00 ffffffff 00001478 b13de760 b13de760 ffffffff
May 18 01:49:03 sutherland kernel:        00000050 dffffc00 00000246 00000000 00001000 b01442fe b13de760 00000000
May 18 01:49:03 sutherland kernel:        b0147dd5 dffffc00 00000050 b13de760 b014919b 00000050 b13de760 00001000
May 18 01:49:03 sutherland kernel: Call Trace:
May 18 01:49:03 sutherland kernel:  [kmem_cache_alloc+44/53] kmem_cache_alloc+0x2c/0x35
May 18 01:49:03 sutherland kernel:  [alloc_buffer_head+16/39] alloc_buffer_head+0x10/0x27
May 18 01:49:03 sutherland kernel:  [alloc_page_buffers+24/162] alloc_page_buffers+0x18/0xa2
May 18 01:49:03 sutherland kernel:  [__getblk+330/448] __getblk+0x14a/0x1c0
May 18 01:49:03 sutherland kernel:  [pg0+813744024/1338717184] do_journal_end+0x404/0xac4 [reiserfs]
May 18 01:49:03 sutherland kernel:  [pagevec_lookup_tag+30/37] pagevec_lookup_tag+0x1e/0x25

After the package upgrade, the lockups still occurred, but the panics do not go to the logs. I was logged in remotely from my laptop, and saw the following today in a wall message:

Code:

sutherland kernel: Oops: 0000 [#1]
sutherland kernel: CPU:    0
sutherland kernel: EIP is at __find_get_block_slow+0x6c/0xed
sutherland kernel: eax: 00000000  ebx: fffdfffe  ecx: 00000001  edx: ffffffff
sutherland kernel: esi: 0000800b  edi: 00000000  ebp: b1132c60  esp: c3e11a60
sutherland kernel: ds: 007b  es: 007b  ss: 0068
sutherland kernel: Process kontact (pid: 7972, threadinfo=c3e10000 task=c8ec3ab0)
sutherland kernel: Stack: <0>de418b24 b98eff40 cf1cb9c4 00108003 00000000 000000
08 b0149fae df864c00
sutherland kernel:        c3e11e38 cf4cde3c b0149fa3 00000000 00000000 00000000 00000000 00000020
sutherland kernel:        00001000 0000800b df864c00 b0149fdc de418ac0 0000800b 00000000 00001000
sutherland kernel: Call Trace:
sutherland kernel:  [__find_get_block+297/317] __find_get_block+0x129/0x13d
sutherland kernel:  [__find_get_block+286/317] __find_get_block+0x11e/0x13d
sutherland kernel:  [__getblk+26/448] __getblk+0x1a/0x1c0
sutherland kernel:  [pg0+813802999/1338717184] search_by_key+0x78/0xd78 [reiserfs]
sutherland kernel:  [pg0+813737607/1338717184] reiserfs_update_sd_size+0x64/0x25e [reiserfs]
sutherland kernel:  [pg0+813725351/1338717184] reiserfs_rename+0x757/0x878 [reiserfs]
sutherland kernel:  [ll_rw_block+127/142] ll_rw_block+0x7f/0x8e
sutherland kernel:  [pg0+813803282/1338717184] search_by_key+0x193/0xd78 [reiserfs]
sutherland kernel:  [mntput_no_expire+20/96] mntput_no_expire+0x14/0x60
sutherland kernel:  [sys_renameat+347/460] sys_renameat+0x15b/0x1cc
sutherland kernel:  [sys_faccessat+146/306] sys_faccessat+0x92/0x132
sutherland kernel:  [handle_IRQ_event+32/76] handle_IRQ_event+0x20/0x4c
sutherland kernel:  [__do_IRQ+101/145] __do_IRQ+0x65/0x91
sutherland kernel:  [syscall_call+7/11] syscall_call+0x7/0xb
sutherland kernel: Code: 9f 00 00 00 8b 00 f6 c4 08 0f 84 8b 00 00 00 8b 45
00 f6 c4 08 75 08 0f 0b 9d 01 42 7d 27 b0 8b 5d 0c b9 01 00 00 00 89 5c 24
04 <8b> 53 18 8b 43 14 39 fa 75 04 39 f0 74 5c 8b 03 8b 5b 04 a8 20
sutherland kernel:  [pg0+813803171/1338717184] search_by_key+0x124/0xd78 [reiserfs]

In the intervening time, I captured another trap remotely. In this one, the process in Oops 0 was kontact. In the new one from last night, it was firefox-bin, and in the ones above captured from the logs, the process was gaim and ud respectively. This implies to me that there is no rhyme or reason on the offending process. I ran a 12 hour memtest86+, which passed with zero errors.

Apologies, I cross posted this from the hardware forum, because of lack of response. Still kinda new here. Can anyone help me track this down?

Thanks,
--Storm

sundialsvcs 05-26-2006 05:46 PM

I frankly don't think that you have successfully exterminated all the hardware problems yet.

exvor 05-26-2006 05:57 PM

I suggest a young priest and a old priest. Cause from your post there it really sounds like you
got demon possesion.


other then that im baffeled.

Storm16 05-26-2006 07:50 PM

Quote:

Originally Posted by sundialsvcs
I frankly don't think that you have successfully exterminated all the hardware problems yet.

I've had a chance to further troubleshoot, and found that at least part of the problem, specifically with the audio and video. The printer was causing some amount of the problems. I was trying to set up the printer, an HP PSC1315v, once and for all. I couldn't get it configured, so I pulled the USB cable, and viola, the audio and video distortion went away. Looking at /proc/interrupts:

Code:

          CPU0     
  0:  61746234          XT-PIC  timer
  1:      12978          XT-PIC  i8042
  2:          0          XT-PIC  cascade
  5:    5030502          XT-PIC  uhci_hcd:usb1, uhci_hcd:usb2, bttv0, Bt87x audio
  7:        68          XT-PIC  parport0
  8:          4          XT-PIC  rtc
 10:  25409621          XT-PIC  ESS Maestro, nvidia
 11:    314697          XT-PIC  acpi, eth0
 12:    690274          XT-PIC  i8042
 14:    2202649          XT-PIC  ide0
 15:    166795          XT-PIC  ide1
NMI:          0
LOC:          0
ERR:          0
MIS:          0

I've swapped cables on the printer to no avail, I plan to try the printer on a different machine to see if it is a problem with the printer itself. The scanning functionality works, I wonder if it is some issue with the scanner drivers interfering with the printer drivers.

That said, is it a good idea to rearrange the IRQs? On my kids computer, a PIII/800, has nothing on IRQ 5, but has acpi, uhci_hcd and nvidia on IRQ 9. My computer (Athlon XP 1800+) has ehci_hcd:usb1, ohci_hcd:usb5, bttv0, Bt87x audio, EMU10K1 on irq 16. I thought I would try putting the USB on irq 9 through BIOS.

Thoughts?
--Storm

Lenard Spencer 07-16-2006 06:51 AM

Just out of curiosity, what motherboard is in this problem system? I have to ask because I used to buy Matsonic boards because the first one I bought (the 6260S Super-7) is still running like a charm in my old DOS-box (I still love the classic games), but when I stepped up to the Athlons I bought the 8137c+, and of the three I bought two failed, trashing the hard drives as they went.


All times are GMT -5. The time now is 07:17 PM.