LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (http://www.linuxquestions.org/questions/linux-general-1/)
-   -   Kernel Oops - System also hangs under load. (http://www.linuxquestions.org/questions/linux-general-1/kernel-oops-system-also-hangs-under-load-148000/)

quill18 02-19-2004 09:02 AM

Kernel Oops - System also hangs under load.
 
Hi everyone, here's the situation:

Two days ago I signed up for a dedicated web host. Everything looked great on the server and testing revealed no problems, so I moved my busy (250,000 hits/day) website to the machine. Everything worked great for a couple hours, then everything hung. I put in a support request to have the machine rebooted, but the same thing happens every time I set it live (I can't get the machine to hang with anything I do myself.) The system is Redhat 9.

My provider swears that they've check the hardware. I don't actually believe but...

Anyway - I don't have access to the local console, so I can't vouch for any message that might be there. The logs don't usually reveal anything, but there are 2 Oops entries in there which did not produce a crash. Here's one:

ksymoops 2.4.5 on i686 2.4.20-28.9. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.20-28.9/ (default)
-m /boot/System.map-2.4.20-28.9 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Error (expand_objects): cannot stat(/lib/ext3.o) for ext3
Error (expand_objects): cannot stat(/lib/jbd.o) for jbd
Error (pclose_local): find_objects pclose failed 0x100
Warning (map_ksym_to_module): cannot match loaded module ext3 to a unique module object. Trace may not be reliable.
Feb 19 04:03:25 sm12311 kernel: Unable to handle kernel paging request at virtual address 13cd7cef
Feb 19 04:03:25 sm12311 kernel: c01187bb
Feb 19 04:03:25 sm12311 kernel: *pde = 00000000
Feb 19 04:03:25 sm12311 kernel: Oops: 0000
Feb 19 04:03:25 sm12311 kernel: CPU: 0
Feb 19 04:03:25 sm12311 kernel: EIP: 0060:[<c01187bb>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
Feb 19 04:03:25 sm12311 kernel: EFLAGS: 00010086
Feb 19 04:03:25 sm12311 kernel: eax: f2e5b214 ebx: d8244000 ecx: 51eb851f edx: 00000ea1
Feb 19 04:03:25 sm12311 kernel: esi: d2e42000 edi: 00000000 ebp: d8245f54 esp: d8245f4c
Feb 19 04:03:25 sm12311 kernel: ds: 0068 es: 0068 ss: 0068
Feb 19 04:03:25 sm12311 kernel: Process logrotate (pid: 1241, stackpage=d8245000)
Feb 19 04:03:25 sm12311 kernel: Stack: d2e42000 00000000 bfffe3c8 c011b479 01200011 bfffe358 d8245fc4 00000000
Feb 19 04:03:25 sm12311 kernel: 00000000 400270c8 00000028 00000066 080e9830 00000000 fffffff2 00000000
Feb 19 04:03:25 sm12311 kernel: bfffe3e0 c0127d92 00000000 00000000 400270c8 c0107b39 01200011 bfffe358
Feb 19 04:03:25 sm12311 kernel: Call Trace: [<c011b479>] do_fork [kernel] 0x99 (0xd8245f58))
Feb 19 04:03:25 sm12311 kernel: [<c0127d92>] sys_rt_sigprocmask [kernel] 0xf2 (0xd8245f90))
Feb 19 04:03:25 sm12311 kernel: [<c0107b39>] sys_clone [kernel] 0x49 (0xd8245fa0))
Feb 19 04:03:25 sm12311 kernel: [<c010953f>] system_call [kernel] 0x33 (0xd8245fc0))
Feb 19 04:03:25 sm12311 kernel: Code: 02 89 d0 f7 e1 c1 ea 05 89 53 38 8b 56 38 8d 14 92 8d 14 92



>>EIP; c01187bb <wake_up_forked_process+2b/f0> <=====

>>eax; f2e5b214 <END_OF_CODE+124c68a9/????>
>>ebx; d8244000 <_end+17e64e80/2042dee0>
>>ecx; 51eb851f Before first symbol
>>edx; 00000ea1 Before first symbol
>>esi; d2e42000 <_end+12a62e80/2042dee0>
>>ebp; d8245f54 <_end+17e66dd4/2042dee0>
>>esp; d8245f4c <_end+17e66dcc/2042dee0>

Trace; c011b479 <do_fork+99/140>
Trace; c0127d92 <sys_rt_sigprocmask+f2/160>
Trace; c0107b39 <sys_clone+49/70>
Trace; c010953f <system_call+33/38>

Code; c01187bb <wake_up_forked_process+2b/f0>
00000000 <_EIP>:
Code; c01187bb <wake_up_forked_process+2b/f0> <=====
0: 02 89 d0 f7 e1 c1 add 0xc1e1f7d0(%ecx),%cl <=====
Code; c01187c1 <wake_up_forked_process+31/f0>
6: ea 05 89 53 38 8b 56 ljmp $0x568b,$0x38538905
Code; c01187c8 <wake_up_forked_process+38/f0>
d: 38 8d 14 92 8d 14 cmp %cl,0x148d9214(%ebp)
Code; c01187ce <wake_up_forked_process+3e/f0>
13: 92 xchg %eax,%edx


2 warnings and 3 errors issued. Results may not be reliable.

h/w 02-19-2004 03:22 PM

a paging request failure? did they change/(add/remove) the ram or some out there? they do it sometimes, and maybe the one they popped in might not be ok?

i dont know how much decoding the oops msg will help if that is the case, but do you know how to? you basically have to check ur system.map file for the symbol at 'c01187bb'. u might have to find the offset in case u dont see a symbol corresponding to c01187bb. then find the offending lines inside it.

dont think i have been of much help here.

quill18 02-19-2004 03:50 PM

Quote:

Originally posted by h/w
a paging request failure? did they change/(add/remove) the ram or some out there? they do it sometimes, and maybe the one they popped in might not be ok?
I'm sure they haven't changed anything - but I *really* wish they would. I'm pretty convinced that it's a hardware error.

Quote:

i dont know how much decoding the oops msg will help if that is the case, but do you know how to? you basically have to check ur system.map file for the symbol at 'c01187bb'. u might have to find the offset in case u dont see a symbol corresponding to c01187bb. then find the offending lines inside it.
The nearest matches in my system map are:

c0118770 T wake_up_state
c0118790 T wake_up_forked_process
c0118880 T sched_exit
c01188e0 T schedule_tail

I've actually noticed another Oops in my log, from a few days ago. The error message is different, but I see the same c01187bb as above:

Feb 17 13:36:11 sm12311 kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000060
Feb 17 13:36:11 sm12311 kernel: e081f20e
Feb 17 13:36:11 sm12311 kernel: *pde = 00000000
Feb 17 13:36:11 sm12311 kernel: Oops: 0002
Feb 17 13:36:11 sm12311 kernel: CPU: 0
Feb 17 13:36:11 sm12311 kernel: EIP: 0060:[<e081f20e>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
Feb 17 13:36:11 sm12311 kernel: EFLAGS: 00010282
Feb 17 13:36:11 sm12311 kernel: eax: 00000000 ebx: 00000000 ecx: c255a2f4 edx: 00000000
Feb 17 13:36:11 sm12311 kernel: esi: 00000001 edi: dbe33880 ebp: 00000000 esp: cb59fdc8
Feb 17 13:36:11 sm12311 kernel: ds: 0068 es: 0068 ss: 0068
Feb 17 13:36:11 sm12311 kernel: Process mysqld (pid: 2012, stackpage=cb59f000)
Feb 17 13:36:11 sm12311 kernel: Stack: cb59fe18 00000001 d0925704 d0925580 e081f5f1 dbe33880 d0925580 00000528
Feb 17 13:36:11 sm12311 kernel: cb59fe00 cb59fe18 00000001 00000003 0001913f 00000000 d09256c8 00018d1d
Feb 17 13:36:11 sm12311 kernel: 00000000 d3379410 00019122 d2e26c54 d2e4c870 0001913f d2e303bc 00000400
Feb 17 13:36:11 sm12311 kernel: Call Trace: [<e081f5f1>] ext3_get_block_handle [ext3] 0x251 (0xcb59fdd8))
Feb 17 13:36:11 sm12311 kernel: [<c0149b55>] get_unused_buffer_head [kernel] 0x65 (0xcb59fe2c))
Feb 17 13:36:11 sm12311 kernel: [<e081f6aa>] ext3_get_block [ext3] 0x4a (0xcb59fe50))
Feb 17 13:36:11 sm12311 kernel: [<c014a423>] __block_prepare_write [kernel] 0x193 (0xcb59fe70))
Warning (Oops_read): Code line not seen, dumping what data is available

>>EIP; c01187bb <wake_up_forked_process+2b/f0> <=====

>>eax; f2e5b214 <END_OF_CODE+124c68a9/????>
>>ebx; d8244000 <_end+17e64e80/2042dee0>
>>ecx; 51eb851f Before first symbol
>>edx; 00000ea1 Before first symbol
>>esi; d2e42000 <_end+12a62e80/2042dee0>
>>ebp; d8245f54 <_end+17e66dd4/2042dee0>
>>esp; d8245f4c <_end+17e66dcc/2042dee0>

Trace; c011b479 <do_fork+99/140>
Trace; c0127d92 <sys_rt_sigprocmask+f2/160>
Trace; c0107b39 <sys_clone+49/70>
Trace; c010953f <system_call+33/38>

Code; c01187bb <wake_up_forked_process+2b/f0>
00000000 <_EIP>:
Code; c01187bb <wake_up_forked_process+2b/f0> <=====
0: 02 89 d0 f7 e1 c1 add 0xc1e1f7d0(%ecx),%cl <=====
Code; c01187c1 <wake_up_forked_process+31/f0>
6: ea 05 89 53 38 8b 56 ljmp $0x568b,$0x38538905
Code; c01187c8 <wake_up_forked_process+38/f0>
d: 38 8d 14 92 8d 14 cmp %cl,0x148d9214(%ebp)
Code; c01187ce <wake_up_forked_process+3e/f0>
13: 92 xchg %eax,%edx

I hate not running my own hardware...but their bandwidth is so cheap...

h/w 02-19-2004 04:17 PM

<wake_up_forked_process>? issues with the system scheduler then eh? apart from agreeing to what u said (h/w issue), i dont know what's going wrong here now as i have not seen this before.


All times are GMT -5. The time now is 11:05 AM.