random core dumps, please help troubleshoot

whysyn · 02-01-2006, 01:15 PM

Hi everybody!

I think these might be related to mysql... server is RedHat 8.0, kernel 2.4.18-14smp, mysql version 3.23.52.

It runs 5 big databases with about 75 tables total. There are about 150 clients doing inserts, and scripts running that mine data from it. Ttotal insert/update traffic is roughly 2.5 gig per 24 hours, and total online data is about 200 gigs.

The past two days, it has core dumped or locked hard 4 times. I can find no issues with the server... load average is within reason, disks aren't full, etc. This is driving me nuts!

Any ideas, suggestions on investgating this, etc are greatly appreciated. Thank you all!

I happened to get a picture of the screen (unfortunately poor quality) which you can see HERE

I coped there text here as well (was hard for me to read, might have an error or two):

Code:

autofs 3c59x iptable_filter ip_tables ide-scsi ide-cd cdrom mousedev keybdev h
CPU:    0
EIP:    0010:[<c0140bb6>]    Not tainted
EFLAGS: 00010282

EIP is at __free_pages_ok [kernel] 0x326 (2.4.18-14smp)
eax: 00000047   ebx: c1e24450   ecx: eded0000   edx: d7d07014
esi: 00000000   edi: f3ba19b4   ebp: 00000000   esp: 49613ecc
ds: 0018   es: 0018   ss: 0018
Process mysqld (pid: 29090, stackpage=d9613000)
Stack: c0296360 c1e24450 f3ba19b4 c0147868 c1e24450 00001000 c0137aa5 000042d8
       00000000 000015b4 00001000 c1e24450 f3ba19b4 000042d9 c01374e6 d9613f6c
       c1e24450 00000000 00001000 00001000 00000000 00000000 00000000 f3ba1900
Call Trace: [<c0147868>] kmap_high [kernel] 0x50 (0xd9613ed8))
[<c0137aa5>] file_read_actor [kernel] 0xd5 (0xd9613f30))
[<c01374e6>] do_generic_file_read [kernel] 0x266 (0xd9613f04))
[<c01379d0>] file_read_actor [kernel] 0x0 (0xd9613f30))
[<c0137b80>] generic_file_read [kernel] 0xb0 (0xd9613f50))
[<c01379d0>] file_read_actor [kernel] 0x0 (0xd9613f60))
[<c014a5fa>] sys_pread [kernel] 0xca (0xd9613f8c))
[<c0109447>] system_call [kernel] 0x33 (0xd9613fc0))


Code: 0f 0b 82 00 99 5a 27 c0 8b 53 08 e9 0b fd ff ff 89 d8 e8 b3

unSpawn · 02-02-2006, 09:18 AM

The past two days, it has core dumped or locked hard 4 times. / I think these might be related to mysql...
Get a copy of your syslogging. Most OOPSes should be logged there. Diff all four and if they're not equal post at least two of them: better to have more and accurate nfo because screenshots +typing can't compensate for lines scrolled off the screen. Next to that, does *any* daemon/application log show errors before the OOPS? Are these the only four OOPSes? In the past six months? Year? How about database and users? Was there an increase of usage? Recently? Where there any applications added? Any (recent?) other changes to the box?

I can find no issues with the server... load average is within reason, disks aren't full, etc.
Do you run continuous stats with like Sa, Atsar or Dstat? Esp. in cases where problems don't appear every 5 minutes it comes in handy to be able to paint a larger picture of what is going on.

server is RedHat 8.0, kernel 2.4.18-14smp, mysql version 3.23.52.
Was this box designed and configured for this task?
BTW, any compelling reason for running an EOL'ed release and vulnerable kernel?

whysyn · 02-02-2006, 11:51 AM

Thanks for the response, I know I'm a total hack and I always appreciate knowledgable users having patience for me...

Syslogging: here (linked due to length)
Only one of the 4 crashes actually wrote to log, and it looks like syslogd is double logging them (I'll have to look into that also) but I left it as-is to avoid any errors. The server has been rock solid since installation. Only issues where due to filling disks on a couple of occasions.

There has in the past 10 days or so been a modest increase in volume... of the 150 concurrent client connections I mentioned, about 10 of them are new. Their individual volume is not much different from average for our clients, but it is 10 new ones.

I can find no other application logs in the crash timeframe, everything seemed to be normal.

SA is running in cron, but I have never dealt with it, nor do I know if it is even running properly. I'll have to look into this, any suggestions welcome =)

This box was spec'd and built for this task and this task only, and was build from factory-new parts and has been in continuous operation since late 2003 (EDIT: late 2002, I can't subtract...). We're still running RH8.0 (1) because it hasn't been broken until now and (2) downtime / prohibitive costs ( parallel hardware, man hours, et al) associated with an upgrade.

Thanks again!

EDIT: typos

unSpawn · 02-15-2006, 02:11 PM

Sorry. Way late.

Feb 1 10:57:40 durant kernel: Page has mapping still set. This is a serious situation. However if you
Feb 1 10:57:40 durant kernel: kernel BUG at page_alloc.c:130!
Feb 1 10:57:40 durant kernel: EIP is at __free_pages_ok [kernel] 0x326 (2.4.18-14smp)
IIGC this has something to do with trying to free a page while it is still in use, and I can't think of any other advice than moving to a later kernel, maybe updating through Fedora Legacy is an option.

whysyn · 05-08-2006, 12:56 PM

Quote:

Originally Posted by unSpawn

Sorry. Way late.

I'm even later =)

It turned out to be a heat issue. One of the CPU fans had died. After replacing it, the system has been rock solid.