LinuxQuestions.org
Did you know LQ has a Linux Hardware Compatibility List?
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Ubuntu
User Name
Password
Ubuntu This forum is for the discussion of Ubuntu Linux.

Notices



Reply
 
Search this Thread
Old 07-25-2014, 09:09 AM   #1
amudhan
LQ Newbie
 
Registered: Jul 2014
Posts: 2

Rep: Reputation: Disabled
Ubuntu 10.04 server hangs and throws error message


Hi,

I am using Ubuntu 10.04 LTS server for running Hadoop cluster, in recent days server hangs and it throws error message in screen but nothing in log.
when server hangs couldn't login thru local or remote.

Error message in local monitor:

Msg 1

[641677.044778] FS: 00007f61ea646700(0000) GS:ffff8800100c0000(0000) kn1GS:000
0000000000000
[641677.081636] CS: 0010 DS: 0000 ES: 0000 CRO: 000000008005003b
[641677.100839] CR2: 000000000061ade0 CR3: 000000042f758000 CR4: 00000000000006
0
[641677.137188] DRO: 0000000000000000 DR1: 0000000000000000 DR2: 00000000000000
op
[641677.174348] DR3: 0000000000000000 DR6: 000000001fff0ff0 DR7: 00000000000004
op
[641677.212210]Call Trace:
[641677.230768]Uffffffff810fbba0>l? drain_local_pages+0x0/0x20
[641677.249676]Uffffffff8109a9c2>l? smp_cal1_function+Ox22/0x30
[641677.268199]Uffffffff8106def4A ? on_each_cpu+Ox24/0x50
[641677.286434]Uffffffff8101a0bc>1? drain_all_pages+Oxlc/0x20
[641677.304433]Uffffffff8101a585>l? __alloc_pages_slowpatb+Ox3c5/0x580
[641677.322223]Uffffffff8101a8b1>l? __alloc_pages_nodemask+Ox171/0x180
[641677.339797]Uffffffff81089512>l? brtimer_cance1+0x22/0x30
[641677.357173]Uffffffff8112d6c7>l? alloc_pages_current+Ox87/0xdO
[641677.374510]Uffffffff810197ae>1? __get_free_pages+Oxe/Ox50
[641677.391581]Uffffffff81064056>l? dup_task_struct+Ox46/0x170
[641677.408413]Uffffffff81065241>l? copy_process+Oxbl/Oxe90
[641677.424899]Uffffffff810660b4>l? do_fork+Ox94/0x430
[641677.441042]Uffffffff81098b38>l? do_futex+Oxb8/0x1b0
[641677.456781]Uffffffff81098cab>1? sys_futex+0x7b/0x170
[641677.472137]Uffffffff81011558>l? sys_clone+Ox28/0x30
[641677.487142]Uffffffff810134d3A ? stub_clone+Ox13/0x20
[641677.501801]Uffffffff810131b2>l? system_cal1_fastpath+Ox16/0x1b

Msg 2

000
(2842010 .855788] DR3: 0000000000000000 DR6: 0000000011110110 DR7: 0000000000000
-00
(2842010 .855794] Process kondemand/2 (pid: 135, threadinfo ffff88044b8c2000, to
,k ffff88 044b1d44d0)
(2842010 .855797] Stack:
(2842010 .855800] 111188044b8c3cc0 ffff88044b8c3c30 ffff88044b8c3d68 0000000000
001111
(2842010 .855805] <0> ffff88001008fb20 000000028155d896 00000000111111ff 0000000
000000008
(2842010 .855812] <0> 0000000000015bc0 0000000000015bc0 1111880010181c30 0000000
000015bc0
(2842010 .855818] Call Trace:
(2842010 .855827] Uffffffff8105c888>1 load_balance_newidle+Oxa8/0x310
(2842010 .855835] (<111111118155894a>] thread_return+0x352/0x418
(2842010 .855843] [<ffffffff810806aa>] worker_thread+Oxda/Ox110
(2842010 .855849] [<ffffffff81085090>] ? autoremove_wake_function+Ox0/0x40
(2842010 .855856] (<1111111181080540>] ? worker_thread+Ox0/0x110
(2842010 .855862] (<1111111181084416>] kthread+0x96/0xa0
(2842010 .855868] [<ffffffff810141ea>] child_rip+Oxa/0x20
(2842010 .855875] (<1111111181084c80>] ? kthread+Ox0/0xa0
(2842010 .855880] fdiffifff810141e0>] ? child_rip+Ox0/0x20
(2842010 .855883] Code: 06 89 85 c0 fe ff ff c7 85 c4 fe ff ff 01 00 00 00 e9 97
fb ff ff 90 48 8b 95 e0 fe ff ff 48 8b 45 a8 8b 72 08 48 c1 e0 Oa 31 d2 <48> f7
f6 48 8b 75 b0 48 89 45 a0 31 c0 48 85 16 74 Oc 48 8b 45
(2842010 .855922] RIP (<1111111181056284>] find_busiest_group+Ox634/0x8f0
[2842010 .855929] RSP <ffff88044b8c3bb0>
[2842010 .855933] ---( end trace 81a1739d978369cb ]---

below link contains snapshot of error msg when server hang.

https://drive.google.com/file/d/0BxC...it?usp=sharing

https://drive.google.com/file/d/0BxC...it?usp=sharing

https://drive.google.com/file/d/0BxC...it?usp=sharing

https://drive.google.com/file/d/0BxC...it?usp=sharing

https://drive.google.com/file/d/0BxC...it?usp=sharing

https://drive.google.com/file/d/0BxC...it?usp=sharing

https://drive.google.com/file/d/0BxC...it?usp=sharing

Please let me know any suggestion and idea where to look.

Last edited by amudhan; 08-04-2014 at 03:09 AM. Reason: Added error messages as text
 
Old 07-27-2014, 02:49 PM   #2
dijetlo
Member
 
Registered: Jan 2009
Location: RHELtopia....
Distribution: Slackware Current 64bit Multi-Lib/RHEL
Posts: 743

Rep: Reputation: Disabled
You don't have any new entries in /var/log/kern ?

I didn't do a deep dive on your errors but it looks like this involves kernel memory space (I'm basing that on the page references -dump and clean- in the error). If so, it seems like it could be a memory leak or perhaps an over run, which would be an odd thing to "develop".
How long did the hadoup server run before it developed this problem? What changed in your environment around the time this "hang" began?
Is your system making huge pages? If so, how many and how big?
What's your upper limit for standard page files (Number and Size)?
As the Hadoup server is running, are you accumulating page files?
Are you writing to a remote location (a SAN or NAS)?
Are any of these machines virtual?
If you don't have sar (or a sar-like substitute) installed on your system, install it and look at the breads,bwrites and iopts that precede the "locking".
Finally, are you really interested enough in solving the problem to do all this? It's going to require a lot of work on your part, because after you've done all that, you're just getting started, I've worked on these kinds of problems fourteen hours a day (mostly spent reading and testing), three or four days at a stretch before I found the silver bullet. I'm not willing to make that kind of commitment to your problem (though I'm happy to help you get started), so the question becomes are you?
Maybe somebody has a really clear idea on how to fix this but that guy aint me. My experience has been you're going to have to tinker a lot to figure it out.
 
Old 07-28-2014, 03:10 AM   #3
amudhan
LQ Newbie
 
Registered: Jul 2014
Posts: 2

Original Poster
Rep: Reputation: Disabled
No entries in log during server hung.

Problem started before two months and it happens randomly and till now the issue doesn't repeats to the server which was restarted due to the problem.

Memory and swap usage is very low.

all server running on bare metal, no virtual machine.

writing files locally.

it happens randomly in a gap of 1 week to 10 days. so, its tough to monitor all servers.

Any way will try to install SAR or substitute to monitor IO.

thanks for your reply.


Quote:
Originally Posted by dijetlo View Post
You don't have any new entries in /var/log/kern ?

I didn't do a deep dive on your errors but it looks like this involves kernel memory space (I'm basing that on the page references -dump and clean- in the error). If so, it seems like it could be a memory leak or perhaps an over run, which would be an odd thing to "develop".
How long did the hadoup server run before it developed this problem? What changed in your environment around the time this "hang" began?
Is your system making huge pages? If so, how many and how big?
What's your upper limit for standard page files (Number and Size)?
As the Hadoup server is running, are you accumulating page files?
Are you writing to a remote location (a SAN or NAS)?
Are any of these machines virtual?
If you don't have sar (or a sar-like substitute) installed on your system, install it and look at the breads,bwrites and iopts that precede the "locking".
Finally, are you really interested enough in solving the problem to do all this? It's going to require a lot of work on your part, because after you've done all that, you're just getting started, I've worked on these kinds of problems fourteen hours a day (mostly spent reading and testing), three or four days at a stretch before I found the silver bullet. I'm not willing to make that kind of commitment to your problem (though I'm happy to help you get started), so the question becomes are you?
Maybe somebody has a really clear idea on how to fix this but that guy aint me. My experience has been you're going to have to tinker a lot to figure it out.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Server Hung with error message: ERROR: Message hist queue is filling up iprince Linux - Enterprise 7 02-10-2014 10:40 AM
Getting error message many a time: ubuntu 12.10 has experienced an internal error ravisingh1 Ubuntu 2 07-20-2013 01:55 PM
After moving mysql db to NFS[netapp] unable to start mysql server throws error naveenrajn Linux - General 3 09-13-2010 03:20 AM
Error message when attempting to install GUI for Ubuntu Server 7.04 calebf Linux - Server 5 04-26-2009 12:23 AM
Running an executable on AIX server throws error vathsan AIX 0 01-12-2009 05:29 AM


All times are GMT -5. The time now is 02:19 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration