LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 05-18-2018, 07:19 AM   #1
1s440
Member
 
Registered: Mar 2018
Posts: 266

Rep: Reputation: Disabled
Root cause analysis for server crash


Hi all,

Server is crashed due to high CPU load and somehow its up now. How to analyse the cause of the failure.
I checked /var/log/syslog and dmesg and messages but nothing found.

Can anyone suggest me
 
Old 05-18-2018, 07:28 AM   #2
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,789

Rep: Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304
usually it is not a reason to crash (high cpu load).
Would be nice to know what was running, probably you can find something in the logs of those apps.
 
Old 05-18-2018, 07:38 AM   #3
1s440
Member
 
Registered: Mar 2018
Posts: 266

Original Poster
Rep: Reputation: Disabled
Some of the messages are like this.
kernel: [694662.597621] free:1994 slab:2523
mapped:5 pagetables:0 bounce:0
jupiter kernel: [694662.597636] DMA free:4064kB
min:60kB low:72kB high:88kB active:3964kB inactive:3684kB
present:16160kB pages_scanned:13851 all_unreclaimable? yes
jupiter kernel: [694662.597643] lowmem_reserve[]: 0
1002 1002 1002
jupiter kernel: [694662.597650] DMA32 free:3912kB
min:4016kB low:5020kB high:6024kB active:715948kB inactive:248440kB
present:1026160kB pages_scanned:1744182 all_unreclaimable? yes
[694662.597658] lowmem_reserve[]: 0 0 0 0
: [694662.597664] DMA: 494*4kB 11*8kB
1*16kB 2*32kB 0*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB
0*4096kB = 4064kB
jupiter kernel: [694662.597679] DMA32: 32*4kB 1*8kB
10*16kB 7*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB
0*4096kB = 3912kB
jupiter kernel: [694662.597699] 35 total pagecache pages
jupiter kernel: [694662.597703] Swap cache: add
314292, delete 314292, find 4998880/5004846
jupiter kernel: [694662.597714] Free swap = 0kB
jupiter kernel: [694662.597724] Total swap = 1048568kB
jupiter kernel: [694662.603505] 264192 pages of RAM
jupiter kernel: [694662.603526] 6120 reserved pages
jupiter kernel: [694662.603529] 4276 pages shared
jupiter kernel: [694662.603532] 0 pages swap cached
jupiter kernel: [694662.603537] Out of memory: kill
process 1290 (slapd) score 24211 or a child
jupiter kernel: [694662.603561] Killed process 1290 (slapd)
jupiter kernel: [694672.737297] smbd invoked
oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=0
jupiter kernel: [694672.737315] Pid: 18717, comm: smbd
 
Old 05-18-2018, 07:47 AM   #4
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,251

Rep: Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321
High CPU load isn't supposed to cause a crash on linux. Are you running M$ windows??
Give us real details not hardware, software, distro, ram & cache and what the load was caused by. Are you an experienced sysadmin? Is your box online? Secured? Patches applied?

Random resets are also software and malware related. Kernel panics are reported on screen, but not logged iirc; ram errors I don't know about (= segmentation faults for historical reasons) usually shut a process.

Maybe it's an unpredictable effect of the new Spectre/Meltdown patches?
 
Old 05-18-2018, 08:20 AM   #5
1s440
Member
 
Registered: Mar 2018
Posts: 266

Original Poster
Rep: Reputation: Disabled
We are not running MS windows. But VM was installed through Vcenter. Unfortunately not sure what caused the crash because I am investigating. I am not an experienced system admin. Patches are not applied. its still old version. I also observed some kernel errors in /var/log/messages. Really not sure whats the reason behind the crash.

Last edited by 1s440; 05-24-2018 at 02:12 AM.
 
Old 05-18-2018, 08:34 AM   #6
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,789

Rep: Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304
please post those "some kernel error messages"....
 
Old 05-18-2018, 08:59 AM   #7
1s440
Member
 
Registered: Mar 2018
Posts: 266

Original Poster
Rep: Reputation: Disabled
May 13 10:03:07 jupiter kernel: [62385504.845584] CPU 1: hi: 186, btch: 31 usd: 172
May 13 10:03:07 jupiter kernel: [62385504.845586] Active:308091 inactive:188791 dirty:0 writeback:0 unstable:0
May 13 10:03:07 jupiter kernel: [62385504.845587] free:12215 slab:2820 mapped:4 pagetables:1595 bounce:0
May 13 10:03:07 jupiter kernel: [62385504.845589] DMA free:8124kB min:68kB low:84kB high:100kB active:2912kB inactive:1776kB present:16256kB pages_scanned:9101 all_unreclaimable? yes
May 13 10:03:07 jupiter kernel: [62385504.845590] lowmem_reserve[]: 0 873 2016 2016
May 13 10:03:07 jupiter kernel: [62385504.845594] Normal free:40272kB min:3744kB low:4680kB high:5616kB active:58636kB inactive:752620kB present:894080kB pages_scanned:1429067 all_unreclaimable? yes
May 13 10:03:07 jupiter kernel: [62385504.845595] lowmem_reserve[]: 0 0 9144 9144
May 13 10:03:07 jupiter kernel: [62385504.845598] HighMem free:464kB min:512kB low:1736kB high:2964kB active:1170816kB inactive:768kB present:1170432kB pages_scanned:3528704 all_unreclaimable? yes
May 13 10:03:07 jupiter kernel: [62385504.845599] lowmem_reserve[]: 0 0 0 0
May 13 10:03:07 jupiter kernel: [62385504.845603] DMA: 31*4kB 30*8kB 27*16kB 19*32kB 13*64kB 8*128kB 5*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 8124kB
May 13 10:03:07 jupiter kernel: [62385504.845608] Normal: 9166*4kB 12*8kB 8*16kB 2*32kB 1*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 40216kB
May 13 10:03:07 jupiter kernel: [62385504.845613] HighMem: 32*4kB 0*8kB 1*16kB 0*32kB 1*64kB 0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 464kB
May 13 10:03:07 jupiter kernel: [62385504.845617] 73 total pagecache pages
May 13 10:03:07 jupiter kernel: [62385504.845619] Swap cache: add 7404283, delete 7404282, find 1147266373/1147714023
May 13 10:03:07 jupiter kernel: [62385504.845620] Free swap = 0kB
May 13 10:03:07 jupiter kernel: [62385504.845621] Total swap = 979956kB
May 13 10:03:07 jupiter kernel: [62385504.848256] 524288 pages of RAM
May 13 10:03:07 jupiter kernel: [62385504.848257] 294912 pages of HIGHMEM
May 13 10:03:07 jupiter kernel: [62385504.848258] 5254 reserved pages
May 13 10:03:07 jupiter kernel: [62385504.848259] 8161 pages shared
May 13 10:03:07 jupiter kernel: [62385504.848260] 1 pages swap cached
May 13 10:03:07 jupiter kernel: [62385504.848261] 0 pages dirty
May 13 10:03:07 jupiter kernel: [62385504.848261] 0 pages writeback
May 13 10:03:07 jupiter kernel: [62385504.848262] 4 pages mapped
May 13 10:03:07 jupiter kernel: [62385504.848263] 2820 pages slab
May 13 10:03:07 jupiter kernel: [62385504.848264] 1595 pages pagetables
May 13 14:04:53 jupiter kernel: [62385586.555477] smbd invoked oom-killer: gfp_mask=0x1200d2, order=0, oomkilladj=0
May 13 14:04:53 jupiter kernel: [62385586.555481] Pid: 28410, comm: smbd Not tainted 2.6.26-2-686 #1
May 13 14:04:53 jupiter kernel: [62385586.555504] [<c015919e>] oom_kill_process+0x4f/0x195
May 13 14:04:53 jupiter kernel: [62385586.555515] [<c01595c8>] out_of_memory+0x14e/0x17f
May 13 14:04:53 jupiter kernel: [62385586.555520] [<c015b530>] __alloc_pages_internal+0x2b8/0x34e
May 13 14:04:53 jupiter kernel: [62385586.555524] [<c015b5d2>] __alloc_pages+0x7/0x9
May 13 14:04:53 jupiter kernel: [62385586.555526] [<c0156da9>] __grab_cache_page+0x2b/0x5b
May 13 14:04:53 jupiter kernel: [62385586.555530] [<f896f6be>] ext3_write_begin+0x51/0x16d [ext3]
May 13 14:04:53 jupiter kernel: [62385586.555544] [<f88e276b>] do_get_write_access+0x2f8/0x331 [jbd]
May 13 14:04:53 jupiter kernel: [62385586.555553] [<c0157728>] generic_file_buffered_write+0xef/0x553
May 13 14:04:53 jupiter kernel: [62385586.555560] [<f88e246a>] journal_stop+0x148/0x151 [jbd]
May 13 14:04:53 jupiter kernel: [62385586.555566] [<c0157ff4>] __generic_file_aio_write_nolock+0x468/0x4cb
May 13 14:04:53 jupiter kernel: [62385586.555571] [<c0156be9>] find_lock_page+0x19/0x7c
May 13 14:04:53 jupiter kernel: [62385586.555576] [<c01580a9>] generic_file_aio_write+0x52/0xa9
May 13 14:04:53 jupiter kernel: [62385586.555580] [<f896bf99>] ext3_file_write+0x19/0x83 [ext3]
May 13 14:04:53 jupiter kernel: [62385586.555587] [<c0174506>] do_sync_write+0xbf/0x100
May 13 14:04:53 jupiter kernel: [62385586.555597] [<c0131a20>] autoremove_wake_function+0x0/0x2d
May 13 14:04:53 jupiter kernel: [62385586.555603] [<c01776f9>] sys_stat64+0x1e/0x23
May 13 14:04:53 jupiter kernel: [62385586.555607] [<c01bac85>] security_file_permission+0xc/0xd
May 13 14:04:53 jupiter kernel: [62385586.555614] [<c0174447>] do_sync_write+0x0/0x100
May 13 14:04:53 jupiter kernel: [62385586.555616] [<c0174c78>] vfs_write+0x83/0x120
May 13 14:04:53 jupiter kernel: [62385586.555619] [<c017524a>] sys_write+0x3c/0x63
May 13 14:04:53 jupiter kernel: [62385586.555622] [<c0103857>] sysenter_past_esp+0x78/0xb1
May 13 14:04:53 jupiter kernel: [62385586.555628] [<c02b0000>] quirk_ali7101_acpi+0x27/0x63
May 13 14:04:53 jupiter kernel: [62385586.555633] =======================
May 13 14:04:53 jupiter kernel: [62385586.555635] Mem-info:
May 13 14:04:53 jupiter kernel: [62385586.555635] DMA per-cpu:
 
Old 05-18-2018, 12:50 PM   #8
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,789

Rep: Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304
Quote:
smbd invoked oom-killer
this probably means samba has made an out of memory problem and probably that caused that crash (although I'm not 100% sure).
You might need to check your samba related setting (and probably the version of your samba packages??)
 
Old 05-22-2018, 03:38 AM   #9
1s440
Member
 
Registered: Mar 2018
Posts: 266

Original Poster
Rep: Reputation: Disabled
Samba version is 2.4 but it never happened earlier. So is this not a kernel issue?

Found this in samba log but i think this is after crash.
smbd/process.c:smbd_process(2068)
receive_message_or_smb failed: NT_STATUS_END_OF_FILE, exiting

Last edited by 1s440; 05-22-2018 at 04:04 AM.
 
Old 05-22-2018, 04:15 AM   #10
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,119

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Yes, it is not a kernel issue. I see no evidence of a "crash" - as in kernel oops.

You have something consuming all your memory - not necessarily smbd, it may just be a victim. But whatever it is, it is bad enough to be impacting the system. The high CPU is probably memory-management trying to locate free-able page frames. Once the oom-killer gets enough memory back, the system will appear to come back to life.
Till is all happens again.

You need to check your monitoring history data to see what was happening over time - it may give some hints depending on what is being recorded.
 
Old 05-22-2018, 06:28 AM   #11
1s440
Member
 
Registered: Mar 2018
Posts: 266

Original Poster
Rep: Reputation: Disabled
Is there any monitoring log that I could add to the server so that all the logs will be saved to it?
 
Old 05-23-2018, 12:30 AM   #12
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,119

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Not really - the data are not exposed or retained by default. Most servers probably have somethig like sysstat, but it tends not to be useful for historical analysis of (particularly) process metrics.
Maybe look at something like collectl - there will be a learning curve.

Another "quick look" option would be to use "top" and add the "swap" column - at least you can see who is using swap at that time. Maybe set up your own monitor - run it in batch mode with a delay of something like 10-20 minutes and write it to a file for later analysis.
 
Old 05-23-2018, 04:33 AM   #13
1s440
Member
 
Registered: Mar 2018
Posts: 266

Original Poster
Rep: Reputation: Disabled
Thanks for the information.How could i keep logging using collectl as I could not find the option. Any idea on How long the logging is stored?
I found another logging tool as "atop"? it directly writes to /var/log and logging is kept for 28 days.

Last edited by 1s440; 05-25-2018 at 08:12 AM.
 
Old 05-27-2018, 02:31 AM   #14
1s440
Member
 
Registered: Mar 2018
Posts: 266

Original Poster
Rep: Reputation: Disabled
Hello all,
I am getting below error when I am trying to execute atop command. Can anyone help me with this.
error while loading shared libraries: libncurses.so.6: cannot open shared object file: No such file or directory
 
Old 05-27-2018, 03:05 AM   #15
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,789

Rep: Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304
still don't know your OS. Probably you need to install it. What did you make "before"?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Analysis a Linux Server that has been compromised. hack3rcon Linux - Security 3 01-18-2016 04:47 PM
[SOLVED] Root server crash: hunting down the cause of the crash kikinovak Slackware 15 01-29-2014 04:22 PM
performance analysis in redhat server 5 dnyanesh.3 Linux - Newbie 1 02-17-2009 03:31 AM
LXer: A quick overview of Linux kernel crash dump analysis LXer Syndicated Linux News 0 08-16-2007 04:11 AM
Server Log Analysis Tools DtC Linux - Software 3 04-22-2003 10:25 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 12:31 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration