Analyze Boot logs

Bouki · 12-30-2014, 02:21 AM

Hello guys.
about 4 days ago one of our servers had trouble in rebooting progress.
i captured all message and dmesg logs.
how can in analyze this files?
i also attach the dmesg log file and hope someone can help.

unSpawn · 12-30-2014, 04:11 AM

Quote:

Originally Posted by Bouki

about 4 days ago one of our servers had trouble in rebooting progress.

Please describe the symptoms and what steps were taken to assess the problem at the time in detail.

Quote:

Originally Posted by Bouki

i captured all message and dmesg logs.
how can in analyze this files?

By reading them.

Quote:

Originally Posted by Bouki

i also attach the dmesg log file and hope someone can help.

Because you have not (yet) posted anything you would like us to look at it makes no sense to read it for you.
*If you're worried about the "TCP: Peer unexpectedly shrunk window" line then note this is an informational level message (not debug,err,warn,crit) as the kernel fixed things itself: hence the "(repaired)" part.

Bouki · 12-30-2014, 05:07 AM

thank you so much dear
let me explain what exactly happened:
at first the server freeze and we lost the server ping.
we tried to connect with ssh but we couldn't. then we reboot the server. after about 15 minutes (nothing happened), i pressed the reset button on server.
server powered off and start booting. but again the booting process took so long(about 30 minutes!!!!) i pressed the reset button again but this time server booted normally. it took about 5 or 6 minutes.
this happened exactly about 6 month ago.

would you please explain about TCP: Peer unexpectedly shrunk window error?

i also have the message logs. if you need any thing else just tell me.

unSpawn · 12-30-2014, 06:18 AM

Quote:

Originally Posted by Bouki

let me explain what exactly happened:
at first the server freeze and we lost the server ping.
we tried to connect with ssh but we couldn't.
then we reboot the server.

Sometimes processes may take an unusual amount of system resources. This makes a server unresponsive. Depending on what remote monitoring is available you can decide to act when you see values rising or leave it be and face the consequences. When you decide to reboot a server it would come in handy to access it locally and attach a screen or gain access via any Out of Band methods (IPMI, console server, KVM, etc, etc) to try and see if messages are logged to the console.

Quote:

Originally Posted by Bouki

after about 15 minutes (nothing happened), i pressed the reset button on server.
server powered off and start booting.
but again the booting process took so long(about 30 minutes!!!!)

When you hard reset a server you offer it no chance to close off any processes, finalize writing files and resetting a file systems "dirty" flag. This means (or should mean) that on reboot a file system check could (should) be forced to ensure integrity of the file system. When you have not configured the server beforehand to take care of file system checks in an automated way then if you do not access the server when checking file systems you may not have seen what the cause for the lengthy boot process was. It may have been trying to access resources it could not find anymore, it may have been waiting for an answer or it simply may have been slow checking file systems due to the size of disks.

Quote:

Originally Posted by Bouki

i pressed the reset button again but this time server booted normally.
it took about 5 or 6 minutes.
this happened exactly about 6 month ago.

When you performed a hard reset of the server again you offered it no chance to close off any processes, finalize writing files, finishing file system checks and resetting a file systems "dirty" flag. If you have not investigated the cause of the problem and if you have not verified the integrity of the system after it booted up you have neglected basic admin duties. 6 month ago. So I do hope this is "just" some expendable personal machine without any valuable data on it.

Quote:

Originally Posted by Bouki

would you please explain about TCP: Peer unexpectedly shrunk window error?

Simply put when two networked machines make contact the first time they decide on the maximum amount of data they will be able to send to each other in one transmission. For example network stack specifics and networked devices along the route may influence what the maximum amount of data will be. Sometimes a device exhibits "odd" behaviour and when the Linux kernel encounters that it tries to combat or even out things smoothly. As shown from your log. The only time this is worth investigating AFAIK is when the message returns frequently or when you experience unacceptable network throughput degradation.

Bouki · 12-30-2014, 12:28 PM

Thank you so much dear unSpawn.
my last questions:
1- what is your professional opinion about this case?(are you sure about "TCP: Peer unexpectedly shrunk window" or i should read more logs?)
2- how can prevent this kind of problems?

Thank you again.

unSpawn · 12-30-2014, 01:34 PM

Quote:

Originally Posted by Bouki

what is your professional opinion about this case?(are you sure about "TCP: Peer unexpectedly shrunk window" or i should read more logs?)

I am not a professional. Like I said before: the only time you investigating is when that message returns frequently or when you experience unacceptable network throughput degradation.

Quote:

Originally Posted by Bouki

how can prevent this kind of problems?

Which problems? Responding in 6 months time or what?

GaWdLy · 12-30-2014, 06:19 PM

Impossible to determine RCA 6 months later...

Also, be sure that you have sosreport installed, systat, and kdump configured. Then call Red Hat when it happens again. Provide a sosreport and a vmcore file.

Bouki · 12-31-2014, 06:10 AM

Quote:

Originally Posted by unSpawn

I am not a professional. Like I said before: the only time you investigating is when that message returns frequently [I]or when you experience
Which problems? Responding in 6 months time or what?

no. server rebooting problems.

unSpawn · 12-31-2014, 02:37 PM

Ah, then what GaWdLy said.

Bouki · 12-31-2014, 11:42 PM

Quote:

Originally Posted by GaWdLy

Impossible to determine RCA 6 months later...

Also, be sure that you have sosreport installed, systat, and kdump configured. Then call Red Hat when it happens again. Provide a sosreport and a vmcore file.

Thank you dear GaWdLy.
i collected the sosreport. is there any way to analyze them by my self?
i dont want send them to the support representative. last time it took about 2 month!!! i should report the problem next weak.
Thanks.

GaWdLy · 01-01-2015, 12:58 AM

Red Hat has an SLA to meet, so it won't take 2 months to review. As long as you are a premium, or standard subscriber, you should get some info back within a day or three.

Have you had a failed boot incident recently? Sosreports are great, but they are only so good at determining RCA. Especially for boot-time issues. Sosreports are a bare minimum for troubleshooting, but are still nearly impossible get get a clearcut RCA.

GaWdLy · 01-01-2015, 01:00 AM

Bouki, if you put your Sosreport in a secure place where you can share it with me, contact me in PMs and I can look through them real quick.

GaWdLy · 01-01-2015, 01:06 AM

BTW, in your dmesg, I see some machine check events logged. That's usually a cpu hardware error, but I don't think that would cause startup issues, per se.

Check /var/log/mcelog for details.

Bouki · 01-01-2015, 01:38 AM

Quote:

Originally Posted by GaWdLy

Bouki, if you put your Sosreport in a secure place where you can share it with me, contact me in PMs and I can look through them real quick.

Thank you GaWdLy.
i will send the Sosreport to you.
i really appreciate your help.