LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Centos Server Troubleshoot Tips (https://www.linuxquestions.org/questions/linux-newbie-8/centos-server-troubleshoot-tips-776324/)

LGX 12-17-2009 01:30 AM

Centos Server Troubleshoot Tips
 
Hey all,

I install a centos 5.2 on an intel server box and been running for about one month smooth. Couple of days ago, no changes made or no new added programs, but the server has been acting up.

When I try to login in via SSH, there is a delay on every keystroke. The server itself does not preform correctly. When I see this, I usually just reboot the server to fix the issue and it runs perfect again. About one week later, it does it again.

I try to run memtest and all memory pass without any errors. I also try to run S.M.A.R.T on the hdd to see if they are bad, but it passes. Not sure if there are anything else i can try (linux commands/tools) to see what is causing this issue. I am not sure if there are system logs or if they are easy to read for a beginner like myself.

# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/md2 137G 39G 91G 31% /
/dev/md1 145M 21M 117M 15% /boot
/dev/md0 3.9G 73M 3.7G 2% /tmp
tmpfs 3.9G 0 3.9G 0% /dev/shm

# free -m
total used free shared buffers cached
Mem: 7900 4399 3500 0 199 1689
-/+ buffers/cache: 2510 5390
Swap: 8189 0 8189

Any help would be great.

Regards,
LGX

unSpawn 12-17-2009 10:15 AM

Actually one of the worst things to do is reboot the server. Sure it may seem like a nice stop-gap solution but rebooting makes all process data that might help troubleshoot the issue disappear unless you log it. Only by logging system stats you can get objective numbers to measure performance.

- What services does the machine provide (application names, versions)?
- Do these hiccups occur only from your IP address or others (different networks) as well?
- Could these hiccups be network related?
- Anything interesting in any /var/log/ logs around the time the hiccups occur?
- Do you run Logwatch to keep a tab on all things reported?
- Do you have any CPU hogs? A wee script like
Code:

__genCpuhoglist() { /bin/ps -eo %C -eo pid,command|sort -bgr -k1|head -10|while read cpu pid command; do
 [ ${cpu%%.*} -gt 5 ] && logger "CPU: ${command%% *}: ${cpu}%"; done; sleep 10s; __genCpuhoglist; }

could show (currently set for load > 5).
- Do you run anything like Dstat or Collectl? Atsar or SAR? With one of the first two you would have a macro view of the resource usage on the box. And if you want to run 'top' I would choose Atop instead: it can save process stats which you can replay later on.
- Do you log in over SSH as root account user (BAD)?
* And if you SSH in, use 'screen'. It enables you to re-attach to broken off sessions easily.

Whatever you do please be verbose in replying: the more information the better.

lazlow 12-17-2009 11:46 AM

Keep in mind that Centos only supports the most current dot release. So 5.2 has not had any support since 5.3 came out, and 5.4 is current(two years without an update?). This could be a flaw that is fixed in the later releases. Upgrading from 5.X to 5.X+1 on Centos is generally (read release notes first) a simple yum update away.

DotHQ 12-17-2009 12:10 PM

Check /var/log/sa for sar?? files. These can show you how busy the CPU was in10 minute intervals. It is part of the sysstat package and I believe it installs by default. If not you can install sysstat with yum. yum sysstat install

Have you ran "top" it will show you real time how busy your cpu's are.\

uptime will also show you the load average. Normally you want the load average under 2.0 Top also shows load average. Load average it key to showing how busy your system really is at any given moment.

cat /etc/resolv.conf

I've seen slow log in's because this file was not set up properly. But that would not help to explain other quirkiness that you've experienced. You might have a mulit issue problem going on here.

LGX 12-17-2009 10:47 PM

I will try to capture some logs when this acts up again. I will provide as much info as you requested because im running out of ideas on what to do. Again, I do appreciate your help you guys posted on this topic. The information below was found on my last hiccups.

- What services does the machine provide (application names, versions)?
There are only basic services (server defaults) being used and only one application installed. Its a linux remote host controler that connect to other webserver.

- Do these hiccups occur only from your IP address or others (different networks) as well?
I do have mult ip binded to the servers and it does impact all other (IP's) network traffic.

- Could these hiccups be network related?
I confirm with my dc, there was no network related issues when this happens.

- Anything interesting in any /var/log/ logs around the time the hiccups occur?

On the last issue i had, this was posted in the /var/log message1
===============================================================
timeout: status=0xd0 { Busy }
Dec 12 15:28:21 chi01-Fibernetservers-1 kernel: ide: failed opcode was: unknown
Dec 12 15:28:21 chi01-Fibernetservers-1 kernel: hdb: drive not ready for command
Dec 12 15:28:26 chi01-Fibernetservers-1 kernel: hdb: status timeout: status=0xd0 { Busy }
Dec 12 15:28:26 chi01-Fibernetservers-1 kernel: ide: failed opcode was: unknown
Dec 12 15:28:26 chi01-Fibernetservers-1 kernel: hdb: drive not ready for command
Dec 12 15:28:31 chi01-Fibernetservers-1 kernel: hdb: status timeout: status=0xd0 { Busy }
Dec 12 15:28:31 chi01-Fibernetservers-1 kernel: ide: failed opcode was: unknown
Dec 12 15:28:31 chi01-Fibernetservers-1 kernel: hdb: drive not ready for command
Dec 12 15:28:31 chi01-Fibernetservers-1 shutdown[8754]: shutting down for system reboot
Dec 12 15:28:31 chi01-Fibernetservers-1 init: Switching to runlevel: 6
Dec 12 15:28:32 chi01-Fibernetservers-1 smartd[4057]: smartd received signal 15: Terminated
Dec 12 15:28:32 chi01-Fibernetservers-1 smartd[4057]: smartd is exiting (exit status 0)
Dec 12 15:28:33 chi01-Fibernetservers-1 avahi-daemon[3971]: Got SIGTERM, quitting.
Dec 12 15:28:33 chi01-Fibernetservers-1 avahi-daemon[3971]: Leaving mDNS multicast group on interface eth1.IPv6 with address fe80::215:17ff:fe6a:779.
Dec 12 15:28:33 chi01-Fibernetservers-1 avahi-daemon[3971]: Leaving mDNS multicast group on interface eth1.IPv4 with address 208.100.1.1.
Dec 12 15:28:37 chi01-Fibernetservers-1 hcid[3638]: Got disconnected from the system message bus
Dec 12 15:28:38 chi01-Fibernetservers-1 rpc.statd[3500]: Caught signal 15, un-registering and exiting.
Dec 12 15:28:38 chi01-Fibernetservers-1 auditd[3395]: Error sending signal_info request etc.....
===============================================================

- Do you run Logwatch to keep a tab on all things reported?
I believe I do not have logwatch on. Is this something I might need? Is this a default pack on the OS or how do I install it?

- Do you run anything like Dstat or Collectl? Atsar or SAR? With one of the first two you would have a macro view of the resource usage on the box. And if you want to run 'top' I would choose Atop instead: it can save process stats which you can replay later on.

I try to use Atop, but it look like it is not install
# Atop
-bash: Atop: command not found

- Do you use screen and do you log in over SSH as root or user account user?

Yes, I do use screen to run mult servers and no, have a user acct that I used if needed.

- Keep in mind that Centos only supports the most current dot release. So 5.2 has not had any support since 5.3 came out, and 5.4 is current(two years without an update?). This could be a flaw that is fixed in the later releases. Upgrading from 5.X to 5.X+1 on Centos is generally (read release notes first) a simple yum update away.

I might update to centos 5.x (current) if I can not find what is causing this issue to see if it helps but I am not sure if that will change my current settings around or impact other servers running on this box.

- Check /var/log/sa for sar
I check this but I am not sure how to read this and there is lots of info in this file.

- Have you ran "top" it will show you real time how busy your cpu's are.
This is top on my server when it is good.

]# top
top - 22:39:29 up 3 days, 6:34, 1 user, load average: 1.20, 1.58, 1.63
Tasks: 273 total, 5 running, 268 sleeping, 0 stopped, 0 zombie
Cpu(s): 7.5%us, 5.5%sy, 0.0%ni, 86.9%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 8090436k total, 5147900k used, 2942536k free, 211696k buffers
Swap: 8385912k total, 0k used, 8385912k free, 1976912k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25805 eco 20 0 405m 307m 12m S 13.7 3.9 26:53.94 srcds_i686
5338 demochi 20 0 387m 273m 12m S 11.7 3.5 600:40.17 srcds_i686
4294 Calvary 20 0 195m 106m 12m S 9.8 1.4 513:05.63 srcds_i686
4988 delta 20 0 574m 486m 12m R 9.8 6.2 496:33.05 srcds_i686
5978 delta 20 0 527m 431m 12m S 9.8 5.5 426:02.02 srcds_i686
20680 delta 20 0 456m 358m 12m R 9.8 4.5 79:00.87 srcds_i686
4524 jhart17 20 0 407m 310m 12m S 7.8 3.9 405:27.87 srcds_i686
4870 delta 20 0 271m 176m 12m R 7.8 2.2 382:54.18 srcds_i686
5714 aquapod 20 0 206m 111m 12m R 7.8 1.4 320:28.07 srcds_i686
25074 jok3r100 20 0 203m 106m 12m S 7.8 1.4 18:28.32 srcds_i686
5592 captthun 20 0 203m 106m 12m S 5.9 1.4 320:18.74 srcds_i686
19 root -51 -5 0 0 0 S 2.0 0.0 6:27.21 sirq-timer/1
58 root -51 -5 0 0 0 S 2.0 0.0 6:50.14 sirq-timer/4
26678 root 20 0 12740 1116 720 R 2.0 0.0 0:00.01 top
1 root 20 0 10352 688 572 S 0.0 0.0 0:04.29 init
2 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/0
4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 posixcputmr/0
5 root -51 -5 0 0 0 S 0.0 0.0 0:00.00 sirq-high/0
6 root -51 -5 0 0 0 S 0.0 0.0 7:43.71 sirq-timer/0


All times are GMT -5. The time now is 06:46 PM.