Slow response

robeadam · 09-06-2007, 04:24 PM

I'm running a test against 2.6.9-42.7.ELsmp. We are considering upgrading from 2.4.21-32.0.1.ELsmp to this version, hince the test. However, when I run the test, the response from the linux box gets really slow, like 1-5 minutes for the response from a command. Just issuing the date command takes 1.5 minutes:

rpd-routem114.cisco.com:27> date
Thu Sep 6 17:21:31 EDT 2007
rpd-routem114.cisco.com:28> date
Thu Sep 6 17:23:01 EDT 2007

I'm ssh'ing through eth0 so I looked there but found no errors:

rpd-routem114.cisco.com:21> sudo ethtool -S eth0
NIC statistics:
rx_packets: 62440
tx_packets: 43762
rx_bytes: 68680156
tx_bytes: 3819299
rx_errors: 0
tx_errors: 0
rx_dropped: 0
tx_dropped: 0
multicast: 0
collisions: 0
rx_length_errors: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 0
rx_missed_errors: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_window_errors: 0
tx_deferred: 0
tx_single_collisions: 0
tx_multi_collisions: 0
tx_flow_control_pause: 0
rx_flow_control_pause: 0
rx_flow_control_unsupported: 0
tx_tco_packets: 0
rx_tco_packets: 0

I also checked the CPU but it looks ok to me:

rpd-routem114.cisco.com:15> iostat
Linux 2.6.9-42.7.ELsmp (rpd-routem114.cisco.com) 09/06/2007

avg-cpu: %user %nice %sys %iowait %idle
7.23 0.00 58.15 0.11 34.50

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 1.71 19.66 14.40 145705 106696

rpd-routem114.cisco.com:26> iostat
Linux 2.6.9-42.7.ELsmp (rpd-routem114.cisco.com) 09/06/2007

avg-cpu: %user %nice %sys %iowait %idle
8.91 0.00 72.20 0.06 18.83

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 1.19 10.73 11.79 145753 160112

The same goes for memory:

rpd-routem114.cisco.com:7> free -m
total used free shared buffers cached
Mem: 1001 180 820 0 30 44
-/+ buffers/cache: 106 895
Swap: 4094 0 4094

Can anyone suggest anything else to look at to see why the box is responding so slow?

Thanks,
Robert

robeadam · 09-06-2007, 04:54 PM

There is a tool we use to emulate BGP peer's. I noticed when the tool was running is when the response was really slow. When I stopped the tool, the response time returned to normal. What's strange to me is that there doesn't seem to be much additional CPU free as compared to before, but the response has greatly improved.

With tool running

avg-cpu: %user %nice %sys %iowait %idle
7.23 0.00 58.15 0.11 34.50

avg-cpu: %user %nice %sys %iowait %idle
8.91 0.00 72.20 0.06 18.83

avg-cpu: %user %nice %sys %iowait %idle
9.03 0.00 73.18 0.06 17.73

With tool stopped:

avg-cpu: %user %nice %sys %iowait %idle
8.72 0.00 70.73 0.05 20.49

avg-cpu: %user %nice %sys %iowait %idle
8.72 0.00 70.68 0.05 20.55

avg-cpu: %user %nice %sys %iowait %idle
8.66 0.00 70.21 0.05 21.07

During a test on the 2.4 kernal, the CPU has plenty free:

avg-cpu: %user %nice %sys %iowait %idle
0.10 0.04 0.24 0.01 99.60

Robert

ilikejam · 09-06-2007, 05:43 PM

Hi.

What does 'top' look like when it's behaving like this? Also, are there any more recent kernels you could try?

Dave

robeadam · 09-06-2007, 07:46 PM

Hey Dave,

Thanks for the response.

When I ran top earlier, it showed the idle CPU at 0%

top - 17:31:42 up 3:58, 1 user, load average: 102.39, 101.49, 101.00
Tasks: 264 total, 74 running, 190 sleeping, 0 stopped, 0 zombie
Cpu(s): 10.9% us, 88.7% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.4% hi, 0.0% si
Mem: 1025712k total, 198840k used, 826872k free, 41260k buffers
Swap: 4192924k total, 0k used, 4192924k free, 46620k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9656 root 15 0 190m 20m 460 S 4 2.1 2:08.26 routem.latest
9665 root 15 0 190m 20m 460 S 4 2.1 2:07.64 routem.latest
9641 root 15 0 190m 20m 460 R 4 2.1 2:07.39 routem.latest
...

"routem" is the tool we use to emulate the BGP peers. I'm not sure why iostat would report idle CPU yet top show none. However, since the 1, 5 & 15 minute averages are all over 100% I can't really contribute the difference between iostat and top as CPU required to run top.

We run RedHat Enterprise here, and I think the latest deployed in our kickstart process is v.4 update 4. :-(.

Robert

ilikejam · 09-06-2007, 08:11 PM

Those load averages aren't in %, they're in no. of processes on average trying to run. A load average of 1 represents a fully occupied single core machine. If you machine is dual core, then a load average of 2 represents a perfectly loaded host. So if you're running a dual processor host, then you load average is actually at 5000%.

Looks like the kernel is really chewing on something ( 88.7% sy ).

To be honest, the upgrade from 2.4 to 2.6 kernels isn't trivial - is the rest of the software on the host updated as well? There's numerous changes to binutils and others required for the 2.6 kernel, if memory serves.

Dave

robeadam · 09-13-2007, 09:26 AM

It turns out that the problem was with the BGP emulator. It loops through all the peers it is emulating then sleeps 10000 usec. It appears in the 2.4 kernel, this was ok but in the 2.6 kernel, it causes problems. Increasing that time to 50000 usec helped quite a bit but still doesn't completely resolve the problem.

Thanks for the support!

Robert

ilikejam · 09-13-2007, 09:31 AM

I see. The kernel tick interval was changed between 2.4 and 2.6, so maybe that's what's causing the problem.

The old timer was 100Hz, but on 2.6 you can choose between 100, 300 and 1000Hz.

Dave