Help on strange performance (user/sys/idle) problem using a 2.2.20 kernel
Linux - KernelThis forum is for all discussion relating to the Linux kernel.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Help on strange performance (user/sys/idle) problem using a 2.2.20 kernel
Hello,
I hope that someone out there may have encountered the problem I am seeing when running a 2.2.20 kernel in an embedded set-top. The set-top box is running an A/V application that receives MPEG-2 streams from dedicated A/V servers via UDP/IP messages in a embedded (dedicated) environment where the only things running are the kernel and kernel tasks plus the actual application.
What I am seeing is a stair stepping effect where the performance improves over time (measured as the amount of CPU idle time using vmstat) for a period of approximately 8 hours. At the approximate 8 hour mark (more or less) the CPU idle percentage suddenly drops to a very low level (8 or 9 percent) for a few minutes then starts to get better again. At the best level vmstat is returning a CPU idle percentage of approximately 40 percent . Then the cycle repeats itself. When the CPU idle percentage drops vmstat is also reporting a higher amount of CPU time being spent in kernel space than at other times. All other vmstat reported metrics appear to be giving expected results.
I have tried tweaking various kernel tunable parameters but they don't seem to effect the behavior, so no magic bullet has been found yet and I'm running out of ammunition to fire...
Has anyone seen this type of behavior and if so, any suggestions on how to resolve it?
Thanks much!
PS: Unfortunately upgrading to a modern kernel or changing the hardware is not an option at this time as my company did not do the platform requirements.
Is it possible for you to re-state this problem in terms of the application? In other words, is it "briefly dropping frames?" Is it "suddenly running slower than an animal in molasses?" Is the disk drive dancing a tango?
I don't know too many pure-computer things that would be so regularly tied to a schedule, as to an eight-hour clock, so if you are consistently observing that kind of a pattern, I think that I would take a closer look at the nature of the workload .. maybe even of the environment. "What happens, in the real world, about every eight hours?"
In other words, you're looking at kernel tuning parameters at a point where I am not yet persuaded that you are twiddling the right knobs.
Thank you for responding. In answer to your questions, the app is being run in a controlled lab "test environment". The environment consists of three machines (two PCs, one embedded STB) in a dedicated and closed LAN. The two PCs are used to generate two output A/V streams each for a total of four different streams. The STB can choose to display 1 of the four available streams simulating the eventual deployment by allowing a customer to switch channels to different sources.
All of the data is being sent/received using UDP/IP, and the company that I work for is responsible for adding FEC protection to the actual data packets so packet loss can be handled without needing retransmission.
What happens when vmstat reports CPU idle percentage dropping to very low rates (8 or 9 percent idle) is that the application is simply running out of CPU horsepower to the extent that the FEC decoding takes up too much processing and frames are dropped because the application can't pull them from the IP stacks quick enough. This is observable by watching the video output on a TV, you can see artifacting and frame jitter occurring.
The workload itself does not change, the two transmitting PCs simply loop continuously replaying (encoding and outputting) the two A/V streams, and the STB just plays what it is told to play. At this point I really don't see how the environment can affect what is occurring.
BTW: We have also done "real world" testing using a partial deployment and the behavior was first discovered there. We are running the tests in our own labs in order to attempt to identify the root cause under controlled conditions.
I also completely agree with you that we may not be tweaking the right knobs. At this point we are just attempting to theorize possible problems and test those theories by practical observations. We are not sure which "knobs" actually are the right ones to tweak.
Again, thank you very much for responding. Any insights or suggestions you can offer are greatly appreciated!
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.