Why my application is waiting at sync_page for long time ?
Hi,
We have RHEL 6.4 and there is a tool installed on this server, bip_server. Whenever this is getting started, it is taking very long time of around 20-30 minutes and it waits at sync_page and in D1 state. There is no load on server, sufficient memory exists on server. ulimit is good (for that user which starts application). Can somebody help me to understand, what else I need to check ? Code:
[root@apps_db34 ~]# ps -ef | grep -i bip |
Process status is uninterruptible sleep https://unix.stackexchange.com/quest...state-indicate
It is likely performing I/O, however considering that it is asleep, it should not be destructively affecting the operating system. Besides this information, did you have any particular reason to feel that the server is not fulfilling the function you have it performing? I'd read the Oracle documentation about it or search their website for further information. |
Try strace, to see what it was doing, just before it hung.
There's probably other things in its /proc to dig into, for more clues, but Idk offhand. Has it ever worked? Any oddities, like nfs swapspace? Vendor support? ;) I assume original RHEL kernel. Idk which kernel source file (anyone?) |
Quote:
Code:
[root@apps_db34]# strace -s 99 -ffp 2654 |
#2 probably right: just 'normal' performance issue
Now I think (not sure tho) that: the D state might be:
'normal' reading-disk-in-progress occurance! Can you repeat that ps ... | grep <pid of process found in D state> LOTS of times, to see IF it is ALWAYS in D state? I'm guessing MAYBE it's NOT!!! I use just: strace -f -o myfile -p <#> Do your two lines of dots, just mean: more of same omitted? (I'm guessing: Yes) **I'm guessing it keeps on doing the seek&read(9,... Which would mean: **It's not hung at all!!! A 'false-positive' of a hang, maybe! #2 simply right;) What 'certainty/proof' do you see, that some actual activity/processing has hung? Could it be that bip application simply does a huge amounts of reading to startup? That would be just a performance issue, like dd reading TeraBytes, bs=1K each. (Don't be upset IF my guess on this is wrong!!!) Interesting dozen threads; ps has a -T; any ideas on them? I haven't web-researched FUTEX_WAIT_PRIVATE (yet). Any other LQ'er have ideas? (I couldn't find "D1" state, just D) |
It is still running and trying to come in operational state, as you can see in below output. In other servers, it takes 2-3 minutes to come up, but here, it is already close to 2 hours.
Quote:
|
Ah! I see! Then, I'd look into disk performance tools, to compare systems.
iostat, plus newer ones which I'm not familiar with. Web-search... Best wishes. p.s. Figure out (lsof maybe) what file that fd 9 is, and time cp it /dev/null Compare with good systems. ALL system&app configs same? dmesg errors? |
That's "Dl", not "D1".
What kernel is this running ?. Is it under KVM ?. Is it using a NFS mount ?. |
It is not in KVM. This VM is running on VMWare. There is no NFS mount on server.
Quote:
|
That code was a bit of a problem child apparently, and was yanked from the kernel at 2.6.39. That would be May 2011.
Even allowing for RH backporting, I'd reckon it's time to upgrade rather than chasing these sort of things. Else raise a ticket with RH if still under support. |
We do not have RH support for these servers (hosted on VMWare), so need to figure ourselves.
We can go with upgrade, but keeping that as secondary option. Because, this is not working in one of server out of other 3 nodes, while all are on same kernel level. Were you able to find similar known issue related to this kernel level (Any link or bug file) ? If we have evidence that this kernel is problematic, we should be able to upgrade all nodes. |
All times are GMT -5. The time now is 06:59 AM. |