LinuxQuestions.org - Why my application is waiting at sync

- Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)

- - Why my application is waiting at sync_page for long time ? (https://www.linuxquestions.org/questions/linux-software-2/why-my-application-is-waiting-at-sync_page-for-long-time-4175612420/)

Why my application is waiting at sync_page for long time ?

Hi,

We have RHEL 6.4 and there is a tool installed on this server, bip_server.
Whenever this is getting started, it is taking very long time of around 20-30 minutes and it waits at sync_page and in D1 state. There is no load on server, sufficient memory exists on server. ulimit is good (for that user which starts application).
Can somebody help me to understand, what else I need to check ?

Code:

[root@apps_db34 ~]# ps -ef | grep -i bip

xms      2594    1 23 03:13 pts/2    00:00:01 /export/xms/kle/bin/bip_server -f /export/xms/config/config.ini

root      2610  1807  0 03:13 pts/1    00:00:00 grep -i bip

[root@apps_db34 ~]# 

[root@apps_db34 ~]# ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32 | grep -i bip

    1  2821 xms      Dl    9.7 bip_server    sync_page

[root@apps_db34 ~]# cat /proc/2821/stack

[<ffffffff81119ded>] sync_page+0x3d/0x50

[<ffffffff81119e0e>] sync_page_killable+0xe/0x40

[<ffffffff81119d17>] __lock_page_killable+0x67/0x70

[<ffffffff8111ba64>] generic_file_aio_read+0x4b4/0x700

[<ffffffff81180eba>] do_sync_read+0xfa/0x140

[<ffffffff811817a5>] vfs_read+0xb5/0x1a0

[<ffffffff811818e1>] sys_read+0x51/0x90

[<ffffffff8104d810>] sysenter_dispatch+0x7/0x2e

[<ffffffffffffffff>] 0xffffffffffffffff

[root@apps_db34 ~]#

[root@apps_db34 ~]# free -m

            total      used      free    shared    buffers    cached

Mem:        15951      1671      14280          0        18      1228

-/+ buffers/cache:        424      15527

Swap:        7999          0      7999

[root@apps_db34 ~]# su - xms

( 03:31:19 ) apps_db34:xms ~>ulimit -a

core file size          (blocks, -c) 0

data seg size          (kbytes, -d) unlimited

scheduling priority            (-e) 0

file size              (blocks, -f) unlimited

pending signals                (-i) 127476

max locked memory      (kbytes, -l) 64

max memory size        (kbytes, -m) unlimited

open files                      (-n) 1024

pipe size            (512 bytes, -p) 8

POSIX message queues    (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) 10240

cpu time              (seconds, -t) unlimited

max user processes              (-u) 271360

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited

( 03:31:23 ) apps_db34:xms ~>

Regards

Process status is uninterruptible sleep https://unix.stackexchange.com/quest...state-indicate

It is likely performing I/O, however considering that it is asleep, it should not be destructively affecting the operating system.

Besides this information, did you have any particular reason to feel that the server is not fulfilling the function you have it performing?

I'd read the Oracle documentation about it or search their website for further information.

Try strace, to see what it was doing, just before it hung.
There's probably other things in its /proc to dig into, for more clues, but Idk offhand.
Has it ever worked? Any oddities, like nfs swapspace? Vendor support? ;)
I assume original RHEL kernel. Idk which kernel source file (anyone?)

Quote:

Originally Posted by !!! (Post 5750882)

Not sure, if I am using strace correctly, as I am not much knowledgeable on this. Does it give some clue ?

Code:

[root@apps_db34]# strace -s 99 -ffp 2654

Process 2654 attached with 12 threads - interrupt to quit

[ Process PID=2665 runs in 32 bit mode. ]

[pid  2665] futex(0x8d449cc, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>

[pid  2664] select(6, [5], NULL, NULL, {0, 135099} <unfinished ...>

[pid  2663] futex(0x8cd31cc, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>

[pid  2661] futex(0x8cd1b6c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>

[pid  2658] futex(0x8cc4118, FUTEX_WAIT_PRIVATE, 5, NULL <unfinished ...>

[pid  2657] futex(0x8cc4118, FUTEX_WAIT_PRIVATE, 5, NULL <unfinished ...>

[pid  2662] futex(0x8cd2f50, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>

[pid  2660] select(5, [4], NULL, NULL, {228, 5309} <unfinished ...>

[pid  2659] futex(0x8cc4118, FUTEX_WAIT_PRIVATE, 5, NULL <unfinished ...>

[pid  2655] futex(0x8cc4118, FUTEX_WAIT_PRIVATE, 5, NULL <unfinished ...>

[pid  2656] futex(0x8cc4118, FUTEX_WAIT_PRIVATE, 5, NULL <unfinished ...>

[pid  2654] _llseek(9, 126976, [126976], SEEK_SET) = 0

[pid  2654] read(9, "\1\21\350:\221\0\2i\4\0\1\21\350=\335\0\2i\22\0\1\21\350B#\0\2i\34\0\1\21\351\22\201\0\2i&\0\1\21\351 \267\0\2i3\0\1\21\351\2307\0\2i?\0\1\21\351\276\255\0\2iI\0\1\21\351\322\313\0\2iT\0\1\21\352)\263\0\2i]\0\1\21\352=3\0\2i]"..., 4096) = 4096

[pid  2654] read(9, "\2\223\37\0\1\22eM\257\0\2\223.\0\1\22eh\205\0\2\223D\0\1\22ex1\0\2\223]\0\1\22e\207\t\0\2\223{\0\1\22e\215\351\0\2\223\223\0\1\22e\231\347\0\2\223\254\0\1\22e\2661\0\2\223\307\0\1\22e\310\211\0\2\223\343\0\1\22e\323u\0\2\223\377\0\1\22e\346\23"..., 4096) = 4096

[pid  2654] _llseek(9, 308625408, [308625408], SEEK_SET) = 0

[pid  2654] read(9, "\7\0\0\2\211\32idZ\213\256\0\25\0\0\2\211\32\274dZ\213\254\0\30\0\0\2\211\33\17dZ\214\240\0\6\0\0\2\211\33bdZ\213\235\0\20\0\0\2\211\33bdZ\214\221\0\3\0\0\2\211\33bdZ\214\221\0\4\0\0\2\211\35\tdZ\214\221\0\5\0\0\2\211\31VdZ\214\222\0\2\0\0"..., 4096) = 4096

[pid  2654] _llseek(9, 131072, [131072], SEEK_SET) = 0

[pid  2654] read(9, "\2\223\37\0\1\22eM\257\0\2\223.\0\1\22eh\205\0\2\223D\0\1\22ex1\0\2\223]\0\1\22e\207\t\0\2\223{\0\1\22e\215\351\0\2\223\223\0\1\22e\231\347\0\2\223\254\0\1\22e\2661\0\2\223\307\0\1\22e\310\211\0\2\223\343\0\1\22e\323u\0\2\223\377\0\1\22e\346\23"..., 4096) = 4096

[pid  2654] _llseek(9, 308633600, [308633600], SEEK_SET) = 0

[pid  2654] read(9, "\257d\373\33\32\0\17\0\0\2\243\360\231d\373\33\32\0\17\0\0\0\323\367Pd\373\33\32\0\17\0\0\0\35\336\365d\373\33-\0\34\0\0\0001\303\325d\373\33-\0\4\0\0\0N\224\352d\373\33(\0\25\0\0\0\226\257\213d\373\33-\0\34\0\0\0\235\37Sd\373\33-\0\25\0\0\0\352\261+d\373"..., 4096) = 4096

[pid  2654] _llseek(9, 131072, [131072], SEEK_SET) = 0

[pid  2654] read(9, "\2\223\37\0\1\22eM\257\0\2\223.\0\1\22eh\205\0\2\223D\0\1\22ex1\0\2\223]\0\1\22e\207\t\0\2\223{\0\1\22e\215\351\0\2\223\223\0\1\22e\231\347\0\2\223\254\0\1\22e\2661\0\2\223\307\0\1\22e\310\211\0\2\223\343\0\1\22e\323u\0\2\223\377\0\1\22e\346\23"..., 4096) = 4096

[pid  2654] _llseek(9, 308637696, [308637696], SEEK_SET) = 0

[pid  2654] read(9,  <unfinished ...>

[pid  2664] <... select resumed> )      = 0 (Timeout)

[pid  2664] select(6, [5], NULL, NULL, {5, 0} <unfinished ...>

[pid  2654] <... read resumed> "Ze\n\371y\0008\0\0\0032\310Ge\n\371y\0008\0\0\0032\310Ze\n\371y\0008\0\0\0032\310Ge\n\371y\0008\0\0\0\341!+e\n\371y\0\31\0\0\2\363\322\245e\n\371y\0\26\0\0\2\363\322\221e\n\371y\0\26\0\0\0{m\363e\n\371y\0\34\0\0\0\341!+e\n"..., 4096) = 4096

[pid  2654] _llseek(9, 131072, [131072], SEEK_SET) = 0

[pid  2654] read(9, "\2\223\37\0\1\22eM\257\0\2\223.\0\1\22eh\205\0\2\223D\0\1\22ex1\0\2\223]\0\1\22e\207\t\0\2\223{\0\1\22e\215\351\0\2\223\223\0\1\22e\231\347\0\2\223\254\0\1\22e\2661\0\2\223\307\0\1\22e\310\211\0\2\223\343\0\1\22e\323u\0\2\223\377\0\1\22e\346\23"..., 4096) = 4096

[pid  2654] _llseek(9, 308641792, [308641792], SEEK_SET) = 0

[pid  2654] read(9, "\0\0\31\367;d\353\27\216\0\n\0\0\0%.\342d\353\27\212\0\f\0\0\1I\"\1d\353\27\212\0\21\0\0\1I!\360d\353\27\212\0\21\0\0\0005>*d\353\27\216\0\n\0\0\1T\201\203d\353\27\212\0\f\0\0\1T\201sd\353\27\212\0\f\0\0\0\31\367;d\353\27|\0\10\0\0\0005"..., 4096) = 4096

[pid  2654] _llseek(9, 131072, [131072], SEEK_SET) = 0

.......................................................

.......................................................

[pid  2654] _llseek(9, 309293056, [309293056], SEEK_SET) = 0

[pid  2654] read(9, ^C <unfinished ...>

Process 2654 detached

Process 2655 detached

Process 2656 detached

Process 2657 detached

Process 2658 detached

Process 2659 detached

Process 2660 detached

Process 2661 detached

Process 2662 detached

Process 2663 detached

Process 2664 detached

Process 2665 detached

#2 probably right: just 'normal' performance issue

Now I think (not sure tho) that: the D state might be:
'normal' reading-disk-in-progress occurance!

Can you repeat that ps ... | grep <pid of process found in D state> LOTS of times, to see IF it is ALWAYS in D state?
I'm guessing MAYBE it's NOT!!!

I use just: strace -f -o myfile -p <#>

Do your two lines of dots, just mean: more of same omitted? (I'm guessing: Yes)
**I'm guessing it keeps on doing the seek&read(9,... Which would mean:
**It's not hung at all!!! A 'false-positive' of a hang, maybe! #2 simply right;)
What 'certainty/proof' do you see, that some actual activity/processing has hung?
Could it be that bip application simply does a huge amounts of reading to startup?
That would be just a performance issue, like dd reading TeraBytes, bs=1K each.
(Don't be upset IF my guess on this is wrong!!!)

Interesting dozen threads; ps has a -T; any ideas on them?

I haven't web-researched FUTEX_WAIT_PRIVATE (yet). Any other LQ'er have ideas? (I couldn't find "D1" state, just D)

It is still running and trying to come in operational state, as you can see in below output. In other servers, it takes 2-3 minutes to come up, but here, it is already close to 2 hours.

Quote:

[root@apps_db34 ~]# ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32 | grep -i bip
1 2654 xms Dl 0.2 bip_server sync_page
[root@apps_db34 ~]# ps -ef | grep -i bip
xms 2654 1 0 13:52 pts/2 00:00:17 /export/xms/kle/bin/bip_server -f /export/xms/config/config.ini
root 6078 6053 0 15:47 pts/4 00:00:00 grep -i bip
[root@apps_db34 ~]# date
Thu Aug 24 15:47:28 PDT 2017
[root@apps_db34 ~]#
[root@apps_db34 ~]# strace -f -o myfile -p 2654
Process 2654 attached with 13 threads - interrupt to quit
[ Process PID=4842 runs in 32 bit mode. ]
^CProcess 2654 detached
Process 2655 detached
Process 2656 detached
Process 2657 detached
Process 2658 detached
Process 2659 detached
Process 2660 detached
Process 2661 detached
Process 2662 detached
Process 2663 detached
Process 2664 detached
Process 2665 detached
Process 4842 detached
[root@apps_db34 ~]#

Ah! I see! Then, I'd look into disk performance tools, to compare systems.
iostat, plus newer ones which I'm not familiar with. Web-search... Best wishes.

p.s. Figure out (lsof maybe) what file that fd 9 is, and time cp it /dev/null
Compare with good systems. ALL system&app configs same? dmesg errors?

That's "Dl", not "D1".
What kernel is this running ?.
Is it under KVM ?.
Is it using a NFS mount ?.

It is not in KVM. This VM is running on VMWare. There is no NFS mount on server.

Quote:

[root@apps_db34 ~]# uname -a
Linux apps_db34 2.6.32-358.6.2.el6.x86_64 #1 SMP Tue May 14 15:48:21 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux
[root@apps_db34 ~]#
[root@apps_db34 ~]# cat /etc/*elea*
Red Hat Enterprise Linux Server release 6.4 (Santiago)
Red Hat Enterprise Linux Server release 6.4 (Santiago)
cpe:/o:redhat:enterprise_linux:6server:ga:server
[root@apps_db34 ~]#

That code was a bit of a problem child apparently, and was yanked from the kernel at 2.6.39. That would be May 2011.
Even allowing for RH backporting, I'd reckon it's time to upgrade rather than chasing these sort of things. Else raise a ticket with RH if still under support.

We do not have RH support for these servers (hosted on VMWare), so need to figure ourselves.
We can go with upgrade, but keeping that as secondary option. Because, this is not working in one of server out of other 3 nodes, while all are on same kernel level.
Were you able to find similar known issue related to this kernel level (Any link or bug file) ? If we have evidence that this kernel is problematic, we should be able to upgrade all nodes.