[SOLVED] Kernel - scheduling of thread with IO

dralasimer · 03-16-2015, 03:58 PM

Hello everyone !

This is a message in a bottle...

I encounter currently an unexpected behavior at work on my embedded Linux system:

Context:
My software application is a multithreaded application (using conditions, mutex for synchronization between threads).
Some of the threads participate to a real time operation (doing a task in less than 1 sec).
These threads are set with the following properties: scheduler=SCHED_RR, priority=55

What's going wrong ?
It running well most of the time.
But sometimes (1 time during 2 hour with the same test conditions during this time), I cannot explain why most of the threads are "locked" during more than 1 second without any explanation.

With ftrace I saw that during this "lock" time only some thread/process are still running, with a lot of "idle" task time (> 20ms) between each of them:

- 2 threads with the following properties: scheduler=SCHED_RR, priority=50
These thread are doing some network operations (with a select). I don't know what they does exactly (it is a closed source thirdparty library we use)

- vsftpd access (in our test a PC does periodically FTP access to a FTP directory on compact flash device).

I saw that can reproduce this behavior by executing a "dd" command in order to write files to flash continously.

I don't understand:
- why all the other tasks are "locked" during all this time (sometimes when requesting a mutex, sometimes when calcultating a simple 16bits-CRC).
- why so much idle time can be seen with ftrace (between sched events) during this duration.
- why higher application thread priorities don't solve the issue.

Since it seems linked with IO, I changed the IO scheduler : deadline => cfq.
Result: It seems a little better, especially with the "dd" command, but the "locking" issue is still present.

I added a call to "fsync" because 10 seconds before the issue, 1MB were copied to flash (fsync is called just after the copy of these files).
Result: better but not fixed (reduced the number of occurrences).

My question:
I suspect something linked with the IO management in kernel, as if kernel preempt every non IO thread in order to do all the works relating to IO (network, files, ...).

Is it right ?

Does kernel implements by default this behavior ?

If yes, How does it work ?

How to fix/change it in order to solve my issue ?

My kernel settings:
Linux kernel version 2.6.39
Preempt option enabled
tickless
HZ=1000
CFQ scheduler (Default settings)

dijetlo · 03-20-2015, 02:58 AM

Quote:

I cannot explain why most of the threads are "locked" during more than 1 second without any explanation.

They are being blocked by the threads that aren't locked.
Or, to think of it another way, a thread with a higher priority is on a loop and forcing the rest of your threads to chill or it's scheduling property is "batch" for example.

Quote:

- 2 threads with the following properties: scheduler=SCHED_RR, priority=50
These thread are doing some network operations (with a select). I don't know what they does exactly (it is a closed source thirdparty library we use)

These seem likely candidates...

Pearlseattle · 03-20-2015, 03:53 PM

Hi

Sorry if this does not help but I have no clue about "realtime" stuff - but I do fiddle around with threads from time to time so I'll just dump what I have in mind:
1)
This...

Quote:

I saw that can reproduce this behavior by executing a "dd" command in order to write files to flash continously.

...is probably an important hint, right?

2)

Quote:

(1 time during 2 hour with the same test conditions during this time), I cannot explain why most of the threads are "locked" during more than 1 second without any explanation.

Do you mean that they're waiting to grab one of your mutexes or are they basically just "frozen" for no apparent reason? Did you already e.g. add "sleep" calls in the method(s)/functions that are called by the threads to exclude that it's the organization of your mutexes that blocks them?
Maybe your problem is a mixed one - maybe on one side your code is not optimal, and on the other side it's because of something else?

3)

Quote:

- why higher application thread priorities don't solve the issue.

I don't think that thread/process priorities solve anything when having to deal with locked mutexes - once the mutex is locked, the process that locked it has to finish => actually raising the priority of the threads that are waiting for the lock on the mutex might make the problem worse, as CPU cycles will be taken away from the thread that has the mutex already locked therefore making that thread slower therefore making the thread that is waiting for the mutex to be unlocked to wait even longer than usual. Or am I wrong?

4)
Btw., which kernel version are you using?
There was a big patch some time ago (for sure before kernel 3.17) to make the flush of dirty buffers not lock all processes => maybe that might help?

dralasimer · 03-21-2015, 05:47 AM

Thanks for your responses.

I can reproduce easily a similar behavior with a dd (5min after starting dd) but this behavior can also occur on nominal application use (without any dd):
- when one of my two network cameras is disconnected (these 2 cameras are managed by a binary library "pylon" from Basler (with two thread with SCHED_RR and priority 50).
But I don't know what this library does exactly...
- when something else occurs (I suspect log flush to flash but I don't have the proof).

I'm wondering if it is the same root cause for all of this...

Sometimes the last ftrace mark I see before the "lock" is a boost mutex lock.
But sometimes, the "lock" occurs during a simple CRC calculation (no mutex lock request).

My kernel version is 2.6.39
About flush patch: Interesting information you gave me.

If you are interested, I shared the ftrace file (to be viewed with kernelshark) when I reproduce the issue with an external dd command:
https://drive.google.com/file/d/0B6p...ew?usp=sharing

Maybe it can help you to see what is really happening for the "dd" case only

The "hole" ("freeze") is (no more custom ftrace markers from my application) at timestamps:

begin: 469.118370
end: 469.802940

Another little "hole"
begin: 469.807644
end: 469.952975

Pearlseattle · 03-22-2015, 10:05 AM

I see this in your trace:

Quote:

begin: 469.118370

Code:

             sdq-1654  [001]   469.118369: kmalloc:              (tracing_mark_write+0x35) call_site=c015eede ptr=0xcd845540 bytes_req=78 bytes_alloc=128 gfp_flags=GFP_KERNELGFP_NOTRACK
             sdq-1654  [001]   469.118371: print:                00000000 MARKER - ../../../../import/boost_util/include/util/Concurrent_queue.h - 203
             sdq-1654  [001]   469.118372: kfree:                (tracing_mark_write+0x87) call_site=c015ef30 ptr=0xcd845540
             sdq-1654  [001]   469.118374: sys_exit_write:       0x4c
             sdq-1654  [001]   469.118375: sys_exit:             NR 4 = 76
             sdq-1654  [001]   469.118377: sys_enter_close:      fd: 0x001e
             sdq-1654  [001]   469.118378: sys_enter:            NR 6 (1e, 0, 8139ff4, 810d4b4, b4432817, b4432998)
             sdq-1654  [001]   469.118381: sys_exit_close:       0x0
             sdq-1654  [001]   469.118383: sys_exit:             NR 6 = 0
             sdq-1654  [001]   469.118388: sys_enter_futex:      uaddr: 0x8980734, op: 0x0080, val: 0x0a35, utime: 0x0000, uaddr2: 0x8980718, val3: 0x0a35
             sdq-1654  [001]   469.118389: sys_enter:            NR 240 (8980734, 80, a35, 0, 8980718, a35)
             sdq-1654  [001]   469.118395: sched_stat_runtime:   comm=sdq pid=1654 runtime=227712 [ns] vruntime=48967751264 [ns]
             sdq-1654  [001]   469.118401: sched_switch:         sdq:1654 [120] S ==> kworker/0:0:0 [120]
          <idle>-0     [000]   469.118411: irq_handler_entry:    irq=18 name=uhci_hcd:usb4
          <idle>-0     [000]   469.118423: kmem_cache_free:      (uhci_free_urb_priv+0x8c) call_site=c031a650 ptr=0xde769ee4
          <idle>-0     [000]   469.118428: kmem_cache_alloc:     (uhci_urb_enqueue+0x52) call_site=c031c54c ptr=0xde769ee4 bytes_req=28 bytes_alloc=28 gfp_flags=GFP_ATOMIC|GFP_ZEROGFP_NOTRACK
          <idle>-0     [000]   469.118434: irq_handler_exit:     irq=18 ret=handled
          <idle>-0     [000]   469.118443: irq_handler_entry:    irq=16 name=uhci_hcd:usb2
          <idle>-0     [000]   469.118450: kmem_cache_free:      (uhci_free_urb_priv+0x8c) call_site=c031a650 ptr=0xde76957c
          <idle>-0     [000]   469.118454: kmem_cache_alloc:     (uhci_urb_enqueue+0x52) call_site=c031c54c ptr=0xde76957c bytes_req=28 bytes_alloc=28 gfp_flags=GFP_ATOMIC|GFP_ZEROGFP_NOTRACK
          <idle>-0     [000]   469.118458: irq_handler_exit:     irq=16 ret=handled
etc...

Quote:

begin: 469.807644

Code:

        netcheck-1665  [001]   469.807643: kmalloc:              (tracing_mark_write+0x35) call_site=c015eede ptr=0xcd845540 bytes_req=78 bytes_alloc=128 gfp_flags=GFP_KERNELGFP_NOTRACK
        netcheck-1665  [001]   469.807645: print:                00000000 MARKER - ../../../../import/boost_util/include/util/Concurrent_queue.h - 203
        netcheck-1665  [001]   469.807645: kfree:                (tracing_mark_write+0x87) call_site=c015ef30 ptr=0xcd845540
        netcheck-1665  [001]   469.807647: sys_exit_write:       0x4c
        netcheck-1665  [001]   469.807649: sys_exit:             NR 4 = 76
        netcheck-1665  [001]   469.807651: sys_enter_close:      fd: 0x001e
        netcheck-1665  [001]   469.807652: sys_enter:            NR 6 (1e, 0, 8139ff4, 810d4b4, b3927817, b3927998)
        netcheck-1665  [001]   469.807655: sys_exit_close:       0x0
        netcheck-1665  [001]   469.807656: sys_exit:             NR 6 = 0
        netcheck-1665  [001]   469.807663: sys_enter_futex:      uaddr: 0xb5a869e4, op: 0x0080, val: 0x0273, utime: 0x0000, uaddr2: 0xb5a869c8, val3: 0x0273
        netcheck-1665  [001]   469.807664: sys_enter:            NR 240 (b5a869e4, 80, 273, 0, b5a869c8, 273)
        netcheck-1665  [001]   469.807680: sched_switch:         netcheck:1665 [44] S ==> kworker/0:0:0 [120]
          <idle>-0     [000]   469.808053: hrtimer_cancel:       hrtimer=0xdf002b34
          <idle>-0     [000]   469.808055: hrtimer_expire_entry: hrtimer=0xdf002b34 now=529383007503 function=tick_sched_timer/0x0
          <idle>-0     [000]   469.808057: softirq_raise:        vec=1 [action=TIMER]
          <idle>-0     [000]   469.808062: hrtimer_expire_exit:  hrtimer=0xdf002b34
          <idle>-0     [000]   469.808063: hrtimer_start:        hrtimer=0xdf002b34 function=tick_sched_timer/0x0 expires=529384000000 softexpires=529384000000
          <idle>-0     [000]   469.808069: irq_handler_entry:    irq=0 name=timer
          <idle>-0     [000]   469.808073: irq_handler_exit:     irq=0 ret=handled
          <idle>-0     [000]   469.808076: softirq_entry:        vec=1 [action=TIMER]
          <idle>-0     [000]   469.808077: softirq_exit:         vec=1 [action=TIMER]
          <idle>-0     [001]   469.808082: hrtimer_cancel:       hrtimer=0xdf202b34
          <idle>-0     [001]   469.808084: hrtimer_expire_entry: hrtimer=0xdf202b34 now=529383037045 function=tick_sched_timer/0x0
          <idle>-0     [001]   469.808090: softirq_raise:        vec=1 [action=TIMER]
          <idle>-0     [001]   469.808091: softirq_raise:        vec=9 [action=RCU]
          <idle>-0     [001]   469.808096: hrtimer_expire_exit:  hrtimer=0xdf202b34
etc...

I have no clue what the output means but in any case I don't see anything interesting that is unique only to these two blocks.

Still taking into account that I know nothing about realtime processes/threads and therefore walking on thin ice, I think that your issue is complex because there are a lot of different things going on at the same time (CPU, I/O, network, RAM de/alloc).

Taking a step back, are we sure that your OS/HW is ready to handle realtime processes?

You could first of all start using a kernel version that does not globally block I/O when it has to flush dirty blocks / is emptying the writeback buffer to storage => have a look at the changelogs.

Then, I'm not sure if you use a special realtime-kernel. In the Gentoo distribution there is a separate kernel for realtime (package "sys-kernel/rt-sources" - "Full Linux 3.14 kernel sources with the CONFIG_PREEMPT_RT patch").

Then I'm not sure if TICKLESS is good - is for sure good for powersaving but I would guess that's not a great feature for realtime stuff (read e.g. here from the 2nd post onwards). You probably don't want your CPU to go to sleep.

Finally, I'm not sure if you're using power saving options (I would disable them, including Intel's turbo boost) and/or what your CPU is (e.g. AMD multicore CPUs are less multicore than the ones of Intel - the CPUs still share components internally) and/or if you're using hyperthreading (I would disable it) and/or look if the spontaneous migration of the processes/threads between CPU cores might cause troubles with realtime processes/threads and if yes then see how to assign them always to the same CPU core.

What do you think?

dijetlo · 03-22-2015, 03:22 PM

Quote:

I can reproduce easily a similar behavior with a dd (5min after starting dd) but this behavior can also occur on nominal application use (without any dd):

Have you tried monitoring IO activity (iostat -xp <device> 1) to see if your suspicion it's IO related is correct? I tend to agree with Pearlseattles assessment that we're dealing with too many possibilities...

Quote:

start using a kernel version that does not globally block I/O when it has to flush dirty blocks / is emptying the writeback buffer to storage...I'm not sure if TICKLESS is good

This may just be the trade-off 'tickless' has made, an easy test is to change platforms (address the HW issue and remove the SCHED_RR threads from the equation), install a kernel with a different thread scheduling algorithm and see if the problem persists.

sundialsvcs · 03-25-2015, 10:16 PM

It sounds to me, simply, that there's a bug in that application or library. At some point, under load, the bug manifests itself ... some kind of tiny timing-hole in the way that it uses mutexes, maybe. Perhaps some I/O-error or other condition that happens rarely. Or maybe the thread which has the resource locked just is waiting a lot longer than you think it should.

When a deadlock occurs, CPU utilization drops to near-zero, and fiddling with thread/process priorities won't help because none of them are "runnable." Looking at an ordinary ps display at that time should show you what the various processes and threads are waiting for.

dralasimer · 07-01-2015, 12:19 PM

Hi everyone,

I update the status of this topic: We think we found the root cause of the lock.

My company hired a Linux expert for 2 days. Thanks to him we found that:

The locks were caused by the fact that all flash access are blocked when a flush of data is done by kernel.

Especially our logger module (used for timestamping...) which calls the syslog() function.

But this syslog() function blocks also the process, even if syslogd daemon is the real process accessing to flash...
(we suspect that unix sockets used for syslog communications block until resources are available, like a bash pipe '|' when writing a lot of logs into a file located in flash).

The solution was to split all the access to flash between real time threads and the other by doing log/flash access into an isolated thread (with a non blocking custom message Queue as communication item)

And it seems that it works !

Pearlseattle · 07-01-2015, 05:55 PM

Thanks for the update