kernel SMP + SLUB issues ?

akiuni · 12-26-2012, 04:27 AM

Hi all

I'm facing some difficulties with linux kernels on servers equiped with 4 CPU sockets (4x8 cores + Hyperthreading = 64 cores at all) when SLUB management is enabled.

Here are some examples :
- 4CPU + SLUB + kernel 2.6.35-14 = kernel panic after 1h10 minutes uptime (stacktrace at the end of my post).

- 4CPU + SLUB + kernel 2.6.39 = very poor networking performances. A bootleneck seems to appears in the softirqs which are concentrated among cpu0 to cpu3

- 4CPU + SLUB + kernel 3.2.34 = same behavior

As a consequence, I'm wondering if there any known issue between SMP and SLUB kernel options ?

CONFIG_SLUB=y
CONFIG_SLUB_DEBUG=y
CONFIG_SMP=y
CONFIG_X86_64_SMP=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_SCSI_SAS_HOST_SMP=y

Also, do you have any idea about an option that I may enable to as to improve the SMP/SLUB performances ?

thank you
best regards,
Julien

stacktrace :

Code:

Dec  5 17:07:10 Host kernel: BUG: unable to handle kernel paging request at 000000007f87312b
Dec  5 17:07:10 Host kernel: IP: [<ffffffff810c5228>] __d_lookup+0x88/0x150
Dec  5 17:07:10 Host kernel: PGD 105c776067 PUD 0
Dec  5 17:07:10 Host kernel: Oops: 0000 [#1] SMP
Dec  5 17:07:10 Host kernel: last sysfs file: /sys/class/scsi_host/host2/proc_name
Dec  5 17:07:10 Host kernel: CPU 32
Dec  5 17:07:10 Host kernel: Modules linked in: pkp_drv
Dec  5 17:07:10 Host kernel:
Dec  5 17:07:10 Host kernel: Pid: 18757, comm: keepalived Not tainted 2.6.35.14-Host64 #7 ....../PowerEdge R810
Dec  5 17:07:10 Host kernel: RIP: 0010:[<ffffffff810c5228>]  [<ffffffff810c5228>] __d_lookup+0x88/0x150
Dec  5 17:07:10 Host kernel: RSP: 0018:ffff88105b9c5b98  EFLAGS: 00210206
Dec  5 17:07:10 Host kernel: RAX: 0000000000000005 RBX: 000000007f87312b RCX: 0000000000000017
Dec  5 17:07:10 Host kernel: RDX: 018721dfc6947272 RSI: ffff88105b9c5c68 RDI: ffff88107f810240
Dec  5 17:07:10 Host kernel: RBP: ffff88105b9c5be8 R08: ffff88105b9c5c68 R09: 000000000000ffff
Dec  5 17:07:10 Host kernel: R10: 0000000000000005 R11: 000000000000000a R12: ffff88105b9c5c68
Dec  5 17:07:10 Host kernel: R13: 0000000008927e69 R14: ffff88107f810240 R15: 0000000000000000
Dec  5 17:07:10 Host kernel: FS:  0000000000000000(0000) GS:ffff880002800000(0063) knlGS:00000000f742eb80
Dec  5 17:07:10 Host kernel: CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
Dec  5 17:07:10 Host kernel: CR2: 000000007f87312b CR3: 0000001061dfe000 CR4: 00000000000006e0
Dec  5 17:07:10 Host kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec  5 17:07:10 Host kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec  5 17:07:10 Host kernel: Process keepalived (pid: 18757, threadinfo ffff88105b9c4000, task ffff88105a6c5460)
Dec  5 17:07:10 Host kernel: Stack:
Dec  5 17:07:10 Host kernel: dead000000200200 0000000000007125 0000000000000005 00000005ffb57b88
Dec  5 17:07:10 Host kernel: <0> ffff88105b9c5c78 00000000000001ae ffff88105b9c5c68 ffff88107f810240
Dec  5 17:07:10 Host kernel: <0> 0000000000007125 0000000000000000 ffff88105b9c5c18 ffffffff810c532b
Dec  5 17:07:10 Host kernel: Call Trace:
Dec  5 17:07:10 Host kernel: [<ffffffff810c532b>] d_lookup+0x3b/0x60
Dec  5 17:07:10 Host kernel: [<ffffffff810c53c9>] d_hash_and_lookup+0x79/0xa0
Dec  5 17:07:10 Host kernel: [<ffffffff8110269d>] proc_flush_task+0x8d/0x250
Dec  5 17:07:10 Host kernel: [<ffffffff8103bf52>] release_task+0x32/0x3c0
Dec  5 17:07:10 Host kernel: [<ffffffff8103c817>] wait_consider_task+0x537/0x950
Dec  5 17:07:10 Host kernel: [<ffffffff8103cd1d>] do_wait+0xed/0x220
Dec  5 17:07:10 Host kernel: [<ffffffff8103cef1>] sys_wait4+0xa1/0xf0
Dec  5 17:07:10 Host kernel: [<ffffffff8103b580>] ? child_wait_callback+0x0/0x70
Dec  5 17:07:10 Host kernel: [<ffffffff8106c01f>] compat_sys_wait4+0x8f/0xd0
Dec  5 17:07:10 Host kernel: [<ffffffff810b1890>] ? vfs_read+0x140/0x180
Dec  5 17:07:10 Host kernel: [<ffffffff8102a82b>] sys32_waitpid+0xb/0x10
Dec  5 17:07:10 Host kernel: [<ffffffff81029cc5>] sysenter_dispatch+0x7/0x2b
Dec  5 17:07:10 Host kernel: Code: 05 8e 7a 83 00 48 8b 00 48 89 c3 8b 45 cc 48 85 db 48 89 45 c0 75 14 eb 5a 66 2e 0f 1f 84 00 00 00 00 00 48 8b 1b 48 85 db 74 48 <48> 8b 03 4c 8d 63 e8 0f 18 08 45 39 6c 24 30 75 e7 4d 39 74 24
Dec  5 17:07:10 Host kernel: RIP  [<ffffffff810c5228>] __d_lookup+0x88/0x150
Dec  5 17:07:10 Host kernel: RSP <ffff88105b9c5b98>
Dec  5 17:07:10 Host kernel: CR2: 000000007f87312b
Dec  5 17:07:10 Host kernel: ---[ end trace cd9ae1febb12caac ]---
Dec  5 17:08:02 Host kernel: BUG: unable to handle kernel paging request at 000000007f87312b
Dec  5 17:08:02 Host kernel: IP: [<ffffffff810c5228>] __d_lookup+0x88/0x150
Dec  5 17:08:02 Host kernel: PGD 105c6ba067 PUD 0
Dec  5 17:08:02 Host kernel: Oops: 0000 [#2] SMP
Dec  5 17:08:02 Host kernel: last sysfs file: /sys/class/scsi_host/host2/proc_name
Dec  5 17:08:02 Host kernel: CPU 5
Dec  5 17:08:02 Host kernel: Modules linked in: pkp_drv
Dec  5 17:08:02 Host kernel:
Dec  5 17:08:02 Host kernel: Pid: 29486, comm: ps Tainted: G      D     2.6.35.14-Host64 #7 ....../PowerEdge R810
Dec  5 17:08:02 Host kernel: RIP: 0010:[<ffffffff810c5228>]  [<ffffffff810c5228>] __d_lookup+0x88/0x150
Dec  5 17:08:02 Host kernel: RSP: 0018:ffff88105c707d18  EFLAGS: 00010206
Dec  5 17:08:02 Host kernel: RAX: 0000000000000005 RBX: 000000007f87312b RCX: 0000000000000017
Dec  5 17:08:02 Host kernel: RDX: 018721dfc6947272 RSI: ffff88105c707dc8 RDI: ffff88107f810240
Dec  5 17:08:02 Host kernel: RBP: ffff88105c707d68 R08: ffff88105c707dc8 R09: ffffffff81101cb0
Dec  5 17:08:02 Host kernel: R10: 0000000000000005 R11: 000000000000000a R12: ffff88105c707dc8
Dec  5 17:08:02 Host kernel: R13: 0000000008927e69 R14: ffff88107f810240 R15: ffff88105c707e78
Dec  5 17:08:02 Host kernel: FS:  0000000000000000(0000) GS:ffff8800024a0000(0063) knlGS:00000000f7659ad0
Dec  5 17:08:02 Host kernel: CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
Dec  5 17:08:02 Host kernel: CR2: 000000007f87312b CR3: 000000105c7a2000 CR4: 00000000000006e0
Dec  5 17:08:02 Host kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec  5 17:08:02 Host kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec  5 17:08:02 Host kernel: Process ps (pid: 29486, threadinfo ffff88105c706000, task ffff88105d2f7080)
Dec  5 17:08:02 Host kernel: Stack:
Dec  5 17:08:02 Host kernel: ffff88107f2a5fa0 ffff88107fc68000 0000000000000005 00000005811c5c49
Dec  5 17:08:02 Host kernel: <0> ffff88105c707e78 00000000000001ae ffff88105c707dc8 ffff88107f810240
Dec  5 17:08:02 Host kernel: <0> 0000000000000005 ffff88105c707e78 ffff88105c707d98 ffffffff810c532b
Dec  5 17:08:02 Host kernel: Call Trace:
Dec  5 17:08:02 Host kernel: [<ffffffff810c532b>] d_lookup+0x3b/0x60
Dec  5 17:08:02 Host kernel: [<ffffffff81101cb0>] ? proc_pid_instantiate+0x0/0xd0
Dec  5 17:08:02 Host kernel: [<ffffffff810fedd6>] proc_fill_cache+0x86/0x170
Dec  5 17:08:02 Host kernel: [<ffffffff810eed50>] ? compat_filldir+0x0/0xf0
Dec  5 17:08:02 Host kernel: [<ffffffff811023cd>] proc_pid_readdir+0x19d/0x200
Dec  5 17:08:02 Host kernel: [<ffffffff810eed50>] ? compat_filldir+0x0/0xf0
Dec  5 17:08:02 Host kernel: [<ffffffff810ba66c>] ? path_put+0x2c/0x40
Dec  5 17:08:02 Host kernel: [<ffffffff810eed50>] ? compat_filldir+0x0/0xf0
Dec  5 17:08:02 Host kernel: [<ffffffff810eed50>] ? compat_filldir+0x0/0xf0
Dec  5 17:08:02 Host kernel: [<ffffffff810fe8d5>] proc_root_readdir+0x45/0x60
Dec  5 17:08:02 Host kernel: [<ffffffff810c0d73>] vfs_readdir+0xb3/0xd0
Dec  5 17:08:02 Host kernel: [<ffffffff810f0a33>] compat_sys_getdents+0x83/0xe0
Dec  5 17:08:02 Host kernel: [<ffffffff81029cc5>] sysenter_dispatch+0x7/0x2b
Dec  5 17:08:02 Host kernel: Code: 05 8e 7a 83 00 48 8b 00 48 89 c3 8b 45 cc 48 85 db 48 89 45 c0 75 14 eb 5a 66 2e 0f 1f 84 00 00 00 00 00 48 8b 1b 48 85 db 74 48 <48> 8b 03 4c 8d 63 e8 0f 18 08 45 39 6c 24 30 75 e7 4d 39 74 24
Dec  5 17:08:03 Host kernel: RIP  [<ffffffff810c5228>] __d_lookup+0x88/0x150
Dec  5 17:08:03 Host kernel: RSP <ffff88105c707d18>
Dec  5 17:08:03 Host kernel: CR2: 000000007f87312b
Dec  5 17:08:03 Host kernel: ---[ end trace cd9ae1febb12caad ]---

NB: the same behavior occures with pkp_drv module unloaded.

additional informations :

Code:

# echo "Code: 05 8e 7a 83 00 48 8b 00 48 89 c3 8b 45 cc 48 85 db 48 89 45 c0 75 14 eb 5a 66 2e 0f 1f 84 00 00 00 00 00 48 8b 1b 48 85 db 74 48 <48> 8b 03 4c 8d 63 e8 0f 18 08 45 39 6c 24 30 75 e7 4d 39 74 24" | ./scripts/decodedecode
All code
========
   0:   05 8e 7a 83 00          add    $0x837a8e,%eax
   5:   48 8b 00                mov    (%rax),%rax
   8:   48 89 c3                mov    %rax,%rbx
   b:   8b 45 cc                mov    -0x34(%rbp),%eax
   e:   48 85 db                test   %rbx,%rbx
  11:   48 89 45 c0             mov    %rax,-0x40(%rbp)
  15:   75 14                   jne    0x2b
  17:   eb 5a                   jmp    0x73
  19:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  20:   00 00 00
  23:   48 8b 1b                mov    (%rbx),%rbx
  26:   48 85 db                test   %rbx,%rbx
  29:   74 48                   je     0x73
  2b:*  48 8b 03                mov    (%rbx),%rax     <-- trapping instruction
  2e:   4c 8d 63 e8             lea    -0x18(%rbx),%r12
  32:   0f 18 08                prefetcht0 (%rax)
  35:   45 39 6c 24 30          cmp    %r13d,0x30(%r12)
  3a:   75 e7                   jne    0x23
  3c:   4d                      rex.WRB
  3d:   39                      .byte 0x39
  3e:   74 24                   je     0x64

Code starting with the faulting instruction
===========================================
   0:   48 8b 03                mov    (%rbx),%rax
   3:   4c 8d 63 e8             lea    -0x18(%rbx),%r12
   7:   0f 18 08                prefetcht0 (%rax)
   a:   45 39 6c 24 30          cmp    %r13d,0x30(%r12)
   f:   75 e7                   jne    0xfffffffffffffff8
  11:   4d                      rex.WRB
  12:   39                      .byte 0x39
  13:   74 24                   je     0x39

Code:

# make tags
# vim -t __d_lookup

./fs/dcache.c
struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
{
        unsigned int len = name->len;
        unsigned int hash = name->hash;
        const unsigned char *str = name->name;
        struct hlist_head *head = d_hash(parent,hash);
        struct dentry *found = NULL;
        struct hlist_node *node;
        struct dentry *dentry;

        rcu_read_lock();

        hlist_for_each_entry_rcu(dentry, node, head, d_hash) {
                struct qstr *qstr;

                if (dentry->d_name.hash != hash)
                        continue;
                if (dentry->d_parent != parent)
                        continue;

                spin_lock(&dentry->d_lock);

                /*
                 * Recheck the dentry after taking the lock - d_move may have
                 * changed things.  Don't bother checking the hash because we're
                 * about to compare the whole name anyway.
                 */
                if (dentry->d_parent != parent)
                        goto next;

                /* non-existing due to RCU? */
                if (d_unhashed(dentry))
                        goto next;

                /*
                 * It is safe to compare names since d_move() cannot
                 * change the qstr (protected by d_lock).
                 */
                qstr = &dentry->d_name;
                if (parent->d_op && parent->d_op->d_compare) {
                        if (parent->d_op->d_compare(parent, qstr, name))
                                goto next;
                } else {
                        if (qstr->len != len)
                                goto next;
                        if (memcmp(qstr->name, str, len))
                                goto next;
                }

                atomic_inc(&dentry->d_count);
                found = dentry;
                spin_unlock(&dentry->d_lock);
                break;
next:
                spin_unlock(&dentry->d_lock);
        }
        rcu_read_unlock();

        return found;
}

business_kid · 12-26-2012, 06:24 AM

Have you tried the latest kernel? There's a LOT of options now which refer to large numbers of smp in the help.

akiuni · 12-26-2012, 07:20 AM

Hi, thank you for your prompt answer

well I've tried with the 3.2.31 kernel which is quite recent... (by the way I made a mistake in my initial post, it's not 3.2.34 but 3.2.31). I would prefer to use the 3.2 kernels because it's the one supported in debian squeeze 6.0.... Do you think there is a big gap between 3.2.31 and 3.2.35 (the latest 3.2 kernel) ?

onebuck · 12-26-2012, 07:25 AM

Moved: This thread is more suitable in <Linux-General> and has been moved accordingly to help your thread/question get the exposure it deserves.

syg00 · 12-27-2012, 02:29 AM

Personally I wouldn't have thought a kernel oops query from some-one prepared to attempt to decode the issue deserves to be plonked in "general".

If you really think you have an issue in the memory buddy system, open a bug against it. Probably too intricate for many of us here to help much.

akiuni · 12-27-2012, 03:13 AM

Well my knowledge stops where the code starts...

I can't open a bug because the oops seams to be solved in 2.6.39 kernel but with poor performances I suppose. Also I may have missed an important option in the .config, and that was the goal of my post...

I will try to locate more precisely the root cause of the bad performances because my only clue today is "SMP+SLUB+softirqs". Depending on what I find, I'll be back on LQ to post the answer !

Thank you
Julien

business_kid · 12-27-2012, 03:15 AM

A problem with SLUB on a multi-smp might well have gone to the 'kernel' mailing list, but whatever.

@akumi: I see this as a potential kernel bug. As such, they will want to make sure it hasn't been fixed. I compiled 3.7.1 recently, and that was where I noticed comments in the help about multi-smp boxes. So I would suggest 3.7.x, and if that doesn't fix it, file a kernel bug, and attach your .config.

EDIT: kernel options like MAXSMP, SCHED_MC, CROSS_MEMORY_ATTACH and many others become significant. One problem I see is that as the number of cpu cores rises, the hardware falls behind(e.g. number of memory pages), and ownership restrictions could get very complicated. Also enable SLUB_DEBUG. Never mind what people say about the kernel. If you are having bad behaviour, file a bug.