Deadlock in NFS4
I have a strange problem. I am connecting servers using NFS4 the shared directories are on servers running Debian 4 while the one who read from them is Debian 5.0.3. The problem is one of these shared servers suddenly stop responding and you cannot list it from Debian 5 server, also df hang, and the web application that is using it does not respond to requests that use this shared directory since it is blocked. Then the load on the server start to increase until the server cannot respond (over 90). I have found many entries in the syslog that refer to this like:
ma25555 kernel: [1200285.732919] nfs: server 10.xxx.xxx.xxx not responding, still trying
Dec 31 08:16:33 ma25555 kernel: [1200289.815378] INFO: task java:9702 blocked for more than 120 seconds.
Dec 31 08:16:33 ma25555 kernel: [1200289.835249] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 31 08:16:33 ma25555 kernel: [1200289.857500] java D 0000000000000000 0 9702 1
Dec 31 08:16:33 ma25555 kernel: [1200289.871244] ffff81039d9e3948 0000000000000086 0000000000000000 0000000000000292
Dec 31 08:16:33 ma25555 kernel: [1200289.891554] ffff81032f943670 ffff81083ccc7470 ffff81032f9438f8 000000010000000c
Dec 31 08:16:33 ma25555 kernel: [1200289.908401] ffff8108395bb240 0000000000000000 00000000ffffffff 0000000000000000
Dec 31 08:16:33 ma25555 kernel: [1200289.924310] Call Trace:
Dec 31 08:16:33 ma25555 kernel: [1200290.011013] [<ffffffffa021f3ca>] :sunrpc:rpc_wait_bit_killable+0x0/0x31
Dec 31 08:16:33 ma25555 kernel: [1200290.028766] [<ffffffffa021f3f4>] :sunrpc:rpc_wait_bit_killable+0x2a/0x31
Dec 31 08:16:33 ma25555 kernel: [1200290.048191] [<ffffffff804293f2>] __wait_on_bit+0x40/0x6e
Dec 31 08:16:33 ma25555 kernel: [1200290.068537] [<ffffffffa021f3ca>] :sunrpc:rpc_wait_bit_killable+0x0/0x31
Dec 31 08:16:33 ma25555 kernel: [1200290.089700] [<ffffffff8042948c>] out_of_line_wait_on_bit+0x6c/0x78
Dec 31 08:16:33 ma25555 kernel: [1200290.111979] [<ffffffff8024622f>] wake_bit_function+0x0/0x23
Dec 31 08:16:33 ma25555 kernel: [1200290.120914] [<ffffffffa021c2e9>] :sunrpc:xprt_connect+0x89/0x123
Dec 31 08:16:33 ma25555 kernel: [1200290.139567] [<ffffffffa021f98f>] :sunrpc:__rpc_execute+0xe6/0x223
Dec 31 08:16:33 ma25555 kernel: [1200290.157657] [<ffffffffa0219bcb>] :sunrpc:rpc_run_task+0x4f/0x56
Dec 31 08:16:34 ma25555 kernel: [1200290.171380] [<ffffffffa0219c67>] :sunrpc:rpc_call_sync+0x3e/0x5b
Dec 31 08:16:34 ma25555 kernel: [1200290.397448] [<ffffffffa02c3ed2>] :nfs:nfs4_proc_access+0x142/0x1c0
Dec 31 08:16:34 ma25555 kernel: [1200290.415733] [<ffffffff803b656c>] __alloc_skb+0x7f/0x12d
Dec 31 08:16:34 ma25555 kernel: [1200290.431886] [<ffffffff8031a31d>] __next_cpu+0x19/0x26
Dec 31 08:16:34 ma25555 kernel: [1200290.439891] [<ffffffff802295fc>] find_busiest_group+0x254/0x6dc
Dec 31 08:16:34 ma25555 kernel: [1200290.465581] [<ffffffff8020ab0d>] __switch_to+0x34c/0x35e
Dec 31 08:16:34 ma25555 kernel: [1200290.473941] [<ffffffffa02ae1e8>] :nfs:nfs_do_access+0x163/0x30c
Dec 31 08:16:34 ma25555 kernel: [1200290.491637] [<ffffffffa02ae481>] :nfs:nfs_permission+0xf0/0x15f
Dec 31 08:16:34 ma25555 kernel: [1200290.513582] [<ffffffff802a2227>] permission+0xb5/0x118
Dec 31 08:16:34 ma25555 kernel: [1200290.529537] [<ffffffff802a37af>] __link_path_walk+0x150/0xd05
Dec 31 08:16:34 ma25555 kernel: [1200290.542843] [<ffffffff802a43aa>] path_walk+0x46/0x8b
Dec 31 08:16:34 ma25555 kernel: [1200290.810421] [<ffffffff802a46d6>] do_path_lookup+0x158/0x1cf
Dec 31 08:16:34 ma25555 kernel: [1200290.823349] [<ffffffff802a34e1>] getname+0x140/0x1a7
Dec 31 08:16:34 ma25555 kernel: [1200290.971416] [<ffffffff802a5045>] __user_walk_fd+0x37/0x4c
Dec 31 08:16:34 ma25555 kernel: [1200290.985607] [<ffffffff8029e15d>] vfs_stat_fd+0x1b/0x4a
Dec 31 08:16:34 ma25555 kernel: [1200290.995950] [<ffffffff80221fbc>] do_page_fault+0x5d8/0x9c8
Dec 31 08:16:34 ma25555 kernel: [1200291.015844] [<ffffffff8029e1e8>] sys_newstat+0x19/0x31
Dec 31 08:16:34 ma25555 kernel: [1200291.028568] [<ffffffff8031e0a7>] __up_read+0x13/0x8a
Dec 31 08:16:34 ma25555 kernel: [1200291.040526] [<ffffffff8042a6a9>] error_exit+0x0/0x60
Dec 31 08:16:34 ma25555 kernel: [1200291.092521] [<ffffffff8020beca>] system_call_after_swapgs+0x8a/0x8f
I have tried the connection between the 2 servers using ping for one day and all are OK (zero lost)
There are 3 other servers that are running Debian 4 and are working fine.
So, please help
Last edited by mam2; 01-04-2010 at 01:58 AM.
|