lsof: WARNING: can't stat() nfs file system
Hi guys,
On some servers, I get NFS issues randomly. This thread is not to resolve those issues, but to help me, to get lsof command, which will not hang. Whenever on affected system I am doing following command it hangs, as per example: Code:
nagios@myhost:~$ sudo lsof -u appuser -b I need to run this command in my monitoring script, to list files open as user. On affected systems, when they get nfs issue, command hangs, and leave orphaned processes, what increase system load. So whenever issue is present, it might possible that in 1 day we get 288 orphaned processes, as plugin is executing every 5minutes. Your help will be kindly appreciated! Thanks, |
I think you need to supervise lsof, and let the supervision thread kill -9 lsof after some time.
Do you have the timeout command? |
Thanks for your comment!
Yes, I've a timeout value, but it hangs forever.. Other thing is that I am running script as nagios user, so I can't kill it after some time because of user permissions. Ok, I can add kill to sudo for nagios, but this is not a solution, as it require sudoers modification on thousands of hosts, so I really prefer to do some workaround in the script which I am responsible for, to validate if NFS share is working properly, than if yes, start the lsof command, otherwise put error and exit0. Interesting thing which I found, is that in kern.log I see messages, regarding NFS: Code:
~# dmesg -T|tail -2 |
I meant the timeout command
Code:
man timeout Code:
timout --signal=9 55s lsof ... |
Thanks!
Looks that this command works from CLI, what is great! Sadly, when I put this code into script, it hangs forever ;/ Code:
nagios@host002:~$ timeout --signal=9 5s sudo lsof -u user1 2>/dev/null|wc -l |
Wrong order: first sudo then timeout lsof!
Then the kill -9 is done with root rights. Code:
sudo timeout --signal=9 5s lsof -u user1 2>/dev/null|wc -l |
None of those solution works, when server have NFS issue. lsof constantly hangs.
To resolve it, I wrote own lsof, basing on /proc/PID/(smaps|fd) variables. Taking this into account, we could assume that issue is resolved. |
Well done.
More and more often I face "featurism" that puts the base function at risk, and would like to write my own "simply works" programs... |
Sometimes you have no choice. This is not the first time, when was forced to code my own functions, its life.. But the good thing is that you know exactly what it does, and you can quickly implement some fix in case of other issues :)
|
Use strace to find out where it is hanging
Try:
strace lsof ps aux | grep <pid where it is hanging> |
All times are GMT -5. The time now is 11:37 AM. |