filesystem /var file system suddenly utilizes 100%

_mz · 07-11-2013, 02:54 AM

Hi,

I had an issue with redhat server whereby /var file system was suddenly utilized 100% of space. It was triggered by alert and it was just a short issue. The server was just fine few minutes after that. There were no logs in /var/log/messages.

I suspect it could be there was a huge data loaded to other folders on that /var directory earlier but I could not confirm this or maybe other issues.

How can I trace what was going on that particular time?

business_kid · 07-11-2013, 02:59 AM

Only thing I can think of. . .

If /var suddenly filled up and then emptied, my guess in something humungous was written to /var/tmp. Perhaps the process crashed when it ran out of space, or it moved the data on. Not that much writes to /var/tmp

_mz · 07-11-2013, 03:24 AM

Thank you for your reply..

In /var/tmp directory:

# ll /var/tmp/
total 32
drwxrwxr-x 2 nagios nagios 4096 May 4 2012 check_logfiles
-rwxrwxrwx 1 root root 248 Feb 15 2012 rehe3_vmstat_110.log
-rwxrwxrwx 1 root root 29 Feb 15 2012 rehe3_vmstat_120.log
drwx------ 2 s22adm sapsys 4096 Mar 13 2012 yum-s22adm-whlFRZ

# ll /var/tmp/check_logfiles/
total 12
-rw-rw-r-- 1 nagios nagios 636 Jul 11 17:15 check_db2diaglog._db2_S22_db2dump_db2diag.log.messagelog
-rw-rw-r-- 1 nagios nagios 0 Jul 11 12:30 check_log_messages._var_log_messages.messagelog

Indeed, the time stamp for "check_log_messages._var_log_messages.messagelog" file is the exact time the issue occurred. Could this be the issue? I have no idea what is this file for..

_mz · 07-11-2013, 04:10 AM

Hi,

I compared to other server, all files in /var/tmp/check_logfiles/ where own by nagios utilize only 8.0K. So I do not think this is the cause. Had googled around but haven't find anything yet.

Any advise is welcomed

business_kid · 07-11-2013, 10:06 AM

Of course it's not there, because your space issue has resolved itself. Your usage went to 100% then back to normal. I was just thinking back - Where can a program erase files? The time to check is when usage is at 100%.

_mz · 07-11-2013, 09:04 PM

The time was at 12.30 but it was not logged in any logs of what was going on. It is hard to trace from OS level.

There were no cron jobs running at the time. Logrotate was fine. I was just thinking it could be due to application but I would like to check from OS level first before asking application team to check further..

business_kid · 07-12-2013, 02:11 PM

The way to narrow it is find what can/does write to /var/tmp. most apps & the OS use /tmp.

jpollard · 07-12-2013, 05:50 PM

Well, one way is turn on process accounting. That way you will get a log of the processes running, and when that process terminates. If it is a process aborting due to no disk, the disk will be freed, and I believe the accounting entry will contain the reason for the exit (exit status). This is not exactly precise as it will not identify the file name of the failure.