[SOLVED] History of processes causing I/O wait

lpwevers · 02-28-2017, 04:27 AM

Hi,

I was wondering if it it possible to see what command caused a high I/O wait on a system. In this case I have a system where yesterday, something went berserk and basically locked the system for 20 minutes or so, due to 70% I/O wait. That I could see from sar.

However, I'm trying to figure out (if possible) if there's also something that will tell me what process caused this. I tried tools like pidstat and iostat, but they do not seem to have the history like sar does. But sar does not seem to track the processes.

If it's not possible I'll probably end up scripting something myself, but if it's possible using standard tools, that'd be great.

smallpond · 02-28-2017, 06:37 AM

Use "top" to see which processes are in wait. Writes are done in the background by the kernel flush tasks. Adjusting VM writeback variables may improve performance.

http://www.thegeekstuff.com/2009/10/...adable-format/

lpwevers · 02-28-2017, 06:40 AM

Thanks. I'll try that for future use. Build some monitoring on that.

Still, I'm hoping if there's something that'll tell me from the sar (and maybe other) data from yesterday....

Habitual · 02-28-2017, 06:43 AM

collectl perhaps?

lpwevers · 02-28-2017, 06:55 AM

Thanks. collectl certainly looks promising. If not for yesterday's data, then for capturing future data in batch. I'll definitely check it out.

syg00 · 02-28-2017, 07:19 AM

iowait is a very misunderstood metric. Especially to base claims of a locked system on. Have a read of this - especially the last 2 examples. It's old, but pretty good - all I could find quickly.

collectl is good, but you'll need to set up process collection. Might help, might not.
Keep an eye on interrupt counts in case you have a driver and/or hardware problem you don't know about.

lpwevers · 02-28-2017, 07:43 AM

Thanks that explains a lot about iowait that I didn't know. Then again, this system runs a huge database and we expect some (or more) users of running some analytical queries at the time that caused the system to become nearly unresponsive. Therefore I was hoping to find out what process was causing this. If it was the database server process, we could at least narrow down the culprit a bit.

I guess I'll setup monitoring using collectl; seems like this may come in very handy in the future anyway.

Thanks for all the advise.

syg00 · 02-28-2017, 07:51 AM

In that case you need all the monitoring data you can get. It's possible you are near the edge of the cliff all the time and you got pushed over at that time. Might be pretty hard to track the real reason rather than just chasing symptoms.

Good luck.

sundialsvcs · 02-28-2017, 08:01 AM

It is most probable that you are actually experiencing thrashing.

You had several users running "analytical queries." How much memory do each of these running tasks require? And, does the machine possess enough physical RAM to support all of them at once?

If not, your virtual-memory paging subsystem goes nuts. The disk drive turns into an out-of-balance washing machine on the "spin" cycle. Nothing gets done because everyone is experiencing constant page faults. The "swap in/out count" goes by in a blur. And, since file I/O and swap I/O take place (usually) to the same device(s), file I/O is severely impacted as well.

lpwevers · 02-28-2017, 08:38 AM

Thanks for that. Indeed it looks like the machine is also showing increased memory usage and swap activity. I guess it's the combination of those that cause the issues. I'll see if I can increase the amount of memory in the machine.

Habitual · 02-28-2017, 09:15 AM

Mark Seger will show up...

sundialsvcs · 02-28-2017, 11:35 AM

"Thrashing" is an extreme situation. If you plotted the completion-time of a process on a graph, for a while the slowdown will be more-or-less linear. But then, it will "hit the wall." The completion-time graph has an elbow shape: when the thrash point is reached, time goes up, not linearly, but exponentially.

In the early days, when disk drives were about the size of a washing machine, they'd start vibrating so hard that we said they were in "Maytag® Mode." One of them actually vibrated its way to the edge of the raised-floor and fell off. (To the obvious detriment of the entire mechanism ... )

chrism01 · 03-01-2017, 02:31 AM

For past issues, the general system logs may have some clues, especially if you know what time this happened.
Some DBs keep a log of each process/query run inside the DB. Again if you know the time it happened (roughly) you may be able to determine the query/queries responsible.

Of course, as per syg00, it may simply be that the machine is gradually getting used more and more and you are hitting that point now...

Try to activate query logging and possibly some sort of performance graphing for ease of observation, rather than having to check logs directly.

lpwevers · 03-01-2017, 07:45 AM

Thanks again for all the help. In the mean time I've managed to setup more advanced monitoring using f.i. collectl and we managed to collect the culprit. We managed to trace it down to a certain user who had scheduled a job in the system to generate some reports over the night. Needless to say his query was less then optimally written.

As he promised to behave next time and test better before unleashing his jobs we've decided to let him live (for now) ;-)

And yes, the system is at it's limits, so management will have to pay for more RAM eventually. But that's another story.

sundialsvcs · 03-01-2017, 08:36 AM

Quote:

Originally Posted by lpwevers

And yes, the system is at it's limits, so management will have to pay for more RAM eventually. But that's another story.

I suggest that you should simply – and, promptly – make the case to management that maximizing the amount of RAM in these systems (to the extent that the motherboards will support!) will have a direct and immediate business impact. Furthermore, "chips are cheap."

You should have a qualified engineer look at exactly how the memory-cards are now deployed in these machines and to construct a plan to optimize the amount of RAM that is available in the various systems with an efficient outlay of money. In any case, it won't be much money, everything will run faster across-the-board, and the bottom-line business impact is quite obviously there.

Presumably, in order to prosecute your business, you need to "run those queries." Trying to do it without enough resources is: "penny-wise and pound-foolish." This case can very easily be made, and you can quote me on that.