LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 02-28-2017, 04:27 AM   #1
lpwevers
Member
 
Registered: Apr 2005
Location: The Netherlands
Distribution: SuSE, CentOS
Posts: 181

Rep: Reputation: 21
History of processes causing I/O wait


Hi,

I was wondering if it it possible to see what command caused a high I/O wait on a system. In this case I have a system where yesterday, something went berserk and basically locked the system for 20 minutes or so, due to 70% I/O wait. That I could see from sar.

However, I'm trying to figure out (if possible) if there's also something that will tell me what process caused this. I tried tools like pidstat and iostat, but they do not seem to have the history like sar does. But sar does not seem to track the processes.

If it's not possible I'll probably end up scripting something myself, but if it's possible using standard tools, that'd be great.
 
Old 02-28-2017, 06:37 AM   #2
smallpond
Senior Member
 
Registered: Feb 2011
Location: Massachusetts, USA
Distribution: Fedora
Posts: 4,153

Rep: Reputation: 1265Reputation: 1265Reputation: 1265Reputation: 1265Reputation: 1265Reputation: 1265Reputation: 1265Reputation: 1265Reputation: 1265
Use "top" to see which processes are in wait. Writes are done in the background by the kernel flush tasks. Adjusting VM writeback variables may improve performance.

http://www.thegeekstuff.com/2009/10/...adable-format/
 
1 members found this post helpful.
Old 02-28-2017, 06:40 AM   #3
lpwevers
Member
 
Registered: Apr 2005
Location: The Netherlands
Distribution: SuSE, CentOS
Posts: 181

Original Poster
Rep: Reputation: 21
Thanks. I'll try that for future use. Build some monitoring on that.

Still, I'm hoping if there's something that'll tell me from the sar (and maybe other) data from yesterday....
 
Old 02-28-2017, 06:43 AM   #4
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
collectl perhaps?
 
1 members found this post helpful.
Old 02-28-2017, 06:55 AM   #5
lpwevers
Member
 
Registered: Apr 2005
Location: The Netherlands
Distribution: SuSE, CentOS
Posts: 181

Original Poster
Rep: Reputation: 21
Thanks. collectl certainly looks promising. If not for yesterday's data, then for capturing future data in batch. I'll definitely check it out.
 
Old 02-28-2017, 07:19 AM   #6
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,144

Rep: Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124
iowait is a very misunderstood metric. Especially to base claims of a locked system on. Have a read of this - especially the last 2 examples. It's old, but pretty good - all I could find quickly.

collectl is good, but you'll need to set up process collection. Might help, might not.
Keep an eye on interrupt counts in case you have a driver and/or hardware problem you don't know about.
 
1 members found this post helpful.
Old 02-28-2017, 07:43 AM   #7
lpwevers
Member
 
Registered: Apr 2005
Location: The Netherlands
Distribution: SuSE, CentOS
Posts: 181

Original Poster
Rep: Reputation: 21
Thanks that explains a lot about iowait that I didn't know. Then again, this system runs a huge database and we expect some (or more) users of running some analytical queries at the time that caused the system to become nearly unresponsive. Therefore I was hoping to find out what process was causing this. If it was the database server process, we could at least narrow down the culprit a bit.

I guess I'll setup monitoring using collectl; seems like this may come in very handy in the future anyway.

Thanks for all the advise.
 
Old 02-28-2017, 07:51 AM   #8
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,144

Rep: Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124
In that case you need all the monitoring data you can get. It's possible you are near the edge of the cliff all the time and you got pushed over at that time. Might be pretty hard to track the real reason rather than just chasing symptoms.

Good luck.
 
Old 02-28-2017, 08:01 AM   #9
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,678
Blog Entries: 4

Rep: Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947
It is most probable that you are actually experiencing thrashing.

You had several users running "analytical queries." How much memory do each of these running tasks require? And, does the machine possess enough physical RAM to support all of them at once?

If not, your virtual-memory paging subsystem goes nuts. The disk drive turns into an out-of-balance washing machine on the "spin" cycle. Nothing gets done because everyone is experiencing constant page faults. The "swap in/out count" goes by in a blur. And, since file I/O and swap I/O take place (usually) to the same device(s), file I/O is severely impacted as well.

Last edited by sundialsvcs; 02-28-2017 at 08:03 AM.
 
Old 02-28-2017, 08:38 AM   #10
lpwevers
Member
 
Registered: Apr 2005
Location: The Netherlands
Distribution: SuSE, CentOS
Posts: 181

Original Poster
Rep: Reputation: 21
Thanks for that. Indeed it looks like the machine is also showing increased memory usage and swap activity. I guess it's the combination of those that cause the issues. I'll see if I can increase the amount of memory in the machine.
 
Old 02-28-2017, 09:15 AM   #11
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
Mark Seger will show up...
 
Old 02-28-2017, 11:35 AM   #12
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,678
Blog Entries: 4

Rep: Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947
"Thrashing" is an extreme situation. If you plotted the completion-time of a process on a graph, for a while the slowdown will be more-or-less linear. But then, it will "hit the wall." The completion-time graph has an elbow shape: when the thrash point is reached, time goes up, not linearly, but exponentially.

In the early days, when disk drives were about the size of a washing machine, they'd start vibrating so hard that we said they were in "MaytagŪ Mode." One of them actually vibrated its way to the edge of the raised-floor and fell off. (To the obvious detriment of the entire mechanism ... )

Last edited by sundialsvcs; 02-28-2017 at 11:38 AM.
 
Old 03-01-2017, 02:31 AM   #13
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,364

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
For past issues, the general system logs may have some clues, especially if you know what time this happened.
Some DBs keep a log of each process/query run inside the DB. Again if you know the time it happened (roughly) you may be able to determine the query/queries responsible.

Of course, as per syg00, it may simply be that the machine is gradually getting used more and more and you are hitting that point now...

Try to activate query logging and possibly some sort of performance graphing for ease of observation, rather than having to check logs directly.
 
1 members found this post helpful.
Old 03-01-2017, 07:45 AM   #14
lpwevers
Member
 
Registered: Apr 2005
Location: The Netherlands
Distribution: SuSE, CentOS
Posts: 181

Original Poster
Rep: Reputation: 21
Thanks again for all the help. In the mean time I've managed to setup more advanced monitoring using f.i. collectl and we managed to collect the culprit. We managed to trace it down to a certain user who had scheduled a job in the system to generate some reports over the night. Needless to say his query was less then optimally written.

As he promised to behave next time and test better before unleashing his jobs we've decided to let him live (for now) ;-)

And yes, the system is at it's limits, so management will have to pay for more RAM eventually. But that's another story.
 
Old 03-01-2017, 08:36 AM   #15
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,678
Blog Entries: 4

Rep: Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947Reputation: 3947
Quote:
Originally Posted by lpwevers View Post
And yes, the system is at it's limits, so management will have to pay for more RAM eventually. But that's another story.
I suggest that you should simply – and, promptly – make the case to management that maximizing the amount of RAM in these systems (to the extent that the motherboards will support!) will have a direct and immediate business impact. Furthermore, "chips are cheap."

You should have a qualified engineer look at exactly how the memory-cards are now deployed in these machines and to construct a plan to optimize the amount of RAM that is available in the various systems with an efficient outlay of money. In any case, it won't be much money, everything will run faster across-the-board, and the bottom-line business impact is quite obviously there.

Presumably, in order to prosecute your business, you need to "run those queries." Trying to do it without enough resources is: "penny-wise and pound-foolish." This case can very easily be made, and you can quote me on that.

Last edited by sundialsvcs; 03-01-2017 at 08:40 AM.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to see the History of processes that has been completed/killed? premkumar.st Linux - Newbie 13 02-08-2013 08:09 AM
how to check which processes are in wait cpu cycle brgsousa Linux - Software 1 08-10-2010 02:35 PM
Shell script: How to wait on two processes at the same time dbrazeau Programming 16 03-18-2010 12:33 PM
Wait for one of two processes to complete in a shell script nonoitall Programming 11 06-10-2008 04:10 PM
dpkg-deb: wait for tar failed: No child processes snudel Linux - Software 16 04-27-2007 07:17 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 07:48 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration