Linux - EnterpriseThis forum is for all items relating to using Linux in the Enterprise.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Background:
* high school environment.
* 217 diskless clients (root) / run off server.
* server running raid10 with 4 disks.
* every once in a while atop and top show high IOWAIT on the disks.
* most of the time everything is fast/snappy.
Question:
* Is there a way to tell/track which files are causing the IO wait?
** I know that at times of the day there is high IO wait.
** I know that it is NFS causing it. But I don't know what the nfs thread is reading/writing to create the disk IO - kde cache file? firefox flash applet? /usr/games/some_game?
%wa isn't a direct indicator of tasks waiting for I/O to complete. Very poor name, and an almost useless metric. It indicates that (all) the CPUs are idle, and there is uncompleted I/O. Might be just one I/O, might be a write out that no-one cares about. No way to tell.
What's you loadavg at the same time - that might (MIGHT) be a better indicator, but for "peaky" values, you might not even see it in the numbers. You might want to look at installing collectl.
Hardware or software RAID ?. Separate controllers ?. Are you swapping ?. Is it impacting your users ?.
I doubt there is an easy way to identify file access - iotop indicates the major users, but you already know that.
syg00 is right - I think especially concerning "It indicates that (all) the CPUs are idle, and there is uncompleted I/O" and "Is it impacting your users ?".
I used to work with an IBM AIX server and had often high wait times with the CPU and at the beginning I was jumping around trying to lower that value but in most of the cases ("most" - not "all") it was normal and had practically no repercussions on the users.
I doubt there is an easy way to identify file access - iotop indicates the major users, but you already know that.
This is what I figured it is not possible to track IO on a per file basis. I've been comparing our school's diskless servers in MRTG and and it seems when we get to around 200 diskless workstations the hardware raid can not keep up. So I'll be focusing in on faster drives/raid.
In regards to atop vs iotop - atop is much nicer break-down of what is causing bottle necks (CPU, disk, net, swap). The MRTG graphs are nice for a visual comparison over time.
Yes - for some reason when the IO is too high DNS stops resolving in a timely manner which prompts students to reboot their diskless client in hopes of fixing Internet. Rebooting diskless clients creates more disk IO.
Yes - for some reason when the IO is too high DNS stops resolving in a timely manner...
This is probably a dumb question, but are you sure that it is this way around? That is, are you sure that it isn't DNS resolve failures causing high IO waits?
This is probably a dumb question, but are you sure that it is this way around? That is, are you sure that it isn't DNS resolve failures causing high IO waits?
Good question
Assuming that most (all?) of your most frequently used NFS servers have static IP's, maybe you might want to propagate an /etc/hosts containing these addresses to your clients (along with an nsswitch.conf that specifies "use /etc/hosts before querying DNS).
Resolved by ensuring that all of the following folders are stored in local tmpfs on the client and not on the server:...
/tmp /var/cache/man /var/lib/xkb /var/lock /var/run /var/log /var/spool /var/lib/discover /etc/hotplug/.run /var/lib/nfs /var/lib/gdm /var/lib/xdm /var/lib/cups /var/lib/urandom /var/log /var/yp/binding /var/cache/cups /etc/network/run /media /var/lib/preload
A few of these I had moved to the server because software was running the workstation out of RAM. I'll deal with tmpfs-abusive software on a per-application basis.
If you have Gb networking throughout, raise MTU to 9000. Tune your NFS server and clients to use maximum packet sizes.
Ideally, find an alternative to NFS - it's buggy and crufty as hell.
Treat any NFS exported filesystem as _prohibited_ for any other kind of export or for local filesystem work, or you risk data corruption (in particular it's extremely dangerous to export the same filesystem NFS and SAMBA at the same time, NFS export to another box and SAMBA export from there)
This is because NFS doesn't pass client lock requests back down to the filesystem, so it's trivially possible for NFS and any other processes to try and write to the same file at the same time. Simultaneous NFS clients are handled within NFS server, so that's fairly safe.
As you've discovered, it's logging, locking, tmpfs and home directory X stuff (firefox cache is a particularly nasty culprit) which can cause a LOT of server IO on a diskless client. Consider throwing more ram at the clients or toss in a cheap small disk and setup cachefilesd to help take some of the load off the server.
Finally: On the server itself, figure about 1Gb ram per Tb of disk for optimal caching and look at tuning your swappiness/memory pressure settings (you don't want the server to swap or things get _really_ slow, nor do you want it to cache writes and then dump them all out at once as this will lead to pauses in fileserving.)
Yes we also turned off Firefox caching and session storage - as you mentioned Firefox hammers the disk and overall performance improves with the following disabled...
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.