Linux File Optimization

Quakeboy02 · 09-30-2009, 03:24 PM

Quote:

After all, when a file is fragmented, the heads have to move all over the place to pick up the entire file; when the files are scattered, the heads just have to move to where the file is.

Yes and no. It all depends on the usage activity on the machine. In a single-user machine where there's pretty much only one thing going on, then what you say is true. It's not so clear in a multi-user machine where several users are accessing files in different home directories, which are at different places on the drive. Modern filesystems, using elevator algorithms, go a long way toward improving average access times in the second case, and scattering may actually improve access. It always depends.

jiml8 · 09-30-2009, 03:35 PM

Quote:

Originally Posted by Quakeboy02

Yes and no. It all depends on the usage activity on the machine. In a single-user machine where there's pretty much only one thing going on, then what you say is true. It's not so clear in a multi-user machine where several users are accessing files in different home directories, which are at different places on the drive. Modern filesystems, using elevator algorithms, go a long way toward improving average access times in the second case, and scattering may actually improve access. It always depends.

This is what I have been saying.

That scattering may improve access is a bit much to expect. But in a modern system it probably doesn't hurt access noticeably.

Quakeboy02 · 09-30-2009, 04:39 PM

Quote:

Originally Posted by jiml8

That scattering may improve access is a bit much to expect. But in a modern system it probably doesn't hurt access noticeably.

Elevator algorithms were developed to take advantage of the case where there are many concurrent users of data in a single partition. The fact that the data is scattered means that one user's data will probably be near that of at least one other user's data on at least some accesses. Since this applies to all users, there is a better probability of smaller seek times when you have more users. If the data were to be segregated by user, then each user's access would imply a seek of at least one unit of segregation (probably a partition). Someone out there somewhere knows the proper nomenclature (probably involving the infamous "O"), but alas I don't.

OTOH, with a single user, it is more important that data be concentrated by fileid.

In any case, it is pretty important that executables not be chopped up and spread out all over the disk, no matter how many concurrent users you have. Data is just a different case.

Ansuer · 09-30-2009, 05:08 PM

Quote:

Originally Posted by Quakeboy02

If your "target system" a multi-user system - i.e. has a number of users logged in and working at the same time - or is it essentially a workstation, however large it may or may not be?

For our most recent purposes it's been single user workstation systems. Over time bad admin practices had really slowed the machines down. So, for several months now there's been a high level management focus on having all workstations highly tuned.

I think I see where you are going with the question though, in a multi-user system the optimization isn't as big of a deal because you can't really profile the file access behavior. I would imagine there are even multi-user circumstances where the optimization would be counter productive.

Ansuer · 09-30-2009, 05:18 PM

Quote:

Originally Posted by jiml8

I question whether you get the performance boost from the I-FAAST scheme particularly, or just from defragging the drives.

I discounted the optimization theory at first. We were running the built-in defrag on a regular basis so I didn't think much of the theory. However, we were able to trade several users for their machines that had really slow (14 minute) boot to GINA, as well as ridiculously long application load times. In several of those cases we had identical hardware on hand. We stripped down and cleaned the older machines, then compared them to identical hardware with a new build, they were still about half the speed. After letting I-FAAST do it's job the older machines are actually faster than the plain defragmented "new build".

Over the years I had also forgotten how much more the full version of Diskeeper does, the built-in version really does suck.

But that's only for our M$FT boxes, our SUSE workstations are still suffering the scatter effect. i.e... new builds drive usage is a lot less than cleaned old build.

Ansuer · 09-30-2009, 05:23 PM

Quote:

Originally Posted by jiml8

That scattering may improve access is a bit much to expect. But in a modern system it probably doesn't hurt access noticeably.

Just to be clear, a lot of our really, really bad systems (laptops) had like 4800 rpm drives in them. In that case the optimization difference was cRaZY. I'm sure a lot of higher performance drives in our environment don't see nearly the gain we did with those laptops.

jiml8 · 09-30-2009, 05:36 PM

Quote:

Originally Posted by Ansuer

I discounted the optimization theory at first. We were running the built-in defrag on a regular basis so I didn't think much of the theory. However, we were able to trade several users for their machines that had really slow (14 minute) boot to GINA, as well as ridiculously long application load times. In several of those cases we had identical hardware on hand. We stripped down and cleaned the older machines, then compared them to identical hardware with a new build, they were still about half the speed. After letting I-FAAST do it's job the older machines are actually faster than the plain defragmented "new build".

Over the years I had also forgotten how much more the full version of Diskeeper does, the built-in version really does suck.

But that's only for our M$FT boxes, our SUSE workstations are still suffering the scatter effect. i.e... new builds drive usage is a lot less than cleaned old build.

Well that is certainly interesting.

If the pagefile is fragmented, that by itself will have a huge impact on performance. Also the registry files: the system hits the registry hives many times a second. The built in Diskkeeper does not defrag these files; some of the more advanced defraggers do, though of course they have to do it offline because Windows won't let anyone touch those files while they are open - i.e. when Windows is running.

I wonder if defragging those files is where the performance improvement is mostly coming from.

I also would be interested in specific numbers regarding drive access on the SUSE machines. I myself have not noticed the kind of slowdown that you talk about on any of my Linux systems over time (though I do see it across the course of one session in KDE, which usually has to do with KDE bugs) and I do monitor my systems' performance rather closely.

I also notice some performance deterioration as drives become full. Generally, though, with my SCSI drives, so long as I keep up the maintenance (deleting the digital detritus, mainly) things seem to keep chugging right along without incident.

jiml8 · 09-30-2009, 05:38 PM

Quote:

Originally Posted by Quakeboy02

Elevator algorithms were developed to take advantage of the case where there are many concurrent users of data in a single partition. The fact that the data is scattered means that one user's data will probably be near that of at least one other user's data on at least some accesses. Since this applies to all users, there is a better probability of smaller seek times when you have more users. If the data were to be segregated by user, then each user's access would imply a seek of at least one unit of segregation (probably a partition). Someone out there somewhere knows the proper nomenclature (probably involving the infamous "O"), but alas I don't.

OTOH, with a single user, it is more important that data be concentrated by fileid.

In any case, it is pretty important that executables not be chopped up and spread out all over the disk, no matter how many concurrent users you have. Data is just a different case.

How is this different than reordering I/O to minimize seek time?

jiml8 · 09-30-2009, 05:43 PM

Quote:

But that's only for our M$FT boxes, our SUSE workstations are still suffering the scatter effect. i.e... new builds drive usage is a lot less than cleaned old build.

I also wonder what would happen if you were to copy the entire contents of one partition to another drive (not an image...use the cp command...), then wipe and reformat the original partition, then copy all the files back into place. This specifically would permit the file system to reorganize the drive according to whatever scheme it considers optimum.

I have done this in the past, usually when enlarging a system partition or changing drives, and I have not noticed any particular performance improvement from doing it. However, as I say, I also have not noticed any deterioration over time, until the drive or partition is nearly full anyway.

Quakeboy02 · 09-30-2009, 05:50 PM

Quote:

Originally Posted by jiml8

How is this different than reordering I/O to minimize seek time?

That's what an elevator algorithm means. Essentially, the I/O is ordered by address, and the heads pass in one direction only until they reach the farthest access before traveling back in the other direction. Everybody gets their disk access done in a timely manner, but those wanting data in the same area as a previous request may get bumped ahead on the queue; depending on where the head is in its seek pattern when the request comes in.

jiml8 · 09-30-2009, 06:04 PM

Quote:

Originally Posted by Quakeboy02

That's what an elevator algorithm means. Essentially, the I/O is ordered by address, and the heads pass in one direction only until they reach the farthest access before traveling back in the other direction. Everybody gets their disk access done in a timely manner, but those wanting data in the same area as a previous request may get bumped ahead on the queue; depending on where the head is in its seek pattern when the request comes in.

I thought that is what you meant by it. This is handled at the hardware level with SCSI disks, on the drive and in the controller. Also with SAS, and I think with SATA. Older IDEs don't do it, though I think I read someplace that newer ones do.

Quakeboy02 · 09-30-2009, 06:15 PM

Quote:

Originally Posted by jiml8

I thought that is what you meant by it. This is handled at the hardware level with SCSI disks, on the drive and in the controller. Also with SAS, and I think with SATA. Older IDEs don't do it, though I think I read someplace that newer ones do.

Actually, it used to be handled by the hardware level. I'm almost 100% certain that the libata driver turns off all drive optimization and does the optimization in the kernel, where it can be controlled to the advantage of the OS, rather than accepting whatever optimization the drive maker put in. Maybe johnsfine will pipe in and either correct me or give me an attaboy.

But we are kind of straying, and I apologize for that.

Quakeboy02 · 09-30-2009, 06:23 PM

Quote:

Originally Posted by Ansuer

For our most recent purposes it's been single user workstation systems. Over time bad admin practices had really slowed the machines down. So, for several months now there's been a high level management focus on having all workstations highly tuned.

I think I see where you are going with the question though, in a multi-user system the optimization isn't as big of a deal because you can't really profile the file access behavior. I would imagine there are even multi-user circumstances where the optimization would be counter productive.

Yeah, I think you got it. One size doesn't necessarily fit everyone when it comes to filesystem access. Depending on your needs, you might get either a speedup or slowdown just from changing the filesystem block size. Also, when you need every last bit of performance, it pays to get oversized drives, since if you only use a small part of the drive your average seek times will be reduced. For example: get a 1TB drive and only setup a partition of perhaps 500GB. If the full stroke seek time was 8ms, you'd probably have a full stroke seek time somewhere near 4-5ms for the partition.

Ansuer · 09-30-2009, 07:29 PM

Quote:

Originally Posted by jiml8

I also would be interested in specific numbers regarding drive access on the SUSE machines. I myself have not noticed the kind of slowdown that you talk about on any of my Linux systems over time (though I do see it across the course of one session in KDE, which usually has to do with KDE bugs) and I do monitor my systems' performance rather closely.

I'll ask what the general iops are between the base build and an old scrubbed one on identical hardware.

Quote:

Originally Posted by jiml8

I also wonder what would happen if you were to copy the entire contents of one partition to another drive (not an image...use the cp command...), then wipe and reformat the original partition, then copy all the files back into place. This specifically would permit the file system to reorganize the drive according to whatever scheme it considers optimum.

Can definitely give this a shot, it should put the files sequentially on the drive in that partition. Though, I'm still wondering if there is a process for Linux that will watch the file system access and optimize it based on that?

Quakeboy02 · 09-30-2009, 07:55 PM

Quote:

Though, I'm still wondering if there is a process for Linux that will watch the file system access and optimize it based on that?

That would be the kernel filesystem driver as well as probably libata which is becoming the compleat (sic) system disk access interface.