LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
LinkBack Search this Thread
Old 03-01-2009, 07:04 PM   #1
dman65
Member
 
Registered: Sep 2003
Posts: 61

Rep: Reputation: 15
Where is the bottleneck in file opening and closing


I am in the middle of a project that requires me to copy about 2.5 million files from one server to another. So far all of the file copy methods I have tried top out at around 12GB an hour or about 3.33MB/sec.

I know that the drives themselves are faster than that because I can copy a single 1GB file from one drive to another in 56 seconds, so at the least they are capable of transferring 17.85MB/sec.

I am guessing that there is a lot of overhead in actually opening the file and I was wondering if anyone knew of what that overhead consisted of. Is it all disk bound or can a program spawn a couple of threads to open files to work on opening and copying files?

In case it matters, the volume I am copying from is RAID5 array.

I put together a simple bare bones program top copy everything in one directory to another with no error checking and no stat call to see if that would speed anything up appreciably, but that only bought me about .2MB/sec over the tar program.

Is there any way to read the directory and get information about the directory structure that would allow reading the files in their sequential order on the drive to keep head thrashing to a minimum?
 
Old 03-01-2009, 07:48 PM   #2
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: Slackware64 13.37, Kubuntu 10.04
Posts: 2,944

Rep: Reputation: Disabled
How are the machines connected? My guess is the network hardware is slowing it down. If you're using a network connection, what transfer protocol are you using? I don't think changing the file order is necessary unless you have a billion small files. RAID would neutralize any benefit gained from that. Something that will slow things down is if you try to transfer faster by copying files in parallel because the drives will oscillate between different files mid-copy.

One particularly effective method of copying is to use tar piped to standard output, then piped into another instance of tar at the destination to extract the files. This preserves symlinks, permissions, times, and ownership. I don't know exactly how you'd set that up, but you could pipe the output of tar across the network using netcat and receive it the same way (you might need to write a small inetd script for the receiving end.) Depending on the type of data, bzip2 might reduce the size significantly.
Kevin Barry
 
Old 03-01-2009, 08:13 PM   #3
jschiwal
Moderator
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,263

Rep: Reputation: 562Reputation: 562Reputation: 562Reputation: 562Reputation: 562Reputation: 562
Here is an example transporting the "logs2" directory to another computer using tar & netcat. I had two terminals open. In the second, I ssh'ed into the target computer. I have public key authentication set up & use ssh-agent so that the passphrase isn't prompted for.

Example using ssh-agent:
Code:
eval $(ssh-agent)
ssh-add
Transferring the files.
In the local shell:
Code:
tar -C /home/jschiwal/ -cf - logs2 | netcat qosmio 1025
In the remote shell:
Code:
netcat -l -w600 -p 1025 qosmio  | tar -C /home/jschiwal/ -xvf -
The netcat command (A.K.A. nc) is launched via ssh but the transfer occurs unencrypted.

I didn't perform any speed tests comparing using netcat vs an ssh tunnel:
Code:
tar -C /home/jschiwal -cf - log2 | ssh tar -C /home/jschiwal -xvf -

Last edited by jschiwal; 03-01-2009 at 08:14 PM.
 
Old 03-01-2009, 08:51 PM   #4
dman65
Member
 
Registered: Sep 2003
Posts: 61

Original Poster
Rep: Reputation: 15
Actually, this was not a transfer over a network. I put an additional EIDE drive in the existing machine and connected an external USB hard drive. I can transfer a 1GB file from the existing RAID5 array to the EIDE drive or to the external USB drive in 50-56 seconds, but transferring the smaller files in the 32K to 3MB range slows the data transfer down to about 1/6 that speed.

I haven't come across any software that makes the multi-file transfers faster.
 
Old 03-01-2009, 09:48 PM   #5
jschiwal
Moderator
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,263

Rep: Reputation: 562Reputation: 562Reputation: 562Reputation: 562Reputation: 562Reputation: 562
Usb drives are slow. To obtain the speed they reach, the programs are compressed at the sender and uncompressed on the receiver end. If a file doesn't readily compress, the transfer rate will be reduced. An esata disk would be the best and would work as fast as a regular drive (or faster compared to a laptop's drive).

A tar backup may be faster if the slowdown is due to the creation of inodes or directory entries on the usb drives filesystem.
Indexing a directory with an extremely large amount of files could cause slowing. Especially using a wildcard. Also, tar would eliminate slack space which may be more significant for very small files. ( note: I'm not certain whether cp continues to the end of the block at the file during a copy)

What filesystem is used on the usb drive. I hope it isn't fat32. That wouldn't copy permissions and because of it's legacy based on a floppy disk filesystem, it is inefficient for a large number of files.

---

Also try a different usb port if you are using an external drive. Some computers will have better performance on one port than the other.

Last edited by jschiwal; 03-02-2009 at 05:44 AM. Reason: added info.
 
Old 03-02-2009, 07:03 AM   #6
dman65
Member
 
Registered: Sep 2003
Posts: 61

Original Poster
Rep: Reputation: 15
Hello jschiwal,

I also tried an internal EIDE drive and it was the same speed as the USB drive. I can write a 1GB file to either the external USB drive or the internal EIDE drive in 50-57 seconds. The issue is in writing many smaller files. I don't have any issues with a similar situation on a Windows server so I am guessing that Windows somehow handles the opening of the files differently or the caching of the data it reads differently. I have tried tar, star, rsync, cpio, cp, etc. but none of them gets me over the 3.33MB/Sec which is basically about what I can get over a 100BaseT connection using tar through ssh.

So, this seems to be an issue of opening files on the RAID5 array rather than an issue with transferring the data to the USB or EIDE drive.
 
Old 03-02-2009, 03:27 PM   #7
Big_Vern
LQ Newbie
 
Registered: Jan 2008
Posts: 9

Rep: Reputation: 1
Are you writing these files to a single directory? Do you know if it starts off fast, gradually getting slower as you put more files into the directory?

Perhaps it's taking quite long to allocate an inode.
 
Old 03-02-2009, 03:56 PM   #8
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: Slackware64 13.37, Kubuntu 10.04
Posts: 2,944

Rep: Reputation: Disabled
Quote:
Originally Posted by Big_Vern View Post
Perhaps it's taking quite long to allocate an inode.
Or journaling, depending on the file system. Also, I've sometimes received a warning with JFS drivers about being in debug mode, which apparently slows the file system down a lot. Maybe OP should try XFS?
Kevin Barry
 
Old 03-02-2009, 08:33 PM   #9
dman65
Member
 
Registered: Sep 2003
Posts: 61

Original Poster
Rep: Reputation: 15
The files on the RAID5 array are all in the same directory and I am copying them into one directory at this point and then I am going to move them into sub directories from there. But the 3.3MB/sec speed is from the beginning. I hate to think how many hours I have spent timing this process over the past couple of weeks.

The source drive is using the ReiserFS file system. I am using XFS on the destination.
 
Old 03-04-2009, 07:00 AM   #10
dman65
Member
 
Registered: Sep 2003
Posts: 61

Original Poster
Rep: Reputation: 15
Just to see what the results would be, I took the parts of my file copying program out that create the new file and write data to it, so all I am actually doing is reading the data from the existing file. So just opening and reading the existing files I am getting a throughput of 7.4MB/sec. That would seem to be pretty much in line with the 3.33MB/sec speed copying to the USB drive since it is a single thread application that reads one file and the writes it.

It seems like there is an awful lot of overhead in the opening of a file since I get 5-6 times this speed just copying a single large file.
 
Old 03-04-2009, 10:04 AM   #11
akuthia
Member
 
Registered: Oct 2007
Location: triad, nc, usa
Distribution: Ubuntu
Posts: 221

Rep: Reputation: 29
have u considered trying to compress/archive a section of the directory, and transferring the archive, to see if that improves speeds?
 
Old 03-04-2009, 02:18 PM   #12
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: Slackware64 13.37, Kubuntu 10.04
Posts: 2,944

Rep: Reputation: Disabled
Something that's non-standard (i.e. not guaranteed to work) is using truncate to arbitrarily expand a file. It works on my machine, and probably most others; therefore, rather than letting the file system incrementally expand the file as it grows, you could create the entire file at once, then fill it with the data. In my experience, truncate can give you a 1GB file in less than a second, which I would think would be faster than the sum of all one-block expansions when building the file incrementally. This would require writing a C program, of course, but because it's a one-time task, this could be split in to two simple programs. One would instantiate the files from a simple list of file:size, etc. and the other would open them in non-appending/non-truncating mode and fill them with the data. Even though you have a lot of files, this still might save time because it sounds like your files have several hundred or thousand blocks, anyway. Also, writing an entire block at once seems to make things faster, although it might be better to write one page at a time to avoid the possibility of a page fault mid-write.
Kevin Barry

edit:
I actually tested this out just now and it doesn't seem to make it any faster. In fact, it seems to make it worse:
Code:
#include <stdio.h>
#include <fcntl.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/timex.h>


#define PAGES ( 1024 )
#define FILE "/tmp/test_file"


#define EXIT_ERROR(program, actions) \
{ fprintf(stderr, "%s: error: %s\n", program, strerror(errno)); \
  actions; \
  exit(1); }


static void write_loop(const char *pProgram, int fFile, char *bBuffer, int sSize)
{
	int I = 0;

	for (; I < PAGES; I++)
	if (write(fFile, bBuffer, sSize) == (ssize_t) -1)
	EXIT_ERROR(pProgram, free(bBuffer));
}


static void show_time(const char *mMessage, struct ntptimeval *sStart,
  struct ntptimeval *sStop)
{
	long double start = (long double) sStart->time.tv_sec +
	                    (long double) sStart->time.tv_usec /
	                    (long double) (1000.0 * 1000.0);

	long double stop = (long double) sStop->time.tv_sec +
	                   (long double) sStop->time.tv_usec /
	                   (long double) (1000.0 * 1000.0);

	fprintf(stderr, "%s%f\n", mMessage, (double) (stop - start));
}


int main(int argc, char *argv[])
{
	char *data = calloc(getpagesize(), 0x00);
	if (!data) EXIT_ERROR(argv[0],);

	int test_file = -1;

	struct ntptimeval start_time, stop_time;

	/*append~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*/

	test_file = open(FILE, O_RDWR | O_TRUNC | O_APPEND | O_CREAT, 0644);
	if (test_file < 0) EXIT_ERROR(argv[0], free(data));

	/*start timing*/
	ntp_gettime(&start_time);

	write_loop(argv[0], test_file, data, getpagesize());

	/*stop timing*/
	ntp_gettime(&stop_time);
	show_time("appending: ", &start_time, &stop_time);

	close(test_file);
	remove(FILE);

	/*truncate~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*/

	test_file = open(FILE, O_RDWR | O_CREAT, 0644);
	if (test_file < 0) EXIT_ERROR(argv[0], free(data));

	/*start timing*/
	ntp_gettime(&start_time);

	if (ftruncate(test_file, PAGES * getpagesize()))
	EXIT_ERROR(argv[0], free(data));

	write_loop(argv[0], test_file, data, getpagesize());

	/*stop timing*/
	ntp_gettime(&stop_time);
	show_time("truncating: ", &start_time, &stop_time);

	close(test_file);
	remove(FILE);

	free(data);
}

Last edited by ta0kira; 03-04-2009 at 02:57 PM.
 
Old 03-05-2009, 01:36 AM   #13
Guttorm
Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 959

Rep: Reputation: 157Reputation: 157
Hi

Quote:
The source drive is using the ReiserFS file system. I am using XFS on the destination.
I think this is probably the reason. ReiserFS has a nice feature where it doesn't slow down when you have lots of files in one directory. I never tried XFS, but all the ones I tried slow down a lot in those situations.
 
Old 03-05-2009, 07:58 AM   #14
dman65
Member
 
Registered: Sep 2003
Posts: 61

Original Poster
Rep: Reputation: 15
Hello Guttorm,

The speed bottleneck appears to be as much on the ResierFS system as the XFS. If I just cycle through the directory on the ResierFS system, open the file, and read the entire file the throughput is only 7.4MB/sec on a drive that shows 60MB/sec using hdparam.

At this point the question has just become academic to me as I am just trying to determine why this would happen so I can hopefully setup the new system to avoid whatever bottleneck this one has. I did a transfer over the wire that took a couple of days and now each day I am just transferring the last 48 hours of data over to what will be the new server until I can get a chance to take the old server down and make the necessary network ID changes to the new one and bring it up. I also have to change the software that accesses the data on that server to look in sub directories instead of one large directory going forward.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
detection opening/closing port qnguyendang Linux - Networking 5 12-15-2008 07:28 PM
cpu is closing and opening on it's own fakhruddinhd Linux - Software 3 09-06-2006 09:10 AM
Opening and closing ports stormtracknole Slackware 10 01-11-2006 08:29 AM
Opening/Closing Ports on Debian 3.0 addowen Linux - Newbie 0 04-22-2004 06:43 PM
closing all and opening specific ports nakkaya Linux - General 2 02-08-2003 11:03 AM


All times are GMT -5. The time now is 06:09 PM.

Main Menu
 
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: @linuxquestions
Open Source Consulting | Domain Registration