Serious file transfer problems (caching until out of memory?)

Cappella · 10-21-2006, 05:22 AM

Hi,

I'm having a big problem transfering a lot of data through sftp. I had this problem with different machines and with Redhat Enterprise 3 WS, Redhat Enterprise 4 ES U3 and U4. Here are the specs of the servers I used :

Dual Xeon 3.2Ghz, Adaptec Raid controller, 4x72GB 10k RPM SCSI in RAID5 (Maxtor HDs), 2 onboard gigabit lan (broadcom and intel I think), 4GB DDR-ECC (2x2GB)
Another is the same config except the 4 HDs are set in RAID 10
Another is the same except it has only 2HD in RAID 0 and 8GB of ram (4x2GB)
Another is a normal P4-4Ghz machine with cheap network card, 2x raid 1 IDE disks, 2GB of DDR (non ecc)

All machines show the same problem when trying to make a big sftp transfer to them, I also had the same issue making a file transfer from an USB connected HD.

The transfer starts at around 23MB/sec which is normal for CAT5E cables, but then it stops, continue, stops, continue and so on, the final average transfer for the files (2GB files) goes from 5 to 10MB/sec. And it also freezes between files sometimes for up to 2 or 3 minutes. Then, after transfering a total of a few gigabyte, linux starts to complain there is no more memory available, and start killing processed one by one... until the system goes down with kernel panic! This can be reproduced on all machines mentionned above!

I noticed that if I transfer a single 2GB file, after the transfer ends, the HD system seems very busy for a long long time... and top indicates a 75% "wa (write access?) cpu usage" (have to start top before the transfer otherwise I'll get trouble starting it, or it'll take a lot of time.

Does it mean that linux is caching the whole file in memory and then writing it to disk long time after? If so, that's insane... I never saw anything like that

Now, I did the same transfer using a 100MB switch, and the speed was capped at around 10MB/sec, I had the same stop/start problem for an average transfer speed of 4MB/sec, but no more memory full and crash problem

Normally the HDs are able to write faster than that (specially for the RAID 1 and RAID 10 configs, even the RAID5 as a full RAID5 init (205GB) took less than 1 hour), and linux should never write cache such an enormous amount of data.

Do anyone know about this problem? how it can be solved and what is its cause?

Thanks in advance,

David.

unSpawn · 10-23-2006, 08:36 AM

Do anyone know about this problem? how it can be solved and what is its cause?
I don't know.

First I'd look for generic and network driver fixes in changelogs for the WS3, ES4U3 and U4 kernels you run. Then I'd look for similar threads wrt network throughput involving the network drivers you use. If that doesn't run up any workarounds or fixes then I'd try a more methodical approach. Start by setting up per system baseline data, using something like Dstat or Atsar or any for of SAR that logs system and network stats, because it will make it easier (visualise) to pinpoint what area you should be looking into: kernel version, file caching or packet queueing, driver, network. You will want to make sure all server specs are as much evened out as possible like having minimal load, running the same kernel, no sysctl mods, no niceness, no network throttling. When you have the data run some local tests shoving data between disks or between local and external disks. After that run a network performance test to see what you should expect. *Then* run some network tests shoving data between servers without the overhead SFTP causes like FTP. If you now plot and overlay data you may have a better picture of what's going on.

Just my idea.