Serious file transfer problems (caching until out of memory?)
I'm having a big problem transfering a lot of data through sftp. I had this problem with different machines and with Redhat Enterprise 3 WS, Redhat Enterprise 4 ES U3 and U4. Here are the specs of the servers I used :
Dual Xeon 3.2Ghz, Adaptec Raid controller, 4x72GB 10k RPM SCSI in RAID5 (Maxtor HDs), 2 onboard gigabit lan (broadcom and intel I think), 4GB DDR-ECC (2x2GB)
Another is the same config except the 4 HDs are set in RAID 10
Another is the same except it has only 2HD in RAID 0 and 8GB of ram (4x2GB)
Another is a normal P4-4Ghz machine with cheap network card, 2x raid 1 IDE disks, 2GB of DDR (non ecc)
All machines show the same problem when trying to make a big sftp transfer to them, I also had the same issue making a file transfer from an USB connected HD.
The transfer starts at around 23MB/sec which is normal for CAT5E cables, but then it stops, continue, stops, continue and so on, the final average transfer for the files (2GB files) goes from 5 to 10MB/sec. And it also freezes between files sometimes for up to 2 or 3 minutes. Then, after transfering a total of a few gigabyte, linux starts to complain there is no more memory available, and start killing processed one by one... until the system goes down with kernel panic! This can be reproduced on all machines mentionned above!
I noticed that if I transfer a single 2GB file, after the transfer ends, the HD system seems very busy for a long long time... and top indicates a 75% "wa (write access?) cpu usage" (have to start top before the transfer otherwise I'll get trouble starting it, or it'll take a lot of time.
Does it mean that linux is caching the whole file in memory and then writing it to disk long time after? If so, that's insane... I never saw anything like that
Now, I did the same transfer using a 100MB switch, and the speed was capped at around 10MB/sec, I had the same stop/start problem for an average transfer speed of 4MB/sec, but no more memory full and crash problem
Normally the HDs are able to write faster than that (specially for the RAID 1 and RAID 10 configs, even the RAID5 as a full RAID5 init (205GB) took less than 1 hour), and linux should never write cache such an enormous amount of data.
Do anyone know about this problem? how it can be solved and what is its cause?
Thanks in advance,