If there is "swap usage," and especially if you are seeing writes, then you can be fairly certain that a certain amount of the memory you are reading in is getting swapped out. Which is a type of load that you really don't want in this case.
It's also obvious to me that 75 threads is far, far too many. You should probably repeat this test with only about 8 or 10 threads. Basically, the "ruling constraint" on this system is going to be the number of simultaneous I/O requests you can handle -- which is really going to be determined by the number of drives you have, and the number of independent I/O-channels leading out to those drives. EIDE really can't cut the mustard like SCSI and FireWire can. Multiple controllers on different IRQs may be needed.
Scatter your "10 million files" into a nest of subdirectories, using increasingly-larger portions of the ID as the subdirectory names, so that for example item #123456 will be in "/mydata/1/12/1234/123456.txt." Then, using symlinks or whatnot, distribute those files among drives and make sure that your I/O subsystem can support truly-simultaneous disk access to all of them.
Now, your application... "one thread per request" won't cut it. Instead, grab the requests from the socket, put them into a queue, and perhaps at that point start thinking about the layout of your drives.
For example, when the input-thread puts another request into the pipe, another thread might wake-up to see if the thread can be satisfied from an in-memory cache; if so, it sends the request straight to an outbound queue, where the output-thread can pick it up and send it. Otherwise, the request gets dropped into one of several queues, each corresponding to a device.
A per-device thread takes each request, issues the file read (a synchronous read), gets the data, and puts it into the outbound queue.
The outbound cache-manager thread will see the request next, to decide if it wants to put a copy of the data into its cache. In any case, the outbound cache-manager forwards the request on to the output-thread.
That is the kind of "separation of concerns" architecture that will deliver the performance you need, with only a handful of threads, orchestrated around the nature of the ruling-constraints, which are "the hardware."
|