Originally Posted by mheymann
Your first option is along the correct lines, so
any other ideas in this direction would be really helpful
Okay so that would be where I thought that you might be loading the wrong modules or some other system configuration may not be correct. In truth I would be inclined to do a full install. I've been running servers in business for 20 years and I have always regretted trying to take a shortcut when setting up a new server. Moving the /etc files from the old server to the new one might have been a mistake. I don't know. Certainly you can move some of the files such as inittab, passwd, shadow, and fstab. The rest might be better off being configured afresh on the new server.
Some other things that you could consider are file system tuning, compiling the kernel to better meet your particular machine, looking for particular files that cause an excessive i/o bottleneck.
Ideas for file system tuning:
What file system type are you using? Although I've only used ext2 I've read here on LQ that the Reiser and XFS file systems are good performers.
You could mount the file system with the noatime option. This would reduce the amount of writes to the disk because file access times are not recorded. I heard about this here at LQ as well. You can see that we're all learning all the time.
What is the block size of your file system? If your files are large then you can use a large block size for your partition. Small block sizes reduce waste of disk space but take a lot of time when modifying large files. Large block sizes on large files speed up file operations.
What is the chunk size of your RAID read/write operations? This can make a big difference in performance. Regretably I think that you need to do your own testing to find the optimum size. I've seen some performance tests where 4K appears to be a good setting.
Is your RAID set to perform write through or write back operations? The write through is safer but takes longer. The write back is faster because it doesn't wait for the operation to complete before moving to the next disk operation. Write back can speed up disk operations. (The same is true for your CPU cache.)
Speaking of the CPU, you might want to consider looking at the motherboard tuning. I'm NOT saying to overclock your CPU or memory but you may be able to enable things like PCI bus mastering.
Back to the file system. If you have one file in the file system that is used a LOT while others are used less then maybe you could move that file to another physical disk. Spreading your i/o load over more spindles can make a big difference in disk response. Notice that I'm not saying to move files between partitons on the same disk. I'm talking about a new physical disk (set).
There is a commercial product called SARCheck. I haven't used it but it claims to be able to watch your computer work load and make recommendation about tuning the settings in the /proc area. It might be worth a look. I used to use something like this on VMS and it was my best friend regarding performance tuning. I've also heard good things about BMC Patrol but I haven't used that either.
Profile your workload. Find what resources are being used and at what rate and at what time of day. Naturally you are mostly interested in the work load during the time that users are doing work with the machine. You could do this with a cron job that periodically runs vmstat or iostat or sar for a limited time, putting the data into a file that you can read at your leisure. Vmstat, iostat, and sar will show you things like the number of blocked processes, the page in/page out rate, and CPU wait times. You can get a really good view of the disk load by using iostat with the x and p parameters. That will show you the i/o on each disk partition.
You can and should keep working your way outward toward the client computers. Look at your NIC(s). Are you using 10/100 Mbps or gigabit speed? Are you certain that they're running at full duplex? I If you are using gigabit are you sure that your hub is gigabit speed on all ports? Many hubs advertised as gigabit speed actually only have gigabit speed on the port that connects them to the LAN while the ports that connect the computer NICs are 10/100 speed. You can look at network topology. Put a sniffer on different parts of the LAN and measure network traffic. Look for saturation and resulting bottlenecks. Also look for packet patterns that indicate faulty NIC hardware or faulty configuration. I recently found a client that was experiencing a lot of duplicate ACKs, packet retransmissions, and packet out of order problems. I replaced the NIC and the performance on that machine increased dramatically. Apparently the NIC was bad. Finally look at the client computers. Maybe they are overloaded with this application. Of course you can and should do several of these things concurrently, and all of these things should be part of your normal system administration routine.
Post what you end up doing and the results. I'm very interested in following your progress.