Very slow software RAID5/LVM array - which drive is dying?
I have an Ubuntu 10.04 system which is used as a file-server, primarily for storing video. The setup combines two RAID5 arrays joined in an LVM (details below).
It's served me very well in the past, with decent enough performance for my use (around 150-200MB/s sequential reads for example). All I really ask of it is to be able to stream HD video, which shouldn't be too onerous.
As I say, the setup used to work absolutely fine but it now grinds to a halt (i.e. <1MB/s reads, 100ms+ seek times, ...) sometimes - even with no load on the system. I strongly suspect one of the hard drives is on its way out, but can't tell which one. I've looked at the drives in system monitor and they all look healthy - SMART reports them all as having either no bad sectors or 1/2 bad sectors. I've run transfer tests on each individual drive and they're perfectly fast. The problem is that the issue is intermittent - when I run a test over a particular drive it's fine more often than not.
Any suggestions for how to pin down this problem?
Physical drives (all SATA):
-- 2x 320GB drives (partitions: 320GB)
-- 3x 750GB drives (partitions: 320GB, 430GB)
-- 1x 1.5TB drive (patitions: 320GB, 430GB)
-- 1x 6 drive RAID5 array, comprising the 320GB partitions.
-- 1x 4 drive RAID5 array, comprising the 430GB partitions.
-- One VG comprising the two RAID5 arrays.
(It's a slightly odd setup, the aim is to be able to grow the array by adding larger drives in future.)
Swap out with new drives and rebuild it then see if it goes away may be the way.
I'd look at all smart data but it may end up being controller or cables or other issues.
Thanks for the reply. I'm hoping there's a better way though; swapping out a drive and rebuilding the array will take a very long time (especially with the array running slow). Rebuilds take several hours to complete and I'd need to do that 6 times.
I might try moving the drives between controllers, and swapping cables out thought - that would be quicker...
Run a smart long test on each, and then post the attributes.
I've run a long smart test on each of the 7 drives (that's the 6 drives in the array, plus the OS boot drive) - run as "sudo smartctl --test=long <device>". All tests have passed - I've attached the full SMART data for all of them below - that's the output of "sudo smartctl --al <device>".
Any ideas? I'm wondering whether something else might be causing the array to be slowing down, but can't think what that might be. In terms of the drives, what's a sensible upper limit for the temperature they should run at? Some of them are just over 50 celsius, I don't know whether that's reasonable.
Here are interesting bits:
This means that no drives are failing, but I would add some more fans.
Did you update recently or change anything on this system ? Maybe it was a bad update, or something changed to cause this ...
Maybe check the logs for anything suspicious, /var/log/ messages syslog.
Also check the cables.
Thanks for the help - I'll look at adding more fans to the case.
I haven't changed anything recently (other than installing standard security updates for Ubuntu). It's possible that the case has gathered some dust over time - I'll clean the filters for the case fans and see if I can blow some dust out of the case. What's a safe temperature for drives to operate at?
As an when I get new drives I'll try to get some 5400rpm ones rather than 7200rpm as they should run a little cooler.
I have the same drive and it runs at:
According to the manual:
The max operating temp is 60C, and yours have gone over in the past.
Having given it a good clean and rerouted some cables for better airflow, the drive temperature is now a little lower (46C for those two drives). However that doesn't seem to have helped - I'm still seeing poor performance with occasional long waits for access.
One thing which has now occured to me though is file fragmentation. Many of the files here have been downloaded by Bittorrent, and when upgrading from Ubuntu 8.04 to 10.04 recently, I changed Bittorrent client from Vuze to Transmission (the default Ubuntu client). One change is that Vuze allocates the entire file on disk prior to starting the download, whereas Transmission allocates it incrementally while downloading.
I suspect that it's resulting in some very fragmented files, which is making access very slow. Picking a recently-downloaded 50MB file at random, filefrag reports it has 1399 extents which seems very poor to me (the array isn't very full: 600GB free out of 2.9TB). Picking an old 350MB file shows a more reasonable 39 extents.
A quick Google shows this bug report/discussion relating to precisely this issue: https://trac.transmissionbt.com/ticket/849. I'll set the option in Transmission to preallocate files, and see if that helps. I suspect the RAID/LVM setup I have exacerbates the problem (having two bits of the same drive in the same logical volume). This is an EXT3 filesystem - I wonder whether EXT4 would have helped at all...
Anyway, thanks for the help - hopefully I have what I need now.
I didn't know about filefrag, and was looking for such a program. I probably didn't find it because it can only be run as root.
Certainly 1399 extents in very fragmented. Try copying the file and using that instead (you can use cp or dd to copy it).
|All times are GMT -5. The time now is 01:25 AM.|