Various problems on a linux software raid-1 (mdraid) with Samsung 840 Pro SSDs

AndyB78 · 10-06-2013, 05:31 PM

Hello,

Since we have bought a server fitted with 2 x Samsung 840 Pro SSDs assembled into a linux software raid-1 matrix.

(1) First I've noticed a very acute write speed problem when creating or copying larger files (hundreds of MBs or GBs) accompanied by serious overloads:

root [~]# w
01:02:14 up 55 days, 57 min, 2 users, load average: 0.48, 1.07, 1.84
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT

root [~]# time dd if=backup-10.5.2013_23-22-11_xxxxxxxx.tar.gz of=test3 oflag=sync bs=1G
0+1 records in
0+1 records out
307191761 bytes (307 MB) copied, 43.0388 s, 7.1 MB/s

real 0m43.060s
user 0m0.000s
sys 0m1.228s

root [~]# w
01:03:07 up 55 days, 58 min, 2 users, load average: 17.97, 5.22, 3.18
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT

As you can see a 307MB files was copied in 43 seconds with an average speed of 7MB/s. These SSDs should be able to do it in 1 second with hundreds of MB per second.

Also, this time the load spiked very moderately. With 500MB the load spikes to 30-40 and with 1GB files it can spike to 100.

(2) During the same kind of operations the iostat looks funny. A few samples during the copy:

Right at the start:

Code:

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               1.50  6298.50  311.50  613.50 64616.00 55270.50   129.61     3.18    3.44   0.27  25.10
sdb               2.50  6603.50  284.50  230.50 61576.00 12054.50   142.97    23.50    8.54   1.58  81.40
md1               0.00     0.00    0.00  393.00     0.00  3144.00     8.00     0.00    0.00   0.00   0.00
md2               0.00     0.00  599.00 6814.00 125936.00 54504.00    24.34     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

After 2 seconds:

Code:

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  1589.50    0.00   54.00     0.00 13148.00   243.48     0.60   11.17   0.46   2.50
sdb               0.00  1627.50    0.00   16.50     0.00  9524.00   577.21   144.25 1439.33  60.61 100.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00 1602.00     0.00 12816.00     8.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

.......... (~40 seconds)

42 seconds later

Code:

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00   14.50     0.00 11788.00   812.97   143.62 7448.45  68.97 100.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

As you can see for the few seconds /dev/sda spiked to 25% %util. But /dev/sdb spiked to 100% and it kept it like this for the remainder of ~40 seconds (while /dev/sda barely broke a sweat). It took 43 seconds for a 300MB file but obviously with larger files it takes a lot more to copy it.

Also the first iteration of iostat (that present averages since last reboot) %util is quite different between the 2 members of the same raid-1 array:

Code:

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda              10.44    51.06  790.39  125.41  8803.98  1633.11    11.40     0.33    0.37   0.06   5.64
sdb               9.53    58.35  322.37  118.11  4835.59  1633.11    14.69     0.33    0.76   0.29  12.97
md1               0.00     0.00    1.88    1.33    15.07    10.68     8.00     0.00    0.00   0.00   0.00
md2               0.00     0.00 1109.02  173.12 10881.59  1620.39     9.75     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.41    0.01     3.10     0.02     7.42     0.00    0.00   0.00   0.00

(3) The wear of the 2 SSDs is very different:

Code:

root [~]# smartctl --attributes /dev/sda | grep -i wear
177 Wear_Leveling_Count     0x0013   095%   095   000    Pre-fail  Always       -       180
root [~]# smartctl --attributes /dev/sdb | grep -i wear
177 Wear_Leveling_Count     0x0013   072%   072   000    Pre-fail  Always       -       1005

/dev/sda: 5% wear
/dev/sdb: 28% wear

Also the total number of LBAs written on the 2 members is different but not so much as the above:

Code:

root [~]# smartctl --attributes /dev/sda | grep -i LBA
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       21912041841
root [~]# smartctl --attributes /dev/sdb | grep -i LBA
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       23720836220

(4) Also following the io load with iotop I've noticed moments when jbd2 uses close to 100% I/O without writing or reading much (or anything for that matter).

Some background info
- this is a shared cPanel server (web, mail, mysql etc)
- the SSDs have been used exactly the same amount of time and I know of no resyncs during this time;
- initially they had the DXM04B0Q but I have updated both to DXM05B0Q
- I have looked for "hard resetting link" in dmesg to check for cable/port issues but nothing
- I believe they are aligned correctly (listing below)
- SO is CentOS 6.4, 2.6.32-358.11.1.el6.x86_64
- the write intent bitmap was removed right at the start
- the system was installed from a minimum installation DVD and completed with stuff as needed (it didn't even have the man pages)
- tested with and without discard and noatime
- tested with all schedulers

root [~]# fdisk -ul /dev/sda
Disk /dev/sda: 512.1 GB, 512110190592 bytes
255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00026d59

Device Boot Start End Blocks Id System
/dev/sda1 2048 4196351 2097152 fd Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/sda2 * 4196352 4605951 204800 fd Linux raid autodetect
Partition 2 does not end on cylinder boundary.
/dev/sda3 4605952 814106623 404750336 fd Linux raid autodetect

root [~]# fdisk -ul /dev/sdb
Disk /dev/sdb: 512.1 GB, 512110190592 bytes
255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0003dede

Device Boot Start End Blocks Id System
/dev/sdb1 2048 4196351 2097152 fd Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/sdb2 * 4196352 4605951 204800 fd Linux raid autodetect
Partition 2 does not end on cylinder boundary.
/dev/sdb3 4605952 814106623 404750336 fd Linux raid autodetect

MOUNT
root [/var/log]# mount
/dev/md2 on / type ext4 (rw,noatime,usrjquota=quota.user,jqfmt=vfsv0)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/md0 on /boot type ext4 (rw,noatime)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
/usr/tmpDSK on /tmp type ext3 (rw,noexec,nosuid,loop=/dev/loop0)
/tmp on /var/tmp type none (rw,noexec,nosuid,bind)

/etc/fstab
root # cat /etc/fstab
#
# /etc/fstab
# Created by anaconda on Wed Apr 3 17:22:52 2013
#
UUID=8fedde2c-f5b7-4edf-975f-d8d087d79ebf / ext4 noatime,usrjquota=quota.user,jqfmt=vfsv0 1 1
UUID=bfc50d02-6d4d-4510-93ea-27941cd49cf4 /boot ext4 noatime,defaults 1 2
UUID=cef1d19d-2578-43db-9ffc-b6b70e227bfa swap swap defaults 0 0
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
/usr/tmpDSK /tmp ext3 noatime,defaults,noauto 0 0

/proc/mdstat
root # cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb2[1] sda2[0]
204736 blocks super 1.0 [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
404750144 blocks super 1.0 [2/2] [UU]

md1 : active raid1 sdb1[1] sda1[0]
2096064 blocks super 1.1 [2/2] [UU]

unused devices: <none>

What is the problem with this array?

Ser Olmy · 10-06-2013, 06:25 PM

I can't see any obvious problems with your array, but I did notice something unusual about your dd command.

Quote:

Originally Posted by AndyB78

root [~]# time dd if=backup-10.5.2013_23-22-11_xxxxxxxx.tar.gz of=test3 oflag=sync bs=1G

Surely, a block size of one gigabyte is a bit on the large side? AFAIK, this will cause dd to read a gigabyte worth of data into a buffer before writing the contents of that buffer to the destination device/file.

If you have less than a gigabyte of free RAM, this will cause the kernel to start swapping out pages, which could explain the abysmal performance. Have you tried using a slightly more reasonable block size, like say, a few megabytes?

AndyB78 · 10-07-2013, 05:20 AM

Quote:

Originally Posted by Ser Olmy

I can't see any obvious problems with your array, but I did notice something unusual about your dd command.
Surely, a block size of one gigabyte is a bit on the large side? AFAIK, this will cause dd to read a gigabyte worth of data into a buffer before writing the contents of that buffer to the destination device/file.

If you have less than a gigabyte of free RAM, this will cause the kernel to start swapping out pages, which could explain the abysmal performance. Have you tried using a slightly more reasonable block size, like say, a few megabytes?

The reason behind my bs=1G was - I don't say it was a good choice - especially so that when it starts to write, to have all the data readily available. Also for a 307MB file, I believe there was enough memory available:

Code:

root [~]# free -m
             total       used       free     shared    buffers     cached
Mem:         15921      14378       1543          0        921       9996
-/+ buffers/cache:       3461      12460
Swap:         2046       1110        936

Though I notice it used swap (but I don't know when). Anyway back to the point...I've redone the test with a bs=2M as you suggested:

Code:

root [~]# time dd if=arch.tar.gz of=test4 bs=2M oflag=sync
146+1 records in
146+1 records out
307191761 bytes (307 MB) copied, 23.6788 s, 13.0 MB/s

real    0m23.680s
user    0m0.000s
sys     0m0.932s

A bit better but still about 5% of what is should be able to do.

Also during the copy iostat again shows something interesting:

Code:

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    1.00    0.00   192.00     0.00   192.00     0.00    1.00   1.00   0.10
sdb               0.00     0.00    0.00    6.00     0.00  4548.00   758.00   106.55 3830.67 166.67 100.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    1.00    0.00   192.00     0.00   192.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

10 seconds later:

Code:

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda              16.00   203.00  103.50  811.50  2144.00  8026.00    11.11     0.62    0.68   0.17  15.75
sdb              12.50   177.50  107.50  840.50  1608.00 11066.00    13.37     2.01   69.84   0.54  51.00
md1               0.00     0.00  111.50  343.00   892.00  2744.00     8.00     0.00    0.00   0.00   0.00
md2               0.00     0.00  128.00  670.00  2860.00  5284.00    10.21     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

And at the end:

Code:

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    14.50   26.00   37.00   224.00   375.50     9.52     0.03    0.54   0.27   1.70
sdb               0.00    14.50   17.50   37.00   140.00   375.50     9.46     0.03    0.48   0.33   1.80
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00   43.50   48.00   364.00   372.00     8.04     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Look at svctm and %util. When the I/O is doing something intensive, svctm and %wait is many times larger for sdb than sda. When it's not stressed (the last iteration above) the values are similar. Look at the first iteration with the averages since reboot:

Code:

#iostat -x
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda              10.42    51.42  785.17  125.58  8752.47  1638.07    11.41     0.33    0.37   0.06   5.63
sdb               9.49    58.72  319.83  118.26  4803.77  1638.07    14.70     0.37    0.85   0.30  13.01
md1               0.00     0.00    1.91    1.33    15.29    10.68     8.00     0.00    0.00   0.00   0.00
md2               0.00     0.00 1101.37  173.65 10823.74  1625.35     9.76     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.41    0.01     3.07     0.02     7.42     0.00    0.00   0.00   0.00

Even on average the svctm is 5 times larger for sdb. Isn't the svctm dependent on the device itself (the second SSD in the array) alone?