sar reports more tps on a logical volumn than on the underlying disk

blockdump · 04-05-2014, 01:30 AM

Hi ,
before I opened a new thread here I did some research on the documentation and internet - but I couldn't find an explanation.

when running for example a "sar -dp 2 22" , I see that there are way more tps on a particular lvm volume than on the underlying disk.

There is one disk /dev/sdd which is used as physical volume for a volume group PPWP ( see below ). Several logical volumes where created of that volume group.

The Filesystem on the logical volumnes is ext3. When the database is doing IO on the lv , sar shows more tps on the lv than on the pv ( = /dev/sdd )

Code:

[serv1:/dev/mapper]sar -d 2 2
Linux 2.6.18-164.el5 (serv1)  04/05/2014

07:57:29 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
07:57:31 AM    dev8-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
07:57:31 AM    dev8-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
07:57:31 AM    dev8-2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
07:57:31 AM   dev8-16      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
07:57:31 AM   dev8-32      1.00      0.00     32.00     32.00      0.00      0.00      0.00      0.00
07:57:31 AM   dev8-48    881.50 582584.00   3743.00    665.15      6.93      7.97      1.14    100.50    <<<< /dev/sdd Disk
07:57:31 AM  dev253-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
07:57:31 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
07:57:31 AM  dev253-2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
07:57:31 AM  dev253-3      6.00      0.00   1833.50    305.58      0.01      1.67      1.67      1.00
07:57:31 AM  dev253-4      6.50      0.00   1813.50    279.00      0.01      1.54      1.54      1.00
07:57:31 AM  dev253-5      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
07:57:31 AM  dev253-6   2257.00 580016.00     96.00    257.03     15.57      6.94      0.45    100.50   <<< Logical volume  PPWP-lvdata 
07:57:31 AM  dev253-7      1.00      0.00     32.00     32.00      0.00      0.00      0.00      0.00

As you can see dev253-6 has more than twice tps than the underlying disk

This is the disk  (dev8-48)

[serv1:/dev/mapper]ll /dev/sdd
brw-r----- 1 root disk 8, 48 Feb 11 14:03 /dev/sdd


This is the logical volme (dev253-6)

[serv1:/dev/mapper]ll /dev/mapper/
total 0
crw------- 1 root root  10, 63 Nov 20 19:56 control
brw------- 1 root root 253,  7 Nov 20 19:57 PPWPFRA-lvfra
brw------- 1 root root 253,  6 Nov 20 19:57 PPWP-lvdata       <<<<<< dev253-6
brw------- 1 root root 253,  5 Nov 20 19:57 PPWP-lvoracle
brw------- 1 root root 253,  3 Nov 20 19:57 PPWP-lvredo1
brw------- 1 root root 253,  4 Nov 20 19:57 PPWP-lvredo2
brw------- 1 root root 253,  0 Nov 20 19:57 VolGroup00-root
brw------- 1 root root 253,  2 Nov 20 19:56 VolGroup00-swap
brw------- 1 root root 253,  1 Nov 20 19:57 VolGroup00-var

This is the physical volume ( disk /dev/sdd )  - the only disk in this volumegroup
[serv1:/dev/mapper]pvs
  PV         VG         Fmt  Attr PSize   PFree
  /dev/sda2  VolGroup00 lvm2 a-    89.88G     0
  /dev/sdb              lvm2 --    54.00G 54.00G
  /dev/sdc   PPWPFRA    lvm2 a-   150.00G 96.00M
  /dev/sdd   PPWP       lvm2 a-   950.01G 16.00M      <<<<<

My first assumption was that the filesystem cache could be the reason, but the database running on this system is doing DIRECT IO. So I ruled that out.

Thanks for any explanations or guesses !

best regards

smallpond · 04-05-2014, 07:06 PM

LVM just takes I/O requests, maps them and passes them to the block device driver, in this case sd. sd, the SCSI disk driver, does queuing and scheduling. It can do read-ahead to reduce the number of reads, and will also coalesce sequential writes. Sounds like it is doing its job.

syg00 · 04-05-2014, 07:08 PM

LVM is a(nother) block device layer - just stacked on top of the "real" block device driver. So your direct_io is going to it (LVM). The "real" I/O is then handed down to be queued against the actual device (ignoring virtualisation and hardware smarts, which may add another layer or two).
The block device driver merges I/Os to reduce head thrashing (amongst other things). You can see this happening with blktrace.

blockdump · 04-07-2014, 12:36 AM

Hello,

thanks for the replies.
I forgot to mention that this disk is using the "noop" scheduler. So I assumed the disk queue is doing the IO request in the order in which they arrive without any changes in the queue.

According to the previous 2 post I assume the values in the column "svctm" correlate to this behaviour since svctm on device dev253-6 is way smaller than on the dev8-48 device.

But the await times are not that much different ( about 7 seconds to 8 seconds ). I was thinking if "noop" is used as IO-Scheduler working on a first-come first-serve basis , the IO request will be handed to the underlying block device in the same sequence they arrive on the queue. So shouldn't the await time on the dev253-6 device also be much smaller ?

thanks !

best regards