Samsung SM951 Performance on PCIE v 1.0

slugman · 10-24-2015, 11:55 PM

So, once I read the performance reviews on the Samsung SM951, I simply couldn't wait. Reads in the 2000 MB/s range? Writes in the 1000 MB/s? Sign me up!

I've eyeing PCI-e SSD's for a while. I bought a AData 64GB SP310 m.2 for my Latitude D430 and it works great! However, I haven't really been impressed with the controllers being used in PCI-e SSD's (generally sata controllers). So, with m.2, and then NGFF, and now with the Samsung SM951, I finally got the courage to spend the 200 to jump in.

Basically, the SM951's throughput is achievable on a PCI-e v3.0 bus. However, I figured I should still have some noticable perforamance enhancement on a PCI-e 1.0 bus.

I have to admit, I was skeptical whether I should jump in, knowing the biggest caveat to the equation. Now that I have emperical evidence, I realize why.

Here is the SSD I purchased off of Amazon:

Samsung SM951 256GB AHCI MZHPV256HDGL-00000 M.2 80mm PCIe 3.0 x4 SSD - OEM
http://www.amazon.com/gp/product/B00...ilpage_o02_s00

And the Addonics PCI-e adapter:

Addonics ADM2PX4 M2 Pcie Ssd Pcie 3.0 4-lane Accs Adapter
http://www.amazon.com/gp/product/B00...ilpage_o03_s00

And the system I'm using:

HP Proliant ML350 G5
http://www8.hp.com/h20195/v2/GetPDF.aspx/c04284193.pdf

Here is what I see in lspci output in my slackware64-current (HP Proliant ML350 G5)

Code:

06:00.0 SATA controller: Samsung Electronics Co Ltd Device a801 (rev 01) (prog-if 01 [AHCI 1.0])
        Subsystem: Samsung Electronics Co Ltd Device a801
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 26
        Region 5: Memory at cdff0000 (32-bit, non-prefetchable) [size=8K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable+ Count=1/8 Maskable- 64bit+
                Address: 00000000fee0f00c  Data: 4162
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Latency L0 <4us, L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR+, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt+ RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [158 v1] Power Budgeting <?>
        Capabilities: [168 v1] #19
        Capabilities: [188 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [190 v1] #1e
        Kernel driver in use: ahci

I want to epmhasize the following, although the link cap specifies:

Code:

LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Latency L0 <4us, L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot-

..I believe this is what its actually "training" at:

Code:

LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

This is right on par with what Wikipedia says:

Code:

PCI-E Ver 	        Line code 	Transfer rate 	           Bandwidth

1.0 	8b/10b 	        2.5 GT/s 	2 Gbit/s (250 MB/s) 	    32 Gbit/s (4 GB/s)

3.0 	128b/130b 	8 GT/s 	        7.877 Gbit/s (984.6 MB/s)   126.032 Gbit/s (15.754 GB/s)

So, by this logic, 1 PCI-e lane is transfering 250MB/s, 4 lanes should be giving me around a 1GB/s transfer rate (or somewhere in that neighborhood)... right?

Although the review site used CyrstalMark and IOMeter, well I used FIO (from the Slackbuilds 14.1 repo).

I haven't quite gotten used to using FIO yet, however I tried my best getting a simple sequential read test using the following parameters:

Code:

root@v766:/home/slugman# fio --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/dev/sda --bs=4k --iodepth=8 --size=4G --readwrite=read 
test: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8

(sda is the SM951 in my system. I have a HP Smart Array E200i, so I'm using cciss, 4x 3G 15K sas drives in Raid0--which is /dev/cciss/c0d0p1, hence why the SM951 is sda).

I believe the above is basically a sequential read test, 4k blocksize, with a total 4G transfer.

Here are my results:

Code:

Starting 1 process
Jobs: 1 (f=1): [R(1)] [100.0% done] [273.1MB/0KB/0KB /s] [70.2K/0/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1880: Sat Oct 24 21:17:21 2015
  read : io=4096.0MB, bw=273869KB/s, iops=68467, runt= 15315msec
  cpu          : usr=10.56%, sys=88.98%, ctx=11621, majf=0, minf=17
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=1048576/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
   READ: io=4096.0MB, aggrb=273869KB/s, minb=273869KB/s, maxb=273869KB/s, mint=15315msec, maxt=15315msec

Disk stats (read/write):
  sda: ios=1042183/0, merge=0/0, ticks=30126/0, in_queue=29462, util=99.28%

My results:

SM951 (PCI-e v 1.0 Bus): 273.1MB/s transfer rate

Thats basically the theoretical max of pci-e v1, with 1 lane! I'm using this on 4 lanes! I know there is some overhead, but I need to establish this now--I'm doing this test on a blank disk--no filesystem is on the SM951! (I did this to get the best possible performance reading to the hardware ).

Is there anything I'm missing here? Or is this the performance I can expect from pci-e v 1.0? i.e. I'll just have to bite the bullet, and completely invest in new hardware across the board?

Keruskerfuerst · 11-02-2015, 01:21 AM

I think the follwing data is right:

PCI-E Ver Line code Transfer rate Bandwidth

1.0 8b/10b 2.5 GT/s 2 Gbit/s (250 MB/s) 32 Gbit/s (4 GB/s)

3.0 128b/130b 8 GT/s 7.877 Gbit/s (984.6 MB/s) 126.032 Gbit/s (15.754 GB/s)