So, once I read the
performance reviews on the Samsung SM951, I simply couldn't wait. Reads in the 2000 MB/s range? Writes in the 1000 MB/s? Sign me up!
I've eyeing PCI-e SSD's for a while. I bought a AData 64GB SP310 m.2 for my Latitude D430 and it works great! However, I haven't really been impressed with the controllers being used in PCI-e SSD's (generally sata controllers). So, with m.2, and then NGFF, and now with the Samsung SM951, I finally got the courage to spend the 200 to jump in.
Basically, the SM951's throughput is achievable on a PCI-e v3.0 bus. However, I figured I should still have some noticable perforamance enhancement on a PCI-e 1.0 bus.
I have to admit, I was skeptical whether I should jump in, knowing the biggest caveat to the equation. Now that I have emperical evidence, I realize why.
Here is the SSD I purchased off of Amazon:
Samsung SM951 256GB AHCI MZHPV256HDGL-00000 M.2 80mm PCIe 3.0 x4 SSD - OEM
http://www.amazon.com/gp/product/B00...ilpage_o02_s00
And the Addonics PCI-e adapter:
Addonics ADM2PX4 M2 Pcie Ssd Pcie 3.0 4-lane Accs Adapter
http://www.amazon.com/gp/product/B00...ilpage_o03_s00
And the system I'm using:
HP Proliant ML350 G5
http://www8.hp.com/h20195/v2/GetPDF.aspx/c04284193.pdf
Here is what I see in lspci output in my slackware64-current (HP Proliant ML350 G5)
Code:
06:00.0 SATA controller: Samsung Electronics Co Ltd Device a801 (rev 01) (prog-if 01 [AHCI 1.0])
Subsystem: Samsung Electronics Co Ltd Device a801
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 26
Region 5: Memory at cdff0000 (32-bit, non-prefetchable) [size=8K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable+ Count=1/8 Maskable- 64bit+
Address: 00000000fee0f00c Data: 4162
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr+ TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Latency L0 <4us, L1 <64us
ClockPM+ Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR+, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt+ RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [158 v1] Power Budgeting <?>
Capabilities: [168 v1] #19
Capabilities: [188 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [190 v1] #1e
Kernel driver in use: ahci
I want to epmhasize the following, although the link cap specifies:
Code:
LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Latency L0 <4us, L1 <64us
ClockPM+ Surprise- LLActRep- BwNot-
..I believe this is what its actually "training" at:
Code:
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
This is right on par with what
Wikipedia says:
Code:
PCI-E Ver Line code Transfer rate Bandwidth
1.0 8b/10b 2.5 GT/s 2 Gbit/s (250 MB/s) 32 Gbit/s (4 GB/s)
3.0 128b/130b 8 GT/s 7.877 Gbit/s (984.6 MB/s) 126.032 Gbit/s (15.754 GB/s)
So, by this logic, 1 PCI-e lane is transfering 250MB/s, 4 lanes should be giving me around a 1GB/s transfer rate (or somewhere in that neighborhood)... right?
Although the review site used CyrstalMark and IOMeter, well I used FIO (
from the Slackbuilds 14.1 repo).
I haven't quite gotten used to using FIO yet, however I tried my best getting a simple sequential read test using the following parameters:
Code:
root@v766:/home/slugman# fio --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/dev/sda --bs=4k --iodepth=8 --size=4G --readwrite=read
test: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8
(sda is the SM951 in my system. I have a HP Smart Array E200i, so I'm using cciss, 4x 3G 15K sas drives in Raid0--which is /dev/cciss/c0d0p1, hence why the SM951 is sda).
I believe the above is basically a sequential read test, 4k blocksize, with a total 4G transfer.
Here are my results:
Code:
Starting 1 process
Jobs: 1 (f=1): [R(1)] [100.0% done] [273.1MB/0KB/0KB /s] [70.2K/0/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1880: Sat Oct 24 21:17:21 2015
read : io=4096.0MB, bw=273869KB/s, iops=68467, runt= 15315msec
cpu : usr=10.56%, sys=88.98%, ctx=11621, majf=0, minf=17
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=1048576/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=8
Run status group 0 (all jobs):
READ: io=4096.0MB, aggrb=273869KB/s, minb=273869KB/s, maxb=273869KB/s, mint=15315msec, maxt=15315msec
Disk stats (read/write):
sda: ios=1042183/0, merge=0/0, ticks=30126/0, in_queue=29462, util=99.28%
My results:
SM951 (PCI-e v 1.0 Bus):
273.1MB/s transfer rate
Thats basically the theoretical max of pci-e v1, with 1 lane! I'm using this on 4 lanes! I know there is some overhead, but I need to establish this now--
I'm doing this test on a blank disk--no filesystem is on the SM951! (I did this to get the best possible performance reading to the hardware ).
Is there anything I'm missing here? Or is this the performance I can expect from pci-e v 1.0? i.e. I'll just have to bite the bullet, and completely invest in new hardware across the board?