LinuxQuestions.org - Need help running 32-threaded benchmark

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Need help running 32-threaded benchmark (https://www.linuxquestions.org/questions/programming-9/need-help-running-32-threaded-benchmark-4175552573/)

Need help running 32-threaded benchmark

I wrote an unseen in C, heh-heh rhymed, integer CPU-RAM subsystem 32-threaded torture, but since I have no access to a powerbox I need assistance to run it on at least 16cores/32threads machine.

The C source is attached, since my native environment is Windows, I used latest MinGW with GCC 5.1.0 to ensure compatibility, the initial version was all Windows but my desire is to run it on *nix too.

In essence the benchmark is simple, it loads/replicates one 91,964,279 bytes compressed file (260MB of English texts) into 32 pools thus simulating 32 independent blocks and decompresses them using my LZSS decompressor Nakamichi 'Lexx'. The idea is to boost the I/O by using 3:1 decompression. My goal is to traverse hundreds of GBs of compressed textual data (mostly English texts) 3x faster than "normal" way.
Actually, the compressed 32 blocks amount to 2,942,857,440 bytes while the uncompressed to 8,748,875,776 bytes. Thus the whole RAM used is about 11GB.
Funny, the test was meant to be all about RAM latency and cache speeds, however at some point it will become BANDWIDTH bound too, currently the best result on the @Jpmboy's rig amounts to:

Let's see how many bytes/clock of decompression speed those 7.048 equal:

(32 threads * 273,401,856) bytes / (6,215,992,807 ticks / 4,700,000,000 ticks) = 6,615,136,217 bytes / second or 6308 MB/s, on second thought 4x or 24GB/s is far from the 50-60GB offered by modern high-end CPUs.

Compile line:

./gcc -O3 -mavx -fopenmp Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.c -D_N_YMM -D_N_prefetch_4096 -D_gcc_mumbo_jumbo_ -DCommence_OpenMP -o Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.elf

Run line:

./Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.elf Autobiography_411-ebooks_Collection.tar.Nakamichi

Download note:

You may download package, here:https://mega.nz/#!I4hHwC5Y!3udON_nVU...SLmQGh89Jp8gns.

Thanks to some OCN overclockers I already obtained the best result for the fastest enthusiast PC I have ever seen, it is:
IPC (Instructions_Per_Clock_during_branchless_32-threaded_decompression) performance: 7.048, yes 7+ IPC, for the original thread and more info:

http://www.overclock.net/t/1564781/c...per-clock/0_20

Note: Sadly, current AMD CPUs are not optimized for this benchmark, AVX is used in a way that they struggle a lot - unaligned uncached (outwith LLC) RAM accesses. Anyway, I am still an AMD fan and actually I wrote this bench for the incoming AMD 'Zen'.

Just saw a funny clip, it says 'AMD Zen is coming':
https://www.youtube.com/watch?v=Mw-c0avURD8

'Zen' will be no joke since AMD is planning to make it the core of its future Exascale Heterogeneous Processor:

AMD has released information concerning an upcoming 'Exascale Heterogeneous Processor', or EHP. In a paper submitted to the IEEE AMD details an APU which packs a multitude of Zen cores, Greenland graphics and up to 32GB of HBM2 memory. It says this processor is an embodiment of its "vision for exascale computing".

http://hexus.net/tech/news/cpu/85184...-32-zen-cores/

For now, Intel 5960X holds the best result, 7+ IPC is no joke either.

Edit: Changed the CPU/CACHE/RAM frequencies with the right ones: 4.7GHz/4.2GHz/2666MHz. The speed was 300MB/s less.

Quickly recalled that SATA III with its 550MB/s is no longer interesting for contemplating heavy ... visions, so let's see what one 2,200MB/s SSD could offer in 'Freaky_Dreamer' scenario:

Intel 750 Series SSDPE2MW400G4R5 2.5" 400GB PCIe NVMe 3.0 x4 MLC Internal Solid State Drive (SSD)

SSD Interface: PCIe NVMe 3.0 x4

SSD Performance:

Max Sequential Read: 2200 MB/s
Max Sequential Write: 900 MB/s
4KB Random Read: 430,000 IOPS
4KB Random Write: 230,000 IOPS
Read Latency: 20 micros
Write Latency: 20 micros

Now, what does the above 2,200MB/s boost read translate to when reading the English Wikipedia XML dump file (enwiki-20150112-pages-articles.xml) 51,344,631,742 bytes long.

If we are to linearly read it using Intel's badboy we will need 51,344,631,742/1024/1024/2200 = 22.2 seconds, in case of using such SSD and 5960X the "upload time" becomes:
51,344,631,742/1024/1024/2200/3 = 7.4 seconds for the upload part and 51,344,631,742/1024/1024/6308 = 7.7 seconds for the decompression part, or 7.4+7.7=15.1 or 22.2-15.1 = 7.1 seconds boost.

Nah, the boost is not sweet enough, no worries, the used Nakamichi 'Lexx', despite being my favorite code, is not suitable for nowadays 64bit architectures it is pure 256bit code, that's the reason it seems inferior while in truth it is monstrously fast.
Anyway, to "fill the gap" i.e. to have an intermediate performer until the real 256bit CPUs come I wrote Nakamichi 'Shifune' it is much faster than 'Lexx' and will perform smoothly on AMD too - it is 64bit code, no AVX.

https://pbs.twimg.com/media/COBt7RSUcAAal4S.png

Maybe, by the end of the month (or next) I will share a nifty textual showdown (compression ratio + decompression speed) between Nakamichi 'Shifune' and GZIP and Zstd and ZPAQ and LzTurbo and BSC.

My current results show decompression speed supremacy for Nakamichi 'Shifune' in 3:1 big English texts cases.

Just a glimpse at incoming 120+ testdatafiles:

Code:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

| testdatafile \ decompressor                                          | Uncompressed |            ZSTD v0.0.1 | LZ4 v1.4 (-9 -b -Sx -T1) |  Nakamichi 'Shifune' | 7za a -tgzip -mx9 |

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

| The_Project_Gutenberg_EBook_of_Gulliver's_Travels_gltrv10.txt        |      604,629 |    239,149; 300.5 MB/s |    261,041;  927.1 MB/s |    xxx,xxx; zzz MB/s |    214,817; N.A. |

| The_Project_Gutenberg_EBook_of_Notre-Dame_de_Paris_2610-8.txt        |    1,101,276 |    450,066; 301.9 MB/s |    489,223;  901.5 MB/s |    xxx,xxx; zzz MB/s |    407,215; N.A. |

| The_Project_Gutenberg_EBook_of_Moby_Dick_moby11.txt                  |    1,255,801 |    540,039; 299.3 MB/s |    588,381;  895.7 MB/s |    xxx,xxx; zzz MB/s |    483,065; N.A. |

| Fleurs_du_mal.tar                                                    |    1,820,160 |    372,893; 563.3 MB/s |    607,482; 1139.3 MB/s |    540,320; 768 MB/s |    496,964; N.A. |

| pg46853_Le_Morte_Darthur_by_Sir_Thomas_Malory.txt                    |    2,136,831 |    733,590; 319.5 MB/s |    783,810;  943.7 MB/s |    xxx,xxx; zzz MB/s |    647,186; N.A. |

| University_of_Canterbury_The_Calgary_Corpus.tar                      |    3,265,536 |  1,164,397; 368.2 MB/s |  1,241,281; 1055.0 MB/s |  1,319,701; 576 MB/s |  1,017,658; N.A. |

| The_Complete_Sherlock_Holmes_-_Doyle_Arthur_Conan.txt                |    3,714,387 |  1,422,283; 302.6 MB/s |  1,539,507;  915.7 MB/s |  x,xxx,xxx; zzz MB/s |  1,285,462; N.A. |

| The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt        |    4,445,260 |  1,512,215; 322.2 MB/s |  1,605,032;  952.2 MB/s |  1,441,679; 704 MB/s |  1,320,100; N.A. |

| Ian_Fleming_-_The_James_Bond_Anthology_(complete_collection).epub.txt |    5,245,293 |  2,079,270; 298.7 MB/s |  2,256,844;  899.4 MB/s |  1,938,723; 640 MB/s |  1,869,849; N.A. |

| Complete_Works_of_William_Shakespeare.txt                            |  10,455,117 |  3,851,884; 313.7 MB/s |  4,200,287;  908.9 MB/s |  x,xxx,xxx; zzz MB/s |  3,378,656; N.A. |

| Complete_Works_of_Fyodor_Dostoyevsky.txt                              |  13,713,275 |  5,127,549; 303.2 MB/s |  5,569,991;  906.5 MB/s |  4,582,363; 448 MB/s |  4,617,360; N.A. |

| The_Book_of_The_Thousand_Nights_and_a_Night.txt                      |  14,613,183 |  5,855,516; 306.2 MB/s |  6,223,909;  914.7 MB/s |  5,293,102; 384 MB/s |  5,198,949; N.A. |

| Dune_Complete_17_Ebooks.tar                                          |  16,973,312 |  6,660,569; 299.5 MB/s |  7,290,214;  894.2 MB/s |  5,893,697; 384 MB/s |  6,086,933; N.A. |

| Agatha_Christie_89-ebooks_TXT.tar                                    |  33,258,496 | 12,404,504; 305.5 MB/s |  13,365,090;  909.2 MB/s |  10,623,335; 320 MB/s |  11,173,195; N.A. |

| Encyclopedia_of_Language_and_Linguistics.txt                          |  59,416,161 | 19,435,030; 358.7 MB/s |  21,118,907; 1038.8 MB/s |  18,502,271; 320 MB/s |  17,546,530; N.A. |

| Stephen_King_67-books.tar                                            |  61,382,144 | 24,402,931; 299.2 MB/s |  26,310,486;  894.3 MB/s |  20,350,142; 256 MB/s |  21,854,632; N.A. |

| enwik8                                                                |  100,000,000 | 39,573,323; 325.6 MB/s |  42,283,793;  936.9 MB/s |  34,218,460; 256 MB/s |  35,102,891; N.A. |

| New_Shorter_Oxford_English_Dictionary_fifth_edition.tar              |  132,728,832 | 28,968,421; 460.7 MB/s |  30,133,137; 1169.4 MB/s |  29,059,023; 448 MB/s |  25,418,601; N.A. |

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Note: Yann's Zstd is the very first 0.0.1 version, he wrote much faster one which hopefully will enter the showdown.

I'd love to run this on some test systems. I have access to a lot of big machines, including a dual proc E5-2697v3 system with 28 cores and 128 GB of RAM, but I won't run any pre-compiled or closed-source code on it. Your download link is tied to distros and packaging software, give me a real source code link that I can verify and I'll build and run it to give you some benchmarks.

Oh, man, you have helped me a lot with Kazahana, it would be extracool to run Freaky_Dreamer too.

I am sorry for the messy (source here testdatafile there) situation. If you have any issues compiling it just ask me, my *nix command line and GCC knowledge sucks. Hopefully it will run since MinGW said ok.

I am not experienced at all with *nix, so can't tell beforehand how the RDTSC will report ticks on dual socket machine:

Code:

#if defined(_gcc_mumbo_jumbo_)

static __inline__ unsigned long long GetRDTSC()

{

  unsigned hi, lo;

  __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));

  return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );

}

#endif

I fear each CPU to have its own clock reporter...

>... but I won't run any pre-compiled or closed-source code on it. Your download link is tied to distros and packaging software, give me a real source code link that I can verify and I'll build and run it to give you some benchmarks.

Sure, you won't find more open C coder than me, all my tests are fully open and mostly FREE.
Didn't get the second part. The walkthrough is this:

Step #1: The C source is attached to the first post ZIPped but with extension .txt, the forum allows not .ZIPs?!
Step #2: You need the testdatafile (91,964,279 Autobiography_411-ebooks_Collection.tar.Nakamichi) it is located in https://mega.nz/#!I4hHwC5Y!3udON_nVU...SLmQGh89Jp8gns package.

Sorry, I missed the zip download, I was distracted by the distro-specific downloads pushing rpm files on me.

I did download the source zip, extracted and compiled, but the results are a little odd:

Code:

$ ./Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.elf Autobiography_411-ebooks_Collection.tar.Nakamichi

Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.

Allocating 2,942,857,440 bytes...

Allocating 8,748,875,776 bytes...

Source&Target buffers are allocated.

Simulating we have 32 blocks for decompression...

Enforcing 32 thread(s).

omp_get_num_procs( ) = 28

omp_get_max_threads( ) = 28

All threads finished.

Decompression time: 0 ticks.

TPI (Ticks_Per_Instructions_during_branchless_decompression) performance: 0.000

IPC (Instructions_Per_Clock_during_branchless_decompression) performance: inf

Looks like there's a timing issue. Actual execution time is about 10 seconds.

Oh, just got what you meant, I am gonna upload the testdatafile to my site right now.

Done:
http://www.sanmayce.com/Downloads/Na...C_32-threads.c
http://www.sanmayce.com/Downloads/Au....tar.Nakamichi

Thanks, it seems someone knowledgeable in GetRDTSC (and time measuring in general) under *nix has to help me to figure it out.

The important thing is that the 'Done' message is there:
All threads finished.

It means that the decompression went as it should, the sizes match, that is.

Decompression time: 0 ticks.

The above message means that my time reporting failed fully :mad:

If someone on LQ knows why this time reporter fails to work will be the fix, otherwise I have to ask on SO.

Why ticksTOTAL2 + GetRDTSC() - ticksStart is zero?! Is GetRDTSC() simply not working?

Code:

#if defined(_gcc_mumbo_jumbo_)

ticksTOTAL2 = ticksTOTAL2 + GetRDTSC() - ticksStart;

#endif

Ten seconds is not good, do you use an old GCC? The code generated is of significant importance. But far more important are the:

- Cache clock;
- CPU clock;
- RAM CAS latency;
- RAM clock.

I suppose your RAM is DDR4 @2133MHz, if so then the major toll takes the RAM clock (2666MHz vs ?) and then the CPU clock (4.7GHz vs 3.6GHz), I guess.

For comparison the best result total time on 5960X 4.7 core / 4.2 uncore DDR4 @2666MHz is about one third (3.4 seconds):

Code:

Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.

Allocating 2,942,857,440 bytes...

Allocating 8,748,875,776 bytes...

Source&Target buffers are allocated.

Simulating we have 32 blocks for decompression...

Enforcing 1 thread.

Decompression time: 40,943,938,444 ticks.

TPI (Ticks_Per_Instructions_during_branchless_decompression) performance: 0.935

IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 1.070







Kernel  Time =    1.828 =  12

User    Time =    12.828 =  86%

Process Time =    14.656 =  98%    Virtual  Memory =  11173 MB

Global  Time =    14.881 =  100%    Physical Memory =  11152 MB

Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.

Allocating 2,942,857,440 bytes...

Allocating 8,748,875,776 bytes...

Source&Target buffers are allocated.

Simulating we have 32 blocks for decompression...

Enforcing 32 thread(s).

omp_get_num_procs( ) = 16

omp_get_max_threads( ) = 16

All threads finished.

Decompression time: 6,215,992,807 ticks.

TPI (Ticks_Per_Instructions_during_branchless_decompression) performance: 0.142

IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 7.048







Kernel  Time =    4.718 =  137%

User    Time =    34.781 = 1016%

Process Time =    39.500 = 1154%    Virtual  Memory =  11176 MB

Global  Time =    3.421 =  100%    Physical Memory =  11154 MB

@suicidaleggroll

Man, excuse this time my non-integral distro, just saw that you have compiled the C source from the benchmark package, it won't work as it was Windows targeted only.

The C source attached to the first post and the link in POST #8 is the working source - it is "revision 2" targeted for *nix too, in there I added the *nix timing.
The dump should say:

Code:

Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC_trials', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.

not

Code:

Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.

If you care please recompile and rerun it.

Worked fine now, thanks.

Initial results:

Code:

$ ./Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.elf Autobiography_411-ebooks_Collection.tar.Nakamichi

Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC_trials', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.

Allocating 2,942,857,440 bytes...

Allocating 8,748,875,776 bytes...

Source&Target buffers are allocated.

Simulating we have 32 blocks for decompression...

Enforcing 32 thread(s).

omp_get_num_procs( ) = 28

omp_get_max_threads( ) = 28

Pass # 1 of 64

Pass # 2 of 64

Pass # 3 of 64

Pass # 4 of 64

Pass # 5 of 64

Pass # 6 of 64

Pass # 7 of 64

Pass # 8 of 64

Pass # 9 of 64

Pass #10 of 64

Pass #11 of 64

Pass #12 of 64

Pass #13 of 64

Pass #14 of 64

Pass #15 of 64

Pass #16 of 64

Pass #17 of 64

Pass #18 of 64

Pass #19 of 64

Pass #20 of 64

Pass #21 of 64

Pass #22 of 64

Pass #23 of 64

Pass #24 of 64

Pass #25 of 64

Pass #26 of 64

Pass #27 of 64

Pass #28 of 64

Pass #29 of 64

Pass #30 of 64

Pass #31 of 64

Pass #32 of 64

Pass #33 of 64

Pass #34 of 64

Pass #35 of 64

Pass #36 of 64

Pass #37 of 64

Pass #38 of 64

Pass #39 of 64

Pass #40 of 64

Pass #41 of 64

Pass #42 of 64

Pass #43 of 64

Pass #44 of 64

Pass #45 of 64

Pass #46 of 64

Pass #47 of 64

Pass #48 of 64

Pass #49 of 64

Pass #50 of 64

Pass #51 of 64

Pass #52 of 64

Pass #53 of 64

Pass #54 of 64

Pass #55 of 64

Pass #56 of 64

Pass #57 of 64

Pass #58 of 64

Pass #59 of 64

Pass #60 of 64

Pass #61 of 64

Pass #62 of 64

Pass #63 of 64

Pass #64 of 64

All threads finished.

Decompression time: 338,640,325,612 ticks.

TPI (Ticks_Per_Instruction_during_branchless_decompression) performance: 0.118

IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 8.460

But the system wasn't completely available, it was already running with a load of 13, so only ~15 of the 28 cores were idle. It'll be a while before the system is idle and I can test again.

Quote:

Originally Posted by suicidaleggroll (Post 5415838)

Worked fine now, thanks.

Code:

...

Decompression time: 338,640,325,612 ticks.

TPI (Ticks_Per_Instruction_during_branchless_decompression) performance: 0.118

IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 8.460

But the system wasn't completely available, it was already running with a load of 13, so only ~15 of the 28 cores were idle. It'll be a while before the system is idle and I can test again.

Thank you, :hattip: it is the first XEON behavior that I see, MT IPC = 8.460 beats by a good measure the heavily overclocked 5960X, can't wait to see a real monster featuring 32 cores, then the RAM latency will be reduced further more as every thread will have its own caches.

Quote:

Originally Posted by Sanmayce (Post 5415556)

Note: Sadly, current AMD CPUs are not optimized for this benchmark, AVX is used in a way that they struggle a lot - unaligned uncached (outwith LLC) RAM accesses. Anyway, I am still an AMD fan and actually I wrote this bench for the incoming AMD 'Zen'.

AMD cpus are usually much better with unaligned memory access
see https://web.archive.org/web/20140101....cx/archives/8

AVX was made by intel and pushed out so hard that AMD had to implement it fast
it should still work good on more modern amd cpus
also see the XOP instruction set

you should always align to the size of register used (although it shouldn't matter for ymm/zmm registers)
and if you have a lot of data, to page size

PS
AVX or AVX2 ?
AVX2 is integer

I don't know how 256bit unaligned move instructions are (if they are at all) optimized across the whole AVX family, one guy on OCN run the test on AMD Vishera (Edit: DOESN'T supports AVX 2.0) and the result was awful:
http://www.overclock.net/t/1564781/c...#post_24200497

The code we are talking about has little to do with caches most of its accesses are outside caches, my dummy guess is that AMD didn't optimize them for such cases, what other explanation one could put forth?

Code:

.B30.3::                        

  00030 45 8b 38        mov r15d, DWORD PTR [r8]              

  00033 44 89 f9        mov ecx, r15d                          

  00036 83 f1 03        xor ecx, 3                            

  00039 41 bc ff ff ff 

        ff              mov r12d, -1                          

  0003f c1 e1 03        shl ecx, 3                            

  00042 bd 01 00 00 00  mov ebp, 1                            

  00047 41 d3 ec        shr r12d, cl                          

  0004a 45 23 fc        and r15d, r12d                        

  0004d 45 33 e4        xor r12d, r12d                        

  00050 45 89 fe        mov r14d, r15d                        

  00053 45 89 fb        mov r11d, r15d                        

  00056 41 83 e6 0f      and r14d, 15                          

  0005a 48 89 c1        mov rcx, rax                          

  0005d 41 83 fe 0c      cmp r14d, 12                          

  00061 44 0f 44 e5      cmove r12d, ebp                        

  00065 4c 89 c5        mov rbp, r8                            

  00068 41 c1 eb 04      shr r11d, 4                            

  0006c 49 ff cc        dec r12                                

  0006f 45 89 da        mov r10d, r11d                        

  00072 4d 89 e6        mov r14, r12                          

  00075 49 2b ca        sub rcx, r10                          

  00078 49 f7 d6        not r14                                

  0007b 48 ff c9        dec rcx                                

  0007e 49 23 ee        and rbp, r14                          

  00081 49 23 cc        and rcx, r12                          

  00084 41 ff c3        inc r11d                              

  00087 4d 23 d6        and r10, r14                          

  0008a 4d 23 de        and r11, r14                          

  0008d c5 fe 6f 44 29 

        01              vmovdqu ymm0, YMMWORD PTR [1+rcx+rbp]  

  00093 44 89 fd        mov ebp, r15d                          

  00096 83 e5 03        and ebp, 3                            

  00099 41 83 e7 0c      and r15d, 12                          

  0009d ff c5            inc ebp                                

  0009f 41 83 c7 04      add r15d, 4                            

  000a3 89 e9            mov ecx, ebp                          

  000a5 c1 e9 02        shr ecx, 2                            

  000a8 41 d3 e7        shl r15d, cl                          

  000ab 49 23 ec        and rbp, r12                          

  000ae 4d 23 fc        and r15, r12                          

  000b1 4c 03 dd        add r11, rbp                          

  000b4 4d 03 d7        add r10, r15                          

  000b7 4d 03 c3        add r8, r11                            

  000ba c5 fe 7f 00      vmovdqu YMMWORD PTR [rax], ymm0        

  000be 49 03 c2        add rax, r10                          

  000c1 4d 3b c1        cmp r8, r9                            

  000c4 0f 82 66 ff ff 

        ff              jb .B30.3

Can you see what else can harm the speed of above loop except the superslow memory fetch:

Code:

vmovdqu ymm0, YMMWORD PTR [1+rcx+rbp]

Haven't thought of that, but maybe that missing AVX 2.0 support was the cause for the poor performance. Is it so?