LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Need help running 32-threaded benchmark (https://www.linuxquestions.org/questions/programming-9/need-help-running-32-threaded-benchmark-4175552573/)

Sanmayce 09-03-2015 08:58 PM

Need help running 32-threaded benchmark
 
1 Attachment(s)
I wrote an unseen in C, heh-heh rhymed, integer CPU-RAM subsystem 32-threaded torture, but since I have no access to a powerbox I need assistance to run it on at least 16cores/32threads machine.

The C source is attached, since my native environment is Windows, I used latest MinGW with GCC 5.1.0 to ensure compatibility, the initial version was all Windows but my desire is to run it on *nix too.

In essence the benchmark is simple, it loads/replicates one 91,964,279 bytes compressed file (260MB of English texts) into 32 pools thus simulating 32 independent blocks and decompresses them using my LZSS decompressor Nakamichi 'Lexx'. The idea is to boost the I/O by using 3:1 decompression. My goal is to traverse hundreds of GBs of compressed textual data (mostly English texts) 3x faster than "normal" way.
Actually, the compressed 32 blocks amount to 2,942,857,440 bytes while the uncompressed to 8,748,875,776 bytes. Thus the whole RAM used is about 11GB.
Funny, the test was meant to be all about RAM latency and cache speeds, however at some point it will become BANDWIDTH bound too, currently the best result on the @Jpmboy's rig amounts to:

Let's see how many bytes/clock of decompression speed those 7.048 equal:

(32 threads * 273,401,856) bytes / (6,215,992,807 ticks / 4,700,000,000 ticks) = 6,615,136,217 bytes / second or 6308 MB/s, on second thought 4x or 24GB/s is far from the 50-60GB offered by modern high-end CPUs.

Compile line:

./gcc -O3 -mavx -fopenmp Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.c -D_N_YMM -D_N_prefetch_4096 -D_gcc_mumbo_jumbo_ -DCommence_OpenMP -o Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.elf

Run line:

./Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.elf Autobiography_411-ebooks_Collection.tar.Nakamichi

Download note:

You may download package, here:https://mega.nz/#!I4hHwC5Y!3udON_nVU...SLmQGh89Jp8gns.

Thanks to some OCN overclockers I already obtained the best result for the fastest enthusiast PC I have ever seen, it is:
IPC (Instructions_Per_Clock_during_branchless_32-threaded_decompression) performance: 7.048, yes 7+ IPC, for the original thread and more info:

http://www.overclock.net/t/1564781/c...per-clock/0_20

Note: Sadly, current AMD CPUs are not optimized for this benchmark, AVX is used in a way that they struggle a lot - unaligned uncached (outwith LLC) RAM accesses. Anyway, I am still an AMD fan and actually I wrote this bench for the incoming AMD 'Zen'.

Just saw a funny clip, it says 'AMD Zen is coming':
https://www.youtube.com/watch?v=Mw-c0avURD8

'Zen' will be no joke since AMD is planning to make it the core of its future Exascale Heterogeneous Processor:

AMD has released information concerning an upcoming 'Exascale Heterogeneous Processor', or EHP. In a paper submitted to the IEEE AMD details an APU which packs a multitude of Zen cores, Greenland graphics and up to 32GB of HBM2 memory. It says this processor is an embodiment of its "vision for exascale computing".

http://hexus.net/tech/news/cpu/85184...-32-zen-cores/

For now, Intel 5960X holds the best result, 7+ IPC is no joke either.

Edit: Changed the CPU/CACHE/RAM frequencies with the right ones: 4.7GHz/4.2GHz/2666MHz. The speed was 300MB/s less.

Sanmayce 09-03-2015 10:01 PM

Quickly recalled that SATA III with its 550MB/s is no longer interesting for contemplating heavy ... visions, so let's see what one 2,200MB/s SSD could offer in 'Freaky_Dreamer' scenario:

Intel 750 Series SSDPE2MW400G4R5 2.5" 400GB PCIe NVMe 3.0 x4 MLC Internal Solid State Drive (SSD)

SSD Interface: PCIe NVMe 3.0 x4

SSD Performance:

Max Sequential Read: 2200 MB/s
Max Sequential Write: 900 MB/s
4KB Random Read: 430,000 IOPS
4KB Random Write: 230,000 IOPS
Read Latency: 20 micros
Write Latency: 20 micros

Now, what does the above 2,200MB/s boost read translate to when reading the English Wikipedia XML dump file (enwiki-20150112-pages-articles.xml) 51,344,631,742 bytes long.

If we are to linearly read it using Intel's badboy we will need 51,344,631,742/1024/1024/2200 = 22.2 seconds, in case of using such SSD and 5960X the "upload time" becomes:
51,344,631,742/1024/1024/2200/3 = 7.4 seconds for the upload part and 51,344,631,742/1024/1024/6308 = 7.7 seconds for the decompression part, or 7.4+7.7=15.1 or 22.2-15.1 = 7.1 seconds boost.

Nah, the boost is not sweet enough, no worries, the used Nakamichi 'Lexx', despite being my favorite code, is not suitable for nowadays 64bit architectures it is pure 256bit code, that's the reason it seems inferior while in truth it is monstrously fast.
Anyway, to "fill the gap" i.e. to have an intermediate performer until the real 256bit CPUs come I wrote Nakamichi 'Shifune' it is much faster than 'Lexx' and will perform smoothly on AMD too - it is 64bit code, no AVX.

https://pbs.twimg.com/media/COBt7RSUcAAal4S.png

Maybe, by the end of the month (or next) I will share a nifty textual showdown (compression ratio + decompression speed) between Nakamichi 'Shifune' and GZIP and Zstd and ZPAQ and LzTurbo and BSC.

My current results show decompression speed supremacy for Nakamichi 'Shifune' in 3:1 big English texts cases.

Just a glimpse at incoming 120+ testdatafiles:
Code:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| testdatafile \ decompressor                                          | Uncompressed |            ZSTD v0.0.1 | LZ4 v1.4 (-9 -b -Sx -T1) |  Nakamichi 'Shifune' | 7za a -tgzip -mx9 |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| The_Project_Gutenberg_EBook_of_Gulliver's_Travels_gltrv10.txt        |      604,629 |    239,149; 300.5 MB/s |    261,041;  927.1 MB/s |    xxx,xxx; zzz MB/s |    214,817; N.A. |
| The_Project_Gutenberg_EBook_of_Notre-Dame_de_Paris_2610-8.txt        |    1,101,276 |    450,066; 301.9 MB/s |    489,223;  901.5 MB/s |    xxx,xxx; zzz MB/s |    407,215; N.A. |
| The_Project_Gutenberg_EBook_of_Moby_Dick_moby11.txt                  |    1,255,801 |    540,039; 299.3 MB/s |    588,381;  895.7 MB/s |    xxx,xxx; zzz MB/s |    483,065; N.A. |
| Fleurs_du_mal.tar                                                    |    1,820,160 |    372,893; 563.3 MB/s |    607,482; 1139.3 MB/s |    540,320; 768 MB/s |    496,964; N.A. |
| pg46853_Le_Morte_Darthur_by_Sir_Thomas_Malory.txt                    |    2,136,831 |    733,590; 319.5 MB/s |    783,810;  943.7 MB/s |    xxx,xxx; zzz MB/s |    647,186; N.A. |
| University_of_Canterbury_The_Calgary_Corpus.tar                      |    3,265,536 |  1,164,397; 368.2 MB/s |  1,241,281; 1055.0 MB/s |  1,319,701; 576 MB/s |  1,017,658; N.A. |
| The_Complete_Sherlock_Holmes_-_Doyle_Arthur_Conan.txt                |    3,714,387 |  1,422,283; 302.6 MB/s |  1,539,507;  915.7 MB/s |  x,xxx,xxx; zzz MB/s |  1,285,462; N.A. |
| The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt        |    4,445,260 |  1,512,215; 322.2 MB/s |  1,605,032;  952.2 MB/s |  1,441,679; 704 MB/s |  1,320,100; N.A. |
| Ian_Fleming_-_The_James_Bond_Anthology_(complete_collection).epub.txt |    5,245,293 |  2,079,270; 298.7 MB/s |  2,256,844;  899.4 MB/s |  1,938,723; 640 MB/s |  1,869,849; N.A. |
| Complete_Works_of_William_Shakespeare.txt                            |  10,455,117 |  3,851,884; 313.7 MB/s |  4,200,287;  908.9 MB/s |  x,xxx,xxx; zzz MB/s |  3,378,656; N.A. |
| Complete_Works_of_Fyodor_Dostoyevsky.txt                              |  13,713,275 |  5,127,549; 303.2 MB/s |  5,569,991;  906.5 MB/s |  4,582,363; 448 MB/s |  4,617,360; N.A. |
| The_Book_of_The_Thousand_Nights_and_a_Night.txt                      |  14,613,183 |  5,855,516; 306.2 MB/s |  6,223,909;  914.7 MB/s |  5,293,102; 384 MB/s |  5,198,949; N.A. |
| Dune_Complete_17_Ebooks.tar                                          |  16,973,312 |  6,660,569; 299.5 MB/s |  7,290,214;  894.2 MB/s |  5,893,697; 384 MB/s |  6,086,933; N.A. |
| Agatha_Christie_89-ebooks_TXT.tar                                    |  33,258,496 | 12,404,504; 305.5 MB/s |  13,365,090;  909.2 MB/s |  10,623,335; 320 MB/s |  11,173,195; N.A. |
| Encyclopedia_of_Language_and_Linguistics.txt                          |  59,416,161 | 19,435,030; 358.7 MB/s |  21,118,907; 1038.8 MB/s |  18,502,271; 320 MB/s |  17,546,530; N.A. |
| Stephen_King_67-books.tar                                            |  61,382,144 | 24,402,931; 299.2 MB/s |  26,310,486;  894.3 MB/s |  20,350,142; 256 MB/s |  21,854,632; N.A. |
| enwik8                                                                |  100,000,000 | 39,573,323; 325.6 MB/s |  42,283,793;  936.9 MB/s |  34,218,460; 256 MB/s |  35,102,891; N.A. |
| New_Shorter_Oxford_English_Dictionary_fifth_edition.tar              |  132,728,832 | 28,968,421; 460.7 MB/s |  30,133,137; 1169.4 MB/s |  29,059,023; 448 MB/s |  25,418,601; N.A. |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Note: Yann's Zstd is the very first 0.0.1 version, he wrote much faster one which hopefully will enter the showdown.

suicidaleggroll 09-03-2015 10:11 PM

I'd love to run this on some test systems. I have access to a lot of big machines, including a dual proc E5-2697v3 system with 28 cores and 128 GB of RAM, but I won't run any pre-compiled or closed-source code on it. Your download link is tied to distros and packaging software, give me a real source code link that I can verify and I'll build and run it to give you some benchmarks.

Sanmayce 09-03-2015 10:14 PM

Oh, man, you have helped me a lot with Kazahana, it would be extracool to run Freaky_Dreamer too.

I am sorry for the messy (source here testdatafile there) situation. If you have any issues compiling it just ask me, my *nix command line and GCC knowledge sucks. Hopefully it will run since MinGW said ok.

Sanmayce 09-03-2015 10:18 PM

I am not experienced at all with *nix, so can't tell beforehand how the RDTSC will report ticks on dual socket machine:

Code:

#if defined(_gcc_mumbo_jumbo_)
static __inline__ unsigned long long GetRDTSC()
{
  unsigned hi, lo;
  __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
  return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
#endif

I fear each CPU to have its own clock reporter...

Sanmayce 09-03-2015 10:26 PM

>... but I won't run any pre-compiled or closed-source code on it. Your download link is tied to distros and packaging software, give me a real source code link that I can verify and I'll build and run it to give you some benchmarks.

Sure, you won't find more open C coder than me, all my tests are fully open and mostly FREE.
Didn't get the second part. The walkthrough is this:

Step #1: The C source is attached to the first post ZIPped but with extension .txt, the forum allows not .ZIPs?!
Step #2: You need the testdatafile (91,964,279 Autobiography_411-ebooks_Collection.tar.Nakamichi) it is located in https://mega.nz/#!I4hHwC5Y!3udON_nVU...SLmQGh89Jp8gns package.

suicidaleggroll 09-03-2015 10:38 PM

Sorry, I missed the zip download, I was distracted by the distro-specific downloads pushing rpm files on me.

I did download the source zip, extracted and compiled, but the results are a little odd:
Code:

$ ./Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.elf Autobiography_411-ebooks_Collection.tar.Nakamichi
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Allocating 2,942,857,440 bytes...
Allocating 8,748,875,776 bytes...
Source&Target buffers are allocated.
Simulating we have 32 blocks for decompression...
Enforcing 32 thread(s).
omp_get_num_procs( ) = 28
omp_get_max_threads( ) = 28
All threads finished.
Decompression time: 0 ticks.
TPI (Ticks_Per_Instructions_during_branchless_decompression) performance: 0.000
IPC (Instructions_Per_Clock_during_branchless_decompression) performance: inf

Looks like there's a timing issue. Actual execution time is about 10 seconds.

Sanmayce 09-03-2015 10:39 PM

Oh, just got what you meant, I am gonna upload the testdatafile to my site right now.

Done:
http://www.sanmayce.com/Downloads/Na...C_32-threads.c
http://www.sanmayce.com/Downloads/Au....tar.Nakamichi

Sanmayce 09-03-2015 10:41 PM

Thanks, it seems someone knowledgeable in GetRDTSC (and time measuring in general) under *nix has to help me to figure it out.

The important thing is that the 'Done' message is there:
All threads finished.

It means that the decompression went as it should, the sizes match, that is.

Decompression time: 0 ticks.

The above message means that my time reporting failed fully :mad:

Sanmayce 09-03-2015 11:03 PM

If someone on LQ knows why this time reporter fails to work will be the fix, otherwise I have to ask on SO.

Why ticksTOTAL2 + GetRDTSC() - ticksStart is zero?! Is GetRDTSC() simply not working?

Code:

#if defined(_gcc_mumbo_jumbo_)
ticksTOTAL2 = ticksTOTAL2 + GetRDTSC() - ticksStart;
#endif

Ten seconds is not good, do you use an old GCC? The code generated is of significant importance. But far more important are the:

- Cache clock;
- CPU clock;
- RAM CAS latency;
- RAM clock.

I suppose your RAM is DDR4 @2133MHz, if so then the major toll takes the RAM clock (2666MHz vs ?) and then the CPU clock (4.7GHz vs 3.6GHz), I guess.

For comparison the best result total time on 5960X 4.7 core / 4.2 uncore DDR4 @2666MHz is about one third (3.4 seconds):

Code:

Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Allocating 2,942,857,440 bytes...
Allocating 8,748,875,776 bytes...
Source&Target buffers are allocated.
Simulating we have 32 blocks for decompression...
Enforcing 1 thread.
Decompression time: 40,943,938,444 ticks.
TPI (Ticks_Per_Instructions_during_branchless_decompression) performance: 0.935
IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 1.070



Kernel  Time =    1.828 =  12
User    Time =    12.828 =  86%
Process Time =    14.656 =  98%    Virtual  Memory =  11173 MB
Global  Time =    14.881 =  100%    Physical Memory =  11152 MB
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Allocating 2,942,857,440 bytes...
Allocating 8,748,875,776 bytes...
Source&Target buffers are allocated.
Simulating we have 32 blocks for decompression...
Enforcing 32 thread(s).
omp_get_num_procs( ) = 16
omp_get_max_threads( ) = 16
All threads finished.
Decompression time: 6,215,992,807 ticks.
TPI (Ticks_Per_Instructions_during_branchless_decompression) performance: 0.142
IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 7.048



Kernel  Time =    4.718 =  137%
User    Time =    34.781 = 1016%
Process Time =    39.500 = 1154%    Virtual  Memory =  11176 MB
Global  Time =    3.421 =  100%    Physical Memory =  11154 MB


Sanmayce 09-03-2015 11:33 PM

@suicidaleggroll

Man, excuse this time my non-integral distro, just saw that you have compiled the C source from the benchmark package, it won't work as it was Windows targeted only.

The C source attached to the first post and the link in POST #8 is the working source - it is "revision 2" targeted for *nix too, in there I added the *nix timing.
The dump should say:

Code:

Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC_trials', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
not

Code:

Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
If you care please recompile and rerun it.

suicidaleggroll 09-04-2015 12:12 PM

Worked fine now, thanks.

Initial results:
Code:

$ ./Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.elf Autobiography_411-ebooks_Collection.tar.Nakamichi
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC_trials', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Allocating 2,942,857,440 bytes...
Allocating 8,748,875,776 bytes...
Source&Target buffers are allocated.
Simulating we have 32 blocks for decompression...
Enforcing 32 thread(s).
omp_get_num_procs( ) = 28
omp_get_max_threads( ) = 28
Pass # 1 of 64
Pass # 2 of 64
Pass # 3 of 64
Pass # 4 of 64
Pass # 5 of 64
Pass # 6 of 64
Pass # 7 of 64
Pass # 8 of 64
Pass # 9 of 64
Pass #10 of 64
Pass #11 of 64
Pass #12 of 64
Pass #13 of 64
Pass #14 of 64
Pass #15 of 64
Pass #16 of 64
Pass #17 of 64
Pass #18 of 64
Pass #19 of 64
Pass #20 of 64
Pass #21 of 64
Pass #22 of 64
Pass #23 of 64
Pass #24 of 64
Pass #25 of 64
Pass #26 of 64
Pass #27 of 64
Pass #28 of 64
Pass #29 of 64
Pass #30 of 64
Pass #31 of 64
Pass #32 of 64
Pass #33 of 64
Pass #34 of 64
Pass #35 of 64
Pass #36 of 64
Pass #37 of 64
Pass #38 of 64
Pass #39 of 64
Pass #40 of 64
Pass #41 of 64
Pass #42 of 64
Pass #43 of 64
Pass #44 of 64
Pass #45 of 64
Pass #46 of 64
Pass #47 of 64
Pass #48 of 64
Pass #49 of 64
Pass #50 of 64
Pass #51 of 64
Pass #52 of 64
Pass #53 of 64
Pass #54 of 64
Pass #55 of 64
Pass #56 of 64
Pass #57 of 64
Pass #58 of 64
Pass #59 of 64
Pass #60 of 64
Pass #61 of 64
Pass #62 of 64
Pass #63 of 64
Pass #64 of 64
All threads finished.
Decompression time: 338,640,325,612 ticks.
TPI (Ticks_Per_Instruction_during_branchless_decompression) performance: 0.118
IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 8.460

But the system wasn't completely available, it was already running with a load of 13, so only ~15 of the 28 cores were idle. It'll be a while before the system is idle and I can test again.

Sanmayce 09-04-2015 02:41 PM

Quote:

Originally Posted by suicidaleggroll (Post 5415838)
Worked fine now, thanks.

Code:

...
Decompression time: 338,640,325,612 ticks.
TPI (Ticks_Per_Instruction_during_branchless_decompression) performance: 0.118
IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 8.460

But the system wasn't completely available, it was already running with a load of 13, so only ~15 of the 28 cores were idle. It'll be a while before the system is idle and I can test again.

Thank you, :hattip: it is the first XEON behavior that I see, MT IPC = 8.460 beats by a good measure the heavily overclocked 5960X, can't wait to see a real monster featuring 32 cores, then the RAM latency will be reduced further more as every thread will have its own caches.

genss 09-04-2015 03:26 PM

Quote:

Originally Posted by Sanmayce (Post 5415556)
Note: Sadly, current AMD CPUs are not optimized for this benchmark, AVX is used in a way that they struggle a lot - unaligned uncached (outwith LLC) RAM accesses. Anyway, I am still an AMD fan and actually I wrote this bench for the incoming AMD 'Zen'.

AMD cpus are usually much better with unaligned memory access
see https://web.archive.org/web/20140101....cx/archives/8

AVX was made by intel and pushed out so hard that AMD had to implement it fast
it should still work good on more modern amd cpus
also see the XOP instruction set


you should always align to the size of register used (although it shouldn't matter for ymm/zmm registers)
and if you have a lot of data, to page size


PS
AVX or AVX2 ?
AVX2 is integer

Sanmayce 09-04-2015 03:42 PM

I don't know how 256bit unaligned move instructions are (if they are at all) optimized across the whole AVX family, one guy on OCN run the test on AMD Vishera (Edit: DOESN'T supports AVX 2.0) and the result was awful:
http://www.overclock.net/t/1564781/c...#post_24200497

The code we are talking about has little to do with caches most of its accesses are outside caches, my dummy guess is that AMD didn't optimize them for such cases, what other explanation one could put forth?

Code:

.B30.3::                       
  00030 45 8b 38        mov r15d, DWORD PTR [r8]             
  00033 44 89 f9        mov ecx, r15d                         
  00036 83 f1 03        xor ecx, 3                           
  00039 41 bc ff ff ff
        ff              mov r12d, -1                         
  0003f c1 e1 03        shl ecx, 3                           
  00042 bd 01 00 00 00  mov ebp, 1                           
  00047 41 d3 ec        shr r12d, cl                         
  0004a 45 23 fc        and r15d, r12d                       
  0004d 45 33 e4        xor r12d, r12d                       
  00050 45 89 fe        mov r14d, r15d                       
  00053 45 89 fb        mov r11d, r15d                       
  00056 41 83 e6 0f      and r14d, 15                         
  0005a 48 89 c1        mov rcx, rax                         
  0005d 41 83 fe 0c      cmp r14d, 12                         
  00061 44 0f 44 e5      cmove r12d, ebp                       
  00065 4c 89 c5        mov rbp, r8                           
  00068 41 c1 eb 04      shr r11d, 4                           
  0006c 49 ff cc        dec r12                               
  0006f 45 89 da        mov r10d, r11d                       
  00072 4d 89 e6        mov r14, r12                         
  00075 49 2b ca        sub rcx, r10                         
  00078 49 f7 d6        not r14                               
  0007b 48 ff c9        dec rcx                               
  0007e 49 23 ee        and rbp, r14                         
  00081 49 23 cc        and rcx, r12                         
  00084 41 ff c3        inc r11d                             
  00087 4d 23 d6        and r10, r14                         
  0008a 4d 23 de        and r11, r14                         
  0008d c5 fe 6f 44 29
        01              vmovdqu ymm0, YMMWORD PTR [1+rcx+rbp] 
  00093 44 89 fd        mov ebp, r15d                         
  00096 83 e5 03        and ebp, 3                           
  00099 41 83 e7 0c      and r15d, 12                         
  0009d ff c5            inc ebp                               
  0009f 41 83 c7 04      add r15d, 4                           
  000a3 89 e9            mov ecx, ebp                         
  000a5 c1 e9 02        shr ecx, 2                           
  000a8 41 d3 e7        shl r15d, cl                         
  000ab 49 23 ec        and rbp, r12                         
  000ae 4d 23 fc        and r15, r12                         
  000b1 4c 03 dd        add r11, rbp                         
  000b4 4d 03 d7        add r10, r15                         
  000b7 4d 03 c3        add r8, r11                           
  000ba c5 fe 7f 00      vmovdqu YMMWORD PTR [rax], ymm0       
  000be 49 03 c2        add rax, r10                         
  000c1 4d 3b c1        cmp r8, r9                           
  000c4 0f 82 66 ff ff
        ff              jb .B30.3

Can you see what else can harm the speed of above loop except the superslow memory fetch:

Code:

vmovdqu ymm0, YMMWORD PTR [1+rcx+rbp]
Haven't thought of that, but maybe that missing AVX 2.0 support was the cause for the poor performance. Is it so?

genss 09-04-2015 05:34 PM

linux has an amazing tool named perf
idk how to compile this to include the assembly so i can't take a look

vmovdqu is an AVX(1) instruction

using unaligned MOVs is in most cases bad due to how the cpu handles memory internally
(and how the memory BUS on the northbridge handles it)
it comes down to that reading 256 bits of unaligned memory results in reading 6*64 bits instead of 4*64
and writing requires reading 2*64 bits, AND-ing and then writing 6*64 bits

ofc it depends on the cpu and i may be wrong
so lets look at instruction latency tables
for my cpu (piledriver):

VMOVDQA has a latency of 6 cycles when reading from RAM and reciprocal throughput of 1 (one instruction can start every 1 cycles)
while the latency when writing is 11 cycles and reciprocal throughput is 17 (one instruction can start every 17 cycles)

VMOVDQU has a latency of 6 when reading and reciprocal throughput is still 1
but for writing it is 14 cycles and reciprocal throughput of 20

note that VMOVDQU write to cache/memory takes 8 instructions (microcode), while VMOVDQA write takes 4


there doesn't seem to be data for these instruction on intel here
but if you find them, do note that these numbers are dependent on cpu frequency (cycle = 1/freq seconds)


as for the snippet of code
note how between read and write rax is not changed and some other instructions are executed
this is due to that latencies, and the compiler reordered instructions around those 2


if you really want to make that code run fast, use movntps (has to be aligned)
it is an SSE2 instruction that moves 16 bytes from the register directly into RAM, bypassing cache
this will reduce cache trashing
(unless you read that value right after writing it)
while at the topic, intel cpus usually have more cache and are thus less vulnerable to cache trashing
idk how this program works, so idk if using aligned memory access is worth while
if it is 4 or 8 byte aligned you can try combining 32bit and/or 64bit registers with movntps

memory is almost always slower then processing on modern cpus
Computer Architecture, Fifth Edition: A Quantitative Approach is a great book about cpus and it explains cache well
(PS Agner Fog web page)
(PPS some AMD manuals are good, better then intel ones if you ask me)

suicidaleggroll 09-05-2015 09:46 AM

There we go...the other jobs finally cleared off. Here's a re-run with all 28 cores available (as available as possible when running the OS and a couple of mostly-idle VMs):

Code:

Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC_trials', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Allocating 2,942,857,440 bytes...
Allocating 8,748,875,776 bytes...
Source&Target buffers are allocated.
Simulating we have 32 blocks for decompression...
Enforcing 32 thread(s).
omp_get_num_procs( ) = 28
omp_get_max_threads( ) = 28
Pass # 1 of 64
Pass # 2 of 64
Pass # 3 of 64
Pass # 4 of 64
Pass # 5 of 64
Pass # 6 of 64
Pass # 7 of 64
Pass # 8 of 64
Pass # 9 of 64
Pass #10 of 64
Pass #11 of 64
Pass #12 of 64
Pass #13 of 64
Pass #14 of 64
Pass #15 of 64
Pass #16 of 64
Pass #17 of 64
Pass #18 of 64
Pass #19 of 64
Pass #20 of 64
Pass #21 of 64
Pass #22 of 64
Pass #23 of 64
Pass #24 of 64
Pass #25 of 64
Pass #26 of 64
Pass #27 of 64
Pass #28 of 64
Pass #29 of 64
Pass #30 of 64
Pass #31 of 64
Pass #32 of 64
Pass #33 of 64
Pass #34 of 64
Pass #35 of 64
Pass #36 of 64
Pass #37 of 64
Pass #38 of 64
Pass #39 of 64
Pass #40 of 64
Pass #41 of 64
Pass #42 of 64
Pass #43 of 64
Pass #44 of 64
Pass #45 of 64
Pass #46 of 64
Pass #47 of 64
Pass #48 of 64
Pass #49 of 64
Pass #50 of 64
Pass #51 of 64
Pass #52 of 64
Pass #53 of 64
Pass #54 of 64
Pass #55 of 64
Pass #56 of 64
Pass #57 of 64
Pass #58 of 64
Pass #59 of 64
Pass #60 of 64
Pass #61 of 64
Pass #62 of 64
Pass #63 of 64
Pass #64 of 64
All threads finished.
Decompression time: 302,662,933,786 ticks.
TPI (Ticks_Per_Instruction_during_branchless_decompression) performance: 0.106
IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 9.466

Specs:

Chassis/Mb: Supermicro 1028GR-TRT
Procs: 2x Xeon E5-2697 v3 with hyperthreading disabled
RAM: 8x Kingston KVR21R15D4/16
Drives: 4x Intel SSDSC2BF480H501 in RAID 10
CentOS 7 running gcc 4.8.3

Sanmayce 09-05-2015 03:34 PM

@genss

Thanks for the suggestions, yes, for some other cases I would consider them, however this particular snippet is out-of-the-box, it disregards all standard one-to-one transfers and uses overutilized such - storing/fetching all sizes =<32 with an YMM register. In some way, it is a brute technique since overlapping the writes is not good, one write should wait for previous to finish, but this is the charm of this etude - simplicity for the human complexity for the CPU. I would try some manually done alignments but this will only bubble up the code which is not human-likeable, let there be some load/work for the logic inside those 5+ billion transistors box.
The AMD manual is good and I will look up it, but quick skimming showed that it explains up to XMM, YMM is not covered, 2005, yes.

As for the , good book no doubt about it, on page 126 I saw:

Pitfall:
Simulating enough instructions to get accurate performance measures of the
memory hierarchy.
There are really three pitfalls here. One is trying to predict performance of a large
cache using a small trace. Another is that a programs locality behavior is not
constant over the run of the entire program. The third is that a programs locality
behavior may vary depending on the input.


Yes, but regardless of the small size of the decompression loop 'Freaky_Dreamer' ('Lexx' decompression loop, in fact) doesn't fall to the above pitfall. In my opinion it stresses well the whole memory hierarchy, no?
Simply, the etude is superb, it loads chunks 32bytes long across 0..256MB range (backwards from some current position). since it is LZSS decompression with a huge Sliding Window this translates to L1/L2/L3 and RAM intensive fetches.

Anyway, I don't know whether you felt one of the nifty facets of 'Lexx', my idea is the branchlessness and the simplicity to be exploited well at some point in GPUs, I saw that the book explains even the branching on GPUs, surely I will look up it more.

If you have time and will to play with the AMD response on YMM vs GP registers you may simply comment the YMM and replace it with 4x8bytes loads.

Code:

                                #ifdef _N_YMM
//                SlowCopy256bit( (const char *)( ((uint64_t)(srcLOCAL+1)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4))&FlagMASKnegated) ), retLOCAL);
// Another (incompatible with Branchfull variant, though) way to avoid 'LEA' is to put the '+1' outside the FlagMASK but then the encoder has to count literals from zero in order to compensate '-((DWORDtrio>>4)-1) = -(DWORDtrio>>4)+1' within FlagMASKnegated:
                SlowCopy256bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), retLOCAL);
                                #endif

Above changed with this:

Code:

//                SlowCopy256bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), retLOCAL);
*(uint64_t*)(retLOCAL+8*(0)) = *(uint64_t*)(1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated)+8*(0));
*(uint64_t*)(retLOCAL+8*(1)) = *(uint64_t*)(1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated)+8*(1));
*(uint64_t*)(retLOCAL+8*(2)) = *(uint64_t*)(1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated)+8*(2));
*(uint64_t*)(retLOCAL+8*(3)) = *(uint64_t*)(1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated)+8*(3));

That way, AVX won't hurt AMD anymore.

The source of Nakamichi 'Lexx' is here:
https://software.intel.com/sites/def...onfly_Lexx.zip

The compressed with Nakamichi 'Lexx' file is here:
http://www.sanmayce.com/Downloads/Au....tar.Nakamichi

I am tempted to ask Agner Fog to share his view on the topic, if he can see how to speed up things it would be much appreciated.
I have read (on Intel's forum) some of his visions to enhance x86, in one of his posts he talked about unburdening the current architecture with the necessity to drag the compatibility with the old code, as far as I understood, he proposed what I need now, when you have an instruction dealing with e.g. 256bit the goal is to make it native, not some kind of mutant on microcode level. The things are complex, the compatibility is something awesome, however my eyes are on who will make a pure 256bit CPU-RAM subsystem available to enthusiast community. My point, let's wait another 20 years and see whether Nakamichi 'Lexx' decompression loop would need rearrangements. I think no.


@suicidaleggroll

Many thanks!

Now, I know that my expectations were too high, I expected to see 14 IPC.
Yet:
IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 9.466
is a solid trump over the speedy 5960x @4.7GHz with @2666MHz DDR4.

Every test brings new PRACTICAL knowledge not some theoretical mumbo-jumbo estimation. Now, it is clear to me, 'Lexx' decompression really is not for nowadays CPUs, maybe when 256MB fast caches arrive then it will scream.

genss 09-06-2015 08:45 AM

Quote:

Originally Posted by Sanmayce (Post 5416374)
@genss

Thanks for the suggestions, yes, for some other cases I would consider them, however this particular snippet is out-of-the-box, it disregards all standard one-to-one transfers and uses overutilized such - storing/fetching all sizes =<32 with an YMM register. In some way, it is a brute technique since overlapping the writes is not good, one write should wait for previous to finish, but this is the charm of this etude - simplicity for the human complexity for the CPU. I would try some manually done alignments but this will only bubble up the code which is not human-likeable, let there be some load/work for the logic inside those 5+ billion transistors box.
The AMD manual is good and I will look up it, but quick skimming showed that it explains up to XMM, YMM is not covered, 2005, yes.

...

I am tempted to ask Agner Fog to share his view on the topic, if he can see how to speed up things it would be much appreciated.
I have read (on Intel's forum) some of his visions to enhance x86, in one of his posts he talked about unburdening the current architecture with the necessity to drag the compatibility with the old code, as far as I understood, he proposed what I need now, when you have an instruction dealing with e.g. 256bit the goal is to make it native, not some kind of mutant on microcode level. The things are complex, the compatibility is something awesome, however my eyes are on who will make a pure 256bit CPU-RAM subsystem available to enthusiast community. My point, let's wait another 20 years and see whether Nakamichi 'Lexx' decompression loop would need rearrangements. I think no.

...

Every test brings new PRACTICAL knowledge not some theoretical mumbo-jumbo estimation. Now, it is clear to me, 'Lexx' decompression really is not for nowadays CPUs, maybe when 256MB fast caches arrive then it will scream.

you are welcome

i speak from experience, not some "theoretical mumbo-jumbo estimation"
my experience shows the correlation between reality and the texts i linked for you to read
(i wrote a very fast memcpy(), that is more or less what your program spends its time with)
if you do not wish to learn, do not ask

the book about computer architecture clearly states that raising the amount of cpu cache is a futile effort
most of the cpu dye is already used for cache and tests clearly show only about 10% speed increase when doubling cache size
it also explains why that is so

do learn to use perf and do learn proper assembly

i would ask of you to please not bother mr. Fog

Sanmayce 09-06-2015 03:09 PM

@genss

Strange, you say I am welcome and yet your words sound like I am some insolent smartass, nah.
If you knew me better you would see the misgiving.

I say you outright, I am interested only in sharing etudes and benchmarking them, not in learning per se, nor in some professional carrier or status. Pure amateurism heavily reinforced by tons of test, that's me.

Anyway, I often use 'mumbo-jumbo' to highlight the contrast between the theoretical vs practical knowledge, not that I look down on theory, just want to emphasize the principle of going-against-the-flow, heh-heh. As for clocks needed for each instructions and stuff, this is not mumbo-jumbo, maybe you thought so, but it was not what I meant, I meant the endless compression/decompression empty talks how some compressor features this and that while in reality when put on the bench it starts to show how even the author didn't know was it good enough for e.g. textual decompression. No test (speed showdowns) no practical value proven.

>if you do not wish to learn, do not ask
Tried several times to see what made you said that, no clue.

>i would ask of you to please not bother mr. Fog
Maybe you have something in mind, so be it.


All times are GMT -5. The time now is 02:46 PM.