Need help running 32-threaded benchmark
1 Attachment(s)
I wrote an unseen in C, heh-heh rhymed, integer CPU-RAM subsystem 32-threaded torture, but since I have no access to a powerbox I need assistance to run it on at least 16cores/32threads machine.
The C source is attached, since my native environment is Windows, I used latest MinGW with GCC 5.1.0 to ensure compatibility, the initial version was all Windows but my desire is to run it on *nix too. In essence the benchmark is simple, it loads/replicates one 91,964,279 bytes compressed file (260MB of English texts) into 32 pools thus simulating 32 independent blocks and decompresses them using my LZSS decompressor Nakamichi 'Lexx'. The idea is to boost the I/O by using 3:1 decompression. My goal is to traverse hundreds of GBs of compressed textual data (mostly English texts) 3x faster than "normal" way. Actually, the compressed 32 blocks amount to 2,942,857,440 bytes while the uncompressed to 8,748,875,776 bytes. Thus the whole RAM used is about 11GB. Funny, the test was meant to be all about RAM latency and cache speeds, however at some point it will become BANDWIDTH bound too, currently the best result on the @Jpmboy's rig amounts to: Let's see how many bytes/clock of decompression speed those 7.048 equal: (32 threads * 273,401,856) bytes / (6,215,992,807 ticks / 4,700,000,000 ticks) = 6,615,136,217 bytes / second or 6308 MB/s, on second thought 4x or 24GB/s is far from the 50-60GB offered by modern high-end CPUs. Compile line: ./gcc -O3 -mavx -fopenmp Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.c -D_N_YMM -D_N_prefetch_4096 -D_gcc_mumbo_jumbo_ -DCommence_OpenMP -o Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.elf Run line: ./Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.elf Autobiography_411-ebooks_Collection.tar.Nakamichi Download note: You may download package, here:https://mega.nz/#!I4hHwC5Y!3udON_nVU...SLmQGh89Jp8gns. Thanks to some OCN overclockers I already obtained the best result for the fastest enthusiast PC I have ever seen, it is: IPC (Instructions_Per_Clock_during_branchless_32-threaded_decompression) performance: 7.048, yes 7+ IPC, for the original thread and more info: http://www.overclock.net/t/1564781/c...per-clock/0_20 Note: Sadly, current AMD CPUs are not optimized for this benchmark, AVX is used in a way that they struggle a lot - unaligned uncached (outwith LLC) RAM accesses. Anyway, I am still an AMD fan and actually I wrote this bench for the incoming AMD 'Zen'. Just saw a funny clip, it says 'AMD Zen is coming': https://www.youtube.com/watch?v=Mw-c0avURD8 'Zen' will be no joke since AMD is planning to make it the core of its future Exascale Heterogeneous Processor: AMD has released information concerning an upcoming 'Exascale Heterogeneous Processor', or EHP. In a paper submitted to the IEEE AMD details an APU which packs a multitude of Zen cores, Greenland graphics and up to 32GB of HBM2 memory. It says this processor is an embodiment of its "vision for exascale computing". http://hexus.net/tech/news/cpu/85184...-32-zen-cores/ For now, Intel 5960X holds the best result, 7+ IPC is no joke either. Edit: Changed the CPU/CACHE/RAM frequencies with the right ones: 4.7GHz/4.2GHz/2666MHz. The speed was 300MB/s less. |
Quickly recalled that SATA III with its 550MB/s is no longer interesting for contemplating heavy ... visions, so let's see what one 2,200MB/s SSD could offer in 'Freaky_Dreamer' scenario:
Intel 750 Series SSDPE2MW400G4R5 2.5" 400GB PCIe NVMe 3.0 x4 MLC Internal Solid State Drive (SSD) SSD Interface: PCIe NVMe 3.0 x4 SSD Performance: Max Sequential Read: 2200 MB/s Max Sequential Write: 900 MB/s 4KB Random Read: 430,000 IOPS 4KB Random Write: 230,000 IOPS Read Latency: 20 micros Write Latency: 20 micros Now, what does the above 2,200MB/s boost read translate to when reading the English Wikipedia XML dump file (enwiki-20150112-pages-articles.xml) 51,344,631,742 bytes long. If we are to linearly read it using Intel's badboy we will need 51,344,631,742/1024/1024/2200 = 22.2 seconds, in case of using such SSD and 5960X the "upload time" becomes: 51,344,631,742/1024/1024/2200/3 = 7.4 seconds for the upload part and 51,344,631,742/1024/1024/6308 = 7.7 seconds for the decompression part, or 7.4+7.7=15.1 or 22.2-15.1 = 7.1 seconds boost. Nah, the boost is not sweet enough, no worries, the used Nakamichi 'Lexx', despite being my favorite code, is not suitable for nowadays 64bit architectures it is pure 256bit code, that's the reason it seems inferior while in truth it is monstrously fast. Anyway, to "fill the gap" i.e. to have an intermediate performer until the real 256bit CPUs come I wrote Nakamichi 'Shifune' it is much faster than 'Lexx' and will perform smoothly on AMD too - it is 64bit code, no AVX. https://pbs.twimg.com/media/COBt7RSUcAAal4S.png Maybe, by the end of the month (or next) I will share a nifty textual showdown (compression ratio + decompression speed) between Nakamichi 'Shifune' and GZIP and Zstd and ZPAQ and LzTurbo and BSC. My current results show decompression speed supremacy for Nakamichi 'Shifune' in 3:1 big English texts cases. Just a glimpse at incoming 120+ testdatafiles: Code:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
I'd love to run this on some test systems. I have access to a lot of big machines, including a dual proc E5-2697v3 system with 28 cores and 128 GB of RAM, but I won't run any pre-compiled or closed-source code on it. Your download link is tied to distros and packaging software, give me a real source code link that I can verify and I'll build and run it to give you some benchmarks.
|
Oh, man, you have helped me a lot with Kazahana, it would be extracool to run Freaky_Dreamer too.
I am sorry for the messy (source here testdatafile there) situation. If you have any issues compiling it just ask me, my *nix command line and GCC knowledge sucks. Hopefully it will run since MinGW said ok. |
I am not experienced at all with *nix, so can't tell beforehand how the RDTSC will report ticks on dual socket machine:
Code:
#if defined(_gcc_mumbo_jumbo_) |
>... but I won't run any pre-compiled or closed-source code on it. Your download link is tied to distros and packaging software, give me a real source code link that I can verify and I'll build and run it to give you some benchmarks.
Sure, you won't find more open C coder than me, all my tests are fully open and mostly FREE. Didn't get the second part. The walkthrough is this: Step #1: The C source is attached to the first post ZIPped but with extension .txt, the forum allows not .ZIPs?! Step #2: You need the testdatafile (91,964,279 Autobiography_411-ebooks_Collection.tar.Nakamichi) it is located in https://mega.nz/#!I4hHwC5Y!3udON_nVU...SLmQGh89Jp8gns package. |
Sorry, I missed the zip download, I was distracted by the distro-specific downloads pushing rpm files on me.
I did download the source zip, extracted and compiled, but the results are a little odd: Code:
$ ./Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.elf Autobiography_411-ebooks_Collection.tar.Nakamichi |
Oh, just got what you meant, I am gonna upload the testdatafile to my site right now.
Done: http://www.sanmayce.com/Downloads/Na...C_32-threads.c http://www.sanmayce.com/Downloads/Au....tar.Nakamichi |
Thanks, it seems someone knowledgeable in GetRDTSC (and time measuring in general) under *nix has to help me to figure it out.
The important thing is that the 'Done' message is there: All threads finished. It means that the decompression went as it should, the sizes match, that is. Decompression time: 0 ticks. The above message means that my time reporting failed fully :mad: |
If someone on LQ knows why this time reporter fails to work will be the fix, otherwise I have to ask on SO.
Why ticksTOTAL2 + GetRDTSC() - ticksStart is zero?! Is GetRDTSC() simply not working? Code:
#if defined(_gcc_mumbo_jumbo_) - Cache clock; - CPU clock; - RAM CAS latency; - RAM clock. I suppose your RAM is DDR4 @2133MHz, if so then the major toll takes the RAM clock (2666MHz vs ?) and then the CPU clock (4.7GHz vs 3.6GHz), I guess. For comparison the best result total time on 5960X 4.7 core / 4.2 uncore DDR4 @2666MHz is about one third (3.4 seconds): Code:
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced. |
@suicidaleggroll
Man, excuse this time my non-integral distro, just saw that you have compiled the C source from the benchmark package, it won't work as it was Windows targeted only. The C source attached to the first post and the link in POST #8 is the working source - it is "revision 2" targeted for *nix too, in there I added the *nix timing. The dump should say: Code:
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC_trials', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced. Code:
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced. |
Worked fine now, thanks.
Initial results: Code:
$ ./Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.elf Autobiography_411-ebooks_Collection.tar.Nakamichi |
Quote:
|
Quote:
see https://web.archive.org/web/20140101....cx/archives/8 AVX was made by intel and pushed out so hard that AMD had to implement it fast it should still work good on more modern amd cpus also see the XOP instruction set you should always align to the size of register used (although it shouldn't matter for ymm/zmm registers) and if you have a lot of data, to page size PS AVX or AVX2 ? AVX2 is integer |
I don't know how 256bit unaligned move instructions are (if they are at all) optimized across the whole AVX family, one guy on OCN run the test on AMD Vishera (Edit: DOESN'T supports AVX 2.0) and the result was awful:
http://www.overclock.net/t/1564781/c...#post_24200497 The code we are talking about has little to do with caches most of its accesses are outside caches, my dummy guess is that AMD didn't optimize them for such cases, what other explanation one could put forth? Code:
.B30.3:: Code:
vmovdqu ymm0, YMMWORD PTR [1+rcx+rbp] |
All times are GMT -5. The time now is 10:01 AM. |