Need help running 32-threaded benchmark
1 Attachment(s)
I wrote an unseen in C, heh-heh rhymed, integer CPU-RAM subsystem 32-threaded torture, but since I have no access to a powerbox I need assistance to run it on at least 16cores/32threads machine.
The C source is attached, since my native environment is Windows, I used latest MinGW with GCC 5.1.0 to ensure compatibility, the initial version was all Windows but my desire is to run it on *nix too. In essence the benchmark is simple, it loads/replicates one 91,964,279 bytes compressed file (260MB of English texts) into 32 pools thus simulating 32 independent blocks and decompresses them using my LZSS decompressor Nakamichi 'Lexx'. The idea is to boost the I/O by using 3:1 decompression. My goal is to traverse hundreds of GBs of compressed textual data (mostly English texts) 3x faster than "normal" way. Actually, the compressed 32 blocks amount to 2,942,857,440 bytes while the uncompressed to 8,748,875,776 bytes. Thus the whole RAM used is about 11GB. Funny, the test was meant to be all about RAM latency and cache speeds, however at some point it will become BANDWIDTH bound too, currently the best result on the @Jpmboy's rig amounts to: Let's see how many bytes/clock of decompression speed those 7.048 equal: (32 threads * 273,401,856) bytes / (6,215,992,807 ticks / 4,700,000,000 ticks) = 6,615,136,217 bytes / second or 6308 MB/s, on second thought 4x or 24GB/s is far from the 50-60GB offered by modern high-end CPUs. Compile line: ./gcc -O3 -mavx -fopenmp Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.c -D_N_YMM -D_N_prefetch_4096 -D_gcc_mumbo_jumbo_ -DCommence_OpenMP -o Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.elf Run line: ./Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.elf Autobiography_411-ebooks_Collection.tar.Nakamichi Download note: You may download package, here:https://mega.nz/#!I4hHwC5Y!3udON_nVU...SLmQGh89Jp8gns. Thanks to some OCN overclockers I already obtained the best result for the fastest enthusiast PC I have ever seen, it is: IPC (Instructions_Per_Clock_during_branchless_32-threaded_decompression) performance: 7.048, yes 7+ IPC, for the original thread and more info: http://www.overclock.net/t/1564781/c...per-clock/0_20 Note: Sadly, current AMD CPUs are not optimized for this benchmark, AVX is used in a way that they struggle a lot - unaligned uncached (outwith LLC) RAM accesses. Anyway, I am still an AMD fan and actually I wrote this bench for the incoming AMD 'Zen'. Just saw a funny clip, it says 'AMD Zen is coming': https://www.youtube.com/watch?v=Mw-c0avURD8 'Zen' will be no joke since AMD is planning to make it the core of its future Exascale Heterogeneous Processor: AMD has released information concerning an upcoming 'Exascale Heterogeneous Processor', or EHP. In a paper submitted to the IEEE AMD details an APU which packs a multitude of Zen cores, Greenland graphics and up to 32GB of HBM2 memory. It says this processor is an embodiment of its "vision for exascale computing". http://hexus.net/tech/news/cpu/85184...-32-zen-cores/ For now, Intel 5960X holds the best result, 7+ IPC is no joke either. Edit: Changed the CPU/CACHE/RAM frequencies with the right ones: 4.7GHz/4.2GHz/2666MHz. The speed was 300MB/s less. |
Quickly recalled that SATA III with its 550MB/s is no longer interesting for contemplating heavy ... visions, so let's see what one 2,200MB/s SSD could offer in 'Freaky_Dreamer' scenario:
Intel 750 Series SSDPE2MW400G4R5 2.5" 400GB PCIe NVMe 3.0 x4 MLC Internal Solid State Drive (SSD) SSD Interface: PCIe NVMe 3.0 x4 SSD Performance: Max Sequential Read: 2200 MB/s Max Sequential Write: 900 MB/s 4KB Random Read: 430,000 IOPS 4KB Random Write: 230,000 IOPS Read Latency: 20 micros Write Latency: 20 micros Now, what does the above 2,200MB/s boost read translate to when reading the English Wikipedia XML dump file (enwiki-20150112-pages-articles.xml) 51,344,631,742 bytes long. If we are to linearly read it using Intel's badboy we will need 51,344,631,742/1024/1024/2200 = 22.2 seconds, in case of using such SSD and 5960X the "upload time" becomes: 51,344,631,742/1024/1024/2200/3 = 7.4 seconds for the upload part and 51,344,631,742/1024/1024/6308 = 7.7 seconds for the decompression part, or 7.4+7.7=15.1 or 22.2-15.1 = 7.1 seconds boost. Nah, the boost is not sweet enough, no worries, the used Nakamichi 'Lexx', despite being my favorite code, is not suitable for nowadays 64bit architectures it is pure 256bit code, that's the reason it seems inferior while in truth it is monstrously fast. Anyway, to "fill the gap" i.e. to have an intermediate performer until the real 256bit CPUs come I wrote Nakamichi 'Shifune' it is much faster than 'Lexx' and will perform smoothly on AMD too - it is 64bit code, no AVX. https://pbs.twimg.com/media/COBt7RSUcAAal4S.png Maybe, by the end of the month (or next) I will share a nifty textual showdown (compression ratio + decompression speed) between Nakamichi 'Shifune' and GZIP and Zstd and ZPAQ and LzTurbo and BSC. My current results show decompression speed supremacy for Nakamichi 'Shifune' in 3:1 big English texts cases. Just a glimpse at incoming 120+ testdatafiles: Code:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
I'd love to run this on some test systems. I have access to a lot of big machines, including a dual proc E5-2697v3 system with 28 cores and 128 GB of RAM, but I won't run any pre-compiled or closed-source code on it. Your download link is tied to distros and packaging software, give me a real source code link that I can verify and I'll build and run it to give you some benchmarks.
|
Oh, man, you have helped me a lot with Kazahana, it would be extracool to run Freaky_Dreamer too.
I am sorry for the messy (source here testdatafile there) situation. If you have any issues compiling it just ask me, my *nix command line and GCC knowledge sucks. Hopefully it will run since MinGW said ok. |
I am not experienced at all with *nix, so can't tell beforehand how the RDTSC will report ticks on dual socket machine:
Code:
#if defined(_gcc_mumbo_jumbo_) |
>... but I won't run any pre-compiled or closed-source code on it. Your download link is tied to distros and packaging software, give me a real source code link that I can verify and I'll build and run it to give you some benchmarks.
Sure, you won't find more open C coder than me, all my tests are fully open and mostly FREE. Didn't get the second part. The walkthrough is this: Step #1: The C source is attached to the first post ZIPped but with extension .txt, the forum allows not .ZIPs?! Step #2: You need the testdatafile (91,964,279 Autobiography_411-ebooks_Collection.tar.Nakamichi) it is located in https://mega.nz/#!I4hHwC5Y!3udON_nVU...SLmQGh89Jp8gns package. |
Sorry, I missed the zip download, I was distracted by the distro-specific downloads pushing rpm files on me.
I did download the source zip, extracted and compiled, but the results are a little odd: Code:
$ ./Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.elf Autobiography_411-ebooks_Collection.tar.Nakamichi |
Oh, just got what you meant, I am gonna upload the testdatafile to my site right now.
Done: http://www.sanmayce.com/Downloads/Na...C_32-threads.c http://www.sanmayce.com/Downloads/Au....tar.Nakamichi |
Thanks, it seems someone knowledgeable in GetRDTSC (and time measuring in general) under *nix has to help me to figure it out.
The important thing is that the 'Done' message is there: All threads finished. It means that the decompression went as it should, the sizes match, that is. Decompression time: 0 ticks. The above message means that my time reporting failed fully :mad: |
If someone on LQ knows why this time reporter fails to work will be the fix, otherwise I have to ask on SO.
Why ticksTOTAL2 + GetRDTSC() - ticksStart is zero?! Is GetRDTSC() simply not working? Code:
#if defined(_gcc_mumbo_jumbo_) - Cache clock; - CPU clock; - RAM CAS latency; - RAM clock. I suppose your RAM is DDR4 @2133MHz, if so then the major toll takes the RAM clock (2666MHz vs ?) and then the CPU clock (4.7GHz vs 3.6GHz), I guess. For comparison the best result total time on 5960X 4.7 core / 4.2 uncore DDR4 @2666MHz is about one third (3.4 seconds): Code:
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced. |
@suicidaleggroll
Man, excuse this time my non-integral distro, just saw that you have compiled the C source from the benchmark package, it won't work as it was Windows targeted only. The C source attached to the first post and the link in POST #8 is the working source - it is "revision 2" targeted for *nix too, in there I added the *nix timing. The dump should say: Code:
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC_trials', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced. Code:
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced. |
Worked fine now, thanks.
Initial results: Code:
$ ./Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.elf Autobiography_411-ebooks_Collection.tar.Nakamichi |
Quote:
|
Quote:
see https://web.archive.org/web/20140101....cx/archives/8 AVX was made by intel and pushed out so hard that AMD had to implement it fast it should still work good on more modern amd cpus also see the XOP instruction set you should always align to the size of register used (although it shouldn't matter for ymm/zmm registers) and if you have a lot of data, to page size PS AVX or AVX2 ? AVX2 is integer |
I don't know how 256bit unaligned move instructions are (if they are at all) optimized across the whole AVX family, one guy on OCN run the test on AMD Vishera (Edit: DOESN'T supports AVX 2.0) and the result was awful:
http://www.overclock.net/t/1564781/c...#post_24200497 The code we are talking about has little to do with caches most of its accesses are outside caches, my dummy guess is that AMD didn't optimize them for such cases, what other explanation one could put forth? Code:
.B30.3:: Code:
vmovdqu ymm0, YMMWORD PTR [1+rcx+rbp] |
linux has an amazing tool named perf
idk how to compile this to include the assembly so i can't take a look vmovdqu is an AVX(1) instruction using unaligned MOVs is in most cases bad due to how the cpu handles memory internally (and how the memory BUS on the northbridge handles it) it comes down to that reading 256 bits of unaligned memory results in reading 6*64 bits instead of 4*64 and writing requires reading 2*64 bits, AND-ing and then writing 6*64 bits ofc it depends on the cpu and i may be wrong so lets look at instruction latency tables for my cpu (piledriver): VMOVDQA has a latency of 6 cycles when reading from RAM and reciprocal throughput of 1 (one instruction can start every 1 cycles) while the latency when writing is 11 cycles and reciprocal throughput is 17 (one instruction can start every 17 cycles) VMOVDQU has a latency of 6 when reading and reciprocal throughput is still 1 but for writing it is 14 cycles and reciprocal throughput of 20 note that VMOVDQU write to cache/memory takes 8 instructions (microcode), while VMOVDQA write takes 4 there doesn't seem to be data for these instruction on intel here but if you find them, do note that these numbers are dependent on cpu frequency (cycle = 1/freq seconds) as for the snippet of code note how between read and write rax is not changed and some other instructions are executed this is due to that latencies, and the compiler reordered instructions around those 2 if you really want to make that code run fast, use movntps (has to be aligned) it is an SSE2 instruction that moves 16 bytes from the register directly into RAM, bypassing cache this will reduce cache trashing (unless you read that value right after writing it) while at the topic, intel cpus usually have more cache and are thus less vulnerable to cache trashing idk how this program works, so idk if using aligned memory access is worth while if it is 4 or 8 byte aligned you can try combining 32bit and/or 64bit registers with movntps memory is almost always slower then processing on modern cpus Computer Architecture, Fifth Edition: A Quantitative Approach is a great book about cpus and it explains cache well (PS Agner Fog web page) (PPS some AMD manuals are good, better then intel ones if you ask me) |
There we go...the other jobs finally cleared off. Here's a re-run with all 28 cores available (as available as possible when running the OS and a couple of mostly-idle VMs):
Code:
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC_trials', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced. Chassis/Mb: Supermicro 1028GR-TRT Procs: 2x Xeon E5-2697 v3 with hyperthreading disabled RAM: 8x Kingston KVR21R15D4/16 Drives: 4x Intel SSDSC2BF480H501 in RAID 10 CentOS 7 running gcc 4.8.3 |
@genss
Thanks for the suggestions, yes, for some other cases I would consider them, however this particular snippet is out-of-the-box, it disregards all standard one-to-one transfers and uses overutilized such - storing/fetching all sizes =<32 with an YMM register. In some way, it is a brute technique since overlapping the writes is not good, one write should wait for previous to finish, but this is the charm of this etude - simplicity for the human complexity for the CPU. I would try some manually done alignments but this will only bubble up the code which is not human-likeable, let there be some load/work for the logic inside those 5+ billion transistors box. The AMD manual is good and I will look up it, but quick skimming showed that it explains up to XMM, YMM is not covered, 2005, yes. As for the , good book no doubt about it, on page 126 I saw: Pitfall: Simulating enough instructions to get accurate performance measures of the memory hierarchy. There are really three pitfalls here. One is trying to predict performance of a large cache using a small trace. Another is that a programs locality behavior is not constant over the run of the entire program. The third is that a programs locality behavior may vary depending on the input. Yes, but regardless of the small size of the decompression loop 'Freaky_Dreamer' ('Lexx' decompression loop, in fact) doesn't fall to the above pitfall. In my opinion it stresses well the whole memory hierarchy, no? Simply, the etude is superb, it loads chunks 32bytes long across 0..256MB range (backwards from some current position). since it is LZSS decompression with a huge Sliding Window this translates to L1/L2/L3 and RAM intensive fetches. Anyway, I don't know whether you felt one of the nifty facets of 'Lexx', my idea is the branchlessness and the simplicity to be exploited well at some point in GPUs, I saw that the book explains even the branching on GPUs, surely I will look up it more. If you have time and will to play with the AMD response on YMM vs GP registers you may simply comment the YMM and replace it with 4x8bytes loads. Code:
#ifdef _N_YMM Code:
// SlowCopy256bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), retLOCAL); The source of Nakamichi 'Lexx' is here: https://software.intel.com/sites/def...onfly_Lexx.zip The compressed with Nakamichi 'Lexx' file is here: http://www.sanmayce.com/Downloads/Au....tar.Nakamichi I am tempted to ask Agner Fog to share his view on the topic, if he can see how to speed up things it would be much appreciated. I have read (on Intel's forum) some of his visions to enhance x86, in one of his posts he talked about unburdening the current architecture with the necessity to drag the compatibility with the old code, as far as I understood, he proposed what I need now, when you have an instruction dealing with e.g. 256bit the goal is to make it native, not some kind of mutant on microcode level. The things are complex, the compatibility is something awesome, however my eyes are on who will make a pure 256bit CPU-RAM subsystem available to enthusiast community. My point, let's wait another 20 years and see whether Nakamichi 'Lexx' decompression loop would need rearrangements. I think no. @suicidaleggroll Many thanks! Now, I know that my expectations were too high, I expected to see 14 IPC. Yet: IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 9.466 is a solid trump over the speedy 5960x @4.7GHz with @2666MHz DDR4. Every test brings new PRACTICAL knowledge not some theoretical mumbo-jumbo estimation. Now, it is clear to me, 'Lexx' decompression really is not for nowadays CPUs, maybe when 256MB fast caches arrive then it will scream. |
Quote:
i speak from experience, not some "theoretical mumbo-jumbo estimation" my experience shows the correlation between reality and the texts i linked for you to read (i wrote a very fast memcpy(), that is more or less what your program spends its time with) if you do not wish to learn, do not ask the book about computer architecture clearly states that raising the amount of cpu cache is a futile effort most of the cpu dye is already used for cache and tests clearly show only about 10% speed increase when doubling cache size it also explains why that is so do learn to use perf and do learn proper assembly i would ask of you to please not bother mr. Fog |
@genss
Strange, you say I am welcome and yet your words sound like I am some insolent smartass, nah. If you knew me better you would see the misgiving. I say you outright, I am interested only in sharing etudes and benchmarking them, not in learning per se, nor in some professional carrier or status. Pure amateurism heavily reinforced by tons of test, that's me. Anyway, I often use 'mumbo-jumbo' to highlight the contrast between the theoretical vs practical knowledge, not that I look down on theory, just want to emphasize the principle of going-against-the-flow, heh-heh. As for clocks needed for each instructions and stuff, this is not mumbo-jumbo, maybe you thought so, but it was not what I meant, I meant the endless compression/decompression empty talks how some compressor features this and that while in reality when put on the bench it starts to show how even the author didn't know was it good enough for e.g. textual decompression. No test (speed showdowns) no practical value proven. >if you do not wish to learn, do not ask Tried several times to see what made you said that, no clue. >i would ask of you to please not bother mr. Fog Maybe you have something in mind, so be it. |
All times are GMT -5. The time now is 02:46 PM. |