LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-04-2015, 05:34 PM   #16
genss
Member
 
Registered: Nov 2013
Posts: 741

Rep: Reputation: Disabled

linux has an amazing tool named perf
idk how to compile this to include the assembly so i can't take a look

vmovdqu is an AVX(1) instruction

using unaligned MOVs is in most cases bad due to how the cpu handles memory internally
(and how the memory BUS on the northbridge handles it)
it comes down to that reading 256 bits of unaligned memory results in reading 6*64 bits instead of 4*64
and writing requires reading 2*64 bits, AND-ing and then writing 6*64 bits

ofc it depends on the cpu and i may be wrong
so lets look at instruction latency tables
for my cpu (piledriver):

VMOVDQA has a latency of 6 cycles when reading from RAM and reciprocal throughput of 1 (one instruction can start every 1 cycles)
while the latency when writing is 11 cycles and reciprocal throughput is 17 (one instruction can start every 17 cycles)

VMOVDQU has a latency of 6 when reading and reciprocal throughput is still 1
but for writing it is 14 cycles and reciprocal throughput of 20

note that VMOVDQU write to cache/memory takes 8 instructions (microcode), while VMOVDQA write takes 4


there doesn't seem to be data for these instruction on intel here
but if you find them, do note that these numbers are dependent on cpu frequency (cycle = 1/freq seconds)


as for the snippet of code
note how between read and write rax is not changed and some other instructions are executed
this is due to that latencies, and the compiler reordered instructions around those 2


if you really want to make that code run fast, use movntps (has to be aligned)
it is an SSE2 instruction that moves 16 bytes from the register directly into RAM, bypassing cache
this will reduce cache trashing
(unless you read that value right after writing it)
while at the topic, intel cpus usually have more cache and are thus less vulnerable to cache trashing
idk how this program works, so idk if using aligned memory access is worth while
if it is 4 or 8 byte aligned you can try combining 32bit and/or 64bit registers with movntps

memory is almost always slower then processing on modern cpus
Computer Architecture, Fifth Edition: A Quantitative Approach is a great book about cpus and it explains cache well
(PS Agner Fog web page)
(PPS some AMD manuals are good, better then intel ones if you ask me)

Last edited by genss; 09-04-2015 at 05:42 PM.
 
1 members found this post helpful.
Old 09-05-2015, 09:46 AM   #17
suicidaleggroll
LQ Guru
 
Registered: Nov 2010
Location: Colorado
Distribution: OpenSUSE, CentOS
Posts: 5,573

Rep: Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142
There we go...the other jobs finally cleared off. Here's a re-run with all 28 cores available (as available as possible when running the OS and a couple of mostly-idle VMs):

Code:
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC_trials', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Allocating 2,942,857,440 bytes...
Allocating 8,748,875,776 bytes...
Source&Target buffers are allocated.
Simulating we have 32 blocks for decompression...
Enforcing 32 thread(s).
omp_get_num_procs( ) = 28
omp_get_max_threads( ) = 28
Pass # 1 of 64
Pass # 2 of 64
Pass # 3 of 64
Pass # 4 of 64
Pass # 5 of 64
Pass # 6 of 64
Pass # 7 of 64
Pass # 8 of 64
Pass # 9 of 64
Pass #10 of 64
Pass #11 of 64
Pass #12 of 64
Pass #13 of 64
Pass #14 of 64
Pass #15 of 64
Pass #16 of 64
Pass #17 of 64
Pass #18 of 64
Pass #19 of 64
Pass #20 of 64
Pass #21 of 64
Pass #22 of 64
Pass #23 of 64
Pass #24 of 64
Pass #25 of 64
Pass #26 of 64
Pass #27 of 64
Pass #28 of 64
Pass #29 of 64
Pass #30 of 64
Pass #31 of 64
Pass #32 of 64
Pass #33 of 64
Pass #34 of 64
Pass #35 of 64
Pass #36 of 64
Pass #37 of 64
Pass #38 of 64
Pass #39 of 64
Pass #40 of 64
Pass #41 of 64
Pass #42 of 64
Pass #43 of 64
Pass #44 of 64
Pass #45 of 64
Pass #46 of 64
Pass #47 of 64
Pass #48 of 64
Pass #49 of 64
Pass #50 of 64
Pass #51 of 64
Pass #52 of 64
Pass #53 of 64
Pass #54 of 64
Pass #55 of 64
Pass #56 of 64
Pass #57 of 64
Pass #58 of 64
Pass #59 of 64
Pass #60 of 64
Pass #61 of 64
Pass #62 of 64
Pass #63 of 64
Pass #64 of 64
All threads finished.
Decompression time: 302,662,933,786 ticks.
TPI (Ticks_Per_Instruction_during_branchless_decompression) performance: 0.106
IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 9.466
Specs:

Chassis/Mb: Supermicro 1028GR-TRT
Procs: 2x Xeon E5-2697 v3 with hyperthreading disabled
RAM: 8x Kingston KVR21R15D4/16
Drives: 4x Intel SSDSC2BF480H501 in RAID 10
CentOS 7 running gcc 4.8.3
 
1 members found this post helpful.
Old 09-05-2015, 03:34 PM   #18
Sanmayce
Member
 
Registered: Jul 2010
Location: Sofia
Posts: 73

Original Poster
Rep: Reputation: 13
@genss

Thanks for the suggestions, yes, for some other cases I would consider them, however this particular snippet is out-of-the-box, it disregards all standard one-to-one transfers and uses overutilized such - storing/fetching all sizes =<32 with an YMM register. In some way, it is a brute technique since overlapping the writes is not good, one write should wait for previous to finish, but this is the charm of this etude - simplicity for the human complexity for the CPU. I would try some manually done alignments but this will only bubble up the code which is not human-likeable, let there be some load/work for the logic inside those 5+ billion transistors box.
The AMD manual is good and I will look up it, but quick skimming showed that it explains up to XMM, YMM is not covered, 2005, yes.

As for the , good book no doubt about it, on page 126 I saw:

Pitfall:
Simulating enough instructions to get accurate performance measures of the
memory hierarchy.
There are really three pitfalls here. One is trying to predict performance of a large
cache using a small trace. Another is that a programs locality behavior is not
constant over the run of the entire program. The third is that a programs locality
behavior may vary depending on the input.


Yes, but regardless of the small size of the decompression loop 'Freaky_Dreamer' ('Lexx' decompression loop, in fact) doesn't fall to the above pitfall. In my opinion it stresses well the whole memory hierarchy, no?
Simply, the etude is superb, it loads chunks 32bytes long across 0..256MB range (backwards from some current position). since it is LZSS decompression with a huge Sliding Window this translates to L1/L2/L3 and RAM intensive fetches.

Anyway, I don't know whether you felt one of the nifty facets of 'Lexx', my idea is the branchlessness and the simplicity to be exploited well at some point in GPUs, I saw that the book explains even the branching on GPUs, surely I will look up it more.

If you have time and will to play with the AMD response on YMM vs GP registers you may simply comment the YMM and replace it with 4x8bytes loads.

Code:
				#ifdef _N_YMM
//		SlowCopy256bit( (const char *)( ((uint64_t)(srcLOCAL+1)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4))&FlagMASKnegated) ), retLOCAL);
// Another (incompatible with Branchfull variant, though) way to avoid 'LEA' is to put the '+1' outside the FlagMASK but then the encoder has to count literals from zero in order to compensate '-((DWORDtrio>>4)-1) = -(DWORDtrio>>4)+1' within FlagMASKnegated:
		SlowCopy256bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), retLOCAL);
				#endif
Above changed with this:

Code:
//		SlowCopy256bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), retLOCAL);
*(uint64_t*)(retLOCAL+8*(0)) = *(uint64_t*)(1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated)+8*(0));
*(uint64_t*)(retLOCAL+8*(1)) = *(uint64_t*)(1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated)+8*(1));
*(uint64_t*)(retLOCAL+8*(2)) = *(uint64_t*)(1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated)+8*(2));
*(uint64_t*)(retLOCAL+8*(3)) = *(uint64_t*)(1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated)+8*(3));
That way, AVX won't hurt AMD anymore.

The source of Nakamichi 'Lexx' is here:
https://software.intel.com/sites/def...onfly_Lexx.zip

The compressed with Nakamichi 'Lexx' file is here:
http://www.sanmayce.com/Downloads/Au....tar.Nakamichi

I am tempted to ask Agner Fog to share his view on the topic, if he can see how to speed up things it would be much appreciated.
I have read (on Intel's forum) some of his visions to enhance x86, in one of his posts he talked about unburdening the current architecture with the necessity to drag the compatibility with the old code, as far as I understood, he proposed what I need now, when you have an instruction dealing with e.g. 256bit the goal is to make it native, not some kind of mutant on microcode level. The things are complex, the compatibility is something awesome, however my eyes are on who will make a pure 256bit CPU-RAM subsystem available to enthusiast community. My point, let's wait another 20 years and see whether Nakamichi 'Lexx' decompression loop would need rearrangements. I think no.


@suicidaleggroll

Many thanks!

Now, I know that my expectations were too high, I expected to see 14 IPC.
Yet:
IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 9.466
is a solid trump over the speedy 5960x @4.7GHz with @2666MHz DDR4.

Every test brings new PRACTICAL knowledge not some theoretical mumbo-jumbo estimation. Now, it is clear to me, 'Lexx' decompression really is not for nowadays CPUs, maybe when 256MB fast caches arrive then it will scream.
 
Old 09-06-2015, 08:45 AM   #19
genss
Member
 
Registered: Nov 2013
Posts: 741

Rep: Reputation: Disabled
Quote:
Originally Posted by Sanmayce View Post
@genss

Thanks for the suggestions, yes, for some other cases I would consider them, however this particular snippet is out-of-the-box, it disregards all standard one-to-one transfers and uses overutilized such - storing/fetching all sizes =<32 with an YMM register. In some way, it is a brute technique since overlapping the writes is not good, one write should wait for previous to finish, but this is the charm of this etude - simplicity for the human complexity for the CPU. I would try some manually done alignments but this will only bubble up the code which is not human-likeable, let there be some load/work for the logic inside those 5+ billion transistors box.
The AMD manual is good and I will look up it, but quick skimming showed that it explains up to XMM, YMM is not covered, 2005, yes.

...

I am tempted to ask Agner Fog to share his view on the topic, if he can see how to speed up things it would be much appreciated.
I have read (on Intel's forum) some of his visions to enhance x86, in one of his posts he talked about unburdening the current architecture with the necessity to drag the compatibility with the old code, as far as I understood, he proposed what I need now, when you have an instruction dealing with e.g. 256bit the goal is to make it native, not some kind of mutant on microcode level. The things are complex, the compatibility is something awesome, however my eyes are on who will make a pure 256bit CPU-RAM subsystem available to enthusiast community. My point, let's wait another 20 years and see whether Nakamichi 'Lexx' decompression loop would need rearrangements. I think no.

...

Every test brings new PRACTICAL knowledge not some theoretical mumbo-jumbo estimation. Now, it is clear to me, 'Lexx' decompression really is not for nowadays CPUs, maybe when 256MB fast caches arrive then it will scream.
you are welcome

i speak from experience, not some "theoretical mumbo-jumbo estimation"
my experience shows the correlation between reality and the texts i linked for you to read
(i wrote a very fast memcpy(), that is more or less what your program spends its time with)
if you do not wish to learn, do not ask

the book about computer architecture clearly states that raising the amount of cpu cache is a futile effort
most of the cpu dye is already used for cache and tests clearly show only about 10% speed increase when doubling cache size
it also explains why that is so

do learn to use perf and do learn proper assembly

i would ask of you to please not bother mr. Fog

Last edited by genss; 09-06-2015 at 09:19 AM. Reason: percent sign !!! damn forum
 
Old 09-06-2015, 03:09 PM   #20
Sanmayce
Member
 
Registered: Jul 2010
Location: Sofia
Posts: 73

Original Poster
Rep: Reputation: 13
@genss

Strange, you say I am welcome and yet your words sound like I am some insolent smartass, nah.
If you knew me better you would see the misgiving.

I say you outright, I am interested only in sharing etudes and benchmarking them, not in learning per se, nor in some professional carrier or status. Pure amateurism heavily reinforced by tons of test, that's me.

Anyway, I often use 'mumbo-jumbo' to highlight the contrast between the theoretical vs practical knowledge, not that I look down on theory, just want to emphasize the principle of going-against-the-flow, heh-heh. As for clocks needed for each instructions and stuff, this is not mumbo-jumbo, maybe you thought so, but it was not what I meant, I meant the endless compression/decompression empty talks how some compressor features this and that while in reality when put on the bench it starts to show how even the author didn't know was it good enough for e.g. textual decompression. No test (speed showdowns) no practical value proven.

>if you do not wish to learn, do not ask
Tried several times to see what made you said that, no clue.

>i would ask of you to please not bother mr. Fog
Maybe you have something in mind, so be it.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Need assistance in running 16-threaded superheavy fuzzy search Sanmayce Programming 21 09-07-2020 09:10 PM
apache: running multi-threaded or multi fork? Swakoo Linux - General 1 03-20-2008 07:18 AM
Running a benchmark to evaluate Linux Cluster asadarfeen Linux - General 2 08-26-2005 11:13 PM
error running soapstone (java) networking benchmark celejar Linux - Software 0 03-05-2005 07:21 PM
What is the difference between a "Threaded version" and "Non Threaded" packages? davidas Linux - Newbie 1 04-05-2004 06:23 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 04:45 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration