qsort vs 'Magnetica' Quicksort

Sanmayce · 11-09-2021, 12:57 PM

Just wrote my qsort() replacement, wanna share it with the community and stress my variant with more versatile data...

Generally, we have three major scenarios:
few distinct keys;
many distinct keys;
ALL distinct keys.

In first two cases, the new implementation is faster or equal to the "Standard" one.
Sadly, in the third case, it is somewhat slower, should be investigated further.

Anyway, benchmarking will show the practical value of 'Magnetica', personally, I started to use it already.
Please, give it a try, my new partitioning scheme is not perfect but pushes the limitations of old ones.

Didn't have time to add the second scenario, but the results (for 1st scenario, with 10 distinct keys) are promising:
273.8/3.7= 74x faster

dugan · 11-09-2021, 01:00 PM

You didn't post it or link to it, but it sounds like the same idea as this:

https://en.wikipedia.org/wiki/Timsort

Sanmayce · 11-09-2021, 01:10 PM

Quote:

Originally Posted by dugan

You didn't post it or link to it, but it sounds like the same idea as this:

https://en.wikipedia.org/wiki/Timsort

Now attached as a .txt, LQ doesn't allow .c attachments, hmm.
As for Timsort, never looked it up, just know it is among some major performers.
AFAIK, my 'Magnetica' partitioning is a new one, nothing special, just did on paper for an hour the swaps and saw that there is a crucial moment where unnecessary swaps are to be avoided. Wonder, whether another guy had it before me, just to acknowledge it, I am too lazy to look up other's sources, however love benchmarking against them.

sundialsvcs · 11-10-2021, 08:26 AM

The "performance" of any sorting algorithm, as measured in microseconds, is always dependent upon the data. However, the practical performance of any algorithm is usually described in terms of "order," as in O(Log2N). Shell Sort will always out-perform Bubble Sort, and Quicksort will always out-perform Shell.

I suggest that you create a new Github project for your new algorithm, and introduce it for discussion in various places, including the Software Engineering forum at StackExchange.

shruggy · 11-10-2021, 01:16 PM

Slightly off-topic. While searching XFCE documentation on their site, I stumbled upon an old article from year 1995 on AlphaSort.

Quote:

No matter how fast the processor, a 100 MB external sort using a single 1993-vintage SCSI disk takes about one minute of elapsed time. This one-minute barrier is created by the 3MB/s sequential transfer rate (bandwidth) of a single commodity disk. We measured both the OpenVMS Sort utility and AlphaSort to take a little under one minute when using one SCSI disk. Both sorts are disk-limited. A faster processor or faster algorithm would not sort much faster because the disk reads at about 4.5 MB/s, and writes at about 3.5 MB/s. Thus, it takes about 25 seconds to read the 100 MB, and about 30 seconds to write the 100 MB answer file. Even on mainframes, sort algorithms like Syncsort and DFsort are limited by this one-minute barrier, unless disk or file striping is used.

Sanmayce · 11-11-2021, 12:10 AM

>However, the practical performance of any algorithm is usually described in terms of "order," as in O(Log2N).

Indeed, however my focus is on 'perf', you are welcome to see the statistics below and draw even better PRACTICAL conclusions whether one performer is good or no good enough.

>I suggest that you create a new Github project for your new algorithm, and introduce it for discussion in various places, including the Software Engineering forum at StackExchange.

Nah, I prefer sharing it with fellow forum members, few times got treated as a spammer on stack*, Github is cool but I search feedback from amateurs (in his primal sense - lovers of something) as me. Benchmarking is what interests me, of course if some coder improves on it then even better - we all win.
After all, all good things will make their way, by itself, naturally.

>Slightly off-topic. While searching XFCE documentation on their site, I stumbled upon an old article from year 1995 on AlphaSort.

Thanks, no idea what it is, but for a long thirty years I admire the work of Bob Pirko and its RPSORT.ASM - a 16-bit sorter, yet never looked into his source code - didn't want to "steal" his ideas, weird, maybe there are some ideas worth reviving.

Looking calmly at 'Magnetica' for few minutes, couldn't miss the obvious optimization of moving 'PR+1' outside the if-elseif-else, now the refined revision is smaller and faster.
Now, we can benchmark either with QS_bench_r2.c or by adding 'Magnetica' to your code.
You need "22338618_QWORDS.bin" and "mobythesaurus.txt" from this package:
https://www.qb64.org/forum/index.php...0;attach=14875

As for the third case (ALL distinct keys), I wanna 2,000,000,000 keys, but how to shuffle them, so each time the keys to be in a given order?

In the near future, I intend to make a showdown between Bentley-McIlroy 3-way partitioning and 'Magnetica', I am not convinced their scheme is better, will see...

This is the refinement which is used in QS_bench_r2.c:

Code:

            // 'Magnetica' partitioning rev.2[
            Jndx = Right;
            PL = Left;
            PR = Left;
            swap (&QWORDS[(Left + Right)>>1], &QWORDS[PR]);
            Pivot = QWORDS[PR];
            for (;PR < Jndx;) {
                PR = PR + 1;
                if (Pivot > QWORDS[PR]) {
                    swap (&QWORDS[PL], &QWORDS[PR]);
                    PL = PL + 1;
                //} else if (Pivot == QWORDS[PR]) {
                } else if (Pivot < QWORDS[PR]) {
                    for (;Pivot < QWORDS[Jndx];) {
                        Jndx = Jndx - 1;
                    }
                    if (PR < Jndx) swap (&QWORDS[PR], &QWORDS[Jndx]);
                    Jndx = Jndx - 1;
                    PR = PR - 1;
                }
            }
            Jndx = PL - 1;
            Indx = PR + 1;
            // 'Magnetica' partitioning rev.2]
/*
; mark_description "Intel(R) C++ Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140726";
; ae-64+2=76 bytes i.e. (94-76=18 less), 5/1 conditional/unconditional jumps i.e. 2-1=1 less unconditional jump
; 'Magnetica' partitioning, mainloop rev.2[
.B4.5::                         
  00064 48 ff c5         inc rbp                                
  00067 48 8b 14 e9      mov rdx, QWORD PTR [rcx+rbp*8]         
  0006b 4c 3b ea         cmp r13, rdx                           
  0006e 77 2c            ja .B4.14 
.B4.6::                         
  00070 73 39            jae .B4.15 
.B4.7::                         
  00072 4e 8b 3c d9      mov r15, QWORD PTR [rcx+r11*8]         
  00076 4d 3b ef         cmp r13, r15                           
  00079 73 0c            jae .B4.11 
.B4.9::                         
  0007b 49 ff cb         dec r11                                
  0007e 4e 8b 3c d9      mov r15, QWORD PTR [rcx+r11*8]         
  00082 4d 3b ef         cmp r13, r15                           
  00085 72 f4            jb .B4.9 
.B4.11::                        
  00087 49 3b eb         cmp rbp, r11                           
  0008a 7d 08            jge .B4.13 
.B4.12::                        
  0008c 4c 89 3c e9      mov QWORD PTR [rcx+rbp*8], r15         
  00090 4a 89 14 d9      mov QWORD PTR [rcx+r11*8], rdx         
.B4.13::                        
  00094 49 ff cb         dec r11                                
  00097 48 ff cd         dec rbp                                
  0009a eb 0f            jmp .B4.15 
.B4.14::                        
  0009c 4e 8b 3c f1      mov r15, QWORD PTR [rcx+r14*8]         
  000a0 4a 89 14 f1      mov QWORD PTR [rcx+r14*8], rdx         
  000a4 49 ff c6         inc r14                                
  000a7 4c 89 3c e9      mov QWORD PTR [rcx+rbp*8], r15         
.B4.15::                        
  000ab 49 3b eb         cmp rbp, r11                           
  000ae 7c b4            jl .B4.5 
; 'Magnetica' partitioning, mainloop rev.2]
*/

Glad to share the first two cases benchmarked with 'Magnetica' rev.2, the console log on my Fedora 33, i5-7200U 3.1GHz max turbo, 36GB DDR4 2133MHz:

Code:

[kaze@kaze Quicksort_says_rev3]$ gcc -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/10/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,objc,obj-c++,ada,go,d,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC) 

[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r2 qsort few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Checking whether sort went correct... OK.

 Performance counter stats for './QS_bench_r2 qsort few':

        372,145.95 msec task-clock:u              #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,363,067      page-faults:u             #    0.012 M/sec                  
 1,123,792,986,601      cycles:u                  #    3.020 GHz                      (50.00%)
 3,268,399,480,738      instructions:u            #    2.91  insn per cycle           (62.50%)
   650,531,988,573      branches:u                # 1748.056 M/sec                    (62.50%)
     5,519,977,909      branch-misses:u           #    0.85% of all branches          (62.50%)
   734,216,718,766      L1-dcache-loads:u         # 1972.927 M/sec                    (62.50%)
     6,752,096,756      L1-dcache-load-misses:u   #    0.92% of all L1-dcache accesses  (62.50%)
       423,592,362      LLC-loads:u               #    1.138 M/sec                    (50.00%)
       126,404,435      LLC-load-misses:u         #   29.84% of all LL-cache accesses  (50.00%)

     372.770596352 seconds time elapsed

     364.324923000 seconds user
       5.532901000 seconds sys

[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r2 Magnetica few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Checking whether sort went correct... OK.

 Performance counter stats for './QS_bench_r2 Magnetica few':

         45,966.43 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,363,067      page-faults:u             #    0.095 M/sec                  
   123,532,537,542      cycles:u                  #    2.687 GHz                      (50.00%)
   101,337,186,572      instructions:u            #    0.82  insn per cycle           (62.50%)
    24,106,121,927      branches:u                #  524.429 M/sec                    (62.50%)
     3,277,959,707      branch-misses:u           #   13.60% of all branches          (62.50%)
    13,329,593,401      L1-dcache-loads:u         #  289.985 M/sec                    (62.50%)
     1,655,094,422      L1-dcache-load-misses:u   #   12.42% of all L1-dcache accesses  (62.50%)
        83,759,355      LLC-loads:u               #    1.822 M/sec                    (50.00%)
        69,790,047      LLC-load-misses:u         #   83.32% of all LL-cache accesses  (50.00%)

      45.977603171 seconds time elapsed

      40.202975000 seconds user
       5.433257000 seconds sys

[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r2 qsort many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Checking whether sort went correct... OK.

 Performance counter stats for './QS_bench_r2 qsort many':

        547,697.55 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,896,781      page-faults:u             #    0.009 M/sec                  
 1,660,777,393,784      cycles:u                  #    3.032 GHz                      (50.00%)
 2,926,096,300,717      instructions:u            #    1.76  insn per cycle           (62.50%)
   656,833,950,654      branches:u                # 1199.264 M/sec                    (62.50%)
    24,981,866,259      branch-misses:u           #    3.80% of all branches          (62.50%)
   675,499,020,355      L1-dcache-loads:u         # 1233.343 M/sec                    (62.50%)
     8,384,728,679      L1-dcache-load-misses:u   #    1.24% of all L1-dcache accesses  (62.50%)
       253,666,230      LLC-loads:u               #    0.463 M/sec                    (50.00%)
       149,205,273      LLC-load-misses:u         #   58.82% of all LL-cache accesses  (50.00%)

     547.972289671 seconds time elapsed

     538.278672000 seconds user
       5.984880000 seconds sys

[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r2 Magnetica many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Checking whether sort went correct... OK.

 Performance counter stats for './QS_bench_r2 Magnetica many':

        241,082.32 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,896,781      page-faults:u             #    0.020 M/sec                  
   719,580,616,824      cycles:u                  #    2.985 GHz                      (50.00%)
   788,555,184,960      instructions:u            #    1.10  insn per cycle           (62.50%)
   152,383,462,963      branches:u                #  632.081 M/sec                    (62.50%)
    18,722,063,529      branch-misses:u           #   12.29% of all branches          (62.50%)
   105,308,386,791      L1-dcache-loads:u         #  436.815 M/sec                    (62.50%)
     9,762,518,793      L1-dcache-load-misses:u   #    9.27% of all L1-dcache accesses  (62.50%)
       291,667,293      LLC-loads:u               #    1.210 M/sec                    (50.00%)
       240,922,518      LLC-load-misses:u         #   82.60% of all LL-cache accesses  (50.00%)

     241.152242784 seconds time elapsed

     233.338084000 seconds user
       5.991166000 seconds sys

[kaze@kaze Quicksort_says_rev3]$

In case "FEW distinct", Magnetica is 364.3/40.2=9.0x faster than qsort.
In case "MANY distinct", Magnetica is 538.2/233.3= 2.3x faster than qsort.
In case "ALL distinct", to be seen...

Sanmayce · 11-12-2021, 09:03 PM

Kinda expected Bentley-McIlroy duo to be superior (in many distinct keys scenario), it turns out 'Magnetica' dominates in all three cases, the following showdown tells a lot:

The Quicksort Showdown roster:

Code:

+-----------------+--------------+---------------+---------------+
| Performer/Keys  | FEW distinct | MANY distinct | ~ALL distinct |
+-----------------+--------------+---------------+---------------+
| qsort           |  317 seconds |   508 seconds |   504 seconds |
| Magnetica       |   39 seconds |   222 seconds |   292 seconds |
| Bentley-McIlroy |   45 seconds |   226 seconds |   306 seconds |
+-----------------+--------------+---------------+---------------+

Legend:
FEW = 2,233,861,800 keys, of them distinct keys = 10
MANY = 2,482,300,900 keys, of them distinct keys = 2,847,531
~ALL = 2,009,333,753 keys, of them distinct keys = 1,912,608,132

The testfile for case #3:
https://download.fedoraproject.org/p..._64-35-1.2.iso

The code (used in QS_bench_r3.c) of Bentley-McIlroy 3-way partitioning by Dr. Jon Bentley (Bell Laboratories) and Dr. Douglas McIlroy (Bell Laboratories):

Code:

void quicksort_Bentley_McIlroy_3way_partitioning(uint64_t a[], int64_t l, int64_t r)
{ 
	int64_t i = l-1, j = r, p = l-1, q = r; 
	int64_t k; 
	uint64_t v = a[r];
	if (r <= l) return;
	for (;;)
	{
		while (a[++i] < v) ;
		while (v < a[--j]) if (j == l) break;
		if (i >= j) break;
		swap(&a[i], &a[j]);
		if (a[i] == v) { p++; swap(&a[p], &a[i]); }
		if (v == a[j]) { q--; swap(&a[j], &a[q]); }
	}
	swap(&a[i], &a[r]); j = i-1; i = i+1;
	for (k = l; k < p; k++, j--) swap(&a[k], &a[j]);
	for (k = r-1; k > q; k--, i++) swap(&a[i], &a[k]);
	quicksort_Bentley_McIlroy_3way_partitioning(a, l, j);
	quicksort_Bentley_McIlroy_3way_partitioning(a, i, r);
}

The code (used in QS_bench_r3.c) of Magnetica 3-way partitioning by Sanmayce:

Code:

//    _____                                __  .__               
//   /     \ _____     ____   ____   _____/  |_|__| ____ _____   
//  /  \ /  \\__  \   / ___\ /    \_/ __ \   __\  |/ ___\\__  \  
// /    Y    \/ __ \_/ /_/  >   |  \  ___/|  | |  \  \___ / __ \_
// \____|__  (____  /\___  /|___|  /\___  >__| |__|\___  >____  /
//         \/     \//_____/      \/     \/             \/     \/ 
void Quicksort_QB64_v4(uint64_t QWORDS[], int64_t Left, int64_t Right) {
    int InsertionsortTHRESHOLD = 0;
    int64_t Indx, Jndx, PL, PR;
    int64_t Stack[99999];
    int64_t StackPtr = 0;
    uint64_t Pivot;
    StackPtr++; Stack[StackPtr] = Left;
    StackPtr++; Stack[StackPtr] = Right;

    do {
        Right = Stack[StackPtr];
        Left = Stack[StackPtr - 1];
        StackPtr = StackPtr - 2;
        do {
            // 'Magnetica' partitioning rev.2[
            Jndx = Right;
            PL = Left;
            PR = Left;
            swap (&QWORDS[(Left + Right)>>1], &QWORDS[PR]);
            Pivot = QWORDS[PR];
            for (;PR < Jndx;) {
                PR = PR + 1;
                if (Pivot > QWORDS[PR]) {
                    swap (&QWORDS[PL], &QWORDS[PR]);
                    PL = PL + 1;
                //} else if (Pivot == QWORDS[PR]) {
                } else if (Pivot < QWORDS[PR]) {
                    for (;Pivot < QWORDS[Jndx];) {
                        Jndx = Jndx - 1;
                    }
                    if (PR < Jndx) swap (&QWORDS[PR], &QWORDS[Jndx]);
                    Jndx = Jndx - 1;
                    PR = PR - 1;
                }
            }
            Jndx = PL - 1;
            Indx = PR + 1;
            // 'Magnetica' partitioning rev.2]
/*
; mark_description "Intel(R) C++ Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140726";
; ae-64+2=76 bytes i.e. (94-76=18 less), 5/1 conditional/unconditional jumps i.e. 2-1=1 less unconditional jump
; 'Magnetica' partitioning, mainloop rev.2[
.B4.5::                         
  00064 48 ff c5         inc rbp                                
  00067 48 8b 14 e9      mov rdx, QWORD PTR [rcx+rbp*8]         
  0006b 4c 3b ea         cmp r13, rdx                           
  0006e 77 2c            ja .B4.14 
.B4.6::                         
  00070 73 39            jae .B4.15 
.B4.7::                         
  00072 4e 8b 3c d9      mov r15, QWORD PTR [rcx+r11*8]         
  00076 4d 3b ef         cmp r13, r15                           
  00079 73 0c            jae .B4.11 
.B4.9::                         
  0007b 49 ff cb         dec r11                                
  0007e 4e 8b 3c d9      mov r15, QWORD PTR [rcx+r11*8]         
  00082 4d 3b ef         cmp r13, r15                           
  00085 72 f4            jb .B4.9 
.B4.11::                        
  00087 49 3b eb         cmp rbp, r11                           
  0008a 7d 08            jge .B4.13 
.B4.12::                        
  0008c 4c 89 3c e9      mov QWORD PTR [rcx+rbp*8], r15         
  00090 4a 89 14 d9      mov QWORD PTR [rcx+r11*8], rdx         
.B4.13::                        
  00094 49 ff cb         dec r11                                
  00097 48 ff cd         dec rbp                                
  0009a eb 0f            jmp .B4.15 
.B4.14::                        
  0009c 4e 8b 3c f1      mov r15, QWORD PTR [rcx+r14*8]         
  000a0 4a 89 14 f1      mov QWORD PTR [rcx+r14*8], rdx         
  000a4 49 ff c6         inc r14                                
  000a7 4c 89 3c e9      mov QWORD PTR [rcx+rbp*8], r15         
.B4.15::                        
  000ab 49 3b eb         cmp rbp, r11                           
  000ae 7c b4            jl .B4.5 
; 'Magnetica' partitioning, mainloop rev.2]
*/
            if (Indx + InsertionsortTHRESHOLD < Right) {
                StackPtr = StackPtr + 2;
                Stack[StackPtr - 1] = Indx;
                Stack[StackPtr] = Right;
            }
            Right = Jndx;
        } while (Left < Right);
    } while (StackPtr != 0);
}

The full console log follows (my laptop i5-7200U 36GB DDR4, Fedora 33):

Code:

[kaze@kaze ~]$ cd /run/media/kaze/Sanmayce_223GB_A/Quicksort_says_rev3
[kaze@kaze Quicksort_says_rev3]$ dir
22338618_QWORDS.bin  Fedora-Workstation-Live-x86_64-35-1.2.iso	Px437_Cordata_PPC-400.ttf  Quicksort_says      Quicksort_says_rev3_Kaby-Lake_Fedora-33.png
bench.sh	     MEM.H					QS_bench_r3.c		   Quicksort_says.bas  Quicksort_says_rev3_Kaby-Lake_Windows-10.png
ColumnChart.ico      mobythesaurus.txt				QS_bench_r3.exe		   Quicksort_says.exe  Sir_Tony_Hoare_IMG_5125.png
[kaze@kaze Quicksort_says_rev3]$ gcc -O3 QS_bench_r3.c -o QS_bench_r3
QS_bench_r3.c:473:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
  473 | g(d,h){for(i=s;i<1<<25;i*=2)d=d*1LL*d%m;for(p=t;p<t+N;p+=s)for(i=s,c=1;i;i--)a=p[s]*(h?c:1LL)%m,p[s]=(m*1U+*p-a)*(h?1LL:c)%m,*p=(a*1U+*p)%m,p++,c=c*1LL*d%m;}
      | ^
QS_bench_r3.c: In function ‘g’:
QS_bench_r3.c:473:1: warning: type of ‘d’ defaults to ‘int’ [-Wimplicit-int]
QS_bench_r3.c:473:1: warning: type of ‘h’ defaults to ‘int’ [-Wimplicit-int]
QS_bench_r3.c: In function ‘main’:
QS_bench_r3.c:557:6: warning: implicit declaration of function ‘strcmp’ [-Wimplicit-function-declaration]
  557 |  if (strcmp(argv[2],"many")==0) {
      |      ^~~~~~
[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r3 qsort few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Checking whether sort went correct... OK. Unique keys = 10

 Performance counter stats for './QS_bench_r3 qsort few':

        325,465.05 msec task-clock:u              #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,363,063      page-faults:u             #    0.013 M/sec                  
   980,631,883,041      cycles:u                  #    3.013 GHz                      (50.00%)
 3,279,551,895,331      instructions:u            #    3.34  insn per cycle           (62.50%)
   650,482,999,545      branches:u                # 1998.626 M/sec                    (62.50%)
     5,582,924,930      branch-misses:u           #    0.86% of all branches          (62.50%)
   734,186,209,415      L1-dcache-loads:u         # 2255.807 M/sec                    (62.50%)
     6,753,617,333      L1-dcache-load-misses:u   #    0.92% of all L1-dcache accesses  (62.50%)
       404,414,575      LLC-loads:u               #    1.243 M/sec                    (50.00%)
       111,254,842      LLC-load-misses:u         #   27.51% of all LL-cache accesses  (50.00%)

     326.101784425 seconds time elapsed

     317.871539000 seconds user
       5.577791000 seconds sys


[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r3 Magnetica  few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Checking whether sort went correct... OK. Unique keys = 10

 Performance counter stats for './QS_bench_r3 Magnetica few':

         45,479.14 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,363,068      page-faults:u             #    0.096 M/sec                  
   121,955,292,590      cycles:u                  #    2.682 GHz                      (50.00%)
   112,449,514,935      instructions:u            #    0.92  insn per cycle           (62.50%)
    24,092,559,780      branches:u                #  529.750 M/sec                    (62.50%)
     3,384,288,209      branch-misses:u           #   14.05% of all branches          (62.50%)
    13,329,400,797      L1-dcache-loads:u         #  293.088 M/sec                    (62.50%)
     1,655,811,975      L1-dcache-load-misses:u   #   12.42% of all L1-dcache accesses  (62.50%)
        70,845,055      LLC-loads:u               #    1.558 M/sec                    (50.00%)
        62,316,430      LLC-load-misses:u         #   87.96% of all LL-cache accesses  (50.00%)

      45.492587322 seconds time elapsed

      39.738044000 seconds user
       5.413636000 seconds sys


[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r3 BM  few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Checking whether sort went correct... OK. Unique keys = 10

 Performance counter stats for './QS_bench_r3 BM few':

         51,751.74 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,363,064      page-faults:u             #    0.084 M/sec                  
   140,850,991,285      cycles:u                  #    2.722 GHz                      (50.00%)
   141,773,758,594      instructions:u            #    1.01  insn per cycle           (62.50%)
    29,021,682,740      branches:u                #  560.787 M/sec                    (62.50%)
     2,964,754,548      branch-misses:u           #   10.22% of all branches          (62.49%)
    20,759,616,908      L1-dcache-loads:u         #  401.139 M/sec                    (62.50%)
     2,253,507,095      L1-dcache-load-misses:u   #   10.86% of all L1-dcache accesses  (62.50%)
       240,693,428      LLC-loads:u               #    4.651 M/sec                    (50.00%)
       119,509,033      LLC-load-misses:u         #   49.65% of all LL-cache accesses  (50.00%)

      51.765804273 seconds time elapsed

      45.999839000 seconds user
       5.392254000 seconds sys


[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r3 qsort many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Checking whether sort went correct... OK. Unique keys = 2847531

 Performance counter stats for './QS_bench_r3 qsort many':

        517,942.56 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,896,779      page-faults:u             #    0.009 M/sec                  
 1,569,421,552,908      cycles:u                  #    3.030 GHz                      (50.00%)
 2,937,315,064,269      instructions:u            #    1.87  insn per cycle           (62.50%)
   656,656,940,988      branches:u                # 1267.818 M/sec                    (62.50%)
    25,008,235,993      branch-misses:u           #    3.81% of all branches          (62.50%)
   675,474,160,813      L1-dcache-loads:u         # 1304.149 M/sec                    (62.50%)
     8,374,868,618      L1-dcache-load-misses:u   #    1.24% of all L1-dcache accesses  (62.50%)
       244,454,547      LLC-loads:u               #    0.472 M/sec                    (50.00%)
       132,791,156      LLC-load-misses:u         #   54.32% of all LL-cache accesses  (50.00%)

     518.237032044 seconds time elapsed

     508.735040000 seconds user
       5.950954000 seconds sys


[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r3 Magnetica many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Checking whether sort went correct... OK. Unique keys = 2847531

 Performance counter stats for './QS_bench_r3 Magnetica many':

        230,292.26 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,896,782      page-faults:u             #    0.021 M/sec                  
   686,398,758,890      cycles:u                  #    2.981 GHz                      (50.00%)
   801,299,477,504      instructions:u            #    1.17  insn per cycle           (62.50%)
   152,341,819,141      branches:u                #  661.515 M/sec                    (62.50%)
    19,420,019,786      branch-misses:u           #   12.75% of all branches          (62.50%)
   105,227,165,554      L1-dcache-loads:u         #  456.929 M/sec                    (62.50%)
     9,758,775,155      L1-dcache-load-misses:u   #    9.27% of all L1-dcache accesses  (62.50%)
       331,205,631      LLC-loads:u               #    1.438 M/sec                    (50.00%)
       260,265,720      LLC-load-misses:u         #   78.58% of all LL-cache accesses  (50.00%)

     230.399202449 seconds time elapsed

     222.654274000 seconds user
       5.973146000 seconds sys


[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r3 BM many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Checking whether sort went correct... OK. Unique keys = 2847531

 Performance counter stats for './QS_bench_r3 BM many':

        234,206.73 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,896,781      page-faults:u             #    0.021 M/sec                  
   698,588,697,890      cycles:u                  #    2.983 GHz                      (50.00%)
   656,081,987,764      instructions:u            #    0.94  insn per cycle           (62.50%)
   165,692,577,479      branches:u                #  707.463 M/sec                    (62.50%)
    18,927,109,180      branch-misses:u           #   11.42% of all branches          (62.50%)
    85,146,879,545      L1-dcache-loads:u         #  363.554 M/sec                    (62.50%)
     8,623,541,343      L1-dcache-load-misses:u   #   10.13% of all L1-dcache accesses  (62.50%)
       286,352,303      LLC-loads:u               #    1.223 M/sec                    (50.00%)
       227,490,519      LLC-load-misses:u         #   79.44% of all LL-cache accesses  (50.00%)

     234.306277225 seconds time elapsed

     226.632611000 seconds user
       5.938060000 seconds sys


[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r3 qsort ALL
Allocating FILE-Buffer 1916MB ...
Allocating AUX-Buffer 15329MB ...
Allocating Master-Buffer 15329MB ...
Sorting in single-thread 2009333753 elements...
Checking whether sort went correct... OK. Unique keys = 1912608132

 Performance counter stats for './QS_bench_r3 qsort ALL':

        520,097.93 msec task-clock:u              #    0.993 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         7,849,014      page-faults:u             #    0.015 M/sec                  
 1,553,894,682,781      cycles:u                  #    2.988 GHz                      (50.00%)
 2,271,179,443,512      instructions:u            #    1.46  insn per cycle           (62.50%)
   536,952,307,468      branches:u                # 1032.406 M/sec                    (62.50%)
    28,368,595,067      branch-misses:u           #    5.28% of all branches          (62.50%)
   520,144,933,578      L1-dcache-loads:u         # 1000.090 M/sec                    (62.50%)
     6,723,006,421      L1-dcache-load-misses:u   #    1.29% of all L1-dcache accesses  (62.50%)
       193,261,861      LLC-loads:u               #    0.372 M/sec                    (50.00%)
       102,759,148      LLC-load-misses:u         #   53.17% of all LL-cache accesses  (50.00%)

     524.002327535 seconds time elapsed

     504.423911000 seconds user
      12.294510000 seconds sys


[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r3 Magnetica ALL
Allocating FILE-Buffer 1916MB ...
Allocating AUX-Buffer 15329MB ...
Allocating Master-Buffer 15329MB ...
Sorting in single-thread 2009333753 elements...
Checking whether sort went correct... OK. Unique keys = 1912608132

 Performance counter stats for './QS_bench_r3 Magnetica ALL':

        306,913.60 msec task-clock:u              #    0.990 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         7,849,020      page-faults:u             #    0.026 M/sec                  
   899,578,515,950      cycles:u                  #    2.931 GHz                      (50.00%)
 1,077,947,210,852      instructions:u            #    1.20  insn per cycle           (62.50%)
   202,031,197,397      branches:u                #  658.267 M/sec                    (62.50%)
    25,788,554,132      branch-misses:u           #   12.76% of all branches          (62.50%)
   143,089,373,092      L1-dcache-loads:u         #  466.220 M/sec                    (62.50%)
     7,361,011,433      L1-dcache-load-misses:u   #    5.14% of all L1-dcache accesses  (62.50%)
       240,367,987      LLC-loads:u               #    0.783 M/sec                    (50.00%)
       189,560,929      LLC-load-misses:u         #   78.86% of all LL-cache accesses  (50.00%)

     310.137967617 seconds time elapsed

     292.453199000 seconds user
      12.266850000 seconds sys


[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r3 BM ALL
Allocating FILE-Buffer 1916MB ...
Allocating AUX-Buffer 15329MB ...
Allocating Master-Buffer 15329MB ...
Sorting in single-thread 2009333753 elements...
Checking whether sort went correct... OK. Unique keys = 1912608132

 Performance counter stats for './QS_bench_r3 BM ALL':

        322,688.09 msec task-clock:u              #    0.992 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         7,849,403      page-faults:u             #    0.024 M/sec                  
   943,352,167,992      cycles:u                  #    2.923 GHz                      (50.00%)
   896,389,731,553      instructions:u            #    0.95  insn per cycle           (62.50%)
   226,893,780,871      branches:u                #  703.137 M/sec                    (62.50%)
    26,011,308,958      branch-misses:u           #   11.46% of all branches          (62.50%)
   119,844,660,142      L1-dcache-loads:u         #  371.395 M/sec                    (62.50%)
     7,940,603,646      L1-dcache-load-misses:u   #    6.63% of all L1-dcache accesses  (62.50%)
       258,555,901      LLC-loads:u               #    0.801 M/sec                    (50.00%)
       217,482,356      LLC-load-misses:u         #   84.11% of all LL-cache accesses  (50.00%)

     325.339098420 seconds time elapsed

     306.791354000 seconds user
      13.564926000 seconds sys


[kaze@kaze Quicksort_says_rev3]$

Love 'Magnetica', enjoy!

Sanmayce · 11-29-2021, 06:12 AM

After a 2 weeks quest, glad to share Quicksort 'Magnetica' r.2, the FASTEST known to me:

The test program is attached, the testdatafile 3.54 GB (3,803,483,832 bytes) is downloadable at:
https://download.fedoraproject.org/p...aarch64.raw.xz

The benchmark is the most stressful for a 32GB machine, it uses 29GB of QWORDS i.e. 64bit elements, taking 8 bytes as an element at each position in the file being sorted i.e. filesize-8+1 elements in total. All in all, sorting 3,803,483,825 elements (of them unique keys = 3,346,259,533).

The results:
Quicksort 'Magnetica' r.2 is 1020s/569s= 1.792x faster than GLIBC qsort.
Quicksort 'Magnetica' r.2 is 607s/569s= 1.066x faster than Bentley_McIlroy.

And the console dump on my laptop Intel i5-7200U 3.1GHz max turbo, 36GB DDR4 2133MHz, running Fedora 33, gcc 10.2.1 -O3 was used:

Code:

[kaze@kaze ~]$ cd /run/media/kaze/Sanmayce_223GB_A/Quicksort_says_rev7
[kaze@kaze Quicksort_says_rev7]$ gcc -O3 QS_bench_r7.c -o QS_bench_r7.elf
QS_bench_r7.c:1000:1: warning: return type defaults to â€˜intâ€™ [-Wimplicit-int]
 1000 | g(d,h){for(i=s;i<1<<25;i*=2)d=d*1LL*d%m;for(p=t;p<t+N;p+=s)for(i=s,c=1;i;i--)a=p[s]*(h?c:1LL)%m,p[s]=(m*1U+*p-a)*(h?1LL:c)%m,*p=(a*1U+*p)%m,p++,c=c*1LL*d%m;}
      | ^
QS_bench_r7.c: In function â€˜gâ€™:
QS_bench_r7.c:1000:1: warning: type of â€˜dâ€™ defaults to â€˜intâ€™ [-Wimplicit-int]
QS_bench_r7.c:1000:1: warning: type of â€˜hâ€™ defaults to â€˜intâ€™ [-Wimplicit-int]
[kaze@kaze Quicksort_says_rev7]$ ls -l
total 4501312
-rwxrwxrwx. 2 kaze kaze 3803483832 Nov 29 02:06 Fedora-Workstation-35-1.2.aarch64.raw.xz
-rwxrwxrwx. 2 kaze kaze      58408 Nov 29 12:56 QS_bench_r7.c
-rwxrwxrwx. 2 kaze kaze     335536 Nov 29 12:56 QS_bench_r7.cod
-rwxrwxrwx. 1 kaze kaze  268465864 Nov 29 13:06 QS_bench_r7.elf
-rwxrwxrwx. 2 kaze kaze  268535808 Nov 29 12:56 QS_bench_r7.exe
-rwxrwxrwx. 2 kaze kaze  268455855 Nov 29 12:56 QS_bench_r7.obj
[kaze@kaze Quicksort_says_rev7]$ perf stat -d ./QS_bench_r7.elf Magnetica ALLmore
Allocating FILE-Buffer 3627MB ...
Allocating Master-Buffer 29018MB ...
Sorting in single-thread 3803483825 elements...
Done in 569 seconds.
Checking whether sort went correct... OK. Unique keys = 3346259533

 Performance counter stats for './QS_bench_r7.elf Magnetica ALLmore':

        595,760.27 msec task-clock:u              #    0.990 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         7,428,735      page-faults:u             #    0.012 M/sec                  
 1,774,492,088,758      cycles:u                  #    2.979 GHz                      (50.00%)
 1,795,596,438,927      instructions:u            #    1.01  insn per cycle           (62.50%)
   348,244,974,654      branches:u                #  584.539 M/sec                    (62.50%)
    52,674,210,396      branch-misses:u           #   15.13% of all branches          (62.50%)
   238,612,258,163      L1-dcache-loads:u         #  400.517 M/sec                    (62.50%)
    12,071,470,906      L1-dcache-load-misses:u   #    5.06% of all L1-dcache accesses  (62.50%)
       291,005,737      LLC-loads:u               #    0.488 M/sec                    (50.00%)
       236,834,179      LLC-load-misses:u         #   81.38% of all LL-cache accesses  (50.00%)

     601.769088005 seconds time elapsed

     575.942070000 seconds user
      15.526557000 seconds sys

[kaze@kaze Quicksort_says_rev7]$ perf stat -d ./QS_bench_r7.elf BM ALLmore
Allocating FILE-Buffer 3627MB ...
Allocating Master-Buffer 29018MB ...
Sorting in single-thread 3803483825 elements...
Done in 607 seconds.
Checking whether sort went correct... OK. Unique keys = 3346259533

 Performance counter stats for './QS_bench_r7.elf BM ALLmore':

        632,750.44 msec task-clock:u              #    0.992 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         7,428,742      page-faults:u             #    0.012 M/sec                  
 1,890,074,871,799      cycles:u                  #    2.987 GHz                      (50.00%)
 1,657,130,448,069      instructions:u            #    0.88  insn per cycle           (62.50%)
   424,010,738,154      branches:u                #  670.107 M/sec                    (62.50%)
    52,229,564,268      branch-misses:u           #   12.32% of all branches          (62.50%)
   221,684,796,070      L1-dcache-loads:u         #  350.351 M/sec                    (62.50%)
    13,935,776,880      L1-dcache-load-misses:u   #    6.29% of all L1-dcache accesses  (62.50%)
       296,863,115      LLC-loads:u               #    0.469 M/sec                    (50.00%)
       227,231,894      LLC-load-misses:u         #   76.54% of all LL-cache accesses  (50.00%)

     638.047061016 seconds time elapsed

     613.514293000 seconds user
      14.803650000 seconds sys

[kaze@kaze Quicksort_says_rev7]$ perf stat -d ./QS_bench_r7.elf qsort ALLmore
Allocating FILE-Buffer 3627MB ...
Allocating Master-Buffer 29018MB ...
Sorting in single-thread 3803483825 elements...
Done in 1020 seconds.
Checking whether sort went correct... OK. Unique keys = 3346259533

 Performance counter stats for './QS_bench_r7.elf qsort ALLmore':

      1,047,717.43 msec task-clock:u              #    0.996 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         7,429,008      page-faults:u             #    0.007 M/sec                  
 3,158,473,480,125      cycles:u                  #    3.015 GHz                      (50.00%)
 4,410,953,498,619      instructions:u            #    1.40  insn per cycle           (62.50%)
 1,045,210,950,583      branches:u                #  997.608 M/sec                    (62.50%)
    55,719,050,158      branch-misses:u           #    5.33% of all branches          (62.50%)
 1,009,805,547,231      L1-dcache-loads:u         #  963.815 M/sec                    (62.50%)
    13,006,118,247      L1-dcache-load-misses:u   #    1.29% of all L1-dcache accesses  (62.50%)
       320,405,768      LLC-loads:u               #    0.306 M/sec                    (50.00%)
       172,928,699      LLC-load-misses:u         #   53.97% of all LL-cache accesses  (50.00%)

    1052.421139598 seconds time elapsed

    1024.281537000 seconds user
      16.586670000 seconds sys

[kaze@kaze Quicksort_says_rev7]$

For more info, plus two videos showing the sort process:
https://qb64.org/forum/index.php?top...8570#msg138570

Enjoy!

Sanmayce · 11-30-2021, 08:22 AM

Thanks to PCBoy Studios, here comes a nice animated Magnetica in 50 tests:
https://youtu.be/sjQhoX10r9s

Note: This is the initial release of Magnetica, not the latest rev.2, there is no median of 5 branchlessly chosen, nor switching to Insertionsort cutoff at 13.

Sanmayce · 03-02-2022, 12:42 AM

I was able to speed up Magnetica by removing a conditional swap within the mainloop:

Code:

/*
; mark_description "Intel(R) C++ Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140726";
; mark_description "-FAcs -O3 -FeQS_bench_r9+_ICL15.0_64bit.exe -D_N_HIGH_PRIORITY";
; aa-65+2=71 bytes, 4/1 conditional/unconditional jumps; 23 instructions
; 'Magnetica' partitioning, mainloop rev.5[
.B66.5::                        
  00065 49 ff c6         inc r14                                
  00068 4e 8b 14 f1      mov r10, QWORD PTR [rcx+r14*8]         
  0006c 4d 3b c2         cmp r8, r10                            
  0006f 77 27            ja .B66.12 
.B66.6::                        
  00071 73 34            jae .B66.13 
.B66.7::                        
  00073 4e 8b 3c d9      mov r15, QWORD PTR [rcx+r11*8]         
  00077 4d 3b c7         cmp r8, r15                            
  0007a 73 0c            jae .B66.11 
.B66.9::                        
  0007c 49 ff cb         dec r11                                
  0007f 4e 8b 3c d9      mov r15, QWORD PTR [rcx+r11*8]         
  00083 4d 3b c7         cmp r8, r15                            
  00086 72 f4            jb .B66.9 
.B66.11::                       
  00088 4e 89 3c f1      mov QWORD PTR [rcx+r14*8], r15         
  0008c 49 ff ce         dec r14                                
  0008f 4e 89 14 d9      mov QWORD PTR [rcx+r11*8], r10         
  00093 49 ff cb         dec r11                                
  00096 eb 0f            jmp .B66.13 
.B66.12::                       
  00098 4c 8b 3c e9      mov r15, QWORD PTR [rcx+rbp*8]         
  0009c 4c 89 14 e9      mov QWORD PTR [rcx+rbp*8], r10         
  000a0 48 ff c5         inc rbp                                
  000a3 4e 89 3c f1      mov QWORD PTR [rcx+r14*8], r15         
.B66.13::                       
  000a7 4d 3b f3         cmp r14, r11                           
  000aa 7c b9            jl .B66.5 
; 'Magnetica' partitioning, mainloop rev.5]
*/

    #ifdef revision5
            for (;PR < Jndx;) {
                PR = PR + 1;
                if (Pivot > QWORDS[PR]) {
                    swapUnconditional (&QWORDS[PL], &QWORDS[PR]);
                    PL = PL + 1;
                //} else if (Pivot == QWORDS[PR]) {
                } else if (Pivot < QWORDS[PR]) {
                    for (;Pivot < QWORDS[Jndx];) {
                        Jndx = Jndx - 1;
                    }
                    swapUnconditional (&QWORDS[PR], &QWORDS[Jndx]); // Magnetica r.5
                    //if (PR < Jndx) swapUnconditional (&QWORDS[PR], &QWORDS[Jndx]);
                    //swapConditionalXY_BUGGY(PR, Jndx, &QWORDS[PR], &QWORDS[Jndx]); // Should be faster than "Akkodah" but buggy when wraparounding, inhere 'PR' and 'Jndx' are far below the dangerous values - no pitfall.
                    //swapConditionalXY_Akkodah(PR, Jndx, &QWORDS[PR], &QWORDS[Jndx]); // The safe one, but slower, isn't it?!
                    Jndx = Jndx - 1;
                    PR = PR - 1;
                }
            }
            if (PR > Jndx) { //redeem i.e. undo
                    swapUnconditional (&QWORDS[PR+1], &QWORDS[Jndx+1]);
            }
    #endif

Let's see what the gain is in sorting the tarred latest Linux kernel, sorting all its QWORDS, Fedora 35 with latest GCC were used:

The GLIBC qsort():

Code:

[kaze@kaze Quicksort_says_rev9+_finished]$ su
Password: 
[root@kaze Quicksort_says_rev9+_finished]# sh bench_gcc16GB.sh 
Current priority is -20.
Allocating FILE-Buffer 1084MB ...
Allocating AUX-Buffer 8679MB ...
Allocating Master-Buffer 8679MB ...
Sorting in single-thread 1137582073 elements...
Done in 199 seconds.
Checking whether sort went correct... OK. Unique keys = 77275994

 Performance counter stats for './QS_bench_r9+_GCC11.2.1.elf qsort manyC':

        209,256.72 msec task-clock                #    1.000 CPUs utilized          
             2,464      context-switches          #   11.775 /sec                   
               109      cpu-migrations            #    0.521 /sec                   
         6,943,283      page-faults               #   33.181 K/sec                  
   644,258,504,905      cycles                    #    3.079 GHz                      (50.00%)
   887,723,856,917      instructions              #    1.38  insn per cycle           (62.50%)
   207,412,371,273      branches                  #  991.186 M/sec                    (62.50%)
     7,297,082,974      branch-misses             #    3.52% of all branches          (62.50%)
   209,785,002,716      L1-dcache-loads           #    1.003 G/sec                    (62.50%)
    10,083,451,475      L1-dcache-load-misses     #    4.81% of all L1-dcache accesses  (62.50%)
       453,288,538      LLC-loads                 #    2.166 M/sec                    (50.00%)
       157,867,418      LLC-load-misses           #   34.83% of all LL-cache accesses  (50.00%)

     209.333380320 seconds time elapsed

     199.735624000 seconds user
       8.109237000 seconds sys

The IN-PLACE Quicksort 'Magnetica' r.5:

Code:

Current priority is -20.
Allocating FILE-Buffer 1084MB ...
Allocating AUX-Buffer 8679MB ...
Allocating Master-Buffer 8679MB ...
Sorting in single-thread 1137582073 elements...
Done in 96 seconds.
Checking whether sort went correct... OK. Unique keys = 77275994

 Performance counter stats for './QS_bench_r9+_GCC11.2.1.elf Magnetica manyC':

        107,254.52 msec task-clock                #    1.000 CPUs utilized          
             1,384      context-switches          #   12.904 /sec                   
                93      cpu-migrations            #    0.867 /sec                   
         4,721,443      page-faults               #   44.021 K/sec                  
   330,057,690,498      cycles                    #    3.077 GHz                      (50.00%)
   377,731,430,098      instructions              #    1.14  insn per cycle           (62.50%)
    69,040,391,678      branches                  #  643.706 M/sec                    (62.50%)
     8,637,267,391      branch-misses             #   12.51% of all branches          (62.50%)
    53,882,878,315      L1-dcache-loads           #  502.383 M/sec                    (62.50%)
     4,034,092,525      L1-dcache-load-misses     #    7.49% of all L1-dcache accesses  (62.50%)
       156,056,295      LLC-loads                 #    1.455 M/sec                    (50.00%)
       100,031,535      LLC-load-misses           #   64.10% of all LL-cache accesses  (50.00%)

     107.292018210 seconds time elapsed

     100.733157000 seconds user
       5.718556000 seconds sys

The IN-PLACE Quicksort 'Bentley_McIlroy':

Code:

Current priority is -20.
Allocating FILE-Buffer 1084MB ...
Allocating AUX-Buffer 8679MB ...
Allocating Master-Buffer 8679MB ...
Sorting in single-thread 1137582073 elements...
Done in 99 seconds.
Checking whether sort went correct... OK. Unique keys = 77275994

 Performance counter stats for './QS_bench_r9+_GCC11.2.1.elf BM manyC':

        110,085.06 msec task-clock                #    1.000 CPUs utilized          
             1,450      context-switches          #   13.172 /sec                   
               102      cpu-migrations            #    0.927 /sec                   
         4,721,442      page-faults               #   42.889 K/sec                  
   338,805,503,530      cycles                    #    3.078 GHz                      (50.00%)
   329,274,697,292      instructions              #    0.97  insn per cycle           (62.50%)
    81,292,002,162      branches                  #  738.447 M/sec                    (62.50%)
     8,566,231,164      branch-misses             #   10.54% of all branches          (62.51%)
    46,212,377,119      L1-dcache-loads           #  419.788 M/sec                    (62.50%)
     4,019,407,777      L1-dcache-load-misses     #    8.70% of all L1-dcache accesses  (62.50%)
       169,441,337      LLC-loads                 #    1.539 M/sec                    (49.99%)
       104,132,627      LLC-load-misses           #   61.46% of all LL-cache accesses  (49.99%)

     110.122229444 seconds time elapsed

     103.762912000 seconds user
       5.523235000 seconds sys

The OUT-OF-PLACE Scandum's Flexsort:

Code:

Current priority is -20.
Allocating FILE-Buffer 1084MB ...
Allocating AUX-Buffer 8679MB ...
Allocating Master-Buffer 8679MB ...
Sorting in single-thread 1137582073 elements...
Done in 66 seconds.
Checking whether sort went correct... OK. Unique keys = 77275994

 Performance counter stats for './QS_bench_r9+_GCC11.2.1.elf FS manyC':

         76,680.31 msec task-clock                #    1.000 CPUs utilized          
             1,022      context-switches          #   13.328 /sec                   
                53      cpu-migrations            #    0.691 /sec                   
         5,847,263      page-faults               #   76.255 K/sec                  
   235,988,083,763      cycles                    #    3.078 GHz                      (50.00%)
   573,773,325,807      instructions              #    2.43  insn per cycle           (62.50%)
    68,033,814,796      branches                  #  887.240 M/sec                    (62.50%)
       199,923,657      branch-misses             #    0.29% of all branches          (62.50%)
   117,675,535,396      L1-dcache-loads           #    1.535 G/sec                    (62.50%)
     5,926,662,790      L1-dcache-load-misses     #    5.04% of all L1-dcache accesses  (62.50%)
       109,487,878      LLC-loads                 #    1.428 M/sec                    (50.00%)
        58,046,998      LLC-load-misses           #   53.02% of all LL-cache accesses  (50.00%)

      76.704672397 seconds time elapsed

      69.129595000 seconds user
       6.966784000 seconds sys

The benchmark (all C sources and binaries included) is at:
http://www.sanmayce.com/Quicksort_sa...9+_finished.7z

And the showdown:

Code:

// Test run: 2022-Feb-27:
// Laptop "Compressionette", Intel 'Kaby Lake' i5-7200U 3.1GHz max turbo, 36GB DDR4 2133MHz:
// +--------------------+---------------------------+---------------------------+---------------------------+----------------------------+-----------------------------+
// | Performer/Keys     | FEW distinct              | MANY distinct             | ALL distinct              | ALLmore distinct           | ALLmax distinct             |
// +--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------+--------------+--------------+
// |  Operating System, | Windows 10, | Fedora 35,  | Windows 10, | Fedora 35,  | Windows 10, | Fedora 35,  | Windows 10, | Fedora 35,   | Windows 10,  | Fedora 35,   |
// |      Compiler, -O3 | Intel v15.0 | GCC 11.2.1  | Intel v15.0 | GCC 11.2.1  | Intel v15.0 | GCC 11.2.1  | Intel v15.0 | GCC 11.2.1   | Intel v15.0  | GCC 11.2.1   |
// +--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------+--------------+--------------+
// | qsort              |  60 seconds | 412 seconds | 350 seconds | 565 seconds | 450 seconds | 551 seconds | 881 seconds | 1066 seconds |         N.A. |         N.A. |
// | Bentley-McIlroy    |  38 seconds |  40 seconds | 204 seconds | 220 seconds | 276 seconds | 296 seconds | 542 seconds |  582 seconds |         N.A. |         N.A. |
// | Magnetica r.5BB    |  31 seconds |  33 seconds | 204 seconds | 216 seconds | 272 seconds | 286 seconds | 537 seconds |  565 seconds |         N.A. |         N.A. |
// | Magnetica r.5FF    |  32 seconds |  38 seconds | 209 seconds | 239 seconds | 265 seconds | 319 seconds | 525 seconds |  625 seconds |         N.A. |         N.A. |
// | Fluxsort           |  38 seconds |  38 seconds | 137 seconds | 138 seconds | 256 seconds | 197 seconds | 933 seconds |  641 seconds |         N.A. |         N.A. |
// +--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------+--------------+--------------+
// | Best Time (bare    |                           |                           |                           |                            |                             |
// | bone in-place QS): |  31s for Magnetica r.5BB  | 204s PARITY               | 272s for Magnetica r.5BB  | 537s for Magnetica r.5BB   | N.A.                        |
// +--------------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------------------+-----------------------------+
// Bottomline: Realistically - 3 wins out of 4 runs both for Magnetica vs qsort & Bentley-McIlroy, the rest is PARITY.
// Note: Scandum's Fluxsort is superior speedwise, but using out-of-place RAM is problematic when RAM is not sufficient, virtual trashing happens.
//
// Legend (The time is exactly the Sort process time):
// FEW = 2,233,861,800 keys, of them distinct = 10; 178,708,944 bytes 22338618_QWORDS.bin; elements = 178,708,944/8 *100; // Keys are 100 times duplicated
// MANY = 2,482,300,900 keys, of them distinct = 2,847,531; 24,823,016 bytes mobythesaurus.txt; elements = 24823016 -8+1; // BuildingBlocks are size-order+1, they are 100 times duplicated
// ALL = 2,009,333,753 keys, of them distinct = 1,912,608,132; 2,009,333,760 bytes Fedora-Workstation-Live-x86_64-35-1.2.iso; elements = 2009333760 -8+1; // BuildingBlocks are size-order+1
// ALLmore = 3,803,483,825 keys, of them distinct = 3,346,259,533; 3,803,483,832 bytes Fedora-Workstation-35-1.2.aarch64.raw.xz; elements = 3803483832 -8+1; // BuildingBlocks are size-order+1
// ALLmax = 7,798,235,435 keys, of them distinct = 6,770,144,405; 7,798,235,442 bytes math.stackexchange.com_en_all_2019-02.zim; elements = 7798235442 -8+1; // BuildingBlocks are size-order+1
// 
// Notes:
// - All the runs were in "Current priority class is REALTIME_PRIORITY_CLASS" for Windows and "Current priority is -20." for Linux;
// - Better speeds for Magnetica r.5FF with Insertionsort Threshold 13 instead of 31;
// - Benchmark needs 32GB RAM, and 64GB for the 5th testset;
// - The whole package (except the 3rd, 4th and 5th datasets) is downloadable at:
//   www.sanmayce.com/Quicksort_says_rev9+_finished 
// - To reproduce the roster, run on Windows or Linux:
//   - BENCH_ICL32GB.BAT
//   - BENCH_ICL64GB.BAT
//   - sh bench_gcc32GB.sh
//   - sh bench_gcc64GB.sh
// - 3rd dataset is downloadable at:
// https://download.fedoraproject.org/pub/fedora/linux/releases/35/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-35-1.2.iso 
// - 4th dataset is downloadable at:
// https://download.fedoraproject.org/pub/fedora/linux/releases/35/Workstation/aarch64/images/Fedora-Workstation-35-1.2.aarch64.raw.xz 
// - 5th dataset is downloadable at:
// https://download.kiwix.org/zim/stack_exchange/math.stackexchange.com_en_all_2019-02.zim

Sanmayce · 03-03-2022, 02:03 AM

Scandum's crumsort is the FASTEST in-place Quicksort known to me, wow! Written few days ago, it amazes.

https://github.com/scandum/crumsort

My quick tests show significant superiority when compared to Magnetica:

Code:

// Test run: 2022-Feb-27, add-on 2022-Mar-03 for crumsort:
// Laptop "Compressionette", Intel 'Kaby Lake' i5-7200U 3.1GHz max turbo, 36GB DDR4 2133MHz:
// +--------------------+---------------------------+---------------------------+---------------------------+----------------------------+-----------------------------+
// | Performer/Keys     | FEW distinct              | MANY distinct             | ALL distinct              | ALLmore distinct           | ALLmax distinct             |
// +--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------+--------------+--------------+
// |  Operating System, | Windows 10, | Fedora 35,  | Windows 10, | Fedora 35,  | Windows 10, | Fedora 35,  | Windows 10, | Fedora 35,   | Windows 10,  | Fedora 35,   |
// |      Compiler, -O3 | Intel v15.0 | GCC 11.2.1  | Intel v15.0 | GCC 11.2.1  | Intel v15.0 | GCC 11.2.1  | Intel v15.0 | GCC 11.2.1   | Intel v15.0  | GCC 11.2.1   |
// +--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------+--------------+--------------+
// | qsort              |  60 seconds | 412 seconds | 350 seconds | 565 seconds | 450 seconds | 551 seconds | 881 seconds | 1066 seconds |         N.A. |         N.A. |
// | Bentley-McIlroy    |  38 seconds |  40 seconds | 204 seconds | 220 seconds | 276 seconds | 296 seconds | 542 seconds |  582 seconds |         N.A. |         N.A. |
// | Magnetica r.5BB    |  31 seconds |  33 seconds | 204 seconds | 216 seconds | 272 seconds | 286 seconds | 537 seconds |  565 seconds |         N.A. |         N.A. |
// | Magnetica r.5FF    |  32 seconds |  38 seconds | 209 seconds | 239 seconds | 265 seconds | 319 seconds | 525 seconds |  625 seconds |         N.A. |         N.A. |
// | Fluxsort           |  38 seconds |  38 seconds | 137 seconds | 138 seconds | 256 seconds | 197 seconds | 933 seconds |  641 seconds |         N.A. |         N.A. |
// | Crumsort           |  31 seconds |  32 seconds | 133 seconds | 150 seconds | 185 seconds | 190 seconds | 355 seconds |  373 seconds |         N.A. |         N.A. |
// +--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------+--------------+--------------+
// | Best Time (bare    |                           |                           |                           |                            |                             |
// | bone in-place QS): |  31s PARITY               | 133s Crumsort             | 185s Crumsort             | 355s Crumsort              | N.A.                        |
// +--------------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------------------+-----------------------------+
// Bottomline: Realistically - 3 wins out of 4 runs both for Magnetica vs qsort & Bentley-McIlroy, the rest is PARITY.
// Note1: Scandum's Fluxsort is superior speedwise, but using out-of-place RAM is problematic when RAM is not sufficient, virtual trashing happens.
// Note2: Scandum's Crumsort proves to be the FASTEST in-place Quicksort.
//
// Legend (The time is exactly the Sort process time):
// FEW = 2,233,861,800 keys, of them distinct = 10; 178,708,944 bytes 22338618_QWORDS.bin; elements = 178,708,944/8 *100; // Keys are 100 times duplicated
// MANY = 2,482,300,900 keys, of them distinct = 2,847,531; 24,823,016 bytes mobythesaurus.txt; elements = 24823016 -8+1; // BuildingBlocks are size-order+1, they are 100 times duplicated
// ALL = 2,009,333,753 keys, of them distinct = 1,912,608,132; 2,009,333,760 bytes Fedora-Workstation-Live-x86_64-35-1.2.iso; elements = 2009333760 -8+1; // BuildingBlocks are size-order+1
// ALLmore = 3,803,483,825 keys, of them distinct = 3,346,259,533; 3,803,483,832 bytes Fedora-Workstation-35-1.2.aarch64.raw.xz; elements = 3803483832 -8+1; // BuildingBlocks are size-order+1
// ALLmax = 7,798,235,435 keys, of them distinct = 6,770,144,405; 7,798,235,442 bytes math.stackexchange.com_en_all_2019-02.zim; elements = 7798235442 -8+1; // BuildingBlocks are size-order+1
// 
// Notes:
// - All the runs were in "Current priority class is REALTIME_PRIORITY_CLASS" for Windows and "Current priority is -20." for Linux;
// - Better speeds for Magnetica r.5FF with Insertionsort Threshold 13 instead of 31;
// - Benchmark needs 32GB RAM, and 64GB for the 5th testset;
// - The whole package (except the 3rd, 4th and 5th datasets) is downloadable at:
//   www.sanmayce.com/Quicksort_says_rev9+_finished 
// - To reproduce the roster, run on Windows or Linux:
//   - BENCH_ICL32GB.BAT
//   - BENCH_ICL64GB.BAT
//   - sh bench_gcc32GB.sh
//   - sh bench_gcc64GB.sh
// - 3rd dataset is downloadable at:
// https://download.fedoraproject.org/pub/fedora/linux/releases/35/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-35-1.2.iso 
// - 4th dataset is downloadable at:
// https://download.fedoraproject.org/pub/fedora/linux/releases/35/Workstation/aarch64/images/Fedora-Workstation-35-1.2.aarch64.raw.xz 
// - 5th dataset is downloadable at:
// https://download.kiwix.org/zim/stack_exchange/math.stackexchange.com_en_all_2019-02.zim

Wow!

Sanmayce · 04-04-2022, 12:22 AM

Enter Quicksort 'Akkoda'.
The fastest Quicksort known to me.

The amazing work that Scandum did with his Crumsort is unbeatable in my view, yet, there is hidden potential unexploited, namely, the bi-threaded variant of my Quicksort 'Magnetica' called 'Akkoda'.

In my heavy benchmarks, it beats Crumsort, both using GCC and ICL compilers.

The full C source is attached. If some fellow member is interested in running the 6 heavy (1 to 7 billion QWORDS as keys) benchmarks, just let me know - will gladly assist in how to run them.

The Speed Roster:
https://www.linuxquestions.org/quest...3&d=1649049532

Code:

// Laptop "Compressionette", Intel 'Kaby Lake' i5-7200U 3.1GHz max turbo, 36GB DDR4 2133MHz:
// +----------------------------------+----------------------------+-----------------------------+---------------------------+-----------------------------+------------------------------+-----------------------------+
// | Performer/Keys                   | #1, FEW distinct           | #2, MANY distinct           | #3, MANYmore distinct     | #4, ALL distinct            | #5, ALLmore distinct         | #6, ALLmax distinct         |
// +----------------------------------+-------------+--------------+--------------+--------------+-------------+-------------+--------------+--------------+--------------+---------------+--------------+--------------+
// |                Operating System, | Windows 10, |   Fedora 35, |  Windows 10, |   Fedora 35, | Windows 10, |  Fedora 35, |  Windows 10, |   Fedora 35, |  Windows 10, |    Fedora 35, |  Windows 10, |   Fedora 35, |
// |                         Compiler | Intel v15.0 |   GCC 11.2.1 |  Intel v15.0 |   GCC 11.2.1 | Intel v15.0 |  GCC 11.2.1 |  Intel v15.0 |   GCC 11.2.1 |  Intel v15.0 |    GCC 11.2.1 |  Intel v15.0 |   GCC 11.2.1 |
// +----------------------------------+-------------+--------------+--------------+--------------+-------------+-------------+--------------+--------------+--------------+---------------+--------------+--------------+
// | Quicksort 'Akkoda' v.1 ITERATIVE |  26/04 sec. | 027/004 sec. | 121/032 sec. | 129/030 sec. | 054/14 sec. | 058/13 sec. | 131/036 sec. | 140/036 sec. | 303/072 sec. | 0323/072 sec. |         N.A. |         N.A. |
// | Quicksort 'Akkoda' v.1 RECURSIVE |  27/05 sec. | 027/005 sec. | 127/032 sec. | 123/031 sec. | 059/16 sec. | 067/24 sec. | 174/061 sec. | 254/150 sec. | 367/120 sec. | 0575/284 sec. |         N.A. |         N.A. |
// | Quicksort 'Magnetica' v.16       |  31/05 sec. | 031/005 sec. | 205/040 sec. | 218/040 sec. | 089/19 sec. | 094/19 sec. | 252/060 sec. | 264/058 sec. | 492/121 sec. | 0518/118 sec. |         N.A. |         N.A. |
// | Quicksort 'Hoare'                |  89/62 sec. | 105/076 sec. | 220/053 sec. | 264/083 sec. | 096/26 sec. | 116/41 sec. | 255/033 sec. | 308/065 sec. | 502/072 sec. | 0602/137 sec. |         N.A. |         N.A. |
// | Quicksort 'Bentley-McIlroy'      |  37/11 sec. | 038/013 sec. | 206/037 sec. | 224/046 sec. | 091/21 sec. | 100/25 sec. | 274/049 sec. | 305/062 sec. | 546/109 sec. | 0601/134 sec. |         N.A. |         N.A. |
// | qsort                            |  59/16 sec. | 360/320 sec. | 344/091 sec. | 523/240 sec. | 161/42 sec. | 196/69 sec. | 447/124 sec. | 521/146 sec. | 871/248 sec. | 1010/287 sec. |         N.A. |         N.A. |
// | Crumsort v.1.1.5.3               |  27/03 sec. | 029/004 sec. | 123/004 sec. | 129/005 sec. | 059/02 sec. | 061/02 sec. | 168/003 sec. | 174/004 sec. | 330/006 sec. | 0340/008 sec. |         N.A. |         N.A. |
// +----------------------------------+-------------+--------------+--------------+--------------+-------------+-------------+--------------+--------------+--------------+---------------+--------------+--------------+
// |                        Best Time |  26         | 027          | 121          | 123          | 054         | 058         | 131          | 140          | 303          | 0323          | N.A.         | N.A.         |
// +----------------------------------+-------------+--------------+--------------+--------------+-------------+-------------+--------------+--------------+--------------+---------------+--------------+--------------+

Code:

// Quick highlights:
// GLIBC's qsort() is trash;
// Microsoft/Intel's qsort() is trashy;
// Quicksort 'Hoare' is faster than Quicksort 'Bentley-McIlroy' on MOSTLY DISTINCT data (but only with ICL);
// Quicksort 'Magnetica' is faster than Quicksort 'Hoare', however on SORTED (and at the same time) MOSTLY DISTINCT data, it is inferior;
// Quicksort 'Akkoda' v.1 ITERATIVE is faster than Crumsort v.1.1.5.3, with the UNFAIR usage of another thread, but crushed on SORTED data;
//
// Legend (The time is exactly the Sort process time, first value is for unsorted, second one is for sorted):
// #1,FEW = 2,233,861,800 keys, of them distinct = 10; 178,708,944 bytes 22338618_QWORDS.bin; elements = 178,708,944/8 *100; // Keys are 100 times duplicated
// #2,MANY = 2,482,300,900 keys, of them distinct = 2,847,531; 24,823,016 bytes mobythesaurus.txt; elements = 24823016 -8+1; // BuildingBlocks are size-order+1, they are 100 times duplicated
// #3,MANYmore = 1,137,582,073 keys, of them distinct = 77,275,994; 1,137,582,080 bytes linux-5.15.25.tar; elements = 1137582080 -8+1; // BuildingBlocks are size-order+1
// #4,ALL = 2,009,333,753 keys, of them distinct = 1,912,608,132; 2,009,333,760 bytes Fedora-Workstation-Live-x86_64-35-1.2.iso; elements = 2009333760 -8+1; // BuildingBlocks are size-order+1
// #5,ALLmore = 3,803,483,825 keys, of them distinct = 3,346,259,533; 3,803,483,832 bytes Fedora-Workstation-35-1.2.aarch64.raw.xz; elements = 3803483832 -8+1; // BuildingBlocks are size-order+1
// #6,ALLmax = 7,798,235,435 keys, of them distinct = 6,770,144,405; 7,798,235,442 bytes math.stackexchange.com_en_all_2019-02.zim; elements = 7798235442 -8+1; // BuildingBlocks are size-order+1

Code:

// Notes:
// - The pair represents UNSORTED/SORTED pool, respectively;
// - Used options for GCC compiler: -O3 -fomit-frame-pointer -fopenmp
// - Used options for ICL compiler: /arch:SSE4.1 /O3 /Qopenmp
// - All the runs were in "Current priority class is REALTIME_PRIORITY_CLASS" for Windows and "Current priority is -20." for Linux;
// - To see more stats (the tables were deriving from) see 'log_i5-7200U_APR03.txt';
// - Benchmark needs 32GB RAM, and 64GB for the 6th testset;
// - The whole package (except the 3rd, 4th, 5th and 6th datasets) is downloadable at:
// www.sanmayce.com/QS_showdown_r16.7z
// - To reproduce the roster, run on Windows or Linux:
// - BENCH_ICL32GB.BAT
// - BENCH_ICL64GB.BAT
// - sh bench_gcc32GB.sh
Quicksort Showdown – quicksort_Magnetica_partitioning vs quicksort_Bentley_McIlroy_3way_partitioning; for more updates: https://twitter.com/Sanmayce Page 7 of 40
// - sh bench_gcc64GB.sh
// - 3rd dataset is downloadable at:
// https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.15.25.tar.xz
// - 4th dataset is downloadable at:
// https://download.fedoraproject.org/pub/fedora/linux/releases/35/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-35-1.2.iso
// - 5th dataset is downloadable at:
// https://download.fedoraproject.org/pub/fedora/linux/releases/35/Workstation/aarch64/images/Fedora-Workstation-35-1.2.aarch64.raw.xz
// - 6th dataset is downloadable at:
// https://download.kiwix.org/zim/stack_exchange/math.stackexchange.com_en_all_2019-02.zim
// - Managed to reduce the mainloop down to 68 bytes (from 71 in r.12), the gain comes from switching to pointers, had to do that long time ago, wanted to keep the etude ARRAY-syntax friendly. Now, r.12 uses arrays, whereas
r.13 uses pointers. Another turning point, GCC now beats ICL in all tests.
// - In order to switch to "Benchmark mode":
// cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
// echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
// - Quicksort 'Akkoda' v.1 invokes two threads of Quicksort 'Magnetica' v.16, OpenMP is used;
// - Scandum's Crumsort is the FASTEST in-place sorter, known to me, hail Scandum!

And here comes the hottest mainloop:

Code:

; mark_description "Intel(R) C++ Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140726";
; mark_description "-FAcs -O3 -FeQS_bench_r13_ICL15.0_64bit.exe -D_N_HIGH_PRIORITY";
; 13e-ff+2=65 bytes, 4/1 conditional/unconditional jumps; 22 instructions
; 'Magnetica' partitioning, mainloop rev.5a2p_BYPASSSWAP[
.B172.7::                       
  000ff 49 83 c1 08      add r9, 8                              
  00103 4d 8b 39         mov r15, QWORD PTR [r9]                
  00106 49 3b c7         cmp rax, r15                           
  00109 76 0c            jbe .B172.9 
.B172.8::                       
  0010b 4d 89 3b         mov QWORD PTR [r11], r15               
  0010e 49 83 c3 08      add r11, 8                             
  00112 49 89 01         mov QWORD PTR [r9], rax                
  00115 eb 24            jmp .B172.15 
.B172.9::                       
  00117 73 22            jae .B172.15 
.B172.10::                      
  00119 4d 8b 30         mov r14, QWORD PTR [r8]                
  0011c 49 3b c6         cmp rax, r14                           
  0011f 73 0c            jae .B172.14 
.B172.12::                      
  00121 49 83 c0 f8      add r8, -8                             
  00125 4d 8b 30         mov r14, QWORD PTR [r8]                
  00128 49 3b c6         cmp rax, r14                           
  0012b 72 f4            jb .B172.12 
.B172.14::                      
  0012d 4d 89 31         mov QWORD PTR [r9], r14                
  00130 49 83 c1 f8      add r9, -8                             
  00134 4d 89 38         mov QWORD PTR [r8], r15                
  00137 49 83 c0 f8      add r8, -8                             
.B172.15::                      
  0013b 4d 3b c8         cmp r9, r8                             
  0013e 72 bf            jb .B172.7 
; 'Magnetica' partitioning, mainloop rev.5a2p_BYPASSSWAP]

Enfun!

ntubski · 04-04-2022, 06:40 AM

Quote:

Originally Posted by Sanmayce

Code:

// Quick highlights:
// GLIBC's qsort() is trash;
// Microsoft/Intel's qsort() is trashy;

I wonder if the main reason these are so much worse is that they have to take the comparison function as a parameter, whereas the other sort functions seem to be hard-coding "<" directly.

Sanmayce · 04-04-2022, 11:18 PM

Quote:

Originally Posted by ntubski

I wonder if the main reason these are so much worse is that they have to take the comparison function as a parameter, whereas the other sort functions seem to be hard-coding "<" directly.

I cannot believe how such functions still exist.

Anyway, for a long time I needed a superfast sorter for billions of QWORD keys, glad that played with Quicksort, this is how Zen 2 executes the benchmark:

https://www.linuxquestions.org/quest...1&d=1649135797

The full C source, with compile scripts/lines (and a PDF booklet 40 pages long - a good reference/starting point for all coders wanted to improve on it) is freely downloadable at:
www.sanmayce.com/QS_showdown_r16.7z

Enfun!

pan64 · 04-04-2022, 11:39 PM

Quote:

Originally Posted by Sanmayce

I cannot believe how such functions still exist.

In a lot of cases any kind of sort is sufficient, especially when we have only a few items to work with. A general sort algorithm is as good as any other one for this purpose.
From the other hand in case of a huge amount of data [structures] we always need a special and unique implementation where all the important details were taken into account (like data types, record size, index table, "distinctness", way of comparison and also including hardware, like cpu type, amount of ram, i/o speed).

Just curious: how will you sort all the existing/valid phone numbers based on the age of the owners (in minutes?).