LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 11-09-2021, 01:57 PM   #1
Sanmayce
Member
 
Registered: Jul 2010
Location: Sofia
Posts: 65

Rep: Reputation: 13
qsort vs 'Magnetica' Quicksort


Just wrote my qsort() replacement, wanna share it with the community and stress my variant with more versatile data...

Generally, we have three major scenarios:
few distinct keys;
many distinct keys;
ALL distinct keys.

In first two cases, the new implementation is faster or equal to the "Standard" one.
Sadly, in the third case, it is somewhat slower, should be investigated further.

Anyway, benchmarking will show the practical value of 'Magnetica', personally, I started to use it already.
Please, give it a try, my new partitioning scheme is not perfect but pushes the limitations of old ones.

Didn't have time to add the second scenario, but the results (for 1st scenario, with 10 distinct keys) are promising:
273.8/3.7= 74x faster
Attached Files
File Type: txt QS_bench.c.txt (18.5 KB, 13 views)

Last edited by Sanmayce; 11-09-2021 at 01:59 PM.
 
Old 11-09-2021, 02:00 PM   #2
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 10,188

Rep: Reputation: 4752Reputation: 4752Reputation: 4752Reputation: 4752Reputation: 4752Reputation: 4752Reputation: 4752Reputation: 4752Reputation: 4752Reputation: 4752Reputation: 4752
You didn't post it or link to it, but it sounds like the same idea as this:

https://en.wikipedia.org/wiki/Timsort

Last edited by dugan; 11-09-2021 at 02:02 PM.
 
Old 11-09-2021, 02:10 PM   #3
Sanmayce
Member
 
Registered: Jul 2010
Location: Sofia
Posts: 65

Original Poster
Rep: Reputation: 13
Quote:
Originally Posted by dugan View Post
You didn't post it or link to it, but it sounds like the same idea as this:

https://en.wikipedia.org/wiki/Timsort
Now attached as a .txt, LQ doesn't allow .c attachments, hmm.
As for Timsort, never looked it up, just know it is among some major performers.
AFAIK, my 'Magnetica' partitioning is a new one, nothing special, just did on paper for an hour the swaps and saw that there is a crucial moment where unnecessary swaps are to be avoided. Wonder, whether another guy had it before me, just to acknowledge it, I am too lazy to look up other's sources, however love benchmarking against them.
 
Old 11-10-2021, 09:26 AM   #4
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 9,431
Blog Entries: 4

Rep: Reputation: 3375Reputation: 3375Reputation: 3375Reputation: 3375Reputation: 3375Reputation: 3375Reputation: 3375Reputation: 3375Reputation: 3375Reputation: 3375Reputation: 3375
The "performance" of any sorting algorithm, as measured in microseconds, is always dependent upon the data. However, the practical performance of any algorithm is usually described in terms of "order," as in O(Log2N). Shell Sort will always out-perform Bubble Sort, and Quicksort will always out-perform Shell.

I suggest that you create a new Github project for your new algorithm, and introduce it for discussion in various places, including the Software Engineering forum at StackExchange.
 
Old 11-10-2021, 02:16 PM   #5
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,037

Rep: Reputation: Disabled
Slightly off-topic. While searching XFCE documentation on their site, I stumbled upon an old article from year 1995 on AlphaSort.
Quote:
No matter how fast the processor, a 100 MB external sort using a single 1993-vintage SCSI disk takes about one minute of elapsed time. This one-minute barrier is created by the 3MB/s sequential transfer rate (bandwidth) of a single commodity disk. We measured both the OpenVMS Sort utility and AlphaSort to take a little under one minute when using one SCSI disk. Both sorts are disk-limited. A faster processor or faster algorithm would not sort much faster because the disk reads at about 4.5 MB/s, and writes at about 3.5 MB/s. Thus, it takes about 25 seconds to read the 100 MB, and about 30 seconds to write the 100 MB answer file. Even on mainframes, sort algorithms like Syncsort and DFsort are limited by this one-minute barrier, unless disk or file striping is used.
 
Old 11-11-2021, 01:10 AM   #6
Sanmayce
Member
 
Registered: Jul 2010
Location: Sofia
Posts: 65

Original Poster
Rep: Reputation: 13
>However, the practical performance of any algorithm is usually described in terms of "order," as in O(Log2N).

Indeed, however my focus is on 'perf', you are welcome to see the statistics below and draw even better PRACTICAL conclusions whether one performer is good or no good enough.

>I suggest that you create a new Github project for your new algorithm, and introduce it for discussion in various places, including the Software Engineering forum at StackExchange.

Nah, I prefer sharing it with fellow forum members, few times got treated as a spammer on stack*, Github is cool but I search feedback from amateurs (in his primal sense - lovers of something) as me. Benchmarking is what interests me, of course if some coder improves on it then even better - we all win.
After all, all good things will make their way, by itself, naturally.

>Slightly off-topic. While searching XFCE documentation on their site, I stumbled upon an old article from year 1995 on AlphaSort.

Thanks, no idea what it is, but for a long thirty years I admire the work of Bob Pirko and its RPSORT.ASM - a 16-bit sorter, yet never looked into his source code - didn't want to "steal" his ideas, weird, maybe there are some ideas worth reviving.

Looking calmly at 'Magnetica' for few minutes, couldn't miss the obvious optimization of moving 'PR+1' outside the if-elseif-else, now the refined revision is smaller and faster.
Now, we can benchmark either with QS_bench_r2.c or by adding 'Magnetica' to your code.
You need "22338618_QWORDS.bin" and "mobythesaurus.txt" from this package:
https://www.qb64.org/forum/index.php...0;attach=14875

As for the third case (ALL distinct keys), I wanna 2,000,000,000 keys, but how to shuffle them, so each time the keys to be in a given order?

In the near future, I intend to make a showdown between Bentley-McIlroy 3-way partitioning and 'Magnetica', I am not convinced their scheme is better, will see...

This is the refinement which is used in QS_bench_r2.c:

Code:
            // 'Magnetica' partitioning rev.2[
            Jndx = Right;
            PL = Left;
            PR = Left;
            swap (&QWORDS[(Left + Right)>>1], &QWORDS[PR]);
            Pivot = QWORDS[PR];
            for (;PR < Jndx;) {
                PR = PR + 1;
                if (Pivot > QWORDS[PR]) {
                    swap (&QWORDS[PL], &QWORDS[PR]);
                    PL = PL + 1;
                //} else if (Pivot == QWORDS[PR]) {
                } else if (Pivot < QWORDS[PR]) {
                    for (;Pivot < QWORDS[Jndx];) {
                        Jndx = Jndx - 1;
                    }
                    if (PR < Jndx) swap (&QWORDS[PR], &QWORDS[Jndx]);
                    Jndx = Jndx - 1;
                    PR = PR - 1;
                }
            }
            Jndx = PL - 1;
            Indx = PR + 1;
            // 'Magnetica' partitioning rev.2]
/*
; mark_description "Intel(R) C++ Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140726";
; ae-64+2=76 bytes i.e. (94-76=18 less), 5/1 conditional/unconditional jumps i.e. 2-1=1 less unconditional jump
; 'Magnetica' partitioning, mainloop rev.2[
.B4.5::                         
  00064 48 ff c5         inc rbp                                
  00067 48 8b 14 e9      mov rdx, QWORD PTR [rcx+rbp*8]         
  0006b 4c 3b ea         cmp r13, rdx                           
  0006e 77 2c            ja .B4.14 
.B4.6::                         
  00070 73 39            jae .B4.15 
.B4.7::                         
  00072 4e 8b 3c d9      mov r15, QWORD PTR [rcx+r11*8]         
  00076 4d 3b ef         cmp r13, r15                           
  00079 73 0c            jae .B4.11 
.B4.9::                         
  0007b 49 ff cb         dec r11                                
  0007e 4e 8b 3c d9      mov r15, QWORD PTR [rcx+r11*8]         
  00082 4d 3b ef         cmp r13, r15                           
  00085 72 f4            jb .B4.9 
.B4.11::                        
  00087 49 3b eb         cmp rbp, r11                           
  0008a 7d 08            jge .B4.13 
.B4.12::                        
  0008c 4c 89 3c e9      mov QWORD PTR [rcx+rbp*8], r15         
  00090 4a 89 14 d9      mov QWORD PTR [rcx+r11*8], rdx         
.B4.13::                        
  00094 49 ff cb         dec r11                                
  00097 48 ff cd         dec rbp                                
  0009a eb 0f            jmp .B4.15 
.B4.14::                        
  0009c 4e 8b 3c f1      mov r15, QWORD PTR [rcx+r14*8]         
  000a0 4a 89 14 f1      mov QWORD PTR [rcx+r14*8], rdx         
  000a4 49 ff c6         inc r14                                
  000a7 4c 89 3c e9      mov QWORD PTR [rcx+rbp*8], r15         
.B4.15::                        
  000ab 49 3b eb         cmp rbp, r11                           
  000ae 7c b4            jl .B4.5 
; 'Magnetica' partitioning, mainloop rev.2]
*/
Glad to share the first two cases benchmarked with 'Magnetica' rev.2, the console log on my Fedora 33, i5-7200U 3.1GHz max turbo, 36GB DDR4 2133MHz:

Code:
[kaze@kaze Quicksort_says_rev3]$ gcc -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/10/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,objc,obj-c++,ada,go,d,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC) 

[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r2 qsort few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Checking whether sort went correct... OK.

 Performance counter stats for './QS_bench_r2 qsort few':

        372,145.95 msec task-clock:u              #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,363,067      page-faults:u             #    0.012 M/sec                  
 1,123,792,986,601      cycles:u                  #    3.020 GHz                      (50.00%)
 3,268,399,480,738      instructions:u            #    2.91  insn per cycle           (62.50%)
   650,531,988,573      branches:u                # 1748.056 M/sec                    (62.50%)
     5,519,977,909      branch-misses:u           #    0.85% of all branches          (62.50%)
   734,216,718,766      L1-dcache-loads:u         # 1972.927 M/sec                    (62.50%)
     6,752,096,756      L1-dcache-load-misses:u   #    0.92% of all L1-dcache accesses  (62.50%)
       423,592,362      LLC-loads:u               #    1.138 M/sec                    (50.00%)
       126,404,435      LLC-load-misses:u         #   29.84% of all LL-cache accesses  (50.00%)

     372.770596352 seconds time elapsed

     364.324923000 seconds user
       5.532901000 seconds sys

[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r2 Magnetica few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Checking whether sort went correct... OK.

 Performance counter stats for './QS_bench_r2 Magnetica few':

         45,966.43 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,363,067      page-faults:u             #    0.095 M/sec                  
   123,532,537,542      cycles:u                  #    2.687 GHz                      (50.00%)
   101,337,186,572      instructions:u            #    0.82  insn per cycle           (62.50%)
    24,106,121,927      branches:u                #  524.429 M/sec                    (62.50%)
     3,277,959,707      branch-misses:u           #   13.60% of all branches          (62.50%)
    13,329,593,401      L1-dcache-loads:u         #  289.985 M/sec                    (62.50%)
     1,655,094,422      L1-dcache-load-misses:u   #   12.42% of all L1-dcache accesses  (62.50%)
        83,759,355      LLC-loads:u               #    1.822 M/sec                    (50.00%)
        69,790,047      LLC-load-misses:u         #   83.32% of all LL-cache accesses  (50.00%)

      45.977603171 seconds time elapsed

      40.202975000 seconds user
       5.433257000 seconds sys

[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r2 qsort many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Checking whether sort went correct... OK.

 Performance counter stats for './QS_bench_r2 qsort many':

        547,697.55 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,896,781      page-faults:u             #    0.009 M/sec                  
 1,660,777,393,784      cycles:u                  #    3.032 GHz                      (50.00%)
 2,926,096,300,717      instructions:u            #    1.76  insn per cycle           (62.50%)
   656,833,950,654      branches:u                # 1199.264 M/sec                    (62.50%)
    24,981,866,259      branch-misses:u           #    3.80% of all branches          (62.50%)
   675,499,020,355      L1-dcache-loads:u         # 1233.343 M/sec                    (62.50%)
     8,384,728,679      L1-dcache-load-misses:u   #    1.24% of all L1-dcache accesses  (62.50%)
       253,666,230      LLC-loads:u               #    0.463 M/sec                    (50.00%)
       149,205,273      LLC-load-misses:u         #   58.82% of all LL-cache accesses  (50.00%)

     547.972289671 seconds time elapsed

     538.278672000 seconds user
       5.984880000 seconds sys

[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r2 Magnetica many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Checking whether sort went correct... OK.

 Performance counter stats for './QS_bench_r2 Magnetica many':

        241,082.32 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,896,781      page-faults:u             #    0.020 M/sec                  
   719,580,616,824      cycles:u                  #    2.985 GHz                      (50.00%)
   788,555,184,960      instructions:u            #    1.10  insn per cycle           (62.50%)
   152,383,462,963      branches:u                #  632.081 M/sec                    (62.50%)
    18,722,063,529      branch-misses:u           #   12.29% of all branches          (62.50%)
   105,308,386,791      L1-dcache-loads:u         #  436.815 M/sec                    (62.50%)
     9,762,518,793      L1-dcache-load-misses:u   #    9.27% of all L1-dcache accesses  (62.50%)
       291,667,293      LLC-loads:u               #    1.210 M/sec                    (50.00%)
       240,922,518      LLC-load-misses:u         #   82.60% of all LL-cache accesses  (50.00%)

     241.152242784 seconds time elapsed

     233.338084000 seconds user
       5.991166000 seconds sys

[kaze@kaze Quicksort_says_rev3]$
In case "FEW distinct", Magnetica is 364.3/40.2=9.0x faster than qsort.
In case "MANY distinct", Magnetica is 538.2/233.3= 2.3x faster than qsort.
In case "ALL distinct", to be seen...
Attached Files
File Type: txt QS_bench_r2.c.txt (26.2 KB, 6 views)

Last edited by Sanmayce; 11-11-2021 at 01:19 AM.
 
2 members found this post helpful.
Old 11-12-2021, 10:03 PM   #7
Sanmayce
Member
 
Registered: Jul 2010
Location: Sofia
Posts: 65

Original Poster
Rep: Reputation: 13
Kinda expected Bentley-McIlroy duo to be superior (in many distinct keys scenario), it turns out 'Magnetica' dominates in all three cases, the following showdown tells a lot:

The Quicksort Showdown roster:
Code:
+-----------------+--------------+---------------+---------------+
| Performer/Keys  | FEW distinct | MANY distinct | ~ALL distinct |
+-----------------+--------------+---------------+---------------+
| qsort           |  317 seconds |   508 seconds |   504 seconds |
| Magnetica       |   39 seconds |   222 seconds |   292 seconds |
| Bentley-McIlroy |   45 seconds |   226 seconds |   306 seconds |
+-----------------+--------------+---------------+---------------+

Legend:
FEW = 2,233,861,800 keys, of them distinct keys = 10
MANY = 2,482,300,900 keys, of them distinct keys = 2,847,531
~ALL = 2,009,333,753 keys, of them distinct keys = 1,912,608,132
The testfile for case #3:
https://download.fedoraproject.org/p..._64-35-1.2.iso

The code (used in QS_bench_r3.c) of Bentley-McIlroy 3-way partitioning by Dr. Jon Bentley (Bell Laboratories) and Dr. Douglas McIlroy (Bell Laboratories):

Code:
void quicksort_Bentley_McIlroy_3way_partitioning(uint64_t a[], int64_t l, int64_t r)
{ 
	int64_t i = l-1, j = r, p = l-1, q = r; 
	int64_t k; 
	uint64_t v = a[r];
	if (r <= l) return;
	for (;;)
	{
		while (a[++i] < v) ;
		while (v < a[--j]) if (j == l) break;
		if (i >= j) break;
		swap(&a[i], &a[j]);
		if (a[i] == v) { p++; swap(&a[p], &a[i]); }
		if (v == a[j]) { q--; swap(&a[j], &a[q]); }
	}
	swap(&a[i], &a[r]); j = i-1; i = i+1;
	for (k = l; k < p; k++, j--) swap(&a[k], &a[j]);
	for (k = r-1; k > q; k--, i++) swap(&a[i], &a[k]);
	quicksort_Bentley_McIlroy_3way_partitioning(a, l, j);
	quicksort_Bentley_McIlroy_3way_partitioning(a, i, r);
}
The code (used in QS_bench_r3.c) of Magnetica 3-way partitioning by Sanmayce:

Code:
//    _____                                __  .__               
//   /     \ _____     ____   ____   _____/  |_|__| ____ _____   
//  /  \ /  \\__  \   / ___\ /    \_/ __ \   __\  |/ ___\\__  \  
// /    Y    \/ __ \_/ /_/  >   |  \  ___/|  | |  \  \___ / __ \_
// \____|__  (____  /\___  /|___|  /\___  >__| |__|\___  >____  /
//         \/     \//_____/      \/     \/             \/     \/ 
void Quicksort_QB64_v4(uint64_t QWORDS[], int64_t Left, int64_t Right) {
    int InsertionsortTHRESHOLD = 0;
    int64_t Indx, Jndx, PL, PR;
    int64_t Stack[99999];
    int64_t StackPtr = 0;
    uint64_t Pivot;
    StackPtr++; Stack[StackPtr] = Left;
    StackPtr++; Stack[StackPtr] = Right;

    do {
        Right = Stack[StackPtr];
        Left = Stack[StackPtr - 1];
        StackPtr = StackPtr - 2;
        do {
            // 'Magnetica' partitioning rev.2[
            Jndx = Right;
            PL = Left;
            PR = Left;
            swap (&QWORDS[(Left + Right)>>1], &QWORDS[PR]);
            Pivot = QWORDS[PR];
            for (;PR < Jndx;) {
                PR = PR + 1;
                if (Pivot > QWORDS[PR]) {
                    swap (&QWORDS[PL], &QWORDS[PR]);
                    PL = PL + 1;
                //} else if (Pivot == QWORDS[PR]) {
                } else if (Pivot < QWORDS[PR]) {
                    for (;Pivot < QWORDS[Jndx];) {
                        Jndx = Jndx - 1;
                    }
                    if (PR < Jndx) swap (&QWORDS[PR], &QWORDS[Jndx]);
                    Jndx = Jndx - 1;
                    PR = PR - 1;
                }
            }
            Jndx = PL - 1;
            Indx = PR + 1;
            // 'Magnetica' partitioning rev.2]
/*
; mark_description "Intel(R) C++ Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140726";
; ae-64+2=76 bytes i.e. (94-76=18 less), 5/1 conditional/unconditional jumps i.e. 2-1=1 less unconditional jump
; 'Magnetica' partitioning, mainloop rev.2[
.B4.5::                         
  00064 48 ff c5         inc rbp                                
  00067 48 8b 14 e9      mov rdx, QWORD PTR [rcx+rbp*8]         
  0006b 4c 3b ea         cmp r13, rdx                           
  0006e 77 2c            ja .B4.14 
.B4.6::                         
  00070 73 39            jae .B4.15 
.B4.7::                         
  00072 4e 8b 3c d9      mov r15, QWORD PTR [rcx+r11*8]         
  00076 4d 3b ef         cmp r13, r15                           
  00079 73 0c            jae .B4.11 
.B4.9::                         
  0007b 49 ff cb         dec r11                                
  0007e 4e 8b 3c d9      mov r15, QWORD PTR [rcx+r11*8]         
  00082 4d 3b ef         cmp r13, r15                           
  00085 72 f4            jb .B4.9 
.B4.11::                        
  00087 49 3b eb         cmp rbp, r11                           
  0008a 7d 08            jge .B4.13 
.B4.12::                        
  0008c 4c 89 3c e9      mov QWORD PTR [rcx+rbp*8], r15         
  00090 4a 89 14 d9      mov QWORD PTR [rcx+r11*8], rdx         
.B4.13::                        
  00094 49 ff cb         dec r11                                
  00097 48 ff cd         dec rbp                                
  0009a eb 0f            jmp .B4.15 
.B4.14::                        
  0009c 4e 8b 3c f1      mov r15, QWORD PTR [rcx+r14*8]         
  000a0 4a 89 14 f1      mov QWORD PTR [rcx+r14*8], rdx         
  000a4 49 ff c6         inc r14                                
  000a7 4c 89 3c e9      mov QWORD PTR [rcx+rbp*8], r15         
.B4.15::                        
  000ab 49 3b eb         cmp rbp, r11                           
  000ae 7c b4            jl .B4.5 
; 'Magnetica' partitioning, mainloop rev.2]
*/
            if (Indx + InsertionsortTHRESHOLD < Right) {
                StackPtr = StackPtr + 2;
                Stack[StackPtr - 1] = Indx;
                Stack[StackPtr] = Right;
            }
            Right = Jndx;
        } while (Left < Right);
    } while (StackPtr != 0);
}
The full console log follows (my laptop i5-7200U 36GB DDR4, Fedora 33):

Code:
[kaze@kaze ~]$ cd /run/media/kaze/Sanmayce_223GB_A/Quicksort_says_rev3
[kaze@kaze Quicksort_says_rev3]$ dir
22338618_QWORDS.bin  Fedora-Workstation-Live-x86_64-35-1.2.iso	Px437_Cordata_PPC-400.ttf  Quicksort_says      Quicksort_says_rev3_Kaby-Lake_Fedora-33.png
bench.sh	     MEM.H					QS_bench_r3.c		   Quicksort_says.bas  Quicksort_says_rev3_Kaby-Lake_Windows-10.png
ColumnChart.ico      mobythesaurus.txt				QS_bench_r3.exe		   Quicksort_says.exe  Sir_Tony_Hoare_IMG_5125.png
[kaze@kaze Quicksort_says_rev3]$ gcc -O3 QS_bench_r3.c -o QS_bench_r3
QS_bench_r3.c:473:1: warning: return type defaults to int [-Wimplicit-int]
  473 | g(d,h){for(i=s;i<1<<25;i*=2)d=d*1LL*d%m;for(p=t;p<t+N;p+=s)for(i=s,c=1;i;i--)a=p[s]*(h?c:1LL)%m,p[s]=(m*1U+*p-a)*(h?1LL:c)%m,*p=(a*1U+*p)%m,p++,c=c*1LL*d%m;}
      | ^
QS_bench_r3.c: In function g:
QS_bench_r3.c:473:1: warning: type of d defaults to int [-Wimplicit-int]
QS_bench_r3.c:473:1: warning: type of h defaults to int [-Wimplicit-int]
QS_bench_r3.c: In function main:
QS_bench_r3.c:557:6: warning: implicit declaration of function strcmp [-Wimplicit-function-declaration]
  557 |  if (strcmp(argv[2],"many")==0) {
      |      ^~~~~~
[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r3 qsort few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Checking whether sort went correct... OK. Unique keys = 10

 Performance counter stats for './QS_bench_r3 qsort few':

        325,465.05 msec task-clock:u              #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,363,063      page-faults:u             #    0.013 M/sec                  
   980,631,883,041      cycles:u                  #    3.013 GHz                      (50.00%)
 3,279,551,895,331      instructions:u            #    3.34  insn per cycle           (62.50%)
   650,482,999,545      branches:u                # 1998.626 M/sec                    (62.50%)
     5,582,924,930      branch-misses:u           #    0.86% of all branches          (62.50%)
   734,186,209,415      L1-dcache-loads:u         # 2255.807 M/sec                    (62.50%)
     6,753,617,333      L1-dcache-load-misses:u   #    0.92% of all L1-dcache accesses  (62.50%)
       404,414,575      LLC-loads:u               #    1.243 M/sec                    (50.00%)
       111,254,842      LLC-load-misses:u         #   27.51% of all LL-cache accesses  (50.00%)

     326.101784425 seconds time elapsed

     317.871539000 seconds user
       5.577791000 seconds sys


[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r3 Magnetica  few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Checking whether sort went correct... OK. Unique keys = 10

 Performance counter stats for './QS_bench_r3 Magnetica few':

         45,479.14 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,363,068      page-faults:u             #    0.096 M/sec                  
   121,955,292,590      cycles:u                  #    2.682 GHz                      (50.00%)
   112,449,514,935      instructions:u            #    0.92  insn per cycle           (62.50%)
    24,092,559,780      branches:u                #  529.750 M/sec                    (62.50%)
     3,384,288,209      branch-misses:u           #   14.05% of all branches          (62.50%)
    13,329,400,797      L1-dcache-loads:u         #  293.088 M/sec                    (62.50%)
     1,655,811,975      L1-dcache-load-misses:u   #   12.42% of all L1-dcache accesses  (62.50%)
        70,845,055      LLC-loads:u               #    1.558 M/sec                    (50.00%)
        62,316,430      LLC-load-misses:u         #   87.96% of all LL-cache accesses  (50.00%)

      45.492587322 seconds time elapsed

      39.738044000 seconds user
       5.413636000 seconds sys


[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r3 BM  few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Checking whether sort went correct... OK. Unique keys = 10

 Performance counter stats for './QS_bench_r3 BM few':

         51,751.74 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,363,064      page-faults:u             #    0.084 M/sec                  
   140,850,991,285      cycles:u                  #    2.722 GHz                      (50.00%)
   141,773,758,594      instructions:u            #    1.01  insn per cycle           (62.50%)
    29,021,682,740      branches:u                #  560.787 M/sec                    (62.50%)
     2,964,754,548      branch-misses:u           #   10.22% of all branches          (62.49%)
    20,759,616,908      L1-dcache-loads:u         #  401.139 M/sec                    (62.50%)
     2,253,507,095      L1-dcache-load-misses:u   #   10.86% of all L1-dcache accesses  (62.50%)
       240,693,428      LLC-loads:u               #    4.651 M/sec                    (50.00%)
       119,509,033      LLC-load-misses:u         #   49.65% of all LL-cache accesses  (50.00%)

      51.765804273 seconds time elapsed

      45.999839000 seconds user
       5.392254000 seconds sys


[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r3 qsort many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Checking whether sort went correct... OK. Unique keys = 2847531

 Performance counter stats for './QS_bench_r3 qsort many':

        517,942.56 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,896,779      page-faults:u             #    0.009 M/sec                  
 1,569,421,552,908      cycles:u                  #    3.030 GHz                      (50.00%)
 2,937,315,064,269      instructions:u            #    1.87  insn per cycle           (62.50%)
   656,656,940,988      branches:u                # 1267.818 M/sec                    (62.50%)
    25,008,235,993      branch-misses:u           #    3.81% of all branches          (62.50%)
   675,474,160,813      L1-dcache-loads:u         # 1304.149 M/sec                    (62.50%)
     8,374,868,618      L1-dcache-load-misses:u   #    1.24% of all L1-dcache accesses  (62.50%)
       244,454,547      LLC-loads:u               #    0.472 M/sec                    (50.00%)
       132,791,156      LLC-load-misses:u         #   54.32% of all LL-cache accesses  (50.00%)

     518.237032044 seconds time elapsed

     508.735040000 seconds user
       5.950954000 seconds sys


[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r3 Magnetica many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Checking whether sort went correct... OK. Unique keys = 2847531

 Performance counter stats for './QS_bench_r3 Magnetica many':

        230,292.26 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,896,782      page-faults:u             #    0.021 M/sec                  
   686,398,758,890      cycles:u                  #    2.981 GHz                      (50.00%)
   801,299,477,504      instructions:u            #    1.17  insn per cycle           (62.50%)
   152,341,819,141      branches:u                #  661.515 M/sec                    (62.50%)
    19,420,019,786      branch-misses:u           #   12.75% of all branches          (62.50%)
   105,227,165,554      L1-dcache-loads:u         #  456.929 M/sec                    (62.50%)
     9,758,775,155      L1-dcache-load-misses:u   #    9.27% of all L1-dcache accesses  (62.50%)
       331,205,631      LLC-loads:u               #    1.438 M/sec                    (50.00%)
       260,265,720      LLC-load-misses:u         #   78.58% of all LL-cache accesses  (50.00%)

     230.399202449 seconds time elapsed

     222.654274000 seconds user
       5.973146000 seconds sys


[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r3 BM many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Checking whether sort went correct... OK. Unique keys = 2847531

 Performance counter stats for './QS_bench_r3 BM many':

        234,206.73 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         4,896,781      page-faults:u             #    0.021 M/sec                  
   698,588,697,890      cycles:u                  #    2.983 GHz                      (50.00%)
   656,081,987,764      instructions:u            #    0.94  insn per cycle           (62.50%)
   165,692,577,479      branches:u                #  707.463 M/sec                    (62.50%)
    18,927,109,180      branch-misses:u           #   11.42% of all branches          (62.50%)
    85,146,879,545      L1-dcache-loads:u         #  363.554 M/sec                    (62.50%)
     8,623,541,343      L1-dcache-load-misses:u   #   10.13% of all L1-dcache accesses  (62.50%)
       286,352,303      LLC-loads:u               #    1.223 M/sec                    (50.00%)
       227,490,519      LLC-load-misses:u         #   79.44% of all LL-cache accesses  (50.00%)

     234.306277225 seconds time elapsed

     226.632611000 seconds user
       5.938060000 seconds sys


[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r3 qsort ALL
Allocating FILE-Buffer 1916MB ...
Allocating AUX-Buffer 15329MB ...
Allocating Master-Buffer 15329MB ...
Sorting in single-thread 2009333753 elements...
Checking whether sort went correct... OK. Unique keys = 1912608132

 Performance counter stats for './QS_bench_r3 qsort ALL':

        520,097.93 msec task-clock:u              #    0.993 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         7,849,014      page-faults:u             #    0.015 M/sec                  
 1,553,894,682,781      cycles:u                  #    2.988 GHz                      (50.00%)
 2,271,179,443,512      instructions:u            #    1.46  insn per cycle           (62.50%)
   536,952,307,468      branches:u                # 1032.406 M/sec                    (62.50%)
    28,368,595,067      branch-misses:u           #    5.28% of all branches          (62.50%)
   520,144,933,578      L1-dcache-loads:u         # 1000.090 M/sec                    (62.50%)
     6,723,006,421      L1-dcache-load-misses:u   #    1.29% of all L1-dcache accesses  (62.50%)
       193,261,861      LLC-loads:u               #    0.372 M/sec                    (50.00%)
       102,759,148      LLC-load-misses:u         #   53.17% of all LL-cache accesses  (50.00%)

     524.002327535 seconds time elapsed

     504.423911000 seconds user
      12.294510000 seconds sys


[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r3 Magnetica ALL
Allocating FILE-Buffer 1916MB ...
Allocating AUX-Buffer 15329MB ...
Allocating Master-Buffer 15329MB ...
Sorting in single-thread 2009333753 elements...
Checking whether sort went correct... OK. Unique keys = 1912608132

 Performance counter stats for './QS_bench_r3 Magnetica ALL':

        306,913.60 msec task-clock:u              #    0.990 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         7,849,020      page-faults:u             #    0.026 M/sec                  
   899,578,515,950      cycles:u                  #    2.931 GHz                      (50.00%)
 1,077,947,210,852      instructions:u            #    1.20  insn per cycle           (62.50%)
   202,031,197,397      branches:u                #  658.267 M/sec                    (62.50%)
    25,788,554,132      branch-misses:u           #   12.76% of all branches          (62.50%)
   143,089,373,092      L1-dcache-loads:u         #  466.220 M/sec                    (62.50%)
     7,361,011,433      L1-dcache-load-misses:u   #    5.14% of all L1-dcache accesses  (62.50%)
       240,367,987      LLC-loads:u               #    0.783 M/sec                    (50.00%)
       189,560,929      LLC-load-misses:u         #   78.86% of all LL-cache accesses  (50.00%)

     310.137967617 seconds time elapsed

     292.453199000 seconds user
      12.266850000 seconds sys


[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r3 BM ALL
Allocating FILE-Buffer 1916MB ...
Allocating AUX-Buffer 15329MB ...
Allocating Master-Buffer 15329MB ...
Sorting in single-thread 2009333753 elements...
Checking whether sort went correct... OK. Unique keys = 1912608132

 Performance counter stats for './QS_bench_r3 BM ALL':

        322,688.09 msec task-clock:u              #    0.992 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         7,849,403      page-faults:u             #    0.024 M/sec                  
   943,352,167,992      cycles:u                  #    2.923 GHz                      (50.00%)
   896,389,731,553      instructions:u            #    0.95  insn per cycle           (62.50%)
   226,893,780,871      branches:u                #  703.137 M/sec                    (62.50%)
    26,011,308,958      branch-misses:u           #   11.46% of all branches          (62.50%)
   119,844,660,142      L1-dcache-loads:u         #  371.395 M/sec                    (62.50%)
     7,940,603,646      L1-dcache-load-misses:u   #    6.63% of all L1-dcache accesses  (62.50%)
       258,555,901      LLC-loads:u               #    0.801 M/sec                    (50.00%)
       217,482,356      LLC-load-misses:u         #   84.11% of all LL-cache accesses  (50.00%)

     325.339098420 seconds time elapsed

     306.791354000 seconds user
      13.564926000 seconds sys


[kaze@kaze Quicksort_says_rev3]$
Love 'Magnetica', enjoy!
Attached Files
File Type: txt QS_bench_r3.c.txt (37.1 KB, 2 views)

Last edited by Sanmayce; 11-12-2021 at 10:05 PM.
 
Old 11-29-2021, 07:12 AM   #8
Sanmayce
Member
 
Registered: Jul 2010
Location: Sofia
Posts: 65

Original Poster
Rep: Reputation: 13
After a 2 weeks quest, glad to share Quicksort 'Magnetica' r.2, the FASTEST known to me:

The test program is attached, the testdatafile 3.54 GB (3,803,483,832 bytes) is downloadable at:
https://download.fedoraproject.org/p...aarch64.raw.xz

The benchmark is the most stressful for a 32GB machine, it uses 29GB of QWORDS i.e. 64bit elements, taking 8 bytes as an element at each position in the file being sorted i.e. filesize-8+1 elements in total. All in all, sorting 3,803,483,825 elements (of them unique keys = 3,346,259,533).

The results:
Quicksort 'Magnetica' r.2 is 1020s/569s= 1.792x faster than GLIBC qsort.
Quicksort 'Magnetica' r.2 is 607s/569s= 1.066x faster than Bentley_McIlroy.

And the console dump on my laptop Intel i5-7200U 3.1GHz max turbo, 36GB DDR4 2133MHz, running Fedora 33, gcc 10.2.1 -O3 was used:

Code:
[kaze@kaze ~]$ cd /run/media/kaze/Sanmayce_223GB_A/Quicksort_says_rev7
[kaze@kaze Quicksort_says_rev7]$ gcc -O3 QS_bench_r7.c -o QS_bench_r7.elf
QS_bench_r7.c:1000:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
 1000 | g(d,h){for(i=s;i<1<<25;i*=2)d=d*1LL*d%m;for(p=t;p<t+N;p+=s)for(i=s,c=1;i;i--)a=p[s]*(h?c:1LL)%m,p[s]=(m*1U+*p-a)*(h?1LL:c)%m,*p=(a*1U+*p)%m,p++,c=c*1LL*d%m;}
      | ^
QS_bench_r7.c: In function ‘g’:
QS_bench_r7.c:1000:1: warning: type of ‘d’ defaults to ‘int’ [-Wimplicit-int]
QS_bench_r7.c:1000:1: warning: type of ‘h’ defaults to ‘int’ [-Wimplicit-int]
[kaze@kaze Quicksort_says_rev7]$ ls -l
total 4501312
-rwxrwxrwx. 2 kaze kaze 3803483832 Nov 29 02:06 Fedora-Workstation-35-1.2.aarch64.raw.xz
-rwxrwxrwx. 2 kaze kaze      58408 Nov 29 12:56 QS_bench_r7.c
-rwxrwxrwx. 2 kaze kaze     335536 Nov 29 12:56 QS_bench_r7.cod
-rwxrwxrwx. 1 kaze kaze  268465864 Nov 29 13:06 QS_bench_r7.elf
-rwxrwxrwx. 2 kaze kaze  268535808 Nov 29 12:56 QS_bench_r7.exe
-rwxrwxrwx. 2 kaze kaze  268455855 Nov 29 12:56 QS_bench_r7.obj
[kaze@kaze Quicksort_says_rev7]$ perf stat -d ./QS_bench_r7.elf Magnetica ALLmore
Allocating FILE-Buffer 3627MB ...
Allocating Master-Buffer 29018MB ...
Sorting in single-thread 3803483825 elements...
Done in 569 seconds.
Checking whether sort went correct... OK. Unique keys = 3346259533

 Performance counter stats for './QS_bench_r7.elf Magnetica ALLmore':

        595,760.27 msec task-clock:u              #    0.990 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         7,428,735      page-faults:u             #    0.012 M/sec                  
 1,774,492,088,758      cycles:u                  #    2.979 GHz                      (50.00%)
 1,795,596,438,927      instructions:u            #    1.01  insn per cycle           (62.50%)
   348,244,974,654      branches:u                #  584.539 M/sec                    (62.50%)
    52,674,210,396      branch-misses:u           #   15.13% of all branches          (62.50%)
   238,612,258,163      L1-dcache-loads:u         #  400.517 M/sec                    (62.50%)
    12,071,470,906      L1-dcache-load-misses:u   #    5.06% of all L1-dcache accesses  (62.50%)
       291,005,737      LLC-loads:u               #    0.488 M/sec                    (50.00%)
       236,834,179      LLC-load-misses:u         #   81.38% of all LL-cache accesses  (50.00%)

     601.769088005 seconds time elapsed

     575.942070000 seconds user
      15.526557000 seconds sys

[kaze@kaze Quicksort_says_rev7]$ perf stat -d ./QS_bench_r7.elf BM ALLmore
Allocating FILE-Buffer 3627MB ...
Allocating Master-Buffer 29018MB ...
Sorting in single-thread 3803483825 elements...
Done in 607 seconds.
Checking whether sort went correct... OK. Unique keys = 3346259533

 Performance counter stats for './QS_bench_r7.elf BM ALLmore':

        632,750.44 msec task-clock:u              #    0.992 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         7,428,742      page-faults:u             #    0.012 M/sec                  
 1,890,074,871,799      cycles:u                  #    2.987 GHz                      (50.00%)
 1,657,130,448,069      instructions:u            #    0.88  insn per cycle           (62.50%)
   424,010,738,154      branches:u                #  670.107 M/sec                    (62.50%)
    52,229,564,268      branch-misses:u           #   12.32% of all branches          (62.50%)
   221,684,796,070      L1-dcache-loads:u         #  350.351 M/sec                    (62.50%)
    13,935,776,880      L1-dcache-load-misses:u   #    6.29% of all L1-dcache accesses  (62.50%)
       296,863,115      LLC-loads:u               #    0.469 M/sec                    (50.00%)
       227,231,894      LLC-load-misses:u         #   76.54% of all LL-cache accesses  (50.00%)

     638.047061016 seconds time elapsed

     613.514293000 seconds user
      14.803650000 seconds sys

[kaze@kaze Quicksort_says_rev7]$ perf stat -d ./QS_bench_r7.elf qsort ALLmore
Allocating FILE-Buffer 3627MB ...
Allocating Master-Buffer 29018MB ...
Sorting in single-thread 3803483825 elements...
Done in 1020 seconds.
Checking whether sort went correct... OK. Unique keys = 3346259533

 Performance counter stats for './QS_bench_r7.elf qsort ALLmore':

      1,047,717.43 msec task-clock:u              #    0.996 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
         7,429,008      page-faults:u             #    0.007 M/sec                  
 3,158,473,480,125      cycles:u                  #    3.015 GHz                      (50.00%)
 4,410,953,498,619      instructions:u            #    1.40  insn per cycle           (62.50%)
 1,045,210,950,583      branches:u                #  997.608 M/sec                    (62.50%)
    55,719,050,158      branch-misses:u           #    5.33% of all branches          (62.50%)
 1,009,805,547,231      L1-dcache-loads:u         #  963.815 M/sec                    (62.50%)
    13,006,118,247      L1-dcache-load-misses:u   #    1.29% of all L1-dcache accesses  (62.50%)
       320,405,768      LLC-loads:u               #    0.306 M/sec                    (50.00%)
       172,928,699      LLC-load-misses:u         #   53.97% of all LL-cache accesses  (50.00%)

    1052.421139598 seconds time elapsed

    1024.281537000 seconds user
      16.586670000 seconds sys

[kaze@kaze Quicksort_says_rev7]$
For more info, plus two videos showing the sort process:
https://qb64.org/forum/index.php?top...8570#msg138570

Enjoy!
Attached Files
File Type: txt QS_bench_r7.c.txt (57.0 KB, 2 views)
 
Old 11-30-2021, 09:22 AM   #9
Sanmayce
Member
 
Registered: Jul 2010
Location: Sofia
Posts: 65

Original Poster
Rep: Reputation: 13
Thanks to PCBoy Studios, here comes a nice animated Magnetica in 50 tests:
https://youtu.be/sjQhoX10r9s

Note: This is the initial release of Magnetica, not the latest rev.2, there is no median of 5 branchlessly chosen, nor switching to Insertionsort cutoff at 13.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Issue with sorting using memcmp and qsort Skyrius Programming 2 05-08-2012 05:36 PM
qsort for files? nodger Programming 4 01-12-2006 08:55 PM
qsort() on linux producing the wrong sort Adelsm Programming 17 09-27-2005 02:58 PM
In C, using qsort for dynamically allocated array ntmsz Programming 7 08-23-2005 11:33 AM
qsort in C questions trutnev Programming 4 06-10-2005 09:49 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:08 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration