ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Just wrote my qsort() replacement, wanna share it with the community and stress my variant with more versatile data...
Generally, we have three major scenarios:
few distinct keys;
many distinct keys;
ALL distinct keys.
In first two cases, the new implementation is faster or equal to the "Standard" one.
Sadly, in the third case, it is somewhat slower, should be investigated further.
Anyway, benchmarking will show the practical value of 'Magnetica', personally, I started to use it already.
Please, give it a try, my new partitioning scheme is not perfect but pushes the limitations of old ones.
Didn't have time to add the second scenario, but the results (for 1st scenario, with 10 distinct keys) are promising:
273.8/3.7= 74x faster
Now attached as a .txt, LQ doesn't allow .c attachments, hmm.
As for Timsort, never looked it up, just know it is among some major performers.
AFAIK, my 'Magnetica' partitioning is a new one, nothing special, just did on paper for an hour the swaps and saw that there is a crucial moment where unnecessary swaps are to be avoided. Wonder, whether another guy had it before me, just to acknowledge it, I am too lazy to look up other's sources, however love benchmarking against them.
The "performance" of any sorting algorithm, as measured in microseconds, is always dependent upon the data. However, the practical performance of any algorithm is usually described in terms of "order," as in O(Log2N). Shell Sort will always out-perform Bubble Sort, and Quicksort will always out-perform Shell.
I suggest that you create a new Github project for your new algorithm, and introduce it for discussion in various places, including the Software Engineering forum at StackExchange.
No matter how fast the processor, a 100 MB external sort using a single 1993-vintage SCSI disk takes about one minute of elapsed time. This one-minute barrier is created by the 3MB/s sequential transfer rate (bandwidth) of a single commodity disk. We measured both the OpenVMS Sort utility and AlphaSort to take a little under one minute when using one SCSI disk. Both sorts are disk-limited. A faster processor or faster algorithm would not sort much faster because the disk reads at about 4.5 MB/s, and writes at about 3.5 MB/s. Thus, it takes about 25 seconds to read the 100 MB, and about 30 seconds to write the 100 MB answer file. Even on mainframes, sort algorithms like Syncsort and DFsort are limited by this one-minute barrier, unless disk or file striping is used.
>However, the practical performance of any algorithm is usually described in terms of "order," as in O(Log2N).
Indeed, however my focus is on 'perf', you are welcome to see the statistics below and draw even better PRACTICAL conclusions whether one performer is good or no good enough.
>I suggest that you create a new Github project for your new algorithm, and introduce it for discussion in various places, including the Software Engineering forum at StackExchange.
Nah, I prefer sharing it with fellow forum members, few times got treated as a spammer on stack*, Github is cool but I search feedback from amateurs (in his primal sense - lovers of something) as me. Benchmarking is what interests me, of course if some coder improves on it then even better - we all win.
After all, all good things will make their way, by itself, naturally.
>Slightly off-topic. While searching XFCE documentation on their site, I stumbled upon an old article from year 1995 on AlphaSort.
Thanks, no idea what it is, but for a long thirty years I admire the work of Bob Pirko and its RPSORT.ASM - a 16-bit sorter, yet never looked into his source code - didn't want to "steal" his ideas, weird, maybe there are some ideas worth reviving.
Looking calmly at 'Magnetica' for few minutes, couldn't miss the obvious optimization of moving 'PR+1' outside the if-elseif-else, now the refined revision is smaller and faster.
Now, we can benchmark either with QS_bench_r2.c or by adding 'Magnetica' to your code.
You need "22338618_QWORDS.bin" and "mobythesaurus.txt" from this package: https://www.qb64.org/forum/index.php...0;attach=14875
As for the third case (ALL distinct keys), I wanna 2,000,000,000 keys, but how to shuffle them, so each time the keys to be in a given order?
In the near future, I intend to make a showdown between Bentley-McIlroy 3-way partitioning and 'Magnetica', I am not convinced their scheme is better, will see...
This is the refinement which is used in QS_bench_r2.c:
Glad to share the first two cases benchmarked with 'Magnetica' rev.2, the console log on my Fedora 33, i5-7200U 3.1GHz max turbo, 36GB DDR4 2133MHz:
Code:
[kaze@kaze Quicksort_says_rev3]$ gcc -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/10/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,objc,obj-c++,ada,go,d,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC)
[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r2 qsort few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Checking whether sort went correct... OK.
Performance counter stats for './QS_bench_r2 qsort few':
372,145.95 msec task-clock:u # 0.998 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
4,363,067 page-faults:u # 0.012 M/sec
1,123,792,986,601 cycles:u # 3.020 GHz (50.00%)
3,268,399,480,738 instructions:u # 2.91 insn per cycle (62.50%)
650,531,988,573 branches:u # 1748.056 M/sec (62.50%)
5,519,977,909 branch-misses:u # 0.85% of all branches (62.50%)
734,216,718,766 L1-dcache-loads:u # 1972.927 M/sec (62.50%)
6,752,096,756 L1-dcache-load-misses:u # 0.92% of all L1-dcache accesses (62.50%)
423,592,362 LLC-loads:u # 1.138 M/sec (50.00%)
126,404,435 LLC-load-misses:u # 29.84% of all LL-cache accesses (50.00%)
372.770596352 seconds time elapsed
364.324923000 seconds user
5.532901000 seconds sys
[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r2 Magnetica few
Allocating AUX-Buffer 21MB ...
Allocating Master-Buffer 17043MB ...
Sorting in single-thread 2233861800 elements...
Checking whether sort went correct... OK.
Performance counter stats for './QS_bench_r2 Magnetica few':
45,966.43 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
4,363,067 page-faults:u # 0.095 M/sec
123,532,537,542 cycles:u # 2.687 GHz (50.00%)
101,337,186,572 instructions:u # 0.82 insn per cycle (62.50%)
24,106,121,927 branches:u # 524.429 M/sec (62.50%)
3,277,959,707 branch-misses:u # 13.60% of all branches (62.50%)
13,329,593,401 L1-dcache-loads:u # 289.985 M/sec (62.50%)
1,655,094,422 L1-dcache-load-misses:u # 12.42% of all L1-dcache accesses (62.50%)
83,759,355 LLC-loads:u # 1.822 M/sec (50.00%)
69,790,047 LLC-load-misses:u # 83.32% of all LL-cache accesses (50.00%)
45.977603171 seconds time elapsed
40.202975000 seconds user
5.433257000 seconds sys
[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r2 qsort many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Checking whether sort went correct... OK.
Performance counter stats for './QS_bench_r2 qsort many':
547,697.55 msec task-clock:u # 0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
4,896,781 page-faults:u # 0.009 M/sec
1,660,777,393,784 cycles:u # 3.032 GHz (50.00%)
2,926,096,300,717 instructions:u # 1.76 insn per cycle (62.50%)
656,833,950,654 branches:u # 1199.264 M/sec (62.50%)
24,981,866,259 branch-misses:u # 3.80% of all branches (62.50%)
675,499,020,355 L1-dcache-loads:u # 1233.343 M/sec (62.50%)
8,384,728,679 L1-dcache-load-misses:u # 1.24% of all L1-dcache accesses (62.50%)
253,666,230 LLC-loads:u # 0.463 M/sec (50.00%)
149,205,273 LLC-load-misses:u # 58.82% of all LL-cache accesses (50.00%)
547.972289671 seconds time elapsed
538.278672000 seconds user
5.984880000 seconds sys
[kaze@kaze Quicksort_says_rev3]$ perf stat -d ./QS_bench_r2 Magnetica many
Allocating FILE-Buffer 23MB ...
Allocating AUX-Buffer 189MB ...
Allocating Master-Buffer 18938MB ...
Sorting in single-thread 2482300900 elements...
Checking whether sort went correct... OK.
Performance counter stats for './QS_bench_r2 Magnetica many':
241,082.32 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
4,896,781 page-faults:u # 0.020 M/sec
719,580,616,824 cycles:u # 2.985 GHz (50.00%)
788,555,184,960 instructions:u # 1.10 insn per cycle (62.50%)
152,383,462,963 branches:u # 632.081 M/sec (62.50%)
18,722,063,529 branch-misses:u # 12.29% of all branches (62.50%)
105,308,386,791 L1-dcache-loads:u # 436.815 M/sec (62.50%)
9,762,518,793 L1-dcache-load-misses:u # 9.27% of all L1-dcache accesses (62.50%)
291,667,293 LLC-loads:u # 1.210 M/sec (50.00%)
240,922,518 LLC-load-misses:u # 82.60% of all LL-cache accesses (50.00%)
241.152242784 seconds time elapsed
233.338084000 seconds user
5.991166000 seconds sys
[kaze@kaze Quicksort_says_rev3]$
In case "FEW distinct", Magnetica is 364.3/40.2=9.0x faster than qsort.
In case "MANY distinct", Magnetica is 538.2/233.3= 2.3x faster than qsort.
In case "ALL distinct", to be seen...
Kinda expected Bentley-McIlroy duo to be superior (in many distinct keys scenario), it turns out 'Magnetica' dominates in all three cases, the following showdown tells a lot:
The Quicksort Showdown roster:
Code:
+-----------------+--------------+---------------+---------------+
| Performer/Keys | FEW distinct | MANY distinct | ~ALL distinct |
+-----------------+--------------+---------------+---------------+
| qsort | 317 seconds | 508 seconds | 504 seconds |
| Magnetica | 39 seconds | 222 seconds | 292 seconds |
| Bentley-McIlroy | 45 seconds | 226 seconds | 306 seconds |
+-----------------+--------------+---------------+---------------+
Legend:
FEW = 2,233,861,800 keys, of them distinct keys = 10
MANY = 2,482,300,900 keys, of them distinct keys = 2,847,531
~ALL = 2,009,333,753 keys, of them distinct keys = 1,912,608,132
The code (used in QS_bench_r3.c) of Bentley-McIlroy 3-way partitioning by Dr. Jon Bentley (Bell Laboratories) and Dr. Douglas McIlroy (Bell Laboratories):
Code:
void quicksort_Bentley_McIlroy_3way_partitioning(uint64_t a[], int64_t l, int64_t r)
{
int64_t i = l-1, j = r, p = l-1, q = r;
int64_t k;
uint64_t v = a[r];
if (r <= l) return;
for (;;)
{
while (a[++i] < v) ;
while (v < a[--j]) if (j == l) break;
if (i >= j) break;
swap(&a[i], &a[j]);
if (a[i] == v) { p++; swap(&a[p], &a[i]); }
if (v == a[j]) { q--; swap(&a[j], &a[q]); }
}
swap(&a[i], &a[r]); j = i-1; i = i+1;
for (k = l; k < p; k++, j--) swap(&a[k], &a[j]);
for (k = r-1; k > q; k--, i++) swap(&a[i], &a[k]);
quicksort_Bentley_McIlroy_3way_partitioning(a, l, j);
quicksort_Bentley_McIlroy_3way_partitioning(a, i, r);
}
The code (used in QS_bench_r3.c) of Magnetica 3-way partitioning by Sanmayce:
The benchmark is the most stressful for a 32GB machine, it uses 29GB of QWORDS i.e. 64bit elements, taking 8 bytes as an element at each position in the file being sorted i.e. filesize-8+1 elements in total. All in all, sorting 3,803,483,825 elements (of them unique keys = 3,346,259,533).
The results:
Quicksort 'Magnetica' r.2 is 1020s/569s= 1.792x faster than GLIBC qsort.
Quicksort 'Magnetica' r.2 is 607s/569s= 1.066x faster than Bentley_McIlroy.
And the console dump on my laptop Intel i5-7200U 3.1GHz max turbo, 36GB DDR4 2133MHz, running Fedora 33, gcc 10.2.1 -O3 was used:
Code:
[kaze@kaze ~]$ cd /run/media/kaze/Sanmayce_223GB_A/Quicksort_says_rev7
[kaze@kaze Quicksort_says_rev7]$ gcc -O3 QS_bench_r7.c -o QS_bench_r7.elf
QS_bench_r7.c:1000:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
1000 | g(d,h){for(i=s;i<1<<25;i*=2)d=d*1LL*d%m;for(p=t;p<t+N;p+=s)for(i=s,c=1;i;i--)a=p[s]*(h?c:1LL)%m,p[s]=(m*1U+*p-a)*(h?1LL:c)%m,*p=(a*1U+*p)%m,p++,c=c*1LL*d%m;}
| ^
QS_bench_r7.c: In function ‘g’:
QS_bench_r7.c:1000:1: warning: type of ‘d’ defaults to ‘int’ [-Wimplicit-int]
QS_bench_r7.c:1000:1: warning: type of ‘h’ defaults to ‘int’ [-Wimplicit-int]
[kaze@kaze Quicksort_says_rev7]$ ls -l
total 4501312
-rwxrwxrwx. 2 kaze kaze 3803483832 Nov 29 02:06 Fedora-Workstation-35-1.2.aarch64.raw.xz
-rwxrwxrwx. 2 kaze kaze 58408 Nov 29 12:56 QS_bench_r7.c
-rwxrwxrwx. 2 kaze kaze 335536 Nov 29 12:56 QS_bench_r7.cod
-rwxrwxrwx. 1 kaze kaze 268465864 Nov 29 13:06 QS_bench_r7.elf
-rwxrwxrwx. 2 kaze kaze 268535808 Nov 29 12:56 QS_bench_r7.exe
-rwxrwxrwx. 2 kaze kaze 268455855 Nov 29 12:56 QS_bench_r7.obj
[kaze@kaze Quicksort_says_rev7]$ perf stat -d ./QS_bench_r7.elf Magnetica ALLmore
Allocating FILE-Buffer 3627MB ...
Allocating Master-Buffer 29018MB ...
Sorting in single-thread 3803483825 elements...
Done in 569 seconds.
Checking whether sort went correct... OK. Unique keys = 3346259533
Performance counter stats for './QS_bench_r7.elf Magnetica ALLmore':
595,760.27 msec task-clock:u # 0.990 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
7,428,735 page-faults:u # 0.012 M/sec
1,774,492,088,758 cycles:u # 2.979 GHz (50.00%)
1,795,596,438,927 instructions:u # 1.01 insn per cycle (62.50%)
348,244,974,654 branches:u # 584.539 M/sec (62.50%)
52,674,210,396 branch-misses:u # 15.13% of all branches (62.50%)
238,612,258,163 L1-dcache-loads:u # 400.517 M/sec (62.50%)
12,071,470,906 L1-dcache-load-misses:u # 5.06% of all L1-dcache accesses (62.50%)
291,005,737 LLC-loads:u # 0.488 M/sec (50.00%)
236,834,179 LLC-load-misses:u # 81.38% of all LL-cache accesses (50.00%)
601.769088005 seconds time elapsed
575.942070000 seconds user
15.526557000 seconds sys
[kaze@kaze Quicksort_says_rev7]$ perf stat -d ./QS_bench_r7.elf BM ALLmore
Allocating FILE-Buffer 3627MB ...
Allocating Master-Buffer 29018MB ...
Sorting in single-thread 3803483825 elements...
Done in 607 seconds.
Checking whether sort went correct... OK. Unique keys = 3346259533
Performance counter stats for './QS_bench_r7.elf BM ALLmore':
632,750.44 msec task-clock:u # 0.992 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
7,428,742 page-faults:u # 0.012 M/sec
1,890,074,871,799 cycles:u # 2.987 GHz (50.00%)
1,657,130,448,069 instructions:u # 0.88 insn per cycle (62.50%)
424,010,738,154 branches:u # 670.107 M/sec (62.50%)
52,229,564,268 branch-misses:u # 12.32% of all branches (62.50%)
221,684,796,070 L1-dcache-loads:u # 350.351 M/sec (62.50%)
13,935,776,880 L1-dcache-load-misses:u # 6.29% of all L1-dcache accesses (62.50%)
296,863,115 LLC-loads:u # 0.469 M/sec (50.00%)
227,231,894 LLC-load-misses:u # 76.54% of all LL-cache accesses (50.00%)
638.047061016 seconds time elapsed
613.514293000 seconds user
14.803650000 seconds sys
[kaze@kaze Quicksort_says_rev7]$ perf stat -d ./QS_bench_r7.elf qsort ALLmore
Allocating FILE-Buffer 3627MB ...
Allocating Master-Buffer 29018MB ...
Sorting in single-thread 3803483825 elements...
Done in 1020 seconds.
Checking whether sort went correct... OK. Unique keys = 3346259533
Performance counter stats for './QS_bench_r7.elf qsort ALLmore':
1,047,717.43 msec task-clock:u # 0.996 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
7,429,008 page-faults:u # 0.007 M/sec
3,158,473,480,125 cycles:u # 3.015 GHz (50.00%)
4,410,953,498,619 instructions:u # 1.40 insn per cycle (62.50%)
1,045,210,950,583 branches:u # 997.608 M/sec (62.50%)
55,719,050,158 branch-misses:u # 5.33% of all branches (62.50%)
1,009,805,547,231 L1-dcache-loads:u # 963.815 M/sec (62.50%)
13,006,118,247 L1-dcache-load-misses:u # 1.29% of all L1-dcache accesses (62.50%)
320,405,768 LLC-loads:u # 0.306 M/sec (50.00%)
172,928,699 LLC-load-misses:u # 53.97% of all LL-cache accesses (50.00%)
1052.421139598 seconds time elapsed
1024.281537000 seconds user
16.586670000 seconds sys
[kaze@kaze Quicksort_says_rev7]$
Note: This is the initial release of Magnetica, not the latest rev.2, there is no median of 5 branchlessly chosen, nor switching to Insertionsort cutoff at 13.
Enter Quicksort 'Akkoda'.
The fastest Quicksort known to me.
The amazing work that Scandum did with his Crumsort is unbeatable in my view, yet, there is hidden potential unexploited, namely, the bi-threaded variant of my Quicksort 'Magnetica' called 'Akkoda'.
In my heavy benchmarks, it beats Crumsort, both using GCC and ICL compilers.
The full C source is attached. If some fellow member is interested in running the 6 heavy (1 to 7 billion QWORDS as keys) benchmarks, just let me know - will gladly assist in how to run them.
// Laptop "Compressionette", Intel 'Kaby Lake' i5-7200U 3.1GHz max turbo, 36GB DDR4 2133MHz:
// +----------------------------------+----------------------------+-----------------------------+---------------------------+-----------------------------+------------------------------+-----------------------------+
// | Performer/Keys | #1, FEW distinct | #2, MANY distinct | #3, MANYmore distinct | #4, ALL distinct | #5, ALLmore distinct | #6, ALLmax distinct |
// +----------------------------------+-------------+--------------+--------------+--------------+-------------+-------------+--------------+--------------+--------------+---------------+--------------+--------------+
// | Operating System, | Windows 10, | Fedora 35, | Windows 10, | Fedora 35, | Windows 10, | Fedora 35, | Windows 10, | Fedora 35, | Windows 10, | Fedora 35, | Windows 10, | Fedora 35, |
// | Compiler | Intel v15.0 | GCC 11.2.1 | Intel v15.0 | GCC 11.2.1 | Intel v15.0 | GCC 11.2.1 | Intel v15.0 | GCC 11.2.1 | Intel v15.0 | GCC 11.2.1 | Intel v15.0 | GCC 11.2.1 |
// +----------------------------------+-------------+--------------+--------------+--------------+-------------+-------------+--------------+--------------+--------------+---------------+--------------+--------------+
// | Quicksort 'Akkoda' v.1 ITERATIVE | 26/04 sec. | 027/004 sec. | 121/032 sec. | 129/030 sec. | 054/14 sec. | 058/13 sec. | 131/036 sec. | 140/036 sec. | 303/072 sec. | 0323/072 sec. | N.A. | N.A. |
// | Quicksort 'Akkoda' v.1 RECURSIVE | 27/05 sec. | 027/005 sec. | 127/032 sec. | 123/031 sec. | 059/16 sec. | 067/24 sec. | 174/061 sec. | 254/150 sec. | 367/120 sec. | 0575/284 sec. | N.A. | N.A. |
// | Quicksort 'Magnetica' v.16 | 31/05 sec. | 031/005 sec. | 205/040 sec. | 218/040 sec. | 089/19 sec. | 094/19 sec. | 252/060 sec. | 264/058 sec. | 492/121 sec. | 0518/118 sec. | N.A. | N.A. |
// | Quicksort 'Hoare' | 89/62 sec. | 105/076 sec. | 220/053 sec. | 264/083 sec. | 096/26 sec. | 116/41 sec. | 255/033 sec. | 308/065 sec. | 502/072 sec. | 0602/137 sec. | N.A. | N.A. |
// | Quicksort 'Bentley-McIlroy' | 37/11 sec. | 038/013 sec. | 206/037 sec. | 224/046 sec. | 091/21 sec. | 100/25 sec. | 274/049 sec. | 305/062 sec. | 546/109 sec. | 0601/134 sec. | N.A. | N.A. |
// | qsort | 59/16 sec. | 360/320 sec. | 344/091 sec. | 523/240 sec. | 161/42 sec. | 196/69 sec. | 447/124 sec. | 521/146 sec. | 871/248 sec. | 1010/287 sec. | N.A. | N.A. |
// | Crumsort v.1.1.5.3 | 27/03 sec. | 029/004 sec. | 123/004 sec. | 129/005 sec. | 059/02 sec. | 061/02 sec. | 168/003 sec. | 174/004 sec. | 330/006 sec. | 0340/008 sec. | N.A. | N.A. |
// +----------------------------------+-------------+--------------+--------------+--------------+-------------+-------------+--------------+--------------+--------------+---------------+--------------+--------------+
// | Best Time | 26 | 027 | 121 | 123 | 054 | 058 | 131 | 140 | 303 | 0323 | N.A. | N.A. |
// +----------------------------------+-------------+--------------+--------------+--------------+-------------+-------------+--------------+--------------+--------------+---------------+--------------+--------------+
Code:
// Quick highlights:
// GLIBC's qsort() is trash;
// Microsoft/Intel's qsort() is trashy;
// Quicksort 'Hoare' is faster than Quicksort 'Bentley-McIlroy' on MOSTLY DISTINCT data (but only with ICL);
// Quicksort 'Magnetica' is faster than Quicksort 'Hoare', however on SORTED (and at the same time) MOSTLY DISTINCT data, it is inferior;
// Quicksort 'Akkoda' v.1 ITERATIVE is faster than Crumsort v.1.1.5.3, with the UNFAIR usage of another thread, but crushed on SORTED data;
//
// Legend (The time is exactly the Sort process time, first value is for unsorted, second one is for sorted):
// #1,FEW = 2,233,861,800 keys, of them distinct = 10; 178,708,944 bytes 22338618_QWORDS.bin; elements = 178,708,944/8 *100; // Keys are 100 times duplicated
// #2,MANY = 2,482,300,900 keys, of them distinct = 2,847,531; 24,823,016 bytes mobythesaurus.txt; elements = 24823016 -8+1; // BuildingBlocks are size-order+1, they are 100 times duplicated
// #3,MANYmore = 1,137,582,073 keys, of them distinct = 77,275,994; 1,137,582,080 bytes linux-5.15.25.tar; elements = 1137582080 -8+1; // BuildingBlocks are size-order+1
// #4,ALL = 2,009,333,753 keys, of them distinct = 1,912,608,132; 2,009,333,760 bytes Fedora-Workstation-Live-x86_64-35-1.2.iso; elements = 2009333760 -8+1; // BuildingBlocks are size-order+1
// #5,ALLmore = 3,803,483,825 keys, of them distinct = 3,346,259,533; 3,803,483,832 bytes Fedora-Workstation-35-1.2.aarch64.raw.xz; elements = 3803483832 -8+1; // BuildingBlocks are size-order+1
// #6,ALLmax = 7,798,235,435 keys, of them distinct = 6,770,144,405; 7,798,235,442 bytes math.stackexchange.com_en_all_2019-02.zim; elements = 7798235442 -8+1; // BuildingBlocks are size-order+1
Code:
// Notes:
// - The pair represents UNSORTED/SORTED pool, respectively;
// - Used options for GCC compiler: -O3 -fomit-frame-pointer -fopenmp
// - Used options for ICL compiler: /arch:SSE4.1 /O3 /Qopenmp
// - All the runs were in "Current priority class is REALTIME_PRIORITY_CLASS" for Windows and "Current priority is -20." for Linux;
// - To see more stats (the tables were deriving from) see 'log_i5-7200U_APR03.txt';
// - Benchmark needs 32GB RAM, and 64GB for the 6th testset;
// - The whole package (except the 3rd, 4th, 5th and 6th datasets) is downloadable at:
// www.sanmayce.com/QS_showdown_r16.7z
// - To reproduce the roster, run on Windows or Linux:
// - BENCH_ICL32GB.BAT
// - BENCH_ICL64GB.BAT
// - sh bench_gcc32GB.sh
Quicksort Showdown – quicksort_Magnetica_partitioning vs quicksort_Bentley_McIlroy_3way_partitioning; for more updates: https://twitter.com/Sanmayce Page 7 of 40
// - sh bench_gcc64GB.sh
// - 3rd dataset is downloadable at:
// https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.15.25.tar.xz
// - 4th dataset is downloadable at:
// https://download.fedoraproject.org/pub/fedora/linux/releases/35/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-35-1.2.iso
// - 5th dataset is downloadable at:
// https://download.fedoraproject.org/pub/fedora/linux/releases/35/Workstation/aarch64/images/Fedora-Workstation-35-1.2.aarch64.raw.xz
// - 6th dataset is downloadable at:
// https://download.kiwix.org/zim/stack_exchange/math.stackexchange.com_en_all_2019-02.zim
// - Managed to reduce the mainloop down to 68 bytes (from 71 in r.12), the gain comes from switching to pointers, had to do that long time ago, wanted to keep the etude ARRAY-syntax friendly. Now, r.12 uses arrays, whereas
r.13 uses pointers. Another turning point, GCC now beats ICL in all tests.
// - In order to switch to "Benchmark mode":
// cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
// echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
// - Quicksort 'Akkoda' v.1 invokes two threads of Quicksort 'Magnetica' v.16, OpenMP is used;
// - Scandum's Crumsort is the FASTEST in-place sorter, known to me, hail Scandum!
// Quick highlights:
// GLIBC's qsort() is trash;
// Microsoft/Intel's qsort() is trashy;
I wonder if the main reason these are so much worse is that they have to take the comparison function as a parameter, whereas the other sort functions seem to be hard-coding "<" directly.
I wonder if the main reason these are so much worse is that they have to take the comparison function as a parameter, whereas the other sort functions seem to be hard-coding "<" directly.
I cannot believe how such functions still exist.
Anyway, for a long time I needed a superfast sorter for billions of QWORD keys, glad that played with Quicksort, this is how Zen 2 executes the benchmark:
The full C source, with compile scripts/lines (and a PDF booklet 40 pages long - a good reference/starting point for all coders wanted to improve on it) is freely downloadable at: www.sanmayce.com/QS_showdown_r16.7z
In a lot of cases any kind of sort is sufficient, especially when we have only a few items to work with. A general sort algorithm is as good as any other one for this purpose.
From the other hand in case of a huge amount of data [structures] we always need a special and unique implementation where all the important details were taken into account (like data types, record size, index table, "distinctness", way of comparison and also including hardware, like cpu type, amount of ram, i/o speed).
Just curious: how will you sort all the existing/valid phone numbers based on the age of the owners (in minutes?).
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.