Unique combinations from range of integers

astrogeek · 10-17-2015, 05:39 PM

The following is a question that arose from a study of discrete math which has become a pleasant diversion. The original question was about the number of unique combinations of objects, of specified length, that could be formed from some set.

Being computer users, we began to think in terms of programs and algorithms that would generate such lists of combinations for us. PHP is an excellent rapid prototyping language, hence the solution below is written in PHP, but the language is really not important.

We can state the math and coding problem like this:

Code:

* Given some set with n members, generate all subsets with k members
* Code for best combined execution speed and elegance of code
* Language of your choice (speed comparison is among same language solutions only)
* Execution time does not include any output or use of resultant combinations

Relying on countability of finite sets to transition from "objects" to integers, we state our universal set to be some range of integers from zero to n-1, and extend that to represent any set of objects.

We have gone through quite a few alternative approaches and post the following code as our best current PHP solution in terms of simplicity consistent with best execution speed.

The initial solution was a much simpler (code wise) recursive approach, but that turns out to be the worst by far in terms of execution speed (with PHP), somewhat surprisingly. (This prompted us to look for faster algorithms/code methods.)

We would be interested to see alternate algorithms and alternate code solutions, including solutions using other programming languages, from any one interested.

Code:

#!/usr/bin/php
<?php

/**
 * Specification:
 *
 * 1: Generate all k-combinations of integers in range 0 to n-1 (inclusive) for
 *      some k and n where 0 < k <= n
 * 2: Make execution time as low as possible.
 * 3: Make code as elegant as possible.
 * 4: The choice of programming language is yours.
 * 5: Use or output of combinations not included in determining execution speed
 *
 * The number of combinations is given by the function
 *
 *             n!
 * C(n,k) = --------
 *          k!(n-k)!
 *
 * Below is my solution in PHP - JASA 10/17/15.
 */
/*** Runtime confguration ***/
$n=36;          // Numbers to use in combination (0 to $n-1 inclusive).
$k=8;           // Combination length (0 < k <= n).
/*** End Runtime confguration ***/

// Newline.
define('NL',chr(10));

function fact($n)
{
/** Returns factorial of $n (n!). */
for($f=1;$n>1;--$n)
        $f*=$n;
return $f;
}

$e=(int)(fact($n)/
        (fact($k)*fact($n-$k))); // Expected number of combinations.
$x=0;           // Counter.
$c=$m=array();  // $c=combination array. $m=max values of elements
                // ($n-$k,$n-($k-1),$n-($k-2),...)

// Set $m and $c.
for($i=0;$i<$k;++$i)
        $m[$i]=$n-($k-($c[$i]=$i));
// Last value of combination.
$v=&$c[$k-1];

// Special case where k=1 we go straight to solution set
if($k==1)
        for(;$v<$n;++$v,++$x){
                //Comment out next line to suppress output of all combinations
                //echo join(' ',$c),NL;
                }
else
        // Start combinations (k>1).
        while(1){
                // Change last number of combination.
                for(;$v<$n;++$v,++$x){
                        //Comment out next line to suppress output of all combinations
                        //echo join(' ',$c),NL;
                        }

                // Adjust earlier values.
                for($i=$k-2;++$c[$i]>$m[$i]&&--$i>-1;)
                        ;

                // At end of combinations?
                if($i<0)
                        break;

                // Adjust later values (after $i) for next loop.
                while(++$i<$k)
                        $c[$i]=$c[$i-1]+1;
        }

// For $n=36 and $k=8 this should be 30260340.
echo 'Combinations: ',$x,NL;

// Correct number of combinations?
if($x!==$e)
        echo 'Wrong number of combinations (expected ',$e,', got ',$x,')!',NL;
?>

*** Some added comments:

"Combination" is used here in the strict sense - an unordered collection - not to be confused with a permutation.

I have an awk version in mind, too - but nothing to show, yet... and a relational DB approach would support direct implementation using set notation... lots of ideas to explore.

norobro · 10-17-2015, 09:27 PM

Very nice! I have at various times enjoyed solving Project Euler problems.

I cobbled the following C++ program together from some Euler solutions that I have. It executes in <200 milliseconds on my machine.

Code:

// command line: g++ -O3 -std=c++11 main.cpp
#include <iostream>
#include <vector>
#include <ctime>
#include <chrono>

std::vector<int> range;
std::vector<int> combination;

static int count = 0;

double factorial(double number){
    return number==1 ? number : number*factorial(number - 1);
}

void print(const std::vector<int>& combo) {
    for (size_t i = 0; i < combo.size(); ++i) {
        std::cout << combo[i] << " ";
    }
    std::cout  << std::endl;
}

void permute(int offset, int k) {
    if (k == 0) {
        ++count;
        //print(combination);  // uncomment to print combinations
        return;
    }
    for (size_t i = offset; i <= range.size() - k; ++i) {
        combination.push_back(range[i]);
        permute(i+1, k-1);
        combination.pop_back();
    }
}

int main() {
    auto start_time = std::chrono::high_resolution_clock::now();
    int n = 36;
    int k = 8;

    for (auto i = 0; i < n; ++i) {
        range.push_back(i+1);
    }
    permute(0, k);
    int possible = factorial(n) / (factorial(k)*factorial(n-k));
    if(count == possible) {
        std::cout << "The number of combinations is: " << count << std::endl;
    } else {
        std::cout << "Solution incorrect\n";
    }
    auto end_time = std::chrono::high_resolution_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
    std::cout << "Processing time= "<< double(elapsed.count()) <<" milliseconds"<< std::endl;
    return 0;
}

metaschima · 10-17-2015, 10:17 PM

There is a huge potential for optimization here, but the elegance will drop very quickly. So, can you clarify ?

astrogeek · 10-17-2015, 10:52 PM

Quote:

Originally Posted by norobro

Very nice! I have at various times enjoyed solving Project Euler problems.

I cobbled the following C++ program together from some Euler solutions that I have. It executes in <200 milliseconds on my machine.

Very nice indeed! I have thought about doing a c++ version but just had not gotten around to it! Well done!

*** Added comments after having time to look more closely at your code...

Your recursive permute(...) function is essentially what we had originally implemented in PHP, using arrays as you use vectors. It was surprisingly s-l-o-w...

I initially thought we were doing something incorrectly and set about trying tp improve the performance, with little effect. It turned out to be the recursion, not the array operations that were the bottleneck. Similar but non-recursive implementations were faster by a factor of 2-4 times.

Your vector / permute() function implementation meets my definition of elegant!

astrogeek · 10-17-2015, 10:59 PM

Quote:

Originally Posted by metaschima

There is a huge potential for optimization here, but the elegance will drop very quickly. So, can you clarify ?

Yea, I thought the word elegance might be kind of subjective.

What we had in mind was readability and neatness mostly, I think. Nothing too obscure or incomprehensible!
(Our PHP example is over commented for this post, so verbose comments are not an inelegant factor)

But as coding styles can be quite different, submit your optimizations and we won't complain as long as we can make sense of it - so let's see what you have in mind.

metaschima · 10-17-2015, 11:56 PM

Well, the entire problem is just that equation. The equation has many factorials, so factorials need to be done quickly. Also note that the chance of overflowing a type is high. Either you're going to use an optimized library that handles this, or you're going to make a table of factorials and use that. If you really have to calculate factorials, you can use threads.

astrogeek · 10-18-2015, 12:04 AM

Quote:

Originally Posted by metaschima

Well, the entire problem is just that equation. The equation has many factorials, so factorials need to be done quickly...

Actually no, the factorial calculation is only used to find the number of combinations for a given n and k - but that is not part of the stated problem. We added that code before posting here only as a convenience check so anyone writing code could know how to test if theirs (or ours) actually found all possible combinations as determined by actual count.

Code:

* Given some set with n members, generate all subsets with k members

You can leave out the factorial bit of code if you want - the actual problem space is to generate all possible combinations of given length, not to calculate how many there are.

*** Also, making a table of potential combinations and using them for lookup had occurred to us but those would be considered inelegant as a solution. We would want to see the code that generated the tables.

norobro · 10-18-2015, 10:26 AM

Speaking of slow, last night I tried the following program using 'next_permutation' from the C++ standard library. Compiled without optimization it takes 21.3 sec. to process. This morning I tried compiling it with '-O3' and the result is a more acceptable 1.3 sec. but still slower than the program posted above. Maybe I'm not holding my mouth right . . .

Code:

#include <iostream>
#include <algorithm>
#include <vector>
#include <ctime>
#include <chrono>

void print(const std::vector<bool> &v, int n) {
    for (int i = 0; i < n; ++i) {
            if (v[i]) {
                std::cout << (i+1) << " ";
            }
        }
        std::cout << "\n";
}

int main() {
    auto start_time = std::chrono::high_resolution_clock::now();
    int n = 36;
    int k = 8;
    std::vector<bool> v(n);
    std::fill(v.begin() + n - k, v.end(), true);
    int count = 0;
    do {
        //print(v, n);
        ++count;
    } while (std::next_permutation(v.begin(), v.end()));
    std::cout << count << std::endl;
    auto end_time = std::chrono::high_resolution_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
    std::cout << "Processing time= "<< double(elapsed.count()) <<" milliseconds"<< std::endl;
    return 0;
}

An aside: Is it not possible to thank someone on the first post of a thread?

millgates · 10-18-2015, 10:55 AM

As an alternative, here's a somewhat naive, not-very-elegant, non-recursive solution:

Code:

#include <iostream>
#include <vector>
#include <stdio.h>

int comb(unsigned n, unsigned k) {
	// check arguments
        if (n < k || k < 1) return -1;

	// create data
	std::vector <unsigned> items;
	items.reserve(k);
	for (unsigned i = 0; i < k; i++) { items.push_back(i); }
	
	// start with the last of the k items
	unsigned which = k - 1;

	// get combinations
	for(;;) {
		// print current
		//for (unsigned i = 0; i < k; i++) { std::cout << items[i] << " "; }
		//std::cout << std::endl;

		// final state
		if (items[0] == n - k) { break; }

		
		while( (items[which] == n - 1 
			|| (which != k - 1 
				&& items[which + 1] == items[which] + 1))
		) { which--; }
	       	items[which]++;
		for (unsigned i = which + 1; i < k; i++) 
		{ items[i] = items[which]+i-which; }
		which=k-1;

	}

	return 0;
}

int main(int argc, char *argv[]) {
	unsigned n=36,k=8;
	comb(n,k);
	return 0;
}

astrogeek · 10-19-2015, 05:59 PM

Sorry for slow responses, my new "normal" I think!

Quote:

Originally Posted by norobro

Speaking of slow, last night I tried the following program using 'next_permutation' from the C++ standard library. Compiled without optimization it takes 21.3 sec. to process. This morning I tried compiling it with '-O3' and the result is a more acceptable 1.3 sec. but still slower than the program posted above. Maybe I'm not holding my mouth right . . .

An aside: Is it not possible to thank someone on the first post of a thread?

Thanks for the variation.

On my single CPU 32 bit laptop I get 38.8 seconds without optimization, 3.6 seconds with -O2 and 3.9 seconds with -O3.

It is interesting that it is so much slower than the first one you posted, which comes in at 3.76s not optimization*, 0.85 with -O2 and 0.51s with -O3.

*Note: I had to make the following change to get correct possible value with no optimizations:

Code:

#include <cmath>
...
int possible = ceil(factorial(n) / (factorial(k)*factorial(n-k)));

It is a delight to read your code by the way!

Aside: I think it is not possible on first post for helpful clicks, but you can use the little scale icon in lower left of post near OS icon.

astrogeek · 10-19-2015, 06:31 PM

Quote:

Originally Posted by millgates

As an alternative, here's a somewhat naive, not-very-elegant, non-recursive solution:

Thanks for the sweepstakes entry!

Most of my own code is naive, it is the only way I can read it the next day!

This is elegant enough!

I added a counter just as a sanity check...

Code:

long int comb(unsigned n, unsigned k) {
	// check arguments
        long int count = 0;

	for(;;) {
                count++;
                ...
	}
	return count;
}

int main(int argc, char *argv[]) {
        ...
        std::cout << "Items: " << comb(n,k) << std::endl;
        ...
}

... and it checks!

On my 32 bit single CPU laptop this runs in 1.29s without optimizations, 0.39 with -O2 and 0.41 with -O3. So overall this is fastest yet.

Very nice, thanks!

astrogeek · 10-20-2015, 01:10 AM

This will not win any speed contest compared to C++, but it occurred to me that the desired result can be most easily expressed in set notation, and that is the world of relational databases and SQL. So I have worked up a relational / SQL implementation using MariaDB (MySQL)...

First express the problem in simple set notation:

Code:

Given a set B with cardinality n
   and a constant k, 0 < k <= n

Find all sets A {A is proper subset of B, |A| = k}

So first we need to construct our set B as a simple relation (table) with one column (which we will call 'id') of integer type and insert n rows of sequential integers starting with 1. (We will name the table combo and use 1 based integers instead of zero based as in other examples because it simplifies inserting the values).

I include here a simple SQL script to do this for you. You only need to do this once, or to change the number of rows (cardinality of B).

Code:

Save SQL below to a file, mkset.sql then from mysql client with database seleted:
\. mkset.sql
............................................................
/* Create and populate universal set (table)
   Set the number of rows (cardinality of B) in the line...
    CALL mkset(rows);
*/
DELIMITER //
DROP TABLE IF EXISTS combo;
CREATE TABLE combo(id int primary key auto_increment);
DROP PROCEDURE IF EXISTS mkset;
CREATE PROCEDURE mkset(members INT)
BEGIN
SET @membs=members;
        REPEAT
                INSERT INTO combo SET id=NULL;
                SET @membs=@membs - 1;
        UNTIL @membs<=0 END REPEAT;
END
//
CALL mkset(36);
SELECT COUNT(*) as members FROM combo;

Now, for the actual query code that generates our combinations...

Our desired subsets A, are a simple restriction of the cartesian product of the set B with itself, k times. The restriction is that each successive product term only includes higher valued rows than the previous product term.

Code:

A = B X B X B X B X B X B X B X B
Where X represents restricted product operator

A simple JOIN produces the desired cartesian product and we will express the restriction within an ON(...) condition for each JOIN, or product. (You can do the same with a WHERE clause instead of the JOIN...ON() syntax, but for MariaDB (MySQL) the optimizer produces identical execution times and I prefer the JOIN syntax.)

So our query will look like this...

Code:

Save SQL below to a file, combo.sql then from mysql client with database seleted:
\. combo.sql
............................................................
SELECT
COUNT(*) as Combinations
-- OR uncomment 'k' columns below to see result rows
-- a.id, b.id, c.id, d.id, e.id, f.id, g.id, h.id
-- 'k' is the number of products, or JOINS
FROM combo a
JOIN combo b ON(b.id>a.id)
JOIN combo c ON(c.id>b.id)
JOIN combo d ON(d.id>c.id)
JOIN combo e ON(e.id>d.id)
JOIN combo f ON(f.id>e.id)
JOIN combo g ON(g.id>f.id)
JOIN combo h ON(h.id>g.id)
;

How does this compare to other languages in terms of execution speed?

Code:

MariaDB [test]> \. combo.sql
+--------------+
| Combinations |
+--------------+
|     30260340 |
+--------------+
1 row in set (4 min 7.89 sec)

Not a speed demon, but for a product set of 30-million-plus rows on this machine... not bad either (as long as you don't print them!).

But as an implementation of the original problem in a different language, it is what it is!

As time permits I may try a PostgreSQL and maybe an SQLite versionfor meaningful speed comparisons - and maybe try to improve the speed.

norobro · 10-20-2015, 01:16 PM

Thanks. That's very interesting! I would have never thought of using SQL and stored procedures. Of course my knowledge of DB's and SQL is very basic.

SQL results on my machine:

Code:

mysql> \. combo.sql
+--------------+
| Combinations |
+--------------+
|     30260340 |
+--------------+
1 row in set (1 min 36.27 sec)

Both Python and Perl provide modules for finding permutations/combinations. I'm not very proficient in either language but here are a couple of simple programs and their execution times for comparison:

Code:

#!/usr/bin/env perl

use strict; use warnings;
use Algorithm::Combinatorics qw(combinations);

my $start = time();

my @n = (1 .. 36);
my $k = 8;
my $iter = combinations(\@n, $k);
my $count = 0;
while (my $c = $iter->next) {
#        print "@$c\n";
    $count = $count +1;
}

my $end = time();
my $exec_time = $end - $start;
print "$count combinations in $exec_time seconds\n";

Execution time: 49 secs.

Code:

#!/usr/bin/python
import itertools 
import time

start = time.time()

iterable = range(1,37)
r = 8
y = 0;

for combo in itertools.combinations(iterable, r):
    #print combo
    y=y+1

print y," combinations in ",(time.time() - start)," seconds\n";

Execution time: 4.533 secs.

astrogeek · 10-22-2015, 03:17 PM

Quote:

Originally Posted by norobro

Thanks. That's very interesting! I would have never thought of using SQL and stored procedures. Of course my knowledge of DB's and SQL is very basic.

SQL results on my machine:

Code:

mysql> \. combo.sql
+--------------+
| Combinations |
+--------------+
|     30260340 |
+--------------+
1 row in set (1 min 36.27 sec)

Thanks for the test run - I kind of thought it unlikely that anyone would actually run it as it involved a little setup.

The stored procedure was only a convenience for easily re-populating the table and is not involved in getting the result set.

I just tried it on PostgreSQL, same machine, same query file, table structure as follows:

Code:

CREATE TABLE combo(id SERIAL);
INSERT INTO combo (id) VALUES(nextval('combo_id_seq')),... 36 times...;
...
# time psql testdb <combo.sql
 combinations
--------------
     30260340
(1 row)

real    1m19.934s
user    0m0.011s
sys     0m0.006s

That vs 4 min 7.89 sec for MariaDB/MySQL (although - the MySQL server on this machine has other users and >20 databases with ~50GB of data in filesystem. PostgreSQL is fairly quiescent).

But what is really impressive as an SQL implementation is SQLite...

Code:

sqlite3 dc-sqlite.db
CREATE TABLE combo(id INTEGER PRIMARY KEY AUTOINCREMENT);
INSERT INTO combo (id) VALUES(NULL),... 36 times ...;
...
# time sqlite3 -header dc-sqlite.db <combo.sql
Combinations
30260340

real    0m7.344s
user    0m7.255s
sys     0m0.007s

...which beats the simple, fast PHP script on same machine.