[SOLVED] strings combinations...

masavini · 05-15-2012, 04:47 PM

hi,
i have a file with list of strings like this:
apple
pie
cake

i have to calculate all of the possible combinations of strings (string order does not matter):

apple
pie
cake
apple pie
apple cake
pie cake
apple pie cake

can you help me getting it with bash?
thanks...

colucix · 05-15-2012, 07:01 PM

Your problem is to compute the permutations of N elements without repetitions. I cannot think a simple solution, right now. However, it can be done by means of a recursive function. Is bash mandatory or can you accept other solutions? Here is a link that might provide some inspiration, for now.

masavini · 05-15-2012, 07:07 PM

i knew about that post, but it didn't help me...
permutations contains a lot of redundant output, since inner strings order matters...

i need this code for a bash script, but i guess other languages (maybe perl?) should be fine...

thank you anyway...

Nominal Animal · 05-15-2012, 10:02 PM

Hmm.. if by "string order does not matter" you mean that you are only concered with substring existence and not order, that you only want one occurrence of either "apple pie" or "pie apple", then the solution is rather simple for up to 63 components (31 on 32-bit architectures):

Code:

combine() {
    local limit=$[ 1 << $# ]
    local args=("$@")
    for ((value = 1; value < limit; value++)); do
        local parts=()
        for ((i = 0; i < $#; i++)); do
            [ $[(1<<i) & value] -ne 0 ] && parts[${#parts[@]}]="${args[i]}"
        done
        echo "${parts[@]}"
    done
}

This works because each combination where only the substring existence matters maps perfectly to a substring-count-digit positive binary integer -- remember, each substring is either included or not included in each combination. (Zero maps to the empty string -- no substrings at all --, and is therefore uninteresting.)

The above simply loops value over all those positive integers, then constructs the string by letting each binary digit in value select whether a substring is included (1) or not included (0).

Right now it relies on Bash integer variables, and is therefore limited to 31 or 32 substrings on 32-bit architectures, and 63 or 64 substrings on 64-bit architectures.

Output of combine apple pie cake | sort:

Code:

apple
apple cake
apple pie
apple pie cake
cake
pie
pie cake

Output of combine 1 2 3 4:

Code:

masavini · 05-16-2012, 04:40 AM

what can i say..? just perfect!
i've been writing bash scripts for about 3 years, but i can't understand even half of the code you posted... i can just see it works!
thank you VERY much...

Nominal Animal · 05-16-2012, 04:12 PM

Quote:

Originally Posted by masavini

i can't understand even half of the code you posted...

Let's see if we can remedy that. Understanding is much more important than just having working code, in my opinion.

The code defines a Bash function. A function is similar to a external script, except that it is run within the same Bash shell.

Let's start:

Code:

combine() {
    local limit=$[ 1 << $# ]
    local args=("$@")
    local value

Above, we define a local variable limit, and set it the value (two to the power of the number of parameters to this function). If you have three parameters, the value will be 2³ = 8. Note that this is by definition the number of combinations, including the empty string. (Therefore, we will output limit-1 combinations.)

In general, the left shift operator a<<n is mathematically equivalent to a×2ⁿ for all integral nonnegative n and all integer nonnegative a. It simply moves the binary digits left, adding zeros to the right. (It does operate the same way with negative numbers, with a couple of quirks, but that's not important here, since we only deal with positive numbers and zero -- nonnegative integers.)

The above also defines an array args containing all the parameters to the function. Note that the first parameter will be ${args[0]} . This is important because bits are counted from 0 upwards: the lowest bit is bit 0 (and not bit 1). (Remember, 2⁰=1, 2¹=1, 2²=4, 2³=8, and so on.)

Note: I forgot the local value definition from the original script. It is not really necessary for proper working, but without it, the value variable will be visible in the caller! If the caller also uses a value variable, without this line the value in the caller will be modified (to limit).

Next, to generate each combination, we need to loop over all positive integers from 1 to limit, not including limit itself, using value as the loop variable:

Code:

    for ((value = 1; value < limit; value++)); do

Next, we need to loop over each bit -- each binary digit -- in value . First, though, we'll clear the parts array, which is used to hold the substrings for the current string (the combination that matches the binary representation of variable value ):

Code:

        local parts=()
        for ((i = 0; i < $#; i++)); do

The following line is the tricky bit. Its logic is simple: If bit i in value is set, then append the corresponding substring to the parts array.

Code:

            [ $[(1<<i) & value] -ne 0 ] && parts[${#parts[@]}]="${args[i]}"

Above, $[ (1 << i) & value ] evaluates to nonzero if and only if bit i in value is set, and to zero otherwise. (The value it evaluates to is 2ⁱ, but who cares: it makes much more sense to compare against zero.)

The expression parts[${#parts[@]}]=foo is the fastest way to append foo to the array parts. It is equivalent to parts=("${parts[@]}" foo) if the existing indices in parts run consecutively starting from zero. (Expression ${#parts[@]} evaluates to the number of elements already in the parts array.)

After the end of the inner loop, we have the combination in array parts , so we just echo it:

Code:

        done
        echo "${parts[@]}"
    done
}

masavini · 05-16-2012, 07:30 PM

ok, not that bad...
i can't understand how the expression $[ 1 << $# ] could mean "two to the power of the number of parameters to this function", but that's a detail...
i got lost in the "tricky bit" as well, but i guess if i didn't it wouldn't have been "tricky"...

but i got the rest, and i thank you so much for explaining...

Nominal Animal · 05-16-2012, 10:53 PM

Quote:

Originally Posted by masavini

i can't understand how the expression $[ 1 << $# ] could mean "two to the power of the number of parameters to this function"

Well, $# tells you the number of positional parameters. In a function, it is the number of parameters given to the funtion.

The $[ expression ] is the traditional Bash way to calculate arithmetic expressions.

A << B is an arithmetic expression, "shift A left B bits". Therefore, 1 << N always evaluates to two to the power of N. That is just basic binary arithmetic: in decimal systems, you can multiply by ten by adding zero to the right, but in binary systems, you multiply by only two when you add a zero to the right.

Quote:

Originally Posted by masavini

i got lost in the "tricky bit" as well

A & B is an arithmetic expression, "A AND B", where the AND is a binary operator.

(1 << N) & B is an arithmetic expression which means "B and the bit N". It is a very common way to test if the bit N is set in B . The only thing to remember about that is that the first (lowest) bit is numbered 0, not 1.

The && is a logical operator which can be read as "and, if that succeeded, then". If its left side is true, then and only then the right side is evaluated. If its left side is false, then the right side is ignored.

If you are unfamiliar with bits and boolean algebra, I suggest you take a look at the Wikipedia Bit, Binary numeral system, and Elementary Boolean algebra articles.

In the articles, ∧ is binary AND, written & in Bash. Binary OR is written as ∨ in the articles, | in Bash. If I ever discover a new operator, I shall name it 🐱 or maybe ̐😲 .

David the H. · 05-18-2012, 07:10 AM

Quote:

Originally Posted by Nominal Animal

The $[ expression ] is the traditional Bash way to calculate arithmetic expressions.

Correction: $[..] is the old, deprecated way to calculate arithmetic expressions, and it's not really a good idea to use it any more. While it is still supported for now, it has, for the longest time, been supplanted by $((..)). The newer version also has the benefit of being posix-compliant.

And if you don't need the value to be expanded, bash also provides ((..)), which only evaluates the expression and spits out a success/failure exit code, such as used in the c-style for loop.

Finally, you can more cleanly append values to an array using "array+=( value )". Also, the -a option should really be used when declaring one.

Therefore the function should more properly be written like this:

Code:

combine() {

	local limit=$(( 1 << $# ))
	local -a parts args=( "$@" )

	for (( value = 1; value < limit; value++ )); do

		parts=()

		for (( i = 0; i < $#; i++ )); do

			(( ( 1 << i ) & value )) && parts+=( "${args[i]}" )

		done

		echo "${parts[@]}"

	done
}

I can't say yet that I fully understand the use of the bitwise operators, however.

Nominal Animal · 05-18-2012, 10:13 AM

Quote:

Originally Posted by David the H.

Correction: $[..] is the old, deprecated way to calculate arithmetic expressions, and it's not really a good idea to use it any more. While it is still supported for now, it has, for the longest time, been supplanted by $((..)). The newer version also has the benefit of being posix-compliant.

I had not noticed POSIX shells (including dash) support $((...)) arithmetic expressions.

I have accounts on machines with only old versions of Bash available, and unfortunately the Bash Reference Manual neglects to describe the version required for each feature. This is why I recommend against switching to every new Bash features, but wait for them to be standardized, or at least stabilized. It is very annoying when you have to work with older versions of Bash occasionally.

David the H. is absolutely correct: $(( ... )) is the standard way for evaluating arithmetic expressions. Thanks!

masavini · 05-18-2012, 10:24 AM

Quote:

Originally Posted by David the H.

I can't say yet that I fully understand the use of the bitwise operators, however.

Nominal Animal · 05-18-2012, 11:08 AM

Quote:

Originally Posted by David the H.

I can't say yet that I fully understand the use of the bitwise operators, however.

I think an in-depth explanation may be in order: if you didn't follow the logic, then I don't think many other members did either.

Let's start with the initial set of words: apple pie cake. Create a table of all the combinations. Put the first word rightmost, because we are used to putting the least significant digit to the right.

Code:

 cake? │ pie? │ apple? ║ numeric ║ combination
═══════╪══════╪════════╬═════════╬═════════════
   0   │   0  │    0   ║ 000 = 0 ║
   0   │   0  │    1   ║ 001 = 1 ║ apple
   0   │   1  │    0   ║ 010 = 2 ║ pie
   0   │   1  │    1   ║ 011 = 3 ║ apple pie
   1   │   0  │    0   ║ 100 = 4 ║ cake
   1   │   0  │    1   ║ 101 = 5 ║ apple cake
   1   │   1  │    0   ║ 110 = 6 ║ pie cake
   1   │   1  │    1   ║ 111 = 7 ║ apple pie cake

From the above, we can already tell that whenever we add a new word, the table will double in size. Therefore, the size of the table is two multiplied by itself the number of words times, i.e. two to the power of the number of words.

As you can see, you can use as many words as you have bits (binary digits), and it will work exactly the same.

Now consider the bit shift operation. If we write binary numbers as say 100₂ = 4, 11₂ = 3, 1₂ = 1, and so on, then the bit shift operation is easy to understand:

Code:

1₂ << 0 = 1₂     = 1  = 2⁰
1₂ << 1 = 10₂    = 2  = 2¹
1₂ << 2 = 100₂   = 4  = 2²
1₂ << 3 = 1000₂  = 8  = 2³
1₂ << 4 = 10000₂ = 16 = 2⁴

Therefore, $(( 1 << i )) is the same as 2ⁱ. In binary, it has only one 1, at the i'th digit (if you consider 0 as the first digit).

Also, you can see from the table size that for i words, you have 2ⁱ combinations (including the empty combination at zero).

The obvious approach then is to have an outer loop, looping from 1 (to skip the empty combination at zero) to 2ⁱ less one (for i words), and generate each combination based on the loop counter value. (The combination for 2ⁱ would be just a new word; check the table!)

The logic is very simple: If the first binary digit is 1, "apple" is part of the combination. If the second bit is 1, "pie" is part of the combination. If the third bit is 1, "cake" is part of the combination. And so on.

In binary arithmetic, value & mask is the binary AND operation. The result will have 1 in each place where both value and mask are 1, and 0 elsewhere. The order does not matter, both sides are treated equally: mask & value yields always the same value as value & mask.

We can use the bit shift operation for picking out a single digit. Remember, $(( 1 << bit )) is a binary number with only one 1, at the bit position (starting from zero).

Therefore, $(( value & ( 1 << bit ) )) evaluates to zero if bit bit is zero, and to nonzero (two to the power of bit) otherwise.

Using a temporary array to hold the words helps because in Bash array indices start at 0. If you have word=(apple pie cake), then ${word[2]} is cake.

So. If value is the number that describes the combination we want, we loop over each bit ((i = 1; i < the-number-of-words; i++ )), and the ${word[i]} is part of the combination if and only if $(( value & (1 << i) )) evaluates to nonzero.

The textual explanation is quite dry and perhaps a bit difficult to follow, but if you switch back to looking at the table, it should help a lot. Hopefully you can see the box drawing characters correctly; it really should be a very nice table!

masavini · 05-18-2012, 11:27 AM

Quote:

Originally Posted by Nominal Animal

The logic is very simple: If the first binary digit is 1, "apple" is part of the combination. If the second bit is 1, "pie" is part of the combination. If the third bit is 1, "cake" is part of the combination. And so on.

got it!

David the H. · 05-20-2012, 09:56 AM

Thank you so much for the education, NA. The extra post did help to make it much more comprehensible.

To be honest, the real reason it wasn't clear was because I had only looked over your posts quickly at that point, and hadn't yet had the time to read through them carefully.

It wasn't until tonight (Sunday) that I had the time to really study them.

Actually, your posts here were quite serendipitous. I was just looking at the ARITHMETIC EVALUATION section of the bash man page the other day and wondering exactly what the bitwise mode operators listed there do. And while I (believe I) understand the basic concept now, trying to switch my mind over to thinking binary-wise is still a bit of headache for me, as is exactly how all the related operators operate. It's going to take some more research and practice to really get them down.

BTW, you can look here for a rundown of which features were introduced in which version of bash:

http://wiki.bash-hackers.org/scripting/bashchanges
http://tiswww.case.edu/php/chet/bash/NEWS

I agree that it's frustrating when the documentation doesn't clearly list when a feature was introduced. The gawk users guide suffers from the same weakness.