LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Numeral string compresion (http://www.linuxquestions.org/questions/programming-9/numeral-string-compresion-4175412355/)

dth4h 06-19-2012 07:26 PM

Numeral string compresion
 
There is this program that generates strings. I need to send these strings to other people, but they are so long.

Here is an example string:
6611881297804308126094135724381397121239011470217068827009544309078855190248685589492717744870438167 8813095694817950976715822649126252058466224169127600131869666810172530132873487253210013425115180428 5698251720291266302608570678775377908310708015293727529645935483754350886509110702614852135582194037 5845662725081900047695885919860265327310

The string of numbers will always be this long (340 characters) and they will always be numbers.


Is there some 'converter' I can make with bash that I can make that will compress the string of numbers and then it can be decompressed later?

This seems impossible but I want it to be less then a quarter of the length.

Is this possible?

Nominal Animal 06-19-2012 09:41 PM

You need log(10^340)/log(2) ≃ 1130 bits ≃ 142 bytes to represent a number with 340 decimal digits. Thus, converting the number to binary will reduce the size down to 42% from original.

A trivial conversion scheme can save each group of nine digits into a 30-bit unsigned int, yielding binary data about 45% of the original size. With a bit of trickery (to avoid using binary zero bytes) you can both read and write the data using only Bash. I can show you how this is done, but remember, it only halves the size -- and the output is binary. (If sending via e-mail, the base64 encoding necessary will expand the binary data by 33%, so the data in e-mail will take about 60% of the original file size.)

If you need to transfer the numbers in text files (as opposed to binary), then you can pack each set of nine digits into five Base64 characters (letters A-Z and a-z, digits 0-9, and two other characters, usually + and /). This is also possible to do in Bash alone, and I can show how it is done. It yields a string 56% of the original length (190 characters for a 340-digit number), so almost halves the original size. With a bit of care, you can allow the string to be split into multiple pieces, and even quoted, and still recover the original data without any extra effort. (Very useful when used in e-mails.)

If you need to fit the data into less space, you need an actual compression algorithm. Wherever you have Bash installed, you almost certainly have also gzip installed. You could see if compressing your data using gzip -9 yields a small enough result. It is not only just about the easiest option, but I believe the correct one for your use case.

It is definitely possible to implement a compression/decompression algorithm (similar to gzip) in Bash, but it seems like an unfeasible amount of effort for very little gain. There are no guarantees you can reach 25% compression for any input (actually, there are guarantees for the opposite -- there is always some input that fails to compress), and you most likely have to compress a lot of 340-digit numbers in one pack to get four-to-one compression rates in practice.

dth4h 06-19-2012 10:11 PM

Gzip works great! But you guessed right, it is something that has to be sent over email. And gzip wont work for a 'send over email' solution.
Even 60% of the original length is better than nothing.

So what is the best option for making a text string that can be sent over email that can be uncompressed later?

Nominal Animal 06-20-2012 01:28 AM

If you have also base64 available, you can first compress the numbers using gzip, then pipe the output to base64:
Code:

program.. | gzip -nc9 | base64
At the other end, you can recover the numbers using the reverse, i.e.
Code:

sed-or-something | base64 -d | gzip -dc
If you find yourself truly limited to bash, here is a packer script you could try. It reads in the numbers from standard input (although you can supply test numbers as command-line parameters), and outputs them as a packed list suitable to be included directly in an e-mail:
Code:

#!/bin/bash
export LANG=C LC_ALL=C

charset="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_"

Pack () {
    local string="${*//[^0-9]/}" value=0

    while [ ${#string} -ge 9 ]; do
        value=$(( 1${string:0:9} - 1000000000 ))
        string="${string:9}"
        printf '%s' "${charset:((value / 16777216) & 63):1}"
        printf '%s' "${charset:((value / 262144) & 63):1}"
        printf '%s' "${charset:((value / 4096) & 63):1}"
        printf '%s' "${charset:((value / 64) & 63):1}"
        printf '%s' "${charset:(value & 63):1}"
    done

    case ${#string} in
    8) value=$(( 8$string )) ;;
    7) value=$(( 70$string )) ;;
    6) value=$(( 600$string )) ;;
    5) value=$(( 5000$string )) ;;
    4) value=$(( 40000$string )) ;;
    3) value=$(( 300000$string )) ;;
    2) value=$(( 2000000$string )) ;;
    1) value=$(( 10000000$string )) ;;
    *) value=0 ;;
    esac

    printf '%s' "${charset:((value / 16777216) & 63):1}"
    printf '%s' "${charset:((value / 262144) & 63):1}"
    printf '%s' "${charset:((value / 4096) & 63):1}"
    printf '%s' "${charset:((value / 64) & 63):1}"
    printf '%s' "${charset:(value & 63):1}"

    printf '/'

    return 0
}

(  printf 'List:/'

    if [ $# -gt 0 ]; then
        for number in "$@" ; do
            Pack "$number" || exit $?
        done
    else
        while read number ; do
            Pack "$number" || exit $?
        done
    fi
    printf ':Ends'

) | while [ 1 ]; do

    line=""
    read -rd "" -N 64 line || [ -n "$line" ] || break
    printf '%s\n' "$line"

done

The packer will only put 64 characters per line. That way the lines won't be split even if quoted in a reply message.

Because the above packer prepends List:/ before the first number, / between numbers, and /:Ends after the last number, the corresponding unpacker can take an entire e-mail message (that contain exactly one number list) either from standard input or from named files, and produce just the original numbers. It works correctly even if the data was quoted using > or | from another message:
Code:

#!/bin/bash
export LANG=C LC_ALL=C
IFS="/"

Unpack5 () {
    local string="$*" value=0 char=""
    for char in "${string:0:1}" "${string:1:1}" "${string:2:1}" "${string:3:1}" "${string:4:1}" ; do
        case "$char" in
        0) value=$(( 64 * value +  0 )) ;;
        1) value=$(( 64 * value +  1 )) ;;
        2) value=$(( 64 * value +  2 )) ;;
        3) value=$(( 64 * value +  3 )) ;;
        4) value=$(( 64 * value +  4 )) ;;
        5) value=$(( 64 * value +  5 )) ;;
        6) value=$(( 64 * value +  6 )) ;;
        7) value=$(( 64 * value +  7 )) ;;
        8) value=$(( 64 * value +  8 )) ;;
        9) value=$(( 64 * value +  9 )) ;;
        A) value=$(( 64 * value + 10 )) ;;
        B) value=$(( 64 * value + 11 )) ;;
        C) value=$(( 64 * value + 12 )) ;;
        D) value=$(( 64 * value + 13 )) ;;
        E) value=$(( 64 * value + 14 )) ;;
        F) value=$(( 64 * value + 15 )) ;;
        G) value=$(( 64 * value + 16 )) ;;
        H) value=$(( 64 * value + 17 )) ;;
        I) value=$(( 64 * value + 18 )) ;;
        J) value=$(( 64 * value + 19 )) ;;
        K) value=$(( 64 * value + 20 )) ;;
        L) value=$(( 64 * value + 21 )) ;;
        M) value=$(( 64 * value + 22 )) ;;
        N) value=$(( 64 * value + 23 )) ;;
        O) value=$(( 64 * value + 24 )) ;;
        P) value=$(( 64 * value + 25 )) ;;
        Q) value=$(( 64 * value + 26 )) ;;
        R) value=$(( 64 * value + 27 )) ;;
        S) value=$(( 64 * value + 28 )) ;;
        T) value=$(( 64 * value + 29 )) ;;
        U) value=$(( 64 * value + 30 )) ;;
        V) value=$(( 64 * value + 31 )) ;;
        W) value=$(( 64 * value + 32 )) ;;
        X) value=$(( 64 * value + 33 )) ;;
        Y) value=$(( 64 * value + 34 )) ;;
        Z) value=$(( 64 * value + 35 )) ;;
        a) value=$(( 64 * value + 36 )) ;;
        b) value=$(( 64 * value + 37 )) ;;
        c) value=$(( 64 * value + 38 )) ;;
        d) value=$(( 64 * value + 39 )) ;;
        e) value=$(( 64 * value + 40 )) ;;
        f) value=$(( 64 * value + 41 )) ;;
        g) value=$(( 64 * value + 42 )) ;;
        h) value=$(( 64 * value + 43 )) ;;
        i) value=$(( 64 * value + 44 )) ;;
        j) value=$(( 64 * value + 45 )) ;;
        k) value=$(( 64 * value + 46 )) ;;
        l) value=$(( 64 * value + 47 )) ;;
        m) value=$(( 64 * value + 48 )) ;;
        n) value=$(( 64 * value + 49 )) ;;
        o) value=$(( 64 * value + 50 )) ;;
        p) value=$(( 64 * value + 51 )) ;;
        q) value=$(( 64 * value + 52 )) ;;
        r) value=$(( 64 * value + 53 )) ;;
        s) value=$(( 64 * value + 54 )) ;;
        t) value=$(( 64 * value + 55 )) ;;
        u) value=$(( 64 * value + 56 )) ;;
        v) value=$(( 64 * value + 57 )) ;;
        w) value=$(( 64 * value + 58 )) ;;
        x) value=$(( 64 * value + 59 )) ;;
        y) value=$(( 64 * value + 60 )) ;;
        z) value=$(( 64 * value + 61 )) ;;
        -) value=$(( 64 * value + 62 )) ;;
        _) value=$(( 64 * value + 63 )) ;;
        *) return 1 ;;
        esac
    done

    printf '%09d' $value
    return 0
}

data=""
if [ $# -gt 0 ]; then
    for file in "$@" ; do
        while read string ; do
            data="$data${string//[^-_\/:0-9A-Za-z]/}"
        done < "$file"
    done
else
    while read string ; do
        data="$data${string//[^-_\/:0-9A-Za-z]/}"
    done
fi

data="${data##*List:/}"
data="${data%%/:Ends*}"
for item in ${data//:/} ; do

    while [ ${#item} -gt 5 ]; do
        Unpack5 "${item:0:5}" || exit $?
        item="${item:5}"
    done

    value=$(Unpack5 "$item")
    value=$(( 1$value - 1000000000 ))

    if  [ $value -ge 900000000 ]; then
        printf 'value = %d\n' $value >&2
        exit 1
    elif [ $value -ge 800000000 ]; then
        printf '%08d\n' $((value - 800000000))
    elif [ $value -ge 700000000 ]; then
        printf '%07d\n' $((value - 700000000))
    elif [ $value -ge 600000000 ]; then
        printf '%06d\n' $((value - 600000000))
    elif [ $value -ge 500000000 ]; then
        printf '%05d\n' $((value - 500000000))
    elif [ $value -ge 400000000 ]; then
        printf '%04d\n' $((value - 400000000))
    elif [ $value -ge 300000000 ]; then
        printf '%03d\n' $((value - 300000000))
    elif [ $value -ge 200000000 ]; then
        printf '%02d\n' $((value - 200000000))
    elif [ $value -ge 100000000 ]; then
        printf '%01d\n' $((value - 100000000))
    elif [ $value -gt 0 ]; then
        printf 'value = %d\n' $value >&2
        exit 1
    else
        printf '\n'
    fi
done

Note that I just wrote the two scripts, and have only lightly tested them. You should test them thoroughly before production use. Both scripts rely on Bash only (version 3.0 or later, I believe; I tried to avoid the version 4 features).

The compressed data is always one or more groups of five characters. Each group is parsed as high-endian 30-bit unsigned integer. The last (or only) group is special: the decimal digit corresponding to hundreds of millions describes the length of the final digit group; the final group therefore contains zero to eight decimal digits. All preceding groups contain nine decimal digits.

Given only your original 340-digit number, the packing script will output
Code:

List:/dQEuXkX6tSaKkd4Q7Nj07OfJR1IpY2fnxS-s7JlF1UtREubBpUT1xBek-U
oCuX18rjkUVITI18uYsJpXfD0ZXI_kpS643HrJc1YOVjpGcE-4MyH1sZBgxiN2I2
N2weTKfHs8xYq0wLqVVaS5lSrO5kqrmusfuGr485D7o2FNwsGFYKU0ICSqZIFrgg
2d3E/:Ends

which the unpacking script recovers correctly back to the original number.

Note that you can pack any number of numbers in a single e-mail, as a list. For example, numbers 0 00 847436 5734873468576 35487653487634856736548 34953685364836498536854658735487346874535 3485734658475638457634857458347563856 pack into
Code:

List:/5zU40/Bwy80/Zq3HC/YBhZoNrwM0/L9lnsqF0INTpVF4/KrO9LcfKEvLzv
CRL9l1yTpeWd/KniyPoXCn5jWTunnmLyX5zU46/:Ends

which produces the original strings when unpacked.

dth4h 06-21-2012 09:17 PM

This works. Thanks for you help :)


All times are GMT -5. The time now is 11:42 PM.