Numeral string compresion
There is this program that generates strings. I need to send these strings to other people, but they are so long.
Here is an example string: 6611881297804308126094135724381397121239011470217068827009544309078855190248685589492717744870438167 8813095694817950976715822649126252058466224169127600131869666810172530132873487253210013425115180428 5698251720291266302608570678775377908310708015293727529645935483754350886509110702614852135582194037 5845662725081900047695885919860265327310 The string of numbers will always be this long (340 characters) and they will always be numbers. Is there some 'converter' I can make with bash that I can make that will compress the string of numbers and then it can be decompressed later? This seems impossible but I want it to be less then a quarter of the length. Is this possible? |
You need log(10^340)/log(2) ≃ 1130 bits ≃ 142 bytes to represent a number with 340 decimal digits. Thus, converting the number to binary will reduce the size down to 42% from original.
A trivial conversion scheme can save each group of nine digits into a 30-bit unsigned int, yielding binary data about 45% of the original size. With a bit of trickery (to avoid using binary zero bytes) you can both read and write the data using only Bash. I can show you how this is done, but remember, it only halves the size -- and the output is binary. (If sending via e-mail, the base64 encoding necessary will expand the binary data by 33%, so the data in e-mail will take about 60% of the original file size.) If you need to transfer the numbers in text files (as opposed to binary), then you can pack each set of nine digits into five Base64 characters (letters A-Z and a-z, digits 0-9, and two other characters, usually + and /). This is also possible to do in Bash alone, and I can show how it is done. It yields a string 56% of the original length (190 characters for a 340-digit number), so almost halves the original size. With a bit of care, you can allow the string to be split into multiple pieces, and even quoted, and still recover the original data without any extra effort. (Very useful when used in e-mails.) If you need to fit the data into less space, you need an actual compression algorithm. Wherever you have Bash installed, you almost certainly have also gzip installed. You could see if compressing your data using gzip -9 yields a small enough result. It is not only just about the easiest option, but I believe the correct one for your use case. It is definitely possible to implement a compression/decompression algorithm (similar to gzip) in Bash, but it seems like an unfeasible amount of effort for very little gain. There are no guarantees you can reach 25% compression for any input (actually, there are guarantees for the opposite -- there is always some input that fails to compress), and you most likely have to compress a lot of 340-digit numbers in one pack to get four-to-one compression rates in practice. |
Gzip works great! But you guessed right, it is something that has to be sent over email. And gzip wont work for a 'send over email' solution.
Even 60% of the original length is better than nothing. So what is the best option for making a text string that can be sent over email that can be uncompressed later? |
If you have also base64 available, you can first compress the numbers using gzip, then pipe the output to base64:
Code:
program.. | gzip -nc9 | base64 Code:
sed-or-something | base64 -d | gzip -dc Code:
#!/bin/bash Because the above packer prepends List:/ before the first number, / between numbers, and /:Ends after the last number, the corresponding unpacker can take an entire e-mail message (that contain exactly one number list) either from standard input or from named files, and produce just the original numbers. It works correctly even if the data was quoted using > or | from another message: Code:
#!/bin/bash The compressed data is always one or more groups of five characters. Each group is parsed as high-endian 30-bit unsigned integer. The last (or only) group is special: the decimal digit corresponding to hundreds of millions describes the length of the final digit group; the final group therefore contains zero to eight decimal digits. All preceding groups contain nine decimal digits. Given only your original 340-digit number, the packing script will output Code:
List:/dQEuXkX6tSaKkd4Q7Nj07OfJR1IpY2fnxS-s7JlF1UtREubBpUT1xBek-U Note that you can pack any number of numbers in a single e-mail, as a list. For example, numbers 0 00 847436 5734873468576 35487653487634856736548 34953685364836498536854658735487346874535 3485734658475638457634857458347563856 pack into Code:
List:/5zU40/Bwy80/Zq3HC/YBhZoNrwM0/L9lnsqF0INTpVF4/KrO9LcfKEvLzv |
This works. Thanks for you help :)
|
All times are GMT -5. The time now is 03:57 AM. |