LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Numerical encoding of text, by position (http://www.linuxquestions.org/questions/programming-9/numerical-encoding-of-text-by-position-929768/)

danielbmartin 02-16-2012 03:55 PM

Numerical encoding of text, by position
 
I posted this question many months ago and received suggestions, but none of them was a "bull's eye" answer. I'm posting again, thinking that LQ now has more members and someone may have a flash of inspiration.

I have a file of English-language words, one word per line, and want to encode them in a numeric form, based on letter positions. This is best explained by example:

people ==> 123152

Reading left to right:
p was first encountered at position 1 so it is encoded as 1.
e was first encountered at position 2 so it is encoded as 2.
o was first encountered at position 3 so it is encoded as 3.
p (again) was first encountered at position 1 so it is encoded as 1.
l was first encountered at position 5 so it is encoded as 5.
e (again) was first encountered at position 2 so it is encoded as 2.

More examples:
aardvark => 11345138
sense => 12312
committee => 123356688
position => 12345428


This is a first step toward a solution.
Code:

echo 'people' \
|tr 'elpoep' '123456789abcdef' \
|tr '654321' '123456789abcdef'

echo 'aardvark' \
|tr 'kravdraa' '123456789abcdef' \
|tr '87654321' '123456789abcdef'

This combination of successive tr commands works but it has a shortcoming: the length of the character strings must equal the length of the word to be encoded. That shorcoming may be remedied by coding something like this:
Code:

echo 'people' \
|tr 'elpoep*********' '123456789abcdef' \
|tr '654321*********' '123456789abcdef'

echo 'aardvark' \
|tr 'kravdraa*******' '123456789abcdef' \
|tr '87654321*******' '123456789abcdef'

Good, that seems like part of the puzzle, but only part.

Using this sample input file ...
Code:

people
aardvark
error
infinite
nevermore

... and massaging it with this clumsy code ...
Code:

cat < $InFile    \
|sed -e "s/$/'/" \
|sed -e "s/^/***************/"  \
|rev \
|cut -c1-15 \
|sed -e "s/$/' '123456789abcdef'/"

... we get this interesting file ...
Code:

'elpoep********' '123456789abcdef'
'kravdraa******' '123456789abcdef'
'rorre*********' '123456789abcdef'
'etinifni******' '123456789abcdef'
'eromreven*****' '123456789abcdef'

Which might be another piece of the puzzle.

Now, what kind of "glue" should be used to put these pieces together? Is there an easier, better way?

There is an index command but my Ubuntu system seems to lack it. Is that some kind of downloadable extension?

Daniel B. Martin

Nominal Animal 02-16-2012 04:43 PM

Quote:

Originally Posted by danielbmartin (Post 4604402)
I have a file of English-language words, one word per line, and want to encode them in a numeric form, based on letter positions.

How about
Code:

#!/usr/bin/awk -f

BEGIN {
    RS = "[\t\n\v\f\r ]+"
    FS = ""
}

{
    split("", p)

    for (i = 1; i <= NF; i++) {
        c = tolower($i)
        if (!(c in p))
            p[c] = i

        printf(" %d", p[c])
    }

    printf("\n")
}

If you save the above as a script, you can use it as a filter, or supply it with file names to process.

It ignores case, and prints a space before each number. If you don't want the space, just remove it from the printf(" %d",p[c]) statement. Or, if you want a hexadecimal encoding without spaces, use printf("%x",p[c]). You can also use a custom encoding, by adding
Code:

split("0 1 2 3 4 5 6 7 8 9 a b c d e f z x v", m, " ")
into the BEGIN rule, and using printf("%s",m[p[c]]) to print the encoded character position for each input character.

The BEGIN rule sets the record separator to any whitespace. This means you don't need to have each word on their own line, any whitespace is okay. Setting FS to an empty string means each character in a record will be a separate field.

The conversion is done in the second rule. The s array maps already seen characters to their columns. The for loop loops over each character, c, the input field (character) converted to lower case. If the character is not already in the s array, it is added to the array, with value being the current position, of course. Finally, the printf statement prints the position the current character maps to. After the loop, the second printf adds a newline.

If you want it to ignore e.g. punctuation and digits, use GNU awk and
Code:

#!/usr/bin/gawk -f

BEGIN {
    RS = "[\t\n\v\f\r ]+"
    FS = ""
}

{
    split("", p)

    for (i = 1; i <= NF; i++) {
        c = tolower($i)
        if (c !~ /[[:punct:][:digit:]]/) {
            if (!(c in p))
                p[c] = i
            printf(" %d", p[c])
        }
    }

    printf("\n")
}

instead; the only changes are the additions in red.

This latter implementation works very well with other languages, too, if you use a single-byte character encoding, or UTF-8. GNU awk correctly detects multibyte UTF-8 characters as single characters. The character classes are also locale-specific. For example, using LANG=fi_FI.UTF-8
Code:

echo '2: Älä lyö, ööliä läikkyy!' | ./numerical.awk

1 2 1
1 2 3
1 1 3 4 5
1 2 3 4 4 6 6

(The text is Finnish and can be roughly translated as "Don't hit, you'll spill the brew!"; "ööli" is a slang term for brew. Google translate does not translate it correctly, but does pronounces it pretty much perfectly.)

danielbmartin 02-17-2012 11:07 AM

[QUOTE=Nominal Animal;4604431]
Code:

#!/usr/bin/awk -f

BEGIN {
    RS = "[\t\n\v\f\r ]+"
    FS = ""
}

{
    split("", p)

    for (i = 1; i <= NF; i++) {
        c = tolower($i)
        if (!(c in p))
            p[c] = i

        printf(" %d", p[c])
    }

    printf("\n")
}

Thank you, NominalAnimal for this awk. It does the job and runs fast!
Quote:

... if you want a hexadecimal encoding without spaces, use printf("%x",p[c]).
Yes, that is the preferred style of output.

Quote:

The text is Finnish and can be roughly translated as "Don't hit, you'll spill the brew!"; "ööli" is a slang term for brew.
I'll check this with my friend Terttu next time I see her.

Please don't think me as ungrateful for your excellent solution ... but I'm still hoping for something even shorter. Some background information: I have a slew of self-written REXX programs from my previous life as a Window user. After switching to Linux, I improved many of them by substituting small segments of Linux commands for large segments of REXX code. This has been a satisfying experience, resulting in programs which are smaller and faster.

The subject of this thread is an exception to the general experience. In REXX this encoding is done in one line. Here is an illustrative example.
Code:

/* A simple example of encoding text. */
Plaintext = 'people'
EncodedWord = translate(Plaintext,'123456789abcdef',Plaintext)
say EncodedWord Plaintext
exit

The output is a single line:
Code:

123152 people
The Linux tr command is so similar to the REXX translate but I cannot replicate the result. Maybe it just isn't possible; maybe I haven't figured out how to do it. Perhaps you (or other LQ members) will show the way.

Daniel B. Martin

uhelp 02-17-2012 11:24 AM

How would you like to deal with words like:
'abcdefghijklmnopqrstxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
?

Strange, but important question....

If the word contains more than 16 different characters, what then?
Ignore an use one "digit-code" for multiple characters?

Or you want use a numeric system providing 26 signs, i.e. base26 digit-system?

Cedrik 02-17-2012 01:09 PM

Quote:

Originally Posted by danielbmartin (Post 4605106)
The subject of this thread is an exception to the general experience. In REXX this encoding is done in one line. Here is an illustrative example.
Code:

/* A simple example of encoding text. */
Plaintext = 'people'
EncodedWord = translate(Plaintext,'123456789abcdef',Plaintext)
say EncodedWord Plaintext
exit

The output is a single line:
Code:

123152 people
The Linux tr command is so similar to the REXX translate but I cannot replicate the result. Maybe it just isn't possible; maybe I haven't figured out how to do it. Perhaps you (or other LQ members) will show the way.

Daniel B. Martin

A perl example:

Code:

#!/usr/bin/perl

$Plaintext = 'people';
@EncodedWord = map { (index $Plaintext, $_)+1 } split //, $Plaintext;
print @EncodedWord, ' ', $Plaintext, "\n";

The output is a single line:
Code:

123152 people
Well, I admit it's not shorter :p

Nominal Animal 02-17-2012 01:18 PM

Quote:

Originally Posted by danielbmartin (Post 4605106)
Please don't think me as ungrateful for your excellent solution ... but I'm still hoping for something even shorter.

No worries! Seeing different types of solutions and approaches is important to me too.

You can compact the awk script quite a bit, e.g.
Code:

awk -F '' '{split("",k);for(i=1;i<=NF;i++){if(!k[$i])k[$i]=i;printf("%x",k[$i])}printf("\n")}'
but it's still just a compact script, and not a short command.

I suspect the main reason is because the Linux/POSIX/Unix command-line tools are stream-oriented; character position is rarely even accessible. For example, one must tell awk to explicitly handle each character as a separate field to do this.

In Bash, you can use the ${#variable} expression to find out the length of the string in variable, and ${variable:skip} to skip over the first skip characters in variable:
Code:

mapping="zyxwvutsrqponmlkjihgfedcba987654321"

word="supercalifragilisticexpialidocious"

echo "$word" | tr "$(echo "$word" | rev)" "${mapping:${#mapping}-${#word}}"

but this is quite slow due to having to use two external programs, tr and rev. Also, the word to be translated must be in a Bash variable; it is referred to no less than three times.. but it should work fine for all words that are not longer than the mapping string. (Obviously you only need to set the mapping string once; this one suffices for all words up to 35 characters long.)

The actual command, the last line, which does the translation (from $word) is pretty funny, IMO. The nested pipe and string operations makes it pretty hard to parse at first sight.

Fun times ;)

Edited to add: as an input filter,
Code:

#!/bin/bash

mapping="zyxwvutsrqponmlkjihgfedcba987654321"

while read -r word ; do
    echo -n "$word" | tr "$(echo "$word" | rev)" "${mapping:${#mapping}-${#word}}"
    echo " $word"
done


danielbmartin 02-17-2012 01:21 PM

Quote:

Originally Posted by uhelp (Post 4605119)
If the word contains more than 16 different characters, what then?
Ignore an use one "digit-code" for multiple characters?

The intended input is common English-language words. Restricting the input to words of length <= 15 characters is an acceptable program limitation.

However, the REXX code handles this with ease. Extending the previously posted example, it becomes ...
Code:

/* A simple example of encoding text. */
Plaintext = 'antidisestablishmentarianism'
EncodedWord = translate(Plaintext,'123456789abcdefghijklm',Plaintext)
say EncodedWord Plaintext

exit

... and the output is
Code:

12345478731cd47gh8231m41247h antidisestablishmentarianism
Daniel B. Martin

firstfire 02-17-2012 04:28 PM

Hi.

Here I am with sed again.

@Daniel: I enjoyed solving this problem and all previous ones. Thank you for that. This one makes me mad -- it is so simple, I wrote about 4 sed and a couple of bash solutions, BUT how to make them efficient and short? I can't beat Nominal's awk solution.

@Nominal Animal: As always, your solutions are excellent!

Solution, I ended up with, works similarly to Nominal's last one. Here is an illustration, showing internal representation of data and main loop (which is roughly equivalent to combination of tr and rev):
Code:

$ echo '123456789;people:people' | sed -r ':a;s/((.).{9}(.).*:.*)\3/\1\2/;ta'
123456789;people:123152

The result goes after a colon.

And here is a script itself:
Code:

#!/bin/sed -rf

s/.*/123456789ABCDEFGHIJK\L&:&/
:a
s/((.).{19}(.).*:.*)\3/\1\2/
ta
s/.*://

Now, to make it a bit faster, we may add a check for duplicate characters in an input word: if there are no such characters, the result is simply '12345...len(word)':
Code:

#!/bin/sed -rf

/(.).*\1/!{ s/^/123456789ABCDEFGHIJK/; s/.{19}$//; b;}

s/.*/123456789ABCDEFGHIJK\L&:&/
:a
s/((.).{19}(.).*:.*)\3/\1\2/
ta
s/.*://

I convert input words to lower case (using GNU sed extension `\L') and use upper case letters as digits to prevent interfering them.

This script is still MUCH SLOWER than Nominal's awk solution and not much shorter, so I am not completely satisfied with it.

audriusk 02-17-2012 05:54 PM

Mind if I join you with Python (2.x, but would work with 3.x after small changes) example? :)
Code:

#!/usr/bin/env python

import sys
from string import digits, ascii_lowercase
from itertools import izip_longest

if __name__ == '__main__':
    for line in sys.stdin:
        line = line.strip()
        # Ensure that encoding is not longer than text line.
        encoding = (digits + ascii_lowercase)[1:len(line)+1]
        d = {}
        print ''.join(d.setdefault(char, code)
                      for char, code in izip_longest(line, encoding)), line

As it reads lines from stdin, it can be used as a filter, similar to Nominal Animal's AWK script. Haven't compared its speed with others, just wanted to get a working example.

Cedrik 02-17-2012 07:07 PM

A little perl onliner
Code:

echo people | perl -ne 'print ((index $_, $1)+1) while /(.)/g'

David the H. 02-18-2012 12:03 PM

In bash ( ver4+, due to associative array use ):

Code:


#!/bin/bash

string="$1"                                                        #to restrict it to lowercase, use string=${1,,}
declare -A array                                                #declare an associative array
pos=( 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m )                #declare a list of positional characters to use

for (( i=0 ; i<${#string} ; i++ )) ; do                                #loop through the length of the string (zero-indexed)

    char=${string:$i:1}                                        #get the character in the current position
    [[ -z ${array[$char]} ]] && array[$char]=${pos[i]:-@}                #if first time encountered, record the current position.  Use "@" if pos is exceeded.
    printf "%s" "${array[$char]}"                                #print recorded position

done

echo        #echo a final newline

exit 0

Edit: Missed the desire for alnum character output. Hopefully the update corrects it. You can extend the pos array to cover additional characters if desired.

Edit2: Made a minor modification so that it uses a filler character if the number of different chars in the string exceeds the number of defined positions.

grail 02-19-2012 08:49 AM

Well Cedrik got me to thinking:

Awk:
Code:

echo people | awk -F "" '{while(++i <=NF)printf "%d", index($0,$i)}'
Ruby: (I am still new to this so it can probably be made much shorter)
Code:

echo people | ruby -ane '$F[0].split(//).each { |x| print $F[0].index(x)+1 }'

danielbmartin 02-19-2012 12:31 PM

Quote:

Originally Posted by firstfire (Post 4605366)
Now, to make it a bit faster, we may add a check for duplicate characters in an input word: if there are no such characters, the result is simply '12345...len(word)' ...

No need for this check.

The problem statement in the original post focused on the technique for encoding. I didn't address the application in which this transformation is embedded. It should be said that the input file contains ...
1) One word per line.
2) Lower case characters, only.
3) Words with multiple repeated characters.

Some of the offered solutions have explicit loops. Maybe these solutions could be made a tiny bit faster by using this observation: there is no need to test the first character in a word, to see if it has already been observed. By definition, it has not already been observed... and consequently the first digit in the encoded result is always 1. Perhaps the loop index variable could begin with 2 rather than 1.

Daniel B. Martin

grail 02-19-2012 12:38 PM

In actual fact, it would take more coding in most, if not all, of the solutions to simply print 1 on the first letter and then start a loop.
Computationally I do not see that you would benefit (programatically) from such a change.

Nominal Animal 02-19-2012 12:43 PM

Ah, using the index mapping Cedrik and Grail used makes much more sense. This is Grail's solution, but with the number-and-letter encoding, with both the encoded string and unencoded one on each output line.
Code:

awk -F '' -v enc=123456789abcdefghijklmnopqrstuvwxyz '{for(i=1;i<=NF;i++) printf("%s",substr(enc,index($0,$i),1)); printf(" %s\n", $0)}'
On my machine, it handles the entire /usr/share/dict/words (234937 words in 2486824 bytes) in about one and a half seconds using GNU awk 3.1.8, but well under a second using mawk-1.3.3.

Using
Code:

mawk -F '' -v enc=123456789abcdefghijklmnopqrstuvwxyz '{s="1"; for(i=2;i<=NF;i++) s=s substr(enc,index($0,$i),1); printf("%s %s\n", s, $0)}'
is even faster, only taking two thirds of a second for the /usr/share/dict/words on my machine. (It takes almost two seconds using gawk.) This version constructs the encoded version into a string instead of outputting character-by-character. It also uses the optimization mentioned above, starting the string always with "1" and not processing the first character explicitly.

Skipping the first character saves about 7% in runtime for the /usr/share/dict/words word list, so it certainly is worthwhile.

In all cases the output is exactly the same; I verified that by piping the output to sha256sum -b and comparing the results.

The speed differences for something this simple surprises me. It seems that mawk has much more efficient input/output buffering methods than GNU awk. (A good thing to know if you handle large amounts of data, and don't need gawk-specific features; you may halve the wall time by switching to mawk.)


All times are GMT -5. The time now is 08:21 PM.