LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 02-16-2012, 02:55 PM   #1
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,065

Rep: Reputation: 284Reputation: 284Reputation: 284
Numerical encoding of text, by position


I posted this question many months ago and received suggestions, but none of them was a "bull's eye" answer. I'm posting again, thinking that LQ now has more members and someone may have a flash of inspiration.

I have a file of English-language words, one word per line, and want to encode them in a numeric form, based on letter positions. This is best explained by example:

people ==> 123152

Reading left to right:
p was first encountered at position 1 so it is encoded as 1.
e was first encountered at position 2 so it is encoded as 2.
o was first encountered at position 3 so it is encoded as 3.
p (again) was first encountered at position 1 so it is encoded as 1.
l was first encountered at position 5 so it is encoded as 5.
e (again) was first encountered at position 2 so it is encoded as 2.

More examples:
aardvark => 11345138
sense => 12312
committee => 123356688
position => 12345428


This is a first step toward a solution.
Code:
echo 'people' \
|tr 'elpoep' '123456789abcdef' \
|tr '654321' '123456789abcdef'

echo 'aardvark' \
|tr 'kravdraa' '123456789abcdef' \
|tr '87654321' '123456789abcdef'
This combination of successive tr commands works but it has a shortcoming: the length of the character strings must equal the length of the word to be encoded. That shorcoming may be remedied by coding something like this:
Code:
echo 'people' \
|tr 'elpoep*********' '123456789abcdef' \
|tr '654321*********' '123456789abcdef'

echo 'aardvark' \
|tr 'kravdraa*******' '123456789abcdef' \
|tr '87654321*******' '123456789abcdef'
Good, that seems like part of the puzzle, but only part.

Using this sample input file ...
Code:
people
aardvark
error
infinite
nevermore
... and massaging it with this clumsy code ...
Code:
cat < $InFile    \
|sed -e "s/$/'/" \
|sed -e "s/^/***************/"  \
|rev \
|cut -c1-15 \
|sed -e "s/$/' '123456789abcdef'/"
... we get this interesting file ...
Code:
'elpoep********' '123456789abcdef'
'kravdraa******' '123456789abcdef'
'rorre*********' '123456789abcdef'
'etinifni******' '123456789abcdef'
'eromreven*****' '123456789abcdef'
Which might be another piece of the puzzle.

Now, what kind of "glue" should be used to put these pieces together? Is there an easier, better way?

There is an index command but my Ubuntu system seems to lack it. Is that some kind of downloadable extension?

Daniel B. Martin
 
Click here to see the post LQ members have rated as the most helpful post in this thread.
Old 02-16-2012, 03:43 PM   #2
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942
Quote:
Originally Posted by danielbmartin View Post
I have a file of English-language words, one word per line, and want to encode them in a numeric form, based on letter positions.
How about
Code:
#!/usr/bin/awk -f

BEGIN {
    RS = "[\t\n\v\f\r ]+"
    FS = ""
}

{
    split("", p)

    for (i = 1; i <= NF; i++) {
        c = tolower($i)
        if (!(c in p))
            p[c] = i

        printf(" %d", p[c])
    }

    printf("\n")
}
If you save the above as a script, you can use it as a filter, or supply it with file names to process.

It ignores case, and prints a space before each number. If you don't want the space, just remove it from the printf(" %d",p[c]) statement. Or, if you want a hexadecimal encoding without spaces, use printf("%x",p[c]). You can also use a custom encoding, by adding
Code:
split("0 1 2 3 4 5 6 7 8 9 a b c d e f z x v", m, " ")
into the BEGIN rule, and using printf("%s",m[p[c]]) to print the encoded character position for each input character.

The BEGIN rule sets the record separator to any whitespace. This means you don't need to have each word on their own line, any whitespace is okay. Setting FS to an empty string means each character in a record will be a separate field.

The conversion is done in the second rule. The s array maps already seen characters to their columns. The for loop loops over each character, c, the input field (character) converted to lower case. If the character is not already in the s array, it is added to the array, with value being the current position, of course. Finally, the printf statement prints the position the current character maps to. After the loop, the second printf adds a newline.

If you want it to ignore e.g. punctuation and digits, use GNU awk and
Code:
#!/usr/bin/gawk -f

BEGIN {
    RS = "[\t\n\v\f\r ]+"
    FS = ""
}

{
    split("", p)

    for (i = 1; i <= NF; i++) {
        c = tolower($i)
        if (c !~ /[[:punct:][:digit:]]/) {
            if (!(c in p))
                p[c] = i
            printf(" %d", p[c])
        }
    }

    printf("\n")
}
instead; the only changes are the additions in red.

This latter implementation works very well with other languages, too, if you use a single-byte character encoding, or UTF-8. GNU awk correctly detects multibyte UTF-8 characters as single characters. The character classes are also locale-specific. For example, using LANG=fi_FI.UTF-8
Code:
echo '2: Älä lyö, ööliä läikkyy!' | ./numerical.awk

 1 2 1
 1 2 3
 1 1 3 4 5
 1 2 3 4 4 6 6
(The text is Finnish and can be roughly translated as "Don't hit, you'll spill the brew!"; "ööli" is a slang term for brew. Google translate does not translate it correctly, but does pronounces it pretty much perfectly.)
 
2 members found this post helpful.
Old 02-17-2012, 10:07 AM   #3
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,065

Original Poster
Rep: Reputation: 284Reputation: 284Reputation: 284
[QUOTE=Nominal Animal;4604431]
Code:
#!/usr/bin/awk -f

BEGIN {
    RS = "[\t\n\v\f\r ]+"
    FS = ""
}

{
    split("", p)

    for (i = 1; i <= NF; i++) {
        c = tolower($i)
        if (!(c in p))
            p[c] = i

        printf(" %d", p[c])
    }

    printf("\n")
}
Thank you, NominalAnimal for this awk. It does the job and runs fast!
Quote:
... if you want a hexadecimal encoding without spaces, use printf("%x",p[c]).
Yes, that is the preferred style of output.

Quote:
The text is Finnish and can be roughly translated as "Don't hit, you'll spill the brew!"; "ööli" is a slang term for brew.
I'll check this with my friend Terttu next time I see her.

Please don't think me as ungrateful for your excellent solution ... but I'm still hoping for something even shorter. Some background information: I have a slew of self-written REXX programs from my previous life as a Window user. After switching to Linux, I improved many of them by substituting small segments of Linux commands for large segments of REXX code. This has been a satisfying experience, resulting in programs which are smaller and faster.

The subject of this thread is an exception to the general experience. In REXX this encoding is done in one line. Here is an illustrative example.
Code:
/* A simple example of encoding text. */
Plaintext = 'people'
EncodedWord = translate(Plaintext,'123456789abcdef',Plaintext)
say EncodedWord Plaintext
exit
The output is a single line:
Code:
123152 people
The Linux tr command is so similar to the REXX translate but I cannot replicate the result. Maybe it just isn't possible; maybe I haven't figured out how to do it. Perhaps you (or other LQ members) will show the way.

Daniel B. Martin
 
Old 02-17-2012, 10:24 AM   #4
uhelp
Member
 
Registered: Nov 2011
Location: Germany, Bavaria, Nueremberg area
Distribution: openSUSE, Debian, LFS
Posts: 205

Rep: Reputation: 43
How would you like to deal with words like:
'abcdefghijklmnopqrstxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
?

Strange, but important question....

If the word contains more than 16 different characters, what then?
Ignore an use one "digit-code" for multiple characters?

Or you want use a numeric system providing 26 signs, i.e. base26 digit-system?

Last edited by uhelp; 02-17-2012 at 10:28 AM.
 
1 members found this post helpful.
Old 02-17-2012, 12:09 PM   #5
Cedrik
Senior Member
 
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140

Rep: Reputation: 242Reputation: 242Reputation: 242
Quote:
Originally Posted by danielbmartin View Post
The subject of this thread is an exception to the general experience. In REXX this encoding is done in one line. Here is an illustrative example.
Code:
/* A simple example of encoding text. */
Plaintext = 'people'
EncodedWord = translate(Plaintext,'123456789abcdef',Plaintext)
say EncodedWord Plaintext
exit
The output is a single line:
Code:
123152 people
The Linux tr command is so similar to the REXX translate but I cannot replicate the result. Maybe it just isn't possible; maybe I haven't figured out how to do it. Perhaps you (or other LQ members) will show the way.

Daniel B. Martin
A perl example:

Code:
#!/usr/bin/perl

$Plaintext = 'people';
@EncodedWord = map { (index $Plaintext, $_)+1 } split //, $Plaintext;
print @EncodedWord, ' ', $Plaintext, "\n";
The output is a single line:
Code:
123152 people
Well, I admit it's not shorter
 
1 members found this post helpful.
Old 02-17-2012, 12:18 PM   #6
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942
Quote:
Originally Posted by danielbmartin View Post
Please don't think me as ungrateful for your excellent solution ... but I'm still hoping for something even shorter.
No worries! Seeing different types of solutions and approaches is important to me too.

You can compact the awk script quite a bit, e.g.
Code:
awk -F '' '{split("",k);for(i=1;i<=NF;i++){if(!k[$i])k[$i]=i;printf("%x",k[$i])}printf("\n")}'
but it's still just a compact script, and not a short command.

I suspect the main reason is because the Linux/POSIX/Unix command-line tools are stream-oriented; character position is rarely even accessible. For example, one must tell awk to explicitly handle each character as a separate field to do this.

In Bash, you can use the ${#variable} expression to find out the length of the string in variable, and ${variable:skip} to skip over the first skip characters in variable:
Code:
mapping="zyxwvutsrqponmlkjihgfedcba987654321"

word="supercalifragilisticexpialidocious"

echo "$word" | tr "$(echo "$word" | rev)" "${mapping:${#mapping}-${#word}}"
but this is quite slow due to having to use two external programs, tr and rev. Also, the word to be translated must be in a Bash variable; it is referred to no less than three times.. but it should work fine for all words that are not longer than the mapping string. (Obviously you only need to set the mapping string once; this one suffices for all words up to 35 characters long.)

The actual command, the last line, which does the translation (from $word) is pretty funny, IMO. The nested pipe and string operations makes it pretty hard to parse at first sight.

Fun times

Edited to add: as an input filter,
Code:
#!/bin/bash

mapping="zyxwvutsrqponmlkjihgfedcba987654321"

while read -r word ; do
    echo -n "$word" | tr "$(echo "$word" | rev)" "${mapping:${#mapping}-${#word}}"
    echo " $word"
done

Last edited by Nominal Animal; 02-17-2012 at 01:00 PM.
 
2 members found this post helpful.
Old 02-17-2012, 12:21 PM   #7
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,065

Original Poster
Rep: Reputation: 284Reputation: 284Reputation: 284
Quote:
Originally Posted by uhelp View Post
If the word contains more than 16 different characters, what then?
Ignore an use one "digit-code" for multiple characters?
The intended input is common English-language words. Restricting the input to words of length <= 15 characters is an acceptable program limitation.

However, the REXX code handles this with ease. Extending the previously posted example, it becomes ...
Code:
/* A simple example of encoding text. */
Plaintext = 'antidisestablishmentarianism'
EncodedWord = translate(Plaintext,'123456789abcdefghijklm',Plaintext)
say EncodedWord Plaintext

exit
... and the output is
Code:
12345478731cd47gh8231m41247h antidisestablishmentarianism
Daniel B. Martin
 
Old 02-17-2012, 03:28 PM   #8
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 620

Rep: Reputation: 362Reputation: 362Reputation: 362Reputation: 362
Hi.

Here I am with sed again.

@Daniel: I enjoyed solving this problem and all previous ones. Thank you for that. This one makes me mad -- it is so simple, I wrote about 4 sed and a couple of bash solutions, BUT how to make them efficient and short? I can't beat Nominal's awk solution.

@Nominal Animal: As always, your solutions are excellent!

Solution, I ended up with, works similarly to Nominal's last one. Here is an illustration, showing internal representation of data and main loop (which is roughly equivalent to combination of tr and rev):
Code:
$ echo '123456789;people:people' | sed -r ':a;s/((.).{9}(.).*:.*)\3/\1\2/;ta'
123456789;people:123152
The result goes after a colon.

And here is a script itself:
Code:
#!/bin/sed -rf

s/.*/123456789ABCDEFGHIJK\L&:&/
:a
s/((.).{19}(.).*:.*)\3/\1\2/
ta
s/.*://
Now, to make it a bit faster, we may add a check for duplicate characters in an input word: if there are no such characters, the result is simply '12345...len(word)':
Code:
#!/bin/sed -rf

/(.).*\1/!{ s/^/123456789ABCDEFGHIJK/; s/.{19}$//; b;}

s/.*/123456789ABCDEFGHIJK\L&:&/
:a
s/((.).{19}(.).*:.*)\3/\1\2/
ta
s/.*://
I convert input words to lower case (using GNU sed extension `\L') and use upper case letters as digits to prevent interfering them.

This script is still MUCH SLOWER than Nominal's awk solution and not much shorter, so I am not completely satisfied with it.
 
1 members found this post helpful.
Old 02-17-2012, 04:54 PM   #9
audriusk
Member
 
Registered: Mar 2011
Location: Klaipėda, Lithuania
Distribution: Slackware
Posts: 232

Rep: Reputation: 104Reputation: 104
Mind if I join you with Python (2.x, but would work with 3.x after small changes) example?
Code:
#!/usr/bin/env python

import sys
from string import digits, ascii_lowercase
from itertools import izip_longest

if __name__ == '__main__':
    for line in sys.stdin:
        line = line.strip()
        # Ensure that encoding is not longer than text line.
        encoding = (digits + ascii_lowercase)[1:len(line)+1]
        d = {}
        print ''.join(d.setdefault(char, code)
                      for char, code in izip_longest(line, encoding)), line
As it reads lines from stdin, it can be used as a filter, similar to Nominal Animal's AWK script. Haven't compared its speed with others, just wanted to get a working example.
 
1 members found this post helpful.
Old 02-17-2012, 06:07 PM   #10
Cedrik
Senior Member
 
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140

Rep: Reputation: 242Reputation: 242Reputation: 242
A little perl onliner
Code:
echo people | perl -ne 'print ((index $_, $1)+1) while /(.)/g'
 
2 members found this post helpful.
Old 02-18-2012, 11:03 AM   #11
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946
In bash ( ver4+, due to associative array use ):

Code:
#!/bin/bash

string="$1"							#to restrict it to lowercase, use string=${1,,}
declare -A array						#declare an associative array
pos=( 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m )		#declare a list of positional characters to use

for (( i=0 ; i<${#string} ; i++ )) ; do				#loop through the length of the string (zero-indexed)

     char=${string:$i:1}					#get the character in the current position
     [[ -z ${array[$char]} ]] && array[$char]=${pos[i]:-@}		#if first time encountered, record the current position.  Use "@" if pos is exceeded.
     printf "%s" "${array[$char]}"				#print recorded position

done

echo	#echo a final newline

exit 0
Edit: Missed the desire for alnum character output. Hopefully the update corrects it. You can extend the pos array to cover additional characters if desired.

Edit2: Made a minor modification so that it uses a filler character if the number of different chars in the string exceeds the number of defined positions.

Last edited by David the H.; 02-18-2012 at 11:31 AM.
 
1 members found this post helpful.
Old 02-19-2012, 07:49 AM   #12
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,433

Rep: Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879
Well Cedrik got me to thinking:

Awk:
Code:
echo people | awk -F "" '{while(++i <=NF)printf "%d", index($0,$i)}'
Ruby: (I am still new to this so it can probably be made much shorter)
Code:
echo people | ruby -ane '$F[0].split(//).each { |x| print $F[0].index(x)+1 }'
 
2 members found this post helpful.
Old 02-19-2012, 11:31 AM   #13
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,065

Original Poster
Rep: Reputation: 284Reputation: 284Reputation: 284
Quote:
Originally Posted by firstfire View Post
Now, to make it a bit faster, we may add a check for duplicate characters in an input word: if there are no such characters, the result is simply '12345...len(word)' ...
No need for this check.

The problem statement in the original post focused on the technique for encoding. I didn't address the application in which this transformation is embedded. It should be said that the input file contains ...
1) One word per line.
2) Lower case characters, only.
3) Words with multiple repeated characters.

Some of the offered solutions have explicit loops. Maybe these solutions could be made a tiny bit faster by using this observation: there is no need to test the first character in a word, to see if it has already been observed. By definition, it has not already been observed... and consequently the first digit in the encoded result is always 1. Perhaps the loop index variable could begin with 2 rather than 1.

Daniel B. Martin
 
Old 02-19-2012, 11:38 AM   #14
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,433

Rep: Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879
In actual fact, it would take more coding in most, if not all, of the solutions to simply print 1 on the first letter and then start a loop.
Computationally I do not see that you would benefit (programatically) from such a change.
 
Old 02-19-2012, 11:43 AM   #15
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942
Ah, using the index mapping Cedrik and Grail used makes much more sense. This is Grail's solution, but with the number-and-letter encoding, with both the encoded string and unencoded one on each output line.
Code:
awk -F '' -v enc=123456789abcdefghijklmnopqrstuvwxyz '{for(i=1;i<=NF;i++) printf("%s",substr(enc,index($0,$i),1)); printf(" %s\n", $0)}'
On my machine, it handles the entire /usr/share/dict/words (234937 words in 2486824 bytes) in about one and a half seconds using GNU awk 3.1.8, but well under a second using mawk-1.3.3.

Using
Code:
mawk -F '' -v enc=123456789abcdefghijklmnopqrstuvwxyz '{s="1"; for(i=2;i<=NF;i++) s=s substr(enc,index($0,$i),1); printf("%s %s\n", s, $0)}'
is even faster, only taking two thirds of a second for the /usr/share/dict/words on my machine. (It takes almost two seconds using gawk.) This version constructs the encoded version into a string instead of outputting character-by-character. It also uses the optimization mentioned above, starting the string always with "1" and not processing the first character explicitly.

Skipping the first character saves about 7% in runtime for the /usr/share/dict/words word list, so it certainly is worthwhile.

In all cases the output is exactly the same; I verified that by piping the output to sha256sum -b and comparing the results.

The speed differences for something this simple surprises me. It seems that mawk has much more efficient input/output buffering methods than GNU awk. (A good thing to know if you handle large amounts of data, and don't need gawk-specific features; you may halve the wall time by switching to mawk.)

Last edited by Nominal Animal; 02-19-2012 at 11:49 AM.
 
2 members found this post helpful.
  


Reply

Tags
sed, tr


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Retrieve numerical values from a text file shik28 Programming 1 11-22-2011 02:02 AM
Openoffice: extract numerical value from text cell lothario Linux - Software 2 01-04-2011 02:34 AM
Numerical encoding of text, by position danielbmartin Linux - Newbie 5 04-29-2010 12:31 AM
how to count the numerical digits in between the text using a command or a script? Kilam orez Linux - Newbie 9 01-03-2010 12:15 AM
Text file manipulation: Extracting specific rows according to numerical pattern CHARL0TTE Linux - Newbie 3 10-07-2009 07:14 AM


All times are GMT -5. The time now is 03:32 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration