Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
| Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
 |
GNU/Linux Basic Guide
This 255-page guide will provide you with the keys to understand the philosophy of free software, teach you how to use and handle it, and give you the tools required to move easily in the world of GNU/Linux. Many users and administrators will be taking their first steps with this GNU/Linux Basic guide and it will show you how to approach and solve the problems you encounter.
Click Here to receive this Complete Guide absolutely free. |
|
 |
|
02-16-2012, 02:55 PM
|
#1
|
|
Member
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 756
Rep: 
|
Numerical encoding of text, by position
I posted this question many months ago and received suggestions, but none of them was a "bull's eye" answer. I'm posting again, thinking that LQ now has more members and someone may have a flash of inspiration.
I have a file of English-language words, one word per line, and want to encode them in a numeric form, based on letter positions. This is best explained by example:
people ==> 123152
Reading left to right:
p was first encountered at position 1 so it is encoded as 1.
e was first encountered at position 2 so it is encoded as 2.
o was first encountered at position 3 so it is encoded as 3.
p (again) was first encountered at position 1 so it is encoded as 1.
l was first encountered at position 5 so it is encoded as 5.
e (again) was first encountered at position 2 so it is encoded as 2.
More examples:
aardvark => 11345138
sense => 12312
committee => 123356688
position => 12345428
This is a first step toward a solution.
Code:
echo 'people' \
|tr 'elpoep' '123456789abcdef' \
|tr '654321' '123456789abcdef'
echo 'aardvark' \
|tr 'kravdraa' '123456789abcdef' \
|tr '87654321' '123456789abcdef'
This combination of successive tr commands works but it has a shortcoming: the length of the character strings must equal the length of the word to be encoded. That shorcoming may be remedied by coding something like this:
Code:
echo 'people' \
|tr 'elpoep*********' '123456789abcdef' \
|tr '654321*********' '123456789abcdef'
echo 'aardvark' \
|tr 'kravdraa*******' '123456789abcdef' \
|tr '87654321*******' '123456789abcdef'
Good, that seems like part of the puzzle, but only part.
Using this sample input file ...
Code:
people
aardvark
error
infinite
nevermore
... and massaging it with this clumsy code ...
Code:
cat < $InFile \
|sed -e "s/$/'/" \
|sed -e "s/^/***************/" \
|rev \
|cut -c1-15 \
|sed -e "s/$/' '123456789abcdef'/"
... we get this interesting file ...
Code:
'elpoep********' '123456789abcdef'
'kravdraa******' '123456789abcdef'
'rorre*********' '123456789abcdef'
'etinifni******' '123456789abcdef'
'eromreven*****' '123456789abcdef'
Which might be another piece of the puzzle.
Now, what kind of "glue" should be used to put these pieces together? Is there an easier, better way?
There is an index command but my Ubuntu system seems to lack it. Is that some kind of downloadable extension?
Daniel B. Martin
|
|
|
|
|
Click here to see the post LQ members have rated as the most helpful post in this thread.
|
02-16-2012, 03:43 PM
|
#2
|
|
Senior Member
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
|
Quote:
Originally Posted by danielbmartin
I have a file of English-language words, one word per line, and want to encode them in a numeric form, based on letter positions.
|
How about
Code:
#!/usr/bin/awk -f
BEGIN {
RS = "[\t\n\v\f\r ]+"
FS = ""
}
{
split("", p)
for (i = 1; i <= NF; i++) {
c = tolower($i)
if (!(c in p))
p[c] = i
printf(" %d", p[c])
}
printf("\n")
}
If you save the above as a script, you can use it as a filter, or supply it with file names to process.
It ignores case, and prints a space before each number. If you don't want the space, just remove it from the printf(" %d",p[c]) statement. Or, if you want a hexadecimal encoding without spaces, use printf("%x",p[c]). You can also use a custom encoding, by adding
Code:
split("0 1 2 3 4 5 6 7 8 9 a b c d e f z x v", m, " ")
into the BEGIN rule, and using printf("%s",m[p[c]]) to print the encoded character position for each input character.
The BEGIN rule sets the record separator to any whitespace. This means you don't need to have each word on their own line, any whitespace is okay. Setting FS to an empty string means each character in a record will be a separate field.
The conversion is done in the second rule. The s array maps already seen characters to their columns. The for loop loops over each character, c, the input field (character) converted to lower case. If the character is not already in the s array, it is added to the array, with value being the current position, of course. Finally, the printf statement prints the position the current character maps to. After the loop, the second printf adds a newline.
If you want it to ignore e.g. punctuation and digits, use GNU awk and
Code:
#!/usr/bin/gawk -f
BEGIN {
RS = "[\t\n\v\f\r ]+"
FS = ""
}
{
split("", p)
for (i = 1; i <= NF; i++) {
c = tolower($i)
if (c !~ /[[:punct:][:digit:]]/) {
if (!(c in p))
p[c] = i
printf(" %d", p[c])
}
}
printf("\n")
}
instead; the only changes are the additions in red.
This latter implementation works very well with other languages, too, if you use a single-byte character encoding, or UTF-8. GNU awk correctly detects multibyte UTF-8 characters as single characters. The character classes are also locale-specific. For example, using LANG=fi_FI.UTF-8
Code:
echo '2: Älä lyö, ööliä läikkyy!' | ./numerical.awk
1 2 1
1 2 3
1 1 3 4 5
1 2 3 4 4 6 6
(The text is Finnish and can be roughly translated as "Don't hit, you'll spill the brew!"; "ööli" is a slang term for brew. Google translate does not translate it correctly, but does pronounces it pretty much perfectly.)
|
|
|
2 members found this post helpful.
|
02-17-2012, 10:07 AM
|
#3
|
|
Member
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 756
Original Poster
Rep: 
|
[QUOTE=Nominal Animal;4604431]
Code:
#!/usr/bin/awk -f
BEGIN {
RS = "[\t\n\v\f\r ]+"
FS = ""
}
{
split("", p)
for (i = 1; i <= NF; i++) {
c = tolower($i)
if (!(c in p))
p[c] = i
printf(" %d", p[c])
}
printf("\n")
}
Thank you, NominalAnimal for this awk. It does the job and runs fast!
Quote:
|
... if you want a hexadecimal encoding without spaces, use printf("%x",p[c]).
|
Yes, that is the preferred style of output.
Quote:
|
The text is Finnish and can be roughly translated as "Don't hit, you'll spill the brew!"; "ööli" is a slang term for brew.
|
I'll check this with my friend Terttu next time I see her.
Please don't think me as ungrateful for your excellent solution ... but I'm still hoping for something even shorter. Some background information: I have a slew of self-written REXX programs from my previous life as a Window user. After switching to Linux, I improved many of them by substituting small segments of Linux commands for large segments of REXX code. This has been a satisfying experience, resulting in programs which are smaller and faster.
The subject of this thread is an exception to the general experience. In REXX this encoding is done in one line. Here is an illustrative example.
Code:
/* A simple example of encoding text. */
Plaintext = 'people'
EncodedWord = translate(Plaintext,'123456789abcdef',Plaintext)
say EncodedWord Plaintext
exit
The output is a single line:
The Linux tr command is so similar to the REXX translate but I cannot replicate the result. Maybe it just isn't possible; maybe I haven't figured out how to do it. Perhaps you (or other LQ members) will show the way.
Daniel B. Martin
|
|
|
|
02-17-2012, 10:24 AM
|
#4
|
|
Member
Registered: Nov 2011
Location: Germany, Bavaria, Nueremberg area
Distribution: openSUSE, Debian, LFS
Posts: 205
Rep:
|
How would you like to deal with words like:
'abcdefghijklmnopqrstxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
?
Strange, but important question....
If the word contains more than 16 different characters, what then?
Ignore an use one "digit-code" for multiple characters?
Or you want use a numeric system providing 26 signs, i.e. base26 digit-system?
Last edited by uhelp; 02-17-2012 at 10:28 AM.
|
|
|
1 members found this post helpful.
|
02-17-2012, 12:09 PM
|
#5
|
|
Senior Member
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140
|
Quote:
Originally Posted by danielbmartin
The subject of this thread is an exception to the general experience. In REXX this encoding is done in one line. Here is an illustrative example.
Code:
/* A simple example of encoding text. */
Plaintext = 'people'
EncodedWord = translate(Plaintext,'123456789abcdef',Plaintext)
say EncodedWord Plaintext
exit
The output is a single line:
The Linux tr command is so similar to the REXX translate but I cannot replicate the result. Maybe it just isn't possible; maybe I haven't figured out how to do it. Perhaps you (or other LQ members) will show the way.
Daniel B. Martin
|
A perl example:
Code:
#!/usr/bin/perl
$Plaintext = 'people';
@EncodedWord = map { (index $Plaintext, $_)+1 } split //, $Plaintext;
print @EncodedWord, ' ', $Plaintext, "\n";
The output is a single line:
Well, I admit it's not shorter 
|
|
|
1 members found this post helpful.
|
02-17-2012, 12:18 PM
|
#6
|
|
Senior Member
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
|
Quote:
Originally Posted by danielbmartin
Please don't think me as ungrateful for your excellent solution ... but I'm still hoping for something even shorter.
|
No worries! Seeing different types of solutions and approaches is important to me too.
You can compact the awk script quite a bit, e.g.
Code:
awk -F '' '{split("",k);for(i=1;i<=NF;i++){if(!k[$i])k[$i]=i;printf("%x",k[$i])}printf("\n")}'
but it's still just a compact script, and not a short command.
I suspect the main reason is because the Linux/POSIX/Unix command-line tools are stream-oriented; character position is rarely even accessible. For example, one must tell awk to explicitly handle each character as a separate field to do this.
In Bash, you can use the ${#variable} expression to find out the length of the string in variable, and ${variable:skip} to skip over the first skip characters in variable:
Code:
mapping="zyxwvutsrqponmlkjihgfedcba987654321"
word="supercalifragilisticexpialidocious"
echo "$word" | tr "$(echo "$word" | rev)" "${mapping:${#mapping}-${#word}}"
but this is quite slow due to having to use two external programs, tr and rev. Also, the word to be translated must be in a Bash variable; it is referred to no less than three times.. but it should work fine for all words that are not longer than the mapping string. (Obviously you only need to set the mapping string once; this one suffices for all words up to 35 characters long.)
The actual command, the last line, which does the translation (from $word) is pretty funny, IMO. The nested pipe and string operations makes it pretty hard to parse at first sight.
Fun times
Edited to add: as an input filter,
Code:
#!/bin/bash
mapping="zyxwvutsrqponmlkjihgfedcba987654321"
while read -r word ; do
echo -n "$word" | tr "$(echo "$word" | rev)" "${mapping:${#mapping}-${#word}}"
echo " $word"
done
Last edited by Nominal Animal; 02-17-2012 at 01:00 PM.
|
|
|
2 members found this post helpful.
|
02-17-2012, 12:21 PM
|
#7
|
|
Member
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 756
Original Poster
Rep: 
|
Quote:
Originally Posted by uhelp
If the word contains more than 16 different characters, what then?
Ignore an use one "digit-code" for multiple characters?
|
The intended input is common English-language words. Restricting the input to words of length <= 15 characters is an acceptable program limitation.
However, the REXX code handles this with ease. Extending the previously posted example, it becomes ...
Code:
/* A simple example of encoding text. */
Plaintext = 'antidisestablishmentarianism'
EncodedWord = translate(Plaintext,'123456789abcdefghijklm',Plaintext)
say EncodedWord Plaintext
exit
... and the output is
Code:
12345478731cd47gh8231m41247h antidisestablishmentarianism
Daniel B. Martin
|
|
|
|
02-17-2012, 03:28 PM
|
#8
|
|
Member
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 576
|
Hi.
Here I am with sed again.
@Daniel: I enjoyed solving this problem and all previous ones. Thank you for that. This one makes me mad -- it is so simple, I wrote about 4 sed and a couple of bash solutions, BUT how to make them efficient and short? I can't beat Nominal's awk solution.
@Nominal Animal: As always, your solutions are excellent!
Solution, I ended up with, works similarly to Nominal's last one. Here is an illustration, showing internal representation of data and main loop (which is roughly equivalent to combination of tr and rev):
Code:
$ echo '123456789;people:people' | sed -r ':a;s/((.).{9}(.).*:.*)\3/\1\2/;ta'
123456789;people:123152
The result goes after a colon.
And here is a script itself:
Code:
#!/bin/sed -rf
s/.*/123456789ABCDEFGHIJK\L&:&/
:a
s/((.).{19}(.).*:.*)\3/\1\2/
ta
s/.*://
Now, to make it a bit faster, we may add a check for duplicate characters in an input word: if there are no such characters, the result is simply '12345...len(word)':
Code:
#!/bin/sed -rf
/(.).*\1/!{ s/^/123456789ABCDEFGHIJK/; s/.{19}$//; b;}
s/.*/123456789ABCDEFGHIJK\L&:&/
:a
s/((.).{19}(.).*:.*)\3/\1\2/
ta
s/.*://
I convert input words to lower case (using GNU sed extension `\L') and use upper case letters as digits to prevent interfering them.
This script is still MUCH SLOWER than Nominal's awk solution and not much shorter, so I am not completely satisfied with it.
|
|
|
1 members found this post helpful.
|
02-17-2012, 04:54 PM
|
#9
|
|
Member
Registered: Mar 2011
Location: Klaipėda, Lithuania
Distribution: Slackware
Posts: 228
Rep: 
|
Mind if I join you with Python (2.x, but would work with 3.x after small changes) example?
Code:
#!/usr/bin/env python
import sys
from string import digits, ascii_lowercase
from itertools import izip_longest
if __name__ == '__main__':
for line in sys.stdin:
line = line.strip()
# Ensure that encoding is not longer than text line.
encoding = (digits + ascii_lowercase)[1:len(line)+1]
d = {}
print ''.join(d.setdefault(char, code)
for char, code in izip_longest(line, encoding)), line
As it reads lines from stdin, it can be used as a filter, similar to Nominal Animal's AWK script. Haven't compared its speed with others, just wanted to get a working example.
|
|
|
1 members found this post helpful.
|
02-17-2012, 06:07 PM
|
#10
|
|
Senior Member
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140
|
A little perl onliner
Code:
echo people | perl -ne 'print ((index $_, $1)+1) while /(.)/g'
|
|
|
2 members found this post helpful.
|
02-18-2012, 11:03 AM
|
#11
|
|
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,577
|
In bash ( ver4+, due to associative array use ):
Code:
#!/bin/bash
string="$1" #to restrict it to lowercase, use string=${1,,}
declare -A array #declare an associative array
pos=( 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m ) #declare a list of positional characters to use
for (( i=0 ; i<${#string} ; i++ )) ; do #loop through the length of the string (zero-indexed)
char=${string:$i:1} #get the character in the current position
[[ -z ${array[$char]} ]] && array[$char]=${pos[i]:-@} #if first time encountered, record the current position. Use "@" if pos is exceeded.
printf "%s" "${array[$char]}" #print recorded position
done
echo #echo a final newline
exit 0
Edit: Missed the desire for alnum character output. Hopefully the update corrects it. You can extend the pos array to cover additional characters if desired.
Edit2: Made a minor modification so that it uses a filler character if the number of different chars in the string exceeds the number of defined positions.
Last edited by David the H.; 02-18-2012 at 11:31 AM.
|
|
|
1 members found this post helpful.
|
02-19-2012, 07:49 AM
|
#12
|
|
Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 6,312
|
Well Cedrik got me to thinking:
Awk:
Code:
echo people | awk -F "" '{while(++i <=NF)printf "%d", index($0,$i)}'
Ruby: (I am still new to this so it can probably be made much shorter)
Code:
echo people | ruby -ane '$F[0].split(//).each { |x| print $F[0].index(x)+1 }'
|
|
|
2 members found this post helpful.
|
02-19-2012, 11:31 AM
|
#13
|
|
Member
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 756
Original Poster
Rep: 
|
Quote:
Originally Posted by firstfire
Now, to make it a bit faster, we may add a check for duplicate characters in an input word: if there are no such characters, the result is simply '12345...len(word)' ...
|
No need for this check.
The problem statement in the original post focused on the technique for encoding. I didn't address the application in which this transformation is embedded. It should be said that the input file contains ...
1) One word per line.
2) Lower case characters, only.
3) Words with multiple repeated characters.
Some of the offered solutions have explicit loops. Maybe these solutions could be made a tiny bit faster by using this observation: there is no need to test the first character in a word, to see if it has already been observed. By definition, it has not already been observed... and consequently the first digit in the encoded result is always 1. Perhaps the loop index variable could begin with 2 rather than 1.
Daniel B. Martin
|
|
|
|
02-19-2012, 11:38 AM
|
#14
|
|
Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 6,312
|
In actual fact, it would take more coding in most, if not all, of the solutions to simply print 1 on the first letter and then start a loop.
Computationally I do not see that you would benefit (programatically) from such a change.
|
|
|
|
02-19-2012, 11:43 AM
|
#15
|
|
Senior Member
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
|
Ah, using the index mapping Cedrik and Grail used makes much more sense. This is Grail's solution, but with the number-and-letter encoding, with both the encoded string and unencoded one on each output line.
Code:
awk -F '' -v enc=123456789abcdefghijklmnopqrstuvwxyz '{for(i=1;i<=NF;i++) printf("%s",substr(enc,index($0,$i),1)); printf(" %s\n", $0)}'
On my machine, it handles the entire /usr/share/dict/words (234937 words in 2486824 bytes) in about one and a half seconds using GNU awk 3.1.8, but well under a second using mawk-1.3.3.
Using
Code:
mawk -F '' -v enc=123456789abcdefghijklmnopqrstuvwxyz '{s="1"; for(i=2;i<=NF;i++) s=s substr(enc,index($0,$i),1); printf("%s %s\n", s, $0)}'
is even faster, only taking two thirds of a second for the /usr/share/dict/words on my machine. (It takes almost two seconds using gawk.) This version constructs the encoded version into a string instead of outputting character-by-character. It also uses the optimization mentioned above, starting the string always with "1" and not processing the first character explicitly.
Skipping the first character saves about 7% in runtime for the /usr/share/dict/words word list, so it certainly is worthwhile.
In all cases the output is exactly the same; I verified that by piping the output to sha256sum -b and comparing the results.
The speed differences for something this simple surprises me. It seems that mawk has much more efficient input/output buffering methods than GNU awk. (A good thing to know if you handle large amounts of data, and don't need gawk-specific features; you may halve the wall time by switching to mawk.)
Last edited by Nominal Animal; 02-19-2012 at 11:49 AM.
|
|
|
2 members found this post helpful.
|
| Thread Tools |
Search this Thread |
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -5. The time now is 04:59 AM.
|
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|