ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I posted this question many months ago and received suggestions, but none of them was a "bull's eye" answer. I'm posting again, thinking that LQ now has more members and someone may have a flash of inspiration.
I have a file of English-language words, one word per line, and want to encode them in a numeric form, based on letter positions. This is best explained by example:
people ==> 123152
Reading left to right:
p was first encountered at position 1 so it is encoded as 1.
e was first encountered at position 2 so it is encoded as 2.
o was first encountered at position 3 so it is encoded as 3.
p (again) was first encountered at position 1 so it is encoded as 1.
l was first encountered at position 5 so it is encoded as 5.
e (again) was first encountered at position 2 so it is encoded as 2.
More examples:
aardvark => 11345138
sense => 12312
committee => 123356688
position => 12345428
This combination of successive tr commands works but it has a shortcoming: the length of the character strings must equal the length of the word to be encoded. That shorcoming may be remedied by coding something like this:
I have a file of English-language words, one word per line, and want to encode them in a numeric form, based on letter positions.
How about
Code:
#!/usr/bin/awk -f
BEGIN {
RS = "[\t\n\v\f\r ]+"
FS = ""
}
{
split("", p)
for (i = 1; i <= NF; i++) {
c = tolower($i)
if (!(c in p))
p[c] = i
printf(" %d", p[c])
}
printf("\n")
}
If you save the above as a script, you can use it as a filter, or supply it with file names to process.
It ignores case, and prints a space before each number. If you don't want the space, just remove it from the printf(" %d",p[c]) statement. Or, if you want a hexadecimal encoding without spaces, use printf("%x",p[c]). You can also use a custom encoding, by adding
Code:
split("0 1 2 3 4 5 6 7 8 9 a b c d e f z x v", m, " ")
into the BEGIN rule, and using printf("%s",m[p[c]]) to print the encoded character position for each input character.
The BEGIN rule sets the record separator to any whitespace. This means you don't need to have each word on their own line, any whitespace is okay. Setting FS to an empty string means each character in a record will be a separate field.
The conversion is done in the second rule. The s array maps already seen characters to their columns. The for loop loops over each character, c, the input field (character) converted to lower case. If the character is not already in the s array, it is added to the array, with value being the current position, of course. Finally, the printf statement prints the position the current character maps to. After the loop, the second printf adds a newline.
If you want it to ignore e.g. punctuation and digits, use GNU awk and
Code:
#!/usr/bin/gawk -f
BEGIN {
RS = "[\t\n\v\f\r ]+"
FS = ""
}
{
split("", p)
for (i = 1; i <= NF; i++) {
c = tolower($i)
if (c !~ /[[:punct:][:digit:]]/) {
if (!(c in p))
p[c] = i
printf(" %d", p[c])
}
}
printf("\n")
}
instead; the only changes are the additions in red.
This latter implementation works very well with other languages, too, if you use a single-byte character encoding, or UTF-8. GNU awk correctly detects multibyte UTF-8 characters as single characters. The character classes are also locale-specific. For example, using LANG=fi_FI.UTF-8
(The text is Finnish and can be roughly translated as "Don't hit, you'll spill the brew!"; "ööli" is a slang term for brew. Google translate does not translate it correctly, but does pronounces it pretty much perfectly.)
#!/usr/bin/awk -f
BEGIN {
RS = "[\t\n\v\f\r ]+"
FS = ""
}
{
split("", p)
for (i = 1; i <= NF; i++) {
c = tolower($i)
if (!(c in p))
p[c] = i
printf(" %d", p[c])
}
printf("\n")
}
Thank you, NominalAnimal for this awk. It does the job and runs fast!
Quote:
... if you want a hexadecimal encoding without spaces, use printf("%x",p[c]).
Yes, that is the preferred style of output.
Quote:
The text is Finnish and can be roughly translated as "Don't hit, you'll spill the brew!"; "ööli" is a slang term for brew.
I'll check this with my friend Terttu next time I see her.
Please don't think me as ungrateful for your excellent solution ... but I'm still hoping for something even shorter. Some background information: I have a slew of self-written REXX programs from my previous life as a Window user. After switching to Linux, I improved many of them by substituting small segments of Linux commands for large segments of REXX code. This has been a satisfying experience, resulting in programs which are smaller and faster.
The subject of this thread is an exception to the general experience. In REXX this encoding is done in one line. Here is an illustrative example.
Code:
/* A simple example of encoding text. */
Plaintext = 'people'
EncodedWord = translate(Plaintext,'123456789abcdef',Plaintext)
say EncodedWord Plaintext
exit
The output is a single line:
Code:
123152 people
The Linux tr command is so similar to the REXX translate but I cannot replicate the result. Maybe it just isn't possible; maybe I haven't figured out how to do it. Perhaps you (or other LQ members) will show the way.
The subject of this thread is an exception to the general experience. In REXX this encoding is done in one line. Here is an illustrative example.
Code:
/* A simple example of encoding text. */
Plaintext = 'people'
EncodedWord = translate(Plaintext,'123456789abcdef',Plaintext)
say EncodedWord Plaintext
exit
The output is a single line:
Code:
123152 people
The Linux tr command is so similar to the REXX translate but I cannot replicate the result. Maybe it just isn't possible; maybe I haven't figured out how to do it. Perhaps you (or other LQ members) will show the way.
but it's still just a compact script, and not a short command.
I suspect the main reason is because the Linux/POSIX/Unix command-line tools are stream-oriented; character position is rarely even accessible. For example, one must tell awk to explicitly handle each character as a separate field to do this.
In Bash, you can use the ${#variable} expression to find out the length of the string in variable, and ${variable:skip} to skip over the first skip characters in variable:
but this is quite slow due to having to use two external programs, tr and rev. Also, the word to be translated must be in a Bash variable; it is referred to no less than three times.. but it should work fine for all words that are not longer than the mapping string. (Obviously you only need to set the mapping string once; this one suffices for all words up to 35 characters long.)
The actual command, the last line, which does the translation (from $word) is pretty funny, IMO. The nested pipe and string operations makes it pretty hard to parse at first sight.
Fun times
Edited to add: as an input filter,
Code:
#!/bin/bash
mapping="zyxwvutsrqponmlkjihgfedcba987654321"
while read -r word ; do
echo -n "$word" | tr "$(echo "$word" | rev)" "${mapping:${#mapping}-${#word}}"
echo " $word"
done
Last edited by Nominal Animal; 02-17-2012 at 01:00 PM.
If the word contains more than 16 different characters, what then?
Ignore an use one "digit-code" for multiple characters?
The intended input is common English-language words. Restricting the input to words of length <= 15 characters is an acceptable program limitation.
However, the REXX code handles this with ease. Extending the previously posted example, it becomes ...
Code:
/* A simple example of encoding text. */
Plaintext = 'antidisestablishmentarianism'
EncodedWord = translate(Plaintext,'123456789abcdefghijklm',Plaintext)
say EncodedWord Plaintext
exit
@Daniel: I enjoyed solving this problem and all previous ones. Thank you for that. This one makes me mad -- it is so simple, I wrote about 4 sed and a couple of bash solutions, BUT how to make them efficient and short? I can't beat Nominal's awk solution.
@Nominal Animal: As always, your solutions are excellent!
Solution, I ended up with, works similarly to Nominal's last one. Here is an illustration, showing internal representation of data and main loop (which is roughly equivalent to combination of tr and rev):
Code:
$ echo '123456789;people:people' | sed -r ':a;s/((.).{9}(.).*:.*)\3/\1\2/;ta'
123456789;people:123152
The result goes after a colon.
And here is a script itself:
Code:
#!/bin/sed -rf
s/.*/123456789ABCDEFGHIJK\L&:&/
:a
s/((.).{19}(.).*:.*)\3/\1\2/
ta
s/.*://
Now, to make it a bit faster, we may add a check for duplicate characters in an input word: if there are no such characters, the result is simply '12345...len(word)':
Mind if I join you with Python (2.x, but would work with 3.x after small changes) example?
Code:
#!/usr/bin/env python
import sys
from string import digits, ascii_lowercase
from itertools import izip_longest
if __name__ == '__main__':
for line in sys.stdin:
line = line.strip()
# Ensure that encoding is not longer than text line.
encoding = (digits + ascii_lowercase)[1:len(line)+1]
d = {}
print ''.join(d.setdefault(char, code)
for char, code in izip_longest(line, encoding)), line
As it reads lines from stdin, it can be used as a filter, similar to Nominal Animal's AWK script. Haven't compared its speed with others, just wanted to get a working example.
#!/bin/bash
string="$1" #to restrict it to lowercase, use string=${1,,}
declare -A array #declare an associative array
pos=( 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m ) #declare a list of positional characters to use
for (( i=0 ; i<${#string} ; i++ )) ; do #loop through the length of the string (zero-indexed)
char=${string:$i:1} #get the character in the current position
[[ -z ${array[$char]} ]] && array[$char]=${pos[i]:-@} #if first time encountered, record the current position. Use "@" if pos is exceeded.
printf "%s" "${array[$char]}" #print recorded position
done
echo #echo a final newline
exit 0
Edit: Missed the desire for alnum character output. Hopefully the update corrects it. You can extend the pos array to cover additional characters if desired.
Edit2: Made a minor modification so that it uses a filler character if the number of different chars in the string exceeds the number of defined positions.
Last edited by David the H.; 02-18-2012 at 11:31 AM.
Now, to make it a bit faster, we may add a check for duplicate characters in an input word: if there are no such characters, the result is simply '12345...len(word)' ...
No need for this check.
The problem statement in the original post focused on the technique for encoding. I didn't address the application in which this transformation is embedded. It should be said that the input file contains ...
1) One word per line.
2) Lower case characters, only.
3) Words with multiple repeated characters.
Some of the offered solutions have explicit loops. Maybe these solutions could be made a tiny bit faster by using this observation: there is no need to test the first character in a word, to see if it has already been observed. By definition, it has not already been observed... and consequently the first digit in the encoded result is always 1. Perhaps the loop index variable could begin with 2 rather than 1.
In actual fact, it would take more coding in most, if not all, of the solutions to simply print 1 on the first letter and then start a loop.
Computationally I do not see that you would benefit (programatically) from such a change.
Ah, using the index mapping Cedrik and Grail used makes much more sense. This is Grail's solution, but with the number-and-letter encoding, with both the encoded string and unencoded one on each output line.
On my machine, it handles the entire /usr/share/dict/words (234937 words in 2486824 bytes) in about one and a half seconds using GNU awk 3.1.8, but well under a second using mawk-1.3.3.
is even faster, only taking two thirds of a second for the /usr/share/dict/words on my machine. (It takes almost two seconds using gawk.) This version constructs the encoded version into a string instead of outputting character-by-character. It also uses the optimization mentioned above, starting the string always with "1" and not processing the first character explicitly.
Skipping the first character saves about 7% in runtime for the /usr/share/dict/words word list, so it certainly is worthwhile.
In all cases the output is exactly the same; I verified that by piping the output to sha256sum -b and comparing the results.
The speed differences for something this simple surprises me. It seems that mawk has much more efficient input/output buffering methods than GNU awk. (A good thing to know if you handle large amounts of data, and don't need gawk-specific features; you may halve the wall time by switching to mawk.)
Last edited by Nominal Animal; 02-19-2012 at 11:49 AM.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.