Numerical encoding of text, by position

audriusk · 02-19-2012, 12:59 PM

Here's a revised Python version that uses index mapping:

Code:

#!/usr/bin/env python

import sys
from string import digits, ascii_lowercase

if __name__ == '__main__':
    encoding = (digits + ascii_lowercase)[1:]
    for line in sys.stdin:
        line = line.strip()
        print ''.join(encoding[line.index(char)] for char in line), line

danielbmartin · 02-19-2012, 05:10 PM

Quote:

Originally Posted by Nominal Animal

On my machine, it handles the entire /usr/share/dict/words (234937 words in 2486824 bytes) in about one and a half seconds using GNU awk 3.1.8, but well under a second using mawk-1.3.3.

Wow, I learn so much from LQ. I am an unschooled newbie and had never known that /usr/share/dict/words exists, and exists on my PC. I am using a word list file obtained from a friend. No need to keep two word lists on the drive when one will do. Moreover, my friend's word list is British-English so it contains neighbour but not neighbor. For my purposes a US-English word list is preferable.

If I launch a terminal session and enter this ...

Code:

wc /usr/share/dict/words

... the response is ...

Code:

 98569  98568 931708 /usr/share/dict/words

Doesn't this mean 98569 words, and not 234937? Is my /usr/share/dict/words different from yours?

Further, I had never heard of mawk and discover that it exists on my PC. Are there functional differences? If mawk is more efficient, should I use it routinely instead of awk?

Daniel B. Martin

Nominal Animal · 02-19-2012, 06:10 PM

Quote:

Originally Posted by danielbmartin

Is my /usr/share/dict/words different from yours?

Yes, but it is no big deal. Mine is from dictionaries-common-1.11.5ubuntu1.

Quote:

Originally Posted by danielbmartin

Further, I had never heard of mawk and discover that it exists on my PC. Are there functional differences? If mawk is more efficient, should I use it routinely instead of awk?

Yes, there are functional differences. If you look at The GNU Awk User's Manual, some functions (asort() for example) are marked with a #, meaning they are GNU extensions. (Sorting is often extremely useful, and although you can implement it yourself as an awk function, using GNU awk extensions when you need them makes sense to me.)

Also, GNU awk can process NUL-separated ('\0' separators, just like strings in C) input simply by setting RS and/or FS to "\0". mawk cannot. I personally use gnu awk (gawk) for e.g. filename mangling (supplying them via find ... -print0 or -printf '%p\0' or similar); this handles all possible filenames in Linux (in any character set supported in Linux, too, if you set LANG=C LC_ALL=C to avoid errors due to invalid UTF-8 sequences).

It seems that mawk would be more efficient for text-formatted data conversions, where sorting or NULs are not needed. For example, setting RS="[\t\n\v\f\r ]*<";FS=">[\t\n\v\f\r ]*" in a BEGIN rule would give you XML tags in $1 (including attributes) and all immediate content in $2 . Obviously it gives no full XML support -- no CDATA sections or comments, for example --, and if you want structure, you need to keep a tag stack. But, if you only need to convert massive amounts of logically flat XML data, and you know it is parseable with awk, a simple script will often suffice. Awk is pretty efficient, after all; in this case it will even stream the data, one tag + immediate text content that follows, at a time, so it'll need very little memory, too.

In general, all awks try pretty hard to be POSIX-compatible, and are to a large extent interchangeable. Aside from extensions and some bugs, of course. The POSIX standard for awk is quite close to historic implementations -- awk was first developed at Bell Labs in the 1970s.