Share your knowledge at the LQ Wiki.
Go Back > Forums > Non-*NIX Forums > Programming
User Name
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.


  Search this Thread
Old 02-19-2012, 12:59 PM   #16
Registered: Mar 2011
Location: Klaipėda, Lithuania
Distribution: Slackware
Posts: 347

Rep: Reputation: 189Reputation: 189

Here's a revised Python version that uses index mapping:
#!/usr/bin/env python

import sys
from string import digits, ascii_lowercase

if __name__ == '__main__':
    encoding = (digits + ascii_lowercase)[1:]
    for line in sys.stdin:
        line = line.strip()
        print ''.join(encoding[line.index(char)] for char in line), line
1 members found this post helpful.
Click here to see the post LQ members have rated as the most helpful post in this thread.
Old 02-19-2012, 05:10 PM   #17
Senior Member
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,659

Original Poster
Rep: Reputation: 525Reputation: 525Reputation: 525Reputation: 525Reputation: 525Reputation: 525
Originally Posted by Nominal Animal View Post
On my machine, it handles the entire /usr/share/dict/words (234937 words in 2486824 bytes) in about one and a half seconds using GNU awk 3.1.8, but well under a second using mawk-1.3.3.
Wow, I learn so much from LQ. I am an unschooled newbie and had never known that /usr/share/dict/words exists, and exists on my PC. I am using a word list file obtained from a friend. No need to keep two word lists on the drive when one will do. Moreover, my friend's word list is British-English so it contains neighbour but not neighbor. For my purposes a US-English word list is preferable.

If I launch a terminal session and enter this ...
wc /usr/share/dict/words
... the response is ...
 98569  98568 931708 /usr/share/dict/words
Doesn't this mean 98569 words, and not 234937? Is my /usr/share/dict/words different from yours?

Further, I had never heard of mawk and discover that it exists on my PC. Are there functional differences? If mawk is more efficient, should I use it routinely instead of awk?

Daniel B. Martin

Last edited by danielbmartin; 02-19-2012 at 05:11 PM. Reason: Correct t7po.
Old 02-19-2012, 06:10 PM   #18
Nominal Animal
Senior Member
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947
Originally Posted by danielbmartin View Post
Is my /usr/share/dict/words different from yours?
Yes, but it is no big deal. Mine is from dictionaries-common-1.11.5ubuntu1.

Originally Posted by danielbmartin View Post
Further, I had never heard of mawk and discover that it exists on my PC. Are there functional differences? If mawk is more efficient, should I use it routinely instead of awk?
Yes, there are functional differences. If you look at The GNU Awk User's Manual, some functions (asort() for example) are marked with a #, meaning they are GNU extensions. (Sorting is often extremely useful, and although you can implement it yourself as an awk function, using GNU awk extensions when you need them makes sense to me.)

Also, GNU awk can process NUL-separated ('\0' separators, just like strings in C) input simply by setting RS and/or FS to "\0". mawk cannot. I personally use gnu awk (gawk) for e.g. filename mangling (supplying them via find ... -print0 or -printf '%p\0' or similar); this handles all possible filenames in Linux (in any character set supported in Linux, too, if you set LANG=C LC_ALL=C to avoid errors due to invalid UTF-8 sequences).

It seems that mawk would be more efficient for text-formatted data conversions, where sorting or NULs are not needed. For example, setting RS="[\t\n\v\f\r ]*<";FS=">[\t\n\v\f\r ]*" in a BEGIN rule would give you XML tags in $1 (including attributes) and all immediate content in $2 . Obviously it gives no full XML support -- no CDATA sections or comments, for example --, and if you want structure, you need to keep a tag stack. But, if you only need to convert massive amounts of logically flat XML data, and you know it is parseable with awk, a simple script will often suffice. Awk is pretty efficient, after all; in this case it will even stream the data, one tag + immediate text content that follows, at a time, so it'll need very little memory, too.

In general, all awks try pretty hard to be POSIX-compatible, and are to a large extent interchangeable. Aside from extensions and some bugs, of course. The POSIX standard for awk is quite close to historic implementations -- awk was first developed at Bell Labs in the 1970s.
1 members found this post helpful.


sed, tr

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Retrieve numerical values from a text file shik28 Programming 1 11-22-2011 02:02 AM
Openoffice: extract numerical value from text cell lothario Linux - Software 2 01-04-2011 02:34 AM
Numerical encoding of text, by position danielbmartin Linux - Newbie 5 04-29-2010 12:31 AM
how to count the numerical digits in between the text using a command or a script? Kilam orez Linux - Newbie 9 01-03-2010 12:15 AM
Text file manipulation: Extracting specific rows according to numerical pattern CHARL0TTE Linux - Newbie 3 10-07-2009 07:14 AM > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:40 AM.

Main Menu
Write for LQ is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration