LinuxQuestions.org - [SOLVED] Text processing

Page 1 of 2

Show 50 post(s) from this thread on one page

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Text processing - "must have" letters (https://www.linuxquestions.org/questions/programming-9/text-processing-must-have-letters-4175554135/)

danielbmartin

09-21-2015 08:52 PM

Text processing - "must have" letters

Have: a file with many 6-character words, one per line.

Have: a key word with 6 characters. Example: deront

Want: a file containing all of the words from the original file which contain all of the letters in the key. Order is not important.

The key word may be thought of as a string of "Must Have" letters.
denote fails because it has no R.
redone fails because it has no T.
rodent passes the test because it contains all of the MH letters.

If the key contains 2 Rs then qualifying words must likewise have 2 Rs.

I've tried grep without success; tried tr without success.

As a matter of personal coding style I strive to avoid explicit loops.

Ideas?

Daniel B. Martin

firstfire

09-21-2015 10:27 PM

Hi.

The simple idea that comes to mind is that words in question differ only by order of letters. So, if we sort letters in each word then simple comparison will tell you whether the word passes or not.

Code:

$ echo deront | sed 's/./&\n/g' | sort | tr -d '\n'

denort

$ echo rodent | sed 's/./&\n/g' | sort | tr -d '\n'

denort  # this word matches

$ echo denote | sed 's/./&\n/g' | sort | tr -d '\n'

deenot  # does not match

(I added newlines at end of output manually for clarity)

firstfire

09-22-2015 12:29 AM

Hi.

Here is the reasonably fast perl solution based on my previous post:

Code:

#!/usr/bin/env perl



sub sortletters { return join "", sort split //, lc $_[0]; }

my $key = sortletters (shift);

while (<>) {

        chomp;

        print "$_\n" if ($key eq sortletters $_);

}

Demo run:

Code:

$ time /tmp/must-have.pl binary  /usr/share/dict/words

binary

brainy



real    0m0.221s

user    0m0.217s

sys    0m0.004s

HMW	09-22-2015 01:19 AM

Quote:

Originally Posted by firstfire (Post 5423744)

The simple idea that comes to mind is that words in question differ only by order of letters. So, if we sort letters in each word then simple comparison will tell you whether the word passes or not.

Clever approach, bravo!

Best regards,
HMW

HMW	09-22-2015 02:03 AM

And here is one version in Python, basically just ripping off firstfire's splendid approach.

With this ("dbm.txt") as infile:

Code:

$ cat dbm.txt 

denote

redone

rodent

And this Python script:

Code:

#!/usr/bin/env python3

import sys 



key = list(sys.argv[1])



with open("dbm.txt") as infile:

    for line in infile:

        sortedLine = "".join(sorted(list(line))).strip('\n')

        if "".join(sorted(key)) == sortedLine:

            print(line)

Upon execution, I get this output:

Code:

$ ./must_have.py deront

rodent

Thanks for a cool little exercise. All credit goes to firstfire for the approach.

Best regards,
HMW

firstfire

09-22-2015 04:25 AM

Hi.

The execution will be a bit faster if you move sorting of key out of the loop:

Code:

key = list(sys.argv[1])

sortedKey = "".join(sorted(key))

  ...

  if sortedKey == sortedLine:

    ...

HMW	09-22-2015 04:34 AM

Quote:

Originally Posted by firstfire (Post 5423839)

The execution will be a bit faster if you move sorting of key out of the loop:

Code:

key = list(sys.argv[1])

sortedKey = "".join(sorted(key))

  ...

  if sortedKey == sortedLine:

    ...

Yes, you are absolutely right. In hindsight, that was clumsy of me. I normally try to keep those kind of operations outside of loops.

The new, improved, code then becomes:

Code:

#!/usr/bin/env python3

import sys 



key = list(sys.argv[1])

sortedKey = "".join(sorted(key))



with open("dbm.txt") as infile:

    for line in infile:

        sortedLine = "".join(sorted(list(line))).strip('\n')

        if sortedKey == sortedLine:

            print(line)

Thanks for pointing this out, appreciate it!

HMW

ntubski

09-22-2015 10:00 AM

It's possible to just count the letters without sorting (skeleton borrowed from HMW, untested):

Code:

#!/usr/bin/env python3

import sys

from collections import Counter



key = Counter(sys.argv[1])



with open("dbm.txt") as infile:

    for line in infile:

        if key == Counter(line.rstrip("\n\r")):

            print(line)

grail

09-22-2015 11:03 AM

@ntubski - not only must I be doing it wrong as your code didn't seem to return anything for me :(, but if you are only counting the letters, won't they all return (sorry, not real familiar with python)

I did the ruby version, but it is boring as looks almost like the perl one (perl was faster though but I believe ruby is still working on its io stuff):

Code:

#!/usr/bin/env ruby



def sortletters(w)

        w.downcase.split(//).sort.join

end



key = sortletters(ARGV[0])



IO.foreach(ARGV[1]) do |word|

        puts word if key == sortletters(word.chomp)

end

millgates

09-22-2015 11:48 AM

OK, let's try sed:

Code:

key=deront

sed -rn "h; s/.*/&#$key/;:a s/(.)(.*#.*)\1/\2/;ta;/[^#]/!{g;p}" <$InFile

ntubski

09-22-2015 12:05 PM

Quote:

Originally Posted by grail (Post 5424003)

@ntubski - not only must I be doing it wrong as your code didn't seem to return anything for me :(,

Hmm, not really sure, but the following prints False False True, so it should work in principle.

Code:

#!/usr/bin/env python3

import sys

from collections import Counter

 

key = Counter('deront')

 

def okay(word):

        return Counter(word) == key

 

print (okay('denote'))

print (okay('redone'))

print (okay('rodent'))

Quote:

but if you are only counting the letters, won't they all return (sorry, not real familiar with python)

Maybe Histogram would be a better name than Counter. The idea is to compare letter count of the "deront" against the letter count of each word.

grail

09-22-2015 12:26 PM

It seems when I print the key value it looks like:

Code:

Counter({'t': 1, 'n': 1, 'e': 1, 'd': 1, 'r': 1, 'o': 1})

So i am wondering if '==' is the correct comparison??

Tada :) ... turns out you need to strip line endings too:

Code:

if key == Counter(line.rstrip("\n\r")):

grail

09-22-2015 01:00 PM

And the winner for slowest version ... lol

Code:

#!/usr/bin/env bash



sortletters()

{

        local array



        for (( i = 0; i < ${#1}; i++ ))

        do

                array+=(${1:i:1})

        done



        printf "c\n" ${array[@]} | sort | tr '\n' '\0'

}



key=$(sortletters "$1")



while read line

do

        [[ "$key" == "$(sortletters "$line")" ]] && echo "$line"

done<"$2"

On my machine:

Code:

Perl:



real        0m0.136s

user        0m0.133s

sys        0m0.000s



Ruby:



real        0m0.485s

user        0m0.467s

sys        0m0.013s



Python:



real        0m0.557s

user        0m0.543s

sys        0m0.013s



Bash:



real        1m54.069s

user        0m5.203s

sys        0m8.727s

firstfire

09-22-2015 01:08 PM

Quote:

Originally Posted by grail (Post 5424062)

And the winner for slowest version ... lol

Haha. I tried bash version first but couldn't wait when it finishes.
It seems forum ate %-sign in your program :)

BTW here is another way to convert string into column (e.g. for sorting)

Code:

$ grep -o . <<< 'hello'

h

e

l

l

o

Stumped upon somewhere on the net.

danielbmartin

09-22-2015 03:04 PM

Quote:

Originally Posted by millgates (Post 5424022)

Code:

key=deront

sed -rn "h; s/.*/&#$key/;:a s/(.)(.*#.*)\1/\2/;ta;/[^#]/!{g;p}" <$InFile

I am blown away by this brilliant one-liner!
It is too dense for me to understand.
Please give us a step-by-step. Thanks!

Daniel B. Martin

All times are GMT -5. The time now is 04:39 PM.

Page 1 of 2

Show 50 post(s) from this thread on one page