LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Text processing - "must have" letters (https://www.linuxquestions.org/questions/programming-9/text-processing-must-have-letters-4175554135/)

danielbmartin 09-21-2015 08:52 PM

Text processing - "must have" letters
 
Have: a file with many 6-character words, one per line.

Have: a key word with 6 characters. Example: deront

Want: a file containing all of the words from the original file which contain all of the letters in the key. Order is not important.

The key word may be thought of as a string of "Must Have" letters.
denote fails because it has no R.
redone fails because it has no T.
rodent passes the test because it contains all of the MH letters.

If the key contains 2 Rs then qualifying words must likewise have 2 Rs.

I've tried grep without success; tried tr without success.

As a matter of personal coding style I strive to avoid explicit loops.

Ideas?

Daniel B. Martin

firstfire 09-21-2015 10:27 PM

Hi.

The simple idea that comes to mind is that words in question differ only by order of letters. So, if we sort letters in each word then simple comparison will tell you whether the word passes or not.
Code:

$ echo deront | sed 's/./&\n/g' | sort | tr -d '\n'
denort
$ echo rodent | sed 's/./&\n/g' | sort | tr -d '\n'
denort  # this word matches
$ echo denote | sed 's/./&\n/g' | sort | tr -d '\n'
deenot  # does not match

(I added newlines at end of output manually for clarity)

firstfire 09-22-2015 12:29 AM

Hi.

Here is the reasonably fast perl solution based on my previous post:
Code:

#!/usr/bin/env perl

sub sortletters { return join "", sort split //, lc $_[0]; }
my $key = sortletters (shift);
while (<>) {
        chomp;
        print "$_\n" if ($key eq sortletters $_);
}

Demo run:
Code:

$ time /tmp/must-have.pl binary  /usr/share/dict/words
binary
brainy

real    0m0.221s
user    0m0.217s
sys    0m0.004s


HMW 09-22-2015 01:19 AM

Quote:

Originally Posted by firstfire (Post 5423744)
The simple idea that comes to mind is that words in question differ only by order of letters. So, if we sort letters in each word then simple comparison will tell you whether the word passes or not.

Clever approach, bravo!

Best regards,
HMW

HMW 09-22-2015 02:03 AM

And here is one version in Python, basically just ripping off firstfire's splendid approach.

With this ("dbm.txt") as infile:
Code:

$ cat dbm.txt
denote
redone
rodent

And this Python script:
Code:

#!/usr/bin/env python3
import sys

key = list(sys.argv[1])

with open("dbm.txt") as infile:
    for line in infile:
        sortedLine = "".join(sorted(list(line))).strip('\n')
        if "".join(sorted(key)) == sortedLine:
            print(line)

Upon execution, I get this output:
Code:

$ ./must_have.py deront
rodent

Thanks for a cool little exercise. All credit goes to firstfire for the approach.

Best regards,
HMW

firstfire 09-22-2015 04:25 AM

Hi.

The execution will be a bit faster if you move sorting of key out of the loop:
Code:

key = list(sys.argv[1])
sortedKey = "".join(sorted(key))
  ...
  if sortedKey == sortedLine:
    ...

:)

HMW 09-22-2015 04:34 AM

Quote:

Originally Posted by firstfire (Post 5423839)
The execution will be a bit faster if you move sorting of key out of the loop:
Code:

key = list(sys.argv[1])
sortedKey = "".join(sorted(key))
  ...
  if sortedKey == sortedLine:
    ...


Yes, you are absolutely right. In hindsight, that was clumsy of me. I normally try to keep those kind of operations outside of loops.

The new, improved, code then becomes:
Code:

#!/usr/bin/env python3
import sys

key = list(sys.argv[1])
sortedKey = "".join(sorted(key))

with open("dbm.txt") as infile:
    for line in infile:
        sortedLine = "".join(sorted(list(line))).strip('\n')
        if sortedKey == sortedLine:
            print(line)

Thanks for pointing this out, appreciate it!

HMW

ntubski 09-22-2015 10:00 AM

It's possible to just count the letters without sorting (skeleton borrowed from HMW, untested):
Code:

#!/usr/bin/env python3
import sys
from collections import Counter

key = Counter(sys.argv[1])

with open("dbm.txt") as infile:
    for line in infile:
        if key == Counter(line.rstrip("\n\r")):
            print(line)


grail 09-22-2015 11:03 AM

@ntubski - not only must I be doing it wrong as your code didn't seem to return anything for me :(, but if you are only counting the letters, won't they all return (sorry, not real familiar with python)

I did the ruby version, but it is boring as looks almost like the perl one (perl was faster though but I believe ruby is still working on its io stuff):
Code:

#!/usr/bin/env ruby

def sortletters(w)
        w.downcase.split(//).sort.join
end

key = sortletters(ARGV[0])

IO.foreach(ARGV[1]) do |word|
        puts word if key == sortletters(word.chomp)
end


millgates 09-22-2015 11:48 AM

OK, let's try sed:

Code:

key=deront
sed -rn "h; s/.*/&#$key/;:a s/(.)(.*#.*)\1/\2/;ta;/[^#]/!{g;p}" <$InFile


ntubski 09-22-2015 12:05 PM

Quote:

Originally Posted by grail (Post 5424003)
@ntubski - not only must I be doing it wrong as your code didn't seem to return anything for me :(,

Hmm, not really sure, but the following prints False False True, so it should work in principle.

Code:

#!/usr/bin/env python3
import sys
from collections import Counter
 
key = Counter('deront')
 
def okay(word):
        return Counter(word) == key
 
print (okay('denote'))
print (okay('redone'))
print (okay('rodent'))

Quote:

but if you are only counting the letters, won't they all return (sorry, not real familiar with python)
Maybe Histogram would be a better name than Counter. The idea is to compare letter count of the "deront" against the letter count of each word.

grail 09-22-2015 12:26 PM

It seems when I print the key value it looks like:
Code:

Counter({'t': 1, 'n': 1, 'e': 1, 'd': 1, 'r': 1, 'o': 1})
So i am wondering if '==' is the correct comparison??


Tada :) ... turns out you need to strip line endings too:
Code:

if key == Counter(line.rstrip("\n\r")):

grail 09-22-2015 01:00 PM

And the winner for slowest version ... lol
Code:

#!/usr/bin/env bash

sortletters()
{
        local array

        for (( i = 0; i < ${#1}; i++ ))
        do
                array+=(${1:i:1})
        done

        printf "c\n" ${array[@]} | sort | tr '\n' '\0'
}

key=$(sortletters "$1")

while read line
do
        [[ "$key" == "$(sortletters "$line")" ]] && echo "$line"
done<"$2"

On my machine:
Code:

Perl:

real        0m0.136s
user        0m0.133s
sys        0m0.000s

Ruby:

real        0m0.485s
user        0m0.467s
sys        0m0.013s

Python:

real        0m0.557s
user        0m0.543s
sys        0m0.013s

Bash:

real        1m54.069s
user        0m5.203s
sys        0m8.727s


firstfire 09-22-2015 01:08 PM

Quote:

Originally Posted by grail (Post 5424062)
And the winner for slowest version ... lol

Haha. I tried bash version first but couldn't wait when it finishes.
It seems forum ate %-sign in your program :)

BTW here is another way to convert string into column (e.g. for sorting)
Code:

$ grep -o . <<< 'hello'
h
e
l
l
o

Stumped upon somewhere on the net.

danielbmartin 09-22-2015 03:04 PM

Quote:

Originally Posted by millgates (Post 5424022)
Code:

key=deront
sed -rn "h; s/.*/&#$key/;:a s/(.)(.*#.*)\1/\2/;ta;/[^#]/!{g;p}" <$InFile


I am blown away by this brilliant one-liner!
It is too dense for me to understand.
Please give us a step-by-step. Thanks!

Daniel B. Martin


All times are GMT -5. The time now is 04:39 PM.