[SOLVED] Text processing

danielbmartin · 09-21-2015, 08:52 PM

Have: a file with many 6-character words, one per line.

Have: a key word with 6 characters. Example: deront

Want: a file containing all of the words from the original file which contain all of the letters in the key. Order is not important.

The key word may be thought of as a string of "Must Have" letters.
denote fails because it has no R.
redone fails because it has no T.
rodent passes the test because it contains all of the MH letters.

If the key contains 2 Rs then qualifying words must likewise have 2 Rs.

I've tried grep without success; tried tr without success.

As a matter of personal coding style I strive to avoid explicit loops.

Ideas?

Daniel B. Martin

firstfire · 09-21-2015, 10:27 PM

Hi.

The simple idea that comes to mind is that words in question differ only by order of letters. So, if we sort letters in each word then simple comparison will tell you whether the word passes or not.

Code:

$ echo deront | sed 's/./&\n/g' | sort | tr -d '\n'
denort
$ echo rodent | sed 's/./&\n/g' | sort | tr -d '\n'
denort   # this word matches
$ echo denote | sed 's/./&\n/g' | sort | tr -d '\n'
deenot   # does not match

(I added newlines at end of output manually for clarity)

firstfire · 09-22-2015, 12:29 AM

Hi.

Here is the reasonably fast perl solution based on my previous post:

Code:

#!/usr/bin/env perl

sub sortletters { return join "", sort split //, lc $_[0]; }
my $key = sortletters (shift);
while (<>) {
	chomp;
	print "$_\n" if ($key eq sortletters $_);
}

Demo run:

Code:

$ time /tmp/must-have.pl binary  /usr/share/dict/words
binary
brainy

real    0m0.221s
user    0m0.217s
sys     0m0.004s

HMW · 09-22-2015, 01:19 AM

Quote:

Originally Posted by firstfire

The simple idea that comes to mind is that words in question differ only by order of letters. So, if we sort letters in each word then simple comparison will tell you whether the word passes or not.

Clever approach, bravo!

Best regards,
HMW

HMW · 09-22-2015, 02:03 AM

And here is one version in Python, basically just ripping off firstfire's splendid approach.

With this ("dbm.txt") as infile:

Code:

$ cat dbm.txt 
denote
redone
rodent

And this Python script:

Code:

#!/usr/bin/env python3
import sys 

key = list(sys.argv[1])

with open("dbm.txt") as infile:
    for line in infile:
        sortedLine = "".join(sorted(list(line))).strip('\n')
        if "".join(sorted(key)) == sortedLine:
            print(line)

Upon execution, I get this output:

Code:

$ ./must_have.py deront
rodent

Thanks for a cool little exercise. All credit goes to firstfire for the approach.

Best regards,
HMW

firstfire · 09-22-2015, 04:25 AM

Hi.

The execution will be a bit faster if you move sorting of key out of the loop:

Code:

key = list(sys.argv[1])
sortedKey = "".join(sorted(key))
  ...
  if sortedKey == sortedLine:
    ...

HMW · 09-22-2015, 04:34 AM

Quote:

Originally Posted by firstfire

The execution will be a bit faster if you move sorting of key out of the loop:

Code:

key = list(sys.argv[1])
sortedKey = "".join(sorted(key))
  ...
  if sortedKey == sortedLine:
    ...

Yes, you are absolutely right. In hindsight, that was clumsy of me. I normally try to keep those kind of operations outside of loops.

The new, improved, code then becomes:

Code:

#!/usr/bin/env python3
import sys 

key = list(sys.argv[1])
sortedKey = "".join(sorted(key))

with open("dbm.txt") as infile:
    for line in infile:
        sortedLine = "".join(sorted(list(line))).strip('\n')
        if sortedKey == sortedLine:
            print(line)

Thanks for pointing this out, appreciate it!

HMW

ntubski · 09-22-2015, 10:00 AM

It's possible to just count the letters without sorting (skeleton borrowed from HMW, untested):

Code:

#!/usr/bin/env python3
import sys
from collections import Counter

key = Counter(sys.argv[1])

with open("dbm.txt") as infile:
    for line in infile:
        if key == Counter(line.rstrip("\n\r")):
            print(line)

grail · 09-22-2015, 11:03 AM

@ntubski - not only must I be doing it wrong as your code didn't seem to return anything for me

, but if you are only counting the letters, won't they all return (sorry, not real familiar with python)

I did the ruby version, but it is boring as looks almost like the perl one (perl was faster though but I believe ruby is still working on its io stuff):

Code:

#!/usr/bin/env ruby

def sortletters(w)
	w.downcase.split(//).sort.join
end

key = sortletters(ARGV[0])

IO.foreach(ARGV[1]) do |word|
	puts word if key == sortletters(word.chomp)
end

millgates · 09-22-2015, 11:48 AM

OK, let's try sed:

Code:

key=deront
sed -rn "h; s/.*/&#$key/;:a s/(.)(.*#.*)\1/\2/;ta;/[^#]/!{g;p}" <$InFile

ntubski · 09-22-2015, 12:05 PM

Quote:

Originally Posted by grail

@ntubski - not only must I be doing it wrong as your code didn't seem to return anything for me

,

Hmm, not really sure, but the following prints False False True, so it should work in principle.

Code:

#!/usr/bin/env python3
import sys
from collections import Counter
 
key = Counter('deront')
 
def okay(word):
	return Counter(word) == key
 
print (okay('denote'))
print (okay('redone'))
print (okay('rodent'))

Quote:

but if you are only counting the letters, won't they all return (sorry, not real familiar with python)

Maybe Histogram would be a better name than Counter. The idea is to compare letter count of the "deront" against the letter count of each word.

grail · 09-22-2015, 12:26 PM

It seems when I print the key value it looks like:

Code:

Counter({'t': 1, 'n': 1, 'e': 1, 'd': 1, 'r': 1, 'o': 1})

So i am wondering if '==' is the correct comparison??

Tada

... turns out you need to strip line endings too:

Code:

if key == Counter(line.rstrip("\n\r")):

grail · 09-22-2015, 01:00 PM

And the winner for slowest version ... lol

Code:

#!/usr/bin/env bash

sortletters()
{
	local array

	for (( i = 0; i < ${#1}; i++ ))
	do
		array+=(${1:i:1})
	done

	printf "c\n" ${array[@]} | sort | tr '\n' '\0'
}

key=$(sortletters "$1")

while read line
do
	[[ "$key" == "$(sortletters "$line")" ]] && echo "$line"
done<"$2"

On my machine:

Code:

Perl:

real	0m0.136s
user	0m0.133s
sys	0m0.000s

Ruby:

real	0m0.485s
user	0m0.467s
sys	0m0.013s

Python:

real	0m0.557s
user	0m0.543s
sys	0m0.013s

Bash:

real	1m54.069s
user	0m5.203s
sys	0m8.727s

firstfire · 09-22-2015, 01:08 PM

Quote:

Originally Posted by grail

And the winner for slowest version ... lol

Haha. I tried bash version first but couldn't wait when it finishes.
It seems forum ate %-sign in your program

BTW here is another way to convert string into column (e.g. for sorting)

Code:

$ grep -o . <<< 'hello'
h
e
l
l
o

Stumped upon somewhere on the net.

danielbmartin · 09-22-2015, 03:04 PM

Quote:

Originally Posted by millgates

Code:

key=deront
sed -rn "h; s/.*/&#$key/;:a s/(.)(.*#.*)\1/\2/;ta;/[^#]/!{g;p}" <$InFile

I am blown away by this brilliant one-liner!
It is too dense for me to understand.
Please give us a step-by-step. Thanks!

Daniel B. Martin