LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-21-2015, 08:52 PM   #1
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Text processing - "must have" letters


Have: a file with many 6-character words, one per line.

Have: a key word with 6 characters. Example: deront

Want: a file containing all of the words from the original file which contain all of the letters in the key. Order is not important.

The key word may be thought of as a string of "Must Have" letters.
denote fails because it has no R.
redone fails because it has no T.
rodent passes the test because it contains all of the MH letters.

If the key contains 2 Rs then qualifying words must likewise have 2 Rs.

I've tried grep without success; tried tr without success.

As a matter of personal coding style I strive to avoid explicit loops.

Ideas?

Daniel B. Martin

Last edited by danielbmartin; 09-21-2015 at 08:54 PM.
 
Old 09-21-2015, 10:27 PM   #2
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Hi.

The simple idea that comes to mind is that words in question differ only by order of letters. So, if we sort letters in each word then simple comparison will tell you whether the word passes or not.
Code:
$ echo deront | sed 's/./&\n/g' | sort | tr -d '\n'
denort
$ echo rodent | sed 's/./&\n/g' | sort | tr -d '\n'
denort   # this word matches
$ echo denote | sed 's/./&\n/g' | sort | tr -d '\n'
deenot   # does not match
(I added newlines at end of output manually for clarity)
 
2 members found this post helpful.
Old 09-22-2015, 12:29 AM   #3
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Hi.

Here is the reasonably fast perl solution based on my previous post:
Code:
#!/usr/bin/env perl

sub sortletters { return join "", sort split //, lc $_[0]; }
my $key = sortletters (shift);
while (<>) {
	chomp;
	print "$_\n" if ($key eq sortletters $_);
}
Demo run:
Code:
$ time /tmp/must-have.pl binary  /usr/share/dict/words
binary
brainy

real    0m0.221s
user    0m0.217s
sys     0m0.004s

Last edited by firstfire; 09-22-2015 at 12:32 AM. Reason: Add lc (lowercase)
 
1 members found this post helpful.
Old 09-22-2015, 01:19 AM   #4
HMW
Member
 
Registered: Aug 2013
Location: Sweden
Distribution: Debian, Arch, Red Hat, CentOS
Posts: 773
Blog Entries: 3

Rep: Reputation: 369Reputation: 369Reputation: 369Reputation: 369
Quote:
Originally Posted by firstfire View Post
The simple idea that comes to mind is that words in question differ only by order of letters. So, if we sort letters in each word then simple comparison will tell you whether the word passes or not.
Clever approach, bravo!

Best regards,
HMW
 
Old 09-22-2015, 02:03 AM   #5
HMW
Member
 
Registered: Aug 2013
Location: Sweden
Distribution: Debian, Arch, Red Hat, CentOS
Posts: 773
Blog Entries: 3

Rep: Reputation: 369Reputation: 369Reputation: 369Reputation: 369
And here is one version in Python, basically just ripping off firstfire's splendid approach.

With this ("dbm.txt") as infile:
Code:
$ cat dbm.txt 
denote
redone
rodent
And this Python script:
Code:
#!/usr/bin/env python3
import sys 

key = list(sys.argv[1])

with open("dbm.txt") as infile:
    for line in infile:
        sortedLine = "".join(sorted(list(line))).strip('\n')
        if "".join(sorted(key)) == sortedLine:
            print(line)
Upon execution, I get this output:
Code:
$ ./must_have.py deront
rodent
Thanks for a cool little exercise. All credit goes to firstfire for the approach.

Best regards,
HMW

Last edited by HMW; 09-22-2015 at 04:46 AM. Reason: Minor spelling error
 
1 members found this post helpful.
Old 09-22-2015, 04:25 AM   #6
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Hi.

The execution will be a bit faster if you move sorting of key out of the loop:
Code:
key = list(sys.argv[1])
sortedKey = "".join(sorted(key))
  ...
  if sortedKey == sortedLine:
    ...
 
2 members found this post helpful.
Old 09-22-2015, 04:34 AM   #7
HMW
Member
 
Registered: Aug 2013
Location: Sweden
Distribution: Debian, Arch, Red Hat, CentOS
Posts: 773
Blog Entries: 3

Rep: Reputation: 369Reputation: 369Reputation: 369Reputation: 369
Quote:
Originally Posted by firstfire View Post
The execution will be a bit faster if you move sorting of key out of the loop:
Code:
key = list(sys.argv[1])
sortedKey = "".join(sorted(key))
  ...
  if sortedKey == sortedLine:
    ...
Yes, you are absolutely right. In hindsight, that was clumsy of me. I normally try to keep those kind of operations outside of loops.

The new, improved, code then becomes:
Code:
#!/usr/bin/env python3
import sys 

key = list(sys.argv[1])
sortedKey = "".join(sorted(key))

with open("dbm.txt") as infile:
    for line in infile:
        sortedLine = "".join(sorted(list(line))).strip('\n')
        if sortedKey == sortedLine:
            print(line)
Thanks for pointing this out, appreciate it!

HMW
 
1 members found this post helpful.
Old 09-22-2015, 10:00 AM   #8
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,784

Rep: Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083
It's possible to just count the letters without sorting (skeleton borrowed from HMW, untested):
Code:
#!/usr/bin/env python3
import sys
from collections import Counter

key = Counter(sys.argv[1])

with open("dbm.txt") as infile:
    for line in infile:
        if key == Counter(line.rstrip("\n\r")):
            print(line)

Last edited by ntubski; 09-22-2015 at 12:47 PM. Reason: add fix from grail
 
Old 09-22-2015, 11:03 AM   #9
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192
@ntubski - not only must I be doing it wrong as your code didn't seem to return anything for me , but if you are only counting the letters, won't they all return (sorry, not real familiar with python)

I did the ruby version, but it is boring as looks almost like the perl one (perl was faster though but I believe ruby is still working on its io stuff):
Code:
#!/usr/bin/env ruby

def sortletters(w)
	w.downcase.split(//).sort.join
end

key = sortletters(ARGV[0])

IO.foreach(ARGV[1]) do |word|
	puts word if key == sortletters(word.chomp)
end
 
Old 09-22-2015, 11:48 AM   #10
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 852

Rep: Reputation: 389Reputation: 389Reputation: 389Reputation: 389
OK, let's try sed:

Code:
key=deront
sed -rn "h; s/.*/&#$key/;:a s/(.)(.*#.*)\1/\2/;ta;/[^#]/!{g;p}" <$InFile
 
2 members found this post helpful.
Old 09-22-2015, 12:05 PM   #11
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,784

Rep: Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083
Quote:
Originally Posted by grail View Post
@ntubski - not only must I be doing it wrong as your code didn't seem to return anything for me ,
Hmm, not really sure, but the following prints False False True, so it should work in principle.

Code:
#!/usr/bin/env python3
import sys
from collections import Counter
 
key = Counter('deront')
 
def okay(word):
	return Counter(word) == key
 
print (okay('denote'))
print (okay('redone'))
print (okay('rodent'))
Quote:
but if you are only counting the letters, won't they all return (sorry, not real familiar with python)
Maybe Histogram would be a better name than Counter. The idea is to compare letter count of the "deront" against the letter count of each word.
 
Old 09-22-2015, 12:26 PM   #12
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192
It seems when I print the key value it looks like:
Code:
Counter({'t': 1, 'n': 1, 'e': 1, 'd': 1, 'r': 1, 'o': 1})
So i am wondering if '==' is the correct comparison??


Tada ... turns out you need to strip line endings too:
Code:
if key == Counter(line.rstrip("\n\r")):

Last edited by grail; 09-22-2015 at 12:35 PM.
 
Old 09-22-2015, 01:00 PM   #13
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192
And the winner for slowest version ... lol
Code:
#!/usr/bin/env bash

sortletters()
{
	local array

	for (( i = 0; i < ${#1}; i++ ))
	do
		array+=(${1:i:1})
	done

	printf "c\n" ${array[@]} | sort | tr '\n' '\0'
}

key=$(sortletters "$1")

while read line
do
	[[ "$key" == "$(sortletters "$line")" ]] && echo "$line"
done<"$2"
On my machine:
Code:
Perl:

real	0m0.136s
user	0m0.133s
sys	0m0.000s

Ruby:

real	0m0.485s
user	0m0.467s
sys	0m0.013s

Python:

real	0m0.557s
user	0m0.543s
sys	0m0.013s

Bash:

real	1m54.069s
user	0m5.203s
sys	0m8.727s

Last edited by grail; 09-22-2015 at 01:01 PM.
 
Old 09-22-2015, 01:08 PM   #14
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Quote:
Originally Posted by grail View Post
And the winner for slowest version ... lol
Haha. I tried bash version first but couldn't wait when it finishes.
It seems forum ate %-sign in your program

BTW here is another way to convert string into column (e.g. for sorting)
Code:
$ grep -o . <<< 'hello'
h
e
l
l
o
Stumped upon somewhere on the net.
 
1 members found this post helpful.
Old 09-22-2015, 03:04 PM   #15
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by millgates View Post
Code:
key=deront
sed -rn "h; s/.*/&#$key/;:a s/(.)(.*#.*)\1/\2/;ta;/[^#]/!{g;p}" <$InFile
I am blown away by this brilliant one-liner!
It is too dense for me to understand.
Please give us a step-by-step. Thanks!

Daniel B. Martin
 
  


Reply

Tags
awk, grep, sed, text processing, tr



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Text processing -- UPPER CASE doubled letters in second word of each line danielbmartin Programming 34 09-29-2014 06:55 AM
[SOLVED] Delete lines from a text file not starting with "http://" or "https://" georgi Programming 4 10-04-2013 03:00 AM
[SOLVED] grep text from a line in between "start" and "end" word deepakdeore2004 Programming 7 08-07-2013 09:45 AM
Can't stop fast typing double letters from "skipping" one of the letters. Lola Kews Ubuntu 3 04-20-2013 03:21 PM
Openldap Authentication error 'send_ldap_result: err=49 matched="" text=""' mahao Linux - Server 1 03-07-2011 12:56 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 08:31 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration