bash file comparing question.

angeloban · 06-09-2009, 01:54 PM

Good afternoon

I have been messing with diff and grep for 2 days now
without result

I am trying to match a file consisting of words
to many separate other wordfiles in a specific directory.
one by one.

What i want the script to do is to report how many matching
words my main file has with every file in the directory,
each in turn

setup:
/home/banana/lists/ has following files:
apples
oranges
kiwis

each of em are plain text files with 1 word per line

/home/banana/wordfile is a text file with 1 word per line
(same format as the ones in /home/banana/lists)

output should be something like:

$ ./script textfile /home/banana/lists/

SCRIPT REPORT:
the file apples shares 5 matches with wordfile
the file oranges shares 1 matches with wordfile
the file kiwis shares 3 matches with wordfile
$

i'm at loss here, boys

the diff man file is just chinese to me
i been trying to use combinations of grep and wc -l
but it was never successful

pixellany · 06-09-2009, 02:13 PM

What part of diff do you not understand? To be sure, the man page shows a lot of options and formats, but the basic command is pretty simple.

You need to start with "pseudo-code". Here is a simple example---based on what I believe you are trying to do:

Code:

Initialize as required---e.g. any counters.
Loop_1: loop on the contents of WORDFILE (the master)---for each word in this file (MAINWORD):
     Loop_2:  loop thru a list of filenames ( that you want to check---for each of these (SCANFILE):
          Loop_3:  loop thru the words (SCANWORD) in SCANFILE
                 SCANWORD matches MAINWORD?
                      Yes:  Increment <SCANFILE>_COUNT by one; continue  (Note that you need one of these for each value of SCANFILE.)
                      No:   continue
          end Loop_3
     end Loop_2
end Loop_1

Now print the results. This depend a bit on where you got the SCANFILE values. If they are not already in the list, then you need to add code to capture them.

Why do I have the feeling that there must alread be a utility that does this---or at least make it easier.

angeloban · 06-09-2009, 02:26 PM

Thanks for the reply.

I have tried pseudocode, but since i couldnt program myself
an hello world, i decided to use a language where all commands
already exist, and i have just to link em all up, somehow...

i want to check the content file1 to file2
and let the script print out how many matching words both files
have. not which words match, how many matches were detected.
each file is a row of words, one per line.

the diff command manpage didnt show me any apparent way to do such.
I was thinking of approaches like:

find /home/banana/lists -type f -exec "cat mywordlist|diff|wc -l" '{}' \; (this is my pseudocode)
the wc -l would count how many strings are equal in both files
since all i can do is make "diff" spew out lines of both files

Quote:

Originally Posted by pixellany

What part of diff do you not understand? To be sure, the man page shows a lot of options and formats, but the basic command is pretty simple.

You need to start with "pseudo-code". Here is a simple example---based on what I believe you are trying to do:

Code:

Initialize as required---e.g. any counters.
Loop_1: loop on the contents of WORDFILE (the master)---for each word in this file (MAINWORD):
     Loop_2:  loop thru a list of filenames ( that you want to check---for each of these (SCANFILE):
          Loop_3:  loop thru the words (SCANWORD) in SCANFILE
                 SCANWORD matches MAINWORD?
                      Yes:  Increment <SCANFILE>_COUNT by one; continue  (Note that you need one of these for each value of SCANFILE.)
                      No:   continue
          end Loop_3
     end Loop_2
end Loop_1

Now print the results. This depend a bit on where you got the SCANFILE values. If they are not already in the list, then you need to add code to capture them.

Why do I have the feeling that there must alread be a utility that does this---or at least make it easier.

bigearsbilly · 06-09-2009, 02:37 PM

as a start,
try

grep -cf wordlist apples oranges kiwis

is this what you mean sort of?

angeloban · 06-09-2009, 02:47 PM

Quote:

Originally Posted by bigearsbilly

as a start,
try

grep -cf wordlist apples oranges kiwis

is this what you mean sort of?

I have approached that way already

[gradius@galaga wordlists]$ grep -cf wordlist apples oranges
apples:25823
oranges:72
[gradius@galaga wordlists]$ wc -l apples
25823 apples
[gradius@galaga wordlists]$ wc -l oranges
72 oranges
[gradius@galaga wordlists]$

if it would work, it would show the numbers of words haha and apples
and oranges have in common

i have echo'd a word that was in "oranges" to wordlist and tried
grep -cf again, and the numbers did not change.

ntubski · 06-09-2009, 07:10 PM

works for me: (although the command should probably be grep -Fcf wordlist unless your wordlist is really a regexp list.)

Code:

~/tmp$ cat > apples
red
green
yellow
dark red
~/tmp$ cat > oranges
orange
ping
pink
green
orange'y
~/tmp$ echo red >wordlist
~/tmp$ grep -cf wordlist apples oranges 
apples:2
oranges:0
~/tmp$ grep -V
GNU grep 2.5.3

Copyright (C) 1988, 1992-2002, 2004, 2005  Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

What is the contents of your wordlist?

ghostdog74 · 06-10-2009, 12:04 AM

if you have Python

Code:

#!/usr/bin/env python
import glob,sys
wordfile=sys.argv[1]
words=open(wordfile).read().split()
for files in glob.glob("file[12]"):
    data=open(files).read().split()
    matches = len(set(words).intersection(data))
    print "the file %s has %d matches with wordfile: %s" %(files,matches,wordfile)

output

Code:

# more file
one
two
three
four
five
six
seven
eight
nine
ten

# more file1
two
one
eleven
twelve
three
fourteen

# more file2
fifteen
sixteen
one
two
seventeen

# python test.py file
the file file1 has 3 matches with wordfile: file
the file file2 has 2 matches with wordfile: file

bigearsbilly · 06-10-2009, 05:01 AM

I think you've confused us all.

what about man comm

how about a decent example?

angeloban · 06-10-2009, 12:51 PM

Hello, thank you for your help all, i figured it out
though the grep example still does not work

comm was the way to go.
This was what i needed:

[root@galaga wordlists]# comm wordlist apples -1 -2 | echo "list has `wc -l` matches"
list has 4 matches

it would show how many words matched in wordlist and apples
sorry for any confusion
I havent tried the python example yet, but i will once i managed
to modify the file[12] to filenames being read from the argumentline