LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-09-2009, 01:54 PM   #1
angeloban
LQ Newbie
 
Registered: Jun 2009
Posts: 9

Rep: Reputation: 0
bash file comparing question.


Good afternoon

I have been messing with diff and grep for 2 days now
without result

I am trying to match a file consisting of words
to many separate other wordfiles in a specific directory.
one by one.

What i want the script to do is to report how many matching
words my main file has with every file in the directory,
each in turn

setup:
/home/banana/lists/ has following files:
apples
oranges
kiwis

each of em are plain text files with 1 word per line

/home/banana/wordfile is a text file with 1 word per line
(same format as the ones in /home/banana/lists)

output should be something like:

$ ./script textfile /home/banana/lists/

SCRIPT REPORT:
the file apples shares 5 matches with wordfile
the file oranges shares 1 matches with wordfile
the file kiwis shares 3 matches with wordfile
$

i'm at loss here, boys
the diff man file is just chinese to me
i been trying to use combinations of grep and wc -l
but it was never successful
 
Old 06-09-2009, 02:13 PM   #2
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809

Rep: Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743
What part of diff do you not understand? To be sure, the man page shows a lot of options and formats, but the basic command is pretty simple.

You need to start with "pseudo-code". Here is a simple example---based on what I believe you are trying to do:
Code:
Initialize as required---e.g. any counters.
Loop_1: loop on the contents of WORDFILE (the master)---for each word in this file (MAINWORD):
     Loop_2:  loop thru a list of filenames ( that you want to check---for each of these (SCANFILE):
          Loop_3:  loop thru the words (SCANWORD) in SCANFILE
                 SCANWORD matches MAINWORD?
                      Yes:  Increment <SCANFILE>_COUNT by one; continue  (Note that you need one of these for each value of SCANFILE.)
                      No:   continue
          end Loop_3
     end Loop_2
end Loop_1
Now print the results. This depend a bit on where you got the SCANFILE values. If they are not already in the list, then you need to add code to capture them.

Why do I have the feeling that there must alread be a utility that does this---or at least make it easier.
 
Old 06-09-2009, 02:26 PM   #3
angeloban
LQ Newbie
 
Registered: Jun 2009
Posts: 9

Original Poster
Rep: Reputation: 0
Thanks for the reply.

I have tried pseudocode, but since i couldnt program myself
an hello world, i decided to use a language where all commands
already exist, and i have just to link em all up, somehow...

i want to check the content file1 to file2
and let the script print out how many matching words both files
have. not which words match, how many matches were detected.
each file is a row of words, one per line.

the diff command manpage didnt show me any apparent way to do such.
I was thinking of approaches like:

find /home/banana/lists -type f -exec "cat mywordlist|diff|wc -l" '{}' \; (this is my pseudocode)
the wc -l would count how many strings are equal in both files
since all i can do is make "diff" spew out lines of both files

Quote:
Originally Posted by pixellany View Post
What part of diff do you not understand? To be sure, the man page shows a lot of options and formats, but the basic command is pretty simple.

You need to start with "pseudo-code". Here is a simple example---based on what I believe you are trying to do:
Code:
Initialize as required---e.g. any counters.
Loop_1: loop on the contents of WORDFILE (the master)---for each word in this file (MAINWORD):
     Loop_2:  loop thru a list of filenames ( that you want to check---for each of these (SCANFILE):
          Loop_3:  loop thru the words (SCANWORD) in SCANFILE
                 SCANWORD matches MAINWORD?
                      Yes:  Increment <SCANFILE>_COUNT by one; continue  (Note that you need one of these for each value of SCANFILE.)
                      No:   continue
          end Loop_3
     end Loop_2
end Loop_1
Now print the results. This depend a bit on where you got the SCANFILE values. If they are not already in the list, then you need to add code to capture them.

Why do I have the feeling that there must alread be a utility that does this---or at least make it easier.
 
Old 06-09-2009, 02:37 PM   #4
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
as a start,
try

grep -cf wordlist apples oranges kiwis

is this what you mean sort of?
 
Old 06-09-2009, 02:47 PM   #5
angeloban
LQ Newbie
 
Registered: Jun 2009
Posts: 9

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by bigearsbilly View Post
as a start,
try

grep -cf wordlist apples oranges kiwis

is this what you mean sort of?

I have approached that way already

[gradius@galaga wordlists]$ grep -cf wordlist apples oranges
apples:25823
oranges:72
[gradius@galaga wordlists]$ wc -l apples
25823 apples
[gradius@galaga wordlists]$ wc -l oranges
72 oranges
[gradius@galaga wordlists]$


if it would work, it would show the numbers of words haha and apples
and oranges have in common

i have echo'd a word that was in "oranges" to wordlist and tried
grep -cf again, and the numbers did not change.
 
Old 06-09-2009, 07:10 PM   #6
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,781

Rep: Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082
works for me: (although the command should probably be grep -Fcf wordlist unless your wordlist is really a regexp list.)
Code:
~/tmp$ cat > apples
red
green
yellow
dark red
~/tmp$ cat > oranges
orange
ping
pink
green
orange'y
~/tmp$ echo red >wordlist
~/tmp$ grep -cf wordlist apples oranges 
apples:2
oranges:0
~/tmp$ grep -V
GNU grep 2.5.3

Copyright (C) 1988, 1992-2002, 2004, 2005  Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
What is the contents of your wordlist?
 
Old 06-10-2009, 12:04 AM   #7
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
if you have Python
Code:
#!/usr/bin/env python
import glob,sys
wordfile=sys.argv[1]
words=open(wordfile).read().split()
for files in glob.glob("file[12]"):
    data=open(files).read().split()
    matches = len(set(words).intersection(data))
    print "the file %s has %d matches with wordfile: %s" %(files,matches,wordfile)
output
Code:
# more file
one
two
three
four
five
six
seven
eight
nine
ten

# more file1
two
one
eleven
twelve
three
fourteen

# more file2
fifteen
sixteen
one
two
seventeen

# python test.py file
the file file1 has 3 matches with wordfile: file
the file file2 has 2 matches with wordfile: file
 
Old 06-10-2009, 05:01 AM   #8
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
I think you've confused us all.

what about man comm

how about a decent example?
 
Old 06-10-2009, 12:51 PM   #9
angeloban
LQ Newbie
 
Registered: Jun 2009
Posts: 9

Original Poster
Rep: Reputation: 0
Hello, thank you for your help all, i figured it out
though the grep example still does not work

comm was the way to go.
This was what i needed:

[root@galaga wordlists]# comm wordlist apples -1 -2 | echo "list has `wc -l` matches"
list has 4 matches


it would show how many words matched in wordlist and apples
sorry for any confusion
I havent tried the python example yet, but i will once i managed
to modify the file[12] to filenames being read from the argumentline
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Question about comparing one file to another in Perl HyperTrey Programming 4 11-20-2008 05:26 PM
comparing file xlordt Programming 5 11-19-2007 03:32 AM
Comparing file sizes using a bash script. IanChristie Programming 5 12-19-2003 10:14 PM
bash - comparing a variable to several values davee Programming 3 05-05-2003 07:26 AM
Bash file question cxel91a Programming 1 03-31-2003 03:48 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:25 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration