ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have been messing with diff and grep for 2 days now
without result
I am trying to match a file consisting of words
to many separate other wordfiles in a specific directory.
one by one.
What i want the script to do is to report how many matching
words my main file has with every file in the directory,
each in turn
setup:
/home/banana/lists/ has following files:
apples
oranges
kiwis
each of em are plain text files with 1 word per line
/home/banana/wordfile is a text file with 1 word per line
(same format as the ones in /home/banana/lists)
output should be something like:
$ ./script textfile /home/banana/lists/
SCRIPT REPORT:
the file apples shares 5 matches with wordfile
the file oranges shares 1 matches with wordfile
the file kiwis shares 3 matches with wordfile
$
i'm at loss here, boys
the diff man file is just chinese to me
i been trying to use combinations of grep and wc -l
but it was never successful
What part of diff do you not understand? To be sure, the man page shows a lot of options and formats, but the basic command is pretty simple.
You need to start with "pseudo-code". Here is a simple example---based on what I believe you are trying to do:
Code:
Initialize as required---e.g. any counters.
Loop_1: loop on the contents of WORDFILE (the master)---for each word in this file (MAINWORD):
Loop_2: loop thru a list of filenames ( that you want to check---for each of these (SCANFILE):
Loop_3: loop thru the words (SCANWORD) in SCANFILE
SCANWORD matches MAINWORD?
Yes: Increment <SCANFILE>_COUNT by one; continue (Note that you need one of these for each value of SCANFILE.)
No: continue
end Loop_3
end Loop_2
end Loop_1
Now print the results. This depend a bit on where you got the SCANFILE values. If they are not already in the list, then you need to add code to capture them.
Why do I have the feeling that there must alread be a utility that does this---or at least make it easier.
I have tried pseudocode, but since i couldnt program myself
an hello world, i decided to use a language where all commands
already exist, and i have just to link em all up, somehow...
i want to check the content file1 to file2
and let the script print out how many matching words both files
have. not which words match, how many matches were detected.
each file is a row of words, one per line.
the diff command manpage didnt show me any apparent way to do such.
I was thinking of approaches like:
find /home/banana/lists -type f -exec "cat mywordlist|diff|wc -l" '{}' \; (this is my pseudocode)
the wc -l would count how many strings are equal in both files
since all i can do is make "diff" spew out lines of both files
Quote:
Originally Posted by pixellany
What part of diff do you not understand? To be sure, the man page shows a lot of options and formats, but the basic command is pretty simple.
You need to start with "pseudo-code". Here is a simple example---based on what I believe you are trying to do:
Code:
Initialize as required---e.g. any counters.
Loop_1: loop on the contents of WORDFILE (the master)---for each word in this file (MAINWORD):
Loop_2: loop thru a list of filenames ( that you want to check---for each of these (SCANFILE):
Loop_3: loop thru the words (SCANWORD) in SCANFILE
SCANWORD matches MAINWORD?
Yes: Increment <SCANFILE>_COUNT by one; continue (Note that you need one of these for each value of SCANFILE.)
No: continue
end Loop_3
end Loop_2
end Loop_1
Now print the results. This depend a bit on where you got the SCANFILE values. If they are not already in the list, then you need to add code to capture them.
Why do I have the feeling that there must alread be a utility that does this---or at least make it easier.
works for me: (although the command should probably be grep -Fcf wordlist unless your wordlist is really a regexp list.)
Code:
~/tmp$ cat > apples
red
green
yellow
dark red
~/tmp$ cat > oranges
orange
ping
pink
green
orange'y
~/tmp$ echo red >wordlist
~/tmp$ grep -cf wordlist apples oranges
apples:2
oranges:0
~/tmp$ grep -V
GNU grep 2.5.3
Copyright (C) 1988, 1992-2002, 2004, 2005 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#!/usr/bin/env python
import glob,sys
wordfile=sys.argv[1]
words=open(wordfile).read().split()
for files in glob.glob("file[12]"):
data=open(files).read().split()
matches = len(set(words).intersection(data))
print "the file %s has %d matches with wordfile: %s" %(files,matches,wordfile)
output
Code:
# more file
one
two
three
four
five
six
seven
eight
nine
ten
# more file1
two
one
eleven
twelve
three
fourteen
# more file2
fifteen
sixteen
one
two
seventeen
# python test.py file
the file file1 has 3 matches with wordfile: file
the file file2 has 2 matches with wordfile: file
Hello, thank you for your help all, i figured it out
though the grep example still does not work
comm was the way to go.
This was what i needed:
[root@galaga wordlists]# comm wordlist apples -1 -2 | echo "list has `wc -l` matches"
list has 4 matches
it would show how many words matched in wordlist and apples
sorry for any confusion
I havent tried the python example yet, but i will once i managed
to modify the file[12] to filenames being read from the argumentline
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.