Sarch files for multiple keywords with AND

titopoquito · 03-05-2005, 05:22 PM

Hi all,

first I'd like to describe the problem I'm facing:

I have about 1,900 Microsoft Word documents, that I converted to text files. I want to search these files for multiple keywords. I found a solution for grep that is able to perform an OR search. But I have to combine it with AND so the command lists all of these documents that contain two or more key words.

I've been searching a lot on this topic, but haven't found a solution. As far as I understand it, grep takes no more than one keyword or expression at once. One more problem with grep is, that it is line-based, but I want to search complete files. I don't need the line number or the sum of all occurances in a file.

My thoughts are going in two directions:
1) is there an alternative to the grep command that is able to handle to search for multiple keywords natively?
2) With grep I could output the names of all files, that contain one special keyword, one by one for each relevant keyword. Is there an utility I could use to get the intersection of multiple outputs? Maybe I should give you an more precice example:

# cat output-1 -------- keyword A occures in these files
file-a.txt
file-b.txt
file-d.txt

# cat output-2 -------- keyword B has been found in these files
file-a.txt
file-c.txt
file-d.txt
file-e.txt

So to ask again, is there an utility that could retrieve the intersection of these two files, i. e. "file-a.txt" and "file-d.txt"?

Of cource I would also appreciate a third or fourth solution

Cheers,

titopoquito

Feminista · 03-05-2005, 05:34 PM

Seems to me that you could pretty easily write a script that would take two parameters, use GREP as desribed, storing the results in two arrays, and then comparing the arrrays and making a third that is the intersection and outputting it. I'm not sure the exact syntax of such a thing, as I don't do much scripting, but it certainly doesn't sound too difficult.

Electro · 03-05-2005, 06:42 PM

Use the find utility. It should work better than grep because it has more functions.

Though you do not need to convert docs to text because there is acsii text in the doc file.

titopoquito · 03-06-2005, 05:20 AM

@ Electro

But how can I use find to state, if a keyword is found within a group of files? I once again read the man pages, but could only find options that apply to the files name but not the content. Or did I miss something?
Is there any other utility/command that can do this?

I looked again also into diff and sdiff, but there is just an option to ignore matching lines, not to print them. The only way I can imagine to use it would be to run diff, grep all lines starting with ">" and "<" and try to move them from the first search output with sed. Sounds complicated to me, I can't imagine there isn't a better fitting tool in the great field of unix commands.

Any other suggestions?

Cheers,

titopoquito

titopoquito · 03-06-2005, 07:51 AM

I tried some scripting to solve my problem. Despite that it is surely not a tricky or elegant script, it seemes to work after I traced down some errors. It should work if you set the keywords in double quotation marks (which I don't understand, since the keywords I tried didn't contain any spaces -- but it gave out wrong results).
If you find any errors or have any suggestions to improve the code, please feel free to post them here

CAVEAT: Before you might run this script please be sure to change "/path/to/documents" by the appropriate path and that you don't have any of the created files, that means "found-1"... up til "found-5" in the directory, where you run the script. Otherwise they will get overwritten.

Cheers,

titopoquito

edit: replaced my individual documents path to all-purpose "/path/to/documents"

Code:

#!/bin/sh
# this script requires two arguments.
# all files in the current directory will
# be searched for the occurence of each argument
# (keyword) with grep. The files used:
# found-1: lists files that contain keyword A = $1
# found-2: lists files that contain keyword B = $2
# found-3: lists all lines (= file names) that differ 
#          between found-1 and found-2
#          i. e. files that don't contain both keywords
# found-4: the same like found-3, but cleaned up from 
#          additional information that is not needed
# found-5: contains (found-1 minus found-5), so only
#          files in which both keywords appear, remain
#
SEARCHDIR=/path/to/documents
#
# check, if arguments are given right to this script,
# else display user how to use it
if [ $# -ne 2 ]; then echo "usage: command <keyword A> <keyword B>"; echo "Be sure to enclose the keywords in double quotes"; exit; fi
grep $1 -l $SEARCHDIR/* > found-1
grep $2 -l $SEARCHDIR/* > found-2
# now take only the lines with the file names from the 
# search results. they are marked with an "> " or an "< " 
# at the beginning of the line
diff found-1 found-2 | grep \> > found-3
diff found-1 found-2 | grep \< >> found-3
# remove the greater and lesser signs at the beginning of the lines
sed 's/> //' found-3 | sed 's/< //' > found-4 
# output the search result for keyword A - if any of these lines is not 
# in found-4 (i. e. the diff command reported it is the same in both
# searches, i. e. both keywords can be found in this file) put an entry
# for it in found-5
# clean found-5 first
rm -f found-5
touch found-5
cat found-1 | while read line; do if [ $(grep -c "$line"  found-4) -eq 0 ]; then echo "$line" >> found-5; fi; done
echo "The search results for \"$1\" and \"$2\" are stored in found-5"