LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 03-05-2005, 05:22 PM   #1
titopoquito
Senior Member
 
Registered: Jul 2004
Location: Lower Rhine region, Germany
Distribution: Slackware64 14.2 and current, SlackwareARM current
Posts: 1,645

Rep: Reputation: 146Reputation: 146
Sarch files for multiple keywords with AND


Hi all,

first I'd like to describe the problem I'm facing:

I have about 1,900 Microsoft Word documents, that I converted to text files. I want to search these files for multiple keywords. I found a solution for grep that is able to perform an OR search. But I have to combine it with AND so the command lists all of these documents that contain two or more key words.

I've been searching a lot on this topic, but haven't found a solution. As far as I understand it, grep takes no more than one keyword or expression at once. One more problem with grep is, that it is line-based, but I want to search complete files. I don't need the line number or the sum of all occurances in a file.

My thoughts are going in two directions:
1) is there an alternative to the grep command that is able to handle to search for multiple keywords natively?
2) With grep I could output the names of all files, that contain one special keyword, one by one for each relevant keyword. Is there an utility I could use to get the intersection of multiple outputs? Maybe I should give you an more precice example:

# cat output-1 -------- keyword A occures in these files
file-a.txt
file-b.txt
file-d.txt

# cat output-2 -------- keyword B has been found in these files
file-a.txt
file-c.txt
file-d.txt
file-e.txt

So to ask again, is there an utility that could retrieve the intersection of these two files, i. e. "file-a.txt" and "file-d.txt"?

Of cource I would also appreciate a third or fourth solution

Cheers,

titopoquito
 
Old 03-05-2005, 05:34 PM   #2
Feminista
Member
 
Registered: Sep 2004
Distribution: Gentoo, OS X
Posts: 37

Rep: Reputation: 15
Seems to me that you could pretty easily write a script that would take two parameters, use GREP as desribed, storing the results in two arrays, and then comparing the arrrays and making a third that is the intersection and outputting it. I'm not sure the exact syntax of such a thing, as I don't do much scripting, but it certainly doesn't sound too difficult.
 
Old 03-05-2005, 06:42 PM   #3
Electro
LQ Guru
 
Registered: Jan 2002
Posts: 6,042

Rep: Reputation: Disabled
Use the find utility. It should work better than grep because it has more functions.

Though you do not need to convert docs to text because there is acsii text in the doc file.
 
Old 03-06-2005, 05:20 AM   #4
titopoquito
Senior Member
 
Registered: Jul 2004
Location: Lower Rhine region, Germany
Distribution: Slackware64 14.2 and current, SlackwareARM current
Posts: 1,645

Original Poster
Rep: Reputation: 146Reputation: 146
@ Electro

But how can I use find to state, if a keyword is found within a group of files? I once again read the man pages, but could only find options that apply to the files name but not the content. Or did I miss something?
Is there any other utility/command that can do this?

I looked again also into diff and sdiff, but there is just an option to ignore matching lines, not to print them. The only way I can imagine to use it would be to run diff, grep all lines starting with ">" and "<" and try to move them from the first search output with sed. Sounds complicated to me, I can't imagine there isn't a better fitting tool in the great field of unix commands.

Any other suggestions?

Cheers,

titopoquito
 
Old 03-06-2005, 07:51 AM   #5
titopoquito
Senior Member
 
Registered: Jul 2004
Location: Lower Rhine region, Germany
Distribution: Slackware64 14.2 and current, SlackwareARM current
Posts: 1,645

Original Poster
Rep: Reputation: 146Reputation: 146
I tried some scripting to solve my problem. Despite that it is surely not a tricky or elegant script, it seemes to work after I traced down some errors. It should work if you set the keywords in double quotation marks (which I don't understand, since the keywords I tried didn't contain any spaces -- but it gave out wrong results).
If you find any errors or have any suggestions to improve the code, please feel free to post them here

CAVEAT: Before you might run this script please be sure to change "/path/to/documents" by the appropriate path and that you don't have any of the created files, that means "found-1"... up til "found-5" in the directory, where you run the script. Otherwise they will get overwritten.

Cheers,

titopoquito

edit: replaced my individual documents path to all-purpose "/path/to/documents"

Code:
#!/bin/sh
# this script requires two arguments.
# all files in the current directory will
# be searched for the occurence of each argument
# (keyword) with grep. The files used:
# found-1: lists files that contain keyword A = $1
# found-2: lists files that contain keyword B = $2
# found-3: lists all lines (= file names) that differ 
#          between found-1 and found-2
#          i. e. files that don't contain both keywords
# found-4: the same like found-3, but cleaned up from 
#          additional information that is not needed
# found-5: contains (found-1 minus found-5), so only
#          files in which both keywords appear, remain
#
SEARCHDIR=/path/to/documents
#
# check, if arguments are given right to this script,
# else display user how to use it
if [ $# -ne 2 ]; then echo "usage: command <keyword A> <keyword B>"; echo "Be sure to enclose the keywords in double quotes"; exit; fi
grep $1 -l $SEARCHDIR/* > found-1
grep $2 -l $SEARCHDIR/* > found-2
# now take only the lines with the file names from the 
# search results. they are marked with an "> " or an "< " 
# at the beginning of the line
diff found-1 found-2 | grep \> > found-3
diff found-1 found-2 | grep \< >> found-3
# remove the greater and lesser signs at the beginning of the lines
sed 's/> //' found-3 | sed 's/< //' > found-4 
# output the search result for keyword A - if any of these lines is not 
# in found-4 (i. e. the diff command reported it is the same in both
# searches, i. e. both keywords can be found in this file) put an entry
# for it in found-5
# clean found-5 first
rm -f found-5
touch found-5
cat found-1 | while read line; do if [ $(grep -c "$line"  found-4) -eq 0 ]; then echo "$line" >> found-5; fi; done
echo "The search results for \"$1\" and \"$2\" are stored in found-5"

Last edited by titopoquito; 03-06-2005 at 08:15 AM.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
GCC 4.0 specific keywords abk4523 Programming 1 09-22-2005 12:03 PM
How to: test packages with missing keywords. exodist Linux - Distributions 1 05-08-2005 10:25 PM
As will get rid of a spam on keywords ukrainet Linux - Newbie 2 12-13-2004 03:00 AM
Help I need help tarring multiple files in multiple directories VisionZ Linux - Newbie 28 03-25-2004 05:25 PM
Linux Keywords Westdog976 Linux - Newbie 4 06-13-2003 03:00 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 06:11 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration