LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 03-07-2015, 10:00 AM   #1
Thermoman
Member
 
Registered: Feb 2015
Location: UK
Distribution: Linux Mint 13
Posts: 39

Rep: Reputation: Disabled
Searching for Strings


If there is one Windows XP feature that I greatly miss in Mint, it is the Search Companion.

I have been struggling with 'grep' in order to create something suitable but with limited success. Take the following problem:-

I wish to interrogate the folder home/dell/Documents/Domestic/Recipes, searching for all files containing the word "mushroom" or "mushrooms", ignoring case. (I can manage the latter. )

Each individual file search should terminate at the first instance of a match and move to the next file. (Recursive, yeh?) Only the file names need to be listed and the output should be paged to allow for easier reading of long lists.

Several different types of file may be involved, including .doc, .odt, .txt .pdf, .htm and .rtf. It would be nice to include all of them in one command. (Wild card behaviour in grep is not entirely predictable - at least not for me.) Running a separate grep command for each different file type would be tedious.

A significant difficulty is that, if grep fails with a syntax, or run-time error, it generally reports the fact but it also has a habit of producing no output, perhaps not returning to the command prompt, whilst sitting inviting the user to decide what to do next. What makes this particularly frustrating is that some file types might not be amenable to a grep search. Text in .txt files and, it would appear, .doc files appears to be searchable but I suspect that .odt files might be more problematic. The snag in such circumstances is trying to interpret grep's response. Does a null return mean that no match was found or that the file format cannot be successfully interrogated? Such failure might not be apparent if the associated file names are simply excluded from the output list.

Apart from grep, is there any other software that would do the job? Sadly LibreOffice Writer seems to be lacking in this area.
 
Old 03-07-2015, 11:46 AM   #2
bigrigdriver
LQ Addict
 
Registered: Jul 2002
Location: East Centra Illinois, USA
Distribution: Debian Jessie 8.4
Posts: 5,873

Rep: Reputation: 348Reputation: 348Reputation: 348Reputation: 348
My favorite method is to use "find" piped through "grep".

example: find /home/Thermoman -type f | grep -i mushroom

command find, directory to be searched, type file, pipe through grep (case insensitive), search for mushroom
 
1 members found this post helpful.
Old 03-07-2015, 02:49 PM   #3
unSpawn
Moderator
 
Registered: May 2001
Posts: 29,331
Blog Entries: 55

Rep: Reputation: 3531Reputation: 3531Reputation: 3531Reputation: 3531Reputation: 3531Reputation: 3531Reputation: 3531Reputation: 3531Reputation: 3531Reputation: 3531Reputation: 3531
Quote:
Originally Posted by Thermoman View Post
Apart from grep, is there any other software that would do the job?
Find and grep work well for about anything text-based but if you want multiple document formats indexed and searched you should look for a more comprehensive tool like Recoll, DocFetcher or any of the desktop search tools mentioned for example here your Linux distribution may provide (or not).
 
1 members found this post helpful.
Old 03-08-2015, 06:40 AM   #4
Thermoman
Member
 
Registered: Feb 2015
Location: UK
Distribution: Linux Mint 13
Posts: 39

Original Poster
Rep: Reputation: Disabled
Thanks for the responses - some interesting information there.

I have had a bit more success with 'grep'. The lesson seems to be, "Don't go messing around with complicated wild card commands. Just tell grep what you want and let it get on with the job." It's a smart cooky - if you get the syntax right. If you get it wrong, as I mentioned before, it does tend to sit looking at you with the implied response, "Your move!" but that's hardly grep's fault.

One interesting feature is that, despite my earlier disappointments, a properly formulated grep command can find strings in most, if not all, of the file types I mentioned in my earlier post, which is particularly useful when searching my word processing files, many of which are in .odt format, for example.

I think I can happily close this thread now. Good outcome!
 
Old 03-08-2015, 06:45 AM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 14,848

Rep: Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823
grep is part of a basic (as in fundamental) toolkit, but note unSpawns response. Indexing tools have a poor reputation (and history), but can really do the job if setup properly.
 
1 members found this post helpful.
Old 03-08-2015, 08:16 PM   #6
Thermoman
Member
 
Registered: Feb 2015
Location: UK
Distribution: Linux Mint 13
Posts: 39

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by bigrigdriver View Post
My favorite method is to use "find" piped through "grep".

example: find /home/Thermoman -type f | grep -i mushroom

command find, directory to be searched, type file, pipe through grep (case insensitive), search for mushroom
Ah! Well I am afraid that that does not work. When I issued the command:-
Code:
find /Documents/Domestic/Recipes -type f | grep -i mushroom
it failed with the error 'No such file or directory'. I was surprised to find that the '/' preceeding 'Documents' had to be removed before the command would work. That appears decidedly odd because Documents is a sub-directory of the pwd. But there we are. I had the same problem with grep.

With the first '/' removed, the find command returned the names of all the files in /Recipes containing the word 'mushroom' in the file name but that is not the intention. The need is to find all the recipes that use mushrooms - i.e. that feature the word 'mushroom' or 'mushrooms' in the body of the text, irrespective of whether it occurs in the file name. (Incidentally, if the need is simply to list the file names containing the word 'mushroom', I discovered that my standard Applications list contains a utility called 'Search for Files', which does the job with minimal effort.)

Before and after starting this thread I spent ages trying to get grep to work before finally coming up with:-
Code:
grep -irl 'mushroom' Documents/Domestic/Recipes
That does the trick. grep takes quite some time searching all the sub-directories (I did not realise that I had so many) but it finds every instance of 'mushroom', so far as I can judge. It is a long list.

There is only one snag. The file names are printed in a deep mauve colour against a black background. There is barely any contrast and the text is therefore hardly readable. Anyone know how to change the colour-set?

In the mean time, I have to say that I have emerged from this thread knowing a great deal more than when I started it.
 
Old 03-08-2015, 08:40 PM   #7
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 14,848

Rep: Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823
Quote:
Originally Posted by Thermoman View Post
I was surprised to find that the '/' preceeding 'Documents' had to be removed before the command would work. That appears decidedly odd because Documents is a sub-directory of the pwd.
It is precisely because the Documents is a sub-directory of your current location that it is not needed. By pre-pending the slash you force the search (for the Documents directory) to be relative to the root of the entire filesystem. All very logical.
Quote:
grep takes quite some time searching all the sub-directories
Anything scanning the disk will take time - if you re-run the scan it will likely be (much) quicker due to the caching of the directories and file data. Depends how much of the cached data has been flushed. An indexer will also take (even more) time, but they generally run in the background, and once done the data is available with minimal delay - and complex queries are much easier.
Quote:
There is only one snag. The file names are printed in a deep mauve colour against a black background. There is barely any contrast and the text is therefore hardly readable. Anyone know how to change the colour-set?
Your distro has an alias set for that - type "alias" in at a terminal to see them all. Simplest (one-off) solution is to add a back slash to the command (\grep ...) - you can also use "grep --color=never ..." (see the manpage), or simply delete the alias altogether.
 
Old 03-10-2015, 10:32 AM   #8
Thermoman
Member
 
Registered: Feb 2015
Location: UK
Distribution: Linux Mint 13
Posts: 39

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by syg00 View Post
grep is part of a basic (as in fundamental) toolkit, but note unSpawns response. Indexing tools have a poor reputation (and history), but can really do the job if setup properly.
Well your posts certainly provided some useful information, particularly
Quote:
Your distro has an alias set for that - type "alias" in at a terminal to see them all. Simplest (one-off) solution is to add a back slash to the command (\grep ...) - you can also use "grep --color=never ..." (see the manpage), or simply delete the alias altogether.
since I can now read the output.

My initial satisfaction in getting grep to work has now been tempered. It turns out to be a rather capricious tool. For example, whilst it found no less than 45 files containing the word 'mushrooms' - mostly in .doc files - it failed to find at least one .doc file in which the word occurred no less than 6 times.

Further disappointments included confirmation of my earlier suspicion that .odt files are not readily amenable to grep searches. An identical .odt copy of the .doc file, containing the 6 references to 'mushrooms', was also missed by grep, as were all other .odt files in which 'mushrooms' feature. I wondered whether this could be caused by an auto-hyphenation character between 'mush' and 'rooms' so I used 'mush' as the search string - with no more success with either the .doc or the .odt file.

Other formats tend to yield similar 'hit-and-miss' results. I suppose that I should expect this with .pdf files but surprisingly I got a couple of hits with these. (I did not check the misses. That could be a long job but I'll bet that there were quite a few.)

There seem to be three answers:-
1. Save files in .txt format, or at least keep a .txt version of each word processor file. Unsurprisingly these tend to be more reliable grep candidates but the solution is a bit clumsy, to say the least.
2. Write a script, converting each word processing file to .txt format before doing the search on the latter - but that's well beyond my capabilities.
3. Dip my toe into the unknown waters of indexing tools, in the hope that there is nothing in there with a liking for newbies' appendages.

Last edited by Thermoman; 03-11-2015 at 05:50 PM. Reason: Numerical error
 
Old 03-10-2015, 09:15 PM   #9
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 14,848

Rep: Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823
Openoffice word processing files (.odt) are a compressed archive - the strings don't exist as strings. A quick search found plenty of info - a simple unzip piped into grep will suffice. However this loses the filename. Scripts exist that will do it all for you - have a read of this for example.
 
Old 03-11-2015, 05:50 PM   #10
Thermoman
Member
 
Registered: Feb 2015
Location: UK
Distribution: Linux Mint 13
Posts: 39

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by syg00 View Post
Openoffice word processing files (.odt) are a compressed archive - the strings don't exist as strings. A quick search found plenty of info - a simple unzip piped into grep will suffice. However this loses the filename. Scripts exist that will do it all for you - have a read of this for example.
...and from syg00's earlier post:-
Quote:
grep is part of a basic (as in fundamental) toolkit, but note unSpawns response. Indexing tools have a poor reputation (and history), but can really do the job if setup properly.
Interesting responses.
1. The script example searches for a specific string. It would be more convenient to be able to input it as a parameter which is then referenced indirectly in the code, making the latter universal. I am sure that that can be done but the result would probably be even more daunting for a newbie.
2. By following the suggested links I discovered that Recoll was the preferred utility in the associated review. I then found it in the repository and installed it. The result was intriguing. It is claimed to be able to interrogate .txt, .rtf, .pdf and .odt files, among a host of others, but not .doc files!! (Does Mr. William Gates know about this?)
True to the claim, Recoll appeared to find all instances of the word 'mushroom' in all the specified file types, including .odt files, but it also had a stab at some .doc files, finding 13 of them, in comparison with grep's 45 which, as earlier indicated, misses a few .doc files - but not nearly so many as Recoll.
The conclusion seems clear. Save my WP files in LibreOffice Writer's default .odt format. I currently keep my old MS Word documents in their original .doc format, which LO can utilize directly, and resave any edits in that same format. Since Recoll seems able to handle .odt files, my best option is probably to let LO resave my .doc files in .odt format alone. That looks a lot simpler than keeping a .txt duplicate for each file that can then be interrogated by grep.

Hopefully, that has sorted it. I am a keen advocate of taking the path of least resistance but I am left rather intrigued by the fact that .doc and .docx files - probably among the world's, if not THE world's, most popular document formats - are not catered for in Recoll. LO has chosen .odt as its default format but nevertheless recognizes the importance of .doc. Why not Recoll, I wonder?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] searching for 2 strings with actions dazdaz Programming 6 07-26-2013 08:12 AM
Searching and replacing strings in groups of files jebradl Linux - Newbie 10 06-22-2012 04:20 PM
grep searching for strings with '(apostrophe) macsdev Programming 5 11-12-2010 12:46 AM
[SOLVED] Searching and replacing strings in a file with strings in other files xndd Linux - Newbie 16 07-29-2010 03:40 PM
Searching files for strings tmoorman Linux - Software 4 01-08-2004 02:46 PM


All times are GMT -5. The time now is 04:57 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration