Searching for Strings
If there is one Windows XP feature that I greatly miss in Mint, it is the Search Companion.
I have been struggling with 'grep' in order to create something suitable but with limited success. Take the following problem:- I wish to interrogate the folder home/dell/Documents/Domestic/Recipes, searching for all files containing the word "mushroom" or "mushrooms", ignoring case. (I can manage the latter. :)) Each individual file search should terminate at the first instance of a match and move to the next file. (Recursive, yeh?) Only the file names need to be listed and the output should be paged to allow for easier reading of long lists. Several different types of file may be involved, including .doc, .odt, .txt .pdf, .htm and .rtf. It would be nice to include all of them in one command. (Wild card behaviour in grep is not entirely predictable - at least not for me.) Running a separate grep command for each different file type would be tedious. A significant difficulty is that, if grep fails with a syntax, or run-time error, it generally reports the fact but it also has a habit of producing no output, perhaps not returning to the command prompt, whilst sitting inviting the user to decide what to do next. What makes this particularly frustrating is that some file types might not be amenable to a grep search. Text in .txt files and, it would appear, .doc files appears to be searchable but I suspect that .odt files might be more problematic. The snag in such circumstances is trying to interpret grep's response. Does a null return mean that no match was found or that the file format cannot be successfully interrogated? Such failure might not be apparent if the associated file names are simply excluded from the output list. Apart from grep, is there any other software that would do the job? Sadly LibreOffice Writer seems to be lacking in this area. |
My favorite method is to use "find" piped through "grep".
example: find /home/Thermoman -type f | grep -i mushroom command find, directory to be searched, type file, pipe through grep (case insensitive), search for mushroom |
Quote:
|
Thanks for the responses - some interesting information there.
I have had a bit more success with 'grep'. The lesson seems to be, "Don't go messing around with complicated wild card commands. Just tell grep what you want and let it get on with the job." It's a smart cooky - if you get the syntax right. :redface: If you get it wrong, as I mentioned before, it does tend to sit looking at you with the implied response, "Your move!" but that's hardly grep's fault. One interesting feature is that, despite my earlier disappointments, a properly formulated grep command can find strings in most, if not all, of the file types I mentioned in my earlier post, which is particularly useful when searching my word processing files, many of which are in .odt format, for example. I think I can happily close this thread now. Good outcome! :) |
grep is part of a basic (as in fundamental) toolkit, but note unSpawns response. Indexing tools have a poor reputation (and history), but can really do the job if setup properly.
|
Quote:
Code:
find /Documents/Domestic/Recipes -type f | grep -i mushroom With the first '/' removed, the find command returned the names of all the files in /Recipes containing the word 'mushroom' in the file name but that is not the intention. The need is to find all the recipes that use mushrooms - i.e. that feature the word 'mushroom' or 'mushrooms' in the body of the text, irrespective of whether it occurs in the file name. (Incidentally, if the need is simply to list the file names containing the word 'mushroom', I discovered that my standard Applications list contains a utility called 'Search for Files', which does the job with minimal effort.) Before and after starting this thread I spent ages trying to get grep to work before finally coming up with:- Code:
grep -irl 'mushroom' Documents/Domestic/Recipes There is only one snag. The file names are printed in a deep mauve colour against a black background. There is barely any contrast and the text is therefore hardly readable. Anyone know how to change the colour-set? In the mean time, I have to say that I have emerged from this thread knowing a great deal more than when I started it. |
Quote:
Quote:
Quote:
|
Quote:
Quote:
My initial satisfaction in getting grep to work has now been tempered. It turns out to be a rather capricious tool. For example, whilst it found no less than 45 files containing the word 'mushrooms' - mostly in .doc files - it failed to find at least one .doc file in which the word occurred no less than 6 times. :confused: Further disappointments included confirmation of my earlier suspicion that .odt files are not readily amenable to grep searches. An identical .odt copy of the .doc file, containing the 6 references to 'mushrooms', was also missed by grep, as were all other .odt files in which 'mushrooms' feature. I wondered whether this could be caused by an auto-hyphenation character between 'mush' and 'rooms' so I used 'mush' as the search string - with no more success with either the .doc or the .odt file. Other formats tend to yield similar 'hit-and-miss' results. I suppose that I should expect this with .pdf files but surprisingly I got a couple of hits with these. (I did not check the misses. That could be a long job but I'll bet that there were quite a few.) There seem to be three answers:- 1. Save files in .txt format, or at least keep a .txt version of each word processor file. Unsurprisingly these tend to be more reliable grep candidates but the solution is a bit clumsy, to say the least. 2. Write a script, converting each word processing file to .txt format before doing the search on the latter - but that's well beyond my capabilities. 3. Dip my toe into the unknown waters of indexing tools, in the hope that there is nothing in there with a liking for newbies' appendages. |
Openoffice word processing files (.odt) are a compressed archive - the strings don't exist as strings. A quick search found plenty of info - a simple unzip piped into grep will suffice. However this loses the filename. Scripts exist that will do it all for you - have a read of this for example.
|
Quote:
Quote:
1. The script example searches for a specific string. It would be more convenient to be able to input it as a parameter which is then referenced indirectly in the code, making the latter universal. I am sure that that can be done but the result would probably be even more daunting for a newbie.The conclusion seems clear. Save my WP files in LibreOffice Writer's default .odt format. I currently keep my old MS Word documents in their original .doc format, which LO can utilize directly, and resave any edits in that same format. Since Recoll seems able to handle .odt files, my best option is probably to let LO resave my .doc files in .odt format alone. That looks a lot simpler than keeping a .txt duplicate for each file that can then be interrogated by grep. Hopefully, that has sorted it. I am a keen advocate of taking the path of least resistance but I am left rather intrigued by the fact that .doc and .docx files - probably among the world's, if not THE world's, most popular document formats - are not catered for in Recoll. LO has chosen .odt as its default format but nevertheless recognizes the importance of .doc. Why not Recoll, I wonder? |
All times are GMT -5. The time now is 02:17 AM. |