Grep: how to make it not match PDF and HTML files?
Hi:
Is there a way to make grep overlook PDF and HTML files? By this I mean to make grep not to process these types, not to look into them for the pattern. When looking for strings I'm interested to find only in plain ASCII text files, I see grep wastes a lot of time because of PDF and HTML files. From grep's man page (the italics are mine): Quote:
Quote:
By reading the manual, then, this thing is clear to me: the default behavior of grep is to look for text files only. So, a great thing for me would be to have a good definition of 'text file' in the sense used in grep's man page (which I presume must be the same used in many linux contexts). So I would have an automatic answer to questions such as "Is an ISO/IEC 8859 file considered a text file by grep?". "Are 16 or 32-bit formats considered text files by grep?". Or, more to the point, "are HTM/HTML and PDF files considered to be text files by grep/linux"? Of course some PDF files are the result of, say, scanning a book. Kernel 2.6.21.5, Slackware 12.0 GNU bash, 3.1.17 grep (GNU grep) 2.5 |
why don't you try --exclude *.pdf or similar?
|
Thanks. I'm trying with 'grep --exclude=html *' and it does not behave as expected. Maybe the pattern ('html') is not well formed. But I always believed this option made grep not match PATTERN when found as a string within a file. Not a pattern in the filename.
I'll see what are acceptable patterns for grep, though I thought I knew something about regexps. |
Use --exclude="*.html".
Put the glob pattern inside quotes so it isn't expanded by bash before the grep command runs. |
Thank you very much. It works fine. But
Code:
semoi@darkstar:~/STORE1/Nonsoft/shurebro$ grep "at that point" * |
see the man page for explanation: --exclude="*htm" means filename ending with htm. Probably you need to try *.htm*
|
I'll do it. I thought it was like searching with regular expressions. Thanks.
But the question about the precise meaning of the expression 'text file' in the context of grep's man page stands. |
it is not trivial, or I would say it costs too much. Look at the man page of magic, it is really not a simple decision. From the other hand you can/need to scan a file to be able to identify (there are systems with two possibilities: the file contains non-printable chars - binary, file contains only printable - text). So matching the extension is a dirty, but a much quicker solution.
|
Quote:
Plus, when you download through the web an HTML file named foo.html, you can end with a big tree at foo_files/, containing files with several extensions. And I have seen grep matching strings from within these files. That is, grep says, say, $ foo_files/pict1.png: <the_string_I_gave_grep_to_look_for> Maybe, as grep has to read say the first 2KB from these files in order to determine if it is a text file, if the file is too small, it will find the pattern and report it if the pattern is there. I can say that I used grep with short patterns, say "an", and grep listed a lot of the files in foo_files/. Thanks for the link. I see I have it in my system too. BTW first time I post in Linux-Desktop, as I used to think the chances of getting feedback were low. May be beginner's luck. But the name being 'Desktop', I think more people should post in it. |
All times are GMT -5. The time now is 12:52 AM. |