LinuxQuestions.org - Grep: how to make it not match PDF and HTML files?

- Linux - Desktop (https://www.linuxquestions.org/questions/linux-desktop-74/)

- - Grep: how to make it not match PDF and HTML files? (https://www.linuxquestions.org/questions/linux-desktop-74/grep-how-to-make-it-not-match-pdf-and-html-files-4175413460/)

Grep: how to make it not match PDF and HTML files?

Hi:

Is there a way to make grep overlook PDF and HTML files? By this I mean to make grep not to process these types, not to look into them for the pattern. When looking for strings I'm interested to find only in plain ASCII text files, I see grep wastes a lot of time because of PDF and HTML files. From grep's man page (the italics are mine):

Quote:

-U, --binary
Treat the file(s) as binary. By default, under MS-DOS and MS-
Windows, grep guesses the file type by looking at the contents
of the first 32KB read from the file. If grep decides the file
is a text file, it strips the CR characters from the original
file contents (to make regular expressions with ^ and $ work
correctly). Specifying -U overrules this guesswork, causing all
files to be read and passed to the matching mechanism verbatim;
if the file is a text file with CR/LF pairs at the end of each
line, this will cause some regular expressions to fail. This
option has no effect on platforms other than MS-DOS and MS-Win-
dows.

So, under given conditions, grep knows if a file is a text file or not. But what should I understand by a text file? Are HTML files, for instance, considered text files by grep? Also,

Quote:

--binary-files=TYPE
If the first few bytes of a file indicate that the file contains
binary data, assume that the file is of type TYPE. By default,
TYPE is binary, and grep normally outputs either a one-line mes-
sage saying that a binary file matches, or no message if there
is no match. If TYPE is without-match, grep assumes that a
binary file does not match; this is equivalent to the -I option.
If TYPE is text, grep processes a binary file as if it were
text; this is equivalent to the -a option. Warning: grep
--binary-files=text might output binary garbage, which can have
nasty side effects if the output is a terminal and if the termi-
nal driver interprets some of it as commands.

I guess then grep looks for a magic number in the file. Again, in order to correctly understand this, I'd need to know what a binary file is for the manual. Are only ELF executables and the like binary files or is a PDF file a binary file too?

By reading the manual, then, this thing is clear to me: the default behavior of grep is to look for text files only. So, a great thing for me would be to have a good definition of 'text file' in the sense used in grep's man page (which I presume must be the same used in many linux contexts). So I would have an automatic answer to questions such as "Is an ISO/IEC 8859 file considered a text file by grep?". "Are 16 or 32-bit formats considered text files by grep?". Or, more to the point, "are HTM/HTML and PDF files considered to be text files by grep/linux"? Of course some PDF files are the result of, say, scanning a book.

Kernel 2.6.21.5, Slackware 12.0
GNU bash, 3.1.17
grep (GNU grep) 2.5

why don't you try --exclude *.pdf or similar?

Thanks. I'm trying with 'grep --exclude=html *' and it does not behave as expected. Maybe the pattern ('html') is not well formed. But I always believed this option made grep not match PATTERN when found as a string within a file. Not a pattern in the filename.

I'll see what are acceptable patterns for grep, though I thought I knew something about regexps.

Use --exclude="*.html".

Put the glob pattern inside quotes so it isn't expanded by bash before the grep command runs.

Thank you very much. It works fine. But

Code:

semoi@darkstar:~/STORE1/Nonsoft/shurebro$ grep  "at that point" *

ttadjust.html:              Then at that point the cartridge's body should again be parallel 

semoi@darkstar:~/STORE1/Nonsoft/shurebro$ grep  "at that point" * --exclude="*html"

semoi@darkstar:~/STORE1/Nonsoft/shurebro$ grep  "at that point" * --exclude="*htm"

ttadjust.html:              Then at that point the cartridge's body should again be parallel 

semoi@darkstar:~/STORE1/Nonsoft/shurebro$

In the third instance, why isn't '*htm' matched? Isn't 'htm' a substring of 'html'?

see the man page for explanation: --exclude="*htm" means filename ending with htm. Probably you need to try *.htm*

I'll do it. I thought it was like searching with regular expressions. Thanks.

But the question about the precise meaning of the expression 'text file' in the context of grep's man page stands.

it is not trivial, or I would say it costs too much. Look at the man page of magic, it is really not a simple decision. From the other hand you can/need to scan a file to be able to identify (there are systems with two possibilities: the file contains non-printable chars - binary, file contains only printable - text). So matching the extension is a dirty, but a much quicker solution.

Quote:

Originally Posted by pan64 (Post 4713038)

So matching the extension is a dirty, but a much quicker solution.

I see. But what happens when you have in your hdd, files in a large variety of formats (say HTML, PDF and the hundred formats/specifications currently circulating in the web). I wouuld need to specify grep all of these extensions (which I'll give you, could be automatically done by means for example of a shell script, but new standards appear everyday).

Plus, when you download through the web an HTML file named foo.html, you can end with a big tree at foo_files/, containing files with several extensions. And I have seen grep matching strings from within these files. That is, grep says, say,

$ foo_files/pict1.png: <the_string_I_gave_grep_to_look_for>

Maybe, as grep has to read say the first 2KB from these files in order to determine if it is a text file, if the file is too small, it will find the pattern and report it if the pattern is there. I can say that I used grep with short patterns, say "an", and grep listed a lot of the files in foo_files/.

Thanks for the link. I see I have it in my system too. BTW first time I post in Linux-Desktop, as I used to think the chances of getting feedback were low. May be beginner's luck. But the name being 'Desktop', I think more people should post in it.