LinuxQuestions.org
Did you know LQ has a Linux Hardware Compatibility List?
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Desktop
User Name
Password
Linux - Desktop This forum is for the discussion of all Linux Software used in a desktop context.

Notices

Reply
 
Search this Thread
Old 06-26-2012, 08:12 AM   #1
stf92
Senior Member
 
Registered: Apr 2007
Location: Buenos Aires.
Distribution: Slackware
Posts: 3,125

Rep: Reputation: 46
Grep: how to make it not match PDF and HTML files?


Hi:

Is there a way to make grep overlook PDF and HTML files? By this I mean to make grep not to process these types, not to look into them for the pattern. When looking for strings I'm interested to find only in plain ASCII text files, I see grep wastes a lot of time because of PDF and HTML files. From grep's man page (the italics are mine):

Quote:
-U, --binary
Treat the file(s) as binary. By default, under MS-DOS and MS-
Windows, grep guesses the file type by looking at the contents
of the first 32KB read from the file. If grep decides the file
is a text file,
it strips the CR characters from the original
file contents (to make regular expressions with ^ and $ work
correctly). Specifying -U overrules this guesswork, causing all
files to be read and passed to the matching mechanism verbatim;
if the file is a text file with CR/LF pairs at the end of each
line, this will cause some regular expressions to fail. This
option has no effect on platforms other than MS-DOS and MS-Win-
dows.
So, under given conditions, grep knows if a file is a text file or not. But what should I understand by a text file? Are HTML files, for instance, considered text files by grep? Also,

Quote:

--binary-files=TYPE
If the first few bytes of a file indicate that the file contains
binary data, assume that the file is of type TYPE. By default,
TYPE is binary, and grep normally outputs either a one-line mes-
sage saying that a binary file matches, or no message if there
is no match. If TYPE is without-match, grep assumes that a
binary file does not match; this is equivalent to the -I option.
If TYPE is text, grep processes a binary file as if it were
text; this is equivalent to the -a option. Warning: grep
--binary-files=text might output binary garbage, which can have
nasty side effects if the output is a terminal and if the termi-
nal driver interprets some of it as commands.

I guess then grep looks for a magic number in the file. Again, in order to correctly understand this, I'd need to know what a binary file is for the manual. Are only ELF executables and the like binary files or is a PDF file a binary file too?

By reading the manual, then, this thing is clear to me: the default behavior of grep is to look for text files only. So, a great thing for me would be to have a good definition of 'text file' in the sense used in grep's man page (which I presume must be the same used in many linux contexts). So I would have an automatic answer to questions such as "Is an ISO/IEC 8859 file considered a text file by grep?". "Are 16 or 32-bit formats considered text files by grep?". Or, more to the point, "are HTM/HTML and PDF files considered to be text files by grep/linux"? Of course some PDF files are the result of, say, scanning a book.

Kernel 2.6.21.5, Slackware 12.0
GNU bash, 3.1.17
grep (GNU grep) 2.5

Last edited by stf92; 06-26-2012 at 08:20 AM.
 
Old 06-26-2012, 08:24 AM   #2
pan64
Senior Member
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 4,730

Rep: Reputation: 1262Reputation: 1262Reputation: 1262Reputation: 1262Reputation: 1262Reputation: 1262Reputation: 1262Reputation: 1262Reputation: 1262
why don't you try --exclude *.pdf or similar?
 
Old 06-26-2012, 08:41 AM   #3
stf92
Senior Member
 
Registered: Apr 2007
Location: Buenos Aires.
Distribution: Slackware
Posts: 3,125

Original Poster
Rep: Reputation: 46
Thanks. I'm trying with 'grep --exclude=html *' and it does not behave as expected. Maybe the pattern ('html') is not well formed. But I always believed this option made grep not match PATTERN when found as a string within a file. Not a pattern in the filename.

I'll see what are acceptable patterns for grep, though I thought I knew something about regexps.
 
Old 06-26-2012, 09:03 AM   #4
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
Use --exclude="*.html".

Put the glob pattern inside quotes so it isn't expanded by bash before the grep command runs.
 
Old 06-26-2012, 09:23 AM   #5
stf92
Senior Member
 
Registered: Apr 2007
Location: Buenos Aires.
Distribution: Slackware
Posts: 3,125

Original Poster
Rep: Reputation: 46
Thank you very much. It works fine. But

Code:
semoi@darkstar:~/STORE1/Nonsoft/shurebro$ grep  "at that point" *
ttadjust.html:              Then at that point the cartridge's body should again be parallel 
semoi@darkstar:~/STORE1/Nonsoft/shurebro$ grep  "at that point" * --exclude="*html"
semoi@darkstar:~/STORE1/Nonsoft/shurebro$ grep  "at that point" * --exclude="*htm"
ttadjust.html:              Then at that point the cartridge's body should again be parallel 
semoi@darkstar:~/STORE1/Nonsoft/shurebro$
In the third instance, why isn't '*htm' matched? Isn't 'htm' a substring of 'html'?
 
Old 06-26-2012, 10:55 AM   #6
pan64
Senior Member
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 4,730

Rep: Reputation: 1262Reputation: 1262Reputation: 1262Reputation: 1262Reputation: 1262Reputation: 1262Reputation: 1262Reputation: 1262Reputation: 1262
see the man page for explanation: --exclude="*htm" means filename ending with htm. Probably you need to try *.htm*
 
Old 06-26-2012, 01:40 PM   #7
stf92
Senior Member
 
Registered: Apr 2007
Location: Buenos Aires.
Distribution: Slackware
Posts: 3,125

Original Poster
Rep: Reputation: 46
I'll do it. I thought it was like searching with regular expressions. Thanks.

But the question about the precise meaning of the expression 'text file' in the context of grep's man page stands.

Last edited by stf92; 06-26-2012 at 01:45 PM.
 
Old 06-27-2012, 12:40 AM   #8
pan64
Senior Member
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 4,730

Rep: Reputation: 1262Reputation: 1262Reputation: 1262Reputation: 1262Reputation: 1262Reputation: 1262Reputation: 1262Reputation: 1262Reputation: 1262
it is not trivial, or I would say it costs too much. Look at the man page of magic, it is really not a simple decision. From the other hand you can/need to scan a file to be able to identify (there are systems with two possibilities: the file contains non-printable chars - binary, file contains only printable - text). So matching the extension is a dirty, but a much quicker solution.
 
Old 06-27-2012, 01:29 AM   #9
stf92
Senior Member
 
Registered: Apr 2007
Location: Buenos Aires.
Distribution: Slackware
Posts: 3,125

Original Poster
Rep: Reputation: 46
Quote:
Originally Posted by pan64 View Post
So matching the extension is a dirty, but a much quicker solution.
I see. But what happens when you have in your hdd, files in a large variety of formats (say HTML, PDF and the hundred formats/specifications currently circulating in the web). I wouuld need to specify grep all of these extensions (which I'll give you, could be automatically done by means for example of a shell script, but new standards appear everyday).

Plus, when you download through the web an HTML file named foo.html, you can end with a big tree at foo_files/, containing files with several extensions. And I have seen grep matching strings from within these files. That is, grep says, say,

$ foo_files/pict1.png: <the_string_I_gave_grep_to_look_for>

Maybe, as grep has to read say the first 2KB from these files in order to determine if it is a text file, if the file is too small, it will find the pattern and report it if the pattern is there. I can say that I used grep with short patterns, say "an", and grep listed a lot of the files in foo_files/.

Thanks for the link. I see I have it in my system too. BTW first time I post in Linux-Desktop, as I used to think the chances of getting feedback were low. May be beginner's luck. But the name being 'Desktop', I think more people should post in it.

Last edited by stf92; 06-27-2012 at 01:38 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to convert PDF Files to Text/image/Word/HTML? sholdon Linux - General 5 05-17-2012 02:45 AM
Merge Of Html Files Into A Single Html (or Pdf) fiomba Linux - Software 6 06-20-2011 07:28 PM
converting .html files to pdf from the command line jason7 Linux - General 6 01-24-2009 08:46 AM
LXer: How to convert PDF files to HTML or XML files in openSUSE LXer Syndicated Linux News 0 08-20-2008 08:40 AM
Converting html files to pdf saurya_s Linux - Software 1 01-12-2004 06:49 AM


All times are GMT -5. The time now is 01:26 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration