LinuxQuestions.org
Latest LQ Deal: Complete CCNA, CCNP & Red Hat Certification Training Bundle
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 04-16-2010, 09:33 AM   #1
citygrid
LQ Newbie
 
Registered: Mar 2010
Posts: 10

Rep: Reputation: 0
Help using wc and grep with regular expressions


I'm writing a program that works with text files, and I'm trying to create some filters with grep. I have various questions here, so I'll number them for clarity.

1) First of all, I'd like to know what wc -w is actually returning. The word count is less than what gedit is counting in Document Statistics, so obviously gedit is counting something (like newlines) that wc -w is not

2) Secondly, I was wondering if there was a way to grep x number of words. I'm looking for something like the -m option, but returning a certain number of words instead of lines. For example, to find the first 2000 words, do something like grep -someoption 2000 ".*" or using \{1,2000\}.

3) Finally, I'm trying to filter out headers and footers of a text file but having no luck. The text files are Project Gutenberg files, so they have standardized headers and footers. Here's an example: http://www.gutenberg.org/files/97/97.txt

The header starts with "The Project Gutenberg EBook of" and ends with the line containing "START OF THIS PROJECT GUTENBERG EBOOK"

The footers begin with: "End of the Project Gutenberg EBook of"

My problem is, grep can find:

Code:
grep "The Project Gutenberg EBook of" flatland.txt
and
Code:
grep "START OF THIS PROJECT GUTENBERG EBOOK" flatland.txt
but not
Code:
grep "The Project Gutenberg EBook of.*START OF THIS PROJECT GUTENBERG EBOOK"
(above edited, thanks grail)

Similarly, grep can find:
Code:
grep "End of the Project Gutenberg EBook of" flatland.txt
but it doesn't return anything on
Code:
grep "End of the Project Gutenberg EBook of.*$" flatland.txt
So obviously I'm using the regular expression incorrectly with grep. What am I doing wrong?

Last edited by citygrid; 04-16-2010 at 03:13 PM. Reason: error in code
 
Old 04-16-2010, 10:02 AM   #2
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738
I don't see any headers and footers in the file you linked---what am I missing?
 
Old 04-16-2010, 10:09 AM   #3
tommylovell
Member
 
Registered: Nov 2005
Distribution: Fedora, Redhat
Posts: 372

Rep: Reputation: 101Reputation: 101
Try
Code:
grep "[End of the Project Gutenberg EBook of].*[End of the Project Gutenberg EBook of]" flatland.txt
Code:
[root@athlonz flatland]# grep "[End of the Project Gutenberg EBook of].*[End of the Project Gutenberg EBook of]" flatland.txt | wc
   3285   36551  215295
[root@athlonz flatland]#
Or did you mean
Code:
grep "[The Project Gutenberg EBook of].*[End of the Project Gutenberg EBook of]" flatland.txt
Code:
[root@athlonz flatland]# grep "[The Project Gutenberg EBook of].*[End of the Project Gutenberg EBook of]" flatland.txt | wc
   3283   36550  215283
[root@athlonz flatland]#
 
Old 04-16-2010, 10:14 AM   #4
kbp
Senior Member
 
Registered: Aug 2009
Posts: 3,790

Rep: Reputation: 650Reputation: 650Reputation: 650Reputation: 650Reputation: 650Reputation: 650
This seems to work:

Code:
cat 97.txt | sed -n '/.*START OF THIS PROJECT GUTENBERG EBOOK FLATLAND.*/,/End of the Project Gutenberg EBook of.*/p'
<edit>oops.. forgot about the rest of your points </edit>
hth

Last edited by kbp; 04-16-2010 at 10:16 AM.
 
Old 04-16-2010, 10:44 AM   #5
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,550

Rep: Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898
Okay, lets try and go in order:

1 - Consider if the following was the only information in a file:

Type=Link

gedit's stats will say there are two words here as it is lexigraphical lookup based on the human language
wc -w on the other hand says how many words in file, a single word being any group of characters without white space, hence its count is 1

2 - As again with above you need to construct what you consider a word to be, ie consecutive letters followed by a space (which could mean
blah is a word but "blah" is not as quotes are not a letter. Once you have made this decision you can then
use the {n} construct after your regex to say how many you want.

3 - Not sure why you thought this would work:
Quote:
grep "End of the Project Gutenberg EBook of.*End of the Project Gutenberg EBook of" flatland.txt
As according to the link you gave the first occurrence of "End of the Project Gutenberg EBook of" only appears once on the entire page
so putting .* and itself again does not exist so rightly returns nothing

As for:
Quote:
grep "End of the Project Gutenberg EBook of.*$" flatland.txt
I ran the same and it returned the expected line of:
Code:
End of the Project Gutenberg EBook of Flatland, by Edwin A. Abbott
btw. this grep says find line containing "End of the Project Gutenberg EBook of", followed by any characters ".*", till the end of that line "$".
works perfectly fine.

And lastly, your final statement can finish here:
Quote:
So obviously I'm using the regular expression incorrectly
as regex is not the sole domain of grep
 
Old 04-16-2010, 11:04 AM   #6
tommylovell
Member
 
Registered: Nov 2005
Distribution: Fedora, Redhat
Posts: 372

Rep: Reputation: 101Reputation: 101
Quote:
Originally Posted by citygrid View Post
1) First of all, I'd like to know what wc -w is actually returning. The word count is less than what gedit is counting in Document Statistics, so obviously gedit is counting something (like newlines) that wc -w is not
I ran 'wc' and the perl program at this site, http://en.literateprograms.org/Speci...unt_%28Perl%29
and they get the same result:

Code:
[root@athlonz flatland]# wc.pl flatland.txt 
    3922   36562  216624 flatland.txt
[root@athlonz flatland]# wc flatland.txt 
  3922  36562 216624 flatland.txt
[root@athlonz flatland]#
I'm not graphical, so I can't comment on gedit, but I've never heard of wc being shown to be wrong. It has only one thing to do in life and it would be tragic if wc count could not count words...
 
Old 04-16-2010, 11:23 AM   #7
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,550

Rep: Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898
Thank you for proving the point as the lines responsible for word counting are:
Code:
61  my @w=split(/[ \t]+/, $line);
62  $words+=@w;
So here it says that a word is consecutive characters (I believe any type of character) followed by one or more lots of white space,
hence this code would also consider Type=Link as one word where gedit looks at this in a humanish way to say it is two
words either side of an equals sign.
 
Old 04-16-2010, 03:11 PM   #8
citygrid
LQ Newbie
 
Registered: Mar 2010
Posts: 10

Original Poster
Rep: Reputation: 0
First of all, I appreciate all your replies. I've given you all Thanks.

All of your suggestions worked, but what I am trying to do is filter/eliminate the header and footer entirely with grep (using -v).

I think the problem lies in my basic misunderstanding of what the wildcard represents. I am looking for an expression which means "return everything from this to that", so that the .* wildcard would INCLUDE all of the words in the filtering process.

For example:

Code:
#!/bin/bash

echo "apples
bananas
oranges
peaches
watermelon" > fruit.txt

grep "apples.*peaches" fruit.txt

exit
What I'd like to return is "apples bananas oranges peaches".

Code:
grep "apples.*peaches" fruit.txt
(my original suggestion) returns nothing

Code:
grep "[apples].*[peaches]" fruit.txt
(per tommylovell's suggestion) returns everything, including "watermelon"

How can I phrase it correctly to return "apples bananas oranges peaches"?

P.S. In reading your replies, I looked at my original code, and although I explained the situation correctly in the text part, there was a mistake in the code section. I've edited and corrected it.

The error was

Code:
grep "End of the Project Gutenberg EBook of.*End of the Project Gutenberg EBook of" flatland.txt
which does not do anything, as grail rightly pointed out.

What I meant to put was:

Code:
grep "The Project Gutenberg EBook of.*START OF THIS PROJECT GUTENBERG EBOOK"
Sorry about the confusion, and thank you for giving me intelligent answers anyway!
 
Old 04-16-2010, 03:28 PM   #9
citygrid
LQ Newbie
 
Registered: Mar 2010
Posts: 10

Original Poster
Rep: Reputation: 0
As for the word count issue, thanks for clarifying, grail. I did some experimenting, and as it turns out, there are several key differences:

gedit considers "J.D. Salinger" three words, whereas wc only counts 2.

gedit also counts newlines, but wc ignores them.

And finally, gedit considers contractions (I'm, don't, etc.) to be two words, while wc counts them as one.
 
Old 04-16-2010, 06:55 PM   #10
citygrid
LQ Newbie
 
Registered: Mar 2010
Posts: 10

Original Poster
Rep: Reputation: 0
Finally got something to work for me -- it's sort of similar to what kbp suggested above (thanks, kbp):

Code:
cat flatland.txt | sed '/The Project Gutenberg EBook of/,/START OF THIS PROJECT GUTENBERG EBOOK/d; /End of the Project Gutenberg EBook of/,/\n.*$/d'
The newline was put into the second part of the argument because for some reason /.*$/d was not sufficient.
 
Old 04-16-2010, 09:00 PM   #11
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 15,937

Rep: Reputation: 2210Reputation: 2210Reputation: 2210Reputation: 2210Reputation: 2210Reputation: 2210Reputation: 2210Reputation: 2210Reputation: 2210Reputation: 2210Reputation: 2210
If you want to delete to eof, just use "$" by itself - as in "sed '/blah/,$ d' flatland.txt"
Note your use of "cat" is superfluous.
 
Old 04-16-2010, 09:45 PM   #12
citygrid
LQ Newbie
 
Registered: Mar 2010
Posts: 10

Original Poster
Rep: Reputation: 0
Thanks, syg00. That's much neater.

By the way, here's my solution for my question #2 in the original post (how to output the first 2000 words of a file).

Code:
cat file.txt | tr "\n" " " | tr " " "\n" | head -n 2000 | tr "\n" " "
And here's a kind of clunky method for preserving newlines, if formatting of the text is important:

Code:
cat file.txt | sed ':top $!N;s/\n/@NEWLINE@/g; ttop' | tr " " "\n" | head -n 2000 | tr "\n" " " | sed ':top $!N;s/@NEWLINE@/\n/g; ttop'
After the filtering, the final wc comes out slightly more than 2000, but it's good enough for what I needed it for.

I'm marking this thread as solved. Thanks, everyone.
 
Old 04-17-2010, 03:29 AM   #13
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,550

Rep: Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898
Well just for later thought, here is another way you could go about it:
[CODE]
awk 'BEGIN{wc=0;p=0;s=1}p && /EBook/{p=0;s=0}p && wc < 2000{wc += NF;print}s && /EBOOK/{p=1}' flatland.txt
[CODE]

This has the same restriction that it will finish on the line that contains the 2000th word, but not necessarily on
that word.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Why this grep command with regular expressions not working on my system? Andrew Dufresne Linux - Newbie 12 10-01-2009 03:38 PM
Regular expressions using grep linuxmandrake Programming 3 11-16-2005 05:29 PM
Regular Expressions markjuggles Programming 2 05-05-2005 12:39 PM
help with REGULAR EXPRESSIONS ner Linux - General 23 11-01-2003 12:09 AM
Regular expressions aromes Linux - General 1 10-15-2003 01:29 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 01:29 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration