LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Help using wc and grep with regular expressions (https://www.linuxquestions.org/questions/linux-newbie-8/help-using-wc-and-grep-with-regular-expressions-802392/)

citygrid 04-16-2010 08:33 AM

Help using wc and grep with regular expressions
 
I'm writing a program that works with text files, and I'm trying to create some filters with grep. I have various questions here, so I'll number them for clarity.

1) First of all, I'd like to know what wc -w is actually returning. The word count is less than what gedit is counting in Document Statistics, so obviously gedit is counting something (like newlines) that wc -w is not

2) Secondly, I was wondering if there was a way to grep x number of words. I'm looking for something like the -m option, but returning a certain number of words instead of lines. For example, to find the first 2000 words, do something like grep -someoption 2000 ".*" or using \{1,2000\}.

3) Finally, I'm trying to filter out headers and footers of a text file but having no luck. The text files are Project Gutenberg files, so they have standardized headers and footers. Here's an example: http://www.gutenberg.org/files/97/97.txt

The header starts with "The Project Gutenberg EBook of" and ends with the line containing "START OF THIS PROJECT GUTENBERG EBOOK"

The footers begin with: "End of the Project Gutenberg EBook of"

My problem is, grep can find:

Code:

grep "The Project Gutenberg EBook of" flatland.txt
and
Code:

grep "START OF THIS PROJECT GUTENBERG EBOOK" flatland.txt
but not
Code:

grep "The Project Gutenberg EBook of.*START OF THIS PROJECT GUTENBERG EBOOK"
(above edited, thanks grail)

Similarly, grep can find:
Code:


grep "End of the Project Gutenberg EBook of" flatland.txt

but it doesn't return anything on
Code:

grep "End of the Project Gutenberg EBook of.*$" flatland.txt
So obviously I'm using the regular expression incorrectly with grep. What am I doing wrong?

pixellany 04-16-2010 09:02 AM

I don't see any headers and footers in the file you linked---what am I missing?

tommylovell 04-16-2010 09:09 AM

Try
Code:

grep "[End of the Project Gutenberg EBook of].*[End of the Project Gutenberg EBook of]" flatland.txt
Code:

[root@athlonz flatland]# grep "[End of the Project Gutenberg EBook of].*[End of the Project Gutenberg EBook of]" flatland.txt | wc
  3285  36551  215295
[root@athlonz flatland]#

Or did you mean
Code:

grep "[The Project Gutenberg EBook of].*[End of the Project Gutenberg EBook of]" flatland.txt
Code:

[root@athlonz flatland]# grep "[The Project Gutenberg EBook of].*[End of the Project Gutenberg EBook of]" flatland.txt | wc
  3283  36550  215283
[root@athlonz flatland]#


kbp 04-16-2010 09:14 AM

This seems to work:

Code:

cat 97.txt | sed -n '/.*START OF THIS PROJECT GUTENBERG EBOOK FLATLAND.*/,/End of the Project Gutenberg EBook of.*/p'
<edit>oops.. forgot about the rest of your points :p </edit>
hth

grail 04-16-2010 09:44 AM

Okay, lets try and go in order:

1 - Consider if the following was the only information in a file:

Type=Link

gedit's stats will say there are two words here as it is lexigraphical lookup based on the human language
wc -w on the other hand says how many words in file, a single word being any group of characters without white space, hence its count is 1

2 - As again with above you need to construct what you consider a word to be, ie consecutive letters followed by a space (which could mean
blah is a word but "blah" is not as quotes are not a letter. Once you have made this decision you can then
use the {n} construct after your regex to say how many you want.

3 - Not sure why you thought this would work:
Quote:

grep "End of the Project Gutenberg EBook of.*End of the Project Gutenberg EBook of" flatland.txt
As according to the link you gave the first occurrence of "End of the Project Gutenberg EBook of" only appears once on the entire page
so putting .* and itself again does not exist so rightly returns nothing

As for:
Quote:

grep "End of the Project Gutenberg EBook of.*$" flatland.txt
I ran the same and it returned the expected line of:
Code:

End of the Project Gutenberg EBook of Flatland, by Edwin A. Abbott
btw. this grep says find line containing "End of the Project Gutenberg EBook of", followed by any characters ".*", till the end of that line "$".
works perfectly fine.

And lastly, your final statement can finish here:
Quote:

So obviously I'm using the regular expression incorrectly
as regex is not the sole domain of grep

tommylovell 04-16-2010 10:04 AM

Quote:

Originally Posted by citygrid (Post 3937607)
1) First of all, I'd like to know what wc -w is actually returning. The word count is less than what gedit is counting in Document Statistics, so obviously gedit is counting something (like newlines) that wc -w is not

I ran 'wc' and the perl program at this site, http://en.literateprograms.org/Speci...unt_%28Perl%29
and they get the same result:

Code:

[root@athlonz flatland]# wc.pl flatland.txt
    3922  36562  216624 flatland.txt
[root@athlonz flatland]# wc flatland.txt
  3922  36562 216624 flatland.txt
[root@athlonz flatland]#

I'm not graphical, so I can't comment on gedit, but I've never heard of wc being shown to be wrong. It has only one thing to do in life and it would be tragic if wc count could not count words...

grail 04-16-2010 10:23 AM

Thank you for proving the point as the lines responsible for word counting are:
Code:

61  my @w=split(/[ \t]+/, $line);
62  $words+=@w;

So here it says that a word is consecutive characters (I believe any type of character) followed by one or more lots of white space,
hence this code would also consider Type=Link as one word where gedit looks at this in a humanish way to say it is two
words either side of an equals sign.

citygrid 04-16-2010 02:11 PM

First of all, I appreciate all your replies. I've given you all Thanks.

All of your suggestions worked, but what I am trying to do is filter/eliminate the header and footer entirely with grep (using -v).

I think the problem lies in my basic misunderstanding of what the wildcard represents. I am looking for an expression which means "return everything from this to that", so that the .* wildcard would INCLUDE all of the words in the filtering process.

For example:

Code:

#!/bin/bash

echo "apples
bananas
oranges
peaches
watermelon" > fruit.txt

grep "apples.*peaches" fruit.txt

exit

What I'd like to return is "apples bananas oranges peaches".

Code:

grep "apples.*peaches" fruit.txt
(my original suggestion) returns nothing

Code:

grep "[apples].*[peaches]" fruit.txt
(per tommylovell's suggestion) returns everything, including "watermelon"

How can I phrase it correctly to return "apples bananas oranges peaches"?

P.S. In reading your replies, I looked at my original code, and although I explained the situation correctly in the text part, there was a mistake in the code section. I've edited and corrected it.

The error was

Code:

grep "End of the Project Gutenberg EBook of.*End of the Project Gutenberg EBook of" flatland.txt
which does not do anything, as grail rightly pointed out.

What I meant to put was:

Code:

grep "The Project Gutenberg EBook of.*START OF THIS PROJECT GUTENBERG EBOOK"
Sorry about the confusion, and thank you for giving me intelligent answers anyway!

citygrid 04-16-2010 02:28 PM

As for the word count issue, thanks for clarifying, grail. I did some experimenting, and as it turns out, there are several key differences:

gedit considers "J.D. Salinger" three words, whereas wc only counts 2.

gedit also counts newlines, but wc ignores them.

And finally, gedit considers contractions (I'm, don't, etc.) to be two words, while wc counts them as one.

citygrid 04-16-2010 05:55 PM

Finally got something to work for me -- it's sort of similar to what kbp suggested above (thanks, kbp):

Code:

cat flatland.txt | sed '/The Project Gutenberg EBook of/,/START OF THIS PROJECT GUTENBERG EBOOK/d; /End of the Project Gutenberg EBook of/,/\n.*$/d'
The newline was put into the second part of the argument because for some reason /.*$/d was not sufficient.

syg00 04-16-2010 08:00 PM

If you want to delete to eof, just use "$" by itself - as in "sed '/blah/,$ d' flatland.txt"
Note your use of "cat" is superfluous.

citygrid 04-16-2010 08:45 PM

Thanks, syg00. That's much neater.

By the way, here's my solution for my question #2 in the original post (how to output the first 2000 words of a file).

Code:

cat file.txt | tr "\n" " " | tr " " "\n" | head -n 2000 | tr "\n" " "
And here's a kind of clunky method for preserving newlines, if formatting of the text is important:

Code:

cat file.txt | sed ':top $!N;s/\n/@NEWLINE@/g; ttop' | tr " " "\n" | head -n 2000 | tr "\n" " " | sed ':top $!N;s/@NEWLINE@/\n/g; ttop'
After the filtering, the final wc comes out slightly more than 2000, but it's good enough for what I needed it for.

I'm marking this thread as solved. Thanks, everyone.

grail 04-17-2010 02:29 AM

Well just for later thought, here is another way you could go about it:
[CODE]
awk 'BEGIN{wc=0;p=0;s=1}p && /EBook/{p=0;s=0}p && wc < 2000{wc += NF;print}s && /EBOOK/{p=1}' flatland.txt
[CODE]

This has the same restriction that it will finish on the line that contains the 2000th word, but not necessarily on
that word.


All times are GMT -5. The time now is 02:57 PM.