Help using wc and grep with regular expressions
I'm writing a program that works with text files, and I'm trying to create some filters with grep. I have various questions here, so I'll number them for clarity.
1) First of all, I'd like to know what wc -w is actually returning. The word count is less than what gedit is counting in Document Statistics, so obviously gedit is counting something (like newlines) that wc -w is not 2) Secondly, I was wondering if there was a way to grep x number of words. I'm looking for something like the -m option, but returning a certain number of words instead of lines. For example, to find the first 2000 words, do something like grep -someoption 2000 ".*" or using \{1,2000\}. 3) Finally, I'm trying to filter out headers and footers of a text file but having no luck. The text files are Project Gutenberg files, so they have standardized headers and footers. Here's an example: http://www.gutenberg.org/files/97/97.txt The header starts with "The Project Gutenberg EBook of" and ends with the line containing "START OF THIS PROJECT GUTENBERG EBOOK" The footers begin with: "End of the Project Gutenberg EBook of" My problem is, grep can find: Code:
grep "The Project Gutenberg EBook of" flatland.txt Code:
grep "START OF THIS PROJECT GUTENBERG EBOOK" flatland.txt Code:
grep "The Project Gutenberg EBook of.*START OF THIS PROJECT GUTENBERG EBOOK" Similarly, grep can find: Code:
Code:
grep "End of the Project Gutenberg EBook of.*$" flatland.txt |
I don't see any headers and footers in the file you linked---what am I missing?
|
Try
Code:
grep "[End of the Project Gutenberg EBook of].*[End of the Project Gutenberg EBook of]" flatland.txt Code:
[root@athlonz flatland]# grep "[End of the Project Gutenberg EBook of].*[End of the Project Gutenberg EBook of]" flatland.txt | wc Code:
grep "[The Project Gutenberg EBook of].*[End of the Project Gutenberg EBook of]" flatland.txt Code:
[root@athlonz flatland]# grep "[The Project Gutenberg EBook of].*[End of the Project Gutenberg EBook of]" flatland.txt | wc |
This seems to work:
Code:
cat 97.txt | sed -n '/.*START OF THIS PROJECT GUTENBERG EBOOK FLATLAND.*/,/End of the Project Gutenberg EBook of.*/p' hth |
Okay, lets try and go in order:
1 - Consider if the following was the only information in a file: Type=Link gedit's stats will say there are two words here as it is lexigraphical lookup based on the human language wc -w on the other hand says how many words in file, a single word being any group of characters without white space, hence its count is 1 2 - As again with above you need to construct what you consider a word to be, ie consecutive letters followed by a space (which could mean blah is a word but "blah" is not as quotes are not a letter. Once you have made this decision you can then use the {n} construct after your regex to say how many you want. 3 - Not sure why you thought this would work: Quote:
so putting .* and itself again does not exist so rightly returns nothing As for: Quote:
Code:
End of the Project Gutenberg EBook of Flatland, by Edwin A. Abbott works perfectly fine. And lastly, your final statement can finish here: Quote:
|
Quote:
and they get the same result: Code:
[root@athlonz flatland]# wc.pl flatland.txt |
Thank you for proving the point as the lines responsible for word counting are:
Code:
61 my @w=split(/[ \t]+/, $line); hence this code would also consider Type=Link as one word where gedit looks at this in a humanish way to say it is two words either side of an equals sign. |
First of all, I appreciate all your replies. I've given you all Thanks.
All of your suggestions worked, but what I am trying to do is filter/eliminate the header and footer entirely with grep (using -v). I think the problem lies in my basic misunderstanding of what the wildcard represents. I am looking for an expression which means "return everything from this to that", so that the .* wildcard would INCLUDE all of the words in the filtering process. For example: Code:
#!/bin/bash Code:
grep "apples.*peaches" fruit.txt Code:
grep "[apples].*[peaches]" fruit.txt How can I phrase it correctly to return "apples bananas oranges peaches"? P.S. In reading your replies, I looked at my original code, and although I explained the situation correctly in the text part, there was a mistake in the code section. I've edited and corrected it. The error was Code:
grep "End of the Project Gutenberg EBook of.*End of the Project Gutenberg EBook of" flatland.txt What I meant to put was: Code:
grep "The Project Gutenberg EBook of.*START OF THIS PROJECT GUTENBERG EBOOK" |
As for the word count issue, thanks for clarifying, grail. I did some experimenting, and as it turns out, there are several key differences:
gedit considers "J.D. Salinger" three words, whereas wc only counts 2. gedit also counts newlines, but wc ignores them. And finally, gedit considers contractions (I'm, don't, etc.) to be two words, while wc counts them as one. |
Finally got something to work for me -- it's sort of similar to what kbp suggested above (thanks, kbp):
Code:
cat flatland.txt | sed '/The Project Gutenberg EBook of/,/START OF THIS PROJECT GUTENBERG EBOOK/d; /End of the Project Gutenberg EBook of/,/\n.*$/d' |
If you want to delete to eof, just use "$" by itself - as in "sed '/blah/,$ d' flatland.txt"
Note your use of "cat" is superfluous. |
Thanks, syg00. That's much neater.
By the way, here's my solution for my question #2 in the original post (how to output the first 2000 words of a file). Code:
cat file.txt | tr "\n" " " | tr " " "\n" | head -n 2000 | tr "\n" " " Code:
cat file.txt | sed ':top $!N;s/\n/@NEWLINE@/g; ttop' | tr " " "\n" | head -n 2000 | tr "\n" " " | sed ':top $!N;s/@NEWLINE@/\n/g; ttop' I'm marking this thread as solved. Thanks, everyone. |
Well just for later thought, here is another way you could go about it:
[CODE] awk 'BEGIN{wc=0;p=0;s=1}p && /EBook/{p=0;s=0}p && wc < 2000{wc += NF;print}s && /EBOOK/{p=1}' flatland.txt [CODE] This has the same restriction that it will finish on the line that contains the 2000th word, but not necessarily on that word. |
All times are GMT -5. The time now is 02:57 PM. |