ThinkLinux 02-24-2010 03:49 AM

awk, sed, grep and paragraphs

I need to extract paragraphs that is more than 4 lines from a text file.
The paragraph length may vary according to the results from a wget request. The paragraphs are separated by blank lines and I need the entire contents of that paragraph to be returned in order to follow the redirects.

What would be the best way of doing this?


Tinkster 02-24-2010 05:08 AM


welcome to LQ!

The quick & easy way:

awk 'BEGIN{RS=ORS="\n\n";FS=OFS="\n"}NF>=4' file
What this does is quite simple; awk normally operates with
lines (\n) as records, and any number of whitespace as a
field separator. What we did here is to tell it that a field
is anything with a line-end (FS), and that a record is a sequence
of 2 line-endings (RS, with nothing else in between, AKA, our
empty line between paragraphs). The rest is even simpler:
if we have NF (number of fields, AKA lines with content) greater
or equal 4, perform the default action (which is print and
which we have lazily omitted). The significance of RS=ORS
and FS=OFS respectively is that we don't want the output to
be reformatted to "standard" awk separators.


Tinkster 02-25-2010 08:02 PM

OP, did you find the explanation satisfactory? Nothing left unclear?

Star_Gazer 04-09-2010 02:22 PM


Originally Posted by Tinkster (Post 3877043)
OP, did you find the explanation satisfactory? Nothing left unclear?

It educated me some! :hattip:

Not sure if the OP is aware of what "OP" means - depends on whether they are "forum-savvy" or not. :)


