Extract last paragraph from text file

pixellany · 07-16-2012, 04:18 AM

OOPS!!---yes, I was thinking end of file---fingers were not connected to brain.

Snark1994 · 07-16-2012, 04:32 AM

Quote:

Originally Posted by firstfire

Here is sed solution:

Code:

$ sed -rn '{:a; /-{5,}/be; N; ba};  :e; $p' infile

Or you can do

Code:

$ cat infile | sed -rn '{:a; /-{5,}/be; N; ba};  :e; $p' > outfile

if you wish.

Funny, it was simple to invent this script, but it took a while to understand why it works..

That's just showing off -.- nice work :P

Also, if the OP considers eir problem solved, could e please mark the thread as 'SOLVED'. Thanks

pixellany · 07-16-2012, 04:58 AM

Quote:

That's just showing off -.- nice work :P

There's a few people here that enjoy solving problems with specific commands--I am one of them. To me, it's just like solving puzzles.

I'm working on the FORTRAN solution to the OPs problem....

bunti01 · 07-16-2012, 06:06 AM

thanks all for your quick responses

danielbmartin · 07-16-2012, 06:32 AM

Quote:

Originally Posted by grail

Code:

awk 'NF{d=$0}END{print d}' RS='--+' file

This is brilliant! Please walk us through it.

As a novice awker I cannot fully follow it. I've groped through the darkness only this far:

NF is a System Variable which is the number of fields for the current input record. What is its significance here?

RS is a System Variable which is the record separator. In this thread the individual paragraphs are separated by a string of dashes so RS='--+' might be defining each paragraph as a record. Is this right? Why the +? Why is it cited at the end of this awk instead of the beginning?

$0 is the current input record in its entirety.

d is apparently a variable because if I change it to e or f the code still works.

So... (and here I get shaky)... we read the entire file one paragraph at a time, and each time we overwrite the contents of variable d with the most recent paragraph. Then we hit END which tells us to stop reading and start printing. There's only one thing to print, and that is d, the last paragraph.

Daniel B. Martin

grail · 07-16-2012, 09:29 AM

Not too far off daniel

NF - you are correct about its origin. You then need to remember that everything in front of {} is evaluated to eventually be true or false. As a record with zero fields would have an NF value of
zero, the braces would not get entered and the value of variable 'd' will not change. The significance to the OPs example is because there are dashes after the last visible record, awk will say that
the final record is the empty one after the last dashes, which of course we do not wish to print.

RS - again origin is correct. The trick to remember with awk is that there are actually 3 places you can set 'system variables':

1. Use -v ... awk -vRS="--+"

2. In the BEGIN ... BEGIN{RS = "--+"}

3. After the 'program' ... this is of course what I have used here

My general rule of thumb is if only one and it is less typing I use after the program, otherwise I use the BEGIN. I reserve the -v option only for those I wish to draw from the environment (usually)

As for the '+', * is zero or more and + is one or more. The data leant itself to the latter (try changing for a * and see the difference)

Quote:

Then we hit END which tells us to stop reading and start printing.

Slight correction here, END is only processed once all files have finished being read (gawk v4+ now also contains ENDFILE which allows you to set things to occur when each file
completes)

Please let me know if you need any further information

danielbmartin · 07-16-2012, 09:59 AM

Quote:

Originally Posted by grail

Please let me know if you need any further information

All questions answered; thank you.

Purely as a learning exercise, I propose making the OP's problem a bit more difficult. Suppose he wanted the penultimate paragraph. How could that be done?

Daniel B. Martin

grail · 07-16-2012, 10:28 AM

I would suggest using 2 variables and print the alternate one. If you then extend to any line I would suggest storing in an array and print length - N of array

firstfire · 07-16-2012, 11:44 AM

Hi.

Quote:

Originally Posted by danielbmartin

Suppose he wanted the penultimate paragraph. How could that be done?

With sed it turned out to be very simple:

Code:

$ sed -nr '{:a; /-{5,}/be; N; ba}; :e; x; $p' in

The only new command here is 'x', which swaps pattern space and hold space. These two registers constitute a "ring buffer" of length 2.

danielbmartin · 07-16-2012, 02:11 PM

Quote:

Originally Posted by firstfire

With sed it turned out to be very simple:

Code:

$ sed -nr '{:a; /-{5,}/be; N; ba}; :e; x; $p' in

Lovely!

Daniel B. Martin

danielbmartin · 07-16-2012, 10:04 PM

Quote:

Originally Posted by danielbmartin

Suppose he wanted the penultimate paragraph. How could that be done?

Playing off grail's method I devised this.

Code:

tac $InFile |awk 'NR==3 {print $0}' RS='--+' |tac > $OutFile

Daniel B. Martin

grail · 07-16-2012, 11:19 PM

Code:

awk '{d[NR]=$0}END{print d[NR-2]}' RS='--+' file

Snark1994 · 07-17-2012, 05:23 AM

Quote:

Originally Posted by pixellany

There's a few people here that enjoy solving problems with specific commands--I am one of them. To me, it's just like solving puzzles.

As am I, I was just admiring a particularly amazing solution. Well, here's a haskell version:

Code:

import System.Environment (getArgs)
import Text.Regex.Posix

interactWith function inputFile outputFile = do
    input <- readFile inputFile
    writeFile outputFile (function input)

main = do args <- getArgs
    case args of  
        [input,output] -> interactWith (unlines . (!!1) . reverse . splitSections . lines) input output
        _ -> putStrLn "Usage: this_script.hs inputfile outputfile"

splitSections xs = foldr step [[]] xs
    where step x acc 
        | x =~ "---------" :: Bool = [x] : acc 
        | otherwise                = (x : head acc) : (tail acc)

Good luck with your FORTRAN, I've only had to use it once and I still have nightmares about it...

danielbmartin · 07-17-2012, 05:59 AM

Quote:

Originally Posted by grail

Code:

awk '{d[NR]=$0}END{print d[NR-2]}' RS='--+' file

grail hits the bulls-eye again! Thank you!

Daniel B. Martin

danielbmartin · 07-17-2012, 06:32 AM

I know nothing of internal implementation, hence this question.

Does tac really begin reading a file from its last record, or does it read the entire file and buffer it (or parts of it)?

The answer has performance implications. If the input file is huge, then a solution to OP's problem which begins with tac might faster than another solution which reads the entire file, start to end. That assumes tac is clever enough to read only enough to satisfy the following piped commands.

Ideas? Comments?

Daniel B. Martin