[SOLVED] sed, awk, Keep only text between two regular expressions

scott_audio · 08-05-2009, 07:47 AM

Greetings - I know this is most likely answered somewhere, but would appreciate some guidance. Not important, just trying to learn more.

I have a directory of saved rss articles, all beginning with IS_*.
Each article has the top fields, then the body text, then a bunch of garbage at the bottome, links, etc.

My goal is to keep everything between the first blank line and a common phrase at the end of the article, and save all the articles to one file.

Sample file:
-----
Title: title
Date: date
Link: link

I want to keep only this text

Each article ends with this text Want podcast of every article?

[1] link
[2] link
[3] and so on

-----

I can get everything to work by hacking away at it, deleting lines that start with [ , etc., but just wondering if there is an easier, more precise way to do it.

for i in $(ls IS_*.txt); do
cat $i | egrep ^Title\: | awk -F": " '{ print $2": " }' | sed 'G' >> is.1
magic sed command to keep the body of the message
done

I've read manuals and google'd, can't seem to find what I need, any guidance would be appreciated.

-Scott

raskin · 08-05-2009, 08:44 AM

I see your problem comes from not reading man csplit. That comes from not having heard about csplit. If you wish, read posix csplit manual http://www.opengroup.org/onlinepubs/...es/csplit.html instead of GNU csplit manual. I doubt you meant that any POSIX-compliant utility is forbidden..

vonbiber · 08-05-2009, 08:44 AM

Quote:

Originally Posted by scott_audio

I have a directory of saved rss articles, all beginning with IS_*.
Each article has the top fields, then the body text, then a bunch of garbage at the bottome, links, etc.

My goal is to keep everything between the first blank line and a common phrase at the end of the article, and save all the articles to one file.

-Scott

could you please copy and paste here a passage from one of your
rss article
and precede the beginning the passage you want to retrieve by
### scott_audio: beginning

and follow the last line of the passage by

### scott_audio: end

It would be easier to figure it out if I can have a look at the
actual content.
Right off the bat it should be possible to do this via a script
shell that loops thru your files and run a sed script on them.

scott_audio · 08-05-2009, 09:18 AM

vonbiber: hi, here is one of the text files:

Title: Bubble Tea / The all-in-one beverage and snack
Author: Author
Date: Fri, 24 Jul 2009 01:00:01 -0400
Link: http://link..

[image 1]
### scott_audio: beginning

Several paragraphs here that I want to keep.

One of the great things about spending time in another country is learning
about new and unique foods. When I was living in Vancouver, Canada a few years
ago, I became acquainted with bubble tea,...

### scott_audio: end

Want podcasts of every article? Support ITotD by [becoming a paid subscriber][3]
.

[Permalink][4] [Email this Article][5] Category:[Food & Drink][6]
[image 7] Good karma: priceless. For everything else, theres [PayPal donations][8]
! [[?][9]]
[[image 11]][10]More Information about Bubble Tea...

-----

This is what I have, and it works, I just want to learn how to write it better. Please don't laugh too loudly, I am a newbie

Code:

for i in $(ls $JEFF/IS_*.txt); do
        cat $i | egrep ^Title\: | awk -F": " '{ print $2": " }' | sed 'G' >> $JEFF/is.rtf;
        sed '/Link: /,/Want podcasts/!d' $i > $JEFF/is.1;
        sed '/Link: /d' $JEFF/is.1 > $JEFF/is.2;
        sed '/\[image/d' $JEFF/is.2 > $JEFF/is.3;
        sed '/./,/^$/!d' $JEFF/is.3 > $JEFF/is.4
        sed '/Want podcasts/d' $JEFF/is.4 >> $JEFF/is.rtf;
        echo "-----" >> $JEFF/is.rtf;
done

vonbiber · 08-05-2009, 10:27 AM

I'll have a look and get back to you.
So it seems the top marker is '[image ...]'

and the bottom marker is
'Want podcasts of every....'

and you want to retrieve only what's in between
so that your output would be
##### beginning of output
Several paragraphs here
...
bubble tea
##### end of output

and maybe you want also to keep the title as well

Bubble Tea / The all... snack

the date and link?

scott_audio · 08-05-2009, 10:33 AM

that's correct...

title

everything between '[image' and 'Want podcasts of...'

thanks for taking time to look, it's not important though, what i have works, i just want to learn how to write it correctly.

synss · 08-05-2009, 04:56 PM

I did not know about csplit either, thanks. But to cut some text like you want, I use awk:

Code:

awk '
$0 ~ "Link:" || c {
	c++
	if ( $0 ~ "Want podcasts of every article?" ) {
		exit
	}
	if ( c > 1 ) {
		print $0
	}
}' feed_filename.txt >> keptRSS.txt

$0 is the current line.

Your for loop is ugly, you should write

Code:

for i in IS_*.txt; do ... ; done

no need for $(ls)

scott_audio · 08-05-2009, 06:45 PM

synss, hi, and thank you very much. Was exactly what I was looking for, a good solid example of what I was trying to accomplish... i made my loop, piped the final through sed '/./,/^$/!d' to take any double blank lines and tr -d '\n' to keep each paragrash one long line, and it worked perfectly. I'm reading up on what the double pipe means || ... looks like, though, it starts two lines down from the line that starts with 'Link:' and prints all the lines until it encounters 'Want podcasts...'

I've learned a lot and had fun, thanks again all
-Scott

vonbiber · 08-06-2009, 06:11 AM

a sed script should do the job

1. copy and paste to a file that you could name seder.sed (or whatever)
then make it executable: chmod +x seder.sed

#### seder.sed : cut and paste ######
#!/bin/sed -f

:loop
N
$!b loop

#get rid of carriage returns
s?\r\n?\n?g
s?\r$[^\n\r]$?\n\1?g
s?\r\r[\r\n]*?\n\n?g

s?Author:[^\n]*\n??
s?^.*Title:[ \t]*??
s?Link:[ \t]*$[^\n ]*$?\1?

#first marker: after date
#keep only day month year
s?Date: *$[^\n]*$ [0-9][0-9]:[0-9][^\n]*?\1º?

#just in case
s?°?\&degree;?g
#second marker: [image ...]
s?\[image [0-9][^\n]*?°?

#keep everything from top to first marker (title, date)
#remove everything between first marker and second marker
s?^$[^º]*$º[^°]*°[ \t\n]*?\1\n?

#remove everything below the line (including the line, and
# the blank lines that precedes it) that starts with:
s|[ \t\n]*Want podcasts of every article? Support ITotD by .*$||

#remove extra blank lines
s?\n[ \t\n]*\n?\n\n?g
########### end of seder.sed ############

2. test on one of your files: ./seder.sed IS_something.txt

3. If the output on the terminal looks ok, you can then:
a) loop thru all your files
b) redirect the output to a file (./seder.sed IS_something.txt > is_something.rtf)

4. you can write a shell script that does all of this:
loop, apply seder, redirect to a file, eg,

################ shell script ################
#!/bin/sh

#this is the current working directory
CWD=$PWD

#better put in an absolute path for this
JEFF=fill_in_the_value_here

SEDER=$CWD/seder.sed

find $JEFF/* | while read f
do
#directory
if [ -d $f ]; then
continue
fi
dir=${f%/*}
filename=${f##*/}
ext=${filename##*.}
name=${filename%.*}
#only process file if it is of type IS_something.txt
if [ "$ext" == "txt" ] && [ $(echo "$name" | grep -c '^IS_') -gt 0 ]; then
input="$f"
output="$dir/""$(echo $name | tr '[:upper:]' '[:lower:]')".rtf
$SEDER $input > $output
fi
done
######## end of shell script ##################

chmod +x whatever.sh
./whatever.sh

scott_audio · 08-06-2009, 02:46 PM

vonbiber, thanks, this gives me lots to experiment with, appreciate your time. I'm sure I'll find some more feeds, and I'll be able to apply what I've learned here.