text search and replacement: bash scripting

jettachamp26 · 02-21-2008, 01:10 PM

Ok here's my dilemma. I have a file that has been converted from odt to an html file. (There's actually thousands all with the same format)
I need a script that looks for and finds "Headline:" or any other version of that line and then go down 3 lines and change the <P></P> brackets to <h1></h1> brackets.

The same needs to be done for "Subheadline:". Is this possible? I have my script almost done, but I am totally stumped here.

Also, the format will always be like this.

<P STYLE="font-style: normal"><B>Headline:</B></P>
</TD>
<TD WIDTH=75%>
<P>Headline text is here </P>
</TD>
</TR>
<TR VALIGN=TOP>
<TD WIDTH=25%>
<P STYLE="font-style: normal"><B>Subheadline:</B></P>
</TD>
<TD WIDTH=75%>
<P>sub-headline text is here</P>
</TD>

Any help would be awesome! My brain has been turned to mush.

-Ryan

druuna · 02-21-2008, 02:03 PM

Hi,

If the layput of the file is always the same this could be done this way:

sed '/[Hh]eadline:/{n;n;n;s/P/h1/g;}' infile

It could need a bit of tuning:
- I used [Hh]eadline: as first search criterea to catch both Subheadline: and Headline: in one go. The layout given could not be the same as the real thing.....
- The /P/h1/ is a bit rough, to prevent false hits put the < and > around them.

Ok, how it works:

sed first searches for a line that contains [Hh]eadline: (/[Hh]eadline:/), if a line is found the part between the curly brackets is 'activated'. The next 3 lines are read (n;n;n;) and then the substitution takes place (s/P/h1/g).

If all works as expected you could use the -i option to replace 'in place'. I.e:
sed -i '/[Hh]eadline:/{n;n;n;s/P/h1/g;}' infile

Hope this helps.

ilikejam · 02-21-2008, 02:14 PM

Hi.

Code:

sed '/[<B>H|<B>Subh]eadline:<\/B>/{n;n;n;s/P>/h1>/g}'

seems to work.

Dave

druuna · 02-21-2008, 02:18 PM

Hi,

@ilikejam: Maybe i've 'beaten' you, but your code is a bit more robust. Which would make me a bit lazy, a truism in this case...

jettachamp26 · 02-21-2008, 02:28 PM

The first way did the trick. I made 2 lines one for Headline and one for Subheadline, since I each one uses a different <H> tag. Thanks for explaining it too.

Thanks for the fast replies!

The other question I had was I am trying to remove/delete all unwanted code from the file (for instance, everything from the beginning of the file to the headline tags and the ugly table tags in between.

Here is what I mean. obviously to keep the bold stuff. It's ok if this has to be done in more than one command, I just don't have enough experience yet with shell scripts to figure it out.

Could it be done using sed still? or maybe grep and then sed?

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=utf-8">
<TITLE></TITLE>
<META NAME="GENERATOR" CONTENT="OpenOffice.org 2.2 (Linux)">
<META NAME="AUTHOR" CONTENT="">
<META NAME="CREATED" CONTENT="">
<META NAME="CHANGEDBY" CONTENT="">
<META NAME="CHANGED" CONTENT="">
<STYLE TYPE="text/css">

</STYLE>
</HEAD>
<BODY LANG="en-US" DIR="LTR">
<DIV TYPE=HEADER>
<TABLE WIDTH=100% BORDER=0 CELLPADDING=0 CELLSPACING=0>
<COL WIDTH=128*>
<COL WIDTH=128*>
<TR VALIGN=TOP>
<TD WIDTH=50%>
<P>Author: <SDFIELD TYPE=AUTHOR FORMAT=NAME>Ryan</SDFIELD></P>
</TD>
<TD WIDTH=50%>
<P ALIGN=RIGHT>Page <SDFIELD TYPE=PAGE SUBTYPE=RANDOM FORMAT=ARABIC>4</SDFIELD>
of <SDFIELD TYPE=DOCSTAT SUBTYPE=PAGE FORMAT=ARABIC>4</SDFIELD></P>
</TD>
</TR>
</TABLE>
<P ALIGN=RIGHT STYLE="margin-bottom: 0.2in">Words: <SDFIELD TYPE=DOCSTAT SUBTYPE=WORD FORMAT=ARABIC>1256</SDFIELD></P>
</DIV>
<TABLE WIDTH=100% BORDER=1 BORDERCOLOR="#000000" CELLPADDING=4 CELLSPACING=0 STYLE="page-break-before: always">
<COL WIDTH=64*>
<COL WIDTH=192*>
<TR VALIGN=TOP>
<TD WIDTH=25%>
<DIV ID="Frame1" DIR="LTR" STYLE="position: absolute; top: 1.58in; left: 1.24in; width: 6.01in; height: 0.48in; border: 1px solid #000000; padding: 0.06in; background: #ffffff">
<P CLASS="frame-contents" ALIGN=LEFT><FONT SIZE=4 STYLE="font-size: 16pt"><B>ARIZONA
– FOOD</B></FONT></P>
</DIV>
<P STYLE="font-style: normal"><B>Headline:</B></P>
</TD>
<TD WIDTH=75%>
<h6>content of h6</h6>
</TD>
</TR>
<TR VALIGN=TOP>
<TD WIDTH=25%>
<P STYLE="font-style: normal"><B>Subheadline:</B></P>
</TD>
<TD WIDTH=75%>
<h1>content of h1</h1>
</TD>
</TR>
</TABLE>

Thanks again.

-Ryan

jettachamp26 · 02-21-2008, 02:29 PM

as you can see, there is a lot of junk I don't need, and if I can get it out of there, it would be a huge time saver.

the reason im doing this is eventually, the clean info is going into a DB so i dont need all the 'proper' HTML, just the guts i'm tryin to format all pretty-like

druuna · 02-21-2008, 02:46 PM

Hi,

I cannot make up what you want to throw away and what you want to keep, but maybe this will get you going:

sed -n '/eadline:/,$p' infile

This supresses all normal output (-n), if eadline: is found, it is printed up to the end of the file.

Here's the output if I use the input provided in post #5:

Code:

$ sed -n '/eadline:/,$p' infile
<P STYLE="font-style: normal"><B>Headline:</B></P>
</TD>
<TD WIDTH=75%>
<h6>content of h6</h6>
</TD>
</TR>
<TR VALIGN=TOP>
<TD WIDTH=25%>
<P STYLE="font-style: normal"><B>Subheadline:</B></P>
</TD>
<TD WIDTH=75%>
<h1>content of h1</h1>
</TD>
</TR>
</TABLE>

But like I stated before, I'm not sure what it is you want to keep. Anyway, it could be a start

jettachamp26 · 02-21-2008, 02:52 PM

Honestly, if i could get rid of all of the extra text (the input I gave in post #5) from the beginning of the file to
<h6>content of h6</h6> that alone would be a big help.

I want to get rid of all of that text except for the headlines, though.

druuna · 02-21-2008, 03:48 PM

Quote:

Originally Posted by jettachamp26

Honestly, if i could get rid of all of the extra text (the input I gave in post #5) from the beginning of the file to
<h6>content of h6</h6> that alone would be a big help.

sed -n '/<h6>content of h6<\/h6>/,$p' infile would do just that.

Quote:

I want to get rid of all of that text except for the headlines, though.

If all that you want to keep are the lines containing [Hh]eadline:
sed -n '/[H]eadline:/p' infile

jettachamp26 · 02-21-2008, 04:07 PM

not really what i need but thats ok. I can work with it thank you.

Now I just came into a big snag.

turns out, when a file is converted to html in openoffice, the <p> tags it inserts don't always have the same stuff. Here is what I mean.

one sometime's I get this
<P STYLE="margin-bottom: 0.1in; font-weight: medium">
<P STYLE="margin-bottom: 0.1in">
<P ALIGN=CENTER STYLE="margin-bottom: 0in">
etc.

is there a way I can tell the script, "look for all instances of <P *> and replace it with just <p>"?

This is what i have now.

Code:

sed -i '/<P>/s/P/p/g' infile
sed -i '/<P /s/P/p/g' infile
sed -i '/<\/P>/s/P/p/g' infile

can i do something like search for a line that has a
<P and something else inside the < > other than P and replace it with <p>?

Lemme know if this doesn't make sense

druuna · 02-21-2008, 04:22 PM

Hi,

This is a bit tricky, but will work if the code layout is the same.

sed 's/<P .*">/<P>/' infile

I expanded the example code a bit to check for false/incorrect hits:

Code:

$ cat infile
<P STYLE="margin-bottom: 0.1in; font-weight: medium">test <B>text</B> </P>
<P STYLE="margin-bottom: 0.1in">test <B>text</B> </P>
<P ALIGN=CENTER STYLE="margin-bottom: 0in">test <B>text</B> </P>
$
$ sed 's/<P .*">/<P>/' infile
<P>test <B>text</B> </P>
<P>test <B>text</B> </P>
<P>test <B>text</B> </P>
$

The tricky part could be the regular expression (<P .*">). The following would be 'greedy': <P .*>, your output would look like this:
$ sed 's/<P .*>/<P>/' infile
<P>
<P>
<P>

jettachamp26 · 02-21-2008, 04:47 PM

awesome that worked perfectly. I just needed to add the -i flag.

Yeah, the regexp is what i couldn't figure out.

Thanks for your help!

ghostdog74 · 02-21-2008, 07:06 PM

parsing html/xml files using standard unix tools like sed/awk etc has always been a "tricky" business. If you have time in the future, you can look into specialized html/xml parsers or parser libraries that come with programming languages like Perl/Python etc..just $0.02

jettachamp26 · 02-22-2008, 08:31 AM

I will look into it. I heard that, while perl is an excellent language, it will probably only be mainstream for 4-5 more years, since Ruby is picking up speed.

Since I am new to both, would it be more rational to skip perl and begin studying ruby in-depth to be up on the cutting edge stuff or, to stick with perl for a while?

jettachamp26 · 02-22-2008, 09:45 AM

New question. I have some text. in between two html comments like so

Code:

<!-- 
     some text 
     that is N lines long
-->

I know how to search for if a pattern exists in a line like shown in post #2. My question is, is there a way to search for the two comment tags and delete all lines in between those tags?