LinuxQuestions.org
Did you know LQ has a Linux Hardware Compatibility List?
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
LinkBack Search this Thread
Old 02-21-2008, 01:10 PM   #1
jettachamp26
Member
 
Registered: Feb 2008
Location: Florida
Distribution: ubuntu
Posts: 30

Rep: Reputation: 15
text search and replacement: bash scripting


Ok here's my dilemma. I have a file that has been converted from odt to an html file. (There's actually thousands all with the same format)
I need a script that looks for and finds "Headline:" or any other version of that line and then go down 3 lines and change the <P></P> brackets to <h1></h1> brackets.



The same needs to be done for "Subheadline:". Is this possible? I have my script almost done, but I am totally stumped here.

Also, the format will always be like this.

<P STYLE="font-style: normal"><B>Headline:</B></P>
</TD>
<TD WIDTH=75%>
<P>Headline text is here </P>
</TD>
</TR>
<TR VALIGN=TOP>
<TD WIDTH=25%>
<P STYLE="font-style: normal"><B>Subheadline:</B></P>
</TD>
<TD WIDTH=75%>
<P>sub-headline text is here</P>
</TD>

Any help would be awesome! My brain has been turned to mush.

-Ryan

Last edited by jettachamp26; 02-21-2008 at 01:12 PM.
 
Old 02-21-2008, 02:03 PM   #2
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371
Hi,

If the layput of the file is always the same this could be done this way:

sed '/[Hh]eadline:/{n;n;n;s/P/h1/g;}' infile

It could need a bit of tuning:
- I used [Hh]eadline: as first search criterea to catch both Subheadline: and Headline: in one go. The layout given could not be the same as the real thing.....
- The /P/h1/ is a bit rough, to prevent false hits put the < and > around them.

Ok, how it works:

sed first searches for a line that contains [Hh]eadline: (/[Hh]eadline:/), if a line is found the part between the curly brackets is 'activated'. The next 3 lines are read (n;n;n;) and then the substitution takes place (s/P/h1/g).

If all works as expected you could use the -i option to replace 'in place'. I.e:
sed -i '/[Hh]eadline:/{n;n;n;s/P/h1/g;}' infile

Hope this helps.
 
Old 02-21-2008, 02:14 PM   #3
ilikejam
Senior Member
 
Registered: Aug 2003
Location: Glasgow
Distribution: Fedora / Solaris
Posts: 3,109

Rep: Reputation: 96
Hi.

Code:
sed '/[<B>H|<B>Subh]eadline:<\/B>/{n;n;n;s/P>/h1>/g}'
seems to work.

Dave

Last edited by ilikejam; 02-21-2008 at 02:14 PM. Reason: Gah! Beaten to it.
 
Old 02-21-2008, 02:18 PM   #4
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371
Hi,

@ilikejam: Maybe i've 'beaten' you, but your code is a bit more robust. Which would make me a bit lazy, a truism in this case...
 
Old 02-21-2008, 02:28 PM   #5
jettachamp26
Member
 
Registered: Feb 2008
Location: Florida
Distribution: ubuntu
Posts: 30

Original Poster
Rep: Reputation: 15
Thumbs up

The first way did the trick. I made 2 lines one for Headline and one for Subheadline, since I each one uses a different <H> tag. Thanks for explaining it too.

Thanks for the fast replies!

The other question I had was I am trying to remove/delete all unwanted code from the file (for instance, everything from the beginning of the file to the headline tags and the ugly table tags in between.

Here is what I mean. obviously to keep the bold stuff. It's ok if this has to be done in more than one command, I just don't have enough experience yet with shell scripts to figure it out.

Could it be done using sed still? or maybe grep and then sed?


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=utf-8">
<TITLE></TITLE>
<META NAME="GENERATOR" CONTENT="OpenOffice.org 2.2 (Linux)">
<META NAME="AUTHOR" CONTENT="">
<META NAME="CREATED" CONTENT="">
<META NAME="CHANGEDBY" CONTENT="">
<META NAME="CHANGED" CONTENT="">
<STYLE TYPE="text/css">
<!--
@page { size: 8.5in 11in; margin-left: 1.25in; margin-right: 1.25in; margin-top: 1in; margin-bottom: 1in }
P { margin-bottom: 0.08in }
TD P { margin-bottom: 0.08in }
-->
</STYLE>
</HEAD>
<BODY LANG="en-US" DIR="LTR">
<DIV TYPE=HEADER>
<TABLE WIDTH=100% BORDER=0 CELLPADDING=0 CELLSPACING=0>
<COL WIDTH=128*>
<COL WIDTH=128*>
<TR VALIGN=TOP>
<TD WIDTH=50%>
<P>Author: <SDFIELD TYPE=AUTHOR FORMAT=NAME>Ryan</SDFIELD></P>
</TD>
<TD WIDTH=50%>
<P ALIGN=RIGHT>Page <SDFIELD TYPE=PAGE SUBTYPE=RANDOM FORMAT=ARABIC>4</SDFIELD>
of <SDFIELD TYPE=DOCSTAT SUBTYPE=PAGE FORMAT=ARABIC>4</SDFIELD></P>
</TD>
</TR>
</TABLE>
<P ALIGN=RIGHT STYLE="margin-bottom: 0.2in">Words: <SDFIELD TYPE=DOCSTAT SUBTYPE=WORD FORMAT=ARABIC>1256</SDFIELD></P>
</DIV>
<TABLE WIDTH=100% BORDER=1 BORDERCOLOR="#000000" CELLPADDING=4 CELLSPACING=0 STYLE="page-break-before: always">
<COL WIDTH=64*>
<COL WIDTH=192*>
<TR VALIGN=TOP>
<TD WIDTH=25%>
<DIV ID="Frame1" DIR="LTR" STYLE="position: absolute; top: 1.58in; left: 1.24in; width: 6.01in; height: 0.48in; border: 1px solid #000000; padding: 0.06in; background: #ffffff">
<P CLASS="frame-contents" ALIGN=LEFT><FONT SIZE=4 STYLE="font-size: 16pt"><B>ARIZONA
FOOD</B></FONT></P>
</DIV>
<P STYLE="font-style: normal"><B>Headline:</B></P>
</TD>
<TD WIDTH=75%>
<h6>content of h6</h6>
</TD>
</TR>
<TR VALIGN=TOP>
<TD WIDTH=25%>
<P STYLE="font-style: normal"><B>Subheadline:</B></P>
</TD>
<TD WIDTH=75%>
<h1>content of h1</h1>
</TD>
</TR>
</TABLE>

Thanks again.

-Ryan
 
Old 02-21-2008, 02:29 PM   #6
jettachamp26
Member
 
Registered: Feb 2008
Location: Florida
Distribution: ubuntu
Posts: 30

Original Poster
Rep: Reputation: 15
as you can see, there is a lot of junk I don't need, and if I can get it out of there, it would be a huge time saver.

the reason im doing this is eventually, the clean info is going into a DB so i dont need all the 'proper' HTML, just the guts i'm tryin to format all pretty-like

Last edited by jettachamp26; 02-21-2008 at 02:31 PM.
 
Old 02-21-2008, 02:46 PM   #7
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371
Hi,

I cannot make up what you want to throw away and what you want to keep, but maybe this will get you going:

sed -n '/eadline:/,$p' infile

This supresses all normal output (-n), if eadline: is found, it is printed up to the end of the file.

Here's the output if I use the input provided in post #5:

Code:
$ sed -n '/eadline:/,$p' infile
<P STYLE="font-style: normal"><B>Headline:</B></P>
</TD>
<TD WIDTH=75%>
<h6>content of h6</h6>
</TD>
</TR>
<TR VALIGN=TOP>
<TD WIDTH=25%>
<P STYLE="font-style: normal"><B>Subheadline:</B></P>
</TD>
<TD WIDTH=75%>
<h1>content of h1</h1>
</TD>
</TR>
</TABLE>
But like I stated before, I'm not sure what it is you want to keep. Anyway, it could be a start
 
Old 02-21-2008, 02:52 PM   #8
jettachamp26
Member
 
Registered: Feb 2008
Location: Florida
Distribution: ubuntu
Posts: 30

Original Poster
Rep: Reputation: 15
Honestly, if i could get rid of all of the extra text (the input I gave in post #5) from the beginning of the file to
<h6>content of h6</h6> that alone would be a big help.

I want to get rid of all of that text except for the headlines, though.

Last edited by jettachamp26; 02-21-2008 at 02:55 PM.
 
Old 02-21-2008, 03:48 PM   #9
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371
Quote:
Originally Posted by jettachamp26 View Post
Honestly, if i could get rid of all of the extra text (the input I gave in post #5) from the beginning of the file to
<h6>content of h6</h6> that alone would be a big help.
sed -n '/<h6>content of h6<\/h6>/,$p' infile would do just that.

Quote:
I want to get rid of all of that text except for the headlines, though.
If all that you want to keep are the lines containing [Hh]eadline:
sed -n '/[H]eadline:/p' infile
 
Old 02-21-2008, 04:07 PM   #10
jettachamp26
Member
 
Registered: Feb 2008
Location: Florida
Distribution: ubuntu
Posts: 30

Original Poster
Rep: Reputation: 15
not really what i need but thats ok. I can work with it thank you.

Now I just came into a big snag.

turns out, when a file is converted to html in openoffice, the <p> tags it inserts don't always have the same stuff. Here is what I mean.

one sometime's I get this
<P STYLE="margin-bottom: 0.1in; font-weight: medium">
<P STYLE="margin-bottom: 0.1in">
<P ALIGN=CENTER STYLE="margin-bottom: 0in">
etc.

is there a way I can tell the script, "look for all instances of <P *> and replace it with just <p>"?

This is what i have now.

Code:
sed -i '/<P>/s/P/p/g' infile
sed -i '/<P /s/P/p/g' infile
sed -i '/<\/P>/s/P/p/g' infile
can i do something like search for a line that has a
<P and something else inside the < > other than P and replace it with <p>?

Lemme know if this doesn't make sense
 
Old 02-21-2008, 04:22 PM   #11
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371
Hi,

This is a bit tricky, but will work if the code layout is the same.

sed 's/<P .*">/<P>/' infile

I expanded the example code a bit to check for false/incorrect hits:
Code:
$ cat infile
<P STYLE="margin-bottom: 0.1in; font-weight: medium">test <B>text</B> </P>
<P STYLE="margin-bottom: 0.1in">test <B>text</B> </P>
<P ALIGN=CENTER STYLE="margin-bottom: 0in">test <B>text</B> </P>
$
$ sed 's/<P .*">/<P>/' infile
<P>test <B>text</B> </P>
<P>test <B>text</B> </P>
<P>test <B>text</B> </P>
$
The tricky part could be the regular expression (<P .*">). The following would be 'greedy': <P .*>, your output would look like this:
$ sed 's/<P .*>/<P>/' infile
<P>
<P>
<P>
 
Old 02-21-2008, 04:47 PM   #12
jettachamp26
Member
 
Registered: Feb 2008
Location: Florida
Distribution: ubuntu
Posts: 30

Original Poster
Rep: Reputation: 15
awesome that worked perfectly. I just needed to add the -i flag.

Yeah, the regexp is what i couldn't figure out.


Thanks for your help!
 
Old 02-21-2008, 07:06 PM   #13
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 239Reputation: 239Reputation: 239
parsing html/xml files using standard unix tools like sed/awk etc has always been a "tricky" business. If you have time in the future, you can look into specialized html/xml parsers or parser libraries that come with programming languages like Perl/Python etc..just $0.02
 
Old 02-22-2008, 08:31 AM   #14
jettachamp26
Member
 
Registered: Feb 2008
Location: Florida
Distribution: ubuntu
Posts: 30

Original Poster
Rep: Reputation: 15
I will look into it. I heard that, while perl is an excellent language, it will probably only be mainstream for 4-5 more years, since Ruby is picking up speed.

Since I am new to both, would it be more rational to skip perl and begin studying ruby in-depth to be up on the cutting edge stuff or, to stick with perl for a while?
 
Old 02-22-2008, 09:45 AM   #15
jettachamp26
Member
 
Registered: Feb 2008
Location: Florida
Distribution: ubuntu
Posts: 30

Original Poster
Rep: Reputation: 15
New question. I have some text. in between two html comments like so

Code:
<!-- 
     some text 
     that is N lines long
-->
I know how to search for if a pattern exists in a line like shown in post #2. My question is, is there a way to search for the two comment tags and delete all lines in between those tags?
 
  


Reply

Tags
replace, sed


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Bash script text replacement... matthurne Programming 4 06-07-2011 06:46 PM
Help with BASH to search text files on disk purveshk Linux - Newbie 3 02-19-2008 01:14 PM
how to change some text of a certain line of a text file with bash and *nix scripting alred Programming 6 07-10-2006 11:55 AM
Bash scripting to check text in a website carlp Programming 2 09-20-2005 11:14 AM
Recursive search in bash scripting ! zulfilee Linux - Software 3 12-12-2004 10:40 PM


All times are GMT -5. The time now is 09:33 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration