text search and replacement: bash scripting

ghostdog74 · 02-22-2008, 09:51 AM

very simplistically

Code:

# more file
this is valid 1
<!--
     some text
     that is N lines long
-->
this is valid 2
# awk '/<!--/,/-->/{next}1' file
this is valid 1
this is valid 2

jettachamp26 · 02-22-2008, 10:12 AM

would you be as so kind to explain how that code works? I only started shell scripting about 2 weeks ago and haven't really been exposed to awk at all.

jettachamp26 · 02-22-2008, 10:16 AM

just an update for drunna, I had to modify the code in post #2. I caught the code adding h6 into the text in between the brackets when there happened to be a capital P in a word.

Code:

 sed -i '/Headline:/{n;n;n;s/<P>/<h6>/;s/<\/P>/<\/h6>/;}' infile

I've learned so much in the past 2 days its crazy. Thanks again for everyones (very speedy!) help!

ghostdog74 · 02-22-2008, 10:22 AM

Code:

awk '/<!--/,/-->/{next}{print}' file

from the pattern , skip. else print the rest.
For more information, read this

jettachamp26 · 02-22-2008, 10:39 AM

Thanks. Good link too. I'll be sure to bookmark it.

jettachamp26 · 02-22-2008, 11:09 AM

I tried using it, and tried tweaking it a little, but it wouldn't work inside my script.

Not really sure why it's not working though. I do know that I have awk on my system and that awk works.

any help?
here's the code to make it clear whats not working.

Code:

awk '/<!--/,/-->/{next}{print}' infile

also tried

Code:

awk -i '/<!--/,/-->/{next}{print}' infile

but i believe the designator only works with sed.

druuna · 02-22-2008, 11:44 AM

Hi,

If I try the code on the sample input it seems to work:

Code:

$ cat infile 
this is valid 1
<!--
     some text
     that is N lines long
-->
this is valid 2
$
$
$ awk '/<!--/,/-->/{next}{print}' infile 
this is valid 1
this is valid 2

If you are more at home using sed you can do this: sed '//d' infile. Result will be the same as the output of the awk command.

You mention that you needed to tweak the awk command but you don't tell what needs to be tweaked. If the above doesn't work could you post a the relevant part of the input and the desired output?

Hope this helps.

BTW: Glad to read that you have learned something, it's always good to read that the help given is actually helping.

jettachamp26 · 02-22-2008, 12:22 PM

the sed command did the trick just added the -i designator.

I think the reason the awk command doesn't work is that the actual functionality of the script isn't my code. I just adapted the code to do the things I need. I still need to sit down and take the time to learn how all the little bits and pieces work.

the original script I used is by Ian Spillane http://iantheteacher.blogspot.com

ghostdog74 · 02-22-2008, 10:46 PM

If -i is all you need to get the thing working using drunna's sed suggestion, then you just need to redirect
to a new file in awk and rename the new file back to the original. the -i in sed is just an in place modification to the file

Code:

awk '/..../{...}' file  > temp
mv temp file

jettachamp26 · 02-25-2008, 09:21 AM

Hey,

Back again with a question. I've been reading up on Reg expressions, but I'm not really getting anywhere.

The problem: I have some code where there is a link wrapped in an ugly and tags. I can clean up all of the tags just fine, until I get to the font tag where I use the command in code #2 below to try and remove it. What ends up happening is it removes everything except for the very last </a> tag. The tags are left fine too.

Through process of elimination, I found the problem lies in the command in code #2.

I know that .* is greedy, but I thought that is what the " and > were for.
I also tried using .*? instead, but to no avail.

Code #1

Code:

<FONT COLOR="#000080"><U><A HREF="http://www.example.com/"></A><A HREF="http://www.example.com/"></A><A HREF="http://www.example.com/">"http://www.example.com/"</A><A HREF="http://www.example.com/"></A><A HREF="http://www.example.com/"></A></U></FONT>
</P>
<P STYLE="margin-bottom: 0in">Example2:
<FONT COLOR="#000080"><U><A HREF="http://www.example2.com/"></A><A HREF="http://www.example2.com/"></A><A HREF="http://www.example2.com/">http://www.example2.com/</A><A HREF="http://www.example2.com/"></A><A HREF="http://www.example2.com/"></A></U></FONT>
</P>

Through process of elimination, I narrowed it down to this one command.

Code #2

Code:

sed -i 's/<FONT .*">//g' "$f"

jettachamp26 · 02-25-2008, 09:27 AM

I would just do a check for the COLOR="000080" but this is used across many different files, and the Font tag is useless to me in all of the files.

druuna · 02-25-2008, 09:40 AM

Hi,

Like ghostdog74 and I said before: parsing html/xml files is tricky business

You need to find something unique to use in the reg-exp. This  is not unique enough (hence it is greedy).

If I look at the example given, this 000080 is unique,  will remove the first font entry on every line. But this only works if only the color value (000080) is used.

Wouldn't it be less work if you would take a look at perl and the html/xml parsing modules that are already available?

EDIT

I just noticed you new post, which renders my answers useless (well, most of it

)

/EDIT

jettachamp26 · 02-25-2008, 10:05 AM

i might just have to check out perl and its html/xml parsing because i am stumped.

EDIT
Do you happen to have any good links I can read up on it?
/EDIT

jettachamp26 · 02-25-2008, 10:12 AM

Also,

Would you know if/how I can remove duplicate empty <a></a> like in Code #1 of post #25?

could I use something like this?

Code:

sed -i 's/<a href=".*"></a>//g' infile

is there a way to tell it not to change the one that actually has text in between the <a href=""></a> tags?

druuna · 02-25-2008, 10:32 AM

Hi again,

The font problem could probably be solved this way: ^ (but I'm not 100% sure if this will exclude false positives).

This looks for: <FONT at the beginning of a line (the first ^), followed by a space and anything ( .*) and it should end with a # followed by 6 chars, which should be an A to F or 0 to 9. the last two chars should be a " and a >.

Quote:

is there a way to tell it not to change the one that actually has text in between the <a href=""></a> tags?

Maybe, but this depends on the actual code (what is were and is it unique enough).