LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Sed/awk help with regular expressions needed (https://www.linuxquestions.org/questions/programming-9/sed-awk-help-with-regular-expressions-needed-658751/)

AP81 07-28-2008 06:50 AM

Sed/awk help with regular expressions needed
 
Hi guys,

I was given a rather large file (about 35,000 lines) and asked to create an .SQL file so I could import it into a Postgres database. Now I've already managed to do it, but would like some input as to make it easier for the next time I have to do it.

My problem is that it contains large amounts of text that contains markup. For example, a typical small row would look something like this:

Code:

Some text goes here, then <a href="http://www.something.com">here</a> is a link.  Here is some <b>more</b> text.
I have to remove all markup, turn it into something like this:
Code:

Some text goes here, then here is a link.  Here is some more text.
What I did was paste all this text into GEdit, then use a regular expression plugin to remove all links and markup. The rest is easy from here on.

I would like to automate this however. What I would like to do is something like this:

awk < infile.txt > outfile.txt

Obviusly this would take the input file, strip out HTML tags then output to outfile.txt. I've tried a few things, but I can't get my head around regular expressions via command line.

Any pointers as how to do this?

colucix 07-28-2008 07:15 AM

Code:

awk '{gsub(/<[^>]*>/,"")}1' infile.txt > oufile.txt

AP81 07-28-2008 07:18 AM

Awesome...I will give it a go tomorrow.

radoulov 07-28-2008 07:26 AM

If you have lynx:

Code:

lynx>outfile.txt --force-html --dump -nolist infile.txt
Or html2text:

Code:

html2text>outfile.txt infile.txt


All times are GMT -5. The time now is 08:27 AM.