I was given a rather large file (about 35,000 lines) and asked to create an .SQL file so I could import it into a Postgres database. Now I've already managed to do it, but would like some input as to make it easier for the next time I have to do it.
My problem is that it contains large amounts of text that contains markup. For example, a typical small row would look something like this:
Some text goes here, then <a href="http://www.something.com">here</a> is a link. Here is some <b>more</b> text.
I have to remove all markup, turn it into something like this:
Some text goes here, then here is a link. Here is some more text.
What I did was paste all this text into GEdit, then use a regular expression plugin to remove all links and markup. The rest is easy from here on.
I would like to automate this however. What I would like to do is something like this:
awk < infile.txt > outfile.txt
Obviusly this would take the input file, strip out HTML tags then output to outfile.txt. I've tried a few things, but I can't get my head around regular expressions via command line.
Any pointers as how to do this?