LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 07-28-2008, 06:50 AM   #1
AP81
Member
 
Registered: Mar 2005
Posts: 38

Rep: Reputation: 15
Sed/awk help with regular expressions needed


Hi guys,

I was given a rather large file (about 35,000 lines) and asked to create an .SQL file so I could import it into a Postgres database. Now I've already managed to do it, but would like some input as to make it easier for the next time I have to do it.

My problem is that it contains large amounts of text that contains markup. For example, a typical small row would look something like this:

Code:
Some text goes here, then <a href="http://www.something.com">here</a> is a link.  Here is some <b>more</b> text.
I have to remove all markup, turn it into something like this:
Code:
Some text goes here, then here is a link.  Here is some more text.
What I did was paste all this text into GEdit, then use a regular expression plugin to remove all links and markup. The rest is easy from here on.

I would like to automate this however. What I would like to do is something like this:

awk < infile.txt > outfile.txt

Obviusly this would take the input file, strip out HTML tags then output to outfile.txt. I've tried a few things, but I can't get my head around regular expressions via command line.

Any pointers as how to do this?
 
Old 07-28-2008, 07:15 AM   #2
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Code:
awk '{gsub(/<[^>]*>/,"")}1' infile.txt > oufile.txt
 
Old 07-28-2008, 07:18 AM   #3
AP81
Member
 
Registered: Mar 2005
Posts: 38

Original Poster
Rep: Reputation: 15
Awesome...I will give it a go tomorrow.
 
Old 07-28-2008, 07:26 AM   #4
radoulov
Member
 
Registered: Apr 2007
Location: Milano, Italia/Варна, България
Distribution: Ubuntu, Open SUSE
Posts: 212

Rep: Reputation: 38
If you have lynx:

Code:
 lynx>outfile.txt --force-html --dump -nolist infile.txt
Or html2text:

Code:
html2text>outfile.txt infile.txt
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Regular Expressions ziggy25 Linux - Newbie 7 11-05-2007 06:57 AM
awk or sed help needed cmontr Programming 42 11-02-2007 11:43 PM
sed regular expression help needed Dew Linux - Newbie 1 03-30-2005 02:59 PM
Sed/Awk command help needed. farmerjoe Programming 3 03-02-2005 11:13 AM
Sed and regular expressions tchernobog Linux - Software 2 08-14-2003 12:41 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 08:06 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration