LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-04-2006, 05:49 PM   #1
zackarya
Member
 
Registered: Jul 2003
Distribution: OpenSuse 10, Debian
Posts: 152

Rep: Reputation: 30
Help with changing the formatting of a text file.


Sorry for the thread title.

Here's what I'm trying to do. I need some arbitrary names for a database project I'm
working on. I got some stuff off the net but I need to get rid of the html code.
So I have a file with names like this.

<tr bgcolor="white"><td>MARY</td><td>2.629</td><td> 3,991,060 </td><td>1</td></tr>
<tr bgcolor="#f5fdd9"><td>PATRICIA</td><td>1.073</td><td> 1,628,911 </td><td>2</td></tr>
<tr bgcolor="white"><td>LINDA</td><td>1.035</td><td> 1,571,224 </td><td>3</td></tr>
<tr bgcolor="#f5fdd9"><td>BARBARA</td><td>0.98</td><td> 1,487,729 </td><td>4</td></tr>

So what I want to do is use sed, awk, grep.... to search for "<td>" and delete
it and everything before it and then search for "</td>" and delete it and everything
after it leaving only the name. I hope that makes sense.


Thanks for your time and help.

Zackarya
 
Old 05-04-2006, 06:06 PM   #2
ataraxia
Member
 
Registered: Apr 2006
Location: Pittsburgh
Distribution: Debian Sid AMD64
Posts: 296

Rep: Reputation: 30
I suggest using the "-dump" option of lynx or w3m, rather than doing this yourself.
 
Old 05-04-2006, 06:33 PM   #3
zackarya
Member
 
Registered: Jul 2003
Distribution: OpenSuse 10, Debian
Posts: 152

Original Poster
Rep: Reputation: 30
ataraxia, thanks for the reply.
I've used lynx -dump before but the site has extra information on the same line as the
names that I want so I'm left with still having to go and delete those fields out.

I know of a couple ways I could hack this to make it work but from time to time I've run
across situations where I need to do something that's similar to this problem and I've
always had to basically hack around it because I don't know of a "good" way to do it.

Zackarya
 
Old 05-05-2006, 01:07 AM   #4
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,359

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
There's few variations on stripping out HTML tags in Perl Cookbook eg:

Use HTML::Parser;
use HTML::FormatText;
$plain_text = HTML::FormatText->new->format(parse_html($html_text));
Then you can just grab the fields you want.
This might be a good starting point.
 
Old 05-06-2006, 01:47 PM   #5
zackarya
Member
 
Registered: Jul 2003
Distribution: OpenSuse 10, Debian
Posts: 152

Original Poster
Rep: Reputation: 30
Thanks chrism01.

I ended up do what I usually do and just use a bad hack.
I'm tyring to find a way to do what I outlined in my first post but in a very
generic way. It's not always html I'm trying to get stuff out of.

For this case what I did was:

sed -e 's/<tr bgcolor="#f5fdd9"><td>//' -e 's/<tr bgcolor="white"><td>//' <most_common_surnames.txt >surnamesfirst.txt

Which strips out everything before the name.

Then
sed 's/</\n/' <surnamesfirst.txt >surnamessecond.txt

Which finds the character after the name and replaces it with a newline.
NOTE: I could have done this with the -i option of sed but I wanted to make sure
I didn't make any mistakes.


Then
awk '(NR%2==0) {print $0; }' <surnamessecond.txt >surnamesthird

Which prints out every second line and outputs to a file. (There's a space as the first line.)


That's basically how I did it this time but it's such a specific solution.
If anyone knows of a "cleaner" way to do this please let me know because I
run across this kind of thing fairly often and would like a more generic
approach.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Changing file-type of a text file olspookishmagus Linux - General 3 05-03-2006 01:00 AM
changing line in text file tescik80 Programming 13 02-20-2006 05:43 AM
C Text Formatting oulevon Programming 6 02-10-2006 09:39 PM
Changing coding of text file(LF -> CRLF) koyi Linux - General 2 10-18-2005 07:44 AM
how to produce a text file from man w/o formatting? spyghost Linux - Newbie 2 07-30-2003 06:05 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 04:03 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration