Visit Jeremy's Blog.
Go Back > Forums > Non-*NIX Forums > Programming
User Name
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.


  Search this Thread
Old 02-10-2011, 09:33 AM   #1
Registered: Aug 2009
Location: Houston
Distribution: Slackware 13.37 x64
Posts: 105

Rep: Reputation: 25
Python: Extract names and values from HTML tags

I'm working on a project at work to automate sending e-mails to customers.

Everything is in place except my ability to extract the useful data from HTML tags to use in the formation of the POST.

      <td width="25%" bgcolor="bisque"><b><font color="blue">From</font></b></td>
      <td width="25%" bgcolor="bisque"><input type="text" name="TechName" value='MY NAME'></td>
      <td width="25%" bgcolor="bisque"><input type="text" name="TechEmail" style="background-color: #FFFF00" size="30" value='MY EMAIL ADDRESS'></td>
      <td width="25%" bgcolor="bisque"><input type="text" name="TechPhone" size="30" value='MY PHONE NUMBER'></td>

I want to disregard everything but the bolded portions.

so I need to figure out how I can copy the thing in quotes after name=, and then the thing in quotes after value, for each occurence in the file, and some of these items may not be contained on the same line. (perhaps though, with beautiful soup, they would be.)

There are 10 or so standard values that I have to collect that show up only once per e-mail.

Then there is a looping section which contains incremented IDs along with associated content, following that same name="" value =' ' structure, but in some cases name and value are separated by other variables such as size and style, which I do not need (and these are the cases where one line may not contain both the name and the value).

How can I do multi-line searching in Python, and what is a suitable way to tackle this problem?

My current idea is to accept that the values are in order all the time, and do string.find("value="), then step forward in the string to just after the =' and assign that section up to the next ' to the "name" field that represents the actual variable in the POST, but this is a Cish way of doing it with arrays and indexes and whatnot, and it still doesn't address the multiline issue. I'd rather be good at Python than good at making Python behave like C.

Last edited by Dogs; 02-10-2011 at 09:38 AM.
Old 02-10-2011, 09:45 AM   #2
Registered: Apr 2010
Posts: 228

Rep: Reputation: 46
and why are not using BeautifulSoup?
Old 02-10-2011, 09:56 AM   #3
Registered: Aug 2009
Location: Houston
Distribution: Slackware 13.37 x64
Posts: 105

Original Poster
Rep: Reputation: 25
Ok, I got acquainted with BeautifulSoup, but of a document that is appx 20kb, only the first 1kb or so is utilized...

Is there a tag or something in there that makes BeautifulSoup stop?

Last edited by Dogs; 02-10-2011 at 03:19 PM.


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract Data between XML tags aharrison Linux - Newbie 13 11-17-2010 08:28 PM
extract values from array PHP Randall Slack Programming 2 07-02-2009 07:52 AM
Script to extract the fields in the agiml tags akhtar.bhat Linux - Software 1 12-17-2008 07:13 AM
Need to extract certain values from kudzu-output. MheAd Linux - Newbie 3 07-02-2008 06:02 AM
strip html tags rblampain Programming 6 08-07-2005 07:22 AM > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:47 PM.

Main Menu
Write for LQ is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration