LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 07-20-2015, 01:08 PM   #1
metallica1973
Senior Member
 
Registered: Feb 2003
Location: Washington D.C
Posts: 2,190

Rep: Reputation: 60
Python BeautifulSoup Re Finding Digits Within Tags


I am writing a little python script that needs to grab version numbers between "<td>4.2.2</td>" within the tbody of the page:
Code:
[<tbody>
<tr style="background: #eee;"><td>4.2.2</td> <td align="center"><a href="https://blah.blah/-4.2.2.zip">zip</a> 
(<a href="https://blah/blah-4.2.2.zip.md5">md5</a> | <a href="https://blah/blah-4.2.2.zip.sha1">sha1</a>)
</td><td align="center"><a href="https://blah/blah-4.2.2.tar.gz">tar.gz</a> 
(<a href="https://blah/blahs-4.2.2.tar.gz.md5">md5</a> | <a href="https://blah/blah-4.2.2.tar.gz.sha1">sha1</a>)
</td><td align="center"><a href="https://blah/blah-4.2.2-IIS.zip">IIS zip</a> 
(<a href="https://blah/blah-4.2.2-IIS.zip.md5">md5</a> | <a href="https://blah/blah-4.2.2-IIS.zip.sha1">sha1</a>)
</td></tr><tr><td>4.2.1</td> <td align="center"><a href="https://blah/blah-4.2.1.zip">zip</a> 
(<a href="https://blah/blah-4.2.1.zip.md5">md5</a> | <a href="https://blah/blah-4.2.1.zip.sha1">sha1</a>)
</td><td align="center">[/tbody]
[tbody]blah blah blah blah blah
[/tbody]
Is it possible to use a one-liner to scrap only the digits between the tags:

"<td>4.2.2</td>"

so it spits out:
4.2.2
4.2.1
etc..

This is what I have done so far but dont understand why it creates the variable rpart as a ResultSet and a regular string that I can scrape the data.
Code:
wphtml = BeautifulSoup('http://blah.blah/release)
rpart = wphtml.find_all('tbody', limit=1)
rpart[0]
[<tbody>
<tr style="background: #eee;"><td>4.2.2</td> <td align="center"><a href="https://blah.blah/-4.2.2.zip">zip</a> (<a href="https://blah/blah-4.2.2.zip.md5">md5</a> | <a href="https://blah/blah-4.2.2.zip.sha1">sha1</a>)</td><td align="center"><a href="https://blah/blah-4.2.2.tar.gz">tar.gz</a> (<a href="https://blah/blahs-4.2.2.tar.gz.md5">md5</a> | <a href="https://blah/blah-4.2.2.tar.gz.sha1">sha1</a>)</td><td align="center"><a href="https://blah/blah-4.2.2-IIS.zip">IIS zip</a> (<a href="https://blah/blah-4.2.2-IIS.zip.md5">md5</a> | <a href="https://blah/blah-4.2.2-IIS.zip.sha1">sha1</a>)</td></tr><tr><td>4.2.1</td> <td align="center"><a href="https://blah/blah-4.2.1.zip">zip</a> (<a href="https://blah/blah-4.2.1.zip.md5">md5</a> | <a href="https://blah/blah-4.2.1.zip.sha1">sha1</a>)</td><td align="center">[/tbody]
[tbody]blah blah blah blah blah
[/tbody]
whos
rpart           ResultSet        [<tbody>\n<tr style="back<...>="1"></td></tr> </tbody>]
wphtml          BeautifulSoup    <!DOCTYPE html>\n<html di<...>"></iframe></body></html>
Is this their a way to do this as a one-liner?
Code:
rpart = wphtml.find_all('tbody', limit=1, td=re.compile('\<td\>\d*.\d*.\d*.\<\/td\>'))
4.2.2
4.2.1
etc..

or 

for tag in wphtml.find_all('tbody', limit=1, string=re.compile("\b\<td\>\d*.\d*.\d*.\<\/td\>\b")):
    print(tag.content)
4.2.2
4.2.1
etc..
So what I am trying to do is:

1 - Search through the html page and capture on the first [tbody]....[/tbody], hence limit=1
2 - Regex through the results and only print out the digits that are inside the <td>\d*.\d*.\d*.\<td> tags
3 - Resulting in:

4.2.2
4.2.1
etc..

Last edited by metallica1973; 07-20-2015 at 02:57 PM.
 
Old 07-21-2015, 12:33 PM   #2
HMW
Member
 
Registered: Aug 2013
Location: Sweden
Distribution: Debian, Arch, Red Hat, CentOS
Posts: 773
Blog Entries: 3

Rep: Reputation: 369Reputation: 369Reputation: 369Reputation: 369
I am sure you can do this with BeautifulSoup. But, you ought to be able to do it using sed as well.

Given your example html file (here named as: lq.html), I got the following output using this sed:
Code:
sed -n '/<tbody>/,/<\/tbody>/p' lq.html | sed -n 's/.*<td>\(.*\)<\/td>.*$/\1/p'
4.2.2
4.2.1
Now, obviously you need to use wget or something (curl?) instead of, as I did, a downloaded html file.

Best regards,
HMW

PS. If I get more time, I can try to help you out with a soup. DS

Last edited by HMW; 07-21-2015 at 02:02 PM.
 
1 members found this post helpful.
Old 07-21-2015, 01:59 PM   #3
HMW
Member
 
Registered: Aug 2013
Location: Sweden
Distribution: Debian, Arch, Red Hat, CentOS
Posts: 773
Blog Entries: 3

Rep: Reputation: 369Reputation: 369Reputation: 369Reputation: 369
Ok, me again. I tried my own idea on a rather familiar http address (http://www.linuxquestions.org/questions/) to see if it works in practice with wget, and it does. Check this out:
Code:
wget -qO- http://www.linuxquestions.org/questions/ | sed -n '/<form.*>/,/<\/form>/p' | sed -n 's/.*<td.*>\(.*\)<\/td>.*$/\1/p'
User Name


Password
What I did here was to pipe the output from wget (the source code for linuxquestions.org/questions/), to sed. Then I extracted the text between the first <form> tags, which I piped into another sed where I extracted and printed ONLY the text between the <td> tags, which is Username and Password.

So, if you simply modify the address to your address, and change the tags to the ones you are looking for, I see no reason why you shouldn't be able to get this working for you without BeautifulSoup.

Best regards,
HMW

PS. I like both Python and BeautifulSoup, but I do believe it's overkill for this operation. However, if you WANT to use those you should of course do that. DS.

Last edited by HMW; 07-21-2015 at 02:00 PM. Reason: Spelling... again...
 
1 members found this post helpful.
Old 07-23-2015, 01:39 PM   #4
metallica1973
Senior Member
 
Registered: Feb 2003
Location: Washington D.C
Posts: 2,190

Original Poster
Rep: Reputation: 60
Many thanks for the reply,

after putting a little elbow grease into this, I was able to accomplish what I needed to do with Beautiful and re:
Code:
wphtml = BeautifulSoup('http://blah.blah/release)
rpart=wphtml.soup.find('tbody')
tds=rpart.find_all('td')
blah=[]
for r in rpart:
    re.compile(r'<td>(.*?)</td>', flags=re.DOTALL)
    blah.append(r.string)
blah
u'4.2.2',
 None,
 None,
 None,
 u'4.2.1',
my next question is how do I get rid of the None

Last edited by metallica1973; 07-23-2015 at 02:33 PM.
 
Old 07-23-2015, 02:12 PM   #5
HMW
Member
 
Registered: Aug 2013
Location: Sweden
Distribution: Debian, Arch, Red Hat, CentOS
Posts: 773
Blog Entries: 3

Rep: Reputation: 369Reputation: 369Reputation: 369Reputation: 369
Quote:
Originally Posted by metallica1973 View Post
my next question is how do I get rid of the None
Well, that can be done in a lot of ways, I would do like this. Given that rpart looks something like:
Code:
>>> rpart
['4.2.2', None, None, None, '4.2.1']
You can simply choose not to append None to the variable blah like this:
Code:
>>> for r in rpart:
...   if r is not None:
...     blah.append(r)
Then blah becomes:
Code:
>>> blah
['4.2.2', '4.2.1']
Best regards,
HMW

Edit:
You can also do a classic 'not' comparison:
Code:
if r != None:
But that is not the 'Pythonic' way of checking if something is 'None' or not.

Last edited by HMW; 07-23-2015 at 02:17 PM.
 
Old 07-23-2015, 04:45 PM   #6
metallica1973
Senior Member
 
Registered: Feb 2003
Location: Washington D.C
Posts: 2,190

Original Poster
Rep: Reputation: 60
nevermind,

I figured it out.
Code:
blah=filter(None, blah)
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Basic web scraping question(mechanize+BeautifulSoup) methodtwo Programming 1 03-31-2014 04:27 PM
Python: Extract names and values from HTML tags Dogs Programming 2 02-10-2011 08:56 AM
Python: finding substring code not working :( PiNPOiNT Programming 1 08-31-2009 08:34 PM
I need help finding out last 4 digits s_b Linux - Newbie 1 10-16-2008 08:16 AM
BASH - convert single digits to double digits. rickenbacherus Programming 7 05-07-2008 06:53 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:47 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration