Python BeautifulSoup Re Finding Digits Within Tags

metallica1973 · 07-20-2015, 01:08 PM

I am writing a little python script that needs to grab version numbers between "<td>4.2.2</td>" within the tbody of the page:

Code:

[<tbody>
<tr style="background: #eee;"><td>4.2.2</td> <td align="center"><a href="https://blah.blah/-4.2.2.zip">zip</a> 
(<a href="https://blah/blah-4.2.2.zip.md5">md5</a> | <a href="https://blah/blah-4.2.2.zip.sha1">sha1</a>)
</td><td align="center"><a href="https://blah/blah-4.2.2.tar.gz">tar.gz</a> 
(<a href="https://blah/blahs-4.2.2.tar.gz.md5">md5</a> | <a href="https://blah/blah-4.2.2.tar.gz.sha1">sha1</a>)
</td><td align="center"><a href="https://blah/blah-4.2.2-IIS.zip">IIS zip</a> 
(<a href="https://blah/blah-4.2.2-IIS.zip.md5">md5</a> | <a href="https://blah/blah-4.2.2-IIS.zip.sha1">sha1</a>)
</td></tr><tr><td>4.2.1</td> <td align="center"><a href="https://blah/blah-4.2.1.zip">zip</a> 
(<a href="https://blah/blah-4.2.1.zip.md5">md5</a> | <a href="https://blah/blah-4.2.1.zip.sha1">sha1</a>)
</td><td align="center">[/tbody]
[tbody]blah blah blah blah blah
[/tbody]

Is it possible to use a one-liner to scrap only the digits between the tags:

"<td>4.2.2</td>"

so it spits out:
4.2.2
4.2.1
etc..

This is what I have done so far but dont understand why it creates the variable rpart as a ResultSet and a regular string that I can scrape the data.

Code:

wphtml = BeautifulSoup('http://blah.blah/release)
rpart = wphtml.find_all('tbody', limit=1)
rpart[0]
[<tbody>
<tr style="background: #eee;"><td>4.2.2</td> <td align="center"><a href="https://blah.blah/-4.2.2.zip">zip</a> (<a href="https://blah/blah-4.2.2.zip.md5">md5</a> | <a href="https://blah/blah-4.2.2.zip.sha1">sha1</a>)</td><td align="center"><a href="https://blah/blah-4.2.2.tar.gz">tar.gz</a> (<a href="https://blah/blahs-4.2.2.tar.gz.md5">md5</a> | <a href="https://blah/blah-4.2.2.tar.gz.sha1">sha1</a>)</td><td align="center"><a href="https://blah/blah-4.2.2-IIS.zip">IIS zip</a> (<a href="https://blah/blah-4.2.2-IIS.zip.md5">md5</a> | <a href="https://blah/blah-4.2.2-IIS.zip.sha1">sha1</a>)</td></tr><tr><td>4.2.1</td> <td align="center"><a href="https://blah/blah-4.2.1.zip">zip</a> (<a href="https://blah/blah-4.2.1.zip.md5">md5</a> | <a href="https://blah/blah-4.2.1.zip.sha1">sha1</a>)</td><td align="center">[/tbody]
[tbody]blah blah blah blah blah
[/tbody]
whos
rpart           ResultSet        [<tbody>\n<tr style="back<...>="1"></td></tr> </tbody>]
wphtml          BeautifulSoup    <!DOCTYPE html>\n<html di<...>"></iframe></body></html>

Is this their a way to do this as a one-liner?

Code:

rpart = wphtml.find_all('tbody', limit=1, td=re.compile('\<td\>\d*.\d*.\d*.\<\/td\>'))
4.2.2
4.2.1
etc..

or 

for tag in wphtml.find_all('tbody', limit=1, string=re.compile("\b\<td\>\d*.\d*.\d*.\<\/td\>\b")):
    print(tag.content)
4.2.2
4.2.1
etc..

So what I am trying to do is:

1 - Search through the html page and capture on the first [tbody]....[/tbody], hence limit=1
2 - Regex through the results and only print out the digits that are inside the <td>\d*.\d*.\d*.\<td> tags
3 - Resulting in:

4.2.2
4.2.1
etc..

HMW · 07-21-2015, 12:33 PM

I am sure you can do this with BeautifulSoup. But, you ought to be able to do it using sed as well.

Given your example html file (here named as: lq.html), I got the following output using this sed:

Code:

sed -n '/<tbody>/,/<\/tbody>/p' lq.html | sed -n 's/.*<td>\(.*\)<\/td>.*$/\1/p'
4.2.2
4.2.1

Now, obviously you need to use wget or something (curl?) instead of, as I did, a downloaded html file.

Best regards,
HMW

PS. If I get more time, I can try to help you out with a soup. DS

HMW · 07-21-2015, 01:59 PM

Ok, me again. I tried my own idea on a rather familiar http address (http://www.linuxquestions.org/questions/) to see if it works in practice with wget, and it does. Check this out:

Code:

wget -qO- http://www.linuxquestions.org/questions/ | sed -n '/<form.*>/,/<\/form>/p' | sed -n 's/.*<td.*>\(.*\)<\/td>.*$/\1/p'
User Name


Password

What I did here was to pipe the output from wget (the source code for linuxquestions.org/questions/), to sed. Then I extracted the text between the first <form> tags, which I piped into another sed where I extracted and printed ONLY the text between the <td> tags, which is Username and Password.

So, if you simply modify the address to your address, and change the tags to the ones you are looking for, I see no reason why you shouldn't be able to get this working for you without BeautifulSoup.

Best regards,
HMW

PS. I like both Python and BeautifulSoup, but I do believe it's overkill for this operation. However, if you WANT to use those you should of course do that. DS.

metallica1973 · 07-23-2015, 01:39 PM

Many thanks for the reply,

after putting a little elbow grease into this, I was able to accomplish what I needed to do with Beautiful and re:

Code:

wphtml = BeautifulSoup('http://blah.blah/release)
rpart=wphtml.soup.find('tbody')
tds=rpart.find_all('td')
blah=[]
for r in rpart:
    re.compile(r'<td>(.*?)</td>', flags=re.DOTALL)
    blah.append(r.string)
blah
u'4.2.2',
 None,
 None,
 None,
 u'4.2.1',

my next question is how do I get rid of the None

HMW · 07-23-2015, 02:12 PM

Quote:

Originally Posted by metallica1973

my next question is how do I get rid of the None

Well, that can be done in a lot of ways, I would do like this. Given that rpart looks something like:

Code:

>>> rpart
['4.2.2', None, None, None, '4.2.1']

You can simply choose not to append None to the variable blah like this:

Code:

>>> for r in rpart:
...   if r is not None:
...     blah.append(r)

Then blah becomes:

Code:

>>> blah
['4.2.2', '4.2.1']

Best regards,
HMW

Edit:
You can also do a classic 'not' comparison:

Code:

if r != None:

But that is not the 'Pythonic' way of checking if something is 'None' or not.

metallica1973 · 07-23-2015, 04:45 PM

nevermind,

I figured it out.

Code:

blah=filter(None, blah)