LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Python Combing Two Commands (https://www.linuxquestions.org/questions/programming-9/python-combing-two-commands-4175546332/)

metallica1973 06-24-2015 01:01 PM

Python Combing Two Commands
 
I have been digging deeper into Python and want to make my code as efficient as possible. The less line of code the better so I have been experimenting and wanted to ask the Python gurus if this is possible. So:

Code:

...
...
In [109]: kbfileurl = re.search('<p>For more information about this update.*</p>', tbull.text.encode('utf8'))
In [110]: kbfileurl.group()
Out[110]: '<p>For more information about this update, see <a href="https://support.microsoft.com/kb/3020393">Microsoft Knowledge Base Article 3020393</a>.</p>'

So based on the string of the url that I parsed out of the html page, I would like to pull only in a one-liner:
https://support.microsoft.com/kb/3020393
So is it possible to combine kbfileinfo,group with re,compile:
Code:

kbfileurl.group().encode('ascii')re.compile(r'\bhttps://support.microsoft.com/kb/d+\b')
to parse out:
https://support.microsoft.com/kb/3020393
??

metallica1973 06-24-2015 02:16 PM

After playing around with it, I did a small modification and came up with but not exactly a one-liner:
Code:

In [201]: kbfileurl = re.search('<p>For more information about this update.*</p>', tbull.text.encode('utf8')).group()

In [202]: kbfileurl
Out[202]: '<p>For more information about this update, see <a href="https://support.microsoft.com/kb/3020393">Microsoft Knowledge Base Article 3020393</a>.</p>'

In [203]: kburl = re.search(r'\bhttps://support.microsoft.com/kb/\d+\b', kbfileurl).group(0)

In [204]: kburl
Out[204]: 'https://support.microsoft.com/kb/3020393'


dugan 06-24-2015 02:31 PM

You should not write your own regex to parse HTML in the real world, but nothing wrong with trying to do it for learning.

Anyway, you should be doing this with one regex search. If this works, it works:

Code:

re.search(r'\bhttps://support.microsoft.com/kb/d+\b', CONTENTS_OF_ENTIRE_FILE)
If that's returning too many results, then put more contextual information in the regex.


All times are GMT -5. The time now is 11:29 PM.