extract links from webpage

Si14 · 04-01-2013, 10:31 PM

I have a "source.txt" file which contains list of some URLs. For example:

Code:

source.txt:    
http://www.amazon.com/gp/product/B007OZNZG0/ref=s9_pop_gw_g349_ir05/176-5131847-6150405?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-2&pf_rd_r=02R1PYSDAPM8P0XF7HXW&pf_rd_t=101&pf_rd_p=1263340922&pf_rd_i=507846
http://www.amazon.com/gp/product/B0083PWAPW/ref=s9_pop_gw_g424_ir04/176-5131847-6150405?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-2&pf_rd_r=02R1PYSDAPM8P0XF7HXW&pf_rd_t=101&pf_rd_p=1263340922&pf_rd_i=507846

I want to extract all links from the above URLs which contain "/gp/product" and then store them in "extracted.txt" file, which would be:

Code:

extracted.txt:
http://www.amazon.com/gp/product/B008GFRB9E/ref=fs_j
http://www.amazon.com/gp/product/B008GFUA4C/ref=fs_2

I am using Cygwin on Windows 7 (64 bit).
Any suggestions for this?

Thanks.
EDIT:
I WANT TO RETRIEVE AND SEARCH THROUGH THE LINKS IN "SOURCE.TXT" FOR THE KEYWORD AND EXPORT THE LINKS IN 'EXTRACTED.TXT".

lykwydchykyn · 04-02-2013, 12:00 AM

Does

Code:

grep "/gp/product" source.txt

Not do it?

Si14 · 04-02-2013, 12:47 AM

Thanks. What I want to do is to retrieve the links inside "source.txt" and search through the HTML files for that keyword. Then finds and write the links to "extracted.txt".

chrism01 · 04-02-2013, 12:56 AM

I can't see 'fs_j' or fs_2' in the src file....??

Si14 · 04-02-2013, 01:33 AM

Quote:

Originally Posted by chrism01

I can't see 'fs_j' or fs_2' in the src file....??

REad the post before yours!

pan64 · 04-02-2013, 01:40 AM

use awk or perl or similar to process
first step to collect url-s and keywords, next step to evaluate keywords and finally you can print the result.
actually we need some sample about those keywords to be able to help further.
Which language do you prefer?

Si14 · 04-02-2013, 01:58 AM

Quote:

Originally Posted by pan64

use awk or perl or similar to process
first step to collect url-s and keywords, next step to evaluate keywords and finally you can print the result.
actually we need some sample about those keywords to be able to help further.
Which language do you prefer?

Thank you for your reply. YES, As I said before, what I want to do is to:
1- retrieve each link inside "source.txt"
2- search through HTML of each of the above links for a keyword ("/gp/product")
3- extract and write those links which passed the comparison to "extracted.txt"

I am using Cygwin on Windows. I am a little bit familiar with PHP as well.

pan64 · 04-02-2013, 02:01 AM

still don't know how to handle those keywords
cygwin has perl, python, php and awk also, so you can select what you prefer

Si14 · 04-02-2013, 02:19 AM

Thank You. Just to clarify, the keyword is one phrase, which is "/gp/product". That means, any link that has this phrase in it should be exported.
About the programs which you mentioned, I am sorry, I am not familiar with them. I know a little about some of them. If you have a program, it might be useful.

pan64 · 04-02-2013, 02:23 AM

I have no program, I have never made such a code, but I can help you to implement what you need.
the first step can be to use grep as it was mentioned in #2

evo2 · 04-02-2013, 02:35 AM

Hi,

the biggest mystery here is trying to work out exactly what you want to do... using words like "keyword" and "extracted" is kind of misleading. Anyway here is my best guess (untested).

Code:

while read url ; do curl -s $url |grep -q '/gp/product' && echo $url ; done < source.txt

This downloads the each url and greps it for the string '/gp/product', if that string is found it outputs the url. To get those urls in a file, you could do:

Code:

(while read url ; do curl -s $url |grep -q '/gp/product' && echo $url ; done < source.txt)>extracted.txt

HTH,

Evo2.

PS. POSIX grep has -s, but if yours doesn't you'll need to redirect output to /dev/null

mina86 · 04-02-2013, 05:27 AM

https://github.com/mina86/tinyapps/b...xtractlinks.pl

Si14 · 04-02-2013, 09:00 AM

Quote:

Originally Posted by evo2

Hi,

the biggest mystery here is trying to work out exactly what you want to do... using words like "keyword" and "extracted" is kind of misleading. Anyway here is my best guess (untested).

Code:

while read url ; do curl -s $url |grep -q '/gp/product' && echo $url ; done < source.txt

This downloads the each url and greps it for the string '/gp/product', if that string is found it outputs the url. To get those urls in a file, you could do:

Code:

(while read url ; do curl -s $url |grep -q '/gp/product' && echo $url ; done < source.txt)>extracted.txt

HTH,

Evo2.

PS. POSIX grep has -s, but if yours doesn't you'll need to redirect output to /dev/null

Thank you for your reply. I am sorry if the message was confusing.
I ran your code, however, apparently, the command runs without any error, but finishes in less than 2 seconds and nothing is written in the "extracted.txt".
Cygwin grep has -s and according to its help:
-s, --no-messages suppress error messages

Please let me know if you have any idea. Thank You.

linosaurusroot · 04-02-2013, 09:32 AM

In my posts history you can find perl that uses HTML Tree modules for extracting links.

http://www.linuxquestions.org/questi...is-4175451447/

David the H. · 04-02-2013, 10:22 AM

I don't recommend trying to parse raw html with tools like grep. The unstructured, nested nature of html is very difficult to parse reliably with line- and regex-based tools.

My recommendation would be to install the html-xml-utils package and then use the hxpipe application it provides. This converts the raw data into a line-based format that's easier to parse with the regular tools.

To get a list of all href links from an html file:

Code:

hxpipe input.html | sed  -n 's/^Ahref CDATA //p' | grep 'gp_product'

You might also need to run the output through hxunent, if you want to convert escape sequences back into their literal equivalents.

In fact, I used this on the two urls you provided:

Code:

mapfile -t urls <data.txt

for url in "${urls[@]}"; do

    source=$( wget -q -O- "$url" )
    { hxpipe <<<"$source" | sed -n 's/^Ahref CDATA // ; \|/gp/product|p' ;} 2>/dev/null
done >urlfile.txt

And ended up with this list of urls in the output file:

Code:

/gp/product/B007OZNZG0
https://www.amazon.com/gp/product/utility/edit-one-click-pref.html?ie=UTF8&query=*entries*%3D0%2C*Version*%3D1&returnPath=%2Fgp%2Fproduct%2FB007OZNZG0
http://www.amazon.com/gp/product/B008681XSG
/gp/product/B008UB7DU6/ref=kindle_dp_comp
/gp/product/B008GG93YE/ref=kindle_dp_comp
/gp/product/B004HZYA6E/ref=kindle_dp_comp
/gp/product/B008GFRBBW/ref=kindle_dp_comp
/gp/product/B008GFRB9E/ref=kindle_dp_comp
/gp/product/B008GGCAVM/ref=kindle_dp_comp
/gp/product/B008GFUA4C/ref=kindle_dp_comp
/gp/product/tags-on-product/B007OZNZG0
/gp/product/B007OZNZG0
https://www.amazon.com/gp/product/utility/edit-one-click-pref.html?ie=UTF8&query=*entries*%3D0%2C*Version*%3D1&returnPath=%2Fgp%2Fproduct%2FB0083PWAPW
/gp/product/B008UB7DU6/ref=kindle_dp_comp
/gp/product/B008GEKXUO/ref=kindle_dp_comp
/gp/product/B008GG93YE/ref=kindle_dp_comp
/gp/product/B004HZYA6E/ref=kindle_dp_comp
/gp/product/B008GFRBBW/ref=kindle_dp_comp
/gp/product/B008GFRB9E/ref=kindle_dp_comp
/gp/product/B008GFUA4C/ref=kindle_dp_comp
/gp/product/tags-on-product/B0083PWAPW

(I had to redirect errors away to devnull; it apears that Amazon's pages are apparently a bit hard for hxpipe to swallow.)