LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 04-01-2013, 10:31 PM   #1
Si14
LQ Newbie
 
Registered: Mar 2013
Posts: 14

Rep: Reputation: Disabled
extract links from webpage


I have a "source.txt" file which contains list of some URLs. For example:

Code:
source.txt:    
http://www.amazon.com/gp/product/B007OZNZG0/ref=s9_pop_gw_g349_ir05/176-5131847-6150405?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-2&pf_rd_r=02R1PYSDAPM8P0XF7HXW&pf_rd_t=101&pf_rd_p=1263340922&pf_rd_i=507846
http://www.amazon.com/gp/product/B0083PWAPW/ref=s9_pop_gw_g424_ir04/176-5131847-6150405?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-2&pf_rd_r=02R1PYSDAPM8P0XF7HXW&pf_rd_t=101&pf_rd_p=1263340922&pf_rd_i=507846
I want to extract all links from the above URLs which contain "/gp/product" and then store them in "extracted.txt" file, which would be:

Code:
extracted.txt:
http://www.amazon.com/gp/product/B008GFRB9E/ref=fs_j
http://www.amazon.com/gp/product/B008GFUA4C/ref=fs_2
I am using Cygwin on Windows 7 (64 bit).
Any suggestions for this?

Thanks.
EDIT:
I WANT TO RETRIEVE AND SEARCH THROUGH THE LINKS IN "SOURCE.TXT" FOR THE KEYWORD AND EXPORT THE LINKS IN 'EXTRACTED.TXT".

Last edited by Si14; 04-02-2013 at 01:35 AM.
 
Old 04-02-2013, 12:00 AM   #2
lykwydchykyn
Member
 
Registered: Mar 2006
Location: Tennessee, USA
Distribution: Debian, Ubuntu
Posts: 135

Rep: Reputation: 36
Does
Code:
grep "/gp/product" source.txt
Not do it?
 
Old 04-02-2013, 12:47 AM   #3
Si14
LQ Newbie
 
Registered: Mar 2013
Posts: 14

Original Poster
Rep: Reputation: Disabled
Thanks. What I want to do is to retrieve the links inside "source.txt" and search through the HTML files for that keyword. Then finds and write the links to "extracted.txt".
 
Old 04-02-2013, 12:56 AM   #4
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,359

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
I can't see 'fs_j' or fs_2' in the src file....??
 
Old 04-02-2013, 01:33 AM   #5
Si14
LQ Newbie
 
Registered: Mar 2013
Posts: 14

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by chrism01 View Post
I can't see 'fs_j' or fs_2' in the src file....??
REad the post before yours!
 
Old 04-02-2013, 01:40 AM   #6
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,842

Rep: Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309
use awk or perl or similar to process
first step to collect url-s and keywords, next step to evaluate keywords and finally you can print the result.
actually we need some sample about those keywords to be able to help further.
Which language do you prefer?

Last edited by pan64; 04-02-2013 at 01:40 AM. Reason: typo
 
Old 04-02-2013, 01:58 AM   #7
Si14
LQ Newbie
 
Registered: Mar 2013
Posts: 14

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by pan64 View Post
use awk or perl or similar to process
first step to collect url-s and keywords, next step to evaluate keywords and finally you can print the result.
actually we need some sample about those keywords to be able to help further.
Which language do you prefer?
Thank you for your reply. YES, As I said before, what I want to do is to:
1- retrieve each link inside "source.txt"
2- search through HTML of each of the above links for a keyword ("/gp/product")
3- extract and write those links which passed the comparison to "extracted.txt"

I am using Cygwin on Windows. I am a little bit familiar with PHP as well.
 
Old 04-02-2013, 02:01 AM   #8
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,842

Rep: Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309
still don't know how to handle those keywords
cygwin has perl, python, php and awk also, so you can select what you prefer
 
Old 04-02-2013, 02:19 AM   #9
Si14
LQ Newbie
 
Registered: Mar 2013
Posts: 14

Original Poster
Rep: Reputation: Disabled
Thank You. Just to clarify, the keyword is one phrase, which is "/gp/product". That means, any link that has this phrase in it should be exported.
About the programs which you mentioned, I am sorry, I am not familiar with them. I know a little about some of them. If you have a program, it might be useful.

Last edited by Si14; 04-02-2013 at 02:20 AM.
 
Old 04-02-2013, 02:23 AM   #10
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,842

Rep: Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309
I have no program, I have never made such a code, but I can help you to implement what you need.
the first step can be to use grep as it was mentioned in #2
 
Old 04-02-2013, 02:35 AM   #11
evo2
LQ Guru
 
Registered: Jan 2009
Location: Japan
Distribution: Mostly Debian and CentOS
Posts: 6,724

Rep: Reputation: 1705Reputation: 1705Reputation: 1705Reputation: 1705Reputation: 1705Reputation: 1705Reputation: 1705Reputation: 1705Reputation: 1705Reputation: 1705Reputation: 1705
Hi,

the biggest mystery here is trying to work out exactly what you want to do... using words like "keyword" and "extracted" is kind of misleading. Anyway here is my best guess (untested).
Code:
while read url ; do curl -s $url |grep -q '/gp/product' && echo $url ; done < source.txt
This downloads the each url and greps it for the string '/gp/product', if that string is found it outputs the url. To get those urls in a file, you could do:

Code:
(while read url ; do curl -s $url |grep -q '/gp/product' && echo $url ; done < source.txt)>extracted.txt
HTH,

Evo2.

PS. POSIX grep has -s, but if yours doesn't you'll need to redirect output to /dev/null

Last edited by evo2; 04-02-2013 at 07:09 AM.
 
Old 04-02-2013, 05:27 AM   #12
mina86
Member
 
Registered: Aug 2008
Distribution: Debian
Posts: 517

Rep: Reputation: 229Reputation: 229Reputation: 229
https://github.com/mina86/tinyapps/b...xtractlinks.pl
 
Old 04-02-2013, 09:00 AM   #13
Si14
LQ Newbie
 
Registered: Mar 2013
Posts: 14

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by evo2 View Post
Hi,

the biggest mystery here is trying to work out exactly what you want to do... using words like "keyword" and "extracted" is kind of misleading. Anyway here is my best guess (untested).
Code:
while read url ; do curl -s $url |grep -q '/gp/product' && echo $url ; done < source.txt
This downloads the each url and greps it for the string '/gp/product', if that string is found it outputs the url. To get those urls in a file, you could do:

Code:
(while read url ; do curl -s $url |grep -q '/gp/product' && echo $url ; done < source.txt)>extracted.txt
HTH,

Evo2.

PS. POSIX grep has -s, but if yours doesn't you'll need to redirect output to /dev/null
Thank you for your reply. I am sorry if the message was confusing.
I ran your code, however, apparently, the command runs without any error, but finishes in less than 2 seconds and nothing is written in the "extracted.txt".
Cygwin grep has -s and according to its help:
-s, --no-messages suppress error messages

Please let me know if you have any idea. Thank You.
 
Old 04-02-2013, 09:32 AM   #14
linosaurusroot
Member
 
Registered: Oct 2012
Distribution: OpenSuSE,RHEL,Fedora,OpenBSD
Posts: 982
Blog Entries: 2

Rep: Reputation: 244Reputation: 244Reputation: 244
In my posts history you can find perl that uses HTML Tree modules for extracting links.

http://www.linuxquestions.org/questi...is-4175451447/
 
Old 04-02-2013, 10:22 AM   #15
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
I don't recommend trying to parse raw html with tools like grep. The unstructured, nested nature of html is very difficult to parse reliably with line- and regex-based tools.

My recommendation would be to install the html-xml-utils package and then use the hxpipe application it provides. This converts the raw data into a line-based format that's easier to parse with the regular tools.

To get a list of all href links from an html file:
Code:
hxpipe input.html | sed  -n 's/^Ahref CDATA //p' | grep 'gp_product'
You might also need to run the output through hxunent, if you want to convert escape sequences back into their literal equivalents.

In fact, I used this on the two urls you provided:
Code:
mapfile -t urls <data.txt

for url in "${urls[@]}"; do

    source=$( wget -q -O- "$url" )
    { hxpipe <<<"$source" | sed -n 's/^Ahref CDATA // ; \|/gp/product|p' ;} 2>/dev/null
done >urlfile.txt
And ended up with this list of urls in the output file:

Code:
/gp/product/B007OZNZG0
https://www.amazon.com/gp/product/utility/edit-one-click-pref.html?ie=UTF8&query=*entries*%3D0%2C*Version*%3D1&returnPath=%2Fgp%2Fproduct%2FB007OZNZG0
http://www.amazon.com/gp/product/B008681XSG
/gp/product/B008UB7DU6/ref=kindle_dp_comp
/gp/product/B008GG93YE/ref=kindle_dp_comp
/gp/product/B004HZYA6E/ref=kindle_dp_comp
/gp/product/B008GFRBBW/ref=kindle_dp_comp
/gp/product/B008GFRB9E/ref=kindle_dp_comp
/gp/product/B008GGCAVM/ref=kindle_dp_comp
/gp/product/B008GFUA4C/ref=kindle_dp_comp
/gp/product/tags-on-product/B007OZNZG0
/gp/product/B007OZNZG0
https://www.amazon.com/gp/product/utility/edit-one-click-pref.html?ie=UTF8&query=*entries*%3D0%2C*Version*%3D1&returnPath=%2Fgp%2Fproduct%2FB0083PWAPW
/gp/product/B008UB7DU6/ref=kindle_dp_comp
/gp/product/B008GEKXUO/ref=kindle_dp_comp
/gp/product/B008GG93YE/ref=kindle_dp_comp
/gp/product/B004HZYA6E/ref=kindle_dp_comp
/gp/product/B008GFRBBW/ref=kindle_dp_comp
/gp/product/B008GFRB9E/ref=kindle_dp_comp
/gp/product/B008GFUA4C/ref=kindle_dp_comp
/gp/product/tags-on-product/B0083PWAPW
(I had to redirect errors away to devnull; it apears that Amazon's pages are apparently a bit hard for hxpipe to swallow.)
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to extract domains, links from one webpage. neopandid Linux - Software 1 02-12-2013 11:51 PM
Selective making links 'hot' in a php webpage hattori.hanzo Programming 2 12-07-2011 06:17 PM
Writing script to extract appropriate line from a web site using links ben1173 Linux - Newbie 4 10-26-2010 10:33 AM
firefox unable to load any webpage in Openbox despite ping and links working in CLI admas Arch 2 05-28-2009 02:16 AM
wget doesn't convert links on webpage JosephS Linux - Software 1 01-27-2008 11:51 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 10:41 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration