Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Thanks. What I want to do is to retrieve the links inside "source.txt" and search through the HTML files for that keyword. Then finds and write the links to "extracted.txt".
use awk or perl or similar to process
first step to collect url-s and keywords, next step to evaluate keywords and finally you can print the result.
actually we need some sample about those keywords to be able to help further.
Which language do you prefer?
Last edited by pan64; 04-02-2013 at 01:40 AM.
Reason: typo
use awk or perl or similar to process
first step to collect url-s and keywords, next step to evaluate keywords and finally you can print the result.
actually we need some sample about those keywords to be able to help further.
Which language do you prefer?
Thank you for your reply. YES, As I said before, what I want to do is to:
1- retrieve each link inside "source.txt"
2- search through HTML of each of the above links for a keyword ("/gp/product")
3- extract and write those links which passed the comparison to "extracted.txt"
I am using Cygwin on Windows. I am a little bit familiar with PHP as well.
Thank You. Just to clarify, the keyword is one phrase, which is "/gp/product". That means, any link that has this phrase in it should be exported.
About the programs which you mentioned, I am sorry, I am not familiar with them. I know a little about some of them. If you have a program, it might be useful.
I have no program, I have never made such a code, but I can help you to implement what you need.
the first step can be to use grep as it was mentioned in #2
the biggest mystery here is trying to work out exactly what you want to do... using words like "keyword" and "extracted" is kind of misleading. Anyway here is my best guess (untested).
Code:
while read url ; do curl -s $url |grep -q '/gp/product' && echo $url ; done < source.txt
This downloads the each url and greps it for the string '/gp/product', if that string is found it outputs the url. To get those urls in a file, you could do:
the biggest mystery here is trying to work out exactly what you want to do... using words like "keyword" and "extracted" is kind of misleading. Anyway here is my best guess (untested).
Code:
while read url ; do curl -s $url |grep -q '/gp/product' && echo $url ; done < source.txt
This downloads the each url and greps it for the string '/gp/product', if that string is found it outputs the url. To get those urls in a file, you could do:
PS. POSIX grep has -s, but if yours doesn't you'll need to redirect output to /dev/null
Thank you for your reply. I am sorry if the message was confusing.
I ran your code, however, apparently, the command runs without any error, but finishes in less than 2 seconds and nothing is written in the "extracted.txt".
Cygwin grep has -s and according to its help:
-s, --no-messages suppress error messages
Please let me know if you have any idea. Thank You.
I don't recommend trying to parse raw html with tools like grep. The unstructured, nested nature of html is very difficult to parse reliably with line- and regex-based tools.
My recommendation would be to install the html-xml-utils package and then use the hxpipe application it provides. This converts the raw data into a line-based format that's easier to parse with the regular tools.
To get a list of all href links from an html file:
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.