ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
so this is what I got going on with reading in the RSSfeeds file.
Code:
func_collectrss ()
{
cat $RSSFile | while read line; do # Read each line of the RSSFile individually
wget $line -O file.html # Download the RSS Feed from that line of the file
mv file.html ./tmp/file.html # Move the downloaded RSS Feed to the tmp directory.
url=`cat ./tmp/file.html | grep -o -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'` # Exract the URLs from the downloaded RSS Feed
printf "%s\n" $url > tmp/url.dat
done
}
Something is off because I am getting partial URLs from the html file downloaded when I run a wget on the RSS feed. This is the RSS Feed link I am using for testing, http://rss.indeed.com/rss?q=systems+...ator&l=atlanta
jared@debian:~/Documents/Scripts/jobs/files$ cat RSSFeeds
http://rss.indeed.com/rss?q=systems+administrator&l=atlanta
jared@debian:~/Documents/Scripts/jobs/files$ wget http://rss.indeed.com/rss?q=systems+administrator&l=atlanta
[1] 5850
jared@debian:~/Documents/Scripts/jobs/files$ --2010-01-01 19:18:30-- http://rss.indeed.com/rss?q=systems+administrator
Resolving rss.indeed.com... 208.43.117.136
Connecting to rss.indeed.com|208.43.117.136|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/xml]
Saving to: “rss?q=systems+administrator”
[ <=> ] 20,012 --.-K/s in 0.1s
2010-01-01 19:18:31 (153 KB/s) - “rss?q=systems+administrator” saved [20012]
[1]+ Done wget http://rss.indeed.com/rss?q=systems+administrator
jared@debian:~/Documents/Scripts/jobs/files$ ls
keywords RSSFeeds rss?q=systems+administrator sedscript sedscript~
jared@debian:~/Documents/Scripts/jobs/files$ ./sedscript rss\?q\=systems+administrator | grep http://
jared@debian:~/Documents/Scripts/jobs/files$ cat sedscript
#! /bin/sed -nf
# Join lines if we have tags that span multiple lines
:join
/<[^>]*$/ { N; s/[ *]\n[ *]/ /; b join; }
# Do some selection to speed the thing up
/<[ ]*\([aA]\|[iI][mM][gG]\)/!b
# Remove extra spaces before/after the tag name, change img/area to a
s/<[ ]*\([aA]\|[iI][mM][gG]|[aA][rR][eE][aA]\)[ ]\+/<a /g
# To simplify the regexps that follow, change href/alt to lowercase
# and replace whitespace before them with a single space
s/<a\([^>]*\)[ ][hH][rR][eE][fF]=/<a\1 href=/g
s/<a\([^>]*\)[ ][aA][lL][tT]=/<a\1 alt=/g
# To simplify the regexps that follow, quote the arguments to href and alt
s/href=\([^" >]\+\)/href="\1"/g
s/alt=\([^" >]\+\)/alt="\1"/g
# Move the alt tag after href, remove attributes between them
s/\( alt="[^"]*"\)[^>]*\( href="[^"]*"\)/\2\1/g
# Remove attributes between <a and href
s/<a[^>]* href="/<a href="/g
# Change href="xxx" ... alt="yyy" to href="xxx|yyy"
s/\(<a href="[^"]*\)"[^>]* alt="\([^"]*"\)/\1|\2/g
t loop
# Print an URL, remove it, and loop
:loop
h
s/.*<a href="\([^"]*\)".*$/\1/p
g
s/\(.*\)<a href="\([^"]*\)".*$/\1/
t loopjared@debian:~/Documents/Scripts/jobs/files$
why don't you make use of a rss feed library?? Here's an partial example using Python and feedparser library
Code:
import feedparser
import os
root="/home"
keyfile=os.path.join(root,"path1","keywords") #where the keywords file is stored
rssfile=os.path.join(root,"path1","rssfile") #where the rss file that contains all your rss links is stored
if not os.path.exists(keyfile) or not os.path.exists(rssfile):
print "No keywords or rss links file"
sys.exit()
# store all keywords to a list for later use
keywords = open(keyfile).read().split()
# read the rssfile for rss links
for rsslink in open(rssfile):
rsslink=rsslink.strip()
# pass the link to the feeder
feed = feedparser.parse( rsslink )
for item in feed["items"]:
print "title:",item["title"]
print "url: ",item["link"]
print "summary: ",item["summary"]
print "*" * 100
# check for keywords etc etc..
partial output when run
Code:
$ ./python.py|more
title: Senior Solaris Systems Administrator - Atlanta, GA
url: http://www.indeed.com/rc/clk?jk=580f08b272eb9e30&from=rss&qd=RnZhMybXSk4M3QtTVGXWoVUpPKQ-Ar2L74KrkUB91D73Oal7uRnyiLEYdTIyc2C-Y0nnJsYLYzaI9wfwmiLV62FaWC
OBmwPWHXOVGjbhxi0&rd=
summary: for an experienced Sun Microsystems Solaris Systems Administrator to join a project for one of our clients... to do documentation on systems
Business casual work... <br />
From ComputerJobs.com - 31 Dec 2009 13:23:00 GMT - <a href="http://www.indeed.com/job/Senior-Solaris-Systems-Administrator-in-Atlanta
,-GA-580f08b272eb9e30">save job, email, more...</a>
****************************************************************************************************
title: SMS Systems Administrator - CCCi - Atlanta - Atlanta, GA
url: http://www.indeed.com/rc/clk?jk=8d8e9d083bfaf53a&from=rss&qd=RnZhMybXSk4M3QtTVGXWoVUpPKQ-Ar2L74KrkUB91D73Oal7uRnyiLEYdTIyc2C-Y0nnJsYLYzaI9wfwmiLV62FaWC
OBmwPWHXOVGjbhxi0&rd=
summary: to-hire opportunity for a Microsoft SMS Systems Administrator for our customer in the North Atlanta area... possess the necessary systems experienc
e required to... <br />
From Engineering Central - 31 Dec 2009 22:47:02 GMT - <a href="http://www.indeed.com/viewjob?t=SMS+Systems+Administrator&c=CCCi+-+Atl
anta&l=Atlanta,+GA&jk=8d8e9d083bfaf53a">save job, email, more...</a>
****************************************************************************************************
title: Senior Unix Systems Administrator - Consilium1 - Atlanta, GA
url: http://www.indeed.com/rc/clk?jk=6a9e101777d4fc51&from=rss&qd=RnZhMybXSk4M3QtTVGXWoVUpPKQ-Ar2L74KrkUB91D73Oal7uRnyiLEYdTIyc2C-Y0nnJsYLYzaI9wfwmiLV62FaWC
OBmwPWHXOVGjbhxi0&rd=
summary: for a Senior Systems Administrator to join our team... Produce design for the IT systems specifying the operations the system will perform and the
way the data... <br />
From Great IT Jobs - 01 Jan 2010 18:43:00 GMT - <a href="http://www.indeed.com/job/Senior-Unix-Systems-Administrator-at-Consilium1-in
-Atlanta,-GA-6a9e101777d4fc51">save job, email, more...</a>
****************************************************************************************************
you can then find those keywords in item["summary"] (or see the documentation of feedparser for others)
Last edited by ghostdog74; 01-01-2010 at 08:20 PM.
Well my choice of not going that route was the lack of knowledge. I know a bit about bash but know nothing about python. Seems like you are able to do what I am trying to do in a lot less code though.
Well I have been trying to improve my BASH skills lately, so I think I will stick to doing 1 language at a time. I will most likely reinvent this script in another language later on. I definitely want to do it in PHP for a live output. But that will be later down the road when I am learning PHP.
Got the wget thing sorted. Even though it says it failed it still downloads, kinda weird but whatever.
This is what I got so far
Code:
#!/bin/bash
#
#-------------------------------------------------------------#
# This Script is designed to take RSS feeds and process them |
# against a list of key words. The results of the Processed |
# RSS feeds will them be outputted to a human readable format |
#-------------------------------------------------------------#
# Author: Jared Bloomer |
# Email: jared@tuxknowledge.com |
# Website: http://www.tuxknowledge.com |
#-------------------------------------------------------------#
#
# TODO LIST
# [X] Define Global Variables
# [X] Create RSSFeeds and keywords files
# [X] Create a tmp directory for the script to work in
# [X] Collect the results of each link contained in each RSS Feed
# [ ] Run a keyword search on the results of each link found on each RSS Feed
# [ ] Determine the percentage of keywords found in search search
# [ ] Write the links of results containing 75% of or more keywords to a file
# [ ] Format the final results file in a human readable format.
# [X] Remove tmp directory and its contents
# [ ] Call all functions to process everything in the correct order
#
# Define Global Variables
RSSFile=./files/RSSFeeds #This is a list of all of the RSS Feeds to be Processed
Keywords=./files/keywords #This is the list of keywords used to process the RSS feeds
Time=`date "+%T"` #This is to be used for File Names and Time Stamping
TotalKeywords=`wc -w files/keywords | gawk '{print $1}'` # Determine total keywords listed in ./files/keywords
FinalPercentage=0 # Set initial Percentage value to 0
KeywordsFound=0 # Set initial value of Keywords found to 0
# Create tmp directory to work in
func_createtmp ()
{
mkdir ./tmp
}
# Remove tmp Directory
func_removetmp ()
{
rm -R ./tmp
}
# Collect all of the HTML Files from each URL in the RSSFeeds File
func_collectrss ()
{
cat $RSSFile | while read line; do # Read each line of the RSSFile individually
wget $line -O file.html # Download the RSS Feed from that line of the file
mv file.html ./tmp/file.html # Move the downloaded RSS Feed to the tmp directory.
url=`cat ./tmp/file.html | grep -o -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'` # Exract the URLs from the downloaded RSS Feed
printf "%s\n" $url > tmp/url.dat
cat ./tmp/url.dat | while read line; do
wget $line -O link.html # Download the links in the tmp/url.dat file
mv link.html ./tmp/link.html # Move the downloaded link to the tmp directory.
done
done
}
func_keywordssearch ()
{
cat ./tmp/link.html | while read line; do
#Figure out how to grep for the keywords listed in the keywords file
done
}
func_createtmp
func_collectrss
func_keywordssearch
#func_removetmp
1.) I still need to figure out why I am getting partial URLs when I run
2.) In the func_keywordsearch function I have to figure out how to search each html file for the keywords listed in the ./files/keywords file. I was thinking of loading all the keywords in ./files/keywords into an array and then searching ./tmp/link.html against the array values. I just don't know how to do that.
Help Please! Thanks to everyone who has helped so far.
2.) In the func_keywordsearch function I have to figure out how to search each html file for the keywords listed in the ./files/keywords file. I was thinking of loading all the keywords in ./files/keywords into an array and then searching ./tmp/link.html against the array values. I just don't know how to do that. I have func_readkeywords, which reads in all the values of ./files/keywords into a Global array named Keywords. I just don't know how to actually perform an intelligent search to if the words are contained in an HTML file and ensure the keyword is not an HTML tag. If the word is found then I need to increment a variable called KeywordsFound by 1. Problem is if I do grep or something it will show all the times it is found and not that it was found. I do not want to increment KeywordsFound every time it finds the keyword it is searching for. I want to increment it only once for each keyword it is searching for, and only increment it if it is found. Hope that isn't too confusing.
Well I think I figured it out by using lynx. Now I just have to implement the Keyword search part and then output the results to a file to be formatted to be human readable and then sort out what type of delivery method I want to use.
Here is what I have so far
Code:
#!/bin/bash
#
#-------------------------------------------------------------#
# This Script is designed to take RSS feeds and process them |
# against a list of key words. The results of the Processed |
# RSS feeds will them be outputted to a human readable format |
#-------------------------------------------------------------#
# Author: Jared Bloomer |
# Email: jared@tuxknowledge.com |
# Website: http://www.tuxknowledge.com |
#-------------------------------------------------------------#
#
# This script calls on external packages in order to perform. Please ensure the
# the following packages are installed before executing this script.
#
# REQUIREMENTS
# lynx
#
#
#
# TODO LIST
# [X] Define Global Variables
# [X] Create RSSFeeds and keywords files
# [X] Create a tmp directory for the script to work in
# [X] Collect the results of each link contained in each RSS Feed
# [ ] Run a keyword search on the results of each link found on each RSS Feed
# [ ] Determine the percentage of keywords found in search search
# [ ] Write the links of results containing 75% of or more keywords to a file
# [ ] Format the final results file in a human readable format.
# [X] Remove tmp directory and its contents
# [ ] Call all functions to process everything in the correct order
#
# Define Global Variables
RSSFile=./files/RSSFeeds #This is a list of all of the RSS Feeds to be Processed
Keywords=./files/keywords #This is the list of keywords used to process the RSS feeds
Time=`date "+%T"` #This is to be used for File Names and Time Stamping
TotalKeywords=`wc -w files/keywords | gawk '{print $1}'` # Determine total keywords listed in ./files/keywords
FinalPercentage=0 # Set initial Percentage value to 0
KeywordsFound=0 # Set initial value of Keywords found to 0
declare -a Keywords=( )
# Create tmp directory to work in
func_createtmp ()
{
mkdir ./tmp
}
# Remove tmp Directory
func_removetmp ()
{
rm -R ./tmp
}
# Collect all of the HTML Files from each URL in the RSSFeeds File
func_collectrss ()
{
cat $RSSFile | while read line; do # Read each line of the RSSFile individually
wget $line -O file.html # Download the RSS Feed from that line of the file
mv file.html ./tmp/file.html # Move the downloaded RSS Feed to the tmp directory.
# url=`cat ./tmp/file.html | grep -o -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'` # Exract the URLs from the downloaded RSS Feed
url=`lynx -dump ./tmp/file.html | grep -o -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'`
printf "%s\n" $url > tmp/url.dat
cat ./tmp/url.dat | while read line; do
wget $line -O link.html # Download the links in the tmp/url.dat file
mv link.html ./tmp/link.html # Move the downloaded link to the tmp directory.
func_keywordssearch
done
done
}
func_keywordssearch ()
{
lynx -dump ./tmp/link.html >> ./tmp/link
}
func_readkeywords ()
{
while read txt ; do
Keywords[${#Keywords[@]}]=$txt
done < ./files/keywords
}
func_createtmp
func_collectrss
func_readkeywords
func_removetmp
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.