LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-01-2010, 05:02 PM   #16
EricTRA
LQ Guru
 
Registered: May 2009
Location: Gibraltar, Gibraltar
Distribution: Fedora 20 with Awesome WM
Posts: 6,805
Blog Entries: 1

Rep: Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297

Then there must be something different between what you did and what I did. Here's what I did:

First get the file (RSS feed)
Code:
wget http://kde-apps.org/content/show.php/GSW-GamStopWatch?content=117722
Then I run this against the file I get:
Code:
./sedscript GSW-GamStopWatch\?content\=117722 | grep http://
and I get this as result:
Code:
http://kde-apps.org                                                            
http://gtk-apps.org                                                            
http://cli-apps.org                                                            
http://qt-apps.org                                                             
http://qt-prop.org                                                             
http://maemo-apps.org                                                          
http://java-apps.org                                                           
http://eyeos-apps.org                                                          
http://wine-apps.org                                                           
http://server-apps.org                                                         
http://kde-look.org                                                            
http://gnome-look.org                                                          
http://xfce-look.org                                                           
http://box-look.org                                                            
http://e17-stuff.org                                                           
http://beryl-themes.org                                                        
http://compiz-themes.org                                                       
http://ede-look.org                                                            
http://debian-art.org                                                          
http://gentoo-art.org                                                          
http://suse-art.org                                                            
http://ubuntu-art.org                                                          
http://kubuntu-art.org                                                         
http://linuxmint-art.org                                                       
http://arch-stuff.org                                                          
http://frugalware-art.org                                                      
http://kde-files.org                                                           
http://opentemplate.org                                                        
http://gimpstuff.org                                                           
http://inkscapestuff.org                                                       
http://scribusstuff.org                                                        
http://blenderstuff.org                                                        
http://kde-help.org                                                            
http://gnome-help.org                                                          
http://xfce-help.org
http://Open-PC.com
http://opendesktop.org
http://opendesktop.org
http://www.fishing-penguins.de
http://www.fishing-penguins.de
http://kde-look.org
http://KDE-Look.org/content/show.php/Gettin+inside%3F?content=117898
http://KDE-Look.org/content/show.php/My+Clean+Desktop?content=117899
http://KDE-Look.org/content/show.php/Rai-qt?content=112093
http://KDE-Look.org/content/show.php/Storm+Watcher?content=117901
http://KDE-Look.org/content/show.php/Fedora+Microbutton?content=117902
http://KDE-Look.org/content/show.php/Fedora+13+Rocket+Wallpaper?content=117904
http://kde-look.org
http://www.xfce-look.org
http://www.konqueror.org
http://www.kde-look.org
http://www.kde-apps.org
http://www.gnome-look.org
http://userbase.kde.org/
http://scribusstuff.org
http://www.qt-apps.org
http://planetkde.org
http://www.inkscapestuff.org
http://dot.kde.org
http://del.icio.us/post?url=http%3A%2F%2Fkde-apps.org%2Fcontent%2Fshow.php%2FGSW-GamStopWatch%3Fcontent%3D117722&title=GSW-GamStopWatch
http://www.digg.com/submit?phase=2&url=http%3A%2F%2Fkde-apps.org%2Fcontent%2Fshow.php%2FGSW-GamStopWatch%3Fcontent%3D117722&title=GSW-GamStopWatch
http://slashdot.org/bookmark.pl?url=http%3A%2F%2Fkde-apps.org%2Fcontent%2Fshow.php%2FGSW-GamStopWatch%3Fcontent%3D117722&title=GSW-GamStopWatch
http://www.fsf.org/licenses/gpl.html
http://openDesktop.org
http://hive01.com/advertising
http://www.cafepress.com/opendesktop
http://www.opendesktop.spreadshirt.net
http://opendesktop.org/rss/opendesktop-events.rss
http://www.kde.org/dot/kde-apps-content.rdf
http://apps.facebook.com/opendesktop
http://twitter.com/opendesktop
http://identi.ca/opendesktop
http://blog.karlitschek.de
http://twitter.com/fkarlitschek
http://www.alles-iphone.de/index.php?xcontentmode=9102
http://www.casinoboni.net/
http://apps.facebook.com/opendesktop
So there has to be a difference. Look at the contents of the file you get from the wget command please.

Kind regards,

Eric
 
Old 01-01-2010, 06:15 PM   #17
worm5252
Member
 
Registered: Oct 2004
Location: Atlanta
Distribution: CentOS, RHEL, HP-UX, OS X
Posts: 567

Original Poster
Rep: Reputation: 57
so this is what I got going on with reading in the RSSfeeds file.

Code:
func_collectrss ()
{
cat $RSSFile | while read line; do # Read each line of the RSSFile individually
    wget $line -O file.html # Download the RSS Feed from that line of the file
	mv file.html ./tmp/file.html # Move the downloaded RSS Feed to the tmp directory.
	url=`cat ./tmp/file.html | grep -o -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'` # Exract the URLs from the downloaded RSS Feed
	printf "%s\n" $url > tmp/url.dat
	
done
}
This gives me the following output
Code:
www.geor
www.indeed.com/q
www.indeed.com/image
www.indeed.com/     
www.indeed.com/rc/clk?jk=580f08b272eb9e30&amp;from=r
www.indeed.com/job/Senior-Solari                    
www.indeed.com/rc/clk?jk=8d8e9d083bfaf53a&amp;from=r
www.indeed.com/viewjob?t=SMS+Sy                     
www.indeed.com/rc/clk?jk=6a9e101777d4fc51&amp;from=r
www.indeed.com/job/Senior-Unix-Sy
www.indeed.com/rc/clk?jk=241cffcb34894a23&amp;from=r
www.indeed.com/job/OS390-and-LINUX-Sy
www.indeed.com/rc/clk?jk=3c508de6ef18090e&amp;from=r
www.indeed.com/viewjob?t=Maximo+Sy
www.indeed.com/rc/clk?jk=2e9dc7157570bc11&amp;from=r
www.indeed.com/job/IT-Admini
www.indeed.com/rc/clk?jk=5ff6a7bfb160b60d&amp;from=r
www.indeed.com/viewjob?t=Senior+Network+Admini
www.indeed.com/rc/clk?jk=2787d9b0570b1a09&amp;from=r
www.indeed.com/job/Sy
www.indeed.com/rc/clk?jk=e011407448aeb34d&amp;from=r
www.indeed.com/viewjob?t=Network+Admini
www.indeed.com/rc/clk?jk=287778d4c3891714&amp;from=r
www.indeed.com/job/Office-Admini
www.indeed.com/rc/clk?jk=e98eeb3906902c4b&amp;from=r
www.indeed.com/job/Senior-Linux-Sy
www.indeed.com/rc/clk?jk=91f4d589ec8066a4&amp;from=r
www.indeed.com/viewjob?t=Sy
www.indeed.com/rc/clk?jk=efdec3dc366db863&amp;from=r
www.indeed.com/job/Advanced-SAN-Admini
www.indeed.com/rc/clk?jk=a0489ea9373f3b03&amp;from=r
www.indeed.com/job/VMS-Sy
www.indeed.com/rc/clk?jk=4d612d7629170766&amp;from=r
www.indeed.com/job/JDE-CNC-Admini
www.indeed.com/rc/clk?jk=a1479bf776c03191&amp;from=r
www.indeed.com/viewjob?t=JDE+CNC+Admini
www.indeed.com/rc/clk?jk=f9bbde7e70b171cb&amp;from=r
www.indeed.com/job/Window
www.indeed.com/rc/clk?jk=ede18d071578a8c4&amp;from=r
www.indeed.com/viewjob?t=SR+SYSTEMS+ENGINEER
www.indeed.com/rc/clk?jk=e286fb43f674b06f&amp;from=r
www.indeed.com/job/IT-Admini
www.indeed.com/rc/clk?jk=df1af9e78442ca62&amp;from=r
www.indeed.com/viewjob?t=WebLogic+Admini
Something is off because I am getting partial URLs from the html file downloaded when I run a wget on the RSS feed. This is the RSS Feed link I am using for testing, http://rss.indeed.com/rss?q=systems+...ator&l=atlanta

Any thoughts?
 
Old 01-01-2010, 06:22 PM   #18
worm5252
Member
 
Registered: Oct 2004
Location: Atlanta
Distribution: CentOS, RHEL, HP-UX, OS X
Posts: 567

Original Poster
Rep: Reputation: 57
Code:
jared@debian:~/Documents/Scripts/jobs/files$ cat RSSFeeds 
http://rss.indeed.com/rss?q=systems+administrator&l=atlanta
jared@debian:~/Documents/Scripts/jobs/files$ wget http://rss.indeed.com/rss?q=systems+administrator&l=atlanta                                                             
[1] 5850                                                                             
jared@debian:~/Documents/Scripts/jobs/files$ --2010-01-01 19:18:30--  http://rss.indeed.com/rss?q=systems+administrator                                                   
Resolving rss.indeed.com... 208.43.117.136                                           
Connecting to rss.indeed.com|208.43.117.136|:80... connected.                        
HTTP request sent, awaiting response... 200 OK                                       
Length: unspecified [text/xml]                                                       
Saving to: “rss?q=systems+administrator”                                             

    [ <=>                                        ] 20,012      --.-K/s   in 0.1s    

2010-01-01 19:18:31 (153 KB/s) - “rss?q=systems+administrator” saved [20012]


[1]+  Done                    wget http://rss.indeed.com/rss?q=systems+administrator
jared@debian:~/Documents/Scripts/jobs/files$ ls                                     
keywords  RSSFeeds  rss?q=systems+administrator  sedscript  sedscript~              
jared@debian:~/Documents/Scripts/jobs/files$ ./sedscript rss\?q\=systems+administrator | grep http://                                                                     
jared@debian:~/Documents/Scripts/jobs/files$ cat sedscript                           
#! /bin/sed -nf                                                                      

# Join lines if we have tags that span multiple lines
:join
/<[^>]*$/ { N; s/[      *]\n[   *]/ /; b join; }

# Do some selection to speed the thing up
/<[     ]*\([aA]\|[iI][mM][gG]\)/!b

# Remove extra spaces before/after the tag name, change img/area to a
s/<[    ]*\([aA]\|[iI][mM][gG]|[aA][rR][eE][aA]\)[      ]\+/<a /g

# To simplify the regexps that follow, change href/alt to lowercase
# and replace whitespace before them with a single space
s/<a\([^>]*\)[  ][hH][rR][eE][fF]=/<a\1 href=/g
s/<a\([^>]*\)[  ][aA][lL][tT]=/<a\1 alt=/g

# To simplify the regexps that follow, quote the arguments to href and alt
s/href=\([^"    >]\+\)/href="\1"/g
s/alt=\([^"     >]\+\)/alt="\1"/g

# Move the alt tag after href, remove attributes between them
s/\( alt="[^"]*"\)[^>]*\( href="[^"]*"\)/\2\1/g

# Remove attributes between <a and href
s/<a[^>]* href="/<a href="/g

# Change href="xxx" ... alt="yyy" to href="xxx|yyy"
s/\(<a href="[^"]*\)"[^>]* alt="\([^"]*"\)/\1|\2/g

t loop

# Print an URL, remove it, and loop
:loop
h
s/.*<a href="\([^"]*\)".*$/\1/p
g
s/\(.*\)<a href="\([^"]*\)".*$/\1/
t loopjared@debian:~/Documents/Scripts/jobs/files$
 
Old 01-01-2010, 08:16 PM   #19
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
why don't you make use of a rss feed library?? Here's an partial example using Python and feedparser library
Code:
import feedparser
import os
root="/home"
keyfile=os.path.join(root,"path1","keywords") #where the keywords file is stored
rssfile=os.path.join(root,"path1","rssfile") #where the rss file that contains all your rss links is stored
if not os.path.exists(keyfile) or not os.path.exists(rssfile):
    print "No keywords or rss links file"
    sys.exit()

# store all keywords to a list for later use
keywords = open(keyfile).read().split()

# read  the rssfile for rss links
for rsslink in open(rssfile):
    rsslink=rsslink.strip()
    # pass the link to the feeder
    feed = feedparser.parse( rsslink )
    for item in feed["items"]:
        print "title:",item["title"]
        print "url: ",item["link"]
        print "summary: ",item["summary"]
        print "*" * 100
        # check for keywords etc etc..
partial output when run
Code:
$ ./python.py|more

title: Senior Solaris Systems Administrator -  Atlanta, GA
url:  http://www.indeed.com/rc/clk?jk=580f08b272eb9e30&from=rss&qd=RnZhMybXSk4M3QtTVGXWoVUpPKQ-Ar2L74KrkUB91D73Oal7uRnyiLEYdTIyc2C-Y0nnJsYLYzaI9wfwmiLV62FaWC
OBmwPWHXOVGjbhxi0&rd=                                                                                                                                        
summary:  for an experienced Sun Microsystems Solaris Systems Administrator to join a project for one of our clients... to do documentation on systems       

Business casual work... <br />
                        From ComputerJobs.com - 31 Dec 2009 13:23:00 GMT - <a href="http://www.indeed.com/job/Senior-Solaris-Systems-Administrator-in-Atlanta
,-GA-580f08b272eb9e30">save job, email, more...</a>
****************************************************************************************************
title: SMS Systems Administrator - CCCi - Atlanta -  Atlanta, GA
url:  http://www.indeed.com/rc/clk?jk=8d8e9d083bfaf53a&from=rss&qd=RnZhMybXSk4M3QtTVGXWoVUpPKQ-Ar2L74KrkUB91D73Oal7uRnyiLEYdTIyc2C-Y0nnJsYLYzaI9wfwmiLV62FaWC
OBmwPWHXOVGjbhxi0&rd=
summary:  to-hire opportunity for a Microsoft SMS Systems Administrator for our customer in the North Atlanta area... possess the necessary systems experienc
e required to... <br />
                        From Engineering Central - 31 Dec 2009 22:47:02 GMT - <a href="http://www.indeed.com/viewjob?t=SMS+Systems+Administrator&c=CCCi+-+Atl
anta&l=Atlanta,+GA&jk=8d8e9d083bfaf53a">save job, email, more...</a>
****************************************************************************************************
title: Senior Unix Systems Administrator - Consilium1 -  Atlanta, GA
url:  http://www.indeed.com/rc/clk?jk=6a9e101777d4fc51&from=rss&qd=RnZhMybXSk4M3QtTVGXWoVUpPKQ-Ar2L74KrkUB91D73Oal7uRnyiLEYdTIyc2C-Y0nnJsYLYzaI9wfwmiLV62FaWC
OBmwPWHXOVGjbhxi0&rd=
summary:  for a Senior Systems Administrator to join our team... Produce design for the IT systems specifying the operations the system will perform and the
way the data... <br />
                        From Great IT Jobs - 01 Jan 2010 18:43:00 GMT - <a href="http://www.indeed.com/job/Senior-Unix-Systems-Administrator-at-Consilium1-in
-Atlanta,-GA-6a9e101777d4fc51">save job, email, more...</a>
****************************************************************************************************
you can then find those keywords in item["summary"] (or see the documentation of feedparser for others)

Last edited by ghostdog74; 01-01-2010 at 08:20 PM.
 
Old 01-01-2010, 08:35 PM   #20
worm5252
Member
 
Registered: Oct 2004
Location: Atlanta
Distribution: CentOS, RHEL, HP-UX, OS X
Posts: 567

Original Poster
Rep: Reputation: 57
Well my choice of not going that route was the lack of knowledge. I know a bit about bash but know nothing about python. Seems like you are able to do what I am trying to do in a lot less code though.
 
Old 01-01-2010, 08:49 PM   #21
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by worm5252 View Post
I know a bit about bash but know nothing about python.
last time, you know nothing about bash too right ? . Just a matter of taking the first step to learn new stuff.
Quote:
Seems like you are able to do what I am trying to do in a lot less code though.
that's right. The bulk of "parsing" the feeds are done by the library. Its the equivalent of all those "messy" regexs you have.

Of course, if you just want to learn about parsing text with regexps..by all means do what you have to do.
 
Old 01-01-2010, 09:15 PM   #22
worm5252
Member
 
Registered: Oct 2004
Location: Atlanta
Distribution: CentOS, RHEL, HP-UX, OS X
Posts: 567

Original Poster
Rep: Reputation: 57
Well I have been trying to improve my BASH skills lately, so I think I will stick to doing 1 language at a time. I will most likely reinvent this script in another language later on. I definitely want to do it in PHP for a live output. But that will be later down the road when I am learning PHP.
 
Old 01-02-2010, 11:06 AM   #23
worm5252
Member
 
Registered: Oct 2004
Location: Atlanta
Distribution: CentOS, RHEL, HP-UX, OS X
Posts: 567

Original Poster
Rep: Reputation: 57
Humm odd thing happening now. Anytime I run wget (in the script or not) with any URL my download fails, but I have connectivity.
 
Old 01-02-2010, 11:36 AM   #24
worm5252
Member
 
Registered: Oct 2004
Location: Atlanta
Distribution: CentOS, RHEL, HP-UX, OS X
Posts: 567

Original Poster
Rep: Reputation: 57
Got the wget thing sorted. Even though it says it failed it still downloads, kinda weird but whatever.

This is what I got so far
Code:
#!/bin/bash
#
#-------------------------------------------------------------#
# This Script is designed to take RSS feeds and process them  |
# against a list of key words. The results of the Processed   |
# RSS feeds will them be outputted to a human readable format |
#-------------------------------------------------------------#
# Author: Jared Bloomer                                       |
# Email: jared@tuxknowledge.com                               |
# Website: http://www.tuxknowledge.com                        |
#-------------------------------------------------------------#
#
# TODO LIST
# [X] Define Global Variables
# [X] Create RSSFeeds and keywords files
# [X] Create a tmp directory for the script to work in
# [X] Collect the results of each link contained in each RSS Feed
# [ ] Run a keyword search on the results of each link found on each RSS Feed
# [ ] Determine the percentage of keywords found in search search
# [ ] Write the links of results containing 75% of or more keywords to a file
# [ ] Format the final results file in a human readable format. 
# [X] Remove tmp directory and its contents
# [ ] Call all functions to process everything in the correct order
#
# Define Global Variables

RSSFile=./files/RSSFeeds #This is a list of all of the RSS Feeds to be Processed
Keywords=./files/keywords #This is the list of keywords used to process the RSS feeds
Time=`date "+%T"` #This is to be used for File Names and Time Stamping
TotalKeywords=`wc -w files/keywords | gawk  '{print   $1}'` # Determine total keywords listed in ./files/keywords
FinalPercentage=0 # Set initial Percentage value to 0
KeywordsFound=0 # Set initial value of Keywords found to 0


# Create tmp directory to work in
func_createtmp ()
{
mkdir ./tmp
}

# Remove tmp Directory
func_removetmp ()
{
rm -R ./tmp
}

# Collect all of the HTML Files from each URL in the RSSFeeds File
func_collectrss ()
{
cat $RSSFile | while read line; do # Read each line of the RSSFile individually
	wget $line -O file.html # Download the RSS Feed from that line of the file
	mv file.html ./tmp/file.html # Move the downloaded RSS Feed to the tmp directory.
	url=`cat ./tmp/file.html | grep -o -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'` # Exract the URLs from the downloaded RSS Feed
	printf "%s\n" $url > tmp/url.dat
	cat ./tmp/url.dat | while read line; do
		wget $line -O link.html # Download the links in the tmp/url.dat file
		mv link.html ./tmp/link.html # Move the downloaded link to the tmp directory.	
	done
done
}

func_keywordssearch ()
{
cat ./tmp/link.html | while read line; do
	#Figure out how to grep for the keywords listed in the keywords file
done
}

func_createtmp
func_collectrss
func_keywordssearch
#func_removetmp
1.) I still need to figure out why I am getting partial URLs when I run
Code:
url=`cat ./tmp/file.html | grep -o -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'`
2.) In the func_keywordsearch function I have to figure out how to search each html file for the keywords listed in the ./files/keywords file. I was thinking of loading all the keywords in ./files/keywords into an array and then searching ./tmp/link.html against the array values. I just don't know how to do that.

Help Please! Thanks to everyone who has helped so far.
 
Old 01-02-2010, 02:31 PM   #25
worm5252
Member
 
Registered: Oct 2004
Location: Atlanta
Distribution: CentOS, RHEL, HP-UX, OS X
Posts: 567

Original Poster
Rep: Reputation: 57
I don't get it. I am reading the keywords file into an array but when I display the value of the array I am not getting the keywords.

This is the contents of the keywords file being read
Code:
monitoring
support
troubleshoot
replace
This is the function reading the file into an array and displaying value 1 (which should be "support").
Code:
func_readkeywords ()
{
cat ./files/keywords | while read line; do
	Word=$line
	for (( i=0;i<$TotalKeywords;i++)); do
		Keywords[$i]=$Word
	done
#echo $Keywords[1]
#echo $Keywords[3]
echo ${#Keywords[1]}
echo "Total Keywords: " $TotalKeywords
done
}
However when this function is run this is the output I am getting
Code:
10
Total Keywords:  4
7
Total Keywords:  4
12
Total Keywords:  4
7
Total Keywords:  4
 
Old 01-02-2010, 02:44 PM   #26
worm5252
Member
 
Registered: Oct 2004
Location: Atlanta
Distribution: CentOS, RHEL, HP-UX, OS X
Posts: 567

Original Poster
Rep: Reputation: 57
Scratch that last post I revamped my way of thinking and did some new search on google for other references. I came up with a working solution.
Code:
func_readkeywords ()
{
while read txt ; do
	Keywords[${#Keywords[@]}]=$txt
done < ./files/keywords
}
 
Old 01-02-2010, 03:16 PM   #27
worm5252
Member
 
Registered: Oct 2004
Location: Atlanta
Distribution: CentOS, RHEL, HP-UX, OS X
Posts: 567

Original Poster
Rep: Reputation: 57
ok folks, I know with all my post it is a bit confusing on what I need help with. Here is my current status

1.) I still need to figure out why I am getting partial URLs when I run
Code:
url=`cat ./tmp/file.html | grep -o -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'`
2.) In the func_keywordsearch function I have to figure out how to search each html file for the keywords listed in the ./files/keywords file. I was thinking of loading all the keywords in ./files/keywords into an array and then searching ./tmp/link.html against the array values. I just don't know how to do that. I have func_readkeywords, which reads in all the values of ./files/keywords into a Global array named Keywords. I just don't know how to actually perform an intelligent search to if the words are contained in an HTML file and ensure the keyword is not an HTML tag. If the word is found then I need to increment a variable called KeywordsFound by 1. Problem is if I do grep or something it will show all the times it is found and not that it was found. I do not want to increment KeywordsFound every time it finds the keyword it is searching for. I want to increment it only once for each keyword it is searching for, and only increment it if it is found. Hope that isn't too confusing.
 
Old 01-02-2010, 05:28 PM   #28
GooseYArd
Member
 
Registered: Jul 2009
Location: Reston, VA
Distribution: Slackware, Ubuntu, RHEL
Posts: 183

Rep: Reputation: 46
now would be an excellent time to read the O'Reilly llama book on Perl
 
Old 01-02-2010, 05:59 PM   #29
worm5252
Member
 
Registered: Oct 2004
Location: Atlanta
Distribution: CentOS, RHEL, HP-UX, OS X
Posts: 567

Original Poster
Rep: Reputation: 57
Well I think I figured it out by using lynx. Now I just have to implement the Keyword search part and then output the results to a file to be formatted to be human readable and then sort out what type of delivery method I want to use.

Here is what I have so far
Code:
#!/bin/bash
#
#-------------------------------------------------------------#
# This Script is designed to take RSS feeds and process them  |
# against a list of key words. The results of the Processed   |
# RSS feeds will them be outputted to a human readable format |
#-------------------------------------------------------------#
# Author: Jared Bloomer                                       |
# Email: jared@tuxknowledge.com                               |
# Website: http://www.tuxknowledge.com                        |
#-------------------------------------------------------------#
#
# This script calls on external packages in order to perform. Please ensure the
# the following packages are installed before executing this script. 
#
# REQUIREMENTS
# lynx
#
#
#
# TODO LIST
# [X] Define Global Variables
# [X] Create RSSFeeds and keywords files
# [X] Create a tmp directory for the script to work in
# [X] Collect the results of each link contained in each RSS Feed
# [ ] Run a keyword search on the results of each link found on each RSS Feed
# [ ] Determine the percentage of keywords found in search search
# [ ] Write the links of results containing 75% of or more keywords to a file
# [ ] Format the final results file in a human readable format. 
# [X] Remove tmp directory and its contents
# [ ] Call all functions to process everything in the correct order
#
# Define Global Variables

RSSFile=./files/RSSFeeds #This is a list of all of the RSS Feeds to be Processed
Keywords=./files/keywords #This is the list of keywords used to process the RSS feeds
Time=`date "+%T"` #This is to be used for File Names and Time Stamping
TotalKeywords=`wc -w files/keywords | gawk  '{print   $1}'` # Determine total keywords listed in ./files/keywords
FinalPercentage=0 # Set initial Percentage value to 0
KeywordsFound=0 # Set initial value of Keywords found to 0
declare -a Keywords=( )


# Create tmp directory to work in
func_createtmp ()
{
mkdir ./tmp
}

# Remove tmp Directory
func_removetmp ()
{
rm -R ./tmp
}

# Collect all of the HTML Files from each URL in the RSSFeeds File
func_collectrss ()
{
cat $RSSFile | while read line; do # Read each line of the RSSFile individually
	wget $line -O file.html # Download the RSS Feed from that line of the file
	mv file.html ./tmp/file.html # Move the downloaded RSS Feed to the tmp directory.
#	url=`cat ./tmp/file.html | grep -o -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'` # Exract the URLs from the downloaded RSS Feed
	url=`lynx -dump ./tmp/file.html | grep -o -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'`
	printf "%s\n" $url > tmp/url.dat
	cat ./tmp/url.dat | while read line; do
		wget $line -O link.html # Download the links in the tmp/url.dat file
		mv link.html ./tmp/link.html # Move the downloaded link to the tmp directory.	
		func_keywordssearch
	done
done
}

func_keywordssearch ()
{
lynx -dump ./tmp/link.html >> ./tmp/link
	

}

func_readkeywords ()
{
while read txt ; do
	Keywords[${#Keywords[@]}]=$txt
done < ./files/keywords
}

func_createtmp
func_collectrss
func_readkeywords
func_removetmp
 
Old 01-03-2010, 03:26 AM   #30
EricTRA
LQ Guru
 
Registered: May 2009
Location: Gibraltar, Gibraltar
Distribution: Fedora 20 with Awesome WM
Posts: 6,805
Blog Entries: 1

Rep: Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297
Hi,

Have to take out some time to read up on your script, but seems you're getting closer and closer.

Kind regards,

Eric
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
cron job did not redirect output from bash script junust Programming 2 07-26-2009 04:30 AM
BASH script for a MySQL cron job - Need help Carlo1973 Linux - Newbie 1 05-20-2009 01:56 AM
HELP - bash script - cron job - not out putting in 132 boyd98 Programming 12 05-01-2007 06:14 PM
Bash script not running within crontab job WrightExposure Linux - General 3 01-23-2007 06:28 PM
Bash script and cron job rust8y Linux - General 2 07-08-2006 07:45 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:36 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration