LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-01-2010, 01:37 PM   #1
worm5252
Member
 
Registered: Oct 2004
Location: Atlanta
Distribution: CentOS, RHEL, HP-UX, OS X
Posts: 567

Rep: Reputation: 57
BASH - I Need a job - Need help with a script


Hey Guys,
I got laid off back in August and have had a hell of a time finding work. I have decided I am going to write a script today to help me find jobs. Problem is I know right off I am not knowledgeable enough in BASH scripting to pull it off. So here is the idea

I have a file with a list of RSS Links in it for job searches on various job sites. For eaxmple one of the links might be a search on Monster.com for Systems Administrator. Another might be the same search on Careerbuilders.com.

Along with that I have a file with a list of key words or phrases

Now I want the script to run and process all of the results from the RSS feeds and search for the keywords in them RSS feeds. Results that find say 75% of the keywords are written to a file. The results are then processed in a user readable format (which I ahve not determined). The format may be a web page on my internal webserver, or an HTML email.

I then want to set Cron to run this script several times a day.

The idea is I can filter out the jobs I know I can't get that would show up in the results of the searches on job sites.

Step 1 is to generate the 2 files with the RSS feeds and the Keywords
Step 2 is to figure out how to read the RSS feeds in a BASH Script
Step 3 is to process the results of each RSS feed against the Keywords and write it to a file
Step 4 is to process the final findings file in a readable format which is to be determined.

So my first question is How do I read the RSS feeds into my script so I can process them?
 
Old 01-01-2010, 01:51 PM   #2
EricTRA
LQ Guru
 
Registered: May 2009
Location: Gibraltar, Gibraltar
Distribution: Fedora 20 with Awesome WM
Posts: 6,805
Blog Entries: 1

Rep: Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296
Hi,

Sorry to hear that you lost your job. Hope you find something soon.

Looks like an interesting project you're starting up. I just had a look at my Akregator (Slackware 13, KDE) RSS feed reader. It has both the option to archive as to export to opml/xml. So that would be a possibility in my opinion. Also the archive option is configurable to set max articles or dates. Probably if you look for other feed readers you'll find one that exactly does what you need.

For the keywords file, is that a file that you would generate manually? Or are you getting the keywords from the read articles that you keep and add them automatically to the list?

Not sure if it helps out but just 'outing' ideas.

Kind regards,

Eric
 
Old 01-01-2010, 02:10 PM   #3
worm5252
Member
 
Registered: Oct 2004
Location: Atlanta
Distribution: CentOS, RHEL, HP-UX, OS X
Posts: 567

Original Poster
Rep: Reputation: 57
The RSS feeds file and the keywords file is something I would generate manually. This would allow me to target certain keywords and certain feeds. So I guess in the end this project isn't about finding a job, but processing RSS feeds against Keywords.

I am still setting up and working on getting the idea I have in my head on paper so this is all I have coded so far

Code:
#!/bin/bash
#
#-------------------------------------------------------------#
# This Script is designed to take RSS feeds and process them  |
# against a list of key words. The results of the Processed   |
# RSS feeds will them be outputted to a human readable format |
#-------------------------------------------------------------#
# Author: Jared Bloomer                                       |
# Email: jared@tuxknowledge.com                               |
# Website: http://www.tuxknowledge.com                        |
#-------------------------------------------------------------#
#
# Define Global Variabes

RSSFile=./files/RSSFeeds #This is a list of all of the RSS Feeds to be Processed
Keywords=./files/keywords #This is the list of keywords used to process the RSS feeds
Time=`date "+%T"` #This is to be used for File Names and Time Stamping
For testing I have added some keywords to the keywords file and added http://rss.indeed.com/rss?q=systems+...ator&l=atlanta to the RSSFeeds file. so I have 1 RSS feed and a few keywords for testing. I will change it before I go to production with it. Here are the contents of the keywords file.

Code:
jared@debian:~/Documents/Scripts/jobs$ cat files/keywords
monitoring
support
troubleshoot
replace
I chose 4 words because of testing later on with a % of how many keywords are found in each RSS feed.

CRAP something I just thought about is I don't want to process the RSS feed itself, but I want to process the results of the links contained in the RSS feed. The RSS feed just tells me of job postings but I will have to do the keyword search on each job posting itself and not the RSS feed. How am I gonna do that?
 
Old 01-01-2010, 02:24 PM   #4
EricTRA
LQ Guru
 
Registered: May 2009
Location: Gibraltar, Gibraltar
Distribution: Fedora 20 with Awesome WM
Posts: 6,805
Blog Entries: 1

Rep: Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296
Quote:
Originally Posted by worm5252 View Post
CRAP something I just thought about is I don't want to process the RSS feed itself, but I want to process the results of the links contained in the RSS feed. The RSS feed just tells me of job postings but I will have to do the keyword search on each job posting itself and not the RSS feed. How am I gonna do that?
Just tried something. From Akregator just copied a link location and then in a console just downloaded it with wget:
Code:
wget http://kde-apps.org/content/show.php/GSW-GamStopWatch?content=117722
and got the complete article in html code. So that's one way to get an article. Then you can snip out the code, get the keywords, save or loose the text after you tested to your parameters (keywords).

Kind regards,

Eric
 
Old 01-01-2010, 02:47 PM   #5
worm5252
Member
 
Registered: Oct 2004
Location: Atlanta
Distribution: CentOS, RHEL, HP-UX, OS X
Posts: 567

Original Poster
Rep: Reputation: 57
EricTRA, I did the same thing actually. HAHA.

Few things
1.) I am trying to get a numerical value for the total number of line (keywords) listed in the keywords file. If I do "wc -l ./files/keywords" then I get "4 files/keywords". How do I strip that down to get just the numeric value so I can assign it to a variable?

2.) How do I collect just the HTML links from the RSS feeds so I can process them with a wget? I thought about doing a wget on the RSS feed itself which gives me a HTML file of the feed. I just don't know how I am going to find the links since every link will be different. I am searching for something dynamic.

3.) How do I tell wget to download the files to a specific directory? I am trying to contain everything to a tmp directory and have my RSSFeeds and Keywords files located in a files directory. In other words it is laid out like this
Code:
Parent Directory
|
|-files
|  |-RSSFeeds
|  |-keywords
|
|-tmp
|
|-script.sh
Anyways, Here is my current code. Slow progress since I am thinking about so many different parts of it

Code:
#!/bin/bash
#
#-------------------------------------------------------------#
# This Script is designed to take RSS feeds and process them  |
# against a list of key words. The results of the Processed   |
# RSS feeds will them be outputted to a human readable format |
#-------------------------------------------------------------#
# Author: Jared Bloomer                                       |
# Email: jared@tuxknowledge.com                               |
# Website: http://www.tuxknowledge.com                        |
#-------------------------------------------------------------#
#
# TODO LIST
# [X] Define Global Variables
# [X] Create RSSFeeds and keywords files
# [X] Create a tmp directory for the script to work in
# [ ] Collect the results of each link contained in each RSS Feed
# [ ] Run a keyword search on the results of each link found on each RSS Feed
# [ ] Determine the percentage of keywords found in search search
# [ ] Write the links of results containing 75% of or more keywords to a file
# [ ] Format the final results file in a human readable format. 
# [X] Remove tmp directory and its contents
# [ ] Call all functions to process everything in the correct order
#
# Define Global Variables

RSSFile=./files/RSSFeeds #This is a list of all of the RSS Feeds to be Processed
Keywords=./files/keywords #This is the list of keywords used to process the RSS feeds
Time=`date "+%T"` #This is to be used for File Names and Time Stamping
TotalKeywords=

# Create tmp directory to work in
func_createtmp ()
{
mkdir ./tmp
}

# Remove tmp Directory 
func_removetmp ()
{
rm -R ./tmp
}



func_createtmp
 
Old 01-01-2010, 02:56 PM   #6
worm5252
Member
 
Registered: Oct 2004
Location: Atlanta
Distribution: CentOS, RHEL, HP-UX, OS X
Posts: 567

Original Poster
Rep: Reputation: 57
Forget getting a numerical value for the total number of lines in the keywords file. I got it sorted using gawk

Code:
TotalKeywords=`wc -l files/keywords | gawk  '{print   $1}'`
 
Old 01-01-2010, 02:57 PM   #7
EricTRA
LQ Guru
 
Registered: May 2009
Location: Gibraltar, Gibraltar
Distribution: Fedora 20 with Awesome WM
Posts: 6,805
Blog Entries: 1

Rep: Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296
Hi,

Yeah, when people are thinking about the same thing that changes are they are performing the same stuff.

1. Try:
Code:
cat yourkeywordfile | wc -w
If you have one keyword per line then you get the numeric value 4 if there are four lines.
Code:
cat yourkeywordfile | wc -l
would give you the same result if only one word on each line but is tricky if you have blank lines in your document. The first command (wc -w) counts words; not lines.

2. Have to think about that one and try some stuff to get it clear.

3. wget will download to the directory you're in, so before executing it in your script, cd into the directory where you want the files would be one option.

Slow is the best way of achieving things. However, don't try to code the entire thing at once. Take it one step at the time, first defining what you want, splitting it into blocks. Then decide how you are going to do it, try it out and code it. When all block are constructed, put the thing together.

Kind regards,

Eric
 
Old 01-01-2010, 02:59 PM   #8
EricTRA
LQ Guru
 
Registered: May 2009
Location: Gibraltar, Gibraltar
Distribution: Fedora 20 with Awesome WM
Posts: 6,805
Blog Entries: 1

Rep: Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296
Quote:
Originally Posted by worm5252 View Post
Forget getting a numerical value for the total number of lines in the keywords file. I got it sorted using gawk

Code:
TotalKeywords=`wc -l files/keywords | gawk  '{print   $1}'`
That's another possibility

Kind regards,

Eric
 
Old 01-01-2010, 03:34 PM   #9
worm5252
Member
 
Registered: Oct 2004
Location: Atlanta
Distribution: CentOS, RHEL, HP-UX, OS X
Posts: 567

Original Poster
Rep: Reputation: 57
Something I am going to need for this script that I have never done before it to read the contents of the external files 1 line at a time and process each line individually.

How can I do that?
 
Old 01-01-2010, 03:37 PM   #10
EricTRA
LQ Guru
 
Registered: May 2009
Location: Gibraltar, Gibraltar
Distribution: Fedora 20 with Awesome WM
Posts: 6,805
Blog Entries: 1

Rep: Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296
With a while loop.

Code:
cat yourfile | while READ line
    do
       block of code here
    done
This way all commands you put in between do and done will be executed on each line one after the other until the end of the file.

Kind regards,

Eric
 
Old 01-01-2010, 03:43 PM   #11
EricTRA
LQ Guru
 
Registered: May 2009
Location: Gibraltar, Gibraltar
Distribution: Fedora 20 with Awesome WM
Posts: 6,805
Blog Entries: 1

Rep: Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296
Small notice.

If you're going to work with variables within the code block between do and done then pay a lot of attention on assigning the variables. If a variable changes value in the code block then that value will be lost after the block terminates.

That's because by 'catting' the file into the pipe it gets executed in a subshell.

Another and safer way would be to use redirection like this:
Code:
while READ line
    do
       block of code here
    done < yourfile
The decision is up to you, both work but the first needs more paying attention to the variables and their values.

Kind regards,

Eric
 
Old 01-01-2010, 04:04 PM   #12
worm5252
Member
 
Registered: Oct 2004
Location: Atlanta
Distribution: CentOS, RHEL, HP-UX, OS X
Posts: 567

Original Poster
Rep: Reputation: 57
Thanks Eric. I am slowly working my way through this code.

Right now this is what I have

Code:
func_collectrss ()
{
cat $RSSFile | while read line; do # Read each line of the RSSFile individually
    wget $line -O file.html # Download the RSS Feed from that line of the file
	mv file.html ./tmp/file.html
	url=`cat ./tmp/file.html | grep http://*`
	echo $url # This is just for testing to see what my output is of the previous line
done
}
Right now I am getting undesirable results from url=`cat ./tmp/file.html | grep http://*`. I am afraid each RSS feed will be different because of the html coding. I am getting more than just the URLs. Is there an easier and more efficient way of extracting the URLs from this HTML code without getting any duplicates?
 
Old 01-01-2010, 04:18 PM   #13
EricTRA
LQ Guru
 
Registered: May 2009
Location: Gibraltar, Gibraltar
Distribution: Fedora 20 with Awesome WM
Posts: 6,805
Blog Entries: 1

Rep: Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296Reputation: 1296
Hi,

Found this little diamond on the internet:
Code:
#! /bin/sed -nf

# Join lines if we have tags that span multiple lines
:join
/<[^>]*$/ { N; s/[ 	*]\n[ 	*]/ /; b join; }

# Do some selection to speed the thing up
/<[ 	]*\([aA]\|[iI][mM][gG]\)/!b

# Remove extra spaces before/after the tag name, change img/area to a
s/<[ 	]*\([aA]\|[iI][mM][gG]|[aA][rR][eE][aA]\)[ 	]\+/<a /g

# To simplify the regexps that follow, change href/alt to lowercase
# and replace whitespace before them with a single space
s/<a\([^>]*\)[ 	][hH][rR][eE][fF]=/<a\1 href=/g
s/<a\([^>]*\)[ 	][aA][lL][tT]=/<a\1 alt=/g

# To simplify the regexps that follow, quote the arguments to href and alt
s/href=\([^" 	>]\+\)/href="\1"/g
s/alt=\([^" 	>]\+\)/alt="\1"/g

# Move the alt tag after href, remove attributes between them
s/\( alt="[^"]*"\)[^>]*\( href="[^"]*"\)/\2\1/g

# Remove attributes between <a and href
s/<a[^>]* href="/<a href="/g

# Change href="xxx" ... alt="yyy" to href="xxx|yyy"
s/\(<a href="[^"]*\)"[^>]* alt="\([^"]*"\)/\1|\2/g

t loop

# Print an URL, remove it, and loop
:loop
h
s/.*<a href="\([^"]*\)".*$/\1/p
g
s/\(.*\)<a href="\([^"]*\)".*$/\1/
t loop
(http://sed.sourceforge.net/grabbag/s.../list_urls.sed).

Put it in a file, make executable, run it against your html file. I tried it like this to clean up a little bit more:
Code:
./sedscript mydownloadedfile | grep http:
and all I got was clean URL, without even the HTML code tags.

Is that what you need/want/like to have?

Kind regards,

Eric
 
Old 01-01-2010, 04:38 PM   #14
worm5252
Member
 
Registered: Oct 2004
Location: Atlanta
Distribution: CentOS, RHEL, HP-UX, OS X
Posts: 567

Original Poster
Rep: Reputation: 57
When I do that I do not get any output. I tried just running the script from a command line, and from the script I am writing. I even ran the sedscript with grep like you suggested and redirected it to an output file, and that resulted in an empty file. I even get an empty file even if I do not use grep.
 
Old 01-01-2010, 04:56 PM   #15
worm5252
Member
 
Registered: Oct 2004
Location: Atlanta
Distribution: CentOS, RHEL, HP-UX, OS X
Posts: 567

Original Poster
Rep: Reputation: 57
Well I did this, and I am close.
Code:
url=`cat ./tmp/file.html | grep -o -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'`
Here is the output
Code:
www.geor www.indeed.com/q www.indeed.com/image www.indeed.com/ www.indeed.com/rc/clk?jk=580f08b272eb9e30&amp;from=r www.indeed.com/job/Senior-Solari www.indeed.com/rc/clk?jk=8d8e9d083bfaf53a&amp;from=r www.indeed.com/viewjob?t=SMS+Sy www.indeed.com/rc/clk?jk=6a9e101777d4fc51&amp;from=r www.indeed.com/job/Senior-Unix-Sy www.indeed.com/rc/clk?jk=241cffcb34894a23&amp;from=r www.indeed.com/job/OS390-and-LINUX-Sy www.indeed.com/rc/clk?jk=3c508de6ef18090e&amp;from=r www.indeed.com/viewjob?t=Maximo+Sy www.indeed.com/rc/clk?jk=2e9dc7157570bc11&amp;from=r www.indeed.com/job/IT-Admini www.indeed.com/rc/clk?jk=5ff6a7bfb160b60d&amp;from=r www.indeed.com/viewjob?t=Senior+Network+Admini www.indeed.com/rc/clk?jk=2787d9b0570b1a09&amp;from=r www.indeed.com/job/Sy www.indeed.com/rc/clk?jk=e011407448aeb34d&amp;from=r www.indeed.com/viewjob?t=Network+Admini www.indeed.com/rc/clk?jk=287778d4c3891714&amp;from=r www.indeed.com/job/Office-Admini www.indeed.com/rc/clk?jk=e98eeb3906902c4b&amp;from=r www.indeed.com/job/Senior-Linux-Sy www.indeed.com/rc/clk?jk=91f4d589ec8066a4&amp;from=r www.indeed.com/viewjob?t=Sy www.indeed.com/rc/clk?jk=efdec3dc366db863&amp;from=r www.indeed.com/job/Advanced-SAN-Admini www.indeed.com/rc/clk?jk=a0489ea9373f3b03&amp;from=r www.indeed.com/job/VMS-Sy www.indeed.com/rc/clk?jk=4d612d7629170766&amp;from=r www.indeed.com/job/JDE-CNC-Admini www.indeed.com/rc/clk?jk=a1479bf776c03191&amp;from=r www.indeed.com/viewjob?t=JDE+CNC+Admini www.indeed.com/rc/clk?jk=f9bbde7e70b171cb&amp;from=r www.indeed.com/job/Window www.indeed.com/rc/clk?jk=ede18d071578a8c4&amp;from=r www.indeed.com/viewjob?t=SR+SYSTEMS+ENGINEER www.indeed.com/rc/clk?jk=e286fb43f674b06f&amp;from=r www.indeed.com/job/IT-Admini www.indeed.com/rc/clk?jk=df1af9e78442ca62&amp;from=r www.indeed.com/viewjob?t=WebLogic+Admini
Any ideas how to sort this out to full 1 per line and a complete URL?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
cron job did not redirect output from bash script junust Programming 2 07-26-2009 04:30 AM
BASH script for a MySQL cron job - Need help Carlo1973 Linux - Newbie 1 05-20-2009 01:56 AM
HELP - bash script - cron job - not out putting in 132 boyd98 Programming 12 05-01-2007 06:14 PM
Bash script not running within crontab job WrightExposure Linux - General 3 01-23-2007 06:28 PM
Bash script and cron job rust8y Linux - General 2 07-08-2006 07:45 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 08:10 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration