LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-29-2016, 09:09 AM   #1
socalheel
Member
 
Registered: Oct 2012
Location: Raleigh, NC
Distribution: CentOS / RHEL
Posts: 158

Rep: Reputation: 3
help make this bash script better


I am okay with scripting, and know only what I've had experience with. But I would like other eyes on this so I can see what other commands and methods I can use to achieve the same result.

I've come up with a script that locates all robots.txt file on our shared web server.

What I care about is the presence of this string: "Disallow: /"

If that string is found, then what I care is if the line immediately before "Disallow: /" has this string: "User-agent: *"

If those two conditions are met, then I need to generate an alert to our monitoring service, OpsGenie.

The portion of this script that has "egrep -v" ... I would love to include a source file and put the strings that I don't care about in there and call on that source file to exclude those strings there. The rationale for that is, this is going to be made into an RPM and deployed to several other shared web servers where those strings are going to be different for each web server it is installed on.

Code:
#!/bin/bash

#We are using locate for this seeing as the document root for the websites are not the same across the board.
#In order to ensure locate gives accurate results, we need to first update the locate db with the updatedb command

updatedb

#setting variables and file locations
hostname=`hostname`
outputFile="/root/robotsDisallow.txt"

#We are now finding all the robots.txt files, excluding unnecessary file locations and outputting the results to the $outputFile
#First we want to grep all robots.txt files for Disallow: / 
#After we have that list populated, we now care about if those Disallow: / strings pertain to all user agents by searching for User-agent: *

locate robots.txt | egrep -v "httpdocs.old|public_html.old|backup|bkp|/usr/libexec/webmin" | xargs grep -B1 'Disallow: /$' | grep 'User-agent: \*' > $outputFile


#Now we are finding if those two conditions are met, and if so, generate an alert to OpsGenie

grep robots.txt $outputFile
if [[ $? == 0 ]]
  then 
     sed -i "1i This file is generated from $hostname to inform you of robots.txt files that are disallowing all" $outputFile
     curl -XPOST 'https://api.opsgenie.com/v1/json/alert' -d '{"apiKey": "OUR-API-KEY","message" : "Misconfigured robots.txt on '$hostname'","teams" : ["IT.TEAM"]}'
     sleep 5
     #we need to get the alias for this alert
     alias=`curl -XGET 'https://api.opsgenie.com/v1/json/alert?apiKey=OUR-API-KEY&limit=1' | awk 'BEGIN { FS = ":" } ; {print $12}' | awk 'BEGIN {FS = "," } ; {print $1}' | sed 's/"//g'`
     curl -XPOST 'https://api.opsgenie.com/v1/json/alert/attach' -F apiKey="OUR-API-KEY" -F alias="$alias" -F attachment=@$outputFile
     sleep 5
fi

#cleaning up after ourselves
rm -f $outputFile
 
Old 06-29-2016, 10:09 AM   #2
keefaz
LQ Guru
 
Registered: Mar 2004
Distribution: Slackware
Posts: 6,552

Rep: Reputation: 872Reputation: 872Reputation: 872Reputation: 872Reputation: 872Reputation: 872Reputation: 872
Instead of using locate to find robots.txt in all the system and then exclude non relevant locations, it should be more controled to search into specific directories in the first place.
The web servers are virtual hosts?, maybe you could make the search directories list dynamically by greping DocumentRoot value from http conf files.

Just a thought
 
1 members found this post helpful.
Old 06-29-2016, 11:07 AM   #3
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,999

Rep: Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190
I agree with keefaz that using locate is not really very expedient and depending on the last time updatedb was executed this may well take a long time.

Instead of :-
Code:
grep robots.txt $outputFile
You can simply have your 'if' clause test it for you:
Code:
if [[ -s "$outputFile" ]]
Here is also an alternative for the 'alias' (not a fan of using variable names of commands) variable:
Code:
alias=$(curl -XGET 'https://api.opsgenie.com/v1/json/alert?apiKey=OUR-API-KEY&limit=1' | awk -F: '{sub(/,.*/,"",$12);print gensub(/"/,","g",$12)}')
I obviously do not have an example to play with so might need to check your output on that one
 
1 members found this post helpful.
Old 06-29-2016, 01:17 PM   #4
socalheel
Member
 
Registered: Oct 2012
Location: Raleigh, NC
Distribution: CentOS / RHEL
Posts: 158

Original Poster
Rep: Reputation: 3
Quote:
Originally Posted by keefaz View Post
Instead of using locate to find robots.txt in all the system and then exclude non relevant locations, it should be more controled to search into specific directories in the first place.
The web servers are virtual hosts?, maybe you could make the search directories list dynamically by greping DocumentRoot value from http conf files.

Just a thought
aah, i see. i didn't even think of that possibility ... i really didn't want to use locate b/c of the updatedb but using find would have been difficult since we have different locations for DocRoots. i like the virutal host file idea. thanks for the idea.
 
Old 06-29-2016, 01:38 PM   #5
socalheel
Member
 
Registered: Oct 2012
Location: Raleigh, NC
Distribution: CentOS / RHEL
Posts: 158

Original Poster
Rep: Reputation: 3
Quote:
Originally Posted by grail View Post
Here is also an alternative for the 'alias' (not a fan of using variable names of commands) variable:
Code:
alias=$(curl -XGET 'https://api.opsgenie.com/v1/json/alert?apiKey=OUR-API-KEY&limit=1' | awk -F: '{sub(/,.*/,"",$12);print gensub(/"/,","g",$12)}')
I obviously do not have an example to play with so might need to check your output on that one
here is the output i'm working with:

Code:
{"alerts":[{"owner":"","acknowledged":false,"teams":["IT.TEAM"],"count":1,"message":"[StatusCake] http://www.site.com/ status is Down","isSeen":true,"tags":["OverwriteQuietHours","StatusCake"],"createdAt":1467222015951000173,"tinyId":"2362","alias":"http://www.site.com/","id":"111a222f-33oe-44cd-55ke-a3c777t6313","status":"closed","updatedAt":1467222395938000235}],"took":14}
and i'm needing to pull only this string: http://www.site.com/ (located right before "id":"111a222f ...)

Last edited by socalheel; 06-29-2016 at 01:44 PM.
 
Old 06-29-2016, 01:56 PM   #6
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,879

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
First off, I must admit that I don't understand some of your terminology. I simplified the problem and hope my version is consistent with your general theme.

With file robot01 ...
Code:
Accura
Buick
Chrysler
Dodge
Apple
Disallow:
Ford
Honda
Jeep
Banana
Disallow:
Kia
Lexus
Mazda
Nissan
Cherry
Disallow:
and file robot02 ...
Code:
Hammer
Wrench
Screwdriver
Daffodil
Disallow:
Saw
Pliers
Rose
Disallow:
Daisy
Disallow:
and file robot03 ...
Code:
Red
Green
Deer
Disallow:
Purple
Orange
Antelope
Disallow:
Brown
Yellow
Goat
Disallow:
Violet
... this code ...
Code:
grep -B1 "Disallow:" $Path"robot"*".txt" \
|egrep -v "(--|Disallow:)"               \
>$OutFile
... produced this OutFile ...
Code:
/home/daniel/Desktop/LQfiles/dbm1640robot01.txt-Apple
/home/daniel/Desktop/LQfiles/dbm1640robot01.txt-Banana
/home/daniel/Desktop/LQfiles/dbm1640robot01.txt-Cherry
/home/daniel/Desktop/LQfiles/dbm1640robot02.txt-Daffodil
/home/daniel/Desktop/LQfiles/dbm1640robot02.txt-Rose
/home/daniel/Desktop/LQfiles/dbm1640robot02.txt-Daisy
/home/daniel/Desktop/LQfiles/dbm1640robot03.txt-Deer
/home/daniel/Desktop/LQfiles/dbm1640robot03.txt-Antelope
/home/daniel/Desktop/LQfiles/dbm1640robot03.txt-Goat
Daniel B. Martin
 
1 members found this post helpful.
Old 06-30-2016, 02:39 AM   #7
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,999

Rep: Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190
You will need to supply more information about how to know what part of the output to select?
It does point out that your current 2 awk and sed solution also does not work

For one thing, colon is a terrible delimiter to select as there is one in your output, ie. http:

So using your current example, help me to identify what it is about this second url that says you should return that one?
 
1 members found this post helpful.
Old 06-30-2016, 06:57 AM   #8
socalheel
Member
 
Registered: Oct 2012
Location: Raleigh, NC
Distribution: CentOS / RHEL
Posts: 158

Original Poster
Rep: Reputation: 3
Quote:
Originally Posted by grail View Post
You will need to supply more information about how to know what part of the output to select?
It does point out that your current 2 awk and sed solution also does not work

For one thing, colon is a terrible delimiter to select as there is one in your output, ie. http:

So using your current example, help me to identify what it is about this second url that says you should return that one?
you're right, that http://www.site.com was a terrible example.

how i obtain the alias id is to query our OpsGenie account for alert information. In the output of an alert query, I need the value for the "alias" field. my example had http://www.site.com as the alias but that is not a normal alert alias ... i don't know how a url got used in that request, i believe if an alert is closed, the alias will switch format to a URL. that's possibly needless information, but anyway ...

a normal alias from OpsGenie is in this format "ab123c4d-5e67-89fg-01hi-2j3kl4mnop56"

so, i need to extract this string:
Code:
ab123c4d-5e67-89fg-01hi-2j3kl4mnop56
from the following output:


Code:
{"alerts":[{"owner":"","acknowledged":false,"teams":["IT.TEAM"],"count":1,"message":"[StatusCake] http://www.site.com/ status is Down","isSeen":true,"tags":["OverwriteQuietHours","StatusCake"],"createdAt":1467222015951000173,"tinyId":"2362","alias":"ab123c4d-5e67-89fg-01hi-2j3kl4mnop56","id":"111a222f-33oe-44cd-55ke-a3c777t6313","status":"closed","updatedAt":1467222395938000235}],"took":14}"

Last edited by socalheel; 06-30-2016 at 07:03 AM.
 
Old 06-30-2016, 08:21 AM   #9
keefaz
LQ Guru
 
Registered: Mar 2004
Distribution: Slackware
Posts: 6,552

Rep: Reputation: 872Reputation: 872Reputation: 872Reputation: 872Reputation: 872Reputation: 872Reputation: 872
If you have python installed, python -mjson.tool is useful
https://docs.python.org/2/library/json.html

Code:
echo $out | python -mjson.tool | sed -ne '/alias/s/.*"\(.*\)",/\1/p'

Last edited by keefaz; 06-30-2016 at 08:22 AM.
 
1 members found this post helpful.
Old 06-30-2016, 12:07 PM   #10
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,999

Rep: Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190
Try something like:
Code:
awk '/alias/{getline;print}' RS='[,:"]+'
If we have to allow for possible urls in there this will not work, again because of the colon.

You could go a bit bashy on it too:
Code:
regex='alias[^,]*'

[[ $(curl -XGET 'https://api.opsgenie.com/v1/json/alert?apiKey=OUR-API-KEY&limit=1') =~ $regex ]] && new_alias=${BASH_REMATCH[0]}

[[ -n "$new_alias" ]] && new_alias=${new_alias#*:\"} && new_alias=${new_alias%\"}
 
2 members found this post helpful.
Old 06-30-2016, 03:51 PM   #11
socalheel
Member
 
Registered: Oct 2012
Location: Raleigh, NC
Distribution: CentOS / RHEL
Posts: 158

Original Poster
Rep: Reputation: 3
Quote:
Originally Posted by grail View Post
Try something like:
Code:
awk '/alias/{getline;print}' RS='[,:"]+'
If we have to allow for possible urls in there this will not work, again because of the colon.

no, we don't have to account for URLs.

i substituted your shorter awk version and it works like a charm.

i like eliminated multiple awks and pipes when i can, so thank you very much. i don't quite understand awk but maybe one day.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] how to make it work in bash script ratotopi Linux - Newbie 5 04-06-2013 03:58 PM
[SOLVED] Make a bash script into a daemon DuskFall Linux - Newbie 6 08-08-2011 03:46 AM
[SOLVED] bash and xterm: how make apps started by and for a script persist when script terminates porphyry5 Linux - General 4 06-15-2011 01:27 PM
[SOLVED] How to make a bash script keep running in same terminal after it calls second script? JohnRock Linux - Newbie 4 06-25-2010 09:16 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:08 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration