ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I am okay with scripting, and know only what I've had experience with. But I would like other eyes on this so I can see what other commands and methods I can use to achieve the same result.
I've come up with a script that locates all robots.txt file on our shared web server.
What I care about is the presence of this string: "Disallow: /"
If that string is found, then what I care is if the line immediately before "Disallow: /" has this string: "User-agent: *"
If those two conditions are met, then I need to generate an alert to our monitoring service, OpsGenie.
The portion of this script that has "egrep -v" ... I would love to include a source file and put the strings that I don't care about in there and call on that source file to exclude those strings there. The rationale for that is, this is going to be made into an RPM and deployed to several other shared web servers where those strings are going to be different for each web server it is installed on.
Code:
#!/bin/bash
#We are using locate for this seeing as the document root for the websites are not the same across the board.
#In order to ensure locate gives accurate results, we need to first update the locate db with the updatedb command
updatedb
#setting variables and file locations
hostname=`hostname`
outputFile="/root/robotsDisallow.txt"
#We are now finding all the robots.txt files, excluding unnecessary file locations and outputting the results to the $outputFile
#First we want to grep all robots.txt files for Disallow: /
#After we have that list populated, we now care about if those Disallow: / strings pertain to all user agents by searching for User-agent: *
locate robots.txt | egrep -v "httpdocs.old|public_html.old|backup|bkp|/usr/libexec/webmin" | xargs grep -B1 'Disallow: /$' | grep 'User-agent: \*' > $outputFile
#Now we are finding if those two conditions are met, and if so, generate an alert to OpsGenie
grep robots.txt $outputFile
if [[ $? == 0 ]]
then
sed -i "1i This file is generated from $hostname to inform you of robots.txt files that are disallowing all" $outputFile
curl -XPOST 'https://api.opsgenie.com/v1/json/alert' -d '{"apiKey": "OUR-API-KEY","message" : "Misconfigured robots.txt on '$hostname'","teams" : ["IT.TEAM"]}'
sleep 5
#we need to get the alias for this alert
alias=`curl -XGET 'https://api.opsgenie.com/v1/json/alert?apiKey=OUR-API-KEY&limit=1' | awk 'BEGIN { FS = ":" } ; {print $12}' | awk 'BEGIN {FS = "," } ; {print $1}' | sed 's/"//g'`
curl -XPOST 'https://api.opsgenie.com/v1/json/alert/attach' -F apiKey="OUR-API-KEY" -F alias="$alias" -F attachment=@$outputFile
sleep 5
fi
#cleaning up after ourselves
rm -f $outputFile
Instead of using locate to find robots.txt in all the system and then exclude non relevant locations, it should be more controled to search into specific directories in the first place.
The web servers are virtual hosts?, maybe you could make the search directories list dynamically by greping DocumentRoot value from http conf files.
I agree with keefaz that using locate is not really very expedient and depending on the last time updatedb was executed this may well take a long time.
Instead of :-
Code:
grep robots.txt $outputFile
You can simply have your 'if' clause test it for you:
Code:
if [[ -s "$outputFile" ]]
Here is also an alternative for the 'alias' (not a fan of using variable names of commands) variable:
Instead of using locate to find robots.txt in all the system and then exclude non relevant locations, it should be more controled to search into specific directories in the first place.
The web servers are virtual hosts?, maybe you could make the search directories list dynamically by greping DocumentRoot value from http conf files.
Just a thought
aah, i see. i didn't even think of that possibility ... i really didn't want to use locate b/c of the updatedb but using find would have been difficult since we have different locations for DocRoots. i like the virutal host file idea. thanks for the idea.
I obviously do not have an example to play with so might need to check your output on that one
here is the output i'm working with:
Code:
{"alerts":[{"owner":"","acknowledged":false,"teams":["IT.TEAM"],"count":1,"message":"[StatusCake] http://www.site.com/ status is Down","isSeen":true,"tags":["OverwriteQuietHours","StatusCake"],"createdAt":1467222015951000173,"tinyId":"2362","alias":"http://www.site.com/","id":"111a222f-33oe-44cd-55ke-a3c777t6313","status":"closed","updatedAt":1467222395938000235}],"took":14}
and i'm needing to pull only this string: http://www.site.com/ (located right before "id":"111a222f ...)
First off, I must admit that I don't understand some of your terminology. I simplified the problem and hope my version is consistent with your general theme.
With file robot01 ...
Code:
Accura
Buick
Chrysler
Dodge
Apple
Disallow:
Ford
Honda
Jeep
Banana
Disallow:
Kia
Lexus
Mazda
Nissan
Cherry
Disallow:
You will need to supply more information about how to know what part of the output to select?
It does point out that your current 2 awk and sed solution also does not work
For one thing, colon is a terrible delimiter to select as there is one in your output, ie. http:
So using your current example, help me to identify what it is about this second url that says you should return that one?
You will need to supply more information about how to know what part of the output to select?
It does point out that your current 2 awk and sed solution also does not work
For one thing, colon is a terrible delimiter to select as there is one in your output, ie. http:
So using your current example, help me to identify what it is about this second url that says you should return that one?
how i obtain the alias id is to query our OpsGenie account for alert information. In the output of an alert query, I need the value for the "alias" field. my example had http://www.site.com as the alias but that is not a normal alert alias ... i don't know how a url got used in that request, i believe if an alert is closed, the alias will switch format to a URL. that's possibly needless information, but anyway ...
a normal alias from OpsGenie is in this format "ab123c4d-5e67-89fg-01hi-2j3kl4mnop56"
so, i need to extract this string:
Code:
ab123c4d-5e67-89fg-01hi-2j3kl4mnop56
from the following output:
Code:
{"alerts":[{"owner":"","acknowledged":false,"teams":["IT.TEAM"],"count":1,"message":"[StatusCake] http://www.site.com/ status is Down","isSeen":true,"tags":["OverwriteQuietHours","StatusCake"],"createdAt":1467222015951000173,"tinyId":"2362","alias":"ab123c4d-5e67-89fg-01hi-2j3kl4mnop56","id":"111a222f-33oe-44cd-55ke-a3c777t6313","status":"closed","updatedAt":1467222395938000235}],"took":14}"
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.