How to delete inactive web links from a file?

Satyaveer Arya · 07-17-2012, 12:52 AM

Hi guys,

I am using RHEL 5.3. I have configured proxy server on that. I have blocked number of proxy sites through proxy server.
I have stored number of proxy sites in a file and the list is of 67,260 links.
And there are so many links in that file which are inactive and they don't open.
Now, I was wondering if there is any automated way to delete all those inactive links from that file, for example like through a script or something else.
Because if we delete inactive links manually from that file it will take huge amount of time.

I have attached some part of that file because the file size is greater than the attachment policy of this forum, you can take a look.

So guys, any idea?

sag47 · 07-17-2012, 01:44 AM

It can be done with a simple one liner.

Code:

sed 's/^\.//' proxy_sites.txt | while read site;do curl "http://$site/" &> /dev/null;if [ $? -eq 0 ];then echo ".$site";fi;done

The curl command assumes there is a website listening on the tldn listed in that file.

Basically, I removed the leading period; tested the site with curl; and added the period again after testing. The command may take a while due to network timeouts but considerably faster than if anyone was doing this by hand. To save the list to a new file you can just redirect the output of the for loop into a file.

Code:

sed 's/^\.//' proxy_sites.txt | while read site;do curl "http://$site/" &> /dev/null;if [ $? -eq 0 ];then echo ".$site";fi;done >> new_proxy_sites.txt

SAM

Satyaveer Arya · 07-17-2012, 01:55 AM

Hi Sam,

Thanks for the update.
But I don't want to save the output to a file. I have to delete those inactive links from the same file.

sag47 · 07-17-2012, 07:08 AM

You could use sed -i to replace links. See the man pages for sed. For example,

Code:

sed -i 's/find/replace/' "proxy sites.txt"

If you wish to create a backup file (not much point since there's so many operations on the same file).

Code:

sed -i.bak 's/find/replace/' "proxy sites.txt"

That will create a backup of the original called "proxy sites.txt.bak". Though it will be overwritten with each change or URL removed.

I still say that the original method I outlined for you is best. You can move or replace your working file with the temporary one. I gave you the tools you *could* use so hack them into the tool you actually want.

Satyaveer Arya · 07-17-2012, 10:02 PM

Hi Sam,

I am thankful to you very much. You provided a very valuable information.
There is only one doubt I have now that, like in your first post in your second method you directed the active web links to a new file, right?

Quote:

sed 's/^\.//' proxy_sites.txt | while read site;do curl "http://$site/" &> /dev/null;if [ $? -eq 0 ];then echo ".$site";fi;done >> new_proxy_sites.txt

Now, can we put any deletion command for inactive web links in this above mentioned method rather than redirecting to a new file?
Again thank you very much Sam. It solved my problem upto an extent.

Satyaveer Arya · 07-18-2012, 09:54 PM

Hi SAM,

Quote:

Basically, I removed the leading period; tested the site with curl

Sorry, I didn't get 'leading period'. Can you please explain this?

sag47 · 07-18-2012, 11:43 PM

Read the following pages.

The data sample you provided, "proxy sites.txt", starts each domain with a period or dot ("."). In order to properly test the domain with curl or any url tester you must remove the leading dot/period. Some additional steps you can take for better understanding...

Look at your own data sample, then look at what I provided, it should make more sense that way. Read the man pages for any commands you don't understand or google "howto command" which usually gives a good result for a tutorial on using the command.

One thing you should be aware of is that my sample script assumes all websites are http protocol running over port 80. It does not take into account other protocols or ports. Considering the name of the text file it also does not test the domains to see if they're actually running a proxy service. What I wrote for you is a rough *prototype* so I leave it up to you to turn it into the tool you specifically require as you're more aware of what your system needs than I am. A better method would be to write a simple program (e.g. python) which simply tests the server to see if the socket is listening and don't care what the protocol is. If no socket listening then ignore it.

Satyaveer Arya · 07-21-2012, 10:02 AM

Hi Sam,

Can you please tell me how to remove dot(.) infront of all those links, otherwise it would take so long?

I would be thankful to you very much.

sag47 · 07-22-2012, 10:58 AM

Quote:

Originally Posted by Satyaveer Arya

Hi Sam,

Can you please tell me how to remove dot(.) infront of all those links, otherwise it would take so long?

I would be thankful to you very much.

I'm not sure I understand you. The script I gave you automatically *removes* the dots in front of all the links and then *automatically* adds the dot back after testing is done. You should read the documentation I linked you to and try to understand what it is that you are running on your own system. It is your ethical and moral responsibility as someone who is managing that said system *especially* if that system houses other peoples' data or logins.

Satyaveer Arya · 07-22-2012, 08:47 PM

Hi Sam,

I am sorry for asking that stupid question. Yes, you showed in your first post how you removed the dot(.) and checked the website and then again put the dot(.) back. Right?
Pal, I am not proficient in shell scripting and still I am learning a lot in shell scripting. Till now I know only some basics and keep going on.

But, can you please share with us your idea how can we put the working links in a separate file and inactive links to another file, after checking those links? Because when I checked that command you posted in your first post, was putting the active and inactive links in same file.

I will be very thankful to you pal!

sag47 · 07-23-2012, 08:53 PM

Quote:

Originally Posted by Satyaveer Arya

I am sorry for asking that stupid question. Yes, you showed in your first post how you removed the dot(.) and checked the website and then again put the dot(.) back. Right?
Pal, I am not proficient in shell scripting and still I am learning a lot in shell scripting. Till now I know only some basics and keep going on.:

As they say there is no such thing as a dumb question. However, I would appreciate it if you actually read the answers I give you.

man bash; see sections: Compound commands, Exit status, conditional expressions.

man test (or man [) for additional information on the conditional expressions.

Quote:

Originally Posted by Satyaveer Arya

But, can you please share with us your idea how can we put the working links in a separate file and inactive links to another file, after checking those links? Because when I checked that command you posted in your first post, was putting the active and inactive links in same file.

See this section on exit status as well. I am not separating "active" and "inactive" links. See the "exit codes" section of the curl man page.

So if you look back at my original script. What is going on is... I am opening the link with curl. If the page exists and returns an HTTP 1.1 200 status code then curl will exit with zero (0) as in success. Notice in my original script that I only output if the command is successful. This means that only links which were successfully tested were output to the file and any links that failed were not printed (i.e. discarded). Let me break it down into a more readable script (with some helpful changes) and maybe it will make a little more sense if it is more than a one liner.

Code:

#!/bin/bash
sed 's/^\.//' proxy_sites.txt | while read site;do 
  curl "http://$site/" &> /dev/null
  if [ $? -eq 0 ];then
    #exit status for curl was zero, as in success
    #output the tested site into a working text file
    echo ".$site" >> working_sites.txt
  else
    #exit status was anything but zero, as in failure
    #output the tested site into a "broken" site text file
    echo ".$site" >> broken_sites.txt
  fi
done

Perhaps for you to get a better understanding of bash and its inner workings you should review (I know blah) some more documentation. Here are two good articles for people who have no scripting experience but help get you up to speed.

It doesn't help you much for me to point you into the man pages for all the specific information if you don't see the big picture. Those scripting guides will help you see the big picture. I still point you to the man pages specifically because I am showing you that all this information exists on your local GNU/Linux system without any google searching required so you know where to find it.

It may sound frustrating that I won't outright give you exactly what you want but it doesn't do anyone any good if I do that and the way you want to go about it is a bad practice in my opinion. Plus, I basically did give you exactly what you want but you just needed to piece the final two puzzles together (i.e. my original script and the sed -i command).

Give a man a fish he eats for a day. Teach a man to fish he eats for a lifetime.

Satyaveer Arya · 07-24-2012, 06:33 PM

Hi Sam,

First-of-all thanks for your valuable time and valuable information again.

And I am already reading both the online-books i.e., Bash Guide for Beginners & Advanced Bash-Scripting Guide.

Quote:

One thing you should be aware of is that my sample script assumes all websites are http protocol running over port 80. It does not take into account other protocols or ports

As you said that your script assumes all websites are http protocol running over port 80. So, is there any way to check the websites which do not run over port 80 ?

sag47 · 07-25-2012, 12:33 AM

Quote:

Originally Posted by Satyaveer Arya

Hi Sam,

First-of-all thanks for your valuable time and valuable information again.

And I am already reading both the online-books i.e., Bash Guide for Beginners & Advanced Bash-Scripting Guide.

As you said that your script assumes all websites are http protocol running over port 80. So, is there any way to check the websites which do not run over port 80 ?

There is not an easy way to accomplish this. One way to go about it is to be very selective about which ports you're going to scan and then then test them for listening sockets. Testing for listening sockets will not tell you specifically which protocol is running on the port but it will give you an idea that *something* is listening. If you're trying to filter a server then really *anything* listening could be considered a bad thing on said server. Here's a simple python script for how one might accomplish that.

Code:

#!/usr/bin/env python
# By Sam Gleske
# http://www.gleske.net/
# MIT Open Source License - http://opensource.org/licenses/mit-license.php

# This tests the status of arbitrary listening ports from a common_ports list.
# Python 2.7.2

import urllib,socket
from sys import exit
from sys import argv

common_ports=[80,443,8080,8443,8000,8081,444,1080,2301,3382,7777]

def isonline_sockettest(host,port):
  host = str(host)
  port = int(port)
  s=socket.socket(socket.AF_INET,socket.SOCK_STREAM)
  #if it's slower than one second then assume it's off
  s.settimeout(1)
  try:
    s.connect((host,port))
    s.shutdown(2)
    #socket is listening
    return True
  except:
    #no listening socket (i.e. some service is turned off)
    return False

def main():
  bad_server = True
  for port in common_ports:
    print "Testing port %d..." % port
    if isonline_sockettest(argv[1],port):
      bad_server = False
      break
  if bad_server:
    print "%s is down" % argv[1]
    exit(1)
  else:
    print "%s is up" % argv[1]
    exit(0)


if __name__ == "__main__":
  main()

Sample usage:

Code:

somescript.py google.com
somescript.py fakeserver.server.com

Again as I stated before. I arbitrarily chose ports in which services like proxies commonly listen. This list can easily be modified and the script exit codes conform to posix standards. You can replace the curl command with the python script. This will take considerably longer because for each non-listening port per host there is at least a 1 second delay.

I say that because you're bound to fail a test with the greater number of ports you wish to check. This is one of those deals where you have to weigh if the processing time is even worth checking all those ports. Worst case scenario my script will run for (number of tested ports)*(number of tested servers) seconds (e.g. 11 ports testing against 1000 servers will give you a runtime of 11000 seconds worst case). Not really ideal.

This script is also a rough prototype thrown together to show you a proof of concept to what I was saying. Before running it on any production system you should do additional research and testing with the script.

SAM

Satyaveer Arya · 08-13-2012, 09:06 AM

Hi Sam,

I used your above mentioned python script but everytime I try to check any website it says that link is down.
Like I checked for google.com, it says google.com is down.
Why is it so?

Thank You!

sag47 · 08-13-2012, 01:35 PM

Not sure, did you try checking with nmap against google.com too? Depending on the part of the world you're in they may block it and just allow your regional google domain. Either way, the prototype script works for me so you should try the script with more verbose error output (-v or -vv options for python, see man page). It is also worth noting that I wrote that script for Python 2.7. So if you're using it with Python 3 I have no idea if it would even work. Py3 is almost like a new language compared to 2.7/2.6.

I just tested the script against google.com, yahoo.com, and amazon.com. It works for all of them. If nmap works with google.com and my script doesn't work then you likely are a) using the wrong version of python or b) have a missing library that is required. If neither nmap nor my script work then you're likely being blocked by that website. In that case try choosing a different website.