LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 07-17-2012, 12:52 AM   #1
Satyaveer Arya
Senior Member
 
Registered: May 2010
Location: Dehradun, Uttarakhand, India
Distribution: RHEL, CentOS, Debian, Oracle Solaris 10
Posts: 1,412

Rep: Reputation: 303Reputation: 303Reputation: 303Reputation: 303
How to delete inactive web links from a file?


Hi guys,

I am using RHEL 5.3. I have configured proxy server on that. I have blocked number of proxy sites through proxy server.
I have stored number of proxy sites in a file and the list is of 67,260 links.
And there are so many links in that file which are inactive and they don't open.
Now, I was wondering if there is any automated way to delete all those inactive links from that file, for example like through a script or something else.
Because if we delete inactive links manually from that file it will take huge amount of time.

I have attached some part of that file because the file size is greater than the attachment policy of this forum, you can take a look.

So guys, any idea?
Attached Files
File Type: txt proxy sites.txt (27.8 KB, 13 views)
 
Old 07-17-2012, 01:44 AM   #2
sag47
Senior Member
 
Registered: Sep 2009
Location: Philly, PA
Distribution: Kubuntu x64, RHEL, Fedora Core, FreeBSD, Windows x64
Posts: 1,422
Blog Entries: 33

Rep: Reputation: 356Reputation: 356Reputation: 356Reputation: 356
It can be done with a simple one liner.
Code:
sed 's/^\.//' proxy_sites.txt | while read site;do curl "http://$site/" &> /dev/null;if [ $? -eq 0 ];then echo ".$site";fi;done
The curl command assumes there is a website listening on the tldn listed in that file.

Basically, I removed the leading period; tested the site with curl; and added the period again after testing. The command may take a while due to network timeouts but considerably faster than if anyone was doing this by hand. To save the list to a new file you can just redirect the output of the for loop into a file.

Code:
sed 's/^\.//' proxy_sites.txt | while read site;do curl "http://$site/" &> /dev/null;if [ $? -eq 0 ];then echo ".$site";fi;done >> new_proxy_sites.txt
SAM

Last edited by sag47; 07-17-2012 at 01:49 AM.
 
2 members found this post helpful.
Old 07-17-2012, 01:55 AM   #3
Satyaveer Arya
Senior Member
 
Registered: May 2010
Location: Dehradun, Uttarakhand, India
Distribution: RHEL, CentOS, Debian, Oracle Solaris 10
Posts: 1,412

Original Poster
Rep: Reputation: 303Reputation: 303Reputation: 303Reputation: 303
Hi Sam,


Thanks for the update.
But I don't want to save the output to a file. I have to delete those inactive links from the same file.
 
Old 07-17-2012, 07:08 AM   #4
sag47
Senior Member
 
Registered: Sep 2009
Location: Philly, PA
Distribution: Kubuntu x64, RHEL, Fedora Core, FreeBSD, Windows x64
Posts: 1,422
Blog Entries: 33

Rep: Reputation: 356Reputation: 356Reputation: 356Reputation: 356
You could use sed -i to replace links. See the man pages for sed. For example,
Code:
sed -i 's/find/replace/' "proxy sites.txt"
If you wish to create a backup file (not much point since there's so many operations on the same file).
Code:
sed -i.bak 's/find/replace/' "proxy sites.txt"
That will create a backup of the original called "proxy sites.txt.bak". Though it will be overwritten with each change or URL removed.

I still say that the original method I outlined for you is best. You can move or replace your working file with the temporary one. I gave you the tools you *could* use so hack them into the tool you actually want.

Last edited by sag47; 07-17-2012 at 07:20 AM.
 
1 members found this post helpful.
Old 07-17-2012, 10:02 PM   #5
Satyaveer Arya
Senior Member
 
Registered: May 2010
Location: Dehradun, Uttarakhand, India
Distribution: RHEL, CentOS, Debian, Oracle Solaris 10
Posts: 1,412

Original Poster
Rep: Reputation: 303Reputation: 303Reputation: 303Reputation: 303
Hi Sam,

I am thankful to you very much. You provided a very valuable information.
There is only one doubt I have now that, like in your first post in your second method you directed the active web links to a new file, right?

Quote:
sed 's/^\.//' proxy_sites.txt | while read site;do curl "http://$site/" &> /dev/null;if [ $? -eq 0 ];then echo ".$site";fi;done >> new_proxy_sites.txt
Now, can we put any deletion command for inactive web links in this above mentioned method rather than redirecting to a new file?
Again thank you very much Sam. It solved my problem upto an extent.

Last edited by Satyaveer Arya; 07-17-2012 at 10:20 PM.
 
Old 07-18-2012, 09:54 PM   #6
Satyaveer Arya
Senior Member
 
Registered: May 2010
Location: Dehradun, Uttarakhand, India
Distribution: RHEL, CentOS, Debian, Oracle Solaris 10
Posts: 1,412

Original Poster
Rep: Reputation: 303Reputation: 303Reputation: 303Reputation: 303
Hi SAM,

Quote:
Basically, I removed the leading period; tested the site with curl
Sorry, I didn't get 'leading period'. Can you please explain this?
 
Old 07-18-2012, 11:43 PM   #7
sag47
Senior Member
 
Registered: Sep 2009
Location: Philly, PA
Distribution: Kubuntu x64, RHEL, Fedora Core, FreeBSD, Windows x64
Posts: 1,422
Blog Entries: 33

Rep: Reputation: 356Reputation: 356Reputation: 356Reputation: 356
Read the following pages.
The data sample you provided, "proxy sites.txt", starts each domain with a period or dot ("."). In order to properly test the domain with curl or any url tester you must remove the leading dot/period. Some additional steps you can take for better understanding...

Look at your own data sample, then look at what I provided, it should make more sense that way. Read the man pages for any commands you don't understand or google "howto command" which usually gives a good result for a tutorial on using the command.

One thing you should be aware of is that my sample script assumes all websites are http protocol running over port 80. It does not take into account other protocols or ports. Considering the name of the text file it also does not test the domains to see if they're actually running a proxy service. What I wrote for you is a rough *prototype* so I leave it up to you to turn it into the tool you specifically require as you're more aware of what your system needs than I am. A better method would be to write a simple program (e.g. python) which simply tests the server to see if the socket is listening and don't care what the protocol is. If no socket listening then ignore it.

Last edited by sag47; 07-18-2012 at 11:55 PM.
 
Old 07-21-2012, 10:02 AM   #8
Satyaveer Arya
Senior Member
 
Registered: May 2010
Location: Dehradun, Uttarakhand, India
Distribution: RHEL, CentOS, Debian, Oracle Solaris 10
Posts: 1,412

Original Poster
Rep: Reputation: 303Reputation: 303Reputation: 303Reputation: 303
Hi Sam,

Can you please tell me how to remove dot(.) infront of all those links, otherwise it would take so long?

I would be thankful to you very much.
 
Old 07-22-2012, 10:58 AM   #9
sag47
Senior Member
 
Registered: Sep 2009
Location: Philly, PA
Distribution: Kubuntu x64, RHEL, Fedora Core, FreeBSD, Windows x64
Posts: 1,422
Blog Entries: 33

Rep: Reputation: 356Reputation: 356Reputation: 356Reputation: 356
Quote:
Originally Posted by Satyaveer Arya View Post
Hi Sam,

Can you please tell me how to remove dot(.) infront of all those links, otherwise it would take so long?

I would be thankful to you very much.
I'm not sure I understand you. The script I gave you automatically *removes* the dots in front of all the links and then *automatically* adds the dot back after testing is done. You should read the documentation I linked you to and try to understand what it is that you are running on your own system. It is your ethical and moral responsibility as someone who is managing that said system *especially* if that system houses other peoples' data or logins.
 
Old 07-22-2012, 08:47 PM   #10
Satyaveer Arya
Senior Member
 
Registered: May 2010
Location: Dehradun, Uttarakhand, India
Distribution: RHEL, CentOS, Debian, Oracle Solaris 10
Posts: 1,412

Original Poster
Rep: Reputation: 303Reputation: 303Reputation: 303Reputation: 303
Hi Sam,

I am sorry for asking that stupid question. Yes, you showed in your first post how you removed the dot(.) and checked the website and then again put the dot(.) back. Right?
Pal, I am not proficient in shell scripting and still I am learning a lot in shell scripting. Till now I know only some basics and keep going on.

But, can you please share with us your idea how can we put the working links in a separate file and inactive links to another file, after checking those links? Because when I checked that command you posted in your first post, was putting the active and inactive links in same file.

I will be very thankful to you pal!

Last edited by Satyaveer Arya; 07-22-2012 at 08:48 PM.
 
Old 07-23-2012, 08:53 PM   #11
sag47
Senior Member
 
Registered: Sep 2009
Location: Philly, PA
Distribution: Kubuntu x64, RHEL, Fedora Core, FreeBSD, Windows x64
Posts: 1,422
Blog Entries: 33

Rep: Reputation: 356Reputation: 356Reputation: 356Reputation: 356
Quote:
Originally Posted by Satyaveer Arya View Post
I am sorry for asking that stupid question. Yes, you showed in your first post how you removed the dot(.) and checked the website and then again put the dot(.) back. Right?
Pal, I am not proficient in shell scripting and still I am learning a lot in shell scripting. Till now I know only some basics and keep going on.:
As they say there is no such thing as a dumb question. However, I would appreciate it if you actually read the answers I give you.

man bash; see sections: Compound commands, Exit status, conditional expressions.

man test (or man [) for additional information on the conditional expressions.

Quote:
Originally Posted by Satyaveer Arya View Post
But, can you please share with us your idea how can we put the working links in a separate file and inactive links to another file, after checking those links? Because when I checked that command you posted in your first post, was putting the active and inactive links in same file.
See this section on exit status as well. I am not separating "active" and "inactive" links. See the "exit codes" section of the curl man page.

So if you look back at my original script. What is going on is... I am opening the link with curl. If the page exists and returns an HTTP 1.1 200 status code then curl will exit with zero (0) as in success. Notice in my original script that I only output if the command is successful. This means that only links which were successfully tested were output to the file and any links that failed were not printed (i.e. discarded). Let me break it down into a more readable script (with some helpful changes) and maybe it will make a little more sense if it is more than a one liner.

Code:
#!/bin/bash
sed 's/^\.//' proxy_sites.txt | while read site;do 
  curl "http://$site/" &> /dev/null
  if [ $? -eq 0 ];then
    #exit status for curl was zero, as in success
    #output the tested site into a working text file
    echo ".$site" >> working_sites.txt
  else
    #exit status was anything but zero, as in failure
    #output the tested site into a "broken" site text file
    echo ".$site" >> broken_sites.txt
  fi
done
Perhaps for you to get a better understanding of bash and its inner workings you should review (I know blah) some more documentation. Here are two good articles for people who have no scripting experience but help get you up to speed.
It doesn't help you much for me to point you into the man pages for all the specific information if you don't see the big picture. Those scripting guides will help you see the big picture. I still point you to the man pages specifically because I am showing you that all this information exists on your local GNU/Linux system without any google searching required so you know where to find it.

It may sound frustrating that I won't outright give you exactly what you want but it doesn't do anyone any good if I do that and the way you want to go about it is a bad practice in my opinion. Plus, I basically did give you exactly what you want but you just needed to piece the final two puzzles together (i.e. my original script and the sed -i command).

Give a man a fish he eats for a day. Teach a man to fish he eats for a lifetime.

Last edited by sag47; 07-23-2012 at 09:04 PM.
 
Old 07-24-2012, 06:33 PM   #12
Satyaveer Arya
Senior Member
 
Registered: May 2010
Location: Dehradun, Uttarakhand, India
Distribution: RHEL, CentOS, Debian, Oracle Solaris 10
Posts: 1,412

Original Poster
Rep: Reputation: 303Reputation: 303Reputation: 303Reputation: 303
Hi Sam,

First-of-all thanks for your valuable time and valuable information again.
And I am already reading both the online-books i.e., Bash Guide for Beginners & Advanced Bash-Scripting Guide.

Quote:
One thing you should be aware of is that my sample script assumes all websites are http protocol running over port 80. It does not take into account other protocols or ports
As you said that your script assumes all websites are http protocol running over port 80. So, is there any way to check the websites which do not run over port 80 ?
 
Old 07-25-2012, 12:33 AM   #13
sag47
Senior Member
 
Registered: Sep 2009
Location: Philly, PA
Distribution: Kubuntu x64, RHEL, Fedora Core, FreeBSD, Windows x64
Posts: 1,422
Blog Entries: 33

Rep: Reputation: 356Reputation: 356Reputation: 356Reputation: 356
Quote:
Originally Posted by Satyaveer Arya View Post
Hi Sam,

First-of-all thanks for your valuable time and valuable information again.
And I am already reading both the online-books i.e., Bash Guide for Beginners & Advanced Bash-Scripting Guide.


As you said that your script assumes all websites are http protocol running over port 80. So, is there any way to check the websites which do not run over port 80 ?
There is not an easy way to accomplish this. One way to go about it is to be very selective about which ports you're going to scan and then then test them for listening sockets. Testing for listening sockets will not tell you specifically which protocol is running on the port but it will give you an idea that *something* is listening. If you're trying to filter a server then really *anything* listening could be considered a bad thing on said server. Here's a simple python script for how one might accomplish that.

Code:
#!/usr/bin/env python
# By Sam Gleske
# http://www.gleske.net/
# MIT Open Source License - http://opensource.org/licenses/mit-license.php

# This tests the status of arbitrary listening ports from a common_ports list.
# Python 2.7.2

import urllib,socket
from sys import exit
from sys import argv

common_ports=[80,443,8080,8443,8000,8081,444,1080,2301,3382,7777]

def isonline_sockettest(host,port):
  host = str(host)
  port = int(port)
  s=socket.socket(socket.AF_INET,socket.SOCK_STREAM)
  #if it's slower than one second then assume it's off
  s.settimeout(1)
  try:
    s.connect((host,port))
    s.shutdown(2)
    #socket is listening
    return True
  except:
    #no listening socket (i.e. some service is turned off)
    return False

def main():
  bad_server = True
  for port in common_ports:
    print "Testing port %d..." % port
    if isonline_sockettest(argv[1],port):
      bad_server = False
      break
  if bad_server:
    print "%s is down" % argv[1]
    exit(1)
  else:
    print "%s is up" % argv[1]
    exit(0)


if __name__ == "__main__":
  main()
Sample usage:
Code:
somescript.py google.com
somescript.py fakeserver.server.com
Again as I stated before. I arbitrarily chose ports in which services like proxies commonly listen. This list can easily be modified and the script exit codes conform to posix standards. You can replace the curl command with the python script. This will take considerably longer because for each non-listening port per host there is at least a 1 second delay.

I say that because you're bound to fail a test with the greater number of ports you wish to check. This is one of those deals where you have to weigh if the processing time is even worth checking all those ports. Worst case scenario my script will run for (number of tested ports)*(number of tested servers) seconds (e.g. 11 ports testing against 1000 servers will give you a runtime of 11000 seconds worst case). Not really ideal.

This script is also a rough prototype thrown together to show you a proof of concept to what I was saying. Before running it on any production system you should do additional research and testing with the script.

SAM

Last edited by sag47; 07-25-2012 at 12:53 AM.
 
Old 08-13-2012, 09:06 AM   #14
Satyaveer Arya
Senior Member
 
Registered: May 2010
Location: Dehradun, Uttarakhand, India
Distribution: RHEL, CentOS, Debian, Oracle Solaris 10
Posts: 1,412

Original Poster
Rep: Reputation: 303Reputation: 303Reputation: 303Reputation: 303
Hi Sam,

I used your above mentioned python script but everytime I try to check any website it says that link is down.
Like I checked for google.com, it says google.com is down.
Why is it so?


Thank You!
 
Old 08-13-2012, 01:35 PM   #15
sag47
Senior Member
 
Registered: Sep 2009
Location: Philly, PA
Distribution: Kubuntu x64, RHEL, Fedora Core, FreeBSD, Windows x64
Posts: 1,422
Blog Entries: 33

Rep: Reputation: 356Reputation: 356Reputation: 356Reputation: 356
Not sure, did you try checking with nmap against google.com too? Depending on the part of the world you're in they may block it and just allow your regional google domain. Either way, the prototype script works for me so you should try the script with more verbose error output (-v or -vv options for python, see man page). It is also worth noting that I wrote that script for Python 2.7. So if you're using it with Python 3 I have no idea if it would even work. Py3 is almost like a new language compared to 2.7/2.6.

I just tested the script against google.com, yahoo.com, and amazon.com. It works for all of them. If nmap works with google.com and my script doesn't work then you likely are a) using the wrong version of python or b) have a missing library that is required. If neither nmap nor my script work then you're likely being blocked by that website. In that case try choosing a different website.

Last edited by sag47; 08-13-2012 at 01:37 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Delete large number files along with hard links mohan.1418 Linux - Newbie 7 06-06-2012 08:59 AM
web page/links links/links vendtagain Linux - Newbie 2 09-19-2009 08:13 PM
hard links problem: how to delete all pointers to an inode at once? onufry Linux - General 6 11-18-2007 06:27 PM
How to delete the dead links shadkong Linux - Newbie 5 04-19-2005 08:39 PM
Tried to delete file as root but it says I don't have permission to delete it! beejayzed Mandriva 23 03-12-2004 02:46 AM


All times are GMT -5. The time now is 01:47 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration