How to delete inactive web links from a file?
1 Attachment(s)
Hi guys,
I am using RHEL 5.3. I have configured proxy server on that. I have blocked number of proxy sites through proxy server. I have stored number of proxy sites in a file and the list is of 67,260 links. And there are so many links in that file which are inactive and they don't open. Now, I was wondering if there is any automated way to delete all those inactive links from that file, for example like through a script or something else. Because if we delete inactive links manually from that file it will take huge amount of time. I have attached some part of that file because the file size is greater than the attachment policy of this forum, you can take a look. So guys, any idea? :) |
It can be done with a simple one liner.
Code:
sed 's/^\.//' proxy_sites.txt | while read site;do curl "http://$site/" &> /dev/null;if [ $? -eq 0 ];then echo ".$site";fi;done Basically, I removed the leading period; tested the site with curl; and added the period again after testing. The command may take a while due to network timeouts but considerably faster than if anyone was doing this by hand. To save the list to a new file you can just redirect the output of the for loop into a file. Code:
sed 's/^\.//' proxy_sites.txt | while read site;do curl "http://$site/" &> /dev/null;if [ $? -eq 0 ];then echo ".$site";fi;done >> new_proxy_sites.txt |
Hi Sam,
Thanks for the update. But I don't want to save the output to a file. I have to delete those inactive links from the same file. |
You could use sed -i to replace links. See the man pages for sed. For example,
Code:
sed -i 's/find/replace/' "proxy sites.txt" Code:
sed -i.bak 's/find/replace/' "proxy sites.txt" I still say that the original method I outlined for you is best. You can move or replace your working file with the temporary one. I gave you the tools you *could* use so hack them into the tool you actually want. |
Hi Sam,
I am thankful to you very much. You provided a very valuable information. There is only one doubt I have now that, like in your first post in your second method you directed the active web links to a new file, right? Quote:
Again thank you very much Sam. It solved my problem upto an extent. :) |
Hi SAM,
Quote:
|
Read the following pages.
The data sample you provided, "proxy sites.txt", starts each domain with a period or dot ("."). In order to properly test the domain with curl or any url tester you must remove the leading dot/period. Some additional steps you can take for better understanding... Look at your own data sample, then look at what I provided, it should make more sense that way. Read the man pages for any commands you don't understand or google "howto command" which usually gives a good result for a tutorial on using the command. One thing you should be aware of is that my sample script assumes all websites are http protocol running over port 80. It does not take into account other protocols or ports. Considering the name of the text file it also does not test the domains to see if they're actually running a proxy service. What I wrote for you is a rough *prototype* so I leave it up to you to turn it into the tool you specifically require as you're more aware of what your system needs than I am. A better method would be to write a simple program (e.g. python) which simply tests the server to see if the socket is listening and don't care what the protocol is. If no socket listening then ignore it. |
Hi Sam,
Can you please tell me how to remove dot(.) infront of all those links, otherwise it would take so long? I would be thankful to you very much. |
Quote:
|
Hi Sam,
I am sorry for asking that stupid question. Yes, you showed in your first post how you removed the dot(.) and checked the website and then again put the dot(.) back. Right? Pal, I am not proficient in shell scripting and still I am learning a lot in shell scripting. Till now I know only some basics and keep going on. But, can you please share with us your idea how can we put the working links in a separate file and inactive links to another file, after checking those links? Because when I checked that command you posted in your first post, was putting the active and inactive links in same file. I will be very thankful to you pal! :hattip: |
Quote:
man bash; see sections: Compound commands, Exit status, conditional expressions. man test (or man [) for additional information on the conditional expressions. Quote:
So if you look back at my original script. What is going on is... I am opening the link with curl. If the page exists and returns an HTTP 1.1 200 status code then curl will exit with zero (0) as in success. Notice in my original script that I only output if the command is successful. This means that only links which were successfully tested were output to the file and any links that failed were not printed (i.e. discarded). Let me break it down into a more readable script (with some helpful changes) and maybe it will make a little more sense if it is more than a one liner. Code:
#!/bin/bash It doesn't help you much for me to point you into the man pages for all the specific information if you don't see the big picture. Those scripting guides will help you see the big picture. I still point you to the man pages specifically because I am showing you that all this information exists on your local GNU/Linux system without any google searching required so you know where to find it. It may sound frustrating that I won't outright give you exactly what you want but it doesn't do anyone any good if I do that and the way you want to go about it is a bad practice in my opinion. Plus, I basically did give you exactly what you want but you just needed to piece the final two puzzles together (i.e. my original script and the sed -i command). Give a man a fish he eats for a day. Teach a man to fish he eats for a lifetime. |
Hi Sam,
First-of-all thanks for your valuable time and valuable information again. :hattip: And I am already reading both the online-books i.e., Bash Guide for Beginners & Advanced Bash-Scripting Guide. Quote:
|
Quote:
Code:
#!/usr/bin/env python Code:
somescript.py google.com I say that because you're bound to fail a test with the greater number of ports you wish to check. This is one of those deals where you have to weigh if the processing time is even worth checking all those ports. Worst case scenario my script will run for (number of tested ports)*(number of tested servers) seconds (e.g. 11 ports testing against 1000 servers will give you a runtime of 11000 seconds worst case). Not really ideal. This script is also a rough prototype thrown together to show you a proof of concept to what I was saying. Before running it on any production system you should do additional research and testing with the script. SAM |
Hi Sam,
I used your above mentioned python script but everytime I try to check any website it says that link is down. Like I checked for google.com, it says google.com is down. Why is it so? Thank You! |
Not sure, did you try checking with nmap against google.com too? Depending on the part of the world you're in they may block it and just allow your regional google domain. Either way, the prototype script works for me so you should try the script with more verbose error output (-v or -vv options for python, see man page). It is also worth noting that I wrote that script for Python 2.7. So if you're using it with Python 3 I have no idea if it would even work. Py3 is almost like a new language compared to 2.7/2.6.
I just tested the script against google.com, yahoo.com, and amazon.com. It works for all of them. If nmap works with google.com and my script doesn't work then you likely are a) using the wrong version of python or b) have a missing library that is required. If neither nmap nor my script work then you're likely being blocked by that website. In that case try choosing a different website. |
All times are GMT -5. The time now is 01:35 AM. |