Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place. |
| Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
 |
GNU/Linux Basic Guide
This 255-page guide will provide you with the keys to understand the philosophy of free software, teach you how to use and handle it, and give you the tools required to move easily in the world of GNU/Linux. Many users and administrators will be taking their first steps with this GNU/Linux Basic guide and it will show you how to approach and solve the problems you encounter.
Click Here to receive this Complete Guide absolutely free. |
|
 |
06-02-2008, 11:21 PM
|
#1
|
|
Member
Registered: Nov 2006
Posts: 89
Rep:
|
filtering a list of domain names for subdomains
Hi all,
I don't know of a better place to ask a question like this than here. I have a text file of domain names. What I want to do with it is weed out any subdomain duplicates. For example:
mydomain.com
server1.mydomain.com
server2.mydomain.com
www.server2.mydomain.com
yourdomain.com
www.yourdomain.com
server1.yourdomain.com
www.hisdomain.com
ssl. www.hisdomain.com
ftp. www.hisdomain.com
theirdomain.co.uk
www.theirdomain.co.uk
ftp.theirdomain.co.uk
What I'd want to come out is;
mydomain.com
yourdomain.com
www.hisdomain.com
theirdomain.co.uk
I'd guess there's probably some way to do this with regular expressions and the like, but I don't know exactly how. The thing is I could easily use cut to strip down to the domain plus tld, but the thing is 1. some domains have two tokens of "tld" while others have only one (co.uk versus .net or .com) and 2. There might be domains below the base domain plus tld that all match. I want the file as specific as possible without duplicating things unnecessarily. Is this possible?
Thanks for the help
F
|
|
|
|
06-04-2008, 05:51 AM
|
#2
|
|
LQ Addict
Registered: Jul 2002
Location: East Centra Illinois, USA
Distribution: Debian Squeeze
Posts: 5,570
|
Quote:
|
I could easily use cut to strip down to the domain plus tld,
|
then pipe the list through uniq.
|
|
|
|
06-04-2008, 10:58 AM
|
#3
|
|
Senior Member
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
|
There is a solution, it's not simple, it does involve regexes, & uniq is not at the core of it.
This sounds like a problem I worked on about 3 yrs. ago to extract unique domain names from published hosts file (black)lists. -- I filter ads etc. for my whole LAN at a firewall using dnsmasq's config file, not a hosts file.
Unlike a hosts file, dnsmasq.conf can block entire domains w/o listing each individual host or sub-domain. This usually results in at least 95% shrinkage in the "distillation" process.
Be patient, I'll try to dig out my code & post it for you.
|
|
|
|
| Thread Tools |
Search this Thread |
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -5. The time now is 02:30 PM.
|
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|