LinuxQuestions.org
LinuxAnswers - the LQ Linux tutorial section.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Reply
 
Search this Thread
Old 06-02-2008, 11:21 PM   #1
fmillion
Member
 
Registered: Nov 2006
Posts: 91

Rep: Reputation: 27
filtering a list of domain names for subdomains


Hi all,
I don't know of a better place to ask a question like this than here. I have a text file of domain names. What I want to do with it is weed out any subdomain duplicates. For example:

mydomain.com
server1.mydomain.com
server2.mydomain.com
www.server2.mydomain.com
yourdomain.com
www.yourdomain.com
server1.yourdomain.com
www.hisdomain.com
ssl.www.hisdomain.com
ftp.www.hisdomain.com
theirdomain.co.uk
www.theirdomain.co.uk
ftp.theirdomain.co.uk

What I'd want to come out is;
mydomain.com
yourdomain.com
www.hisdomain.com
theirdomain.co.uk

I'd guess there's probably some way to do this with regular expressions and the like, but I don't know exactly how. The thing is I could easily use cut to strip down to the domain plus tld, but the thing is 1. some domains have two tokens of "tld" while others have only one (co.uk versus .net or .com) and 2. There might be domains below the base domain plus tld that all match. I want the file as specific as possible without duplicating things unnecessarily. Is this possible?

Thanks for the help
F
 
Old 06-04-2008, 05:51 AM   #2
bigrigdriver
LQ Addict
 
Registered: Jul 2002
Location: East Centra Illinois, USA
Distribution: Debian Squeeze
Posts: 5,766

Rep: Reputation: 307Reputation: 307Reputation: 307Reputation: 307
Quote:
I could easily use cut to strip down to the domain plus tld,
then pipe the list through uniq.
 
Old 06-04-2008, 10:58 AM   #3
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 230Reputation: 230Reputation: 230
There is a solution, it's not simple, it does involve regexes, & uniq is not at the core of it.

This sounds like a problem I worked on about 3 yrs. ago to extract unique domain names from published hosts file (black)lists. -- I filter ads etc. for my whole LAN at a firewall using dnsmasq's config file, not a hosts file.

Unlike a hosts file, dnsmasq.conf can block entire domains w/o listing each individual host or sub-domain. This usually results in at least 95% shrinkage in the "distillation" process.

Be patient, I'll try to dig out my code & post it for you.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
command want list all in root / by filtering file and directory? hocheetiong Linux - Newbie 2 11-01-2007 02:16 AM
a complete balck list for filtering baambooli Linux - General 1 11-29-2006 07:53 AM
Subdomains and security with regards to root domain htmlcoder Linux - Security 1 03-10-2005 05:48 PM
Domain Names Timbo General 7 02-14-2003 03:10 PM
domain names cic Linux - Networking 3 06-11-2002 03:47 PM


All times are GMT -5. The time now is 07:51 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration