Old 03-23-2013, 08:09 PM   #1
LQ Newbie
Registered: Mar 2013
Posts: 25

Rep: Reputation: Disabled
Question How to remove forward slash and everything after with sed or awk?

Ok guys, Ive got a list of urls and I need to parse just the domains from them and get rid of all the trailing crap

Ive got a list of these lines like this

And I need to end up with

Also this is a massive list of over 2 million urls, so, its not limited to .com, .ru. .ch, .pl, and .biz. So it needs to implicitly remove the first firward slash and everything after it on each line

Thank you to the wizard who knows the answer in advance.
Old 03-23-2013, 09:06 PM   #2
LQ Guru
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
sed 's:/.*::' file
Old 03-23-2013, 09:10 PM   #3
LQ Newbie
Registered: Mar 2013
Posts: 19

Rep: Reputation: Disabled
If your list is formatted with no http:// at the beginning then this should work.
sed 's/\/.*//g' file_with_urls
This will give you a preview of the output. If you are happy with the results then redirect the output to another file.

sed 's/\/.*//g' file_with_urls > output_file

sed -i 's/\/.*//g' file_with_urls
The -i will overwrite the original file with the new changes

Last edited by freebsd_Rules_All_OSes; 03-23-2013 at 09:14 PM.
Old 03-23-2013, 10:04 PM   #4
LQ Newbie
Registered: Mar 2013
Posts: 25

Original Poster
Rep: Reputation: Disabled
sed 's:/.*::' file

Did exactly what I needed

also I didnt know about the -i option thats awesome,

thanks guys.

Hmm I have another more complicated one, any care to have a swing at it?

Ok, so this huge list of sites is like this.

I need a way to remove subdomains, but without destroying items like,

So I figure I need to remove anything before .*.* but only on lines that do not contain multiple entries like and

The whole idea is to have a list of top level domains. But retaining, and and all similar extensions
Old 03-25-2013, 02:15 AM   #5
LQ Guru
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,245

Rep: Reputation: 2327Reputation: 2327Reputation: 2327Reputation: 2327Reputation: 2327Reputation: 2327Reputation: 2327Reputation: 2327Reputation: 2327Reputation: 2327Reputation: 2327
Unfortunately, I think you'll need a list of all 2 level domain codes to keep ie your program needs to know if eg '' is a TLD (ie only want .br) or an allowable 2 level one?
Similarly, given .it is ccTLD for italy, is there an equiv to eg
For tha I'd use Perl, although others might use awk.
I think sed would be a stretch for this more general problem.

This is a problem where the code needs to know stuff, as opposed to just truncating at a known marker.
