LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Reply
 
Search this Thread
Old 03-23-2013, 07:09 PM   #1
CaptainDerp
LQ Newbie
 
Registered: Mar 2013
Posts: 12

Rep: Reputation: Disabled
Question How to remove forward slash and everything after with sed or awk?


Ok guys, Ive got a list of urls and I need to parse just the domains from them and get rid of all the trailing crap

Ive got a list of these lines like this

somebullshit.com/some/crap/i/want/to/remove
someothershit.biz/crap/poop.html
random.ru/shiz/myboody.php
moar.ch/caca.html
someotherstuff.pl/blah/blah/blah

And I need to end up with

somebullshit.com
someothershit.biz
random.ru
moar.ch
someotherstuff.pl

Also this is a massive list of over 2 million urls, so, its not limited to .com, .ru. .ch, .pl, and .biz. So it needs to implicitly remove the first firward slash and everything after it on each line

Thank you to the wizard who knows the answer in advance.
 
Old 03-23-2013, 08:06 PM   #2
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,458

Rep: Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941
Code:
sed 's:/.*::' file
 
Old 03-23-2013, 08:10 PM   #3
freebsd_Rules_All_OSes
LQ Newbie
 
Registered: Mar 2013
Posts: 19

Rep: Reputation: Disabled
If your list is formatted with no http:// at the beginning then this should work.
Code:
sed 's/\/.*//g' file_with_urls
This will give you a preview of the output. If you are happy with the results then redirect the output to another file.

Code:
sed 's/\/.*//g' file_with_urls > output_file
or

Code:
sed -i 's/\/.*//g' file_with_urls
The -i will overwrite the original file with the new changes

Last edited by freebsd_Rules_All_OSes; 03-23-2013 at 08:14 PM.
 
Old 03-23-2013, 09:04 PM   #4
CaptainDerp
LQ Newbie
 
Registered: Mar 2013
Posts: 12

Original Poster
Rep: Reputation: Disabled
sed 's:/.*::' file

Did exactly what I needed

also I didnt know about the -i option thats awesome,

thanks guys.


Hmm I have another more complicated one, any care to have a swing at it?


Ok, so this huge list of sites is like this.

laui.somesite.com
lau-immobilien.de
laurapausini.fanspace.it
lauraroebuck.com
laurenserect.ru
laurentianbankz.ca
laurianoalmeida.sites.uol.com.br
lavasoftupdate.com
lavl-vicky.com
lavvckpordclbduy.ru

I need a way to remove subdomains, but without destroying items like .co.uk,

So I figure I need to remove anything before .*.* but only on lines that do not contain multiple entries like .co.uk and .co.nz

The whole idea is to have a list of top level domains. But retaining .co.nz, and .co.uk and all similar extensions
 
Old 03-25-2013, 01:15 AM   #5
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,226

Rep: Reputation: 2023Reputation: 2023Reputation: 2023Reputation: 2023Reputation: 2023Reputation: 2023Reputation: 2023Reputation: 2023Reputation: 2023Reputation: 2023Reputation: 2023
Unfortunately, I think you'll need a list of all 2 level domain codes to keep ie your program needs to know if eg '.com.br' is a TLD (ie only want .br) or an allowable 2 level one?
Similarly, given .it is ccTLD for italy, is there an equiv to .co.uk eg .co.it?
For tha I'd use Perl, although others might use awk.
I think sed would be a stretch for this more general problem.

This is a problem where the code needs to know stuff, as opposed to just truncating at a known marker.
 
1 members found this post helpful.
  


Reply

Tags
awk, sed


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] sed/awk : remove section from file vrusu Linux - Newbie 3 10-26-2010 08:49 AM
SED or AWK - remove every 4 of 5 new lines Mallardle Linux - Newbie 6 08-30-2010 07:44 AM
sed or awk delete after last slash kofucii Linux - Newbie 2 08-22-2009 03:49 PM
Remove everything up to the last numbers of a string w/ sed or awk OutThere Linux - General 4 04-23-2009 07:01 PM
How to remove everything before the first space in Sed or Awk OutThere Linux - General 1 04-05-2009 10:45 PM


All times are GMT -5. The time now is 11:23 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration