LinuxQuestions.org - [SOLVED] grep help

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - grep help (https://www.linuxquestions.org/questions/linux-newbie-8/grep-help-4175659960/)

motherboard

08-28-2019 09:25 AM

grep help

blah blah blah www.website1.com blah blah blah
asdf www.website2.com asdf

How do I get grep to print only the website name and ignore everything before www. and everything after .com?

berndbausch

08-28-2019 09:33 AM

grep is not the tool for cutting lines into smaller pieces. It's the tool for filtering lines.

You want sed or awk. Something like that (not sure if it works):

Code:

sed 's/.*\(www.*com\).*/\1/'

Turbocapitalist

08-28-2019 09:38 AM

Quote:

Originally Posted by motherboard (Post 6030573)

How do I get grep to print only the website name and ignore everything before www. and everything after .com?

Take a look at the -w and -o options together.

If your patterns get more complex take a look at -E also. Then, if you have to, escalate to using -P for PCRE.

Firerat

08-28-2019 09:43 AM

not true berndbausch

Code:

grep -o "www.*com"

you can also use more complex matching

from the grep manpage

Quote:

Matcher Selection
-E, --extended-regexp
Interpret PATTERNS as extended regular expressions (EREs, see below).

-F, --fixed-strings
Interpret PATTERNS as fixed strings, not regular expressions.

-G, --basic-regexp
Interpret PATTERNS as basic regular expressions (BREs, see below). This is the default.

-P, --perl-regexp
Interpret PATTERNS as Perl-compatible regular expressions (PCREs). This option is experimental when combined with the -z
(--null-data) option, and grep -P may warn of unimplemented features.

however, another tool may be more suited to the task.. it really depends on what else you want to do

berndbausch

08-28-2019 10:43 AM

Quote:

Originally Posted by Firerat (Post 6030582)

not true berndbausch

Code:

grep -o "www.*com"

True, this works wonderfully.

Firerat

08-28-2019 10:57 AM

yeah, it can get messy the .* is greedy so if you happen to have two web addresses on a single line you end up with both and the junk inbetween.

but the same is true with the sed

awk would be better since you could loop through each field

perl is probably the natual tool for the job
but I don't know perl

Turbocapitalist

08-28-2019 11:06 AM

Quote:

Originally Posted by Firerat (Post 6030617)

yeah, it can get messy the .* is greedy so if you happen to have two web addresses on a single line you end up with both and the junk inbetween.

grep has substantial, but not complete, support for PCRE. So you could try it like this:

Code:

grep -w -P -o 'www\..*?\.com'

With the examples given, -w is not strictly necessary but I figure it will come in handy just in case.

Sefyir

08-28-2019 11:12 AM

It can be pretty hard to match domains

Code:

./FILE

blah blah blah website1.com blah blah blah website2.com blah blah              

asdf www.website3.org asdf                                                      

                                                                                

blah blah blah www.website4.com blah blah blah                                  

asdf www.website5.com asdf

Code:

grep -woE '(?:www\.)?\w+\.[a-z]{3,4}' ./FILE

website1.com

website2.com

website3.org

website4.com

website5.com

This explains what each part does
regexr.com/4k0pk

boughtonp

08-28-2019 03:42 PM

Not all domain extensions are matched by \.[a-z]{3,4} - most notably any country-specific ones.

Also \w includes underscore (not valid in domains) but not hyphens (which are), so I'd probably go with:

Code:

grep -owEi '[a-z0-9-]+(\.[a-z0-9-]+)+' ./FILE

With the i for case-insensitivity.

And, if the use-case calls for it, filter the output through something that does a DNS lookup to confirm actual domains.

All times are GMT -5. The time now is 04:08 PM.