LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   grep help (https://www.linuxquestions.org/questions/linux-newbie-8/grep-help-4175659960/)

motherboard 08-28-2019 09:25 AM

grep help
 
blah blah blah www.website1.com blah blah blah
asdf www.website2.com asdf

How do I get grep to print only the website name and ignore everything before www. and everything after .com?

berndbausch 08-28-2019 09:33 AM

grep is not the tool for cutting lines into smaller pieces. It's the tool for filtering lines.

You want sed or awk. Something like that (not sure if it works):
Code:

sed 's/.*\(www.*com\).*/\1/'

Turbocapitalist 08-28-2019 09:38 AM

Quote:

Originally Posted by motherboard (Post 6030573)
How do I get grep to print only the website name and ignore everything before www. and everything after .com?

Take a look at the -w and -o options together.

If your patterns get more complex take a look at -E also. Then, if you have to, escalate to using -P for PCRE.

Firerat 08-28-2019 09:43 AM

not true berndbausch

Code:

grep -o "www.*com"

you can also use more complex matching


from the grep manpage
Quote:

Matcher Selection
-E, --extended-regexp
Interpret PATTERNS as extended regular expressions (EREs, see below).

-F, --fixed-strings
Interpret PATTERNS as fixed strings, not regular expressions.

-G, --basic-regexp
Interpret PATTERNS as basic regular expressions (BREs, see below). This is the default.

-P, --perl-regexp
Interpret PATTERNS as Perl-compatible regular expressions (PCREs). This option is experimental when combined with the -z
(--null-data) option, and grep -P may warn of unimplemented features.



however, another tool may be more suited to the task.. it really depends on what else you want to do

berndbausch 08-28-2019 10:43 AM

Quote:

Originally Posted by Firerat (Post 6030582)
not true berndbausch

Code:

grep -o "www.*com"

True, this works wonderfully.

Firerat 08-28-2019 10:57 AM

yeah, it can get messy the .* is greedy so if you happen to have two web addresses on a single line you end up with both and the junk inbetween.

but the same is true with the sed

awk would be better since you could loop through each field

perl is probably the natual tool for the job
but I don't know perl

Turbocapitalist 08-28-2019 11:06 AM

Quote:

Originally Posted by Firerat (Post 6030617)
yeah, it can get messy the .* is greedy so if you happen to have two web addresses on a single line you end up with both and the junk inbetween.

grep has substantial, but not complete, support for PCRE. So you could try it like this:

Code:

grep -w -P -o 'www\..*?\.com'
With the examples given, -w is not strictly necessary but I figure it will come in handy just in case.

Sefyir 08-28-2019 11:12 AM

It can be pretty hard to match domains

Code:

./FILE
blah blah blah website1.com blah blah blah website2.com blah blah             
asdf www.website3.org asdf                                                     
                                                                               
blah blah blah www.website4.com blah blah blah                                 
asdf www.website5.com asdf

Code:

grep -woE '(?:www\.)?\w+\.[a-z]{3,4}' ./FILE
website1.com
website2.com
website3.org
website4.com
website5.com

This explains what each part does
regexr.com/4k0pk

boughtonp 08-28-2019 03:42 PM

Not all domain extensions are matched by \.[a-z]{3,4} - most notably any country-specific ones.

Also \w includes underscore (not valid in domains) but not hyphens (which are), so I'd probably go with:

Code:

grep -owEi '[a-z0-9-]+(\.[a-z0-9-]+)+' ./FILE
With the i for case-insensitivity.

And, if the use-case calls for it, filter the output through something that does a DNS lookup to confirm actual domains.


All times are GMT -5. The time now is 04:08 PM.