LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   How to exclude all speacial characters using regex? (https://www.linuxquestions.org/questions/programming-9/how-to-exclude-all-speacial-characters-using-regex-4175657614/)

blason 07-18-2019 07:44 AM

How to exclude all speacial characters using regex?
 
Hi Folks,

I need to exclude special characters from file and only include
[a-zA-Z0-9] . -

In-fact I am just including domain names and exclude all special characters.

I am not able achieve the same.

~`!@#$%^&*()_+={}[]\|;:'"<,>/?

Can someone please help?

pan64 07-18-2019 07:45 AM

which language is it? do you have any written code already?

BW-userx 07-18-2019 07:52 AM

a quick search

blason 07-18-2019 07:57 AM

Quote:

Originally Posted by pan64 (Post 6016202)
which language is it? do you have any written code already?

Need that in bash, dang I am not regex pro but giving my best and failing :(
Any hint?

blason 07-18-2019 07:59 AM

My sample text would be

example.com
test.com
test123.com
123test.ocm
calid-domain.com
test-test.net
!def
@fsf
dafsrf#
fffgg$.net
%rrt.com
^testcom
asddf&.net
as*
(
)
_
+
=
\
;
:
'
"
<
,
>
?
/

TB0ne 07-18-2019 08:23 AM

Quote:

Originally Posted by blason (Post 6016211)
Need that in bash, dang I am not regex pro but giving my best and failing :(
Any hint?

You've been posting things like this for a good while now:
https://www.linuxquestions.org/quest...es-4175657403/
https://www.linuxquestions.org/quest...nd-4175656948/
https://www.linuxquestions.org/quest...rs-4175655180/
https://www.linuxquestions.org/quest...ng-4175648204/
https://www.linuxquestions.org/quest...es-4175641557/
https://www.linuxquestions.org/quest...pt-4175635666/
https://www.linuxquestions.org/quest...pt-4175616729/

Show your own efforts when posting, and do basic research. After three years, you should have SOME scripting/research skills.

Putting "bash regex strip out anything but letters and numbers" into Google pulls up a LOT of 'hints'. You've been told many times to post things in CODE tags, but don't seem to follow that advice either. The [:alnum:] is alpha-numeric.

blason 07-18-2019 08:30 AM

Quote:

Originally Posted by TB0ne (Post 6016222)
You've been posting things like this for a good while now:
https://www.linuxquestions.org/quest...es-4175657403/
https://www.linuxquestions.org/quest...nd-4175656948/
https://www.linuxquestions.org/quest...rs-4175655180/
https://www.linuxquestions.org/quest...ng-4175648204/
https://www.linuxquestions.org/quest...es-4175641557/
https://www.linuxquestions.org/quest...pt-4175635666/
https://www.linuxquestions.org/quest...pt-4175616729/

Show your own efforts when posting, and do basic research. After three years, you should have SOME scripting/research skills.

Putting "bash regex strip out anything but letters and numbers" into Google pulls up a LOT of 'hints'. You've been told many times to post things in CODE tags, but don't seem to follow that advice either. The [:alnum:] is alpha-numeric.

I understand and I am definitely trying to get the answer and of course everyone first tries google which I also did and if that didnt resolve then come here.

Will definitely ensure to follow the code tags.

crts 07-18-2019 08:46 AM

You need to escape certain characters inside the RegEx:
Code:

while read -r line;do
        if [[ ! "$line" =~ [][()\'\"~!\`@/?\>\<\\] ]];then
                echo "$line"
        fi
done < "/path/to/file"

The above code takes care of the most problematic ones. Notice, that if you want to match a literal ']' inside the brackets then it must be the first character after the opening '['.
I will leave matching the remaining characters as an excercise.

PS:
You can also achieve this by using [:alnum:] by TB0ne but it has also a pitfall. I think, however, that doing it the "hard" way is more educational in the long run since you can learn how to handle certain characters in a RegEx.

blason 07-18-2019 08:53 AM

Code:

'[!@#%%$^*()_+=\;:,"<>?/]'
I guess I am not able to exclude single quote

crts 07-18-2019 08:56 AM

Read post #8 again.

TB0ne 07-18-2019 09:01 AM

Quote:

Originally Posted by blason
I understand and I am definitely trying to get the answer and of course everyone first tries google which I also did and if that didnt resolve then come here. Will definitely ensure to follow the code tags.

Sorry, just don't believe that. Putting the search term I used into Google yielded 559,000 hits....hard to believe that out of all that there wasn't one 'hint' you could have used. And you've been asked about CODE tags for a LONG time, but don't use them.
Quote:

Originally Posted by blason (Post 6016236)
Code:

'[!@#%%$^*()_+=\;:,"<>?/]'
I guess I am not able to exclude single quote

And why is that, given the fact that I not only gave you a search-term that has your 'hints', but the **EXACT** thing you need to use for a regex to strip out anything but letters and numbers???

MadeInGermany 07-18-2019 09:10 AM

Better name the printable characters, and use the complement of it, either with tr and -c option, or with a negating ^ in a charset in a RE:
Code:

tr -dc '.a-zA-Z0-9\n-' < samplefile
sed -n 's/[^.a-zA-Z0-9-]//gp' < samplefile


BW-userx 07-18-2019 09:13 AM

just a quick test of that one loop.
Code:

#!/bin/bash

while read -r line;do
        if [[ ! "$line" =~ [][()\'\"~!\`@/?\>\<\\] ]];then
                echo "$line"
        fi
done < $1

testfile
Code:

[][()\'\"~!\`@/?\>\<\\]

[ in here ]
'what'
< if >
@googles

~where
!ho
Hello

results
Code:

[userx@arcomeo testdir]$ ./stripme testfile


Hello

tells a story...

crts 07-18-2019 09:19 AM

Quote:

Originally Posted by BW-userx (Post 6016250)

tells a story...

And what story would that be?

blason 07-18-2019 09:21 AM

Quote:

Originally Posted by MadeInGermany (Post 6016246)
Better name the printable characters, and use the complement of it, either with tr and -c option, or with a negating ^ in a charset in a RE:
Code:

tr -dc '.a-zA-Z0-9\n-' < samplefile
sed -n 's/[^a-zA-Z0-9\n-]//gp' < samplefile


Thanks and nice option; however I am looking with Grep if possible.

crts 07-18-2019 09:27 AM

Quote:

Originally Posted by blason (Post 6016252)
Thanks and nice option; however I am looking with Grep if possible.

Then why did you ask for a bash solution in a previous post?
Quote:

Originally Posted by blason (Post 6016211)
Need that in bash ...


blason 07-18-2019 09:28 AM

Quote:

Originally Posted by BW-userx (Post 6016250)
just a quick test of that one loop.
Code:

#!/bin/bash

while read -r line;do
        if [[ ! "$line" =~ [][()\'\"~!\`@/?\>\<\\] ]];then
                echo "$line"
        fi
done < $1

testfile
Code:

[][()\'\"~!\`@/?\>\<\\]

[ in here ]
'what'
< if >
@googles

~where
!ho
Hello

results
Code:

[userx@arcomeo testdir]$ ./stripme testfile


Hello

tells a story...

Just a few modification

Code:

#!/bin/bash

while read -r line;do
        if [[ ! "$line" =~ [][()\'\"~!\`@/?\>\<\/\*\_\+\=\;\:\,\#\$\%\^\&] ]];then
                echo "$line"
        fi
done < $1


BW-userx 07-18-2019 09:28 AM

Quote:

Originally Posted by crts (Post 6016251)
And what story would that be?

looks like someone needs to read the foot notes. it works..

blason 07-18-2019 09:29 AM

Quote:

Originally Posted by crts (Post 6016254)
Then why did you ask for a bash solution in a previous post?

my bad bash as in wanted in grep :(

crts 07-18-2019 09:36 AM

Quote:

Originally Posted by blason (Post 6016258)
my bad bash as in wanted in grep :(

Is the sample file you provided at least representative?

crts 07-18-2019 09:50 AM

Quote:

Originally Posted by MadeInGermany (Post 6016246)
Better name the printable characters, and use the complement of it, either with tr and -c option, or with a negating ^ in a charset in a RE:
Code:

tr -dc '.a-zA-Z0-9\n-' < samplefile
sed -n 's/[^a-zA-Z0-9\n-]//gp' < samplefile


Those solutions will also keep invalid domain names, like '%rtt.com'. It will "transform" to 'rtt.com' but I am not sure if this is desired by OP.

@OP:
Please provide a sample output file of what you expect it to look like before we keep guessing.

MadeInGermany 07-18-2019 09:57 AM

If you want to not print lines that have a forbidden character, with grep:
Code:

grep -v '[^a-zA-Z0-9-]' testfile
Again, it is easier to name the allowed characters.
For the [a-zA-Z0-9] set there is [[:alnum:]], can be augmented with extra characters and of course with the ^ negation:
Code:

grep -v '[^[:alnum:]-]' testfile

TB0ne 07-18-2019 10:04 AM

Quote:

Originally Posted by MadeInGermany (Post 6016277)
If you want to not print lines that have a forbidden character, with grep:
Code:

grep -v '[^a-zA-Z0-9-]' testfile
Again, it is easier to name the allowed characters.
For the [a-zA-Z0-9] set there is [[:alnum:]], can be augmented with extra characters and of course with the ^ negation:
Code:

grep -v '[^[:alnum:]-]' testfile

Yep, the alnum was given earlier, and promptly ignored.

crts 07-18-2019 10:33 AM

Quote:

Originally Posted by MadeInGermany (Post 6016277)
Again, it is easier to name the allowed characters.
For the [a-zA-Z0-9] set there is [[:alnum:]], can be augmented with extra characters and of course with the ^ negation:
Code:

grep -v '[^[:alnum:]-]' testfile

On a side note, let me just point out for OP that if you want to include '-' inside '[]' then it must be the last character. This is the pitfall I was refering to earlier. Otherwise it will be interpreted as a range operator. This solution will almost bring you there. You still need to figure out the last missing part.

MadeInGermany 07-18-2019 12:00 PM

Quote:

Originally Posted by crts (Post 6016295)
On a side note, let me just point out for OP that if you want to include '-' inside '[]' then it must be the last character. This is the pitfall I was refering to earlier. Otherwise it will be interpreted as a range operator. This solution will almost bring you there. You still need to figure out the last missing part.

Yes, so adding more extra characters must be like this
Code:

grep -v '[^[:alnum:].-]' testfile
or this
Code:

grep -v '[^.[:alnum:]-]' testfile
Actually it is possible to have the - character first (after the ^ of course), but there are other characters like a ] that must be first, so it is better to remember "- must be last".
Quote:

Originally Posted by TB0ne (Post 6016281)
Yep, the alnum was given earlier, and promptly ignored.

Please promptly ignore any sarcasm ;)

crts 07-18-2019 01:34 PM

Quote:

Originally Posted by MadeInGermany (Post 6016331)
Actually it is possible to have the - character first (after the ^ of course)

Did not know that it can be also first, always had it last when needed and never questioned it. Makes sense, though, since on position one (ignoring ^) it cannot be mistaken to indicate a range.

blason 07-18-2019 09:48 PM

Quote:

Originally Posted by MadeInGermany (Post 6016331)
Yes, so adding more extra characters must be like this
Code:

grep -v '[^[:alnum:].-]' testfile
or this
Code:

grep -v '[^.[:alnum:]-]' testfile
Actually it is possible to have the - character first (after the ^ of course), but there are other characters like a ] that must be first, so it is better to remember "- must be last".

Please promptly ignore any sarcasm ;)

It wasn't ignored but mistakenly it wasn't working as I was just using [[:alnum:]] hence was showing different results.

blason 07-19-2019 11:23 PM

Hello,

That worked perfectly fine; however what I am trying to match here is and not sure if this can be achieved in the same line.
Since the above pattern is catching single dot as liternal and hyphen. Being a domain name those will be surrounded by alnum hence trying hard for validation to match . and - only if surrounded by \wfollowed by those two literals.

May be I am missing something?

Quote:

cat test | grep -v [^[:alnum:]\w.-]

crts 07-20-2019 06:17 AM

Quote:

Originally Posted by blason (Post 6016863)
May be I am missing something?

The most important thing you are missing is to provide a representative sample file and the output you expect.
You have been presented with a solution that works for the sample file you provided. Now you are telling us that the sample file is not representing the actual input data, thus the solution is inappropriate. It is pointless to provide you with a solution if you keep changing the requirement.

BW-userx 07-20-2019 07:05 AM

wrong post.. oops

pan64 07-20-2019 08:14 AM

\w and [:alnum:] are almost the same, just different syntax.
^ will have an effect on everything inside [ and ] (means exclusion instead of inclusion).
Also would be nice to check www.regex101.com because you can construct and check any regexp yourself

ehartman 07-20-2019 09:26 AM

Quote:

Originally Posted by pan64 (Post 6016931)
\w and [:alnum:] are almost the same, just different syntax.

From the man page: The symbol \w is a synonym for [_[:alnum:]] and \W is a synonym for [^_[:alnum:]], so they extend the class [:alnum:] with the character _ (in any place for \w) or with _ as FIRST character (for \W).
As _ is not a alphanumeric character, both thus are an extension on [:alnum:]

TB0ne 07-20-2019 09:32 AM

Quote:

Originally Posted by crts (Post 6016904)
The most important thing you are missing is to provide a representative sample file and the output you expect.
You have been presented with a solution that works for the sample file you provided. Now you are telling us that the sample file is not representing the actual input data, thus the solution is inappropriate. It is pointless to provide you with a solution if you keep changing the requirement.

The OP asked for 'hints' (post #4). They've been given a LOT of hints, but don't seem to have actually worked on/thought about them.

OP, you have been given a LOT of advice that you could act on and research, to solve your own problem.

blason 07-21-2019 09:09 PM

Quote:

Originally Posted by crts (Post 6016904)
The most important thing you are missing is to provide a representative sample file and the output you expect.
You have been presented with a solution that works for the sample file you provided. Now you are telling us that the sample file is not representing the actual input data, thus the solution is inappropriate. It is pointless to provide you with a solution if you keep changing the requirement.

Here is my Input file
Code:

example.com
test.com
test123.com
123test.ocm
calid-domain.com
test-test.net
!def
@fsf
dafsrf#
fffgg$.net
%rrt.com
^testcom
asddf&.net
as*
(
)
_
+
=
\
;
:
'
"
<
,
>
?
/
[test.net
]gsef.ex
`ftrfgr
!
@
#
$
%
^
&
*
(
)
_
=
+
{
}
[
]
;
:
"
'
<
,
.
>
/
?
-

And here is the answer I am getting. Now if you see this is matching a single dot and hyphen as well. I am working include only if dot/hyphen has surrounded by [:alnum:]
Code:

example.com
test.com
test123.com
123test.ocm
calid-domain.com
test-test.net
.
-


ehartman 07-22-2019 02:15 AM

Quote:

Originally Posted by blason (Post 6017388)
Now if you see this is matching a single dot and hyphen as well. I am working include only if dot/hyphen has surrounded by [:alnum:]

Note: if this is about "filename like" strings:
1) Quite often filenames start with a dot (so-called hidden files).
Use "la -A" in your home dir to see a lot of them.
2) Filenames with multiple dots and/or dashes are common too, i.e. php-5.6.40-x86_64-1.txz
3) The _ is often used to substitute spaces, like George_Harrison-What_Is_Life.mp3

blason 07-22-2019 08:04 AM

Quote:

Originally Posted by ehartman (Post 6017431)
Note: if this is about "filename like" strings:
1) Quite often filenames start with a dot (so-called hidden files).
Use "la -A" in your home dir to see a lot of them.
2) Filenames with multiple dots and/or dashes are common too, i.e. php-5.6.40-x86_64-1.txz
3) The _ is often used to substitute spaces, like George_Harrison-What_Is_Life.mp3

Nah I am not trying to match files but trying to include the data in that file and those are domain names. Only valid characters in domains names are . and - but should have alphanumeric around it.

pan64 07-22-2019 08:17 AM

so I would create something like this:
1. one alphanumeric
2. any alnum and -
3. one alnum
4. dot
5. any alnum (but at least one).

You can construct a regexp for this (or anything similar)

ehartman 07-22-2019 12:59 PM

Quote:

Originally Posted by pan64 (Post 6017514)
4. dot
5. any alnum (but at least one).

You can construct a regexp for this (or anything similar)

And how do you handle then hostnames like www.debian.org or en.wikipedia.org (that is, with multiple dots)?
Those steps 4 and 5 have to be repeated until the end of the string is reached.

TB0ne 07-22-2019 01:11 PM

Quote:

Originally Posted by blason (Post 6017507)
Nah I am not trying to match files but trying to include the data in that file and those are domain names. Only valid characters in domains names are . and - but should have alphanumeric around it.

So what have you done with any/all of the 'hints' you asked for thus far, to try to accomplish this?? Haven't seen your efforts thus far.

And if all you're looking to do (since the goal has apparently changed again), is to get domain names, why can't you just grep for things like .net,.com,.edu, etc. into another file??

blason 07-22-2019 10:43 PM

Quote:

cat test | grep -v '[^[:alnum:].-]' | grep '\w'
This eventually should suffice my need? Please correct me if I am wrong.

Quote:

cat test | grep -v '[^[:alnum:].-]' | grep '\w'
example.com
test.com
test123.com
123test.ocm
calid-domain.com
test-test.net

pan64 07-23-2019 02:09 AM

Quote:

Originally Posted by ehartman (Post 6017591)
And how do you handle then hostnames like www.debian.org or en.wikipedia.org (that is, with multiple dots)?
Those steps 4 and 5 have to be repeated until the end of the string is reached.

this was only an example and anyone can improve. just a way to build a correct regexp

Quote:

Originally Posted by blason (Post 6017722)
This eventually should suffice my need? Please correct me if I am wrong.

Did you check it? It looks [almost] identical/similar to:
Code:

grep '[[:alnum:].-]' test
Is this what you need?

blason 07-23-2019 02:26 AM

Quote:

Originally Posted by pan64 (Post 6017744)
this was only an example and anyone can improve. just a way to build a correct regexp


Did you check it? It looks [almost] identical/similar to:
Code:

grep '[[:alnum:].-]' test
Is this what you need?

Yes it is now showing me the desired results.

pan64 07-23-2019 05:21 AM

in that case probably you can mark the thread solved
and again, you may check www.regex101.com to improve your skills and to check your regexps.

MadeInGermany 08-15-2019 01:46 AM

Quote:

Originally Posted by blason (Post 6016863)
Hello,

That worked perfectly fine; however what I am trying to match here is and not sure if this can be achieved in the same line.
Since the above pattern is catching single dot as liternal and hyphen. Being a domain name those will be surrounded by alnum hence trying hard for validation to match . and - only if surrounded by \wfollowed by those two literals.

May be I am missing something?

The grep runs in a shell; many special characters have a special meaning in the shell, too.
Use quotes, so the shell does not try special substitutions!
Code:

< test grep -v '[^[:alnum:]\w.-]'
I am unsure if perl-style extensions like \w work within a character set [ ], so I would go for only Posix classes [:xxxx:]
Here it is sufficient to add the _ character to the [:alnum:] class
Code:

< test grep -v '[^[:alnum:]_.-]'


All times are GMT -5. The time now is 09:54 PM.