How to exclude all speacial characters using regex?
Hi Folks,
I need to exclude special characters from file and only include [a-zA-Z0-9] . - In-fact I am just including domain names and exclude all special characters. I am not able achieve the same. ~`!@#$%^&*()_+={}[]\|;:'"<,>/? Can someone please help? |
which language is it? do you have any written code already?
|
|
Quote:
Any hint? |
My sample text would be
example.com test.com test123.com 123test.ocm calid-domain.com test-test.net !def @fsf dafsrf# fffgg$.net %rrt.com ^testcom asddf&.net as* ( ) _ + = \ ; : ' " < , > ? / |
Quote:
https://www.linuxquestions.org/quest...es-4175657403/ https://www.linuxquestions.org/quest...nd-4175656948/ https://www.linuxquestions.org/quest...rs-4175655180/ https://www.linuxquestions.org/quest...ng-4175648204/ https://www.linuxquestions.org/quest...es-4175641557/ https://www.linuxquestions.org/quest...pt-4175635666/ https://www.linuxquestions.org/quest...pt-4175616729/ Show your own efforts when posting, and do basic research. After three years, you should have SOME scripting/research skills. Putting "bash regex strip out anything but letters and numbers" into Google pulls up a LOT of 'hints'. You've been told many times to post things in CODE tags, but don't seem to follow that advice either. The [:alnum:] is alpha-numeric. |
Quote:
Will definitely ensure to follow the code tags. |
You need to escape certain characters inside the RegEx:
Code:
while read -r line;do I will leave matching the remaining characters as an excercise. PS: You can also achieve this by using [:alnum:] by TB0ne but it has also a pitfall. I think, however, that doing it the "hard" way is more educational in the long run since you can learn how to handle certain characters in a RegEx. |
Code:
'[!@#%%$^*()_+=\;:,"<>?/]' |
Read post #8 again.
|
Quote:
Quote:
|
Better name the printable characters, and use the complement of it, either with tr and -c option, or with a negating ^ in a charset in a RE:
Code:
tr -dc '.a-zA-Z0-9\n-' < samplefile |
just a quick test of that one loop.
Code:
#!/bin/bash Code:
[][()\'\"~!\`@/?\>\<\\] Code:
[userx@arcomeo testdir]$ ./stripme testfile |
Quote:
|
Quote:
|
Quote:
Quote:
|
Quote:
Code:
#!/bin/bash |
Quote:
|
Quote:
|
Quote:
|
Quote:
@OP: Please provide a sample output file of what you expect it to look like before we keep guessing. |
If you want to not print lines that have a forbidden character, with grep:
Code:
grep -v '[^a-zA-Z0-9-]' testfile For the [a-zA-Z0-9] set there is [[:alnum:]], can be augmented with extra characters and of course with the ^ negation: Code:
grep -v '[^[:alnum:]-]' testfile |
Quote:
|
Quote:
|
Quote:
Code:
grep -v '[^[:alnum:].-]' testfile Code:
grep -v '[^.[:alnum:]-]' testfile Quote:
|
Quote:
|
Quote:
|
Hello,
That worked perfectly fine; however what I am trying to match here is and not sure if this can be achieved in the same line. Since the above pattern is catching single dot as liternal and hyphen. Being a domain name those will be surrounded by alnum hence trying hard for validation to match . and - only if surrounded by \wfollowed by those two literals. May be I am missing something? Quote:
|
Quote:
You have been presented with a solution that works for the sample file you provided. Now you are telling us that the sample file is not representing the actual input data, thus the solution is inappropriate. It is pointless to provide you with a solution if you keep changing the requirement. |
wrong post.. oops
|
\w and [:alnum:] are almost the same, just different syntax.
^ will have an effect on everything inside [ and ] (means exclusion instead of inclusion). Also would be nice to check www.regex101.com because you can construct and check any regexp yourself |
Quote:
As _ is not a alphanumeric character, both thus are an extension on [:alnum:] |
Quote:
OP, you have been given a LOT of advice that you could act on and research, to solve your own problem. |
Quote:
Code:
example.com Code:
example.com |
Quote:
1) Quite often filenames start with a dot (so-called hidden files). Use "la -A" in your home dir to see a lot of them. 2) Filenames with multiple dots and/or dashes are common too, i.e. php-5.6.40-x86_64-1.txz 3) The _ is often used to substitute spaces, like George_Harrison-What_Is_Life.mp3 |
Quote:
|
so I would create something like this:
1. one alphanumeric 2. any alnum and - 3. one alnum 4. dot 5. any alnum (but at least one). You can construct a regexp for this (or anything similar) |
Quote:
Those steps 4 and 5 have to be repeated until the end of the string is reached. |
Quote:
And if all you're looking to do (since the goal has apparently changed again), is to get domain names, why can't you just grep for things like .net,.com,.edu, etc. into another file?? |
Quote:
Quote:
|
Quote:
Quote:
Code:
grep '[[:alnum:].-]' test |
Quote:
|
in that case probably you can mark the thread solved
and again, you may check www.regex101.com to improve your skills and to check your regexps. |
Quote:
Use quotes, so the shell does not try special substitutions! Code:
< test grep -v '[^[:alnum:]\w.-]' Here it is sufficient to add the _ character to the [:alnum:] class Code:
< test grep -v '[^[:alnum:]_.-]' |
All times are GMT -5. The time now is 09:54 PM. |