bash: remove dashes (-) out of regexes...

masavini · 02-11-2020, 03:57 PM

hi,
can you please suggest a command to remove dashes (-) out of a list of regexes, excluding the ones that define characters ranges (i.e.: [a-z])?

i.e.:

original regex:

Code:

13-45[a-z][3-4]-?buburr-ex
  ^            ^^      ^

modified regex:

Code:

1345[a-z][3-4]buburrex

boughtonp · 02-11-2020, 05:03 PM

How many regexes do you have to deal with that makes this not easier to do by hand?

What engine are the regexes for? If there is a dash in a character class (e.g. "[\w-]") how should it be handled?

syg00 · 02-11-2020, 05:16 PM

Not to mention there is a meta character being deleted unannounced. Who knows what other requirements have been omitted.
A quick hack simply doing multiple passes would probably be my solution - trying to do it as one convoluted regex is just an academic pursuit IMHO.

masavini · 02-11-2020, 05:57 PM

Quote:

Originally Posted by boughtonp

How many regexes do you have to deal with that makes this not easier to do by hand?

What engine are the regexes for? If there is a dash in a character class (e.g. "[\w-]") how should it be handled?

regexes are a few dozens. i wouldn't edit them by hand because i need to grep for original regexes first, and then grep for 'no-dashes' regexes if no matches are found with the original ones.

dashes removal rule should be: 'remove dashes outside square brackets and any following meta character (?*+)'

syg00 · 02-11-2020, 09:58 PM

If all the data are of exactly that structure, a reasonably straightforward ERE with sed will do the job. I didn't attempt to do it in one compound expression, but can be done in a single pass.
Perl will of course do it, but that is a given when regex is mentioned.

Doing it in pure bash is not something I would ever contemplate.

agillator · 02-12-2020, 04:20 PM

A dash is a special character, normally indicating a range, in all the regexps I have worked with. To remove a literal dash character you need to escape it. \ is the normal escape character, so 13\-14 would match the character sequence one three dash one four. Otherwise it is trying to match a range of 13 to 14 which is probably not what you want and may even not make any sense in you expression.

boughtonp · 02-15-2020, 08:01 PM

Quote:

Originally Posted by syg00

Doing it in pure bash is not something I would ever contemplate.

Why not?

It's basically just iterating through a string and either including, excluding, or toggling a flag - none of which is difficult in Bash.

Might be a bit more complex to properly handle all quantifiers and nested classes, but the OP hasn't requested those.

GazL · 02-25-2020, 10:02 AM

That was actually quite a challenge when you start looking at all the ways the characters can occur within a regex, escapes, and so on. Can't promise this is perfect, but its a start:

Code:

$ cat /tmp/regex
#!/bin/bash

# ERE's:
#
# match bracket expression ([]) block:   (\[\^?]?[^]]*])
# match characters until escaped character, hyphen, or start of brackets block: ([^[\-]+)
# match escaped char: (\\.)
# match hyphens: (-+)
# rest of line: (.*)

sed -E -e '
  :again
     s/^(((\[\^?]?[^]]*])|(\\.)|([^[\-]+))*)(-+)(.*)$/\1\7/
     t again
'
$ echo '13-45[a-z][3-4]-?buburr-ex' | /tmp/regex 
1345[a-z][3-4]?buburrex
$ echo '13-45\[a-z][3-4]-?buburr-ex' | /tmp/regex 
1345\[az][3-4]?buburrex
$

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski

boughtonp · 02-25-2020, 11:40 AM

Quote:

Originally Posted by GazL

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski

Some people, when confronted with a regex, insist on posting a throw-away response to a misuse of Perl like its a piece of inspired wisdom (instead of the useless meme it actually was).

Quote:

That was actually quite a challenge when you start looking at all the ways the characters can occur within a regex, escapes, and so on. Can't promise this is perfect, but its a start:

A lot of people think regex is complicated, but it's actually a pretty simple language. There's only two fairly basic syntaxes (inside character classes and outside character classes), with a pretty limited set of rules of what goes where - and in this specific situation only a handful of characters that need to be cared about.

But that doesn't necessarily mean parsing regex with regex is the best approach...

The sed solution you posted checks for character classes before escaped characters, which means it doesn't handle [\]-] correctly.

That'll be fixable, but if you also want to handle nested classes like [[:alpha:]-] then the simplest approach is to probably count unescaped brackets to know when you're outside again. (Assuming a regex flavour where brackets must be escaped in character classes, which isn't guaranteed.)

Here's the dreaded pure Bash solution addressing that issue:

Code:

#!/bin/bash

input='13-45[a-z][3-4]-?buburr-ex'
output=''
inclass=0

for (( i=0 ; i<${#input} ; ++i ))
do
	c=${input:$i:1}

	case "$c" in
		'\') output+="${input:$i:2}" && ((++i)) && continue ;;
		'[') ((++inclass));;
		']') ((--inclass));;
		'-')
			if [ $inclass -eq 0 ]
			then
				case "${input:$i+1:1}" in ('*'|'+'|'?')
					((++i))
				esac
				continue
			fi
	esac

	output+="$c"
done

echo "$output"

GazL · 02-26-2020, 03:17 AM

Thanks for pointing that out. Clearly I missed the [:class:] cases. Oh well, I said it likely wasn't perfect.

If I were approaching this as a real problem to be solved I would have reached for C. I was using a regex here purely for fun of it and to see if I could do it that way. Apparently I only got part way there.

However memey it has become I still like that Jamie Zawinski quote: It's humorous, and it's also a caution that if you attempt to do something too complicated with a regex you're going to fail, or cause yourself problems in the future. Nothing is absolute though, and regex do have their place, however it seems this wasn't one of them.