LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 02-11-2020, 03:57 PM   #1
masavini
Member
 
Registered: Jun 2008
Posts: 285

Rep: Reputation: 6
bash: remove dashes (-) out of regexes...


hi,
can you please suggest a command to remove dashes (-) out of a list of regexes, excluding the ones that define characters ranges (i.e.: [a-z])?

i.e.:

original regex:
Code:
13-45[a-z][3-4]-?buburr-ex
  ^            ^^      ^
modified regex:
Code:
1345[a-z][3-4]buburrex
 
Old 02-11-2020, 05:03 PM   #2
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,597

Rep: Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545
How many regexes do you have to deal with that makes this not easier to do by hand?

What engine are the regexes for? If there is a dash in a character class (e.g. "[\w-]") how should it be handled?


Last edited by boughtonp; 02-11-2020 at 05:06 PM.
 
1 members found this post helpful.
Old 02-11-2020, 05:16 PM   #3
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,120

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Not to mention there is a meta character being deleted unannounced. Who knows what other requirements have been omitted.
A quick hack simply doing multiple passes would probably be my solution - trying to do it as one convoluted regex is just an academic pursuit IMHO.
 
1 members found this post helpful.
Old 02-11-2020, 05:57 PM   #4
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
Quote:
Originally Posted by boughtonp View Post
How many regexes do you have to deal with that makes this not easier to do by hand?

What engine are the regexes for? If there is a dash in a character class (e.g. "[\w-]") how should it be handled?


regexes are a few dozens. i wouldn't edit them by hand because i need to grep for original regexes first, and then grep for 'no-dashes' regexes if no matches are found with the original ones.

dashes removal rule should be: 'remove dashes outside square brackets and any following meta character (?*+)'
 
Old 02-11-2020, 09:58 PM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,120

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
If all the data are of exactly that structure, a reasonably straightforward ERE with sed will do the job. I didn't attempt to do it in one compound expression, but can be done in a single pass.
Perl will of course do it, but that is a given when regex is mentioned.

Doing it in pure bash is not something I would ever contemplate.
 
Old 02-12-2020, 04:20 PM   #6
agillator
Member
 
Registered: Aug 2016
Distribution: Mint 19.1
Posts: 419

Rep: Reputation: Disabled
A dash is a special character, normally indicating a range, in all the regexps I have worked with. To remove a literal dash character you need to escape it. \ is the normal escape character, so 13\-14 would match the character sequence one three dash one four. Otherwise it is trying to match a range of 13 to 14 which is probably not what you want and may even not make any sense in you expression.
 
Old 02-15-2020, 08:01 PM   #7
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,597

Rep: Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545
Quote:
Originally Posted by syg00 View Post
Doing it in pure bash is not something I would ever contemplate.
Why not?

It's basically just iterating through a string and either including, excluding, or toggling a flag - none of which is difficult in Bash.

Might be a bit more complex to properly handle all quantifiers and nested classes, but the OP hasn't requested those.

 
Old 02-25-2020, 10:02 AM   #8
GazL
LQ Veteran
 
Registered: May 2008
Posts: 6,897

Rep: Reputation: 5018Reputation: 5018Reputation: 5018Reputation: 5018Reputation: 5018Reputation: 5018Reputation: 5018Reputation: 5018Reputation: 5018Reputation: 5018Reputation: 5018
That was actually quite a challenge when you start looking at all the ways the characters can occur within a regex, escapes, and so on. Can't promise this is perfect, but its a start:
Code:
$ cat /tmp/regex
#!/bin/bash

# ERE's:
#
# match bracket expression ([]) block:   (\[\^?]?[^]]*])
# match characters until escaped character, hyphen, or start of brackets block: ([^[\-]+)
# match escaped char: (\\.)
# match hyphens: (-+)
# rest of line: (.*)

sed -E -e '
  :again
     s/^(((\[\^?]?[^]]*])|(\\.)|([^[\-]+))*)(-+)(.*)$/\1\7/
     t again
'
$ echo '13-45[a-z][3-4]-?buburr-ex' | /tmp/regex 
1345[a-z][3-4]?buburrex
$ echo '13-45\[a-z][3-4]-?buburr-ex' | /tmp/regex 
1345\[az][3-4]?buburrex
$

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski

Last edited by GazL; 02-25-2020 at 10:15 AM. Reason: fixed comments
 
Old 02-25-2020, 11:40 AM   #9
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,597

Rep: Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545
Quote:
Originally Posted by GazL View Post
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski
Some people, when confronted with a regex, insist on posting a throw-away response to a misuse of Perl like its a piece of inspired wisdom (instead of the useless meme it actually was).


Quote:
That was actually quite a challenge when you start looking at all the ways the characters can occur within a regex, escapes, and so on. Can't promise this is perfect, but its a start:
A lot of people think regex is complicated, but it's actually a pretty simple language. There's only two fairly basic syntaxes (inside character classes and outside character classes), with a pretty limited set of rules of what goes where - and in this specific situation only a handful of characters that need to be cared about.

But that doesn't necessarily mean parsing regex with regex is the best approach...

The sed solution you posted checks for character classes before escaped characters, which means it doesn't handle [\]-] correctly.

That'll be fixable, but if you also want to handle nested classes like [[:alpha:]-] then the simplest approach is to probably count unescaped brackets to know when you're outside again. (Assuming a regex flavour where brackets must be escaped in character classes, which isn't guaranteed.)

Here's the dreaded pure Bash solution addressing that issue:
Code:
#!/bin/bash

input='13-45[a-z][3-4]-?buburr-ex'
output=''
inclass=0

for (( i=0 ; i<${#input} ; ++i ))
do
	c=${input:$i:1}

	case "$c" in
		'\') output+="${input:$i:2}" && ((++i)) && continue ;;
		'[') ((++inclass));;
		']') ((--inclass));;
		'-')
			if [ $inclass -eq 0 ]
			then
				case "${input:$i+1:1}" in ('*'|'+'|'?')
					((++i))
				esac
				continue
			fi
	esac

	output+="$c"
done

echo "$output"
 
Old 02-26-2020, 03:17 AM   #10
GazL
LQ Veteran
 
Registered: May 2008
Posts: 6,897

Rep: Reputation: 5018Reputation: 5018Reputation: 5018Reputation: 5018Reputation: 5018Reputation: 5018Reputation: 5018Reputation: 5018Reputation: 5018Reputation: 5018Reputation: 5018
Thanks for pointing that out. Clearly I missed the [:class:] cases. Oh well, I said it likely wasn't perfect.

If I were approaching this as a real problem to be solved I would have reached for C. I was using a regex here purely for fun of it and to see if I could do it that way. Apparently I only got part way there.

However memey it has become I still like that Jamie Zawinski quote: It's humorous, and it's also a caution that if you attempt to do something too complicated with a regex you're going to fail, or cause yourself problems in the future. Nothing is absolute though, and regex do have their place, however it seems this wasn't one of them.
 
  


Reply

Tags
regular expression



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Need help writing a script to remove lines in which >X% of the characters are dashes kmkocot Linux - Newbie 14 12-02-2009 11:27 PM
'df -h' shows dashes instead of stats charleycat Linux - Software 1 05-02-2006 10:33 PM
Typing em dashes? culturejam Linux - General 1 12-13-2004 10:11 AM
Dashes in Grub.conf herkdrvr Linux - Newbie 2 10-12-2004 08:33 AM
qmail ignores aliases with DASHES (-) in them.. why? ivj Linux - Software 0 07-30-2004 04:20 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:18 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration