sed to extract multiple matches in a line?

mhoch3 · 07-30-2005, 04:55 PM

Dear all,

How can I extract all occurrences of a given expression from a line?

so for instance from:

then the three of them went swimming until they were quite tired

and the regexp

the[a-z]

I would like to get

thenthemthey

I'm not too worried about overlapping matches - I'm extracting entries from a list and there are delimiters which never appear in the entries, not even in escaped form. Lucky me!

Regards,

Mydrofiol

jonaskoelker · 07-30-2005, 05:17 PM

Off the top of my head: how about substituting everything that doesn't match (that is, matches the `inverse' regexp) with (empty)?

hth --Jonas

mhoch3 · 07-30-2005, 05:37 PM

How do you match the inverse of a regexp?

I'm familiar with the use of [^q] to match all chars except q but I've never encountered (yet often wished for!) a not on longer strings. The problem I always supposed is that practically everything doesn't match the regexp.

e.g. in: the boy and his dog
" an"
doesn't match and
and I always believed that sed doesn't cope with overlapping matches, by which I mean that if you look for
b..
in
baboon
you will catch bab but not boo

(and indeed I've just checked..:

echo baboon | sed 's/b..//g'
returns
oon
which is presumably a stifled attempt at muttering the name of G. Hoon, General in charge of the British armed forces. I wonder what my machine knows about him...
)

I know there's a theorem that states that the inverse of every regular language is a regular language, but that theorem never said that representing the inverse can be done in the concise and elegant way I expect from sed.

mhoch3 · 07-31-2005, 09:16 AM

Although to be fair, I did once have a situation where the regexps for the bits that I didn't want in each line were nice and so I could do what you suggest. This unfortunately isn't straightforward this time, and surely there has to be a better way? If not it should be added to sed. sed extract. sedex. Bound to be good.

jonaskoelker · 07-31-2005, 10:13 AM

Well, the inverse regular *expression* may be hairy, but the Finite Automaton inversion is easy: AcceptStates = AllStates - AcceptStates

So, write a regexp package

Otherwise, try RTM--I seem to recall that it should be reasonably easy w. sed (but not how)

anyone else?

hth --Jonas

twsnnva · 07-31-2005, 11:28 AM

Can you use 'grep -o'?

Code:

Thomas@lightning:~$ cat test.txt
then the three of them went swimming until they were quite tired
Thomas@lightning:~$ cat test.txt | grep -o the[a-z]
then
them
they
Thomas@lightning:~$

That will give you a list, if that's not the format you want you could run it through a for loop.

Code:

Thomas@lightning:~$ for i in `cat test.txt|grep -o the[a-z]`; do echo -n $i; done ; echo
thenthemthey
Thomas@lightning:~$

jonaskoelker · 08-01-2005, 05:47 AM

twsnnva:

Of course--grep -o is *the* easy way to do that. One thing though: if test.txt was multiline, Something needs to be done to mark line endings (afai see it).

Anyways, that's for the OP to decide;

OP: I suggest you read the `smart questions' faq--at least the part about describing the goal, not the step.

--Jonas

twsnnva · 08-01-2005, 08:18 AM

Quote:

Of course--grep -o is *the* easy way to do that.

You got me there. Though I'm sure we can think of a more complicated way if we put our heads together.

Quote:

One thing though: if test.txt was multiline, Something needs to be done to mark line endings (afai see it).

Put everything in a loop that processes each line individually.

jonaskoelker · 08-01-2005, 03:32 PM

Quote:

Originally posted by twsnnva
Put everything in a loop that processes each line individually.

douh!

I would think it takes a slight speed hit though.

"Premature optimization is the root of all evil."

Knuth, right?

--Jonas