Advice on sed

jjonas · 08-09-2014, 03:39 AM

Hi,

I would need to clean up HTML files by removing a recurring string of text, which has one changing element: name="ImageX", where X is a number of 1 to 3 digits:

Code:

<img align="bottom" border="0" height="1" name="Image13" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR4nGP5//8/AwAJEAMC/QwKSQAAAABJRU5ErkJggg==" width="1" />

I've read this tutorial for sed, and I think I've managed to do what I want, but I would like to ask for clarification, because I'm not entirely sure why I've apparently succeeded.

I've used the command:

Code:

sed -i 's/name="Image[^ ]*"/name="Image"/g' filename.htm

which has changed all the offending name="ImageX" names into just name="Image". I've then used the text editor's normal find-and-replace function to replace the whole litany (of which name="ImageX" was only a part) with nothing. I first tried to change it all with sed, but apparently the spaces in the string were causing problems, or maybe it was something else, but anyway I ended up doing what I've just described, and it worked.

My question is: what exactly does the [^ ]* portion of the command I executed do (and if it has several components, what is the effect of each of them)? The mentioned tutorial's section /g – Global replacement is not detailed enough for me, even though I managed to do what I did based on what was said there.

sycamorex · 08-09-2014, 04:44 AM

Quote:

Originally Posted by jjonas

My question is: what exactly does the [^ ]* portion of the command I executed do (and if it has several components, what is the effect of each of them)? The mentioned tutorial's section /g – Global replacement is not detailed enough for me, even though I managed to do what I did based on what was said there.

What it does it matches anything up to the first space.

For example,

Code:

[^A]*

Would match anything up to the first capital letter A.

jjonas · 08-09-2014, 05:08 AM

Hi,

thanks for the clarification. A Further question. I have file, test.txt, with the text:

TEST 123 abc

If I give the command

Code:

sed -i 's/[^ ]/test/g' test.txt

it changes the contents to

testtesttesttest testtesttest testtesttest

If I give the command

Code:

sed -i 's/[^ ]*/test/g' test.txt

the results look like

test test test

I'm not sure exactly what the logic of the * variable here is, could you please break it down for me..?

sycamorex · 08-09-2014, 05:31 AM

Quote:

Originally Posted by jjonas

Hi,

thanks for the clarification. A Further question. I have file, test.txt, with the text:

TEST 123 abc

If I give the command

Code:

sed -i 's/[^ ]/test/g' test.txt

it changes the contents to

testtesttesttest testtesttest testtesttest

... and that's as expected. It matches ANY character apart from a blank space and replaces ANY character with the string 'test'

so TEST (4 characters being replaced with test giving you testtesttesttest
123 (each of the 3 characters is being replaced with test giving you testtesttest), etc.

So here the replacements are done on individual characters.

Quote:

Originally Posted by jjonas

If I give the command

Code:

sed -i 's/[^ ]*/test/g' test.txt

the results look like

test test test

I'm not sure exactly what the logic of the * variable here is, could you please break it down for me..?

* means zero or more occurrences so here what's being replaced is not individual characters but the whole string up to the blank space. So in other words what it translates to is replace zero or more occurrences of whatever is there up to the first blank space with the string 'test'.

Please note that because * means ZERO or more occurrences, it is usually recommended to use [^ ][^ ]* to ensure it matches at least one character.

I hope this makes sense.

syg00 · 08-09-2014, 06:18 AM

Or use a "+" instead of the "*" - see teh tutorial references above.

jjonas · 08-09-2014, 06:58 AM

I think I'm getting some of it. So if the file content is 'TEST abc 123' and I issue the command

Code:

sed -i 's/[^TE ST]/1/g' test.txt

everything except the letters T, E, S, T and the blank space will be changed to 1's. Apparently you need to have something that you exclude, because issuing

Code:

sed -i 's/[^]/1/g' test.txt

gives, instead of replacing everything with 1's,

Code:

sed: -e expression #1, char 9: unterminated `s' command

But I don't understand how more complicated stuff is supposed to work. I tried to apply the mentioned tutorial also to another HTML cleanup procedure, but without success. The text I'm trying to clean up has strings that are of the form

Code:

<sup>X</sup>

where X is some number of 1 to 3 digits. I tried to replace these with <a href="footnoteX">[X]</a> by issuing the following command:

Code:

sed -i 's/<sup>[0-9]*</sup>/<a href="footnote&">[&]</a>/g' test.txt

but it doesn't work ("sed: -e expression #1, char 21: unknown option to `s'"). I'm not sure which part the error message is referring to, and I'm not sure whether my search-for-this-text portion or replace-it-with-this-text portion (or both) of the command are faulty. Possibly special symbols like [ and < might require special notation, but I don't know.

Trying out simpler stuff didn't work either. If I have a file with the content abc 123 def 456, and I issue the command

Code:

sed -i 's/[0-9]/footnote&/g' test.txt

the file is changed into

abc footnote1footnote2footnote3 def footnote4footnote5footnote6

I understand this is because sed looks for individual numbers 0-9, and when it finds one, it changes it into "footnote&", where & is the number it found. Right?

Now, if I issue the command

Code:

sed -i 's/[0-9]*/footnote&/g' test.txt

the file is changed into

footnoteafootnotebfootnotecfootnote footnote123 footnotedfootnoteefootnoteffootnote footnote456

In other words, the numbers are changed correctly, but why does sed find the letters and change them individually? How can I make it leave the letters alone, and change only the numbers, whether they're one or three digits long?

sycamorex · 08-09-2014, 09:01 AM

Quote:

Originally Posted by jjonas

I think I'm getting some of it. So if the file content is 'TEST abc 123' and I issue the command

Code:

sed -i 's/[^TE ST]/1/g' test.txt

everything except the letters T, E, S, T and the blank space will be changed to 1's. Apparently you need to have something that you exclude, because issuing

Code:

sed -i 's/[^]/1/g' test.txt

gives, instead of replacing everything with 1's,

Code:

sed: -e expression #1, char 9: unterminated `s' command

Well, obviously. Otherwise the whole expression does not make sense.

Quote:

Originally Posted by jjonas

But I don't understand how more complicated stuff is supposed to work. I tried to apply the mentioned tutorial also to another HTML cleanup procedure, but without success. The text I'm trying to clean up has strings that are of the form

Code:

<sup>X</sup>

where X is some number of 1 to 3 digits. I tried to replace these with <a href="footnoteX">[X]</a> by issuing the following command:

Code:

sed -i 's/<sup>[0-9]*</sup>/<a href="footnote&">[&]</a>/g' test.txt

but it doesn't work ("sed: -e expression #1, char 21: unknown option to `s'"). I'm not sure which part the error message is referring to, and I'm not sure whether my search-for-this-text portion or replace-it-with-this-text portion (or both) of the command are faulty. Possibly special symbols like [ and < might require special notation, but I don't know.

Please note that in your example above forward slashes (/) appear as part of your string to match. That confuses sed as forward slashes are normally treated as delimiters of the command s. For example: s/old/new so when you put s/ol/d/ne/w/ it's syntactically incorrect. You have two options:
Either escape / by preceding it with \

Code:

sed 's/<sup>[0-9]*<\/sup>/<a href="footnote&">[&]<\/a>/g'

or use some other character as a delimiter. In this case "|":

Code:

sed 's|<sup>[0-9]*</sup>|<a href="footnote&">[&]</a>|g'

Quote:

Originally Posted by jjonas

Trying out simpler stuff didn't work either. If I have a file with the content abc 123 def 456, and I issue the command

Code:

sed -i 's/[0-9]/footnote&/g' test.txt

the file is changed into

abc footnote1footnote2footnote3 def footnote4footnote5footnote6

I understand this is because sed looks for individual numbers 0-9, and when it finds one, it changes it into "footnote&", where & is the number it found. Right?

Correct.

Quote:

Originally Posted by jjonas

Now, if I issue the command

Code:

sed -i 's/[0-9]*/footnote&/g' test.txt

the file is changed into

footnoteafootnotebfootnotecfootnote footnote123 footnotedfootnoteefootnoteffootnote footnote456

In other words, the numbers are changed correctly, but why does sed find the letters and change them individually? How can I make it leave the letters alone, and change only the numbers, whether they're one or three digits long?

Now this is what I mentioned at the very end of my previous post. * means ZERO or more of the preceding character. ZERO. So in this case it matches anything. Any character and blank space. Anything. If you think about it, it matched it correctly: Match ZERO occurrences of a digit. Yep. Did it. A letter IS zero occurrences of a digit. Ok so now replace it with 'footnote'. No problem.

This is something that is sometimes difficult to grasp about *

[0-9]* means 0 or more occurrences of a digit
[0-9][0-9]* now this means 1 or more occurrence of a digit.
So what you want is:

Code:

sed 's/[0-9][0-9]*/footnote&/g' file.txt

Or, as it was suggested above, use sed with extended regex (-r) and then you can just use '+' (which is less confusing) than '*':

Code:

sed -r 's/[0-9]+/footnote&/g' file

Please keep reading the mentioned tutorial. All this information is there. Also make sure you understand this thing with * meaning ZERO or more (not one or more) occurrences of the previous character. It's a common source of errors.

jjonas · 08-09-2014, 09:44 AM

Hi,

thanks for the reply, it was useful, using -r with + makes the most sense for the moment! :-)

sycamorex · 08-09-2014, 09:50 AM

Quote:

Originally Posted by jjonas

Hi,

thanks for the reply, it was useful, using -r with + makes the most sense for the moment! :-)

It is easier to grasp. As you get more and more familiar with sed, you'll probably encounter lots of examples using * and hopefully it'll become clearer to you.