LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 08-09-2014, 03:39 AM   #1
jjonas
Member
 
Registered: Jul 2005
Location: Finland
Distribution: Arch Linux
Posts: 80

Rep: Reputation: 15
Advice on sed


Hi,

I would need to clean up HTML files by removing a recurring string of text, which has one changing element: name="ImageX", where X is a number of 1 to 3 digits:
Code:
<img align="bottom" border="0" height="1" name="Image13" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR4nGP5//8/AwAJEAMC/QwKSQAAAABJRU5ErkJggg==" width="1" />
I've read this tutorial for sed, and I think I've managed to do what I want, but I would like to ask for clarification, because I'm not entirely sure why I've apparently succeeded.

I've used the command:

Code:
sed -i 's/name="Image[^ ]*"/name="Image"/g' filename.htm
which has changed all the offending name="ImageX" names into just name="Image". I've then used the text editor's normal find-and-replace function to replace the whole litany (of which name="ImageX" was only a part) with nothing. I first tried to change it all with sed, but apparently the spaces in the string were causing problems, or maybe it was something else, but anyway I ended up doing what I've just described, and it worked.

My question is: what exactly does the [^ ]* portion of the command I executed do (and if it has several components, what is the effect of each of them)? The mentioned tutorial's section /g – Global replacement is not detailed enough for me, even though I managed to do what I did based on what was said there.
 
Old 08-09-2014, 04:44 AM   #2
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
Quote:
Originally Posted by jjonas View Post
My question is: what exactly does the [^ ]* portion of the command I executed do (and if it has several components, what is the effect of each of them)? The mentioned tutorial's section /g – Global replacement is not detailed enough for me, even though I managed to do what I did based on what was said there.

What it does it matches anything up to the first space.

For example,

Code:
[^A]*
Would match anything up to the first capital letter A.
 
1 members found this post helpful.
Old 08-09-2014, 05:08 AM   #3
jjonas
Member
 
Registered: Jul 2005
Location: Finland
Distribution: Arch Linux
Posts: 80

Original Poster
Rep: Reputation: 15
Hi,

thanks for the clarification. A Further question. I have file, test.txt, with the text:
TEST 123 abc
If I give the command

Code:
sed -i 's/[^ ]/test/g' test.txt
it changes the contents to
testtesttesttest testtesttest testtesttest
If I give the command

Code:
sed -i 's/[^ ]*/test/g' test.txt
the results look like
test test test
I'm not sure exactly what the logic of the * variable here is, could you please break it down for me..?
 
Old 08-09-2014, 05:31 AM   #4
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
Quote:
Originally Posted by jjonas View Post
Hi,

thanks for the clarification. A Further question. I have file, test.txt, with the text:
TEST 123 abc
If I give the command

Code:
sed -i 's/[^ ]/test/g' test.txt
it changes the contents to
testtesttesttest testtesttest testtesttest

... and that's as expected. It matches ANY character apart from a blank space and replaces ANY character with the string 'test'

so TEST (4 characters being replaced with test giving you testtesttesttest
123 (each of the 3 characters is being replaced with test giving you testtesttest), etc.

So here the replacements are done on individual characters.

Quote:
Originally Posted by jjonas View Post
If I give the command

Code:
sed -i 's/[^ ]*/test/g' test.txt
the results look like
test test test
I'm not sure exactly what the logic of the * variable here is, could you please break it down for me..?

* means zero or more occurrences so here what's being replaced is not individual characters but the whole string up to the blank space. So in other words what it translates to is replace zero or more occurrences of whatever is there up to the first blank space with the string 'test'.


Please note that because * means ZERO or more occurrences, it is usually recommended to use [^ ][^ ]* to ensure it matches at least one character.

I hope this makes sense.
 
1 members found this post helpful.
Old 08-09-2014, 06:18 AM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,125

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Or use a "+" instead of the "*" - see teh tutorial references above.
 
Old 08-09-2014, 06:58 AM   #6
jjonas
Member
 
Registered: Jul 2005
Location: Finland
Distribution: Arch Linux
Posts: 80

Original Poster
Rep: Reputation: 15
I think I'm getting some of it. So if the file content is 'TEST abc 123' and I issue the command

Code:
sed -i 's/[^TE ST]/1/g' test.txt
everything except the letters T, E, S, T and the blank space will be changed to 1's. Apparently you need to have something that you exclude, because issuing

Code:
sed -i 's/[^]/1/g' test.txt
gives, instead of replacing everything with 1's,

Code:
sed: -e expression #1, char 9: unterminated `s' command
But I don't understand how more complicated stuff is supposed to work. I tried to apply the mentioned tutorial also to another HTML cleanup procedure, but without success. The text I'm trying to clean up has strings that are of the form

Code:
<sup>X</sup>
where X is some number of 1 to 3 digits. I tried to replace these with <a href="footnoteX">[X]</a> by issuing the following command:

Code:
sed -i 's/<sup>[0-9]*</sup>/<a href="footnote&">[&]</a>/g' test.txt
but it doesn't work ("sed: -e expression #1, char 21: unknown option to `s'"). I'm not sure which part the error message is referring to, and I'm not sure whether my search-for-this-text portion or replace-it-with-this-text portion (or both) of the command are faulty. Possibly special symbols like [ and < might require special notation, but I don't know.

Trying out simpler stuff didn't work either. If I have a file with the content abc 123 def 456, and I issue the command

Code:
sed -i 's/[0-9]/footnote&/g' test.txt
the file is changed into
abc footnote1footnote2footnote3 def footnote4footnote5footnote6
I understand this is because sed looks for individual numbers 0-9, and when it finds one, it changes it into "footnote&", where & is the number it found. Right?

Now, if I issue the command

Code:
sed -i 's/[0-9]*/footnote&/g' test.txt
the file is changed into
footnoteafootnotebfootnotecfootnote footnote123 footnotedfootnoteefootnoteffootnote footnote456
In other words, the numbers are changed correctly, but why does sed find the letters and change them individually? How can I make it leave the letters alone, and change only the numbers, whether they're one or three digits long?
 
Old 08-09-2014, 09:01 AM   #7
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
Quote:
Originally Posted by jjonas View Post
I think I'm getting some of it. So if the file content is 'TEST abc 123' and I issue the command

Code:
sed -i 's/[^TE ST]/1/g' test.txt
everything except the letters T, E, S, T and the blank space will be changed to 1's. Apparently you need to have something that you exclude, because issuing

Code:
sed -i 's/[^]/1/g' test.txt
gives, instead of replacing everything with 1's,

Code:
sed: -e expression #1, char 9: unterminated `s' command
Well, obviously. Otherwise the whole expression does not make sense.

Quote:
Originally Posted by jjonas View Post

But I don't understand how more complicated stuff is supposed to work. I tried to apply the mentioned tutorial also to another HTML cleanup procedure, but without success. The text I'm trying to clean up has strings that are of the form

Code:
<sup>X</sup>
where X is some number of 1 to 3 digits. I tried to replace these with <a href="footnoteX">[X]</a> by issuing the following command:

Code:
sed -i 's/<sup>[0-9]*</sup>/<a href="footnote&">[&]</a>/g' test.txt
but it doesn't work ("sed: -e expression #1, char 21: unknown option to `s'"). I'm not sure which part the error message is referring to, and I'm not sure whether my search-for-this-text portion or replace-it-with-this-text portion (or both) of the command are faulty. Possibly special symbols like [ and < might require special notation, but I don't know.
Please note that in your example above forward slashes (/) appear as part of your string to match. That confuses sed as forward slashes are normally treated as delimiters of the command s. For example: s/old/new so when you put s/ol/d/ne/w/ it's syntactically incorrect. You have two options:
Either escape / by preceding it with \

Code:
sed 's/<sup>[0-9]*<\/sup>/<a href="footnote&">[&]<\/a>/g'
or use some other character as a delimiter. In this case "|":
Code:
sed 's|<sup>[0-9]*</sup>|<a href="footnote&">[&]</a>|g'
Quote:
Originally Posted by jjonas View Post
Trying out simpler stuff didn't work either. If I have a file with the content abc 123 def 456, and I issue the command

Code:
sed -i 's/[0-9]/footnote&/g' test.txt
the file is changed into
abc footnote1footnote2footnote3 def footnote4footnote5footnote6
I understand this is because sed looks for individual numbers 0-9, and when it finds one, it changes it into "footnote&", where & is the number it found. Right?
Correct.

Quote:
Originally Posted by jjonas View Post
Now, if I issue the command

Code:
sed -i 's/[0-9]*/footnote&/g' test.txt
the file is changed into
footnoteafootnotebfootnotecfootnote footnote123 footnotedfootnoteefootnoteffootnote footnote456
In other words, the numbers are changed correctly, but why does sed find the letters and change them individually? How can I make it leave the letters alone, and change only the numbers, whether they're one or three digits long?
Now this is what I mentioned at the very end of my previous post. * means ZERO or more of the preceding character. ZERO. So in this case it matches anything. Any character and blank space. Anything. If you think about it, it matched it correctly: Match ZERO occurrences of a digit. Yep. Did it. A letter IS zero occurrences of a digit. Ok so now replace it with 'footnote'. No problem.

This is something that is sometimes difficult to grasp about *

[0-9]* means 0 or more occurrences of a digit
[0-9][0-9]* now this means 1 or more occurrence of a digit.
So what you want is:
Code:
sed 's/[0-9][0-9]*/footnote&/g' file.txt
Or, as it was suggested above, use sed with extended regex (-r) and then you can just use '+' (which is less confusing) than '*':
Code:
sed -r 's/[0-9]+/footnote&/g' file
Please keep reading the mentioned tutorial. All this information is there. Also make sure you understand this thing with * meaning ZERO or more (not one or more) occurrences of the previous character. It's a common source of errors.

Last edited by sycamorex; 08-09-2014 at 09:06 AM.
 
1 members found this post helpful.
Old 08-09-2014, 09:44 AM   #8
jjonas
Member
 
Registered: Jul 2005
Location: Finland
Distribution: Arch Linux
Posts: 80

Original Poster
Rep: Reputation: 15
Hi,

thanks for the reply, it was useful, using -r with + makes the most sense for the moment! :-)
 
Old 08-09-2014, 09:50 AM   #9
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
Quote:
Originally Posted by jjonas View Post
Hi,

thanks for the reply, it was useful, using -r with + makes the most sense for the moment! :-)
It is easier to grasp. As you get more and more familiar with sed, you'll probably encounter lots of examples using * and hopefully it'll become clearer to you.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
create file list: SED inline vs SED standalone, enormous speed difference Corsari Linux - Newbie 4 09-02-2013 03:01 AM
[SOLVED] sed help to run sed command against multiple different file names bkone Programming 2 04-16-2012 12:27 PM
[SOLVED] sed 's/Tb05.5K5.100/Tb229/' alone but doesn't work in sed file w/ other expressions Radha.jg Programming 6 03-03-2011 07:59 AM
bash - awk, sed, grep, ... advice schneidz Programming 13 08-25-2008 09:30 AM
Insert character into a line with sed? & variables in sed? jago25_98 Programming 5 03-11-2004 06:12 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 11:09 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration