Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I would need to clean up HTML files by removing a recurring string of text, which has one changing element: name="ImageX", where X is a number of 1 to 3 digits:
I've read this tutorial for sed, and I think I've managed to do what I want, but I would like to ask for clarification, because I'm not entirely sure why I've apparently succeeded.
I've used the command:
Code:
sed -i 's/name="Image[^ ]*"/name="Image"/g' filename.htm
which has changed all the offending name="ImageX" names into just name="Image". I've then used the text editor's normal find-and-replace function to replace the whole litany (of which name="ImageX" was only a part) with nothing. I first tried to change it all with sed, but apparently the spaces in the string were causing problems, or maybe it was something else, but anyway I ended up doing what I've just described, and it worked.
My question is: what exactly does the [^ ]* portion of the command I executed do (and if it has several components, what is the effect of each of them)? The mentioned tutorial's section /g – Global replacement is not detailed enough for me, even though I managed to do what I did based on what was said there.
My question is: what exactly does the [^ ]* portion of the command I executed do (and if it has several components, what is the effect of each of them)? The mentioned tutorial's section /g – Global replacement is not detailed enough for me, even though I managed to do what I did based on what was said there.
What it does it matches anything up to the first space.
For example,
Code:
[^A]*
Would match anything up to the first capital letter A.
thanks for the clarification. A Further question. I have file, test.txt, with the text:
TEST 123 abc
If I give the command
Code:
sed -i 's/[^ ]/test/g' test.txt
it changes the contents to
testtesttesttest testtesttest testtesttest
... and that's as expected. It matches ANY character apart from a blank space and replaces ANY character with the string 'test'
so TEST (4 characters being replaced with test giving you testtesttesttest
123 (each of the 3 characters is being replaced with test giving you testtesttest), etc.
So here the replacements are done on individual characters.
Quote:
Originally Posted by jjonas
If I give the command
Code:
sed -i 's/[^ ]*/test/g' test.txt
the results look like
test test test
I'm not sure exactly what the logic of the * variable here is, could you please break it down for me..?
* means zero or more occurrences so here what's being replaced is not individual characters but the whole string up to the blank space. So in other words what it translates to is replace zero or more occurrences of whatever is there up to the first blank space with the string 'test'.
Please note that because * means ZERO or more occurrences, it is usually recommended to use [^ ][^ ]* to ensure it matches at least one character.
I think I'm getting some of it. So if the file content is 'TEST abc 123' and I issue the command
Code:
sed -i 's/[^TE ST]/1/g' test.txt
everything except the letters T, E, S, T and the blank space will be changed to 1's. Apparently you need to have something that you exclude, because issuing
But I don't understand how more complicated stuff is supposed to work. I tried to apply the mentioned tutorial also to another HTML cleanup procedure, but without success. The text I'm trying to clean up has strings that are of the form
Code:
<sup>X</sup>
where X is some number of 1 to 3 digits. I tried to replace these with <a href="footnoteX">[X]</a> by issuing the following command:
Code:
sed -i 's/<sup>[0-9]*</sup>/<a href="footnote&">[&]</a>/g' test.txt
but it doesn't work ("sed: -e expression #1, char 21: unknown option to `s'"). I'm not sure which part the error message is referring to, and I'm not sure whether my search-for-this-text portion or replace-it-with-this-text portion (or both) of the command are faulty. Possibly special symbols like [ and < might require special notation, but I don't know.
Trying out simpler stuff didn't work either. If I have a file with the content abc 123 def 456, and I issue the command
I understand this is because sed looks for individual numbers 0-9, and when it finds one, it changes it into "footnote&", where & is the number it found. Right?
In other words, the numbers are changed correctly, but why does sed find the letters and change them individually? How can I make it leave the letters alone, and change only the numbers, whether they're one or three digits long?
I think I'm getting some of it. So if the file content is 'TEST abc 123' and I issue the command
Code:
sed -i 's/[^TE ST]/1/g' test.txt
everything except the letters T, E, S, T and the blank space will be changed to 1's. Apparently you need to have something that you exclude, because issuing
Well, obviously. Otherwise the whole expression does not make sense.
Quote:
Originally Posted by jjonas
But I don't understand how more complicated stuff is supposed to work. I tried to apply the mentioned tutorial also to another HTML cleanup procedure, but without success. The text I'm trying to clean up has strings that are of the form
Code:
<sup>X</sup>
where X is some number of 1 to 3 digits. I tried to replace these with <a href="footnoteX">[X]</a> by issuing the following command:
Code:
sed -i 's/<sup>[0-9]*</sup>/<a href="footnote&">[&]</a>/g' test.txt
but it doesn't work ("sed: -e expression #1, char 21: unknown option to `s'"). I'm not sure which part the error message is referring to, and I'm not sure whether my search-for-this-text portion or replace-it-with-this-text portion (or both) of the command are faulty. Possibly special symbols like [ and < might require special notation, but I don't know.
Please note that in your example above forward slashes (/) appear as part of your string to match. That confuses sed as forward slashes are normally treated as delimiters of the command s. For example: s/old/new so when you put s/ol/d/ne/w/ it's syntactically incorrect. You have two options:
Either escape / by preceding it with \
Code:
sed 's/<sup>[0-9]*<\/sup>/<a href="footnote&">[&]<\/a>/g'
or use some other character as a delimiter. In this case "|":
Code:
sed 's|<sup>[0-9]*</sup>|<a href="footnote&">[&]</a>|g'
Quote:
Originally Posted by jjonas
Trying out simpler stuff didn't work either. If I have a file with the content abc 123 def 456, and I issue the command
I understand this is because sed looks for individual numbers 0-9, and when it finds one, it changes it into "footnote&", where & is the number it found. Right?
In other words, the numbers are changed correctly, but why does sed find the letters and change them individually? How can I make it leave the letters alone, and change only the numbers, whether they're one or three digits long?
Now this is what I mentioned at the very end of my previous post. * means ZERO or more of the preceding character. ZERO. So in this case it matches anything. Any character and blank space. Anything. If you think about it, it matched it correctly: Match ZERO occurrences of a digit. Yep. Did it. A letter IS zero occurrences of a digit. Ok so now replace it with 'footnote'. No problem.
This is something that is sometimes difficult to grasp about *
[0-9]* means 0 or more occurrences of a digit
[0-9][0-9]* now this means 1 or more occurrence of a digit.
So what you want is:
Code:
sed 's/[0-9][0-9]*/footnote&/g' file.txt
Or, as it was suggested above, use sed with extended regex (-r) and then you can just use '+' (which is less confusing) than '*':
Code:
sed -r 's/[0-9]+/footnote&/g' file
Please keep reading the mentioned tutorial. All this information is there. Also make sure you understand this thing with * meaning ZERO or more (not one or more) occurrences of the previous character. It's a common source of errors.
thanks for the reply, it was useful, using -r with + makes the most sense for the moment! :-)
It is easier to grasp. As you get more and more familiar with sed, you'll probably encounter lots of examples using * and hopefully it'll become clearer to you.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.