LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Understanding a sed command... (https://www.linuxquestions.org/questions/linux-newbie-8/understanding-a-sed-command-4175519800/)

jonnybinthemix 09-23-2014 07:48 AM

Understanding a sed command...
 
Hey Guys,

I have been working on some scripts for some time now, and I've had a lot of help from the forum, so many thanks for that.

I'm pretty much there now.. I didn't know bash scripting before I started so it's been a learning curve.. a very enjoyable learning curve at that.

I understand everything I have, as I've been careful to make sure I fully understand something before it ends up in my script, however I have one command that has slipped through the grid and I don't fully understand it. I wonder if someone has a minute to explain it for me?

The command is:

Code:

find . -name "MF_BAT_BB*$D*" -exec sh -c 'a=$(echo {} | sed -r "s/([^.]*)\$/\L\1/"); [ "$a" != "{}" ] && mv "{}" "$a" ' \;
So, I understand the find command of course, which is finding all files in the current directory with MF_BAT_BB in the filename along with todays date which is stored in $D.

Then once we find the files, exec carries out a sed command... I am pretty sure it's renaming the files.. but I don't recall how and what it's renaming them as.

It's a little more difficult to dissect what it's doing as the script is quite large..

I would be eternally grateful if someone could walk me through the steps happening after the find command?

Thanks in advance,

Jon

business_kid 09-23-2014 08:17 AM

The sed command is this
Code:

sed -r "s/([^.]*)\$/\L\1/");
It has to stop there, because a semi-colon ends it. Sed is a stream editor - it pipes stuff through and, in this case, changes it on the way.

The straight form of that is:
Code:

sed 's/Expression 1/Expression2/'
and exp. 1 is swapped for exp 2. In your case they look like Posix regexes, where the backslash preceding a character indicate the ordinary meaning for the next character.
exp1 = ([^.]*)\$
exp2 = \L\1

They look a little odd - the thing to watch is the full stop, which means basically anything, while \. means a full stop. I have not seen '1' escaped before, but my regexes are weak.

As for directions, stuff being found by the find is processed and sent to the next part.

jonnybinthemix 09-23-2014 09:22 AM

So, as I currently understand it:

Code:

find . -name "MF_BAT_BB*$D*" -exec sh -c 'a=$(echo {} | sed -r "s/([^.]*)\$/\L\1/"); [ "$a" != "{}" ] && mv "{}" "$a" ' \;
find . -name "MF_BAT_BB*$D*" - Finds all files according to the name in the current location

-exec sh -c - Once it finds the file, execute something...

'a=$(echo {} - Execute this.. making $a nothing? (empty?)

sed -r "s/([^.]*)\$/\L\1/"); - Pipe the output to the SED command (not sure what it does).

[ "$a" != "{}" ] && mv "{}" "$a" ' \; - Then do another SED command which appears to ask if $a is not equal to {} (nothing?) and then move {} (nothing?) to $a.. ie, rename nothing with the value of $a.

That's my understanding of it from just looking at what it's doing with the knowledge I have.. but I am a little stuck in two areas. What is the first SED command doing? Renaming something? If so, how? And the second SED command appears to be actually carrying out the rename right? So maybe the first SED command is manipulating a filename and then the second is writing that filename using the mv command.

So, I think I've got a basic theoretical understanding of what it's doing.. but if possible I'd like to understand why it's doing what it's doing so that I could use variations of it elsewhere should I need to.

Thanks
Jon

jonnybinthemix 09-23-2014 09:45 AM

The more I stare at it the more things click...

$a contains the result of echo {} | sed -r "s/([^.]*)\$/\L\1/") - still no clearer as to what the sed command does, but in turn the second command seems to be nothing to do with sed and is just a shell command.

Little closer to understanding, but still a long way off lol.

rknichols 09-23-2014 10:49 AM

The sed command matches as many characters at the end of the line as it can without encountering a literal "." and changes them to lower case. The result of that substitution is assigned to variable a.

The find command will have replaced every instance of "{}" with the path that it found, so the test
Code:

[ "$a" != "{}" ]
checks whether the variable a is different from the original path (i.e., that sed actually changed something). If the test is true, the mv command renames the file to the changed name.

The purpose is to force the extension to lower case, e.g., rename "XyZZy.JPG" to "XyZZy.jpg".

There is a bug that changes the whole name to lower case if there is no "." in it. If such a file is in a subdirectory, that renaming attempt would extend to the directory name as well (and probably fail).

Note: I have not tested any of the above -- it's just my reading of the command.

jonnybinthemix 09-24-2014 02:58 AM

Wow that's awesome.. Great explanation and that makes perfect sense :)

One other question though, and I know this is probably going to be a mine field so apologies and feel free to say "Go read a book" lol.. (And I intend to read)

But how does the sed command do that? I assume it's regular expressions but how would I construct that for something else if I needed to?

jonnybinthemix 09-24-2014 04:36 AM

I have been reading all morning.. I did begin to remember quite a bit from the courses I've done about regular expressions which is good.. and I've written some notes as to my findings as I go.. I'm determined to fully understand this.. I don't like the idea of using a command/string in a script that I don't fully understand, so I'll be forever grateful if you could take a look at my notes and let me know if I'm on the right track.

Apologies if this seems a bit basic, but I'm still learning :)

Code:

find . -name "MF_BAT_BB*$D*" -exec sh -c 'a=$(echo {} | sed -r "s/([^.]*)\$/\L\1/"); [ "$a" != "{}" ] && mv "{}" "$a" ' \;

sed -r "s/([^.]*)\$/\L\1/");

"s/([^.]*)\$/\L\1/")

"s/( = The bracket is clearly sectioning off some expressions.. (I guess) - but the s/ is a mystery.. assuming it's something to do with SED?

[^.] = Exclude "." in the search? (^ inside brackets negates the expression) ^ outside []'s represents start of string? "." could mean any character? But not sure about within []'s

* = Exclude all "."'s? (* matches when the preceding character occurs 0 or more times)

\$ = $ is look only at the end of the string, but \ seems to be escape the character. This tells me it's searching for $ as a literal character but we know this is not true.

/ = Unable to find any information about \'s in regular expressions.

\L = Here I'm starting to think Regular Expressions is ending, and something new is starting.. because from what I'm reading \L should escape the L character
(if it were a special character, which it doesn't look like it is. Knowing now what the command does, I guess it's something to do with lower case?

\1 = As above, not really sure.

")= This bit is confusing only because of the miss placement of the "'s and )'s.. why are they over lapping?

rknichols 09-24-2014 09:27 AM

sed has a language all its own, and regular expressions are just a small part of it. Then manpage for sed has a brief synopsis of the sed commands. If the info command is installed on your system, you can run "info sed" and get a fuller description, or you can to to http://www.gnu.org/software/sed/. While the language is typically used for filtering text, it is actually Turing-complete. Someone actually wrote a sed script that emulates the bc arbitrary precision calculator.

I'll answer some of your questions:

The "s/regexp/replacement/" is the "substitute" command in sed, as it is in vi and some other editors. The part of the line matched by the regexp is replaced by the replacement text.

Since sed was given the "-r" option, it is using extended regular expressions. The parentheses within the regexp are marking a subexpression that is subsequently referenced by the "\1" back reference. (\n, where n is a single digit, refers to the n-th parenthesized subexpression of the regular expression.)

The "$" character is special to the shell, and needs to be escaped by a backslash in order to be passed literally to the sed command. You always need to be aware of how the shell handles various special characters. You can use "set -x" in the shell to see what is actually passed to each invoked command. (Use "set +x" to cancel that.)

The "\L" in the replacement text is part of the sed language and causes the subsequent part of the replacement to be converted to lower case.

As for that final "), that parenthesis is not part of the sed command, it is the end of the "a=$( ... )" shell variable assignment.

jonnybinthemix 09-24-2014 09:54 AM

Ah ok, thanks for that.

It's starting to make a little more sense.. and gives me something to work on :) You're a gentleman for spending the time to explain that to me, it actually all made sense too... so I credit your explanation over my ability to understand as it normally takes a while for things to sink in.

I'll take a look at the pages you've recommended and have a play around with the command to see if I can tweak it.

At the moment (I've tested it), it renames something like xxxxxx.CSV.PGP to xxxxxxx.CSV.pgp. I'd actually like it to set the CSV to lower case too, which I imagine is possible by modifying the same command?

Thanks
Jon

rknichols 09-24-2014 10:46 AM

Quote:

Originally Posted by jonnybinthemix (Post 5243547)
At the moment (I've tested it), it renames something like xxxxxx.CSV.PGP to xxxxxxx.CSV.pgp. I'd actually like it to set the CSV to lower case too, which I imagine is possible by modifying the same command?

Just add to the regexp:
Code:

sed -r "s/([^.]*\.[^.]*)\$/\L\1/"
Now it matches anynumber of characters that are not a ".", followed by a literal ".", followed by any number of characters that are not a ".", all occurring at the end of the line. Actually, I'd change those asterisks to "+" signs to insist on matching at least one non-"." character in each place:
Code:

sed -r "s/([^.]+\.[^.]+)\$/\L\1/"
It probably makes no difference, but when doing things like that I like to make my matches as specific as I can.

All of the above misbehave on files that do not have two extensions, so let's restrict the action to files that do have two extensions in the final path component:
Code:

sed -r "/[^/]+\.[^.]+\.[^.]+\$/s/([^.]+\.[^.]+)\$/\L\1/"
The part in red selects only those lines that end with one or more characters that are not "/", followed by ".", followed by the two "." separated extensions that we want to change.

business_kid 09-24-2014 11:21 AM

It's always the same - by the time you have it working you're an expert, but it's like you crammed for an exam.
Usually, you then get knowledge bulimia: Learn it for the job, forget it afterward:-P.

chrism01 09-25-2014 02:00 AM

This is also a popular how-sed-works-by-example, if a bit old http://www.grymoire.com/Unix/Sed.html


All times are GMT -5. The time now is 01:22 PM.