Problem with sed regexp

Chipper · 03-16-2011, 04:39 AM

hi,
I have a problem with the sed command, because I don't know how to write this rule:

I want to transform this string:
foo CONST "STRING CONST" foo

And I want to transform it to:
foo <const>CONST</const> <string>"STRING CONST"</string> foo

NOT TO!!:
foo <const>CONST</const> <string>"STRING <const>CONST</const>"</string> foo

Now I have this bad regexp:
export LC_ALL=C
line="foo CONST \"STRING CONST\" foo"
line=`echo "$line" |sed -e "s/$[A-Z][A-Z0-9_]*$/<const>\1<\/const>/g"`
line=`echo "$line" |sed -e "s/$[\"].*[\"]$/<string>\1<\/string>/g"`

Thank you for your advices and sorry for my bad english

colucix · 03-16-2011, 05:23 AM

Maybe something like this?

Code:

line="foo CONST \"STRING CONST\" foo"
line=$(echo "$line" | sed 's/\([A-Z][A-Z0-9_]*\)/<const>\1<\/const>/')
line=$(echo "$line" | sed 's/\(".*"\)/<string>\1<\/string>/')

The first sed adds the <const> and </const> tags only to the first occurrence of uppercase words, the second one adds the <string> and </string> tags outside the double quotes. I'm not sure if this matches your requirement.

Chipper · 03-16-2011, 05:29 AM

Quote:

Originally Posted by colucix

Maybe something like this?

Code:

line="foo CONST \"STRING CONST\" foo"
line=$(echo "$line" | sed 's/\([A-Z][A-Z0-9_]*\)/<const>\1<\/const>/')
line=$(echo "$line" | sed 's/\(".*"\)/<string>\1<\/string>/')

The first sed adds the <const> and </const> tags only to the first occurrence of uppercase words, the second one adds the <string> and </string> tags outside the double quotes. I'm not sure if this matches your requirement.

Sorry, it was my fault, the correct form is:

line="foo CONST \"STRING CONST\" foo"
line=$(echo "$line" | sed 's/$[A-Z][A-Z0-9_]*$/<const>\1<\/const>/g')
line=$(echo "$line" | sed 's/$".*"$/<string>\1<\/string>/g')

colucix · 03-16-2011, 06:38 AM

Do you mean you have multiple occurrences of CONST and "STRING CONST" on the same line?

Chipper · 03-16-2011, 07:00 AM

Quote:

Originally Posted by colucix

Do you mean you have multiple occurrences of CONST and "STRING CONST" on the same line?

Yes, I have more occurences, the string can look like:

"foofoo" BAR "foo" BAR BAR "BAR foo" "BAR"

and the result what I want is:

<string>"foofoo"</string> <const>BAR</const> <string>"foo"</string> <const>BAR</const> <const>BAR</const> <string>"BAR foo"</string> <string>"BAR"</string>

colucix · 03-16-2011, 10:15 AM

The problem using regular expressions is that there is not an easy way to distinguish what is outside double quotes pairs. What is inside is a little more straightforward:

Code:

/"[^"]*"/

this matches every quoted string (even if there are multiple ones on the same line). Literally it matches the opening double quotes followed by zero or more occurrences of any character different from double quotes and the closing double quotes. Said that, the solution to your issue is a bit tricky. Here is what I've done:

1. Add the <string> and </string> tags around the quoted strings. Supposed there are not @ characters in the text, add an opening @ and a closing @ for reasons that will be clear later:

Code:

$ line='"foofoo" BAR "foo" BAR BAR "BAR BAR foo BAR foo" "BAR"'
$ line=$(echo "$line" | sed -r 's/("[^"]*")/@<string>\1<\/string>@/g')
$ echo "$line"
@<string>"foofoo"</string>@ BAR @<string>"foo"</string>@ BAR BAR @<string>"BAR BAR foo BAR foo"</string>@ @<string>"BAR"</string>@

2. Now add the <const> and </const> tags around all the uppercase words, even those ones inside double quotes:

Code:

$ line=$(echo "$line" | sed -r 's/([A-Z]+)/<const>\1<\/const>/g')
$ echo "$line"
@<string>"foofoo"</string>@ <const>BAR</const> @<string>"foo"</string>@ <const>BAR</const> <const>BAR</const> @<string>"<const>BAR</const> <const>BAR</const> foo <const>BAR</const> foo"</string>@ @<string>"<const>BAR</const>"</string>@

3. Now remove recursively every <const> </const> pair inside the @ pairs, that is inside every <string> </string> pair. Now the reason for adding @ is clear, since I need a single character to match any string not containing the multi-character pattern <string> or </string>:

Code:

 $ line=$(echo "$line" | sed -r ':again; s/(@<string>[^@]*)<const>([^@]+)<\/const>([^@]*<\/string>@)/\1\2\3/; t again')
$ echo "$line"
@<string>"foofoo"</string>@ <const>BAR</const> @<string>"foo"</string>@ <const>BAR</const> <const>BAR</const> @<string>"BAR BAR foo BAR foo"</string>@ @<string>"BAR"</string>@

4. Now remove the @ characters and the trick is done:

Code:

$ line=$(echo "$line" | sed 's/@//g')
$ echo "$line"
<string>"foofoo"</string> <const>BAR</const> <string>"foo"</string> <const>BAR</const> <const>BAR</const> <string>"BAR BAR foo BAR foo"</string> <string>"BAR"</string>

Feel free to ask for any clarification. Hope this helps.

Chipper · 03-16-2011, 10:32 AM

Thank you for your answer, but I dont know which characters will be in text. The text is
output from strace program.

I read somewhere something about hold space and pattern space features in sed, but I can't work
with it. Thank you for your solution, but it is not universal (if I have @ char in text...)

colucix · 03-16-2011, 11:01 AM

Ok. Since sed can manage hexadecimal ASCII codes, you can choose a control character which most likely does not appear in the input line, for example the group separator (GS):

Code:

Dec Hex Oct Char
29  1d  035 GS    (group separator)

In this case you simply have to substitute @ with \x1d in the sed commands:

Code:

$ line='"foofoo" BAR "foo" BAR BAR "BAR BAR foo BAR foo" "BAR"'
$ line=$(echo "$line" | sed -r 's/("[^"]*")/\x1d<string>\1<\/string>\x1d/g')
$ line=$(echo "$line" | sed -r 's/([A-Z]+)/<const>\1<\/const>/g')
$ line=$(echo "$line" | sed -r ':again; s/(\x1d<string>[^\x1d]*)<const>([^\x1d]+)<\/const>([^\x1d]*<\/string>\x1d)/\1\2\3/; t again')
$ line=$(echo "$line" | sed 's/\x1d//g')
$ echo "$line"
<string>"foofoo"</string> <const>BAR</const> <string>"foo"</string> <const>BAR</const> <const>BAR</const> <string>"BAR BAR foo BAR foo"</string> <string>"BAR"</string>

In alternative, here is a more straightforward awk solution. Here you can easily distinguish between quoted and not quoted strings. Just use the double quotes as field separator:

Code:

BEGIN { FS = "\""; OFS = "" }

{
  
  for ( i = 1; i <= NF; i++ ) 
    if ( i % 2 == 0 )
      $i = "<string>\"" $i "\"</string>"
    else
      $i = gensub(/([A-Z][A-Z0-9_]*)/,"<const>\\1</const>","g",$i)
  
  print
  
}

Please note the empty string as output field separator, due to the fact that blank spaces are already inside the fields.

Chipper · 03-19-2011, 05:16 AM

Thank you for your solution again. Last question:
Is this solution POSIX compliant? Because here: http://pubs.opengroup.org/onlinepubs...ities/sed.html I read that sed hasn't got -r option.

jschiwal · 03-19-2011, 05:30 AM

I wonder how well your description of the input describes the output of strace. It is best to use real samples instead of a wordy explanation. Regular expressions are very finicky. Could you describe what the output of the following input sample would be.

Code:

open("Webinar", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
write(1, "\n", 1
)                       = 1
write(1, "Webinar:\n", 9Webinar:
)               = 9
getdents64(3, /* 5 entries */, 32768)   = 352
getdents64(3, /* 0 entries */, 32768)   = 0
close(3)                                = 0

Note that the write commands are split on two lines. This is when you usually need to use the HOLD register, building up both lines, getting the register back, and then including an `\n' in the LHS expression. So you will probably need more than one command to accomplish what you want.

jschiwal · 03-19-2011, 05:38 AM

Sorry, if you use "strace -o file", the lines won't be split. The program output was getting mixed in with the strace output.

Code:

open("Webinar", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
write(1, "\n", 1)                       = 1
write(1, "Webinar:\n", 9)               = 9
getdents64(3, /* 5 entries */, 32768)   = 352
getdents64(3, /* 0 entries */, 32768)   = 0
close(3)                                = 0

It would still be appreciated if you gave us how the output should look like. I have no idea where constant & foo describe lines in an strace log, so I don't know what the result would look like.

Chipper · 03-19-2011, 05:46 AM

Ok source input file is: http://dl.dropbox.com/u/21850274/strace.txt

The output file is: http://dl.dropbox.com/u/21850274/out.html

Look at the 9th line, to the string. It is wrong.

I highlight the strings and constants like this:

#strubg highlight
line=`echo "$line" |sed 's/$\"[^"]*\"$/<span class\="string">\1<\/span>/g'`

#constant highlight
line=`echo "$line" |sed 's/$[^_A-Za-z0-9]$$[A-Z][_A-Z0-9]*$$[^_A-Za-z0-9]$/\1<span class="const">\2<\/span>\3/g$
line=`echo "$line" |sed 's/$[^_A-Za-z0-9]$$[A-Z][_A-Z0-9]*$$[^_A-Za-z0-9]$/\1<span class="const">\2<\/span>\3/g$

colucix · 03-19-2011, 05:49 AM

Quote:

Originally Posted by Chipper

Is this solution POSIX compliant? Because here: http://pubs.opengroup.org/onlinepubs...ities/sed.html I read that sed hasn't got -r option.

The -r option here is just to avoid escaping of some special characters, the parenthesis and the plu sign. You can safely remove the -r option and escape these characters.