LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 03-16-2011, 04:39 AM   #1
Chipper
LQ Newbie
 
Registered: Feb 2011
Posts: 11

Rep: Reputation: 0
Question Problem with sed regexp


hi,
I have a problem with the sed command, because I don't know how to write this rule:

I want to transform this string:
foo CONST "STRING CONST" foo

And I want to transform it to:
foo <const>CONST</const> <string>"STRING CONST"</string> foo

NOT TO!!:
foo <const>CONST</const> <string>"STRING <const>CONST</const>"</string> foo


Now I have this bad regexp:
export LC_ALL=C
line="foo CONST \"STRING CONST\" foo"
line=`echo "$line" |sed -e "s/\([A-Z][A-Z0-9_]*\)/<const>\1<\/const>/g"`
line=`echo "$line" |sed -e "s/\([\"].*[\"]\)/<string>\1<\/string>/g"`

Thank you for your advices and sorry for my bad english
 
Old 03-16-2011, 05:23 AM   #2
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Maybe something like this?
Code:
line="foo CONST \"STRING CONST\" foo"
line=$(echo "$line" | sed 's/\([A-Z][A-Z0-9_]*\)/<const>\1<\/const>/')
line=$(echo "$line" | sed 's/\(".*"\)/<string>\1<\/string>/')
The first sed adds the <const> and </const> tags only to the first occurrence of uppercase words, the second one adds the <string> and </string> tags outside the double quotes. I'm not sure if this matches your requirement.
 
Old 03-16-2011, 05:29 AM   #3
Chipper
LQ Newbie
 
Registered: Feb 2011
Posts: 11

Original Poster
Rep: Reputation: 0
Question Sorry, my fault

Quote:
Originally Posted by colucix View Post
Maybe something like this?
Code:
line="foo CONST \"STRING CONST\" foo"
line=$(echo "$line" | sed 's/\([A-Z][A-Z0-9_]*\)/<const>\1<\/const>/')
line=$(echo "$line" | sed 's/\(".*"\)/<string>\1<\/string>/')
The first sed adds the <const> and </const> tags only to the first occurrence of uppercase words, the second one adds the <string> and </string> tags outside the double quotes. I'm not sure if this matches your requirement.
Sorry, it was my fault, the correct form is:

line="foo CONST \"STRING CONST\" foo"
line=$(echo "$line" | sed 's/\([A-Z][A-Z0-9_]*\)/<const>\1<\/const>/g')
line=$(echo "$line" | sed 's/\(".*"\)/<string>\1<\/string>/g')
 
Old 03-16-2011, 06:38 AM   #4
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Do you mean you have multiple occurrences of CONST and "STRING CONST" on the same line?
 
Old 03-16-2011, 07:00 AM   #5
Chipper
LQ Newbie
 
Registered: Feb 2011
Posts: 11

Original Poster
Rep: Reputation: 0
Question

Quote:
Originally Posted by colucix View Post
Do you mean you have multiple occurrences of CONST and "STRING CONST" on the same line?
Yes, I have more occurences, the string can look like:

"foofoo" BAR "foo" BAR BAR "BAR foo" "BAR"

and the result what I want is:

<string>"foofoo"</string> <const>BAR</const> <string>"foo"</string> <const>BAR</const> <const>BAR</const> <string>"BAR foo"</string> <string>"BAR"</string>
 
Old 03-16-2011, 10:15 AM   #6
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
The problem using regular expressions is that there is not an easy way to distinguish what is outside double quotes pairs. What is inside is a little more straightforward:
Code:
/"[^"]*"/
this matches every quoted string (even if there are multiple ones on the same line). Literally it matches the opening double quotes followed by zero or more occurrences of any character different from double quotes and the closing double quotes. Said that, the solution to your issue is a bit tricky. Here is what I've done:

1. Add the <string> and </string> tags around the quoted strings. Supposed there are not @ characters in the text, add an opening @ and a closing @ for reasons that will be clear later:
Code:
$ line='"foofoo" BAR "foo" BAR BAR "BAR BAR foo BAR foo" "BAR"'
$ line=$(echo "$line" | sed -r 's/("[^"]*")/@<string>\1<\/string>@/g')
$ echo "$line"
@<string>"foofoo"</string>@ BAR @<string>"foo"</string>@ BAR BAR @<string>"BAR BAR foo BAR foo"</string>@ @<string>"BAR"</string>@
2. Now add the <const> and </const> tags around all the uppercase words, even those ones inside double quotes:
Code:
$ line=$(echo "$line" | sed -r 's/([A-Z]+)/<const>\1<\/const>/g')
$ echo "$line"
@<string>"foofoo"</string>@ <const>BAR</const> @<string>"foo"</string>@ <const>BAR</const> <const>BAR</const> @<string>"<const>BAR</const> <const>BAR</const> foo <const>BAR</const> foo"</string>@ @<string>"<const>BAR</const>"</string>@
3. Now remove recursively every <const> </const> pair inside the @ pairs, that is inside every <string> </string> pair. Now the reason for adding @ is clear, since I need a single character to match any string not containing the multi-character pattern <string> or </string>:
Code:
 $ line=$(echo "$line" | sed -r ':again; s/(@<string>[^@]*)<const>([^@]+)<\/const>([^@]*<\/string>@)/\1\2\3/; t again')
$ echo "$line"
@<string>"foofoo"</string>@ <const>BAR</const> @<string>"foo"</string>@ <const>BAR</const> <const>BAR</const> @<string>"BAR BAR foo BAR foo"</string>@ @<string>"BAR"</string>@
4. Now remove the @ characters and the trick is done:
Code:
$ line=$(echo "$line" | sed 's/@//g')
$ echo "$line"
<string>"foofoo"</string> <const>BAR</const> <string>"foo"</string> <const>BAR</const> <const>BAR</const> <string>"BAR BAR foo BAR foo"</string> <string>"BAR"</string>
Feel free to ask for any clarification. Hope this helps.

Last edited by colucix; 03-16-2011 at 10:16 AM.
 
Old 03-16-2011, 10:32 AM   #7
Chipper
LQ Newbie
 
Registered: Feb 2011
Posts: 11

Original Poster
Rep: Reputation: 0
Question

Thank you for your answer, but I dont know which characters will be in text. The text is
output from strace program.

I read somewhere something about hold space and pattern space features in sed, but I can't work
with it. Thank you for your solution, but it is not universal (if I have @ char in text...)
 
Old 03-16-2011, 11:01 AM   #8
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Ok. Since sed can manage hexadecimal ASCII codes, you can choose a control character which most likely does not appear in the input line, for example the group separator (GS):
Code:
Dec Hex Oct Char
29  1d  035 GS    (group separator)
In this case you simply have to substitute @ with \x1d in the sed commands:
Code:
$ line='"foofoo" BAR "foo" BAR BAR "BAR BAR foo BAR foo" "BAR"'
$ line=$(echo "$line" | sed -r 's/("[^"]*")/\x1d<string>\1<\/string>\x1d/g')
$ line=$(echo "$line" | sed -r 's/([A-Z]+)/<const>\1<\/const>/g')
$ line=$(echo "$line" | sed -r ':again; s/(\x1d<string>[^\x1d]*)<const>([^\x1d]+)<\/const>([^\x1d]*<\/string>\x1d)/\1\2\3/; t again')
$ line=$(echo "$line" | sed 's/\x1d//g')
$ echo "$line"
<string>"foofoo"</string> <const>BAR</const> <string>"foo"</string> <const>BAR</const> <const>BAR</const> <string>"BAR BAR foo BAR foo"</string> <string>"BAR"</string>
In alternative, here is a more straightforward awk solution. Here you can easily distinguish between quoted and not quoted strings. Just use the double quotes as field separator:
Code:
BEGIN { FS = "\""; OFS = "" }

{
  
  for ( i = 1; i <= NF; i++ ) 
    if ( i % 2 == 0 )
      $i = "<string>\"" $i "\"</string>"
    else
      $i = gensub(/([A-Z][A-Z0-9_]*)/,"<const>\\1</const>","g",$i)
  
  print
  
}
Please note the empty string as output field separator, due to the fact that blank spaces are already inside the fields.
 
1 members found this post helpful.
Old 03-19-2011, 05:16 AM   #9
Chipper
LQ Newbie
 
Registered: Feb 2011
Posts: 11

Original Poster
Rep: Reputation: 0
Thank you for your solution again. Last question:
Is this solution POSIX compliant? Because here: http://pubs.opengroup.org/onlinepubs...ities/sed.html I read that sed hasn't got -r option.
 
Old 03-19-2011, 05:30 AM   #10
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
I wonder how well your description of the input describes the output of strace. It is best to use real samples instead of a wordy explanation. Regular expressions are very finicky. Could you describe what the output of the following input sample would be.
Code:
open("Webinar", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
write(1, "\n", 1
)                       = 1
write(1, "Webinar:\n", 9Webinar:
)               = 9
getdents64(3, /* 5 entries */, 32768)   = 352
getdents64(3, /* 0 entries */, 32768)   = 0
close(3)                                = 0
Note that the write commands are split on two lines. This is when you usually need to use the HOLD register, building up both lines, getting the register back, and then including an `\n' in the LHS expression. So you will probably need more than one command to accomplish what you want.
 
Old 03-19-2011, 05:38 AM   #11
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
Sorry, if you use "strace -o file", the lines won't be split. The program output was getting mixed in with the strace output.

Code:
open("Webinar", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
write(1, "\n", 1)                       = 1
write(1, "Webinar:\n", 9)               = 9
getdents64(3, /* 5 entries */, 32768)   = 352
getdents64(3, /* 0 entries */, 32768)   = 0
close(3)                                = 0
It would still be appreciated if you gave us how the output should look like. I have no idea where constant & foo describe lines in an strace log, so I don't know what the result would look like.

Last edited by jschiwal; 03-19-2011 at 05:41 AM.
 
Old 03-19-2011, 05:46 AM   #12
Chipper
LQ Newbie
 
Registered: Feb 2011
Posts: 11

Original Poster
Rep: Reputation: 0
Ok source input file is: http://dl.dropbox.com/u/21850274/strace.txt

The output file is: http://dl.dropbox.com/u/21850274/out.html

Look at the 9th line, to the string. It is wrong.

I highlight the strings and constants like this:

#strubg highlight
line=`echo "$line" |sed 's/\(\"[^"]*\"\)/<span class\="string">\1<\/span>/g'`

#constant highlight
line=`echo "$line" |sed 's/\([^_A-Za-z0-9]\)\([A-Z][_A-Z0-9]*\)\([^_A-Za-z0-9]\)/\1<span class="const">\2<\/span>\3/g$
line=`echo "$line" |sed 's/\([^_A-Za-z0-9]\)\([A-Z][_A-Z0-9]*\)\([^_A-Za-z0-9]\)/\1<span class="const">\2<\/span>\3/g$
 
Old 03-19-2011, 05:49 AM   #13
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Quote:
Originally Posted by Chipper View Post
Is this solution POSIX compliant? Because here: http://pubs.opengroup.org/onlinepubs...ities/sed.html I read that sed hasn't got -r option.
The -r option here is just to avoid escaping of some special characters, the parenthesis and the plu sign. You can safely remove the -r option and escape these characters.
 
  


Reply

Tags
regular expressions, sed



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
sed and regexp for search in multilines Felipe Linux - Software 10 09-27-2010 07:58 AM
[SOLVED] Migrate Regexp from SED to AWK cgcamal Programming 9 04-23-2010 10:32 PM
Regexp: difference between sed and Perl matiasar Programming 2 10-15-2009 11:03 AM
vim or sed multiline regexp matching eentonig Programming 1 09-08-2008 09:06 AM
help with sed / regexp elinenbe Programming 2 02-01-2008 10:09 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 01:58 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration