Newbie SED / AWK / Regex command help request

Critcho · 03-15-2007, 02:32 PM

I'm not too familiar with some of the regex syntax, I am seeking help with a text parser, I beleive SED / AWK would probably do it...

I have a text file, basically a code file.

The interperter doesn't like tabs or comments, and wants each command on it's own line, complete command on one line only. I do like formatting, it helps me read the code and undertand what is going on when I look at it a week later.

Different text editors will replace a <tab> with various combinations of special characters or consecutive spaces.

So I need to replace:
- tab
- /t
- carriage return / line feed
- everything on a line after // (comments)
- consecutive spaces

... with a single space (in that order, so any consecutive replacements to a single space don't add up to multiple spaces, the last replacement is the consecutive spaces)

e.g.

Code:

//Test for the start command
Digital01 = if (
                  ( //system is enabled
                       Mode = 2
                   OR  Mode = 3
                   )
               AND
                   ( //start request received
                       Calculated_Start = 1
                    OR Manual_Start = 1
                    )
            Then
                 1, //start
            Else
                 Digital01) //Unchanged

... gets replaced with

Code:

Digital01 = if ((Mode = 2 OR Mode = 3) AND (Calculated_Start = 1 OR Manual Start = 1) Then 1, Else Digital01)

Thanks!!
Critcho.

anomie · 03-15-2007, 03:28 PM

Is this homework?

A bit of an ugly hack, but I got this working with gawk: several calls to its string function sub(), and concatenation of each line to a single long string that gets displayed at the end.

http://ftp.wayne.edu/pub/gnu/Manuals...mono/gawk.html

Critcho · 03-15-2007, 04:26 PM

Nope, not cheating on homework. Part of a much larger program that I am working with a team on, just that it is a minor addition to the project so none of the main engineers want to divert time to write and test the script, and I am pretty busy at the moment, learning awk looks like it will take me some time...

anomie · 03-15-2007, 04:39 PM

Ok, I warned you -- it ain't pretty.

Code:

[simba@lk ~]$ cat formatted-code 
//Test for the start command
Digital01 = if (
                  ( //system is enabled
                       Mode = 2
                   OR  Mode = 3
                   )
               AND
                   ( //start request received
                       Calculated_Start = 1
                    OR Manual_Start = 1
                    )
            Then
                 1, //start
            Else
                 Digital01) //Unchanged

[simba@lk ~]$ gawk --posix '{ sub(/\/\/.*$/,""); str=(str $0); gsub(/[[:space:]]{2,}/," ",str); }END{ print str; }' formatted-code 
Digital01 = if ( ( Mode = 2 OR Mode = 3 ) AND ( Calculated_Start = 1 OR Manual_Start = 1 ) Then 1, Else Digital01)

[simba@lk ~]$ gawk --version | head -1
GNU Awk 3.1.3

Very close to your desired output, but not exact. You might need to play around with it and tweak it a bit. (Which will require learning g/awk and regular expressions if you decide to go this route.)

Good luck.

Critcho · 03-15-2007, 07:06 PM

anomie, thanks for your help.

The script as you have it doens't work on my OS (embedded Linux - BusyBox v1.00 ), but by following your lead (and studying just the regex's you used rather than the whole book) I have got most of the way with:

Code:

sed 's/  */ /g; s/\/\/.*//g; /^$/d' temp.lge

this returns:

Code:

Digital01 = if (
 (
 Mode = 2
 OR Mode = 3
 )
 AND
 (
 Calculated_Start = 1
 OR Manual_Start = 1
 )
 Then
 1,
 Else
 Digital01)

I just need to replace the carriage returns...

anomie · 03-15-2007, 09:44 PM

Sure - the regexp to search for to get the carriage returns will look something like:
"\r$" (which means match a carriage return at the very end of the line)
or
"\r" (match a carriage return anywhere in the line)

pokemaster · 03-16-2007, 11:24 AM

in other words,

Code:

sed 's/  */ /g; s/\/\/.*//g; /^$/d; /\r/d; ' temp.lge

However, this won't work, because of the way sed processes lines. Instead, run this:

Code:

sed 's/  */ /g; s/\/\/.*//g; /^$/d; ' temp.lge | tr '\n' ' '

Does this help?

Critcho · 03-16-2007, 02:46 PM

Yeah, I had tried that, but it replaced all my "r"'s with spaces....

Code:

[user@Dreadnaught config]$ cat test3.lge.txt
Line 1 has words with the letter r in it

Line 3    ends with a space

Line 5 next line is 2 spaces

[user@Dreadnaught config]$ sed 's/  */ /g; s/\/\/.*//g; /^$/d; s/\r/ /g' test3.lge.txt >test3.lge
[user@Dreadnaught config]$ cat test3.lge
Line 1 has wo ds with the lette    in it
Line 3 ends with a space
Line 5 next line is 2 spaces

[user@Dreadnaught config]$

Critcho · 03-16-2007, 03:31 PM

Quote:

Originally Posted by pokemaster

Code:

sed 's/  */ /g; s/\/\/.*//g; /^$/d; ' temp.lge | tr '\n' ' '

Does this help?

Sure does!! Yup, that is getting me much closer!

Thanks heaps!

I just played with the order a bit to remove double spaces created by the tr command (it can't take null / '' as a second argument). Now I would like to be able join lines that are not seperated by blank lines.... (i.e. convert paragraphs into lines, but one line per paragraph) getting tougher, but getting much closer...

My interim solution of writing the code with -- on blank lines will do me for now.

i.e.

Code:

[user@Dreadnaught config]$ cat test3.lge.txt
Line 1 has words with the letter r in it
//Comments on line 2
--
Line 4    ends with a space
--
Line 6 next line is 2 spaces
--
All above lines should stay on their own line
--
  Lines 10 through 12
  are considered a paragraph
  they should end up on one line
[user@Dreadnaught config]$ sed 's/\/\/.*//g; /^$/d' test3.lge.txt | tr '\n' ' ' | tr '\-\-' '\n' | sed 's/\  */ /g; s/^ //g' >test3.lge
[user@Dreadnaught config]$ cat test3.lge
Line 1 has words with the letter r in it

Line 4 ends with a space

Line 6 next line is 2 spaces

All above lines should stay on their own line

Lines 10 through 12 are considered a paragraph they should end up on one line [user@Dreadnaught config]$

pokemaster · 03-16-2007, 05:15 PM

ah, well, try this:

Code:

sed 's/  */ /g; s/\/\/.*//g; /^$/d; ' temp.lge | tr -d '\n' | sed -e 's/  */ /g;'

Edit: under closer inspection, i read about the paragraphs -- this only solves the multiple spaces. note the 'tr -d '\n'', this is how you do the null replacement you were attempting (-d = delete)

To keep the paragraphs, you will need to change it a little more:

Code:

sed 's/  */ /g; s/\/\/.*//g; s/ *$//; s/^ *//; s/^$/\/\//' tmp.lge | tr '\n' ' ' | sed -e 's/\/\//\n/g'

I kept the space in the tr command, since some commands managed to string together otherwise...

Critcho · 03-19-2007, 11:22 AM

OK, I think I have the final final working version

Code:

sed 's/\/\/.*//g; s/ *$//; s/^$/\/\//' $1.lge.txt | tr '\n' ' ' | sed 's/\/\//\n/g; s/ */ /g' | sed 's/^ //g; /^[#tab]*$/d' > $1.lge

And it does what I want it to!!

Code:

[[user@Dreadnaught config]$ cat cleanlge.sh
if [ $# -ne 1 ]; then
        echo "cleanlge script"
        echo
        echo "function:"
        echo "  cleans comments from lge file, puts all commands on one line"
        echo
        echo "usage:"
        echo "  . ./cleanlge.sh filename"
        echo "  will clean up filename.lge.txt and save as filename.lge"
        echo
else
        echo "--------Original File------------"
        cat $1.lge.txt
        echo "--------Cleaned File-------------"
        sed 's/\/\/.*//g; s/ *$//; s/^$/\/\//' $1.lge.txt | tr '\n' ' ' | sed 's/\/\//\n/g; s/ */ /g' | sed 's/^ //g; /^[#tab]*$/d' > $1.lge
        cat $1.lge
        echo
        echo "---------------------------------"
fi
[user@Dreadnaught config]$ . ./cleanlge.sh test3
--------Original File------------
Line 1 has words with the letter r in it
//Comments on line 2

Line 4    ends with a space //Comments at the end of line 4

Line 6 next line is 2 spaces

All above lines should stay on their own line

  Lines 10 through 12 //and they have comments on each line
  are considered a paragraph //more comments
  they should end up on one line
--------Cleaned File-------------
Line 1 has words with the letter r in it
Line 4 ends with a space
Line 6 next line is 2 spaces
All above lines should stay on their own line
Lines 10 through 12 are considered a paragraph they should end up on one line
---------------------------------
[user@Dreadnaught config]$ [user@Dreadnaught config]$

Yah!

Thanks all!