LinuxQuestions.org - multiline pattern matching

Quote:

Originally Posted by csegau (Post 4354163)

Hi all,

i am finding it difficult to handle multiline pattern matching. problem is like this.

csegau, I understood that your files have similar structure to this:

Code:

# Name      Age

1 Alice    25

2 Bob      26

3 Carol    25

4 ThePersonT26

  hatHasAVer

  yLongName

5 Dave      26

and you're having a problem for example searching all names that contain "Very". Am I correct?

If the first field is empty for all continuation lines, then this is quite easy to solve using awk. GNU awk versions 2.1.3 and later do have a facility that makes this much easier, but it's not too hard with any awk -- this is for any awk:

Code:

awk 'BEGIN    { RS="[\t\n\v\f\r ]*[\r\n]+"

                FS="\n"

                OFS="\t"

                col[1] = 1;  len[1] = 2

                col[2] = 3;  len[2] = 10

                col[3] = 13; len[3] = 2

                cols  = 3

                row    = 1

              }

              { if (substr($0, col[1], len[1]) ~ /^[\t ]*$/)

                    for (i = 1; i <= cols; i++)

                        field[i] = field[i] substr($0, col[i], len[i])

                else {

                    row = NR

                    for (i = 1; i <= cols; i++)

                        field[i] = substr($0, col[i], len[i])

                }



                NF = cols

                for (i = 1; i <= cols; i++) $i = field[i]

              }



    # Now you can use $1 .. $cols (or field[1] to field[cols]).

    # The starting row is in variable 'row'.



    $2 ~ /Very/ { print $0 }

    '

The second to last line checks if the second logical field contains Very, and if so, prints the entire record with tabs between each field (since OFS is a tab).

Another alternative is to reconstruct the data, using e.g. tabs \t or pipes | as the field separator:

Code:

awk '

BEGIN { RS="[\t\n\v\f\r ]*[\r\n]+"

        FS="\n"

        OFS="\t"

        col[1] = 1;  len[1] = 2

        col[2] = 3;  len[2] = 10

        col[3] = 13; len[3] = 2

        cols  = 3

        row    = 0

      }

      { if (substr($0, col[1], len[1]) ~ /^[\t ]*$/)

            for (i = 1; i <= cols; i++)

                field[i] = field[i] substr($0, col[i], len[i])

        else {

            if (row) {

                printf("%s", field[1])

                for (i = 2; i <= cols; i++)

                    printf("%s%s", OFS, field[i])

                printf("\n")

            }

            row = NR

            for (i = 1; i <= cols; i++)

                field[i] = substr($0, col[i], len[i])

        }

      }

END  { if (row) {

            printf("%s", field[1])

            for (i = 2; i <= cols; i++)

                printf("%s%s", OFS, field[i])

            printf("\n")

        }

      }'

Since the latter script will merge all split fields, you can use grep or sed on the output.

Hope this helps.