multiline pattern matching

csegau · 05-12-2011, 05:43 AM

Hi all,

i am finding it difficult to handle multiline pattern matching. problem is like this.

I have a formatted text file, in which each column has width of 'X'number of character. So when a text in column exceeds 'X' number of character, then remaining character are placed in next line. same happens with other column of text file too.

now, i have to do search for a string which is of length > 'X' number of character.

now problem is my search string is in single line and text file contains this search string in multiple line because of column width.

how to search it?

Thanks in advance

grail · 05-12-2011, 06:58 AM

I know what you have written is probably clear to you, but I am way lost. How about you show an actual example of the input and the desired output?

Also, what have you tried?

Nominal Animal · 05-13-2011, 02:06 PM

Quote:

Originally Posted by csegau

Hi all,

i am finding it difficult to handle multiline pattern matching. problem is like this.

csegau, I understood that your files have similar structure to this:

Code:

# Name      Age
1 Alice     25
2 Bob       26
3 Carol     25
4 ThePersonT26
  hatHasAVer
  yLongName
5 Dave      26

and you're having a problem for example searching all names that contain "Very". Am I correct?

If the first field is empty for all continuation lines, then this is quite easy to solve using awk. GNU awk versions 2.1.3 and later do have a facility that makes this much easier, but it's not too hard with any awk -- this is for any awk:

Code:

awk 'BEGIN    { RS="[\t\n\v\f\r ]*[\r\n]+"
                FS="\n"
                OFS="\t"
                col[1] = 1;  len[1] = 2
                col[2] = 3;  len[2] = 10
                col[3] = 13; len[3] = 2
                cols   = 3
                row    = 1
              }
              { if (substr($0, col[1], len[1]) ~ /^[\t ]*$/)
                    for (i = 1; i <= cols; i++)
                        field[i] = field[i] substr($0, col[i], len[i])
                else {
                    row = NR
                    for (i = 1; i <= cols; i++)
                        field[i] = substr($0, col[i], len[i])
                }

                NF = cols
                for (i = 1; i <= cols; i++) $i = field[i]
              }

     # Now you can use $1 .. $cols (or field[1] to field[cols]).
     # The starting row is in variable 'row'.

     $2 ~ /Very/ { print $0 }
    '

The second to last line checks if the second logical field contains Very, and if so, prints the entire record with tabs between each field (since OFS is a tab).

Another alternative is to reconstruct the data, using e.g. tabs \t or pipes | as the field separator:

Code:

awk '
BEGIN { RS="[\t\n\v\f\r ]*[\r\n]+"
        FS="\n"
        OFS="\t"
        col[1] = 1;  len[1] = 2
        col[2] = 3;  len[2] = 10
        col[3] = 13; len[3] = 2
        cols   = 3
        row    = 0
      }
      { if (substr($0, col[1], len[1]) ~ /^[\t ]*$/)
            for (i = 1; i <= cols; i++)
                field[i] = field[i] substr($0, col[i], len[i])
        else {
            if (row) {
                printf("%s", field[1])
                for (i = 2; i <= cols; i++)
                    printf("%s%s", OFS, field[i])
                printf("\n")
            }
            row = NR
            for (i = 1; i <= cols; i++)
                field[i] = substr($0, col[i], len[i])
        }
      }
END   { if (row) {
            printf("%s", field[1])
            for (i = 2; i <= cols; i++)
                printf("%s%s", OFS, field[i])
             printf("\n")
        }
      }'

Since the latter script will merge all split fields, you can use grep or sed on the output.

Hope this helps.