LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   multiline pattern matching (https://www.linuxquestions.org/questions/programming-9/multiline-pattern-matching-880281/)

csegau 05-12-2011 05:43 AM

multiline pattern matching
 
Hi all,

i am finding it difficult to handle multiline pattern matching. problem is like this.

I have a formatted text file, in which each column has width of 'X'number of character. So when a text in column exceeds 'X' number of character, then remaining character are placed in next line. same happens with other column of text file too.

now, i have to do search for a string which is of length > 'X' number of character.

now problem is my search string is in single line and text file contains this search string in multiple line because of column width.

how to search it?

Thanks in advance

grail 05-12-2011 06:58 AM

I know what you have written is probably clear to you, but I am way lost. How about you show an actual example of the input and the desired output?

Also, what have you tried?

Nominal Animal 05-13-2011 02:06 PM

Quote:

Originally Posted by csegau (Post 4354163)
Hi all,

i am finding it difficult to handle multiline pattern matching. problem is like this.

csegau, I understood that your files have similar structure to this:
Code:

# Name      Age
1 Alice    25
2 Bob      26
3 Carol    25
4 ThePersonT26
  hatHasAVer
  yLongName
5 Dave      26

and you're having a problem for example searching all names that contain "Very". Am I correct?

If the first field is empty for all continuation lines, then this is quite easy to solve using awk. GNU awk versions 2.1.3 and later do have a facility that makes this much easier, but it's not too hard with any awk -- this is for any awk:
Code:

awk 'BEGIN    { RS="[\t\n\v\f\r ]*[\r\n]+"
                FS="\n"
                OFS="\t"
                col[1] = 1;  len[1] = 2
                col[2] = 3;  len[2] = 10
                col[3] = 13; len[3] = 2
                cols  = 3
                row    = 1
              }
              { if (substr($0, col[1], len[1]) ~ /^[\t ]*$/)
                    for (i = 1; i <= cols; i++)
                        field[i] = field[i] substr($0, col[i], len[i])
                else {
                    row = NR
                    for (i = 1; i <= cols; i++)
                        field[i] = substr($0, col[i], len[i])
                }

                NF = cols
                for (i = 1; i <= cols; i++) $i = field[i]
              }

    # Now you can use $1 .. $cols (or field[1] to field[cols]).
    # The starting row is in variable 'row'.

    $2 ~ /Very/ { print $0 }
    '

The second to last line checks if the second logical field contains Very, and if so, prints the entire record with tabs between each field (since OFS is a tab).

Another alternative is to reconstruct the data, using e.g. tabs \t or pipes | as the field separator:
Code:

awk '
BEGIN { RS="[\t\n\v\f\r ]*[\r\n]+"
        FS="\n"
        OFS="\t"
        col[1] = 1;  len[1] = 2
        col[2] = 3;  len[2] = 10
        col[3] = 13; len[3] = 2
        cols  = 3
        row    = 0
      }
      { if (substr($0, col[1], len[1]) ~ /^[\t ]*$/)
            for (i = 1; i <= cols; i++)
                field[i] = field[i] substr($0, col[i], len[i])
        else {
            if (row) {
                printf("%s", field[1])
                for (i = 2; i <= cols; i++)
                    printf("%s%s", OFS, field[i])
                printf("\n")
            }
            row = NR
            for (i = 1; i <= cols; i++)
                field[i] = substr($0, col[i], len[i])
        }
      }
END  { if (row) {
            printf("%s", field[1])
            for (i = 2; i <= cols; i++)
                printf("%s%s", OFS, field[i])
            printf("\n")
        }
      }'

Since the latter script will merge all split fields, you can use grep or sed on the output.

Hope this helps.


All times are GMT -5. The time now is 03:05 AM.