LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Multiline and complex text input for bash/ awk/perl help (https://www.linuxquestions.org/questions/linux-newbie-8/multiline-and-complex-text-input-for-bash-awk-perl-help-4175452889/)

btacuso 03-05-2013 11:24 PM

Multiline and complex text input for bash/ awk/perl help
 
1 Attachment(s)
Hi. The horizontal space here is not enough. Instead, I attached the detail and sample input. Basically, it is a file where I need to pick data for import to excel.I attached an input sample file. Perl is welcome but I prefer bash/awk where I have basic knowledge. I am zero with perl.Thanks for the lending hands.

chrism01 03-06-2013 01:46 AM

1.
Code:

file folio2.txt
folio2.txt: ASCII English text, with CRLF line terminators

I'd fix that first if you're processing in Linux; see http://linux.die.net/man/1/unix2dos.
You'll want the reverse (dos2unix) fn when you put it back on MS.

2. 'Account' almost works, but sometimes the same num appears twice eg 277557771, but in-between you get lines like '0.00 will be billed to: Account 1273961'

3. That's a fairly free flow format; I'm guessing not all recs have identical(!) layout.
Personally I would use Perl, but that's my weapon of choice for this sort of stuff and something Perl is very good at.
If you do go Perl:
http://perldoc.perl.org/
http://www.tizag.com/perlT/index.php

btacuso 03-06-2013 03:26 AM

Quote:

Originally Posted by chrism01 (Post 4905733)
1.
Code:

file folio2.txt
folio2.txt: ASCII English text, with CRLF line terminators

I'd fix that first if you're processing in Linux; see http://linux.die.net/man/1/unix2dos.
You'll want the reverse (dos2unix) fn when you put it back on MS.

2. 'Account' almost works, but sometimes the same num appears twice eg 277557771, but in-between you get lines like '0.00 will be billed to: Account 1273961'

3. That's a fairly free flow format; I'm guessing not all recs have identical(!) layout.
Personally I would use Perl, but that's my weapon of choice for this sort of stuff and something Perl is very good at.
If you do go Perl:
http://perldoc.perl.org/
http://www.tizag.com/perlT/index.php

1.) I do not think I will need the line terminator at this point. The input is edited just for everybody to see. I will do that with the real file which is a big one.
2.) it is not "Account" but "Account:" that I tested as "Record Separator" and it seemed to work.
3.) You are right, it is very much free flowing, I don't even know where to start. I would take it if you could make it in perl but I might not be able to do even the simplest maintenance in perl if needed. Thanks again.

grail 03-06-2013 07:45 AM

With Perl or awk I am not sure I see patterns in how to get some of your data?

Perhaps you could show us from a single account what would make a line unique enough that the required information could be extracted?

Also, I would leave the RS as standard and use the finding of "Account:" as a resetting point for your variables storing the data, this
way you are then not resorting to creating a loop to cycle over the fields. Just a thought.

rigor 03-06-2013 09:01 AM

btacuso,

Using awk, it would seem possible to take, more or less, a "State Machine" style of approach to your task.

With your sample input data, and this program:

Code:

BEGIN  {
            # Constants:

            # "kludge" values to effectively allow for variables to be treated
            # almost as "boolean" type in awk:
            true  =  (  1  ==  1  ) ;
            false  =  (  0  ==  1  ) ;

            # Initial values/states for dynamic variables:
            found_prior_account_start = false ;
            account = "***UNKNOWN***" ;
            account_info_lines_count = 0 ;
            delete account_info ;
            prev_line_was_safari = false ;
            name_info = "***UNKNOWN***" ;
            name = "error" ;
            booking_id = "unknown" ;
            room = "***err***" ;
            arrival_date = "***error***" ;
            departure_date = "***error***" ;
            found_folio_summary = false ;
            financial_data_count = 0 ;
            found_balance_due = false ;

            # Expect that gawk may not be configured with --enable-switch
            # so "kludge" a switch-like behavior.
            financial_data_map[ 0 ] = "room_charge" ;
            financial_data_map[ 1 ] = "locale_tax" ;
            financial_data_map[ 2 ] = "occupancy_tax" ;
            financial_data_map[ 3 ] = "other_tax" ;
            financial_data_map[ 4 ] = "balance_due" ;
            financial_data[  financial_data_map[ 0 ]  ]  =  "0.0" ;
            financial_data[  financial_data_map[ 1 ]  ]  =  "0.0" ;
            financial_data[  financial_data_map[ 2 ]  ]  =  "0.0" ;
            financial_data[  financial_data_map[ 3 ]  ]  =  "0.0" ;
            financial_data[  financial_data_map[ 4 ]  ]  =  "0.0" ;

            total_tax = 0.0 ;

            prev_booking_id = "***UNKNOWN***" ;

            print "lname,fname    booking    invoice      arrival      departure    rate    tax    balance\n" ;

            debug = false ;
        }


function output_account_data()
{
    for (  fin_data_elem = 1 ;  fin_data_elem < 4 ;  fin_data_elem++  )
        financial_data[  financial_data_map[ fin_data_elem ]  ]  +=  0.0 ;

    for (  fin_data_elem = 1 ;  fin_data_elem < 4 ;  fin_data_elem++  )
        total_tax  +=  financial_data[  financial_data_map[ fin_data_elem ]  ]  ;

    if (  debug  ==  true  )
    {
        print "\n\n\n\nCURRENT ACCOUNT:  "  account ;
        print "name_info="  name_info ;
        print "name='"  name  "'" ;
        print "booking_id='"  booking_id  "'" ;
        print "room='"  room  "'" ;
        print "arrival_date='"  arrival_date  "'" ;
        print "departure_date='"  departure_date  "'" ;

        print financial_data_map[ 0 ] "='"  financial_data[  financial_data_map[ 0 ]  ]  "'" ;
        print "total_tax='"  total_tax  "'" ;
        print financial_data_map[ 4 ] "='"  financial_data[  financial_data_map[ 4 ]  ]  "'" ;
    }

    if (  name  !=  "error"  )
    {
        split(  name ,  pieces ,  /,/  ) ;
        lname = pieces[ 1 ] ;
    }
    else
    {
        lname = "***err***" ;
    }

    printf  "%-15s %-10s %-13s %-13s %-13s %7.2f %7.2f %7.2f\n" ,  name ,  booking_id ,  lname  room ,  arrival_date ,  departure_date ,  financial_data[  financial_data_map[ 0 ]  ] ,  total_tax ,  financial_data[  financial_data_map[ 4 ]  ] ;
}


/^Account\:/ {
                if (  debug  ==  true  )
                    print $0 ;

                if (  found_prior_account_start  ==  true  )
                {
                    if (  booking_id  !=  prev_booking_id  )
                    {
                        output_account_data() ;
                    }
                }
                else
                    found_prior_account_start = true ;

                prev_booking_id = booking_id ;
                delete account_info ;
                account_info_lines_count = 0 ;
                account = $2 ;
                gsub(  /[\r]/ ,  "" ,  account  ) ;

                name_info = "***UNKNOWN***" ;
                name = "error" ;
                booking_id = "error" ;
                room = "***err***" ;
                arrival_date = "***error***" ;
                departure_date = "***error***" ;
                found_folio_summary = false ;
                financial_data_count = 0 ;
                found_balance_due = false ;
                financial_data[  financial_data_map[ 0 ]  ]  =  "0.0" ;
                financial_data[  financial_data_map[ 1 ]  ]  =  "0.0" ;
                financial_data[  financial_data_map[ 2 ]  ]  =  "0.0" ;
                financial_data[  financial_data_map[ 3 ]  ]  =  "0.0" ;
                financial_data[  financial_data_map[ 4 ]  ]  =  "0.0" ;
                total_tax = 0.0 ;

                next ;
            }


(  found_folio_summary  ==  true  ) {
                                        if (  debug  ==  true  )
                                            print $0 ;

                                        if (  $0  ~  /^x____________/  )
                                        {
                                            # Handle "stray" numeric data sometimes found after end of "normal" input.
                                            found_folio_summary = false ;

                                            next ;
                                        }

                                        if (  $0  ~  /^Balance Due:/  )
                                        {
                                            found_balance_due =  true ;

                                            next ;
                                        }

                                        if (  financial_data_count  >  4  )
                                            next ;

                                        if (  $0  ~  /^[0-9]+\.[0-9]+[\r]$/  )
                                        {
                                            if (  ( financial_data_count < 4 )  ||  (  ( financial_data_count == 4 )  &&  ( found_balance_due == true )  )  )
                                            {
                                              gsub(  /[\r]/ ,  "" ,  $0  ) ;
                                              financial_data[  financial_data_map[ financial_data_count++ ]  ]  =  $0 ;
                                            }
                                        }

                                        next ;
                                    }


/^Folio Summary/  {
                        if (  debug  ==  true  )
                            print $0 ;

                        found_folio_summary = true ;

                        next ;
                    }



/^Room\:/  {
                if (  debug  ==  true  )
                    print $0 ;

                room = $2 ;
                gsub(  /[\r]/ ,  "" ,  room  ) ;

                next ;
            }


/^Arrival Date\:/  {
                        if (  debug  ==  true  )
                            print $0 ;

                        arrival_date = $3 ;
                        gsub(  /[\r]/ ,  "" ,  arrival_date  ) ;

                        next ;
                    }


/^Departure Date\:/  {
                        if (  debug  ==  true  )
                            print $0 ;

                        departure_date = $3 ;
                        gsub(  /[\r]/ ,  "" ,  departure_date  ) ;

                        next ;
                    }


/^SAFARI\/TOURSCAPE[\r]$/  {
                                if (  debug  ==  true  )
                                    print $0 ;

                                prev_line_was_safari = true ;

                                next ;
                            }


(  prev_line_was_safari  ==  true  )    {
                                            if (  debug  ==  true  )
                                                print $0 ;

                                            name_info = $0 ;
                                            gsub(  /[\r]/ ,  "" ,  name_info  ) ;
                                            piece_count = split(  name_info ,  pieces ,  /\*/  ) ;

                                            if (  piece_count  ==  3  )
                                            {
                                                name = substr(  pieces[ 1 ] ,  1 ,  length( pieces[ 1 ] ) - 2  ) ;
                                                booking_id = pieces[ 2 ] ;
                                            }
                                            else
                                            {
                                                name = pieces[ 1 ] ;
                                            }

                                            prev_line_was_safari = false ;

                                            next ;
                                        }


(  $0  !~  /Account\:/  )  {
                                if (  debug  ==  true  )
                                    print $0 ;

                                account_info[ account_info_lines_count++ ] = $0 ;
                            }


END {
        output_account_data() ;
    }

I get output that looks like this:

Code:

lname,fname    booking    invoice      arrival      departure    rate    tax    balance

error          error      ***err***230  3/2/13        3/4/13        249.90  31.24    0.00
Evans, Greg    327022634  Evans237      3/1/13        3/4/13        320.25  41.12  361.37
error          error      ***err***245  3/2/13        3/4/13        219.90  28.08    0.00
Summers, Marvin error      Summers325    3/3/13        3/4/13          96.75  12.66  109.41

I wasn't sure I was understanding your explanation of how the input was to be treated. You seemingly gave examples from different "folio's", but it might have been easier to follow if you had given most examples from the same "folio". Also, if you meant this phrase "Only names under "SAFARI/TOURSCAPE" are needed" to mean that only the names on a line immediately following a line containing only "SAFARI/TOURSCAPE", are to be used, then it appeared to me that there is a "folio" with the name missing, that being the "folio" for "Christina Simpleman".

From my awk program, you can see that it appeared the return chars just before the end of the line, did need to be handled, at least under Linux, and this is a Linux forum.

The program is not complete. It's just that I had so many questions, it seemed easier to present code that might effectively raise the questions, rather than try to describe all the questions.

I didn't format dates.

I would guess that the output column for the names would need to be wider, or the names truncated, since names can be rather longer than the space allowed by your sample heading for the report output. I seemingly found various values missing from the input data.

That's, IF, I'm interpreting the file correctly. Having what appear to be labels for values, following the values, two lines after the value, seems a rather unusual file format. I took the final "dollar" value after the phrase "Balance Due" within the "Folio Summary", to be the value for "Balance Due". In that case though, it seemed that it was missing in at least one case. Having the label for "Balance Due" before the value, seems an odd departure from having the label after the value, elsewhere in the file. It seemed that there might have been values for some of the taxes missing. If that's possible, you might want to consider outputting some error indication for that situation too. Also, if any other things can be missing, the "kludge"/assumption that I made, that the dollar values for taxes come in a certain order, can easily be wrong. Perhaps other expectations the program illustrates, could also be wrong. It may be necessary to grab a dollar value, save it, then explicitly look for a following label, where the label is expected to follow the value.

If the output is sufficiently close to how you expected the file to be processed, maybe we can help you adjust the program to be exactly what you need. If so, maybe you could give us some additional details on how the file is to be processed.

Hope this helps.

btacuso 03-06-2013 11:44 AM

Quote:

Originally Posted by rigor (Post 4905982)
btacuso,

Using awk, it would seem possible to take, more or less, a "State Machine" style of approach to your task.

With your sample input data, and this program:

Code:

BEGIN  {
            # Constants:

            # "kludge" values to effectively allow for variables to be treated
            # almost as "boolean" type in awk:
            true  =  (  1  ==  1  ) ;
            false  =  (  0  ==  1  ) ;

            # Initial values/states for dynamic variables:
            found_prior_account_start = false ;
            account = "***UNKNOWN***" ;
            account_info_lines_count = 0 ;
            delete account_info ;
            prev_line_was_safari = false ;
            name_info = "***UNKNOWN***" ;
            name = "error" ;
            booking_id = "unknown" ;
            room = "***err***" ;
            arrival_date = "***error***" ;
            departure_date = "***error***" ;
            found_folio_summary = false ;
            financial_data_count = 0 ;
            found_balance_due = false ;

            # Expect that gawk may not be configured with --enable-switch
            # so "kludge" a switch-like behavior.
            financial_data_map[ 0 ] = "room_charge" ;
            financial_data_map[ 1 ] = "locale_tax" ;
            financial_data_map[ 2 ] = "occupancy_tax" ;
            financial_data_map[ 3 ] = "other_tax" ;
            financial_data_map[ 4 ] = "balance_due" ;
            financial_data[  financial_data_map[ 0 ]  ]  =  "0.0" ;
            financial_data[  financial_data_map[ 1 ]  ]  =  "0.0" ;
            financial_data[  financial_data_map[ 2 ]  ]  =  "0.0" ;
            financial_data[  financial_data_map[ 3 ]  ]  =  "0.0" ;
            financial_data[  financial_data_map[ 4 ]  ]  =  "0.0" ;

            total_tax = 0.0 ;

            prev_booking_id = "***UNKNOWN***" ;

            print "lname,fname    booking    invoice      arrival      departure    rate    tax    balance\n" ;

            debug = false ;
        }


function output_account_data()
{
    for (  fin_data_elem = 1 ;  fin_data_elem < 4 ;  fin_data_elem++  )
        financial_data[  financial_data_map[ fin_data_elem ]  ]  +=  0.0 ;

    for (  fin_data_elem = 1 ;  fin_data_elem < 4 ;  fin_data_elem++  )
        total_tax  +=  financial_data[  financial_data_map[ fin_data_elem ]  ]  ;

    if (  debug  ==  true  )
    {
        print "\n\n\n\nCURRENT ACCOUNT:  "  account ;
        print "name_info="  name_info ;
        print "name='"  name  "'" ;
        print "booking_id='"  booking_id  "'" ;
        print "room='"  room  "'" ;
        print "arrival_date='"  arrival_date  "'" ;
        print "departure_date='"  departure_date  "'" ;

        print financial_data_map[ 0 ] "='"  financial_data[  financial_data_map[ 0 ]  ]  "'" ;
        print "total_tax='"  total_tax  "'" ;
        print financial_data_map[ 4 ] "='"  financial_data[  financial_data_map[ 4 ]  ]  "'" ;
    }

    if (  name  !=  "error"  )
    {
        split(  name ,  pieces ,  /,/  ) ;
        lname = pieces[ 1 ] ;
    }
    else
    {
        lname = "***err***" ;
    }

    printf  "%-15s %-10s %-13s %-13s %-13s %7.2f %7.2f %7.2f\n" ,  name ,  booking_id ,  lname  room ,  arrival_date ,  departure_date ,  financial_data[  financial_data_map[ 0 ]  ] ,  total_tax ,  financial_data[  financial_data_map[ 4 ]  ] ;
}


/^Account\:/ {
                if (  debug  ==  true  )
                    print $0 ;

                if (  found_prior_account_start  ==  true  )
                {
                    if (  booking_id  !=  prev_booking_id  )
                    {
                        output_account_data() ;
                    }
                }
                else
                    found_prior_account_start = true ;

                prev_booking_id = booking_id ;
                delete account_info ;
                account_info_lines_count = 0 ;
                account = $2 ;
                gsub(  /[\r]/ ,  "" ,  account  ) ;

                name_info = "***UNKNOWN***" ;
                name = "error" ;
                booking_id = "error" ;
                room = "***err***" ;
                arrival_date = "***error***" ;
                departure_date = "***error***" ;
                found_folio_summary = false ;
                financial_data_count = 0 ;
                found_balance_due = false ;
                financial_data[  financial_data_map[ 0 ]  ]  =  "0.0" ;
                financial_data[  financial_data_map[ 1 ]  ]  =  "0.0" ;
                financial_data[  financial_data_map[ 2 ]  ]  =  "0.0" ;
                financial_data[  financial_data_map[ 3 ]  ]  =  "0.0" ;
                financial_data[  financial_data_map[ 4 ]  ]  =  "0.0" ;
                total_tax = 0.0 ;

                next ;
            }


(  found_folio_summary  ==  true  ) {
                                        if (  debug  ==  true  )
                                            print $0 ;

                                        if (  $0  ~  /^x____________/  )
                                        {
                                            # Handle "stray" numeric data sometimes found after end of "normal" input.
                                            found_folio_summary = false ;

                                            next ;
                                        }

                                        if (  $0  ~  /^Balance Due:/  )
                                        {
                                            found_balance_due =  true ;

                                            next ;
                                        }

                                        if (  financial_data_count  >  4  )
                                            next ;

                                        if (  $0  ~  /^[0-9]+\.[0-9]+[\r]$/  )
                                        {
                                            if (  ( financial_data_count < 4 )  ||  (  ( financial_data_count == 4 )  &&  ( found_balance_due == true )  )  )
                                            {
                                              gsub(  /[\r]/ ,  "" ,  $0  ) ;
                                              financial_data[  financial_data_map[ financial_data_count++ ]  ]  =  $0 ;
                                            }
                                        }

                                        next ;
                                    }


/^Folio Summary/  {
                        if (  debug  ==  true  )
                            print $0 ;

                        found_folio_summary = true ;

                        next ;
                    }



/^Room\:/  {
                if (  debug  ==  true  )
                    print $0 ;

                room = $2 ;
                gsub(  /[\r]/ ,  "" ,  room  ) ;

                next ;
            }


/^Arrival Date\:/  {
                        if (  debug  ==  true  )
                            print $0 ;

                        arrival_date = $3 ;
                        gsub(  /[\r]/ ,  "" ,  arrival_date  ) ;

                        next ;
                    }


/^Departure Date\:/  {
                        if (  debug  ==  true  )
                            print $0 ;

                        departure_date = $3 ;
                        gsub(  /[\r]/ ,  "" ,  departure_date  ) ;

                        next ;
                    }


/^SAFARI\/TOURSCAPE[\r]$/  {
                                if (  debug  ==  true  )
                                    print $0 ;

                                prev_line_was_safari = true ;

                                next ;
                            }


(  prev_line_was_safari  ==  true  )    {
                                            if (  debug  ==  true  )
                                                print $0 ;

                                            name_info = $0 ;
                                            gsub(  /[\r]/ ,  "" ,  name_info  ) ;
                                            piece_count = split(  name_info ,  pieces ,  /\*/  ) ;

                                            if (  piece_count  ==  3  )
                                            {
                                                name = substr(  pieces[ 1 ] ,  1 ,  length( pieces[ 1 ] ) - 2  ) ;
                                                booking_id = pieces[ 2 ] ;
                                            }
                                            else
                                            {
                                                name = pieces[ 1 ] ;
                                            }

                                            prev_line_was_safari = false ;

                                            next ;
                                        }


(  $0  !~  /Account\:/  )  {
                                if (  debug  ==  true  )
                                    print $0 ;

                                account_info[ account_info_lines_count++ ] = $0 ;
                            }


END {
        output_account_data() ;
    }

I get output that looks like this:

Code:

lname,fname    booking    invoice      arrival      departure    rate    tax    balance

error          error      ***err***230  3/2/13        3/4/13        249.90  31.24    0.00
Evans, Greg    327022634  Evans237      3/1/13        3/4/13        320.25  41.12  361.37
error          error      ***err***245  3/2/13        3/4/13        219.90  28.08    0.00
Summers, Marvin error      Summers325    3/3/13        3/4/13          96.75  12.66  109.41

I wasn't sure I was understanding your explanation of how the input was to be treated. You seemingly gave examples from different "folio's", but it might have been easier to follow if you had given most examples from the same "folio". Also, if you meant this phrase "Only names under "SAFARI/TOURSCAPE" are needed" to mean that only the names on a line immediately following a line containing only "SAFARI/TOURSCAPE", are to be used, then it appeared to me that there is a "folio" with the name missing, that being the "folio" for "Christina Simpleman".

From my awk program, you can see that it appeared the return chars just before the end of the line, did need to be handled, at least under Linux, and this is a Linux forum.

The program is not complete. It's just that I had so many questions, it seemed easier to present code that might effectively raise the questions, rather than try to describe all the questions.

I didn't format dates.

I would guess that the output column for the names would need to be wider, or the names truncated, since names can be rather longer than the space allowed by your sample heading for the report output. I seemingly found various values missing from the input data.

That's, IF, I'm interpreting the file correctly. Having what appear to be labels for values, following the values, two lines after the value, seems a rather unusual file format. I took the final "dollar" value after the phrase "Balance Due" within the "Folio Summary", to be the value for "Balance Due". In that case though, it seemed that it was missing in at least one case. Having the label for "Balance Due" before the value, seems an odd departure from having the label after the value, elsewhere in the file. It seemed that there might have been values for some of the taxes missing. If that's possible, you might want to consider outputting some error indication for that situation too. Also, if any other things can be missing, the "kludge"/assumption that I made, that the dollar values for taxes come in a certain order, can easily be wrong. Perhaps other expectations the program illustrates, could also be wrong. It may be necessary to grab a dollar value, save it, then explicitly look for a following label, where the label is expected to follow the value.

If the output is sufficiently close to how you expected the file to be processed, maybe we can help you adjust the program to be exactly what you need. If so, maybe you could give us some additional details on how the file is to be processed.

Hope this helps.

Rigor, It seems you are on the right track. Yes, the file is somewhat confusing but that is caused by the conversion from PDF to text. The original is a pdf file which I converted to text using poppler pdftotext tool. The output you created has the right data except for the date format and of course the extra lines with errors. My statement "under SAFARI...." is also somewhat misleading. What I really meant is that, if there is a folio booked by SAFARI, then that is what I need. I also have to emphasize that SAFARI folio comes in 2. Immediately after the 1st which is for a room, is another for the same name but it is for the restaurant bills. This is not needed. You are right also with the actual width of the ouput, it needs to be wider to accomodate long names.

Thanks for the immediate response. I hope my problem is fixed with your and everybody's help.


All times are GMT -5. The time now is 07:55 PM.