LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   How to parse files with variable record length (https://www.linuxquestions.org/questions/programming-9/how-to-parse-files-with-variable-record-length-824637/)

btacuso 08-06-2010 02:05 PM

How to parse files with variable record length
 
1 Attachment(s)
Hello. Below is a sample of my input file. I would like to extract Room number, Lastname,Firstname,invoice(205880080),arrival date, departure date, and total(229.46). Can you at least give me a hint on how to proceed? I have tried a lot but I am stumped from the beginning. Thanks.
------------------------------------------------------------------------
***History***
Room: 124 B Payment: Bell/TRAVELSCAPE.COM
Lastname*FIT*,Firstname 4A, 0K, 0B Guest
Bell *205880080 FT
Bell *205880080 July 31, 2010
____ 00 August 1, 2010
00000
Date Trans Room Debit Credit Balance
Jul31'10ROOM 124 206.75 206.75
Jul31'10TAX 124 21.71 228.46
Jul31'10TID 124 1.00 229.46
Aug 1'10EX 124 229.46 CR 0.00
Account Bell/TRAVELSCAPE.COM
_____________________________________________________________________

colucix 08-06-2010 02:57 PM

Hmm.. you can try with GNU Awk using the gensub function to extract specific parts of the lines, based on strict regular expressions. However you have to define what are the items that show repeatedly. In other words it's necessary to define the format of the input text.

For example I tried to extract the desired information based on these assumptions:
1. The word History is at the start of each section
2. First line after History contains the keywords Room: and Payment:
3. The second line contains Lastname and Firstname separated by a single and unique comma
4. The third line contains the invoice preceded by the payment method (?) a.k.a. Spaceship or Bill in your samples
5. Arrival and departure dates are in the format Month [D]D, YYYY
6. Total is in the line above that one repeating the payment method (?) and CR is a keyword following the total amount.

Well.. based on my (surely wrong) guess, I can think of something like this:
Code:

#!/usr/bin/awk -f

/History/ {

  getline
  room = gensub(/.*Room: ([0-9]*).*Payment.*/,"\\1","g")
  paym = gensub(/.*Payment: (.*)\/.*/,"\\1","g")
 
  getline
  lastname = gensub(/(.*),.*/,"\\1","g",$1)
  frstname = gensub(/.*,(.*)/,"\\1","g",$1)
 
  getline
  if ( $1 ~ paym ) sub(/^*/,"",$2)
  invoice = $2
 
  getline
  match($0,/[JFMASOND][a-z]* [1-3]*[0-9], 20[1-9][0-9]/,arrival)
 
  getline
  match($0,/[JFMASOND][a-z]* [1-3]*[0-9], 20[1-9][0-9]/,departure)
 
  while ( $0 !~ paym ) {
    getline
    if ($0 ~ / CR / ) total = gensub(/.* ([0-9.]*) CR.*/,"\\1","g")
  }

}

Just to give you an idea. What is your skill in regular expressions, anyway? And in awk?

Tinkster 08-06-2010 05:24 PM

Moved: This thread is more suitable in <PROGRAMMING> and has been moved accordingly to help your thread/question get the exposure it deserves.

grail 08-07-2010 12:58 AM

Like colucix I feel we need more information, but assuming the format is always the same as shown, the following is an alternative on the same theme:
Code:

#!/usr/bin/awk -f

BEGIN{
    RS="***History***"
}

NF>2{
    room=$3" "$4

    split($7,names,"*FIT*,")

    sub("*","",$13)

    arrival=$17" "$18" "$19
    departure=$22" "$23" "$24

    print room,names[1],names[2],$13,arrival,departure,$(NF - 9)
}

Obviously you need to work on the formatting, but you get the idea :)

btacuso 08-11-2010 10:49 AM

At last, I got it. Basing on all your input, I was able to get a working script. Thanks again to all of you.


All times are GMT -5. The time now is 09:45 AM.