LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 08-06-2010, 02:05 PM   #1
btacuso
Member
 
Registered: May 2009
Posts: 32

Rep: Reputation: 15
How to parse files with variable record length


Hello. Below is a sample of my input file. I would like to extract Room number, Lastname,Firstname,invoice(205880080),arrival date, departure date, and total(229.46). Can you at least give me a hint on how to proceed? I have tried a lot but I am stumped from the beginning. Thanks.
------------------------------------------------------------------------
***History***
Room: 124 B Payment: Bell/TRAVELSCAPE.COM
Lastname*FIT*,Firstname 4A, 0K, 0B Guest
Bell *205880080 FT
Bell *205880080 July 31, 2010
____ 00 August 1, 2010
00000
Date Trans Room Debit Credit Balance
Jul31'10ROOM 124 206.75 206.75
Jul31'10TAX 124 21.71 228.46
Jul31'10TID 124 1.00 229.46
Aug 1'10EX 124 229.46 CR 0.00
Account Bell/TRAVELSCAPE.COM
_____________________________________________________________________
Attached Files
File Type: txt xtest.txt (894 Bytes, 6 views)
 
Old 08-06-2010, 02:57 PM   #2
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,489

Rep: Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956Reputation: 1956
Hmm.. you can try with GNU Awk using the gensub function to extract specific parts of the lines, based on strict regular expressions. However you have to define what are the items that show repeatedly. In other words it's necessary to define the format of the input text.

For example I tried to extract the desired information based on these assumptions:
1. The word History is at the start of each section
2. First line after History contains the keywords Room: and Payment:
3. The second line contains Lastname and Firstname separated by a single and unique comma
4. The third line contains the invoice preceded by the payment method (?) a.k.a. Spaceship or Bill in your samples
5. Arrival and departure dates are in the format Month [D]D, YYYY
6. Total is in the line above that one repeating the payment method (?) and CR is a keyword following the total amount.

Well.. based on my (surely wrong) guess, I can think of something like this:
Code:
#!/usr/bin/awk -f

/History/ {

   getline
   room = gensub(/.*Room: ([0-9]*).*Payment.*/,"\\1","g")
   paym = gensub(/.*Payment: (.*)\/.*/,"\\1","g")
   
   getline
   lastname = gensub(/(.*),.*/,"\\1","g",$1)
   frstname = gensub(/.*,(.*)/,"\\1","g",$1)
   
   getline
   if ( $1 ~ paym ) sub(/^*/,"",$2)
   invoice = $2
   
   getline
   match($0,/[JFMASOND][a-z]* [1-3]*[0-9], 20[1-9][0-9]/,arrival)
   
   getline
   match($0,/[JFMASOND][a-z]* [1-3]*[0-9], 20[1-9][0-9]/,departure)
   
   while ( $0 !~ paym ) {
     getline
     if ($0 ~ / CR / ) total = gensub(/.* ([0-9.]*) CR.*/,"\\1","g")
   }

}
Just to give you an idea. What is your skill in regular expressions, anyway? And in awk?

Last edited by colucix; 08-06-2010 at 03:00 PM.
 
Old 08-06-2010, 05:24 PM   #3
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,974
Blog Entries: 11

Rep: Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879
Moved: This thread is more suitable in <PROGRAMMING> and has been moved accordingly to help your thread/question get the exposure it deserves.
 
1 members found this post helpful.
Old 08-07-2010, 12:58 AM   #4
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,489

Rep: Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891
Like colucix I feel we need more information, but assuming the format is always the same as shown, the following is an alternative on the same theme:
Code:
#!/usr/bin/awk -f

BEGIN{
    RS="***History***"
}

NF>2{
    room=$3" "$4

    split($7,names,"*FIT*,")

    sub("*","",$13)

    arrival=$17" "$18" "$19
    departure=$22" "$23" "$24

    print room,names[1],names[2],$13,arrival,departure,$(NF - 9)
}
Obviously you need to work on the formatting, but you get the idea
 
1 members found this post helpful.
Old 08-11-2010, 10:49 AM   #5
btacuso
Member
 
Registered: May 2009
Posts: 32

Original Poster
Rep: Reputation: 15
At last, I got it. Basing on all your input, I was able to get a working script. Thanks again to all of you.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Wlan0: option 43 has zero length, failed to parse packet BobNutfield Linux - Networking 14 12-09-2010 07:45 PM
Variable length console prompt statquant Linux - General 7 07-15-2010 05:55 PM
how to generate variable length packets in iperf rohit83.ken Linux - Networking 1 03-10-2009 08:53 PM
Variable length objects kamransoomro84 Programming 4 10-28-2004 12:56 PM
problems reading in fixed-length record file naijaguy Programming 1 08-24-2004 02:34 PM


All times are GMT -5. The time now is 07:41 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration