LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 06-15-2009, 11:12 AM   #1
pwc101
Senior Member
 
Registered: Oct 2005
Location: UK
Distribution: Slackware
Posts: 1,847

Rep: Reputation: 128Reputation: 128
Split header from data in file using python


I'm trying to get to grips with python, and thought a genuine application would be more likely to get me going that just fiddling my way through tutorials. With that in mind, I have the following problem.

I need to extract certain parameters from the header (purple) in a number of files. At the moment, the header is not a fixed number of lines, but its format is as follows:
Code:
DATA SOURCE CODE=RIKZ
DATA ANALYSIS INSTITUTE CODE=RIKZ
DATA CENTRE & CONTACT CODE=RIKZ
STATION NAME=Vlissingen
GEOGRAPHICAL COORDINATES=N512635.3/E0033550.5
PERIOD BEGIN=19840101 000000
PERIOD END=19841231 230000
TIME REFERENCE=MET
PERIOD DURATION=00365 230000
REGISTRATION INTERVAL= 60
NUMBER OF DATA RECORDS=   8784
UNITS=cm
REFERENCE LEVEL MEASUREMENT=NAP
PARAMETER=MEASURED WATER LEVEL
INSTRUMENT TYPE CODE=TNO_D
DATA QUALITY DESCRIPTION=No remarks
MISSING VALUE=NAN
VALUES
   220
   217
   180
   NAN
   NAN
   <snip>
   180
   NAN
   203
END
Each variable is separated from its values by an =, and the data (in red) are always preceded by a line whose sole value is VALUE and always followed by a line whose sole value is END.

So I'd like to be able to create variables of PERIOD BEGIN, PERIOD END etc. for use later on in the code, and then work on the data values separately.

I think this'll most easily be achieved if I can create two arrays, one which is the header information in two columns (separated by =) and one which is the values (which lie between the words VALUES and END.

I've tried to read the file in line by line, and then assign the file to a variable split by =, but it fails when the line contains only a single column with no = in it.

This is what I have so far, and it doesn't work:
Code:
#!/usr/bin/env python

import sys

for file in sys.argv[1:]:

   openFile=open(file,'r')

   for line in openFile:
      while 'VALUES' not in line:
         currentLine=line.split("=")
         print currentLine
         break
This very effectively strips any line with VALUES in it, but I was expecting it to read each line, and break out of the loop as soon as it came to line with VALUES in it.

I've tried googling, but there's an enormous amount of information, and identifying what's relevant and what's outdated is difficult unless you know what you're looking for!
 
Old 06-15-2009, 12:36 PM   #2
motorider2
LQ Newbie
 
Registered: Apr 2009
Posts: 1

Rep: Reputation: 0
The for line is missing the readlines object

for line in openFile.readlines():
 
Old 06-15-2009, 12:42 PM   #3
pwc101
Senior Member
 
Registered: Oct 2005
Location: UK
Distribution: Slackware
Posts: 1,847

Original Poster
Rep: Reputation: 128Reputation: 128
Quote:
Originally Posted by motorider2 View Post
The for line is missing the readlines object

for line in openFile.readlines():
As far as I'm aware, which is to say not very much, that's not necessary. See http://docs.python.org/tutorial/inpu...f-file-objects
Quote:
Originally Posted by Python v2.6.2 documentation
An alternative approach to reading lines is to loop over the file object. This is memory efficient, fast, and leads to simpler code:

>>> for line in f:
print line,

This is the first line of the file.
Second line of the file
 
Old 06-15-2009, 01:12 PM   #4
Hko
Senior Member
 
Registered: Aug 2002
Location: Groningen, The Netherlands
Distribution: ubuntu
Posts: 2,530

Rep: Reputation: 108Reputation: 108
I think this should do pretty much of what you need. It could still be improved by adding more error-checking and raising WaterFileSyntaxError exceptions when such error are found.

Hope this helps
Code:
#!/usr/bin/env python

import sys

class WaterFileSyntaxError(Exception):
    def __init__(self, msg): self.msg = msg
    def __str__(self): return repr(self.msg)
 
class WaterFile():
    def __init__(self, filepath):
        self.header = {}
        self.values = []
        self.parse(filepath)

    def __getitem__(self, key):
        return self.header.get(key)

    def parse(self, filepath):
        still_reading_header = True
        for line in file(filepath):
            line = line.strip()
            if still_reading_header:
                if line == 'VALUES':
                    still_reading_header = False
                else:
                    var, val = line.split('=')
                    self.header[var.strip()] = val.strip()
            else:
                if line == 'END':
                    break
                if line == self['MISSING VALUE']:
                    self.values.append(None)
                else:                    
                    self.values.append(int(line))
        else:
            raise WaterFileSyntaxError('Section "VALUES" missing or it did not end with "END"')

### Main program starts here ###

for filepath in sys.argv[1:]:
    # Create a 'WaterFile' object from the string containing a file path
    wf = WaterFile(filepath)

    # you can now acces a list of values:
    print wf.values

    # Or iterate over all values:
    # Note: The "NAN" values have been replace with the
    #       special python value None.
    i = 0
    for value in wf.values:
        print 'Value', i, ' = ', value
        i += 1

    # The first way of accessing header variables:
    print 'Units are:', wf['UNITS']
    print 'Period end is:', wf['PERIOD END']
    
    # If a none-existing header var is accessed, you will
    # get the python special value None:
    print 'Non-existing header var:', wf['DOES NOT EXIST']
    print
    
    # The second way of accessing header variables:
    print 'Units are:', wf.header['UNITS']
    print 'Period end is:', wf.header['PERIOD END']

    # The difference with the first way is that the second
    # way will raise an exception when a non-existing var
    # is accessed, instead of returning None:
    print 'Non-existing header var:', wf.header['DOES NOT EXIST']

Last edited by Hko; 06-15-2009 at 01:13 PM.
 
Old 06-15-2009, 04:43 PM   #5
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
A typical task for Perl - built-in regular expressions come handy.

And I do not see a need for OOP in this case - pure procedural code would suffice because of simplicity of the problem.
 
Old 06-15-2009, 05:02 PM   #6
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 12,268

Rep: Reputation: 1028Reputation: 1028Reputation: 1028Reputation: 1028Reputation: 1028Reputation: 1028Reputation: 1028Reputation: 1028
Possibly, but that shouldn't stop the anyone in his/her quest to learn a new language.
I only started on python because I had a specific project in mind - must get back to it sometime ...
 
Old 06-15-2009, 09:36 PM   #7
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
Quote:
Originally Posted by syg00 View Post
Possibly, but that shouldn't stop the anyone in his/her quest to learn a new language.
I only started on python because I had a specific project in mind - must get back to it sometime ...
Python has some kind of regular expressions module, doesn't it ?

...

Regarding the specific project - if you have to contribute to an existing Python project, of course you need Python.

If you're starting something from scratch - Python (AFAIK) is not a more capable language than Perl; I have already published a number of small pieces of code which cannot be implemented in Python due to its limitations.
 
Old 06-15-2009, 10:21 PM   #8
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
Quote:
Originally Posted by pwc101 View Post
I'm trying to get to grips with python, and thought a genuine application would be more likely to get me going that just fiddling my way through tutorials. With that in mind, I have the following problem.

I need to extract certain parameters from the header (purple) in a number of files. At the moment, the header is not a fixed number of lines, but its format is as follows:
Code:
DATA SOURCE CODE=RIKZ
DATA ANALYSIS INSTITUTE CODE=RIKZ
DATA CENTRE & CONTACT CODE=RIKZ
STATION NAME=Vlissingen
GEOGRAPHICAL COORDINATES=N512635.3/E0033550.5
PERIOD BEGIN=19840101 000000
PERIOD END=19841231 230000
TIME REFERENCE=MET
PERIOD DURATION=00365 230000
REGISTRATION INTERVAL= 60
NUMBER OF DATA RECORDS=   8784
UNITS=cm
REFERENCE LEVEL MEASUREMENT=NAP
PARAMETER=MEASURED WATER LEVEL
INSTRUMENT TYPE CODE=TNO_D
DATA QUALITY DESCRIPTION=No remarks
MISSING VALUE=NAN
VALUES
   220
   217
   180
   NAN
   NAN
   <snip>
   180
   NAN
   203
END
Each variable is separated from its values by an =, and the data (in red) are always preceded by a line whose sole value is VALUE and always followed by a line whose sole value is END.

So I'd like to be able to create variables of PERIOD BEGIN, PERIOD END etc. for use later on in the code, and then work on the data values separately.

I think this'll most easily be achieved if I can create two arrays, one which is the header information in two columns (separated by =) and one which is the values (which lie between the words VALUES and END.

I've tried to read the file in line by line, and then assign the file to a variable split by =, but it fails when the line contains only a single column with no = in it.

This is what I have so far, and it doesn't work:
Code:
#!/usr/bin/env python

import sys

for file in sys.argv[1:]:

   openFile=open(file,'r')

   for line in openFile:
      while 'VALUES' not in line:
         currentLine=line.split("=")
         print currentLine
         break
This very effectively strips any line with VALUES in it, but I was expecting it to read each line, and break out of the loop as soon as it came to line with VALUES in it.

I've tried googling, but there's an enormous amount of information, and identifying what's relevant and what's outdated is difficult unless you know what you're looking for!
get familiar with dictionaries
Code:
d={}
for line in open("file"):
    if "=" in line:
        line=line.strip().split("=")
        d.setdefault(line[0],line[-1])    
for i,j in d.iteritems():
    print "key %s has value %s" %(i,j)
 
Old 06-15-2009, 10:30 PM   #9
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
Quote:
Originally Posted by Sergei Steshenko View Post
Python has some kind of regular expressions module, doesn't it ?

...
yes, re module. However, since you know not much about Python, i can tell you that such a simple task like that doesn't need regular expression. Python has excellent string manipulation capabilities on par with Perl or even better than Perl.

Quote:
If you're starting something from scratch - Python (AFAIK) is not a more capable language than Perl;
Let's not get into this. you have to really try Python to see the difference. Then after that, you try and maintain a big project coded in Perl and that coded in Python and see the difference.

Quote:
I have already published a number of small pieces of code which cannot be implemented in Python due to its limitations.
yes, but those ( i think you mean anonymous functions) doesn't have real use case in reality. Also, it make code hard to read and hard to troubleshoot/understand. I can tell you the different things done in Python is far easier and better than Perl, but thats not the point.
 
Old 06-16-2009, 03:27 AM   #10
angrybanana
Member
 
Registered: Oct 2003
Distribution: Archlinux
Posts: 147

Rep: Reputation: 21
Quote:
Originally Posted by pwc101 View Post
Code:
#!/usr/bin/env python

import sys

for file in sys.argv[1:]:

   openFile=open(file,'r')

   for line in openFile:
      while 'VALUES' not in line:
         currentLine=line.split("=")
         print currentLine
         break
This very effectively strips any line with VALUES in it, but I was expecting it to read each line, and break out of the loop as soon as it came to line with VALUES in it.

I've tried googling, but there's an enormous amount of information, and identifying what's relevant and what's outdated is difficult unless you know what you're looking for!
You were close, I think what you were trying to do is this:
Code:
#!/usr/bin/env python

import sys

for file in sys.argv[1:]:

   openFile=open(file,'r')

   for line in openFile:
      if 'VALUES' in line:
         break
      currentLine=line.split("=")
      print currentLine
split + strip is all you need for the individual lines.

For reading the values you can either loop over and log when you're in range (between 'values' and 'end') or if you have the whole file read into a list of lines do something like this:

Code:
lines[lines.index('VALUES')+1:lines.index('END')]
As for perl vs python.. it's 2-3 lines in either language..
 
Old 06-16-2009, 05:09 AM   #11
pwc101
Senior Member
 
Registered: Oct 2005
Location: UK
Distribution: Slackware
Posts: 1,847

Original Poster
Rep: Reputation: 128Reputation: 128
Thanks everyone for all the examples.

Since my secondary aim in doing this was to start learning python, the full-blown program method does appeal, although it seems this is a pretty trivial problem to solve! I've copied the code I've ended up using below. As you can see, I've ended up using most of Hko's code, adding a small section at the end to actually calculate the new date and times for each data point.
Code:
#!/usr/bin/env python

import sys
import datetime
from datetime import timedelta
import time

class WaterFileSyntaxError(Exception):
   def __init__(self, msg): self.msg = msg
   def __str__(self): return repr(self.msg)

class GetTide():
   def __init__(self,filepath):
      self.header={}
      self.values=[]
      self.parse(filepath)

   def __getitem__(self,key):
      return self.header.get(key)

   def parse(self,filepath):
      stillReadingHeader=True
      for line in file(filepath):
         line=line.strip()
         if stillReadingHeader:
            if line == 'VALUES':
               stillReadingHeader=False
            else:
               headerName,headerVal=line.split('=')
               self.header[headerName.strip()]=headerVal.strip()
         else:
            if line == 'END':
               break
            if line == self['MISSING VALUE']:
               self.values.append(None)
            else:
               self.values.append(int(line))
      else:
         raise WaterFileSyntaxError('Section "VALUES" missing or did not end with "END"')

### Main program i.e. grunt work

for filepath in sys.argv[1:]:
   # use GetTide above to separate the header from the tidal data
   gt=GetTide(filepath)

   # check the headers...
#   print gt.header['NUMBER OF DATA RECORDS']

   startDate,startTime=gt.header['PERIOD BEGIN'].split(" ")
   interval=int(gt.header['REGISTRATION INTERVAL'])

   timeOffset=timedelta(minutes=interval)
   currTime=datetime.datetime(int(startDate[:4]),int(startDate[4:6]),int(startDate[7:8]),int(startTime[0:1]),int(startTime[2:3]),int(startTime[4:5]))

   for i in range(int(gt.header['NUMBER OF DATA RECORDS'])):
      print currTime+(timeOffset*i),gt.values[i]
This basically gives me a time series of water levels (what I need to study) from a file which contains only metainformation on the dataset and a series of values. Every other program known to me needs to have a time stamp associated with each value, so I've calculated those from that metainformation and the datetime module in python.

Incidentally, the original reason I decided to try this in python was that my bash attempt has been running for a few days (on around 200 files), and the python implementation above takes about 20 seconds for the same number of files. Needless to say, that is something of an improvement, and probably says more about my bash implementation than anything else!

As for perl vs. python, the main reason I wanted to use python was it's so easy to read. Although perl's regular expressions are extremely powerful, they're just so hard to read unless you use them every day (I don't). So, python seemed the more obvious choice. That, and a piece of software I do use every day (ArcGIS) has the ability to incorporate custom functions written in python, so it's likely to be more useful to me in the future.

Thanks everyone for the input - hopefully this is the start of a prosperous use of python for me. It's been on my todo list for so long now.
 
Old 06-16-2009, 12:12 PM   #12
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
Quote:
Originally Posted by ghostdog74 View Post
...
yes, but those ( i think you mean anonymous functions) doesn't have real use case in reality. Also, it make code hard to read and hard to troubleshoot/understand. I can tell you the different things done in Python is far easier and better than Perl, but thats not the point.
It saves from name collisions, so the code is easier to maintain.

The idea of anonymous functions/objects is that not the anonymous code author, but the anonymous code user decides on names, so the user and the author have no need in prior negotiations regarding names.

In a similar manner, inheritance is easily done "inline" through scoping rules - the lack of decent scoping rules is a big drawback of Python, probably the biggest repellant for me.

Last edited by Sergei Steshenko; 06-16-2009 at 12:15 PM.
 
Old 06-16-2009, 12:25 PM   #13
Hko
Senior Member
 
Registered: Aug 2002
Location: Groningen, The Netherlands
Distribution: ubuntu
Posts: 2,530

Rep: Reputation: 108Reputation: 108
May I suggest you open (yet another) dedicated thread to do perl vs python flamewars?
 
Old 06-16-2009, 07:27 PM   #14
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
Quote:
Originally Posted by Hko View Post
May I suggest you open (yet another) dedicated thread to do perl vs python flamewars?
are you serious?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
import c header file into a python script deathalele Programming 1 05-26-2009 09:50 AM
Split data of text file into mysql with perl koscek Programming 1 11-01-2007 10:26 AM
ho to split data to two disks bong.mau Linux - Software 4 07-08-2007 10:01 AM
How to split file , .. awk or split ERBRMN Linux - General 9 08-15-2006 12:02 AM
dump packet data to file, but no packet header Nathanael Linux - Networking 3 02-08-2006 10:27 AM


All times are GMT -5. The time now is 11:13 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration