LinuxQuestions.org - Split header from data in file using python

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Split header from data in file using python (https://www.linuxquestions.org/questions/programming-9/split-header-from-data-in-file-using-python-733085/)

Split header from data in file using python

I'm trying to get to grips with python, and thought a genuine application would be more likely to get me going that just fiddling my way through tutorials. With that in mind, I have the following problem.

I need to extract certain parameters from the header (purple) in a number of files. At the moment, the header is not a fixed number of lines, but its format is as follows:

Code:

DATA SOURCE CODE=RIKZ

DATA ANALYSIS INSTITUTE CODE=RIKZ

DATA CENTRE & CONTACT CODE=RIKZ

STATION NAME=Vlissingen

GEOGRAPHICAL COORDINATES=N512635.3/E0033550.5

PERIOD BEGIN=19840101 000000

PERIOD END=19841231 230000

TIME REFERENCE=MET

PERIOD DURATION=00365 230000

REGISTRATION INTERVAL= 60

NUMBER OF DATA RECORDS=  8784

UNITS=cm

REFERENCE LEVEL MEASUREMENT=NAP

PARAMETER=MEASURED WATER LEVEL

INSTRUMENT TYPE CODE=TNO_D

DATA QUALITY DESCRIPTION=No remarks

MISSING VALUE=NAN

VALUES

  220

  217

  180

  NAN

  NAN

  <snip>

  180

  NAN

  203

END

Each variable is separated from its values by an =, and the data (in red) are always preceded by a line whose sole value is VALUE and always followed by a line whose sole value is END.

So I'd like to be able to create variables of PERIOD BEGIN, PERIOD END etc. for use later on in the code, and then work on the data values separately.

I think this'll most easily be achieved if I can create two arrays, one which is the header information in two columns (separated by =) and one which is the values (which lie between the words VALUES and END.

I've tried to read the file in line by line, and then assign the file to a variable split by =, but it fails when the line contains only a single column with no = in it.

This is what I have so far, and it doesn't work:

Code:

#!/usr/bin/env python



import sys



for file in sys.argv[1:]:



  openFile=open(file,'r')



  for line in openFile:

      while 'VALUES' not in line:

        currentLine=line.split("=")

        print currentLine

        break

This very effectively strips any line with VALUES in it, but I was expecting it to read each line, and break out of the loop as soon as it came to line with VALUES in it.

I've tried googling, but there's an enormous amount of information, and identifying what's relevant and what's outdated is difficult unless you know what you're looking for!

The for line is missing the readlines object

for line in openFile.readlines():

Quote:

Originally Posted by motorider2 (Post 3574757)

The for line is missing the readlines object

for line in openFile.readlines():

As far as I'm aware, which is to say not very much, that's not necessary. See http://docs.python.org/tutorial/inpu...f-file-objects

Quote:

Originally Posted by Python v2.6.2 documentation

An alternative approach to reading lines is to loop over the file object. This is memory efficient, fast, and leads to simpler code:

>>> for line in f:
print line,

This is the first line of the file.
Second line of the file

I think this should do pretty much of what you need. It could still be improved by adding more error-checking and raising WaterFileSyntaxError exceptions when such error are found.

Hope this helps

Code:

#!/usr/bin/env python



import sys



class WaterFileSyntaxError(Exception):

    def __init__(self, msg): self.msg = msg

    def __str__(self): return repr(self.msg)

 

class WaterFile():

    def __init__(self, filepath):

        self.header = {}

        self.values = []

        self.parse(filepath)



    def __getitem__(self, key):

        return self.header.get(key)



    def parse(self, filepath):

        still_reading_header = True

        for line in file(filepath):

            line = line.strip()

            if still_reading_header:

                if line == 'VALUES':

                    still_reading_header = False

                else:

                    var, val = line.split('=')

                    self.header[var.strip()] = val.strip()

            else:

                if line == 'END':

                    break

                if line == self['MISSING VALUE']:

                    self.values.append(None)

                else:                    

                    self.values.append(int(line))

        else:

            raise WaterFileSyntaxError('Section "VALUES" missing or it did not end with "END"')



### Main program starts here ###



for filepath in sys.argv[1:]:

    # Create a 'WaterFile' object from the string containing a file path

    wf = WaterFile(filepath)



    # you can now acces a list of values:

    print wf.values



    # Or iterate over all values:

    # Note: The "NAN" values have been replace with the

    #      special python value None.

    i = 0

    for value in wf.values:

        print 'Value', i, ' = ', value

        i += 1



    # The first way of accessing header variables:

    print 'Units are:', wf['UNITS']

    print 'Period end is:', wf['PERIOD END']

    

    # If a none-existing header var is accessed, you will

    # get the python special value None:

    print 'Non-existing header var:', wf['DOES NOT EXIST']

    print

    

    # The second way of accessing header variables:

    print 'Units are:', wf.header['UNITS']

    print 'Period end is:', wf.header['PERIOD END']



    # The difference with the first way is that the second

    # way will raise an exception when a non-existing var

    # is accessed, instead of returning None:

    print 'Non-existing header var:', wf.header['DOES NOT EXIST']

A typical task for Perl - built-in regular expressions come handy.

And I do not see a need for OOP in this case - pure procedural code would suffice because of simplicity of the problem.

Possibly, but that shouldn't stop the anyone in his/her quest to learn a new language.
I only started on python because I had a specific project in mind - must get back to it sometime ...

Quote:

Originally Posted by syg00 (Post 3575071)

Possibly, but that shouldn't stop the anyone in his/her quest to learn a new language.
I only started on python because I had a specific project in mind - must get back to it sometime ...

Python has some kind of regular expressions module, doesn't it ?

...

Regarding the specific project - if you have to contribute to an existing Python project, of course you need Python.

If you're starting something from scratch - Python (AFAIK) is not a more capable language than Perl; I have already published a number of small pieces of code which cannot be implemented in Python due to its limitations.

Quote:

Originally Posted by pwc101 (Post 3574645)

Code:

DATA SOURCE CODE=RIKZ

DATA ANALYSIS INSTITUTE CODE=RIKZ

DATA CENTRE & CONTACT CODE=RIKZ

STATION NAME=Vlissingen

GEOGRAPHICAL COORDINATES=N512635.3/E0033550.5

PERIOD BEGIN=19840101 000000

PERIOD END=19841231 230000

TIME REFERENCE=MET

PERIOD DURATION=00365 230000

REGISTRATION INTERVAL= 60

NUMBER OF DATA RECORDS=  8784

UNITS=cm

REFERENCE LEVEL MEASUREMENT=NAP

PARAMETER=MEASURED WATER LEVEL

INSTRUMENT TYPE CODE=TNO_D

DATA QUALITY DESCRIPTION=No remarks

MISSING VALUE=NAN

VALUES

  220

  217

  180

  NAN

  NAN

  <snip>

  180

  NAN

  203

END

Code:

#!/usr/bin/env python



import sys



for file in sys.argv[1:]:



  openFile=open(file,'r')



  for line in openFile:

      while 'VALUES' not in line:

        currentLine=line.split("=")

        print currentLine

        break

get familiar with dictionaries

Code:

d={}

for line in open("file"):

    if "=" in line:

        line=line.strip().split("=")

        d.setdefault(line[0],line[-1])    

for i,j in d.iteritems():

    print "key %s has value %s" %(i,j)

Quote:

Originally Posted by Sergei Steshenko (Post 3575293)

Python has some kind of regular expressions module, doesn't it ?

...

yes, re module. However, since you know not much about Python, i can tell you that such a simple task like that doesn't need regular expression. Python has excellent string manipulation capabilities on par with Perl or even better than Perl.

Quote:

If you're starting something from scratch - Python (AFAIK) is not a more capable language than Perl;

Let's not get into this. you have to really try Python to see the difference. Then after that, you try and maintain a big project coded in Perl and that coded in Python and see the difference.

Quote:

I have already published a number of small pieces of code which cannot be implemented in Python due to its limitations.

yes, but those ( i think you mean anonymous functions) doesn't have real use case in reality. Also, it make code hard to read and hard to troubleshoot/understand. I can tell you the different things done in Python is far easier and better than Perl, but thats not the point.

Quote:

Originally Posted by pwc101 (Post 3574645)

Code:

#!/usr/bin/env python



import sys



for file in sys.argv[1:]:



  openFile=open(file,'r')



  for line in openFile:

      while 'VALUES' not in line:

        currentLine=line.split("=")

        print currentLine

        break

You were close, I think what you were trying to do is this:

Code:

#!/usr/bin/env python



import sys



for file in sys.argv[1:]:



  openFile=open(file,'r')



  for line in openFile:

      if 'VALUES' in line:

        break

      currentLine=line.split("=")

      print currentLine

split + strip is all you need for the individual lines.

For reading the values you can either loop over and log when you're in range (between 'values' and 'end') or if you have the whole file read into a list of lines do something like this:

Code:

lines[lines.index('VALUES')+1:lines.index('END')]

As for perl vs python.. it's 2-3 lines in either language..

Thanks everyone for all the examples.

Since my secondary aim in doing this was to start learning python, the full-blown program method does appeal, although it seems this is a pretty trivial problem to solve! I've copied the code I've ended up using below. As you can see, I've ended up using most of Hko's code, adding a small section at the end to actually calculate the new date and times for each data point.

Code:

#!/usr/bin/env python



import sys

import datetime

from datetime import timedelta

import time



class WaterFileSyntaxError(Exception):

  def __init__(self, msg): self.msg = msg

  def __str__(self): return repr(self.msg)



class GetTide():

  def __init__(self,filepath):

      self.header={}

      self.values=[]

      self.parse(filepath)



  def __getitem__(self,key):

      return self.header.get(key)



  def parse(self,filepath):

      stillReadingHeader=True

      for line in file(filepath):

        line=line.strip()

        if stillReadingHeader:

            if line == 'VALUES':

              stillReadingHeader=False

            else:

              headerName,headerVal=line.split('=')

              self.header[headerName.strip()]=headerVal.strip()

        else:

            if line == 'END':

              break

            if line == self['MISSING VALUE']:

              self.values.append(None)

            else:

              self.values.append(int(line))

      else:

        raise WaterFileSyntaxError('Section "VALUES" missing or did not end with "END"')



### Main program i.e. grunt work



for filepath in sys.argv[1:]:

  # use GetTide above to separate the header from the tidal data

  gt=GetTide(filepath)



  # check the headers...

#  print gt.header['NUMBER OF DATA RECORDS']



  startDate,startTime=gt.header['PERIOD BEGIN'].split(" ")

  interval=int(gt.header['REGISTRATION INTERVAL'])



  timeOffset=timedelta(minutes=interval)

  currTime=datetime.datetime(int(startDate[:4]),int(startDate[4:6]),int(startDate[7:8]),int(startTime[0:1]),int(startTime[2:3]),int(startTime[4:5]))



  for i in range(int(gt.header['NUMBER OF DATA RECORDS'])):

      print currTime+(timeOffset*i),gt.values[i]

This basically gives me a time series of water levels (what I need to study) from a file which contains only metainformation on the dataset and a series of values. Every other program known to me needs to have a time stamp associated with each value, so I've calculated those from that metainformation and the datetime module in python.

Incidentally, the original reason I decided to try this in python was that my bash attempt has been running for a few days (on around 200 files), and the python implementation above takes about 20 seconds for the same number of files. Needless to say, that is something of an improvement, and probably says more about my bash implementation than anything else!

As for perl vs. python, the main reason I wanted to use python was it's so easy to read. Although perl's regular expressions are extremely powerful, they're just so hard to read unless you use them every day (I don't). So, python seemed the more obvious choice. That, and a piece of software I do use every day (ArcGIS) has the ability to incorporate custom functions written in python, so it's likely to be more useful to me in the future.

Thanks everyone for the input - hopefully this is the start of a prosperous use of python for me. It's been on my todo list for so long now.

Quote:

Originally Posted by ghostdog74 (Post 3575330)

...
yes, but those ( i think you mean anonymous functions) doesn't have real use case in reality. Also, it make code hard to read and hard to troubleshoot/understand. I can tell you the different things done in Python is far easier and better than Perl, but thats not the point.

It saves from name collisions, so the code is easier to maintain.

The idea of anonymous functions/objects is that not the anonymous code author, but the anonymous code user decides on names, so the user and the author have no need in prior negotiations regarding names.

In a similar manner, inheritance is easily done "inline" through scoping rules - the lack of decent scoping rules is a big drawback of Python, probably the biggest repellant for me.

May I suggest you open (yet another) dedicated thread to do perl vs python flamewars?

Quote:

Originally Posted by Hko (Post 3576137)

May I suggest you open (yet another) dedicated thread to do perl vs python flamewars?

are you serious?