LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Split header from data in file using python (https://www.linuxquestions.org/questions/programming-9/split-header-from-data-in-file-using-python-733085/)

pwc101 06-15-2009 11:12 AM

Split header from data in file using python
 
I'm trying to get to grips with python, and thought a genuine application would be more likely to get me going that just fiddling my way through tutorials. With that in mind, I have the following problem.

I need to extract certain parameters from the header (purple) in a number of files. At the moment, the header is not a fixed number of lines, but its format is as follows:
Code:

DATA SOURCE CODE=RIKZ
DATA ANALYSIS INSTITUTE CODE=RIKZ
DATA CENTRE & CONTACT CODE=RIKZ
STATION NAME=Vlissingen
GEOGRAPHICAL COORDINATES=N512635.3/E0033550.5
PERIOD BEGIN=19840101 000000
PERIOD END=19841231 230000
TIME REFERENCE=MET
PERIOD DURATION=00365 230000
REGISTRATION INTERVAL= 60
NUMBER OF DATA RECORDS=  8784
UNITS=cm
REFERENCE LEVEL MEASUREMENT=NAP
PARAMETER=MEASURED WATER LEVEL
INSTRUMENT TYPE CODE=TNO_D
DATA QUALITY DESCRIPTION=No remarks
MISSING VALUE=NAN

VALUES
  220
  217
  180
  NAN
  NAN
  <snip>
  180
  NAN
  203
END

Each variable is separated from its values by an =, and the data (in red) are always preceded by a line whose sole value is VALUE and always followed by a line whose sole value is END.

So I'd like to be able to create variables of PERIOD BEGIN, PERIOD END etc. for use later on in the code, and then work on the data values separately.

I think this'll most easily be achieved if I can create two arrays, one which is the header information in two columns (separated by =) and one which is the values (which lie between the words VALUES and END.

I've tried to read the file in line by line, and then assign the file to a variable split by =, but it fails when the line contains only a single column with no = in it.

This is what I have so far, and it doesn't work:
Code:

#!/usr/bin/env python

import sys

for file in sys.argv[1:]:

  openFile=open(file,'r')

  for line in openFile:
      while 'VALUES' not in line:
        currentLine=line.split("=")
        print currentLine
        break

This very effectively strips any line with VALUES in it, but I was expecting it to read each line, and break out of the loop as soon as it came to line with VALUES in it.

I've tried googling, but there's an enormous amount of information, and identifying what's relevant and what's outdated is difficult unless you know what you're looking for!

motorider2 06-15-2009 12:36 PM

The for line is missing the readlines object

for line in openFile.readlines():

pwc101 06-15-2009 12:42 PM

Quote:

Originally Posted by motorider2 (Post 3574757)
The for line is missing the readlines object

for line in openFile.readlines():

As far as I'm aware, which is to say not very much, that's not necessary. See http://docs.python.org/tutorial/inpu...f-file-objects
Quote:

Originally Posted by Python v2.6.2 documentation
An alternative approach to reading lines is to loop over the file object. This is memory efficient, fast, and leads to simpler code:

>>> for line in f:
print line,

This is the first line of the file.
Second line of the file


Hko 06-15-2009 01:12 PM

I think this should do pretty much of what you need. It could still be improved by adding more error-checking and raising WaterFileSyntaxError exceptions when such error are found.

Hope this helps
Code:

#!/usr/bin/env python

import sys

class WaterFileSyntaxError(Exception):
    def __init__(self, msg): self.msg = msg
    def __str__(self): return repr(self.msg)
 
class WaterFile():
    def __init__(self, filepath):
        self.header = {}
        self.values = []
        self.parse(filepath)

    def __getitem__(self, key):
        return self.header.get(key)

    def parse(self, filepath):
        still_reading_header = True
        for line in file(filepath):
            line = line.strip()
            if still_reading_header:
                if line == 'VALUES':
                    still_reading_header = False
                else:
                    var, val = line.split('=')
                    self.header[var.strip()] = val.strip()
            else:
                if line == 'END':
                    break
                if line == self['MISSING VALUE']:
                    self.values.append(None)
                else:                   
                    self.values.append(int(line))
        else:
            raise WaterFileSyntaxError('Section "VALUES" missing or it did not end with "END"')

### Main program starts here ###

for filepath in sys.argv[1:]:
    # Create a 'WaterFile' object from the string containing a file path
    wf = WaterFile(filepath)

    # you can now acces a list of values:
    print wf.values

    # Or iterate over all values:
    # Note: The "NAN" values have been replace with the
    #      special python value None.
    i = 0
    for value in wf.values:
        print 'Value', i, ' = ', value
        i += 1

    # The first way of accessing header variables:
    print 'Units are:', wf['UNITS']
    print 'Period end is:', wf['PERIOD END']
   
    # If a none-existing header var is accessed, you will
    # get the python special value None:
    print 'Non-existing header var:', wf['DOES NOT EXIST']
    print
   
    # The second way of accessing header variables:
    print 'Units are:', wf.header['UNITS']
    print 'Period end is:', wf.header['PERIOD END']

    # The difference with the first way is that the second
    # way will raise an exception when a non-existing var
    # is accessed, instead of returning None:
    print 'Non-existing header var:', wf.header['DOES NOT EXIST']


Sergei Steshenko 06-15-2009 04:43 PM

A typical task for Perl - built-in regular expressions come handy.

And I do not see a need for OOP in this case - pure procedural code would suffice because of simplicity of the problem.

syg00 06-15-2009 05:02 PM

Possibly, but that shouldn't stop the anyone in his/her quest to learn a new language.
I only started on python because I had a specific project in mind - must get back to it sometime ...

Sergei Steshenko 06-15-2009 09:36 PM

Quote:

Originally Posted by syg00 (Post 3575071)
Possibly, but that shouldn't stop the anyone in his/her quest to learn a new language.
I only started on python because I had a specific project in mind - must get back to it sometime ...

Python has some kind of regular expressions module, doesn't it ?

...

Regarding the specific project - if you have to contribute to an existing Python project, of course you need Python.

If you're starting something from scratch - Python (AFAIK) is not a more capable language than Perl; I have already published a number of small pieces of code which cannot be implemented in Python due to its limitations.

ghostdog74 06-15-2009 10:21 PM

Quote:

Originally Posted by pwc101 (Post 3574645)
I'm trying to get to grips with python, and thought a genuine application would be more likely to get me going that just fiddling my way through tutorials. With that in mind, I have the following problem.

I need to extract certain parameters from the header (purple) in a number of files. At the moment, the header is not a fixed number of lines, but its format is as follows:
Code:

DATA SOURCE CODE=RIKZ
DATA ANALYSIS INSTITUTE CODE=RIKZ
DATA CENTRE & CONTACT CODE=RIKZ
STATION NAME=Vlissingen
GEOGRAPHICAL COORDINATES=N512635.3/E0033550.5
PERIOD BEGIN=19840101 000000
PERIOD END=19841231 230000
TIME REFERENCE=MET
PERIOD DURATION=00365 230000
REGISTRATION INTERVAL= 60
NUMBER OF DATA RECORDS=  8784
UNITS=cm
REFERENCE LEVEL MEASUREMENT=NAP
PARAMETER=MEASURED WATER LEVEL
INSTRUMENT TYPE CODE=TNO_D
DATA QUALITY DESCRIPTION=No remarks
MISSING VALUE=NAN

VALUES
  220
  217
  180
  NAN
  NAN
  <snip>
  180
  NAN
  203
END

Each variable is separated from its values by an =, and the data (in red) are always preceded by a line whose sole value is VALUE and always followed by a line whose sole value is END.

So I'd like to be able to create variables of PERIOD BEGIN, PERIOD END etc. for use later on in the code, and then work on the data values separately.

I think this'll most easily be achieved if I can create two arrays, one which is the header information in two columns (separated by =) and one which is the values (which lie between the words VALUES and END.

I've tried to read the file in line by line, and then assign the file to a variable split by =, but it fails when the line contains only a single column with no = in it.

This is what I have so far, and it doesn't work:
Code:

#!/usr/bin/env python

import sys

for file in sys.argv[1:]:

  openFile=open(file,'r')

  for line in openFile:
      while 'VALUES' not in line:
        currentLine=line.split("=")
        print currentLine
        break

This very effectively strips any line with VALUES in it, but I was expecting it to read each line, and break out of the loop as soon as it came to line with VALUES in it.

I've tried googling, but there's an enormous amount of information, and identifying what's relevant and what's outdated is difficult unless you know what you're looking for!

get familiar with dictionaries
Code:

d={}
for line in open("file"):
    if "=" in line:
        line=line.strip().split("=")
        d.setdefault(line[0],line[-1])   
for i,j in d.iteritems():
    print "key %s has value %s" %(i,j)


ghostdog74 06-15-2009 10:30 PM

Quote:

Originally Posted by Sergei Steshenko (Post 3575293)
Python has some kind of regular expressions module, doesn't it ?

...

yes, re module. However, since you know not much about Python, i can tell you that such a simple task like that doesn't need regular expression. Python has excellent string manipulation capabilities on par with Perl or even better than Perl.

Quote:

If you're starting something from scratch - Python (AFAIK) is not a more capable language than Perl;
Let's not get into this. you have to really try Python to see the difference. Then after that, you try and maintain a big project coded in Perl and that coded in Python and see the difference.

Quote:

I have already published a number of small pieces of code which cannot be implemented in Python due to its limitations.
yes, but those ( i think you mean anonymous functions) doesn't have real use case in reality. Also, it make code hard to read and hard to troubleshoot/understand. I can tell you the different things done in Python is far easier and better than Perl, but thats not the point.

angrybanana 06-16-2009 03:27 AM

Quote:

Originally Posted by pwc101 (Post 3574645)
Code:

#!/usr/bin/env python

import sys

for file in sys.argv[1:]:

  openFile=open(file,'r')

  for line in openFile:
      while 'VALUES' not in line:
        currentLine=line.split("=")
        print currentLine
        break

This very effectively strips any line with VALUES in it, but I was expecting it to read each line, and break out of the loop as soon as it came to line with VALUES in it.

I've tried googling, but there's an enormous amount of information, and identifying what's relevant and what's outdated is difficult unless you know what you're looking for!

You were close, I think what you were trying to do is this:
Code:

#!/usr/bin/env python

import sys

for file in sys.argv[1:]:

  openFile=open(file,'r')

  for line in openFile:
      if 'VALUES' in line:
        break
      currentLine=line.split("=")
      print currentLine

split + strip is all you need for the individual lines.

For reading the values you can either loop over and log when you're in range (between 'values' and 'end') or if you have the whole file read into a list of lines do something like this:

Code:

lines[lines.index('VALUES')+1:lines.index('END')]
As for perl vs python.. it's 2-3 lines in either language..

pwc101 06-16-2009 05:09 AM

Thanks everyone for all the examples.

Since my secondary aim in doing this was to start learning python, the full-blown program method does appeal, although it seems this is a pretty trivial problem to solve! I've copied the code I've ended up using below. As you can see, I've ended up using most of Hko's code, adding a small section at the end to actually calculate the new date and times for each data point.
Code:

#!/usr/bin/env python

import sys
import datetime
from datetime import timedelta
import time

class WaterFileSyntaxError(Exception):
  def __init__(self, msg): self.msg = msg
  def __str__(self): return repr(self.msg)

class GetTide():
  def __init__(self,filepath):
      self.header={}
      self.values=[]
      self.parse(filepath)

  def __getitem__(self,key):
      return self.header.get(key)

  def parse(self,filepath):
      stillReadingHeader=True
      for line in file(filepath):
        line=line.strip()
        if stillReadingHeader:
            if line == 'VALUES':
              stillReadingHeader=False
            else:
              headerName,headerVal=line.split('=')
              self.header[headerName.strip()]=headerVal.strip()
        else:
            if line == 'END':
              break
            if line == self['MISSING VALUE']:
              self.values.append(None)
            else:
              self.values.append(int(line))
      else:
        raise WaterFileSyntaxError('Section "VALUES" missing or did not end with "END"')

### Main program i.e. grunt work

for filepath in sys.argv[1:]:
  # use GetTide above to separate the header from the tidal data
  gt=GetTide(filepath)

  # check the headers...
#  print gt.header['NUMBER OF DATA RECORDS']

  startDate,startTime=gt.header['PERIOD BEGIN'].split(" ")
  interval=int(gt.header['REGISTRATION INTERVAL'])

  timeOffset=timedelta(minutes=interval)
  currTime=datetime.datetime(int(startDate[:4]),int(startDate[4:6]),int(startDate[7:8]),int(startTime[0:1]),int(startTime[2:3]),int(startTime[4:5]))

  for i in range(int(gt.header['NUMBER OF DATA RECORDS'])):
      print currTime+(timeOffset*i),gt.values[i]

This basically gives me a time series of water levels (what I need to study) from a file which contains only metainformation on the dataset and a series of values. Every other program known to me needs to have a time stamp associated with each value, so I've calculated those from that metainformation and the datetime module in python.

Incidentally, the original reason I decided to try this in python was that my bash attempt has been running for a few days (on around 200 files), and the python implementation above takes about 20 seconds for the same number of files. Needless to say, that is something of an improvement, and probably says more about my bash implementation than anything else!

As for perl vs. python, the main reason I wanted to use python was it's so easy to read. Although perl's regular expressions are extremely powerful, they're just so hard to read unless you use them every day (I don't). So, python seemed the more obvious choice. That, and a piece of software I do use every day (ArcGIS) has the ability to incorporate custom functions written in python, so it's likely to be more useful to me in the future.

Thanks everyone for the input - hopefully this is the start of a prosperous use of python for me. It's been on my todo list for so long now.

Sergei Steshenko 06-16-2009 12:12 PM

Quote:

Originally Posted by ghostdog74 (Post 3575330)
...
yes, but those ( i think you mean anonymous functions) doesn't have real use case in reality. Also, it make code hard to read and hard to troubleshoot/understand. I can tell you the different things done in Python is far easier and better than Perl, but thats not the point.

It saves from name collisions, so the code is easier to maintain.

The idea of anonymous functions/objects is that not the anonymous code author, but the anonymous code user decides on names, so the user and the author have no need in prior negotiations regarding names.

In a similar manner, inheritance is easily done "inline" through scoping rules - the lack of decent scoping rules is a big drawback of Python, probably the biggest repellant for me.

Hko 06-16-2009 12:25 PM

May I suggest you open (yet another) dedicated thread to do perl vs python flamewars?

ghostdog74 06-16-2009 07:27 PM

Quote:

Originally Posted by Hko (Post 3576137)
May I suggest you open (yet another) dedicated thread to do perl vs python flamewars?

are you serious?


All times are GMT -5. The time now is 10:29 AM.