LinuxQuestions.org - write a file fo data array (with float values) in a binary format in Python

Please use [CODE][/CODE] tags around your code and data examples, so they're easier to read.

Quote:

Originally Posted by justinlapin (Post 4639380)

My file is composed of several data arrays containing float values [in binary format].

You use struct.pack() to pack the float data. For example:

Code:

import sys

import struct



data = [ 1.2, 1.5, 1.7 ]



handle = open('floats.bin', 'wb')

handle.write(struct.pack('<%df' % len(data), *data))

handle.close()



handle = open('doubles.bin', 'wb')

handle.write(struct.pack('<%dd' % len(data), *data))

handle.close()

In the above example, the pack statement evaluates to struct.pack('<3f', 1.2, 1.5, 1.7) for floats.bin file and to struct.pack('<3d', 1.2, 1.5, 1.7) for doubles.bin file. Check the struct format characters in the Python standard library documentation, and you'll see the pack format strings read as "in little-endian byte order, three IEEE-754 float (binary32) values" and "in little-endian byte order, three IEEE-754 double (binary64) values".

Reading the data is just as simple, except you use struct.unpack() instead. The only important point is to realize you need to know exactly how many values you want to unpack. I personally prefer to embed the count somewhere in the file header, but if you have pure float/double data, you can just divide the length of the data by 4 (for floats) or 8 (for doubles). Oh, and struct.unpack() always returns a tuple, which you probably want to convert to a list using list() ; I do.

Code:

import sys

import struct



handle = open('floats.bin', 'rb')

datastr = handle.read()

data = list(struct.unpack('<%df' % (len(datastr)/4), datastr))

sys.stdout.write('Read %d floats from floats.bin:\n' % len(data))

for i in data:

    sys.stdout.write('\t%g\n' % i)

handle.close()



handle = open('doubles.bin', 'rb')

datastr = handle.read()

data = list(struct.unpack('<%dd' % (len(datastr)/8), datastr))

sys.stdout.write('Read %d doubles from doubles.bin:\n' % len(data))

for i in data:

    sys.stdout.write('\t%g\n' % i)

handle.close()

Finally, a couple of notes.

The above code is platform-agnostic. The < at the start of the struct format means the data will be little-endian in the file, even if the computer architecture was big-endian. According to Python docs, it will also work even if the architecture used non-IEEE-754 floating-point formats.

While my code reads and writes the entire data at once, it wastes memory. You're better off slicing the data into say 131072-value units (yielding half a megabyte or megabyte of data per slice), and packing and writing or reading and unpacking each slice at a time, to limit the amount of temporary storage needed. It becomes very important very suddenly, when the data set size grows. Obviously, if your entire data set is smaller than twice that, it is better to just do it all at once. (But please, do not just assume that "640k is enough for everyone", that "nobody will ever use this for that much data". That way of thinking leads to tools that break for no discernible reason, when the dataset grows large enough; and that is extremely frustrating.)

(The I/O block size calculation is a bit of black magic. First, it should be a power of two, because all current filesystems use power-of-two block sizes, so non-power of two writes end up being much slower. Second, it should be large enough so that the OS can handle the write optimally. Third, it should be small enough so both source and target data, during the conversion, fits comfortably in the CPU core cache. Thus, there is really no right answer, but in practical terms, I find 131072/262144/524288/1048576/2097152/4194304/8388608 -byte blocks to work best -- and have done so for the last ten years or so, so I do not expect that to change anytime soon. Smaller end works better with desktop-type machines, and larger end with server and cluster-type machines.)

The above code snippets work in both Python 2 and Python 3.

@firstfire: Reading floating-point values from text strings, and writing them as strings, is a very, very slow process. The rounding requirements in the relevant standards mean very precise intermediate values must be used. Even on high-end CPUs, the conversion speed is less than a million values per second!

With classical atomic simulations (molecular dynamics), you may have say ten million atoms in a system. To animate it in a video, you might save just the coordinates for say a hundred frames (yielding just a few seconds of animation). In practice, the simulation is measurably slowed down by the conversion of coordinates to text format, even when the system is distributed over eight CPU cores (or nodes). On the other hand, saving the coordinates in native binary format (thirty million doubles per frame, for 100 frames, is about 24 gigabytes of data) is only limited by the local I/O speed. I also prefer to do it on the background, so as long as the calculations take up enough time, and there is enough memory to keep one frame in memory (about 240 megabytes in this case), the save is essentially free. If I do the conversion to text, it still takes a large amount of CPU processing power away from the calculations, and it is very difficult to distribute it evenly to the computing cores -- meaning it is usually one core that does the job, with the rest idling by. What a waste.

Additionally, the binary data format ends up being easier to handle. You can slice exactly the parts of data you want, simply by seeking the file; if it was text, you'd have to read the file from the start to find the exact row you want. Not that fun if the file is a couple of dozen gigabytes in size..

Hope this helps,