LinuxQuestions.org - Saving file data using Python in an embedded system in an safe and fast way

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Saving file data using Python in an embedded system in an safe and fast way (https://www.linuxquestions.org/questions/programming-9/saving-file-data-using-python-in-an-embedded-system-in-an-safe-and-fast-way-876015/)

Saving file data using Python in an embedded system in an safe and fast way

Hi, I am developing a program in a system where the Linux does not take care of the sync command automatically. So I have to run it from my application always I save some data in the disk, which in my case is a 2GB sdcard.

It is true that I can make the operation system takes care of the syncronization, using a proper mount option, but in this case the programm's performance drops drastically.

In particular I use the shelve module from Python to save data that comes from a socket/TCP connection and I have to deal with the potencial risk of the system being turned off suddenly

Initially I wrote something like that to save data using shelve:

Code:

def saveData(vo)

 fd = shelve.open( 'fileName' , 'c')

 fd[ key ] = vo

 fd.close()

 os.system("sync")

But that takes too much time to save the data.
Note that I use the sync from the OS every time I close a file to prevent data corruption in the case of the "computer" being turned off with data even in the buffer.

To improve the performance I made something like that:

Code:

def saveListData( list )

  fd = shelve.open('file_name', 'c')

  for itemVo in list:

    fd[itemVo.key] = itemVo

  fd.close()

  os.system("sync")

Thus, first I saved an amount of objects in a list then I open the file and save the objects. In this way I have to open the file just one time to save a lot of objects.

However I would like to know if adding a lot of objects before closing the file would increase the risk of data corruption.
I known that turning off the system after fd.close() and before os.sync may cause problems. But what about turning off the system after

Code:

fd = shelve.open('file_name', 'c')

but before fd.close()?

Thanks for any sugestion.

Just a few thoughts - do you really need the overhead of shelve? Why not just use cPickle? Also keeping the files open for as short a time as possible will cut down on the risks of data loss. I've made these and a few other changes to try and address some of your concerns. Not knowing much about your system it's difficult to offer any more suggestions. Anyway it may be useful in some ways so here it is:

Code:

import cPickle

import subprocess



def save_data(data):

    with open("save.file", "w") as  save_file:

        cPickle.dump(mylist, save_file)

    return subprocess.call('sync')

    

def get_data():

    with open("save.file", "r") as  save_file:

        data=cPickle.load(save_file)

    return data

    

mylist=["this is a list"]*100

print "save_data returned %i" % (save_data(mylist))

del mylist

mylist=get_data()

print mylist

Thanks bgeddy for your reply.

Quote:

Not knowing much about your system it's difficult to offer any more suggestions.

Basically I have a class in the Python code and this code receives a set of parameters from the socket/TCP for each object that will be saved in the file. Each object has an unique key that is used later.

My problem is that some times I have to save thousands of objects and this process turns to be very slow, mainly because of the time spent during the steps of opening and closing the file followed by the syncronization.

About pickle. As I must save many objects and its keys in a file for later use, I am afraid that pickle is not well suited, because I have understood that in Pickle it is saved just one object in each file, so I will not be able to search objects using keys.
But, of course, if it is possible to use keys with pickle and it is faster, I will prefer that.

You might also investigate the filesystem being used. Some filesystems may be faster for what you're trying to do, if changing is an option.

If not, you might see if you can open using the O_DIRECT option, which guarantees a sync() when you write.

Quote:

You might also investigate the filesystem being used. Some filesystems may be faster for what you're trying to do, if changing is an option.

Thanks, I will verify if it is possible to change the filesystem. Now I use ext2.

I have noticed that having to apply the sync command constantly is not the main problem. I have measured the time taken by the process of opening and closing the file and also the time taken by the sync. When the the size of the shelve file is around 5MB then it takes 30 seconds to add 50 objects at once, then more 4 seconds to apply the sync command. Thus the main difficult is that the performance goes down as the file size increases.

Quote:

About pickle. As I must save many objects and its keys in a file for later use, I am afraid that pickle is not well suited, because I have understood that in Pickle it is saved just one object in each file, so I will not be able to search objects using keys.
But, of course, if it is possible to use keys with pickle and it is faster, I will prefer that.

That is true. Perhaps a re think of your program logic would help. Maybe build a dictionary outside the /open/save/close/sync in your code and save that to the pickle? You may then read it back and access members by key value. An interesting problem you have there.

Quote:

I have measured the time taken by the process of opening and closing the file and also the time taken by the sync. When the the size of the shelve file is around 5MB then it takes 30 seconds to add 50 objects at once, then more 4 seconds to apply the sync command. Thus the main difficult is that the performance goes down as the file size increases.

Wow that's slow! It looks like you have already profiled your code but if not it's worth looking into it.

Quote:

If not, you might see if you can open using the O_DIRECT option, which guarantees a sync() when you write.

You may open a file using the low level os.open method, which takes flags like os.O_DIRECT and os.SYNC, and then convert to a file object by os.fdopen so python modules expecting a file object, such as cPickle, will work. The shelve module does it's own file opening so it's not possible to give your own custom opened file. Even then, converting a file descriptor opened with os.O_DIRECT applied will not work correctly when converted to a python file object. This effectively means the modules working with file objects such as pickle won't work. The following is from the man (2) page about the open system call which is mirrored in python's os.open :

Quote:

In summary, O_DIRECT is a potentially powerful tool that should be used with caution. It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default.

"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances." -- Linus

Which amused me greatly. However this option is irrelevant for standard python file objects.

Apparently there is a way to have O_DIRECT file accesses in python by memory aligning the buffers used by using mmap and using buffers which must be a multiple of the logical block size of the file system. This adds a fair bit of complication and only works for raw system file descriptors - not python file objects. There is a interesting post here on the subject.

Changing the filesystem, as suggested by orgcandman, could help, a good suggestion.

Hopefully someone with directly relevant experience of systems such as yours, which obviously I don't have, will chip in.

Quote:

Wow that's slow! It looks like you have already profiled your code but if not it's worth looking into it.

Yes that is slow. But the part of the code that sends data to the file is simple and I do not see many points to be changed.
When the file is small, like 500KB then the time to save data is small, like 1 second to save 50 objects.

About a better filesystem to use, if someone could give me a sugestion ...

Here is my processor:

Code:

cat /proc/cpuinfo 

processor        : 0

chip type        : AT32AP700x revision C

cpu arch        : AVR32B revision 1

cpu core        : AP7 revision 0

cpu MHz                : 140.000

i-cache                : 16K (4 ways x 128 sets x 32)

d-cache                : 16K (4 ways x 128 sets x 32)

features        : dsp simd ocd perfctr java

bogomips        : 282.06

It is not very fast and perhaps it explains my problem.

Thanks again.

Quote:

I tried that, but in this case, always I want to update the dictionary I have to get it from the file, update the dictionary and finally it is needed to write all the data in the file again, which takes a lot of time either. I could not see another way of using Pickle in this case.