python memory error

lancebermudez · 01-07-2022, 10:22 PM

Memory Error I get is

Code:

Traceback (most recent call last):
  File "/home/10dot9/bin/pythonscript/./hashdir.py", line 75, in <module>
    content = openfile.read()
MemoryError

this code will walk a directory and give me the files to hash in the directory use to work good for small directories but not for big directories. A big directories will give memory error. Any idea on how to fix this I seem to hit a wall.

Code:

#!/usr/bin/env python3
from sys import exit
from sys import version as version # needed determine Python version number
from sys import platform as platform # needed to determine OS version number
import hashlib
from os import walk
import csv
from os.path import join

print('Python Version '+ version)
print('Operating System Platform '+ platform)
print('This Python 3 script will Sha256 recurse hash files in Directory')
print(' ')

#==== Source Directory to hash
SRC_DIR = input('File path to run file hash on: \n')

#==== csv file creation
# field names 
fields = ['Sha256 File Hash', 'Full File Path']

#==== name of csv file and it place in filesystem
filename = input('Where to save the csv file: \n')

#==== writing to csv file
with open(filename, 'w') as csvfile:
    #==== creating a csv writer object
    csvwriter = csv.writer(csvfile)

    #==== writing the fields
    csvwriter.writerow(fields)


#==== recurse the Directory
print('Walking ', SRC_DIR)
files_dir = []

for root, subdirs, files in walk(SRC_DIR):
    for file in files:
        files_dir.append(join(root, file))

print('DoneWalking ', SRC_DIR) 

#==== to see full file paths in files_dir
#for f in files_dir:
#    print(f)


#==== run sha256 on file
print('Starting the hashing of files\n') 
for x in sorted(files_dir):
    hasher = hashlib.sha256()
    with open(x, 'rb') as openfile:
        content = openfile.read()
        hasher.update(content)
        print(hasher.hexdigest().upper(), x +'\n', sep=",")
        with open(filename, 'a') as csvfile:
#=========== writing data to csv file
            csvwriter = csv.writer(csvfile)
            csvwriter.writerow([hasher.hexdigest().upper(), x])

print('Done with hashing of files!')
input('Enter to close..')

lancebermudez · 01-07-2022, 11:02 PM

still getting memory error at

Code:

Traceback (most recent call last):
  File "/usr/lib/python3.9/idlelib/run.py", line 559, in runcode
    exec(code, self.locals)
  File "/home/10dot9/bin/pythonscript/filetreeHash.py", line 75, in <module>
    content = openfile.read()
MemoryError

changed the code to del content to see if that will clear up the memory

Code:

#==== run sha256 on file
print('Starting the hashing of files\n') 
for x in sorted(files_dir):
    hasher = hashlib.sha256()
    with open(x, 'rb') as openfile:
        content = openfile.read()
        hasher.update(content)
        print(hasher.hexdigest().upper(), x +'\n', sep=",")
        del content
        with open(filename, 'a') as csvfile:
#=========== writing data to csv file
            csvwriter = csv.writer(csvfile)
            csvwriter.writerow([hasher.hexdigest().upper(), x])

dugan · 01-07-2022, 11:21 PM

While this is probably not related to the problem, don't keep opening the CSV file. Just open it once, write to it until you have no data, and let it close afterwards.

Also, it looks to me like the problem isn't the size of the directory, but the size of the files in that directory?

I take it that some of the files being hashed are huge?

Obviously, as you've noticed, the openfile.read() calls are part of the problem.

Can you avoid those?

I see some sample code here...

https://stackoverflow.com/q/22058048/240515

lancebermudez · 01-08-2022, 01:28 AM

I was not thinking of the possible large files in the directory
with help form

Code:

up dated friday January 07 2022 23:58
https://stackoverflow.com/questions/36099331/how-to-grab-all-files-in-a-folder-and-get-their-md5-hash-in-python
Notice that this can potentially exhaust your memory
if you happen to have a large file in that directory,
so it is better to read the file in smaller chunks
(adapted here for 1 MiB blocks):
edited Aug 25 '18 at 8:13
answered Mar 19 '16 at 8:05
Antti Haapala

https://codereview.stackexchange.com/questions/147056/short-script-to-hash-files-in-a-directory
I've used for chunk in iter(lambda: f.read(4096), b"")
which is a better approach for hashing large files
(sometimes you won't be able to fit the whole file in
memory. In that case, you'll have to read chunks
of 4096 bytes sequentially).
edited Nov 15 '16 at 15:04
answered Nov 15 '16 at 8:02
Grajdeanu Alex

I changed this part of the code to make it work

Code:

#==== run sha256 on file
n = 0
print('Starting the hashing of files\n') 
for x in sorted(files_dir):
    n = n + 1
    hasher = hashlib.sha256()
    with open(x, 'rb') as openfile:
        #content = openfile.read() # old
        #hasher.update(content) # old
        # below is new to handle large files and avoid
        #     content = openfile.read()
        #     MemoryError
        for content in iter(lambda: openfile.read(2 ** 20), b""):
                hasher.update(content)
        print('file number:',n,'\n')
        print(hasher.hexdigest().upper(), x +'\n', sep=",")
        with open(filename, 'a') as csvfile:
#=========== writing data to csv file
            csvwriter = csv.writer(csvfile)
            csvwriter.writerow([hasher.hexdigest().upper(), x])

with the code change I have noticed a speed change when the script is run but that dose not bother me much.

pan64 · 01-08-2022, 03:42 AM

I can only repeat post #3, do not keep files/anything in memory, if not needed. Otherwise would be nice to see more about that situation, dropping in an error message is far from enough. I guess you keep much more things in memory, so this is only the last step.
Probably you can use a python memory profiler to catch it.
Also I wouldn't open csvfile in every cycle, just only once, but that is not really related.

sundialsvcs · 01-08-2022, 10:03 AM

I have seen exactly this sort of error both in Python and in PHP, leading me to suspect that the actual error is in some binary library used by both. Do not repeatedly load a file into memory, especially not a large one. You should read files "one record at a time." There is always a way to do that.