ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Traceback (most recent call last):
File "/home/10dot9/bin/pythonscript/./hashdir.py", line 75, in <module>
content = openfile.read()
MemoryError
this code will walk a directory and give me the files to hash in the directory use to work good for small directories but not for big directories. A big directories will give memory error. Any idea on how to fix this I seem to hit a wall.
Code:
#!/usr/bin/env python3
from sys import exit
from sys import version as version # needed determine Python version number
from sys import platform as platform # needed to determine OS version number
import hashlib
from os import walk
import csv
from os.path import join
print('Python Version '+ version)
print('Operating System Platform '+ platform)
print('This Python 3 script will Sha256 recurse hash files in Directory')
print(' ')
#==== Source Directory to hash
SRC_DIR = input('File path to run file hash on: \n')
#==== csv file creation
# field names
fields = ['Sha256 File Hash', 'Full File Path']
#==== name of csv file and it place in filesystem
filename = input('Where to save the csv file: \n')
#==== writing to csv file
with open(filename, 'w') as csvfile:
#==== creating a csv writer object
csvwriter = csv.writer(csvfile)
#==== writing the fields
csvwriter.writerow(fields)
#==== recurse the Directory
print('Walking ', SRC_DIR)
files_dir = []
for root, subdirs, files in walk(SRC_DIR):
for file in files:
files_dir.append(join(root, file))
print('DoneWalking ', SRC_DIR)
#==== to see full file paths in files_dir
#for f in files_dir:
# print(f)
#==== run sha256 on file
print('Starting the hashing of files\n')
for x in sorted(files_dir):
hasher = hashlib.sha256()
with open(x, 'rb') as openfile:
content = openfile.read()
hasher.update(content)
print(hasher.hexdigest().upper(), x +'\n', sep=",")
with open(filename, 'a') as csvfile:
#=========== writing data to csv file
csvwriter = csv.writer(csvfile)
csvwriter.writerow([hasher.hexdigest().upper(), x])
print('Done with hashing of files!')
input('Enter to close..')
Last edited by lancebermudez; 01-07-2022 at 10:26 PM.
Traceback (most recent call last):
File "/usr/lib/python3.9/idlelib/run.py", line 559, in runcode
exec(code, self.locals)
File "/home/10dot9/bin/pythonscript/filetreeHash.py", line 75, in <module>
content = openfile.read()
MemoryError
changed the code to del content to see if that will clear up the memory
Code:
#==== run sha256 on file
print('Starting the hashing of files\n')
for x in sorted(files_dir):
hasher = hashlib.sha256()
with open(x, 'rb') as openfile:
content = openfile.read()
hasher.update(content)
print(hasher.hexdigest().upper(), x +'\n', sep=",")
del content
with open(filename, 'a') as csvfile:
#=========== writing data to csv file
csvwriter = csv.writer(csvfile)
csvwriter.writerow([hasher.hexdigest().upper(), x])
While this is probably not related to the problem, don't keep opening the CSV file. Just open it once, write to it until you have no data, and let it close afterwards.
Also, it looks to me like the problem isn't the size of the directory, but the size of the files in that directory?
I take it that some of the files being hashed are huge?
Obviously, as you've noticed, the openfile.read() calls are part of the problem.
I was not thinking of the possible large files in the directory
with help form
Code:
up dated friday January 07 2022 23:58
https://stackoverflow.com/questions/36099331/how-to-grab-all-files-in-a-folder-and-get-their-md5-hash-in-python
Notice that this can potentially exhaust your memory
if you happen to have a large file in that directory,
so it is better to read the file in smaller chunks
(adapted here for 1 MiB blocks):
edited Aug 25 '18 at 8:13
answered Mar 19 '16 at 8:05
Antti Haapala
https://codereview.stackexchange.com/questions/147056/short-script-to-hash-files-in-a-directory
I've used for chunk in iter(lambda: f.read(4096), b"")
which is a better approach for hashing large files
(sometimes you won't be able to fit the whole file in
memory. In that case, you'll have to read chunks
of 4096 bytes sequentially).
edited Nov 15 '16 at 15:04
answered Nov 15 '16 at 8:02
Grajdeanu Alex
I changed this part of the code to make it work
Code:
#==== run sha256 on file
n = 0
print('Starting the hashing of files\n')
for x in sorted(files_dir):
n = n + 1
hasher = hashlib.sha256()
with open(x, 'rb') as openfile:
#content = openfile.read() # old
#hasher.update(content) # old
# below is new to handle large files and avoid
# content = openfile.read()
# MemoryError
for content in iter(lambda: openfile.read(2 ** 20), b""):
hasher.update(content)
print('file number:',n,'\n')
print(hasher.hexdigest().upper(), x +'\n', sep=",")
with open(filename, 'a') as csvfile:
#=========== writing data to csv file
csvwriter = csv.writer(csvfile)
csvwriter.writerow([hasher.hexdigest().upper(), x])
with the code change I have noticed a speed change when the script is run but that dose not bother me much.
I can only repeat post #3, do not keep files/anything in memory, if not needed. Otherwise would be nice to see more about that situation, dropping in an error message is far from enough. I guess you keep much more things in memory, so this is only the last step.
Probably you can use a python memory profiler to catch it.
Also I wouldn't open csvfile in every cycle, just only once, but that is not really related.
I have seen exactly this sort of error both in Python and in PHP, leading me to suspect that the actual error is in some binary library used by both. Do not repeatedly load a file into memory, especially not a large one. You should read files "one record at a time." There is always a way to do that.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.