LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-07-2022, 10:22 PM   #1
lancebermudez
LQ Newbie
 
Registered: Nov 2019
Posts: 18

Rep: Reputation: Disabled
Angry python memory error


Memory Error I get is
Code:
Traceback (most recent call last):
  File "/home/10dot9/bin/pythonscript/./hashdir.py", line 75, in <module>
    content = openfile.read()
MemoryError
this code will walk a directory and give me the files to hash in the directory use to work good for small directories but not for big directories. A big directories will give memory error. Any idea on how to fix this I seem to hit a wall.
Code:
#!/usr/bin/env python3
from sys import exit
from sys import version as version # needed determine Python version number
from sys import platform as platform # needed to determine OS version number
import hashlib
from os import walk
import csv
from os.path import join

print('Python Version '+ version)
print('Operating System Platform '+ platform)
print('This Python 3 script will Sha256 recurse hash files in Directory')
print(' ')

#==== Source Directory to hash
SRC_DIR = input('File path to run file hash on: \n')

#==== csv file creation
# field names 
fields = ['Sha256 File Hash', 'Full File Path']

#==== name of csv file and it place in filesystem
filename = input('Where to save the csv file: \n')

#==== writing to csv file
with open(filename, 'w') as csvfile:
    #==== creating a csv writer object
    csvwriter = csv.writer(csvfile)

    #==== writing the fields
    csvwriter.writerow(fields)


#==== recurse the Directory
print('Walking ', SRC_DIR)
files_dir = []

for root, subdirs, files in walk(SRC_DIR):
    for file in files:
        files_dir.append(join(root, file))

print('DoneWalking ', SRC_DIR) 

#==== to see full file paths in files_dir
#for f in files_dir:
#    print(f)


#==== run sha256 on file
print('Starting the hashing of files\n') 
for x in sorted(files_dir):
    hasher = hashlib.sha256()
    with open(x, 'rb') as openfile:
        content = openfile.read()
        hasher.update(content)
        print(hasher.hexdigest().upper(), x +'\n', sep=",")
        with open(filename, 'a') as csvfile:
#=========== writing data to csv file
            csvwriter = csv.writer(csvfile)
            csvwriter.writerow([hasher.hexdigest().upper(), x])

print('Done with hashing of files!')
input('Enter to close..')

Last edited by lancebermudez; 01-07-2022 at 10:26 PM.
 
Old 01-07-2022, 11:02 PM   #2
lancebermudez
LQ Newbie
 
Registered: Nov 2019
Posts: 18

Original Poster
Rep: Reputation: Disabled
still getting memory error at
Code:
Traceback (most recent call last):
  File "/usr/lib/python3.9/idlelib/run.py", line 559, in runcode
    exec(code, self.locals)
  File "/home/10dot9/bin/pythonscript/filetreeHash.py", line 75, in <module>
    content = openfile.read()
MemoryError
changed the code to del content to see if that will clear up the memory
Code:
#==== run sha256 on file
print('Starting the hashing of files\n') 
for x in sorted(files_dir):
    hasher = hashlib.sha256()
    with open(x, 'rb') as openfile:
        content = openfile.read()
        hasher.update(content)
        print(hasher.hexdigest().upper(), x +'\n', sep=",")
        del content
        with open(filename, 'a') as csvfile:
#=========== writing data to csv file
            csvwriter = csv.writer(csvfile)
            csvwriter.writerow([hasher.hexdigest().upper(), x])
 
1 members found this post helpful.
Old 01-07-2022, 11:21 PM   #3
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,223

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320
While this is probably not related to the problem, don't keep opening the CSV file. Just open it once, write to it until you have no data, and let it close afterwards.

Also, it looks to me like the problem isn't the size of the directory, but the size of the files in that directory?

I take it that some of the files being hashed are huge?

Obviously, as you've noticed, the openfile.read() calls are part of the problem.

Can you avoid those?

I see some sample code here...

https://stackoverflow.com/q/22058048/240515

Last edited by dugan; 01-07-2022 at 11:28 PM.
 
Old 01-08-2022, 01:28 AM   #4
lancebermudez
LQ Newbie
 
Registered: Nov 2019
Posts: 18

Original Poster
Rep: Reputation: Disabled
Smile

I was not thinking of the possible large files in the directory
with help form
Code:
up dated friday January 07 2022 23:58
https://stackoverflow.com/questions/36099331/how-to-grab-all-files-in-a-folder-and-get-their-md5-hash-in-python
Notice that this can potentially exhaust your memory
if you happen to have a large file in that directory,
so it is better to read the file in smaller chunks
(adapted here for 1 MiB blocks):
edited Aug 25 '18 at 8:13
answered Mar 19 '16 at 8:05
Antti Haapala

https://codereview.stackexchange.com/questions/147056/short-script-to-hash-files-in-a-directory
I've used for chunk in iter(lambda: f.read(4096), b"")
which is a better approach for hashing large files
(sometimes you won't be able to fit the whole file in
memory. In that case, you'll have to read chunks
of 4096 bytes sequentially).
edited Nov 15 '16 at 15:04
answered Nov 15 '16 at 8:02
Grajdeanu Alex
I changed this part of the code to make it work
Code:
#==== run sha256 on file
n = 0
print('Starting the hashing of files\n') 
for x in sorted(files_dir):
    n = n + 1
    hasher = hashlib.sha256()
    with open(x, 'rb') as openfile:
        #content = openfile.read() # old
        #hasher.update(content) # old
        # below is new to handle large files and avoid
        #     content = openfile.read()
        #     MemoryError
        for content in iter(lambda: openfile.read(2 ** 20), b""):
                hasher.update(content)
        print('file number:',n,'\n')
        print(hasher.hexdigest().upper(), x +'\n', sep=",")
        with open(filename, 'a') as csvfile:
#=========== writing data to csv file
            csvwriter = csv.writer(csvfile)
            csvwriter.writerow([hasher.hexdigest().upper(), x])
with the code change I have noticed a speed change when the script is run but that dose not bother me much.
 
Old 01-08-2022, 03:42 AM   #5
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,838

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
I can only repeat post #3, do not keep files/anything in memory, if not needed. Otherwise would be nice to see more about that situation, dropping in an error message is far from enough. I guess you keep much more things in memory, so this is only the last step.
Probably you can use a python memory profiler to catch it.
Also I wouldn't open csvfile in every cycle, just only once, but that is not really related.
 
Old 01-08-2022, 10:03 AM   #6
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,659
Blog Entries: 4

Rep: Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940
I have seen exactly this sort of error both in Python and in PHP, leading me to suspect that the actual error is in some binary library used by both. Do not repeatedly load a file into memory, especially not a large one. You should read files "one record at a time." There is always a way to do that.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
I got error while installing python-tk python-psycopg2 python-twisted saili kadam Linux - Newbie 1 09-05-2015 03:03 AM
configure: error: The xdg python module is required (pyxdg or python-xdg) Sargalus Linux - Software 7 03-24-2010 07:34 AM
LXer: Python Python Python (aka Python 3) LXer Syndicated Linux News 0 08-05-2009 08:30 PM
Difference between resident memory,shared memory and virtual memory in system monitor mathimca05 Linux - Newbie 1 11-11-2007 04:05 AM
Help!?! RH 8 Memory Mapping -High Memory-Virtural Memory issues.. Merlin53 Linux - Hardware 2 06-18-2003 04:48 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:30 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration