awk running out of memory
I am running the following awk script using the -f option, but it runs out of memory. I know that the program gets to the main action statement and that it is here that the memory gets consumed. It is as if something in the main action statement lead to increasing amounts of memory being consumed.
I have tried the following but it still leads to massive memory consumption: * running from the command line i.e. without the -f option. * a main action statement consisting of only of print $0 I am running it on file of 7 Gb. This file size does not cause problems with other scripts. For example, the following works fine: awk '($2==1){print}' myBigFile.txt Any ideas on what could be causing the problem? Thanks. Tim. Code:
BEGIN{ |
very weird: awk running out of memory
I am running the following awk script using the -f option, but it runs out of memory. I know that the program gets to the main action statement and that it is here that the memory gets consumed. It is as if something in the main action statement lead to increasing amounts of memory being consumed.
Code:
BEGIN{ * running from the command line i.e. without the -f option. * a main action statement consisting of only of print $0 I am running it on file of 7 Gb. This file size does not cause problems with other scripts. For example, the following works fine: awk '($2==1){print}' myBigFile.txt Any ideas on what could be causing the problem? Thanks. Tim. |
Well, you're essentially building a two dimensional array that's i x j characters. If i and j get very large, you're going to be running out of memory very quickly.
|
To expand on that, the awk...print just reads/prints as it goes; memory usage is small. The 2D array will be in RAM, then expand into swap if you run out of RAM.
|
You shouldn't do such big data munging in a BEGIN section to start with.
Define functions instead if that is what you need. But keep those calculations inside the body of your main AWK script. Anyway, what version of AWK interpreter/compiler are you using? i.e., An earlier version of MAWK had a little leakage as far as I can recall. |
BTW Aren't your tables (arrays) growing insanely big? i.e., Isn't NB duplicating NR's internal value? (RS perhaps?)
|
Some questions and some notes:
1) How big is the region section of the file (that one read in the BEGIN section of your awk script)? In other words, how many elements it stores in the positionChr array? 2) How much memory do you have on the running machine? 3) What is the purpose of the following expression? Code:
while((getline < rgnsFile) > 0) 4) Do you see some lines of output from the main section before the memory outage? 5) One thing to take in mind is that in awk whenever you reference a non-existent array element, the element is actually created and a null string is assigned as its value. This is a common pitfall in awk scripts and it is the reason why the syntax "index in array" has been created in order to scan all the elements of an array. So there is a chance that if a huge amount of elements have not been assigned in the BEGIN section, the size of the array increase enormously and unexpectedly in the main action. |
Assuming you are indeed processing only 1 file that is 7 GB, in your awk code, you are processing your file twice, one in the while loop in the BEGIN section, the other is normal processing from command line argument. This is one of the reason it takes more time. why don't you do everything once, not in the BEGIN loop.
Small note: remember to close your file when you use while loop inside awk. eg close(rgnsFile) |
Quote:
i.e., As many lines as his 7Gb text database has, provided they have a $2 field. Quote:
He may have this variable assigned externally and he does. If it wasn't AWK should exit with error or silently depending on his interpreter. It was stated clearly that -f was in usage. Quote:
|
Hi ghostdog! :)
Quote:
|
Quote:
Quote:
Code:
if(positionChr[$1,$2]==1) |
Quote:
|
Quote:
Code:
getline < rgnsFile |
Quote:
Code:
awk 'BEGIN{ while ((getline < "file") > 0 ){ print } }' Code:
awk '{print}' file Lastly, as I have stated in my first post, i am assuming rgnsFile is the big file, not another "region" file like you mentioned. OP has to show more in order to clarify what's going on. |
Quote:
Quote:
Quote:
getline < rgnsFile tries to read the first (not really it depends on the internal iterator) line of whichever content rgnsFile variable has as if it was the name of an existent file. If rgnsFile is not an assigned variable or the file pointed does not exist then this should evaluate to zero (0). Thus, this is the same than Code:
while (getline < "nofile") > 0 Code:
empty_file=( (getline < foo) > 0 ) Quote:
Quote:
Quote:
Quote:
That if he has a decent AWK interpreter, which is part of my early questions. EDIT: Anyway, the problem appears to be he is keeping a lot of data in those arrays and I fail to see why he needs to do so. |
All times are GMT -5. The time now is 07:25 PM. |