[SOLVED] awk running out of memory

timonlq · 12-08-2009, 04:04 PM

I am running the following awk script using the -f option, but it runs out of memory. I know that the program gets to the main action statement and that it is here that the memory gets consumed. It is as if something in the main action statement lead to increasing amounts of memory being consumed.

I have tried the following but it still leads to massive memory consumption:
* running from the command line i.e. without the -f option.
* a main action statement consisting of only of print $0

I am running it on file of 7 Gb.

This file size does not cause problems with other scripts. For example, the following works fine:

awk '($2==1){print}' myBigFile.txt

Any ideas on what could be causing the problem?

Thanks.

Tim.

Code:

BEGIN{
print "something"
	OFS="\t";
	nb=0;
	while((getline < rgnsFile) > 0)
	{
		nb+=1;
		tileChr[nb]=$1;
		tileStarts[nb]=$2;
		tileEnds[nb]=$3;
		print "reading region"
	}
	for(i in tileStarts)
	{
		for(j=(tileStarts[i]-upStreamDist); j<=(tileEnds[i]+downStreamDist); j++)
		{
			positionChr[tileChr[i],j]=1;
		}
		print "building region"
	}
}
{


	if(positionChr[$1,$2]==1)
	{
		# The line of the bed file, 0-based
		# chr, start, end, cov
		# using the columns of the pileup input file
		print $1, ($2-1), $2, $4;
	}

	
}

timonlq · 12-08-2009, 04:45 PM

I am running the following awk script using the -f option, but it runs out of memory. I know that the program gets to the main action statement and that it is here that the memory gets consumed. It is as if something in the main action statement lead to increasing amounts of memory being consumed.

Code:

BEGIN{
print "something"
	OFS="\t";
	nb=0;
	while((getline < rgnsFile) > 0)
	{
		nb+=1;
		tileChr[nb]=$1;
		tileStarts[nb]=$2;
		tileEnds[nb]=$3;
		print "reading region"
	}
	for(i in tileStarts)
	{
		for(j=(tileStarts[i]-upStreamDist); j<=(tileEnds[i]+downStreamDist); j++)
		{
			positionChr[tileChr[i],j]=1;
		}
		print "building region"
	}
}
{
	if(positionChr[$1,$2]==1)
	{
		# The line of the bed file, 0-based
		# chr, start, end, cov
		# using the columns of the pileup input file
		print $1, ($2-1), $2, $4;
	}
}

I have tried the following but it still leads to massive memory consumption:
* running from the command line i.e. without the -f option.
* a main action statement consisting of only of print $0

I am running it on file of 7 Gb.

This file size does not cause problems with other scripts. For example, the following works fine:

awk '($2==1){print}' myBigFile.txt

Any ideas on what could be causing the problem?

Thanks.

Tim.

bartonski · 12-08-2009, 05:08 PM

Well, you're essentially building a two dimensional array that's i x j characters. If i and j get very large, you're going to be running out of memory very quickly.

chrism01 · 12-08-2009, 05:28 PM

To expand on that, the awk...print just reads/prints as it goes; memory usage is small. The 2D array will be in RAM, then expand into swap if you run out of RAM.

code933k · 12-08-2009, 05:52 PM

You shouldn't do such big data munging in a BEGIN section to start with.
Define functions instead if that is what you need. But keep those calculations inside the body of your main AWK script.

Anyway, what version of AWK interpreter/compiler are you using? i.e., An earlier version of MAWK had a little leakage as far as I can recall.

code933k · 12-09-2009, 06:58 AM

BTW Aren't your tables (arrays) growing insanely big? i.e., Isn't NB duplicating NR's internal value? (RS perhaps?)

colucix · 12-09-2009, 07:42 AM

Some questions and some notes:
1) How big is the region section of the file (that one read in the BEGIN section of your awk script)? In other words, how many elements it stores in the positionChr array?
2) How much memory do you have on the running machine?
3) What is the purpose of the following expression?

Code:

while((getline < rgnsFile) > 0)

what I don't understand is that you evaluate the expression getline < rgnsFile (rgnsFile not being assigned in the posted script) then check if it's true using > 0. Is it a redundancy or am I missing something?
4) Do you see some lines of output from the main section before the memory outage?
5) One thing to take in mind is that in awk whenever you reference a non-existent array element, the element is actually created and a null string is assigned as its value. This is a common pitfall in awk scripts and it is the reason why the syntax "index in array" has been created in order to scan all the elements of an array. So there is a chance that if a huge amount of elements have not been assigned in the BEGIN section, the size of the array increase enormously and unexpectedly in the main action.

ghostdog74 · 12-09-2009, 07:55 AM

Assuming you are indeed processing only 1 file that is 7 GB, in your awk code, you are processing your file twice, one in the while loop in the BEGIN section, the other is normal processing from command line argument. This is one of the reason it takes more time. why don't you do everything once, not in the BEGIN loop.
Small note: remember to close your file when you use while loop inside awk. eg close(rgnsFile)

code933k · 12-09-2009, 08:10 AM

Quote:

Originally Posted by colucix

Some questions and some notes:
1) How big is the region section of the file (that one read in the BEGIN section of your awk script)? In other words, how many elements it stores in the positionChr array?

You don't understand this because you didn't understood 3)
i.e., As many lines as his 7Gb text database has, provided they have a $2 field.

Quote:

3) What is the purpose of the following expression?

Code:

while((getline < rgnsFile) > 0)

what I don't understand is that you evaluate the expression getline < rgnsFile (rgnsFile not being assigned in the posted script) then check if it's true using > 0. Is it a redundancy or am I missing something?

You are missing a common AWK! file read line by line.
He may have this variable assigned externally and he does. If it wasn't AWK should exit with error or silently depending on his interpreter. It was stated clearly that -f was in usage.

Quote:

5) One thing to [s]take[/s]bear in mind is that in awk whenever you reference a non-existent array element, the element is actually created and a null string is assigned as its value. This is a common pitfall in awk scripts and it is the reason why the syntax "index in array" has been created in order to scan all the elements of an array. So there is a chance that if a huge amount of elements have not been assigned in the BEGIN section, the size of the array increase enormously and unexpectedly in the main action.

It isn't much of a problem as non existent fields should lead to null indexes, which is preferable instead of conditionals whenever you read a 7Gb file.

colucix · 12-09-2009, 08:11 AM

Hi ghostdog!

Quote:

Originally Posted by ghostdog74

you are processing your file twice, one in the while loop in the BEGIN section, the other is normal processing from command line argument.

Maybe I continue to miss something, but I don't understand your assertion. To me the file is read only once, since the getline statement in the while loop cause it to process the first lines of the input file, which are skipped in the main action. Beer?

colucix · 12-09-2009, 08:29 AM

Quote:

Originally Posted by code933k

You are missing a common AWK! file read line by line.

This is a bit strong assertion, if you don't mind. The OP did not specify the value of rgnsFile in the post. In any case the getline statement returns -1 0 or 1 which is compared with the value of rgnsFile, then the whole expression embedded in parentheses is evaluated for true using the expression > 0. If I am wrong, could you further explain what I'm missing here?

Quote:

It isn't much of a problem as non existent fields should lead to null indexes, which is preferable instead of conditionals whenever you read a 7Gb file.

That's not exactly what I was trying to explain. I try to re-formulate: whenever you reference an array element, as in the expression

Code:

if(positionChr[$1,$2]==1)

if the array element already exists (the index matches an existing one) no problem. If the array element does not exist, it is created even by the simple referring to that array element (without an explicit assignment expression). The value assigned to that element is the null string, but this does not mean that the allocated memory is null. In this sense I stated that it can be a pitfall for awk script: if you test for the existence of a huge amount of array elements, that have not been explicitly assigned in advance, you may waste a large amount of memory and reach a memory outage.

ghostdog74 · 12-09-2009, 08:32 AM

Quote:

Originally Posted by colucix

since the getline statement in the while loop cause it to process the first lines of the input file,

No, one getline statement reads a line, BUT if a while loop comes into the picture and it tests for getline > 0, then its indeed iterating a file, NOT one line. ( In other words , a getline return value of more than 0 means a line is available. )

colucix · 12-09-2009, 08:47 AM

Quote:

Originally Posted by ghostdog74

No, one getline statement reads a line, BUT if a while loop comes into the picture and it tests for getline > 0, then its indeed iterating a file, NOT one line. ( In other words , a getline return value of more than 0 means a line is available. )

At a first glance, I don't agree! But I'd like to do some tests to be sure. In any case, I think I know what I was missing now: most likely the OP didn't mention that the script reads the region information from another file whose name is stored in the rgnsFile variable. If this is true the expression

Code:

getline < rgnsFile

is to get a line from this "external file". Here is what I missed: the sign < as "input redirection", not as "less than". Indeed getline less than something does not make sense. So the script reads a "region definition" file, then start to process the big input file in the main action. This sounds right to me, but until the OP does not clarify, we can only guess.

ghostdog74 · 12-09-2009, 08:58 AM

Quote:

Originally Posted by colucix

At a first glance, I don't agree! But I'd like to do some tests to be sure.

this:

Code:

awk 'BEGIN{ while ((getline < "file") > 0 ){ print } }'

and this:

Code:

  awk '{print}' file

are the same, (except for some speed difference.)

Lastly, as I have stated in my first post, i am assuming rgnsFile is the big file, not another "region" file like you mentioned. OP has to show more in order to clarify what's going on.

code933k · 12-09-2009, 09:14 AM

Quote:

Originally Posted by colucix

This is a bit strong assertion, if you don't mind.

No, I don't

Quote:

The OP did not specify the value of rgnsFile in the post.

OK

Quote:

In any case the getline statement returns -1 0 or 1 which is compared with the value of rgnsFile

So, so:

getline < rgnsFile tries to read the first (not really it depends on the internal iterator) line of whichever content rgnsFile variable has as if it was the name of an existent file. If rgnsFile is not an assigned variable or the file pointed does not exist then this should evaluate to zero (0). Thus, this is the same than

Code:

 while (getline < "nofile") > 0

which exits silently because of the lack of lines for reading. This snippet is also used in standard AWK as a simple "file does not exist or empty" check.

Code:

empty_file=( (getline < foo) > 0 )

Quote:

if- I am wrong, could you further explain what I'm missing here?

In such case, "nothing" is created, evaluated or destroyed beyond this, no output is shown for a 7Gb file. Thus no memory problems. As memory problems is what he stated I suppose it has nothing to do with this rgnsFile line. Unless he was drunk and assigned a 100Tb file

Quote:

If getline evaluates to false (0) it should exit, if there is a new line then it should evaluate (1)
That's not exactly what I was trying to explain. I try to re-formulate: whenever you reference an array element, as in the expression

Code:

if(positionChr[$1,$2]==1)

if the array element already exists (the index matches an existing one) no problem. If the array element does not exist, it is created even by the simple referring to that array element (without an explicit assignment expression).

Of course.

Quote:

The value assigned to that element is the null string, but this does not mean that the allocated memory is null.

Of course

Quote:

In this sense I stated that it can be a pitfall for awk script: if you test for the existence of a huge amount of array elements, that have not been explicitly assigned in advance, you may waste a large amount of memory and reach a memory outage.

It is commonly pointed as a pitfall but it is not truly that. That's why garbage collectors are there and this behaviour (direct assignment of NULL) is preferable than having three conditionals for $1, $2, $3 per line trying to avoid null asignments (even more memory referenced, though garbage collected the same way). Do remember that arrays in awk aren't positional i.e., p[2] != NULL and p[99999] != NULL may coexist separately of [1 ... 99998] Thus, NULL pointers can be automatically collected anyway.

That if he has a decent AWK interpreter, which is part of my early questions.

EDIT: Anyway, the problem appears to be he is keeping a lot of data in those arrays and I fail to see why he needs to do so.