LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 12-08-2009, 04:04 PM   #1
timonlq
LQ Newbie
 
Registered: Aug 2009
Posts: 7

Rep: Reputation: 0
awk running out of memory


I am running the following awk script using the -f option, but it runs out of memory. I know that the program gets to the main action statement and that it is here that the memory gets consumed. It is as if something in the main action statement lead to increasing amounts of memory being consumed.

I have tried the following but it still leads to massive memory consumption:
* running from the command line i.e. without the -f option.
* a main action statement consisting of only of print $0

I am running it on file of 7 Gb.

This file size does not cause problems with other scripts. For example, the following works fine:

awk '($2==1){print}' myBigFile.txt

Any ideas on what could be causing the problem?

Thanks.

Tim.

Code:
BEGIN{
print "something"
	OFS="\t";
	nb=0;
	while((getline < rgnsFile) > 0)
	{
		nb+=1;
		tileChr[nb]=$1;
		tileStarts[nb]=$2;
		tileEnds[nb]=$3;
		print "reading region"
	}
	for(i in tileStarts)
	{
		for(j=(tileStarts[i]-upStreamDist); j<=(tileEnds[i]+downStreamDist); j++)
		{
			positionChr[tileChr[i],j]=1;
		}
		print "building region"
	}
}
{


	if(positionChr[$1,$2]==1)
	{
		# The line of the bed file, 0-based
		# chr, start, end, cov
		# using the columns of the pileup input file
		print $1, ($2-1), $2, $4;
	}

	
}

Last edited by timonlq; 12-08-2009 at 04:42 PM.
 
Old 12-08-2009, 04:45 PM   #2
timonlq
LQ Newbie
 
Registered: Aug 2009
Posts: 7

Original Poster
Rep: Reputation: 0
very weird: awk running out of memory

I am running the following awk script using the -f option, but it runs out of memory. I know that the program gets to the main action statement and that it is here that the memory gets consumed. It is as if something in the main action statement lead to increasing amounts of memory being consumed.
Code:
BEGIN{
print "something"
	OFS="\t";
	nb=0;
	while((getline < rgnsFile) > 0)
	{
		nb+=1;
		tileChr[nb]=$1;
		tileStarts[nb]=$2;
		tileEnds[nb]=$3;
		print "reading region"
	}
	for(i in tileStarts)
	{
		for(j=(tileStarts[i]-upStreamDist); j<=(tileEnds[i]+downStreamDist); j++)
		{
			positionChr[tileChr[i],j]=1;
		}
		print "building region"
	}
}
{
	if(positionChr[$1,$2]==1)
	{
		# The line of the bed file, 0-based
		# chr, start, end, cov
		# using the columns of the pileup input file
		print $1, ($2-1), $2, $4;
	}
}
I have tried the following but it still leads to massive memory consumption:
* running from the command line i.e. without the -f option.
* a main action statement consisting of only of print $0

I am running it on file of 7 Gb.

This file size does not cause problems with other scripts. For example, the following works fine:

awk '($2==1){print}' myBigFile.txt

Any ideas on what could be causing the problem?

Thanks.

Tim.

Last edited by timonlq; 12-08-2009 at 04:49 PM.
 
Old 12-08-2009, 05:08 PM   #3
bartonski
Member
 
Registered: Jul 2006
Location: Louisville, KY
Distribution: Fedora 12, Slackware, Debian, Ubuntu Karmic, FreeBSD 7.1
Posts: 443
Blog Entries: 1

Rep: Reputation: 48
Well, you're essentially building a two dimensional array that's i x j characters. If i and j get very large, you're going to be running out of memory very quickly.
 
Old 12-08-2009, 05:28 PM   #4
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,365

Rep: Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753Reputation: 2753
To expand on that, the awk...print just reads/prints as it goes; memory usage is small. The 2D array will be in RAM, then expand into swap if you run out of RAM.
 
Old 12-08-2009, 05:52 PM   #5
code933k
Member
 
Registered: Aug 2007
Location: Bogotá, Colombia. South America
Distribution: ArchLinux / Source Mage GNU Linux (test branch) / openSUSE
Posts: 130

Rep: Reputation: Disabled
You shouldn't do such big data munging in a BEGIN section to start with.
Define functions instead if that is what you need. But keep those calculations inside the body of your main AWK script.

Anyway, what version of AWK interpreter/compiler are you using? i.e., An earlier version of MAWK had a little leakage as far as I can recall.
 
Old 12-09-2009, 06:58 AM   #6
code933k
Member
 
Registered: Aug 2007
Location: Bogotá, Colombia. South America
Distribution: ArchLinux / Source Mage GNU Linux (test branch) / openSUSE
Posts: 130

Rep: Reputation: Disabled
BTW Aren't your tables (arrays) growing insanely big? i.e., Isn't NB duplicating NR's internal value? (RS perhaps?)
 
Old 12-09-2009, 07:42 AM   #7
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Some questions and some notes:
1) How big is the region section of the file (that one read in the BEGIN section of your awk script)? In other words, how many elements it stores in the positionChr array?
2) How much memory do you have on the running machine?
3) What is the purpose of the following expression?
Code:
while((getline < rgnsFile) > 0)
what I don't understand is that you evaluate the expression getline < rgnsFile (rgnsFile not being assigned in the posted script) then check if it's true using > 0. Is it a redundancy or am I missing something?
4) Do you see some lines of output from the main section before the memory outage?
5) One thing to take in mind is that in awk whenever you reference a non-existent array element, the element is actually created and a null string is assigned as its value. This is a common pitfall in awk scripts and it is the reason why the syntax "index in array" has been created in order to scan all the elements of an array. So there is a chance that if a huge amount of elements have not been assigned in the BEGIN section, the size of the array increase enormously and unexpectedly in the main action.
 
Old 12-09-2009, 07:55 AM   #8
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Assuming you are indeed processing only 1 file that is 7 GB, in your awk code, you are processing your file twice, one in the while loop in the BEGIN section, the other is normal processing from command line argument. This is one of the reason it takes more time. why don't you do everything once, not in the BEGIN loop.
Small note: remember to close your file when you use while loop inside awk. eg close(rgnsFile)
 
Old 12-09-2009, 08:10 AM   #9
code933k
Member
 
Registered: Aug 2007
Location: Bogotá, Colombia. South America
Distribution: ArchLinux / Source Mage GNU Linux (test branch) / openSUSE
Posts: 130

Rep: Reputation: Disabled
Post

Quote:
Originally Posted by colucix View Post
Some questions and some notes:
1) How big is the region section of the file (that one read in the BEGIN section of your awk script)? In other words, how many elements it stores in the positionChr array?
You don't understand this because you didn't understood 3)
i.e., As many lines as his 7Gb text database has, provided they have a $2 field.

Quote:
3) What is the purpose of the following expression?
Code:
while((getline < rgnsFile) > 0)
what I don't understand is that you evaluate the expression getline < rgnsFile (rgnsFile not being assigned in the posted script) then check if it's true using > 0. Is it a redundancy or am I missing something?
You are missing a common AWK! file read line by line.
He may have this variable assigned externally and he does. If it wasn't AWK should exit with error or silently depending on his interpreter. It was stated clearly that -f was in usage.

Quote:
5) One thing to [s]take[/s]bear in mind is that in awk whenever you reference a non-existent array element, the element is actually created and a null string is assigned as its value. This is a common pitfall in awk scripts and it is the reason why the syntax "index in array" has been created in order to scan all the elements of an array. So there is a chance that if a huge amount of elements have not been assigned in the BEGIN section, the size of the array increase enormously and unexpectedly in the main action.
It isn't much of a problem as non existent fields should lead to null indexes, which is preferable instead of conditionals whenever you read a 7Gb file.

Last edited by code933k; 12-09-2009 at 08:17 AM.
 
Old 12-09-2009, 08:11 AM   #10
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Hi ghostdog!
Quote:
Originally Posted by ghostdog74 View Post
you are processing your file twice, one in the while loop in the BEGIN section, the other is normal processing from command line argument.
Maybe I continue to miss something, but I don't understand your assertion. To me the file is read only once, since the getline statement in the while loop cause it to process the first lines of the input file, which are skipped in the main action. Beer?
 
Old 12-09-2009, 08:29 AM   #11
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Quote:
Originally Posted by code933k View Post
You are missing a common AWK! file read line by line.
This is a bit strong assertion, if you don't mind. The OP did not specify the value of rgnsFile in the post. In any case the getline statement returns -1 0 or 1 which is compared with the value of rgnsFile, then the whole expression embedded in parentheses is evaluated for true using the expression > 0. If I am wrong, could you further explain what I'm missing here?
Quote:
It isn't much of a problem as non existent fields should lead to null indexes, which is preferable instead of conditionals whenever you read a 7Gb file.
That's not exactly what I was trying to explain. I try to re-formulate: whenever you reference an array element, as in the expression
Code:
if(positionChr[$1,$2]==1)
if the array element already exists (the index matches an existing one) no problem. If the array element does not exist, it is created even by the simple referring to that array element (without an explicit assignment expression). The value assigned to that element is the null string, but this does not mean that the allocated memory is null. In this sense I stated that it can be a pitfall for awk script: if you test for the existence of a huge amount of array elements, that have not been explicitly assigned in advance, you may waste a large amount of memory and reach a memory outage.

Last edited by colucix; 12-09-2009 at 08:30 AM.
 
Old 12-09-2009, 08:32 AM   #12
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by colucix View Post
since the getline statement in the while loop cause it to process the first lines of the input file,
No, one getline statement reads a line, BUT if a while loop comes into the picture and it tests for getline > 0, then its indeed iterating a file, NOT one line. ( In other words , a getline return value of more than 0 means a line is available. )
 
Old 12-09-2009, 08:47 AM   #13
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Quote:
Originally Posted by ghostdog74 View Post
No, one getline statement reads a line, BUT if a while loop comes into the picture and it tests for getline > 0, then its indeed iterating a file, NOT one line. ( In other words , a getline return value of more than 0 means a line is available. )
At a first glance, I don't agree! But I'd like to do some tests to be sure. In any case, I think I know what I was missing now: most likely the OP didn't mention that the script reads the region information from another file whose name is stored in the rgnsFile variable. If this is true the expression
Code:
getline < rgnsFile
is to get a line from this "external file". Here is what I missed: the sign < as "input redirection", not as "less than". Indeed getline less than something does not make sense. So the script reads a "region definition" file, then start to process the big input file in the main action. This sounds right to me, but until the OP does not clarify, we can only guess.
 
Old 12-09-2009, 08:58 AM   #14
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by colucix View Post
At a first glance, I don't agree! But I'd like to do some tests to be sure.
this:
Code:
awk 'BEGIN{ while ((getline < "file") > 0 ){ print } }'
and this:
Code:
  awk '{print}' file
are the same, (except for some speed difference.)

Lastly, as I have stated in my first post, i am assuming rgnsFile is the big file, not another "region" file like you mentioned. OP has to show more in order to clarify what's going on.
 
Old 12-09-2009, 09:14 AM   #15
code933k
Member
 
Registered: Aug 2007
Location: Bogotá, Colombia. South America
Distribution: ArchLinux / Source Mage GNU Linux (test branch) / openSUSE
Posts: 130

Rep: Reputation: Disabled
Post

Quote:
Originally Posted by colucix View Post
This is a bit strong assertion, if you don't mind.
No, I don't
Quote:
The OP did not specify the value of rgnsFile in the post.
OK
Quote:
In any case the getline statement returns -1 0 or 1 which is compared with the value of rgnsFile
So, so:

getline < rgnsFile tries to read the first (not really it depends on the internal iterator) line of whichever content rgnsFile variable has as if it was the name of an existent file. If rgnsFile is not an assigned variable or the file pointed does not exist then this should evaluate to zero (0). Thus, this is the same than
Code:
 while (getline < "nofile") > 0
which exits silently because of the lack of lines for reading. This snippet is also used in standard AWK as a simple "file does not exist or empty" check.
Code:
empty_file=( (getline < foo) > 0 )
Quote:
if- I am wrong, could you further explain what I'm missing here?
In such case, "nothing" is created, evaluated or destroyed beyond this, no output is shown for a 7Gb file. Thus no memory problems. As memory problems is what he stated I suppose it has nothing to do with this rgnsFile line. Unless he was drunk and assigned a 100Tb file

Quote:
If getline evaluates to false (0) it should exit, if there is a new line then it should evaluate (1)
That's not exactly what I was trying to explain. I try to re-formulate: whenever you reference an array element, as in the expression
Code:
if(positionChr[$1,$2]==1)
if the array element already exists (the index matches an existing one) no problem. If the array element does not exist, it is created even by the simple referring to that array element (without an explicit assignment expression).
Of course.

Quote:
The value assigned to that element is the null string, but this does not mean that the allocated memory is null.
Of course

Quote:
In this sense I stated that it can be a pitfall for awk script: if you test for the existence of a huge amount of array elements, that have not been explicitly assigned in advance, you may waste a large amount of memory and reach a memory outage.
It is commonly pointed as a pitfall but it is not truly that. That's why garbage collectors are there and this behaviour (direct assignment of NULL) is preferable than having three conditionals for $1, $2, $3 per line trying to avoid null asignments (even more memory referenced, though garbage collected the same way). Do remember that arrays in awk aren't positional i.e., p[2] != NULL and p[99999] != NULL may coexist separately of [1 ... 99998] Thus, NULL pointers can be automatically collected anyway.

That if he has a decent AWK interpreter, which is part of my early questions.

EDIT: Anyway, the problem appears to be he is keeping a lot of data in those arrays and I fail to see why he needs to do so.

Last edited by code933k; 12-09-2009 at 09:32 AM.
 
  


Reply

Tags
awk



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Running an awk script using command cks? vxc69 Programming 2 02-05-2008 01:54 PM
Running on 16 mb memory (or less...) Which distro? eazhar Linux - Newbie 9 07-13-2007 09:25 PM
running out of memory!? slzckboy General 13 01-15-2007 03:18 PM
Running out of memory Datamike Linux - Hardware 3 02-14-2006 03:46 PM
Running a shell command containing awk and grep within a C program Linh Programming 1 06-05-2003 06:51 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:37 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration