ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I am running the following awk script using the -f option, but it runs out of memory. I know that the program gets to the main action statement and that it is here that the memory gets consumed. It is as if something in the main action statement lead to increasing amounts of memory being consumed.
I have tried the following but it still leads to massive memory consumption:
* running from the command line i.e. without the -f option.
* a main action statement consisting of only of print $0
I am running it on file of 7 Gb.
This file size does not cause problems with other scripts. For example, the following works fine:
awk '($2==1){print}' myBigFile.txt
Any ideas on what could be causing the problem?
Thanks.
Tim.
Code:
BEGIN{
print "something"
OFS="\t";
nb=0;
while((getline < rgnsFile) > 0)
{
nb+=1;
tileChr[nb]=$1;
tileStarts[nb]=$2;
tileEnds[nb]=$3;
print "reading region"
}
for(i in tileStarts)
{
for(j=(tileStarts[i]-upStreamDist); j<=(tileEnds[i]+downStreamDist); j++)
{
positionChr[tileChr[i],j]=1;
}
print "building region"
}
}
{
if(positionChr[$1,$2]==1)
{
# The line of the bed file, 0-based
# chr, start, end, cov
# using the columns of the pileup input file
print $1, ($2-1), $2, $4;
}
}
I am running the following awk script using the -f option, but it runs out of memory. I know that the program gets to the main action statement and that it is here that the memory gets consumed. It is as if something in the main action statement lead to increasing amounts of memory being consumed.
Code:
BEGIN{
print "something"
OFS="\t";
nb=0;
while((getline < rgnsFile) > 0)
{
nb+=1;
tileChr[nb]=$1;
tileStarts[nb]=$2;
tileEnds[nb]=$3;
print "reading region"
}
for(i in tileStarts)
{
for(j=(tileStarts[i]-upStreamDist); j<=(tileEnds[i]+downStreamDist); j++)
{
positionChr[tileChr[i],j]=1;
}
print "building region"
}
}
{
if(positionChr[$1,$2]==1)
{
# The line of the bed file, 0-based
# chr, start, end, cov
# using the columns of the pileup input file
print $1, ($2-1), $2, $4;
}
}
I have tried the following but it still leads to massive memory consumption:
* running from the command line i.e. without the -f option.
* a main action statement consisting of only of print $0
I am running it on file of 7 Gb.
This file size does not cause problems with other scripts. For example, the following works fine:
Well, you're essentially building a two dimensional array that's i x j characters. If i and j get very large, you're going to be running out of memory very quickly.
To expand on that, the awk...print just reads/prints as it goes; memory usage is small. The 2D array will be in RAM, then expand into swap if you run out of RAM.
Distribution: ArchLinux / Source Mage GNU Linux (test branch) / openSUSE
Posts: 130
Rep:
You shouldn't do such big data munging in a BEGIN section to start with.
Define functions instead if that is what you need. But keep those calculations inside the body of your main AWK script.
Anyway, what version of AWK interpreter/compiler are you using? i.e., An earlier version of MAWK had a little leakage as far as I can recall.
Some questions and some notes:
1) How big is the region section of the file (that one read in the BEGIN section of your awk script)? In other words, how many elements it stores in the positionChr array?
2) How much memory do you have on the running machine?
3) What is the purpose of the following expression?
Code:
while((getline < rgnsFile) > 0)
what I don't understand is that you evaluate the expression getline < rgnsFile (rgnsFile not being assigned in the posted script) then check if it's true using > 0. Is it a redundancy or am I missing something?
4) Do you see some lines of output from the main section before the memory outage?
5) One thing to take in mind is that in awk whenever you reference a non-existent array element, the element is actually created and a null string is assigned as its value. This is a common pitfall in awk scripts and it is the reason why the syntax "indexinarray" has been created in order to scan all the elements of an array. So there is a chance that if a huge amount of elements have not been assigned in the BEGIN section, the size of the array increase enormously and unexpectedly in the main action.
Assuming you are indeed processing only 1 file that is 7 GB, in your awk code, you are processing your file twice, one in the while loop in the BEGIN section, the other is normal processing from command line argument. This is one of the reason it takes more time. why don't you do everything once, not in the BEGIN loop.
Small note: remember to close your file when you use while loop inside awk. eg close(rgnsFile)
Distribution: ArchLinux / Source Mage GNU Linux (test branch) / openSUSE
Posts: 130
Rep:
Quote:
Originally Posted by colucix
Some questions and some notes:
1) How big is the region section of the file (that one read in the BEGIN section of your awk script)? In other words, how many elements it stores in the positionChr array?
You don't understand this because you didn't understood 3)
i.e., As many lines as his 7Gb text database has, provided they have a $2 field.
Quote:
3) What is the purpose of the following expression?
Code:
while((getline < rgnsFile) > 0)
what I don't understand is that you evaluate the expression getline < rgnsFile (rgnsFile not being assigned in the posted script) then check if it's true using > 0. Is it a redundancy or am I missing something?
You are missing a common AWK! file read line by line.
He may have this variable assigned externally and he does. If it wasn't AWK should exit with error or silently depending on his interpreter. It was stated clearly that -f was in usage.
Quote:
5) One thing to [s]take[/s]bear in mind is that in awk whenever you reference a non-existent array element, the element is actually created and a null string is assigned as its value. This is a common pitfall in awk scripts and it is the reason why the syntax "indexinarray" has been created in order to scan all the elements of an array. So there is a chance that if a huge amount of elements have not been assigned in the BEGIN section, the size of the array increase enormously and unexpectedly in the main action.
It isn't much of a problem as non existent fields should lead to null indexes, which is preferable instead of conditionals whenever you read a 7Gb file.
you are processing your file twice, one in the while loop in the BEGIN section, the other is normal processing from command line argument.
Maybe I continue to miss something, but I don't understand your assertion. To me the file is read only once, since the getline statement in the while loop cause it to process the first lines of the input file, which are skipped in the main action. Beer?
You are missing a common AWK! file read line by line.
This is a bit strong assertion, if you don't mind. The OP did not specify the value of rgnsFile in the post. In any case the getline statement returns -1 0 or 1 which is compared with the value of rgnsFile, then the whole expression embedded in parentheses is evaluated for true using the expression > 0. If I am wrong, could you further explain what I'm missing here?
Quote:
It isn't much of a problem as non existent fields should lead to null indexes, which is preferable instead of conditionals whenever you read a 7Gb file.
That's not exactly what I was trying to explain. I try to re-formulate: whenever you reference an array element, as in the expression
Code:
if(positionChr[$1,$2]==1)
if the array element already exists (the index matches an existing one) no problem. If the array element does not exist, it is created even by the simple referring to that array element (without an explicit assignment expression). The value assigned to that element is the null string, but this does not mean that the allocated memory is null. In this sense I stated that it can be a pitfall for awk script: if you test for the existence of a huge amount of array elements, that have not been explicitly assigned in advance, you may waste a large amount of memory and reach a memory outage.
since the getline statement in the while loop cause it to process the first lines of the input file,
No, one getline statement reads a line, BUT if a while loop comes into the picture and it tests for getline > 0, then its indeed iterating a file, NOT one line. ( In other words , a getline return value of more than 0 means a line is available. )
No, one getline statement reads a line, BUT if a while loop comes into the picture and it tests for getline > 0, then its indeed iterating a file, NOT one line. ( In other words , a getline return value of more than 0 means a line is available. )
At a first glance, I don't agree! But I'd like to do some tests to be sure. In any case, I think I know what I was missing now: most likely the OP didn't mention that the script reads the region information from another file whose name is stored in the rgnsFile variable. If this is true the expression
Code:
getline < rgnsFile
is to get a line from this "external file". Here is what I missed: the sign < as "input redirection", not as "less than". Indeed getline less than something does not make sense. So the script reads a "region definition" file, then start to process the big input file in the main action. This sounds right to me, but until the OP does not clarify, we can only guess.
Lastly, as I have stated in my first post, i am assuming rgnsFile is the big file, not another "region" file like you mentioned. OP has to show more in order to clarify what's going on.
Distribution: ArchLinux / Source Mage GNU Linux (test branch) / openSUSE
Posts: 130
Rep:
Quote:
Originally Posted by colucix
This is a bit strong assertion, if you don't mind.
No, I don't
Quote:
The OP did not specify the value of rgnsFile in the post.
OK
Quote:
In any case the getline statement returns -1 0 or 1 which is compared with the value of rgnsFile
So, so:
getline < rgnsFile tries to read the first (not really it depends on the internal iterator) line of whichever content rgnsFile variable has as if it was the name of an existent file. If rgnsFile is not an assigned variable or the file pointed does not exist then this should evaluate to zero (0). Thus, this is the same than
Code:
while (getline < "nofile") > 0
which exits silently because of the lack of lines for reading. This snippet is also used in standard AWK as a simple "file does not exist or empty" check.
Code:
empty_file=( (getline < foo) > 0 )
Quote:
if- I am wrong, could you further explain what I'm missing here?
In such case, "nothing" is created, evaluated or destroyed beyond this, no output is shown for a 7Gb file. Thus no memory problems. As memory problems is what he stated I suppose it has nothing to do with this rgnsFile line. Unless he was drunk and assigned a 100Tb file
Quote:
If getline evaluates to false (0) it should exit, if there is a new line then it should evaluate (1)
That's not exactly what I was trying to explain. I try to re-formulate: whenever you reference an array element, as in the expression
Code:
if(positionChr[$1,$2]==1)
if the array element already exists (the index matches an existing one) no problem. If the array element does not exist, it is created even by the simple referring to that array element (without an explicit assignment expression).
Of course.
Quote:
The value assigned to that element is the null string, but this does not mean that the allocated memory is null.
Of course
Quote:
In this sense I stated that it can be a pitfall for awk script: if you test for the existence of a huge amount of array elements, that have not been explicitly assigned in advance, you may waste a large amount of memory and reach a memory outage.
It is commonly pointed as a pitfall but it is not truly that. That's why garbage collectors are there and this behaviour (direct assignment of NULL) is preferable than having three conditionals for $1, $2, $3 per line trying to avoid null asignments (even more memory referenced, though garbage collected the same way). Do remember that arrays in awk aren't positional i.e., p[2] != NULL and p[99999] != NULL may coexist separately of [1 ... 99998] Thus, NULL pointers can be automatically collected anyway.
That if he has a decent AWK interpreter, which is part of my early questions.
EDIT: Anyway, the problem appears to be he is keeping a lot of data in those arrays and I fail to see why he needs to do so.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.