[SOLVED] awk running out of memory

colucix · 12-09-2009, 09:17 AM

Quote:

Originally Posted by ghostdog74

this:

Code:

awk 'BEGIN{ while ((getline < "file") > 0 ){ print } }'

and this:

Code:

  awk '{print}' file

are the same, (except for some speed difference.)

Yes, they are the same. And after having clarified my doubt I can see what you stated in a previous post (file is read twice). Let's assume the rgnsFile is the same big input file: the statement

Code:

getline < rgnsFile

takes input from rgnsFile itself but this does not affect the NR variable, so that in the main action the input file (I mean the file whose name is passed as argument on the command line) is processed from the beginning. In other word it is as if the same file is opened twice (and independently from each other) by the same awk script, right?

Quote:

OP has to show more in order to clarify what's going on.

Totally agree, this is why I asked for more information in the first instance.

ghostdog74 · 12-09-2009, 09:26 AM

Quote:

Originally Posted by colucix

takes input from rgnsFile itself but this does not affect the NR variable, so that in the main action the input file (I mean the file whose name is passed as argument on the command line) is processed from the beginning. In other word it is as if the same file is opened twice (and independently from each other) by the same awk script, right?

anything in the BEGIN{} block is processed before the file (command line argument) is processed. therefore, yes, the NR variable is not affected.
another way, without using the while getline loop,

Code:

awk 'FNR==NR{ print "from 1st file:" $0 ;next} 
{print "from 2nd file:" $0 }' file1 file2

colucix · 12-09-2009, 09:30 AM

@code933k: indeed I can be a bit slow, but finally I succeed in understanding what I was missing. Thank you for your detailed explanation, anyway!

I have just to reflect a bit more on the following...

Quote:

Originally Posted by code933k

It is commonly pointed as a pitfall but it is not truly that. That's why garbage collectors are there and this behaviour (direct assignment of NULL) is preferable than having three conditionals for $1, $2, $3 per line trying to avoid null asignments (even more memory referenced, though garbage collected the same way). Do remember that arrays in awk aren't positional i.e., p[2] != NULL and p[99998] != NULL may exist separately of [1 ... 99999] Thus, NULL pointers can be automatically collected anyway.

anyway I cannot move from my mind that an alternative if statement like

Code:

if ( ($1 SUBSEP $2) in positionChr )

should solve the memory problem. And a totally different logic - as suggested by you and ghostdog - should do, as well.

colucix · 12-09-2009, 09:40 AM

Quote:

Originally Posted by ghostdog74

anything in the BEGIN{} block is processed before the file (command line argument) is processed. therefore, yes, the NR variable is not affected.

That's not the point. To me is the getline taking input from "another" file (due to redirection) that does not affect the NR count of the (passed as argument) input file. Indeed the BEGIN section does affect the NR count. Consider the following:

Code:

$ cat testfile
line 1
line 2
line 3
$ cat test.awk
BEGIN {
  getline
  print NR
  getline
  print NR
}
{ print }
$ awk -f test.awk testfile
1
2
line 3

If I run the following code, instead:

Code:

$ cat test.awk
BEGIN {
  getline < "testfile"
  print NR
  getline < "testfile"
  print NR
}
{ print }
$ awk -f test.awk testfile
0
0
line 1
line 2
line 3

that is the redirection from "testfile" in the getline statement causes awk to consider (open) it independently, as it was another file. And this is what causes the same file to be processed twice by the same code.

code933k · 12-09-2009, 10:17 AM

Quote:

Originally Posted by colucix

that is the redirection from "testfile" in the getline statement causes awk to consider (open) it independently, as it was another file. And this is what causes the same file to be processed twice by the same code.

That's it!
Though I am sure you didn't meant "it is processed twice by the same code" but, opened and processed once in the BEGIN section and once in the AWK script body (taking file from the command line) as previously stated by ghostdog74.

colucix · 12-09-2009, 10:47 AM

Quote:

Originally Posted by code933k

That's it!
Though I am sure you didn't meant "it is processed twice by the same code" but, opened and processed once in the BEGIN section and once in the AWK script body (taking file from the command line) as previously stated by ghostdog74.

That's exactly what I meant. Now, let's wait for the OP to answer our queries.

Valery Reznic · 12-09-2009, 02:29 PM

Quote:

Originally Posted by bartonski

Well, you're essentially building a two dimensional array that's i x j characters. If i and j get very large, you're going to be running out of memory very quickly.

Before it, while {...} will read whole file (OK, OK, at least first 3 fields from each line) into memory - enough to eat all the memory on big file

timonlq · 12-10-2009, 05:57 AM

i don't think any of the above suggestions are the answer as the program does not run out of memory in the BEGIN action, but rather in the main action.

In the main action all I am doing is reading lines from the awk file and check a value against the smallish array that I have filled with values in the BEGIN action statement.

Any other suggestions?

Tim.

Valery Reznic · 12-10-2009, 06:14 AM

Quote:

Originally Posted by timonlq

i don't think any of the above suggestions are the answer as the program does not run out of memory in the BEGIN action, but rather in the main action.

In the main action all I am doing is reading lines from the awk file and check a value against the smallish array that I have filled with values in the BEGIN action statement.

Any other suggestions?

Tim.

Strange that problem happened in main and not in BEGIN.
May be for some reason awk able to handle "out of memory" in BEGIN,
but when awk need event tiny memory allocation in the main it's fail ?

Anyway, could you explain what your program should do ?

(And I can't see where upStreamDist and downStreamDist initialized - in the command line ? )

timonlq · 12-10-2009, 07:19 AM

hi,

I see that my post has caused a lively debate between people that seem to know a million times more about awk than I do. That is good news for me

Now for the clarification that I should have provided sooner as it would have reduced the uncertainty and debate:

I have 8 Gb RAM

It is clearly in the main action section that the memory gets consumed.

The rgnsFile used in the BEGIN section defines a smaller number of regions (less than 100 lines). It is of the form:
chromosome startIndex endIndex

These endIndex-startIndex is about 10,000. Which means that there are maybe 100*10,000 entries in the array.

The actual inputFile (7 GB=billions of lines) processed in the main section is of the form:

chromosome indexSinglePosition somethingNotUsed value

I wish to print out information for all lines in inputFile which are in the regions defined in rgnsFile.

If it is true that:

Quote:

5) One thing to take in mind is that in awk whenever you reference a non-existent array element, the element is actually created and a null string is assigned as its value. This is a common pitfall in awk scripts and it is the reason why the syntax "index in array" has been created in order to scan all the elements of an array. So there is a chance that if a huge amount of elements have not been assigned in the BEGIN section, the size of the array increase enormously and unexpectedly in the main action.

Then this is most probably the cause of the problem, as there will be billions of positions in my inputFile which are not in the array.

Can you confirm this and perhaps suggest how to overcome this?

Thanks.

Tim.

colucix · 12-10-2009, 10:02 AM

Quote:

Originally Posted by timonlq

Can you confirm this and perhaps suggest how to overcome this?

Thanks.

Tim.

Hi Tim,

well... this is one of the points we had discussed about. Actually it is clearly stated in the GAWK official manual. As a possible (but not certain) solution, you can try the expression suggested in post #15:

Code:

if ( ($1 SUBSEP $2) in positionChr )

in place of

Code:

if(positionChr[$1,$2]==1)

from your original code. The expression ($1 SUBSEP $2) is a simple concatenation between $1, SUBSEP and $2, where SUBSEP is a gawk internal variable to retrieve the separator between indices used in multi-dimensional array.

I can suggest to take a look at the official manual for further details.

Good luck!

pixellany · 12-12-2009, 04:40 AM

I have merged the two duplicate threads---since each one had many replies.

Next time---only one thread per topic. Thanks

timonlq · 12-15-2009, 01:59 AM

Thanks to everyone who contributed to this thread

and helped me with a solution.

colucix · 12-15-2009, 12:10 PM

Quote:

Originally Posted by timonlq

Thanks to everyone who contributed to this thread

and helped me with a solution.

And (just out of curiosity) the solution is...?