ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Yes, they are the same. And after having clarified my doubt I can see what you stated in a previous post (file is read twice). Let's assume the rgnsFile is the same big input file: the statement
Code:
getline < rgnsFile
takes input from rgnsFile itself but this does not affect the NR variable, so that in the main action the input file (I mean the file whose name is passed as argument on the command line) is processed from the beginning. In other word it is as if the same file is opened twice (and independently from each other) by the same awk script, right?
Quote:
OP has to show more in order to clarify what's going on.
Totally agree, this is why I asked for more information in the first instance.
takes input from rgnsFile itself but this does not affect the NR variable, so that in the main action the input file (I mean the file whose name is passed as argument on the command line) is processed from the beginning. In other word it is as if the same file is opened twice (and independently from each other) by the same awk script, right?
anything in the BEGIN{} block is processed before the file (command line argument) is processed. therefore, yes, the NR variable is not affected.
another way, without using the while getline loop,
@code933k: indeed I can be a bit slow, but finally I succeed in understanding what I was missing. Thank you for your detailed explanation, anyway!
I have just to reflect a bit more on the following...
Quote:
Originally Posted by code933k
It is commonly pointed as a pitfall but it is not truly that. That's why garbage collectors are there and this behaviour (direct assignment of NULL) is preferable than having three conditionals for $1, $2, $3 per line trying to avoid null asignments (even more memory referenced, though garbage collected the same way). Do remember that arrays in awk aren't positional i.e., p[2] != NULL and p[99998] != NULL may exist separately of [1 ... 99999] Thus, NULL pointers can be automatically collected anyway.
anyway I cannot move from my mind that an alternative if statement like
Code:
if ( ($1 SUBSEP $2) in positionChr )
should solve the memory problem. And a totally different logic - as suggested by you and ghostdog - should do, as well.
anything in the BEGIN{} block is processed before the file (command line argument) is processed. therefore, yes, the NR variable is not affected.
That's not the point. To me is the getline taking input from "another" file (due to redirection) that does not affect the NR count of the (passed as argument) input file. Indeed the BEGIN section does affect the NR count. Consider the following:
Code:
$ cat testfile
line 1
line 2
line 3
$ cat test.awk
BEGIN {
getline
print NR
getline
print NR
}
{ print }
$ awk -f test.awk testfile
1
2
line 3
If I run the following code, instead:
Code:
$ cat test.awk
BEGIN {
getline < "testfile"
print NR
getline < "testfile"
print NR
}
{ print }
$ awk -f test.awk testfile
0
0
line 1
line 2
line 3
that is the redirection from "testfile" in the getline statement causes awk to consider (open) it independently, as it was another file. And this is what causes the same file to be processed twice by the same code.
Distribution: ArchLinux / Source Mage GNU Linux (test branch) / openSUSE
Posts: 130
Rep:
Quote:
Originally Posted by colucix
that is the redirection from "testfile" in the getline statement causes awk to consider (open) it independently, as it was another file. And this is what causes the same file to be processed twice by the same code.
That's it!
Though I am sure you didn't meant "it is processed twice by the same code" but, opened and processed once in the BEGIN section and once in the AWK script body (taking file from the command line) as previously stated by ghostdog74.
That's it!
Though I am sure you didn't meant "it is processed twice by the same code" but, opened and processed once in the BEGIN section and once in the AWK script body (taking file from the command line) as previously stated by ghostdog74.
That's exactly what I meant. Now, let's wait for the OP to answer our queries.
Well, you're essentially building a two dimensional array that's i x j characters. If i and j get very large, you're going to be running out of memory very quickly.
Before it, while {...} will read whole file (OK, OK, at least first 3 fields from each line) into memory - enough to eat all the memory on big file
i don't think any of the above suggestions are the answer as the program does not run out of memory in the BEGIN action, but rather in the main action.
In the main action all I am doing is reading lines from the awk file and check a value against the smallish array that I have filled with values in the BEGIN action statement.
i don't think any of the above suggestions are the answer as the program does not run out of memory in the BEGIN action, but rather in the main action.
In the main action all I am doing is reading lines from the awk file and check a value against the smallish array that I have filled with values in the BEGIN action statement.
Any other suggestions?
Tim.
Strange that problem happened in main and not in BEGIN.
May be for some reason awk able to handle "out of memory" in BEGIN,
but when awk need event tiny memory allocation in the main it's fail ?
Anyway, could you explain what your program should do ?
(And I can't see where upStreamDist and downStreamDist initialized - in the command line ? )
I see that my post has caused a lively debate between people that seem to know a million times more about awk than I do. That is good news for me
Now for the clarification that I should have provided sooner as it would have reduced the uncertainty and debate:
I have 8 Gb RAM
It is clearly in the main action section that the memory gets consumed.
The rgnsFile used in the BEGIN section defines a smaller number of regions (less than 100 lines). It is of the form:
chromosome startIndex endIndex
These endIndex-startIndex is about 10,000. Which means that there are maybe 100*10,000 entries in the array.
The actual inputFile (7 GB=billions of lines) processed in the main section is of the form:
chromosome indexSinglePosition somethingNotUsed value
I wish to print out information for all lines in inputFile which are in the regions defined in rgnsFile.
If it is true that:
Quote:
5) One thing to take in mind is that in awk whenever you reference a non-existent array element, the element is actually created and a null string is assigned as its value. This is a common pitfall in awk scripts and it is the reason why the syntax "index in array" has been created in order to scan all the elements of an array. So there is a chance that if a huge amount of elements have not been assigned in the BEGIN section, the size of the array increase enormously and unexpectedly in the main action.
Then this is most probably the cause of the problem, as there will be billions of positions in my inputFile which are not in the array.
Can you confirm this and perhaps suggest how to overcome this?
Can you confirm this and perhaps suggest how to overcome this?
Thanks.
Tim.
Hi Tim,
well... this is one of the points we had discussed about. Actually it is clearly stated in the GAWK official manual. As a possible (but not certain) solution, you can try the expression suggested in post #15:
Code:
if ( ($1 SUBSEP $2) in positionChr )
in place of
Code:
if(positionChr[$1,$2]==1)
from your original code. The expression ($1 SUBSEP $2) is a simple concatenation between $1, SUBSEP and $2, where SUBSEP is a gawk internal variable to retrieve the separator between indices used in multi-dimensional array.
I can suggest to take a look at the official manual for further details.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.