LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 12-09-2009, 09:17 AM   #16
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983

Quote:
Originally Posted by ghostdog74 View Post
this:
Code:
awk 'BEGIN{ while ((getline < "file") > 0 ){ print } }'
and this:
Code:
  awk '{print}' file
are the same, (except for some speed difference.)
Yes, they are the same. And after having clarified my doubt I can see what you stated in a previous post (file is read twice). Let's assume the rgnsFile is the same big input file: the statement
Code:
getline < rgnsFile
takes input from rgnsFile itself but this does not affect the NR variable, so that in the main action the input file (I mean the file whose name is passed as argument on the command line) is processed from the beginning. In other word it is as if the same file is opened twice (and independently from each other) by the same awk script, right?

Quote:
OP has to show more in order to clarify what's going on.
Totally agree, this is why I asked for more information in the first instance.

Last edited by colucix; 12-09-2009 at 09:19 AM.
 
Old 12-09-2009, 09:26 AM   #17
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by colucix View Post
takes input from rgnsFile itself but this does not affect the NR variable, so that in the main action the input file (I mean the file whose name is passed as argument on the command line) is processed from the beginning. In other word it is as if the same file is opened twice (and independently from each other) by the same awk script, right?
anything in the BEGIN{} block is processed before the file (command line argument) is processed. therefore, yes, the NR variable is not affected.
another way, without using the while getline loop,
Code:
awk 'FNR==NR{ print "from 1st file:" $0 ;next} 
{print "from 2nd file:" $0 }' file1 file2
 
Old 12-09-2009, 09:30 AM   #18
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
@code933k: indeed I can be a bit slow, but finally I succeed in understanding what I was missing. Thank you for your detailed explanation, anyway!

I have just to reflect a bit more on the following...
Quote:
Originally Posted by code933k View Post
It is commonly pointed as a pitfall but it is not truly that. That's why garbage collectors are there and this behaviour (direct assignment of NULL) is preferable than having three conditionals for $1, $2, $3 per line trying to avoid null asignments (even more memory referenced, though garbage collected the same way). Do remember that arrays in awk aren't positional i.e., p[2] != NULL and p[99998] != NULL may exist separately of [1 ... 99999] Thus, NULL pointers can be automatically collected anyway.
anyway I cannot move from my mind that an alternative if statement like
Code:
if ( ($1 SUBSEP $2) in positionChr )
should solve the memory problem. And a totally different logic - as suggested by you and ghostdog - should do, as well.
 
Old 12-09-2009, 09:40 AM   #19
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Quote:
Originally Posted by ghostdog74 View Post
anything in the BEGIN{} block is processed before the file (command line argument) is processed. therefore, yes, the NR variable is not affected.
That's not the point. To me is the getline taking input from "another" file (due to redirection) that does not affect the NR count of the (passed as argument) input file. Indeed the BEGIN section does affect the NR count. Consider the following:
Code:
$ cat testfile
line 1
line 2
line 3
$ cat test.awk
BEGIN {
  getline
  print NR
  getline
  print NR
}
{ print }
$ awk -f test.awk testfile
1
2
line 3
If I run the following code, instead:
Code:
$ cat test.awk
BEGIN {
  getline < "testfile"
  print NR
  getline < "testfile"
  print NR
}
{ print }
$ awk -f test.awk testfile
0
0
line 1
line 2
line 3
that is the redirection from "testfile" in the getline statement causes awk to consider (open) it independently, as it was another file. And this is what causes the same file to be processed twice by the same code.

Last edited by colucix; 12-09-2009 at 09:41 AM.
 
Old 12-09-2009, 10:17 AM   #20
code933k
Member
 
Registered: Aug 2007
Location: Bogotá, Colombia. South America
Distribution: ArchLinux / Source Mage GNU Linux (test branch) / openSUSE
Posts: 130

Rep: Reputation: Disabled
Thumbs up

Quote:
Originally Posted by colucix View Post
that is the redirection from "testfile" in the getline statement causes awk to consider (open) it independently, as it was another file. And this is what causes the same file to be processed twice by the same code.
That's it!
Though I am sure you didn't meant "it is processed twice by the same code" but, opened and processed once in the BEGIN section and once in the AWK script body (taking file from the command line) as previously stated by ghostdog74.
 
Old 12-09-2009, 10:47 AM   #21
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Quote:
Originally Posted by code933k View Post
That's it!
Though I am sure you didn't meant "it is processed twice by the same code" but, opened and processed once in the BEGIN section and once in the AWK script body (taking file from the command line) as previously stated by ghostdog74.
That's exactly what I meant. Now, let's wait for the OP to answer our queries.
 
Old 12-09-2009, 02:29 PM   #22
Valery Reznic
ELF Statifier author
 
Registered: Oct 2007
Posts: 676

Rep: Reputation: 137Reputation: 137
Quote:
Originally Posted by bartonski View Post
Well, you're essentially building a two dimensional array that's i x j characters. If i and j get very large, you're going to be running out of memory very quickly.
Before it, while {...} will read whole file (OK, OK, at least first 3 fields from each line) into memory - enough to eat all the memory on big file
 
Old 12-10-2009, 05:57 AM   #23
timonlq
LQ Newbie
 
Registered: Aug 2009
Posts: 7

Original Poster
Rep: Reputation: 0
i don't think any of the above suggestions are the answer as the program does not run out of memory in the BEGIN action, but rather in the main action.

In the main action all I am doing is reading lines from the awk file and check a value against the smallish array that I have filled with values in the BEGIN action statement.

Any other suggestions?

Tim.
 
Old 12-10-2009, 06:14 AM   #24
Valery Reznic
ELF Statifier author
 
Registered: Oct 2007
Posts: 676

Rep: Reputation: 137Reputation: 137
Quote:
Originally Posted by timonlq View Post
i don't think any of the above suggestions are the answer as the program does not run out of memory in the BEGIN action, but rather in the main action.

In the main action all I am doing is reading lines from the awk file and check a value against the smallish array that I have filled with values in the BEGIN action statement.

Any other suggestions?

Tim.
Strange that problem happened in main and not in BEGIN.
May be for some reason awk able to handle "out of memory" in BEGIN,
but when awk need event tiny memory allocation in the main it's fail ?

Anyway, could you explain what your program should do ?

(And I can't see where upStreamDist and downStreamDist initialized - in the command line ? )
 
Old 12-10-2009, 07:19 AM   #25
timonlq
LQ Newbie
 
Registered: Aug 2009
Posts: 7

Original Poster
Rep: Reputation: 0
hi,

I see that my post has caused a lively debate between people that seem to know a million times more about awk than I do. That is good news for me

Now for the clarification that I should have provided sooner as it would have reduced the uncertainty and debate:

I have 8 Gb RAM

It is clearly in the main action section that the memory gets consumed.

The rgnsFile used in the BEGIN section defines a smaller number of regions (less than 100 lines). It is of the form:
chromosome startIndex endIndex

These endIndex-startIndex is about 10,000. Which means that there are maybe 100*10,000 entries in the array.

The actual inputFile (7 GB=billions of lines) processed in the main section is of the form:

chromosome indexSinglePosition somethingNotUsed value

I wish to print out information for all lines in inputFile which are in the regions defined in rgnsFile.

If it is true that:

Quote:
5) One thing to take in mind is that in awk whenever you reference a non-existent array element, the element is actually created and a null string is assigned as its value. This is a common pitfall in awk scripts and it is the reason why the syntax "index in array" has been created in order to scan all the elements of an array. So there is a chance that if a huge amount of elements have not been assigned in the BEGIN section, the size of the array increase enormously and unexpectedly in the main action.
Then this is most probably the cause of the problem, as there will be billions of positions in my inputFile which are not in the array.

Can you confirm this and perhaps suggest how to overcome this?

Thanks.

Tim.
 
Old 12-10-2009, 10:02 AM   #26
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Quote:
Originally Posted by timonlq View Post
Can you confirm this and perhaps suggest how to overcome this?

Thanks.

Tim.
Hi Tim,

well... this is one of the points we had discussed about. Actually it is clearly stated in the GAWK official manual. As a possible (but not certain) solution, you can try the expression suggested in post #15:
Code:
if ( ($1 SUBSEP $2) in positionChr )
in place of
Code:
if(positionChr[$1,$2]==1)
from your original code. The expression ($1 SUBSEP $2) is a simple concatenation between $1, SUBSEP and $2, where SUBSEP is a gawk internal variable to retrieve the separator between indices used in multi-dimensional array.

I can suggest to take a look at the official manual for further details.

Good luck!
 
1 members found this post helpful.
Old 12-12-2009, 04:40 AM   #27
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809

Rep: Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743
I have merged the two duplicate threads---since each one had many replies.

Next time---only one thread per topic. Thanks
 
Old 12-15-2009, 01:59 AM   #28
timonlq
LQ Newbie
 
Registered: Aug 2009
Posts: 7

Original Poster
Rep: Reputation: 0
Thanks to everyone who contributed to this thread and helped me with a solution.
 
Old 12-15-2009, 12:10 PM   #29
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Quote:
Originally Posted by timonlq View Post
Thanks to everyone who contributed to this thread and helped me with a solution.
And (just out of curiosity) the solution is...?
 
  


Reply

Tags
awk



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Running an awk script using command cks? vxc69 Programming 2 02-05-2008 01:54 PM
Running on 16 mb memory (or less...) Which distro? eazhar Linux - Newbie 9 07-13-2007 09:25 PM
running out of memory!? slzckboy General 13 01-15-2007 03:18 PM
Running out of memory Datamike Linux - Hardware 3 02-14-2006 03:46 PM
Running a shell command containing awk and grep within a C program Linh Programming 1 06-05-2003 06:51 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:48 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration