LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (http://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Bash awk coding loop (http://www.linuxquestions.org/questions/linux-newbie-8/bash-awk-coding-loop-939117/)

rafir 04-10-2012 12:48 PM

Bash awk coding loop
 
Hi

I have a very large file of format:

chr1 54478 54479 00 00 00 00 3
chr1 41980 41981 11 11 11 11 4
chr1 54352 54353 00 nn 01 00 4
chr1 52726 52727 nn 00 01 01 12
chr1 47291 47292 nn nn nn 00 13
chr1 46669 46670 01 nn nn nn 14
chr1 47107 47108 nn nn 01 nn 14
chr1 54379 54380 00 00 0n 00 15
chr1 49297 49298 nn nn nn nn 16
chr1 54675 54676 00 00 00 0n 16
chr1 55163 55164 11 11 nn 11 16
chr1 51672 51673 00 nn nn nn 18
...

I would like to subset the data multiple times along column 8. Then for each subset ask, how many times columns 4 and 5 do not match, and the same for columns 6 and 7. The only way I could think of was to mesh bash and awk, but it does not seem to work.

for(i=1; i<=120; i++);
grep $8<=i|
awk 'BEGIN{n=0;m=0}
{
if($4!=$5){n=n+1;}
if($6!=$7){m=m+1;}
}
END{print i, n, m}'
done

**************************************

-or-

for(i=1; i<=120; i++);
awk 'BEGIN{n=0;m=0}
{if($8<=i){
if($4!=$5){n=n+1;}
if($6!=$7){m=m+1;}
}
}
END{print i, n, m}'
done

Thank you

David the H. 04-10-2012 01:04 PM

Please use [code][/code] tags around your code and data, to preserve formatting and to improve readability. Please do not use quote tags, colors, or other fancy formatting.

Could you please explain what you want a little more clearly? Show us exactly what the output should look like for the above example.

The first thing I noticed:

Code:

for(i=1, i=120, i++){
grep $8>=i | awk

The first part of this command is shell syntax, not awk syntax, so, for example, $8 here would be considered the shell's eighth positional parameter, and the for loop syntax is completely wrong.

rafir 04-10-2012 01:20 PM

That's true. I am mixing up formats. So for (i =1:18) I would want an output of:

3 0 0
4 1 1
...
12 2 1
13 2 2
14 3 3
15 3 4
16 3 6
17 3 6
18 4 6

grail 04-10-2012 01:41 PM

Yeah still lost me :( Maybe you could explain how you are measuring the data you have shown, for example, 18 4 6, I follow that 18 is in the last column but have absolutely zero
ideas on how you manufactured the other 2 numbers???

amani 04-10-2012 01:49 PM

@grail, 4 ,6 must be the number of nonmatches (read 1st post)

rafir 04-10-2012 01:58 PM

amani is right. When the last column is <= 18, there are 4 nonmatches where ($4 !=$5), and 6 nonmatches where column ($6 != $7)

grail 04-10-2012 02:05 PM

Okay ... so it is cumulative ... nice to know :)

Next silly question, when the last column is the same number, ie row 2 and 3 both end in a 4, are we not to output the information until there is a change in the last column?

As an example if we were not doing it per change, the output would be:
Code:

3 0 0
4 0 0
4 1 1

It may seem like an odd question being your output example, however, your example also includes data not present in the original example, such as 17.

rafir 04-10-2012 02:38 PM

Yes only output when i changes value, and every iteration of i should get only one entry. So the complete output from above would be:

1 0 0
2 0 0
3 0 0
4 1 1
5 1 1
6 1 1
7 1 1
8 1 1
9 1 1
10 1 1
11 1 1
12 2 1
13 2 2
14 3 3
15 3 4
16 3 6
17 3 6
18 4 6

colucix 04-11-2012 03:49 AM

No need to mix bash and a powerful tool like awk! Please check this script:
Code:

#!/bin/awk -f

{

  pair_one[$NF] = pair_one[$NF] + ( $4 != $5 )
  pair_two[$NF] = pair_two[$NF] + ( $6 != $7 )
 
}

END {

  for ( i = 1; i <= $NF; i++ ) {
    sum_one += pair_one[i]
    sum_two += pair_two[i]
    print i, sum_one, sum_two
  }
   
}

The loop in the END section terminates at the last number encountered in the last column at the end of the file. Change $NF with 120 if you already know it is the last/maximum value (or if the numbers in the last column are not sorted in ascending order). Moreover, please notice that - as written - this is a script interpreted by awk (see the sha-bang in the very first line). Hope this helps.

grail 04-11-2012 04:50 AM

Right ... so now that I have all the information, you might want something like:
Code:

awk '{a[$NF] = a[last] + ($4!=$5);b[$NF] = b[last] + ($6!=$7);last = $NF}END{for(i=1;i<=$NF;i++){if(a[i])n = i;print i,a[n],b[n]}}' file
Currently it does not print zeros but I am sure you can change as need be :)

rafir 04-11-2012 01:32 PM

When run on the data above, both codes produce huge (but not identical) files that seems to be infinite loops.

What is the meaning of the syntax:

pair_one[$NF] = pair_one[$NF] + ( $4 != $5 )

colucix 04-11-2012 02:34 PM

Quote:

Originally Posted by rafir (Post 4650356)
When run on the data above, both codes produce huge (but not identical) files that seems to be infinite loops.

How did you run the code? Please, show us what you entered in the command line, what did you get and what is the content of the current version of your script. Possibly using CODE tags to make it more readable.

Quote:

Originally Posted by rafir (Post 4650356)
What is the meaning of the syntax:

pair_one[$NF] = pair_one[$NF] + ( $4 != $5 )

This means that the $NF-th element (that is the element that has index equal to the value of the last field of the current record) of the array pair_one is equal to itself increased by the value returned by the expression
Code:

( $4 != $5 )
In awk (and similarly in many programming languages) a logical expression is evaluated 0 if it's false and 1 if it's true. Hence the count is increased by 1 if the two fields are different and it is not increased if the two fields are equal. Hope it's a bit more clear, now.

grail 04-12-2012 04:18 AM

I am with colucix. I have run the code on the given example and it generates the exact output you have given.


All times are GMT -5. The time now is 01:55 AM.