how to combine same lines to one in a file

David the H. · 11-09-2011, 12:15 PM

I hesitate to post something while the OP is still studying up on it, but I've been working on my own *advanced* solution, and I'm a bit stumped on one point. Hopefully it will work to inspire him, rather than act as a spoon-feeding.

I decided to try my hand at using gawk's new arrays of arrays feature, and I've managed to get each line in the desired order (c1-4 + (all) c5's + c6's + c7's + c8's).

(I could probably dig deeper into array nesting to make it neater, but I've decided to stop at two dimensions for the time being.)

What I can't quite figure out is how to keep the lines themselves in their original input order, since awk likes to process arrays according to some internal logic. I can't directly use asort/asorti or a simple counting loop, because the main index is a complex string, rather than a simple series.

Can someone suggest a fix for this, or am I just barking up the wrong tree here?

Code:

#!/usr/local/bin/gawk -f
# Requires gawk v.4.0+
#(it's in the above location on my system)

BEGIN{
     SUBSEP=" "
     }

{
     ar[$1,$2,$3,$4][1] = ar[$1,$2,$3,$4][1]" "$5
     ar[$1,$2,$3,$4][2] = ar[$1,$2,$3,$4][2]" "$6
     ar[$1,$2,$3,$4][3] = ar[$1,$2,$3,$4][3]" "$7
     ar[$1,$2,$3,$4][4] = ar[$1,$2,$3,$4][4]" "$8
}

END{
     for ( i in ar ) {
               printf "%s", i
               for ( j=1 ; j <= 4; j ++ ) { printf "%s", ar[i][j] }
               printf "\n"
        }
   }

grail · 11-09-2011, 11:51 PM

How about adding the sort option prior to for loop call:

Code:

PROCINFO["sorted_in"] = "@ind_num_asc"
for( i in ar)...

See Here for details.

David the H. · 11-10-2011, 12:58 AM

Yes, I've already looked into that. It suffers from the same limitations as the asort options. The only built-in choices are numerical and string sorting. Indeed, it appears that asort simply accesses PROCINFO["sorted_in"] internally.

So what I need is something that keeps track of the original input order as it comes in. I'm thinking along the lines of a separate array to keep track of the indexes as they come along, and a related sorting function to reorder the output, but I can't quite wrap my mind around how to do it.

grail · 11-10-2011, 02:13 AM

Sorry ... misunderstood requirement ... how about:

Code:

#!/usr/bin/awk -f

BEGIN{
     SUBSEP=" "
     }

{
    found = 0

    for(p in ar)
        if(($1,$2,$3,$4) in ar[p])
            found = p

    if ( found )
        i = found
    else
        i++

     ar[i][$1,$2,$3,$4][1] = ar[i][$1,$2,$3,$4][1]" "$5
     ar[i][$1,$2,$3,$4][2] = ar[i][$1,$2,$3,$4][2]" "$6
     ar[i][$1,$2,$3,$4][3] = ar[i][$1,$2,$3,$4][3]" "$7
     ar[i][$1,$2,$3,$4][4] = ar[i][$1,$2,$3,$4][4]" "$8
}

END{
    for ( x = 1; x <= i; x++)
        for (f in ar[x]){
            printf "%s", f
            for ( j=1 ; j <= 4; j ++ ) { printf "%s", ar[x][f][j] }
               printf "\n"

   }
}

eagal · 11-10-2011, 09:18 AM

Thanks a lot for your help. I try to understand it and to test it. I did not know why the test showed:
gawk: test1:10: ar[$1,$2,$3,$4][1] = ar[$1,$2,$3,$4][1]" "$5
gawk: test1:10: ^ syntax error
gawk: test1:10: ar[$1,$2,$3,$4][1] = ar[$1,$2,$3,$4][1]" "$5
gawk: test1:10: ^ syntax error
gawk: test1:11: ar[$1,$2,$3,$4][2] = ar[$1,$2,$3,$4][2]" "$6
gawk: test1:11: ^ syntax error
gawk: test1:11: ar[$1,$2,$3,$4][2] = ar[$1,$2,$3,$4][2]" "$6
gawk: test1:11: ^ syntax error
gawk: test1:12: ar[$1,$2,$3,$4][3] = ar[$1,$2,$3,$4][3]" "$7
gawk: test1:12: ^ syntax error
gawk: test1:12: ar[$1,$2,$3,$4][3] = ar[$1,$2,$3,$4][3]" "$7
gawk: test1:12: ^ syntax error
gawk: test1:13: ar[$1,$2,$3,$4][4] = ar[$1,$2,$3,$4][4]" "$8
gawk: test1:13: ^ syntax error
gawk: test1:13: ar[$1,$2,$3,$4][4] = ar[$1,$2,$3,$4][4]" "$8
gawk: test1:13: ^ syntax error
gawk: test1:19: for ( j=1 ; j <= 4; j ++ ) { printf "%s", ar[i][j] }
gawk: test1:19: ^ syntax error

grail · 11-10-2011, 10:46 AM

Well unless you are using version 4+ of gawk (as mentioned by David) this will never work as previous version do not have array in array ability.

Maybe you also missed this line:

Quote:

I hesitate to post something while the OP is still studying up on it, but I've been working on my own *advanced* solution, and I'm a bit stumped on one point. Hopefully it will work to inspire him, rather than act as a spoon-feeding.

Which generally means if your not sure what your doing this is probably not the solution you want to try and understand first.

David the H. · 11-10-2011, 06:27 PM

Domo arigato, grail. An additional array level was one of the solutions I was thinking about. I just couldn't work out the implementation of it. At a certain level of complexity my brain apparently starts to overheat and I can't keep track of how everything is supposed to work.

I'm not sure what you gave works quite right though. I believe that if i was, for example, 2 on an input line, then 1 on the next input line, then there was no match on the line after that, then i++ would mistakenly increment it back to 2.

It took surprisingly long for me to work out the kinks, but here's my final solution. I also made the variables more regular and descriptive, as well as making it properly respect the output separator:

Code:

#!/usr/local/bin/gawk -f

BEGIN{
	SUBSEP=OFS=" "
}

{
	found = 0
	for( g in ar ) {
		if( ($1,$2,$3,$4) in ar[g] ) {
			found = 1
			break
		}
	}

	if ( ! found ) {
		g = length(ar) + 1
	}

	ar[g][$1,$2,$3,$4][1] = ar[g][$1,$2,$3,$4][1] OFS $5
	ar[g][$1,$2,$3,$4][2] = ar[g][$1,$2,$3,$4][2] OFS $6
	ar[g][$1,$2,$3,$4][3] = ar[g][$1,$2,$3,$4][3] OFS $7
	ar[g][$1,$2,$3,$4][4] = ar[g][$1,$2,$3,$4][4] OFS $8

}

END{
	for ( g = 1 ; g <= length(ar) ; g++ ) {
		for ( first4 in ar[g] ) {
				printf "%s", first4
 				for ( i=1 ; i <= 4 ; i++ ) { printf "%s", ar[g][first4][i] }
				printf "\n"
		}
	}
}

@eagal, Sorry if I led you down the wrong track. I was really posting for my own edification more than anything else.

However, you might take some clues from the code I posted. awk's traditional arrays can likely do it as well, if you think about it creatively. I may see what I can come up with as well, if I have time to work on it.

grail · 11-10-2011, 08:20 PM

Quote:

I'm not sure what you gave works quite right though. I believe that if i was, for example, 2 on an input line, then 1 on the next input line, then there was no match on the line after that, then i++ would mistakenly increment it back to 2.

Good catch David

I knocked it up before leaving work so hadn't really had a chance to test it.

@eagal - the general solution is actually easier to follow than using this more advanced feature (hint)