Issue for going back to previous line under conditions

Trd300 · 07-10-2012, 12:38 AM

I have this input:

Code:

Joe|info.1
Bob|info.1
Bob|info.2

I would like to write the different info about the same person on the same line like that:

Code:

Joe|info.1
Bob|info.1|info.2

I tried:

Code:

awk 'BEGIN{FS=OFS="|"} {if(a[$1]++ == 0) {print; stored = $0}; else if(a[$1]++ > 0) print stored FS $2}'

But I get the duplicate original info:

Code:

Joe|info.1
Bob|info.1
Bob|info.1|info.2

It's because I print the first if statement, but if I don't I don't have the first line...
Any advice !

Thanks in advance

grail · 07-10-2012, 12:52 AM

As a quick alternative:

Code:

awk -F"|" 'NR==1{printf $0}x{if(x!=$1)printf "\n%s",$0;else printf "|%s",$2}{x=$1}' file

Trd300 · 07-10-2012, 01:09 AM

Thanks for your help, but this alternative is too quick ! :-)

It doesn't work for me...

The point here is to say if a[$1] exist only once then print the entire line, and if a[$1] exist more than once then go back to the first occurence and add the supplementary fields from the next occurrences.

David the H. · 07-10-2012, 02:14 AM

How about something like this?

Code:

#example file input
$ cat file.txt
Joe|info.1
Bob|info.1
Bob|info.2
David|info.1
Bob|info.3
Grail|info.1
David|info.2
Trd|info.1
Foo|info.1

$ awk 'BEGIN{ FS="|" } { a[$1]=(a[$1]?a[$1]:$1) FS $2 } END{ for (i in a){ print a[i] } }' file.txt
Foo|info.1
Grail|info.1
David|info.1|info.2
Bob|info.1|info.2|info.3
Trd|info.1
Joe|info.1

Caveats are that it assumes there are only two fields per line, and the output is (as you can see) unsorted in relation to the original, due to awk's internal array index tracking.

grail · 07-10-2012, 03:10 AM

I guess from the initial data I was of the understanding the data was sorted by column 1 (hence my suggested solution).

Your current process obviously cannot work as using print will leave the line intact but not allow for additional entries to be added.
Therefore, in an unsorted list (although will of course work for sorted, but requires storing before printing), David's solution is the way to go

Further to David's solution, if name order were important you could use an asorti in the END solution.

David the H. · 07-10-2012, 08:53 AM

Yeah, maybe I should've added that to my caveats. If the list is unsorted, then you're going to have to store every line in memory and print everything out at the end. That's not a problem for small amounts of input text, but it won't work if there's more than the system memory can handle.

As for controlling array sorting, see here:
http://www.gnu.org/software/gawk/man...y-Sorting.html

Rather than using asorti though, if you want the output sorted alphabetically, for example, you can simply add a PROCINFO setting to the BEGIN section:

Code:

BEGIN{ PROCINFO["sorted_in"]="@ind_str_asc" ; FS=OFS="|" }

Note that only recent versions of gawk can do this. older gawk and other awk implementations don't have any sorting features built-in, and you'd have to manually roll your own index tracking function. You'll also have to do so if you need the output order to be identical to the input, and it isn't already in one of the pre-set sorting types.

(And wouldn't it be nice if the gawk developers added a setting or two for "input order"?)

Trd300 · 07-10-2012, 07:14 PM

Thanks David & grail !

The order it returns the output doesn't really matter.

I didn't know this syntax:

Code:

{ a[$i]=(a[$i]?a[$i]:$i) FS $j }

it's very handy, and asorti and PROCINFO as well.

Thanks guys !

grail · 07-10-2012, 09:47 PM

If it is sorted then my solution negates having to store the data.

David the H. · 07-11-2012, 10:42 AM

Quote:

Originally Posted by Trd300

I didn't know this syntax:

Code:

{ a[$i]=(a[$i]?a[$i]:$i) FS $j }

it's very handy, and asorti and PROCINFO as well.

Yeah, "condition?value1:value2" is the ternary operator, a kind of short form of if/then/else.

In this case, if a previously-set value for array entry "a[$i]" exists, then use it, otherwise use "$i".