last occurrence of duplicate record, gawk

lqd9o · 10-03-2012, 07:51 PM

Hi Linux pros !

I am new in programming, and I am trying with gawk to remove duplicates lines based on the first 2 fields, then keep the last occurrence of the record.
I've seen a lot of commands to keep the first occurrence but not really the last one (and it seems more complicated than I thought).

The file I want to treat is (although the real one id much longer):

Code:

item1/ref.001/eur/Bel.
item1/ref.001/eur/Spa.
item2/ref.002/eur/Ita.
item3/ref.002/asi/Chi.
item4/ref.003/ame/Can.
item4/ref.003/afr/Sen.

Lines 1 and 2 have the same 2 first fields and are considered as duplicates, so keep line 2 only. Same logic for lines 5/6:

Code:

item1/ref.001/eur/Spa.
item2/ref.002/eur/Ita.
item3/ref.002/asi/Chi.
item4/ref.003/afr/Sen.

Based on what I've seen on the web, I tried:

Code:

gawk 'BEGIN{FS="/"}

{ array[$1$2] = NR
  lines[$1$2] = $0

  for(key in array)
       reverse[array[key]] = key
       for(nr=1;nr<=NR;nr++)
           if(nr in reverse)
               print lines[reverse[nr]]
}'

but it adds more duplicates !!!

Thanks in advance !

rosehosting.com · 10-03-2012, 08:50 PM

This code should work.

Code:

awk 'BEGIN{FS="/"}
{pos[$1,$2] = NR; lines[$1,$2] = $0}
END {
  for(key in pos) reverse[pos[key]] = key
  for(nr=1;nr<=NR;nr++)
    if(nr in reverse) print lines[reverse[nr]]
}'

Quote:

Originally Posted by lqd9o

Hi Linux pros !

I am new in programming, and I am trying with gawk to remove duplicates lines based on the first 2 fields, then keep the last occurrence of the record.
I've seen a lot of commands to keep the first occurrence but not really the last one (and it seems more complicated than I thought).

The file I want to treat is (although the real one id much longer):

Code:

item1/ref.001/eur/Bel.
item1/ref.001/eur/Spa.
item2/ref.002/eur/Ita.
item3/ref.002/asi/Chi.
item4/ref.003/ame/Can.
item4/ref.003/afr/Sen.

Lines 1 and 2 have the same 2 first fields and are considered as duplicates, so keep line 2 only. Same logic for lines 5/6:

Code:

item1/ref.001/eur/Spa.
item2/ref.002/eur/Ita.
item3/ref.002/asi/Chi.
item4/ref.003/afr/Sen.

Based on what I've seen on the web, I tried:

Code:

gawk 'BEGIN{FS="/"}

{ array[$1$2] = NR
  lines[$1$2] = $0

  for(key in array)
       reverse[array[key]] = key
       for(nr=1;nr<=NR;nr++)
           if(nr in reverse)
               print lines[reverse[nr]]
}'

but it adds more duplicates !!!

Thanks in advance !

lqd9o · 10-04-2012, 03:32 AM

Whoops, I forgot the END... Thanks rosehosting.com !

Can I ask you another question related to this thread? (maybe I should create a new thread)
Even if it is not necessary now, I am just curious about a detail.

While I was looking for a solution to this problem, I faced another problem to create an array.

For each line of the previous input, I added a field with the number of occurrence of the paired $1 and $2, and wanted to sort the field I just created by descending order using the "asort" function. Then I could have use the classical way to remove duplicates by keeping the first instance.
The problem is I couldn't get the right output after sorting the last field in order to obtain that:

Code:

item1/ref.001/eur/Spa./1
item1/ref.001/eur/Bel./0
item2/ref.002/eur/Ita./0
item3/ref.002/asi/Chi./0
item4/ref.003/afr/Sen./1
item4/ref.003/ame/Can./0

The code I used to sort the last field by descending order:

Code:

BEGIN{FS=OFS="/"}

{occur = array[$1$2]++
line = $0 FS occur

array[occ++] = occ
n = asort(array, "@val_num_desc")
for(i=0; i<=n; i++){
    print $0 FS array[i]
}
}

Do you see why it doesn't work?
I am not sure about the array in the asort function.
Can you create an array without using the "split" function?

Thanks !

grail · 10-04-2012, 07:28 AM

Here is another alternative:

Code:

awk -F/ '!($1$2 in a){i=1}{a[$1$2][i++]=$0}END{for(x in a)print a[x][length(a[x])]}' file

lqd9o · 10-04-2012, 06:49 PM

@grail:
Thanks ! (Your code didn't work for me with awk but with gawk)

So if I follow your logic, we can also create an array by writing

Code:

( items in array){ do stuff with array ...  }

?

But when I tried to create an array based on the iteration of the input lines (and thus count[$1$2]++) with my previous code,

Code:

gawk 'BEGIN{FS=OFS="/"}

{

array[j++] = count[$1$2]++

n = asort(array, sorted, "@val_num_desc")

for(i=1; i<=n; i++)

print $0 FS sorted[i]

}' input

the input file is unchanged, instead of getting:

Code:

item1/ref.001/eur/Spa./1
item1/ref.001/eur/Bel./0
item2/ref.002/eur/Ita./0
item3/ref.002/asi/Chi./0
item4/ref.003/afr/Sen./1
item4/ref.003/ame/Can./0

I don't get it...

David the H. · 10-07-2012, 11:48 AM

How about a simple non-awk solution?

Code:

tac infile.txt | sort -u -t '/' -k 1,2

It relies on the "-u" gnu extension to sort, however.