LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   last occurrence of duplicate record, gawk (https://www.linuxquestions.org/questions/linux-newbie-8/last-occurrence-of-duplicate-record-gawk-4175430377/)

lqd9o 10-03-2012 07:51 PM

last occurrence of duplicate record, gawk
 
Hi Linux pros !

I am new in programming, and I am trying with gawk to remove duplicates lines based on the first 2 fields, then keep the last occurrence of the record.
I've seen a lot of commands to keep the first occurrence but not really the last one (and it seems more complicated than I thought).

The file I want to treat is (although the real one id much longer):
Code:

item1/ref.001/eur/Bel.
item1/ref.001/eur/Spa.
item2/ref.002/eur/Ita.
item3/ref.002/asi/Chi.
item4/ref.003/ame/Can.
item4/ref.003/afr/Sen.

Lines 1 and 2 have the same 2 first fields and are considered as duplicates, so keep line 2 only. Same logic for lines 5/6:
Code:

item1/ref.001/eur/Spa.
item2/ref.002/eur/Ita.
item3/ref.002/asi/Chi.
item4/ref.003/afr/Sen.

Based on what I've seen on the web, I tried:
Code:

gawk 'BEGIN{FS="/"}

{ array[$1$2] = NR
  lines[$1$2] = $0

  for(key in array)
      reverse[array[key]] = key
      for(nr=1;nr<=NR;nr++)
          if(nr in reverse)
              print lines[reverse[nr]]
}'

but it adds more duplicates !!!

Thanks in advance !

rosehosting.com 10-03-2012 08:50 PM

This code should work.
Code:

awk 'BEGIN{FS="/"}
{pos[$1,$2] = NR; lines[$1,$2] = $0}
END {
  for(key in pos) reverse[pos[key]] = key
  for(nr=1;nr<=NR;nr++)
    if(nr in reverse) print lines[reverse[nr]]
}'


Quote:

Originally Posted by lqd9o (Post 4796537)
Hi Linux pros !

I am new in programming, and I am trying with gawk to remove duplicates lines based on the first 2 fields, then keep the last occurrence of the record.
I've seen a lot of commands to keep the first occurrence but not really the last one (and it seems more complicated than I thought).

The file I want to treat is (although the real one id much longer):
Code:

item1/ref.001/eur/Bel.
item1/ref.001/eur/Spa.
item2/ref.002/eur/Ita.
item3/ref.002/asi/Chi.
item4/ref.003/ame/Can.
item4/ref.003/afr/Sen.

Lines 1 and 2 have the same 2 first fields and are considered as duplicates, so keep line 2 only. Same logic for lines 5/6:
Code:

item1/ref.001/eur/Spa.
item2/ref.002/eur/Ita.
item3/ref.002/asi/Chi.
item4/ref.003/afr/Sen.

Based on what I've seen on the web, I tried:
Code:

gawk 'BEGIN{FS="/"}

{ array[$1$2] = NR
  lines[$1$2] = $0

  for(key in array)
      reverse[array[key]] = key
      for(nr=1;nr<=NR;nr++)
          if(nr in reverse)
              print lines[reverse[nr]]
}'

but it adds more duplicates !!!

Thanks in advance !


lqd9o 10-04-2012 03:32 AM

Whoops, I forgot the END... Thanks rosehosting.com !

Can I ask you another question related to this thread? (maybe I should create a new thread)
Even if it is not necessary now, I am just curious about a detail.

While I was looking for a solution to this problem, I faced another problem to create an array.

For each line of the previous input, I added a field with the number of occurrence of the paired $1 and $2, and wanted to sort the field I just created by descending order using the "asort" function. Then I could have use the classical way to remove duplicates by keeping the first instance.
The problem is I couldn't get the right output after sorting the last field in order to obtain that:
Code:

item1/ref.001/eur/Spa./1
item1/ref.001/eur/Bel./0
item2/ref.002/eur/Ita./0
item3/ref.002/asi/Chi./0
item4/ref.003/afr/Sen./1
item4/ref.003/ame/Can./0

The code I used to sort the last field by descending order:
Code:

BEGIN{FS=OFS="/"}

{occur = array[$1$2]++
line = $0 FS occur

array[occ++] = occ
n = asort(array, "@val_num_desc")
for(i=0; i<=n; i++){
    print $0 FS array[i]
}
}

Do you see why it doesn't work?
I am not sure about the array in the asort function.
Can you create an array without using the "split" function?

Thanks !

grail 10-04-2012 07:28 AM

Here is another alternative:
Code:

awk -F/ '!($1$2 in a){i=1}{a[$1$2][i++]=$0}END{for(x in a)print a[x][length(a[x])]}' file

lqd9o 10-04-2012 06:49 PM

@grail:
Thanks ! (Your code didn't work for me with awk but with gawk)

So if I follow your logic, we can also create an array by writing
Code:

( items in array){ do stuff with array ... }
?

But when I tried to create an array based on the iteration of the input lines (and thus count[$1$2]++) with my previous code,
Code:

gawk 'BEGIN{FS=OFS="/"}

{

array[j++] = count[$1$2]++

n = asort(array, sorted, "@val_num_desc")

for(i=1; i<=n; i++)

print $0 FS sorted[i]

}' input

the input file is unchanged, instead of getting:
Code:

item1/ref.001/eur/Spa./1
item1/ref.001/eur/Bel./0
item2/ref.002/eur/Ita./0
item3/ref.002/asi/Chi./0
item4/ref.003/afr/Sen./1
item4/ref.003/ame/Can./0

I don't get it...

David the H. 10-07-2012 11:48 AM

How about a simple non-awk solution?

Code:

tac infile.txt | sort -u -t '/' -k 1,2
It relies on the "-u" gnu extension to sort, however.


All times are GMT -5. The time now is 09:15 AM.