LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 10-03-2012, 07:51 PM   #1
lqd9o
LQ Newbie
 
Registered: Oct 2012
Posts: 4

Rep: Reputation: Disabled
last occurrence of duplicate record, gawk


Hi Linux pros !

I am new in programming, and I am trying with gawk to remove duplicates lines based on the first 2 fields, then keep the last occurrence of the record.
I've seen a lot of commands to keep the first occurrence but not really the last one (and it seems more complicated than I thought).

The file I want to treat is (although the real one id much longer):
Code:
item1/ref.001/eur/Bel.
item1/ref.001/eur/Spa.
item2/ref.002/eur/Ita.
item3/ref.002/asi/Chi.
item4/ref.003/ame/Can.
item4/ref.003/afr/Sen.
Lines 1 and 2 have the same 2 first fields and are considered as duplicates, so keep line 2 only. Same logic for lines 5/6:
Code:
item1/ref.001/eur/Spa.
item2/ref.002/eur/Ita.
item3/ref.002/asi/Chi.
item4/ref.003/afr/Sen.
Based on what I've seen on the web, I tried:
Code:
gawk 'BEGIN{FS="/"}

{ array[$1$2] = NR
  lines[$1$2] = $0

  for(key in array)
       reverse[array[key]] = key
       for(nr=1;nr<=NR;nr++)
           if(nr in reverse)
               print lines[reverse[nr]]
}'
but it adds more duplicates !!!

Thanks in advance !

Last edited by lqd9o; 10-03-2012 at 07:54 PM.
 
Old 10-03-2012, 08:50 PM   #2
rosehosting.com
Member
 
Registered: Jun 2012
Location: Missouri, USA
Posts: 236

Rep: Reputation: 64
This code should work.
Code:
awk 'BEGIN{FS="/"}
{pos[$1,$2] = NR; lines[$1,$2] = $0}
END {
  for(key in pos) reverse[pos[key]] = key
  for(nr=1;nr<=NR;nr++)
    if(nr in reverse) print lines[reverse[nr]]
}'

Quote:
Originally Posted by lqd9o View Post
Hi Linux pros !

I am new in programming, and I am trying with gawk to remove duplicates lines based on the first 2 fields, then keep the last occurrence of the record.
I've seen a lot of commands to keep the first occurrence but not really the last one (and it seems more complicated than I thought).

The file I want to treat is (although the real one id much longer):
Code:
item1/ref.001/eur/Bel.
item1/ref.001/eur/Spa.
item2/ref.002/eur/Ita.
item3/ref.002/asi/Chi.
item4/ref.003/ame/Can.
item4/ref.003/afr/Sen.
Lines 1 and 2 have the same 2 first fields and are considered as duplicates, so keep line 2 only. Same logic for lines 5/6:
Code:
item1/ref.001/eur/Spa.
item2/ref.002/eur/Ita.
item3/ref.002/asi/Chi.
item4/ref.003/afr/Sen.
Based on what I've seen on the web, I tried:
Code:
gawk 'BEGIN{FS="/"}

{ array[$1$2] = NR
  lines[$1$2] = $0

  for(key in array)
       reverse[array[key]] = key
       for(nr=1;nr<=NR;nr++)
           if(nr in reverse)
               print lines[reverse[nr]]
}'
but it adds more duplicates !!!

Thanks in advance !
 
1 members found this post helpful.
Old 10-04-2012, 03:32 AM   #3
lqd9o
LQ Newbie
 
Registered: Oct 2012
Posts: 4

Original Poster
Rep: Reputation: Disabled
Whoops, I forgot the END... Thanks rosehosting.com !

Can I ask you another question related to this thread? (maybe I should create a new thread)
Even if it is not necessary now, I am just curious about a detail.

While I was looking for a solution to this problem, I faced another problem to create an array.

For each line of the previous input, I added a field with the number of occurrence of the paired $1 and $2, and wanted to sort the field I just created by descending order using the "asort" function. Then I could have use the classical way to remove duplicates by keeping the first instance.
The problem is I couldn't get the right output after sorting the last field in order to obtain that:
Code:
item1/ref.001/eur/Spa./1
item1/ref.001/eur/Bel./0
item2/ref.002/eur/Ita./0
item3/ref.002/asi/Chi./0
item4/ref.003/afr/Sen./1
item4/ref.003/ame/Can./0
The code I used to sort the last field by descending order:
Code:
BEGIN{FS=OFS="/"}

{occur = array[$1$2]++
line = $0 FS occur

array[occ++] = occ
n = asort(array, "@val_num_desc")
for(i=0; i<=n; i++){
    print $0 FS array[i]
}
}
Do you see why it doesn't work?
I am not sure about the array in the asort function.
Can you create an array without using the "split" function?

Thanks !

Last edited by lqd9o; 10-04-2012 at 03:40 AM.
 
Old 10-04-2012, 07:28 AM   #4
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192
Here is another alternative:
Code:
awk -F/ '!($1$2 in a){i=1}{a[$1$2][i++]=$0}END{for(x in a)print a[x][length(a[x])]}' file
 
Old 10-04-2012, 06:49 PM   #5
lqd9o
LQ Newbie
 
Registered: Oct 2012
Posts: 4

Original Poster
Rep: Reputation: Disabled
@grail:
Thanks ! (Your code didn't work for me with awk but with gawk)

So if I follow your logic, we can also create an array by writing
Code:
( items in array){ do stuff with array ...  }
?

But when I tried to create an array based on the iteration of the input lines (and thus count[$1$2]++) with my previous code,
Code:
gawk 'BEGIN{FS=OFS="/"}

{

array[j++] = count[$1$2]++

n = asort(array, sorted, "@val_num_desc")

for(i=1; i<=n; i++)

print $0 FS sorted[i]

}' input
the input file is unchanged, instead of getting:
Code:
item1/ref.001/eur/Spa./1
item1/ref.001/eur/Bel./0
item2/ref.002/eur/Ita./0
item3/ref.002/asi/Chi./0
item4/ref.003/afr/Sen./1
item4/ref.003/ame/Can./0
I don't get it...

Last edited by lqd9o; 10-04-2012 at 08:23 PM.
 
Old 10-07-2012, 11:48 AM   #6
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
How about a simple non-awk solution?

Code:
tac infile.txt | sort -u -t '/' -k 1,2
It relies on the "-u" gnu extension to sort, however.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
does tar or bzip2 squash duplicate or near-duplicate files? garydale Linux - Software 6 11-19-2009 04:43 PM
CSV | GAWK | Record merge problem! lmedland Programming 4 07-30-2008 08:10 AM
Occurrence book(OB) nderitualex Linux - Software 0 03-21-2005 02:51 AM
Unable to record mic-in with SoundBlaster Live! while able to record other sources max76230 Linux - Newbie 2 03-14-2005 04:31 AM
Error: Acct: Couldn't insert SQL accounting START record - Duplicate entry '15212' fo ethanchic Linux - Software 0 04-11-2003 10:48 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 01:02 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration