Rename duplicate values in a column

jv61 · 07-11-2012, 01:06 PM

Hi all,

I have a file like given below. I would like to rename the unique values in the first column by adding a,b,c..etc at the end for each occurence and add something like NU at the begining for non-uniq lines. Any ideas of how to do this using awk/sed or any other programming?

Code:

Contig  Data1 Data2

con1    pass   pass
con2    pass   pass
con3    pass     -
con3    fail   pass
con3    pass   fail
con4    fail   pass
con5    pass   fail
con5    fail   fail

My result file should look something like this

Code:

Contig     Data1 Data2

NU_con1    pass   pass
NU_con2    pass   pass
con3a      pass     -
con3b      fail   pass
NU_con3    pass   fail
NU_con4    fail   pass
con5a      pass   fail
con5b      fail   fail

Thanks in advance,

Tinkster · 07-11-2012, 06:36 PM

I think you'll need to better explain what qualifies a line as not unique.

In your example there's only one occurrence of con1, yet you flag it
as NU. Why?
Why does NU_con3 spring into existence, but the last con5 turns into a con5b?

Cheers,
Tink

jv61 · 07-12-2012, 03:59 AM

I am very sorry for the confusion. There is error in the result file I have given in my previous post. Here is a better explanation.

What I want to do is to give unique values one ID and non uique values another ID. Then for non unique values I would like the IDs added to appear in series. So my result file should look something like this.

Code:

Contig             Data1 Data2

con1_Uniq          pass   pass
con2_Uniq          pass   pass
con3_NotUniq_1     pass     -
con3_NotUniq_2     fail   pass
con3_NotUniq_3     pass   fail
con4_Uniq          fail   pass
con5_NotUniq_1     pass   fail
con5_NotUniq_2     fail   fail

Thanks

grail · 07-12-2012, 07:51 AM

Is the data sorted by the first column?

Farzan Mufti · 07-12-2012, 08:12 AM

Here's a solution. I have made a number of assumptions.
1. You are using a newer version of Bash that accepts associative arrays. You can check by issuing command: declare -A arr
2. Actual data starts with line number 3, so I processed the file starting line 3.

Code:

#!/bin/bash

FILE=$1

#Find all the duplicates
dups=$(cat $FILE | awk '{print $1}'| sort | uniq -d)
#Keep track of duplicate values
declare -A count
for val in $dups
do
    count[$val]=1
done

#Now lets process the file one line at a time
sed -n '3,$ p' $FILE | while read line
do
    #Get the first field
    f1=$(echo "$line" | awk '{print $1}')
    if [[ -n ${count[$f1]} ]]
    then
        #value is duplicate
        echo "$line" | sed "s/\($f1\)/\1_NotUniq_${count[$f1]}/"
        (( count[$f1]++ ))
    else
        echo $line | sed "s/\($f1\)/\1_Unique/"
    fi
done

USAGE:

Code:

./script_name filename

NOTE: I have tested the code before posting.

From: Farzan Jameel Mufti Thursday July 12, 2012