[SOLVED] To identify and count the identical string of alphabets across columns

Cheah Boon Huat · 02-26-2013, 07:42 AM

Hi,
I have a tabular file with the information of DNA codes from 4 experimental samples as follows (Sample is the heading for the columns and DNA is the heading of rows):

Sample1 Sample2 Sample3 Sample4

DNA-1 ATTTTT CGTACG CGTACG CGTACG

DNA-2 ACGTAA ACGTAA ACGTAA AGGCAA
(hundreds of rows to go)

Can anyone show me the scripts that can help me to identify the consensus sequence across the samples for each row and also tell me the number of occurences of the consensus sequence. The output is as follows:

Sample1 Sample2 Sample3 Sample4 Consensus No. of occurences

DNA-1 ATTTTT CGTACG CGTACG CGTACG CGTACG 3

DNA-2 ACGTAA ACGTAA ACGTAA AGGCAA ACGTAA 3
(hundreds of rows to go)

Thanks a lot.

David the H. · 02-26-2013, 09:03 AM

I don't understand what you mean exactly by "identify the consensus sequence". Could you please post a slightly longer example of the input text (including non-conforming lines, if any), and what you want the output of that to look like?

Most of the time this kind of thing is easily doable with a simple awk script, but we need details to proceed.

Here are a few useful awk references to start you out:
http://www.grymoire.com/Unix/Awk.html
http://www.gnu.org/software/gawk/man...ode/index.html
http://www.pement.org/awk/awk1line.txt
http://www.catonmat.net/series/awk-one-liners-explained

Oh, and please use ***[code][/code]*** tags around the code and data you post, to preserve the original formatting and to improve readability. Do not use quote tags, bolding, colors, "start/end" lines, or other creative techniques.

shivaa · 02-26-2013, 09:15 AM

@David:
As far as I could understand his requirement, he want something like:

Input:

Code:

       Sample1 Sample2 Sample3 Sample4
 
DNA-1  ATTTTT  CGTACG  CGTACG  CGTACG
 
DNA-2  ACGTAA  ACGTAA  ACGTAA  AGGCAA
(Truncated)

Output:

Code:

      Sample1 Sample2 Sample3 Sample4  Repeated_gnome Consensus_No._of _occurences
 
DNA-1 ATTTTT  CGTACG  CGTACG  CGTACG   CGTACG         3

DNA-2 ACGTAA  ACGTAA  ACGTAA  AGGCAA   ACGTAA         3

He want to print the string which has repeated and then count how many times that string has repeated.
Hope this makes some sense now...

danielbmartin · 02-26-2013, 10:10 AM

With this InFile ...

Code:

NAMES-1 GEORGE DANIEL DANIEL DANIEL
NAMES-2 ALBERT GEORGE DANIEL ALBERT
NAMES-3 ALFRED ALFRED ALBERT ALFRED
NAMES-4 DANIEL BARNEY BARNEY HERMAN
NAMES-5 EDWARD EDWARD EDWARD EDWARD
NAMES-6 ALBERT EDWARD BARNEY DANIEL

... this code ...

Code:

MaxLines=$(wc -l < $InFile)
rm $Work2  # Blow away work file
for (( j=1; j<=MaxLines; j++ ))
  do
     sed -n $j'p' $InFile  \
    |tr " " "\n"          \
    |sed '1d'             \
    |sort                 \
    |uniq -c              \
    |sort -nrk1           \
    |sed -n 1p            \
    |tr -s " "            \
    |sed -r 's/(.*) (.*)/\2 Consensus=\1/' \
    >> $Work2
  done
paste $InFile $Work2 > $OutFile

... produced this OutFile ...

Code:

NAMES-1 GEORGE DANIEL DANIEL DANIEL	DANIEL Consensus= 3
NAMES-2 ALBERT GEORGE DANIEL ALBERT	ALBERT Consensus= 2
NAMES-3 ALFRED ALFRED ALBERT ALFRED	ALFRED Consensus= 3
NAMES-4 DANIEL BARNEY BARNEY HERMAN	BARNEY Consensus= 2
NAMES-5 EDWARD EDWARD EDWARD EDWARD	EDWARD Consensus= 4
NAMES-6 ALBERT EDWARD BARNEY DANIEL	EDWARD Consensus= 1

Daniel B. Martin

danielbmartin · 02-26-2013, 11:15 AM

With this InFile ...

Code:

NAMES-1 GEORGE DANIEL DANIEL DANIEL
NAMES-2 ALBERT GEORGE DANIEL ALBERT
NAMES-3 ALFRED ALFRED ALBERT ALFRED
NAMES-4 DANIEL BARNEY BARNEY HERMAN
NAMES-5 EDWARD EDWARD EDWARD EDWARD
NAMES-6 ALBERT EDWARD BARNEY DANIEL

... this code ...

Code:

awk '{delete Names; BigCount=0;
  for (j=2;j<=NF;j++) {if (++Names[$j]>BigCount) {BigCount++;BigName=$j}} 
  print $0,"   ",BigName,"Consensus=",BigCount}' $InFile >$OutFile

... produced this OutFile ...

Code:

NAMES-1 GEORGE DANIEL DANIEL DANIEL	DANIEL Consensus= 3
NAMES-2 ALBERT GEORGE DANIEL ALBERT	ALBERT Consensus= 2
NAMES-3 ALFRED ALFRED ALBERT ALFRED	ALFRED Consensus= 3
NAMES-4 DANIEL BARNEY BARNEY HERMAN	BARNEY Consensus= 2
NAMES-5 EDWARD EDWARD EDWARD EDWARD	EDWARD Consensus= 4
NAMES-6 ALBERT EDWARD BARNEY DANIEL	EDWARD Consensus= 1

Daniel B. Martin

danielbmartin · 02-26-2013, 11:51 AM

For future reference: questions such as this should be posted in the programming forum.

Daniel B. Martin

grail · 02-27-2013, 12:34 AM

I have an additional question ... What happens in the cases of a tie, ie of the 4 samples it is an even split of 2 and 2 or all 4 are unique?

Cheah Boon Huat · 02-27-2013, 03:01 AM

Thanks everyone with the quick replies and advice. Thank you Daniel B. Martin with the scripts that has worked pretty well for my file.

However, by refering to Grail's question, can the script be edited so as to show multiple strings that repeat at the same highest number of times.

So sorry for didn't including this scenario in my input file at the beginning.

danielbmartin · 02-27-2013, 09:55 AM

Quote:

Originally Posted by Cheah Boon Huat

... can the script be edited so as to show multiple strings that repeat at the same highest number of times.

With this InFile ...

Code:

NAMES-1 GEORGE DANIEL DANIEL DANIEL
NAMES-2 ALBERT GEORGE DANIEL ALBERT
NAMES-3 ALFRED ALFRED ALBERT ALFRED
NAMES-4 DANIEL BARNEY BARNEY HERMAN
NAMES-5 EDWARD EDWARD EDWARD EDWARD
NAMES-6 ALBERT EDWARD BARNEY DANIEL
NAMES-7 EDWARD ALBERT BARNEY DANIEL
NAMES-8 HERMAN IRVING HERMAN IRVING

... this code ...

Code:

awk '{delete Names; ConVal=" ";
  for (j=2;j<=NF;j++) {Names[$j]++} 
  {for (k=NF-1;k>=2;k--)
     for (Name in Names)
       {if (Names[Name]==k) ConVal=ConVal" "Name"("Names[Name]")"}}
  if (ConVal==" ") ConVal="  No Consensus!"
  {print $0,ConVal} }' $InFile  >$OutFile

... produced this OutFile ...

Code:

NAMES-1 GEORGE DANIEL DANIEL DANIEL   DANIEL(3)
NAMES-2 ALBERT GEORGE DANIEL ALBERT   ALBERT(2)
NAMES-3 ALFRED ALFRED ALBERT ALFRED   ALFRED(3)
NAMES-4 DANIEL BARNEY BARNEY HERMAN   BARNEY(2)
NAMES-5 EDWARD EDWARD EDWARD EDWARD   EDWARD(4)
NAMES-6 ALBERT EDWARD BARNEY DANIEL   No Consensus!
NAMES-7 EDWARD ALBERT BARNEY DANIEL   No Consensus!
NAMES-8 HERMAN IRVING HERMAN IRVING   HERMAN(2) IRVING(2)

Daniel B. Martin

colucix · 02-27-2013, 10:40 AM

Hi Daniel. May I suggest a slightly different version of your code? Basically, I'd change the outer for loop with a simple if/then condition:

Code:

awk '{ delete Names; ConVal=" "
       for (j=2;j<=NF;j++) Names[$j]++ 
       for (Name in Names)
         if (Names[Name] > 1)
           ConVal=ConVal" "Name"("Names[Name]")"
       if (ConVal==" ") ConVal="  No Consensus!"
       print $0, ConVal
}' $InFile > $OutFile

Also I removed some extra braces.

danielbmartin · 02-27-2013, 11:42 AM

Quote:

Originally Posted by colucix

May I suggest a slightly different version of your code?

Your version is better than mine, for the problem as stated by the OP. I wrote this code with features which were "invented" by my overactive imagination.
1) OP offered an InFile in which each line has an identifier followed by four data fields. I chose to generalize the problem to allow for lines which have an identifier followed by "n" data fields.
2) If the InFile does have more than four data fields the Consensus Values could have an interesting mix of numbers. I wanted those numbers to be arranged in descending order, left-to-right.

With this InFile ...

Code:

NAMES-1 GEORGE DANIEL DANIEL DANIEL GEORGE PHILIP
NAMES-2 ALBERT GEORGE DANIEL ALBERT GEORGE ALBERT
NAMES-3 ALFRED ALFRED ALBERT ALFRED MARCUS MARCUS
NAMES-4 DANIEL BARNEY BARNEY HERMAN HERMAN BARNEY
NAMES-5 EDWARD EDWARD EDWARD EDWARD NORMAN EDWARD
NAMES-6 ALBERT EDWARD BARNEY DANIEL PHILIP CARSON
NAMES-7 HERMAN IRVING HERMAN IRVING IRVING HERMAN
NAMES-8 NORMAN GEORGE NORMAN NORMAN GEORGE NORMAN

... this code (unchanged) ...

Code:

awk '{delete Names; ConVal=" ";
  for (j=2;j<=NF;j++) {Names[$j]++} 
  {for (k=NF-1;k>=2;k--)
     for (Name in Names)
       {if (Names[Name]==k) ConVal=ConVal" "Name"("Names[Name]")"}}
  if (ConVal==" ") ConVal="  No Consensus!"
  {print $0,ConVal} }' $InFile  >$OutFile

... produced this OutFile ...

Code:

NAMES-1 GEORGE DANIEL DANIEL DANIEL GEORGE PHILIP   DANIEL(3) GEORGE(2)
NAMES-2 ALBERT GEORGE DANIEL ALBERT GEORGE ALBERT   ALBERT(3) GEORGE(2)
NAMES-3 ALFRED ALFRED ALBERT ALFRED MARCUS MARCUS   ALFRED(3) MARCUS(2)
NAMES-4 DANIEL BARNEY BARNEY HERMAN HERMAN BARNEY   BARNEY(3) HERMAN(2)
NAMES-5 EDWARD EDWARD EDWARD EDWARD NORMAN EDWARD   EDWARD(5)
NAMES-6 ALBERT EDWARD BARNEY DANIEL PHILIP CARSON   No Consensus!
NAMES-7 HERMAN IRVING HERMAN IRVING IRVING HERMAN   HERMAN(3) IRVING(3)
NAMES-8 NORMAN GEORGE NORMAN NORMAN GEORGE NORMAN   NORMAN(4) GEORGE(2)

I am inexperienced with using arrays in awk and there might be a cleaner way to achieve the "descending order" feature. Your suggestions are accepted with gratitude.

Daniel B. Martin

grail · 02-27-2013, 01:30 PM

Just for some fun ... here is what I came up with in ruby

Code:

ruby -ane 'print $_.chomp << "\t";out=""; # Print original line and set out to empty string
h={};$F.each{|x| h[x] = h[x]?h[x]+1:1 };  # Create hash and fill it
h.each_value{|v|
out = "#{out.empty? ? "" : out<<" / "}#{h.key(v)}(#{v})" if v > 1  # If the value is greater than 1 store string or append string
;h.delete(h.key(v))                       # delete associated key value pair
} if h.values.sort.max > 1;               # Only change out value if any of the values are greater than 1
out="No Consensus!" if out.empty?;        # Set out if still empty
puts out                                  # Print out value
' file

danielbmartin · 02-27-2013, 02:21 PM

Quote:

Originally Posted by grail

Just for some fun ... here is what I came up with in ruby

...

Blows up on my machine. Could this be a version discrepancy? This is what I have:

Code:

daniel@daniel-desktop:~$ ruby --version
ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]

This code ...

Code:

echo "Method of LQ Guru grail"
ruby -ane 'print $_.chomp << "\t";out=""; # Print original line and set out to empty string
h={};$F.each{|x| h[x] = h[x]?h[x]+1:1 };  # Create hash and fill it
h.each_value{|v|
out = "#{out.empty? ? "" : out<<" / "}#{h.key(v)}(#{v})" if v > 1  # If the value is greater than 1 store string or append string
;h.delete(h.key(v))                       # delete associated key value pair
} if h.values.sort.max > 1;               # Only change out value if any of the values are greater than 1
out="No Consensus!" if out.empty?;        # Set out if still empty
puts out                                  # Print out value
' $InFile > $OutFile

... produced this result ...

Code:

Method of LQ Guru grail
-e:5: undefined method `key' for {"NAMES-1"=>1, "PHILIP"=>1, "DANIEL"=>3, "GEORGE"=>2}:Hash (NoMethodError)
	from -e:3:in `each_value'
	from -e:3

Daniel B. Martin

grail · 02-28-2013, 02:02 AM

Hey Daniel ... you are correct that the key function is new from v1.9 onwards. The following should work for you:

Code:

ruby -ane 'print $_.chomp << "\t";out=""; # Print original line and set out to empty string
h={};$F.each{|x| h[x] = h[x]?h[x]+1:1 };  # Create hash and fill it
h.each_pair{|k,v|
out = "#{out.empty? ? "" : out<<" / "}#{k}(#{v})" if v > 1  # If the value is greater than 1 store string or append string
;h.delete(k)                              # delete associated key value pair
} if h.values.sort.max > 1;               # Only change out value if any of the values are greater than 1
out="No Consensus!" if out.empty?;        # Set out if still empty
puts out                                  # Print out value
' $InFile > $OutFile

Actually this way looks a little cleaner too ... cheers

colucix · 02-28-2013, 03:54 AM

Yet another exercise using bash (with associative arrays):

Code:

#!/bin/bash
#
while read line
do
  #
  #  Declare associative array
  #
  declare -A array
  #
  #  Assign positional parameters
  #
  set $line
  #
  #  Loop over positional parameters and assign key/value pairs
  #
  while [[ $# -ge 1 ]]
  do
    array[$1]=$((${array[$1]}+1))
    shift
  done
  #
  #  Print line
  #
  echo -n "$line   "
  #
  #  Check repeated sequences
  #
  for i in "${!array[@]}"
  do
    [[ ${array[$i]} -gt 1 ]] && echo -n "$i(${array[$i]}) " && consensus=1
  done
  #
  #  End of line
  #
  [[ -n $consensus ]] && echo || echo 'No Consensus!'
  #
  #  Make clean for the next iteration
  #
  unset array consensus
done < file