[SOLVED] To identify and count the identical string of alphabets across columns
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
To identify and count the identical string of alphabets across columns
Hi,
I have a tabular file with the information of DNA codes from 4 experimental samples as follows (Sample is the heading for the columns and DNA is the heading of rows):
Sample1 Sample2 Sample3 Sample4
DNA-1 ATTTTT CGTACG CGTACG CGTACG
DNA-2 ACGTAA ACGTAA ACGTAA AGGCAA
(hundreds of rows to go)
Can anyone show me the scripts that can help me to identify the consensus sequence across the samples for each row and also tell me the number of occurences of the consensus sequence. The output is as follows:
Sample1 Sample2 Sample3 Sample4 Consensus No. of occurences
DNA-1 ATTTTT CGTACG CGTACG CGTACG CGTACG 3
DNA-2 ACGTAA ACGTAA ACGTAA AGGCAA ACGTAA 3
(hundreds of rows to go)
I don't understand what you mean exactly by "identify the consensus sequence". Could you please post a slightly longer example of the input text (including non-conforming lines, if any), and what you want the output of that to look like?
Most of the time this kind of thing is easily doable with a simple awk script, but we need details to proceed.
Oh, and please use ***[code][/code]*** tags around the code and data you post, to preserve the original formatting and to improve readability. Do not use quote tags, bolding, colors, "start/end" lines, or other creative techniques.
NAMES-1 GEORGE DANIEL DANIEL DANIEL
NAMES-2 ALBERT GEORGE DANIEL ALBERT
NAMES-3 ALFRED ALFRED ALBERT ALFRED
NAMES-4 DANIEL BARNEY BARNEY HERMAN
NAMES-5 EDWARD EDWARD EDWARD EDWARD
NAMES-6 ALBERT EDWARD BARNEY DANIEL
NAMES-1 GEORGE DANIEL DANIEL DANIEL DANIEL Consensus= 3
NAMES-2 ALBERT GEORGE DANIEL ALBERT ALBERT Consensus= 2
NAMES-3 ALFRED ALFRED ALBERT ALFRED ALFRED Consensus= 3
NAMES-4 DANIEL BARNEY BARNEY HERMAN BARNEY Consensus= 2
NAMES-5 EDWARD EDWARD EDWARD EDWARD EDWARD Consensus= 4
NAMES-6 ALBERT EDWARD BARNEY DANIEL EDWARD Consensus= 1
NAMES-1 GEORGE DANIEL DANIEL DANIEL
NAMES-2 ALBERT GEORGE DANIEL ALBERT
NAMES-3 ALFRED ALFRED ALBERT ALFRED
NAMES-4 DANIEL BARNEY BARNEY HERMAN
NAMES-5 EDWARD EDWARD EDWARD EDWARD
NAMES-6 ALBERT EDWARD BARNEY DANIEL
NAMES-1 GEORGE DANIEL DANIEL DANIEL DANIEL Consensus= 3
NAMES-2 ALBERT GEORGE DANIEL ALBERT ALBERT Consensus= 2
NAMES-3 ALFRED ALFRED ALBERT ALFRED ALFRED Consensus= 3
NAMES-4 DANIEL BARNEY BARNEY HERMAN BARNEY Consensus= 2
NAMES-5 EDWARD EDWARD EDWARD EDWARD EDWARD Consensus= 4
NAMES-6 ALBERT EDWARD BARNEY DANIEL EDWARD Consensus= 1
Daniel B. Martin
Last edited by danielbmartin; 02-26-2013 at 05:40 PM.
Reason: Improve code readability
... can the script be edited so as to show multiple strings that repeat at the same highest number of times.
With this InFile ...
Code:
NAMES-1 GEORGE DANIEL DANIEL DANIEL
NAMES-2 ALBERT GEORGE DANIEL ALBERT
NAMES-3 ALFRED ALFRED ALBERT ALFRED
NAMES-4 DANIEL BARNEY BARNEY HERMAN
NAMES-5 EDWARD EDWARD EDWARD EDWARD
NAMES-6 ALBERT EDWARD BARNEY DANIEL
NAMES-7 EDWARD ALBERT BARNEY DANIEL
NAMES-8 HERMAN IRVING HERMAN IRVING
... this code ...
Code:
awk '{delete Names; ConVal=" ";
for (j=2;j<=NF;j++) {Names[$j]++}
{for (k=NF-1;k>=2;k--)
for (Name in Names)
{if (Names[Name]==k) ConVal=ConVal" "Name"("Names[Name]")"}}
if (ConVal==" ") ConVal=" No Consensus!"
{print $0,ConVal} }' $InFile >$OutFile
... produced this OutFile ...
Code:
NAMES-1 GEORGE DANIEL DANIEL DANIEL DANIEL(3)
NAMES-2 ALBERT GEORGE DANIEL ALBERT ALBERT(2)
NAMES-3 ALFRED ALFRED ALBERT ALFRED ALFRED(3)
NAMES-4 DANIEL BARNEY BARNEY HERMAN BARNEY(2)
NAMES-5 EDWARD EDWARD EDWARD EDWARD EDWARD(4)
NAMES-6 ALBERT EDWARD BARNEY DANIEL No Consensus!
NAMES-7 EDWARD ALBERT BARNEY DANIEL No Consensus!
NAMES-8 HERMAN IRVING HERMAN IRVING HERMAN(2) IRVING(2)
May I suggest a slightly different version of your code?
Your version is better than mine, for the problem as stated by the OP. I wrote this code with features which were "invented" by my overactive imagination.
1) OP offered an InFile in which each line has an identifier followed by four data fields. I chose to generalize the problem to allow for lines which have an identifier followed by "n" data fields.
2) If the InFile does have more than four data fields the Consensus Values could have an interesting mix of numbers. I wanted those numbers to be arranged in descending order, left-to-right.
With this InFile ...
Code:
NAMES-1 GEORGE DANIEL DANIEL DANIEL GEORGE PHILIP
NAMES-2 ALBERT GEORGE DANIEL ALBERT GEORGE ALBERT
NAMES-3 ALFRED ALFRED ALBERT ALFRED MARCUS MARCUS
NAMES-4 DANIEL BARNEY BARNEY HERMAN HERMAN BARNEY
NAMES-5 EDWARD EDWARD EDWARD EDWARD NORMAN EDWARD
NAMES-6 ALBERT EDWARD BARNEY DANIEL PHILIP CARSON
NAMES-7 HERMAN IRVING HERMAN IRVING IRVING HERMAN
NAMES-8 NORMAN GEORGE NORMAN NORMAN GEORGE NORMAN
... this code (unchanged) ...
Code:
awk '{delete Names; ConVal=" ";
for (j=2;j<=NF;j++) {Names[$j]++}
{for (k=NF-1;k>=2;k--)
for (Name in Names)
{if (Names[Name]==k) ConVal=ConVal" "Name"("Names[Name]")"}}
if (ConVal==" ") ConVal=" No Consensus!"
{print $0,ConVal} }' $InFile >$OutFile
... produced this OutFile ...
Code:
NAMES-1 GEORGE DANIEL DANIEL DANIEL GEORGE PHILIP DANIEL(3) GEORGE(2)
NAMES-2 ALBERT GEORGE DANIEL ALBERT GEORGE ALBERT ALBERT(3) GEORGE(2)
NAMES-3 ALFRED ALFRED ALBERT ALFRED MARCUS MARCUS ALFRED(3) MARCUS(2)
NAMES-4 DANIEL BARNEY BARNEY HERMAN HERMAN BARNEY BARNEY(3) HERMAN(2)
NAMES-5 EDWARD EDWARD EDWARD EDWARD NORMAN EDWARD EDWARD(5)
NAMES-6 ALBERT EDWARD BARNEY DANIEL PHILIP CARSON No Consensus!
NAMES-7 HERMAN IRVING HERMAN IRVING IRVING HERMAN HERMAN(3) IRVING(3)
NAMES-8 NORMAN GEORGE NORMAN NORMAN GEORGE NORMAN NORMAN(4) GEORGE(2)
I am inexperienced with using arrays in awk and there might be a cleaner way to achieve the "descending order" feature. Your suggestions are accepted with gratitude.
Just for some fun ... here is what I came up with in ruby
Code:
ruby -ane 'print $_.chomp << "\t";out=""; # Print original line and set out to empty string
h={};$F.each{|x| h[x] = h[x]?h[x]+1:1 }; # Create hash and fill it
h.each_value{|v|
out = "#{out.empty? ? "" : out<<" / "}#{h.key(v)}(#{v})" if v > 1 # If the value is greater than 1 store string or append string
;h.delete(h.key(v)) # delete associated key value pair
} if h.values.sort.max > 1; # Only change out value if any of the values are greater than 1
out="No Consensus!" if out.empty?; # Set out if still empty
puts out # Print out value
' file
echo "Method of LQ Guru grail"
ruby -ane 'print $_.chomp << "\t";out=""; # Print original line and set out to empty string
h={};$F.each{|x| h[x] = h[x]?h[x]+1:1 }; # Create hash and fill it
h.each_value{|v|
out = "#{out.empty? ? "" : out<<" / "}#{h.key(v)}(#{v})" if v > 1 # If the value is greater than 1 store string or append string
;h.delete(h.key(v)) # delete associated key value pair
} if h.values.sort.max > 1; # Only change out value if any of the values are greater than 1
out="No Consensus!" if out.empty?; # Set out if still empty
puts out # Print out value
' $InFile > $OutFile
... produced this result ...
Code:
Method of LQ Guru grail
-e:5: undefined method `key' for {"NAMES-1"=>1, "PHILIP"=>1, "DANIEL"=>3, "GEORGE"=>2}:Hash (NoMethodError)
from -e:3:in `each_value'
from -e:3
Daniel B. Martin
Last edited by danielbmartin; 02-27-2013 at 08:46 PM.
Reason: Correct attribution
Hey Daniel ... you are correct that the key function is new from v1.9 onwards. The following should work for you:
Code:
ruby -ane 'print $_.chomp << "\t";out=""; # Print original line and set out to empty string
h={};$F.each{|x| h[x] = h[x]?h[x]+1:1 }; # Create hash and fill it
h.each_pair{|k,v|
out = "#{out.empty? ? "" : out<<" / "}#{k}(#{v})" if v > 1 # If the value is greater than 1 store string or append string
;h.delete(k) # delete associated key value pair
} if h.values.sort.max > 1; # Only change out value if any of the values are greater than 1
out="No Consensus!" if out.empty?; # Set out if still empty
puts out # Print out value
' $InFile > $OutFile
Actually this way looks a little cleaner too ... cheers
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.