LinuxQuestions.org
Latest LQ Deal: Linux Power User Bundle
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 02-26-2013, 07:42 AM   #1
Cheah Boon Huat
LQ Newbie
 
Registered: Feb 2013
Posts: 10

Rep: Reputation: Disabled
Smile To identify and count the identical string of alphabets across columns


Hi,
I have a tabular file with the information of DNA codes from 4 experimental samples as follows (Sample is the heading for the columns and DNA is the heading of rows):

Sample1 Sample2 Sample3 Sample4

DNA-1 ATTTTT CGTACG CGTACG CGTACG

DNA-2 ACGTAA ACGTAA ACGTAA AGGCAA
(hundreds of rows to go)

Can anyone show me the scripts that can help me to identify the consensus sequence across the samples for each row and also tell me the number of occurences of the consensus sequence. The output is as follows:

Sample1 Sample2 Sample3 Sample4 Consensus No. of occurences

DNA-1 ATTTTT CGTACG CGTACG CGTACG CGTACG 3

DNA-2 ACGTAA ACGTAA ACGTAA AGGCAA ACGTAA 3
(hundreds of rows to go)

Thanks a lot.
 
Old 02-26-2013, 09:03 AM   #2
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958
I don't understand what you mean exactly by "identify the consensus sequence". Could you please post a slightly longer example of the input text (including non-conforming lines, if any), and what you want the output of that to look like?

Most of the time this kind of thing is easily doable with a simple awk script, but we need details to proceed.

Here are a few useful awk references to start you out:
http://www.grymoire.com/Unix/Awk.html
http://www.gnu.org/software/gawk/man...ode/index.html
http://www.pement.org/awk/awk1line.txt
http://www.catonmat.net/series/awk-one-liners-explained


Oh, and please use ***[code][/code]*** tags around the code and data you post, to preserve the original formatting and to improve readability. Do not use quote tags, bolding, colors, "start/end" lines, or other creative techniques.
 
1 members found this post helpful.
Old 02-26-2013, 09:15 AM   #3
shivaa
Senior Member
 
Registered: Jul 2012
Location: Grenoble, Fr.
Distribution: Sun Solaris, RHEL, Ubuntu, Debian 6.0
Posts: 1,800
Blog Entries: 4

Rep: Reputation: 286Reputation: 286Reputation: 286
@David:
As far as I could understand his requirement, he want something like:

Input:
Code:
       Sample1 Sample2 Sample3 Sample4
 
DNA-1  ATTTTT  CGTACG  CGTACG  CGTACG
 
DNA-2  ACGTAA  ACGTAA  ACGTAA  AGGCAA
(Truncated)
Output:
Code:
      Sample1 Sample2 Sample3 Sample4  Repeated_gnome Consensus_No._of _occurences
 
DNA-1 ATTTTT  CGTACG  CGTACG  CGTACG   CGTACG         3

DNA-2 ACGTAA  ACGTAA  ACGTAA  AGGCAA   ACGTAA         3
He want to print the string which has repeated and then count how many times that string has repeated.
Hope this makes some sense now...
 
1 members found this post helpful.
Old 02-26-2013, 10:10 AM   #4
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,513

Rep: Reputation: 437Reputation: 437Reputation: 437Reputation: 437Reputation: 437
With this InFile ...
Code:
NAMES-1 GEORGE DANIEL DANIEL DANIEL
NAMES-2 ALBERT GEORGE DANIEL ALBERT
NAMES-3 ALFRED ALFRED ALBERT ALFRED
NAMES-4 DANIEL BARNEY BARNEY HERMAN
NAMES-5 EDWARD EDWARD EDWARD EDWARD
NAMES-6 ALBERT EDWARD BARNEY DANIEL
... this code ...
Code:
MaxLines=$(wc -l < $InFile)
rm $Work2  # Blow away work file
for (( j=1; j<=MaxLines; j++ ))
  do
     sed -n $j'p' $InFile  \
    |tr " " "\n"          \
    |sed '1d'             \
    |sort                 \
    |uniq -c              \
    |sort -nrk1           \
    |sed -n 1p            \
    |tr -s " "            \
    |sed -r 's/(.*) (.*)/\2 Consensus=\1/' \
    >> $Work2
  done
paste $InFile $Work2 > $OutFile
... produced this OutFile ...
Code:
NAMES-1 GEORGE DANIEL DANIEL DANIEL	DANIEL Consensus= 3
NAMES-2 ALBERT GEORGE DANIEL ALBERT	ALBERT Consensus= 2
NAMES-3 ALFRED ALFRED ALBERT ALFRED	ALFRED Consensus= 3
NAMES-4 DANIEL BARNEY BARNEY HERMAN	BARNEY Consensus= 2
NAMES-5 EDWARD EDWARD EDWARD EDWARD	EDWARD Consensus= 4
NAMES-6 ALBERT EDWARD BARNEY DANIEL	EDWARD Consensus= 1
Daniel B. Martin
 
1 members found this post helpful.
Old 02-26-2013, 11:15 AM   #5
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,513

Rep: Reputation: 437Reputation: 437Reputation: 437Reputation: 437Reputation: 437
With this InFile ...
Code:
NAMES-1 GEORGE DANIEL DANIEL DANIEL
NAMES-2 ALBERT GEORGE DANIEL ALBERT
NAMES-3 ALFRED ALFRED ALBERT ALFRED
NAMES-4 DANIEL BARNEY BARNEY HERMAN
NAMES-5 EDWARD EDWARD EDWARD EDWARD
NAMES-6 ALBERT EDWARD BARNEY DANIEL
... this code ...
Code:
awk '{delete Names; BigCount=0;
  for (j=2;j<=NF;j++) {if (++Names[$j]>BigCount) {BigCount++;BigName=$j}} 
  print $0,"   ",BigName,"Consensus=",BigCount}' $InFile >$OutFile
... produced this OutFile ...
Code:
NAMES-1 GEORGE DANIEL DANIEL DANIEL	DANIEL Consensus= 3
NAMES-2 ALBERT GEORGE DANIEL ALBERT	ALBERT Consensus= 2
NAMES-3 ALFRED ALFRED ALBERT ALFRED	ALFRED Consensus= 3
NAMES-4 DANIEL BARNEY BARNEY HERMAN	BARNEY Consensus= 2
NAMES-5 EDWARD EDWARD EDWARD EDWARD	EDWARD Consensus= 4
NAMES-6 ALBERT EDWARD BARNEY DANIEL	EDWARD Consensus= 1
Daniel B. Martin

Last edited by danielbmartin; 02-26-2013 at 05:40 PM. Reason: Improve code readability
 
3 members found this post helpful.
Old 02-26-2013, 11:51 AM   #6
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,513

Rep: Reputation: 437Reputation: 437Reputation: 437Reputation: 437Reputation: 437
For future reference: questions such as this should be posted in the programming forum.

Daniel B. Martin
 
1 members found this post helpful.
Old 02-27-2013, 12:34 AM   #7
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,437

Rep: Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842
I have an additional question ... What happens in the cases of a tie, ie of the 4 samples it is an even split of 2 and 2 or all 4 are unique?
 
1 members found this post helpful.
Old 02-27-2013, 03:01 AM   #8
Cheah Boon Huat
LQ Newbie
 
Registered: Feb 2013
Posts: 10

Original Poster
Rep: Reputation: Disabled
Thanks everyone with the quick replies and advice. Thank you Daniel B. Martin with the scripts that has worked pretty well for my file.

However, by refering to Grail's question, can the script be edited so as to show multiple strings that repeat at the same highest number of times.

So sorry for didn't including this scenario in my input file at the beginning.

Last edited by Cheah Boon Huat; 02-27-2013 at 03:19 AM.
 
Old 02-27-2013, 09:55 AM   #9
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,513

Rep: Reputation: 437Reputation: 437Reputation: 437Reputation: 437Reputation: 437
Quote:
Originally Posted by Cheah Boon Huat View Post
... can the script be edited so as to show multiple strings that repeat at the same highest number of times.
With this InFile ...
Code:
NAMES-1 GEORGE DANIEL DANIEL DANIEL
NAMES-2 ALBERT GEORGE DANIEL ALBERT
NAMES-3 ALFRED ALFRED ALBERT ALFRED
NAMES-4 DANIEL BARNEY BARNEY HERMAN
NAMES-5 EDWARD EDWARD EDWARD EDWARD
NAMES-6 ALBERT EDWARD BARNEY DANIEL
NAMES-7 EDWARD ALBERT BARNEY DANIEL
NAMES-8 HERMAN IRVING HERMAN IRVING
... this code ...
Code:
awk '{delete Names; ConVal=" ";
  for (j=2;j<=NF;j++) {Names[$j]++} 
  {for (k=NF-1;k>=2;k--)
     for (Name in Names)
       {if (Names[Name]==k) ConVal=ConVal" "Name"("Names[Name]")"}}
  if (ConVal==" ") ConVal="  No Consensus!"
  {print $0,ConVal} }' $InFile  >$OutFile
... produced this OutFile ...
Code:
NAMES-1 GEORGE DANIEL DANIEL DANIEL   DANIEL(3)
NAMES-2 ALBERT GEORGE DANIEL ALBERT   ALBERT(2)
NAMES-3 ALFRED ALFRED ALBERT ALFRED   ALFRED(3)
NAMES-4 DANIEL BARNEY BARNEY HERMAN   BARNEY(2)
NAMES-5 EDWARD EDWARD EDWARD EDWARD   EDWARD(4)
NAMES-6 ALBERT EDWARD BARNEY DANIEL   No Consensus!
NAMES-7 EDWARD ALBERT BARNEY DANIEL   No Consensus!
NAMES-8 HERMAN IRVING HERMAN IRVING   HERMAN(2) IRVING(2)
Daniel B. Martin
 
1 members found this post helpful.
Old 02-27-2013, 10:40 AM   #10
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978
Hi Daniel. May I suggest a slightly different version of your code? Basically, I'd change the outer for loop with a simple if/then condition:
Code:
awk '{ delete Names; ConVal=" "
       for (j=2;j<=NF;j++) Names[$j]++ 
       for (Name in Names)
         if (Names[Name] > 1)
           ConVal=ConVal" "Name"("Names[Name]")"
       if (ConVal==" ") ConVal="  No Consensus!"
       print $0, ConVal
}' $InFile > $OutFile
Also I removed some extra braces.
 
2 members found this post helpful.
Old 02-27-2013, 11:42 AM   #11
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,513

Rep: Reputation: 437Reputation: 437Reputation: 437Reputation: 437Reputation: 437
Quote:
Originally Posted by colucix View Post
May I suggest a slightly different version of your code?
Your version is better than mine, for the problem as stated by the OP. I wrote this code with features which were "invented" by my overactive imagination.
1) OP offered an InFile in which each line has an identifier followed by four data fields. I chose to generalize the problem to allow for lines which have an identifier followed by "n" data fields.
2) If the InFile does have more than four data fields the Consensus Values could have an interesting mix of numbers. I wanted those numbers to be arranged in descending order, left-to-right.

With this InFile ...
Code:
NAMES-1 GEORGE DANIEL DANIEL DANIEL GEORGE PHILIP
NAMES-2 ALBERT GEORGE DANIEL ALBERT GEORGE ALBERT
NAMES-3 ALFRED ALFRED ALBERT ALFRED MARCUS MARCUS
NAMES-4 DANIEL BARNEY BARNEY HERMAN HERMAN BARNEY
NAMES-5 EDWARD EDWARD EDWARD EDWARD NORMAN EDWARD
NAMES-6 ALBERT EDWARD BARNEY DANIEL PHILIP CARSON
NAMES-7 HERMAN IRVING HERMAN IRVING IRVING HERMAN
NAMES-8 NORMAN GEORGE NORMAN NORMAN GEORGE NORMAN
... this code (unchanged) ...
Code:
awk '{delete Names; ConVal=" ";
  for (j=2;j<=NF;j++) {Names[$j]++} 
  {for (k=NF-1;k>=2;k--)
     for (Name in Names)
       {if (Names[Name]==k) ConVal=ConVal" "Name"("Names[Name]")"}}
  if (ConVal==" ") ConVal="  No Consensus!"
  {print $0,ConVal} }' $InFile  >$OutFile
... produced this OutFile ...
Code:
NAMES-1 GEORGE DANIEL DANIEL DANIEL GEORGE PHILIP   DANIEL(3) GEORGE(2)
NAMES-2 ALBERT GEORGE DANIEL ALBERT GEORGE ALBERT   ALBERT(3) GEORGE(2)
NAMES-3 ALFRED ALFRED ALBERT ALFRED MARCUS MARCUS   ALFRED(3) MARCUS(2)
NAMES-4 DANIEL BARNEY BARNEY HERMAN HERMAN BARNEY   BARNEY(3) HERMAN(2)
NAMES-5 EDWARD EDWARD EDWARD EDWARD NORMAN EDWARD   EDWARD(5)
NAMES-6 ALBERT EDWARD BARNEY DANIEL PHILIP CARSON   No Consensus!
NAMES-7 HERMAN IRVING HERMAN IRVING IRVING HERMAN   HERMAN(3) IRVING(3)
NAMES-8 NORMAN GEORGE NORMAN NORMAN GEORGE NORMAN   NORMAN(4) GEORGE(2)
I am inexperienced with using arrays in awk and there might be a cleaner way to achieve the "descending order" feature. Your suggestions are accepted with gratitude.

Daniel B. Martin
 
2 members found this post helpful.
Old 02-27-2013, 01:30 PM   #12
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,437

Rep: Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842
Just for some fun ... here is what I came up with in ruby
Code:
ruby -ane 'print $_.chomp << "\t";out=""; # Print original line and set out to empty string
h={};$F.each{|x| h[x] = h[x]?h[x]+1:1 };  # Create hash and fill it
h.each_value{|v|
out = "#{out.empty? ? "" : out<<" / "}#{h.key(v)}(#{v})" if v > 1  # If the value is greater than 1 store string or append string
;h.delete(h.key(v))                       # delete associated key value pair
} if h.values.sort.max > 1;               # Only change out value if any of the values are greater than 1
out="No Consensus!" if out.empty?;        # Set out if still empty
puts out                                  # Print out value
' file
 
1 members found this post helpful.
Old 02-27-2013, 02:21 PM   #13
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,513

Rep: Reputation: 437Reputation: 437Reputation: 437Reputation: 437Reputation: 437
Quote:
Originally Posted by grail View Post
Just for some fun ... here is what I came up with in ruby ...
Blows up on my machine. Could this be a version discrepancy? This is what I have:
Code:
daniel@daniel-desktop:~$ ruby --version
ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]
This code ...
Code:
echo "Method of LQ Guru grail"
ruby -ane 'print $_.chomp << "\t";out=""; # Print original line and set out to empty string
h={};$F.each{|x| h[x] = h[x]?h[x]+1:1 };  # Create hash and fill it
h.each_value{|v|
out = "#{out.empty? ? "" : out<<" / "}#{h.key(v)}(#{v})" if v > 1  # If the value is greater than 1 store string or append string
;h.delete(h.key(v))                       # delete associated key value pair
} if h.values.sort.max > 1;               # Only change out value if any of the values are greater than 1
out="No Consensus!" if out.empty?;        # Set out if still empty
puts out                                  # Print out value
' $InFile > $OutFile
... produced this result ...
Code:
Method of LQ Guru grail
-e:5: undefined method `key' for {"NAMES-1"=>1, "PHILIP"=>1, "DANIEL"=>3, "GEORGE"=>2}:Hash (NoMethodError)
	from -e:3:in `each_value'
	from -e:3
Daniel B. Martin

Last edited by danielbmartin; 02-27-2013 at 08:46 PM. Reason: Correct attribution
 
Old 02-28-2013, 02:02 AM   #14
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,437

Rep: Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842
Hey Daniel ... you are correct that the key function is new from v1.9 onwards. The following should work for you:
Code:
ruby -ane 'print $_.chomp << "\t";out=""; # Print original line and set out to empty string
h={};$F.each{|x| h[x] = h[x]?h[x]+1:1 };  # Create hash and fill it
h.each_pair{|k,v|
out = "#{out.empty? ? "" : out<<" / "}#{k}(#{v})" if v > 1  # If the value is greater than 1 store string or append string
;h.delete(k)                              # delete associated key value pair
} if h.values.sort.max > 1;               # Only change out value if any of the values are greater than 1
out="No Consensus!" if out.empty?;        # Set out if still empty
puts out                                  # Print out value
' $InFile > $OutFile
Actually this way looks a little cleaner too ... cheers
 
1 members found this post helpful.
Old 02-28-2013, 03:54 AM   #15
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978
Yet another exercise using bash (with associative arrays):
Code:
#!/bin/bash
#
while read line
do
  #
  #  Declare associative array
  #
  declare -A array
  #
  #  Assign positional parameters
  #
  set $line
  #
  #  Loop over positional parameters and assign key/value pairs
  #
  while [[ $# -ge 1 ]]
  do
    array[$1]=$((${array[$1]}+1))
    shift
  done
  #
  #  Print line
  #
  echo -n "$line   "
  #
  #  Check repeated sequences
  #
  for i in "${!array[@]}"
  do
    [[ ${array[$i]} -gt 1 ]] && echo -n "$i(${array[$i]}) " && consensus=1
  done
  #
  #  End of line
  #
  [[ -n $consensus ]] && echo || echo 'No Consensus!'
  #
  #  Make clean for the next iteration
  #
  unset array consensus
done < file
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Bash Shell Script to check if a string has only alphabets and digits. aswani Programming 8 08-16-2012 09:49 AM
[SOLVED] copy string a to string b and change string b with toupper() and count the chars beep3r Programming 3 10-22-2010 07:22 PM
Trying to identify file types(command line arguments) and count how many there are jdwalk Linux - Newbie 5 02-20-2010 03:51 PM
Appending a string to columns in a file raghu123 Programming 2 08-29-2008 01:19 AM
Count of two columns vimal480 Programming 1 04-16-2005 02:25 AM


All times are GMT -5. The time now is 02:07 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration