need a column starting from a specific pattern

asrshell · 01-28-2013, 04:15 PM

Hi !! Have a look please if anyone can help me (I'm indeed a new bash learner). I've a file let's call it 'test.txt' which contain the followings (more precisely its an alignment file)

A-1 AAAAAAAKGAAKAAAAAAAAAAAAAAAA
A-2 ELEEEEEEEEEEEEEEEEESWEEEEEEEE
A-3 JJJLJJJJJJJJJJJJJJJWWJJJJJJJ

A-4 LLLHLIDDFRRRLLLLLLLLLLLGHLLLLLL
A-5 UUUGUUARRRHUUUUUUUUUUUJJUU
A-6 GFGFJYHFRRRGFRDCDAGGF.........

A-7 BBBBBBBBBBBAWBBBBBBBABBBSBBB
A-8 XXXXXFGXDXXXSXXXXXXXXXXXXXXX
A-9 ZZZDZZZZZZZZZZZZZZHZZZHZZZGZ

A-10 DDDDDDHDDDDIDDDIRRRDDDDDDKDDDD
A-11 QQQQHFQQQQQIQQQIRRRQQQQQQKQQQQ
A-12 IIIWWWWWWDDDIIIIIIIRRRIIIIIIKIIIILLL

now how can i get 'n'th column starting from a specif pattern such as 'RRR'(as for e.g. from the above text file, ) ('n'th column before this pattern or after this pattern)

thanks in advance

kbp · 01-28-2013, 04:52 PM

Code:

egrep -o 'RRR[A-Z]{5}[A-Z]' test.txt | egrep -o '[A-Z]$'

The '5' is the number of columns between 'RRR' and the column you want

shivaa · 01-28-2013, 09:24 PM

In more generalized way, a one-line awk could do the job:

Code:

awk 'BEGIN{FS=" "}; /<search_pattern>/ {print $<column>}' test.txt

So let's say, if you want to print 1st coulmn of all lines having pattern "RRR", then do as:

Code:

awk 'BEGIN{FS=" "}; /RRR/ {print $1}' test.txt

Output:

Code:

A-4
A-5
A-6
A-10
A-11
A-12

grail · 01-29-2013, 12:08 AM

I'm curious about what the OP considers to be a column? (as is evident from the 2 very different solutions so far)

asrshell · 01-29-2013, 10:41 AM

Thanks both of you kbp and shivaa for answering

shivaa: your code isn't producing what i want. please consider each character as column (anyway your code help me to solve some of my other problems).

Kbp: cheers!! your code is working but only 'after' the given pattern (i.e. it's generating desired column from the 'right' side of the pattern).

1. How can it work 'before' the pattern also (i.e. from the 'left' side of the pattern)?
2. Does it possible to print the output with line header (in that e.g.A-4, A-5, A-6 etc. these are line header)

(additional query to all)
Let's i have a file called 'test.txt' contains as below where A-1 , A-2, A-3 etc are line headers. (there is always same space after each line header. there is also space between paragraph.)

A-1 AARAAAAAARAAAAAAAARAAAAAAZA
A-2 ARAAAAAAAKAAAAAAARARAAYAAAA
A-3 AARARAAYAKAARAAAAAAAAAAAAAA

A-1 ZAAAAAAAARRRAAAAAAAAAAAAAAA
A-2 AAYAAAAAARRRAAAAAAAAAAAAAAA
A-3 YAAAAAAAARRRAAAAAAAAAAAAAAA

A-1 AAZAAAAAAKAARARAAAQAAAAAARA
A-2 AAYRARARAKAAAAAAAAAAAAAAAQA
A-3 ARYAARAAAKAAAAAAAAQAAAAAARA

A-1 AAAZAAAAARRRAAAAAAAAAAAAAAA
A-2 AAYAAAAAARRRAAAAAAAAAAAAAAA
A-3 AAAAAAYAARRRAAAAAAAAAAAAAAA

3. Is it possible to get 'n' th column starting from a pattern to lines/paragraph (before or after the pattern) where the the pattern is absent. Let's consider above e.g.

How can i print 'n'th column from paragraph 1 or 3 starting count from a pattern 'RRR' (which is present in the 2nd and 4th paragraph) for e.g. 18th(ignoring line header and white space) column starting from the left side of the 2nd 'RRR' pattern or starting from the right side of the 1st 'RRR' pattern?

it would be nice if the output prints with corresponding line header. so briefly i would be happy if i got a output like this

For first case
A-1 Q
A-2 A
A-3 Q

for 2nd case
A-1 Z
A-2 Y
A-3 Y

shivaa · 01-29-2013, 01:43 PM

Sorry if I misunderstood it. Anyway, if you consider RRR as field seperator, then you can specify it as:-
Print after RRR:

Code:

~$ awk 'BEGIN{FS="RRR"}; NF>1 {print $2}' test.txt

Print before RRR (leaving headers):

Code:

~$ awk -F" " '{print $2}' <(awk 'BEGIN{FS="RRR"}; NF>1 {print $1}' test.txt)

Print only headers:

Code:

~$ awk -F" " '{print $1}' <(awk 'BEGIN{FS="RRR"}; NF>1 {print $1}' test.txt)

Well, for more accurate answers, can you once specify sample output (as you want in all cases)?

kbp · 01-29-2013, 07:04 PM

Just move the pieces around and change the anchor in the second grep:

Code:

egrep -o '[A-Z]{6}RRR' test.txt | egrep -o '^[A-Z]'