Regular expression problem

raghu123 · 10-10-2008, 10:53 AM

Hi.........

I hav a file having 3 colums lik this.....

1N1Y.pdb_SIA_lig_1 1V0F.pdb_SLB_lig_1 0.217803
1N1Y.pdb_SIA_lig_1 1V0F.pdb_SLB_lig_3 0.159091
1N1Y.pdb_SIA_lig_1 1V0F.pdb_SLB_lig_6 0.157197
1A4G.pdb_NAG_lig_1 2ZG1.pdb_SIA_lig_1 0.076190
1A4G.pdb_NAG_lig_2 2ZG1.pdb_SIA_lig_1 0.057143
1A4G.pdb_CA_lig_5 2ZG1.pdb_SIA_lig_1 0.000000
1N1Y.pdb_SIA_lig_1 1V0F.pdb_SLB_lig_9 0.092803
1N1Y.pdb_SIA_lig_1 1V0F.pdb_SLB_lig_2 0.092803
1N1Y.pdb_SIA_lig_1 1V0F.pdb_SIA_lig_5 0.081439
1A4G.pdb_ZMR_lig_7 2ZG1.pdb_SIA_lig_1 0.044326
1A4G.pdb_ZMR_lig_6 2ZG1.pdb_SIA_lig_1 0.042553
1A4G.pdb_ZMR_lig_6 2F0Z.pdb_ZMR_lig_1 0.394504
1A4G.pdb_ZMR_lig_7 2F0Z.pdb_ZMR_lig_1 0.391844

I want to grep the patterns in such a way that i get

either SIA/ZMR (middle terms of the 1st and 2nd column) in the 1st column and either SIA/ZMR again in the 2nd column and the 3rd column will be score as usual......

the O/P i explained must b lik this.....

1N1Y.pdb_SIA_lig_1 1V0F.pdb_SIA_lig_5 0.081439
1A4G.pdb_ZMR_lig_7 2ZG1.pdb_SIA_lig_1 0.044326
1A4G.pdb_ZMR_lig_6 2ZG1.pdb_SIA_lig_1 0.042553
1A4G.pdb_ZMR_lig_6 2F0Z.pdb_ZMR_lig_1 0.394504
1A4G.pdb_ZMR_lig_7 2F0Z.pdb_ZMR_lig_1 0.391844

SO print those in which the 1stcolumn has either SIA or ZMR and the 2nd column also havin the corresponding SIA or ZMR

PLz help.......need it badly

raghu123 · 10-10-2008, 10:55 AM

[actually ter are spaces in between the columns wic is not clear wen i paste in the forum.....
the columns are like this......

1A4G.pdb_ZMR_lig_6 2F0Z.pdb_ZMR_lig_1 0.394504
1A4G.pdb_ZMR_lig_7 2F0Z.pdb_ZMR_lig_1 0.391844

jus for sample i typed these 2 entries

raconteur · 10-10-2008, 12:57 PM

Very simple.

Code:

cat <file> | egrep "ZMR.*(ZMR|SIA)|SIA.*(ZMR|SIA)"

[edit] Oops, I misread your post, I though you wanted only lines that contained either tag in both columns. Fixed.

raghu123 · 10-10-2008, 02:50 PM

plz understand the question...i hav mentioned the correct addressing of my question

raconteur · 10-10-2008, 03:02 PM

Quote:

Originally Posted by raghu123

plz understand the question...i hav mentioned the correct addressing of my question

And I gave you a correct answer.

chrism01 · 10-10-2008, 07:32 PM

@raconteur
Actually, UUOC (Useless Use Of cat), you can do

egrep "ZMR.*(ZMR|SIA)|SIA.*(ZMR|SIA)" <filename>

PS Like the regex though...

raghu123 · 10-11-2008, 03:25 AM

Thanq..........

the code is working fine

archtoad6 · 10-11-2008, 08:40 AM

Quote:

Originally Posted by chrism01

@raconteur
Actually, UUOC (Useless Use Of cat), you can do

egrep "ZMR.*(ZMR|SIA)|SIA.*(ZMR|SIA)" <filename>
...

Actually it's a NACUUOC (Not A Completely Useless Use Of cat) -- cat'ing a file into a short command being iteratively developed on the CLI makes the heart of it, in this case the regex, more accessible for editing. I do it all the time, although I do usually remove the cat later.

I don't understand the "ZMR.*(ZMR|SIA)|SIA" part of the regex. Wouldn't:

Code:

egrep '(ZMR|SIA).*(ZMR|SIA)'
## (untested)

work just as well? Or maybe better:

Code:

egrep '^[^ ]*(ZMR|SIA)[^ ]* [^ ]*(ZMR|SIA)'
## (untested)

chrism01 · 10-11-2008, 10:14 PM

TMTOWTDI

A final soln should probably not use it unless you need to. Shorter lines are easier to read / debug (unless its really obscure).
I can see this degenerating into style discussion

Actually, it can make a difference sometimes (using '|') eg you can echo a string through the wc cmd and get num of words, but you can't use the wc cmd on a string eg

you can

string="one two three"

echo $line|wc -w

but you can't

wc -w $string

in the latter case wc insists on a filename as the input.

I think I'll stop there before I get into trouble

Telemachos · 10-12-2008, 06:19 AM

Quote:

Originally Posted by archtoad6

Actually it's a NACUUOC (Not A Completely Useless Use Of cat) -- cat'ing a file into a short command being iteratively developed on the CLI makes the heart of it, in this case the regex, more accessible for editing. I do it all the time, although I do usually remove the cat later.

But there's a better solution for that. Put the filename first using redirection:

Code:

<filename egrep "ZMR.*(ZMR|SIA)|SIA.*(ZMR|SIA)"

archtoad6 · 10-12-2008, 07:13 AM

Good point, thanks.

I wonder if that "word order" doesn't spring to my mind because of having English as a native language. Would a speaker of Latin, used to having word order make almost difference to the meaning of a sentence, be more likely to think of that?

Anyway, I must remember that.

Telemachos · 10-12-2008, 07:17 AM

Quote:

Originally Posted by archtoad6

Would a speaker of Latin, used to having word order make almost difference to the meaning of a sentence, be more likely to think of that?

A teacher of Latin in this case, so maybe you're right.