grep+awk+sed+paste+sort in one script?

mchriste · 02-03-2009, 04:00 PM

Hi people, new user here. Hope I can get some help.

I have to extract data pairwise from two lines in one big (4MB+) text file.
(The script is applied to a series of files but the data always comes pairwise from the same file)

I did kind of solve the problem by using a csh script to call an awk script
(and generating 4 temp files) but I'd like a more elegant way of doing it.

What I have:

Code:

getroots.csh

#!/bin/csh
grep "^DOCKED: USER    Estimated Free" *.dlg | awk '{print $9}'> tempe
grep "^DOCKED: ATOM      1  O" *.dlg > tempr
paste -d ' ' tempe tempr > tempp
sort -n tempp > temps
gawk -f 'root2pdb.awk' temps > roots.pdb
rm -f temp*

Code:

root2pdb.awk

#!/bin/awk -f
BEGIN {i=0}
   {
   i++
   printf ("ATOM  %5d %4s %3s %1s %3d     %7.3f %7.3f %7.3f %5.2f %5.2f           %1s\n",
   i, $5, $6, "L", 0, $7, $8, $9, 1, $1, $5)
   }
END {print "END"}

How can I combine all this into one awk script?

I can provide source and target data if it would help, but I didn't want to make an unnecessarily long post.

Thanks!

Tinkster · 02-03-2009, 04:58 PM

Providing data (before & after) would surely help. And chances
are it can be done with awk alone =)

Cheers,
Tink

mchriste · 02-03-2009, 05:27 PM

Thanks for the quick answer, Tinkster.

Source: (in *.dlg file)

Code:

(...) (variable length)

ATOM      1  O   UNK             8.570  33.752 -25.184 +0.00 -0.07    -0.318 51.853
ATOM      2  C   UNK             7.552  34.018 -24.226 +0.01 +0.08    +0.297 51.853
ATOM      3  C   UNK             7.197  32.776 -23.398 -0.02 +0.11    +0.233 51.853

(...) (variable length)

DOCKED: MODEL       30
DOCKED: USER    Run = 30
DOCKED: USER    DPF = SOSc.dpf
DOCKED: USER  
DOCKED: USER    Estimated Free Energy of Binding    =   -3.15 kcal/mol  [=(1)+(2)+(3)-(4)]
DOCKED: USER    Estimated Inhibition Constant, Ki   =    4.88 mM (millimolar)  [Temperature = 298.15 K]
DOCKED: USER    
DOCKED: USER    (1) Final Intermolecular Energy     =  -11.12 kcal/mol
DOCKED: USER        vdW + Hbond + desolv Energy     =   -3.47 kcal/mol
DOCKED: USER        Electrostatic Energy            =   -7.64 kcal/mol
DOCKED: USER    (2) Final Total Internal Energy     =   +9.90 kcal/mol
DOCKED: USER    (3) Torsional Free Energy           =   +5.76 kcal/mol
DOCKED: USER    (4) Unbound System's Energy         =   +7.70 kcal/mol
DOCKED: USER    

(...) (variable length)

DOCKED: REMARK   21  A    between atoms: O_44  and  S_55 
DOCKED: USER                              x       y       z     vdW  Elec       q    Type
DOCKED: USER                           _______ _______ _______ _____ _____    ______ ____
DOCKED: ROOT
DOCKED: ATOM      1  O   UNK            10.202   3.560  -4.925 +0.01 -0.17    -0.318 OA
DOCKED: ENDROOT
DOCKED: BRANCH   1   2
DOCKED: ATOM      2  C   UNK             8.987   4.249  -5.194 -0.01 +0.16    +0.297 C 

(pattern repeats)

desired output: (roots.pdb)

Code:

ATOM      1    O UNK L   0       7.718   2.274  -6.002  1.00 -8.36           O
ATOM      2    O UNK L   0      10.215   4.608  -4.430  1.00 -8.24           O
ATOM      3    O UNK L   0      10.720   4.202  -4.754  1.00 -8.20           O

(...)

ATOM    141    O UNK L   0      10.202   3.560  -4.925  1.00 -3.15           O

(...)

Important notes:
- the energies (ex: -3.15 for ATOM 141) must match the coordinates for each run (it's the purpose of the entire exercise!)
- the atom I want to extract for each run is always the "root" (always ATOM 1 for each run)
- there are thousands of runs per file resulting in one line for each in the output
- the coordinates I am interested in are always in a line that starts with DOCKED: ATOM
- I wish to sort the output by lowest energy
- original "ranking" is completely random (just by run # - here I gave run 30 as an example)
- output must follow exactly that format (including number of blank spaces)
- I'd like for this to work under Cygwin as well (just in case that's a limitation)
- bonus question: can I extract the run number as well? (appending it at the right end of each line in the output)

Thanks, Martin

Tinkster · 02-03-2009, 08:28 PM

Hmmm ... can you provide a slightly larger sample (that would allow to
produce more than one row of output)? Maybe 2 or 3 output lines?

Also, those sections with (variable length) ... are they actually separated
by blank lines or is the data stream not blank-line separated?

syg00 · 02-03-2009, 08:46 PM

I have a rule I follow - if it's quick and the data is well-formed use sed; else if it's manipulating data for re-display, use awk; when it gets too complex/ugly with either of them, use perl.

So ... use perl

mchriste · 02-04-2009, 11:43 AM

OK... guess I should have attached a larger text sample to begin with. Sorry, my bad.
I am attaching an (extensively truncated) example source file here.

Note: Please rename .txt to .dlg

The parts where I removed stuff are indicated by the line

Code:

>TRUNCATED HERE

The output I get from this example using my cumbersome method is:

Code:

ATOM      1    O UNK L   0       8.544  24.334  -2.603  1.00 -3.46           O
ATOM      2    O UNK L   0      10.202   3.560  -4.925  1.00 -3.15           O
ATOM      3    O UNK L   0      11.768   9.257   0.669  1.00 -2.99           O
ATOM      4    O UNK L   0      11.426   2.115  -3.358  1.00 -2.61           O
ATOM      5    O UNK L   0      13.380  20.326  -2.311  1.00 -2.48           O
END

If I can get the above using just one script, I'd be happy.

But ideally, I would like this (the last number corresponding to the run / MODEL #)

Code:

ATOM      1    O UNK L   0       8.544  24.334  -2.603  1.00 -3.46           O  256
ATOM      2    O UNK L   0      10.202   3.560  -4.925  1.00 -3.15           O   30
ATOM      3    O UNK L   0      11.768   9.257   0.669  1.00 -2.99           O    1
ATOM      4    O UNK L   0      11.426   2.115  -3.358  1.00 -2.61           O    3
ATOM      5    O UNK L   0      13.380  20.326  -2.311  1.00 -2.48           O    2
END

Syg00: Unfortunately, I have zero understanding of perl...

The point is that I do know how to get the data I need using the scripts in the OP.
I would just like to concatenate all this into one script - whether it's awk, csh or perl.

Tinkster · 02-04-2009, 02:13 PM

Does this work for you? =o)

Code:

#!/bin/awk -f
BEGIN{
  i=0
}
{
  if ( $0 ~ /^DOCKED: USER    Run =/){
    run = $5
    i++
  }
  if ( $0 ~ /^DOCKED: USER    Estimated Free/){
    energy = $9
    i++
  }
  if ( $0 ~ /^DOCKED: ATOM      1  O/){
    line[energy] = sprintf ("ATOM  %5d %4s %3s %1s %3d     %7.3f %7.3f %7.3f %5.2f %5.2f           %1s  %3d", i, $4, $5, "L", 0, $6, $7, $8, 1, energy, $4, run)
  }
}
END{
  for ( j in line){
    print line[j] | "sort -k 11,11g";
  }
  close( "sort -k 11,11g")
  print "END"
}

Cheers,
Tink

mchriste · 02-05-2009, 08:25 AM

Thanks for the script, Tink!
It works *almost* perfectly, except that the atoms are not serially numbered.

Using the example file, with your script I get:

Code:

ATOM     10    O UNK L   0       8.544  24.334  -2.603  1.00 -3.46           O  256
ATOM      8    O UNK L   0      10.202   3.560  -4.925  1.00 -3.15           O   30
ATOM      2    O UNK L   0      11.768   9.257   0.669  1.00 -2.99           O    1
ATOM      6    O UNK L   0      11.426   2.115  -3.358  1.00 -2.61           O    3
ATOM      4    O UNK L   0      13.380  20.326  -2.311  1.00 -2.48           O    2
END

but I'd like:

Code:

ATOM      1    O UNK L   0       8.544  24.334  -2.603  1.00 -3.46           O  256
ATOM      2    O UNK L   0      10.202   3.560  -4.925  1.00 -3.15           O   30
ATOM      3    O UNK L   0      11.768   9.257   0.669  1.00 -2.99           O    1
ATOM      4    O UNK L   0      11.426   2.115  -3.358  1.00 -2.61           O    3
ATOM      5    O UNK L   0      13.380  20.326  -2.311  1.00 -2.48           O    2
END

I can't figure out what causes this first number to be wrong.
Seems like it doubles the run number (because you increase i twice?) but only up to a point...
Can you - or anybody else on this forum! :-) fix it?

Thanks for your help!

mchriste · 02-06-2009, 03:04 PM

Uhm... anybody?

Tinkster · 02-08-2009, 03:20 PM

The problem here is that the numbering of column 2 needs to be
applied after the sort on the last one ... which means we have
to insert those *after* the sort, so it will again become a two
step process, where the numeric IDs get *created* after the awk.

I'll have a think about how to work this - sorry for the later
response, had a mini-holiday of 3.5 days =}

mchriste · 02-11-2009, 03:19 PM

Any progress with my problem yet?

Maybe something involving a counter mechanism for the new file (i.e. just counting line# of the output)
that is completely independent of the already applied i++ increment?

PS: Thanks for keeping me updated, Tink. I really do appreciate your help.

Tinkster · 02-11-2009, 03:55 PM

Quote:

Originally Posted by mchriste

Any progress with my problem yet?

Maybe something involving a counter mechanism for the new file (i.e. just counting line# of the output)
that is completely independent of the already applied i++ increment?

PS: Thanks for keeping me updated, Tink. I really do appreciate your help.

The problem with that is that the order is being determined by the
sort in the output inside the loop of the END processing ... I'd need
to see whether I can use awks built-in sort functions (asort or asorti)
to do the sorting instead, and then do the increment count in there.

The other (quick, but less elegant) way I see (for now, but I'm kind
of preoccupied with with other stuff) is to put something bogus in
the position of the counter, and then pipe that through a second awk
that just replaces the bogus with a padded line number ...

Can you try and work with those (English) pseudo instructions yourself?
I guess it would only take me half an hour, but that's currently hard
to dig up

Cheers,
Tink

mchriste · 03-05-2009, 12:40 PM

Thanks Tink, I finally figured out something that works using your suggestions.

Tinkster · 03-05-2009, 01:57 PM

Glad to hear it's done, thanks for coming back with the feed-back
and sorry I couldn't be of more assistance.

Cheers,
Tink