LinuxQuestions.org - grep+awk+sed+paste+sort in one script?

- Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)

- - grep+awk+sed+paste+sort in one script? (https://www.linuxquestions.org/questions/linux-software-2/grep-awk-sed-paste-sort-in-one-script-702027/)

grep+awk+sed+paste+sort in one script?

Hi people, new user here. Hope I can get some help.

I have to extract data pairwise from two lines in one big (4MB+) text file.
(The script is applied to a series of files but the data always comes pairwise from the same file)

I did kind of solve the problem by using a csh script to call an awk script
(and generating 4 temp files) but I'd like a more elegant way of doing it.

What I have:

Code:

getroots.csh



#!/bin/csh

grep "^DOCKED: USER    Estimated Free" *.dlg | awk '{print $9}'> tempe

grep "^DOCKED: ATOM      1  O" *.dlg > tempr

paste -d ' ' tempe tempr > tempp

sort -n tempp > temps

gawk -f 'root2pdb.awk' temps > roots.pdb

rm -f temp*

Code:

root2pdb.awk



#!/bin/awk -f

BEGIN {i=0}

  {

  i++

  printf ("ATOM  %5d %4s %3s %1s %3d    %7.3f %7.3f %7.3f %5.2f %5.2f          %1s\n",

  i, $5, $6, "L", 0, $7, $8, $9, 1, $1, $5)

  }

END {print "END"}

How can I combine all this into one awk script?

I can provide source and target data if it would help, but I didn't want to make an unnecessarily long post.

Thanks!

Providing data (before & after) would surely help. And chances
are it can be done with awk alone =)

Cheers,
Tink

Thanks for the quick answer, Tinkster.

Source: (in *.dlg file)

Code:



(...) (variable length)



ATOM      1  O  UNK            8.570  33.752 -25.184 +0.00 -0.07    -0.318 51.853

ATOM      2  C  UNK            7.552  34.018 -24.226 +0.01 +0.08    +0.297 51.853

ATOM      3  C  UNK            7.197  32.776 -23.398 -0.02 +0.11    +0.233 51.853



(...) (variable length)



DOCKED: MODEL      30

DOCKED: USER    Run = 30

DOCKED: USER    DPF = SOSc.dpf

DOCKED: USER  

DOCKED: USER    Estimated Free Energy of Binding    =  -3.15 kcal/mol  [=(1)+(2)+(3)-(4)]

DOCKED: USER    Estimated Inhibition Constant, Ki  =    4.88 mM (millimolar)  [Temperature = 298.15 K]

DOCKED: USER    

DOCKED: USER    (1) Final Intermolecular Energy    =  -11.12 kcal/mol

DOCKED: USER        vdW + Hbond + desolv Energy    =  -3.47 kcal/mol

DOCKED: USER        Electrostatic Energy            =  -7.64 kcal/mol

DOCKED: USER    (2) Final Total Internal Energy    =  +9.90 kcal/mol

DOCKED: USER    (3) Torsional Free Energy          =  +5.76 kcal/mol

DOCKED: USER    (4) Unbound System's Energy        =  +7.70 kcal/mol

DOCKED: USER    



(...) (variable length)



DOCKED: REMARK  21  A    between atoms: O_44  and  S_55 

DOCKED: USER                              x      y      z    vdW  Elec      q    Type

DOCKED: USER                          _______ _______ _______ _____ _____    ______ ____

DOCKED: ROOT

DOCKED: ATOM      1  O  UNK            10.202  3.560  -4.925 +0.01 -0.17    -0.318 OA

DOCKED: ENDROOT

DOCKED: BRANCH  1  2

DOCKED: ATOM      2  C  UNK            8.987  4.249  -5.194 -0.01 +0.16    +0.297 C 



(pattern repeats)

desired output: (roots.pdb)

Code:

ATOM      1    O UNK L  0      7.718  2.274  -6.002  1.00 -8.36          O

ATOM      2    O UNK L  0      10.215  4.608  -4.430  1.00 -8.24          O

ATOM      3    O UNK L  0      10.720  4.202  -4.754  1.00 -8.20          O



(...)



ATOM    141    O UNK L  0      10.202  3.560  -4.925  1.00 -3.15          O



(...)

Important notes:
- the energies (ex: -3.15 for ATOM 141) must match the coordinates for each run (it's the purpose of the entire exercise!)
- the atom I want to extract for each run is always the "root" (always ATOM 1 for each run)
- there are thousands of runs per file resulting in one line for each in the output
- the coordinates I am interested in are always in a line that starts with DOCKED: ATOM
- I wish to sort the output by lowest energy
- original "ranking" is completely random (just by run # - here I gave run 30 as an example)
- output must follow exactly that format (including number of blank spaces)
- I'd like for this to work under Cygwin as well (just in case that's a limitation)
- bonus question: can I extract the run number as well? (appending it at the right end of each line in the output)

Thanks, Martin

Hmmm ... can you provide a slightly larger sample (that would allow to
produce more than one row of output)? Maybe 2 or 3 output lines?

Also, those sections with (variable length) ... are they actually separated
by blank lines or is the data stream not blank-line separated?

I have a rule I follow - if it's quick and the data is well-formed use sed; else if it's manipulating data for re-display, use awk; when it gets too complex/ugly with either of them, use perl.

So ... use perl ;)

OK... guess I should have attached a larger text sample to begin with. Sorry, my bad.
I am attaching an (extensively truncated) example source file here.

Note: Please rename .txt to .dlg

The parts where I removed stuff are indicated by the line

Code:

>TRUNCATED HERE

The output I get from this example using my cumbersome method is:

Code:

ATOM      1    O UNK L  0      8.544  24.334  -2.603  1.00 -3.46          O

ATOM      2    O UNK L  0      10.202  3.560  -4.925  1.00 -3.15          O

ATOM      3    O UNK L  0      11.768  9.257  0.669  1.00 -2.99          O

ATOM      4    O UNK L  0      11.426  2.115  -3.358  1.00 -2.61          O

ATOM      5    O UNK L  0      13.380  20.326  -2.311  1.00 -2.48          O

END

If I can get the above using just one script, I'd be happy.

But ideally, I would like this (the last number corresponding to the run / MODEL #)

Code:

ATOM      1    O UNK L  0      8.544  24.334  -2.603  1.00 -3.46          O  256

ATOM      2    O UNK L  0      10.202  3.560  -4.925  1.00 -3.15          O  30

ATOM      3    O UNK L  0      11.768  9.257  0.669  1.00 -2.99          O    1

ATOM      4    O UNK L  0      11.426  2.115  -3.358  1.00 -2.61          O    3

ATOM      5    O UNK L  0      13.380  20.326  -2.311  1.00 -2.48          O    2

END

Syg00: Unfortunately, I have zero understanding of perl...

The point is that I do know how to get the data I need using the scripts in the OP.
I would just like to concatenate all this into one script - whether it's awk, csh or perl.

Does this work for you? =o)

Code:

#!/bin/awk -f

BEGIN{

  i=0

}

{

  if ( $0 ~ /^DOCKED: USER    Run =/){

    run = $5

    i++

  }

  if ( $0 ~ /^DOCKED: USER    Estimated Free/){

    energy = $9

    i++

  }

  if ( $0 ~ /^DOCKED: ATOM      1  O/){

    line[energy] = sprintf ("ATOM  %5d %4s %3s %1s %3d    %7.3f %7.3f %7.3f %5.2f %5.2f          %1s  %3d", i, $4, $5, "L", 0, $6, $7, $8, 1, energy, $4, run)

  }

}

END{

  for ( j in line){

    print line[j] | "sort -k 11,11g";

  }

  close( "sort -k 11,11g")

  print "END"

}

Cheers,
Tink

Thanks for the script, Tink!
It works *almost* perfectly, except that the atoms are not serially numbered.

Using the example file, with your script I get:

Code:

ATOM    10    O UNK L  0      8.544  24.334  -2.603  1.00 -3.46          O  256

ATOM      8    O UNK L  0      10.202  3.560  -4.925  1.00 -3.15          O  30

ATOM      2    O UNK L  0      11.768  9.257  0.669  1.00 -2.99          O    1

ATOM      6    O UNK L  0      11.426  2.115  -3.358  1.00 -2.61          O    3

ATOM      4    O UNK L  0      13.380  20.326  -2.311  1.00 -2.48          O    2

END

but I'd like:

Code:

ATOM      1    O UNK L  0      8.544  24.334  -2.603  1.00 -3.46          O  256

ATOM      2    O UNK L  0      10.202  3.560  -4.925  1.00 -3.15          O  30

ATOM      3    O UNK L  0      11.768  9.257  0.669  1.00 -2.99          O    1

ATOM      4    O UNK L  0      11.426  2.115  -3.358  1.00 -2.61          O    3

ATOM      5    O UNK L  0      13.380  20.326  -2.311  1.00 -2.48          O    2

END

I can't figure out what causes this first number to be wrong.
Seems like it doubles the run number (because you increase i twice?) but only up to a point...
Can you - or anybody else on this forum! :-) fix it?

Thanks for your help!

The problem here is that the numbering of column 2 needs to be
applied after the sort on the last one ... which means we have
to insert those *after* the sort, so it will again become a two
step process, where the numeric IDs get *created* after the awk.

I'll have a think about how to work this - sorry for the later
response, had a mini-holiday of 3.5 days =}

Any progress with my problem yet?

Maybe something involving a counter mechanism for the new file (i.e. just counting line# of the output)
that is completely independent of the already applied i++ increment?

PS: Thanks for keeping me updated, Tink. I really do appreciate your help.

Quote:

Originally Posted by mchriste (Post 3440136)

The problem with that is that the order is being determined by the
sort in the output inside the loop of the END processing ... I'd need
to see whether I can use awks built-in sort functions (asort or asorti)
to do the sorting instead, and then do the increment count in there.

The other (quick, but less elegant) way I see (for now, but I'm kind
of preoccupied with with other stuff) is to put something bogus in
the position of the counter, and then pipe that through a second awk
that just replaces the bogus with a padded line number ...

Can you try and work with those (English) pseudo instructions yourself?
I guess it would only take me half an hour, but that's currently hard
to dig up :)

Cheers,
Tink

Thanks Tink, I finally figured out something that works using your suggestions.

Glad to hear it's done, thanks for coming back with the feed-back
and sorry I couldn't be of more assistance.

Cheers,
Tink