LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices

Reply
 
Search this Thread
Old 02-03-2009, 04:00 PM   #1
mchriste
LQ Newbie
 
Registered: Feb 2009
Distribution: OpenSuse 11.1
Posts: 10

Rep: Reputation: 0
grep+awk+sed+paste+sort in one script?


Hi people, new user here. Hope I can get some help.

I have to extract data pairwise from two lines in one big (4MB+) text file.
(The script is applied to a series of files but the data always comes pairwise from the same file)

I did kind of solve the problem by using a csh script to call an awk script
(and generating 4 temp files) but I'd like a more elegant way of doing it.

What I have:

Code:
getroots.csh

#!/bin/csh
grep "^DOCKED: USER    Estimated Free" *.dlg | awk '{print $9}'> tempe
grep "^DOCKED: ATOM      1  O" *.dlg > tempr
paste -d ' ' tempe tempr > tempp
sort -n tempp > temps
gawk -f 'root2pdb.awk' temps > roots.pdb
rm -f temp*
Code:
root2pdb.awk

#!/bin/awk -f
BEGIN {i=0}
   {
   i++
   printf ("ATOM  %5d %4s %3s %1s %3d     %7.3f %7.3f %7.3f %5.2f %5.2f           %1s\n",
   i, $5, $6, "L", 0, $7, $8, $9, 1, $1, $5)
   }
END {print "END"}
How can I combine all this into one awk script?

I can provide source and target data if it would help, but I didn't want to make an unnecessarily long post.

Thanks!

Last edited by mchriste; 02-03-2009 at 04:03 PM. Reason: clarification
 
Old 02-03-2009, 04:58 PM   #2
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,988
Blog Entries: 11

Rep: Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880
Providing data (before & after) would surely help. And chances
are it can be done with awk alone =)


Cheers,
Tink
 
Old 02-03-2009, 05:27 PM   #3
mchriste
LQ Newbie
 
Registered: Feb 2009
Distribution: OpenSuse 11.1
Posts: 10

Original Poster
Rep: Reputation: 0
Thanks for the quick answer, Tinkster.

Source: (in *.dlg file)
Code:
(...) (variable length)

ATOM      1  O   UNK             8.570  33.752 -25.184 +0.00 -0.07    -0.318 51.853
ATOM      2  C   UNK             7.552  34.018 -24.226 +0.01 +0.08    +0.297 51.853
ATOM      3  C   UNK             7.197  32.776 -23.398 -0.02 +0.11    +0.233 51.853

(...) (variable length)

DOCKED: MODEL       30
DOCKED: USER    Run = 30
DOCKED: USER    DPF = SOSc.dpf
DOCKED: USER  
DOCKED: USER    Estimated Free Energy of Binding    =   -3.15 kcal/mol  [=(1)+(2)+(3)-(4)]
DOCKED: USER    Estimated Inhibition Constant, Ki   =    4.88 mM (millimolar)  [Temperature = 298.15 K]
DOCKED: USER    
DOCKED: USER    (1) Final Intermolecular Energy     =  -11.12 kcal/mol
DOCKED: USER        vdW + Hbond + desolv Energy     =   -3.47 kcal/mol
DOCKED: USER        Electrostatic Energy            =   -7.64 kcal/mol
DOCKED: USER    (2) Final Total Internal Energy     =   +9.90 kcal/mol
DOCKED: USER    (3) Torsional Free Energy           =   +5.76 kcal/mol
DOCKED: USER    (4) Unbound System's Energy         =   +7.70 kcal/mol
DOCKED: USER    

(...) (variable length)

DOCKED: REMARK   21  A    between atoms: O_44  and  S_55 
DOCKED: USER                              x       y       z     vdW  Elec       q    Type
DOCKED: USER                           _______ _______ _______ _____ _____    ______ ____
DOCKED: ROOT
DOCKED: ATOM      1  O   UNK            10.202   3.560  -4.925 +0.01 -0.17    -0.318 OA
DOCKED: ENDROOT
DOCKED: BRANCH   1   2
DOCKED: ATOM      2  C   UNK             8.987   4.249  -5.194 -0.01 +0.16    +0.297 C 

(pattern repeats)
desired output: (roots.pdb)
Code:
ATOM      1    O UNK L   0       7.718   2.274  -6.002  1.00 -8.36           O
ATOM      2    O UNK L   0      10.215   4.608  -4.430  1.00 -8.24           O
ATOM      3    O UNK L   0      10.720   4.202  -4.754  1.00 -8.20           O

(...)

ATOM    141    O UNK L   0      10.202   3.560  -4.925  1.00 -3.15           O

(...)
Important notes:
- the energies (ex: -3.15 for ATOM 141) must match the coordinates for each run (it's the purpose of the entire exercise!)
- the atom I want to extract for each run is always the "root" (always ATOM 1 for each run)
- there are thousands of runs per file resulting in one line for each in the output
- the coordinates I am interested in are always in a line that starts with DOCKED: ATOM
- I wish to sort the output by lowest energy
- original "ranking" is completely random (just by run # - here I gave run 30 as an example)
- output must follow exactly that format (including number of blank spaces)
- I'd like for this to work under Cygwin as well (just in case that's a limitation)
- bonus question: can I extract the run number as well? (appending it at the right end of each line in the output)

Thanks, Martin

Last edited by mchriste; 02-03-2009 at 05:43 PM. Reason: clarification
 
Old 02-03-2009, 08:28 PM   #4
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,988
Blog Entries: 11

Rep: Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880
Hmmm ... can you provide a slightly larger sample (that would allow to
produce more than one row of output)? Maybe 2 or 3 output lines?

Also, those sections with (variable length) ... are they actually separated
by blank lines or is the data stream not blank-line separated?
 
Old 02-03-2009, 08:46 PM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 12,356

Rep: Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043
I have a rule I follow - if it's quick and the data is well-formed use sed; else if it's manipulating data for re-display, use awk; when it gets too complex/ugly with either of them, use perl.

So ... use perl
 
Old 02-04-2009, 11:43 AM   #6
mchriste
LQ Newbie
 
Registered: Feb 2009
Distribution: OpenSuse 11.1
Posts: 10

Original Poster
Rep: Reputation: 0
OK... guess I should have attached a larger text sample to begin with. Sorry, my bad.
I am attaching an (extensively truncated) example source file here.

Note: Please rename .txt to .dlg

The parts where I removed stuff are indicated by the line
Code:
>TRUNCATED HERE
The output I get from this example using my cumbersome method is:
Code:
ATOM      1    O UNK L   0       8.544  24.334  -2.603  1.00 -3.46           O
ATOM      2    O UNK L   0      10.202   3.560  -4.925  1.00 -3.15           O
ATOM      3    O UNK L   0      11.768   9.257   0.669  1.00 -2.99           O
ATOM      4    O UNK L   0      11.426   2.115  -3.358  1.00 -2.61           O
ATOM      5    O UNK L   0      13.380  20.326  -2.311  1.00 -2.48           O
END
If I can get the above using just one script, I'd be happy.

But ideally, I would like this (the last number corresponding to the run / MODEL #)
Code:
ATOM      1    O UNK L   0       8.544  24.334  -2.603  1.00 -3.46           O  256
ATOM      2    O UNK L   0      10.202   3.560  -4.925  1.00 -3.15           O   30
ATOM      3    O UNK L   0      11.768   9.257   0.669  1.00 -2.99           O    1
ATOM      4    O UNK L   0      11.426   2.115  -3.358  1.00 -2.61           O    3
ATOM      5    O UNK L   0      13.380  20.326  -2.311  1.00 -2.48           O    2
END
Syg00: Unfortunately, I have zero understanding of perl...

The point is that I do know how to get the data I need using the scripts in the OP.
I would just like to concatenate all this into one script - whether it's awk, csh or perl.
Attached Files
File Type: txt example.txt (131.4 KB, 5 views)

Last edited by mchriste; 02-04-2009 at 11:51 AM. Reason: clarification
 
Old 02-04-2009, 02:13 PM   #7
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,988
Blog Entries: 11

Rep: Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880
Does this work for you? =o)

Code:
#!/bin/awk -f
BEGIN{
  i=0
}
{
  if ( $0 ~ /^DOCKED: USER    Run =/){
    run = $5
    i++
  }
  if ( $0 ~ /^DOCKED: USER    Estimated Free/){
    energy = $9
    i++
  }
  if ( $0 ~ /^DOCKED: ATOM      1  O/){
    line[energy] = sprintf ("ATOM  %5d %4s %3s %1s %3d     %7.3f %7.3f %7.3f %5.2f %5.2f           %1s  %3d", i, $4, $5, "L", 0, $6, $7, $8, 1, energy, $4, run)
  }
}
END{
  for ( j in line){
    print line[j] | "sort -k 11,11g";
  }
  close( "sort -k 11,11g")
  print "END"
}


Cheers,
Tink

Last edited by Tinkster; 02-04-2009 at 02:15 PM.
 
Old 02-05-2009, 08:25 AM   #8
mchriste
LQ Newbie
 
Registered: Feb 2009
Distribution: OpenSuse 11.1
Posts: 10

Original Poster
Rep: Reputation: 0
Thanks for the script, Tink!
It works *almost* perfectly, except that the atoms are not serially numbered.

Using the example file, with your script I get:
Code:
ATOM     10    O UNK L   0       8.544  24.334  -2.603  1.00 -3.46           O  256
ATOM      8    O UNK L   0      10.202   3.560  -4.925  1.00 -3.15           O   30
ATOM      2    O UNK L   0      11.768   9.257   0.669  1.00 -2.99           O    1
ATOM      6    O UNK L   0      11.426   2.115  -3.358  1.00 -2.61           O    3
ATOM      4    O UNK L   0      13.380  20.326  -2.311  1.00 -2.48           O    2
END
but I'd like:
Code:
ATOM      1    O UNK L   0       8.544  24.334  -2.603  1.00 -3.46           O  256
ATOM      2    O UNK L   0      10.202   3.560  -4.925  1.00 -3.15           O   30
ATOM      3    O UNK L   0      11.768   9.257   0.669  1.00 -2.99           O    1
ATOM      4    O UNK L   0      11.426   2.115  -3.358  1.00 -2.61           O    3
ATOM      5    O UNK L   0      13.380  20.326  -2.311  1.00 -2.48           O    2
END
I can't figure out what causes this first number to be wrong.
Seems like it doubles the run number (because you increase i twice?) but only up to a point...
Can you - or anybody else on this forum! :-) fix it?

Thanks for your help!

Last edited by mchriste; 02-05-2009 at 01:20 PM.
 
Old 02-06-2009, 03:04 PM   #9
mchriste
LQ Newbie
 
Registered: Feb 2009
Distribution: OpenSuse 11.1
Posts: 10

Original Poster
Rep: Reputation: 0
Uhm... anybody?
 
Old 02-08-2009, 03:20 PM   #10
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,988
Blog Entries: 11

Rep: Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880
The problem here is that the numbering of column 2 needs to be
applied after the sort on the last one ... which means we have
to insert those *after* the sort, so it will again become a two
step process, where the numeric IDs get *created* after the awk.


I'll have a think about how to work this - sorry for the later
response, had a mini-holiday of 3.5 days =}
 
Old 02-11-2009, 03:19 PM   #11
mchriste
LQ Newbie
 
Registered: Feb 2009
Distribution: OpenSuse 11.1
Posts: 10

Original Poster
Rep: Reputation: 0
Any progress with my problem yet?

Maybe something involving a counter mechanism for the new file (i.e. just counting line# of the output)
that is completely independent of the already applied i++ increment?

PS: Thanks for keeping me updated, Tink. I really do appreciate your help.
 
Old 02-11-2009, 03:55 PM   #12
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,988
Blog Entries: 11

Rep: Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880
Quote:
Originally Posted by mchriste View Post
Any progress with my problem yet?

Maybe something involving a counter mechanism for the new file (i.e. just counting line# of the output)
that is completely independent of the already applied i++ increment?

PS: Thanks for keeping me updated, Tink. I really do appreciate your help.
The problem with that is that the order is being determined by the
sort in the output inside the loop of the END processing ... I'd need
to see whether I can use awks built-in sort functions (asort or asorti)
to do the sorting instead, and then do the increment count in there.

The other (quick, but less elegant) way I see (for now, but I'm kind
of preoccupied with with other stuff) is to put something bogus in
the position of the counter, and then pipe that through a second awk
that just replaces the bogus with a padded line number ...

Can you try and work with those (English) pseudo instructions yourself?
I guess it would only take me half an hour, but that's currently hard
to dig up



Cheers,
Tink
 
Old 03-05-2009, 12:40 PM   #13
mchriste
LQ Newbie
 
Registered: Feb 2009
Distribution: OpenSuse 11.1
Posts: 10

Original Poster
Rep: Reputation: 0
Thanks Tink, I finally figured out something that works using your suggestions.
 
Old 03-05-2009, 01:57 PM   #14
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,988
Blog Entries: 11

Rep: Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880
Glad to hear it's done, thanks for coming back with the feed-back
and sorry I couldn't be of more assistance.



Cheers,
Tink
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
awk/sed to grep the text ahpin Linux - Software 3 10-17-2007 12:34 AM
sed/awk sort help Kvetch Programming 17 08-30-2006 07:21 PM
bash script with grep and sed: sed getting filenames from grep odysseus.lost Programming 1 07-17-2006 11:36 AM
How can I awk/sed/grep the IPs from the maillog? abefroman Programming 7 03-09-2006 10:22 AM
How to loop or sort in bash, awk or sed? j4r0d Programming 1 09-09-2004 03:22 AM


All times are GMT -5. The time now is 11:19 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration