Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
02-03-2009, 04:00 PM
|
#1
|
LQ Newbie
Registered: Feb 2009
Distribution: OpenSuse 11.1
Posts: 10
Rep:
|
grep+awk+sed+paste+sort in one script?
Hi people, new user here. Hope I can get some help.
I have to extract data pairwise from two lines in one big (4MB+) text file.
(The script is applied to a series of files but the data always comes pairwise from the same file)
I did kind of solve the problem by using a csh script to call an awk script
(and generating 4 temp files) but I'd like a more elegant way of doing it.
What I have:
Code:
getroots.csh
#!/bin/csh
grep "^DOCKED: USER Estimated Free" *.dlg | awk '{print $9}'> tempe
grep "^DOCKED: ATOM 1 O" *.dlg > tempr
paste -d ' ' tempe tempr > tempp
sort -n tempp > temps
gawk -f 'root2pdb.awk' temps > roots.pdb
rm -f temp*
Code:
root2pdb.awk
#!/bin/awk -f
BEGIN {i=0}
{
i++
printf ("ATOM %5d %4s %3s %1s %3d %7.3f %7.3f %7.3f %5.2f %5.2f %1s\n",
i, $5, $6, "L", 0, $7, $8, $9, 1, $1, $5)
}
END {print "END"}
How can I combine all this into one awk script?
I can provide source and target data if it would help, but I didn't want to make an unnecessarily long post.
Thanks!
Last edited by mchriste; 02-03-2009 at 04:03 PM.
Reason: clarification
|
|
|
02-03-2009, 04:58 PM
|
#2
|
Moderator
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
|
Providing data (before & after) would surely help. And chances
are it can be done with awk alone =)
Cheers,
Tink
|
|
|
02-03-2009, 05:27 PM
|
#3
|
LQ Newbie
Registered: Feb 2009
Distribution: OpenSuse 11.1
Posts: 10
Original Poster
Rep:
|
Thanks for the quick answer, Tinkster.
Source: (in *.dlg file)
Code:
(...) (variable length)
ATOM 1 O UNK 8.570 33.752 -25.184 +0.00 -0.07 -0.318 51.853
ATOM 2 C UNK 7.552 34.018 -24.226 +0.01 +0.08 +0.297 51.853
ATOM 3 C UNK 7.197 32.776 -23.398 -0.02 +0.11 +0.233 51.853
(...) (variable length)
DOCKED: MODEL 30
DOCKED: USER Run = 30
DOCKED: USER DPF = SOSc.dpf
DOCKED: USER
DOCKED: USER Estimated Free Energy of Binding = -3.15 kcal/mol [=(1)+(2)+(3)-(4)]
DOCKED: USER Estimated Inhibition Constant, Ki = 4.88 mM (millimolar) [Temperature = 298.15 K]
DOCKED: USER
DOCKED: USER (1) Final Intermolecular Energy = -11.12 kcal/mol
DOCKED: USER vdW + Hbond + desolv Energy = -3.47 kcal/mol
DOCKED: USER Electrostatic Energy = -7.64 kcal/mol
DOCKED: USER (2) Final Total Internal Energy = +9.90 kcal/mol
DOCKED: USER (3) Torsional Free Energy = +5.76 kcal/mol
DOCKED: USER (4) Unbound System's Energy = +7.70 kcal/mol
DOCKED: USER
(...) (variable length)
DOCKED: REMARK 21 A between atoms: O_44 and S_55
DOCKED: USER x y z vdW Elec q Type
DOCKED: USER _______ _______ _______ _____ _____ ______ ____
DOCKED: ROOT
DOCKED: ATOM 1 O UNK 10.202 3.560 -4.925 +0.01 -0.17 -0.318 OA
DOCKED: ENDROOT
DOCKED: BRANCH 1 2
DOCKED: ATOM 2 C UNK 8.987 4.249 -5.194 -0.01 +0.16 +0.297 C
(pattern repeats)
desired output: (roots.pdb)
Code:
ATOM 1 O UNK L 0 7.718 2.274 -6.002 1.00 -8.36 O
ATOM 2 O UNK L 0 10.215 4.608 -4.430 1.00 -8.24 O
ATOM 3 O UNK L 0 10.720 4.202 -4.754 1.00 -8.20 O
(...)
ATOM 141 O UNK L 0 10.202 3.560 -4.925 1.00 -3.15 O
(...)
Important notes:
- the energies (ex: -3.15 for ATOM 141) must match the coordinates for each run (it's the purpose of the entire exercise!)
- the atom I want to extract for each run is always the "root" (always ATOM 1 for each run)
- there are thousands of runs per file resulting in one line for each in the output
- the coordinates I am interested in are always in a line that starts with DOCKED: ATOM
- I wish to sort the output by lowest energy
- original "ranking" is completely random (just by run # - here I gave run 30 as an example)
- output must follow exactly that format (including number of blank spaces)
- I'd like for this to work under Cygwin as well (just in case that's a limitation)
- bonus question: can I extract the run number as well? (appending it at the right end of each line in the output)
Thanks, Martin
Last edited by mchriste; 02-03-2009 at 05:43 PM.
Reason: clarification
|
|
|
02-03-2009, 08:28 PM
|
#4
|
Moderator
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
|
Hmmm ... can you provide a slightly larger sample (that would allow to
produce more than one row of output)? Maybe 2 or 3 output lines?
Also, those sections with (variable length) ... are they actually separated
by blank lines or is the data stream not blank-line separated?
|
|
|
02-03-2009, 08:46 PM
|
#5
|
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,385
|
I have a rule I follow - if it's quick and the data is well-formed use sed; else if it's manipulating data for re-display, use awk; when it gets too complex/ugly with either of them, use perl.
So ... use perl 
|
|
|
02-04-2009, 11:43 AM
|
#6
|
LQ Newbie
Registered: Feb 2009
Distribution: OpenSuse 11.1
Posts: 10
Original Poster
Rep:
|
OK... guess I should have attached a larger text sample to begin with. Sorry, my bad.
I am attaching an (extensively truncated) example source file here.
Note: Please rename .txt to .dlg
The parts where I removed stuff are indicated by the line
The output I get from this example using my cumbersome method is:
Code:
ATOM 1 O UNK L 0 8.544 24.334 -2.603 1.00 -3.46 O
ATOM 2 O UNK L 0 10.202 3.560 -4.925 1.00 -3.15 O
ATOM 3 O UNK L 0 11.768 9.257 0.669 1.00 -2.99 O
ATOM 4 O UNK L 0 11.426 2.115 -3.358 1.00 -2.61 O
ATOM 5 O UNK L 0 13.380 20.326 -2.311 1.00 -2.48 O
END
If I can get the above using just one script, I'd be happy.
But ideally, I would like this (the last number corresponding to the run / MODEL #)
Code:
ATOM 1 O UNK L 0 8.544 24.334 -2.603 1.00 -3.46 O 256
ATOM 2 O UNK L 0 10.202 3.560 -4.925 1.00 -3.15 O 30
ATOM 3 O UNK L 0 11.768 9.257 0.669 1.00 -2.99 O 1
ATOM 4 O UNK L 0 11.426 2.115 -3.358 1.00 -2.61 O 3
ATOM 5 O UNK L 0 13.380 20.326 -2.311 1.00 -2.48 O 2
END
Syg00: Unfortunately, I have zero understanding of perl...
The point is that I do know how to get the data I need using the scripts in the OP.
I would just like to concatenate all this into one script - whether it's awk, csh or perl.
Last edited by mchriste; 02-04-2009 at 11:51 AM.
Reason: clarification
|
|
|
02-04-2009, 02:13 PM
|
#7
|
Moderator
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
|
Does this work for you? =o)
Code:
#!/bin/awk -f
BEGIN{
i=0
}
{
if ( $0 ~ /^DOCKED: USER Run =/){
run = $5
i++
}
if ( $0 ~ /^DOCKED: USER Estimated Free/){
energy = $9
i++
}
if ( $0 ~ /^DOCKED: ATOM 1 O/){
line[energy] = sprintf ("ATOM %5d %4s %3s %1s %3d %7.3f %7.3f %7.3f %5.2f %5.2f %1s %3d", i, $4, $5, "L", 0, $6, $7, $8, 1, energy, $4, run)
}
}
END{
for ( j in line){
print line[j] | "sort -k 11,11g";
}
close( "sort -k 11,11g")
print "END"
}
Cheers,
Tink
Last edited by Tinkster; 02-04-2009 at 02:15 PM.
|
|
|
02-05-2009, 08:25 AM
|
#8
|
LQ Newbie
Registered: Feb 2009
Distribution: OpenSuse 11.1
Posts: 10
Original Poster
Rep:
|
Thanks for the script, Tink!
It works *almost* perfectly, except that the atoms are not serially numbered.
Using the example file, with your script I get:
Code:
ATOM 10 O UNK L 0 8.544 24.334 -2.603 1.00 -3.46 O 256
ATOM 8 O UNK L 0 10.202 3.560 -4.925 1.00 -3.15 O 30
ATOM 2 O UNK L 0 11.768 9.257 0.669 1.00 -2.99 O 1
ATOM 6 O UNK L 0 11.426 2.115 -3.358 1.00 -2.61 O 3
ATOM 4 O UNK L 0 13.380 20.326 -2.311 1.00 -2.48 O 2
END
but I'd like:
Code:
ATOM 1 O UNK L 0 8.544 24.334 -2.603 1.00 -3.46 O 256
ATOM 2 O UNK L 0 10.202 3.560 -4.925 1.00 -3.15 O 30
ATOM 3 O UNK L 0 11.768 9.257 0.669 1.00 -2.99 O 1
ATOM 4 O UNK L 0 11.426 2.115 -3.358 1.00 -2.61 O 3
ATOM 5 O UNK L 0 13.380 20.326 -2.311 1.00 -2.48 O 2
END
I can't figure out what causes this first number to be wrong.
Seems like it doubles the run number (because you increase i twice?) but only up to a point...
Can you - or anybody else on this forum! :-) fix it?
Thanks for your help!
Last edited by mchriste; 02-05-2009 at 01:20 PM.
|
|
|
02-06-2009, 03:04 PM
|
#9
|
LQ Newbie
Registered: Feb 2009
Distribution: OpenSuse 11.1
Posts: 10
Original Poster
Rep:
|
Uhm... anybody?
|
|
|
02-08-2009, 03:20 PM
|
#10
|
Moderator
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
|
The problem here is that the numbering of column 2 needs to be
applied after the sort on the last one ... which means we have
to insert those *after* the sort, so it will again become a two
step process, where the numeric IDs get *created* after the awk.
I'll have a think about how to work this - sorry for the later
response, had a mini-holiday of 3.5 days =}
|
|
|
02-11-2009, 03:19 PM
|
#11
|
LQ Newbie
Registered: Feb 2009
Distribution: OpenSuse 11.1
Posts: 10
Original Poster
Rep:
|
Any progress with my problem yet?
Maybe something involving a counter mechanism for the new file (i.e. just counting line# of the output)
that is completely independent of the already applied i++ increment?
PS: Thanks for keeping me updated, Tink. I really do appreciate your help.
|
|
|
02-11-2009, 03:55 PM
|
#12
|
Moderator
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
|
Quote:
Originally Posted by mchriste
Any progress with my problem yet?
Maybe something involving a counter mechanism for the new file (i.e. just counting line# of the output)
that is completely independent of the already applied i++ increment?
PS: Thanks for keeping me updated, Tink. I really do appreciate your help.
|
The problem with that is that the order is being determined by the
sort in the output inside the loop of the END processing ... I'd need
to see whether I can use awks built-in sort functions (asort or asorti)
to do the sorting instead, and then do the increment count in there.
The other (quick, but less elegant) way I see (for now, but I'm kind
of preoccupied with with other stuff) is to put something bogus in
the position of the counter, and then pipe that through a second awk
that just replaces the bogus with a padded line number ...
Can you try and work with those (English) pseudo instructions yourself?
I guess it would only take me half an hour, but that's currently hard
to dig up
Cheers,
Tink
|
|
|
03-05-2009, 12:40 PM
|
#13
|
LQ Newbie
Registered: Feb 2009
Distribution: OpenSuse 11.1
Posts: 10
Original Poster
Rep:
|
Thanks Tink, I finally figured out something that works using your suggestions.
|
|
|
03-05-2009, 01:57 PM
|
#14
|
Moderator
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
|
Glad to hear it's done, thanks for coming back with the feed-back
and sorry I couldn't be of more assistance.
Cheers,
Tink
|
|
|
All times are GMT -5. The time now is 01:25 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|