Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Hi people, new user here. Hope I can get some help.
I have to extract data pairwise from two lines in one big (4MB+) text file.
(The script is applied to a series of files but the data always comes pairwise from the same file)
I did kind of solve the problem by using a csh script to call an awk script
(and generating 4 temp files) but I'd like a more elegant way of doing it.
(...) (variable length)
ATOM 1 O UNK 8.570 33.752 -25.184 +0.00 -0.07 -0.318 51.853
ATOM 2 C UNK 7.552 34.018 -24.226 +0.01 +0.08 +0.297 51.853
ATOM 3 C UNK 7.197 32.776 -23.398 -0.02 +0.11 +0.233 51.853
(...) (variable length)
DOCKED: MODEL 30
DOCKED: USER Run = 30
DOCKED: USER DPF = SOSc.dpf
DOCKED: USER
DOCKED: USER Estimated Free Energy of Binding = -3.15 kcal/mol [=(1)+(2)+(3)-(4)]
DOCKED: USER Estimated Inhibition Constant, Ki = 4.88 mM (millimolar) [Temperature = 298.15 K]
DOCKED: USER
DOCKED: USER (1) Final Intermolecular Energy = -11.12 kcal/mol
DOCKED: USER vdW + Hbond + desolv Energy = -3.47 kcal/mol
DOCKED: USER Electrostatic Energy = -7.64 kcal/mol
DOCKED: USER (2) Final Total Internal Energy = +9.90 kcal/mol
DOCKED: USER (3) Torsional Free Energy = +5.76 kcal/mol
DOCKED: USER (4) Unbound System's Energy = +7.70 kcal/mol
DOCKED: USER
(...) (variable length)
DOCKED: REMARK 21 A between atoms: O_44 and S_55
DOCKED: USER x y z vdW Elec q Type
DOCKED: USER _______ _______ _______ _____ _____ ______ ____
DOCKED: ROOT
DOCKED: ATOM 1 O UNK 10.202 3.560 -4.925 +0.01 -0.17 -0.318 OA
DOCKED: ENDROOT
DOCKED: BRANCH 1 2
DOCKED: ATOM 2 C UNK 8.987 4.249 -5.194 -0.01 +0.16 +0.297 C
(pattern repeats)
desired output: (roots.pdb)
Code:
ATOM 1 O UNK L 0 7.718 2.274 -6.002 1.00 -8.36 O
ATOM 2 O UNK L 0 10.215 4.608 -4.430 1.00 -8.24 O
ATOM 3 O UNK L 0 10.720 4.202 -4.754 1.00 -8.20 O
(...)
ATOM 141 O UNK L 0 10.202 3.560 -4.925 1.00 -3.15 O
(...)
Important notes:
- the energies (ex: -3.15 for ATOM 141) must match the coordinates for each run (it's the purpose of the entire exercise!)
- the atom I want to extract for each run is always the "root" (always ATOM 1for each run)
- there are thousands of runs per file resulting in one line for each in the output
- the coordinates I am interested in are always in a line that starts with DOCKED: ATOM
- I wish to sort the output by lowest energy
- original "ranking" is completely random (just by run # - here I gave run 30 as an example)
- output must follow exactly that format (including number of blank spaces)
- I'd like for this to work under Cygwin as well (just in case that's a limitation)
- bonus question: can I extract the run number as well? (appending it at the right end of each line in the output)
Thanks, Martin
Last edited by mchriste; 02-03-2009 at 05:43 PM.
Reason: clarification
I have a rule I follow - if it's quick and the data is well-formed use sed; else if it's manipulating data for re-display, use awk; when it gets too complex/ugly with either of them, use perl.
OK... guess I should have attached a larger text sample to begin with. Sorry, my bad.
I am attaching an (extensively truncated) example source file here.
Note: Please rename .txt to .dlg
The parts where I removed stuff are indicated by the line
Code:
>TRUNCATED HERE
The output I get from this example using my cumbersome method is:
Code:
ATOM 1 O UNK L 0 8.544 24.334 -2.603 1.00 -3.46 O
ATOM 2 O UNK L 0 10.202 3.560 -4.925 1.00 -3.15 O
ATOM 3 O UNK L 0 11.768 9.257 0.669 1.00 -2.99 O
ATOM 4 O UNK L 0 11.426 2.115 -3.358 1.00 -2.61 O
ATOM 5 O UNK L 0 13.380 20.326 -2.311 1.00 -2.48 O
END
If I can get the above using just one script, I'd be happy.
But ideally, I would like this (the last number corresponding to the run / MODEL #)
Code:
ATOM 1 O UNK L 0 8.544 24.334 -2.603 1.00 -3.46 O 256
ATOM 2 O UNK L 0 10.202 3.560 -4.925 1.00 -3.15 O 30
ATOM 3 O UNK L 0 11.768 9.257 0.669 1.00 -2.99 O 1
ATOM 4 O UNK L 0 11.426 2.115 -3.358 1.00 -2.61 O 3
ATOM 5 O UNK L 0 13.380 20.326 -2.311 1.00 -2.48 O 2
END
Syg00: Unfortunately, I have zero understanding of perl...
The point is that I do know how to get the data I need using the scripts in the OP.
I would just like to concatenate all this into one script - whether it's awk, csh or perl.
Last edited by mchriste; 02-04-2009 at 11:51 AM.
Reason: clarification
Thanks for the script, Tink!
It works *almost* perfectly, except that the atoms are not serially numbered.
Using the example file, with your script I get:
Code:
ATOM 10 O UNK L 0 8.544 24.334 -2.603 1.00 -3.46 O 256
ATOM 8 O UNK L 0 10.202 3.560 -4.925 1.00 -3.15 O 30
ATOM 2 O UNK L 0 11.768 9.257 0.669 1.00 -2.99 O 1
ATOM 6 O UNK L 0 11.426 2.115 -3.358 1.00 -2.61 O 3
ATOM 4 O UNK L 0 13.380 20.326 -2.311 1.00 -2.48 O 2
END
but I'd like:
Code:
ATOM 1 O UNK L 0 8.544 24.334 -2.603 1.00 -3.46 O 256
ATOM 2 O UNK L 0 10.202 3.560 -4.925 1.00 -3.15 O 30
ATOM 3 O UNK L 0 11.768 9.257 0.669 1.00 -2.99 O 1
ATOM 4 O UNK L 0 11.426 2.115 -3.358 1.00 -2.61 O 3
ATOM 5 O UNK L 0 13.380 20.326 -2.311 1.00 -2.48 O 2
END
I can't figure out what causes this first number to be wrong.
Seems like it doubles the run number (because you increase i twice?) but only up to a point...
Can you - or anybody else on this forum! :-) fix it?
The problem here is that the numbering of column 2 needs to be
applied after the sort on the last one ... which means we have
to insert those *after* the sort, so it will again become a two
step process, where the numeric IDs get *created* after the awk.
I'll have a think about how to work this - sorry for the later
response, had a mini-holiday of 3.5 days =}
Maybe something involving a counter mechanism for the new file (i.e. just counting line# of the output)
that is completely independent of the already applied i++ increment?
PS: Thanks for keeping me updated, Tink. I really do appreciate your help.
Maybe something involving a counter mechanism for the new file (i.e. just counting line# of the output)
that is completely independent of the already applied i++ increment?
PS: Thanks for keeping me updated, Tink. I really do appreciate your help.
The problem with that is that the order is being determined by the
sort in the output inside the loop of the END processing ... I'd need
to see whether I can use awks built-in sort functions (asort or asorti)
to do the sorting instead, and then do the increment count in there.
The other (quick, but less elegant) way I see (for now, but I'm kind
of preoccupied with with other stuff) is to put something bogus in
the position of the counter, and then pipe that through a second awk
that just replaces the bogus with a padded line number ...
Can you try and work with those (English) pseudo instructions yourself?
I guess it would only take me half an hour, but that's currently hard
to dig up
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.