I need to extract chunks of information from a file based on two variables that are stored within another file.
I have attached an example "data" file (alldocked.pdbqt) and "variables" file (roots.pdb).
Important notes:- remove the .txt extension first (required by the forum)
- I had to heavily snip the data file. A real data file can easily be 10MB large.
The format for each record I want to extract is the following:
"Data" file: alldocked.pdbqt
Code:
<filename>:DOCKED: MODEL <model>
<filename>:DOCKED: USER (...)
(...)
<filename>:DOCKED: (all kinds of stuff, many lines)
(...)
<filename>:DOCKED: ENDMDL
<filename>:DOCKED: MODEL <model>
<filename>:DOCKED: USER (...)
(...)
<filename>:DOCKED: (all kinds of stuff, many lines)
(...)
<filename>:DOCKED: ENDMDL
(etc)
The <filename> and <model> variables are stored in a separate file called roots.pdb.
The format of roots.pdb is for each line:
Code:
ATOM 1 O UNK L 0 <coord1> <coord2> <coord3> 1.00 <energy> O <filename> <model>
Characteristics of the data:
- <filename> corresponds to field #13, <model> to field #14 in each line.
- Both <filename> and <model> vary INDEPENDANTLY for each record.
- <filename> can be absolutely random but it will always end with ".dlg".
- There may be up to 99 different <filename> in the alldocked.pdbqt file.
- <model> is always an integer and can range from 1 to 999.
- The "data" file is always called alldocked.pdbqt, the "variables" file roots.pdb
WHAT I WOULD LIKE TO GET:
I need a file that contains the records corresponding to the first x (say, 20) lines of roots.pdb.
I also need to add a line containing the <filename> between each record.
(Yes I know that <filename> is included on each line, but later I will strip this).
So, the output format I desire would be: (
output.pdbqt)
Code:
<filename>:DOCKED: REMARK <filename>
<filename>:DOCKED: MODEL <model>
<filename>:DOCKED: USER (...)
(...)
<filename>:DOCKED: (all kinds of stuff, many lines)
(...)
<filename>:DOCKED: ENDMDL
<filename>:DOCKED: REMARK <filename>
<filename>:DOCKED: MODEL <model>
<filename>:DOCKED: USER (...)
(...)
<filename>:DOCKED: (all kinds of stuff, many lines)
(...)
<filename>:DOCKED: ENDMDL
(...)
Enough with the abstract descriptions, here's a real example:
The example roots.pdb file I attached starts with the following lines:
Code:
ATOM 1 O UNK L 0 9.200 5.778 -4.800 1.00 -10.02 O SOS_charged1.dlg 13
ATOM 2 O UNK L 0 13.230 3.129 -6.038 1.00 -7.94 O SOS_charged2.dlg 20
ATOM 3 O UNK L 0 11.295 0.656 -6.503 1.00 -7.91 O SOS_charged1.dlg 7
So, the output file I would like should start with:
Code:
SOS_charged1.dlg:DOCKED: REMARK SOS_charged1.dlg
SOS_charged1.dlg:DOCKED: MODEL 13
(...)
SOS_charged1.dlg:DOCKED: (all data in alldocked.dlg that corresponds to model SOS_charged1.dlg 13)
(...)
SOS_charged1.dlg:DOCKED: ENDMDL
SOS_charged2.dlg:DOCKED: REMARK SOS_charged2.dlg
SOS_charged2.dlg:DOCKED: MODEL 20
SOS_charged2.dlg:DOCKED: USER (...)
(...)
SOS_charged2.dlg:DOCKED: (all data in alldocked.dlg that corresponds to model SOS_charged2.dlg 20)
(...)
SOS_charged2.dlg:DOCKED: ENDMDL
SOS_charged2.dlg:DOCKED: REMARK SOS_charged2.dlg
SOS_charged1.dlg:DOCKED: MODEL 7
SOS_charged1.dlg:DOCKED: USER (...)
(...)
SOS_charged1.dlg:DOCKED: (all data in alldocked.dlg that corresponds to model SOS_charged1.dlg 7)
(...)
SOS_charged1.dlg:DOCKED: ENDMDL
Who can help me? I'm trying to analyze protein-ligand docking results in a way that is not provided by the software I'm using - but I'm a biochemist, not a programmer!