Can anyone help me with this file modification problem?

Deadally · 06-09-2011, 03:44 PM

Hello everyone!

I am not a programmer, per se. I have some very basic introduction to python; however, I recognize a problem that might be solved with linux shell scripts or python, perhaps. If I cannot take care of this problem, it may very well make my proposed project intractable. So, can anybody help me do this FOR SCIENCE?!

I have a file in the format .mol2 (unimportant). What is important is that it is basically thousands and thousands of concatonated files in this format:

@<TRIPOS>MOLECULE
ZINC41503146
34 36 0 0 0
SMALL
USER_CHARGES

@<TRIPOS>ATOM
1 C1 -0.0164 1.3570 0.0095 C.ar 1 <0> -0.1101
2 C2 1.1559 2.1003 0.0023 C.ar 1 <0> -0.0852
3 C3 2.3751 1.4492 -0.0129 C.ar 1 <0> -0.0666
4 C4 2.4044 0.0758 -0.0205 C.ar 1 <0> -0.1663
5 C5 1.2262 -0.7047 -0.0135 C.ar 1 <0> 0.0726
6 C6 0.0021 -0.0041 0.0020 C.ar 1 <0> -0.0508
7 N1 1.4635 -2.0326 -0.0233 N.2 1 <0> -0.4836
8 C7 2.6693 -2.5212 -0.0378 C.2 1 <0> 0.0660
9 S1 3.6694 -1.1675 -0.0399 S.3 1 <0> 0.1590
10 C8 3.0782 -3.9717 -0.0496 C.3 1 <0> 0.0671
11 H1 2.1877 -4.6002 -0.0441 H 1 <0> 0.1779
12 C9 3.8673 -4.2492 -1.2608 C.1 1 <0> 0.1986
13 N2 4.4762 -4.4634 -2.1956 N.1 1 <0> -0.3600
14 C10 3.9103 -4.2689 1.1712 C.2 1 <0> 0.3657
15 O1 4.5222 -3.3815 1.7155 O.2 1 <0> -0.3757
16 C11 3.9788 -5.6736 1.7126 C.3 1 <0> 0.0132
17 O2 4.8242 -5.6994 2.8645 O.3 1 <0> -0.2798
18 C12 4.9944 -6.9006 3.4785 C.ar 1 <0> 0.1265
19 C13 4.3653 -8.0310 2.9786 C.ar 1 <0> -0.1695
20 C14 4.5394 -9.2510 3.6030 C.ar 1 <0> -0.1058
21 C15 5.3415 -9.3462 4.7277 C.ar 1 <0> 0.0803
22 C16 5.9704 -8.2192 5.2284 C.ar 1 <0> -0.1070
23 C17 5.8024 -6.9978 4.6027 C.ar 1 <0> -0.0390
24 Cl1 6.5963 -5.5857 5.2266 Cl 1 <0> -0.0336
25 F1 5.5101 -10.5401 5.3372 F 1 <0> -0.1304
26 H2 -0.9661 1.8710 0.0259 H 1 <0> 0.1383
27 H3 1.1157 3.1795 0.0081 H 1 <0> 0.1423
28 H4 3.2953 2.0145 -0.0190 H 1 <0> 0.1380
29 H5 -0.9285 -0.5522 0.0083 H 1 <0> 0.1367
30 H6 2.9781 -6.0058 1.9888 H 1 <0> 0.1158
31 H7 4.3843 -6.3378 0.9494 H 1 <0> 0.1108
32 H8 3.7392 -7.9575 2.1017 H 1 <0> 0.1500
33 H9 4.0491 -10.1312 3.2139 H 1 <0> 0.1487
34 H10 6.5958 -8.2949 6.1057 H 1 <0> 0.1555
@<TRIPOS>BOND
1 1 6 ar
2 1 2 ar
3 1 26 1
4 2 3 ar
5 2 27 1
6 3 4 ar
7 3 28 1
8 4 9 1
9 4 5 ar
10 5 6 ar
11 5 7 1
12 6 29 1
13 7 8 2
14 8 9 1
15 8 10 1
16 10 11 1
17 10 12 1
18 10 14 1
19 12 13 3
20 14 15 2
21 14 16 1
22 16 17 1
23 16 30 1
24 16 31 1
25 17 18 1
26 18 23 ar
27 18 19 ar
28 19 20 ar
29 19 32 1
30 20 21 ar
31 20 33 1
32 21 22 ar
33 21 25 1
34 22 23 ar
35 22 34 1
36 23 24 1

Here's what I want to do:

I need a way of screening out these guys that I can be quite sure will be unsuitable for my calculation. I would like to search the individual files and remove the ones that do not have a certain string. In this case, if it does not have a type under the @<TRIPOS>ATOM C.ar, then I want it to be removed from the file entirely.

Is there any way to automate this process? I have more than 7 million of these compounds to try and screen out, and I expect that I can eliminate a few million if I can accomplish this. I would be forever indebted to anybody who has insight!

Thank you

dugan · 06-09-2011, 04:40 PM

I'm going to need a better example. I don't see "@<TRIPOS>ATOM C.ar" in the example you posted, and I don't know which part the "type" is.

SigTerm · 06-09-2011, 05:55 PM

Quote:

Originally Posted by Deadally

I am not a programmer, per se.

You can ask somebody you known in person to help (in exchange for a favor, for example) you or could hire a freelancer.

Quote:

Originally Posted by Deadally

Is there any way to automate this process?

Yes, there're multiple ways to approach problem. Linux has "grep" (regular expression pattern matching) utility, which could be called from within shell script, python has regexp handling routines, perl has built-in regular expression support. Basically any language that supports strings and (optionally) supports regular expressions could be used for this task. Shell scripts, perl, python, well, almost anything.

However, if you "aren't a programmer", then the quickest solution would be to hire/ask somebody else to do it for you.

jschiwal · 06-09-2011, 06:57 PM

Do you simple want to include the files if they contain the string "C.ar" as in the line "1 C1 -0.0164 1.3570 0.0095 C.ar 1 <0> -0.1101"?

Code:

grep -l *.mol2 | xargs cat >> bigfile.mol2

The `-l' option will just print the filename with a match and quit after the first match. The xargs command will append the results of grep to the command "cat >> bigfile.mol2".

If the pattern "C.ar" might occur in one of the other blocks, but you need to detect it only in the ATOM block, you can use `sed' to select just this block.

Code:

sed -n '/@.*MOLECULE$/,/@.*BOND$/{ 
                                   /C.ar/ { s/.*/'"$file"'/p;q }
                                 }' sample*.mol2

Inside the range of lines, if the pattern "C.ar" is present, the line is substituted with the filename, the line is printed, and sed quits

Here is an example. The files sample1.mol2 and sample3.mol2 are copies of your example. I altered it for sample2.mol2 without the pattern for testing:

Code:

for file in *.mol2; do
> sed -n '/@.*MOLECULE$/,/@.*BOND$/{ /C.ar/ { s/.*/'"$file"'/p;q} }' "$file"
> done
sample1.mol2
sample3.mol2

The results could be piped to xargs or I could have used "$file >>bigfile.mol2" in the sed line above to construct a script you could review and the run to assemble your big cat'ed file.

note:
You should put your sample inside [ code ] ... [ /code ] blocks and cut and paste. There is a blank line preceding the "@<TRIPOS>ATOM" header but not the "@<TRIPOS>BOND" header. This seems inconsistent.
Exactness is needed when dealing with regular expressions.

Deadally · 06-09-2011, 07:10 PM

Thanks for the suggestions so far, everyone. I'm sorry I was not more specific

Basically, under the header @<TRIPOS>ATOM, I want it to search and ask whether an atom of type C.ar is present. Atom C1 is an example of this. If this is true, then I want it to leave it alone or export the whole entry to a different file. If no C.ar is found, I would like to either delete or ignore the entire entry so that these are not included in my screen.

I am not averse to learning to program it myself. I am simply inexperienced and did not know if this was a trivial problem for someone who knows better.

Thank you again! I hope this clarifies the situation some.

jschiwal · 06-09-2011, 07:56 PM

Are only the lines under @<TRIPOS>ATOM effected? Do the corresponding entries under @<TRIPOS>BOND also need to be deleted? Could you manually post what the result should be so we can provide the advice you need.

This will only print lines with C.ar inside the @<TRIPOS>ATOM block:

Code:

sed -n '/@<TRIPOS>ATOM/,/@<TRIPOS>BOND/!p
                        /@<TRIPOS>ATOM/,/@<TRIPOS>BOND/{ /@<TRIPOS>ATOM/p
                                                         /@<TRIPOS>BOND/p
                                                         /C.ar/p
                                                       }' sample.mol2

If corresponding lines in the @<TRIPOS>BOND block need to be deleted as well, `awk' would be a better command to use.

Deadally · 06-09-2011, 08:12 PM

jschiwal, the entire entry would need to be excluded if C.ar types are not present. What I posted is one excerpt of a file containing thousands of similar entries. If C.ar is not present, then I want the entire entry from @<TRIPOS>MOLECULE to the last line under the bond header to disappear. Basically, each of these entries is a molecule, and if a certain type of carbon is not present, I know I do not want that molecule.

So a typical result you should see will have either the entire entry there or not, depending on the search criteria. Does that make more sense?

jschiwal · 06-09-2011, 08:42 PM

So what you posted is a single "entry". Do they start with entry per file? If so, the grep example could be used to only cat together files that contain lines with "C.ar".

If you start with a file containing thousands of entries, we need to know how separate entries are delineated. Also, is there a blank line before @<TRIPOS>ATOM?

Is the text "TRIPOS" in @<TRIPOS> a constant, or is it something else in other entries?

grail · 06-09-2011, 09:43 PM

I would like to know if all segments start with:

@<TRIPOS>MOLECULE

If so, then something like:

Code:

awk '/C\.ar/{print RS $0}' RS="@<TRIPOS>MOLECULE" file.mol2

Deadally · 06-10-2011, 06:07 AM

Grail,

That seems to be getting a lot closer to what I am looking for! Yes, every entry begins with that header. They are identical in their layout. They just have different numbers of atoms, bonds, as well as different types. If an atom of type C.ar is not present, I want to exclude that entry entirely. Is the awk command going to be able to do that?

Jschiwal, this is one file with thousands of entries, as in thousands of molecules per file. They're packaged together for convenience, since 7 million individual files would be rather burdensome for the task at hand. Each entry is delineated by the molecule header, and every entry contains the information shown in my example. There are no spaces in the real file. I'll attach another example in the code tags

Code:

@<TRIPOS>MOLECULE
ZINC02782238
   39    40     0     0     0
SMALL
USER_CHARGES
N-(3-carbamoyl-1-ethyl-pyrazol-4-yl)-1-ethyl-3-methyl-pyrazole-4-carboxamide
@<TRIPOS>ATOM
      1 C1          0.0021   -0.0041    0.0020 C.3       1 <0>        -0.1669
      2 C2         -0.0187    1.5258    0.0104 C.3       1 <0>         0.1233
      3 N1         -0.7044    1.9982    1.2158 N.pl3     1 <0>        -0.3527
      4 C3         -0.1980    2.8468    2.1248 C.2       1 <0>         0.1070
      5 C4         -1.1606    3.0341    3.0965 C.2       1 <0>        -0.2935
      6 C5         -2.2770    2.2471    2.7348 C.2       1 <0>         0.1364
      7 N2         -1.9953    1.6365    1.6169 N.2       1 <0>        -0.2961
      8 C6         -3.5668    2.1327    3.5056 C.3       1 <0>        -0.0784
      9 C7         -1.0433    3.8899    4.2911 C.2       1 <0>         0.5983
     10 O1         -1.9672    3.9649    5.0785 O.2       1 <0>        -0.5312
     11 N3          0.0877    4.5891    4.5107 N.am      1 <0>        -0.6332
     12 C8          0.1994    5.4036    5.6477 C.2       1 <0>         0.0899
     13 C9         -0.7358    5.6009    6.6147 C.2       1 <0>        -0.0298
     14 N4         -0.1986    6.4599    7.5189 N.pl3     1 <0>        -0.2707
     15 N5          0.9870    6.7986    7.1592 N.2       1 <0>        -0.2421
     16 C10         1.3098    6.2005    6.0286 C.2       1 <0>        -0.0493
     17 C11         2.5846    6.3344    5.2987 C.2       1 <0>         0.6174
     18 O2          2.7580    5.7246    4.2611 O.2       1 <0>        -0.5484
     19 N6          3.5572    7.1325    5.7820 N.am      1 <0>        -0.8385
     20 C12        -0.8761    6.9365    8.7272 C.3       1 <0>         0.1181
     21 C13        -0.0035    6.6454    9.9497 C.3       1 <0>        -0.1682
     22 H1          0.5293   -0.3651    0.8851 H         1 <0>         0.0620
     23 H2         -1.0205   -0.3814    0.0098 H         1 <0>         0.0728
     24 H3          0.5123   -0.3556   -0.8948 H         1 <0>         0.0792
     25 H4          1.0039    1.9031    0.0027 H         1 <0>         0.1017
     26 H5         -0.5459    1.8868   -0.8726 H         1 <0>         0.0921
     27 H6          0.7814    3.3012    2.0993 H         1 <0>         0.1849
     28 H7         -3.4913    1.3136    4.2208 H         1 <0>         0.0806
     29 H8         -3.7552    3.0645    4.0390 H         1 <0>         0.0952
     30 H9         -4.3869    1.9373    2.8147 H         1 <0>         0.0724
     31 H10         0.8246    4.5292    3.8827 H         1 <0>         0.4204
     32 H11        -1.7191    5.1564    6.6571 H         1 <0>         0.2072
     33 H12         3.4189    7.6189    6.6097 H         1 <0>         0.4091
     34 H13         4.3955    7.2205    5.3020 H         1 <0>         0.4074
     35 H14        -1.0455    8.0105    8.6491 H         1 <0>         0.0964
     36 H15        -1.8326    6.4246    8.8333 H         1 <0>         0.1080
     37 H16         0.1659    5.5715   10.0278 H         1 <0>         0.0641
     38 H17         0.9530    7.1573    9.8436 H         1 <0>         0.0729
     39 H18        -0.5076    7.0001   10.8487 H         1 <0>         0.0821
@<TRIPOS>BOND
     1    1    2 1
     2    1   22 1
     3    1   23 1
     4    1   24 1
     5    2    3 1
     6    2   25 1
     7    2   26 1
     8    3    7 1
     9    3    4 1
    10    4    5 2
    11    4   27 1
    12    5    6 1
    13    5    9 1
    14    6    7 2
    15    6    8 1
    16    8   28 1
    17    8   29 1
    18    8   30 1
    19    9   10 2
    20    9   11 am
    21   11   12 1
    22   11   31 1
    23   12   16 1
    24   12   13 2
    25   13   14 1
    26   13   32 1
    27   14   15 1
    28   14   20 1
    29   15   16 2
    30   16   17 1
    31   17   18 2
    32   17   19 am
    33   19   33 1
    34   19   34 1
    35   20   21 1
    36   20   35 1
    37   20   36 1
    38   21   37 1
    39   21   38 1
    40   21   39 1
@<TRIPOS>MOLECULE
ZINC04305585
   46    49     0     0     0
SMALL
USER_CHARGES
4-fluoro-N-[2-indolin-1-yl-2-(2-thienyl)ethyl]-benzenesulfonamide
@<TRIPOS>ATOM
      1 C1          2.5553    9.1832   -2.4259 C.ar      1 <0>        -0.1337
      2 C2          1.2900    8.6260   -2.4631 C.ar      1 <0>        -0.0906
      3 C3          1.0679    7.3699   -1.9390 C.ar      1 <0>        -0.1477
      4 C4          2.1179    6.6531   -1.3670 C.ar      1 <0>         0.0869
      5 C5          3.3842    7.2241   -1.3297 C.ar      1 <0>        -0.1393
      6 C6          3.6016    8.4794   -1.8580 C.ar      1 <0>        -0.0778
      7 C7          4.3197    6.2420   -0.6530 C.3       1 <0>        -0.0818
      8 C8          3.3453    5.2632    0.0416 C.3       1 <0>         0.0239
      9 N1          2.1256    5.3785   -0.7870 N.pl3     1 <0>        -0.4588
     10 C9          0.9180    5.1603    0.0205 C.3       1 <0>         0.1593
     11 H1          0.0461    5.5219   -0.5247 H         1 <0>         0.1199
     12 C10         1.0442    5.9202    1.3424 C.3       1 <0>         0.0824
     13 N2          1.2278    7.3478    1.0693 N.pl3     1 <0>        -1.1202
     14 S1          1.0235    8.4559    2.2829 S.o2      1 <0>         2.6899
     15 O1          1.2937    9.7314    1.7179 O.2       1 <0>        -0.9477
     16 O2         -0.2134    8.1473    2.9107 O.2       1 <0>        -0.9506
     17 C11         2.2832    8.1576    3.4783 C.ar      1 <0>        -0.6948
     18 C12         2.0503    7.2833    4.5236 C.ar      1 <0>         0.0077
     19 C13         3.0392    7.0449    5.4590 C.ar      1 <0>        -0.1734
     20 C14         4.2599    7.6900    5.3549 C.ar      1 <0>         0.1597
     21 C15         4.4901    8.5695    4.3106 C.ar      1 <0>        -0.1738
     22 C16         3.5016    8.8023    3.3734 C.ar      1 <0>         0.0130
     23 F1          5.2257    7.4613    6.2715 F         1 <0>        -0.1268
     24 C17         0.7602    3.6880    0.3005 C.2       1 <0>        -0.1436
     25 C18        -0.3682    2.9372    0.3111 C.2       1 <0>        -0.1393
     26 C19        -0.2727    1.5873    0.5930 C.2       1 <0>        -0.1576
     27 C20         0.9454    1.0564    0.8610 C.2       1 <0>        -0.1728
     28 S2          2.0111    2.4802    0.7051 S.3       1 <0>         0.0556
     29 H2          2.7258   10.1663   -2.8393 H         1 <0>         0.1303
     30 H3          0.4726    9.1767   -2.9048 H         1 <0>         0.1319
     31 H4          0.0776    6.9403   -1.9716 H         1 <0>         0.1269
     32 H5          4.5902    8.9132   -1.8284 H         1 <0>         0.1266
     33 H6          4.9464    6.7487    0.0809 H         1 <0>         0.0843
     34 H7          4.9330    5.7200   -1.3875 H         1 <0>         0.0921
     35 H8          3.1475    5.5767    1.0666 H         1 <0>         0.0780
     36 H9          3.7335    4.2449    0.0185 H         1 <0>         0.1013
     37 H10         0.1387    5.7766    1.9319 H         1 <0>         0.1012
     38 H11         1.9029    5.5431    1.8977 H         1 <0>         0.0822
     39 H12         1.4691    7.6440    0.1777 H         1 <0>         0.4212
     40 H13         1.0966    6.7833    4.6068 H         1 <0>         0.1511
     41 H14         2.8585    6.3587    6.2732 H         1 <0>         0.1486
     42 H15         5.4420    9.0730    4.2282 H         1 <0>         0.1485
     43 H16         3.6810    9.4880    2.5585 H         1 <0>         0.1510
     44 H17        -1.3273    3.3861    0.0987 H         1 <0>         0.1354
     45 H18        -1.1556    0.9655    0.6023 H         1 <0>         0.1355
     46 H19         1.2045    0.0360    1.1016 H         1 <0>         0.1857
@<TRIPOS>BOND
     1    1    6 ar
     2    1    2 ar
     3    1   29 1
     4    2    3 ar
     5    2   30 1
     6    3    4 ar
     7    3   31 1
     8    4    9 1
     9    4    5 ar
    10    5    6 ar
    11    5    7 1
    12    6   32 1
    13    7    8 1
    14    7   33 1
    15    7   34 1
    16    8    9 1
    17    8   35 1
    18    8   36 1
    19    9   10 1
    20   10   11 1
    21   10   12 1
    22   10   24 1
    23   12   13 1
    24   12   37 1
    25   12   38 1
    26   13   14 1
    27   13   39 1
    28   14   15 2
    29   14   16 2
    30   14   17 1
    31   17   22 ar
    32   17   18 ar
    33   18   19 ar
    34   18   40 1
    35   19   20 ar
    36   19   41 1
    37   20   21 ar
    38   20   23 1
    39   21   22 ar
    40   21   42 1
    41   22   43 1
    42   24   28 1
    43   24   25 2
    44   25   26 1
    45   25   44 1
    46   26   27 2
    47   26   45 1
    48   27   28 1
    49   27   46 1

Here are 2 entries. I hope it is more illustrative. I want to search the entries, looking for C.ar (or some other term). If that atom type is not present, as in the first entry, I want to eliminate that from consideration in my screen.

Thank you very much!

grail · 06-10-2011, 09:11 AM

Quote:

If an atom of type C.ar is not present, I want to exclude that entry entirely. Is the awk command going to be able to do that?

Yes it is more than capable and I believe the one I have presented does what you require. The only other question would be, is it possible C.ar to appear
in another section? The current example I have shown checks the entire molecule and if it is present it will output that molecule in its entirety.

Deadally · 06-10-2011, 10:31 AM

Thanks, Grail. I'll try that out as soon as I can.

C.ar should only be present in that section and should not be found anywhere else.