Hello everyone!
I am not a programmer, per se. I have some very basic introduction to python; however, I recognize a problem that might be solved with linux shell scripts or python, perhaps. If I cannot take care of this problem, it may very well make my proposed project intractable. So, can anybody help me do this FOR SCIENCE?!
I have a file in the format .mol2 (unimportant). What is important is that it is basically thousands and thousands of concatonated files in this format:
@<TRIPOS>MOLECULE
ZINC41503146
34 36 0 0 0
SMALL
USER_CHARGES
@<TRIPOS>ATOM
1 C1 -0.0164 1.3570 0.0095 C.ar 1 <0> -0.1101
2 C2 1.1559 2.1003 0.0023 C.ar 1 <0> -0.0852
3 C3 2.3751 1.4492 -0.0129 C.ar 1 <0> -0.0666
4 C4 2.4044 0.0758 -0.0205 C.ar 1 <0> -0.1663
5 C5 1.2262 -0.7047 -0.0135 C.ar 1 <0> 0.0726
6 C6 0.0021 -0.0041 0.0020 C.ar 1 <0> -0.0508
7 N1 1.4635 -2.0326 -0.0233 N.2 1 <0> -0.4836
8 C7 2.6693 -2.5212 -0.0378 C.2 1 <0> 0.0660
9 S1 3.6694 -1.1675 -0.0399 S.3 1 <0> 0.1590
10 C8 3.0782 -3.9717 -0.0496 C.3 1 <0> 0.0671
11 H1 2.1877 -4.6002 -0.0441 H 1 <0> 0.1779
12 C9 3.8673 -4.2492 -1.2608 C.1 1 <0> 0.1986
13 N2 4.4762 -4.4634 -2.1956 N.1 1 <0> -0.3600
14 C10 3.9103 -4.2689 1.1712 C.2 1 <0> 0.3657
15 O1 4.5222 -3.3815 1.7155 O.2 1 <0> -0.3757
16 C11 3.9788 -5.6736 1.7126 C.3 1 <0> 0.0132
17 O2 4.8242 -5.6994 2.8645 O.3 1 <0> -0.2798
18 C12 4.9944 -6.9006 3.4785 C.ar 1 <0> 0.1265
19 C13 4.3653 -8.0310 2.9786 C.ar 1 <0> -0.1695
20 C14 4.5394 -9.2510 3.6030 C.ar 1 <0> -0.1058
21 C15 5.3415 -9.3462 4.7277 C.ar 1 <0> 0.0803
22 C16 5.9704 -8.2192 5.2284 C.ar 1 <0> -0.1070
23 C17 5.8024 -6.9978 4.6027 C.ar 1 <0> -0.0390
24 Cl1 6.5963 -5.5857 5.2266 Cl 1 <0> -0.0336
25 F1 5.5101 -10.5401 5.3372 F 1 <0> -0.1304
26 H2 -0.9661 1.8710 0.0259 H 1 <0> 0.1383
27 H3 1.1157 3.1795 0.0081 H 1 <0> 0.1423
28 H4 3.2953 2.0145 -0.0190 H 1 <0> 0.1380
29 H5 -0.9285 -0.5522 0.0083 H 1 <0> 0.1367
30 H6 2.9781 -6.0058 1.9888 H 1 <0> 0.1158
31 H7 4.3843 -6.3378 0.9494 H 1 <0> 0.1108
32 H8 3.7392 -7.9575 2.1017 H 1 <0> 0.1500
33 H9 4.0491 -10.1312 3.2139 H 1 <0> 0.1487
34 H10 6.5958 -8.2949 6.1057 H 1 <0> 0.1555
@<TRIPOS>BOND
1 1 6 ar
2 1 2 ar
3 1 26 1
4 2 3 ar
5 2 27 1
6 3 4 ar
7 3 28 1
8 4 9 1
9 4 5 ar
10 5 6 ar
11 5 7 1
12 6 29 1
13 7 8 2
14 8 9 1
15 8 10 1
16 10 11 1
17 10 12 1
18 10 14 1
19 12 13 3
20 14 15 2
21 14 16 1
22 16 17 1
23 16 30 1
24 16 31 1
25 17 18 1
26 18 23 ar
27 18 19 ar
28 19 20 ar
29 19 32 1
30 20 21 ar
31 20 33 1
32 21 22 ar
33 21 25 1
34 22 23 ar
35 22 34 1
36 23 24 1
Here's what I want to do:
I need a way of screening out these guys that I can be quite sure will be unsuitable for my calculation. I would like to search the individual files and remove the ones that do not have a certain string. In this case, if it does not have a type under the @<TRIPOS>ATOM C.ar, then I want it to be removed from the file entirely.
Is there any way to automate this process? I have more than 7 million of these compounds to try and screen out, and I expect that I can eliminate a few million if I can accomplish this. I would be forever indebted to anybody who has insight!
Thank you