LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 12-07-2011, 05:48 AM   #1
robertselwyne
LQ Newbie
 
Registered: Jul 2010
Posts: 7

Rep: Reputation: 0
Chemistry problem: Identify duplicates and non-duplicates within TWO sdf files


Dear Programmers

I have 2 sets of SDF files that contains three dimensional coordinates of chemical compounds.

Each SDF files contains 100 chemical compounds.

FILE_01.sdf
1.compound_01.sdf
2.compound_04.sdf
3.compound_05.sdf
4.compound_07.sdf
5.compound_09.sdf
6.compound_11.sdf
7.compound_13.sdf
8.compound_15.sdf
9.compound_17.sdf
10.compound_19.sdf.........up to 100th compound

FILE_02.sdf
1.compound_02.sdf
2.compound_04.sdf
3.compound_06.sdf
4.compound_08.sdf
5.compound_09.sdf
6.compound_12.sdf
7.compound_13.sdf
8.compound_16.sdf
9.compound_17.sdf
10.compound_20.sdf........up to 100th compound

I need a script that will identify the common chemical compounds between FILE_01.sdf and FILE_02.sdf and write in a separate file named “common.sdf” along with its three dimensional coordinates.

For example in the above case compound_04.sdf, compound_09.sdf, compound_13.sdf and compound_17.sdf should be written in “common.sdf”. So obviously “common.sdf” will have 4 chemical compound along with its three dimensional coordinates stored in it.

And the remaining unmatched chemical compounds along with its three dimensional coordinates should be written in another two sdf files separately. So excluding the common chemical compounds, the new file will have 6 compounds in FILE_01_unmatched.sdf and 6 compounds in FILE_02_unmatched.sdf respectively.

Could anybody please help me to sort out this problem?
Thank you in advance.
Robert.
 
Old 12-07-2011, 06:24 AM   #2
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 640

Rep: Reputation: 375Reputation: 375Reputation: 375Reputation: 375
Hi.
Please describe what is SDF format or give a link to description. This one? How to compare SDF records? Which language should be used? What are your previous attempts?
 
Old 12-07-2011, 09:09 AM   #3
robertselwyne
LQ Newbie
 
Registered: Jul 2010
Posts: 7

Original Poster
Rep: Reputation: 0
Hi Firstfire. Thanks for the reply. The wikipedia SDF file format link description you found is the exact one.

Since the SDF file content is so large, i am just pasting a sample below.

For example the below one is a file named "sample.sdf" which is in SDF file format. This "sample.sdf" contains two chemical compounds- compound_01 and compound_02 along with its three dimensional coordinates which is stored in SDF file format.

Hope i made you understand. If you didnt understood well, please do inform me. i can try to explain you better.

compound_01
3D
Structure written by MMmdl.
55 58 0 0 1 0 999 V2000
-4.8043 3.9814 0.3129 C 0 0 0 0 0 0
-5.9723 3.2006 0.2358 C 0 0 0 0 0 0
-5.8703 1.8149 0.0170 C 0 0 0 0 0 0
-4.6041 1.2145 -0.1242 C 0 0 0 0 0 0
-3.4176 1.9834 -0.0512 C 0 0 0 0 0 0
-3.5407 3.3753 0.1715 C 0 0 0 0 0 0
-2.0865 1.3532 -0.1905 C 0 0 0 0 0 0
0.2822 1.1578 0.2473 C 0 0 0 0 0 0
-0.9338 1.8442 0.4637 C 0 0 0 0 0 0
-2.0104 0.2637 -0.9893 N 0 0 0 0 0 0
-0.8203 -0.3159 -1.1322 C 0 0 0 0 0 0
0.3207 0.0765 -0.5626 N 0 0 0 0 0 0
-7.1990 3.7819 0.3723 O 0 0 0 0 0 0
1.5605 1.5605 0.8734 C 0 0 0 0 0 0
3.0285 3.3016 1.7996 C 0 0 0 0 0 0
1.8162 2.9142 1.1961 C 0 0 0 0 0 0
2.5591 0.6038 1.1790 C 0 0 0 0 0 0
3.7726 0.9861 1.7829 C 0 0 0 0 0 0
4.0088 2.3371 2.0941 C 0 0 0 0 0 0
-0.7964 -1.4306 -1.9489 N 0 0 0 0 0 0
0.3891 -2.2301 -2.2283 C 0 0 0 0 0 0
0.7431 -3.1774 -1.0712 C 0 0 0 0 0 0
2.0076 -4.0004 -1.3677 C 0 0 0 0 0 0
2.3505 -4.9968 -0.3537 N 0 0 0 0 0 0
3.4160 -5.8756 -0.8259 C 0 0 0 0 0 0
3.7996 -6.9337 0.2232 C 0 0 0 0 0 0
4.1832 -6.2874 1.4325 O 0 0 0 0 0 0
3.1547 -5.4537 1.9566 C 0 0 0 0 0 0
2.7623 -4.3928 0.9128 C 0 0 0 0 0 0
-4.8709 5.0467 0.4791 H 0 0 0 0 0 0
-6.7625 1.2087 -0.0419 H 0 0 0 0 0 0
-4.5499 0.1479 -0.2862 H 0 0 0 0 0 0
-2.6594 3.9968 0.2267 H 0 0 0 0 0 0
-0.9845 2.6998 1.1197 H 0 0 0 0 0 0
-7.1558 4.7149 0.5145 H 0 0 0 0 0 0
3.2062 4.3406 2.0354 H 0 0 0 0 0 0
1.0828 3.6746 0.9729 H 0 0 0 0 0 0
2.3976 -0.4405 0.9538 H 0 0 0 0 0 0
4.5215 0.2411 2.0081 H 0 0 0 0 0 0
4.9389 2.6326 2.5570 H 0 0 0 0 0 0
-1.6727 -1.7076 -2.3661 H 0 0 0 0 0 0
1.2286 -1.5693 -2.4482 H 0 0 0 0 0 0
0.2041 -2.8070 -3.1347 H 0 0 0 0 0 0
-0.0938 -3.8509 -0.8826 H 0 0 0 0 0 0
0.8833 -2.5941 -0.1619 H 0 0 0 0 0 0
2.8522 -3.3272 -1.5236 H 0 0 0 0 0 0
1.8520 -4.5100 -2.3197 H 0 0 0 0 0 0
3.1038 -6.3768 -1.7428 H 0 0 0 0 0 0
4.2982 -5.2851 -1.0787 H 0 0 0 0 0 0
2.9655 -7.6111 0.4122 H 0 0 0 0 0 0
4.6309 -7.5405 -0.1361 H 0 0 0 0 0 0
2.2910 -6.0606 2.2327 H 0 0 0 0 0 0
3.5138 -4.9747 2.8676 H 0 0 0 0 0 0
3.6066 -3.7219 0.7464 H 0 0 0 0 0 0
1.9568 -3.7859 1.3245 H 0 0 0 0 0 0
1 2 2 0 0 0
1 6 1 0 0 0
1 30 1 0 0 0
2 3 1 0 0 0
2 13 1 0 0 0
3 4 2 0 0 0
3 31 1 0 0 0
4 5 1 0 0 0
4 32 1 0 0 0
5 6 2 0 0 0
5 7 1 0 0 0
6 33 1 0 0 0
7 9 2 0 0 0
7 10 1 0 0 0
8 9 1 0 0 0
8 12 2 0 0 0
8 14 1 0 0 0
9 34 1 0 0 0
10 11 2 0 0 0
11 12 1 0 0 0
11 20 1 0 0 0
13 35 1 0 0 0
14 16 2 0 0 0
14 17 1 0 0 0
15 16 1 0 0 0
15 19 2 0 0 0
15 36 1 0 0 0
16 37 1 0 0 0
17 18 2 0 0 0
17 38 1 0 0 0
18 19 1 0 0 0
18 39 1 0 0 0
19 40 1 0 0 0
20 21 1 0 0 0
20 41 1 0 0 0
21 22 1 0 0 0
21 42 1 0 0 0
21 43 1 0 0 0
22 23 1 0 0 0
22 44 1 0 0 0
22 45 1 0 0 0
23 24 1 0 0 0
23 46 1 0 0 0
23 47 1 0 0 0
24 25 1 0 0 0
24 29 1 0 0 0
25 26 1 0 0 0
25 48 1 0 0 0
25 49 1 0 0 0
26 27 1 0 0 0
26 50 1 0 0 0
26 51 1 0 0 0
27 28 1 0 0 0
28 29 1 0 0 0
28 52 1 0 0 0
28 53 1 0 0 0
29 54 1 0 0 0
29 55 1 0 0 0
M END
> <s_m_entry_name>
compound_01
$$$$
compound_02
3D
Structure written by MMmdl.
56 59 0 0 1 0 999 V2000
-4.8679 3.9090 0.3957 C 0 0 0 0 0 0
-6.0073 3.0851 0.4495 C 0 0 0 0 0 0
-5.8662 1.6920 0.3160 C 0 0 0 0 0 0
-4.5896 1.1272 0.1292 C 0 0 0 0 0 0
-3.4316 1.9395 0.0710 C 0 0 0 0 0 0
-3.5937 3.3383 0.2093 C 0 0 0 0 0 0
-2.0891 1.3471 -0.1163 C 0 0 0 0 0 0
0.3092 1.2629 0.1715 C 0 0 0 0 0 0
-0.9157 1.9206 0.4238 C 0 0 0 0 0 0
-2.0229 0.2090 -0.8450 N 0 0 0 0 0 0
-0.8235 -0.3381 -1.0294 C 0 0 0 0 0 0
0.3366 0.1280 -0.5629 N 0 0 0 0 0 0
-7.2441 3.6319 0.6301 O 0 0 0 0 0 0
1.6091 1.7566 0.6767 C 0 0 0 0 0 0
3.0622 3.6140 1.3723 C 0 0 0 0 0 0
1.8292 3.1388 0.8845 C 0 0 0 0 0 0
2.6654 0.8627 0.9774 C 0 0 0 0 0 0
3.9000 1.3328 1.4658 C 0 0 0 0 0 0
4.0998 2.7106 1.6639 C 0 0 0 0 0 0
-0.8112 -1.5011 -1.7761 N 0 0 0 0 0 0
0.3838 -2.2680 -2.1001 C 0 0 0 0 0 0
0.8465 -3.1482 -0.9283 C 0 0 0 0 0 0
2.1187 -3.9421 -1.2682 C 0 0 0 0 0 0
2.6427 -4.7040 -0.1066 N 0 3 0 0 0 0
4.0695 -5.0730 -0.2731 C 0 0 0 0 0 0
4.5855 -5.8191 0.9692 C 0 0 0 0 0 0
3.7807 -6.9677 1.2023 O 0 0 0 0 0 0
2.4245 -6.6242 1.4563 C 0 0 0 0 0 0
1.8351 -5.9030 0.2318 C 0 0 0 0 0 0
-4.9648 4.9802 0.4955 H 0 0 0 0 0 0
-6.7363 1.0531 0.3577 H 0 0 0 0 0 0
-4.5050 0.0545 0.0346 H 0 0 0 0 0 0
-2.7354 3.9921 0.1637 H 0 0 0 0 0 0
-0.9563 2.8180 1.0222 H 0 0 0 0 0 0
-7.2282 4.5735 0.7071 H 0 0 0 0 0 0
3.2115 4.6734 1.5212 H 0 0 0 0 0 0
1.0513 3.8535 0.6605 H 0 0 0 0 0 0
2.5328 -0.2003 0.8379 H 0 0 0 0 0 0
4.6934 0.6349 1.6896 H 0 0 0 0 0 0
5.0458 3.0736 2.0381 H 0 0 0 0 0 0
-1.7009 -1.8245 -2.1263 H 0 0 0 0 0 0
1.1803 -1.5879 -2.4051 H 0 0 0 0 0 0
0.1620 -2.8922 -2.9661 H 0 0 0 0 0 0
0.0364 -3.8247 -0.6566 H 0 0 0 0 0 0
1.0370 -2.5251 -0.0536 H 0 0 0 0 0 0
2.8804 -3.2326 -1.5954 H 0 0 0 0 0 0
1.9429 -4.6155 -2.1086 H 0 0 0 0 0 0
4.1846 -5.7003 -1.1586 H 0 0 0 0 0 0
4.6694 -4.1765 -0.4351 H 0 0 0 0 0 0
5.6172 -6.1360 0.8155 H 0 0 0 0 0 0
4.5791 -5.1698 1.8463 H 0 0 0 0 0 0
1.8624 -7.5353 1.6620 H 0 0 0 0 0 0
2.3550 -5.9995 2.3484 H 0 0 0 0 0 0
0.8042 -5.6231 0.4455 H 0 0 0 0 0 0
1.8115 -6.5823 -0.6220 H 0 0 0 0 0 0
2.6012 -4.0826 0.6890 H 0 0 0 0 0 0
1 2 2 0 0 0
1 6 1 0 0 0
1 30 1 0 0 0
2 3 1 0 0 0
2 13 1 0 0 0
3 4 2 0 0 0
3 31 1 0 0 0
4 5 1 0 0 0
4 32 1 0 0 0
5 6 2 0 0 0
5 7 1 0 0 0
6 33 1 0 0 0
7 9 2 0 0 0
7 10 1 0 0 0
8 9 1 0 0 0
8 12 2 0 0 0
8 14 1 0 0 0
9 34 1 0 0 0
10 11 2 0 0 0
11 12 1 0 0 0
11 20 1 0 0 0
13 35 1 0 0 0
14 16 2 0 0 0
14 17 1 0 0 0
15 16 1 0 0 0
15 19 2 0 0 0
15 36 1 0 0 0
16 37 1 0 0 0
17 18 2 0 0 0
17 38 1 0 0 0
18 19 1 0 0 0
18 39 1 0 0 0
19 40 1 0 0 0
20 21 1 0 0 0
20 41 1 0 0 0
21 22 1 0 0 0
21 42 1 0 0 0
21 43 1 0 0 0
22 23 1 0 0 0
22 44 1 0 0 0
22 45 1 0 0 0
23 24 1 0 0 0
23 46 1 0 0 0
23 47 1 0 0 0
24 25 1 0 0 0
24 29 1 0 0 0
24 56 1 0 0 0
25 26 1 0 0 0
25 48 1 0 0 0
25 49 1 0 0 0
26 27 1 0 0 0
26 50 1 0 0 0
26 51 1 0 0 0
27 28 1 0 0 0
28 29 1 0 0 0
28 52 1 0 0 0
28 53 1 0 0 0
29 54 1 0 0 0
29 55 1 0 0 0
M CHG 1 24 1
M END
> <s_m_entry_name>
compound_02
$$$$
 
Old 12-07-2011, 02:25 PM   #4
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 640

Rep: Reputation: 375Reputation: 375Reputation: 375Reputation: 375
Hi.
Finally I seem to got it working. Here is an awk script:
Code:
BEGIN{ RS="[$][$][$][$]\n" }

{ compounds[$1]++ }

END{
for(c in compounds)
	if(compounds[c]>1) common[c] = 1

print "Common compounds:"
for(c in common)
	print c

delete ARGV[0]
for(f in ARGV)
	{
		unmatched = gensub("(.*)\\.sdf$", "\\1", "g", ARGV[f]) "_unmatched.sdf"
		while((getline < ARGV[f]) > 0)
		{
			if(!$1)
			       continue
			if($1 in common ) {
				if ( !($1 in seen) )
				{
					if($0) print $0"$$$$" > "common.sdf"
					seen[$1] = 1
				}
			}
			else    print $0"$$$$" > unmatched
		}
		close(ARGV[f]"_unmatched.sdf")
	}
	close("common.sdf")
}
Save it to file, for example "process.awk".
To run it, type in the console
Code:
awk  -f process.awk FILE_*.sdf
The script will print common compounds to standard output, create "common.sdf", and "FILE_*_unmatched.sdf" if there are unmatched compounds in corresponding sdf file.

Hope that helps.

Last edited by firstfire; 12-07-2011 at 02:30 PM.
 
Old 12-09-2011, 05:15 AM   #5
robertselwyne
LQ Newbie
 
Registered: Jul 2010
Posts: 7

Original Poster
Rep: Reputation: 0
Thank you very much Firstfire.
The script works very fine..Indeed a great help for my chemistry research.
 
Old 12-09-2011, 07:20 AM   #6
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 640

Rep: Reputation: 375Reputation: 375Reputation: 375Reputation: 375
Quote:
Originally Posted by robertselwyne View Post
The script works very fine..
Glad to hear that.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Duplicates or hard-linked files? littlebigman Linux - Software 7 07-18-2011 10:14 AM
why does /home/.Trash-0 contain duplicates of all of the user's files? unclejed613 Slackware 4 04-18-2011 01:53 AM
Merging files and removing near-duplicates TheBigH Linux - Newbie 3 12-02-2009 05:24 PM
BASH out duplicates from multiple text files smudge|lala Linux - General 3 09-24-2008 08:51 PM
Comparing 2 Files for Duplicates Mr_H Linux - Newbie 5 11-09-2005 01:43 PM


All times are GMT -5. The time now is 10:43 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration