Search and replace text in a csv file according to a text file

zillur · 03-04-2017, 09:52 AM

Hi there,

I have a csv file with list of gene symbol. I have another text file with all the gene symbol with corresponding KEGG IDs/Title. I want to replace all the gene symbol in the csv file with corresponding KEGG titles. How can I do this? any suggestion?

Best Regards
Zillur

The csv file with multiple columns and rows:

Code:

	Cparvum	Bmicroti	Tparva
OG0000000			
OG0000001			
OG0000002			
OG0000003			
OG0000004			TP03_0403-t26_1-p1
OG0000005			
OG0000006			
OG0000007			
OG0000008			
OG0000009			
OG0000010	cgd6_4080-t26_1-p1	BBM_III00070	TP01_0667-t26_1-p1, TP01_1185-t26_1-p1, TP01_1186-t26_1-p1, TP02_0704-t26_1-p1, TP03_0200-t26_1-p1, TP03_0738-t26_1-p1, TP03_0739-t26_1-p1, TP04_0044-t26_1-p1, TP04_0172-t26_1-p1
OG0000011			
OG0000012			TP01_0004-t26_1-p1, TP01_0005-t26_1-p1, TP01_0006-t26_1-p1, TP01_0007-t26_1-p1, TP01_0008-t26_1-p1, TP01_0009-t26_1-p1, TP01_1225-t26_1-p1, TP01_1226-t26_1-p1, TP01_1227-t26_1-p1, TP02_0003-t26_1-p1, TP02_0004-t26_1-p1, TP02_0005-t26_1-p1, TP02_0006-t26_1-p1, TP02_0007-t26_1-p1, TP02_0008-t26_1-p1, TP02_0010-t26_1-p1, TP02_0011-t26_1-p1, TP02_0785-t26_1-p1, TP02_0855-t26_1-p1, TP02_0953-t26_1-p1, TP02_0954-t26_1-p1, TP02_0955-t26_1-p1, TP02_0956-t26_1-p1, TP02_0957-t26_1-p1, TP02_0958-t26_1-p1, TP02_0959-t26_1-p1, TP02_0960-t26_1-p1, TP03_0001-t26_1-p1, TP03_0002-t26_1-p1, TP03_0003-t26_1-p1, TP03_0004-t26_1-p1, TP03_0005-t26_1-p1, TP03_0006-t26_1-p1, TP03_0298-t26_1-p1, TP03_0866-t26_1-p1, TP03_0867-t26_1-p1, TP03_0868-t26_1-p1, TP03_0869-t26_1-p1, TP03_0870-t26_1-p1, TP03_0871-t26_1-p1, TP03_0872-t26_1-p1, TP03_0873-t26_1-p1, TP03_0874-t26_1-p1, TP03_0875-t26_1-p1, TP03_0877-t26_1-p1, TP03_0878-t26_1-p1, TP03_0879-t26_1-p1, TP03_0880-t26_1-p1, TP03_0881-t26_1-p1, TP03_0882-t26_1-p1, TP03_0883-t26_1-p1, TP03_0884-t26_1-p1, TP03_0885-t26_1-p1, TP03_0886-t26_1-p1, TP03_0887-t26_1-p1, TP03_0888-t26_1-p1, TP03_0889-t26_1-p1, TP03_0890-t26_1-p1, TP03_0891-t26_1-p1, TP03_0892-t26_1-p1, TP03_0893-t26_1-p1, TP03_0930-t26_1-p1, TP04_0001-t26_1-p1, TP04_0002-t26_1-p1, TP04_0003-t26_1-p1, TP04_0004-t26_1-p1, TP04_0005-t26_1-p1, TP04_0006-t26_1-p1, TP04_0007-t26_1-p1, TP04_0008-t26_1-p1, TP04_0009-t26_1-p1, TP04_0010-t26_1-p1, TP04_0011-t26_1-p1, TP04_0013-t26_1-p1, TP04_0014-t26_1-p1, TP04_0015-t26_1-p1, TP04_0016-t26_1-p1, TP04_0017-t26_1-p1, TP04_0018-t26_1-p1, TP04_0019-t26_1-p1, TP04_0098-t26_1-p1, TP04_0099-t26_1-p1, TP04_0406-t26_1-p1, TP04_0916-t26_1-p1, TP04_0917-t26_1-p1, TP04_0918-t26_1-p1, TP04_0919-t26_1-p1, TP04_0920-t26_1-p1, TP04_0921-t26_1-p1, TP04_0923-t26_1-p1, TP04_0927-t26_1-p1, TP04_0928-t26_1-p1
OG0000013			
OG0000014			
OG0000015			
OG0000016			
OG0000017			
OG0000018	cgd8_4180-t26_1-p1		TP01_0001-t26_1-p1, TP01_0002-t26_1-p1, TP01_0003-t26_1-p1, TP01_0256-t26_1-p1, TP01_0257-t26_1-p1, TP01_0336-t26_1-p1, TP01_0539-t26_1-p1, TP01_0540-t26_1-p1, TP01_0640-t26_1-p1, TP01_0647-t26_1-p1, TP01_1221-t26_1-p1, TP01_1222-t26_1-p1, TP01_1223-t26_1-p1, TP01_1224-t26_1-p1, TP02_0001-t26_1-p1, TP02_0002-t26_1-p1, TP02_0013-t26_1-p1, TP02_0030-t26_1-p1, TP02_0570-t26_1-p1, TP02_0608-t26_1-p1, TP02_0637-t26_1-p1, TP02_0691-t26_1-p1, TP02_0778-t26_1-p1, TP02_0782-t26_1-p1, TP02_0789-t26_1-p1, TP02_0802-t26_1-p1, TP02_0818-t26_1-p1, TP02_0819-t26_1-p1, TP02_0856-t26_1-p1, TP02_0857-t26_1-p1, TP02_0872-t26_1-p1, TP02_0876-t26_1-p1, TP02_0878-t26_1-p1, TP02_0888-t26_1-p1, TP02_0890-t26_1-p1, TP02_0891-t26_1-p1, TP02_0895-t26_1-p1, TP02_0896-t26_1-p1, TP03_0009-t26_1-p1, TP03_0114-t26_1-p1, TP03_0213-t26_1-p1, TP03_0218-t26_1-p1, TP03_0219-t26_1-p1, TP03_0243-t26_1-p1, TP03_0297-t26_1-p1, TP03_0316-t26_1-p1, TP03_0368-t26_1-p1, TP03_0482-t26_1-p1, TP03_0822-t26_1-p1, TP03_0829-t26_1-p1, TP03_0863-t26_1-p1, TP03_0865-t26_1-p1, TP04_0012-t26_1-p1, TP04_0085-t26_1-p1, TP04_0086-t26_1-p1, TP04_0095-t26_1-p1, TP04_0096-t26_1-p1, TP04_0097-t26_1-p1, TP04_0100-t26_1-p1, TP04_0101-t26_1-p1, TP04_0102-t26_1-p1, TP04_0103-t26_1-p1, TP04_0104-t26_1-p1, TP04_0115-t26_1-p1, TP04_0136-t26_1-p1, TP04_0145-t26_1-p1, TP04_0407-t26_1-p1, TP04_0929-t26_1-p1
OG0000019			
OG0000020			
OG0000021			
OG0000022			
OG0000023	cgd1_2650-t26_1-p1, cgd4_4230-t26_1-p1, cgd6_1410-t26_1-p1, cgd7_640-t26_1-p1, cgd8_4100-t26_1-p1	BBM_I00890, BBM_II00035, BBM_II03475, BBM_II04040, BBM_III00940, BBM_III01185, BBM_III05095	TP01_0103-t26_1-p1, TP01_0544-t26_1-p1, TP01_0641-t26_1-p1, TP01_1019-t26_1-p1, TP02_0292-t26_1-p1, TP03_0394-t26_1-p1
OG0000024	cgd3_920-t26_1-p1, cgd5_820-t26_1-p1, cgd6_3400-t26_1-p1, cgd7_40-t26_1-p1	BBM_II03735, BBM_III01395, BBM_III08970	TP01_0983-t26_1-p1, TP01_1073-t26_1-p1, TP04_0518-t26_1-p1
OG0000025	cgd1_330-t26_1-p1, cgd4_1730-t26_1-p1, cgd5_2010-t26_1-p1	BBM_I01965, BBM_I02055, BBM_II02605, BBM_III04665	TP01_0937-t26_1-p1, TP01_1158-t26_1-p1, TP02_0059-t26_1-p1, TP03_0490-t26_1-p1
OG0000026	cgd2_1310-t26_1-p1	BBM_II01030, BBM_II04035, BBM_III00065, BBM_III01030, BBM_III06085	TP01_0355-t26_1-p1, TP02_0293-t26_1-p1, TP02_0672-t26_1-p1, TP02_0673-t26_1-p1, TP02_0674-t26_1-p1
OG0000027	cgd6_1330-t26_1-p1, cgd7_1120-t26_1-p1, cgd7_3190-t26_1-p1	BBM_III05040	TP02_0087-t26_1-p1
OG0000028	cgd2_1010-t26_1-p1, cgd4_3000-t26_1-p1, cgd8_800-t26_1-p1	BBM_I01030, BBM_I01485, BBM_I01910, BBM_II03965, BBM_III09930	TP03_0478-t26_1-p1, TP03_0532-t26_1-p1, TP03_0830-t26_1-p1, TP04_0562-t26_1-p1
OG0000029

The text file with two columns:

Code:

PCHAS_1108000	-	
PCHAS_1312900	-	
PCHAS_1428700	-	
PCHAS_1443900	-	
PCYB_103290	Spliceosome	
PCYB_126910	Spliceosome	
PCYB_143760	-	
PCYB_145280	Spliceosome	
PF3D7_0508700	-	
PF3D7_0810600	-	
PF3D7_1227100	-	
PF3D7_1445900	-	
PKNH_1024900	-	
PKNH_1236300	-	
PKNH_1430100	-	
PKNH_1446500	-	
PRCDC_0507900	-	
PRCDC_0809900	-	
PRCDC_1226400	-	
PRCDC_1445200	-	
PVX_097995	Spliceosome	
PVX_118190	Spliceosome	
PVX_123240	-	
PVX_123985	Spliceosome	
PYYM_1110500	-	
PYYM_1310500	-	
PYYM_1430700	-	
PYYM_1446000	-	
PmUG01_06016200.1-p1	-

TB0ne · 03-04-2017, 09:57 AM

Quote:

Originally Posted by zillur

Hi there,
I have a csv file with list of gene symbol. I have another text file with all the gene symbol with corresponding KEGG IDs/Title. I want to replace all the gene symbol in the csv file with corresponding KEGG titles. How can I do this? any suggestion?

The csv file with multiple columns and rows:

Code:

	Cparvum	Bmicroti	Tparva
OG0000000			
OG0000001			
OG0000002			
OG0000003			
OG0000004			TP03_0403-t26_1-p1
OG0000005			
OG0000006			
OG0000007			
OG0000008			
OG0000009			
OG0000010	cgd6_4080-t26_1-p1	BBM_III00070	TP01_0667-t26_1-p1, TP01_1185-t26_1-p1, TP01_1186-t26_1-p1, TP02_0704-t26_1-p1, TP03_0200-t26_1-p1, TP03_0738-t26_1-p1, TP03_0739-t26_1-p1, TP04_0044-t26_1-p1, TP04_0172-t26_1-p1
OG0000011			
OG0000012			TP01_0004-t26_1-p1, TP01_0005-t26_1-p1, TP01_0006-t26_1-p1, TP01_0007-t26_1-p1, TP01_0008-t26_1-p1, TP01_0009-t26_1-p1, TP01_1225-t26_1-p1, TP01_1226-t26_1-p1, TP01_1227-t26_1-p1, TP02_0003-t26_1-p1, TP02_0004-t26_1-p1, TP02_0005-t26_1-p1, TP02_0006-t26_1-p1, TP02_0007-t26_1-p1, TP02_0008-t26_1-p1, TP02_0010-t26_1-p1, TP02_0011-t26_1-p1, TP02_0785-t26_1-p1, TP02_0855-t26_1-p1, TP02_0953-t26_1-p1, TP02_0954-t26_1-p1, TP02_0955-t26_1-p1, TP02_0956-t26_1-p1, TP02_0957-t26_1-p1, TP02_0958-t26_1-p1, TP02_0959-t26_1-p1, TP02_0960-t26_1-p1, TP03_0001-t26_1-p1, TP03_0002-t26_1-p1, TP03_0003-t26_1-p1, TP03_0004-t26_1-p1, TP03_0005-t26_1-p1, TP03_0006-t26_1-p1, TP03_0298-t26_1-p1, TP03_0866-t26_1-p1, TP03_0867-t26_1-p1, TP03_0868-t26_1-p1, TP03_0869-t26_1-p1, TP03_0870-t26_1-p1, TP03_0871-t26_1-p1, TP03_0872-t26_1-p1, TP03_0873-t26_1-p1, TP03_0874-t26_1-p1, TP03_0875-t26_1-p1, TP03_0877-t26_1-p1, TP03_0878-t26_1-p1, TP03_0879-t26_1-p1, TP03_0880-t26_1-p1, TP03_0881-t26_1-p1, TP03_0882-t26_1-p1, TP03_0883-t26_1-p1, TP03_0884-t26_1-p1, TP03_0885-t26_1-p1, TP03_0886-t26_1-p1, TP03_0887-t26_1-p1, TP03_0888-t26_1-p1, TP03_0889-t26_1-p1, TP03_0890-t26_1-p1, TP03_0891-t26_1-p1, TP03_0892-t26_1-p1, TP03_0893-t26_1-p1, TP03_0930-t26_1-p1, TP04_0001-t26_1-p1, TP04_0002-t26_1-p1, TP04_0003-t26_1-p1, TP04_0004-t26_1-p1, TP04_0005-t26_1-p1, TP04_0006-t26_1-p1, TP04_0007-t26_1-p1, TP04_0008-t26_1-p1, TP04_0009-t26_1-p1, TP04_0010-t26_1-p1, TP04_0011-t26_1-p1, TP04_0013-t26_1-p1, TP04_0014-t26_1-p1, TP04_0015-t26_1-p1, TP04_0016-t26_1-p1, TP04_0017-t26_1-p1, TP04_0018-t26_1-p1, TP04_0019-t26_1-p1, TP04_0098-t26_1-p1, TP04_0099-t26_1-p1, TP04_0406-t26_1-p1, TP04_0916-t26_1-p1, TP04_0917-t26_1-p1, TP04_0918-t26_1-p1, TP04_0919-t26_1-p1, TP04_0920-t26_1-p1, TP04_0921-t26_1-p1, TP04_0923-t26_1-p1, TP04_0927-t26_1-p1, TP04_0928-t26_1-p1
OG0000013			
OG0000014			
OG0000015			
OG0000016			
OG0000017			
OG0000018	cgd8_4180-t26_1-p1		TP01_0001-t26_1-p1, TP01_0002-t26_1-p1, TP01_0003-t26_1-p1, TP01_0256-t26_1-p1, TP01_0257-t26_1-p1, TP01_0336-t26_1-p1, TP01_0539-t26_1-p1, TP01_0540-t26_1-p1, TP01_0640-t26_1-p1, TP01_0647-t26_1-p1, TP01_1221-t26_1-p1, TP01_1222-t26_1-p1, TP01_1223-t26_1-p1, TP01_1224-t26_1-p1, TP02_0001-t26_1-p1, TP02_0002-t26_1-p1, TP02_0013-t26_1-p1, TP02_0030-t26_1-p1, TP02_0570-t26_1-p1, TP02_0608-t26_1-p1, TP02_0637-t26_1-p1, TP02_0691-t26_1-p1, TP02_0778-t26_1-p1, TP02_0782-t26_1-p1, TP02_0789-t26_1-p1, TP02_0802-t26_1-p1, TP02_0818-t26_1-p1, TP02_0819-t26_1-p1, TP02_0856-t26_1-p1, TP02_0857-t26_1-p1, TP02_0872-t26_1-p1, TP02_0876-t26_1-p1, TP02_0878-t26_1-p1, TP02_0888-t26_1-p1, TP02_0890-t26_1-p1, TP02_0891-t26_1-p1, TP02_0895-t26_1-p1, TP02_0896-t26_1-p1, TP03_0009-t26_1-p1, TP03_0114-t26_1-p1, TP03_0213-t26_1-p1, TP03_0218-t26_1-p1, TP03_0219-t26_1-p1, TP03_0243-t26_1-p1, TP03_0297-t26_1-p1, TP03_0316-t26_1-p1, TP03_0368-t26_1-p1, TP03_0482-t26_1-p1, TP03_0822-t26_1-p1, TP03_0829-t26_1-p1, TP03_0863-t26_1-p1, TP03_0865-t26_1-p1, TP04_0012-t26_1-p1, TP04_0085-t26_1-p1, TP04_0086-t26_1-p1, TP04_0095-t26_1-p1, TP04_0096-t26_1-p1, TP04_0097-t26_1-p1, TP04_0100-t26_1-p1, TP04_0101-t26_1-p1, TP04_0102-t26_1-p1, TP04_0103-t26_1-p1, TP04_0104-t26_1-p1, TP04_0115-t26_1-p1, TP04_0136-t26_1-p1, TP04_0145-t26_1-p1, TP04_0407-t26_1-p1, TP04_0929-t26_1-p1
OG0000019			
OG0000020			
OG0000021			
OG0000022			
OG0000023	cgd1_2650-t26_1-p1, cgd4_4230-t26_1-p1, cgd6_1410-t26_1-p1, cgd7_640-t26_1-p1, cgd8_4100-t26_1-p1	BBM_I00890, BBM_II00035, BBM_II03475, BBM_II04040, BBM_III00940, BBM_III01185, BBM_III05095	TP01_0103-t26_1-p1, TP01_0544-t26_1-p1, TP01_0641-t26_1-p1, TP01_1019-t26_1-p1, TP02_0292-t26_1-p1, TP03_0394-t26_1-p1
OG0000024	cgd3_920-t26_1-p1, cgd5_820-t26_1-p1, cgd6_3400-t26_1-p1, cgd7_40-t26_1-p1	BBM_II03735, BBM_III01395, BBM_III08970	TP01_0983-t26_1-p1, TP01_1073-t26_1-p1, TP04_0518-t26_1-p1
OG0000025	cgd1_330-t26_1-p1, cgd4_1730-t26_1-p1, cgd5_2010-t26_1-p1	BBM_I01965, BBM_I02055, BBM_II02605, BBM_III04665	TP01_0937-t26_1-p1, TP01_1158-t26_1-p1, TP02_0059-t26_1-p1, TP03_0490-t26_1-p1
OG0000026	cgd2_1310-t26_1-p1	BBM_II01030, BBM_II04035, BBM_III00065, BBM_III01030, BBM_III06085	TP01_0355-t26_1-p1, TP02_0293-t26_1-p1, TP02_0672-t26_1-p1, TP02_0673-t26_1-p1, TP02_0674-t26_1-p1
OG0000027	cgd6_1330-t26_1-p1, cgd7_1120-t26_1-p1, cgd7_3190-t26_1-p1	BBM_III05040	TP02_0087-t26_1-p1
OG0000028	cgd2_1010-t26_1-p1, cgd4_3000-t26_1-p1, cgd8_800-t26_1-p1	BBM_I01030, BBM_I01485, BBM_I01910, BBM_II03965, BBM_III09930	TP03_0478-t26_1-p1, TP03_0532-t26_1-p1, TP03_0830-t26_1-p1, TP04_0562-t26_1-p1
OG0000029

The text file with two columns:

Code:

PCHAS_1108000	-	
PCHAS_1312900	-	
PCHAS_1428700	-	
PCHAS_1443900	-	
PCYB_103290	Spliceosome	
PCYB_126910	Spliceosome	
PCYB_143760	-	
PCYB_145280	Spliceosome	
PF3D7_0508700	-	
PF3D7_0810600	-	
PF3D7_1227100	-	
PF3D7_1445900	-	
PKNH_1024900	-	
PKNH_1236300	-	
PKNH_1430100	-	
PKNH_1446500	-	
PRCDC_0507900	-	
PRCDC_0809900	-	
PRCDC_1226400	-	
PRCDC_1445200	-	
PVX_097995	Spliceosome	
PVX_118190	Spliceosome	
PVX_123240	-	
PVX_123985	Spliceosome	
PYYM_1110500	-	
PYYM_1310500	-	
PYYM_1430700	-	
PYYM_1446000	-	
PmUG01_06016200.1-p1	-

You've shown us the two files, but haven't shown us what you want to replace/search for, or how you want the output to look. Most importantly, you haven't shown what you have done/tried to to do this. There are many ways to do this, but since we don't know what language(s) you know/want, we can't really offer much, aside from "you can write a script to do this".

This could be a bash script, perl, ruby, or python, and all could easily do this. Post what you've done, and tell us where you're stuck.

zillur · 03-04-2017, 10:14 AM

Thank you very much for your reply. I want a output like this:

Code:

	Cparvum	Bmicroti	Tparva	Pberghei
OG0000000	-			
OG0000001	-			
OG0000002	-			
OG0000003	-			
OG0000004	-			
OG0000005	-			
OG0000006	-			
OG0000007	Protein processing in endoplasmic reticulum			
OG0000008	Protein processing in endoplasmic reticulum			
OG0000009	-			
OG0000010	Ribosome biogenesis in eukaryotes			
OG0000011	-			
OG0000012	-			
OG0000013	-			
OG0000014	-			
OG0000015	-			
OG0000016	-			
OG0000017	-			
OG0000018	-			
OG0000019	-			
OG0000020	-			
OG0000021	-			
OG0000022	-			
OG0000023	-			
OG0000024	Protein processing in endoplasmic reticulum			
OG0000025	Protein processing in endoplasmic reticulum			
OG0000026	Ribosome biogenesis in eukaryotes			
OG0000027	-			
	-

Then I want to convert it into a presence/absence matrix:

Code:

	Cparvum	Bmicroti	Tparva	Pberghei	Pchabaudi	Pcynomolgi	Pfalciparum
Protein processing in endoplasmic reticulum	1	0	0	1	0	1	0
Ribosome biogenesis in eukaryotes	0	0	1	1	1	1

TB0ne · 03-04-2017, 10:22 AM

Quote:

Originally Posted by zillur

Thank you very much for your reply. I want a output like this:

Code:

	Cparvum	Bmicroti	Tparva	Pberghei
OG0000000	-			
OG0000001	-			
OG0000002	-			
OG0000003	-			
OG0000004	-			
OG0000005	-			
OG0000006	-			
OG0000007	Protein processing in endoplasmic reticulum			
OG0000008	Protein processing in endoplasmic reticulum			
OG0000009	-			
OG0000010	Ribosome biogenesis in eukaryotes			
OG0000011	-			
OG0000012	-			
OG0000013	-			
OG0000014	-			
OG0000015	-			
OG0000016	-			
OG0000017	-			
OG0000018	-			
OG0000019	-			
OG0000020	-			
OG0000021	-			
OG0000022	-			
OG0000023	-			
OG0000024	Protein processing in endoplasmic reticulum			
OG0000025	Protein processing in endoplasmic reticulum			
OG0000026	Ribosome biogenesis in eukaryotes			
OG0000027	-			
	-

Then I want to convert it into a presence/absence matrix:

Code:

	Cparvum	Bmicroti	Tparva	Pberghei	Pchabaudi	Pcynomolgi	Pfalciparum
Protein processing in endoplasmic reticulum	1	0	0	1	0	1	0
Ribosome biogenesis in eukaryotes	0	0	1	1	1	1

Ok...so again; can you post what you have done/tried on your own and what you've written to accomplish this? Show us where you're stuck?? We are happy to help you, but we aren't going to write your scripts for you.

And thank you for providing sample output...but it just isn't making much sense; what pattern in the text file are you searching for in the CSV, because I can't see what you're trying to search/replace. Instead of MANY lines of samples, post a few of the csv file, and highlight what you're searching for, and post a few of the text file with the relevant data in it.

zillur · 03-04-2017, 11:48 AM

Thank you very much for your quick reply and for details clarification. In my Orthogroups.txt file I have a list of gene names (The first row are organism names, 1st column is orthogroups name). Each cell contain gene symbol. In the mypathway.txt file I have all the same gene symbols in the 1st column and in the 2nd column corresponding pathway titles. I want to replace all the gene symbol in the Orthogroups.txt file with corresponding 2nd column of the 2nd file. Is it possible? Thanks again.

Best regards
Zillur

Turbocapitalist · 03-04-2017, 11:54 AM

Which columns are supposed to be the same in each file? Can you show two or three lines that match from each file?

perl has the module Text::CSV which parses CSV properly. From there it is easy to merge two tables. If the two files are tab-delimited text then you might even just use join instead.

TB0ne · 03-04-2017, 03:50 PM

Quote:

Originally Posted by zillur

Thank you very much for your quick reply and for details clarification. In my Orthogroups.txt file I have a list of gene names (The first row are organism names, 1st column is orthogroups name). Each cell contain gene symbol. In the mypathway.txt file I have all the same gene symbols in the 1st column and in the 2nd column corresponding pathway titles. I want to replace all the gene symbol in the Orthogroups.txt file with corresponding 2nd column of the 2nd file. Is it possible? Thanks again.

First, yes, it's very possible.

Secondly, as you've been asked now several times:

POST WHAT YOU HAVE WRITTEN/DONE/TRIED of your own and tell us where you're stuck. We WILL NOT write your scripts for you, but will be happy to assist you if you're stuck
Can you, as you've been asked before, post one CLEAR example of what you want to search/replace, and what you want to see when you're done??? What you've posted so far makes no sense.

As Turbocapitalist said, perl would make short work of this, but you can also use any number of other languages as well. Since you haven't posted what you've done, what lanuage you're working in, or told us where you're stuck, we can't offer much in the way of advice on how to fix whatever problem you have.

zillur · 03-04-2017, 07:22 PM

Thank you very much for your comments. I was trying in bash:

Code:

grep -vf mypathway.txt Orthogroups.txt > new_4.csv

It doesn't give me the expected output. The new_4.csv is same as Orthogrgroups.txt. I also tried awk:

Code:

awk -v FS="[ =]" 'NR==FNR{rows[$1]++;next}(substr($NF,1,length($NF)-1) in rows)' mypathway.txt Orthogroups.txt > new_4.csv

But got same result.

I want a output like new_5.csv.txt (attached). Sorry for the inconvenience.

Best Regards
Zillur

Turbocapitalist · 03-05-2017, 02:08 AM

It's not an inconvenience but it does need to be spelled out easily enough to retain any interest among forum members. You can put your data between [code] [/code] tags in the body of your post so they are accessible.

The data we are interested in would be a) two lines from mypathway.txt that match b) a line or more in Orthogroups.txt and c) which fields in a and b that should be examined for a match. All in all, it would be nice thus to see three groups of data between [code] [/code] tags. In that way you'll help us help you since we don't have familiarity with your data and we're volunteering our time for interesting questions like this one.

zillur · 03-05-2017, 09:12 PM

Thank you very much for helping me all the way. Here is mypathway.txt:

Code:

PVX_088085	Protein processing in endoplasmic reticulum	
PVX_114095	Protein processing in endoplasmic reticulum	
PVX_123055	Ribosome biogenesis in eukaryotes
PYYM_1032000	-	
PYYM_1120600	-

The 1st column contains gene symbols which are extracted from orthogroups.csv file. mypathway.txt has only two columns. 2nd columns may contain multiple values (Text).
Here is my orthogroups.csv file:

Code:

	Cparvum	Bmicroti	Tparva	Pberghei	Pchabaudi	Pcynomolgi	Pfalciparum	Pknowlesi	Preichenowi	Pvivax	Pyoelii	Pmalariae	Tgondii
OG0000000				PBANKA_0000600, PBANKA_0000701, PBANKA_0000801, PBANKA_0001001, PBANKA_0001101, PBANKA_0001201, PBANKA_0001301, PBANKA_0001401, PBANKA_0001501, PBANKA_0006300, PBANKA_0006401, PBANKA_0006501, PBANKA_0006600, PBANKA_0006701, OG0000001												PmUG01_00010100.1-p1, PmUG01_00010200.1-p1, PmUG01_00010400.1-p1, PmUG01_00010500.1-p1, PmUG01_00010600.1-p1, PmUG01_00010700.1-p1, PmUG01_00010800.1-p1, PmUG01_00010900.1-p1, PmUG01_00011000.1-p1, PmUG01_00011300.1-p1, PmUG01_00011400.1-p1, PmUG01_00011600.1-p1, PmUG01_00011700.1-p1, PmUG01_00012100.1-p1, PmUG01_00012200.1-p1,

Orthogroups.csv contains multiple column. The 1st row is column name and 1st column is the row name. Each cell may contain single/multiple gene symbols or maybe blank (these gene symbols are the 1st column of mypathway.txt) .
I want to replace the gene symbol in the csv file with corresponding 2nd column from my pathway.txt without changing the format of the csv file.
I tried the following perl script:

Code:

# This script was excerpted from http://stackoverflow.com/questions/11678939/replace-text-based-on-a-dictionary

use strict;
use warnings;

open my $fh, '<', 'bioDBnet_db2db_KEGG_Title_final.txt' or die $!;
my %dict =  map { chomp; split ' ', $_, 2 } <$fh>;
my $re = join '|', keys %dict;

open $fh, '<', 'Orthogroups_3.csv' or die $!;
while (<$fh>) {
  s/($re)/$dict{$1}/g;
  print;
}

It gave me the expected output but changed the format of the csv file. The original csv contains 13 columns. But replaced csv showed only 1 column when I load it in R. I want to load the replaced csv in R then convert it into data matrix or binary matrix. The output I got:

Code:

	Cparvum	Bmicroti	Tparva	Pberghei	Pchabaudi	Pcynomolgi	Pfalciparum	Pknowlesi	Preichenowi	Pvivax	Pyoelii	Pmalariae	Tgondii
OG0000000				-	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	
OG0000024	-	, -	, -	, -		-	, -	, -		-	, -	, -		-	, -	, -	, -		-	, -	, -		Protein processing in endoplasmic reticulum	, -	, -	, -	, -		-	, -	, -		-	, -	, -		-	, -	, -		-	, -	, -	, -		-	, -	, -		-	, -	, -	, -		-	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	
OG0000025	-	, -	, -		-	, -	, -	, -		-	, -	, -	, -		-	, -	, -	, -		-	, -	, -	, -		Protein processing in endoplasmic reticulum	, Protein processing in endoplasmic reticulum	, -	, Ribosome biogenesis in eukaryotes		-	, -	, -	, -		-	, -	, -	, -		-	, -	, -	, -		-	, Protein processing in endoplasmic reticulum	, Protein processing in endoplasmic reticulum	, Ribosome biogenesis in eukaryotes		-	, -	, -	, -		-	, -	, -	, -		-	, -	, -	, -	, -	, -	, -	
OG0000026		
OG0000001												-	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -	, -		
OG0000002

I want the 1st row with column names will be same. I want text from each cells to be the row names and make a abundance or binary matrix. Is it possible?

Thanks again.

Best Regards
Zillur

shane25119 · 03-05-2017, 11:00 PM

It strikes me that this would more easily accomplished in R or Stata or some other statistical software using code akin to:

replace v2 = "X" if v2== 01234

That's Stata syntax above, but the logic for R, which you reference, is effectively the same.

zillur · 03-07-2017, 10:26 AM

Thank you very much for your kind reply.

Quote:

replace v2 = "X" if v2== 01234

Can you enlighten me a little more or give me some example.
I have converted the text in the csv using this perl script:http://stackoverflow.com/questions/1...n-a-dictionary
But it is giving me more columns than original and I can't load it the output in R.
Best Regards
Zillur

Turbocapitalist · 03-07-2017, 12:47 PM

Quote:

Originally Posted by zillur

I have converted the text in the csv using this perl script:http://stackoverflow.com/questions/1...n-a-dictionary
But it is giving me more columns than original and I can't load it the output in R.

The first two lines just pull in modules to set the rules that the program must adhere to as far as syntax.

Code:

use strict;
use warnings;

The next three lines read in data from a text file and loads it into a hash (lookup table) called %dict. The script expects two columns here. The split() function there expects a single space to separate the columns and stops after the second column is found. You may wish to change the separator to a tab or even a span of white space. What is that file really using for a separator?

Code:

open my $fh, '<', 'mypathway.txt' or die $!;
my %dict =  map { chomp; split ' ', $_, 2 } <$fh>;
my $re = join '|', keys %dict;

The next five lines open the data file and go through it line by line replacing any of the keys of the lookup table with the matching data from the lookup table. The first part of the substitution s/// is the other place where you should be paying attention if it is producing extra columns.

Edit: it is using the magic variable $_ which often does not need to be named explicitly.

Code:

open $fh, '<', 'orthogroups.csv' or die $!;
while (<$fh>) {
  s/($re)/$dict{$1}/g;
  print;
}

The code has been written for stylishness / trendiness not so much clarity. Though I must admit I am not with the "in" crowd as far as the current perl styles go.

zillur · 03-07-2017, 01:49 PM

Thank you very much for your comment. I just ran:

Code:

use strict;
use warnings;
open my $fh, '<', 'k_all_1.txt' or die $!;
my %dict =  map { chomp; split '\t', $_, 2 } <$fh>;
my $re = join '|', keys %dict;

open $fh, '<', 'Orthogroups_3.csv' or die $!;
while (<$fh>) {
  s/($re)/$dict{$1}/g;
  print;
}

It gave me output what I wanted. But when I wanted to load it in R:

Code:

> grpsTgrpsTbl <- read.csv("k_ortho_1.csv", header=T, sep = "\t", row.names = 1, stringsAsFactors=F)
Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  more columns than column names

When I load the data in excel with "sep=\t" it shows me more columns than the original (attached is part of the output).

If I check the column in bash:

Code:

awk -F'\t' '{print NF; exit}' k_ortho_1.csv
14

Maybe it counts rownames as column. Then if I delete the part "row.names=1" during loading in R. It still gives me the same error:

Code:

> grpsTgrpsTbl <- read.csv("k_ortho_1.csv", header=T, sep = "\t", stringsAsFactors=F)
Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  more columns than column names

What should I do now. Thanks again for your valuable comments.

Best Regards
Zillur