LinuxQuestions.org
LinuxAnswers - the LQ Linux tutorial section.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices

Reply
 
Search this Thread
Old 02-28-2012, 01:17 AM   #1
Trd300
Member
 
Registered: Feb 2012
Posts: 89

Rep: Reputation: Disabled
Search multiple patterns & print matching patterns instead of whole line


%%%%%

Last edited by Trd300; 05-01-2012 at 04:30 AM.
 
Old 02-28-2012, 12:49 PM   #2
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 620

Rep: Reputation: 362Reputation: 362Reputation: 362Reputation: 362
Hi.

For file[12].tab of moderate size this may work:
Code:
#!/usr/bin/awk -f

# File: search.awk
# Usage: chmod +x search.awk; ./search.awk bigdb.tab

BEGIN{
	while( (getline < "file1.tab") >0 )
		pat1=pat1 (pat1?"|":"") $0
	while( (getline < "file2.tab") >0 )
		pat2=pat2 (pat2?"|":"") $0
}
$3 ~ pat1 && $3 ~ pat2 {
		split($3, a, /,/)
		$3=""
		$4=""
		for(e in a){
		       	if(a[e]~pat1) $3=$3 ($3?",":"") a[e]
		       	if(a[e]~pat2) $4=$4 ($4?",":"") a[e]
		}
		print
	}
Code:
$ ./search.awk bigtab.tab 
db1 0001 A_ent C_ent
db3 0003 H_ent K_ent
 
Old 02-28-2012, 07:57 PM   #3
Trd300
Member
 
Registered: Feb 2012
Posts: 89

Original Poster
Rep: Reputation: Disabled
Hi firstfire ! Thanks for your help !

Your program works perfectly on small files, but not on bigger ones.

There are 3 major problems on bigger files:
- the records contain only one field with Database, Ref., Entities separated by spaces not tabs.
- Database and Ref. ($1 and $2) correspond to each other, but not with the good Entities ($3)
- it prints records containing names that are not in file1.tab or file2.tab

Why your script works on small but not on big files (reminder: I am a beginner in Unix)?
Is there any way to make it more "stable"?

Thanks again !
 
Old 02-28-2012, 08:31 PM   #4
lisle2011
Member
 
Registered: Mar 2011
Location: Surrey B.C. Canada (Metro Vancouver)
Distribution: Slackware 2.6.33.4-smp
Posts: 179
Blog Entries: 1

Rep: Reputation: 25
Tab delimited file

If part of the file is tab delimited why is the other part (the big part) delimited by spaces?

So it is not a tab delimited file at all but a file that is higgledy piggledy. If the spaces are the same on each line then replace them with tabs or some other delimiter so that the file is consistent. Then the above code will work like a dream.




If I helped at all give me a pat on the back
 
Old 02-28-2012, 09:50 PM   #5
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 620

Rep: Reputation: 362Reputation: 362Reputation: 362Reputation: 362
Hi.

Please post your actual input data (or representative part of it) and, if possible, the desired result.
 
Old 02-29-2012, 02:42 AM   #6
Trd300
Member
 
Registered: Feb 2012
Posts: 89

Original Poster
Rep: Reputation: Disabled
@ lisle2011:
Hi !
Thanks for your help !
My original files are all tab -delimited for sure.
 
Old 02-29-2012, 02:49 AM   #7
Trd300
Member
 
Registered: Feb 2012
Posts: 89

Original Poster
Rep: Reputation: Disabled
%%%%%
Attached Files
File Type: txt bigdb.txt (5.6 KB, 4 views)
File Type: txt file1.txt (4.0 KB, 4 views)
File Type: txt file2.txt (8.6 KB, 2 views)

Last edited by Trd300; 05-01-2012 at 04:31 AM.
 
Old 02-29-2012, 03:05 AM   #8
Trd300
Member
 
Registered: Feb 2012
Posts: 89

Original Poster
Rep: Reputation: Disabled
%%%%%

Last edited by Trd300; 05-01-2012 at 04:31 AM.
 
Old 02-29-2012, 04:41 AM   #9
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 620

Rep: Reputation: 362Reputation: 362Reputation: 362Reputation: 362
Hi.

Here is a new version
Code:
#!/usr/bin/awk -f
function search(string, words, result,    found)
{
	# clear out the result array
	split("", result)
	for(w in words)
		if(string ~ w) { result[w]; found++; }
	return found;	# nonzero on success
}

BEGIN{
	FS="\t"
	OFS="\t"
	while( (getline < "file1.tab") > 0 )
		pat1[$0]
	while( (getline < "file2.tab") > 0 )
		pat2[$0]
}
search($3, pat1, res1) && search($3, pat2, res2) {
		$3=""
		$4=""
		for(e in res1) $3=$3 ($3?",":"") e
		for(e in res2) $4=$4 ($4?",":"") e
		
		print
	}
It should also run a bit faster, but the order of items may be changed.

But, the logic is the same as before, and for the line
Code:
$ sed -n '11p' bigdb.txt 
db12    12179021        F77_item,TH3B_item,SL978_item,7PO71_item,7POS7_item,7POS1_item,7POE_item,7LBU_item,PISE3_item,SPS92_item,3BP7_item,SGOL7_item,7P71B_item,IT77_item,7MPP1_item,KLK1_item,PPB1_item,7TM_item,P7LB7_item,KSM71_item,G7B31_item,7PO77_item,7PO79_item,TI7M1_item,L7T_item,G37P7_item,ITB2_item,PMP_item,MYL9_item,FIB7_item,FI4S_item,PGKG_item,SXSL7_item,IS1_item,SF7P_item,IGHG1_item,SF7B_item,4L3S9_item,T3Y1_item,PPE9B_item,O38P7_item,O3177_item,O37T4_item,S7S1I_item,4SO71_item,S4OT1_item,SS427_item,7P1M1_item,SMS7_item,777T_item,BBS1_item,GELS_item,OSTS_item,4PH4_item,PF9V_item,S7BL1_item,7P9E1_item,VI4EX_item,EXOS4_item,SHKB_item,SP7ST_item,P779F_item,S7TL1_item,7SPG_item,MYLK1_item,QT3P1_item,K1S19_item,SLS77_item,K7S1_item,K7S2_item,FETU7_item,TTHY_item,1BP2_item,SL7P7_item,374B1_item,3Y37_item,GF7P_item,F1071_item,K7S47_item,PEG1_item,T3120_item,SPSP1_item,SF7H_item,HPT_item,K7S4B_item,K1S10_item,HEMO_item,3LF_item,SF319_item,T77P1_item,STGE2_item,PHS1_item,PSP_item,S3UM1_item,GS41L_item,SEP99_item,K0717_item,Z4717_item,K1S9_item,J73P7_item,TPS11_item,S7SS_item,TL47_item,PL7G1_item,PSPH1_item,OBSL1_item,SETX_item,K1S14_item,SYTS_item,PE3L1_item,744KB1_item,TITI4_item,LPB1_item,7POS1_item,IGKS_item,KV109_item,SXG7_item,TTP7L_item,IGPS9_item,F71E7_item,YS018_item,ITIH1_item,K1S11_item,Z4797_item,F7277_item,F111B_item,PZ349_item,SL012_item,SMT71_item
it outputs
Code:
$ sed -n '11p' bigdb.txt | ./search.awk 
db12    12179021        SMS7_item,I4S_item,3Y37_item    S7S1I_item,KSM71_item,SS427_item,79_item
instead of what you want
Code:
db12    12179021        3Y37_item       79_item,S7S1I_item,SS427_item,KSM71_item
Either I do not understand the logic, or you are wrong. SMS7_item and I4S_item are both present in file1.tab and in 11'th line of bigdb.txt, so they should appear in $3. No?
 
Old 02-29-2012, 09:49 AM   #10
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,442

Rep: Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879
Actually firstfire, you are in error but only because your search does not delimit the search item (I too hit this snag)

Allow me to demonstrate:

Using string I4S_item which is in file1.txt:
Code:
grep I4S_item file1.txt
I4S_item
Now let us combine the following in the bigdb.txt file supplied and look for both 12179021 and I4S_item:
Code:
grep 12179021 bigdb.txt | grep I4S_item
db12	12179021	F77_item,TH3B_item,SL978_item,7PO71_item,7POS7_item,7POS1_item,7POE_item,7LBU_item,PISE3_item,SPS92_item,3BP7_item,SGOL7_item,7P71B_item,IT77_item,7MPP1_item,
KLK1_item,PPB1_item,7TM_item,P7LB7_item,KSM71_item,G7B31_item,7PO77_item,7PO79_item,TI7M1_item,L7T_item,G37P7_item,ITB2_item,PMP_item,MYL9_item,FIB7_item,FI4S_item,
PGKG_item,SXSL7_item,IS1_item,SF7P_item,IGHG1_item,SF7B_item,4L3S9_item,T3Y1_item,PPE9B_item,O38P7_item,O3177_item,O37T4_item,S7S1I_item,4SO71_item,S4OT1_item,
SS427_item,7P1M1_item,SMS7_item,777T_item,BBS1_item,GELS_item,OSTS_item,4PH4_item,PF9V_item,S7BL1_item,7P9E1_item,VI4EX_item,EXOS4_item,SHKB_item,SP7ST_item,
P779F_item,S7TL1_item,7SPG_item,MYLK1_item,QT3P1_item,K1S19_item,SLS77_item,K7S1_item,K7S2_item,FETU7_item,TTHY_item,1BP2_item,SL7P7_item,374B1_item,
3Y37_item,GF7P_item,F1071_item,K7S47_item,PEG1_item,T3120_item,SPSP1_item,SF7H_item,HPT_item,K7S4B_item,K1S10_item,HEMO_item,3LF_item,SF319_item,T77P1_item,
STGE2_item,PHS1_item,PSP_item,S3UM1_item,GS41L_item,SEP99_item,K0717_item,Z4717_item,K1S9_item,J73P7_item,TPS11_item,S7SS_item,TL47_item,PL7G1_item,PSPH1_item,
OBSL1_item,SETX_item,K1S14_item,SYTS_item,PE3L1_item,744KB1_item,TITI4_item,LPB1_item,7POS1_item,IGKS_item,KV109_item,SXG7_item,TTP7L_item,IGPS9_item,F71E7_item,
YS018_item,ITIH1_item,K1S11_item,Z4797_item,F7277_item,F111B_item,PZ349_item,SL012_item,SMT71_item
From here we can see that the string I4S_item does exist but only as part of FI4S_item which is not in file1.txt

However I would point out that some of the OPs "This is what I want" are actually incorrect, ie do not have the correct or complete list of items that should be discovered.

Also your "search" function is quite similar to match (which I tried although struggled to get all matches ) so in case it may help further the solution:
Code:
#!/usr/bin/awk -f

BEGIN{ OFS = "\t" }

FILENAME ~ /file/{
    if(FILENAME ~ /1/)
        pat1=pat1 (pat1?"|":"\\<(") $0
    else
        pat2=pat2 (pat2?"|":"\\<(") $0

    next
}

!x{ pat1 = pat1 ")\\>"
    pat2 = pat2 ")\\>"
    x=1
}

match($NF, pat1, f1) && match($NF, pat2, f2){
    $NF = ""
    print $0 f1[0],f2[0]
}
This is run like so:
Code:
./search.awk file* bigdb.txt
I believe it does produce all the correct lines but at present does not provide multiple matches in field 3 and 4

Last edited by grail; 02-29-2012 at 09:51 AM.
 
1 members found this post helpful.
Old 02-29-2012, 10:18 AM   #11
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 620

Rep: Reputation: 362Reputation: 362Reputation: 362Reputation: 362
Hi.

Quote:
Originally Posted by grail View Post
Actually firstfire, you are in error but only because your search does not delimit the search item (I too hit this snag)
Ohh. I see. My bad. Thanks for pointing out, grail.

I hope the following will fix the problem
Code:
function search(string, words, result,    found)
{
        # clear out the result array
        split("", result)
        for(w in words)
                if(string ~ "\\<"w"\\>") { result[w]; found++; }
        return found;   # nonzero on success
}
 
Old 02-29-2012, 10:53 AM   #12
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,442

Rep: Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879
Here is an alternative:
Code:
#!/usr/bin/awk -f

BEGIN{ OFS = "\t" }

FILENAME ~ /file/{
    if(FILENAME ~ /1/)
        pat1=pat1 (pat1?"|":"\\<(") $0
    else
        pat2=pat2 (pat2?"|":"\\<(") $0

    next
}

!x{ pat1 = pat1 ")\\>"
    pat2 = pat2 ")\\>"
    x=1
}

match($NF, pat1, f1) && match($NF, pat2, f2){
    p3 = p4 = ""

    p3 = findmore(f1[0],f1[0,"start"]+f1[0,"length"],pat1)
    p4 = findmore(f2[0],f2[0,"start"]+f2[0,"length"],pat2)

    $NF = ""
    print $0 p3,p4
}

function findmore(string, len, pat,   f,l,ret_string)
{
    l = len + 1
    ret_string = string

    while(match(substr($NF,l), pat, f)){
        ret_string = ret_string "," f[0]
        l = l + f[0,"start"] + f[0,"length"] + 1
    }

    return ret_string
}
My question about the output is, what is the requirement if something should appear twice, example:
Code:
db7	do	sy	nt	P3P1_item,P3P1_item	S7S17_item,S7S1B_item,S7S1P_item,G47S1_item,G47S1_item,G3I77_item,G3I79_item,I3K1_item,I3K4_item,4MPE1_item,SS417_item
As you can see I have repetition but this is due to the fact that the data has the item appear more than once:
Code:
db7	do sy nt	7727_item,772B_item,772P_item,772E_item,772G_item,7777_item,777B_item,77B7_item,77BB_item,77BP_item,77BG_item,7PSY2_item,7KT1_item,7KT7_item,7KT1_item,7LEX_item,7OF7_item,
7OFB_item,733B7_item,7TF7_item,7TF9_item,7TF4B_item,BM7L1_item,S7S17_item,S7S1B_item,S7S1S_item,S7S1P_item,S7LL1_item,S7LL2_item,S7LL4_item,S7LM_item,S7LY_item,SLOSK_item,SOMT_item,
S31L1_item,S31L7_item,S31L1_item,S31L9_item,S3EB1_item,S3EB1_item,S3EB2_item,PPS_item,P3P1_item,P3P7_item,P3P1_item,P3P9_item,P3P2_item,FOS_item,GBB1_item,GBB7_item,GBB1_item,GBB9_item,
GBB2_item,GBG10_item,GBG11_item,GBG17_item,GBG11_item,GBG1_item,GBG7_item,GBG1_item,GBG9_item,GBG2_item,GBG7_item,GBG8_item,GBGT7_item,G47I1_item,G47I7_item,G47I1_item,G47L_item,G47O_item,
G47Q_item,G47S1_item,G47S7_item,G47S1_item,G3I71_item,G3I77_item,G3I71_item,G3I79_item,GSK17_item,GSK1B_item,I3K1_item,I3K2_item,I3K4_item,I3K9_item,ITP31_item,ITP37_item,ITP31_item,K7PS7_item,
K7PSB_item,K7PSG_item,KSS77_item,KSS7B_item,KSS7P_item,KSS7G_item,KIF27_item,KIF2S_item,KI4H_item,KPS7_item,KPSB_item,KPSG_item,MK08_item,MK09_item,MK10_item,MK11_item,MK17_item,MK11_item,
MK19_item,4MPE1_item,4MPE7_item,P7317_item,P731B_item,P731S_item,PLSB1_item,PLSB7_item,PLSB1_item,PLSB9_item,PP17_item,PP1B_item,PP1G_item,PP777_item,PP77B_item,PP7B7_item,PP7BB_item,PP7BS_item,
PP31B_item,P3KX_item,SS471_item,SS417_item,TY1H_item,VM7T1_item,VM7T7_item
I do not understand the importance of the item appearing more than once so would need clarification?
 
Old 02-29-2012, 11:17 AM   #13
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 620

Rep: Reputation: 362Reputation: 362Reputation: 362Reputation: 362
Hi.

The following script preserves the order of items. I stole (is that a correct form?) grail's approach to loading word-lists
Code:
#!/usr/bin/awk -f
function search(string, words, result)
{
	# clear out the result array
	split("", result)
	for(w in words)
		if(string ~ "\\<"w"\\>") { result[ ++result["N"] ]=w; }
	return result["N"];	# nonzero on success
}

BEGIN{ FS="\t"; OFS="\t" }

FILENAME ~ /file/{
	if(FILENAME ~ /1/)
		pat1[$0]
	else
		pat2[$0]
	next
	}

search($3, pat1, res1) && search($3, pat2, res2) {
	$3=""
	$4=""
	for(i=1; i<=res1["N"]; i++) $3=$3 ($3?",":"") res1[i]
	for(i=1; i<=res2["N"]; i++) $4=$4 ($4?",":"") res2[i]

	print
}
Run as follows
Code:
$ ./search.awk file* bigdb.txt

Last edited by firstfire; 02-29-2012 at 11:21 AM. Reason: Remove unused variable.
 
Old 02-29-2012, 12:42 PM   #14
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,442

Rep: Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879Reputation: 1879
@firstfire - as you are using the 'in' construct for your array search you are not preserving the order. You may currently be getting the correct order by luck but it is not producing the same results
for me and this construct has no set order so this outcome is to be expected.

On the plus side the search of the array dramatically improves the time

Up to OP if it is important to preserve order.
 
Old 02-29-2012, 12:56 PM   #15
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 620

Rep: Reputation: 362Reputation: 362Reputation: 362Reputation: 362
Quote:
Originally Posted by grail View Post
@firstfire - as you are using the 'in' construct for your array search you are not preserving the order. You may currently be getting the correct order by luck but it is not producing the same results
for me and this construct has no set order so this outcome is to be expected.

On the plus side the search of the array dramatically improves the time

Up to OP if it is important to preserve order.
Again, you're right. I should get a sleep.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
SED how to find multiple patterns on a single line yaazz Programming 9 07-31-2009 04:20 AM
Perl only matching single-character regex patterns? Lordandmaker Programming 3 01-20-2009 08:59 AM
Finding matching patterns in 2 files herveld Programming 25 12-01-2008 03:35 PM
LXer: Regular expressions & search patterns LXer Syndicated Linux News 0 09-23-2007 02:51 AM
Remembering patterns and printing only those patterns using sed bernie82 Programming 5 05-26-2005 05:18 PM


All times are GMT -5. The time now is 04:20 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration