Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place! |
| Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
 |
GNU/Linux Basic Guide
This 255-page guide will provide you with the keys to understand the philosophy of free software, teach you how to use and handle it, and give you the tools required to move easily in the world of GNU/Linux. Many users and administrators will be taking their first steps with this GNU/Linux Basic guide and it will show you how to approach and solve the problems you encounter.
Click Here to receive this Complete Guide absolutely free. |
|
 |
|
02-28-2012, 01:17 AM
|
#1
|
|
Member
Registered: Feb 2012
Posts: 89
Rep: 
|
Search multiple patterns & print matching patterns instead of whole line
%%%%%
Last edited by Trd300; 05-01-2012 at 04:30 AM.
|
|
|
|
02-28-2012, 12:49 PM
|
#2
|
|
Member
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 577
|
Hi.
For file[12].tab of moderate size this may work:
Code:
#!/usr/bin/awk -f
# File: search.awk
# Usage: chmod +x search.awk; ./search.awk bigdb.tab
BEGIN{
while( (getline < "file1.tab") >0 )
pat1=pat1 (pat1?"|":"") $0
while( (getline < "file2.tab") >0 )
pat2=pat2 (pat2?"|":"") $0
}
$3 ~ pat1 && $3 ~ pat2 {
split($3, a, /,/)
$3=""
$4=""
for(e in a){
if(a[e]~pat1) $3=$3 ($3?",":"") a[e]
if(a[e]~pat2) $4=$4 ($4?",":"") a[e]
}
print
}
Code:
$ ./search.awk bigtab.tab
db1 0001 A_ent C_ent
db3 0003 H_ent K_ent
|
|
|
|
02-28-2012, 07:57 PM
|
#3
|
|
Member
Registered: Feb 2012
Posts: 89
Original Poster
Rep: 
|
Hi firstfire ! Thanks for your help !
Your program works perfectly on small files, but not on bigger ones.
There are 3 major problems on bigger files:
- the records contain only one field with Database, Ref., Entities separated by spaces not tabs.
- Database and Ref. ($1 and $2) correspond to each other, but not with the good Entities ($3)
- it prints records containing names that are not in file1.tab or file2.tab
Why your script works on small but not on big files (reminder: I am a beginner in Unix)?
Is there any way to make it more "stable"?
Thanks again !
|
|
|
|
02-28-2012, 08:31 PM
|
#4
|
|
Member
Registered: Mar 2011
Location: Surrey B.C. Canada (Metro Vancouver)
Distribution: Slackware 2.6.33.4-smp
Posts: 179
Rep:
|
Tab delimited file
If part of the file is tab delimited why is the other part (the big part) delimited by spaces?
So it is not a tab delimited file at all but a file that is higgledy piggledy. If the spaces are the same on each line then replace them with tabs or some other delimiter so that the file is consistent. Then the above code will work like a dream.
If I helped at all give me a pat on the back
|
|
|
|
02-28-2012, 09:50 PM
|
#5
|
|
Member
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 577
|
Hi.
Please post your actual input data (or representative part of it) and, if possible, the desired result.
|
|
|
|
02-29-2012, 02:42 AM
|
#6
|
|
Member
Registered: Feb 2012
Posts: 89
Original Poster
Rep: 
|
@ lisle2011:
Hi !
Thanks for your help !
My original files are all tab -delimited for sure.
|
|
|
|
02-29-2012, 02:49 AM
|
#7
|
|
Member
Registered: Feb 2012
Posts: 89
Original Poster
Rep: 
|
%%%%%
Last edited by Trd300; 05-01-2012 at 04:31 AM.
|
|
|
|
02-29-2012, 03:05 AM
|
#8
|
|
Member
Registered: Feb 2012
Posts: 89
Original Poster
Rep: 
|
%%%%%
Last edited by Trd300; 05-01-2012 at 04:31 AM.
|
|
|
|
02-29-2012, 04:41 AM
|
#9
|
|
Member
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 577
|
Hi.
Here is a new version
Code:
#!/usr/bin/awk -f
function search(string, words, result, found)
{
# clear out the result array
split("", result)
for(w in words)
if(string ~ w) { result[w]; found++; }
return found; # nonzero on success
}
BEGIN{
FS="\t"
OFS="\t"
while( (getline < "file1.tab") > 0 )
pat1[$0]
while( (getline < "file2.tab") > 0 )
pat2[$0]
}
search($3, pat1, res1) && search($3, pat2, res2) {
$3=""
$4=""
for(e in res1) $3=$3 ($3?",":"") e
for(e in res2) $4=$4 ($4?",":"") e
print
}
It should also run a bit faster, but the order of items may be changed.
But, the logic is the same as before, and for the line
Code:
$ sed -n '11p' bigdb.txt
db12 12179021 F77_item,TH3B_item,SL978_item,7PO71_item,7POS7_item,7POS1_item,7POE_item,7LBU_item,PISE3_item,SPS92_item,3BP7_item,SGOL7_item,7P71B_item,IT77_item,7MPP1_item,KLK1_item,PPB1_item,7TM_item,P7LB7_item,KSM71_item,G7B31_item,7PO77_item,7PO79_item,TI7M1_item,L7T_item,G37P7_item,ITB2_item,PMP_item,MYL9_item,FIB7_item,FI4S_item,PGKG_item,SXSL7_item,IS1_item,SF7P_item,IGHG1_item,SF7B_item,4L3S9_item,T3Y1_item,PPE9B_item,O38P7_item,O3177_item,O37T4_item,S7S1I_item,4SO71_item,S4OT1_item,SS427_item,7P1M1_item,SMS7_item,777T_item,BBS1_item,GELS_item,OSTS_item,4PH4_item,PF9V_item,S7BL1_item,7P9E1_item,VI4EX_item,EXOS4_item,SHKB_item,SP7ST_item,P779F_item,S7TL1_item,7SPG_item,MYLK1_item,QT3P1_item,K1S19_item,SLS77_item,K7S1_item,K7S2_item,FETU7_item,TTHY_item,1BP2_item,SL7P7_item,374B1_item,3Y37_item,GF7P_item,F1071_item,K7S47_item,PEG1_item,T3120_item,SPSP1_item,SF7H_item,HPT_item,K7S4B_item,K1S10_item,HEMO_item,3LF_item,SF319_item,T77P1_item,STGE2_item,PHS1_item,PSP_item,S3UM1_item,GS41L_item,SEP99_item,K0717_item,Z4717_item,K1S9_item,J73P7_item,TPS11_item,S7SS_item,TL47_item,PL7G1_item,PSPH1_item,OBSL1_item,SETX_item,K1S14_item,SYTS_item,PE3L1_item,744KB1_item,TITI4_item,LPB1_item,7POS1_item,IGKS_item,KV109_item,SXG7_item,TTP7L_item,IGPS9_item,F71E7_item,YS018_item,ITIH1_item,K1S11_item,Z4797_item,F7277_item,F111B_item,PZ349_item,SL012_item,SMT71_item
it outputs
Code:
$ sed -n '11p' bigdb.txt | ./search.awk
db12 12179021 SMS7_item,I4S_item,3Y37_item S7S1I_item,KSM71_item,SS427_item,79_item
instead of what you want
Code:
db12 12179021 3Y37_item 79_item,S7S1I_item,SS427_item,KSM71_item
Either I do not understand the logic, or you are wrong. SMS7_item and I4S_item are both present in file1.tab and in 11'th line of bigdb.txt, so they should appear in $3. No?
|
|
|
|
02-29-2012, 09:49 AM
|
#10
|
|
Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 6,386
|
Actually firstfire, you are in error but only because your search does not delimit the search item (I too hit this snag)
Allow me to demonstrate:
Using string I4S_item which is in file1.txt:
Code:
grep I4S_item file1.txt
I4S_item
Now let us combine the following in the bigdb.txt file supplied and look for both 12179021 and I4S_item:
Code:
grep 12179021 bigdb.txt | grep I4S_item
db12 12179021 F77_item,TH3B_item,SL978_item,7PO71_item,7POS7_item,7POS1_item,7POE_item,7LBU_item,PISE3_item,SPS92_item,3BP7_item,SGOL7_item,7P71B_item,IT77_item,7MPP1_item,
KLK1_item,PPB1_item,7TM_item,P7LB7_item,KSM71_item,G7B31_item,7PO77_item,7PO79_item,TI7M1_item,L7T_item,G37P7_item,ITB2_item,PMP_item,MYL9_item,FIB7_item,FI4S_item,
PGKG_item,SXSL7_item,IS1_item,SF7P_item,IGHG1_item,SF7B_item,4L3S9_item,T3Y1_item,PPE9B_item,O38P7_item,O3177_item,O37T4_item,S7S1I_item,4SO71_item,S4OT1_item,
SS427_item,7P1M1_item,SMS7_item,777T_item,BBS1_item,GELS_item,OSTS_item,4PH4_item,PF9V_item,S7BL1_item,7P9E1_item,VI4EX_item,EXOS4_item,SHKB_item,SP7ST_item,
P779F_item,S7TL1_item,7SPG_item,MYLK1_item,QT3P1_item,K1S19_item,SLS77_item,K7S1_item,K7S2_item,FETU7_item,TTHY_item,1BP2_item,SL7P7_item,374B1_item,
3Y37_item,GF7P_item,F1071_item,K7S47_item,PEG1_item,T3120_item,SPSP1_item,SF7H_item,HPT_item,K7S4B_item,K1S10_item,HEMO_item,3LF_item,SF319_item,T77P1_item,
STGE2_item,PHS1_item,PSP_item,S3UM1_item,GS41L_item,SEP99_item,K0717_item,Z4717_item,K1S9_item,J73P7_item,TPS11_item,S7SS_item,TL47_item,PL7G1_item,PSPH1_item,
OBSL1_item,SETX_item,K1S14_item,SYTS_item,PE3L1_item,744KB1_item,TITI4_item,LPB1_item,7POS1_item,IGKS_item,KV109_item,SXG7_item,TTP7L_item,IGPS9_item,F71E7_item,
YS018_item,ITIH1_item,K1S11_item,Z4797_item,F7277_item,F111B_item,PZ349_item,SL012_item,SMT71_item
From here we can see that the string I4S_item does exist but only as part of FI4S_item which is not in file1.txt
However I would point out that some of the OPs "This is what I want" are actually incorrect, ie do not have the correct or complete list of items that should be discovered.
Also your "search" function is quite similar to match (which I tried although struggled to get all matches  ) so in case it may help further the solution:
Code:
#!/usr/bin/awk -f
BEGIN{ OFS = "\t" }
FILENAME ~ /file/{
if(FILENAME ~ /1/)
pat1=pat1 (pat1?"|":"\\<(") $0
else
pat2=pat2 (pat2?"|":"\\<(") $0
next
}
!x{ pat1 = pat1 ")\\>"
pat2 = pat2 ")\\>"
x=1
}
match($NF, pat1, f1) && match($NF, pat2, f2){
$NF = ""
print $0 f1[0],f2[0]
}
This is run like so:
Code:
./search.awk file* bigdb.txt
I believe it does produce all the correct lines but at present does not provide multiple matches in field 3 and 4
Last edited by grail; 02-29-2012 at 09:51 AM.
|
|
|
1 members found this post helpful.
|
02-29-2012, 10:18 AM
|
#11
|
|
Member
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 577
|
Hi.
Quote:
Originally Posted by grail
Actually firstfire, you are in error but only because your search does not delimit the search item (I too hit this snag) 
|
Ohh. I see. My bad. Thanks for pointing out, grail.
I hope the following will fix the problem
Code:
function search(string, words, result, found)
{
# clear out the result array
split("", result)
for(w in words)
if(string ~ "\\<"w"\\>") { result[w]; found++; }
return found; # nonzero on success
}
|
|
|
|
02-29-2012, 10:53 AM
|
#12
|
|
Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 6,386
|
Here is an alternative:
Code:
#!/usr/bin/awk -f
BEGIN{ OFS = "\t" }
FILENAME ~ /file/{
if(FILENAME ~ /1/)
pat1=pat1 (pat1?"|":"\\<(") $0
else
pat2=pat2 (pat2?"|":"\\<(") $0
next
}
!x{ pat1 = pat1 ")\\>"
pat2 = pat2 ")\\>"
x=1
}
match($NF, pat1, f1) && match($NF, pat2, f2){
p3 = p4 = ""
p3 = findmore(f1[0],f1[0,"start"]+f1[0,"length"],pat1)
p4 = findmore(f2[0],f2[0,"start"]+f2[0,"length"],pat2)
$NF = ""
print $0 p3,p4
}
function findmore(string, len, pat, f,l,ret_string)
{
l = len + 1
ret_string = string
while(match(substr($NF,l), pat, f)){
ret_string = ret_string "," f[0]
l = l + f[0,"start"] + f[0,"length"] + 1
}
return ret_string
}
My question about the output is, what is the requirement if something should appear twice, example:
Code:
db7 do sy nt P3P1_item,P3P1_item S7S17_item,S7S1B_item,S7S1P_item,G47S1_item,G47S1_item,G3I77_item,G3I79_item,I3K1_item,I3K4_item,4MPE1_item,SS417_item
As you can see I have repetition but this is due to the fact that the data has the item appear more than once:
Code:
db7 do sy nt 7727_item,772B_item,772P_item,772E_item,772G_item,7777_item,777B_item,77B7_item,77BB_item,77BP_item,77BG_item,7PSY2_item,7KT1_item,7KT7_item,7KT1_item,7LEX_item,7OF7_item,
7OFB_item,733B7_item,7TF7_item,7TF9_item,7TF4B_item,BM7L1_item,S7S17_item,S7S1B_item,S7S1S_item,S7S1P_item,S7LL1_item,S7LL2_item,S7LL4_item,S7LM_item,S7LY_item,SLOSK_item,SOMT_item,
S31L1_item,S31L7_item,S31L1_item,S31L9_item,S3EB1_item,S3EB1_item,S3EB2_item,PPS_item,P3P1_item,P3P7_item,P3P1_item,P3P9_item,P3P2_item,FOS_item,GBB1_item,GBB7_item,GBB1_item,GBB9_item,
GBB2_item,GBG10_item,GBG11_item,GBG17_item,GBG11_item,GBG1_item,GBG7_item,GBG1_item,GBG9_item,GBG2_item,GBG7_item,GBG8_item,GBGT7_item,G47I1_item,G47I7_item,G47I1_item,G47L_item,G47O_item,
G47Q_item,G47S1_item,G47S7_item,G47S1_item,G3I71_item,G3I77_item,G3I71_item,G3I79_item,GSK17_item,GSK1B_item,I3K1_item,I3K2_item,I3K4_item,I3K9_item,ITP31_item,ITP37_item,ITP31_item,K7PS7_item,
K7PSB_item,K7PSG_item,KSS77_item,KSS7B_item,KSS7P_item,KSS7G_item,KIF27_item,KIF2S_item,KI4H_item,KPS7_item,KPSB_item,KPSG_item,MK08_item,MK09_item,MK10_item,MK11_item,MK17_item,MK11_item,
MK19_item,4MPE1_item,4MPE7_item,P7317_item,P731B_item,P731S_item,PLSB1_item,PLSB7_item,PLSB1_item,PLSB9_item,PP17_item,PP1B_item,PP1G_item,PP777_item,PP77B_item,PP7B7_item,PP7BB_item,PP7BS_item,
PP31B_item,P3KX_item,SS471_item,SS417_item,TY1H_item,VM7T1_item,VM7T7_item
I do not understand the importance of the item appearing more than once so would need clarification?
|
|
|
|
02-29-2012, 11:17 AM
|
#13
|
|
Member
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 577
|
Hi.
The following script preserves the order of items. I stole (is that a correct form?) grail's approach to loading word-lists
Code:
#!/usr/bin/awk -f
function search(string, words, result)
{
# clear out the result array
split("", result)
for(w in words)
if(string ~ "\\<"w"\\>") { result[ ++result["N"] ]=w; }
return result["N"]; # nonzero on success
}
BEGIN{ FS="\t"; OFS="\t" }
FILENAME ~ /file/{
if(FILENAME ~ /1/)
pat1[$0]
else
pat2[$0]
next
}
search($3, pat1, res1) && search($3, pat2, res2) {
$3=""
$4=""
for(i=1; i<=res1["N"]; i++) $3=$3 ($3?",":"") res1[i]
for(i=1; i<=res2["N"]; i++) $4=$4 ($4?",":"") res2[i]
print
}
Run as follows
Code:
$ ./search.awk file* bigdb.txt
Last edited by firstfire; 02-29-2012 at 11:21 AM.
Reason: Remove unused variable.
|
|
|
|
02-29-2012, 12:42 PM
|
#14
|
|
Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 6,386
|
@firstfire - as you are using the 'in' construct for your array search you are not preserving the order. You may currently be getting the correct order by luck but it is not producing the same results
for me and this construct has no set order so this outcome is to be expected.
On the plus side the search of the array dramatically improves the time
Up to OP if it is important to preserve order.
|
|
|
|
02-29-2012, 12:56 PM
|
#15
|
|
Member
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 577
|
Quote:
Originally Posted by grail
@firstfire - as you are using the 'in' construct for your array search you are not preserving the order. You may currently be getting the correct order by luck but it is not producing the same results
for me and this construct has no set order so this outcome is to be expected.
On the plus side the search of the array dramatically improves the time
Up to OP if it is important to preserve order.
|
Again, you're right. I should get a sleep. 
|
|
|
|
| Thread Tools |
Search this Thread |
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -5. The time now is 11:28 AM.
|
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|