Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
| Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
 |
GNU/Linux Basic Guide
This 255-page guide will provide you with the keys to understand the philosophy of free software, teach you how to use and handle it, and give you the tools required to move easily in the world of GNU/Linux. Many users and administrators will be taking their first steps with this GNU/Linux Basic guide and it will show you how to approach and solve the problems you encounter.
Click Here to receive this Complete Guide absolutely free. |
|
 |
01-24-2011, 08:45 PM
|
#1
|
|
Member
Registered: Oct 2004
Distribution: Slackware 12
Posts: 165
Rep:
|
Bash - match lines from two files
Hello all,
It's been a very long time since I've mucked around in Bash or Python so I've pretty much forgotten most of it. I've run into a problem I need to resolve at work. I can do it by hand, but it would take me hours upon hours to do it. I'd like to let the computer do the work for me, if at all possible...I'm just not sure how.
You see, I have two text files, "all.txt" and "address.txt". In the "all.txt" file I have email addresses, first name, and last name (approximately 10,000 lines) like so:
someone@somewhere.net John Doe
someoneelse@somewhere.net Jane Doe
In the "address.txt" file, I have email addresses only like so:
someone@somewhere.net
someoneelse@somewhere.net
I need to write a script that will read each line of the "address.txt file and find its corresponding match in the "all.txt" file then print the whole line (email address, first name, and last name)into a file called "matched.txt". If a line in the "address.txt" fails to match a line in the "all.txt" file then I need it to be printed to a file called "no-match.txt".
Hope this makes sense.
What is the best way to go about this, speed, resource, and accuracy wise?
I tried a few things in Bash and Python, but it isn't working out well. I'm back to being a newbie again
Any help or advice would be sincerely appreciated!
Thanks.
LocoMojo
|
|
|
|
01-24-2011, 09:00 PM
|
#2
|
|
Member
Registered: Oct 2004
Distribution: Slackware 12
Posts: 165
Original Poster
Rep:
|
Never mind
As soon as I posted the OP, it dawned on me.
I'm so embarrassed, I forgot about grep.
Thanks anyway.
LocoMojo
|
|
|
|
01-24-2011, 09:01 PM
|
#3
|
|
Senior Member
Registered: Aug 2006
Posts: 2,695
|
Python script
Code:
#!/usr/bin/env python
from collections import defaultdict
h = defaultdict(str)
addr = open("address.txt").read().split()
for line in open("all.txt"):
s=line.rstrip().split(" ",1)
h[s[0]] = line
keys = h.keys()
same = set(addr) and set(keys)
diff = set(addr) - set(keys)
match = open("matched.txt","w")
for found in same:
match.write( h[found] )
match.close()
nomatch = open("no-match.txt","w")
for no in diff:
nomatch.write(no)
nomatch.close()
|
|
|
|
01-24-2011, 09:53 PM
|
#4
|
|
Member
Registered: Oct 2004
Distribution: Slackware 12
Posts: 165
Original Poster
Rep:
|
Hello ghostdog74,
I came back because I found that my bash script didn't actually work 100%.
I saw your post and got excited and tried it out. It was so much faster than mine, but unfortunately it didn't work. I tried it with sample files.
all.txt = 3,102 lines
address.txt = 906 lines
After using your script:
matched.txt = 3,102 lines
no-match.txt = 1 line with many addresses (no new lines)
I skimmed over the files and counted at least 25 "no matches" so the matched.txt file should not equal the number of lines in all.txt.
Thanks though!
My bash script was far less elegant, but it almost worked:
Code:
#!/bin/bash
FILE1=address.txt
FILE2=all.txt
while read line; do
if grep $line $FILE2; then
echo $line >> matched.txt
else
echo $line >> no-matches.txt
fi
done < $FILE1
With this script I got:
890 matches
15 no matches
A total of 905 out of 906 lines in address.txt ... strange.
I'll have to fiddle more with this. I like your script though, it was much faster and probably less on the resources, but it was in-accurate.
Thanks again!
LocoMojo
|
|
|
|
01-24-2011, 10:08 PM
|
#5
|
|
Member
Registered: Oct 2004
Distribution: Slackware 12
Posts: 165
Original Poster
Rep:
|
In the above post:
"I skimmed over the files and counted at least 25 "no matches" " should have read "I skimmed over the files and counted at least 5 "no matches" ".
Doesn't matter anyway, matches should not exceed 906 (the number of lines being checked against "all.txt"(3,102 lines).
LocoMojo
|
|
|
|
01-24-2011, 10:33 PM
|
#6
|
|
Senior Member
Registered: Aug 2006
Posts: 2,695
|
Quote:
Originally Posted by LocoMojo
Hello ghostdog74,
I came back because I found that my bash script didn't actually work 100%.
I saw your post and got excited and tried it out. It was so much faster than mine, but unfortunately it didn't work. I tried it with s
|
you only provided a small bit of sample file to work with. And it does work with my code.
Why don't you provide more samples of both files..are they all the same structure? show your expected output also if possible. Its much faster than your bash script since yours need to call grep for EACH line. (O^2).
Last edited by ghostdog74; 01-24-2011 at 10:35 PM.
|
|
|
|
01-25-2011, 01:56 AM
|
#7
|
|
Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 6,328
|
Well not a full solution but a quick way to get the first half would be:
Code:
grep -f address.txt all.txt > matched.txt
|
|
|
|
01-25-2011, 03:41 AM
|
#8
|
|
Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 6,328
|
As an addition, if you threw this in a bash script you could do the following:
Code:
#!/bin/bash
grep -f address.txt all.txt > matched.txt
awk 'FNR==NR{arr[$1]++;next}!($1 in arr)' matched.txt address.txt > not_matched.txt
Not tested or sure of the performance hit, but I think it should work 
|
|
|
|
01-25-2011, 04:45 AM
|
#9
|
|
Senior Member
Registered: Dec 2004
Location: Marburg, Germany
Distribution: openSUSE 11.4
Posts: 1,314
|
There is also the utility join installed often as part of the GNU text tools which will search through two files. Other useful text tools are presented here: GNU text utilities.
|
|
|
|
| Thread Tools |
Search this Thread |
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -5. The time now is 03:46 PM.
|
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|