LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 01-24-2011, 08:45 PM   #1
LocoMojo
Member
 
Registered: Oct 2004
Distribution: Slackware 12
Posts: 165

Rep: Reputation: 30
Bash - match lines from two files


Hello all,

It's been a very long time since I've mucked around in Bash or Python so I've pretty much forgotten most of it. I've run into a problem I need to resolve at work. I can do it by hand, but it would take me hours upon hours to do it. I'd like to let the computer do the work for me, if at all possible...I'm just not sure how.

You see, I have two text files, "all.txt" and "address.txt". In the "all.txt" file I have email addresses, first name, and last name (approximately 10,000 lines) like so:

someone@somewhere.net John Doe
someoneelse@somewhere.net Jane Doe

In the "address.txt" file, I have email addresses only like so:

someone@somewhere.net
someoneelse@somewhere.net

I need to write a script that will read each line of the "address.txt file and find its corresponding match in the "all.txt" file then print the whole line (email address, first name, and last name)into a file called "matched.txt". If a line in the "address.txt" fails to match a line in the "all.txt" file then I need it to be printed to a file called "no-match.txt".

Hope this makes sense.

What is the best way to go about this, speed, resource, and accuracy wise?

I tried a few things in Bash and Python, but it isn't working out well. I'm back to being a newbie again

Any help or advice would be sincerely appreciated!

Thanks.

LocoMojo
 
Old 01-24-2011, 09:00 PM   #2
LocoMojo
Member
 
Registered: Oct 2004
Distribution: Slackware 12
Posts: 165

Original Poster
Rep: Reputation: 30
Never mind

As soon as I posted the OP, it dawned on me.

I'm so embarrassed, I forgot about grep.

Thanks anyway.

LocoMojo
 
Old 01-24-2011, 09:01 PM   #3
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 240Reputation: 240Reputation: 240
Python script

Code:
#!/usr/bin/env python
from collections import defaultdict
h = defaultdict(str)
addr = open("address.txt").read().split()
for line in open("all.txt"):
    s=line.rstrip().split(" ",1)
    h[s[0]] = line

keys = h.keys()
same = set(addr) and set(keys)
diff = set(addr) - set(keys)
match = open("matched.txt","w")
for found in same:
    match.write( h[found] )
match.close()

nomatch = open("no-match.txt","w")
for no in diff:
    nomatch.write(no)
nomatch.close()
 
Old 01-24-2011, 09:53 PM   #4
LocoMojo
Member
 
Registered: Oct 2004
Distribution: Slackware 12
Posts: 165

Original Poster
Rep: Reputation: 30
Hello ghostdog74,

I came back because I found that my bash script didn't actually work 100%.

I saw your post and got excited and tried it out. It was so much faster than mine, but unfortunately it didn't work. I tried it with sample files.

all.txt = 3,102 lines
address.txt = 906 lines

After using your script:

matched.txt = 3,102 lines
no-match.txt = 1 line with many addresses (no new lines)

I skimmed over the files and counted at least 25 "no matches" so the matched.txt file should not equal the number of lines in all.txt.

Thanks though!

My bash script was far less elegant, but it almost worked:

Code:
#!/bin/bash

FILE1=address.txt
FILE2=all.txt

while read line; do
  if grep $line $FILE2; then
    echo $line >> matched.txt
  else
    echo $line >> no-matches.txt
  fi
done < $FILE1
With this script I got:

890 matches
15 no matches

A total of 905 out of 906 lines in address.txt ... strange.

I'll have to fiddle more with this. I like your script though, it was much faster and probably less on the resources, but it was in-accurate.

Thanks again!

LocoMojo
 
Old 01-24-2011, 10:08 PM   #5
LocoMojo
Member
 
Registered: Oct 2004
Distribution: Slackware 12
Posts: 165

Original Poster
Rep: Reputation: 30
In the above post:

"I skimmed over the files and counted at least 25 "no matches" " should have read "I skimmed over the files and counted at least 5 "no matches" ".

Doesn't matter anyway, matches should not exceed 906 (the number of lines being checked against "all.txt"(3,102 lines).

LocoMojo
 
Old 01-24-2011, 10:33 PM   #6
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 240Reputation: 240Reputation: 240
Quote:
Originally Posted by LocoMojo View Post
Hello ghostdog74,

I came back because I found that my bash script didn't actually work 100%.

I saw your post and got excited and tried it out. It was so much faster than mine, but unfortunately it didn't work. I tried it with s
you only provided a small bit of sample file to work with. And it does work with my code.
Why don't you provide more samples of both files..are they all the same structure? show your expected output also if possible. Its much faster than your bash script since yours need to call grep for EACH line. (O^2).

Last edited by ghostdog74; 01-24-2011 at 10:35 PM.
 
Old 01-25-2011, 01:56 AM   #7
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,431

Rep: Reputation: 1878Reputation: 1878Reputation: 1878Reputation: 1878Reputation: 1878Reputation: 1878Reputation: 1878Reputation: 1878Reputation: 1878Reputation: 1878Reputation: 1878
Well not a full solution but a quick way to get the first half would be:
Code:
grep -f address.txt all.txt > matched.txt
 
Old 01-25-2011, 03:41 AM   #8
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,431

Rep: Reputation: 1878Reputation: 1878Reputation: 1878Reputation: 1878Reputation: 1878Reputation: 1878Reputation: 1878Reputation: 1878Reputation: 1878Reputation: 1878Reputation: 1878
As an addition, if you threw this in a bash script you could do the following:
Code:
#!/bin/bash

grep -f address.txt all.txt > matched.txt

awk 'FNR==NR{arr[$1]++;next}!($1 in arr)' matched.txt address.txt > not_matched.txt
Not tested or sure of the performance hit, but I think it should work
 
Old 01-25-2011, 04:45 AM   #9
Reuti
Senior Member
 
Registered: Dec 2004
Location: Marburg, Germany
Distribution: openSUSE 11.4
Posts: 1,319

Rep: Reputation: 252Reputation: 252Reputation: 252
There is also the utility join installed often as part of the GNU text tools which will search through two files. Other useful text tools are presented here: GNU text utilities.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Bash script to find and remove similar lines from multiple files linuxquestion1 Programming 9 07-13-2011 01:45 AM
bash- how to compare only certain lines of text files daberkow Linux - Newbie 2 06-01-2009 04:48 PM
bash; editting config files and replacing lines jamescondron Programming 1 01-07-2009 06:17 PM
search files in the current directory/subdirectory for lines that match particular rajdey1 Linux - Newbie 2 11-24-2008 02:32 PM
commands for bash script that handles files of varying number of lines BBFeltham Linux - Newbie 1 07-26-2008 10:18 AM


All times are GMT -5. The time now is 11:06 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration