LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 03-07-2012, 02:12 PM   #1
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Rep: Reputation: 284Reputation: 284Reputation: 284
Select lines based on key match


I want select lines from a large file based on matching a key value in a smaller file. The key values in the small file are unique; the matching values in the large file are not unique. Both files are sorted.

Sample small file ...
Code:
Cole
Phillips
Sample large file ...
Code:
Bergeron Denise
Bergeron Terrence
Cole Carlton
Cole Donald
Cole Martha
Davis Michelle
Davis Joel
High Alice
High Robert
Phillips Edgar
Phillips Suzanne
Sample output file ...
Code:
Cole Carlton
Cole Donald
Cole Martha
Phillips Edgar
Phillips Suzanne
These samples are representative but the actual files are large so performance is a consideration.

Daniel B. Martin
 
Old 03-07-2012, 02:23 PM   #2
anomie
Senior Member
 
Registered: Nov 2004
Location: Texas
Distribution: RHEL, Scientific Linux, Debian, Fedora, Lubuntu, FreeBSD
Posts: 3,930
Blog Entries: 5

Rep: Reputation: Disabled
grep(1) can read patterns from a file, a la:
Code:
$ grep -f keys.txt people.txt

Last edited by anomie; 03-07-2012 at 02:26 PM.
 
1 members found this post helpful.
Old 03-07-2012, 03:41 PM   #3
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Original Poster
Rep: Reputation: 284Reputation: 284Reputation: 284
Quote:
Originally Posted by anomie View Post
Code:
$ grep -f keys.txt people.txt
Thank you, anomie, we are on the right track. It is necessary to match the key strings to the left-most blank-delimited field only. That will make the grep run faster and, more importantly, avoid false matches. If the keys file contains "Martin" I don't want matches on lines in the people file such as "Davidson Martin."

How may we limit the scope of the grep?

Daniel B. Martin
 
Old 03-07-2012, 03:54 PM   #4
anomie
Senior Member
 
Registered: Nov 2004
Location: Texas
Distribution: RHEL, Scientific Linux, Debian, Fedora, Lubuntu, FreeBSD
Posts: 3,930
Blog Entries: 5

Rep: Reputation: Disabled
There may be a more efficient means for solving this, but I'd simply put the patterns in the keys.txt file.

Input files:

Code:
$ cat keys.txt 
^Cole\>
^Phillips\>
Code:
$ cat people.txt 
Bergeron Denise
Bergeron Terrence
Cole Carlton
Cole Donald
Cole Martha
Davis Michelle
Davis Joel
High Alice
High Robert
Phillips Edgar
Phillips Suzanne
Jo Cole
Coleen Hsu
Result from grep(1):

Code:
$ grep -f keys.txt people.txt 
Cole Carlton
Cole Donald
Cole Martha
Phillips Edgar
Phillips Suzanne
 
1 members found this post helpful.
Old 03-07-2012, 05:02 PM   #5
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942
How about
Code:
awk -v 'keyfile=path/to/small/file' 'BEGIN {
    while ((getline < keyfile) > 0) key[$1]
    close(keyfile)
  }
  ($1 in key)' 'path/to/large/file'
The BEGIN rule reads the key file first. First field (word) on each line is saved to the key array as a key, with a null value. (In awk, just referencing an array member will create it. Assume there is a = NULL at the end of the second line.)

The actual rule on the last line reads: If first field matches a key in key array, then print the record. (You can omit the implicit { print } for the last rule.)

Essentially, the above reads the first fields in the small file, then outputs the records (lines) of the large file only if the first field matches one of the ones read from the small file.
 
Old 03-08-2012, 06:40 AM   #6
Reuti
Senior Member
 
Registered: Dec 2004
Location: Marburg, Germany
Distribution: openSUSE 11.4
Posts: 1,319

Rep: Reputation: 252Reputation: 252Reputation: 252
Code:
$ join keys.txt people.txt
By default it will match on the first column.

Last edited by Reuti; 03-08-2012 at 06:41 AM. Reason: Changed file names
 
Old 03-08-2012, 10:52 AM   #7
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Original Poster
Rep: Reputation: 284Reputation: 284Reputation: 284
Quote:
Originally Posted by anomie View Post
There may be a more efficient means for solving this, but I'd simply put the patterns in the keys.txt file.
Code:
$ cat keys.txt 
^Cole\>
^Phillips\>
The ^ means "starting in column 1" as desired. I don't understand what the \> does for us. I tried this method using only the ^ prefix and it seemed to work.

Daniel B. Martin
 
Old 03-08-2012, 11:11 AM   #8
anomie
Senior Member
 
Registered: Nov 2004
Location: Texas
Distribution: RHEL, Scientific Linux, Debian, Fedora, Lubuntu, FreeBSD
Posts: 3,930
Blog Entries: 5

Rep: Reputation: Disabled
Those are regular expressions (anchors). The meanings are:
  • ^ -- match beginning of line
  • \> -- match end of word

If you do not use the latter, you'll also match names like "Coleman Butler".

Last edited by anomie; 03-08-2012 at 11:12 AM.
 
1 members found this post helpful.
Old 03-08-2012, 06:13 PM   #9
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Original Poster
Rep: Reputation: 284Reputation: 284Reputation: 284
Quote:
Originally Posted by anomie View Post
[*] \> -- match end of word[*]
Thank you, this is something I haven't seen before.

I try to learn from tutorials and Google searches. Even knowing \> I found no mention of it anywhere. Help me to help myself -- where could I have found this on my own?

Daniel B. Martin
 
Old 03-08-2012, 08:26 PM   #10
anomie
Senior Member
 
Registered: Nov 2004
Location: Texas
Distribution: RHEL, Scientific Linux, Debian, Fedora, Lubuntu, FreeBSD
Posts: 3,930
Blog Entries: 5

Rep: Reputation: Disabled
In this case, you can view the manpages for grep(1):
Code:
Anchoring
   The caret ^ and the dollar sign $ are meta-characters that respectively
   match the empty string at the beginning and end of a line.

The Backslash Character and Special Expressions
   The  symbols  \<  and  \>  respectively  match  the empty string at the
   beginning and end of a word...

Last edited by anomie; 03-08-2012 at 08:31 PM. Reason: changed to manpage.
 
1 members found this post helpful.
Old 03-09-2012, 11:03 AM   #11
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Original Poster
Rep: Reputation: 284Reputation: 284Reputation: 284
Quote:
Originally Posted by anomie View Post
In this case, you can view the manpages for grep(1) ...
Yes, found it, thank you. As I climb the Linux learning curve, it becomes apparent that mastering grep and sed requires a proficiency in writing Regular Expressions.

Daniel B. Martin
 
  


Reply

Tags
awk, comm, grep, sort


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Combining lines based on key danielbmartin Programming 34 12-12-2011 02:47 PM
delete lines between match csegau Programming 15 05-16-2011 11:25 AM
[SOLVED] Select lines from FileA based on a key field in FileB danielbmartin Linux - Newbie 2 02-11-2011 11:37 AM
Select the files of a directory that match a specific pattern jianelisj Linux - Newbie 2 03-17-2008 12:25 PM


All times are GMT -5. The time now is 07:41 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration