ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I want select lines from a large file based on matching a key value in a smaller file. The key values in the small file are unique; the matching values in the large file are not unique. Both files are sorted.
Sample small file ...
Code:
Cole
Phillips
Sample large file ...
Code:
Bergeron Denise
Bergeron Terrence
Cole Carlton
Cole Donald
Cole Martha
Davis Michelle
Davis Joel
High Alice
High Robert
Phillips Edgar
Phillips Suzanne
Sample output file ...
Code:
Cole Carlton
Cole Donald
Cole Martha
Phillips Edgar
Phillips Suzanne
These samples are representative but the actual files are large so performance is a consideration.
Thank you, anomie, we are on the right track. It is necessary to match the key strings to the left-most blank-delimited field only. That will make the grep run faster and, more importantly, avoid false matches. If the keys file contains "Martin" I don't want matches on lines in the people file such as "Davidson Martin."
There may be a more efficient means for solving this, but I'd simply put the patterns in the keys.txt file.
Input files:
Code:
$ cat keys.txt
^Cole\>
^Phillips\>
Code:
$ cat people.txt
Bergeron Denise
Bergeron Terrence
Cole Carlton
Cole Donald
Cole Martha
Davis Michelle
Davis Joel
High Alice
High Robert
Phillips Edgar
Phillips Suzanne
Jo Cole
Coleen Hsu
Result from grep(1):
Code:
$ grep -f keys.txt people.txt
Cole Carlton
Cole Donald
Cole Martha
Phillips Edgar
Phillips Suzanne
awk -v 'keyfile=path/to/small/file' 'BEGIN {
while ((getline < keyfile) > 0) key[$1]
close(keyfile)
}
($1 in key)' 'path/to/large/file'
The BEGIN rule reads the key file first. First field (word) on each line is saved to the key array as a key, with a null value. (In awk, just referencing an array member will create it. Assume there is a = NULL at the end of the second line.)
The actual rule on the last line reads: If first field matches a key in key array, then print the record. (You can omit the implicit { print } for the last rule.)
Essentially, the above reads the first fields in the small file, then outputs the records (lines) of the large file only if the first field matches one of the ones read from the small file.
There may be a more efficient means for solving this, but I'd simply put the patterns in the keys.txt file.
Code:
$ cat keys.txt
^Cole\>
^Phillips\>
The ^ means "starting in column 1" as desired. I don't understand what the \> does for us. I tried this method using only the ^ prefix and it seemed to work.
Thank you, this is something I haven't seen before.
I try to learn from tutorials and Google searches. Even knowing \> I found no mention of it anywhere. Help me to help myself -- where could I have found this on my own?
In this case, you can view the manpages for grep(1):
Code:
Anchoring
The caret ^ and the dollar sign $ are meta-characters that respectively
match the empty string at the beginning and end of a line.
The Backslash Character and Special Expressions
The symbols \< and \> respectively match the empty string at the
beginning and end of a word...
Last edited by anomie; 03-08-2012 at 08:31 PM.
Reason: changed to manpage.
In this case, you can view the manpages for grep(1) ...
Yes, found it, thank you. As I climb the Linux learning curve, it becomes apparent that mastering grep and sed requires a proficiency in writing Regular Expressions.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.