LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Select lines based on key match (https://www.linuxquestions.org/questions/programming-9/select-lines-based-on-key-match-933285/)

danielbmartin 03-07-2012 02:12 PM

Select lines based on key match
 
I want select lines from a large file based on matching a key value in a smaller file. The key values in the small file are unique; the matching values in the large file are not unique. Both files are sorted.

Sample small file ...
Code:

Cole
Phillips

Sample large file ...
Code:

Bergeron Denise
Bergeron Terrence
Cole Carlton
Cole Donald
Cole Martha
Davis Michelle
Davis Joel
High Alice
High Robert
Phillips Edgar
Phillips Suzanne

Sample output file ...
Code:

Cole Carlton
Cole Donald
Cole Martha
Phillips Edgar
Phillips Suzanne

These samples are representative but the actual files are large so performance is a consideration.

Daniel B. Martin

anomie 03-07-2012 02:23 PM

grep(1) can read patterns from a file, a la:
Code:

$ grep -f keys.txt people.txt

danielbmartin 03-07-2012 03:41 PM

Quote:

Originally Posted by anomie (Post 4621157)
Code:

$ grep -f keys.txt people.txt

Thank you, anomie, we are on the right track. It is necessary to match the key strings to the left-most blank-delimited field only. That will make the grep run faster and, more importantly, avoid false matches. If the keys file contains "Martin" I don't want matches on lines in the people file such as "Davidson Martin."

How may we limit the scope of the grep?

Daniel B. Martin

anomie 03-07-2012 03:54 PM

There may be a more efficient means for solving this, but I'd simply put the patterns in the keys.txt file.

Input files:

Code:

$ cat keys.txt
^Cole\>
^Phillips\>

Code:

$ cat people.txt
Bergeron Denise
Bergeron Terrence
Cole Carlton
Cole Donald
Cole Martha
Davis Michelle
Davis Joel
High Alice
High Robert
Phillips Edgar
Phillips Suzanne
Jo Cole
Coleen Hsu

Result from grep(1):

Code:

$ grep -f keys.txt people.txt
Cole Carlton
Cole Donald
Cole Martha
Phillips Edgar
Phillips Suzanne


Nominal Animal 03-07-2012 05:02 PM

How about
Code:

awk -v 'keyfile=path/to/small/file' 'BEGIN {
    while ((getline < keyfile) > 0) key[$1]
    close(keyfile)
  }
  ($1 in key)' 'path/to/large/file'

The BEGIN rule reads the key file first. First field (word) on each line is saved to the key array as a key, with a null value. (In awk, just referencing an array member will create it. Assume there is a = NULL at the end of the second line.)

The actual rule on the last line reads: If first field matches a key in key array, then print the record. (You can omit the implicit { print } for the last rule.)

Essentially, the above reads the first fields in the small file, then outputs the records (lines) of the large file only if the first field matches one of the ones read from the small file.

Reuti 03-08-2012 06:40 AM

Code:

$ join keys.txt people.txt
By default it will match on the first column.

danielbmartin 03-08-2012 10:52 AM

Quote:

Originally Posted by anomie (Post 4621230)
There may be a more efficient means for solving this, but I'd simply put the patterns in the keys.txt file.

Code:

$ cat keys.txt
^Cole\>
^Phillips\>

The ^ means "starting in column 1" as desired. I don't understand what the \> does for us. I tried this method using only the ^ prefix and it seemed to work.

Daniel B. Martin

anomie 03-08-2012 11:11 AM

Those are regular expressions (anchors). The meanings are:
  • ^ -- match beginning of line
  • \> -- match end of word

If you do not use the latter, you'll also match names like "Coleman Butler".

danielbmartin 03-08-2012 06:13 PM

Quote:

Originally Posted by anomie (Post 4621987)
[*] \> -- match end of word[*]

Thank you, this is something I haven't seen before.

I try to learn from tutorials and Google searches. Even knowing \> I found no mention of it anywhere. Help me to help myself -- where could I have found this on my own?

Daniel B. Martin

anomie 03-08-2012 08:26 PM

In this case, you can view the manpages for grep(1):
Code:

Anchoring
  The caret ^ and the dollar sign $ are meta-characters that respectively
  match the empty string at the beginning and end of a line.

The Backslash Character and Special Expressions
  The  symbols  \<  and  \>  respectively  match  the empty string at the
  beginning and end of a word...


danielbmartin 03-09-2012 11:03 AM

Quote:

Originally Posted by anomie (Post 4622360)
In this case, you can view the manpages for grep(1) ...

Yes, found it, thank you. As I climb the Linux learning curve, it becomes apparent that mastering grep and sed requires a proficiency in writing Regular Expressions.

Daniel B. Martin


All times are GMT -5. The time now is 03:03 AM.