Keep duplicates based on first word only
I have a large file and want to keep lines which are duplicates, but the test for duplicates is performed only on the first blank-delimited word.
Sample input file: ALBERT 54 BENJAMIN 37 BILL 24 BILL 25 BILL 77 CARL 40 CARL 44 CHESTER 59 DAVID 23 DAVID 23 DAVID 28 DAVID 61 EDGAR 33 EDWARD 54 EDWARD 59 EDWIN 30 Desired output file: BILL 24 BILL 25 BILL 77 CARL 40 CARL 44 DAVID 23 DAVID 23 DAVID 28 DAVID 61 EDWARD 54 EDWARD 59 I'm a newbie and still learning the basics, so please: - no awk - no bash - no Perl Let's stick to commands such as uniq, sort, sed, grep, cut, paste, join, etc |
Sounds a lot like homework.
Is this homework? Have you looked at the man page of 'sort'? |
This looks like homework. You can do it with 4 of the commands you listed. Show what you tried, and we can supply hints.
|
|
Quote:
Quote:
|
The uniq command can print out all the duplicate lines and discard the rest. Moreover it has an option to skip the first N fields. What would be useful is an option to compare only the first N fields. Due to its absence, here is a workaround:
Code:
rev file | uniq -f1 -D | rev |
Another thing to try is to use cut & uniq to return duplicated names. The grep for those names in the original list:
Code:
grep -f <(cut -f1 -d' ' file | uniq -d) file I'll often use it when a command needs sorted input. Such as: Code:
comm -23 <(sort list1) <(sort list2) |
All times are GMT -5. The time now is 09:36 AM. |