LinuxQuestions.org - Keep duplicates based on first word only

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Keep duplicates based on first word only (https://www.linuxquestions.org/questions/linux-newbie-8/keep-duplicates-based-on-first-word-only-867934/)

danielbmartin

03-11-2011 09:20 AM

Keep duplicates based on first word only

I have a large file and want to keep lines which are duplicates, but the test for duplicates is performed only on the first blank-delimited word.

Sample input file:
ALBERT 54
BENJAMIN 37
BILL 24
BILL 25
BILL 77
CARL 40
CARL 44
CHESTER 59
DAVID 23
DAVID 23
DAVID 28
DAVID 61
EDGAR 33
EDWARD 54
EDWARD 59
EDWIN 30

Desired output file:
BILL 24
BILL 25
BILL 77
CARL 40
CARL 44
DAVID 23
DAVID 23
DAVID 28
DAVID 61
EDWARD 54
EDWARD 59

I'm a newbie and still learning the basics, so please:
- no awk
- no bash
- no Perl

Let's stick to commands such as uniq, sort, sed, grep, cut, paste, join, etc

szboardstretcher

03-11-2011 09:37 AM

Sounds a lot like homework.

Is this homework?

Have you looked at the man page of 'sort'?

jschiwal

03-11-2011 09:45 AM

This looks like homework. You can do it with 4 of the commands you listed. Show what you tried, and we can supply hints.

szboardstretcher

03-11-2011 09:49 AM

Double post

http://www.linuxquestions.org/questi...-theme-867810/

danielbmartin

03-12-2011 10:08 AM

Quote:

Originally Posted by szboardstretcher (Post 4287027)

Sounds a lot like homework.
Is this homework?

I assure you, this is *not* homework! I am well into retirement (16 years, now) and dabble in programming as a hobby, hoping to keep my brain from atrophying. Any LQ member who has lingering doubts is invited to contact me off-forum at danielbmartin ..aatt.. earthlink ..ddott.. net. I will respond with details about my employment history, detail which should convince you that I am in compliance with LQ forum rules.

Quote:

Originally Posted by szboardstretcher (Post 4287027)

Have you looked at the man page of 'sort'?

Yes. In fact, I have it as an icon on my Ubuntu desktop. The syntax is daunting. I wonder why there is a --unique option but not a complementary --notunique option. Technical intuition suggests that the logic which identifies uniques and keeps them could just as well identify duplicates and keep them.

colucix

03-12-2011 10:27 AM

The uniq command can print out all the duplicate lines and discard the rest. Moreover it has an option to skip the first N fields. What would be useful is an option to compare only the first N fields. Due to its absence, here is a workaround:

Code:

rev file | uniq -f1 -D | rev

As previously noticed, you have a double post with quite the same question. I'm going to close that one and keep this open for further discussion.

jschiwal

03-12-2011 10:48 AM

Another thing to try is to use cut & uniq to return duplicated names. The grep for those names in the original list:

Code:

grep -f <(cut -f1 -d' ' file | uniq -d) file

FYI: the form <( command; command ...) will return the results of the commands, in a form where a filename is expected.
I'll often use it when a command needs sorted input. Such as:

Code:

comm -23 <(sort list1) <(sort list2)

All times are GMT -5. The time now is 09:36 AM.