[SOLVED] Is there a program that removes non-adjacent duplicate lines?

RandomTroll · 11-15-2022, 07:09 AM

I have a file of lines that I want to keep in a non-sorted (as far as any Unix app can tell) order but remove duplicates.

boughtonp · 11-15-2022, 07:25 AM

Yes, there's lots of ways: https://duckduckgo.com/?q=linux+remove+duplicate+lines+without+sorting

slacker_et · 11-15-2022, 07:33 AM

Sounds like you are looking for a "uniq" command that does not require source file to be sorted and is not interactive.
I do not think there is such a command.
However; in the past I dabbled with using this Windows based program running under Wine: WinMerge

--ET

allend · 11-15-2022, 07:35 AM

Code:

awk '++dups[$0] == 1' <filename>

although the reverse logic in @boughtonp link is cute

Turbocapitalist · 11-15-2022, 07:36 AM

You could do it with AWK and an associative array keyed on the contents of each line. I'm not sure how well that would scale though. How large a text file are you considering?

MadeInGermany · 11-15-2022, 07:38 AM

Most resources suggest the ulta-short

Code:

awk '!x[$0]++'

More explicit is

Code:

awk '!($0 in x){x[$0]; print}'

You can append a file name, otherwise it reads stdin.

sundialsvcs · 11-15-2022, 11:46 AM

While I will not now write the "necessary one-liner" for you, the algorithm essentially is this:

• Produce a version of the input file which contains a "record number" field to the left of the record's contents.
• Sort the resulting file by the second field: the actual content
• Now that the "duplicate values" are adjacent, considering only the second field remove all but the first occurrence. This is trivial now, because you need only consider the current value against its immediate predecessor.
• Re-sort the resulting file by the first ("record number") field.
• Remove the "record number" field to produce the final result.

Fifty years ago, they did this with magnetic tapes. It may well be that they did it earlier using punched cards.

Keith Hedger · 11-15-2022, 02:38 PM

If you dont mind the file being sorted use

Code:

sort -u /path/to/file