text file manipulation

crowzie · 08-11-2015, 06:41 AM

Hi
I have a wordlist text file with lots of duplicate entries
is there a command to delete all duplicates along with the entries that had duplicates so as to only be left with entries that did not have duplicates in the first place?

rtmistler · 08-11-2015, 06:52 AM

Not a command but a couple of commands used together, and then combined in a script or program.

Have you attempted any of this on your own? Please do so and when you get stuck, post your attempt and describe where it fell short.

I'm sure there are other ways, but I'm thinking that I'd read the file line by line, use grep to determine if there were duplicate lines, and then use sed to alter the file.

crowzie · 08-11-2015, 07:02 AM

Yes I can remove duplicates easily the way you described but im still left with entries that had duplicates.

rtmistler · 08-11-2015, 07:04 AM

Incorrect. Sed would remove all entries once you identified the entry targeted for removal.

crowzie · 08-11-2015, 07:15 AM

There are one liners to sort text files but I cannot think how to remove entries that had duplicates.

rtmistler · 08-11-2015, 07:21 AM

Post your attempts and people will give you some assistance. We're not here to do the task for you, that is not how LQ works. Please review the FAQ if you need to understand LQ better.

pan64 · 08-11-2015, 07:25 AM

have you checked the command uniq already (see man page)?

dugan · 08-11-2015, 02:54 PM

I wrote a Python implementation before realizing that this was in the Linux - General forum and not the Programming forum. Anyway...

Code:

#!/usr/bin/env python

import collections
import fileinput


def main():
    word_counts = collections.defaultdict(int)
    words = [word.strip() for word in fileinput.input()]
    for word in words:
        word_counts[word] += 1
    words = [word for word in words if word_counts[word] == 1]
    for word in words:
        print word


if __name__ == '__main__':
    main()

You can run this script with the wordlist file as its argument, or you can cat the wordlist file and pipe it into this script.

syg00 · 08-12-2015, 07:23 AM

@dugan, is that guaranteed to print in the order of input ?. Interest only on my behalf - only briefly dabbled in python years ago.
I did similar in awk, but associative arrays in awk aren't necessarily in order - I had to take extra efforts to ensure input order for the print.

pan64 · 08-12-2015, 07:29 AM

Do you want to keep the original order? Why didn't you tell that? In that case which one do you want to keep (first, last, none, ...) of duplicated lines.

syg00 · 08-12-2015, 07:31 AM

The OP didn't want to keep any of the duplicates - specifically asked for the non-duplicate lines.

pan64 · 08-12-2015, 07:42 AM

Code:

#ok, so:
sort | uniq -d > filename
#will collect all the duplicates
grep -vFxf filename
#will drop them

rtmistler · 08-12-2015, 07:43 AM

Quote:

Originally Posted by syg00

The OP didn't want to keep any of the duplicates - specifically asked for the non-duplicate lines.

Quote:

Originally Posted by pan64

Do you want to keep the original order? Why didn't you tell that? In that case which one do you want to keep (first, last, none, ...) of duplicated lines.

These are a couple of reasons why I say it's best that the OP post some of their attempts, or at least give them time and a chance to respond. Because they may have stated their problem poorly, not thought it through well enough, and therefore the definition of what they want to do may end up changing. Instead of spinning off and solving a multitude of different possible interpretations of this problem for the OP, responders instead should get clarification, as well as await the OP responding if they have further questions. Because there are obviously different outcomes here, the OP could self solve and never decide to update, the OP could conclude that they do not wish to put forth the effort and never return to the thread, the OP gets confused and frustrated due to varied interpretations not matching their problem, or a response such as pan64's offering the uniq command is entirely sufficient and the OP just doesn't bother to mark the thread as solved, nor offer thanks to the responder.

syg00 · 08-12-2015, 07:51 AM

True enough.
It needs to be noted that uniq only works as expected on sorted data. Where I come from, I am not at liberty to mangle user data except when told. Sorting when not requested is mangling.

pan64 · 08-12-2015, 08:02 AM

Quote:

Originally Posted by syg00

True enough.
It needs to be noted that uniq only works as expected on sorted data.

That's why I wrote: see man page, that fact was noted.

Quote:

Originally Posted by syg00

Where I come from, I am not at liberty to mangle user data except when told. Sorting when not requested is mangling.

It was not specified by the OP, so sorting probably not an issue (if only the list was required)

Also missing:
do we have only one word in a line or there can be more words?