Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Hi
I have a wordlist text file with lots of duplicate entries
is there a command to delete all duplicates along with the entries that had duplicates so as to only be left with entries that did not have duplicates in the first place?
Not a command but a couple of commands used together, and then combined in a script or program.
Have you attempted any of this on your own? Please do so and when you get stuck, post your attempt and describe where it fell short.
I'm sure there are other ways, but I'm thinking that I'd read the file line by line, use grep to determine if there were duplicate lines, and then use sed to alter the file.
Post your attempts and people will give you some assistance. We're not here to do the task for you, that is not how LQ works. Please review the FAQ if you need to understand LQ better.
I wrote a Python implementation before realizing that this was in the Linux - General forum and not the Programming forum. Anyway...
Code:
#!/usr/bin/env python
import collections
import fileinput
def main():
word_counts = collections.defaultdict(int)
words = [word.strip() for word in fileinput.input()]
for word in words:
word_counts[word] += 1
words = [word for word in words if word_counts[word] == 1]
for word in words:
print word
if __name__ == '__main__':
main()
You can run this script with the wordlist file as its argument, or you can cat the wordlist file and pipe it into this script.
@dugan, is that guaranteed to print in the order of input ?. Interest only on my behalf - only briefly dabbled in python years ago.
I did similar in awk, but associative arrays in awk aren't necessarily in order - I had to take extra efforts to ensure input order for the print.
Do you want to keep the original order? Why didn't you tell that? In that case which one do you want to keep (first, last, none, ...) of duplicated lines.
The OP didn't want to keep any of the duplicates - specifically asked for the non-duplicate lines.
Quote:
Originally Posted by pan64
Do you want to keep the original order? Why didn't you tell that? In that case which one do you want to keep (first, last, none, ...) of duplicated lines.
These are a couple of reasons why I say it's best that the OP post some of their attempts, or at least give them time and a chance to respond. Because they may have stated their problem poorly, not thought it through well enough, and therefore the definition of what they want to do may end up changing. Instead of spinning off and solving a multitude of different possible interpretations of this problem for the OP, responders instead should get clarification, as well as await the OP responding if they have further questions. Because there are obviously different outcomes here, the OP could self solve and never decide to update, the OP could conclude that they do not wish to put forth the effort and never return to the thread, the OP gets confused and frustrated due to varied interpretations not matching their problem, or a response such as pan64's offering the uniq command is entirely sufficient and the OP just doesn't bother to mark the thread as solved, nor offer thanks to the responder.
True enough.
It needs to be noted that uniq only works as expected on sorted data. Where I come from, I am not at liberty to mangle user data except when told. Sorting when not requested is mangling.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.