LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 08-11-2015, 06:41 AM   #1
crowzie
LQ Newbie
 
Registered: Jul 2011
Posts: 16

Rep: Reputation: Disabled
text file manipulation


Hi
I have a wordlist text file with lots of duplicate entries
is there a command to delete all duplicates along with the entries that had duplicates so as to only be left with entries that did not have duplicates in the first place?
 
Old 08-11-2015, 06:52 AM   #2
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,882
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
Not a command but a couple of commands used together, and then combined in a script or program.

Have you attempted any of this on your own? Please do so and when you get stuck, post your attempt and describe where it fell short.

I'm sure there are other ways, but I'm thinking that I'd read the file line by line, use grep to determine if there were duplicate lines, and then use sed to alter the file.
 
Old 08-11-2015, 07:02 AM   #3
crowzie
LQ Newbie
 
Registered: Jul 2011
Posts: 16

Original Poster
Rep: Reputation: Disabled
Yes I can remove duplicates easily the way you described but im still left with entries that had duplicates.
 
Old 08-11-2015, 07:04 AM   #4
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,882
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
Incorrect. Sed would remove all entries once you identified the entry targeted for removal.
 
Old 08-11-2015, 07:15 AM   #5
crowzie
LQ Newbie
 
Registered: Jul 2011
Posts: 16

Original Poster
Rep: Reputation: Disabled
There are one liners to sort text files but I cannot think how to remove entries that had duplicates.
 
Old 08-11-2015, 07:21 AM   #6
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,882
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
Post your attempts and people will give you some assistance. We're not here to do the task for you, that is not how LQ works. Please review the FAQ if you need to understand LQ better.
 
Old 08-11-2015, 07:25 AM   #7
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,792

Rep: Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306
have you checked the command uniq already (see man page)?

Last edited by pan64; 08-11-2015 at 01:43 PM. Reason: just use bold to highlight
 
2 members found this post helpful.
Old 08-11-2015, 02:54 PM   #8
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,219

Rep: Reputation: 5309Reputation: 5309Reputation: 5309Reputation: 5309Reputation: 5309Reputation: 5309Reputation: 5309Reputation: 5309Reputation: 5309Reputation: 5309Reputation: 5309
I wrote a Python implementation before realizing that this was in the Linux - General forum and not the Programming forum. Anyway...

Code:
#!/usr/bin/env python

import collections
import fileinput


def main():
    word_counts = collections.defaultdict(int)
    words = [word.strip() for word in fileinput.input()]
    for word in words:
        word_counts[word] += 1
    words = [word for word in words if word_counts[word] == 1]
    for word in words:
        print word


if __name__ == '__main__':
    main()
You can run this script with the wordlist file as its argument, or you can cat the wordlist file and pipe it into this script.

Last edited by dugan; 08-11-2015 at 02:57 PM.
 
Old 08-12-2015, 07:23 AM   #9
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,119

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
@dugan, is that guaranteed to print in the order of input ?. Interest only on my behalf - only briefly dabbled in python years ago.
I did similar in awk, but associative arrays in awk aren't necessarily in order - I had to take extra efforts to ensure input order for the print.
 
Old 08-12-2015, 07:29 AM   #10
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,792

Rep: Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306
Do you want to keep the original order? Why didn't you tell that? In that case which one do you want to keep (first, last, none, ...) of duplicated lines.
 
Old 08-12-2015, 07:31 AM   #11
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,119

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
The OP didn't want to keep any of the duplicates - specifically asked for the non-duplicate lines.
 
Old 08-12-2015, 07:42 AM   #12
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,792

Rep: Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306
Code:
#ok, so:
sort | uniq -d > filename
#will collect all the duplicates
grep -vFxf filename
#will drop them

Last edited by pan64; 08-12-2015 at 07:44 AM.
 
Old 08-12-2015, 07:43 AM   #13
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,882
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
Quote:
Originally Posted by syg00 View Post
The OP didn't want to keep any of the duplicates - specifically asked for the non-duplicate lines.
Quote:
Originally Posted by pan64 View Post
Do you want to keep the original order? Why didn't you tell that? In that case which one do you want to keep (first, last, none, ...) of duplicated lines.
These are a couple of reasons why I say it's best that the OP post some of their attempts, or at least give them time and a chance to respond. Because they may have stated their problem poorly, not thought it through well enough, and therefore the definition of what they want to do may end up changing. Instead of spinning off and solving a multitude of different possible interpretations of this problem for the OP, responders instead should get clarification, as well as await the OP responding if they have further questions. Because there are obviously different outcomes here, the OP could self solve and never decide to update, the OP could conclude that they do not wish to put forth the effort and never return to the thread, the OP gets confused and frustrated due to varied interpretations not matching their problem, or a response such as pan64's offering the uniq command is entirely sufficient and the OP just doesn't bother to mark the thread as solved, nor offer thanks to the responder.
 
Old 08-12-2015, 07:51 AM   #14
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,119

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
True enough.
It needs to be noted that uniq only works as expected on sorted data. Where I come from, I am not at liberty to mangle user data except when told. Sorting when not requested is mangling.
 
Old 08-12-2015, 08:02 AM   #15
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,792

Rep: Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306
Quote:
Originally Posted by syg00 View Post
True enough.
It needs to be noted that uniq only works as expected on sorted data.
That's why I wrote: see man page, that fact was noted.
Quote:
Originally Posted by syg00 View Post
Where I come from, I am not at liberty to mangle user data except when told. Sorting when not requested is mangling.
It was not specified by the OP, so sorting probably not an issue (if only the list was required)

Also missing:
do we have only one word in a line or there can be more words?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Text file manipulation: selecting specific lines/columns using awk and print CHARL0TTE Linux - Newbie 2 02-27-2010 02:40 AM
Text file manipulation: Extracting specific rows according to numerical pattern CHARL0TTE Linux - Newbie 3 10-07-2009 07:14 AM
Text file manipulation: alphanumeric strings CHARL0TTE Linux - Newbie 2 07-10-2009 09:40 AM
Easy string/text manipulation/indentation for restructured text brianmcgee Linux - Software 1 04-22-2008 08:27 PM
More text manipulation ice_hockey Linux - General 2 05-28-2005 01:43 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 08:36 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration