LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 03-29-2011, 02:11 AM   #1
sundays211
LQ Newbie
 
Registered: Apr 2010
Distribution: Ubuntu 10.10
Posts: 8

Rep: Reputation: 0
Find and remove duplicate phrases in a document


I would like to find a command which automatically finds and removes phrases which appear more than once in a text file. I still want to keep one of these phrases, but I only want to see one of them. Any ideas?
 
Old 03-29-2011, 02:29 AM   #2
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Hi,

try this
Code:
awk '(!a[$0]++)' file
 
Old 03-29-2011, 02:33 AM   #3
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
The answer depends on the exact circumstances. Please give us a representative sample of the text, and the kind of changes you want to make.

In general, if you can define regular patterns and rules for matching and modification, then it's probably scriptable. The more variation and unpredictability in the text, the harder it is to work with.
 
Old 03-29-2011, 02:33 AM   #4
k3lt01
Senior Member
 
Registered: Feb 2011
Location: Australia
Distribution: Debian Wheezy, Jessie, Sid/Experimental, playing with LFS.
Posts: 2,900

Rep: Reputation: 637Reputation: 637Reputation: 637Reputation: 637Reputation: 637Reputation: 637
For my host file I use this
Code:
sort /home/michael/hosts | tr '\t'  ' ' | tr -s ' ' | uniq >| /home/michael/hosts.new
Make a copy of your file and play around with it. Note that with the host file it requires each "phrase" to be a separate line so it will look something like this.

127.0.0.1 www(dot)abcde(dot)com
127.0.0.1 www(dot)abcde(dot)com
127.0.0.1 www(dot)bcdef(dot)com

(actual . replaced by (dot) cause abcde is a real net address)

the code above will remove the duplicate abcde(dot)com line after it puts all lines in alphabetical order.
 
Old 03-29-2011, 02:43 AM   #5
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
For that matter, if you assume that each phrase is on a separate line, and that the original order doesn't need to be maintained, then all you may really need is:
Code:
sort -u filename
But that's why I requested clarification. Until the OP defines his needs in more detail, we're having to make assumptions like this.
 
Old 03-30-2011, 01:05 AM   #6
sundays211
LQ Newbie
 
Registered: Apr 2010
Distribution: Ubuntu 10.10
Posts: 8

Original Poster
Rep: Reputation: 0
I have used grep to select some lines from a group of .htm files (250 in total, 10 per file) and store them in a text file. Unfortunately I've run into another small problem when it comes to sorting the list which is that the filename comes before the actual phrase which I want to order them by. I would have no problem (and in fact want to) get rid of the filename in the phrases.

Here is a sample of the text I wish to modify (I have changed the actual names, but I'm sure whatever you give me will work for the actual names). The phrases I am woried about are shown in bold. Note that the first number shown in bold is part of the filename, which I want removed.


34,35,576,17229483,goto,10.htm: href="http://www(dot)example(dot)com/directory/displayresults.ws?searchName=1abcd" class="flink" src="http://www.example.com/directory/1abcd/picture.gif" class="alink" alt=""
 
Old 03-30-2011, 01:18 AM   #7
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
grep has the -h option, which turns off filename output. See the man page.

But I still don't get it. Do you want to whole lines, or just the "1abcd" part? But you want to keep the first instance? I think just removing that phrase would lead to some odd remainders. Care to elaborate further?
 
Old 03-30-2011, 01:37 AM   #8
sundays211
LQ Newbie
 
Registered: Apr 2010
Distribution: Ubuntu 10.10
Posts: 8

Original Poster
Rep: Reputation: 0
I want to remove any line which is identical to another line, but keep one copy of that line.

So for example, if I had:

phrase 2 phrase
phrase 4 phrase
phrase 7 phrase
phrase 2 phrase
phrase 7 phrase

I would want to have

phrase 2 phrase
phrase 4 phrase
phrase 7 phrase
 
Old 03-30-2011, 04:31 AM   #9
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
But that's not what your first example shows. It has several different html components, with the target phrase embedded inside multiple different components. And they aren't on individual lines either, unless your lack of [code][/code] tags around them has broken the formatting. Or is that supposed to be a single line?

But again, if the original order of the text doesn't matter, and the whole lines are truly identical, then the sort command I gave before can do it. If order matters, then crts' awk command will do it.

If the lines aren't exactly the same, then we'll need to do more work. Can you show us a larger sample of the actual text, wrapped in code tags, and exactly how you want it to look afterwards?
 
Old 03-30-2011, 11:09 PM   #10
sundays211
LQ Newbie
 
Registered: Apr 2010
Distribution: Ubuntu 10.10
Posts: 8

Original Poster
Rep: Reputation: 0
"sort -u filename" was what I needed. In actual fact all phrases were on one line each, and they were all identical to each other apart from one part which I wanted them to be ordered by.

Thanks for the help
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to AUTO remove duplicate files drudge Linux - Newbie 9 03-13-2013 02:19 PM
App to find and remove duplicate images? Zaraphrax Linux - General 5 12-14-2010 06:34 AM
Remove duplicate entries on a row sebelk Programming 2 11-01-2010 09:43 AM
remove duplicate entries from first column?? kadvar Programming 2 05-12-2010 06:22 PM
[SOLVED] uniq -u : does not seem to remove duplicate lines boxb29 Linux - General 7 08-15-2009 06:34 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 01:34 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration