Find and remove duplicate phrases in a document

sundays211 · 03-29-2011, 02:11 AM

I would like to find a command which automatically finds and removes phrases which appear more than once in a text file. I still want to keep one of these phrases, but I only want to see one of them. Any ideas?

crts · 03-29-2011, 02:29 AM

Hi,

try this

Code:

awk '(!a[$0]++)' file

David the H. · 03-29-2011, 02:33 AM

The answer depends on the exact circumstances. Please give us a representative sample of the text, and the kind of changes you want to make.

In general, if you can define regular patterns and rules for matching and modification, then it's probably scriptable. The more variation and unpredictability in the text, the harder it is to work with.

k3lt01 · 03-29-2011, 02:33 AM

For my host file I use this

Code:

sort /home/michael/hosts | tr '\t'  ' ' | tr -s ' ' | uniq >| /home/michael/hosts.new

Make a copy of your file and play around with it. Note that with the host file it requires each "phrase" to be a separate line so it will look something like this.

127.0.0.1 www(dot)abcde(dot)com
127.0.0.1 www(dot)abcde(dot)com
127.0.0.1 www(dot)bcdef(dot)com

(actual . replaced by (dot) cause abcde is a real net address)

the code above will remove the duplicate abcde(dot)com line after it puts all lines in alphabetical order.

David the H. · 03-29-2011, 02:43 AM

For that matter, if you assume that each phrase is on a separate line, and that the original order doesn't need to be maintained, then all you may really need is:

Code:

sort -u filename

But that's why I requested clarification. Until the OP defines his needs in more detail, we're having to make assumptions like this.

sundays211 · 03-30-2011, 01:05 AM

I have used grep to select some lines from a group of .htm files (250 in total, 10 per file) and store them in a text file. Unfortunately I've run into another small problem when it comes to sorting the list which is that the filename comes before the actual phrase which I want to order them by. I would have no problem (and in fact want to) get rid of the filename in the phrases.

Here is a sample of the text I wish to modify (I have changed the actual names, but I'm sure whatever you give me will work for the actual names). The phrases I am woried about are shown in bold. Note that the first number shown in bold is part of the filename, which I want removed.

34,35,576,17229483,goto,10.htm: href="http://www(dot)example(dot)com/directory/displayresults.ws?searchName=1abcd" class="flink" src="http://www.example.com/directory/1abcd/picture.gif" class="alink" alt=""

David the H. · 03-30-2011, 01:18 AM

grep has the -h option, which turns off filename output. See the man page.

But I still don't get it. Do you want to whole lines, or just the "1abcd" part? But you want to keep the first instance? I think just removing that phrase would lead to some odd remainders. Care to elaborate further?

sundays211 · 03-30-2011, 01:37 AM

I want to remove any line which is identical to another line, but keep one copy of that line.

So for example, if I had:

phrase 2 phrase
phrase 4 phrase
phrase 7 phrase
phrase 2 phrase
phrase 7 phrase

I would want to have

phrase 2 phrase
phrase 4 phrase
phrase 7 phrase

David the H. · 03-30-2011, 04:31 AM

But that's not what your first example shows. It has several different html components, with the target phrase embedded inside multiple different components. And they aren't on individual lines either, unless your lack of [code][/code] tags around them has broken the formatting. Or is that supposed to be a single line?

But again, if the original order of the text doesn't matter, and the whole lines are truly identical, then the sort command I gave before can do it. If order matters, then crts' awk command will do it.

If the lines aren't exactly the same, then we'll need to do more work. Can you show us a larger sample of the actual text, wrapped in code tags, and exactly how you want it to look afterwards?

sundays211 · 03-30-2011, 11:09 PM

"sort -u filename" was what I needed. In actual fact all phrases were on one line each, and they were all identical to each other apart from one part which I wanted them to be ordered by.

Thanks for the help