How to detect duplicate text with sed or other means

Ulysses_ · 05-23-2017, 06:21 AM

Got a lot of text that contains some duplicate sentences. How do I detect them? Can sed do the job if the lines are first merged into one big line using: tr '\n' _ < input.txt > output.txt?

Ulysses_ · 05-23-2017, 06:29 AM

Let's say the text looks like this:

echo "hello my dear friend_how are you my dear John_" > temp

We want to detect that "my dear" is repeated twice. Why doesn't the following work?

sed "s/$.*$.*\1/duplicate:\1/g" < temp

This outputs "duplicate:" followed by new line.

Turbocapitalist · 05-23-2017, 07:19 AM

How do you define duplicate? The word "my" appears twice as does the word "dear". Or it could be counted as the phrase "my dear" appearing twice. And should the words or phrases to be checked occur consecutively or anywhere in a single line? Or are you looking for whole sentences which might be spread over several lines?

Ulysses_ · 05-23-2017, 07:26 AM

Quote:

Originally Posted by Turbocapitalist

How do you define duplicate? The word "my" appears twice as does the word "dear".

The longest match would be ideal. Could specify a minimum length if needed. I am really looking for sentences made up of dozens of words.

Quote:

And should the words or phrases to be checked occur consecutively or anywhere in a single line?

Definitely anywhere in a single line.

Quote:

Or are you looking for whole sentences which might be spread over several lines?

Yes but all lines can be merged into one big line as shown above so this is not an issue.

hydrurga · 05-23-2017, 08:12 AM

Just to clarify. Given your flexible approach to the definition "line", do you effectively only want to find duplicates in a given sentence?

For example:

Quote:

Ulysses wrote a sentence. Ulysses wrote it this morning. It was a nice morning this morning.

What would you like to define as a duplicate in the above?

schneidz · 05-23-2017, 08:18 AM

i would keep them in separate lines and run something like

Code:

sort | uniq -d

edit:

Quote:

Originally Posted by Ulysses_

Let's say the text looks like this:

echo "hello my dear friend_how are you my dear John_" > temp

We want to detect that "my dear" is repeated twice. Why doesn't the following work?

sed "s/$.*$.*\1/duplicate:\1/g" < temp

This outputs "duplicate:" followed by new line.

this complicates things considering "my| |dear" is not a word delimited by spaces nor a record delimited by lines but a phrase containing a space (similar problem as csv's containing a comma in a field).

also, most parsers would consider friend_how to be 1 word.

BW-userx · 05-23-2017, 09:44 AM

Quote:

Originally Posted by Ulysses_

Got a lot of text that contains some duplicate sentences. How do I detect them? Can sed do the job if the lines are first merged into one big line using: tr '\n' _ < input.txt > output.txt?

the thing is that it needs a basis for comparison, as you know already. but what would that be? it could be anything as this is what you are looking for, anything that matches something else. the letter "a" even being within that file more than once. Seeing that you are stating "some duplicate sentences".

You could write an elaborate two loop algorithm that reads one line at a time, then stores that line and then goes through the entire file looks for a duplicate line then do with it what you will, then goto next line and do the same. Using the period . as a means of identification of a line. then break it down to do three words or two words or even one word as you already have your two loops written to do this. Just a minor change to your code inside to take "words" and not complete sentences then check against the hold-for-comparison variable. It is a variable because it changes with each new value to be compared to.

hypothetical

Code:

while [[ some condition ]]
do
{
get line from file

   while [[ some condition ]]
do
{
     read file doing a string compare with each other line.
     if match ; do whatever you want to it
     if no match it kicks out of inner loop
 
}
done
}
done

#outside loop gets next line, runs inner loop again doing same process. when outer loop hits EOF - it quits.

You've already read about some of the pitfalls in doing this with sed and stuff,

when you get the basic working you than can get more elaborate with it in what to check for and how to deal with it.
( I might be backwards logic in how I told the thing is working ) better read this too

this shows 4 loops but you get the idea yes?
http://tldp.org/LDP/Bash-Beginners-G...ect_09_03.html

Ulysses_ · 05-23-2017, 10:04 AM

Gosh, I do not know why this is so difficult to communicate. Hydrurga, a line is text with a \n at the end (hex code 0Dh). Text comes in multiple lines, but it doesn't have to, they can all be merged into one if all the \n's are deleted or replaced with _ or whatever character you prefer, even space will do.

So it's all a single line of great length. The letter e may be duplicated thousands of times in this line, but more of interest are long phrases being duplicated.

Quote:

Ulysses wrote a sentence. Ulysses wrote it this morning. It was a nice morning this morning.

In the above example, "Ulysses" is duplicated, and also "Ulysses wrote" is duplicated, and "wrote", and "it", and "morning", and "this morning". All are acceptable output. But the long ones are preferred. So how do you exclude the short ones?

I think the easiest way to do it is match a minimum length of 30 in the first pass (ie .............................. in sed syntax), then edit out the duplication manually if we want to which is the purpose of this exercise, then 29 dots in the next pass, edit out this duplication too with an editor, and so on to 10, say.

BW-userx · 05-23-2017, 10:55 AM

Code:

userx%slackwhere ⚡ testing ⚡> grep -oh "[[:alpha:]]*ec[[:alpha:]]*" casey
echo
echo

file

Code:

#!/bin/bash


case "$1" in
1)
  echo "$1"
;;
2)
echo "NO $1"
;;
*)
exit 1
esac

MOD:
correct me if I am wrong
from my understanding of sed it is a simple find / replace tool . you have to tell it what to find first in order for it to replace with something else. it is like you want sed to search a given pattern that is and can only be one pattern
"then edit out the duplication manually if we want to which is the purpose of this exercise"
which supersedes sed because it is suppose to match a pattern then replace not match let you know it has something that matches then move on looking for the next match.

schneidz · 05-23-2017, 10:56 AM

what output are you looking for in this example:

Code:

hello world
this is only a test.  this is only a test.
the end is never the end is never the end
chun-li vs. akuma
this is not only a test; this is also a trap.
ryu vs. chun-li
never take a test on an empty stomach.
this is only a test.  this is only a test.
the end

i think you should define how greedy you want your matching to be ?
counting duplicate lines or words would be trivial. counting duplicates of randomly identified groupings of words would take some pseudo-coding.

Code:

[schneidz@hyper ~]$ sort ulysses.txt | uniq -c -d # by line:
      2 this is only a test.  this is only a test.
[schneidz@hyper ~]$ cat ulysses.txt | tr ' ' '\n' | tr -d . | sort | uniq -d -c # by word:      2 
      7 a
      2 chun-li
      4 end
      8 is
      3 never
      5 only
      5 test
      4 the
      6 this
      2 vs

BW-userx · 05-23-2017, 11:16 AM

pseudo : not genuine; sham.

so pseudo-coding still would not work. terminology : pseudo-coding is

Code:

if var = 'match this' 
then
  print "match var"
else
  print "no match"

This has been Just a friendly fyi announcement.

Ulysses_ · 05-23-2017, 11:21 AM

What is the "ec" in that grep command meant for?

Here's what works on schneidz's text:

Code:

cat input.txt | tr '\n' _ > temp
sed "s/.*\(..............................\).*\1.*/\1/g" temp

The output is:

Code:

  this is only a test._the end

But bummer, it does not work on large files. It does not match at all on my 1.13 Mb file. But it matches if the file is first truncated to 32000 bytes with head -c 32000.

BW-userx · 05-23-2017, 11:25 AM

Quote:

Originally Posted by Ulysses_

What is the "ec" in that grep command meant for?

words starting with 'ec'
hence echo being the only matches.

BW-userx · 05-23-2017, 11:31 AM

regular expressions and grep
https://www.digitalocean.com/communi...terns-in-linux

you could capture the match in a var then slip it into sed to be replaced?
pseudo-coding

Code:

var=$(grep [search pattern] file $1)
sed 's/$var/replace with/g' file $1

$1 being off the command line

Code:

./script FileTOBEworkedOn

Ulysses_ · 05-23-2017, 11:32 AM

Quote:

Originally Posted by BW-userx

"then edit out the duplication manually if we want to which is the purpose of this exercise"
which supersedes sed because it is suppose to match a pattern then replace not match let you know it has something that matches then move on looking for the next match.

First you find the unknown duplicate text, then you go to your gedit or leafpad or whatever manual editor you have and decide which of the two instances you discovered you want to keep.