How to detect duplicate text with sed or other means
Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
How to detect duplicate text with sed or other means
Got a lot of text that contains some duplicate sentences. How do I detect them? Can sed do the job if the lines are first merged into one big line using: tr '\n' _ < input.txt > output.txt?
How do you define duplicate? The word "my" appears twice as does the word "dear". Or it could be counted as the phrase "my dear" appearing twice. And should the words or phrases to be checked occur consecutively or anywhere in a single line? Or are you looking for whole sentences which might be spread over several lines?
i would keep them in separate lines and run something like
Code:
sort | uniq -d
edit:
Quote:
Originally Posted by Ulysses_
Let's say the text looks like this:
echo "hello my dear friend_how are you my dear John_" > temp
We want to detect that "my dear" is repeated twice. Why doesn't the following work?
sed "s/\(.*\).*\1/duplicate:\1/g" < temp
This outputs "duplicate:" followed by new line.
this complicates things considering "my| |dear" is not a word delimited by spaces nor a record delimited by lines but a phrase containing a space (similar problem as csv's containing a comma in a field).
also, most parsers would consider friend_how to be 1 word.
Got a lot of text that contains some duplicate sentences. How do I detect them? Can sed do the job if the lines are first merged into one big line using: tr '\n' _ < input.txt > output.txt?
the thing is that it needs a basis for comparison, as you know already. but what would that be? it could be anything as this is what you are looking for, anything that matches something else. the letter "a" even being within that file more than once. Seeing that you are stating "some duplicate sentences".
You could write an elaborate two loop algorithm that reads one line at a time, then stores that line and then goes through the entire file looks for a duplicate line then do with it what you will, then goto next line and do the same. Using the period . as a means of identification of a line. then break it down to do three words or two words or even one word as you already have your two loops written to do this. Just a minor change to your code inside to take "words" and not complete sentences then check against the hold-for-comparison variable. It is a variable because it changes with each new value to be compared to.
hypothetical
Code:
while [[ some condition ]]
do
{
get line from file
while [[ some condition ]]
do
{
read file doing a string compare with each other line.
if match ; do whatever you want to it
if no match it kicks out of inner loop
}
done
}
done
#outside loop gets next line, runs inner loop again doing same process. when outer loop hits EOF - it quits.
You've already read about some of the pitfalls in doing this with sed and stuff,
when you get the basic working you than can get more elaborate with it in what to check for and how to deal with it.
( I might be backwards logic in how I told the thing is working ) better read this too this shows 4 loops but you get the idea yes? http://tldp.org/LDP/Bash-Beginners-G...ect_09_03.html
Gosh, I do not know why this is so difficult to communicate. Hydrurga, a line is text with a \n at the end (hex code 0Dh). Text comes in multiple lines, but it doesn't have to, they can all be merged into one if all the \n's are deleted or replaced with _ or whatever character you prefer, even space will do.
So it's all a single line of great length. The letter e may be duplicated thousands of times in this line, but more of interest are long phrases being duplicated.
Quote:
Ulysses wrote a sentence. Ulysses wrote it this morning. It was a nice morning this morning.
In the above example, "Ulysses" is duplicated, and also "Ulysses wrote" is duplicated, and "wrote", and "it", and "morning", and "this morning". All are acceptable output. But the long ones are preferred. So how do you exclude the short ones?
I think the easiest way to do it is match a minimum length of 30 in the first pass (ie .............................. in sed syntax), then edit out the duplication manually if we want to which is the purpose of this exercise, then 29 dots in the next pass, edit out this duplication too with an editor, and so on to 10, say.
#!/bin/bash
case "$1" in
1)
echo "$1"
;;
2)
echo "NO $1"
;;
*)
exit 1
esac
MOD:
correct me if I am wrong
from my understanding of sed it is a simple find / replace tool . you have to tell it what to find first in order for it to replace with something else. it is like you want sed to search a given pattern that is and can only be one pattern
"then edit out the duplication manually if we want to which is the purpose of this exercise"
which supersedes sed because it is suppose to match a pattern then replace not match let you know it has something that matches then move on looking for the next match.
hello world
this is only a test. this is only a test.
the end is never the end is never the end
chun-li vs. akuma
this is not only a test; this is also a trap.
ryu vs. chun-li
never take a test on an empty stomach.
this is only a test. this is only a test.
the end
i think you should define how greedy you want your matching to be ?
counting duplicate lines or words would be trivial. counting duplicates of randomly identified groupings of words would take some pseudo-coding.
Code:
[schneidz@hyper ~]$ sort ulysses.txt | uniq -c -d # by line:
2 this is only a test. this is only a test.
[schneidz@hyper ~]$ cat ulysses.txt | tr ' ' '\n' | tr -d . | sort | uniq -d -c # by word: 2
7 a
2 chun-li
4 end
8 is
3 never
5 only
5 test
4 the
6 this
2 vs
Last edited by schneidz; 05-23-2017 at 11:07 AM.
Reason: added results
But bummer, it does not work on large files. It does not match at all on my 1.13 Mb file. But it matches if the file is first truncated to 32000 bytes with head -c 32000.
"then edit out the duplication manually if we want to which is the purpose of this exercise"
which supersedes sed because it is suppose to match a pattern then replace not match let you know it has something that matches then move on looking for the next match.
First you find the unknown duplicate text, then you go to your gedit or leafpad or whatever manual editor you have and decide which of the two instances you discovered you want to keep.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.