LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 05-23-2017, 06:21 AM   #1
Ulysses_
Senior Member
 
Registered: Jul 2009
Posts: 1,303

Rep: Reputation: 57
How to detect duplicate text with sed or other means


Got a lot of text that contains some duplicate sentences. How do I detect them? Can sed do the job if the lines are first merged into one big line using: tr '\n' _ < input.txt > output.txt?
 
Old 05-23-2017, 06:29 AM   #2
Ulysses_
Senior Member
 
Registered: Jul 2009
Posts: 1,303

Original Poster
Rep: Reputation: 57
Let's say the text looks like this:

echo "hello my dear friend_how are you my dear John_" > temp

We want to detect that "my dear" is repeated twice. Why doesn't the following work?

sed "s/\(.*\).*\1/duplicate:\1/g" < temp

This outputs "duplicate:" followed by new line.

Last edited by Ulysses_; 05-23-2017 at 06:32 AM.
 
Old 05-23-2017, 07:19 AM   #3
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,328
Blog Entries: 3

Rep: Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726Reputation: 3726
How do you define duplicate? The word "my" appears twice as does the word "dear". Or it could be counted as the phrase "my dear" appearing twice. And should the words or phrases to be checked occur consecutively or anywhere in a single line? Or are you looking for whole sentences which might be spread over several lines?
 
1 members found this post helpful.
Old 05-23-2017, 07:26 AM   #4
Ulysses_
Senior Member
 
Registered: Jul 2009
Posts: 1,303

Original Poster
Rep: Reputation: 57
Quote:
Originally Posted by Turbocapitalist View Post
How do you define duplicate? The word "my" appears twice as does the word "dear".
The longest match would be ideal. Could specify a minimum length if needed. I am really looking for sentences made up of dozens of words.

Quote:
And should the words or phrases to be checked occur consecutively or anywhere in a single line?
Definitely anywhere in a single line.

Quote:
Or are you looking for whole sentences which might be spread over several lines?
Yes but all lines can be merged into one big line as shown above so this is not an issue.
 
Old 05-23-2017, 08:12 AM   #5
hydrurga
LQ Guru
 
Registered: Nov 2008
Location: Pictland
Distribution: Linux Mint 21 MATE
Posts: 8,048
Blog Entries: 5

Rep: Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925
Just to clarify. Given your flexible approach to the definition "line", do you effectively only want to find duplicates in a given sentence?

For example:

Quote:
Ulysses wrote a sentence. Ulysses wrote it this morning. It was a nice morning this morning.
What would you like to define as a duplicate in the above?
 
Old 05-23-2017, 08:18 AM   #6
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
i would keep them in separate lines and run something like
Code:
sort | uniq -d
edit:

Quote:
Originally Posted by Ulysses_ View Post
Let's say the text looks like this:

echo "hello my dear friend_how are you my dear John_" > temp

We want to detect that "my dear" is repeated twice. Why doesn't the following work?

sed "s/\(.*\).*\1/duplicate:\1/g" < temp

This outputs "duplicate:" followed by new line.
this complicates things considering "my| |dear" is not a word delimited by spaces nor a record delimited by lines but a phrase containing a space (similar problem as csv's containing a comma in a field).

also, most parsers would consider friend_how to be 1 word.

Last edited by schneidz; 05-23-2017 at 08:33 AM.
 
Old 05-23-2017, 09:44 AM   #7
BW-userx
LQ Guru
 
Registered: Sep 2013
Location: Somewhere in my head.
Distribution: Slackware (15 current), Slack15, Ubuntu studio, MX Linux, FreeBSD 13.1, WIn10
Posts: 10,342

Rep: Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242
Quote:
Originally Posted by Ulysses_ View Post
Got a lot of text that contains some duplicate sentences. How do I detect them? Can sed do the job if the lines are first merged into one big line using: tr '\n' _ < input.txt > output.txt?
the thing is that it needs a basis for comparison, as you know already. but what would that be? it could be anything as this is what you are looking for, anything that matches something else. the letter "a" even being within that file more than once. Seeing that you are stating "some duplicate sentences".

You could write an elaborate two loop algorithm that reads one line at a time, then stores that line and then goes through the entire file looks for a duplicate line then do with it what you will, then goto next line and do the same. Using the period . as a means of identification of a line. then break it down to do three words or two words or even one word as you already have your two loops written to do this. Just a minor change to your code inside to take "words" and not complete sentences then check against the hold-for-comparison variable. It is a variable because it changes with each new value to be compared to.

hypothetical
Code:
while [[ some condition ]]
do
{
get line from file

   while [[ some condition ]]
do
{
     read file doing a string compare with each other line.
     if match ; do whatever you want to it
     if no match it kicks out of inner loop
 
}
done
}
done
#outside loop gets next line, runs inner loop again doing same process. when outer loop hits EOF - it quits.

You've already read about some of the pitfalls in doing this with sed and stuff,

when you get the basic working you than can get more elaborate with it in what to check for and how to deal with it.
( I might be backwards logic in how I told the thing is working ) better read this too this shows 4 loops but you get the idea yes?
http://tldp.org/LDP/Bash-Beginners-G...ect_09_03.html

Last edited by BW-userx; 05-23-2017 at 10:18 AM.
 
Old 05-23-2017, 10:04 AM   #8
Ulysses_
Senior Member
 
Registered: Jul 2009
Posts: 1,303

Original Poster
Rep: Reputation: 57
Gosh, I do not know why this is so difficult to communicate. Hydrurga, a line is text with a \n at the end (hex code 0Dh). Text comes in multiple lines, but it doesn't have to, they can all be merged into one if all the \n's are deleted or replaced with _ or whatever character you prefer, even space will do.

So it's all a single line of great length. The letter e may be duplicated thousands of times in this line, but more of interest are long phrases being duplicated.

Quote:
Ulysses wrote a sentence. Ulysses wrote it this morning. It was a nice morning this morning.
In the above example, "Ulysses" is duplicated, and also "Ulysses wrote" is duplicated, and "wrote", and "it", and "morning", and "this morning". All are acceptable output. But the long ones are preferred. So how do you exclude the short ones?

I think the easiest way to do it is match a minimum length of 30 in the first pass (ie .............................. in sed syntax), then edit out the duplication manually if we want to which is the purpose of this exercise, then 29 dots in the next pass, edit out this duplication too with an editor, and so on to 10, say.

Last edited by Ulysses_; 05-23-2017 at 10:13 AM.
 
Old 05-23-2017, 10:55 AM   #9
BW-userx
LQ Guru
 
Registered: Sep 2013
Location: Somewhere in my head.
Distribution: Slackware (15 current), Slack15, Ubuntu studio, MX Linux, FreeBSD 13.1, WIn10
Posts: 10,342

Rep: Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242
Code:
userx%slackwhere ⚡ testing ⚡> grep -oh "[[:alpha:]]*ec[[:alpha:]]*" casey
echo
echo
file
Code:
#!/bin/bash


case "$1" in
1)
  echo "$1"
;;
2)
echo "NO $1"
;;
*)
exit 1
esac
MOD:
correct me if I am wrong
from my understanding of sed it is a simple find / replace tool . you have to tell it what to find first in order for it to replace with something else. it is like you want sed to search a given pattern that is and can only be one pattern
"then edit out the duplication manually if we want to which is the purpose of this exercise"
which supersedes sed because it is suppose to match a pattern then replace not match let you know it has something that matches then move on looking for the next match.

Last edited by BW-userx; 05-23-2017 at 11:11 AM.
 
Old 05-23-2017, 10:56 AM   #10
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
what output are you looking for in this example:
Code:
hello world
this is only a test.  this is only a test.
the end is never the end is never the end
chun-li vs. akuma
this is not only a test; this is also a trap.
ryu vs. chun-li
never take a test on an empty stomach.
this is only a test.  this is only a test.
the end
i think you should define how greedy you want your matching to be ?
counting duplicate lines or words would be trivial. counting duplicates of randomly identified groupings of words would take some pseudo-coding.
Code:
[schneidz@hyper ~]$ sort ulysses.txt | uniq -c -d # by line:
      2 this is only a test.  this is only a test.
[schneidz@hyper ~]$ cat ulysses.txt | tr ' ' '\n' | tr -d . | sort | uniq -d -c # by word:      2 
      7 a
      2 chun-li
      4 end
      8 is
      3 never
      5 only
      5 test
      4 the
      6 this
      2 vs

Last edited by schneidz; 05-23-2017 at 11:07 AM. Reason: added results
 
Old 05-23-2017, 11:16 AM   #11
BW-userx
LQ Guru
 
Registered: Sep 2013
Location: Somewhere in my head.
Distribution: Slackware (15 current), Slack15, Ubuntu studio, MX Linux, FreeBSD 13.1, WIn10
Posts: 10,342

Rep: Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242
pseudo : not genuine; sham.

so pseudo-coding still would not work. terminology : pseudo-coding is
Code:
if var = 'match this' 
then
  print "match var"
else
  print "no match"
This has been Just a friendly fyi announcement.

Last edited by BW-userx; 05-23-2017 at 11:18 AM.
 
Old 05-23-2017, 11:21 AM   #12
Ulysses_
Senior Member
 
Registered: Jul 2009
Posts: 1,303

Original Poster
Rep: Reputation: 57
What is the "ec" in that grep command meant for?

Here's what works on schneidz's text:

Code:
cat input.txt | tr '\n' _ > temp
sed "s/.*\(..............................\).*\1.*/\1/g" temp
The output is:

Code:
  this is only a test._the end
But bummer, it does not work on large files. It does not match at all on my 1.13 Mb file. But it matches if the file is first truncated to 32000 bytes with head -c 32000.

Last edited by Ulysses_; 05-23-2017 at 11:22 AM.
 
Old 05-23-2017, 11:25 AM   #13
BW-userx
LQ Guru
 
Registered: Sep 2013
Location: Somewhere in my head.
Distribution: Slackware (15 current), Slack15, Ubuntu studio, MX Linux, FreeBSD 13.1, WIn10
Posts: 10,342

Rep: Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242
Quote:
Originally Posted by Ulysses_ View Post
What is the "ec" in that grep command meant for?
words starting with 'ec'
hence echo being the only matches.
 
Old 05-23-2017, 11:31 AM   #14
BW-userx
LQ Guru
 
Registered: Sep 2013
Location: Somewhere in my head.
Distribution: Slackware (15 current), Slack15, Ubuntu studio, MX Linux, FreeBSD 13.1, WIn10
Posts: 10,342

Rep: Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242Reputation: 2242
regular expressions and grep
https://www.digitalocean.com/communi...terns-in-linux

you could capture the match in a var then slip it into sed to be replaced?
pseudo-coding
Code:
var=$(grep [search pattern] file $1)
sed 's/$var/replace with/g' file $1
$1 being off the command line
Code:
./script FileTOBEworkedOn

Last edited by BW-userx; 05-23-2017 at 11:39 AM.
 
Old 05-23-2017, 11:32 AM   #15
Ulysses_
Senior Member
 
Registered: Jul 2009
Posts: 1,303

Original Poster
Rep: Reputation: 57
Quote:
Originally Posted by BW-userx View Post
"then edit out the duplication manually if we want to which is the purpose of this exercise"
which supersedes sed because it is suppose to match a pattern then replace not match let you know it has something that matches then move on looking for the next match.
First you find the unknown duplicate text, then you go to your gedit or leafpad or whatever manual editor you have and decide which of the two instances you discovered you want to keep.
 
  


Reply

Tags
uir



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to check duplicate word in line with sed crimsonman Programming 5 03-22-2012 02:46 AM
detect duplicate files using md5sum blue frog Linux - Newbie 3 02-21-2011 12:25 PM
[SOLVED] Using sed to remove lines with duplicate ID's, but different endings... wapitismith Linux - Newbie 4 05-08-2010 12:30 PM
Removing duplicate lines with sed tireseas Programming 10 01-12-2005 03:27 AM
how to detect duplicate IP gsbarry Linux - Networking 2 02-21-2003 04:28 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 12:07 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration