LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-01-2018, 09:53 AM   #1
Sanmayce
Member
 
Registered: Jul 2010
Location: Sofia
Posts: 46

Rep: Reputation: 11
Fastest LCSS (Longest-Common-SubString) reporter?


Is there an utility (console tool) in *nix toolset reporting the longest "plagiarism" taking place in one document when compared to another?

Yesterday, I watched very funny comical video dramatizing the "coincidence" of plagiarizing a paragraph of Wikipedia, so how one is supposed to find that paragraph within, say, dumps.wikimedia.org_English_enwiki-20180220-pages-articles.xml 61.3 GB (65,865,333,874 bytes)?

Published on Mar 23, 2018
Two students (Vocab Malone and Jon McCray) are caught by their philosophy professor (David Wood) copying a Wikipedia article for their papers. Can atheism help them avoid charges of plagiarism and imminent disciplinary action? Let's find out!

https://www.youtube.com/watch?v=yto4jXOOen8

...

Dr. Wood:
Hmm, I guess you have a point. If the Universe is a product of chance, I just have no basis of accusing you of plagiarism.

The hatman #1:
Thank you.

Dr. Wood:
Gentlemen, I'm afraid I own you an apology, not only I withdraw my accusation, I hereby declare that no matter how similar your future papers are to Wikipedia articles I can never accuse you of cheating because the odds of you writing the exact same words that Wikipedia author came up with, no matter how many times it happens are far better than the odds of Universe being able to sport life by chance, if I can believe that the Universe formed without the designer, I can believe absolutely anything.

The hatman #1:
Apology accepted.

My amateurish attempt to write such a practical console tool reporting ALL LCSS of two given files led to writing Kamboocha revision 2h-128 - a simple (written in C) tool but with lot of potential. By no means it is the fastest, there is vectorization (within inner loop) and vertical (outer loop itself) parallelization remaining to play with. As for GPU scenarios, I am clueless what those mighty vectorizers hold. My word here is for CPU approaches, how to boost this superb etude!

The .C source is in the attached file.

Since my environment is Windows, Intel C compiler v.15.0, the compile lines that I use:

Code:
icl /O3 /arch:SSE2 Kamboocha.c /FAcs
copy/y Kamboocha.exe Kamboocha_Vanilla_Intel_v15_32bit.exe

icl /O3 /arch:SSE2 /openmp Kamboocha.c /FAcs -DCommence_OpenMP
copy/y Kamboocha.exe Kamboocha_Parallelization_Intel_v15_SSE2_32bit.exe

icl /O3 /arch:CORE-AVX2 /openmp Kamboocha.c /FAcs -DCommence_OpenMP
copy/y Kamboocha.exe Kamboocha_Parallelization_Intel_v15_AVX2_32bit.exe
With minor changes it will compile with GCC, sure of it, line 56/57 are to be commented/uncommented:
Code:
#define _WIN32_ENVIRONMENT_
//#define _POSIX_ENVIRONMENT_
I'm interested in seeing better implementations, so in order to test the performance, in the file below two biographical books are included.
Kamboocha_Intel_(32bit_64bit)_revision_2h-128.zip 11.7 MB (12,299,575 bytes)
https://drive.google.com/file/d/1Z5v...ew?usp=sharing

The benchmark is called "Mike_vs_Mickey", the result on my laptop with i5-7200u (2cores/4threads) looks like this:

Code:
C:\Kamboocha_Intel_(32bit_64bit)_revision_2h-128\64>dir

04/01/2018  04:39 PM            96,374 Kamboocha.c
04/01/2018  05:16 PM         1,859,349 Kamboocha.cod
04/01/2018  05:16 PM           131,072 Kamboocha_Parallelization_Intel_v15_AVX2_64bit.exe
04/06/2017  11:44 AM           389,306 Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.txt
04/06/2017  11:44 AM         1,195,397 Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt

C:\Kamboocha_Intel_(32bit_64bit)_revision_2h-128\64>timer64.exe Kamboocha_Parallelization_Intel_v15_AVX2_64bit.exe Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt Mickey_Rourke_-_Wrestling_With_Demons_by_
Sandro_Monetti.epub.txt
Kamboocha, revision 2h-128, written by Kaze.
Purpose: Calculates Longest-Common-SubString of Haystack and EnvelopedNeedle (reports Offset-and-Length within Haystack).
Note1: This revision implements inner-loop (horizontal) MANUAL (parallel section) multi-threading for finding LCSS.
Note2: This revision implements inner-loop (horizontal) AUTOMATICAL (for-loop) multi-threading for dumping ALL LCSS.
Usage: Kamboocha Haystack.txt EnvelopedNeedle.txt
Size of Haystack file: 1,195,397
Size of EnvelopedNeedle file: 389,306
Haystack Allocation of 1,195,397 bytes successful.
EnvelopedNeedle Allocation of 389,306 bytes successful.
VectorPrev Allocation of 9,563,176 bytes successful.
VectorCurr Allocation of 9,563,176 bytes successful.
omp_get_num_procs( ) = 4
omp_get_max_threads( ) = 4
Enforcing 128 threads ...
-; Done 100%
Dumping ALL common chunks of order 38 ...
In ASCII: " wanted to be the center of attention."
LCSS = 38

Kernel  Time =    26.156 =    2%
User    Time =  4058.875 =  381%
Process Time =  4085.031 =  383%    Virtual  Memory =     23 MB
Global  Time =  1064.004 =  100%    Physical Memory =     23 MB

C:\Kamboocha_Intel_(32bit_64bit)_revision_2h-128\64>
Would like to see it in action on some 16++ threaded machine, anyone?

Only for the brave - currently I am running (included in the benchmark):
Code:
@echo Kamboochaize 15,583,440 Arabian_Nights_complete.html vs 16,968,704 Sunnah_Hadith_Quran.tar ...
timer64.exe Kamboocha_Parallelization_Intel_v15_AVX2_64bit.exe Arabian_Nights_complete.html Sunnah_Hadith_Quran.tar
Attached Thumbnails
Click image for larger version

Name:	DZYliBMWAAEMVzj.jpg
Views:	2
Size:	168.9 KB
ID:	27358  
Attached Files
File Type: txt Kamboocha.c.txt (94.1 KB, 2 views)
 
Old 04-01-2018, 10:09 AM   #2
Sanmayce
Member
 
Registered: Jul 2010
Location: Sofia
Posts: 46

Original Poster
Rep: Reputation: 11
Oh, in theory this revision should be able to find the paragraph which Dr. Wood found himself, BUT, the requirements are:

Code:
Haystack Allocation of 65,865,333,874 bytes successful?
EnvelopedNeedle Allocation of size-of-paper bytes successful?
VectorPrev Allocation of 8*65,865,333,874 bytes successful?
VectorCurr Allocation of 8*65,865,333,874 bytes successful?
1472 GB RAM
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Al Jazeera reporter nails it.. Dogs General 34 05-13-2010 04:37 PM
xfree86-common xserver-common xfonts-base missing in etch/lenny unev_21 Debian 2 09-11-2009 02:12 AM
php preg_replace substring of a substring senyahnoj Programming 5 12-08-2006 11:31 AM
vulnerability reporter available? BenCollver Fedora 1 09-25-2006 03:16 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:03 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration