LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 06-25-2019, 04:20 PM   #1
qrange
Senior Member
 
Registered: Jul 2006
Location: Belgrade, Yugoslavia
Distribution: Debian stable/testing, amd64
Posts: 1,061

Rep: Reputation: 47
duplicate word files


unfortunately many documents are still in MS formats.

is there a batch-cli-tool that can compare content of Word, PowerPoint, etc.. files to find duplicates but with different 'metadata'?

(files that have same content, but different hash because of modified metadata)


thanks.
 
Old 06-26-2019, 02:30 AM   #2
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
There's a utility to extract text from Word documents: catdoc.
Write a script that combines that with diff?
 
Old 06-26-2019, 02:46 AM   #3
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,307
Blog Entries: 3

Rep: Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721
I don't know of a pre-existing tool. Also, it'd be hard to do much comparison in the style you ask without actually parsing the body of the file and normalizing it in preparation for some kind of comparison.

LibreOffice is scriptable from within using python and javascript. It handles the old MS-Word formats better than MS-Word itself.

Or you could call its APIs via external scripts using its --headless option.

So you could look at LibreOffice for a means of normalizing the document bodies for comparison. Just do it in a way that does not further disturb the originals.
 
1 members found this post helpful.
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
linux Script required for word by word comaprison from 2 carat separated files. rajeevdvedi2006 Linux - Newbie 1 05-29-2013 06:24 AM
How can i read two files word by word at a time using any loop by shell script? vaibhavs17 Programming 16 03-19-2010 03:48 AM
word by word comparison in two files using loop in shell script vaibhavs17 Programming 2 03-05-2010 07:41 AM
does tar or bzip2 squash duplicate or near-duplicate files? garydale Linux - Software 6 11-19-2009 04:43 PM
How 2 find a duplicate word in a text file cowardnewbie Programming 1 09-16-2001 11:57 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 11:42 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration