LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 02-18-2011, 08:25 AM   #1
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Choice of programming solutions for collated document repository


Hello

LQ's advice on programming solutions for a collated document repository would be helpful.

We have documents on multiple workstations and want to collate them into a single repository to provide text search and download. So far we have implemented rsync to copy files from each workstation under a directory for each workstation on a server (incidentally providing a backup) and have set up text search using Xapian with Omega; users access it via a web browser. Still to do is to set up a system to copy files from each workstation's area on the server to the repository.

Many files are duplicated. In these cases we want to preserve the names but keep a single copy of the file; hard links can be used for that.

For each file to be copied from a workstation's area into the collated area we need to check whether it is a duplicate (file size and, if same, MD5 sum) and if so, create a hard link to the original rather than create a copy.

A system to detect and replace duplicates in the collated area has been written using ruby and postgresql but the developer cannot commit to continuing this work. It does mean we have a postgresql database populated with "fingerprints" of files in the collated area.

My first priority is to get the system working; in the longer term whatever is developed must be maintainable; I do not yet know which language skills are available locally.

I am fluent in bash and competent with awk. Ruby looks nice but I have started to learn python and do think it prudent to learn both at the same time. Python's postgresql capabilities are not settled (http://lwn.net/Articles/374627/) but may be fine for the simple usage required.

What to do? A bash solution would run very slowly but could be developed quickly. Language knowledge aside, I have found it difficult to install ruby on the server (CentOS 5.5; installed rvm but "gem" still not installed; seems a very complex system with its own package management). Python?

Best

Charles
 
Old 02-18-2011, 08:35 AM   #2
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,200

Rep: Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307Reputation: 5307
Have you looked at existing document management systems? LX-er did a review of them just a little while ago. There's also Sharepoint, if you're mainly managing MS Office documents.

Last edited by dugan; 02-18-2011 at 08:38 AM.
 
Old 02-18-2011, 08:59 AM   #3
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578

Original Poster
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Quote:
Originally Posted by dugan View Post
Have you looked at existing document management systems? LX-er did a review of them just a little while ago. There's also Sharepoint, if you're mainly managing MS Office documents.
Thanks dugan

We started out going for a document management system but soon realised:
  • We didn't even have all the documents in one place as a starting point.
  • It is essential that the tags required for document management are well chosen -- and we do not have resource for doing so.
  • Tagging all the existing documents would take a long time -- and we do not have resource for doing so.
  • Our user base is neither very computerate nor trainable -- and we have a non-authoritarian organization culture which does not support ordering document producers to use a document management system.
In the light of those realisations I decided on this "quick win" of a collated document repository to provide text search and download. It also pre-positions us for document management when the organization is ready for it.

EDIT: many of the documents are MS Office but many are not.
 
Old 02-18-2011, 12:46 PM   #4
paulsm4
LQ Guru
 
Registered: Mar 2004
Distribution: SusE 8.2
Posts: 5,863
Blog Entries: 1

Rep: Reputation: Disabled
Hi -

As always, the best software solution ... is usually the one you DON'T have to write from scratch

A couple of suggestions:

FSLint

SquashFS
 
Old 02-18-2011, 10:06 PM   #5
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578

Original Poster
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Thanks paulsm4

fslint looks good and could be used to give us the same functionality as our ruby and postgresql proof-of-concept code. Using the MD5 sum of the first 4 K is a nice speed-up technique. fslint's being in python is a plus too.

fslint doesn't address our need to copy files from each workstation's area into the collated area though. We could simply copy them and remove duplicates later but it would not be efficient.

SquashFS would be useful for distributing the collated document repository but, if I understand correctly, it is a read only file system so could not be used for the live system which will be constantly updated.
 
Old 02-18-2011, 10:48 PM   #6
paulsm4
LQ Guru
 
Registered: Mar 2004
Distribution: SusE 8.2
Posts: 5,863
Blog Entries: 1

Rep: Reputation: Disabled
Hi, Catkin -

Actually, I misspoke myself about "SquashFS". I read about a new "remove duplicate files and compress contents" FUSE filesystem in a magazine recently. I couldn't recall the name and, when I googled, I came up with "SquashFS".

I was wrong.

The magazine article was actually about "LessFS". The magazine was Linux Pro (Jan 2011 edition). And here's a link to the project:

http://www.lessfs.com/

This link about LessFS was also cited in the LP article:

http://forums.gentoo.org/viewtopic-t...ht-lessfs.html

'Hope that helps .. PSM

Last edited by paulsm4; 02-18-2011 at 10:56 PM.
 
Old 02-18-2011, 11:17 PM   #7
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578

Original Poster
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Thanks for the new links PSM

LessFS and ZFS look promising for when we run out of disk space -- by which time they may have matured nicely.

Right now our ruby and postgresql proof-of-concept code can do de-duplication -- and do it quickly because previously derived file fingerprint information is kept in the database. The missing functionality is copying files from each workstation's area into the collated area using the database to a) avoid re-copying files and b) create hard links instead of duplicates.

Pending further suggestions, I'm going to try to install a working ruby system and develop the proof-of-concept ruby code.
 
Old 02-20-2011, 12:10 PM   #8
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578

Original Poster
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Update.

Ruby now set up and scripts being developed. It's a nice language, even at the painful early learning stage.

Apparently PHP skills are available locally so I started another thread asking about PHP's suitability for this type of work.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Document Management Solutions dougp23 Linux - Enterprise 1 10-10-2007 10:11 AM
Stuck with a choice here...solutions? zack3g Mandriva 3 05-29-2004 02:42 PM
Where to find document about IPv6 soket programming ? (empty) kylin Programming 2 11-10-2002 04:39 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:45 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration