Hello
LQ's advice on programming solutions for a collated document repository would be helpful.
We have documents on multiple workstations and want to collate them into a single repository to provide text search and download. So far we have implemented rsync to copy files from each workstation under a directory for each workstation on a server (incidentally providing a backup) and have set up text search using Xapian with Omega; users access it via a web browser. Still to do is to set up a system to copy files from each workstation's area on the server to the repository.
Many files are duplicated. In these cases we want to preserve the names but keep a single copy of the file; hard links can be used for that.
For each file to be copied from a workstation's area into the collated area we need to check whether it is a duplicate (file size and, if same, MD5 sum) and if so, create a hard link to the original rather than create a copy.
A system to detect and replace duplicates in the collated area has been written using ruby and postgresql but the developer cannot commit to continuing this work. It does mean we have a postgresql database populated with "fingerprints" of files in the collated area.
My first priority is to get the system working; in the longer term whatever is developed must be maintainable; I do not yet know which language skills are available locally.
I am fluent in bash and competent with awk. Ruby looks nice but I have started to learn python and do think it prudent to learn both at the same time. Python's postgresql capabilities are not settled (
http://lwn.net/Articles/374627/) but may be fine for the simple usage required.
What to do? A bash solution would run very slowly but could be developed quickly. Language knowledge aside, I have found it difficult to install ruby on the server (CentOS 5.5; installed rvm but "gem" still not installed; seems a very complex system with its own package management). Python?
Best
Charles