LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 11-03-2010, 12:34 AM   #1
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Searching all computers on a LAN (OSS and Yacy)


Hello

The issue: the organisation has accumulated many/important documents, currently on personal computers. Looking for documents is difficult. We're implementing a document management system but it will take time, especially for users to change working habits.

The opportunity: a "quick win" with a solution to search documents on all computers on the internal network, available to users of each computer.

Candidate solutions: after researching products -- including Datapark, Docfetcher, Flax, Hadoopi, Lucene, Nutch, Omega, Open Search Server (OSS), OpenPipeline, Recoll, Sixsearch, Sphinx, Tracker, Xavian, Xesam (formerly wasabi), Yacy and ore.xapian -- only two candidates identified that run on Linux, OSX and Windows:The questions: Are there any other candidate solutions worth considering? What is the real-world experience of using the candidates (neither is primarily designed for this task; both are primarily web search engines)?

Best

Charles

Last edited by catkin; 11-03-2010 at 08:42 AM. Reason: Added "(OSS and Yacy)" to title
 
Old 11-04-2010, 09:26 AM   #2
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,520

Rep: Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944
From my personal experience slaying that particular dragon, I'd say there is no easy answer.

First, you'd have to know what users name documents...if they do at all. Some put extensions on them, some don't. Are they saved as OpenOffice, XLS, or something else? In what folders/where? External USB drive(s)? CD/DVD's? The user-part of the equation is the most difficult. Even if you find a software-based solution that MOSTLY works, it'll have to scan the users entire drive, and copy them all off. If you run the scan at login time, chances are VERY good that they'll get tired of waiting, and abort the scan, reboot the box, etc.
Second, is the type of clients...Windows/Mac/Linux or all three?

In my opinion, the only GOOD way to accomplish this, involves buy-in from management, and a clear plan that people agree to. Get with your boss(es) first, and draw something up that's reasonable. Make sure that each user has a network resource that is private, and accessible only to them, and a 'shared' drive that their department can access. Make it very clear that those resources are backed up on a regular schedule, and emphasize that this is a good thing...if Frank is out one day, and has an important document, they can still access it, or have an admin copy it from Franks personal share, to the department share. Since the data is now stored on server-class hardware, they don't have to worry about individual PC's crashing, and them losing work.

Then comes the painful part...go department by department, presenting this doc and plan to each manager, and explaining what's happening, and why, and how they'll benefit. And make it very clear that they CANNOT decline. The week before you begin work, have the manager tell the employees to start moving their files around. Put files they want personal in a PERSONAL folder, and ones the department needs in a SHARED folder. The actual work is easy, and can be done by a temp or an intern...go to each desk, one at a time, and make the changes. Attach them to the network, load any clients/software they need to do it, and then copy the files over accordingly, doing a final search to make sure you got them all. Repeat for each user. When you're done, you'll have private folders, and a shared drive with individual directories under them for each user. Make SURE the users understand that they are NOT to save ANYTHING to their personal hard drives, unless for some reason the network is down, and even then, that's a temporary fix, just to keep them working until the network is back up.

Once you get things into a central location, you can then implement a document-search solution, and have some flexibility on how you do it. Until then, there are far too many variables to make it work.
 
1 members found this post helpful.
Old 11-04-2010, 10:32 AM   #3
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578

Original Poster
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Many thanks for the benefit of your experience TBOne

You have saved me days, perhaps months, of wasted effort. The insight that this is more of a management exercise than a technical one is not good news, though. Have you ever tried cat-herding?!

A high proportion of our users have portables so synchronisation will be essential. For backup and resilience we might as well treat the desktops the same way.

The personal computers are ~90% Windows, and ~5% Linux and OSX. There's no server yet (that's another project).

Networking is peer-to-peer; the installation has grown ad-hoc. Anything more sophisticated is thus a big cultural change and early wins that all can see the benefit of are very desirable for credibility and generating enthusiasm. That makes your advice even more valuable than just saving wasted effort.

Best

Charles
 
Old 11-04-2010, 11:19 AM   #4
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,520

Rep: Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944
Quote:
Originally Posted by catkin View Post
Many thanks for the benefit of your experience TBOne

You have saved me days, perhaps months, of wasted effort. The insight that this is more of a management exercise than a technical one is not good news, though. Have you ever tried cat-herding?!
Yes...I find a large club to be useful.
Quote:
A high proportion of our users have portables so synchronisation will be essential. For backup and resilience we might as well treat the desktops the same way.
That's actually not that difficult. The desktops are a no brainer....just map them when they log in, and you're done. The portables, just set a policy that ANY documents are to be saved in a particular folder, then write a small script that runs every now and then, to see if there's a network connection. If so, attach the network drives, and sync up.
Quote:
The personal computers are ~90% Windows, and ~5% Linux and OSX. There's no server yet (that's another project).
Windows 7 has a backup utility that just backs up what changed. Linux has rsync, and I'm not sure about OSX, but you've got the tools you need already. A server is a small thing, but get something bigger than you think you'll need, then go bigger. Don't skimp on disk resources, and RAID capabilities. This has to run 24/7/365, and make sure it has multiple NIC's, connected to the different core routers.
Quote:
Networking is peer-to-peer; the installation has grown ad-hoc. Anything more sophisticated is thus a big cultural change and early wins that all can see the benefit of are very desirable for credibility and generating enthusiasm. That makes your advice even more valuable than just saving wasted effort.
Just touting the "if you save it on the network, it's backed up for you", is a big one. Anyone who has lost days/weeks of work from a hard drive failure will automatically love it. You'll wind up saving a ton of $$ too, in the long run, but it's a painful process to go through.

Best thing: have a clear, reasonable plan, that's approved by upper management. After that, the department managers won't have any wiggle room, and will have to comply, which makes your job easier.
 
1 members found this post helpful.
Old 11-04-2010, 12:44 PM   #5
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578

Original Poster
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Thanks again TBOne

You're envisaging a bigger and more conventional organisation than it actually is. There are less than 20 users and very little management -- it's peer-to-peer like the network.

The network devices are a switch and a WiFi router so no redundancy there but easy to replace.

The cost of a high-availability server cannot be justified; having a good backup and a DR plan will have to suffice but mirrored HDDs are on the list.

I don't yet know what the existing backup strategy is, beyond knowing that they're using Zmanda and the sysadmin (for the whole building) is on the ball.

I never wanted to be a salesman but an element of that is going to be crucial.

Fun times ahead ...

Best

Charles
 
Old 11-04-2010, 02:27 PM   #6
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,520

Rep: Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944Reputation: 7944
Quote:
Originally Posted by catkin View Post
Thanks again TBOne
You're envisaging a bigger and more conventional organisation than it actually is. There are less than 20 users and very little management -- it's peer-to-peer like the network.

The network devices are a switch and a WiFi router so no redundancy there but easy to replace.
No problems there, and I didn't know the size of the organization. You're in a MUCH easier position, than if you had the 2,500+ I had to deal with.
Quote:
The cost of a high-availability server cannot be justified; having a good backup and a DR plan will have to suffice but mirrored HDDs are on the list.

I don't yet know what the existing backup strategy is, beyond knowing that they're using Zmanda and the sysadmin (for the whole building) is on the ball.
That'll definitely suffice. Disk space is very cheap these days, and my personal preference is always go to RAID5. One disk fails, and the system goes to the online spare, with zero downtime. Replace the failed disk later. Fairly cheap, too...a RAID5 SATA controller is about $350. Cheap disks are abundant, I'd spend the $$$ on the disk/controllers, rather than on CPU, if you're only looking for a file/print server. Not going to be heavily taxed.
Quote:
I never wanted to be a salesman but an element of that is going to be crucial.

Fun times ahead ...
True...but sales isn't really involved, if you get a clear plan before you go in to talk to them. Lay the risks out, VERY clearly, and explain how your solution will address them. If they say "No", then get them to physically sign-off on something saying that. When things die, and they come to you, drag that document out, and tell them you'll do your best, but can make no promises. Make sure they can't lay it at your feet.
 
1 members found this post helpful.
Old 11-05-2010, 12:16 AM   #7
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578

Original Poster
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Thanks TBOne
Quote:
Originally Posted by TB0ne View Post
No problems there, and I didn't know the size of the organization. You're in a MUCH easier position, than if you had the 2,500+ I had to deal with.
Certainly am -- a much smaller organisation changes many aspects of the project for the easier.
Quote:
Originally Posted by TB0ne View Post
That'll definitely suffice. Disk space is very cheap these days, and my personal preference is always go to RAID5. One disk fails, and the system goes to the online spare, with zero downtime. Replace the failed disk later. Fairly cheap, too...a RAID5 SATA controller is about $350. Cheap disks are abundant, I'd spend the $$$ on the disk/controllers, rather than on CPU, if you're only looking for a file/print server. Not going to be heavily taxed.
Choices are constrained by:
  1. Rack space (going for 1U).
  2. Local availability (a few foreign suppliers will ship to India but it's expensive and makes replacement slow).
  3. Desire for low electrical power consumption (aiming for lower environmental impact plus grid cuts and low voltage mean the UPS battery bank's capacity is challenged)
#3 means an Atom-based system. That's OK -- the D525 should be powerful enough for the projected usage of a document management system and a ticketing system along with file-serving -- and the ICH9R (NH82801IR) chipset provides RAID 5 with hotplugging but I have yet to find a 1 U case locally that houses more than two HDDs. Mirroring/RAID 1 should also give allow disk failure and replacement with zero down time; there would be no parity checking but maybe that's "good enough" -- or it may be possible to fit a third HDD given the motherboard's mini-ITX form.
 
Old 12-11-2011, 10:31 AM   #8
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578

Original Poster
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Update

Update

The project has reached enough maturity to be published in case anyone else can make use of it. It's called docoll (document collation) and is on Savannah.

If you are curious, you can download the documentation tarball (docoll_documentation-0.7.3.tgz) from the download page and get an overview from the "docoll system introduction" OpenOffice.org Writer file.

The final solution uses cwrsync and rsync to synchronise files from the client computers to the server; Ruby, bash and PostgreSQL to build a collation from the individual clients' files; Xapian Omega's omindex to index the collation; and Xapian Omega to drive the web browser search UI.
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Accessing LAN computers itstootuff Fedora 1 11-05-2008 05:09 AM
acces of computers in lan kantor_98 Linux - Newbie 4 06-14-2007 08:19 AM
KDE Lan Browser doesn't display available LAN computers dance2die Linux - Newbie 2 01-16-2005 08:14 PM
How to resolve LAN computers? yalag Linux - Networking 1 03-21-2004 05:54 AM
Searching Tool for LAN stelmed Slackware 6 01-23-2004 06:06 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 11:43 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration