LinuxQuestions.org - Need a file-based tagging file organizer

- Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)

- - Need a file-based tagging file organizer (https://www.linuxquestions.org/questions/linux-software-2/need-a-file-based-tagging-file-organizer-926595/)

Need a file-based tagging file organizer

I've been looking for a while, and I cannot find an app, or system plugin, or desktop setting, which will allow me to take a set of files, and tag them according to my wishes. Let me add a set of filenames to this application, tag each as "bill," "receipt," "medical," "auto," etc., and I will be happy.

So far, my solution is referencer, but my use case is very much not the goal of the project. Referencer is a way to organize bibliographical documents in a research project, not a general purpose organizer for personal scanned documents.

How do other people do this? How do you organize the scanned bills and other stuff you get in the mail? How do you take a stack of files a categorize them with useful tags?

In KDE4, I think nepomuk (the general desktop search program) does this. http://nepomuk.kde.org/discover/user

What desktop (name and version) are you using and what distro (name and version)?

I don't do much with Document Management, but I probably should. Anyway, I did some searching and found this page:

Linux User & Developer: OpenOffice.org Base--No Frills Document Management

It's a bit of a do-it-yourself system, which could be good or bad depending on your persuasion. I could swear I've seen a pre-packaged, tag-based document management system somewhere though. I'll probably do a little more searching just in case.

EDIT:
--post edit: I looked a little more closely at the description, and I can't guarantee that DocMGR allows for tag-based organization. There is reference to "keywords" in some of the documentation, but no clear indication that "keywords" and "tags" are equivalent.

Also came across this blog post: Cool Web-based Software-DocMGR. And a just a tad more digging to find the DocMGR homepage.

Again, I don't do much document management. So I don't know how good/bad either of these are or if they fit your needs. Just pointing them out in case your searching has not turned them up.

Quote:

Originally Posted by pljvaldez (Post 4588766)

In KDE4, I think nepomuk (the general desktop search program) does this. http://nepomuk.kde.org/discover/user

What desktop (name and version) are you using and what distro (name and version)?

ubuntu at work, mint at home, and debian on the server. But it shouldn't matter. If the solution is so coupled to the desktop, it isn't a solution to this problem. Referencer does fit the puzzle, only it is not designed for my use case. I'm trying to make sure there isn't an obvious solution, before I start dedicating development time to referencer to mold it to my needs.

Quote:

Originally Posted by Dark_Helmet (Post 4588773)

It's is often good, but like most of us, my time is valuable. I stopped reading the link after it mentioned "open office base." OObase is interesting, but I'm not going to develop a personal microsoft-access-style solution around OO just because I can. It seems that this problem should be more common. I get paper in the mail, I scan it into a computer, and I want to organize it. How do I do this?

Re: time being valuable. Sure, I understand.

The last two options I'll throw at you are from this Open Source Document Management blog post. The link to KnowledgeTree in the article is dead. But the link to jLibrary is still good. Scanning over the description, it looks like jLibrary supports customizable meta information for files.

I did find (what I assume is) the open source Knowledge Tree document management system on Sourceforge.

I did find this novel approach of a "filesystem" for tags. http://www.tagsistant.net/

I have been working on a general file tagging program for a little while now. It's called TMSU and works by providing a tool with which you can tag your files. It also then lets you mount a tag based view of your files so that you can use tags to access your files from any other program.

(It's GPL3 and tested only on Linux at present, though it should theoretically work on BSD too. I've had a report it's not yet working on OSX. Windows port planned but not started yet.)

---------- Post added 06-07-12 at 11:07 AM ----------

http://www.tmsu.org/

Edit: just noticed the link to 'tagsistant' above. I have completely independently taken an almost identical path to that tool with my own. Shame I didn't find that before I started work on tmsu!

Add one more to the list =). Sorry for the necro, but I'm wondering if this is still an issue and whether the following seems like a good solution for it: (I can't post links yet, but search for 'arch linux hitagiFS' for the thread) (disclaimer: this is my project). In short, it provides a tag-based general file organization system based on hard links.

The reason for this shameless plug is that I'm looking to see if there is a need for the project I'm working on. If no one needs it, then I won't be as motivated getting it into shape. If there IS a need for it, I'll be glad to work on it, and hopefully others can get some use out of it.

I've looked at tagsistant, but it wasn't for me. Plus, it looks abandoned now. I haven't seen TMSU before, but it looks similar. The difference is that mine (hitagiFS) is much more transparent, relying solely on file system hard links and symlinks. There's no database or portability issues. If you use it for a while and decide that you don't like it, you can ditch the program, but the directory structure is left the same.

The reason I'm putting this here is (again) I would like to see if there's still interest in yet another (although quite different) tag-based file organization system. If not, I won't work as hard on it. As much as I like contributing, I'm not going to put in hours if no one'll appreciate it =).

Quote:

Originally Posted by darkfeline (Post 4867653)

I can't post links yet, but search for 'arch linux hitagiFS' for the thread) (disclaimer: this is my project).

Here, let me add that link for you: https://github.com/darkfeline/hitagiFS

@darkfeline: I'm working in these very days on release 0.6 of Tagsistant. Why do you think it's abandoned?

Hi Tx0

Sorry if I was wrong. I think the last time I checked, the last news item was from a while back, and the documentation and everything seemed incomplete/old. But I seem to be wrong.

I think your tagsistant is cool, and it was one of the options I considered for my needs, but the documentation is a little messy. Could you explain the internals of tagsistant a little more? As I'm working on my program (now called "dantalian" because the first name was ill-chosen), I think there may be some duplication of effort here.

Are you storing the files in MySQL, or just the paths to the files? How do you handle filenames (I heard something about unique names for the entire database, but that may be wrong/outdated)?

P.S. I just read the 0.6 howto and I must say it looks a lot better than when I looked before, I am considering dropping my own project with mixed feelings, but I still think there's a fundamental difference in our approaches to the problem. For example, the deduplication: What if I want two separate identical files (because I want to edit one of them in the future)? It seems like you can't do that with tagsistant.

I'm also interested in your backend, and performance with huge numbers of tags/files (100,000 tags with as many files each, for example), as a programmers asking another for advice.

@darkfeline: I noticed right now that your post dates to January 2013 and I'm not sure I published news about 0.6 before February, so it's perfectly possible that you perceived Tagsistant as an abandoned project. :) The good news it's: it isn't.

There are still minor issues with Tagsistant 0.6 I'm fixing in the SVN repository, publishing a new release candidate every week or two. Yesterday I've changed a callback used to retrieve an integer from an SQL query because it used arbitrary chosen libDBI functions to get the number, while now the callback checks the return type and uses the proper libDBI function to fetch it.

The documentation on the site is largely outdated because it targets the 0.2 release which I don't support any more. As you noticed, the 0.6 release has a long howto here: http://www.tagsistant.net/documents-...tant/0-6-howto.

I'm not storing the file content in MySQL, just the name of the file. The file is actually stored in the archive/ directory inside the repository. I plan to organize archive/ in subfolders based on the inode of the objects stored to avoid hogging the archive/ directory and slowing down its browsing.

The issue about unique names in the filesystem is related to Tagsistant 0.2 (another reason for dropping it in favour of Tagsistant 0.6).

You are right about deduplication: if two files with the same content (same MD5 hash) are created, the second gets deleted and its tag-set gets transferred to the first copy. So it's not possible to edit just one copy: both are altered because they're one.

This is also a bit rough because can confuse the user: if two identical files named A.jpg and B.jpg are respectively copied inside tag1/ and tag2/ directories, after deduplication the file in tag2/ is called A.jpg too! But I'll address this in a future release.

About performance: this is hitting a nerve. The biggest load in every query is due to:

parse the query in tokens
build the tag tree (a tree representing the tags involved and the unions made with +/)
reasoning the tag tree (that's finding related tags)
build the corresponding SQL query (just for readdir)

The second and third steps are the most SQL intensive. So I decided to create a query cache which follows the previous list just the first time, duplicates cached queries from second time on, and deletes cache entries as soon as a new relation involves at least one of the tags included in the entry.

This is giving a 5-10x improvement! But I still don't have metrics about 100K tags, so I can't really answer to your question.

Quote:

Originally Posted by darkfeline (Post 4867653)

I haven't seen TMSU before, but it looks similar. The difference is that mine (hitagiFS) is much more transparent, relying solely on file system hard links and symlinks. There's no database or portability issues. If you use it for a while and decide that you don't like it, you can ditch the program, but the directory structure is left the same.

TMSU does not work this way. It does not store any files (only metadata) in its database and does not alter the original filesystem in any way, so you can likewise ditch TMSU and start up exactly where you started (only with a marginally shorter life). I think I'll throw up a comparison table on the TMSU wiki so that prospective users can pick the tool most appropriate for them.

Found these links to TagFileSystem and related projects...
http://code.google.com/p/tagfilesystem/
http://code.google.com/p/tagfilesyst...SummaryOfTagFS
(note: I haven't tried anything out as of this posting)