LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 10-08-2010, 02:33 PM   #1
unihiekka
Member
 
Registered: Aug 2005
Distribution: SuSE Linux / Scientific Linux / [K|X]ubuntu
Posts: 273

Rep: Reputation: 32
database choice advice


Hi!

I want - out of personal interest and curiosity - analyse loads of texts that I have in plain text files (unicode) in such a way that my programme (C++) reads all the words, counts them, does some statistical analyses per word and counts the different characters in the texts and does some more statistical analyses with these. I think it's easiest to create a database with these characters and a separate one for the words, where there are entries for e.g. "possible positions in a sentence", "possible position within words", "number count", etc.

I admit that I know very little about databases, so what I'm basically asking is what database would you advise me to use and why. The database should be accessed quite often and preferably be extremely fast, but I think that goes without saying. A cross-platform database would be very nice, but it should not be a main concern, as most of my development will be on Linux (and for Linux/myself).

Thanks in advance for sharing your thoughts.
 
Old 10-08-2010, 06:20 PM   #2
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910
There's no simple answer; in terms of text data and lexical analysis
I'd recommend PostgreSQL.

Have a look at their specs.

[edit]
Forgot to mention - you can install Pl/R, which gives you
an interface between Postgres and the R statistics package
[/edit]


Cheers,
Tink

Last edited by Tinkster; 10-08-2010 at 07:39 PM.
 
Old 10-08-2010, 08:40 PM   #3
leejohnli
Member
 
Registered: Sep 2010
Posts: 66

Rep: Reputation: 2
What about Oracle?
 
Old 10-09-2010, 12:47 AM   #4
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910
Quote:
Originally Posted by leejohnli View Post
What about Oracle?
It's fat, not open-source, and it requires a team of
DBAs to look after it ... ;}


Cheers,
Tink
 
Old 10-09-2010, 01:48 AM   #5
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,424

Rep: Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823
Oracle is an Enterprise solution and very unlikely of use in this scenario. I would use Tink's suggestion or perhaps mysql
 
Old 10-09-2010, 02:00 AM   #6
sohail0399
Member
 
Registered: Oct 2008
Location: Pakistan, Islamabad
Distribution: CentOS, Fedora, Solaris
Posts: 154

Rep: Reputation: 23
This is very wide question regarding database, it depends that what is you working scenario.

mysql i have seen on many linux systems like web hosting solutions, it is fast and no license issue.
But For Enterprise solution and telecommunication companies for managing millions and billions of log and data using Oracle and informix IDS Database. ERP solution is also with Oracle.

yes Oracle required license for Enterprise edition but you can use and test its personal edition.

It depends what kind of Solution you are required.
Me Experience: If you ask this question from any DBA he will refer to you which he has experience, because each DB has its own functionality and expertise.

I have also work with monitoring tools with Round Robin DB.
 
Old 10-09-2010, 02:29 AM   #7
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910
Quote:
Originally Posted by sohail0399 View Post
This is very wide question regarding database, it depends that what is you working scenario.

mysql i have seen on many linux systems like web hosting solutions, it is fast and no license issue.
But For Enterprise solution and telecommunication companies for managing millions and billions of log and data using Oracle and informix IDS Database. ERP solution is also with Oracle.
That's a fairly narrow view ... All-popular skype uses PostgreSQL
as the backend, and contributes to the code-base.
Big players use it, too ... see this, for example:
http://www.bull.us/liberatedb/succes...PostgreSQL.php


Cheers,
Tink
 
Old 10-09-2010, 02:44 AM   #8
unihiekka
Member
 
Registered: Aug 2005
Distribution: SuSE Linux / Scientific Linux / [K|X]ubuntu
Posts: 273

Original Poster
Rep: Reputation: 32
PostgreSQL seems OK. With the risk of sounding stupid: what is the difference/advantage of PostgreSQL over for instance MySQL?

About Oracle: I want it to be (free and) open-source, so I'll pass on that one. Thanks anyway for the suggestion.

I know that there is no one answer, but as I know virtually nothing about DBs, I just want to ask around to see what you all would recommend and then start learning one of these.

Apart from that, do you think I require a database for what I'm trying to do or not?
 
Old 10-09-2010, 04:39 PM   #9
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910
Quote:
Originally Posted by unihiekka View Post
PostgreSQL seems OK. With the risk of sounding stupid: what is the difference/advantage of PostgreSQL over for instance MySQL?
Where to begin?! :}

There are MySQL's criminaly insane default settings;
e.g.:
  • 0-0-0000 and 30-02-2010 are valid dates
  • Insert 250 characters into a 200 byte wide column, it will happily do so; only in a log will you be told that it discarded 50
  • It will let you define foreign key constraints for any storage engine type, but only on InnoDB they actually work (note that MyISAM is their default engine).

Other advantages:
PostgreSQL has a BSD license, MySQL has a funky hybrid.
PostgreSQL *isn't* owned by Oracle (MySQL is).

Quote:
Originally Posted by unihiekka View Post
Apart from that, do you think I require a database for what I'm trying to do or not?
Depends on how much work you want to do yourself;
as I said - PostgreSQL with Pl/R will probably do any
statistical work for you, and your C++ coding should
be kept to a bare minimum.

That said: your stats sound fairly simple - chances
are that you're going to see faster results w/ just
a few flat-files and in-memory operations (if you have
enough RAM & Swap combined to hold your data sets).

Not knowing the state of your project, the anticipated
memory consumption, amount of statistical work, exact
types of "queries" against your data all one can say is:

How long is a piece of string?


Cheers,
Tink

Last edited by Tinkster; 10-09-2010 at 04:41 PM.
 
Old 10-09-2010, 05:19 PM   #10
joec@home
Member
 
Registered: Sep 2009
Location: Galveston Tx
Posts: 291

Rep: Reputation: 70
I know i will be hated for this answer, but there is one reason to put MySQL over PostGRE. There are more techs in the world familiar with debugging MySQL.
 
Old 10-09-2010, 05:59 PM   #11
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910
Quote:
Originally Posted by joec@home View Post
I know i will be hated for this answer, but there is one reason to put MySQL over PostGRE. There are more techs in the world familiar with debugging MySQL.
No hate at all; same as using Windows instead of Linux.

Just do Postgres the honour and spell it right.
It's either PostgreSQL, or Postgres for short.
Think "post-gres-queue-elle".

There's no such thing as PostGRE.


Cheers,
Tink
 
Old 10-10-2010, 02:21 AM   #12
unihiekka
Member
 
Registered: Aug 2005
Distribution: SuSE Linux / Scientific Linux / [K|X]ubuntu
Posts: 273

Original Poster
Rep: Reputation: 32
Quote:
Originally Posted by Tinkster View Post
Where to begin?! :}

How long is a piece of string?
Well, the idea is to have the whole corpus of a language at a certain point in time to be analysed, so it can be quite big (IMO). Also, the initial statistics are going to be quite simple, but in the end there is going to be some more refined work, although I would probably use a library in C++ that I already created, so R is not really necessary.

Thanks for your PostgreSQL/MySQL comparison. I'm definitely leaning towards PostgreSQL now. Can PostgreSQL handle (text) files in UTF-8? If so, I'll be setting up PostgreSQL in no time!

Cheers,
unihiekka.
 
Old 10-10-2010, 02:41 AM   #13
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910
Quote:
Originally Posted by unihiekka View Post
Well, the idea is to have the whole corpus of a language at a certain point in time to be analysed, so it can be quite big (IMO). Also, the initial statistics are going to be quite simple, but in the end there is going to be some more refined work, although I would probably use a library in C++ that I already created, so R is not really necessary.
Well ... you're the one in control =} ... More
power (and good luck) to you.

Quote:
Originally Posted by unihiekka View Post
Thanks for your PostgreSQL/MySQL comparison. I'm definitely leaning towards PostgreSQL now. Can PostgreSQL handle (text) files in UTF-8? If so, I'll be setting up PostgreSQL in no time!
I'm biased - I'm sure there are people who'll draw
quite opposite conclusions (and I'll think of them
as dimwitted), and back those w/ their experience.

It most certainly can, and in my personal experience
with far less farting around than MySQL. I *had* to
use MySQL for a project at some stage, and still feel
the pain ...



Cheers,
Tink
 
Old 10-10-2010, 02:43 AM   #14
mericet
Member
 
Registered: Jul 2009
Posts: 50

Rep: Reputation: 8
The relational databases mentioned so far are not designed to do large-scale text manipulation.

Try CLucene, a C++ port of Apache's Lucene project.

http://sourceforge.net/projects/clucene/
 
Old 10-10-2010, 03:06 AM   #15
vikas027
Senior Member
 
Registered: May 2007
Location: Sydney
Distribution: RHEL, CentOS, Debian, OS X
Posts: 1,298

Rep: Reputation: 102Reputation: 102
IMHO MySQL would be the best choice.

Advantages:-
Fast
Easy to Install
No DBAs required to manage it
No Licensing Required

Last edited by vikas027; 10-10-2010 at 03:07 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Newbie seeking advice on distro-choice ianjose Linux - Newbie 13 02-08-2010 08:31 PM
distro choice advice leech985 Linux - Newbie 3 11-10-2006 04:24 PM
Database/Photo management choice? rogere Linux - Software 3 05-03-2006 06:33 PM
Video card choice advice exit3219 Linux - Hardware 3 06-27-2005 08:19 AM
need some advice on language choice(Perl vs PHP) coolman0stress Programming 8 11-17-2003 05:41 AM


All times are GMT -5. The time now is 07:14 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration