LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   Database or Spreadsheet? (https://www.linuxquestions.org/questions/linux-software-2/database-or-spreadsheet-614049/)

johnsfine 01-16-2008 01:32 PM

Database or Spreadsheet?
 
I have a file of data, about 5000 rows by about 100 columns of 1 to 40 characters per field, from which I frequently need to view selected subsets.

Currently this is done with an Excel spreadsheet, that has a nearly optimal UI and would be really great if there weren't some Excel bug making it often give incorrect results.

I'd like to switch to some free program I can run under Linux (or better yet a free package, such as Open Office that can run on both Windows and Linux).

I haven't yet tried any free database or spreadsheet program. I expect I'll need to learn quite a lot about a program before I can try something this big, and only then will I find out whether the program can achieve these UI features at this size data set.

I'm hoping some experts in this forum can tell me which programs can do this, so I won't waste the learning time on any program that turns out not to do the job.

Some basic UI requirements:

The data is presented in a grid (like an ordinary spreadsheet).

There is a title row at the top that stays put as you scroll lower rows (standard spreadsheet feature, I think).

The major UI feature (the heart of this application):

There is a pull down dialog from each (top of column) title cell that presents a sorted set of all unique values of that column. One of the columns has about 1000 unique values among its 5000 entries. Other columns have far fewer. Most columns have only two unique values over all 5000 fields.

You can select a value in that dialog, and the program will then hide all rows that don't have that value in that field, both from the main display and from the set of unique values you would get by opening the unique value dialog of a different column.

You can select a value each in several different columns to get the intersection of all those selections.

(This is an Excel feature that I don't know how to set up, but someone else set it up. It is the feature with the bug. Sometimes it hides a few rows that it shouldn't have hidden).

Other UI features:

For the many columns in which the data is just one character wide, the title is displayed vertically, so the column is just one character wide.

There are a few columns locked at the left, so when you scroll horizontally to see columns initially off the right edge, those few at the left don't move (standard Excel feature).

It would be great to be able to select a few values in a column, rather than all or one. In Excel a column is either unrestricted or restricted to a single value. So if I wanted to find everything that is either A or B in column one and is either C or D in column 2, I'd need to do four different selections: AC, AD, BC, and BD. I'd prefer to see them all together as (A or B)(C or D).

Each column defaults to the width of the widest data item in it, but there is the usual GUI method to manually adjust column widths.

I also have a second data set for a different project, that is currently stored as 1900 txt files totalling 55Mb of tab delimited text. I search that in a very different and cruder method. I'd like to be able to manage that dataset the same way as the first. 55Mb seems pretty trivial for computers with a GB or more of RAM, but in Excel this data set is too many rows for the selection feature described above.

tronayne 01-16-2008 05:47 PM

You may want to download and install OpenOffice (available for both Linux and Windows; you can open an Excel spreadsheet in OpenOffice' Calc program) and compare the features and function of OpenOffice to Microsoft Office (and I'll bet you'll be impressed). You should bear in mind that a large spread sheet (5,000 rows by 100 columns is a fairly large spread sheet) requires overhead to manipulate and performance may not be exactly what you expect with any spread sheet program.

On the other hand, 5,000 rows by 100 columns is trivial for a relational data base management system (DBMS); however, a DBMS does not provide the GUI that a spread sheet does, so there is that to consider. The language of DBMS' is Structured Query Language or SQL and that does take some investment to learn; once learned, though, you'll probably never want to go back. The MySQL DBMS is pretty much standard equipment on Linux distributions, and there are some GUI front end applications for it (MySQLAdmin comes to mind); MySQL is fast, well-supported and an excellent platform to build on (and it pretty much runs on any operating system you might encounter). With a DBMS, you define a table (columns and their data type) and load your spread sheet data one time; from there you query the data base based on relationships between columns (or, more accurately, between columns in one or more tables -- that's the "relational" part, where different tables contain one or more elements also contained in other tables). Different queries for different relationships between data.

As above, get and try OpenOffice and read about MySQL and other DBMS and see what you think.

jlinkels 01-16-2008 06:00 PM

Interesting problem. It doesn't seem to me that you'll be able to find something which already provides something more towards you solution than a storage engine and a display mechanism.

MS Excel is not that bad. Really, I think it is the best MS application I know. It is fast, almost all features work as expected and the underlying VB is extremely powerful. Are you sure sure sure there is no way you can work around the bug? Are you using macros or VB for your selection and display?

If not, don't even consider using OpenOffice Calc. It is sometimes buggy, it is sloooooooooow where you don't expect it. The programming language seems to be extremely and insanely versatile and orthogonal, but it also seems it is only implemented for a small part. Code is unreadable because of the loooooooong object names and the large level of fields you need to mention before you get somewhere. And undebuggable because most objects or actions can not be debugged.

Since you store data, using a database is obvious. MySQL doesn't have any problems with storing 500000 values. It is scalable until the end of the universe. MySQL you learn in under 2 hours.

That leaves presentation. That is difficult (isn't it always?). If you start to write a custom program in of of the libraries like GTK is fear you will be programming for the next 6 months instead of working on your project.

Instead I would look into something like Tcl/Tk. Tcl is an interpreted programming language. A bit strange but extremely powerful. You need about 5 nights going thru the tutorial. It has a connector for MySQL.

Tk is the graphical shell. It is very easy to learn, in one night you have built your first application. Tk is able to implement windows with scrollbars etc like you are used to in Excel.

I doubt whether Tcl or Tk are able to read, store and handle 500000 values. But you don't have to. Because you use MySQL everything is safely stored there, and only what you want to display is retrieved and being displayed. It is bit of calculation where you have to start your display and what values to retrieve for that etc, but certainly doable.

There are numerous examples on the internet to show what Tcl/Tk can do, including graphics and clickable areas etc. Tk exists for other languages as well like Python and Perl. But I don't have experience with those, and as a matter if fact Tcl is simple, powerful & funny.

jlinkels

johnsfine 01-17-2008 08:49 AM

I just tried opening that XLS file in OpenOffice. Previously I had just assumed that wouldn't work (It's Excel compatibility couldn't be that good).

I'm impressed. It all seems to work. I'll try using them side by side for a while to how the bugs compare. (At the moment I can't recall any specific selection that triggered the bug in Excel, but I hit them fairly often in normal use).

But the performance in OpenOffice is terrible.

I made a selection in Excel (that reduced 5000 rows to 13 by selecting one unusual value in one column). It took about half a second to display the result. I did the exact same selection in OpenOffice and it took 48 seconds.

Also the titles are badly displayed. The column widths are too small for most of the titles. Each program seems to be automatically choosing a way to display each title in the available space. The Excel choices are better than the OpenOffice choices. One condition occurs in a few titles, that causes OpenOffice to totally mess up and display the title in the wrong place (on top of a nearby cell) and totally unreadable. In the same case, Excel just truncates the title. (Column is one character wide and many characters hight, but the title is too long to fit vertically).

Also, Task Manager tells me OpenOffice is using 70Mb of virtual memory to have that file open, while Excel needs only 10Mb.

I HOPE some of these problems are caused by building the spreadsheet in Excel before opening in OpenOffice (vs. building the same thing from scratch in OpenOffice). But I don't really EXPECT that to be true (If I did, I'd be rebuilding this spreadsheet from scratch in OpenOffice now rather than puting that far down on my things to try later list).

jlinkels 01-17-2008 12:42 PM

Hehe I hadn't noticed that troynane posted a reply while I was composing mine.

johnsfine, this is what I expected from OOCalc, and it can be much worse. (like 30 seconds for creating a graph of 512 rows x 2 columns) Memory usage *might* be because I expect that each cell in OO is an object, and a terribly complicated object as well. You can try but I do not expect that building the application from scratch brings relief.

This is off-topic, but to credit the OO developers: I know my critics on OOCalc sounds harsh. On the other hand, is you compare OOWriter with MsWord, the comparison ends up exactly the other way around. Writer is an excellent product, fast, small file sizes, features that work as expected.

This morning while reading the documentation I coincidentally stumbled on the specified limitation of MySQL tables. You can have at most 4096 colums in a table (not bad) but one row can be at most 65653 bytes long. Which could be sufficient if you'd use variable lenght strings and most of the strings are a single character.

jlinkels

johnsfine 01-22-2008 10:19 AM

I tried OO Base. I guess I should have followed your advice and tried MySQL. But OO Base was already installed on my Windows computers and my Linux computer, and it has a spreadsheet like GUI available.

It was amazingly hard to figure out how to do anything. Simple tasks, like importing tab delimited ascii, took multiple searches for and through tutorials to find (import to OOCalc first, then copy/paste and that paste step itself was super confusing).

Everything ran super slow and there was no hour glass (in the Windows version). I needed to keep task manager open beside it and check the CPU time it was using, or I couldn't tell whether the operation I just tried does nothing (as many did) or takes a long time with no intial change to the display before finally doing something (as almost everything else did).

I still haven't figured out how to do useful queries (which requires variables, which are slightly documented, but don't work as they seem to be documented). The basic filtering available without defining a query seems to be neither more flexible, nor faster than the filering in OOCalc.

The documentation I found is incredibly thin. It is barely more than titles telling you what you can already see from the program's menu structure. It isn't even as well organised as the menues and dialogs it documents, and it explains nothing. At best, it tells you what is in the menu or dialog (even that is incomplete), nothing about what any of it means, nor how you accomplish basic tasks, nor the pros/cons of any choices you need to make.

Hopefully, I'll do better when I find time to try MySQL.

In case you want to see the kind of data I'm working on, I started from this XLS file by Rob Crowe
http://www.hifi-remote.com/forums/dl...e&file_id=2039
I haven't changed enough yet to matter for any performance or functionality question.
(That file tells summary information about all the different Setup Codes installed in many different models of OneForAll remotes. I use it to help in answering questions in several remote control forums.) Typical use is to select a protocol name from the pull down in that column, then select a device number in the dev1 column, then see what setup codes exist for that and which model remotes include that setup code.
I want to do more with that data than that, but I can't really get started until I have decent speed on the basic feature I just described.

tronayne 01-22-2008 12:12 PM

Well, you've discovered that OOBase is a three-legged dog with a busted tail, particularly with the data set you're using. If it was me, I'd sit down and load the data into a MySQL table (or tables) and go from there (you will get the speed and flexibility you're looking for but you will not get the ease-of-use of a spread sheet). You will have to invest some time and effort learning the ins and outs of data base management systems, SQL, and, probably PHP to get the display you want in a browser -- none of this is trivial, but at the same time, it's not rocket science either.

A hint: save the spread sheet as a comma-separated-variable file and you'll be able to load it in a MySQL table. First, though, you need to define the table columns (the same as the spread sheet columns); I highly recommend that you do not use the full column names from the spread sheet -- keep 'em short, no spaces in the names, something you can readily understand (I usually use no more than 10 characters for column names, generally three characters, underscore, three characters when the name will be longer than eight or ten characters; remember, you're going to have to type that stuff in queries!).

If you invest the time and effort you'll recoup your "cost" when it comes time to do the next application, the next data base, the next... whatever. You can grow MySQL pretty much to whatever size you need (yeah, there are limits, but they're darned hard to get to); a spread sheet is just a spread sheet and that's pretty much that... and you've discovered the limit switches, eh?

johnsfine 01-22-2008 04:04 PM

I installed MySQL and started trying to play with it.

I have no clue yet how I'm going to put any sort of GUI in front of it to make it useful. I would have thought a spreadsheet like GUI in front of a database would be a common requirement. I'm sure it is available already done somewhere. I'm not sure it would be easier to find than redo. But I don't even know what tools to start learning for redoing it. I'm an experienced C++ programmer, but only back-end projects. I've done very little GUI programming and none recently.

I was unhappy about much of the MySQL install process. I hate software packages designed on the assumption that they are the most important thing you will be running and deserve a footprint whether you are running them at the moment or not. I like applications that sit around as just a collection of files on your hard drive (including executable files of course) and don't do anything until you invoke them. I tried answering the MySQL install questions for minimal idle footprint, but didn't really understand the questions and don't yet know how much (but clearly nonzero) idle footprint it now has. (I tried Windows first, because of where I was when I had time to try this. Maybe the Linux install of MySQL is cleaner).

I'd also like to be able to share my results with a lot of people who would only use it occasionally and don't otherwise use databases. If my result were an OOCalc spreadsheet file, it wouldn't be too unreasonable to post the file and tell people they need OOCalc in order to use it. With MySQL, I haven't figured out yet how packed my result could be nor how reasonable it might be to tell others to install MySQL if they want to use it.

The GUI is the important question, but pending that, I'm looking at the data side and getting frustrated there as well. One important field type for both my projects has 257 legal values: empty and the numbers 0 to 255. Reading MySQL doc, it seems like NULL is the best substitute for empty, so I hoped to get it working with:
tinyint(3) unsigned default null
That can store the 257 values I want, but missing values become 0 rather than null. In order to get NULL, I need explicit \N in the input instead of having nothing. Obviously, I can write some C++ code to filter the input, and if I need to code a GUI it can filter all the output. But it seems like there ought to be a built-in way to make empty and NULL be the same for numeric fields. I do understand why empty (zero length text) cannot be the same as NULL in field types such as VARCHAR. But that shouldn't apply to numeric.

I expect I'll need to switch to a more appropriate forum as my MySQL questions get harder. I do appreciate the help you've given me here so far.

jlinkels 01-22-2008 06:11 PM

Creating the import file from a CSV is almost trivial, but not completely. CSV is not a nice format when you have quoted strings in your CSV, or comma's in your strings, whatever.

This little program converts a CSV file into a file with nice field separators like '|'. Then the easiest is to use AWK to convert it into a MySQL file. Look for an example here.

Did you already check the MySQL documentation? It has many examples, and I know there are some real SQL gurus around here as well.

From your one but latest port I understand that the spreadsheet thing is not a real requirement. You want to select a protocol and a device and see what that yields. That is MUCH easier to write a GUI for than mimicking the entire spreadsheet with columns, scrollbars and whatnot.

You even might be able to do that in PHP so you can query this in every web browser. I am just a bit afraid about getting a drop down list of 5000 items in PHP for performance. If you could live with entering text + a wildcard you'd be fine though.

You should really have a look into Tcl/Tk.

I tried to open the spreadsheet at home on my 512MB machine, but that choked on the spreadsheet. I did it in my office (1.5 GB) and that was fine. Unfortunately I do not quite remember the contents of the spreadsheet.

jlinkels

johnsfine 01-22-2008 07:23 PM

Quote:

Originally Posted by jlinkels (Post 3031881)
Creating the import file from a CSV is almost trivial, but not completely. CSV is not a nice format when you have quoted strings in your CSV, or comma's in your strings, whatever.

There are no non delimiter tabs, no comma's, and no quotes in this data. Despite the name "csv", the doc I found suggested tab delimited and tab delimited works fine. For the second project I have in mind, there are commas and quotes in the data, but never any tabs (even inside quotes) that aren't delimiters. From the doc, that seems to be a valid input form (tab as a field delimiter and no text delimiter).

Other than changing empty fields to \N, I don't think I need to preprocess the CVS input. I was hoping there was a way to not need to preprocess for the empty fields.

Quote:

Originally Posted by jlinkels (Post 3031881)
Did you already check the MySQL documentation?

I didn't read every page. But I read every page that seemed (from the table of contents) to be relevant to any of my current problems.

Quote:

Originally Posted by jlinkels (Post 3031881)
From your one but latest port I understand that the spreadsheet thing is not a real requirement. You want to select a protocol and a device and see what that yields. That is MUCH easier to write a GUI for than mimicking the entire spreadsheet with columns, scrollbars and whatnot.

It would be quite hard to come up with a different but still effective GUI. Things like the horizontal split screen scrolling are very important to use (wrapping lines would make it all unreadable, but it can't all fit at once across the screen). It's just NICE that before the first step of filtering you can seen the top N rows on screen of unfiltered data and scroll further. But it is IMPORTANT that after the first step of filtering you can see and scroll the top N rows of partially filtered data to help guide the next filtering step.

Even when you reach the fully filtered result, it many be many rows and each row is still too wide to view. It would be a lot of extra effort to select horizontally (tell the GUI which fields you want to see each time) than simply use the horizontal scrolling and column width adjustments and other basic features of a spreadsheet GUI.

Quote:

Originally Posted by jlinkels (Post 3031881)
You even might be able to do that in PHP so you can query this in every web browser.

I don't really see the big advantage to viewing this in a web browser. I do see the disadvantage you mentioned and a lot more like it.

Quote:

Originally Posted by jlinkels (Post 3031881)
You should really have a look into Tcl/Tk.

A few years ago, I made a small try at learning that for some other project. Not a big enough try to learn more than the fact that a small try wasn't close to enough.

Quote:

Originally Posted by jlinkels (Post 3031881)
I tried to open the spreadsheet at home on my 512MB machine, but that choked on the spreadsheet. I did it in my office (1.5 GB) and that was fine. Unfortunately I do not quite remember the contents of the spreadsheet.

My Linux box has 3GB, so if OOCalc needs a lot to open that in Linux, I wouldn't have noticed. My report that OOCalc was super slow was based mainly on results on a Windows box with 2GB. I also opened it slowly on a Windows box with 512MB, however, even there it didn't choke and wasn't even hitting the swap file. I'm pretty sure it was slower there due to smaller L2 cache rather than because it is limited by the 512Mb.

The second project I have in mind is much bigger and its data much more suited to a database. But for that one as well, a spreadsheet would be the best GUI for presenting the results.


All times are GMT -5. The time now is 02:00 PM.