LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Blogs > Unpopular Positions: One Geek's Take
User Name
Password

Notices


A space to ponder, discuss, speculate, disagree, and learn.

Expected topics include:
  • Technology
  • Politics
  • Defense
  • Philosophy
  • Humanism and Transhumanism
  • The Future
Rate this Entry

Pondering version two of "sel"

Posted 04-01-2015 at 03:15 PM by ttk

Some years ago, I wrote a little utility called "sel", short for "select" (to avoid confusion with bash's "select" builtin).

It's more or less SQL's "select" statement for the command line, reading structured data on stdin and writing structured data on stdout, optionally filtering rows (like a structured-data-savvy grep) and creating / removing / reordering columns and changing the format. For instance, it could read CSV in and write tab-delimited rows out, or read XML and write JSON out, etc.

http://ciar.org/ttk/codecloset/select/

It's grown organically over the years, and brought me far. I've gotten some excellent use out of it. It's also one of the few open source projects I have which other people have actually used and enjoy. I forked it while at DM so I could work on employment-related features on the company dime, and that fork has many powerful features (but is unfortunately not available to the public, until I re-implement them "from scratch" in the public fork).

Unfortunately it wasn't written very cleanly, and the code is something of a mess. It's gotten to the point where adding desired features (like support for ini format) or fixing bugs (the CSV implementation has some quirks) is a huge chore.

So, I'd like to rewrite it from scratch, using a clean, modular design, so that it is more maintainable / extensible in the future. A rewrite offers fantastic opportunities, but also some dangers (like Second System Syndrome), and a few questions besides.

One of the "features" I've been wanting for a while is better performance. It's written in perl, which is very fast compared to some languages, but a slowpoke compared to C. So perhaps version two could be written in C? This wouldn't improve JSON or XML parsing by very much (perl's JSON::XS and XML::LibXML are already wrappers around parsers implemented in C), but according to my benchmarks it would increase performance of parsing other formats dramatically.

It would also improve the performance of operations on the data's intermediate format. Currently, sel translates the input data to a native perl hash (associative array), performs filtering and transformation on the hash, and then translates the hash content to the output format. This is very handy and elegant, but slow. Doing it in C would be orders of magnitude faster.

The flipside of this is that C doesn't have a data structure comparable to perl's hash in convenience and versatility. The available parsers for complex structured data formats (like jansson for JSON, or libxml2 for XML) deal with this by implementing their own highly-dynamic abstract data types for representing and manipulating the parsed data.

This means that instead of translating (for instance) JSON -> perl hash -> XML, sel rewritten in C would be translating JSON -> jansson -> libxml2 -> XML, which wouldn't be so bad, but sel also supports translating to/from about half a dozen other data formats. Without a common intermediate representation, that would require fifty or a hundred translation implementations.

One solution to this would be implementing an intermediate representation abstract data type with the flexibility of handling data in any supported format. Then I would only need to write eight or ten translators to the intermediate representation, and eight or ten translators from the intermediate representation. But would translating JSON -> jansson -> intermediate format -> libxml2 -> XML be much faster, at the end of the day, than just using perl? That intermediate format would need to look pretty much like a perl hash.

Alternatively I could pick one of these libraries' representations and use it as the intermediate representation. The jansson data representation type has the flexibility, and it would save me considerable coding effort. I wouldn't have to implement my own data representation type, and I wouldn't have to implement the additional translation layers for JSON. I'd still have to implement translators for CSV, tab-delimited, loadfiles, ini, YAML, html, sql, etc, though. The code would translate CSV -> jansson -> INI, or CSV -> jansson -> libxml2 -> XML.

That might not be too bad.

Or I could give up on performance, for now, and stick with perl, which would make development go much faster. This holds some appeal, too, since there are quite a few features I'd like to add, which would require ten or twenty times as many lines of C code than perl code:

New formats: I've been wanting sel to support INI, YAML, C, and SQL data formats for a long time.

Easier transformations: Right now the only way to create or rename columns with sel is to use the --perl= option to provide a fragment of perl code which sel eval()'s on every data row. It would be very nice if the column list could contain decorations, like "username=customer_name" to rename the input's "username" column to "customer_name" on output, for instance, or perhaps make it more SQL-like and support the "username AS customer_name" syntax. Also, if I reimplemented sel in C, I'd want to still have the --perl option somehow, because it's very powerful. Perhaps replace it with --lua or pipe through an external perl process when needed?

Parallelism: This goes hand-in-hand with "more performance". Perl's threading implementation kind of sucks, and using multiple processes is a bad fit to the kinds of transformations sel is useful for. Making sel multithreaded would actually be easier in C, and would give it a performance boost on top of C's natural runtime performance advantage.

Complex sort: Often I've wished sel could sort its output in a scalable way, akin to GNU sort(1), doing what it can in memory but using temporary files and mergesorting them if the input dataset is too large to fit in memory. I've written a half-assed "hashsort" utility, which sort of does what I want, but only understands the hash data format. This means in order to translate JSON to CSV which was sorted by the input data's ["foo": ["bar": ".."]] field, I would do something like: "cat input.js | sel --all --out=hash | hashsort foo.bar | sel --all --out=csv > output.csv" which would be slow as molasses due to the extra translation steps. It would also require two serialize/deserialize steps, which means that even if sel and hashsort were implemented for concurrent processing, their ability to take advantage of that concurrency would be unduly limited.

Perhaps I should implement version two in perl, which would make it easier to incorporate all of these features, and save the C re-implementation for version three? It's much, much easier to rewrite a working perl program in C than it is to develop and debug new code in C. With perl I can more quickly/easily write and debug the underlying logic, giving me a known-working program to translate straightforwardly into C later.

Unfortunately it's taken me ten years to move on from version one to version two, so how long would it take me to work on a version three? Another ten years? But would a version two written in C ever get finished in the first place?

Right now I'm leaning towards writing version two in perl, but want to ponder it a bit longer before starting (again -- I started writing a thread-enabled sel2 in perl a few years back, but abandoned it).
Views 1969 Comments 1
« Prev     Main     Next »
Total Comments 1

Comments

  1. Old Comment
    One of the regulars in ##slackware suggests using glib's hash table implementation (which I've not used) and Inline::C (which I have used, and like). He's seen 3x speedup this way vs native perl hashes: https://developer.gnome.org/glib/sta...sh-Tables.html . The main downside here is introducing a gnome dependency, but their implementation looks nicely complete.

    Another regular pointed me at his own C hash table implementation, which also looks nice, though it needs a bit of fixing (which he might be taking care of now): http://tekk.in/dtype.tar.gz . Specifically, some declarations need to be changed so that it either uses Hash and List or TList, THash, and FHash. Also I'd want to replace his hash() with my own http://ciar.org/ttk/codecloset/libth-0.2.tar.gz implementation (which is about as fast as Google's CityHash, and generates more square output).

    It occurs to me that if I write the code such that the functions which access/manipulate the intermediate format are encapsulated and isolated, I can use perl hashes first to get things working, and then replace those isolated functions with an Inline::C implementation, using either of these hash libraries.
    Posted 04-01-2015 at 04:30 PM by ttk ttk is offline
 

  



All times are GMT -5. The time now is 10:29 PM.

Main Menu
Advertisement
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration