LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Blogs > astrogeek
User Name
Password

Notices


Rate this Entry

New Years Data To DB - The Model and The Data

Posted 03-12-2019 at 03:48 PM by astrogeek
Updated 03-12-2019 at 05:02 PM by astrogeek (typos)

With a now complete solution set and minimal set of generating expressions, it seems reasonable to think about inserting these into a relational database in order to more easily explore the landscape. To do so we need a normalized data model to query, and methods of capturing the data into the model or schema, both of which are easily within reach.

First, it is important to realize how many unique result sets there are, and how these are related to the ten thousand four digit "years". As summarized by dogpatch and implicit in all our solutions generating code, there are:

Code:
715   - Unique combinations of four digits, therefore 715 unique solution sets
10000 - Permutations of those combinations, which are just the years
46978 46980 - Unique solutions, integers 0-100 (found by evaluation of permuted expressions/digits)
To make this explicit, in the notation of combinatorics there are C(n,r) unique unordered combinations of n things taken r at a time, and P(n,r) permutations, or ways to order them. (Note: I have included a longer discussion of this in COMBINATIONS.TXT included in my attached code bundle below which I hope you will find interesting!)

There are two cases depending on whether the r "things" are unique (1234, 5678, etc.) or if repetitions are allowed (1233, 1999, etc), for both combinations and permutations. For our purposes here, n=10, the number of base 10 digits, r=4, and repetition is allowed, so the applicable forms are:

Code:
                             (r+n-1)!
C(10,4), with repetitions = ---------- = 715
                             r!(n-1)!

P(10,4), with repetitions = n^r = 10,000
So we are going to store solutions for 715 permuted combinations. But we also want to be able to access any result by years, or permutations. To relate the two we will use the combination as a key, adopting the lowest-to-greatest left to right permutation as the canonical key form. We will also keep the total solutions count generated by our programs with each year.

We want to store all solutions generated by all expressions for all permutations, not just one unique solution per combination. That is much more than the 46980 total unique solutions! For my own minimal expression set that comes to 675,295 separate solutions. My own code uses the absolute value of the result as the result, with implicit sign inversion, which reduces the number of expressions we must evaluate. But it may be useful to know when that has happened so we will capture the sign of the result before inversion is applied. We also want to keep the expression which generated a given result, after variable substitution.

Finally, for each solution we want to know the symbolic expression which generated it, but storing that in the solution would create a lot of redundancy, so we will normalize that into a separate relation and link it to each solution by an integer key, exp_id.

There are various ways we might want to rank our expressions for use in queries, so we will include an integer rank column for that purpose. Because we have not actually defined its use yet, we will simply set it to a default value of zero for now.

In order to link solutions to the symbolic expressions which generated them, we will need to import the symbolic expression data before the solution data, then link the expressions into the solutions when they are imported. To facilitate this we define a solution data staging table.

And we will add a utility pivot table, because you never know when you might need a source of strictly increasing monotonic integers...

Taken all together, this leads us to the following schema for our data model, suitable for most any RDBMS, I am using MariaDB (MySQL) at the moment:


Now we need to get our expressions and solution data into the model. Originally, I did this with a few simple shell scripts and awk - not really difficult. But I have now modified my program to generate the year and solution data in the form of SQL INSERT statements to make it very easy.

Either way, the idea is to generate the necessary INSERTs into a text file, then execute them to enter the data into the data model, or schema.

The year data consists of three fields, the four digit year (include leading zeroes), the lowest left to right ordering of those digits as combination key, and the total solutions found. This data allows us easy access to solutions by year and also stores the originally generated totals by year as an easy check on validity of solutions data - mostly a convenience. You may get the raw data from any of our programs, -F option for my own. Or you may get the SQL inserts with the -FQ option.

The expression data is just the list of symbolic expressions used to generate the solutions. That might be the contents of the "minimal" expressions file posted with my own code, for example, or your own list. Note that it is essential that the expressions be the identical set used to generate the solution data! I have included a simple shell script for generating the necessary inserts in my code below.

The solution data is the complete set of solutions for all 10,000 permutations, including all found solutions (not simply the unique solutions). Using my own code the raw data may be obtained using the -Fsv options (but no sign field), while the -FQs options will generate the necessary SQL INSERTs with sign information.

If you generate your own data using sed, awk or other method, be sure to use the lowest left-right ordering of digits for the combination key - that ties it all together!

HINT: Also, if generating your own INSERTs, use the VALUES(...),... form to group all or bunches of them into one or a few aggregate INSERT statements - you will appreciate the difference in execution speed! Using my program generated data and scripts I can go from start to fully imported data in about 2 minutes, for reference.

Here is a bundle of my current code, years_v1.1.0.tar.gz.txt, which should include enough documentation to get your data into a MariaDB, or other RDBMS model!

As for making use of the data, you are on your own until my next post! But I am just learning my way around it myself, so you may well do better without any (mis)direction from me!
« Prev     Main     Next »
Total Comments 2

Comments

  1. Old Comment
    What i find especially satisfying here is seeing a formula for C equating to 715 combinations in our case. I now see that, without realizing it at the time, i was indeed following the above formula for C in finding 715 unique combinations of 4 digits.
    Posted 03-12-2019 at 06:34 PM by dogpatch dogpatch is offline
  2. Old Comment
    For some reason I never seem to be able to remember the formula for combinations and find myself looking it up again. At some point, probably after you had mentioned it, I gave myself a refresher, notes from which led to writing of COMBINATIONS.TXT. It is satisfying to arrive at some such conclusion based only on the problem at hand, then to find that you have reinvented another wheel - successfully. Reinventing wheels is a very useful exercise, so long as they are round in the end! It would be much less satisfying to simply memorize everything, and forget why it works...

    Thanks for the comment!
    Posted 03-13-2019 at 03:23 PM by astrogeek astrogeek is offline
    Updated 03-13-2019 at 03:30 PM by astrogeek
 

  



All times are GMT -5. The time now is 12:22 PM.

Main Menu
Advertisement
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration