New Years Data To DB - Arithmetic, Graphs and Relations

Posted 03-14-2019 at 04:58 PM by astrogeek
Updated 03-15-2019 at 12:14 PM by astrogeek (Forgot aggregate column)

... Or, How To Get From Arithmetic to Relational Algebra and Enjoy The Trip!

Quote:

Originally Posted by dogpatch

I found the 'NP Complete' link quite interesting, even though i still don't fully grasp the concept, nor your explanations about sets, etc.

I would very much like to try to explain some of the ideas of sets, how those are important to some aspects of our fun with the newyears puzzle, and why that takes us into the domains of graph theory and relational databases. In doing so, I do not pretend to be an expert, or even to fully understand the various subjects. But I do very much enjoy having what knowledge I have gotten from them, always enjoy the explorations, and would like to share some of that with you if I can.

What follows is a tour of the landscape as I understand it, pointing out the major landmarks and points of interest without stopping to look very closely at any of them - a whirlwind tour of sorts.

To get started, recall that the original problem posed by the newyears puzzle was to find as many integers between 0 and 100 which could be produced by the basic operations of arithmetic acting on the four digits of the year. The expressions we use for that purpose enter the domain of elementary algebra as soon as we begin to substitute symbols, or variables, for the numbers. Most of our programming efforts are concerned with solving those symbolic expressions using suitably defined variables within the algebra of some programming language or other.

Obviously, our efforts to produce new expressions which produce new solutions using that model have been very productive, and continue!

But now we are beginning to ask questions about the set of solutions we have found, and the sets of expressions which produce them, and we are in many ways no longer in the domain of arithmetic or elementary algebra in asking them! We need to use another algebra - an algebra of the domain of the questions, to find the answers we seek.

We usually think of algebra only as the elementary algebra we are taught in school, or the modern algebra (also called abstract algebra) we would encounter in advanced courses of study. But there are as many algebras as there are ways of thinking about any given subject!

Quote:

In its most general form, algebra is the study of mathematical symbols and the rules for manipulating these symbols.

When we ask quesitons about sets of things, we need to use the tools of sets, the symbols and notation of sets, and one or more of the algebras of sets, to find those answers. So our problem leads naturally into the domain of sets (naturally, as opposed to via some contrived or obscure path).

Once we begin to learn how our questions may be modeled in the language of sets, we quickly discover that many of them are actually variations of a well known class of problems known as covering set problems (the term I learned long ago), or commonly, cover set problems. And immediately we enter the domain of graph theory!

While I have worked with cover sets in the past, the term Hitting set was new to me, but this short description of the problem hits the target:

Quote:

Set covering is equivalent to the hitting set problem. That is seen by observing that an instance of set covering can be viewed as an arbitrary bipartite graph, with sets represented by vertices on the left, the universe represented by vertices on the right, and edges representing the inclusion of elements in sets. The task is then to find a minimum cardinality subset of left-vertices which covers all of the right-vertices. In the Hitting set problem, the objective is to cover the left-vertices using a minimum subset of the right vertices. Converting from one problem to the other is therefore achieved by interchanging the two sets of vertices.

What we get from all of this is a very clear notation with which to represent the various "minimal set" problems, and a pool of well explored ideas and algorithms which we may draw upon to help understand and solve them! All we really need to learn is the notation or symbols used, and the rules for manipulating them - an algebra of the domain, in other words.

As we learn the new algebra, which itself is unfamiliar to most of us, but not actually difficult to understand, we have in back of our minds another question, "How do I express this in a programming language? How do I do this on a computer?"

Which brings us to the relational database.

A relation is a specific concept from set theory, a specific type of set with specific properties, and the relational model for databases is a way of organizing arbitrary information, or data into relations. Set theory also provides us with relational operators for manipulating relations - a relational algebra...

The relational algebra is an algebra which operates on relations much the same as the arithmetic operators operate on real numbers. Not surprisingly then, a relational database management system, or RDBMS, is at its core a set processing machine which understands the language of the relational algebra, which makes it very useful indeed!

One important concept to grasp, and which shows immediately why this is useful, is that of closure. The set of real numbers is closed under the basic operations of arithmetic, for example. What this means is that the result of applying one or more of those operations to a member of the set of real numbers, is another real number.

Similarly, the set of relations is closed under the operations of the relational algebra - the result is always another relation. Relations go in, relations come out. The all important consequence of this is...

Quote:

...The fact that the output from any given relational operation is another relation is referred to as the relational closure property. ... Closure means we can write nested relational expressions - i.e. relational expressions in which the operands are themselves represented by relational expressions in turn, of potentially arbitrary complexity. (There is an obvious analogy between the ability to nest relational expressions in the relational algebra and the ability to nest arithmetic expressions in ordinary arithmetic; indeed, the fact that relations are closed under the algebra is important for exactly the same kinds of reasons that the fact that numbers are closed under ordinary arithmetic is important.)
An Introduction To Database Systems, 7th Ed, C. J. Date, pp 152

Many (most) users of relational databases do not think in terms of sets, and therefore never realize the power at their fingertips, making use of it only to search and sort lists of things. That is kind of like owning a Formula-1 racer, but using it only to drop the kids off at school, and make trips to the grocery store!

But the current relational technology is really very good at what it does! When we realize that it actually is a set processing machine which implements and understands the relational algebra, and use it as such, we obtain great benefits for our trouble!

So that, in a nutshell, is how we get from a newyears puzzle to a relational database, with visits to set theory and graph theory along the way - and why all of that is fun to know, and may be useful to us!

As a quick example of how we may use all of this to find minimal sets of expressions (in my data model) which produce all solutions, here is a slightly simplified template algorithm in pseudo SQL which does that for us with repeated application (the while block is a wrapper, not part of SQL)...

Code:

while {The set of all solutions} is not empty {

SELECT expression
FROM {The set of all expressions} 
JOIN {The set of all solutions} USING {Product terms}
WHERE {Minimizing criteria}
GROUP BY expression
ORDER BY {Optimal ordering criteria}
LIMIT 1

DELETE SetB FROM {The set of all solutions} AS SetA
JOIN {Our selected expressions set} USING expression
JOIN {The set of all solutions} AS SetB USING result

}

Here, JOIN is a relational product operator, producing one kind of product of sets.
SELECT...WHERE is the relational restrict operator, restricting the result to some subset.
GROUP BY aggregate column or columns, expression identifier for this example.
ORDER BY imposes some order on the result set which is an unordered set by definition. Here we use it to place our single desired expression at top of the list which we then LIMIT to that one expression.

You can see, I hope, that this focuses our efforts on finding better Minimizing criteria and Optimal ordering criteria, each ultimately expressed as some property of the input or output sets. All variations of this algorithm are forms of the basic greedy algorithm which produces only approximate solutions, but does so in polynomial time - proportional to the size of the sets. All things considered, it works pretty well!

But if we want an exact solution, there is a "covering algebra" specifically defined for use in finding exact minimal covering sets... if we have the time. But that is a topic for another day!

As a final note, I found this concise but understandable paper on covering sets online from MIT, which you may find useful: Cover set PDF

I hope this overview is at least of some use to you in understanding the why, if not the how of this very interesting subset of the mathematics related to our little puzzle!

New Years Data To DB - Arithmetic, Graphs and Relations

Comments