Could someone help explain this code?

MTK358 · 04-01-2010, 10:28 AM

Quote:

Originally Posted by Sergei Steshenko

Let's first implement tree-less calculation on the fly. The point is to better show/feel how much of already processed part of expression you really need for calculations/decision making.

I guess you can just return pre-calculated results instead of trees.

Quote:

Originally Posted by Sergei Steshenko

Secondly, the tree is a "book" one, but not necessarily the most convenient one. For many practical tasks a tree like this:

Code:

(+, 1, 2, 3, (*, 4, (^, 5, 2)))

is much more convenient.

I just thought that since they are binary operators, that they would be better represented as having two operands each.

Sergei Steshenko · 04-01-2010, 10:42 AM

Quote:

Originally Posted by MTK358

I guess you can just return pre-calculated results instead of trees.

I just thought that since they are binary operators, that they would be better represented as having two operands each.

There are two points.

1) In the first approximation you do not need a tree, and trust me, to add tree generation (and, if necessary, to eliminate on the fly calculation) is walk in the park in the correctly implemented on the fly calculator.

2) if you think about constant folding, e.g. if you want to convert, for example,

Code:

x + y + 3 + 4 + 5 * (z + 7 + 8)

into

Code:

x + y + 7 + 5 * (z + 15)

, the second form is more convenient.

Philosophically/empirically if you can do something without recursion, do it - your form is more recursive. But, again, for general understanding/feeling at this stage the tree is premature.

johnsfine · 04-01-2010, 11:00 AM

The solution I prefer to this common parsing problem is to use an explicit data stack instead of much of the recursion and to have a left precedence distinct from the right precedence of each operator, allowing no ties (ties are easy to handle but not allowing them is still less confusing).

For example, consider making every left precedence odd and every right precedence even. For ordinary operators, always make the right precedence one more than the left precedence. For right associative operators (such as exponentiation in many languages) make the right precedence one less than the left precedence.

I prefer languages in which right and left precedence can get further apart, such as having an assignment operator with a very high left precedence despite its usual very low right precedence.

Ignoring startup and termination details, the basic work flow is:

You have a stack of pairs of subexpression and following operator. The stack always has at least one such pair.

The main loop does:

A) Read another primary expression and following operator and push onto the stack.

B) Repeat as long as the left precedence of the top operator is less than the right precedence of the second to top operator:
B1) Pop both pairs from the stack.
B2) Combine the two subexpressions with the operator that was second to top.
B3) Push back the pair of that resulting subexpression and the other operator.

The startup is obviously just reading the first primary expression and following operator.

The termination is simplest if you invent a dummy operator representing any token that is not a binary operator that is seen when you try to read a binary operator. That dummy operator has left precedence lower than any real right precedence.

Before step A in the main loop, you check whether the one operator you have on the stack is the dummy. If so, you exit.

Re the original question of this thread, it may be easiest to understand a recursive expression parser (such as the one in that tutorial) by seeing its similarity to the explicit stack version I just described.

The stack is just the local variables of all the recursive calls. So a net push is one level of call and a net pop is one level of return (but the pop twice and push of forming a new subexpression is combined so it is not two returns and a call). Otherwise, the operation sequence and all the decision points of the recursive version are the same as the explicit stack version.

MTK358 · 04-01-2010, 11:36 AM

Could something like this work?

Code:

parseBinOp(lhs) {
	while(1) {
		op = getToken();
		rhs = parsePrimary();
		next = peekToken();
		if((assoc[next] == ASSOC_LEFT && prec[next] > prec[op]) ||
		   (assoc[next] == ASSOC_RIGHT && prec[next] >= prec[op])) {
			rhs = parseBinOp(rhs);
		}
		lhs = calculate(op, lhs, rhs);
		if(next == '\n') break;
	}
	return lhs;
}

Sergei Steshenko · 04-01-2010, 12:11 PM

Quote:

Originally Posted by MTK358

Could something like this work?

Code:

parseBinOp(lhs) {
	while(1) {
		op = getToken();
		rhs = parsePrimary();
		next = peekToken();
		if((assoc[next] == ASSOC_LEFT && prec[next] > prec[op]) ||
		   (assoc[next] == ASSOC_RIGHT && prec[next] >= prec[op])) {
			rhs = parseBinOp(rhs);
		}
		lhs = calculate(op, lhs, rhs);
		if(next == '\n') break;
	}
	return lhs;
}

Without going too much into details - see the line in red, and see the line in blue - should the line in blue be immediately after line in red ?

I mean, can you be sure the look-ahead token is indeed operation and not newline apparently indicating end of expression ?

MTK358 · 04-01-2010, 12:15 PM

Because even of this is the last binary expression, this needs to be run:

Code:

lhs = calculate(op, lhs, rhs);

Sergei Steshenko · 04-01-2010, 12:21 PM

Quote:

Originally Posted by MTK358

Because even of this is the last binary expression, this needs to be run:

Code:

lhs = calculate(op, lhs, rhs);

So,

Code:

assoc['\n']

is defined and returns a value high enough to make 'if' condition true ?

MTK358 · 04-01-2010, 12:22 PM

This is not the final algorithm, I'm just still thinking.

Sergei Steshenko · 04-01-2010, 12:33 PM

Quote:

Originally Posted by MTK358

This is not the final algorithm, I'm just still thinking.

You know, please write types of the variables involved, i.e. 'op', 'rhs', 'next', 'lhs'.

Also, please publish all the related to each other subroutines, I mean that expression parsing/calculation involves cross/mutual recursion, so it's easier to comprehend all at once.

I.e. if A possibly calls B, and B possibly calls C, and C possibly calls A, then it's better to see all the A, B, C side by side.

Another problem I see (and C/C++ is not quite your friend here) - your algorithm will most likely behave erratically/ungracefully in case of errors. I.e. what will the behavior be in case of, say,

Code:

2 3 + 4

input (operation is missing between '2' and '3').

johnsfine · 04-01-2010, 01:36 PM

Quote:

Originally Posted by Sergei Steshenko

You know, please write types of the variables involved, i.e. 'op', 'rhs', 'next', 'lhs'.

Also, please publish all the related to each other subroutines, I mean that expression parsing/calculation involves cross/mutual recursion, so it's easier to comprehend all at once.

I disagree. I think it is better to focus on the algorithm being designed at the moment and just make reasonable assumptions about the characteristics of the data types and called functions.

Part of being a skilled programmer is doing a good job of estimating what is or isn't a reasonable assumption for the parts you aren't focusing on at the moment.

Quote:

Originally Posted by Sergei Steshenko

Without going too much into details - see the line in red, and see the line in blue - should the line in blue be immediately after line in red ?

I mean, can you be sure the look-ahead token is indeed operation and not newline apparently indicating end of expression ?

I also have an issue with the line in blue, but not any issue with its position.

I'd be much more comfortable with

Code:

if( ! binary_operator(next) ) break;

Does every expression end with '\n', including invalid expressions Sergei's "2 3 + 4" example? I think not.

Quote:

Originally Posted by Sergei Steshenko

So,

Code:

assoc['\n']

is defined and returns a value high enough to make 'if' condition true ?

The point should be that assoc[x] is defined for any value x that might be returned by peekToken() and that for any token that isn't a binary operator it is some value not equal to either ASSOC_LEFT or ASSOC_RIGHT so the if condition will not be true.

I'd also like to comment on an important difference between the typical non object oriented version of the above code and a typical object oriented version.

A typical non object oriented version would tend to have a predefined set of possible values return by peekToken(), so that when the token is a number or identifier, the returned value can only indicate that it is a number or identifier. It can't hold which number or identifier.

Then something like assoc[next] can be the simple array lookup that it appears to be. But the important info about which number or identifier has been tokenized must be returned through some back door.

In a more object oriented design the return from peekToken() would be a pointer or object for the whole token. So assoc[next] is not a simple array lookup. It's really the execution of a function.

Sergei Steshenko · 04-01-2010, 01:47 PM

Quote:

Originally Posted by johnsfine

I disagree. I think it is better to focus on the algorithm being designed at the moment and just make reasonable assumptions about the characteristics of the data types and called functions.

Part of being a skilled programmer is doing a good job of estimating what is or isn't a reasonable assumption for the parts you aren't focusing on at the moment.
...

I know that the hardest mistakes to correct are infrastructural ones.

If we are talking about both parser and on the fly evaluator, and if they are recursive (which they are), two values should be returned:

result of parsing (i.e. legal or illegal language construct);
result of evaluation

.

MTK358 · 04-01-2010, 03:31 PM

I'm thinking that I should just drop this issue for now.

Taking my mind away from it for a while might help me understand it better, and also this topic still reminds me of yesterday's argument and makes me a little emotional and regretful.

Sergei Steshenko · 04-01-2010, 09:44 PM

Quote:

Originally Posted by MTK358

I'm thinking that I should just drop this issue for now.

Taking my mind away from it for a while might help me understand it better, and also this topic still reminds me of yesterday's argument and makes me a little emotional and regretful.

Here are some more examples to think about:

Code:

1 + (2 - 2) + 3

- a correct expression - indeed needs recursion, inner expression legitimately returns 0 in terms of numeric value and TRUE from grammar point of view because it is correct, parsing/calculations should continue.

Code:

1 + (2   2) + 3

- a wrong expression - inner expression should return FALSE from grammar point of view because it is wrong, its value for calculations is undefined, parsing/calculations should stop.

"inner expression" in both cases means '(....)'.