LinuxQuestions.org - Why a regular grammar cannot be both left-recursive, and right-recursive?

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Why a regular grammar cannot be both left-recursive, and right-recursive? (https://www.linuxquestions.org/questions/programming-9/why-a-regular-grammar-cannot-be-both-left-recursive-and-right-recursive-4175727537/)

Why a regular grammar cannot be both left-recursive, and right-recursive?

It is stated that a regular grammar can be either left-linear, or right-linear; but not both.
Want to understand why?
Let us take the below grammar, which is having left recursion, in all except the rule for derivation of the non-terminal symbol: factor:

Code:

expr->expr+term | expr - term | term

term-> term*factor | term/factor | factor

factor -> digit factor | digit 

digit -> 0|1|2|3|4|5|6|7|8|9

Want to understand why the parse of arithmetic expressions, as per the above grammar is impossible; say for the string: 2*3*5-112+2+3.

Please give some insights.

Exactly what is impossible and why is it impossible?
My only concern is that lexical parsing should be a separate process:

Code:

expr->expr + term | expr - term | term

term-> term * factor | term / factor | factor

factor -> ( expr ) | NUMBER



# Lexical parser returns NUMBER + - * / ( )

Quote:

Originally Posted by NevemTeve (Post 6445301)

Exactly what is impossible and why is it impossible?
My only concern is that lexical parsing should be a separate process:

Code:

expr->expr + term | expr - term | term

term-> term * factor | term / factor | factor

factor -> ( expr ) | NUMBER



# Lexical parser returns NUMBER + - * / ( )

Thanks, but seems you mean to say that for the parser, if regular grammar is used; then it can be both left-linear and right-linear.
And, it is just a theoretical issue, that such a regular grammar, be either left-linear grammar, or right-linear one.

But, still the doubt lingers on, as why books state that the regular grammar be either left-linear/right-linear, not both.

Don't know what do you mean 'linear', but I do think you should actually start creating example-programs with bison.
https://www.gnu.org/software/bison/manual/bison.html

I too, am confused by your terms - what exactly do you mean by linear? And are you asking specifically about a "regular" grammar, that is a Type 3 grammar in the Chomsky hierarchy? All of which leads me to think that you may be confusing yourself as well and using an example which is not applicable to the question.

Taking just the question itself:

Quote:

Why a regular grammar cannot be both left-recursive, and right-recursive?

By definition a regular grammar, that is a Type 3 grammar, cannot "remember" things that have gone before, which means that the RHS may only contain one non-terminal and if present it must come at the end. This means that there may be only two types of rules in a regular grammar (From Parsing Technques A Practical Guide, Dick Grune & Ceriel J. H. Jacobs, 1990, pg. 38):

Quote:

A non-terminal produces zero or more terminals
A non-terminal produces zero or more terminals followed by one non-terminal

This definition precludes left-recursion, so left-recursion is impossible in such a regular grammar.

I agree that you should be exploring examples by writing your own examples using GNU Bison - and be precise about your terms. It may be that other authors use other terms, but you should still make effort to define their meaning more precisely, not assuming that others have read the same material.

Why it is valid for parser to break a token into units.

Reconsidering the grammar above, for a different reason.
For the arithmetic expression:

Code:

200 + 300 * 50 / 2 * 5 - 3 + 4

The rule for NT factor, will break the tokens: 200, 300, 50 into further smaller parts, which is somewhat confusing; as a token is considered as an atom for lexer.

Code:

expr->expr+term | expr - term | term

term-> term*factor | term/factor | factor

factor -> digit factor | digit 

digit -> 0|1|2|3|4|5|6|7|8|9

I might have mentioned earlier that you should have a lexical parser that returns a sequence of consecutive digits as a unit (and also removes the white-spaces between tokens) so it should be like this:

Code:

expr -> expr + term | expr - term | term

term -> term * factor | term / factor | factor

factor -> NUMBER | ( expr )

Quote:

Originally Posted by NevemTeve (Post 6446700)

Code:

expr -> expr + term | expr - term | term

term -> term * factor | term / factor | factor

factor -> NUMBER | ( expr )

Thanks, but any hint towards coding this lexical parser, where the NUMBER is both a token, and a terminal; hence is taken directly without any need for breaking it further.

Also, why the last rule is needed, is unclear; i.e.

Code:

factor -> (expr)

=============================================

Edit:
Have found here a similar way to split an integer number into parts based on recursive rule, into parts.

Though the rosetta code does not break it.

"Token" is a data unit returned by the lexical parser. Sample:
Input:

Code:

str := "foo" + 'b';

Tokens:

Code:

IDENTIFIER

ASSIGN

STRINGLITERAL

+

CHARLITERAL

;

https://en.wikipedia.org/wiki/Lexical_analysis

Regarding `factor->(expr)`: please try to parse this expression: `1*(2-3)`

Quote:

Originally Posted by NevemTeve (Post 6446759)

"Token" is a data unit returned by the lexical parser. Sample:
Input:

Code:

str := "foo" + 'b';

Tokens:

Code:

IDENTIFIER

ASSIGN

STRINGLITERAL

+

CHARLITERAL

;

https://en.wikipedia.org/wiki/Lexical_analysis

But, that means that your (& rosetta code) approach is correct!

> But, that means that your approach is correct!

I should hope so. Do you still have any unanswered question?

A few comments:

The scanner, or "lexer," is the input side. Its purpose is simply to take a stream of raw characters and to arrange them meaingfully into "tokens." For example, the seven consecutive characters 1234.56 would be recognized as "a floating-point number" and passed along accordingly. The lexer is also able to distinguish between '<' '<=' '<>' and to return three individual 'tokens' accordingly, usually needing to "look ahead" only one character to do so.

(The language is designed that way, from the bottom up. It is created to be "efficiently parsed" by such an engine. And then, to allow for efficient code generation and the generation of efficient code. But, I digress ...)

The parser then accepts this stream of "tokens" – TK_FLOAT, TK_LT, TK_LE, TK_NE – and processes it against the "grammar."

However, at some important point you must also consider how the underlying parser engine is actually implemented. The grammar that you write is transformed into a data structure that actually feeds the engine, and "the engine is built for simplicity and raw speed." When the pedal finally meets the metal, the engine isn't "recursive" at all. Usually, it amounts to (in part) an extremely clever FSM = finite-state machine. This implementation imposes certain by-design constraints upon how it actually works, and many of those decisions are careful compromises.

The compiler for that grammar understands the innermost workings of that particular engine and builds its data structures (similarity: "object code") accordingly. Of course, not all engines are the same: "bison" is 'more advanced' than "yacc" because it is specifically engineered to handle more difficult cases. But, at the end of the day, both of them are focused on being "blindingly fast." And this basically means that "they don't 'think.'" The decision paths are built into the data structure by the compiler ... and imposed upon you as the creator of the "valid" grammar that goes in. It is because of the underlying nature of the engine that various grammars and their grammar-compilers also differ.

I would encourage you to "dumpster-dive into the actual source code" of the "compiler creator" of your choice – since you have it – and to read some of the many articles (and books) that have been written to describe why(!) it works. What were the issues that prompted the creators of these tools to write them in this particular way? They were building tools to solve an extremely practical and commonly-encountered problem. But, from their technical point of view, what was 'that problem?' Exactly?

Quote:

Originally Posted by sundialsvcs (Post 6447413)

... Its purpose is simply to take a stream of raw characters and to distinguish between '<' '<=' '<>' and to return three individual 'tokens' accordingly, usually needing to "look ahead" only one character to do so.

The lexer-generator for C-language is built on regular grammar, though the same for some other tools, is based on CFG.
If correct, the reason must be the simplicity; but still request some example based reasoning to show why weaker language (i.e., regular vs cfg) has been chosen for lexer. In fact, it is stated that gcc-C compiler has hand-built lexer, rather than by a tool as Lex, for reasons of efficiency; but still gcc-C lexer is based on regular grammar.
I suspect the gcc-C lexer to be built on regular grammar, as it is based on closed-loop with its parser, that demands one token at a time.
But, unable to get any literature on this.

Quote:

(The language is designed that way, from the bottom up.

Kindly elaborate /provide link, for what is meant by: 'bottom up'.

Quote:

It is created to be "efficiently parsed" by such an engine. And then, to allow for efficient code generation and the generation of efficient code. But, I digress ...)

The parser then accepts this stream of "tokens" – TK_FLOAT, TK_LT, TK_LE, TK_NE – and processes it against the "grammar."

However, at some important point you must also consider how the underlying parser engine is actually implemented. The grammar that you write is transformed into a data structure that actually feeds the engine, and "the engine is built for simplicity and raw speed." When the pedal finally meets the metal, the engine isn't "recursive" at all.

Usually, it amounts to (in part) an extremely clever FSM = finite-state machine. This implementation imposes certain by-design constraints upon how it actually works, and many of those decisions are careful compromises.

Please give an example/reference to show that parser instead of being recursive (hence CFG), is regular.

Seems the book by Holub might point in this direction, as has code implementation; but have to search.
If a source is known by you, kindly point out.

Quote:

The compiler for that grammar understands the innermost workings of that particular engine and builds its data structures (similarity: "object code") accordingly. Of course, not all engines are the same: "bison" is 'more advanced' than "yacc" because it is specifically engineered to handle more difficult cases. But, at the end of the day, both of them are focused on being "blindingly fast." And this basically means that "they don't 'think.'" The decision paths are built into the data structure by the compiler ... and imposed upon you as the creator of the "valid" grammar that goes in. It is because of the underlying nature of the engine that various grammars and their grammar-compilers also differ.

Seems "bison"/"yacc" is different from the data structures, in which the decision paths are implemented.

Quote:

I would encourage you to "dumpster-dive into the actual source code" of the "compiler creator" of your choice – since you have it – and to read some of the many articles (and books) that have been written to describe why(!) it works.

Request book and article names, even 1-2 should suffice.
Have got from Delnet, even a very old book titled: Pascal - The Language and it's Implementation, by: D. W. Burton.
Have most of the books in library, but seems none reaches any of the topics stated by you.
The gcc-C compiler related statements above, are courtesy of some answers on stackexchange, hence depth is missing.

Quote:

What were the issues that prompted the creators of these tools to write them in this particular way? They were building tools to solve an extremely practical and commonly-encountered problem. But, from their technical point of view, what was 'that problem?' Exactly?

It depends on sources, and seems have none, inspite of googling.

I have long ago come to the conclusion that authors who write books about compilers don't know how to write. :) Many of them are very frankly terrible.

Quote:

Originally Posted by sundialsvcs (Post 6447592)

I have long ago come to the conclusion that authors who write books about compilers don't know how to write. :) Many of them are very frankly terrible.

Then some paper, or some open-source compiler code, should be the clue.
So, should start with yacc/bison source code.
Any direction for what to make as a list to check for, in such code, in what order, particularly for checking how the parser is implemented, or how the lexer is implemented.
Also, the reason why it is implemented that way.
Efficiency results based on particular implementation of parser/lexer as seperate phases, or closed-loop.
But, that needs literature too, to point in the right direction, and papers must be there.
Also, your earlier answers are relevant here as demand more questions, than have raised here
In particular, the last line, of your earlier message, with portion restated below:

Quote:

I would encourage you to "dumpster-dive into the actual source code" of the "compiler creator" of your choice – since you have it – and to read some of the many articles (and books) that have been written to describe why(!) it works. What were the issues that prompted the creators of these tools to write them in this particular way? They were building tools to solve an extremely practical and commonly-encountered problem. But, from their technical point of view, what was 'that problem?' Exactly?

The answer cannot be found, just by source code's snippets' ( can run stand-alone) understanding. Need literature, though literature must be derived by coding.