Unable to understand why the given Lex program fails to recognize the given input string.

ajiten · 07-06-2023, 11:45 PM

The problem is stated below, based on the contents here: https://www.ibm.com/docs/en/zos/2.4....uity-lookahead

It states a Lex program may be ambiguous, if some particular string may match more than one translation expression.
If the input matches more than one expression, Lex uses the following rules to determine which action to take:
1. The rule that matches the longest possible input stream is preferred.
2. If more than one rule matches an input of the same length, the rule that appears first in the translations section is preferred.

It states that the below fragment of lex program, would not be able to recognize the input string: ‘abbb9’, due to the second translation expression never being reached.
This is stated to be due to the above rule (not clear which of the two, the author implies):

letter [[:lower:]]
%%
a({letter})* { return('A'); }
ab({letter})* { return('B'); }

It states that by the swapping the sequence of the two translation expressions, the given string would be recognized.

NevemTeve · 07-07-2023, 12:13 AM

You might have missed this part:
In this example, rule 1 is not sufficient, as, for example, the string abb9 applies to either action; therefore, by rule 2, the first matching action should apply.

Note: it is unlucky that 'rule' is used in two meaning, maybe it could be rephrased as:

1. The pattern that matches the longest possible input stream is preferred.
2. If more than one pattern matches an input of the same length, the one that appears first in the translations section is preferred.

(Also '9' in 'abbb9' is not part of the pattern, guess it is meant to terminate the 'abbb' sequence.)

ajiten · 07-07-2023, 12:29 AM

Quote:

Originally Posted by NevemTeve

You might have missed this part:
In this example, rule 1 is not sufficient, as, for example, the string abb9 applies to either action; therefore, by rule 2, the first matching action should apply.

Note: it is unlucky that 'rule' is used in two meaning, maybe it could be rephrased as:

1. The pattern that matches the longest possible input stream is preferred.
2. If more than one pattern matches an input of the same length, the one that appears first in the translations section is preferred.

(Also '9' in 'abbb9' is not part of the pattern, guess it is meant to terminate the 'abbb' sequence.)

Thanks, but do you mean to say that I have wrongly interpreted that the given Lex program, would not be able to read the input: 'abbb'?
If yes, then why the second pattern is preferred by the author? What difference is there, if the first pattern is able to read the input: 'abbb'?
Also, if the second pattern is never reached by the given program, then why the grammar is ambiguous?
How the swap of the two patterns, would make the program unambiguous?

NevemTeve · 07-07-2023, 12:49 AM

It is no question that the program will accept `abbb`, the question is which pattern will be used. Both pattern (ab(letter)* and a(letter)*) matches the sequence (the whole), so the first one wins.

ajiten · 07-07-2023, 12:54 AM

Quote:

Originally Posted by NevemTeve

It is no question that the program will accept `abbb`, the question is which pattern will be used. Both pattern (ab(letter)* and a(letter)*) matches the sequence (the whole), so the first one wins.

But, this winning of the first pattern having translation expression, for recognizing the given input string; would make the program ambiguous, as per the author!
But, how the swap of the two translation expressions, would make the program unambiguous is unclear?
Still both translation expressions can match, and just there is no need for the second one, i.e. in the program containing the swapped translation expressions.

NevemTeve · 07-07-2023, 01:00 AM

Please re-read the second rule:

1. The pattern that matches the longest possible input stream is preferred.
2. If more than one pattern matches an input of the same length, the one that appears first in the translations section is preferred.

astrogeek · 07-07-2023, 01:16 AM

Quote:

Originally Posted by ajiten

Also, if the second pattern is never reached by the given program, then why the grammar is ambiguous?
How the swap of the two patterns, would make the program unambiguous?

I have not read the linked article, but based on what has been said in this thread so far...

If you have a pattern that can never be reached by any matching pattern, then what is the purpose of that pattern... that is ambiguity. The program can never see a 'B' token.

If you swap the order of those patterns then patterns which match it will be returned, thus removing the ambiguity. The program can see a 'B' token.

ajiten · 07-07-2023, 01:24 AM

Quote:

Originally Posted by astrogeek

I have not read the linked article, but based on what has been said in this thread so far...

If you have a pattern that can never be reached by any matching pattern, then what is the purpose of that pattern... that is ambiguity. The program can never see a 'B' token.

If you swap the order of those patterns then patterns which match it will be returned, thus removing the ambiguity. The program can see a 'B' token.

Thanks, but by swapping the matching patterns, the program would not able to return a 'A' Token.
And, that means it cannot be the basis for stating that the modified program is unambiguous.

To quote the source, it states:
In this example, rule 1 is not sufficient, as, for example, the string abb9 applies to either action; therefore, by rule 2, the first matching action should apply.

syg00 · 07-07-2023, 02:14 AM

Might one ask why you are reading that particular doco ?. That is for a Unix subsystem that was added to IBMs proprietary z/OS operating system running (only) on IBM proprietary mainframe zseries hardware. Note - not z/linux.

I would image the lex code is pretty common, but it's not guaranteed - remember this is the company that also has AIX.

NevemTeve · 07-07-2023, 03:15 AM

> Thanks, but by swapping the matching patterns, the program would not able to return a 'A' Token.

Of course it would, try sequences "a", "aa", "acab" etc.

ajiten · 07-07-2023, 05:45 AM

Quote:

Originally Posted by NevemTeve

> Thanks, but by swapping the matching patterns, the program would not able to return a 'A' Token.

Of course it would, try sequences "a", "aa", "acab" etc.

So, the document is wrong in specifying the problem, as one of ambiguity; as the problem is of specifying non-general patterns, that distinguish between different cases of inputs; rather than specifying a relatively general pattern.
An alternative way to state the same, is: the pattern should match the state diagram precisely.
So, the state diagram for the input language given by the pattern in the second rule, as:
> start node (let, n1), one input: 'a', that leads to the second node (let, n2).
n2 , in-turn, has two possible inputs: 'a', 'b'; and that leads to two new states (nodes): n3, n4; and n3/n4 nodes can have any number of 'a', or 'b', as inputs; and act as final states.

It implies that a general pattern, should not be used, instead of a particular pattern.

But, is that synonymous with ambiguity?

NevemTeve · 07-11-2023, 12:28 AM

Having both (1) a[a-z]* and (2) ab[a-z]* in the language is ambiguous (it is not lex-related).
(Possible solution is changing the rule (1) to a|(a[ac-z][a-z]*)

Programmatic implementations (like lex) can solve this (without changing the pattern) by defining evaluation order: (2) first, then (1)