Bison YACC compiler generator question

schmitta · 10-13-2021, 02:40 PM

I have a grammar BOOK -> PAGES -> page -> lines -> line ->tokens. the book page can have any number of lines (set by user) and the line can have as many tokens as can fit the Line Length which is set by the user. I can't use ; to end a line as this is text. a token is added to the line with a space until the line length is reached then it is UNPUT. I do not know how to tell yacc that I can't fit any more tokens on a line and so to terminate the production and go back up the tree (un load the grammar stack ) This is probably my problem as my grammar is probably suppose to solve this but I do not know how. If a call or a bit could be set in yacc to tell it the line has been made and now reduce the grammar stack it would be useful to me. the input to yacc is a bunch of tokens with some commands

The input is just to feed tokens but the line being built determines when it is finished and is time to reduce.

one way to do this would be, upon a line full condition, unput the input a certain number of times and then insert a crazy string such as (}{) and have a bnf rule <data> ::=tokens token (}{) and then ask lex for the next token and it would get (}{) which would terminate the rule. Is this possible?

astrogeek · 10-13-2021, 05:36 PM

I have not been able to form a picture in my mind of what you are tyring to accomplish, based on your description.

Rather than ask any of the specific questions that come to mind (and which might lead us down an unproductive path!), it would probably be more helpful if you could provide a minimal actual or contrived example of an input stream, how it would be tokenized, and the Bison grammar that is expected to handle it.

Also...

When you say the number of lines and line length are "set by the user", are you referring to the user who is building the parser or the user who is running the resultant program?

A more complete description with example of what you are actually trying to accomplish would help others and myself understand the problem.

sundialsvcs · 10-13-2021, 08:43 PM

If "the line length set by the user" is syntactically significant ... as "tab-stops and indentation" are in Python ... then you may need to address this issue in your scanner/tokenizer. This layer might need to inject an LINE_IS_FULL "token" (of some sort ...) into the symbol stream so that the YACC-or-Bison grammar can then react to it.

On the other hand, perhaps the problem that you describe isn't the parser's problem at all, but rather yours. Maybe you need to write some code which, executing after the parser has done its business and delivered the AST to you, now traverses this structure and then re-structures it to conform to the line-length requirement.

astrogeek · 10-13-2021, 09:10 PM

Taking only this part of your question:

Quote:

Originally Posted by schmitta

I do not know how to tell yacc that I can't fit any more tokens on a line and so to terminate the production...

You would do that in the grammar by having a token with the meaning "line is full" and a production rule which will terminate a "line rule" with that token. One way to do that is to have two levels of rule for a line, a left-recursive one which accumulates words into a line, which I will call segment below, and another which reduces the segment to a line when it is followed by the "line is full" token.

Code:

%token WORD LINEFULL
...
lines: line | lines line ;
line: segment LINEFULL ;
segment: WORD | segment WORD ;

Then you would need variables to keep track of the accumulated words and spaces, say curleng, and the maximum length, maxleng. Then when curleng + yyleng > maxleng have the lexer return LINEFULL and unput the current token.

Maybe something like this in the lexer:

Code:

[a-zA-Z0-9_-]+ { if(curlen + yyleng > maxleng){ /*unput matched text here*/; curleng=0; return LINEFULL; }
                 yylval=strdup(yytext); curlen+=yyleng; return WORD;
               }

The parser would be responsible for adding the length of any spaces or other characters it added to a segment to curleng.

I cannot say how well this may fit your case because I am not sure I understand your case, but hopefully it will answer this one question or at the least lead you to a solution.

UPDATE: And I agree with sundialsvcs that this may not be something that should be handled by the parser (i.e. grammar) at all. If the lexer could match a "line"'s worth of input and return it as a single token value that would remove it from being the responsibility of the grammar. On the other hand, if the tokens returned to the parser are more complex and have syntactic significance, then perhaps breaking it into fixed length lines would be better moved into the semantic processing or even output formatting arenas.

schmitta · 10-14-2021, 09:29 AM

Thank you to all of you for your help to me

schmitta · 10-14-2021, 09:51 AM

The parser would find out from the user the line width ( right margin (.rm <num>) plus left margin (.lm <num>) plus line width .lw <num>)) from commands the user put in the text. how would I send the information back to the lexer to issue the LINEFULL - wait - maybe I am being stupid here - the output is c code for both the lexer and yacc so I could signal between the two since they are both compiled at the same time (I guess with an extern variable) to have lex output LINEFULL when the code in yacc found the end of the line had been reached (in its c code for that production or grammar rule). My problem is that I need to write some code to learn the interaction between the two. I will write a little test case to prove to myself that this can be done. Thank you all for your help in this.

astrogeek · 10-14-2021, 12:35 PM

Quote:

Originally Posted by schmitta

The parser would find out from the user the line width ( right margin (.rm <num>) plus left margin (.lm <num>) plus line width .lw <num>)) from commands the user put in the text.

If these parameters are in the text then they must first be extracted by the lexer and (potentially and subsequently) recognized by the parser. In other words they are first seen by the lexer so there may be no need to pass them back from the parser.

It seems to me that if you are setting margins and widths this way they are ultimately going to be device dependent and not something that you can set by simply breaking lines. In that case you will probably want capture them into variables to be passed along to the output generation routines.

But if the point of the exercise is just this...

Quote:

Originally Posted by schmitta

My problem is that I need to write some code to learn the interaction between the two. I will write a little test case to prove to myself that this can be done.

That is a little ambiguous. Do you want learn how the lexer and parser normally interact, or do you want to prove that you can force them to interact in this specific way, that is possibly not normal?

Assuming you mean that you want to learn how they normally interact...

Quote:

Originally Posted by schmitta

how would I send the information back to the lexer to issue the LINEFULL

Well you wouldn't. The lexer should generate the token sequence based on the input stream and its internal state. The parser should not normally tell the lexer, "Hey! send me one of these next!", or what would be the point of having a grammar to recognize structure in the token stream when it is determining that structure?

It seems to me based on the description so far (and still without actual input examples), that the lexer itself should pick up the .lm, .lw and .rm parameters and maintain its own state based on those, perhaps without passing them along to the parser at all.

Think about using different start states in the lexer to match whole lines and as hinted previously, to perhaps deliver complete lines of the desired length (and now margin offsets) to the parser as a single token value pair.

Quote:

Originally Posted by schmitta

- wait - maybe I am being stupid here - the output is c code for both the lexer and yacc so I could signal between the two since they are both compiled at the same time (I guess with an extern variable) to have lex output LINEFULL when the code in yacc found the end of the line had been reached (in its c code for that production or grammar rule).

Instead of thinking about how to add some new path to signal between the two, think instead about how to make use of the designed in signal path, the token stream and grammar, to accomplish your purpose.

Redefine your lexer to send a well defined token sequence based on the input and its own internal state (including any necessary end state tokens), and define your grammar to recoginze that token sequence.

schmitta · 10-14-2021, 04:22 PM

the line length etc is found out by other command parts of the grammar. If I can access a variable (extern) in the lexer then I can use the code that is already parsing the next token to check this variable and execute a return LINEEND operation to cause the parser to end the grammar line.

sundialsvcs · 10-14-2021, 04:44 PM

I think that you're injecting complexity where it really doesn't belong.

Just let the lexer and the parser do their thing – then, look at the resulting AST-segment in terms of "line length." The parser initially hands you a tree of productions which might or might not "fit." Now, iterate through that tree and change it by building a subtree. Put "as many tokens as will fit" onto each branch of that new subtree. Do this after the parser has done its thing. (Or, you might be able to invoke it during grammar processing ...)

Given the strategy that you seem to be pursuing right now, my guess is that you're going to wind up with an unreliable, fragile "solution." But, as Perl programmers love to say: "TMTOWTDI: "Tim Toady": There's More Than One Way To Do It.™"

astrogeek · 10-14-2021, 06:36 PM

I would still like to see a complete but minimal example of actual input, including real margin and width specifiers and any other text that you use to identify paragraph breaks, page breaks, etc., and a corresponding example of what you would expect to see output by your parser. From that we would be able to make more concrete suggestions.

I agree with sundialsvcs once more, it seems to me that you may be adding needless complexity. It is the very sort of thing I did so many times when trying to figure out how to make use of Lex and YACC! But I finally got it and you will too!

Good luck!

schmitta · 10-16-2021, 12:55 PM

the approach by ASTROGEEK of:

[a-zA-Z0-9_-]+ { if(curlen + yyleng > maxleng){ /*unput matched text here*/; curleng=0; return LINEFULL; }
yylval=strdup(yytext); curlen+=strlen(yytext); return WORD;
}

is the absolute best way to terminate the sequence and then send LINEFULL to terminate the rule in yacc and i thank him and all of you for your inciteful ideas to my problem. Later if you want me to put up my test case to verify this just ask and I will.

astrogeek · 10-16-2021, 02:45 PM

Excellent! And you are welcome, glad that worked for your case!

By the way, you have probably already noticed, but this...

Code:

curlen+=strlen(yytext);

... should really just be this...

Code:

curlen+=yyleng;

...because you already have the length of the word in yyleng (and it was used in the if(){...}). Not sure why I used strlen(), probably a glitch in the matrix or some other asynchronous interrupt!

By all means, please post your test case when you are satisfied with it, always helpful to others who land here in future!

Good luck!

schmitta · 10-17-2021, 04:36 PM

what I am doing - I am writing what IBM in 1972 called SCRIPT and I think the program RUNOFF did the same thing. I never did like wysiwyg and I wanted a program to take free form input and make pretty output on a page sorta like HTML (I call it an automatic editor). My version just does text, titles and outlines. I first wrote it in 1972 in PL/1 and used it for documentation for our projects in our computer lab at VPI&SU (Virginia Tech) and also to write my thesis. Then, later, I wrote it to run on a PC using Bill Gates BASIC. But now I have been trying to write it for C and have a very good parser written as well as a command processor (you will see later) but have been having trouble writing the rest of it because my brain IS NOT AS FLEXIBLE as it use to be (I am 70). So I thought it would be fun to write it in YACC and easier as it has been. I also wrote most of a BASIC compiler (the background stuff expression analyzer; the tokenizer etc) but I want to use YACC to build compilers and assemblers for an abstract machine (modern people call it a VM). I do this stuff, as all of us do this stuff, because it is fun. I am looking for a machine to emulate and was thinking about the 65816 but it will probably be an ARM with peripherals such as I2C UART SPI USB etc.

schmitta · 10-18-2021, 12:40 PM

yacc, lex and script files. Note that I have not et included astrogeeks suggestions

FLEX

Quote:

%{

#include <stdio.h>
#include "y.tab.h"
extern YYSTYPE yylval;
/*int yywrap ()
{
return (1);
}
*/
%}
%option noinput nounput noyywrap
%%

[' '|\t]+ {return BLANKS;}

^'.'"pa"|"PA" {return PA;}
^'.'"lm"|"LM" {return LM;}
^'.'"rm"|"RM" {return RM;}
^'.'"tm"|"TM" {return TM;}
^'.'"t"|"T" {return TITLE;}
^'.'"bm"|"BM" {return BM;}
^'.'"l"|"L" {return OUTLINE;}
^'.'"bl"|"BL" {return BLANK;}
^'.'"pnon"|"PNON" {return PNON;}
^'.'"pnoff"|"PNOFF" {return PNOFF;}
^'.'"pn"|"PN" {return PN;}
^'.'"fon"|"FON" {return FON;}
^'.'"foff"|"FOFF" {return FOFF;}

[0-9]+ {yylval.intValue = atoi(yytext); return NUM;}

^-s\n? |
" "+-s\n? {yylval.stringValue=strdup(yytext); return WORD;}

%%

extern int yyerror(const char *msg){
fprintf(stderr,"%d: %s at '%s'\n",yylineno,msg,yytext);
}

bison

Quote:

%{
#include <stdio.h>
#include <stdlib.h>
void yyerror(char *);
int lm=3,rm=0,tm=3,bm=3,pn=0;
/*
extern int yywrap ()
{
return (1);
}
*/
%}

%start start
%token WORD PA TITLE LM RM TM BM PNON PNOFF PN
%token FON FOFF BLANK OUTLINE FINP FINL NUM BLANKS
%union
{
int intValue;
float floatValue;
char *stringValue;
}
%%
start : book
;

book : pages
;

pages : page
| pages page
;

page : lines FINP
;

lines : line
| lines line
;

line : segment FINL
;

segment : command
| words
;

words : WORD {printf ("%s\n",$<stringValue>1);}
| words BLANKS WORD {printf ("%s\n",$<stringValue>3);}
;

command : PA
| TITLE BLANKS words '\n'
| LM BLANKS NUM '\n' {lm=$<intValue>3;fprintf(stdout,"here:%d",lm);}
| RM BLANKS NUM '\n' {rm=$<intValue>3;}
| TM BLANKS NUM '\n' {tm=$<intValue>3;}
| BM BLANKS NUM '\n' {bm=$<intValue>3;}
| PNON '\n'
| PNOFF '\n'
| PN BLANKS NUM '\n' {pn=$<intValue>3;}
| FON '\n'
| FOFF '\n'
| BLANK '\n'
| BLANK BLANKS NUM '\n'
| OUTLINE BLANKS NUM '\n'
| OUTLINE BLANKS NUM BLANKS words '\n'
;

%%
/*
int main(void)
{
return(yyparse());
}
*/

script file

Quote:

#!/bin/bash
bison -d $1.y
lex $1.l
gcc -o $1 lex.yy.c y.tab.c
#gcc -o $1 -ll y.tab.c
chmod 755 $1

entered .lm 9

and got

--(end of buffer or a NUL)
.lm 9
--accepting default rule (".")
--accepting default rule ("l")
--accepting default rule ("m")

it did not pick up .L rule 20 which is ^'.'"lm"|"LM" {return LM;}

why?

--accepting rule at line 16 (" ")
1: syntax error at ' '
.lm

sundialsvcs · 10-18-2021, 02:19 PM

Don't need to remind you that "bison" and "lex/yacc" are differing technologies and that "bison" can handle certain grammar edge-cases that are, well, its raison d'entre. Don't know if you will encounter any of them in your project.

Incidentally, I wonder if the "grammar convolutions" that were used in the implementation of PHP might be somehow useful to you.