Bison YACC compiler generator question

schmitta · 10-21-2021, 03:34 PM

FOR YACC:

Quote:

%{
#include <stdio.h>
#include <stdlib.h>
void yyerror(char *);
int lm=3,rm=0,tm=3,bm=3,pn=0;

extern int yywrap ()
{
return (1);
}
%}

%start start
%token WORD PA TITLE LM RM TM BM PNON PNOFF PN
%token FON FOFF BLANK OUTLINE FINP FINL NUM BLANKS
%union
{
int intValue;
float floatValue;
char *stringValue;
}
%%
start : book
;

book : pages
;

pages : page
| pages page
;

page : lines FINP
;

lines : line
| lines line
;

line : segment FINL
;

segment : command
| words
;

words : WORD {printf ("%s\n",$<stringValue>1);}
| words BLANKS WORD {printf ("%s\n",$<stringValue>3);}
;

command : PA
| TITLE BLANKS words '\n'
| LM BLANKS NUM '\n' {lm=$<intValue>3;fprintf(stdout,"here:%d",lm);}
| RM BLANKS NUM '\n' {rm=$<intValue>3;}
| TM BLANKS NUM '\n' {tm=$<intValue>3;}
| BM BLANKS NUM '\n' {bm=$<intValue>3;}
| PNON '\n'
| PNOFF '\n'
| PN BLANKS NUM '\n' {pn=$<intValue>3;}
| FON '\n'
| FOFF '\n'
| BLANK '\n'
| BLANK BLANKS NUM '\n'
| OUTLINE BLANKS NUM '\n'
| OUTLINE BLANKS NUM BLANKS words '\n'
;

%%
int main(void)
{
return(yyparse());
}

LEX:

Quote:

%{

#include <stdio.h>
#include "y.tab.h"
extern YYSTYPE yylval;
int yywrap ()
{
return (1);
}

%}

%%

[' '|\t]+ {return BLANKS;}

^"."pa|PA {return PA;}
^\.lm|LM {return LM;}
^"."rm|RM {return RM;}
^"."tm|TM {return TM;}
^"."t|T {return TITLE;}
^"."bm|BM {return BM;}
^"."l|L {return OUTLINE;}
^"."bl|BL {return BLANK;}
^"."pnon|PNON {return PNON;}
^"."pnoff|PNOFF {return PNOFF;}
^"."pn|PN {return PN;}
^"."fon|FON {return FON;}
^"."foff|FOFF {return FOFF;}

^.+\n? |
" "+.+\n? {yylval.stringValue=strdup(yytext); return WORD;}

[0-9]+ {yylval.intValue = atoi(yytext); return NUM;}

%%

extern int yyerror(const char *msg){
fprintf(stderr,"%d: %s at '%s'\n",yylineno,msg,yytext);
}

I run the program and when I type:

Quote:

.lm 7
or
this is a test

.lm 7 is a command in my language to set left margin to 7 spaces.
"this is a test" is a line that should print out the WORDs <this is a test> on separate lines

I am now using flex and bison

I get nothing out

1) I put yytext in yacc and it does not find it until I add extern
2) yytext is an int not a string?
3) $3 is an int when I think it is a string.
4) how do I access the lex string (which I think is yytext) in yacc?

thank you for your ideas and help

astrogeek · 10-21-2021, 04:40 PM

I will not have a chance to look more closely until this evening, but this does not look right...

Code:

" "+.+\n? {return *yytext;}

Two things:

1. You should return only an integer TOKEN value, not the matched text...
2. This is not a valid statement to return the matched text anyway...

yytext is a char * so returning its dereferenced value makes no sense. But the thing pointed to by yytext, the matched text, is not necessarily valid after the return so when returning its value you should only return a pointer to a copy of the text. And YACC/Bison only looks for that value in a YYSTYPE object named yylval (int by default, see %defines or %union directives):

Code:

yylval=strdup(yytext); return TOKEN;

That is just what I see on a quick look so there may be other difficulties as well.

It would be helpful if you would also provide the commands you are actually using to build with so we can reproduce what you are seeing. I'll have a closer look later today.

**UPDATE**
Sorry I have not had opportunity to return to this but had another quick look and offer the following...

Code:

command : PA
 ...
| LM BLANKS NUM '\n' {lm=$2;fprintf(stdout,"here:%d",lm);}
| RM BLANKS NUM '\n' {rm=$2;}

Here and other places you reference semantic values ($2) which you have not passed from the lexer - needs to be fixed.

Code:

words : WORD {printf ("%s\n",yytext);}
| words BLANKS WORD {printf ("%s\n",yytext);}

Per my previous notes, yytext is not valid after the return from yylex() and should not be used.

The token WORD is not returned by any lexer rule so these grammar rules will never be used.

The definition of yyerror() does not look to be correct, no return value for one.

A suggestion: Write a stand alone lexer with a main(){...} function that will show you what it is doing so that you can get those parts right independent of the parser first, then work on the grammar. Perhaps something like:

Code:

int main(){
  int tok;
  while(tok=yylex()){
     switch(tok){
         ... suitable messages here ...
     }
  }
  return 0;
}

You will also need to define an enum for the tokens as they are normally in the header created by bison and will not be available to the standalone lexer (hint: just copy it from the existing bison generated source, very easy). Then generate the lexer with the -d option to build with debug trace - very useful!

schmitta · 10-23-2021, 01:20 AM

I updated the lex and yacc files in my previous note with the proper and current text. I also added comments to that note to explain what I am doing and what I am confused about. Do I need %union ? Union to me for c is various variables occupying the same storage space.

I found out that $n type is different depending on where it is in the parsing stack sometimes a int some times a string depending on I guess what yylval was assigned and where we are on the stack

---I do not know if the above is true. I do not understand %union which seems to name TYPEs intValue, stringValue rather than assign common storage like c.

schmitta@schmitta-ThinkPad-T500:~/Dropbox/PRODUCTS/APS_PRODUCTS/OUTLINE/OUTLINELY$ ./makely.sh test01
lex.yy.c:662:12: warning: prototype for ‘yywrap’ follows non-prototype definition
662 | extern int yywrap ( void );
| ^~~~~~
y.tab.c: In function ‘yyparse’:
y.tab.c:1306:16: warning: implicit declaration of function ‘yylex’ [-Wimplicit-function-declaration]
1306 | yychar = yylex ();
| ^~~~~
/usr/bin/ld: y.tab.o: in function `main':
y.tab.c

.text+0x9dd): multiple definition of `main'; /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libl.a(libmain.o)

.text.startup+0x0): first defined here
/usr/bin/ld: lex.yy.o: in function `yywrap':
lex.yy.c

.text+0x0): multiple definition of `yywrap'; y.tab.o:y.tab.c

.text+0x0): first defined here
collect2: error: ld returned 1 exit status
chmod: cannot access 'test01': No such file or directory

schmitta · 10-23-2021, 08:33 PM

This is what I have now. it compiles but does not recognize .lm 8 (set left margin to 8) but gets trapped at .l 33 which has a . and it does not get trapped at .l line 20 where it is suppose to get trapped.

lex

Quote:

%{

#include <stdio.h>
#include "y.tab.h"
extern YYSTYPE yylval;
/*int yywrap ()
{
return (1);
}
*/
%}
%option noinput nounput noyywrap
%%

[ |\t]+ {return BLANKS;}

^\."pa"|"PA" {return PA;}
^\."lm"|"LM" {return LM;}
^\."rm"|"RM" {return RM;}
^\."tm"|"TM" {return TM;}
^\."t"|"T" {return TITLE;}
^\."bm"|"BM" {return BM;}
^\."l"|"L" {return OUTLINE;}
^\."bl"|"BL" {return BLANK;}
^\."pnon"|"PNON" {return PNON;}
^\."pnoff"|"PNOFF" {return PNOFF;}
^\."pn"|"PN" {return PN;}
^\."fon"|"FON" {return FON;}
^\."foff"|"FOFF" {return FOFF;}

[0-9]+ {yylval.intValue = atoi(yytext); return NUM;}

^-s\n? |
" "+-s\n? {yylval.stringValue=strdup(yytext); return WORD;}

%%

extern int yyerror(const char *msg){
fprintf(stderr,"%d: %s at '%s'\n",yylineno,msg,yytext);
}

yacc

Quote:

%{
#include <stdio.h>
#include <stdlib.h>
void yyerror(char *);
int lm=3,rm=0,tm=3,bm=3,pn=0;
/*
extern int yywrap ()
{
return (1);
}
*/
%}

%start start
%token WORD PA TITLE LM RM TM BM PNON PNOFF PN
%token FON FOFF BLANK OUTLINE FINP FINL NUM BLANKS
%union
{
int intValue;
float floatValue;
char *stringValue;
}
%%
start : book
;

book : pages
;

pages : page
| pages page
;

page : lines FINP
;

lines : line
| lines line
;

line : segment FINL
;

segment : command
| words
;

words : WORD {printf ("%s\n",$<stringValue>1);}
| words BLANKS WORD {printf ("%s\n",$<stringValue>3);}
;

command : PA
| TITLE BLANKS words '\n'
| LM BLANKS NUM '\n' {lm=$<intValue>3;fprintf(stdout,"here:%d",lm);}
| RM BLANKS NUM '\n' {rm=$<intValue>3;}
| TM BLANKS NUM '\n' {tm=$<intValue>3;}
| BM BLANKS NUM '\n' {bm=$<intValue>3;}
| PNON '\n'
| PNOFF '\n'
| PN BLANKS NUM '\n' {pn=$<intValue>3;}
| FON '\n'
| FOFF '\n'
| BLANK '\n'
| BLANK BLANKS NUM '\n'
| OUTLINE BLANKS NUM '\n'
| OUTLINE BLANKS NUM BLANKS words '\n'
;

%%
/*
int main(void)
{
return(yyparse());
}
*/

The script file

Quote:

#!/bin/bash
bison -d $1.y
flex -d $1.l
gcc -Wall -o $1 y.tab.c lex.yy.c
chmod 755 $1

astrogeek · 10-23-2021, 11:49 PM

Sorry for my slow response...

Good to see you using %union to set up your semantic value types - always a good idea.

You may also want to declare the types for those tokens and non-terminals which have values and avoid all those bracketed type references. For example, define the types for NUM and WORD in the %token declaration like this...

Code:

%token <intValue> NUM
%token <stringValue> WORD

...then when you reference them in action code...

Code:

words : WORD {printf ("%s\n",$<stringValue>1);}
...may be simply...
words : WORD {printf ("%s\n",$1);}

| RM BLANKS NUM '\n' {rm=$<intValue>3;}
...may be simply...
| RM BLANKS NUM '\n' {rm=$3;}

For any non-terminals which have a type use the %type declaration to set them up.

Quote:

Originally Posted by schmitta

...it compiles but does not recognize .lm 8 (set left margin to 8) but gets trapped at .l 33 which has a . and it does not get trapped at .l line 20 where it is suppose to get trapped.

Your regular expressions are responsible for that. When Flex matches more than one pattern it will select the longest match, or the one which occurrs first in the specification if they are the same length. So this rule at line 33...

Code:

^.+\n? |
" "+.+\n? {yylval.stringValue=strdup(yytext); return WORD;}

... is going to match just about anything that you send as input. That rule says "match literally anything from the start of a non-empty line up to and including a newline", which is going to override any of your dot-rules whether or not they are followed by a number because those characters plus the newline will be a longer match.

In fact, there are probably other problems with those regular expressions so you need to look carefully at them and be sure you know what they are actually going to match. You are building with Flex debug trace enabled so just look at what that is telling you for each case of test input. Additionally, as mentioned in a previous post, you may want to set up a stand-alone lexer so that you can test those rules independent of the parser and have certainty about what they are producing.

One final comment - I see that you changed the declaration of yyerror() in the Bison file, but it does not match what is in the Flex file. It will probably be more convenient if you move the definition into the Bison file as well (unless you have some reason for including it in the Flex spec).

schmitta · 10-23-2021, 11:50 PM

I need a way of accepting any input for a word so I used

Quote:

^.\n? |
" "+.\n? {yylval.stringValue=strdup(yytext); return WORD;}

at the end of all other rules but this does not capture numbers as they are captured several lines before

I fixed it using -s instead of .

astrogeek · 10-24-2021, 12:00 AM

Quote:

Originally Posted by schmitta

I need a way of accepting any input for a word so I used

Code:

^.+\n?

at the end of all other rules but this does not capture numbers as they are captured several lines before

It will still override the number match rule too because it matches any number of characters followed by the newline. That rule is going to be very problematic for you and you should probably rethink it.

**UPDATED**

To be very clear, that rule (in red above) will match all of these lines as they appear in input, instead of their intended rules...

Code:

.l Some text
.t Some text
.lm 8
.rm 20
.pnon
.pnoff
1234
Anything else you put here...

astrogeek · 10-24-2021, 01:02 AM

Here is a quick and dirty standalone lexer made from your last version plus the union and token type enum copied from the bison generated source:

Code:

%{

#include <stdio.h>
/*int yywrap ()
{
return (1);
}
*/
union
{
int intValue;
float floatValue;
char *stringValue;
} yylval;

enum yytokentype
  {
    WORD = 258,
    PA = 259,
    TITLE = 260,
    LM = 261,
    RM = 262,
    TM = 263,
    BM = 264,
    PNON = 265,
    PNOFF = 266,
    PN = 267,
    FON = 268,
    FOFF = 269,
    BLANK = 270,
    OUTLINE = 271,
    FINP = 272,
    FINL = 273,
    NUM = 274,
    BLANKS = 275
  };

extern int yy_flex_debug;

%}
%option noinput nounput noyywrap
%%


[' '|\t]+ {return BLANKS;}


^"."pa|PA {return PA;}
^"."lm|LM {return LM;}
^"."rm|RM {return RM;}
^"."tm|TM {return TM;}
^"."t|T {return TITLE;}
^"."bm|BM {return BM;}
^"."l|L {return OUTLINE;}
^"."bl|BL {return BLANK;}
^"."pnon|PNON {return PNON;}
^"."pnoff|PNOFF {return PNOFF;}
^"."pn|PN {return PN;}
^"."fon|FON {return FON;}
^"."foff|FOFF {return FOFF;}

^.+\n? |
" "+.+\n? {yylval.stringValue=strdup(yytext); return WORD;}

[0-9]+ {yylval.intValue = atoi(yytext); return NUM;}

%%

int main(){
   int tok;
   int hasint, hasstr;
   /* Uncomment yy_flex_debug=0 to suppress debug trace */
   yy_flex_debug=0;
   char *name;
   while((tok = yylex()))
   {
      hasint=hasstr=0;
      switch(tok){
    case WORD: name="WORD"; hasstr++; break;
    case PA: name="PA"; break;
    case TITLE: name="TITLE"; break;
    case LM: name="LM"; break;
    case RM: name="RM"; break;
    case TM: name="TM"; break;
    case BM: name="BM"; break;
    case PNON: name="PNON"; break;
    case PNOFF: name="PNOFF"; break;
    case PN: name="PN"; break;
    case FON: name="FON"; break;
    case FOFF: name="FOFF"; break;
    case BLANK: name="BLANK"; break;
    case OUTLINE: name="OUTLINE"; break;
    case FINP: name="FINP"; break;
    case FINL: name="FINL"; break;
    case NUM: name="NUM"; hasint++; break;
    case BLANKS: name="BLANKS"; break;
      default: name="Unknown";
      }

   printf("%s", name);
   if(hasint)
      printf(" = %d\n", yylval.intValue);
   else if(hasstr)
      printf(" = %s\n", yylval.stringValue);
   else
   putchar('\n');
   }
   return 0;
}

Build with:

Code:

flex -d standalone.l
gcc -Wall lex.yy.c

This will allow you to see what is returned for any input case (type Ctrl-D to exit):

Code:

./a.out
1
WORD = 1

2
WORD = 2

1234
WORD = 1234

.lm 8
WORD = .lm 8

That should let you refine your regular expressions more easily

schmitta · 10-24-2021, 01:27 AM

thank you ASTROGEEK for your gracious help. I am basically coming up from knowing nothing buy using the internet to educate myself in this.

sundialsvcs · 10-24-2021, 07:30 PM

Quote:

Originally Posted by schmitta

thank you ASTROGEEK for your gracious help. I am basically coming up from knowing nothing buy using the internet to educate myself in this.

That's why all of us are "lurking" here!

astrogeek · 10-24-2021, 11:26 PM

Quote:

Originally Posted by schmitta

thank you ASTROGEEK for your gracious help. I am basically coming up from knowing nothing buy using the internet to educate myself in this.

You are very welcome!

I know why I am here too, best expressed by someone named schmitta:

Quote:

Originally Posted by schmitta

I do this stuff, as all of us do this stuff, because it is fun.

Agreed! And I have found the whole progression of ideas behind parsing and compilers (and there are a lot of them!) more interesting, and more fun than many other problems encountered in computing! It is a treat to encounter others exploring them too!

schmitta · 10-25-2021, 12:14 AM

To astrogeek - 400 years before the birth of Christ Isaiah tells of his coming in chapter 53. Tells of his virgin birth; that he would be wounded for our transgressions and that by his strips we would be healed. He tells of Christ's death and resurrection and that belief in him is the only way to heaven. Only stupid people go to hell because hell is a lake of liquid fire that an angle of God throws the sinner in . When he hits he screams bloody murder and the pain is unreal and forever. I don't want to see anyone go there. Heaven is where you get what you were always looking for even if you did not know that on earth. Please confess with your mouth to someone that Jesus is God and believe in your heart that God the father raised Jesus the son from the dead and you shall be saved from an eternity of misrely. Thank you - I just felt the need to share that.

astrogeek · 10-25-2021, 11:46 AM

Thank you for sharing that which is most important to you!

LQ, like the Free software movement and the culture we share, were all founded on the idea of sharing, helping others, doing to others as we would want for ourselves. However we may express it, that is really why we are here, isn't it!

Your comments are received and appreciated in the spirit in which they were given, thank you for sharing them!

But this is a technical forum, so let's continue our explorations of the topic at hand in that same spirit of sharing, for the benefit of us all.

schmitta · 10-25-2021, 11:58 AM

I will try out your lex code today. I have -d switch on for lex which shows rule accessed. Thanks again. Alvin...

astrogeek · 10-25-2021, 01:14 PM

"My" lexer is still just your lexer code. All I did was copied in the union and tokentype enum, and added a simple main(){} function to call yylex just as the Bison code would do. The main point of doing that is to separate the lexer from the parser so that you can interact with the lexer independent of the parser (and grammar) and gain a better degree of certainty over how your lexer rules actually work. I have found it to be a useful exercise for most of my own projects, and it is easy to do.

The only "specification" I have for your project is what you have described in your posts and what I imagine you intend from looking at your rules - which is incomplete at best.

I would suggest that you write a simple description in plain english of how each part of input is supposed to be handled. For example, try to write a concise one line description of each of the dot-rules, specifying how it must appear in the input stream (i.e., at start of the document or embedded, must be at start of line or may be inline with other text, followed by number or not, one per line, etc.), and what effect it has on the output stream.

Then do the same for the text you are trying to process. Should whitespace be preserved? Does it recognize paragraph breaks? Page breaks? Is all text just words and whitespace of do you need to recognize any special keywords or symbols? Etc...

Try to then put together a simplest test case, or a few of them, which you can then feed into your standalone lexer or combined parser application to work out the necessary rules... it is really only at this point that you are in position to work those descriptions and examples into a proper grammar.

Because you are trying to process the inut text as blocks of text, as opposed to a small set of keywords or other symbols, you will probably find it helpful to make use of different start states in the lexer. That will allow you to separate out the control commands from random text without getting things crossed up as your current rules are doing. If you are not familiar with Flex start states I'll be happy to suggest an example based on any test case you care to post.

Also, as you are new to this, I suggest you try to find a copy of Parsing Techniques: A Practical Guide by Dick Grune and Ceriel J. H. Jacobs. For several years Grune offered free download of an earlier edition online, although that was gone last time I looked. You can probably find a used copy of the original edition online for $5-$10 and it will repay you many times the cost! If you can clearly understand the ideas presented in just the first three chapters your world will be changed - at least with regard to the basic ideas of parsing!