LinuxQuestions.org - Book "Compiler Construction using Flex and Bison"

Page 2 of 4

Show 50 post(s) from this thread on one page

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Book "Compiler Construction using Flex and Bison" - problems, discussions, steps, ... (https://www.linuxquestions.org/questions/programming-9/book-compiler-construction-using-flex-and-bison-problems-discussions-steps-4175587087/)

astrogeek

08-16-2016 02:41 AM

Quote:

Originally Posted by dedec0 (Post 5591397)

astrogeek, thank you for your explanations. I had not yet seen %type, so I had no clue what was it. I just used it as something more to try, almost blindly (as my trial and error report shows).

As I undertand now, IDENTIFIER is the token that corresponds to the variable name, not its value or symbol. In C we could have an integer atribution:

boxOfFruits = 26;

For this line, there would be 4 tokens: IDENTIFIER (with a string "boxOfFruits"); EQUAL_SIGN; INTEGER (with value 26); END_OF_LINE. So, IDENTIFIER is a terminal symbol or not (I think it is not). The language of the book it not C, but it is something similar for that expression.

I downloaded the PDF with the intent of copying and compiling the code. But as there does not seem to be a complete version for each chapter and everything is shown as modifications from the previous page, I would have to read it all and work through the changes and I do not have time to do that now.

So, I will try to offer what I think will be the most helpful comments for you, but maybe not as directly applicable to your code example as you would like.

First, let me try to clarify a few terms and their use that did not seem to be presented clearly in my limited scan of the PDF. I think that getting these very clearly in mind will allow you to understand things not so clearly explained in examples that you find on the interwebs...

Let's start with terminal and non-terminal. These terms come directly from BNF/EBNF and the foundations of grammer theory. They are not especially relevant to the code itself unless you already know what they mean! I did not see any explanation of them in your PDF, so you might want to search online for a complete discussion.

But in simple terms, in your grammer (which is what the Bison rules section code actually is), non-terminals are the symbols that appear on the left-hand side of any expression, and terminals are what is received from the lexer or scanner. Non-terminals are called that because they can be further reduced to other terms. Terminals are end-points, they cannot be further reduced.

Note that your PDF adopts the convention that terminals are shown in all caps, and non-terminals are always lower case. So, IDENTIFIER is definitely a terminal - and it comes from the lexer too, obviously.

Now that we have a simple working idea of those, what are "tokens"? Tokens are what the lexer or scanner produces - it is a "tokenizer". And what it produces is received by the parser (syntactical analyzer) as terminals... tokens in the lexer are terminals in the grammer. Very easy.

Tokens are represented as integer values as a result of Bison processing the %token declarations, these integer definitions are passed to the lexer by the tab.h files produced by Bison so that they have the same names on both ends, but they are just integers associated with names - enums.

But the scanner may need to also return a value along with a token, and in the parser the $1, $2, ... references in the action code are mapped to these values, not to the token itself. These values are always passed in a variable named yylval. Look at your lexer code - it returns the token (integer identifier) by name, and if there is an associated value it is first placed into yylval, where the parser can find it.

If you only ever pass integer values, then yylval is all you need - an integer by default.

But, if you need to pass other types such as strings or structures you must declare a %union type. What the %union declaration does is creates a union definition in the resulting C code so that pointers to other values can be returned with the tokens. Of course, then you must tell the parser what type to expect! This is what the type declarations are all about!

If a %union declaration exists, then you must declare the type of every terminal and non-terminal that is referenced by value ($1, $2, ...) in the action code. This way the parser code knows which union member to fetch and can check for type errors in the code! These declarations appear like this in the Bison source file:

Code:

%token <type> TERMINAL-NAME

%type <type> non-terminal-name

... where <type> is the union member name for the type.

Terminals and non-terminals that are never referenced as $1, $2, etc. in the action code need no type declaration.

Additionally, any non-terminal (left hand symbol) that can take on the value of a typed terminal or non-terminal on the right hand side, must have the same declared type! If they do not, then you will receive the type of error in your example code: "$2 from `command' has no declared type", or a similar complaint about mis-matched types.

So, if this makes sense, just review your example code and see which symbols are terminals that return a value, and be sure it has a type declaration. And check all the non-terminals that can be referenced by or assigned a typed value, and be sure they have a type declaration.

Hopefully, when you see a "type" error generated by the parser processing you will know what it means and how to fix it!

That is all the time I have tonight, sorry for the length, and good luck!

dedec0

08-16-2016 03:30 PM

Thank you very much for this post, astrogeek. It blends together several things that I studied a bit, some a little more. And it puts Bison and Flex together in the story with their own possible functions, clearly. It helped me to understand more about what I am reading. :) And I will read the post again, just to make sure... hahaha

I agree with you. This book feels unfinished, but (at least) it is free to distribute. I am finding it better to read than the other book about Flex and Bison that I tried before. A friend have suggested that a Wikibook is made from this book, with at least these corrections and improvements - and possibly with a few others, naturally.

astrogeek

08-16-2016 06:34 PM

I am glad that the previous post was helpful!

During the day today I have tried to catch up with you by compiling the example code from the PDF...

It is not compilable as written... I copied and pasted each step - including MUCH editing to adjust for dumb-quotes (usually called smart-quotes) in the PDF code, as well as dumb-substitutions such as a unicode -- for the simple -, etc... I expect that you must have done the same.

I also added the missing token declarations ASSGNOP, FI, etc...

And I finally got down to the Chapter 4 complete version. But that cannot be processed by Bison due to multiple missing type declarations (declarations, command, id_seq,... more). I suspect those you were seeing are just the beginning for you.

I also suspect that the author intends to replace IDENTIFIER with IDENT, but that is not clear in the text. The lexer does not include a rule to return an IDENT token either, and the grammer rules using it are inconsistent, or at least make no sense as written.

It is not clear to me that the author intends for the examples to actually be compiled, at least not incrementally per-chapter.

It may be that the author intends to make further modifications in the next chapter, but I will leave that up to you to follow along at this point. Perhaps they will arrive at a "final" version, but I think that you will waste a lot of time trying to compile the modifications per-chapter, at least if Chapter 4 is representative.

I am not discouraging you from completing the exercise, but it is clearly not a tested code example, so you might want to read along and not get bogged down debugging the untested example code - or look for another example to follow...

dedec0

08-16-2016 07:28 PM

As I read chapters 1-3, I have made a few files, trying to make them as correct as possible. The aim is to have a more practical reading.

Chapter 1 does not need a program file. I have only a txt with what it shows. As shell scripts, I consider lines starting with "#" a comment, and added one:

Code:

# Context-free grammar for the language Simple



program ::= LET [ declarations ] IN command sequence END

declarations ::= INTEGER [ id_seq ] IDENTIFIER .

id_seq ::= id_seq... IDENTIFIER ,

command sequence ::= command... command

command ::= SKIP ;

            | IDENTIFIER := expression ;

            | IF exp THEN command sequence ELSE command sequence FI ;

            | WHILE exp DO command sequence END ;

            | READ IDENTIFIER ;

            | WRITE expression ;

expression ::= NUMBER | IDENTIFIER | '(' expression ')'

            | expression + expression | expression - expression

            | expression * expression | expression / expression

            | expression ^ expression

            | expression = expression

            | expression < expression

            | expression > expression

Chapter 2 needs one, ch2.y:

Code:

%start program

%token LET INTEGER IN

 /* FI was missing */

%token SKIP IF THEN ELSE FI END WHILE DO READ WRITE

%token NUMBER

 /* ASSGNOP was missing */

%token IDENTIFIER ASSGNOP

%left '-' '+'

%left '*' '/'

%left '<' '>' '=' '' /* missing in the book */

%right '^ '



%%



 /* Grammar rules and actions */



program : LET declarations IN commands END ;



declarations : /* empty */

    | INTEGER id_seq IDENTIFIER '.'

;



id_seq : /* empty */

    | id_seq IDENTIFIER ','

;

commands : /* empty */

    | commands command ';'

;

command : SKIP

    | READ IDENTIFIER

    | WRITE exp

    | IDENTIFIER ASSGNOP exp

    | IF exp THEN commands ELSE commands FI

    | WHILE exp DO commands END

;

exp : NUMBER

  | IDENTIFIER

  | exp '<' exp

  | exp '=' exp

  | exp '>' exp

  | exp '+' exp

  | exp '-' exp

  | exp '' exp

  | exp '/' exp

  | exp '^ ' exp

  | '(' exp ')'

;



%%



 /* C subroutines */



 /* no output, parse tree implicitly constructed */

int main( int argc, char *argv[] )

{

    extern FILE *yyin;

    ++argv; --argc;

    yyin = fopen( argv[0], "r" );

    yydebug = 1;

    errors = 0;

    yyparse ();

}

int yyerror (char *s) /* Called by yyparse on error */

{

  printf ("%s\n", s);

}

Chapter 3 needs two; ch2.y (indirectly used) and ch3.lex:

Code:

%{

#include "ch2.tab.h" /* tokens (Bison output for ch2.y) */

%}



DIGIT [0-9]

ID    [a-z][a-z0-9]*



%%



":="      { return(ASSGNOP);        }

{DIGIT}+  { return(NUMBER);          }

do        { return(DO);              }

else      { return(ELSE);            }

end        { return(END);            }

fi        { return(FI);              }

if        { return(IF);              }

in        { return(IN);              }

integer    { return(INTEGER);        }

let        { return(LET);            }

read      { return(READ);            }

skip      { return(SKIP);            }

then      { return(THEN);            }

while      { return(WHILE);          }

write      { return(WRITE);          }

{ID}      { return(IDENTIFIER);      }

[ \t\n\r]+ /* blank, tab, new line: eat up whitespace */

.          { return(yytext[0]);      }

To compile these two files, I used the commands below, shown with their output forced as comments:

Code:

# produces ch2.output, ch2.tab.c and ch2.tab.h

bison -vd ch2.y



# produces lex.yy.c

flex ch3.lex 



# compiles a program that parses the Simple

# language, although silently

gcc -Wall lex.yy.c -lfl

# lex.yy.c:1192: warning: ‘yyunput’ defined but not used

# lex.yy.c:1233: warning: ‘input’ defined but not used

These two warnings are not a problem. Do you think these 3 files makes sense in every detail?

Chapter 4 files is my next task, probably possible now with the help I got in this thread.

dedec0

08-16-2016 09:36 PM

I did not note your answer before my post with files' code. I only read it now.

Well, I will try to follow the whole book as I did for chapters 1-3. If it increases its numbers of problems, I will do as you suggest: just pass them by while trying to get the idea there.

"Follow other examples": if you (or anyone else) can recommend better books or other materials to read, that is welcome!

astrogeek

08-17-2016 01:33 AM

Sorry to be so slow responding.

I had a chance to look ahead in the PDF that you are working from later today, and I must say I am not optimistic that you will learn Flex and Bison from it, assuming that is your objective.

After Chapter 4 it becomes somewhat confused with no continuity from what had gone before. Chapter 5, Optimization is mostly just random thoughts (IMO) that ends with "We do not illustrate an optimizer in the parser...". Chapter 6, Virtual Machines... little connection to the rest of the PDF... Chapter 7 Code Generation is back to the original code, with more modifications, but honestly I think there would be little point in trying to follow it as a learning exercise.

I have sent you a PM with some other options and will post any useful links for examples that I can find back in this thread.

Good luck.

dedec0

08-29-2016 09:27 AM

Chapter 4: first try (does not work yet)

Chapter 4 is fairly fast to read, but many things are implied or come from previous chapters. I have made a few files to achieve something that makes sense for the whole chapter (and, if possible, that can be compiled, run and tested, not yet achieved).

I had some difficulties with it, though. The files are shown separately below, after everything I write (so you do not have more to read after their start - that is not a comment inside the files, there are some).

The files are being compiled with:

Code:

# first bison because ch4.tab.h is included in flex file

bison -vd ch4.y

flex ch4.lex

gcc lex.yy.c -lfl  # I use -Wall in a gcc alias

Flex and Bison silently work. In gcc output a problem is shown. It complains about yylval used without being declared. Isn't it automatic, like the yy* functions? How should I declare it, what are the best and common ways? I imagine it is close to that union we increased...

Code:

$ gcc lex.yy.c -lfl

/tmp/ccH6IRzd.o: In function `yylex':

/full/path/ch4.lex:36: undefined reference to `yylval'

collect2: ld returned 1 exit status

-------------------------------------------------------------------------

ch4.lex:

Code:

%{

#include <string.h>



 /* #include "simple.tab.h" / * tokens/fichas (Bison output) */

#include "ch4.tab.h"



%}



DIGIT [0-9]

ID    [a-z][a-z0-9]*



%%



":="      { return(ASSGNOP);        }

{DIGIT}+  { return(NUMBER);          }

 /* below there are keywords */

do        { return(DO);              }

else      { return(ELSE);            }

end        { return(END);            }

fi        { return(FI);              }

if        { return(IF);              }

in        { return(IN);              }

integer    { return(INTEGER);        }

let        { return(LET);            }

read      { return(READ);            }

skip      { return(SKIP);            }

then      { return(THEN);            }

while      { return(WHILE);          }

write      { return(WRITE);          }



 /* where IDENT was written IDENTIFIER assumed to be correct word */

 /* {ID}      { return(IDENTIFIER);      } */

 /* ID declaration was changed to return the identifier text and its

    token */

{ID}            {

                /* yylval->union, which is int or (char*) ->ch4.y:9 */

                yylval.id = strdup(yytext);

                return(IDENTIFIER);

            }



[ \t\n\r]+ /* blank, tab, new lines: ignore all whitespace */

.          { return(yytext[0]);      }

ch4.y:

Code:

%start program



 /* SEMANTIC RECORD */

 /* char *id: For returning identifiers */

 /*

Place to easily pasting/cutting the union declaration

 */



%union {

char *id;

}



/* Simple identifier */

 /*

Place to exchange the IDENTIFIER token declarations

 */

%token IDENTIFIER

%type <id> IDENTIFIER



%token LET IN

/* "integer" keyword */

/* both INT e INTEGER exist? Removed INT, occurrences changed.

%token INT

*/

%token INTEGER



/* Same question for NUMBER: should it exist?

  Replaced all of its occurrences with INTEGER

  -> wrong, written numbers X 'integer' keyword. Changes undone

*/

%token NUMBER



 /* FI was missing */

%token SKIP IF THEN ELSE FI END WHILE DO READ WRITE

    /* ASSGNOP was missing */

%token ASSGNOP 

%left '-' '+'

%left '*' '/'

%left '<' '>' '=' '' /* these where missing on the book, added */

%right '^ '



%{



#include <stdlib.h> /* For malloc in symbol table */

#include <string.h> /* For strcmp in symbol table */

#include <stdio.h> /* For error messages */

#include "st.h" /* The Symbol Table Module */

#define YYDEBUG 1 /* for debugging / para depuração */



int install( char* sym_name)

{

    symrec *s;

    s = getsym(sym_name);

    if (s == 0)

        s = putsym (sym_name);

    else

    {

        errors++;

        printf("%s is already defined\n", sym_name);

        return 0;

    }

    return 1;

}



int context_check(char* sym_name)

{

    if ( getsym( sym_name ) == 0 )

    {

        printf("%s is an undeclared identifier\n", sym_name);

        return 0;

    }

    return 1;

}





%}



%%



 /* Grammar rules and actions */



program : LET declarations IN commands END ;



declarations : /* empty */

    | INTEGER id_seq IDENTIFIER '.' { install( $3 ); }

;



id_seq : /* empty */

    | id_seq IDENTIFIER ','            { install( $2 ); }

;

commands : /* empty */

    | commands command ';'

;

command : SKIP

    | READ IDENTIFIER                    { context_check( $2 ); }

    | WRITE exp

    | IDENTIFIER ASSGNOP exp            { context_check( $1 ); }

    | IF exp THEN commands ELSE commands FI

    | WHILE exp DO commands END

;

 /* expressions */

exp : NUMBER

                                    /* in book it is $2, wrong */

  | IDENTIFIER                            { context_check( $1 ); }

  | exp '<' exp

  | exp '=' exp

  | exp '>' exp

  | exp '+' exp

  | exp '-' exp

  | exp '' exp

  | exp '/' exp

  | exp '^ ' exp

  | '(' exp ')'

;



%%



 /* C subroutines */



/* não tem saída, a árvore de reconhec. fica implícita */

/* no output, implicit parse tree */

int main( int argc, char *argv[] )

{

    extern FILE *yyin;

    ++argv; --argc;

    yyin = fopen( argv[0], "r" );

    yydebug = 1;

    errors = 0;

    yyparse ();

    return 0;

}

int yyerror (char *s) /* called by yyparse() in error situations */

{

    printf ("%s\n", s);

    return 0;

}

st.h:

Code:

/* symbol table module */

/*    -> to be included in Bison file */

typedef struct symrec

{

    char *name;                    /* symbol name / nome do símbolo */

    struct symrec *next;    /* list link / elo da lista */

} symrec;



symrec *sym_table = (symrec *)0;

symrec* putsym(char *);

symrec* getsym(char *);



symrec* putsym( char *sym_name)

{

    symrec *ptr;

    ptr = (symrec *) malloc( sizeof(symrec) );

    ptr->name = (char *) malloc( strlen(sym_name) + 1 );

    strcpy( ptr->name, sym_name);

    ptr->next = (symrec*) sym_table;

    sym_table = ptr;

    return ptr;

}



symrec* getsym( char* sym_name)

{

    symrec *ptr;

    for (

        ptr = sym_table;

        ptr != (symrec *) 0;

        ptr = (symrec *) ptr->next

        )

        if( strcmp( ptr->name, sym_name) == 0 )

            return ptr;

    return 0;

}

dedec0

08-29-2016 12:45 PM

Quote:

Originally Posted by astrogeek (Post 5591385)

[...]

Quote:

Originally Posted by dedec0

For now I am keeping the "%type <id> IDENTIFIER" line, but the error continues.

That is not correct IF IDENTIFIER is a terminal symbol, as appears to be the case.

It may be that the '$2 from command' error you are seeing is complaining about exp, not IDENTIFIER (but, also not clear to me).

The first parts your post (removed) are about things we discussed in the previous posts. Skim through them, I would suggest.

No, IDENTIFIER is not a terminal symbol. It is a variable name in their declaration or use. The Simple grammar was showed in chapter 1 (pdf page 9). I have made a text file from it, given below.

I am not sure what is/was the exp problem. There was an error due to grammar ambiguity that I solved it (somewhere I do not remember now). There was also a simple error too about a "$M" being used where a "$K" makes sense.

----------------------------------------------------------
ch1.txt:

Code:

# Context-free grammar for the language Simple



program ::= LET [ declarations ] IN command_sequence END

declarations ::= INTEGER [ id_seq ] IDENTIFIER .

id_seq ::= id_seq... IDENTIFIER ,

command_sequence ::= command... command

command ::= SKIP ;

            | IDENTIFIER := expression ;

            | IF exp THEN command_sequence ELSE command_sequence FI ;

            | WHILE exp DO command_sequence END ;

            | READ IDENTIFIER ;

            | WRITE expression ;

expression ::= NUMBER | IDENTIFIER | '(' expression ')'

            | expression + expression | expression - expression

            | expression * expression | expression / expression

            | expression ^ expression

            | expression = expression

            | expression < expression

            | expression > expression

astrogeek

08-30-2016 02:42 AM

Hi dedec0!

Sorry to be so late getting in here, and I only have a few minutes so I'll have to be brief for now.

Quote:

Originally Posted by dedec0 (Post 5597853)

I looked back over the previous posts but am not sure what you would like me to see from reviewing them. Could you be more specific.

And yes, IDENTIFIER is a terminal symbol. Perhaps you are confused about what a terminal symbol is?

In the simple CF grammer on pages 8-9 (bottom page numbers 2-3), IDENTIFIER appears only on the right hand side of any expressions and is used exactly as a terminal symbol. Also, immediately below that figure on page 9 (page number 3) it says...

Quote:

...Figure 1.1 where the non-terminal symbols are given in all lower case,
the terminal symbols are given in all caps or as literal symbols and, where the
literal symbols conflict with the meta symbols of the EBNF, they are enclosed
with single quotes. The start symbol is program. While the grammar uses
upper-case to high-light terminal symbols, they are to be implemented in lower
case.

And at the top of page 12 (page number 6), IDENTIFIER is defined as a token - tokens are terminal symbols.

Further along on page 16 (page number 10) in the scanner example code, IDENTIFIER is a token returned by the scanner to the parser - i.e., a terminal symbol.

I don't know what to add to that, and I am not sure how you think that should be used if it is not a terminal symbol... perhaps if you could try to explain to me more precisely what you think IDENTIFIER is in the CF grammer, and how that relates to IDENTIFIER in the scanner, maybe we could reach a better understanding of this example.

A little OT: Late last week I was able to spend a little time refreshing my Flex/Bison notes and I found a link to an example I do not recall seeing before - and it is a very good Flex Bison C++ Example too! The C++ Flex code may be a little confusing so just focus on the lexer rules and the grammer in the Bison code. But it compiles and works well and will give you an example parser that builds the AST explicitly. Hope it is helpful.

dedec0

08-30-2016 06:34 AM

Two parts:

1. A terminal symbol is one that does change anymore. It may be something written in the source code, like numbers values, variable names or language symbols ('+','-', ...). IDENTIFIER is the token given where variable names are declared or used in Simple. Nonterminal symbols are variables in the grammar.

For example, in the grammar we have a line: program ::= LET [ declarations ] IN command_sequence END. In this line: program, declarations and command_sequence are nonterminal symbols (or variables); LET, IN and END are terminal symbols.

In this book, the convention to use uppercase/lowercase for nonterminal/terminal symbols is different from what we always used in school.

In a Simple program (source code), we may have:

Code:

let integer trees, seed_package in

    trees = 3

    seed_package = 17

    while trees < 1000 do

        trees = trees + seed_package

    end

    write trees

end

This program has 2 constants: the current number of trees; the amount of seeds in a package we buy. We plant all seeds of packages. We want to have at least a thousand trees. How many trees will we have, then?

Is there something you want to correct or add in my ideas?

2. I am waiting for an answer and comments in my post about chapter 4

ntubski

08-30-2016 06:53 AM

Quote:

Originally Posted by dedec0 (Post 5598179)

1. A terminal symbol is one that does change anymore. It may be something written in the source code, like numbers values, variable names or language symbols ('+','-', ...). IDENTIFIER is the token given where variable names are declared or used in Simple. Non-terminal symbols are variables in the grammar.

According to that definition IDENTIFIER is a terminal. I don't really like the phrase "one that does [not] change anymore". A more precise definition would be: A terminal symbol is one that never occurs on the left hand side of the grammar rules. Hence a derivation always terminates with a list of terminal symbols and the terminal nodes of a parse tree are always terminal symbols.

dedec0

08-30-2016 07:48 AM

Indeed, that is better. Thank you also for remembering the parse tree concept and its characteristics. :)

sundialsvcs

08-30-2016 11:29 AM

Another way to think of it is that a "terminal symbol" is "something that you can see in the source-code you are compiling." They are the structural elements that define the language and that allow positions in the grammar to be non-ambiguously determined.

Noterminal symbols correspond to something else in the grammar.

ntubski

08-31-2016 08:35 AM

Quote:

Originally Posted by dedec0 (Post 5597752)

Code:

# first bison because ch4.tab.h is included in flex file

bison -vd ch4.y

flex ch4.lex

gcc lex.yy.c -lfl  # I use -Wall in a gcc alias

Code:

$ gcc lex.yy.c -lfl

/tmp/ccH6IRzd.o: In function `yylex':

/full/path/ch4.lex:36: undefined reference to `yylval'

collect2: ld returned 1 exit status

This is an error about yylval being undefined (see Declaration vs. definition). It means you didn't link the code that defines it. Perhaps it is defined in the bison output? Probably you should do something like this:

Code:

# First produce C files

bison -vd ch4.y # produces ch4.tab.c (right?)

flex ch4.lex # produces lex.yy.c



# Second compile (that's what the -c option means) C files to object (.o) files.

gcc -c lex.yy.c # produces lex.yy.o

gcc -c ch4.tab.c # produces ch4.tab.o



# Lastly, link all the object files together into a program 

# (I called it "ch4-program", if you drop the -o option you get a program named "a.out").

gcc -o ch4-program lex.yy.o ch4.tab.o -lfl

The command you showed was trying to create a program using the source from flex only.

sundialsvcs

08-31-2016 10:53 AM

Every parser-generator which produces a compilable source-file must provide some means for you to include headers at the top of the generated code, if it does not provide them already. Statements needed to #include the header-files needed by the compiler must be provided-for by appropriate means. These sections will generally be included in the output source-file verbatim.

All times are GMT -5. The time now is 08:51 AM.

Page 2 of 4

Show 50 post(s) from this thread on one page