LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Book "Compiler Construction using Flex and Bison" - problems, discussions, steps, ... (https://www.linuxquestions.org/questions/programming-9/book-compiler-construction-using-flex-and-bison-problems-discussions-steps-4175587087/)

dedec0 08-15-2016 08:13 AM

Book "Compiler Construction using Flex and Bison" - problems, discussions, steps, ...
 
Hello. I started reading the book "Compiler Construction using Flex and Bison", freely available at http://research.microsoft.com/en-us/...l/compiler.pdf. Same file mirrored here: http://balacobaco.insomnia247.nl/dedec0/compiler.pdf .

There are a few errors but was able to fix them until chapter 3. Now, in chapter 4, I have reached a point where I cannot make some modifications the text directs to, without reaching a compilation error from bison.

A working bison file (it compiles with no output), with the lines that causes problem when added/changed is:

Code:

%start program

/*

 We should create this union, although it does not have more than one declaration in it:

%union {  /* SEMANTIC RECORD * /
char *id;  /* For returning identifiers * /
}

  We should change IDENTIFIER token to get the name of declared variables:

%token IDENTIFIER

  becomes:

%token <id> IDENTIFIER /* Simple identifier * /

*/

%token <id> IDENTIFIER /* Simple identifier */
%token IDENTIFIER
%token LET INTEGER INT IN
%token SKIP IF THEN ELSE FI END WHILE DO READ WRITE
%token NUMBER
%token ASSGNOP
%left '-' '+'
%left '*' '/'
%left '<' '>' '=' '' /* were missing in the book */
%right '^ '

%{

#include <stdlib.h> /* For malloc in symbol table */
#include <string.h> /* For strcmp in symbol table */
#include <stdio.h> /* For error messages */
#include "st.h" /* The Symbol Table Module */
#define YYDEBUG 1 /* para depuração */

int install( char* sym_name)
{
    symrec *s;
    s = getsym(sym_name);
    if (s == 0)
        s = putsym (sym_name);
    else
    {
        errors++;
        printf("%s is already defined\n", sym_name);
        return 0;
    }
    return 1;
}

int context_check(char* sym_name)
{
    if ( getsym( sym_name ) == 0 )
    {
        printf("%s is an undeclared identifier\n", sym_name);
        return 0;
    }
    return 1;
}


%}

%%

 /* Grammar rules and actions */

program : LET declarations IN commands END ;

declarations : /* empty */
    | INTEGER id_seq IDENTIFIER '.' { install( $3 ); }
;

id_seq : /* empty */
    | id_seq IDENTIFIER ','            { install( $2 ); }
;
commands : /* empty */
    | commands command ';'
;
command : SKIP
    | READ IDENTIFIER                    { context_check( $2 ); }
    | WRITE exp
    | IDENTIFIER ASSGNOP exp            { context_check( $2 ); }
    | IF exp THEN commands ELSE commands FI
    | WHILE exp DO commands END
;
exp : NUMBER
                            /* book said $2 for this, wrong*/
  | IDENTIFIER                            { context_check( $1 ); }
  | exp '<' exp
  | exp '=' exp
  | exp '>' exp
  | exp '+' exp
  | exp '-' exp
  | exp '' exp
  | exp '/' exp
  | exp '^ ' exp
  | '(' exp ')'
;

%%

 /* C subroutines */

 /* no output, implied parse tree */
int main( int argc, char *argv[] )
{
    extern FILE *yyin;
    ++argv; --argc;
    yyin = fopen( argv[0], "r" );
    yydebug = 1;
    errors = 0;
    yyparse ();
    return 0;
}
int yyerror (char *s) /* chamada por yyparse() com erros */
{
    printf ("%s\n", s);
    return 1;
}

The file as above works. Making the said changes the error reported is:

Code:

$ bison -vd ch4.y
ch4.y:82.54-55: $2 from `command' has no declared type
$

Chapter 4 starts in PDF page 19.

I need this .y file working before I proceed to next session, where the corresponding scanner (a flex file) is changed. May you help me understanding what is wrong and fixing it?

dedec0 08-15-2016 08:15 AM

Forgot to post it. The "st.h" file is:

Code:

typedef struct symrec
{
    char *name;                    /* symbol name*/
    struct symrec *next;    /* link field */
} symrec;

symrec *sym_table = (symrec *)0;
symrec* putsym(char *);
symrec* getsym(char *);

symrec* putsym( char *sym_name)
{
    symrec *ptr;
    ptr = (symrec *) malloc( sizeof(symrec) );
    ptr->name = (char *) malloc( strlen(sym_name) + 1 );
    strcpy( ptr->name, sym_name);
    ptr->next = (symrec*) sym_table;
    sym_table = ptr;
    return ptr;
}

symrec* getsym( char* sym_name)
{
    symrec *ptr;
    for (
        ptr = sym_table;
        ptr != (symrec *) 0;
        ptr = (symrec *)ptr- >next
        )
        if( strcmp( ptr->name, sym_name) == 0 )
            return ptr;
    return 0;
}


grail 08-15-2016 10:23 AM

Are you sure you have copied over all the changes as required? I noticed a reference to "exp : INT" which I do not see in your file. There could be others. I would suggest going back over all entries.

dedec0 08-15-2016 11:19 AM

The "exp: INT" you mention is in the end of PDF page 21? I have changed it to INTEGER.

I had a few doubts and went through a few errors in chapters 1-3. I had to make changes to what is written in the book to be able to compile. I have files made for each chapter, with their given code. Until chapter 3 they work.

Is "INT" something that makes sense? Or would it be a short "typo" for INTEGER, like I thought? And there is also the NUMBER token.

Note that, in the file above, I have added declarations for all of them (or bison gives error for undeclared token).

In the file below I have changed all INT and NUMBER tokens to INTEGER. Then I removed declarations for both, the error is the same. Without the union declaration and the IDENTIFIER with <id> (exchanged) bison compiles it silently (assumed to be good). With the union declaration and <id> in IDENTIFIER, as added in chapter 4, the error appears.

In the code below it is easy to make the changes I mentioned. It is just to cut/paste a few lines from/to multiline comments above each part:

Code:

%start program

 /* SEMANTIC RECORD */
 /* char *id: For returning identifiers */
 /*
Place to easily pasting/cutting the union declaration
%union {
char *id;
}

 */

/* Simple identifier */
 /*
Place to exchange the IDENTIFIER token declarations
%token <id> IDENTIFIER
 */
%token IDENTIFIER

%token LET IN
/* tem os dois, INT e INTEGER?
%token INT
%token NUMBER
*/
%token INTEGER

 /* tava faltando o FI */
%token SKIP IF THEN ELSE FI END WHILE DO READ WRITE
    /* tava faltando o ASSGNOP */
%token ASSGNOP
%left '-' '+'
%left '*' '/'
%left '<' '>' '=' '' /* não tinham no livro, precisa acrescentar */
%right '^ '

%{

#include <stdlib.h> /* For malloc in symbol table */
#include <string.h> /* For strcmp in symbol table */
#include <stdio.h> /* For error messages */
#include "st.h" /* The Symbol Table Module */
#define YYDEBUG 1 /* para depuração */

int install( char* sym_name)
{
    symrec *s;
    s = getsym(sym_name);
    if (s == 0)
        s = putsym (sym_name);
    else
    {
        errors++;
        printf("%s is already defined\n", sym_name);
        return 0;
    }
    return 1;
}

int context_check(char* sym_name)
{
    if ( getsym( sym_name ) == 0 )
    {
        printf("%s is an undeclared identifier\n", sym_name);
        return 0;
    }
    return 1;
}


%}

%%

 /* Grammar rules and actions */

program : LET declarations IN commands END ;

declarations : /* empty */
    | INTEGER id_seq IDENTIFIER '.' { install( $3 ); }
;

id_seq : /* empty */
    | id_seq IDENTIFIER ','            { install( $2 ); }
;
commands : /* empty */
    | commands command ';'
;
command : SKIP
    | READ IDENTIFIER                    { context_check( $2 ); }
    | WRITE exp
    | IDENTIFIER ASSGNOP exp            { context_check( $2 ); }
    | IF exp THEN commands ELSE commands FI
    | WHILE exp DO commands END
;
exp : INTEGER
                                    /* no livro está $2, errado */
  | IDENTIFIER                            { context_check( $1 ); }
  | exp '<' exp
  | exp '=' exp
  | exp '>' exp
  | exp '+' exp
  | exp '-' exp
  | exp '' exp
  | exp '/' exp
  | exp '^ ' exp
  | '(' exp ')'
;

%%

 /* C subroutines */

/* não tem saída, a árvore de recon. fica implícita */
int main( int argc, char *argv[] )
{
    extern FILE *yyin;
    ++argv; --argc;
    yyin = fopen( argv[0], "r" );
    yydebug = 1;
    errors = 0;
    yyparse ();
    return 0;
}
int yyerror (char *s) /* chamada por yyparse() com erros */
{
    printf ("%s\n", s);
    return 0;
}

Code:

$ bison -vd ch4.y
ch4.y:91.54-55: $2 de `command' não tem tipo declarado


grail 08-15-2016 11:59 AM

INT appears in your tokens. Been a while since I have played with this stuff, but I am sure one of the others will be able to help you further :)

dedec0 08-15-2016 12:22 PM

Do not miss it: in this last post INT is only inside a multiline comment (that we should use it to easily change from a working to an erroneous file, or vice versa). There is no other occurrences of it.

I hope to have given enough details of my problem so anyone could easily and quickly reproduce it. Thank you for your goodwill, grail. Do you have a book to recommend? This is the second one that I got with these not so small problems.

smallpond 08-15-2016 12:26 PM

Not seeing any definition for ASSGNOP.

Also I don't think this is right:

Code:

%token <id> IDENTIFIER
It looks like it should be:

Code:

%type <id> IDENTIFIER

dedec0 08-15-2016 01:13 PM

Quote:

Originally Posted by smallpond (Post 5591167)
Not seeing any definition for ASSGNOP.

It is missing in the book, I noted it. But I have added it before starting this thread. It is there on line 30 (and it is not inside a comment): %token ASSGNOP .

Quote:

Originally Posted by smallpond (Post 5591167)
Also I don't think this is right:

Code:

%token <id> IDENTIFIER
It looks like it should be:

Code:

%type <id> IDENTIFIER

Should both type and token exist? It is not clear for me (I tried to read something in https://www.gnu.org/software/bison/m...emantic-Tokens too). My results:

1. Without union and IDENTIFIER is a simple token: it compiles (a previous result).

2. With union declared, "%token IDENTIFIER" removed, line "%type <id> IDENTIFIER" added:

Code:

$bison -vd ch4.y
ch4.y:18.12-21: symbol IDENTIFIER used, but not defined as a token and has no rules
ch4.y:91.54-55: $2 from `command' has no declared type

3. Continuing from try 2, simply add one more line with "%token IDENTIFIER". Result is the same error from the last big post:

Code:

$bison -vd ch4.y
ch4.y:91.54-55: $2 from `command' has no declared type

Did I try everything?

For now I am keeping the "%type <id> IDENTIFIER" line, but the error continues.

smallpond 08-15-2016 01:24 PM

Code:

ch4.y:91.54-55: $2 from `command' has no declared type
The error says you have not fully defined the 2nd word of line 91 which it has expanded from "command".

dedec0 08-15-2016 01:56 PM

Line 91 is
Code:

command : SKIP
    | READ IDENTIFIER                    { context_check( $2 ); }
    | WRITE exp
    | IDENTIFIER ASSGNOP exp            { context_check( $2 ); } /* line 91 */
    | IF exp THEN commands ELSE commands FI
    | WHILE exp DO commands END
;

Should it be $3 in line 91? The error repeats. I do not know what else to do with this information. :( Please give me a clearer hint.

dedec0 08-15-2016 02:02 PM

NO! It should be $1, right?? So context_check will check if the variable is declared or not. Right?? :D :D :D

smallpond 08-15-2016 06:53 PM

How much clearer can I get? I already told you in comment 7 there is no definition for ASSGNOP and you didn't believe me.

astrogeek 08-15-2016 08:31 PM

I will admit first, that I have not fully followed your example code nor read the PDF (that URL is blocked by my local firewall rules).

But I do see that you are confused about the use of %token and %type, so perhaps I can offer some helpful comment on those.

Quote:

Originally Posted by dedec0 (Post 5591205)
Should both type and token exist? It is not clear for me (I tried to read something in https://www.gnu.org/software/bison/m...emantic-Tokens too).

%token and %type are two different things.

%token is used to declare terminals, and may include or require a <type> assignment when a %union has been declared, but not otherwise. So for terminal symbols (which IDENTIFIER seems to be), either of these would be correct depending on usage...

Code:

%token IDENTIFIER
  or
%token <id> IDENTIFIER

... but NOT...

Code:

%type <id> IDENTIFIER
%type is used to declare the types of non-terminals (commands, command, exp, etc... from your example code).

So something like this might be appropriate, again depending on usage...

Code:

%type <id> commands command exp ...
So in simplistic terms...

When %union is declared you will need to declare terminals (tokens) with a type as...

Code:

%token <id> IDENTIFIER
... and non-terminals with a type as...

Code:

%type <id> command commands exp ...
And your code must be consistent with regard to those types as values traverse the parse tree. That is, when a typed right-hand value is assigned to a left-hand (non-terminal) symbol, they must be of the same type or the compiler will complain (this is what the type declarations are used for).

Quote:

Originally Posted by dedec0 (Post 5591205)
For now I am keeping the "%type <id> IDENTIFIER" line, but the error continues.

That is not correct IF IDENTIFIER is a terminal symbol, as appears to be the case.

It may be that the '$2 from command' error you are seeing is complaining about exp, not IDENTIFIER (but, also not clear to me).

dedec0 08-15-2016 08:37 PM

Quote:

Originally Posted by smallpond (Post 5591360)
How much clearer can I get? I already told you in comment 7 there is no definition for ASSGNOP and you didn't believe me.

It is not what I did. Don't mix understanding with believing. ASSGNOP was defined, there is a token declaration for it, as I said above. You expected me to understand that:

"since there is no definition for ASSGNOP, I should look at line 91 and see that I need to change the argument because it is incorrect, it is pointing to ASSGNOP instead of IDENTIFIER"

?

No way. Your words there are not clear at all, not for me. I bet that not for others too. The title of the thread contains the word "learning" because I know very little of Bison, if I can say I know something about it at all.

Anyway, thank you. Our conversation helped me to solve another "easy" problem in this book.

dedec0 08-15-2016 08:57 PM

astrogeek, thank you for your explanations. I had not yet seen %type, so I had no clue what was it. I just used it as something more to try, almost blindly (as my trial and error report shows).

As I undertand now, IDENTIFIER is the token that corresponds to the variable name, not its value or symbol. In C we could have an integer atribution:

boxOfFruits = 26;

For this line, there would be 4 tokens: IDENTIFIER (with a string "boxOfFruits"); EQUAL_SIGN; INTEGER (with value 26); END_OF_LINE. So, IDENTIFIER is a terminal symbol or not (I think it is not). The language of the book it not C, but it is something similar for that expression.

astrogeek 08-16-2016 02:41 AM

Quote:

Originally Posted by dedec0 (Post 5591397)
astrogeek, thank you for your explanations. I had not yet seen %type, so I had no clue what was it. I just used it as something more to try, almost blindly (as my trial and error report shows).

As I undertand now, IDENTIFIER is the token that corresponds to the variable name, not its value or symbol. In C we could have an integer atribution:

boxOfFruits = 26;

For this line, there would be 4 tokens: IDENTIFIER (with a string "boxOfFruits"); EQUAL_SIGN; INTEGER (with value 26); END_OF_LINE. So, IDENTIFIER is a terminal symbol or not (I think it is not). The language of the book it not C, but it is something similar for that expression.

I downloaded the PDF with the intent of copying and compiling the code. But as there does not seem to be a complete version for each chapter and everything is shown as modifications from the previous page, I would have to read it all and work through the changes and I do not have time to do that now.

So, I will try to offer what I think will be the most helpful comments for you, but maybe not as directly applicable to your code example as you would like.

First, let me try to clarify a few terms and their use that did not seem to be presented clearly in my limited scan of the PDF. I think that getting these very clearly in mind will allow you to understand things not so clearly explained in examples that you find on the interwebs...

Let's start with terminal and non-terminal. These terms come directly from BNF/EBNF and the foundations of grammer theory. They are not especially relevant to the code itself unless you already know what they mean! I did not see any explanation of them in your PDF, so you might want to search online for a complete discussion.

But in simple terms, in your grammer (which is what the Bison rules section code actually is), non-terminals are the symbols that appear on the left-hand side of any expression, and terminals are what is received from the lexer or scanner. Non-terminals are called that because they can be further reduced to other terms. Terminals are end-points, they cannot be further reduced.

Note that your PDF adopts the convention that terminals are shown in all caps, and non-terminals are always lower case. So, IDENTIFIER is definitely a terminal - and it comes from the lexer too, obviously.

Now that we have a simple working idea of those, what are "tokens"? Tokens are what the lexer or scanner produces - it is a "tokenizer". And what it produces is received by the parser (syntactical analyzer) as terminals... tokens in the lexer are terminals in the grammer. Very easy.

Tokens are represented as integer values as a result of Bison processing the %token declarations, these integer definitions are passed to the lexer by the tab.h files produced by Bison so that they have the same names on both ends, but they are just integers associated with names - enums.

But the scanner may need to also return a value along with a token, and in the parser the $1, $2, ... references in the action code are mapped to these values, not to the token itself. These values are always passed in a variable named yylval. Look at your lexer code - it returns the token (integer identifier) by name, and if there is an associated value it is first placed into yylval, where the parser can find it.

If you only ever pass integer values, then yylval is all you need - an integer by default.

But, if you need to pass other types such as strings or structures you must declare a %union type. What the %union declaration does is creates a union definition in the resulting C code so that pointers to other values can be returned with the tokens. Of course, then you must tell the parser what type to expect! This is what the type declarations are all about!

If a %union declaration exists, then you must declare the type of every terminal and non-terminal that is referenced by value ($1, $2, ...) in the action code. This way the parser code knows which union member to fetch and can check for type errors in the code! These declarations appear like this in the Bison source file:

Code:

%token <type> TERMINAL-NAME
%type <type> non-terminal-name

... where <type> is the union member name for the type.

Terminals and non-terminals that are never referenced as $1, $2, etc. in the action code need no type declaration.

Additionally, any non-terminal (left hand symbol) that can take on the value of a typed terminal or non-terminal on the right hand side, must have the same declared type! If they do not, then you will receive the type of error in your example code: "$2 from `command' has no declared type", or a similar complaint about mis-matched types.

So, if this makes sense, just review your example code and see which symbols are terminals that return a value, and be sure it has a type declaration. And check all the non-terminals that can be referenced by or assigned a typed value, and be sure they have a type declaration.

Hopefully, when you see a "type" error generated by the parser processing you will know what it means and how to fix it!

That is all the time I have tonight, sorry for the length, and good luck!

dedec0 08-16-2016 03:30 PM

Thank you very much for this post, astrogeek. It blends together several things that I studied a bit, some a little more. And it puts Bison and Flex together in the story with their own possible functions, clearly. It helped me to understand more about what I am reading. :) And I will read the post again, just to make sure... hahaha

I agree with you. This book feels unfinished, but (at least) it is free to distribute. I am finding it better to read than the other book about Flex and Bison that I tried before. A friend have suggested that a Wikibook is made from this book, with at least these corrections and improvements - and possibly with a few others, naturally.

astrogeek 08-16-2016 06:34 PM

I am glad that the previous post was helpful!

During the day today I have tried to catch up with you by compiling the example code from the PDF...

It is not compilable as written... I copied and pasted each step - including MUCH editing to adjust for dumb-quotes (usually called smart-quotes) in the PDF code, as well as dumb-substitutions such as a unicode -- for the simple -, etc... I expect that you must have done the same.

I also added the missing token declarations ASSGNOP, FI, etc...

And I finally got down to the Chapter 4 complete version. But that cannot be processed by Bison due to multiple missing type declarations (declarations, command, id_seq,... more). I suspect those you were seeing are just the beginning for you.

I also suspect that the author intends to replace IDENTIFIER with IDENT, but that is not clear in the text. The lexer does not include a rule to return an IDENT token either, and the grammer rules using it are inconsistent, or at least make no sense as written.

It is not clear to me that the author intends for the examples to actually be compiled, at least not incrementally per-chapter.

It may be that the author intends to make further modifications in the next chapter, but I will leave that up to you to follow along at this point. Perhaps they will arrive at a "final" version, but I think that you will waste a lot of time trying to compile the modifications per-chapter, at least if Chapter 4 is representative.

I am not discouraging you from completing the exercise, but it is clearly not a tested code example, so you might want to read along and not get bogged down debugging the untested example code - or look for another example to follow...

dedec0 08-16-2016 07:28 PM

As I read chapters 1-3, I have made a few files, trying to make them as correct as possible. The aim is to have a more practical reading.

Chapter 1 does not need a program file. I have only a txt with what it shows. As shell scripts, I consider lines starting with "#" a comment, and added one:

Code:

# Context-free grammar for the language Simple

program ::= LET [ declarations ] IN command sequence END
declarations ::= INTEGER [ id_seq ] IDENTIFIER .
id_seq ::= id_seq... IDENTIFIER ,
command sequence ::= command... command
command ::= SKIP ;
            | IDENTIFIER := expression ;
            | IF exp THEN command sequence ELSE command sequence FI ;
            | WHILE exp DO command sequence END ;
            | READ IDENTIFIER ;
            | WRITE expression ;
expression ::= NUMBER | IDENTIFIER | '(' expression ')'
            | expression + expression | expression - expression
            | expression * expression | expression / expression
            | expression ^ expression
            | expression = expression
            | expression < expression
            | expression > expression

Chapter 2 needs one, ch2.y:

Code:

%start program
%token LET INTEGER IN
 /* FI was missing */
%token SKIP IF THEN ELSE FI END WHILE DO READ WRITE
%token NUMBER
 /* ASSGNOP was missing */
%token IDENTIFIER ASSGNOP
%left '-' '+'
%left '*' '/'
%left '<' '>' '=' '' /* missing in the book */
%right '^ '

%%

 /* Grammar rules and actions */

program : LET declarations IN commands END ;

declarations : /* empty */
    | INTEGER id_seq IDENTIFIER '.'
;

id_seq : /* empty */
    | id_seq IDENTIFIER ','
;
commands : /* empty */
    | commands command ';'
;
command : SKIP
    | READ IDENTIFIER
    | WRITE exp
    | IDENTIFIER ASSGNOP exp
    | IF exp THEN commands ELSE commands FI
    | WHILE exp DO commands END
;
exp : NUMBER
  | IDENTIFIER
  | exp '<' exp
  | exp '=' exp
  | exp '>' exp
  | exp '+' exp
  | exp '-' exp
  | exp '' exp
  | exp '/' exp
  | exp '^ ' exp
  | '(' exp ')'
;

%%

 /* C subroutines */

 /* no output, parse tree implicitly constructed */
int main( int argc, char *argv[] )
{
    extern FILE *yyin;
    ++argv; --argc;
    yyin = fopen( argv[0], "r" );
    yydebug = 1;
    errors = 0;
    yyparse ();
}
int yyerror (char *s) /* Called by yyparse on error */
{
  printf ("%s\n", s);
}

Chapter 3 needs two; ch2.y (indirectly used) and ch3.lex:

Code:

%{
#include "ch2.tab.h" /* tokens (Bison output for ch2.y) */
%}

DIGIT [0-9]
ID    [a-z][a-z0-9]*

%%

":="      { return(ASSGNOP);        }
{DIGIT}+  { return(NUMBER);          }
do        { return(DO);              }
else      { return(ELSE);            }
end        { return(END);            }
fi        { return(FI);              }
if        { return(IF);              }
in        { return(IN);              }
integer    { return(INTEGER);        }
let        { return(LET);            }
read      { return(READ);            }
skip      { return(SKIP);            }
then      { return(THEN);            }
while      { return(WHILE);          }
write      { return(WRITE);          }
{ID}      { return(IDENTIFIER);      }
[ \t\n\r]+ /* blank, tab, new line: eat up whitespace */
.          { return(yytext[0]);      }

To compile these two files, I used the commands below, shown with their output forced as comments:

Code:

# produces ch2.output, ch2.tab.c and ch2.tab.h
bison -vd ch2.y

# produces lex.yy.c
flex ch3.lex

# compiles a program that parses the Simple
# language, although silently
gcc -Wall lex.yy.c -lfl
# lex.yy.c:1192: warning: ‘yyunput’ defined but not used
# lex.yy.c:1233: warning: ‘input’ defined but not used

These two warnings are not a problem. Do you think these 3 files makes sense in every detail?

Chapter 4 files is my next task, probably possible now with the help I got in this thread.

dedec0 08-16-2016 09:36 PM

I did not note your answer before my post with files' code. I only read it now.

Well, I will try to follow the whole book as I did for chapters 1-3. If it increases its numbers of problems, I will do as you suggest: just pass them by while trying to get the idea there.

"Follow other examples": if you (or anyone else) can recommend better books or other materials to read, that is welcome!

astrogeek 08-17-2016 01:33 AM

Sorry to be so slow responding.

I had a chance to look ahead in the PDF that you are working from later today, and I must say I am not optimistic that you will learn Flex and Bison from it, assuming that is your objective.

After Chapter 4 it becomes somewhat confused with no continuity from what had gone before. Chapter 5, Optimization is mostly just random thoughts (IMO) that ends with "We do not illustrate an optimizer in the parser...". Chapter 6, Virtual Machines... little connection to the rest of the PDF... Chapter 7 Code Generation is back to the original code, with more modifications, but honestly I think there would be little point in trying to follow it as a learning exercise.

I have sent you a PM with some other options and will post any useful links for examples that I can find back in this thread.

Good luck.

dedec0 08-29-2016 09:27 AM

Chapter 4: first try (does not work yet)
 
Chapter 4 is fairly fast to read, but many things are implied or come from previous chapters. I have made a few files to achieve something that makes sense for the whole chapter (and, if possible, that can be compiled, run and tested, not yet achieved).

I had some difficulties with it, though. The files are shown separately below, after everything I write (so you do not have more to read after their start - that is not a comment inside the files, there are some).

The files are being compiled with:

Code:

# first bison because ch4.tab.h is included in flex file
bison -vd ch4.y
flex ch4.lex
gcc lex.yy.c -lfl  # I use -Wall in a gcc alias

Flex and Bison silently work. In gcc output a problem is shown. It complains about yylval used without being declared. Isn't it automatic, like the yy* functions? How should I declare it, what are the best and common ways? I imagine it is close to that union we increased...

Code:

$ gcc lex.yy.c -lfl
/tmp/ccH6IRzd.o: In function `yylex':
/full/path/ch4.lex:36: undefined reference to `yylval'
collect2: ld returned 1 exit status

-------------------------------------------------------------------------

ch4.lex:
Code:

%{
#include <string.h>

 /* #include "simple.tab.h" / * tokens/fichas (Bison output) */
#include "ch4.tab.h"

%}

DIGIT [0-9]
ID    [a-z][a-z0-9]*

%%

":="      { return(ASSGNOP);        }
{DIGIT}+  { return(NUMBER);          }
 /* below there are keywords */
do        { return(DO);              }
else      { return(ELSE);            }
end        { return(END);            }
fi        { return(FI);              }
if        { return(IF);              }
in        { return(IN);              }
integer    { return(INTEGER);        }
let        { return(LET);            }
read      { return(READ);            }
skip      { return(SKIP);            }
then      { return(THEN);            }
while      { return(WHILE);          }
write      { return(WRITE);          }

 /* where IDENT was written IDENTIFIER assumed to be correct word */
 /* {ID}      { return(IDENTIFIER);      } */
 /* ID declaration was changed to return the identifier text and its
    token */
{ID}            {
                /* yylval->union, which is int or (char*) ->ch4.y:9 */
                yylval.id = strdup(yytext);
                return(IDENTIFIER);
            }

[ \t\n\r]+ /* blank, tab, new lines: ignore all whitespace */
.          { return(yytext[0]);      }

ch4.y:
Code:

%start program

 /* SEMANTIC RECORD */
 /* char *id: For returning identifiers */
 /*
Place to easily pasting/cutting the union declaration
 */

%union {
char *id;
}

/* Simple identifier */
 /*
Place to exchange the IDENTIFIER token declarations
 */
%token IDENTIFIER
%type <id> IDENTIFIER

%token LET IN
/* "integer" keyword */
/* both INT e INTEGER exist? Removed INT, occurrences changed.
%token INT
*/
%token INTEGER

/* Same question for NUMBER: should it exist?
  Replaced all of its occurrences with INTEGER
  -> wrong, written numbers X 'integer' keyword. Changes undone
*/
%token NUMBER

 /* FI was missing */
%token SKIP IF THEN ELSE FI END WHILE DO READ WRITE
    /* ASSGNOP was missing */
%token ASSGNOP
%left '-' '+'
%left '*' '/'
%left '<' '>' '=' '' /* these where missing on the book, added */
%right '^ '

%{

#include <stdlib.h> /* For malloc in symbol table */
#include <string.h> /* For strcmp in symbol table */
#include <stdio.h> /* For error messages */
#include "st.h" /* The Symbol Table Module */
#define YYDEBUG 1 /* for debugging / para depuração */

int install( char* sym_name)
{
    symrec *s;
    s = getsym(sym_name);
    if (s == 0)
        s = putsym (sym_name);
    else
    {
        errors++;
        printf("%s is already defined\n", sym_name);
        return 0;
    }
    return 1;
}

int context_check(char* sym_name)
{
    if ( getsym( sym_name ) == 0 )
    {
        printf("%s is an undeclared identifier\n", sym_name);
        return 0;
    }
    return 1;
}


%}

%%

 /* Grammar rules and actions */

program : LET declarations IN commands END ;

declarations : /* empty */
    | INTEGER id_seq IDENTIFIER '.' { install( $3 ); }
;

id_seq : /* empty */
    | id_seq IDENTIFIER ','            { install( $2 ); }
;
commands : /* empty */
    | commands command ';'
;
command : SKIP
    | READ IDENTIFIER                    { context_check( $2 ); }
    | WRITE exp
    | IDENTIFIER ASSGNOP exp            { context_check( $1 ); }
    | IF exp THEN commands ELSE commands FI
    | WHILE exp DO commands END
;
 /* expressions */
exp : NUMBER
                                    /* in book it is $2, wrong */
  | IDENTIFIER                            { context_check( $1 ); }
  | exp '<' exp
  | exp '=' exp
  | exp '>' exp
  | exp '+' exp
  | exp '-' exp
  | exp '' exp
  | exp '/' exp
  | exp '^ ' exp
  | '(' exp ')'
;

%%

 /* C subroutines */

/* não tem saída, a árvore de reconhec. fica implícita */
/* no output, implicit parse tree */
int main( int argc, char *argv[] )
{
    extern FILE *yyin;
    ++argv; --argc;
    yyin = fopen( argv[0], "r" );
    yydebug = 1;
    errors = 0;
    yyparse ();
    return 0;
}
int yyerror (char *s) /* called by yyparse() in error situations */
{
    printf ("%s\n", s);
    return 0;
}

st.h:
Code:

/* symbol table module */
/*    -> to be included in Bison file */
typedef struct symrec
{
    char *name;                    /* symbol name / nome do símbolo */
    struct symrec *next;    /* list link / elo da lista */
} symrec;

symrec *sym_table = (symrec *)0;
symrec* putsym(char *);
symrec* getsym(char *);

symrec* putsym( char *sym_name)
{
    symrec *ptr;
    ptr = (symrec *) malloc( sizeof(symrec) );
    ptr->name = (char *) malloc( strlen(sym_name) + 1 );
    strcpy( ptr->name, sym_name);
    ptr->next = (symrec*) sym_table;
    sym_table = ptr;
    return ptr;
}

symrec* getsym( char* sym_name)
{
    symrec *ptr;
    for (
        ptr = sym_table;
        ptr != (symrec *) 0;
        ptr = (symrec *) ptr->next
        )
        if( strcmp( ptr->name, sym_name) == 0 )
            return ptr;
    return 0;
}


dedec0 08-29-2016 12:45 PM

Quote:

Originally Posted by astrogeek (Post 5591385)
[...]
Quote:

Originally Posted by dedec0
For now I am keeping the "%type <id> IDENTIFIER" line, but the error continues.

That is not correct IF IDENTIFIER is a terminal symbol, as appears to be the case.

It may be that the '$2 from command' error you are seeing is complaining about exp, not IDENTIFIER (but, also not clear to me).

The first parts your post (removed) are about things we discussed in the previous posts. Skim through them, I would suggest.

No, IDENTIFIER is not a terminal symbol. It is a variable name in their declaration or use. The Simple grammar was showed in chapter 1 (pdf page 9). I have made a text file from it, given below.

I am not sure what is/was the exp problem. There was an error due to grammar ambiguity that I solved it (somewhere I do not remember now). There was also a simple error too about a "$M" being used where a "$K" makes sense.

----------------------------------------------------------
ch1.txt:
Code:

# Context-free grammar for the language Simple

program ::= LET [ declarations ] IN command_sequence END
declarations ::= INTEGER [ id_seq ] IDENTIFIER .
id_seq ::= id_seq... IDENTIFIER ,
command_sequence ::= command... command
command ::= SKIP ;
            | IDENTIFIER := expression ;
            | IF exp THEN command_sequence ELSE command_sequence FI ;
            | WHILE exp DO command_sequence END ;
            | READ IDENTIFIER ;
            | WRITE expression ;
expression ::= NUMBER | IDENTIFIER | '(' expression ')'
            | expression + expression | expression - expression
            | expression * expression | expression / expression
            | expression ^ expression
            | expression = expression
            | expression < expression
            | expression > expression


astrogeek 08-30-2016 02:42 AM

Hi dedec0!

Sorry to be so late getting in here, and I only have a few minutes so I'll have to be brief for now.

Quote:

Originally Posted by dedec0 (Post 5597853)
The first parts your post (removed) are about things we discussed in the previous posts. Skim through them, I would suggest.

No, IDENTIFIER is not a terminal symbol. It is a variable name in their declaration or use. The Simple grammar was showed in chapter 1 (pdf page 9). I have made a text file from it, given below.

I looked back over the previous posts but am not sure what you would like me to see from reviewing them. Could you be more specific.

And yes, IDENTIFIER is a terminal symbol. Perhaps you are confused about what a terminal symbol is?

In the simple CF grammer on pages 8-9 (bottom page numbers 2-3), IDENTIFIER appears only on the right hand side of any expressions and is used exactly as a terminal symbol. Also, immediately below that figure on page 9 (page number 3) it says...

Quote:

...Figure 1.1 where the non-terminal symbols are given in all lower case,
the terminal symbols are given in all caps or as literal symbols and, where the
literal symbols conflict with the meta symbols of the EBNF, they are enclosed
with single quotes. The start symbol is program. While the grammar uses
upper-case to high-light terminal symbols, they are to be implemented in lower
case.
And at the top of page 12 (page number 6), IDENTIFIER is defined as a token - tokens are terminal symbols.

Further along on page 16 (page number 10) in the scanner example code, IDENTIFIER is a token returned by the scanner to the parser - i.e., a terminal symbol.

I don't know what to add to that, and I am not sure how you think that should be used if it is not a terminal symbol... perhaps if you could try to explain to me more precisely what you think IDENTIFIER is in the CF grammer, and how that relates to IDENTIFIER in the scanner, maybe we could reach a better understanding of this example.

A little OT: Late last week I was able to spend a little time refreshing my Flex/Bison notes and I found a link to an example I do not recall seeing before - and it is a very good Flex Bison C++ Example too! The C++ Flex code may be a little confusing so just focus on the lexer rules and the grammer in the Bison code. But it compiles and works well and will give you an example parser that builds the AST explicitly. Hope it is helpful.

dedec0 08-30-2016 06:34 AM

Two parts:

1. A terminal symbol is one that does change anymore. It may be something written in the source code, like numbers values, variable names or language symbols ('+','-', ...). IDENTIFIER is the token given where variable names are declared or used in Simple. Nonterminal symbols are variables in the grammar.

For example, in the grammar we have a line: program ::= LET [ declarations ] IN command_sequence END. In this line: program, declarations and command_sequence are nonterminal symbols (or variables); LET, IN and END are terminal symbols.

In this book, the convention to use uppercase/lowercase for nonterminal/terminal symbols is different from what we always used in school.

In a Simple program (source code), we may have:

Code:

let integer trees, seed_package in
    trees = 3
    seed_package = 17
    while trees < 1000 do
        trees = trees + seed_package
    end
    write trees
end

This program has 2 constants: the current number of trees; the amount of seeds in a package we buy. We plant all seeds of packages. We want to have at least a thousand trees. How many trees will we have, then?

Is there something you want to correct or add in my ideas?

2. I am waiting for an answer and comments in my post about chapter 4

ntubski 08-30-2016 06:53 AM

Quote:

Originally Posted by dedec0 (Post 5598179)
1. A terminal symbol is one that does change anymore. It may be something written in the source code, like numbers values, variable names or language symbols ('+','-', ...). IDENTIFIER is the token given where variable names are declared or used in Simple. Non-terminal symbols are variables in the grammar.

According to that definition IDENTIFIER is a terminal. I don't really like the phrase "one that does [not] change anymore". A more precise definition would be: A terminal symbol is one that never occurs on the left hand side of the grammar rules. Hence a derivation always terminates with a list of terminal symbols and the terminal nodes of a parse tree are always terminal symbols.

dedec0 08-30-2016 07:48 AM

Indeed, that is better. Thank you also for remembering the parse tree concept and its characteristics. :)

sundialsvcs 08-30-2016 11:29 AM

Another way to think of it is that a "terminal symbol" is "something that you can see in the source-code you are compiling." They are the structural elements that define the language and that allow positions in the grammar to be non-ambiguously determined.

Noterminal symbols correspond to something else in the grammar.

ntubski 08-31-2016 08:35 AM

Quote:

Originally Posted by dedec0 (Post 5597752)
Code:

# first bison because ch4.tab.h is included in flex file
bison -vd ch4.y
flex ch4.lex
gcc lex.yy.c -lfl  # I use -Wall in a gcc alias

Flex and Bison silently work. In gcc output a problem is shown. It complains about yylval used without being declared. Isn't it automatic, like the yy* functions? How should I declare it, what are the best and common ways? I imagine it is close to that union we increased...

Code:

$ gcc lex.yy.c -lfl
/tmp/ccH6IRzd.o: In function `yylex':
/full/path/ch4.lex:36: undefined reference to `yylval'
collect2: ld returned 1 exit status


This is an error about yylval being undefined (see Declaration vs. definition). It means you didn't link the code that defines it. Perhaps it is defined in the bison output? Probably you should do something like this:

Code:

# First produce C files
bison -vd ch4.y # produces ch4.tab.c (right?)
flex ch4.lex # produces lex.yy.c

# Second compile (that's what the -c option means) C files to object (.o) files.
gcc -c lex.yy.c # produces lex.yy.o
gcc -c ch4.tab.c # produces ch4.tab.o

# Lastly, link all the object files together into a program
# (I called it "ch4-program", if you drop the -o option you get a program named "a.out").
gcc -o ch4-program lex.yy.o ch4.tab.o -lfl

The command you showed was trying to create a program using the source from flex only.

sundialsvcs 08-31-2016 10:53 AM

Every parser-generator which produces a compilable source-file must provide some means for you to include headers at the top of the generated code, if it does not provide them already. Statements needed to #include the header-files needed by the compiler must be provided-for by appropriate means. These sections will generally be included in the output source-file verbatim.

ntubski 08-31-2016 11:10 AM

Quote:

Originally Posted by sundialsvcs (Post 5598791)
Every parser-generator which produces a compilable source-file must provide some means for you to include headers at the top of the generated code, if it does not provide them already. Statements needed to #include the header-files needed by the compiler must be provided-for by appropriate means. These sections will generally be included in the output source-file verbatim.

This is useful general advice, but please note that headers are for declarations, while had the problem mentioned here was a lack of definition.

https://en.wikipedia.org/wiki/Declar...vs._definition

dedec0 08-31-2016 01:40 PM

Almost there, chapter 4...
 
Thank you for these clarifications, ntubski.

Now I have retried these steps:

Code:

bison -vd ch4.y
flex ch4.lex
gcc -c lex.yy.c
gcc -c ch4.tab.c
gcc lex.yy.o ch4.tab.o -lfl
# Or everything in just one line to read the warnings easier:
# bison -vd ch4.y; flex ch4.lex; gcc -c lex.yy.c; echo ==1; gcc -c ch4.tab.c; echo ==2; gcc lex.yy.o ch4.tab.o -lfl

The compilation of ch4.tab.c had a few problems. I will say what I did for which I "fixed" (not all are good fixes, just to make it work), and then what remains.

1.
Code:

ch4.y:48:1: warning: "YYDEBUG" redefined
ch4.tab.c:76:1: warning: this is the location of the previous definition

Redefined?! There are only two ocurrences of 'define YYDEBUG'. One is in our ch4.y file, the other is a conditional define to 0 in the .c file outputed by Bison. Fixed by commenting the ch4.y:48, which seems wrong or bad: is debugging still active?

2.
Code:

ch4.y: In function ‘install’:
ch4.y:58: error: ‘errors’ undeclared (first use in this function)
ch4.y: In function ‘main’:
ch4.y:128: error: ‘errors’ undeclared (first use in this function)

Fixed by creating a global variable 'Errors' (initialized to 0) and changed those two lines to use it instead of 'errors'. The second use of it was just to initialize it, so I removed that line. Is everything correct? Or 'errors' is somehow provided by flex/bison or the library we use (I searched in Bison doc)?

File ch4.y with all these two changes is below. One problem remains, I do not know what to do:

3.
Code:

ch4.y:130: error: ‘yydebug’ undeclared (first use in this function)
yydebug is mentioned in Bison's documentation (https://www.gnu.org/software/bison/manual/bison.html) but reading there did not ring anything for me. I decided to comment the line that uses it.

4.
Code:

ch4.tab.c:1496: warning: implicit declaration of function ‘yyerror’
I declared the function yyerror we have before the function install, it removed the warning.

There is the -t option for bison to activate debugging, but I think it should also work in the code, which I could not do.

Now I have an a.out that seems to work (I have not played with it yet). But I think the fixes I did are not the best that could be...

There is a warning still there:
Code:

$ gcc -c ch4.tab.c
ch4.tab.c: In function ‘yyparse’:
ch4.tab.c:1349: warning: implicit declaration of function ‘yylex’

This function is not mentioned in our .y file. It is probably in the Flex library -lfl. Should I disconsider this warning? It seems to be something repetitive, probably related with that Flex library. Isn't something missing in the source code or compilation commands?


--------------------------------------------
ch4.y (changes in red):
Code:

%start program

 /* SEMANTIC RECORD */
 /* char *id: For returning identifiers */
 /*
Place to easily pasting/cutting the union declaration
 */

%union {
char *id;
}

/* Simple identifier */
 /*
Place to exchange the IDENTIFIER token declarations
 */
%token IDENTIFIER
%type <id> IDENTIFIER

%token LET IN
/* "integer" keyword */
/* both INT e INTEGER exist? Removed INT, occurrences changed.
%token INT
*/
%token INTEGER

/* Same question for NUMBER: should it exist?
  Replaced all of its occurrences with INTEGER
  -> wrong, written numbers X 'integer' keyword. Changes undone
*/
%token NUMBER

 /* FI was missing */
%token SKIP IF THEN ELSE FI END WHILE DO READ WRITE
    /* ASSGNOP was missing */
%token ASSGNOP
%left '-' '+'
%left '*' '/'
%left '<' '>' '=' '' /* these where missing on the book, added */
%right '^ '

%{

#include <stdlib.h> /* For malloc in symbol table */
#include <string.h> /* For strcmp in symbol table */
#include <stdio.h> /* For error messages */
#include "st.h" /* The Symbol Table Module */

/* Redeclaration? Should not be. #define YYDEBUG 1 / * for debugging / para depuração */

/*  Global variable 'Errors' to substitute the previously
  undeclared variable 'errors' */
  int Errors = 0;


int yyerror (char *s);    // function header, removes a warning
int install( char* sym_name)
{
    symrec *s;
    s = getsym(sym_name);
    if (s == 0)
        s = putsym (sym_name);
    else
    {
        Errors++;
        printf("%s is already defined\n", sym_name);
        return 0;
    }
    return 1;
}

int context_check(char* sym_name)
{
    if ( getsym( sym_name ) == 0 )
    {
        printf("%s is an undeclared identifier\n", sym_name);
        return 0;
    }
    return 1;
}


%}

%%

 /* Grammar rules and actions */

program : LET declarations IN commands END ;

declarations : /* empty */
    | INTEGER id_seq IDENTIFIER '.' { install( $3 ); }
;

id_seq : /* empty */
    | id_seq IDENTIFIER ','            { install( $2 ); }
;
commands : /* empty */
    | commands command ';'
;
command : SKIP
    | READ IDENTIFIER                    { context_check( $2 ); }
    | WRITE exp
    | IDENTIFIER ASSGNOP exp            { context_check( $1 ); }
    | IF exp THEN commands ELSE commands FI
    | WHILE exp DO commands END
;
 /* expressions */
exp : NUMBER
                                    /* in book it is $2, wrong */
  | IDENTIFIER                            { context_check( $1 ); }
  | exp '<' exp
  | exp '=' exp
  | exp '>' exp
  | exp '+' exp
  | exp '-' exp
  | exp '' exp
  | exp '/' exp
  | exp '^ ' exp
  | '(' exp ')'
;

%%

 /* C subroutines */

/* não tem saída, a árvore de reconhec. fica implícita */
/* no output, implicit parse tree */
int main( int argc, char *argv[] )
{
    extern FILE *yyin;
    ++argv; --argc;
    yyin = fopen( argv[0], "r" );
    // yydebug = 1;          // line commented -> bad fix

    // errors = 0;            // initialization unneeded => line removed
    yyparse ();
    return 0;
}
int yyerror (char *s) /* called by yyparse() in error situations */
{
    printf ("%s\n", s);
    return 0;
}


ntubski 08-31-2016 02:55 PM

Quote:

Originally Posted by dedec0 (Post 5598874)
1.
Code:

ch4.y:48:1: warning: "YYDEBUG" redefined
ch4.tab.c:76:1: warning: this is the location of the previous definition

Redefined?! There are only two ocurrences of 'define YYDEBUG'. One is in our ch4.y file, the other is a conditional define to 0 in the .c file outputed by Bison.

I believe the one in the .y file would also end up in the .c file. Somehow you need to arrange that it should go before the conditional generated by bison (so that the condition prevents it from taking effect). Note the all warnings/errors listed as occurring in .y files are really from .c files, but bison puts #line directives in the .c file to tell the compiler which part of the .y file is "responsible" for that code. Then gcc lists .y as the origin of the error so you can fix it in the real source rather than the generated one (which would be overwritten anyway).

Quote:

2.
Code:

ch4.y: In function ‘install’:
ch4.y:58: error: ‘errors’ undeclared (first use in this function)
ch4.y: In function ‘main’:
ch4.y:128: error: ‘errors’ undeclared (first use in this function)

Fixed by creating a global variable 'Errors' (initialized to 0) and changed those two lines to use it instead of 'errors'. The second use of it was just to initialize it, so I removed that line. Is everything correct? Or 'errors' is somehow provided by flex/bison or the library we use (I searched in Bison doc)?
I guess it's just something particular to the C code you had in your bison file, so defining it there makes sense.


Quote:

3.
Code:

ch4.y:130: error: ‘yydebug’ undeclared (first use in this function)
yydebug is mentioned in Bison's documentation (https://www.gnu.org/software/bison/manual/bison.html) but reading there did not ring anything for me. I decided to comment the line that uses it.

4.
Code:

ch4.tab.c:1496: warning: implicit declaration of function ‘yyerror’
I declared the function yyerror we have before the function install, it removed the warning.
Looking at https://www.gnu.org/software/bison/m...n.html#Tracing it seems that those are symbols that bison defines. I guess it should provide declarations as well somehow, but I'm not exactly sure how it's supposed to work.

Quote:

Code:

$ gcc -c ch4.tab.c
ch4.tab.c: In function ‘yyparse’:
ch4.tab.c:1349: warning: implicit declaration of function ‘yylex’

This function is not mentioned in our .y file. It is probably in the Flex library -lfl. Should I disconsider this warning? It seems to be something repetitive, probably related with that Flex library. Isn't something missing in the source code or compilation commands?
I think that's the function produced by flex from your .lex file. Does flex produce a header with declarations? If not I guess you should add a declaration yourself.

astrogeek 09-01-2016 03:37 AM

First a personal comment - As this PDF, this alleged Flex/Bison "book" originated from Micro$oft I am inclined to think it is an example of them poisoning the FREE software well by providing examples that are so obscure, incomplete and badly written that they will discourage all who attempt to use them from ever trying to use FREE software again in the future! :thumbsdown: Sheeeesh!!

That having been said, I have been able to get the Chapter 4 code to compile and run, with great effort I will add... but what it gets you I do not know...

But first...

Quote:

Originally Posted by dedec0 (Post 5598179)
Two parts:

1. A terminal symbol is one that does [not] change anymore... Nonterminal symbols are variables in the grammar.

As ntubski has pointed out, thinking of a terminal as something that does not change (any more?), and nonterminals as things that do change, totally confuses their use and meaning! That description only has meaning, limited at that, if you keep clearly in mind that the grammar rules are production rules which are applied repeatedly, beginning with the start symbol and ending only when there are no nonterminals on the right hand side of that which is produced. (Think: generative grammars).

I emphasize "production rules" and the direction of the process because that is the direction the parse tree grows, and the parse tree is the ultimate product of the process whether directly visible or not. It is all too easy, and common, to think of the tokens (terminals) as the starting point and the start symbol as the end point because in the code that looks like the direction of "flow" - it is not!

"Variables" reinforces that confusion. I think this terminology is responsible for a great deal of confusion among those trying to learn parsing. Some authors call nonterminals "variables", but that is very far from reality, especially if you think of them as similar to "variables" in a programming language. You are writing a compiler, naturally you want to "see" the grammar as just more code - it is not!

Terminals are symbols that stand-in for some real "word" which can appear in a syntactically correct sentence. Nonterminals are symbols which stand-in for intermediate states* and can never appear in a syntactically correct sentence. Those states, in general, are much more complex than the term "variable" indicates. The start symbol, itself a nonterminal, is just a stand-in for all possible correct sentences which can be produced by the grammar.

*Intermediate states between the start symbol and the terminal state reached by repeated application of the production, or rewriting rules.

My point (much too wordy and not at all as clear as it is in my own brain!) is that you need to really grasp these terms, what exactly grammar is and what the process and product of parsing are. Work to distinguish the grammar and the parsing process from the code - the code your compiler will produce and the code in which the parser is written. It is a separate "thing" altogether.

Quote:

Originally Posted by dedec0 (Post 5598179)
In this book, the convention to use uppercase/lowercase for nonterminal/terminal symbols is different from what we always used in school.

Just to be clear, when I made reference in another post to IDENTIFIER being a terminal and being all upper case, I was saying that it is a terminal because of how it is used in the grammer - being all upper case only indicates that this was the use intended by the author as well.

An upper/lower convention is very helpful when reading the grammer and code, but it is the useage which determines whether a symbol is a terminal or nonterminal.

Quote:

Originally Posted by dedec0 (Post 5598179)
2. I am waiting for an answer and comments in my post about chapter 4

I am way far past my time limit for the day (week!) ;) so I'll provide more detail in a separate post on this one. Here are a few quick comments to give you something to work on until then...

The code as written in the "book" is not capable of being compiled. That is not due to typos or simple errors - it is just incomplete. I really don't think it was intended to be run at this point (maybe in Chapter 7, but that will be another story I am sure!).

Some useful hints - your modifications are not correct I think, according to what is in chapter 4. In particular, I think they may intend to use both IDENT and IDENTIFIER as separate symbols, just look at how they used side by side in the grammar and in the modification text. As such I defined a type for both and that resolves several issues (and opens others). When you make the modifications, look carefully at the surrounding context in the book and do it exactly as shown (which seems to me to be different than your posted code).

There are a couple of undefined variables - errors comes to mind. And types, types, types...

With those changes I find only one $x reference which seems to be wrong (but may be awaiting further modification in Chapter 7, very confusing).

Code:

exp : INT
 | IDENT { context_check( $1 ); } /* $2 in book */

I am less sure whether INT and INTEGER are intended to be separate symbols, but have left them both defined as they do not conflict with each other at this point.

Ah, conflicts... SURELY your bison does not silently build the code! Can you recheck that! I count 39 shift-reduce conflicts, mostly resulting from all the as yet undefined actions for all those exp terms!

And finally - the big answer you have been waiting for...

Code:

$ gcc lex.yy.c -lfl
/tmp/ccH6IRzd.o: In function `yylex':
/full/path/ch4.lex:36: undefined reference to `yylval'
collect2: ld returned 1 exit status

Look at that - it is a linker error, not a compiler error - it is defined in the header file, but there is no object code to link to! You have to compile it! Try this...

Code:

gcc lex.yy.c ch4.tab.c -lfl
So how does it run...? I leave that for another day as well!

Good luck!

dedec0 09-01-2016 07:28 AM

Quote:

Originally Posted by astrogeek (Post 5599162)
First a personal comment - As this PDF, this alleged Flex/Bison "book" originated from Micro$oft I am inclined to think it is an example of them poisoning the FREE software well by providing examples that are so obscure, incomplete and badly written that they will discourage all who attempt to use them from ever trying to use FREE software again in the future! :thumbsdown: Sheeeesh!!

We are not being fair to Microsoft in this discussion. I have thought about saying it before, but kept it because it did not seem necessary. I am not a fan of Microsoft also, but it has been doing a few good things toward the open source things - that I will not discuss or tell in this thread, please do not do it. All that Microsoft has done in this story is to distribute the file of this book. It is available in many other places as well, I chose its website because it seems more reliable, given the other choices I saw then. Check it yourselves: https://duckduckgo.com/html/?q=antho...bison+compiler.

This book is free: we can copy and distribute it as much as we want, no harm done to anyone. This is written in its pages. When the author, Anthony A. Aaby, wrote it, he was in USA, in the then called Walla Walla College (website was cs.wwc.edu, does not exist anymore). His email account is also written in the book: aabyan@wwc.edu. This account is not active anymore, I discovered this when I tried to contact him for the first "bigger" problems I had. This college is now an university, changed its name, he does not work there anymore. By finding and contacting them, they have told me most of this information. They were kind and helpful as they could, even offering me the contact of their Computer Science team to discuss anything - which was out of scope for me, I preferred to choose this forum.

There has been an initiative to make a Wikibook from this work. I think the idea is good, since the book has many details that seems incomplete. And the book ideas have a place to exist, just need some more work - partially done in this thread we are making.

Just to end this post with a good thought for everybody... a composer from my country says: "the best things in life are free".

dedec0 09-01-2016 12:04 PM

Chapter 5: it is simple (using irony to tell about it)
 
Chapter 4 is still waiting for some more work (from me). I plan to do this in the next two days, I will make another post just for that.

While the discussion in the last posts were made, I have finished chapter 5. It just presents a few fairly simple ideas on optimization of the parse tree. But it does not present anything else. No code, just a few pseudo code transformations and some explanation where needed. The ideas are showing using a tree, sadly. We must get it ourselves.

I will not try to make some code to optimize now - not until I finish this book and know better how to use these tools, at least. =)

astrogeek 09-01-2016 02:56 PM

Sorry, it was not my intent to divert this thread into an anti-M$ rant. My comment was plainly labeled as personal opinion and kept brief - but my frustrations with the book demanded some vent at the time. My disdain for all things M$ is irreconcilable, we won't go there.

So with only the most brief response possible...

Quote:

Originally Posted by dedec0 (Post 5599208)
We are not being fair to Microsoft in this discussion.

The M$ corporate has done immeasurable damage to all our lives, all our societies and all our futures. The only fairness will come when the individuals responsible for that behavior are held to account as the criminals they are, and their ill-gotten fortunes liquidated and distributed to the poor.

Quote:

Originally Posted by dedec0 (Post 5599208)
This book is free: we can copy and distribute it as much as we want, no harm done to anyone.

That is entirely thanks to the author! And as I do not know the originally intended audience, I cannot say whether the book accomplished what the author intended or not - it may have been entirely successful in its intended purpose for all I know. But clearly that purpose was not focused on teaching Flex and Bison per se.

But the fact that it was chosen as a resource on the M$ site stains it... like a dentist passing out free candy! The candy itself is not evil, but the dentist's motives are necessarily open to question! That was the thought I expressed. ;)

The best things are indeed FREE, as in FREEDOM! But do not confuse free of cost, or "open source" with FREEDOM!

That is enough... back to the grammar in another post...

sundialsvcs 09-01-2016 06:00 PM

Actually, "when human creative effort :banghead: meets the computer," Microsoft Corporation is in pretty much the same boat as the rest of us. In order to (profitably) produce products, they first have to make them work. Bison happens to be a technology that they use a lot, because it's a little bit more forgiving than Yacc. (And, due to Microsoft's long corporate history, they're "stuck" with some syntaxes for which Bison is better.)

So, in this case, I would regard their "corporate offering" simply at face value, and say, "thank you."

I consider "Microsoft Press" to be a superb technical publishing house, and I still consider their "Software Engineering Classics" boxed-set to be ... well ... "a classic."

No matter what you might think of the layers of mega-corporate bullsh*t mantle :rolleyes: that has been lathered upon Microsoft over so many years and by so many Wall Street investors, at the heart of the company there still beats the heart and soul of a group of people who have done some truly-magnificent things hacks ... successfully. (Even, dare I say it, "gracefully.")

Quote:

"Sonofab*tch ... Dammit!! ... I don't wanna say it 'cept I have to ... ... They a-r-e good! ... :banghead: ... They are That good!!" :eek:

Yes, "still." Despise them if you want to, but, "they are where they are today" because ... because ... they earned it. (Aye, in spite of Wall Street.)

dedec0 09-03-2016 04:46 AM

Chapter 4: tests
 
So I have made the partially unsatisfying steps (shown before) to compile chapter 4 files (also given above). It was not working as expected, I was using a correct Simple code - not! The grammar given in chapter 1 is not consistent with the lex and bison files (or parts of them) shown after that. It is obvious to see it, but I did not even considered or thought about checking the consistence of this detail.

A Simple program that satisfies the current compilation steps is:

Code:

let integer trees, seedpackage. in
    trees := 3;
    seedpackage := 17;
    while trees < 1000 do
        trees := trees + seedpackage;
    end;
    write trees;
 end

It is much different from the Simple program I wrote here before. It is more Complicated! xD

Now I am a bit puzzled by what is being considered from ch4.lex and from ch4.y, and what is wrong in them, for this execution. Some "obvious" comments would be nice. Some quick questions I have:

1. The lex file does not have tokens for +-*/^><= operators. They appear explicitly in bison file. I should create them, right? Would it be good to make a better scanner (more complete) and separate more clearly the scanner from the parser?

2. Bison file has a precedence for '='. It is different from ':=' (ASSGNOP).
Code:

let integer a, b, c. in
  c := 3;
  a := b = c;
 end

Would it be the comparison operator that does that "==" does in C?

3. The bison file has the line %right '^ '. Note the space. Does a "^" followed by a space makes any sense? I am tempted to remove that space... both programs are accepted now, with the space in that line:

Code:

let integer a,b,c. in
  a := b^ c;
 end

Code:

let integer a,b,c. in
  a := b^c;
 end

4. In bison file, does the rule exp : exp '' exp, with two single quotes together, make sense?

5. Is the Simple language a creation of the book author? Have you ever seen it somewhere else? My searches had no other ocurrences.

6. Any other "hidden" things here?

dedec0 09-03-2016 12:38 PM

Quote:

Originally Posted by dedec0 (Post 5600131)

2. Bison file has a precedence for '='. It is different from ':=' (ASSGNOP).
Code:

let integer a, b, c. in
  c := 3;
  a := b = c;
 end

Would it be the comparison operator that does that "==" does in C?

Yes. This was confirmed (written) elsewhere. It is intended do be used in conditionals (if, while), and not in an attribution, like above. The above code is accepted, but is not correct.

Quote:

Originally Posted by dedec0 (Post 5600131)
4. In bison file, does the rule exp : exp '' exp, with two single quotes together, make sense?

They do not make sense. The bison file did not have the rule for the * operator... guess where it should be!

The other questions are still waiting answers...

astrogeek 09-03-2016 11:55 PM

Quote:

Originally Posted by dedec0 (Post 5600131)
So I have made the partially unsatisfying steps (shown before) to compile chapter 4 files (also given above). It was not working as expected

I must ask the question, how are you expecting it to work at this point? Are you expecting it to be capable of processing "Simple" programs?

Quote:

Originally Posted by dedec0 (Post 5600131)
I was using a correct Simple code - not! The grammar given in chapter 1 is not consistent with the lex and bison files (or parts of them) shown after that. It is obvious to see it, but I did not even considered or thought about checking the consistence of this detail.

No it is neither consistent, nor complete.

Quote:

Originally Posted by dedec0 (Post 5600131)
Now I am a bit puzzled by what is being considered from ch4.lex and from ch4.y, and what is wrong in them, for this execution. Some "obvious" comments would be nice. Some quick questions I have:

1. The lex file does not have tokens for +-*/^><= operators. They appear explicitly in bison file. I should create them, right? Would it be good to make a better scanner (more complete) and separate more clearly the scanner from the parser?

Those are all handled by the 'any character' rule in the lexer.

Quote:

Originally Posted by dedec0 (Post 5600131)
2. Bison file has a precedence for '='. It is different from ':=' (ASSGNOP).
Code:

let integer a, b, c. in
  c := 3;
  a := b = c;
 end

Would it be the comparison operator that does that "==" does in C?

From the above usage it would be a comparison operator I think.

But I see no precedence set for it in the bison code...??

Quote:

Originally Posted by dedec0 (Post 5600131)
3. The bison file has the line %right '^ '. Note the space. Does a "^" followed by a space makes any sense? I am tempted to remove that space... both programs are accepted now, with the space in that line:

Code:

let integer a,b,c. in
  a := b^ c;
 end

Code:

let integer a,b,c. in
  a := b^c;
 end


No, that is an extraneous space introduced in the copy paste. The 'ˆ' in the PDF is also not a caret character, it is unicode 0x02c6. You should also double check the rest as -,*,_, all quotes and ^ were not ASCII in all cases (also not consistent!).

The ONLY way that rule makes sense is to remove the space and replace the 0x02c6 with a real ASCII '^' (0x5e).

There were _many_ such cases in the PDF, so double check everything from your copy paste code...

It might also be that your local locale settings and editors affect how copy/paste translates those.

Quote:

Originally Posted by dedec0 (Post 5600131)
4. In bison file, does the rule exp : exp '' exp, with two single quotes together, make sense?

No. It not a single error either - check the lines before and after it in Chapter 2 and Chapter 4 - they are not the same...

Quote:

Originally Posted by dedec0 (Post 5600131)
5. Is the Simple language a creation of the book author? Have you ever seen it somewhere else? My searches had no other ocurrences.

6. Any other "hidden" things here?

I have never heard of a commonly known "Simple" language, but I have seen many examples referred to by their respective authors as "a simple language", even Levine uses the term in Flex & Bison (O'Reilly). So as far as I know there is no "Simple" language, it must be an invention of the author.

So may I ask, are you trying to write the wiki book that you mentioned?

dedec0 09-04-2016 08:02 AM

I will answer each question or comment that you have made (without quoting, assuming it will be easier and faster to read it).

At this point (chapter 4) I expect it to accept Simple programs. It does not do anything. When the source is not recognized, an error is shown - and this fact made me see inconsistencies in grammar and definition pages of the book.

I know +-*/<>= are handled by the . in the lex file. The point I tried to make/ask is: they should have a token instead of being literally recognized in the parser.

For the = operator and its "precedence", I meant the lines:

Code:

%left '-' '+'
%left '*' '/'
%left '<' '>' '=' /* these where missing on the book, added */
%right '^'

The order of these lines and elements does not have a "last resort" consequence on their precedence?

For the "unicode or normal char" differences, I have tested most of the copied code. It should not have more of this.

So Simple is author's creation. I feel more free to change it in some details that seem bad or "not so simple".

I am not trying to write the wikibook now since I cannot completely understand and complete a considerable part of the book now. But I would do it otherwise. This thread, as it got, can probably help whoever tries that.

dedec0 09-09-2016 06:36 AM

Chapter 5
 
Chapter 5 presents some ideas for optimization, but it does not give a single example. May you help me getting a simple one to work? I believe that once this example is given, I should be able to add other optimizations to it.

The first step that I imagine is to make a full Simple language compiler, which is not what chapter 4 ended with (what I have now, described in posts before this one). Is this correct? What do you suggest?

astrogeek 09-19-2016 03:14 AM

Hi dedec0! Sorry to be so slow responding.

Partly, my delay is because I have been mostly unavailable... but partly it is because each time I do try to make sense of the (badly written and hopelessly incomplete IMO) book, it takes far too much time to fill in the mystery pieces and try to give a reasonable reply - so I end without being able to contribute anything helpful.

I have just completed that exercise yet again, and am inclined to make this my last trip around this particular hamster wheel.

Quote:

Originally Posted by dedec0 (Post 5600526)
At this point (chapter 4) I expect it to accept Simple programs. It does not do anything. When the source is not recognized, an error is shown - and this fact made me see inconsistencies in grammar and definition pages of the book.

I am not sure that I understand what you are expressing here.

Are you saying that you expect it to accept - and process in some way - "Simple" programs, but you find that it does not do anything useful?

Or are you saying that it should accept the syntax without doing anything with it? This is loosely what it I would expect.

But as you have found, while it accepts user input strings (and does nothing with them), the "syntax" (i.e. grammar) it recognizes is very ambiguous (i.e. ill-defined), with 39 shift-reduce errors by my count - vagueness and errors are indeed the order of the day!

How to fix that? I cannot read the original author's mind so I cannot say how they intended to further develop this example. In as much as they did not complete the exercise themselves in this "book", it is kind of an open ended question.

Quote:

Originally Posted by dedec0 (Post 5600526)
I know +-*/<>= are handled by the . in the lex file. The point I tried to make/ask is: they should have a token instead of being literally recognized in the parser.

That too is up to the author of the lexer and parser.

For myself, I almost always explicitly define every token and almost never rely on a "." handler to return the ASCII value as the token. On the other hand, Bison was specifically designed to not conflict with ASCII valued tokens just for this purpose, and either way they are just unique integers... so it is really down to your preference (the code author).

Quote:

Originally Posted by dedec0 (Post 5600526)
For the = operator and its "precedence", I meant the lines:

Code:

%left '-' '+'
%left '*' '/'
%left '<' '>' '=' /* these where missing on the book, added */
%right '^'

The order of these lines and elements does not have a "last resort" consequence on their precedence?

Again, I am not really sure I understand your question.

The way Bison works is that symbols defined further down in the code have a higher precedence. If you do not assign any symbol's precedence explicitly, it will be assigned a precedence level based on where Bison first encounters it when processing the rules.

By explicitly assigning the associativity (%left, %right, %nonassoc), you are also declaring the precedence by the order of the lines. Symbols on the same line have the same precedence.

In the book, those symbols not declared will have a higher precedence based on their order of occurrance in the code, top to bottom.

Again, I have no idea what the author ultimately intended, whether this was an omission or intentional.

Quote:

Originally Posted by dedec0 (Post 5600526)
So Simple is author's creation. I feel more free to change it in some details that seem bad or "not so simple".

I am not trying to write the wikibook now since I cannot completely understand and complete a considerable part of the book now. But I would do it otherwise. This thread, as it got, can probably help whoever tries that.

I think that to write the wikibook, you will first have to complete the language specification and the book itself. I am still of the opinion that it was rather obviously never completed and as such is not much useful as an accessible learning source. On the other hand, completing the book would indeed be a good, but difficult exercise for an advanced student.

Finally, you have asked about an example of the optimizations suggested in Chapter 5...

The text suggests changes in the parse tree (i.e. grammar specification) to optimize specific output code structures.

All I can say is that you would want to understand the parse tree generated by the existing rules for those cases, and modify them and likely add new nonterminals and rules to recognize those cases and lead to some desired result.

The rules as they exist do not lend themselves to that level of optimization (IMO), especially with existing ambiguities and without any action handling code yet written. Again, the author may have had something specific in mind, but I do not know what it was, and they did not elaborate...

Good luck!

dedec0 09-19-2016 12:38 PM

No problem with your delay. Thank you for these efforts with my posts.

I will think more later, and (if and where still necessary) I will try to rewrite my questions.

One important point is that: the code I was able to use and/or improve from the book is not a full compiler, and (as consequence of this) it does not have any optimizations.

I would like to have an example compiler made with Flex and Bison that has all these characteristics. It does not need to be of the Simple language. It does not need (and should not) to be complex or big. But, as far as I have got until now, I could build one.

One point I tried to make and confirm with you is that some tasks happen in the parser "block", others in the scanner "block", which happens first. Why tokens for some of these symbols are not being tokenized? They should! (right?) I think it seems to bring a more clear division between each compiler phase, and so it is between some modifications I have made on my code copied from the book.

sundialsvcs 09-19-2016 01:06 PM

When specifying your grammar, I think that it is very important that you explicitly state the precedence of all operators, and left/right bindings and so on. "Don't rely on any defaults: say it."

I also strongly suggest that you :study: study :study: the code that is produced. Study the source-code of the actual parsing engine as it works through the data structures that are the output of the compiled grammar. It needs to "stop being a mystery, to stop being 'a black box' in your mind," as soon as possible.

dedec0 09-19-2016 01:54 PM

Quote:

Originally Posted by sundialsvcs (Post 5607352)
When specifying your grammar, I think that it is very important that you explicitly state the precedence of all operators, and left/right bindings and so on. "Don't rely on any defaults: say it."

I also strongly suggest that you :study: study :study: the code that is produced. Study the source-code of the actual parsing engine as it works through the data structures that are the output of the compiled grammar. It needs to "stop being a mystery, to stop being 'a black box' in your mind," as soon as possible.

Writing them all would be my choice, sundialsvcs. :)

Studying the generated code... seems good. I will try it sometime. For now, my aim is to get a full working flex+bison+cc compiler with some optimization for any not complicated language. I imagine that using it for more complex languages/tasks/optimizations can be smooth.

jpollard 09-19-2016 02:44 PM

The book itself is VERY incomplete - basically unusable.

Even the stack machine included is incomplete (it doesn't detect stack overflow/underflow for one, missing PC out of range, memory range limits too).

The book LOOKS like a draft copy, not a final copy.

1. The problems you have seen are due to poor choice of grammar. It is not necessary (though sometimes simpler) to use %left/%right anywhere - but you do have to be careful with the grammar rules.

2. The incorrect use of some terminology... It is almost like this was being translated from something done earlier, but hasn't yet been polished.

3. Missing examples - Optimization is an important section - yet it is nearly empty other than a little introductory text. Completely left out is algebraic optimization, a little constant folding (barely mentioned)...

4. code generation is mostly skipped.

5. Unspecified references for "further reading"...

You get better examples from the "Lex and Yacc" book, though they won't go into optimization (which is a rather complex subject on its own).

dedec0 09-19-2016 02:52 PM

Quote:

Originally Posted by jpollard (Post 5607382)
The book itself is VERY incomplete.

Even the stack machine included is incomplete (it doesn't detect stack overflow or underflow for one).

The book LOOKS like a draft copy, not a final copy.

I agree with you, but I probably cannot see it as clearly as you (because I am trying to learn from it).

I have done some work to produce the files I showed here, with things until chapter 4. If they are close to being a full compiler, I would like the help of people that are reading this thread. If they are not so close, please point that detail, so I do not expend more time with this idea.

jpollard 09-19-2016 03:08 PM

The basic problem is that you need a decent book. This is not one of them.

You might try http://gnuu.org/2009/09/18/writing-y...-toy-compiler/

I haven't gone through this - but the organization behind it do know what they are doing.


All times are GMT -5. The time now is 11:41 AM.