regular expressions

schmitta · 10-01-2013, 01:32 AM

How would I make a multi pass interpreter with flex and bison to pre process labels and forward branches as well as find all the variable declares ?

jpollard · 10-01-2013, 02:27 PM

You need to refer to a text book on compiler construction as that topic is fairly large.

But a VERY simple overview is:

1. The scanner+parser validate language input and generate a symbol table + parse tree (or some intermediate language) All loops are left as symbolic in the intermediate code.

2. Optimizing passes can go through the intermediate code to prune code not reached (constant true/false tests always do the same thing,so why generate the alternative, loops that can never be executed are the same...) Register usage optimization, tail recursion elimination, constant expression elimination... Redundant operand elimination -- a*b + a*c is usually slow and replaced by a *(b+a), which eliminates a multiply. Other things that can be done is tail recursion elimination (it really is just iteration, which is faster and reduces stack overhead).

3. During code generation (the last pass usually, but not necessarily) the code generated is assigned addresses - when a label is assigned an address, the back links in the symbol table provide all locations that need the code adjusted with the real address. Note, this may involve additional changes as using things like short relative addresses may change the actual address of the label - requiring additional adjustments to be repeated. It is even possible to enter infinite loops here - where changing an address causes other locations to need long addresses - which again change the location. So some optimization of what and where long/short/relative addressing is to be used becomes important.

schmitta · 10-05-2013, 12:47 AM

What I was asking is how to do multiple passes with bison flex? Use BEGIN var to turn on and off certain areas of the grammar? Also I am getting 1 shift/reduce warning (error?). It is for :

Quote:

statement : IF exp THEN statement
| IF exp THEN statement ELSE statement
;

I enabled %glr-parser but am not sure if this will take care of the problem.

jpollard · 10-05-2013, 07:37 AM

Quote:

Originally Posted by schmitta

What I was asking is how to do multiple passes with bison flex? Use BEGIN var to turn on and off certain areas of the grammar?

You don't "turn off" parts of the grammar - that is a capability for preprocessing the input, though it can be done by a scanner (when it recognizes a disabled section it skips input processing until it recognizes a "turn on" again...) but it makes the scanner much harder to write - this is under "conditional compilation", and it usually means that you can't use flex for the scanner, and you have to hand code one. It can be done - but makes the action part of the rule recognizing the "turn off" symbol have to scan for the "turn on" symbol using yet another scanner...
I don't think the scanner is designed to be invoked recursively, at least not those generated by flex...

Neither flex, nor bison are good at two passes... that isn't what they were designed for. Now flex CAN be made to do two file passes (it is a simple scanner) as it does not maintain internal state (I believe seeking to the beginning of the input file will do - but now you also have to track how many times you reach the end of file or you have an infinite loop). But bison DOES maintain internal state.

The goal of bison is to either directly interpret the parse tree, or generate an output file... When the scanner reports an end of file to the parser, the parsing is supposed to be completed --- unless the internal state still has elements on the stack... in which case you have a parse error (unexpected end-of-file).

Now that said, you can accomplish the equivalent using something like

Code:

program:
    pass1 pass2;

pass1: statements EOF1

pass2: statements EOF2

where EOF1 is the token returned at the end of file (the first time)
and EOF2 is the token returned at the second end of file.

It also allows you to have to different action rules instead of one. This can be used to do an expected grammar check during pass1, generating errors and such that abort performing the "pass2" parsing, and have pass2 with different action rules.

It is awkward, and you HAVE to make sure the grammar specification is identical. It doubles the size of the parser... And all it really does is force double the work, as the action rules of the first part can include those of the second.

Quote:

Also I am getting 1 shift/reduce warning (error?). It is for :

I enabled %glr-parser but am not sure if this will take care of the problem.

your sample grammar is ambiguous - it allows a construct like

Code:

IF expr THEN IF expr THEN somestatement ELSE somestatement

The shift reduce error is cause by the fact that bison is a LR(1) parser (with extensions). In the case of your warning it is because the stack has two ways to continue - associating the ELSE with the first if, or the second. It is listed as a warning because bison has been coded to take the shortest match, which would associate the ELSE with the second IF, which is the most frequent meaning.
As describe in the documentation for bison:
http://www.gnu.org/software/bison/ma...02fReduce.html

You fix it as shown in
http://www.gnu.org/software/bison/ma...#Non-Operators

which forces a particular choice.

It can also be solved by having a "if...endif" construct to force the association as well, where the ELSE part is either null or an else clause... the grammar for that would look something like:

Code:

ifstatement:
 IF expr ifblock ENDIF

ifblock: statement elseblock

elseblock:             /* note null else block */
   | ELSE statement

But don't take the above as gospel - it has been a long time since I've done this. The only reason this should work is that the IF is terminated by an ENDIF which marks how the shift/reduce is to be processed.

schmitta · 10-05-2013, 10:33 PM

Thanks jpollard for your expert help. I made it:

Code:

statement: IF exp THEN stmts ENDIF
         | IF exp THEN stmts ELSE stmts ENDIF
         ;

Would that work? Should I separate the ELSE statement into a separate production to make the generated C code easier to specify? Thanks. Alvin....

schmitta · 10-05-2013, 10:49 PM

How do you handle a forward branch in say a

Code:

 WHILE exp stmts WEND

statement without using two pass compiler to find the WEND statement on the first pass, for the branch when exp is not true? Would it be better form to have

Code:

 WHILE ( exp ) stmts WEND

? Thanks. Alvin...

jpollard · 10-06-2013, 10:47 PM

Quote:

Originally Posted by schmitta

Thanks jpollard for your expert help. I made it:

Code:

statement: IF exp THEN stmts ENDIF
         | IF exp THEN stmts ELSE stmts ENDIF
         ;

Would that work? Should I separate the ELSE statement into a separate production to make the generated C code easier to specify? Thanks. Alvin....

It should work... but it puts double the work for you on the actions for the rules.

jpollard · 10-06-2013, 11:07 PM

Quote:

Originally Posted by schmitta

How do you handle a forward branch in say a

Code:

 WHILE exp stmts WEND

statement without using two pass compiler to find the WEND statement on the first pass, for the branch when exp is not true? Would it be better form to have

Code:

 WHILE ( exp ) stmts WEND

? Thanks. Alvin...

I keep telling you that the second pass doesn't work on the original source. It works on intermediate code generated during the first pass by the actions associated with the rule.

You really need to refer to a textbook for this level of detail.

schmitta · 10-07-2013, 10:33 AM

But if I am interpreting the code then I have the original in core in order to process it on a second pass. To write a two pass algorithm would it not be best to come up with two different flex/bison routines each with its own unique yyparse() (yyparse1,yyparse2) and call them from a main routine with global symbol table and variable recording routines using different (needed) grammar for each? yyparse1 for first pass for forward branch labels, while wend statements and variable declare statements with yypass2 processing the program with the now known values? A kludge I know but a way to accomplish it.

jpollard · 10-07-2013, 10:52 AM

Only if you want to bloat the translator.

Having two parsers in memory takes twice the space... and makes it harder to fix errors.

In a single pass, the rule

Code:

...: WHILE expr statements WEND  {
	    $$.value = generate_while($2.value, $3.value);
};

This simply adds to the parse tree (represented by $$.value), by creating a
(dynamic) structure with 3 elements - the entry is identified as a while, a
pointer to the parse tree for the expression (represented by $2.value), and a
pointer to the parse tree for the statements (represented by the $3.value).

No labels (forward or backward) required AT THIS TIME.

What is this entry? Something like:

Code:

struct entry {
    unsigned char	entry_type;
    struct entry	*left;
    struct entry	*center;
    struct entry	*right;
}

What are the values for entry_type? A number representing the significant
elments of the language: STATEMENT, IF, WHILE, FUNCTION, ASSIGN, VARIABLE, ADD, SUB,
MUL, DIV, EXP, ...

These are the junction points of the grammar. Each entry only points to other
entries. The end result is that a complete scan of the source code is translated
into one or more elements in this tree.

An expression like (a + b(c+d)) gets converted into a tree with elements like:

Code:

{add, left, null, right}
        |          |
        |          {MUL, left, null, right}
        |                  |          |
        |                  |          {ADD, left, null, right}
        |                  |                |             |
        |                  |                |             {VARIABLE, ptr to d,null}
        |                  |                VARIABLE, ptr to c, null}
        |                  {VARIABLE, ptr to b, null}
        {VARIABLE, pointer to symboltable entry a, null}

A while statement gets something similar:

Code:

{while, left, null, right}
         |            |
         |            {points to a list of statements}
         a tree for an expression

An if statement looks a little bit different:

Code:

{if, left, center, right}
       |      |       |
       |      |       {pointer to a list of statements}
       |      {pointer to a list of statements}
       {pointer to an expression tree }

So what would a list of statements look like?:

Code:

{STATEMENT, ptr to a single statement, null, ptr to another STATEMENT entry}

A "ptr to a single statement" can be one of ASSIGN, IF, WHILE, or FUNCTION
The last STATEMENT entry has a null for "ptr to another STATEMENT entry".

So now we have a parse tree that represents the program.

First pass done.

The SECOND PASS can be either a code generator, or an interpreter. If an
interpreter, no labels are needed. The interpreter only does what each element
of the tree specifies. The interpreter just walks the tree and invokes
functions for each entry type identified for each node in the tree.

If the second pass is a code generator, THEN AND ONLY THEN are labels necessary.
But it works ALMOST the same as the interpreter. The only difference is that
instead of evaluating the statements, it outputs an assembly/machine code.

A translator for the IF entry would be something like:

Code:

void trans_if(struct entry *left, struct entry *center, struct entry *right)
{
    char *label1, *label2;

    gencode(left);                /* the expression */
    label1 = newlabel();
    printf("    JZ   %s\n",label1);
    gencode(center);              /* the true part */
    if (NULL != right) {          /* the else part (if it exists)
        label2 = newlabel();
        printf("    JMP    %s\n",label2);
        printf("%s:\n",label1);
        gencode(right);
        printf("%s:\n",label2);
    } else {
        printf("%s\n:",label1);
    }
}

How about the while? Very similar:

Code:

void trans_while(struct entry *left, struct entry *right)
{
    char *label1, *label2;

    label1 = newlabel();
    printf("%s:\n",label1);	     /* this is where the while repeats */
    gencode(left);                   /* the expression again */
    label2 = newlabel();
    printf("    JZ  %s\n",label2);   /* how to get out of the loop */
    gencode(right);                  /* the list of statements */
    printf("    JMP   %s\n",label1); /* and this repeats */
    printf("%s:\n",label2);
}

All gencode does is walk the tree of statements...invoking the translator for
the current node - that is why it is recursive - the code generated for an IF
needs code generated for the expression... so the trans_if function just
invokes the gencode function to generate code for a particular branch of
the tree.

The only caution in the "newlabel" use is to be sure to get the value of the
generated label before it is needed.

Now, using the above as an example creates a three pass compiler. The FIRST
PASS does the syntax checking, error handling, and creates a parse tree.

The SECOND PASS generats an assembly output...

So a THIRD PASS would be the assember that translate the assembly code into machine code...

Please note - none of this has been tested.

schmitta · 10-08-2013, 12:03 AM

Jpollard. Thank you for your most helpful explanation. I greatly appreciate it. Alvin...

schmitta · 10-08-2013, 12:28 AM

I am not sure I will be able to trust recursion as my mcu has a return stack only 16 levels deep. Some other way of recursion would have to be found but I will ask the compiler maker what he thinks should be done.

jpollard · 10-08-2013, 07:17 AM

Quote:

Originally Posted by schmitta

I am not sure I will be able to trust recursion as my mcu has a return stack only 16 levels deep. Some other way of recursion would have to be found but I will ask the compiler maker what he thinks should be done.

Which is why the code generator doesn't run on it.

The sample is a translator to assembler - much smaller, and the resulting code generated doesn't require recursion.

All interpreters (with subroutines) need recursion. Now one way to get around the hardware limitation is to define a pcode interpreter. These don't necessarily require recursion, but do need subroutines. Then the pcode interpreter can implement a stack however it wants. The programs that run on the pcode interpreter (also called a virtual machine) can then use recursion with a much less limited stack.

The problem you have listed is the limited physical memory. It is possible to get a pcode interpreter there - but the language must aimed at the machine. That was why I tried pointing you to the language fourth - it was designed for minimal overhead, AND simple parsing (it is, a very simple stack machine, and uses Reverse Polish Notation to simplify the parsing - almost to the point of just being a scanner). Plus it has the advantage that source code is already available. It is only the conversion of the interpreter to the target machine that is necessary.

Using lex/bison is not what one uses to create compact programs - they aren't. They are designed to make implementing compilers easier. The parsers generated are huge, and require a lot of stack space due to the supporting functions required.

schmitta · 10-10-2013, 02:10 AM

1)Are you saying that the parser should be run on the PC and the entry structure:

Code:

struct entry {
    unsigned char	entry_type;
    struct entry	*left;
    struct entry	*center;
    struct entry	*right;
}

downloaded to the mcu to be executed by a c program on the mcu?

2)Excuse me for being dense but is the entry structure generated on the fly and how is that done?

3)Can bison be called line by line by an external program and still keep its internal data structure intact?

jpollard · 10-10-2013, 04:43 AM

Quote:

Originally Posted by schmitta

1)Are you saying that the parser should be run on the PC and the entry structure:

Code:

struct entry {
    unsigned char	entry_type;
    struct entry	*left;
    struct entry	*center;
    struct entry	*right;
}

downloaded to the mcu to be executed by a c program on the mcu?

No. The dynamic structure is part of the parser.

Quote:

2)Excuse me for being dense but is the entry structure generated on the fly and how is that done?

The tree is generated dynamically and is composed of nodes allocated during parsing.

Quote:

3)Can bison be called line by line by an external program and still keep its internal data structure intact?

no. It is an LALR(1) parser generator. The code generated interprets an entire grammar. Once it is finished with the grammar the only "state" remaining is any dynamic structure created during parsing. It is NOT a line-by-line translator.

Again, that is why I pointed you at a minimalist language requiring only simple parsing/scanning such as Forth (http://en.wikipedia.org/wiki/Forth_%...ng_language%29)