LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   regular expressions (https://www.linuxquestions.org/questions/linux-newbie-8/regular-expressions-4175478394/)

schmitta 09-24-2013 03:30 PM

regular expressions
 
Is there a program for which you enter a regular expression and it tells you in english what it will perform?

sycamorex 09-24-2013 03:58 PM

Perhaps, something like that would help:

http://regex101.com/
or

http://www.myezapp.com/apps/dev/regexp/show.ws

schmitta 09-25-2013 12:11 AM

Thanks sycamorex! The myezapp site did the trick and gave me what I wanted. They did not have flex but the java version seems to work identical. Thanks again! Alvin....

schmitta 09-25-2013 01:10 AM

I am trying to get a label that is defined as starting with a letter then following zero to 7 letters or numbers followed by a colon. I would also like it to be the first item in the statement. I tried:

lines ::= labeldef statement
labeldef ::= [A-Za-z][A-Za-z0-9]{0,7}:

but it allows labels longer than 8 characters, matching only the last 8 chars before the colon plus the colon. How do I write the reg expression to allow only eight chars total and give an error for 9 or more characters (or not starting with a letter)? Maybe I need to precede it with a null or get it to start at the first character on the line. Thank you. Alvin.... (note I am not in college doing this for a course but for my business).

Firerat 09-25-2013 02:34 AM

it appears fine here, at least in bash

Code:

Check=""
for i in a {1..10};do
  Check=${Check%:}${i}:
  printf "$Check "
  [[ $Check =~ [A-Za-z][A-Za-z0-9]{0,7}: ]] \
    && echo True \
    || echo False
done

gets the following output

Code:

a: True
a1: True
a12: True
a123: True
a1234: True
a12345: True
a123456: True
a1234567: True
a12345678: False
a123456789: False
a12345678910: False


pan64 09-25-2013 02:52 AM

starting means ^, so you would need to write: ^[A-Za-z][A-Za-z0-9]{0,7}:

Firerat 09-25-2013 03:11 AM

good point pan64 :)

Still, it should work without ^
Code:

Check="";for i in 0 {1..10};do  Check=${Check%:}${i}:;  printf "$Check ";  [[ $Check =~ [A-Za-z][A-Za-z0-9]{0,7}: ]]    && echo True    || echo False; done
results in all False
But 100% agree, ^ should be used, along with $ on the end
Code:

Check="";for i in z {1..10};do  Check=${Check%:}${i}:;  printf "$Check ";  [[ $Check =~ ^[A-Za-z][A-Za-z0-9]{0,7}:$ ]]    && echo True    || echo False; done

As I mentioned earlier,. your regexpr. is working in bash..

Do you have sample code where it is not working?

schmitta 09-25-2013 02:59 PM

I was using the software at:

http://regex101.com/
or

http://www.myezapp.com/apps/dev/regexp/show.ws

to test with. The myezapp program shows the match with a colored bar. For "testabc0:" all was colored. for "testabc12:" "stabc12:" was colored as a match. The label needs to be at the beginning of the line so I will use the ^ first. Other tokens can follow so I will leave off the $. I just tried it with http-//regex101.com and now it seems to work correctly. I have: ^[A-Za-z][A-Za-z0-9]{0,7}: which now seems to work rejecting "testabc01:" and " testabc0:" (not first in the line.) Thank you for your help. Do you know if flex will reject it and just pass the no match through or will it give me some way to flag it as an error? Thanks. Alvin...

jpollard 09-25-2013 03:34 PM

Quote:

Originally Posted by Firerat (Post 5034366)
good point pan64 :)

Still, it should work without ^

No, it should work exactly as shown: aaab0123456 would match "b0123456"... even though it is preceded by aaa string. Only by giving the ^ does it specify that it match from the beginning of the string.

One other note - it is part of a flex scanner/tokenizer. Now specifying the ^ will identify it as valid, but usually this would be counted as a SEMANTIC error, rather than a syntax error. Leaving the ^ off would allow the action part to make more detailed analysis and determine the difference between a valid label, and a label that is too long, and thus provide better error diagnostics for the user to be able to make corrections faster, and more accurately.

Habitual 09-25-2013 06:00 PM

also http://www.gskinner.com/RegExr/

schmitta 09-26-2013 08:22 PM

Should I leave the ^ in or out? I changed mine to ^[A-Za-z][A-Za-z0-9]{0.7}[ \t\n] to catch a blank or tab between the labeldef and the next token on the line or just the labeldef on the line. But how will I catch a label too long as it will probably just pass through if flex works as I think.

jpollard 09-27-2013 03:45 AM

It depends on the grammar. Don't forget that a scanner is only supposed to recognize tokens. If the grammar uses white space for a token or just a separator is two different things.

The scanner can easily consume white space if it isn't significant with a very simple rule. For instance. If the grammar specifies a label as:

Code:

label : symbol ':' {whatever to do with a label}
      .

Then the scanner only has to identify what a symbol is. Length of a symbol is not really relevant to the grammar. The action part of the grammar can look at the length and decide if it is too long, report an error (label length too long) that is specific to the label.

If all symbols are limited then the scanner can identify the error, but still not abort scanning - translators work best by identifying as many errors as possible, and not terminate on the first one.

The scanner could identify the label with:
Code:

id [A-Za-z]{1,}[0-9]*
ws [ \t]
nl '\n'
coln ':'
%%
id    {return ID};
coln  {return COLON};
..... /* other tokens */
ws  ;  /* discard */
nl    {linecount++;};

Now the whitespace is discarded - but the newline is checked for to maintain a line count for error messages. This allows the grammar to identify whether something is a label or not with:

Code:

label:  ID COLON    { if ($1.length > 8) {
                            /* print error message with linecount */
                            errorcount++;
                      }
                      /* do other stuff with label - update symbol table... */
                     
                      }

It really depends on how you decide to handle things, and how complex the grammar is. Scanners generated by flex are ok - though sometimes they are not terribly clear and sometimes it is easier to create one by hand (especially simple ones).
Flex is good to use when speed of implementation is more important, but it does assume you are already familiar with what/how scanners are used and if you are interfacing it with bison (as in generating the appropriate include files...)

They are supposed to make it easier for the parser to handle things and separate semantics from tokenizing - and help make error messages more meaningful and parsing recovery possible. The simplest error handling is to abort on the first error - but that makes USING the application/translator/... much harder as you have to keep re-running the application just to find the next error.

schmitta 09-28-2013 01:49 PM

Thanks jpollard. I really appreciate the extra effort you took to make those points. Alvin...

schmitta 09-30-2013 04:14 PM

I am writing a BASIC interpreter. I was going to write the bnf and run it through flex and bison but I am not sure they would be appropriate for writing an interpreter with. The original idea was to generate the interpreter with flex and bison and compile the C code to run in a MCU. The MCU has 170k words of flash and 56kb of ram in a harvard architecture. But I am considering using flex and bison to write a c program that would run on a PC and generate an intermediate psudo assembler language to be interpreted on the mcu. I am including DCL (declare) statements in the BASIC which I would like to find on an initial pass through the code. Pass zero would also find forward branches and the WEND in a WHILE WEND statements. Can multiple passes be done with flex and bison or is there a better way? The PC program would be written in JAVA so as to run under Windows, MAC and Linux systems. Any ideas you have would be greatly appreciated.

jpollard 09-30-2013 04:54 PM

The first pass of a compiler translates the source into something more amenable to analysis. This is the intermediate language that can be in one of many forms - a parse tree plus symbol table, or multiple parse trees and symbol tables (this is the one I'm most familiar with, but there are others). Consider the inclusion of subroutines/functions for instance. The ability to include "precompiled" parse trees (or whatever the intermediate language is) allows for global optimization of the code. After the optimization pass, a third pass can be made that generates the optimized assembler.

You might want to look into the LLVM/Clang compilers (http://llvm.org/) - they are designed for this.

Another back end (I have only read about, not used) is what is used for Android - the Dalvic bytecode interpreter. This is supposed to provide an efficient interpreter with a minimum of actual code, but allows a higher level language to be converted into a smaller size (good), though it is slower than native code (bad), it is MUCH easier to generate good code for... And allows the code to run on anything that the interpreter runs on (making it easier to develop for).

This is also what the goal for the Forth language (http://www.forth.org/) - a very small interpreter, with an easily parsed language to run on very small processors.

schmitta 10-01-2013 01:32 AM

How would I make a multi pass interpreter with flex and bison to pre process labels and forward branches as well as find all the variable declares ?

jpollard 10-01-2013 02:27 PM

You need to refer to a text book on compiler construction as that topic is fairly large.

But a VERY simple overview is:

1. The scanner+parser validate language input and generate a symbol table + parse tree (or some intermediate language) All loops are left as symbolic in the intermediate code.

2. Optimizing passes can go through the intermediate code to prune code not reached (constant true/false tests always do the same thing,so why generate the alternative, loops that can never be executed are the same...) Register usage optimization, tail recursion elimination, constant expression elimination... Redundant operand elimination -- a*b + a*c is usually slow and replaced by a *(b+a), which eliminates a multiply. Other things that can be done is tail recursion elimination (it really is just iteration, which is faster and reduces stack overhead).

3. During code generation (the last pass usually, but not necessarily) the code generated is assigned addresses - when a label is assigned an address, the back links in the symbol table provide all locations that need the code adjusted with the real address. Note, this may involve additional changes as using things like short relative addresses may change the actual address of the label - requiring additional adjustments to be repeated. It is even possible to enter infinite loops here - where changing an address causes other locations to need long addresses - which again change the location. So some optimization of what and where long/short/relative addressing is to be used becomes important.

schmitta 10-05-2013 12:47 AM

What I was asking is how to do multiple passes with bison flex? Use BEGIN var to turn on and off certain areas of the grammar? Also I am getting 1 shift/reduce warning (error?). It is for :
Quote:

statement : IF exp THEN statement
| IF exp THEN statement ELSE statement
;
I enabled %glr-parser but am not sure if this will take care of the problem.

jpollard 10-05-2013 07:37 AM

Quote:

Originally Posted by schmitta (Post 5040366)
What I was asking is how to do multiple passes with bison flex? Use BEGIN var to turn on and off certain areas of the grammar?

You don't "turn off" parts of the grammar - that is a capability for preprocessing the input, though it can be done by a scanner (when it recognizes a disabled section it skips input processing until it recognizes a "turn on" again...) but it makes the scanner much harder to write - this is under "conditional compilation", and it usually means that you can't use flex for the scanner, and you have to hand code one. It can be done - but makes the action part of the rule recognizing the "turn off" symbol have to scan for the "turn on" symbol using yet another scanner...
I don't think the scanner is designed to be invoked recursively, at least not those generated by flex...

Neither flex, nor bison are good at two passes... that isn't what they were designed for. Now flex CAN be made to do two file passes (it is a simple scanner) as it does not maintain internal state (I believe seeking to the beginning of the input file will do - but now you also have to track how many times you reach the end of file or you have an infinite loop). But bison DOES maintain internal state.

The goal of bison is to either directly interpret the parse tree, or generate an output file... When the scanner reports an end of file to the parser, the parsing is supposed to be completed --- unless the internal state still has elements on the stack... in which case you have a parse error (unexpected end-of-file).

Now that said, you can accomplish the equivalent using something like

Code:


program:
    pass1 pass2;

pass1: statements EOF1

pass2: statements EOF2

where EOF1 is the token returned at the end of file (the first time)
and EOF2 is the token returned at the second end of file.

It also allows you to have to different action rules instead of one. This can be used to do an expected grammar check during pass1, generating errors and such that abort performing the "pass2" parsing, and have pass2 with different action rules.

It is awkward, and you HAVE to make sure the grammar specification is identical. It doubles the size of the parser... And all it really does is force double the work, as the action rules of the first part can include those of the second.

Quote:

Also I am getting 1 shift/reduce warning (error?). It is for :

I enabled %glr-parser but am not sure if this will take care of the problem.
your sample grammar is ambiguous - it allows a construct like
Code:

IF expr THEN IF expr THEN somestatement ELSE somestatement
The shift reduce error is cause by the fact that bison is a LR(1) parser (with extensions). In the case of your warning it is because the stack has two ways to continue - associating the ELSE with the first if, or the second. It is listed as a warning because bison has been coded to take the shortest match, which would associate the ELSE with the second IF, which is the most frequent meaning.
As describe in the documentation for bison:
http://www.gnu.org/software/bison/ma...02fReduce.html

You fix it as shown in
http://www.gnu.org/software/bison/ma...#Non-Operators

which forces a particular choice.

It can also be solved by having a "if...endif" construct to force the association as well, where the ELSE part is either null or an else clause... the grammar for that would look something like:

Code:

ifstatement:
 IF expr ifblock ENDIF

ifblock: statement elseblock

elseblock:            /* note null else block */
  | ELSE statement

But don't take the above as gospel - it has been a long time since I've done this. The only reason this should work is that the IF is terminated by an ENDIF which marks how the shift/reduce is to be processed.

schmitta 10-05-2013 10:33 PM

Thanks jpollard for your expert help. I made it:
Code:


statement: IF exp THEN stmts ENDIF
        | IF exp THEN stmts ELSE stmts ENDIF
        ;

Would that work? Should I separate the ELSE statement into a separate production to make the generated C code easier to specify? Thanks. Alvin....

schmitta 10-05-2013 10:49 PM

How do you handle a forward branch in say a
Code:

WHILE exp stmts WEND
statement without using two pass compiler to find the WEND statement on the first pass, for the branch when exp is not true? Would it be better form to have
Code:

WHILE ( exp ) stmts WEND
? Thanks. Alvin...

jpollard 10-06-2013 10:47 PM

Quote:

Originally Posted by schmitta (Post 5040770)
Thanks jpollard for your expert help. I made it:
Code:


statement: IF exp THEN stmts ENDIF
        | IF exp THEN stmts ELSE stmts ENDIF
        ;

Would that work? Should I separate the ELSE statement into a separate production to make the generated C code easier to specify? Thanks. Alvin....

It should work... but it puts double the work for you on the actions for the rules.

jpollard 10-06-2013 11:07 PM

Quote:

Originally Posted by schmitta (Post 5040772)
How do you handle a forward branch in say a
Code:

WHILE exp stmts WEND
statement without using two pass compiler to find the WEND statement on the first pass, for the branch when exp is not true? Would it be better form to have
Code:

WHILE ( exp ) stmts WEND
? Thanks. Alvin...

I keep telling you that the second pass doesn't work on the original source. It works on intermediate code generated during the first pass by the actions associated with the rule.

You really need to refer to a textbook for this level of detail.

schmitta 10-07-2013 10:33 AM

But if I am interpreting the code then I have the original in core in order to process it on a second pass. To write a two pass algorithm would it not be best to come up with two different flex/bison routines each with its own unique yyparse() (yyparse1,yyparse2) and call them from a main routine with global symbol table and variable recording routines using different (needed) grammar for each? yyparse1 for first pass for forward branch labels, while wend statements and variable declare statements with yypass2 processing the program with the now known values? A kludge I know but a way to accomplish it.

jpollard 10-07-2013 10:52 AM

Only if you want to bloat the translator.

Having two parsers in memory takes twice the space... and makes it harder to fix errors.

In a single pass, the rule
Code:

...: WHILE expr statements WEND  {
            $$.value = generate_while($2.value, $3.value);
};

This simply adds to the parse tree (represented by $$.value), by creating a
(dynamic) structure with 3 elements - the entry is identified as a while, a
pointer to the parse tree for the expression (represented by $2.value), and a
pointer to the parse tree for the statements (represented by the $3.value).


No labels (forward or backward) required AT THIS TIME.

What is this entry? Something like:

Code:

struct entry {
    unsigned char        entry_type;
    struct entry        *left;
    struct entry        *center;
    struct entry        *right;
}

What are the values for entry_type? A number representing the significant
elments of the language: STATEMENT, IF, WHILE, FUNCTION, ASSIGN, VARIABLE, ADD, SUB,
MUL, DIV, EXP, ...


These are the junction points of the grammar. Each entry only points to other
entries. The end result is that a complete scan of the source code is translated
into one or more elements in this tree.

An expression like (a + b(c+d)) gets converted into a tree with elements like:
Code:

{add, left, null, right}
        |          |
        |          {MUL, left, null, right}
        |                  |          |
        |                  |          {ADD, left, null, right}
        |                  |                |            |
        |                  |                |            {VARIABLE, ptr to d,null}
        |                  |                VARIABLE, ptr to c, null}
        |                  {VARIABLE, ptr to b, null}
        {VARIABLE, pointer to symboltable entry a, null}

A while statement gets something similar:

Code:

{while, left, null, right}
        |            |
        |            {points to a list of statements}
        a tree for an expression

An if statement looks a little bit different:

Code:

{if, left, center, right}
      |      |      |
      |      |      {pointer to a list of statements}
      |      {pointer to a list of statements}
      {pointer to an expression tree }

So what would a list of statements look like?:

Code:

{STATEMENT, ptr to a single statement, null, ptr to another STATEMENT entry}
A "ptr to a single statement" can be one of ASSIGN, IF, WHILE, or FUNCTION
The last STATEMENT entry has a null for "ptr to another STATEMENT entry".

So now we have a parse tree that represents the program.

First pass done.

The SECOND PASS can be either a code generator, or an interpreter. If an
interpreter, no labels are needed. The interpreter only does what each element
of the tree specifies. The interpreter just walks the tree and invokes
functions for each entry type identified for each node in the tree.

If the second pass is a code generator, THEN AND ONLY THEN are labels necessary.
But it works ALMOST the same as the interpreter. The only difference is that
instead of evaluating the statements, it outputs an assembly/machine code.

A translator for the IF entry would be something like:

Code:

void trans_if(struct entry *left, struct entry *center, struct entry *right)
{
    char *label1, *label2;

    gencode(left);                /* the expression */
    label1 = newlabel();
    printf("    JZ  %s\n",label1);
    gencode(center);              /* the true part */
    if (NULL != right) {          /* the else part (if it exists)
        label2 = newlabel();
        printf("    JMP    %s\n",label2);
        printf("%s:\n",label1);
        gencode(right);
        printf("%s:\n",label2);
    } else {
        printf("%s\n:",label1);
    }
}

How about the while? Very similar:

Code:

void trans_while(struct entry *left, struct entry *right)
{
    char *label1, *label2;

    label1 = newlabel();
    printf("%s:\n",label1);            /* this is where the while repeats */
    gencode(left);                  /* the expression again */
    label2 = newlabel();
    printf("    JZ  %s\n",label2);  /* how to get out of the loop */
    gencode(right);                  /* the list of statements */
    printf("    JMP  %s\n",label1); /* and this repeats */
    printf("%s:\n",label2);
}

All gencode does is walk the tree of statements...invoking the translator for
the current node - that is why it is recursive - the code generated for an IF
needs code generated for the expression... so the trans_if function just
invokes the gencode function to generate code for a particular branch of
the tree.

The only caution in the "newlabel" use is to be sure to get the value of the
generated label before it is needed.

Now, using the above as an example creates a three pass compiler. The FIRST
PASS does the syntax checking, error handling, and creates a parse tree.

The SECOND PASS generats an assembly output...

So a THIRD PASS would be the assember that translate the assembly code into machine code...

Please note - none of this has been tested.

schmitta 10-08-2013 12:03 AM

Jpollard. Thank you for your most helpful explanation. I greatly appreciate it. Alvin...

schmitta 10-08-2013 12:28 AM

I am not sure I will be able to trust recursion as my mcu has a return stack only 16 levels deep. Some other way of recursion would have to be found but I will ask the compiler maker what he thinks should be done.

jpollard 10-08-2013 07:17 AM

Quote:

Originally Posted by schmitta (Post 5041878)
I am not sure I will be able to trust recursion as my mcu has a return stack only 16 levels deep. Some other way of recursion would have to be found but I will ask the compiler maker what he thinks should be done.

Which is why the code generator doesn't run on it.

The sample is a translator to assembler - much smaller, and the resulting code generated doesn't require recursion.

All interpreters (with subroutines) need recursion. Now one way to get around the hardware limitation is to define a pcode interpreter. These don't necessarily require recursion, but do need subroutines. Then the pcode interpreter can implement a stack however it wants. The programs that run on the pcode interpreter (also called a virtual machine) can then use recursion with a much less limited stack.

The problem you have listed is the limited physical memory. It is possible to get a pcode interpreter there - but the language must aimed at the machine. That was why I tried pointing you to the language fourth - it was designed for minimal overhead, AND simple parsing (it is, a very simple stack machine, and uses Reverse Polish Notation to simplify the parsing - almost to the point of just being a scanner). Plus it has the advantage that source code is already available. It is only the conversion of the interpreter to the target machine that is necessary.

Using lex/bison is not what one uses to create compact programs - they aren't. They are designed to make implementing compilers easier. The parsers generated are huge, and require a lot of stack space due to the supporting functions required.

schmitta 10-10-2013 02:10 AM

1)Are you saying that the parser should be run on the PC and the entry structure:
Code:

struct entry {
    unsigned char        entry_type;
    struct entry        *left;
    struct entry        *center;
    struct entry        *right;
}

downloaded to the mcu to be executed by a c program on the mcu?

2)Excuse me for being dense but is the entry structure generated on the fly and how is that done?

3)Can bison be called line by line by an external program and still keep its internal data structure intact?

jpollard 10-10-2013 04:43 AM

Quote:

Originally Posted by schmitta (Post 5043232)
1)Are you saying that the parser should be run on the PC and the entry structure:
Code:

struct entry {
    unsigned char        entry_type;
    struct entry        *left;
    struct entry        *center;
    struct entry        *right;
}

downloaded to the mcu to be executed by a c program on the mcu?

No. The dynamic structure is part of the parser.
Quote:

2)Excuse me for being dense but is the entry structure generated on the fly and how is that done?
The tree is generated dynamically and is composed of nodes allocated during parsing.
Quote:

3)Can bison be called line by line by an external program and still keep its internal data structure intact?
no. It is an LALR(1) parser generator. The code generated interprets an entire grammar. Once it is finished with the grammar the only "state" remaining is any dynamic structure created during parsing. It is NOT a line-by-line translator.

Again, that is why I pointed you at a minimalist language requiring only simple parsing/scanning such as Forth (http://en.wikipedia.org/wiki/Forth_%...ng_language%29)

schmitta 10-10-2013 06:42 PM

Please bear with me.


1)
Quote:

Can bison be called line by line by an external program and still keep its internal data structure intact?
no. It is an LALR(1) parser generator. The code generated interprets an entire grammar. Once it is finished with the grammar the only "state" remaining is any dynamic structure created during parsing. It is NOT a line-by-line translator.
Would the dynamic structure disappear after yyparse() was exited or could it be made to stay around?

2) For the following:
Quote:

In a single pass, the rule
Code:
...: WHILE expr statements WEND {
$$.value = generate_while($2.value, $3.value);
};
This simply adds to the parse tree (represented by $$.value), by creating a
(dynamic) structure with 3 elements - the entry is identified as a while, a
pointer to the parse tree for the expression (represented by $2.value), and a
pointer to the parse tree for the statements (represented by the $3.value).


No labels (forward or backward) required AT THIS TIME.

What is this entry? Something like:

Code:

struct entry {
    unsigned char        entry_type;
    struct entry        *left;
    struct entry        *center;
    struct entry        *right;
}

What are the values for entry_type? A number representing the significant
elments of the language: STATEMENT, IF, WHILE, FUNCTION, ASSIGN, VARIABLE, ADD, SUB,
MUL, DIV, EXP, ...
For:
Code:

$$.value = generate_while($2.value, $3.value);
does "generate_while" create and fill STRUCT ENTRY's left, center and right elements and then assign the structure's address to $$.value? Does it have to create a new instance of STRUCT ENTRY? If so how? With malloc?

jpollard 10-10-2013 08:48 PM

Quote:

Originally Posted by schmitta (Post 5043635)
Please bear with me.

Not a problem. I've always liked parsers, and have used them to create small languages for configuration files (things like menu buttons, rules for network access, even one that allowed a program to be morphed into any program desired).

Quote:

1)

Would the dynamic structure disappear after yyparse() was exited or could it be made to stay around?
It would stay around because the value of the $$ parse variable should have been assigned to a static pointer variable as the last thing handled. Since the parse tree is dynamically allocated it is external to the parse state structure (a stack). That is why the definition of YYSTYPE is made to make the parse stack an array of "struct entry *". The internal state is maintained by that stack. The top level rule can then save the entire tree by saving the value of of the $$ parse value - it is the last entry remaining on the stack. If the value is NOT saved, you have a memory leak - as the entire program (as represented by the tree) is lost.
Quote:

2) For the following:


For:
Code:

$$.value = generate_while($2.value, $3.value);
does "generate_while" create and fill STRUCT ENTRY's left, center and right elements and then assign the structure's address to $$.value? Does it have to create a new instance of STRUCT ENTRY? If so how? With malloc?
Yes. I was working from your description of your target:
Quote:

The MCU has 170k words of flash and 56kb of ram in a harvard architecture. But I am considering using flex and bison to write a c program that would run on a PC and generate an intermediate psudo assembler language to be interpreted on the mcu.
You don't have the available memory to hold a parser in your target. Instead, what you describe is a translator, which I have been providing a little bit of description - even how it can generate the pseudo assembly.

A PC doesn't have the same limitations. One thing that will be necessary though, is to have a solid definition of what the pseudo assembly language is. Even to the point of having a "simulator" to evaluate the results of your translator (it would help tremendously in debugging).

I do apologize for not being fully consistent. The $$ variable is actually a struct entry *. I made an error using $$.value. The production rule for a while should be:

Code:

...: WHILE expr statements WEND  {
            $$ = generate_while($2, $3);
        };

And the generate_while function would then be:

Code:

struct entry *generate_while(
    struct entry *expr,
    struct entry *statements)
{
    struct entry *while_stmnt = malloc(struct entry);

    while_stmnt->entry_type = WHILE;
    while_stmnt->left = expr;
    while_stmnt->center = NULL;
    while_stmnt->right = statements;
    return(while_stmnt);
}

As usual, this isn't quite complete - there are no checks for errors due to malloc :) It also points up another possible area of confusion - the symbol WHILE could be confused - perhaps I should have used some other naming style. Perhaps it should be like WHILE_t so as to be read as "WHILE type" to make it clear it isn't something else.

schmitta 10-10-2013 10:10 PM

Thanks Jpollard you are a real life saver!

schmitta 10-10-2013 11:44 PM

Would YYSTYPE be defined at the top of the rules.y bison file as:
Code:

%{
struct entry {
    unsigned char        entry_type;
    struct entry        *left;
    struct entry        *center;
    struct entry        *right;
};
#define YYSTYPE struct entry *
%}


schmitta 10-11-2013 01:22 AM

Would there in fact be a need for two or more structure types one for
Code:

struct entry {
    unsigned char        entry_type;
    struct entry        *left;
    struct entry        *center;
    struct entry        *right;
};

for non terminal (left hand side) elements and then a structure that held the values of the terminal elements like a decimal number or a character or array value?

jpollard 10-11-2013 05:02 AM

Quote:

Originally Posted by schmitta (Post 5043802)
Would there in fact be a need for two or more structure types one for
Code:

struct entry {
    unsigned char        entry_type;
    struct entry        *left;
    struct entry        *center;
    struct entry        *right;
};

for non terminal (left hand side) elements and then a structure that held the values of the terminal elements like a decimal number or a character or array value?

I think the answer is yes.

Characters, array, integers are all elements of the underlying language. They get an entry_type value just like variables, and statements do. One way to handle an array is to treat it just like an operator (add/sub/mul ...), only the entry type for indexing would be an INDEX_t. I'm assuming, for the time being, the grammar allows for multiple dimensions.

As a bit of expansion, an array (declared as A[M,N], where M,N is the size of the corresponding dimension - actual allocation calls for a block of memory of size M*N elements) would have an index operations done for a use of A[x,y]
such that A[x,y] is equivalent to A[x*N+y] (which happens to be why the original BASIC only had a single dimension for an array). This works for arrays of strings just as well as for integers (using an array of pointers to the string value), and does assume the symbol table entry for A indicates the data type).

Please note - there is a bit of vagueness here. It is entirely possible to just have a tree {INDEX_t, NULL, ptr to var x,NULL, ptr to expression list} and during the interpretation/code generation, use the parent entry {VARIABLE_t, ptr to symbol table entry for variable A, ptr to index tree}, and extract the M,N values from the symbol table as needed. If the "ptr to index tree" is null, then there are no subscripts involved. Note: "ptr to var x/y" could be expression subtrees.

There are several ways to represent the indexing in the parse tree. In the above, I'm allowing for "expression list", which can be used for either indexing (for example "A[x,y]") or for function parameter lists (for example "function_name(x,y)").

HOW the list is used is up to the parent entry structure. An array reference A[x,y] would be broken down into a tree like {VARIABLE_t, ptr to symbol table entry for A, NULL,{INDEX_t, NULL, NULL, ptr to list}}, where ptr to list would be a tree presenting the expresion list like {EXPR_t, ptr to expression subtree, NULL, ptr to next EXPR_t entry} would allow for evaluation during code generation. This directly corresponds to a statement list (same type of BNF rules). It also happens to allow for as many dimensions as you want, just as it allows for as many function parameters as you want.

In the case of a function, the tree would look almost the same - the difference is at the top of the tree. Instead of a VARIABLE_t at the beginning, you have FUNCTION_t, and instead of INDEX_t, you have PARAM_t (parameter entry type). If the function call has no parameters then the pointer would be NULL. You could even have no syntactic difference between an array and a function (provided the function/array definition appears in the program source code BEFORE its use). This would make program constructs like "A(x,y)" be either an array or a function call, as determined by the symbol table entry for A. If A is an array, then it is indexing, if A is a function, then it is a parameter list.

Personally (my opinion) it is easier to read code that does make a syntax difference. Visually the code "A(x,y)" would be a function call, and "A[x,y]" is an array without needing to hunt for the declaration of A to find out which it is.

schmitta 10-11-2013 11:11 AM

Would YYSTYPE be defined at the top of the rules.y bison file as:
Code:

%{
struct entry {
    unsigned char        entry_type;
    struct entry        *left;
    struct entry        *center;
    struct entry        *right;
};
#define YYSTYPE struct entry *
%}


jpollard 10-11-2013 12:44 PM

I believe it needs to also be included in the scanner so that the scanner can generate some entries for it. I tend to put it in an include file that can then be included in both (this is mostly so the scanner can add an identifier to a symbol table, and return an entry structure that connects the parse tree with the symbol table entr).

note - the structure as it stands is not quite complete (it has enough for things within the parse tree, but things like "ptr to variable" usually means it has to be pointing into a symbol table (which we haven't discussed yet). Now it is still a pointer, but as a variable there is a different structure that needs to be defined - one that takes account of initialization, number of and size of dimensions (if any), data type (at a minimum, integer and string). And then there are debugging information (what source code line it was defined on for instance, has a value ever been assigned...) as well as how function parameters are handled (wouldn't want a function name used as a variable...)

schmitta 10-11-2013 02:44 PM

Fot the time being I am eliminating functions just using gosub with parameters passed as global variables. Do you have an idea as to how the structure entry should look? What about:
Code:

struct entry {
    unsigned char        entry_type;
    unsigned char      vst;          // 0 => non-terminal symbol
                                      // 1 => terminal symbol variable
                                      // 2 => terminal symbol actual value
                                      //
    unsigned char      vtype;        // 1 => pointer to integer symbol table entry
                                      // 2 => contains actual integer value
                                      // 3 => pointer to character symbol table entry
                                      // 4 => contains actual character value
                                      // 5 => pointer to string symbol table entry
                                      // 6 => pointer to actual string.
                                      //
    struct entry        *left;
    struct entry        *center;
    struct entry        *right;
                                      //
    int                *int_st_ptr;  // for vtype = 1
    int                  intvalue;    // for vtype = 2
    char                *char_st_ptr;  // for vtype = 3
    char                charvalue;    // for vtype = 4
    char                *string_st_ptr;// for vtype = 5
    char                *stringptr;    // for vtype = 6
};

or something like the above to point at the symbol table entry or contain the actual integer value, character value or string value. It would not be as elegant as your solution but it would be easier for me to understand and implement. It would use more space for STRUCT ENTRY and more code in each routine but solve the problem of left, right and center only being pointers to STRUCT ENTRY.

jpollard 10-11-2013 04:36 PM

I don't think vst or vtype would ever get used - the grammar itself specifies whether something is a terminal or not. The entry_type field would have the same (or equivalent) values. As for the parse tree, most terminals (and non-terminals) are discarded. The only terminal symbols that would remain are to variables in the symbol table, or a constant value (whether char, integer, string or whatever), and these would be identified in the entry type.

Identifiers are terminal symbols... but they are already identified in the entry type (which could be an int rather than an unsigned char if there are more than 256 different values. I thought there would only be 25 to 30 different entry types)

A symbol table entry might be
Code:

struct symbol {
    char          *symbol;    /* the string name for the symbol */
    unsigned char sym_type;  /* might be int/string/char */
    unsigned int  flags;      /* various flags, array for one thing, function */
    unsigned int  line;      /* line the symbol was defined on (for errors) */
    unsigned int  dimsize[2]; /* dimension size - only two dimensions allowed */
    char          *other;    /* could be used to remove duplicate output code/
}

This allows the symbol table to be sorted (or put in a tree). And every time the parser recognizes a symbol the production rules can check the symbol table entry for semantic errors (such as doing arithmetic on a string...), or to be able to create code that knows how to convert from a string into a number... BTW, nothing prevents putting constants (string or numbers) into the symbol structure - they all start out as a string of characters. Or even having multiple tables - one for variable symbols, one for constant numbers, one for constant strings... The nice part of that is that IF a particular constant is already in the table, then there is no need to keep two of them - it allows an immediate storage optimization by eliminating duplicates (the "other" field could be a label already used during code generation so that the duplicate would be dropped.

And the field "symbol" can point to any string at all. Which is why it can also hold constant string values (even if that string is a sequence of numerical digits)... With that consideration the only way a variable would exist is if it is in a variable table.

Now this would allow a entry structure like:

Code:

struct entry {
    unsigned char    entry_type;
    struct entry    *left;
    struct symbol    *sym;
    struct entry    *center;
    struct entry    *right;
}

If storage were an issue, there is the fact that the only time sym is used only when "left" is NOT used..thus allowing for a union of the two... But it isn't necessary.

I don't think you want to completely drop functions - I suspect you are going to want some built-in functions (like substr, index, ...). They aren't that bad - the most complex thing is that the defined parameter list of a function declaration becomes a "symbol table", and symbols within the function body are first looked up in that table, and if not defined, are then looked for in the global table. Mapping the parameter values to the code becomes a relatively simple task depending on the target interpreter. Most languages even have a table for variables defined and used only within the body. The complexity is deallocating it after the parsing is complete. Some just drop it with the expectation that the translator will exit - which releases all the memory anyway.

schmitta 10-11-2013 08:48 PM

1) Where are the constants stored? - I will be using a signed 32 bit int, an unsigned char for a character and a pointer to a string (16 bit).

2) I have never written a bison/flex program. Would you be interested in looking over my code? I could pay a reasonable amount for your services. If you would be interested please contact me at schmitt_alvin[at]yahoo[dot]com . Thank you for all your help. Alvin...

jpollard 10-11-2013 10:17 PM

What had crossed my mind about constants (both numeric and string) was that these could be stored just like other symbols - in different tables. That way if the code used the same constant it would be identified as a duplicate, and only stored once.

Storage would be just like a variable name (a string) - so each "symbol" table would accumulate those constants - symbol names in a table for variables, string constants in a string table, and such. Besides the elimination of duplicates (more significant for 32 bit integer and string constants), it could also permit more optimum storage in the target machine memory. Integers (of various sizes) can have their own alignment requirements, where character strings don't. The various sized integers can be stored in the same table, but when they go to be created for the target machine code, sorted according to the storage size - all characters together or put inline as duplicating a byte (even if extended to 16 bits) is smaller than using addresses - so all the characters would be together, and the integers together. But it is just a thought.

I've started throwing some grammar around again myself, and may make a translator as an example. The target machine interpreter is also something I've done before. My old one was implemented as a protocol interpreter in a driver for a PDP-11 for a serial line block storage device (a DEC TU58 tape cartridge, which worked like the old Ftape drive - right down to the same block limits, and cheaper/more reliable that a floppy at the time).

As for #2, you have an interesting problem and it has gotten me off my duff to poke around at small languages again. I wouldn't mind looking your code over, if you think it worth it, lets just see afterwards about any payment you think reasonable. Otherwise it's free.

schmitta 10-12-2013 12:41 AM

Please contact me at schmitt_alvin[at]yahoo[dot]com if you like. Alvin...

schmitta 10-14-2013 08:55 AM

After the parser has processed the language and created the tree I guess the tree is sent to the mcu for processing as a pseudo program (interpreted?). How would the tree be sent to the micro? By walking the tree? Should the STRUC ENTRY tree be in a predefined array as sending that array would be easier than walking the tree? Since the tree is defined on a PC with infinite storage (!?) a very large array (larger that the total 53kb of ram on the mcu) could be defined and the STRUCT ENTRY allocator program would know what storage was used (allocated linearly from 0 index on) it would be simple to send the array as one contiguous byte stream to the micro. What do you think?

jpollard 10-14-2013 10:30 AM

The tree is "sort of" sent to the mcu. It isn't in a suitable form as it uses local addresses, and a rather large tree structure for the program.

What should be done (still on the development PC) is that the tree gets translated into a format directly related to the target machine.

There are several ways to do that, and it depends on the target as to what form it takes:

1) machine code - the parse tree can be converted to the base assembler code for other development tools (such as an assembler) which make the final creation of the machine code. This usually has the best runtime, but sometimes can be a booger to debug.

2) instructions for a virtual machine that actually runs on the target machine... This approach is like the Java virtual machine - you have to have a virtual machine program running on the target, and provide the application binary to it.

3) an inbetween mix of the two. This one makes heavy use of runtime support (if the target has such a limited stack, then function/subroutine calls are emulated by the runtime, as are function/subroutine returns using a software implemented stack). In between the emulated capability (such as some arithmetic operations, string operations, input/output) then the code generated is combined with the runtime (either during translation of the assembler code, or having some kind of linker to do the work).

#3 is likely the best choice as it allows a minimal virtual machine, plus the advantage of mostly direct execution. There are two forms of this:
a) a threaded code virtual machine - makes code generation relatively easy as you get to avoid problems of register availability limits solved by the virtual machine, and the code generation can be of several different types.
b) a "run time library" that defines things that the hardware doesn't provide (a general stack for instance - runtime routines can emulate having a fully functioning stack of nearly any depth, subroutine calls/returns using that general stack, extra arithmetic capability (multiplication/division? even floating point is possible that way).

Of these two options (3a and 3b), 3a is the easiest to implement in an emulator for debugging purposes, 3b has the best speed performance.

I don't know how familiar you are with threaded code interpreters though. They can have nearly the same speed as direct execution, with an overhead of one to about 4 instructions (it depends on the specific CPU target... A PDP-11 with a 64k address range could do it with a single instruction (it had an indirect jump with an auto-increment register addressing mode). Others need two to five instructions to do it.

schmitta 10-14-2013 11:29 AM

If STRUCT ENTRY was an array of structures then the array could be sent to the target as is and storage could be malloc allocated on the target. The array would still be valid as the indexes are virtual. I guess int on the pc is 32 bits but the int index on the target would be 16 bit. But the indexes would still be valid as long as the struct size was known on the target. If code generation were done I would probably like to go for the java like abstract machine on the target but a simple one of course. Then the language could be ported to other machines maybe even as open source although I would need some proprietary code to access the target mcu's peripherals. What do you think and thank you for your input.

jpollard 10-14-2013 11:41 AM

That is the advantage for the JVM approach. It even works for the threaded code as the runtime portion of threaded code is just a library of functions. The only difference between them and a subroutine is that the starting address value serves as the threaded code "instruction", and instead of a return, the last instruction(s) are those needed to update the thread version of a PC, and an indirect jump.

Neither require application code (which uses the library) to be of the same license. For the threaded code it is "linked" to the runtime, and the runtime can have anything in it desired, thus a separate license, and only combined by user when the target object code is generated.

The advantage of adding the jump table is that instead of using the addresses directly, the thread instruction is used as an index into the jump table - thus needing a few more instructions, but making the application completely independent of the runtime, and not requiring a "link them together" step other than the vm needing a loader to put it in an appropriate place in memory.

In either case, it is not possible (usually due to the lack of memory protection) to separate the virtual machine entire from the application - but it is very very close, as the vm just copies the "application" into a typeless (or nearly so) array before starting interpretation. If the two can be combined externally (installed into the flash as a unit), you eliminate the need for the VM load - and only the base boot loader would be used to copy from flash to ram. In some architectures, the "ram" would also be the flash - and eliminate the entire loader. It would all look like the bios at that point.

jpollard 10-14-2013 12:19 PM

Quote:

Originally Posted by schmitta (Post 5045522)
If STRUCT ENTRY was an array of structures then the array could be sent to the target as is and storage could be malloc allocated on the target. The array would still be valid as the indexes are virtual. I guess int on the pc is 32 bits but the int index on the target would be 16 bit. But the indexes would still be valid as long as the struct size was known on the target. If code generation were done I would probably like to go for the java like abstract machine on the target but a simple one of course. Then the language could be ported to other machines maybe even as open source although I would need some proprietary code to access the target mcu's peripherals. What do you think and thank you for your input.

Sorry, forgot to answer the first part of your question about struct entry.

The problem with that is the size of the struct entry - there is no need to bulk up most of the structures - If you look, almost none of them are fully used - most only use the left/right pointers.

Converting the tree into a code sequence eliminates the structure entire, and saves all the memory and time needed to walk the tree during execution. Effectively the walking is done during the final translation, thus the VM/interpreter doesn't have to do that work.

schmitta 10-14-2013 04:05 PM

I have the .lex code of:

Code:

%{
#include "test.tab.h"
#include <stdlib.h>
%union
{
    int  intValue;
    char  charValue;
    char  *stringValue;
}
%}
%option yylineno
WHITE  [ \t]+
DIGIT [0-9]
BINARY 0[bB][01]+                {yylval.stringValue = strdup(yytext); return BINARY;}
HEX 0[xX][0-9A-Fa-f]+                {(void) sscanf(yytext,"%x",&yylval.intValue); return HEX;}
DEC -{0,1}[0-9]+                {yylval.intValue = atoi(yytext); return DEC;        }
CHARACTER {SQ}[^"\0"]{SQ}        {yylval.stringValue = strdup(yytext);      return CHARACTER;}
CHARS [^"\0"]+                        {yylval.charValue = yytext;      return CHARS;}

with the following bnf and parser coding:

Code:

constant:        HEX                {$$=$1<intValue>.value;}
        |        DEC                {$$=$1<intValue>.value;}
        |        CHARACTER        {$$=$1<charValue>.value;}
        |        BINARY                {$$=$1<stringValue>.value;}
        |        string                {        }
        ;

string        :        DQ CHARS DQ        {$$=$2<stringValue>.value;}
        ;


is this the correct interpretation of what you want to do?

jpollard 10-15-2013 05:14 AM

Semantically, there is no difference between HEX, or DEC. If you notice (the result of both is an int), the scanner specification turns them into an integer, so that the syntax element constant only needs "NUMBER" for these two.

I think there is a scanner problem though.

How is the scanner going to handle a input sequence like "v=a-5" (an expression), from "v= -5" (a constant)?

If the "-5" is handled as a CONSTANT, then the expression is syntactically wrong. The problem is cause by the "-" to set the negative. The parsing of the expression (whether "a-5" or "-5") would be identified differently, the "expression - CONSTANT", or "- CONSTANT" separately. An optimizing pass would identify the difference between the the first and second case, and then collapse the "- CONSTANT" into a single element where the constant is a negative. It could be done during a parse, but it requires the action routine for the parse to look at the "- expression" subtree to decide if it is a constant and can be collapsed into a negative constant.

Now semantic handling of the constants can be done in different ways...

My original thought (not necessarly valid, mind), was to store the various constants in different tables, one table for each type (it does mean a different search function for each table, or a parameter to tell a search function which type to look for). The tables provide a way to accomplish duplicate elimination (saving memory, and better data packing in the target system). It also reduces the number of types for the parse tree to just pointers to entries in the various tables. If the table structure definition can contain all types, then the parse tree only needs to contain references to either the parse tree (forming the tree), or pointer to a table entry. It does mean that the scanner would be the one to enter the constants into the various tables and not done in the action portion of the parser rules (which would provide updates to the table in form of attribute flags and such).

One of the purposes of the tables (outside the duplication elimination) is to carry extra information (the attributes, and the "other" field) which can be used when generating code, and possible debugging information (semantic errors - like making the value of a character too large, conversion of an integer to a character, character to an integer...).


All times are GMT -5. The time now is 10:08 PM.