LinuxQuestions.org - regular expressions

Page 3 of 25

Last »

Show 50 post(s) from this thread on one page

LinuxQuestions.org (/questions/)

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - regular expressions (https://www.linuxquestions.org/questions/linux-newbie-8/regular-expressions-4175478394/)

schmitta

10-10-2013 06:42 PM

Please bear with me.

1)

Quote:

Can bison be called line by line by an external program and still keep its internal data structure intact?
no. It is an LALR(1) parser generator. The code generated interprets an entire grammar. Once it is finished with the grammar the only "state" remaining is any dynamic structure created during parsing. It is NOT a line-by-line translator.

Would the dynamic structure disappear after yyparse() was exited or could it be made to stay around?

2) For the following:

Quote:

In a single pass, the rule
Code:
...: WHILE expr statements WEND {
$$.value = generate_while($2.value, $3.value);
};
This simply adds to the parse tree (represented by $$.value), by creating a
(dynamic) structure with 3 elements - the entry is identified as a while, a
pointer to the parse tree for the expression (represented by $2.value), and a
pointer to the parse tree for the statements (represented by the $3.value).

No labels (forward or backward) required AT THIS TIME.

What is this entry? Something like:

Code:

struct entry {

    unsigned char        entry_type;

    struct entry        *left;

    struct entry        *center;

    struct entry        *right;

}

What are the values for entry_type? A number representing the significant
elments of the language: STATEMENT, IF, WHILE, FUNCTION, ASSIGN, VARIABLE, ADD, SUB,
MUL, DIV, EXP, ...

For:

Code:

$$.value = generate_while($2.value, $3.value);

does "generate_while" create and fill STRUCT ENTRY's left, center and right elements and then assign the structure's address to $$.value? Does it have to create a new instance of STRUCT ENTRY? If so how? With malloc?

jpollard

10-10-2013 08:48 PM

Quote:

Originally Posted by schmitta (Post 5043635)

Please bear with me.

Not a problem. I've always liked parsers, and have used them to create small languages for configuration files (things like menu buttons, rules for network access, even one that allowed a program to be morphed into any program desired).

Quote:

1)

Would the dynamic structure disappear after yyparse() was exited or could it be made to stay around?

It would stay around because the value of the $$ parse variable should have been assigned to a static pointer variable as the last thing handled. Since the parse tree is dynamically allocated it is external to the parse state structure (a stack). That is why the definition of YYSTYPE is made to make the parse stack an array of "struct entry *". The internal state is maintained by that stack. The top level rule can then save the entire tree by saving the value of of the $$ parse value - it is the last entry remaining on the stack. If the value is NOT saved, you have a memory leak - as the entire program (as represented by the tree) is lost.

Quote:

2) For the following:

For:

Code:

$$.value = generate_while($2.value, $3.value);

Yes. I was working from your description of your target:

Quote:

The MCU has 170k words of flash and 56kb of ram in a harvard architecture. But I am considering using flex and bison to write a c program that would run on a PC and generate an intermediate psudo assembler language to be interpreted on the mcu.

You don't have the available memory to hold a parser in your target. Instead, what you describe is a translator, which I have been providing a little bit of description - even how it can generate the pseudo assembly.

A PC doesn't have the same limitations. One thing that will be necessary though, is to have a solid definition of what the pseudo assembly language is. Even to the point of having a "simulator" to evaluate the results of your translator (it would help tremendously in debugging).

I do apologize for not being fully consistent. The $$ variable is actually a struct entry *. I made an error using $$.value. The production rule for a while should be:

Code:

...: WHILE expr statements WEND  {

            $$ = generate_while($2, $3);

        };

And the generate_while function would then be:

Code:

struct entry *generate_while(

    struct entry *expr,

    struct entry *statements)

{

    struct entry *while_stmnt = malloc(struct entry);



    while_stmnt->entry_type = WHILE;

    while_stmnt->left = expr;

    while_stmnt->center = NULL;

    while_stmnt->right = statements;

    return(while_stmnt);

}

As usual, this isn't quite complete - there are no checks for errors due to malloc :) It also points up another possible area of confusion - the symbol WHILE could be confused - perhaps I should have used some other naming style. Perhaps it should be like WHILE_t so as to be read as "WHILE type" to make it clear it isn't something else.

schmitta

10-10-2013 10:10 PM

Thanks Jpollard you are a real life saver!

schmitta

10-10-2013 11:44 PM

Would YYSTYPE be defined at the top of the rules.y bison file as:

Code:

%{

struct entry {

    unsigned char        entry_type;

    struct entry        *left;

    struct entry        *center;

    struct entry        *right;

};

#define YYSTYPE struct entry *

%}

schmitta

10-11-2013 01:22 AM

Would there in fact be a need for two or more structure types one for

Code:

struct entry {

    unsigned char        entry_type;

    struct entry        *left;

    struct entry        *center;

    struct entry        *right;

};

for non terminal (left hand side) elements and then a structure that held the values of the terminal elements like a decimal number or a character or array value?

jpollard

10-11-2013 05:02 AM

Quote:

Originally Posted by schmitta (Post 5043802)

Would there in fact be a need for two or more structure types one for

Code:

struct entry {

    unsigned char        entry_type;

    struct entry        *left;

    struct entry        *center;

    struct entry        *right;

};

for non terminal (left hand side) elements and then a structure that held the values of the terminal elements like a decimal number or a character or array value?

I think the answer is yes.

Characters, array, integers are all elements of the underlying language. They get an entry_type value just like variables, and statements do. One way to handle an array is to treat it just like an operator (add/sub/mul ...), only the entry type for indexing would be an INDEX_t. I'm assuming, for the time being, the grammar allows for multiple dimensions.

As a bit of expansion, an array (declared as A[M,N], where M,N is the size of the corresponding dimension - actual allocation calls for a block of memory of size M*N elements) would have an index operations done for a use of A[x,y]
such that A[x,y] is equivalent to A[x*N+y] (which happens to be why the original BASIC only had a single dimension for an array). This works for arrays of strings just as well as for integers (using an array of pointers to the string value), and does assume the symbol table entry for A indicates the data type).

Please note - there is a bit of vagueness here. It is entirely possible to just have a tree {INDEX_t, NULL, ptr to var x,NULL, ptr to expression list} and during the interpretation/code generation, use the parent entry {VARIABLE_t, ptr to symbol table entry for variable A, ptr to index tree}, and extract the M,N values from the symbol table as needed. If the "ptr to index tree" is null, then there are no subscripts involved. Note: "ptr to var x/y" could be expression subtrees.

There are several ways to represent the indexing in the parse tree. In the above, I'm allowing for "expression list", which can be used for either indexing (for example "A[x,y]") or for function parameter lists (for example "function_name(x,y)").

HOW the list is used is up to the parent entry structure. An array reference A[x,y] would be broken down into a tree like {VARIABLE_t, ptr to symbol table entry for A, NULL,{INDEX_t, NULL, NULL, ptr to list}}, where ptr to list would be a tree presenting the expresion list like {EXPR_t, ptr to expression subtree, NULL, ptr to next EXPR_t entry} would allow for evaluation during code generation. This directly corresponds to a statement list (same type of BNF rules). It also happens to allow for as many dimensions as you want, just as it allows for as many function parameters as you want.

In the case of a function, the tree would look almost the same - the difference is at the top of the tree. Instead of a VARIABLE_t at the beginning, you have FUNCTION_t, and instead of INDEX_t, you have PARAM_t (parameter entry type). If the function call has no parameters then the pointer would be NULL. You could even have no syntactic difference between an array and a function (provided the function/array definition appears in the program source code BEFORE its use). This would make program constructs like "A(x,y)" be either an array or a function call, as determined by the symbol table entry for A. If A is an array, then it is indexing, if A is a function, then it is a parameter list.

Personally (my opinion) it is easier to read code that does make a syntax difference. Visually the code "A(x,y)" would be a function call, and "A[x,y]" is an array without needing to hunt for the declaration of A to find out which it is.

schmitta

10-11-2013 11:11 AM

Would YYSTYPE be defined at the top of the rules.y bison file as:

Code:

%{

struct entry {

    unsigned char        entry_type;

    struct entry        *left;

    struct entry        *center;

    struct entry        *right;

};

#define YYSTYPE struct entry *

%}

jpollard

10-11-2013 12:44 PM

I believe it needs to also be included in the scanner so that the scanner can generate some entries for it. I tend to put it in an include file that can then be included in both (this is mostly so the scanner can add an identifier to a symbol table, and return an entry structure that connects the parse tree with the symbol table entr).

note - the structure as it stands is not quite complete (it has enough for things within the parse tree, but things like "ptr to variable" usually means it has to be pointing into a symbol table (which we haven't discussed yet). Now it is still a pointer, but as a variable there is a different structure that needs to be defined - one that takes account of initialization, number of and size of dimensions (if any), data type (at a minimum, integer and string). And then there are debugging information (what source code line it was defined on for instance, has a value ever been assigned...) as well as how function parameters are handled (wouldn't want a function name used as a variable...)

schmitta

10-11-2013 02:44 PM

Fot the time being I am eliminating functions just using gosub with parameters passed as global variables. Do you have an idea as to how the structure entry should look? What about:

Code:

struct entry {

    unsigned char        entry_type;

    unsigned char      vst;          // 0 => non-terminal symbol

                                      // 1 => terminal symbol variable

                                      // 2 => terminal symbol actual value 

                                      //

    unsigned char      vtype;        // 1 => pointer to integer symbol table entry

                                      // 2 => contains actual integer value

                                      // 3 => pointer to character symbol table entry

                                      // 4 => contains actual character value

                                      // 5 => pointer to string symbol table entry

                                      // 6 => pointer to actual string.

                                      //

    struct entry        *left;

    struct entry        *center;

    struct entry        *right;

                                      //

    int                *int_st_ptr;  // for vtype = 1 

    int                  intvalue;    // for vtype = 2

    char                *char_st_ptr;  // for vtype = 3

    char                charvalue;    // for vtype = 4

    char                *string_st_ptr;// for vtype = 5

    char                *stringptr;    // for vtype = 6

};

or something like the above to point at the symbol table entry or contain the actual integer value, character value or string value. It would not be as elegant as your solution but it would be easier for me to understand and implement. It would use more space for STRUCT ENTRY and more code in each routine but solve the problem of left, right and center only being pointers to STRUCT ENTRY.

jpollard

10-11-2013 04:36 PM

I don't think vst or vtype would ever get used - the grammar itself specifies whether something is a terminal or not. The entry_type field would have the same (or equivalent) values. As for the parse tree, most terminals (and non-terminals) are discarded. The only terminal symbols that would remain are to variables in the symbol table, or a constant value (whether char, integer, string or whatever), and these would be identified in the entry type.

Identifiers are terminal symbols... but they are already identified in the entry type (which could be an int rather than an unsigned char if there are more than 256 different values. I thought there would only be 25 to 30 different entry types)

A symbol table entry might be

Code:

struct symbol {

    char          *symbol;    /* the string name for the symbol */

    unsigned char sym_type;  /* might be int/string/char */

    unsigned int  flags;      /* various flags, array for one thing, function */

    unsigned int  line;      /* line the symbol was defined on (for errors) */

    unsigned int  dimsize[2]; /* dimension size - only two dimensions allowed */

    char          *other;    /* could be used to remove duplicate output code/

}

This allows the symbol table to be sorted (or put in a tree). And every time the parser recognizes a symbol the production rules can check the symbol table entry for semantic errors (such as doing arithmetic on a string...), or to be able to create code that knows how to convert from a string into a number... BTW, nothing prevents putting constants (string or numbers) into the symbol structure - they all start out as a string of characters. Or even having multiple tables - one for variable symbols, one for constant numbers, one for constant strings... The nice part of that is that IF a particular constant is already in the table, then there is no need to keep two of them - it allows an immediate storage optimization by eliminating duplicates (the "other" field could be a label already used during code generation so that the duplicate would be dropped.

And the field "symbol" can point to any string at all. Which is why it can also hold constant string values (even if that string is a sequence of numerical digits)... With that consideration the only way a variable would exist is if it is in a variable table.

Now this would allow a entry structure like:

Code:

struct entry {

    unsigned char    entry_type;

    struct entry    *left;

    struct symbol    *sym;

    struct entry    *center;

    struct entry    *right;

}

If storage were an issue, there is the fact that the only time sym is used only when "left" is NOT used..thus allowing for a union of the two... But it isn't necessary.

I don't think you want to completely drop functions - I suspect you are going to want some built-in functions (like substr, index, ...). They aren't that bad - the most complex thing is that the defined parameter list of a function declaration becomes a "symbol table", and symbols within the function body are first looked up in that table, and if not defined, are then looked for in the global table. Mapping the parameter values to the code becomes a relatively simple task depending on the target interpreter. Most languages even have a table for variables defined and used only within the body. The complexity is deallocating it after the parsing is complete. Some just drop it with the expectation that the translator will exit - which releases all the memory anyway.

schmitta

10-11-2013 08:48 PM

1) Where are the constants stored? - I will be using a signed 32 bit int, an unsigned char for a character and a pointer to a string (16 bit).

2) I have never written a bison/flex program. Would you be interested in looking over my code? I could pay a reasonable amount for your services. If you would be interested please contact me at schmitt_alvin[at]yahoo[dot]com . Thank you for all your help. Alvin...

jpollard

10-11-2013 10:17 PM

What had crossed my mind about constants (both numeric and string) was that these could be stored just like other symbols - in different tables. That way if the code used the same constant it would be identified as a duplicate, and only stored once.

Storage would be just like a variable name (a string) - so each "symbol" table would accumulate those constants - symbol names in a table for variables, string constants in a string table, and such. Besides the elimination of duplicates (more significant for 32 bit integer and string constants), it could also permit more optimum storage in the target machine memory. Integers (of various sizes) can have their own alignment requirements, where character strings don't. The various sized integers can be stored in the same table, but when they go to be created for the target machine code, sorted according to the storage size - all characters together or put inline as duplicating a byte (even if extended to 16 bits) is smaller than using addresses - so all the characters would be together, and the integers together. But it is just a thought.

I've started throwing some grammar around again myself, and may make a translator as an example. The target machine interpreter is also something I've done before. My old one was implemented as a protocol interpreter in a driver for a PDP-11 for a serial line block storage device (a DEC TU58 tape cartridge, which worked like the old Ftape drive - right down to the same block limits, and cheaper/more reliable that a floppy at the time).

As for #2, you have an interesting problem and it has gotten me off my duff to poke around at small languages again. I wouldn't mind looking your code over, if you think it worth it, lets just see afterwards about any payment you think reasonable. Otherwise it's free.

schmitta

10-12-2013 12:41 AM

Please contact me at schmitt_alvin[at]yahoo[dot]com if you like. Alvin...

schmitta

10-14-2013 08:55 AM

After the parser has processed the language and created the tree I guess the tree is sent to the mcu for processing as a pseudo program (interpreted?). How would the tree be sent to the micro? By walking the tree? Should the STRUC ENTRY tree be in a predefined array as sending that array would be easier than walking the tree? Since the tree is defined on a PC with infinite storage (!?) a very large array (larger that the total 53kb of ram on the mcu) could be defined and the STRUCT ENTRY allocator program would know what storage was used (allocated linearly from 0 index on) it would be simple to send the array as one contiguous byte stream to the micro. What do you think?

jpollard

10-14-2013 10:30 AM

The tree is "sort of" sent to the mcu. It isn't in a suitable form as it uses local addresses, and a rather large tree structure for the program.

What should be done (still on the development PC) is that the tree gets translated into a format directly related to the target machine.

There are several ways to do that, and it depends on the target as to what form it takes:

1) machine code - the parse tree can be converted to the base assembler code for other development tools (such as an assembler) which make the final creation of the machine code. This usually has the best runtime, but sometimes can be a booger to debug.

2) instructions for a virtual machine that actually runs on the target machine... This approach is like the Java virtual machine - you have to have a virtual machine program running on the target, and provide the application binary to it.

3) an inbetween mix of the two. This one makes heavy use of runtime support (if the target has such a limited stack, then function/subroutine calls are emulated by the runtime, as are function/subroutine returns using a software implemented stack). In between the emulated capability (such as some arithmetic operations, string operations, input/output) then the code generated is combined with the runtime (either during translation of the assembler code, or having some kind of linker to do the work).

#3 is likely the best choice as it allows a minimal virtual machine, plus the advantage of mostly direct execution. There are two forms of this:
a) a threaded code virtual machine - makes code generation relatively easy as you get to avoid problems of register availability limits solved by the virtual machine, and the code generation can be of several different types.
b) a "run time library" that defines things that the hardware doesn't provide (a general stack for instance - runtime routines can emulate having a fully functioning stack of nearly any depth, subroutine calls/returns using that general stack, extra arithmetic capability (multiplication/division? even floating point is possible that way).

Of these two options (3a and 3b), 3a is the easiest to implement in an emulator for debugging purposes, 3b has the best speed performance.

I don't know how familiar you are with threaded code interpreters though. They can have nearly the same speed as direct execution, with an overhead of one to about 4 instructions (it depends on the specific CPU target... A PDP-11 with a 64k address range could do it with a single instruction (it had an indirect jump with an auto-increment register addressing mode). Others need two to five instructions to do it.

All times are GMT -5. The time now is 07:22 AM.

Page 3 of 25

Last »

Show 50 post(s) from this thread on one page