regular expressions
Is there a program for which you enter a regular expression and it tells you in english what it will perform?
|
Perhaps, something like that would help:
http://regex101.com/ or http://www.myezapp.com/apps/dev/regexp/show.ws |
Thanks sycamorex! The myezapp site did the trick and gave me what I wanted. They did not have flex but the java version seems to work identical. Thanks again! Alvin....
|
I am trying to get a label that is defined as starting with a letter then following zero to 7 letters or numbers followed by a colon. I would also like it to be the first item in the statement. I tried:
lines ::= labeldef statement labeldef ::= [A-Za-z][A-Za-z0-9]{0,7}: but it allows labels longer than 8 characters, matching only the last 8 chars before the colon plus the colon. How do I write the reg expression to allow only eight chars total and give an error for 9 or more characters (or not starting with a letter)? Maybe I need to precede it with a null or get it to start at the first character on the line. Thank you. Alvin.... (note I am not in college doing this for a course but for my business). |
it appears fine here, at least in bash
Code:
Check="" Code:
a: True |
starting means ^, so you would need to write: ^[A-Za-z][A-Za-z0-9]{0,7}:
|
good point pan64 :)
Still, it should work without ^ Code:
Check="";for i in 0 {1..10};do Check=${Check%:}${i}:; printf "$Check "; [[ $Check =~ [A-Za-z][A-Za-z0-9]{0,7}: ]] && echo True || echo False; done But 100% agree, ^ should be used, along with $ on the end Code:
Check="";for i in z {1..10};do Check=${Check%:}${i}:; printf "$Check "; [[ $Check =~ ^[A-Za-z][A-Za-z0-9]{0,7}:$ ]] && echo True || echo False; done As I mentioned earlier,. your regexpr. is working in bash.. Do you have sample code where it is not working? |
I was using the software at:
http://regex101.com/ or http://www.myezapp.com/apps/dev/regexp/show.ws to test with. The myezapp program shows the match with a colored bar. For "testabc0:" all was colored. for "testabc12:" "stabc12:" was colored as a match. The label needs to be at the beginning of the line so I will use the ^ first. Other tokens can follow so I will leave off the $. I just tried it with http-//regex101.com and now it seems to work correctly. I have: ^[A-Za-z][A-Za-z0-9]{0,7}: which now seems to work rejecting "testabc01:" and " testabc0:" (not first in the line.) Thank you for your help. Do you know if flex will reject it and just pass the no match through or will it give me some way to flag it as an error? Thanks. Alvin... |
Quote:
One other note - it is part of a flex scanner/tokenizer. Now specifying the ^ will identify it as valid, but usually this would be counted as a SEMANTIC error, rather than a syntax error. Leaving the ^ off would allow the action part to make more detailed analysis and determine the difference between a valid label, and a label that is too long, and thus provide better error diagnostics for the user to be able to make corrections faster, and more accurately. |
|
Should I leave the ^ in or out? I changed mine to ^[A-Za-z][A-Za-z0-9]{0.7}[ \t\n] to catch a blank or tab between the labeldef and the next token on the line or just the labeldef on the line. But how will I catch a label too long as it will probably just pass through if flex works as I think.
|
It depends on the grammar. Don't forget that a scanner is only supposed to recognize tokens. If the grammar uses white space for a token or just a separator is two different things.
The scanner can easily consume white space if it isn't significant with a very simple rule. For instance. If the grammar specifies a label as: Code:
label : symbol ':' {whatever to do with a label} If all symbols are limited then the scanner can identify the error, but still not abort scanning - translators work best by identifying as many errors as possible, and not terminate on the first one. The scanner could identify the label with: Code:
id [A-Za-z]{1,}[0-9]* Code:
label: ID COLON { if ($1.length > 8) { Flex is good to use when speed of implementation is more important, but it does assume you are already familiar with what/how scanners are used and if you are interfacing it with bison (as in generating the appropriate include files...) They are supposed to make it easier for the parser to handle things and separate semantics from tokenizing - and help make error messages more meaningful and parsing recovery possible. The simplest error handling is to abort on the first error - but that makes USING the application/translator/... much harder as you have to keep re-running the application just to find the next error. |
Thanks jpollard. I really appreciate the extra effort you took to make those points. Alvin...
|
I am writing a BASIC interpreter. I was going to write the bnf and run it through flex and bison but I am not sure they would be appropriate for writing an interpreter with. The original idea was to generate the interpreter with flex and bison and compile the C code to run in a MCU. The MCU has 170k words of flash and 56kb of ram in a harvard architecture. But I am considering using flex and bison to write a c program that would run on a PC and generate an intermediate psudo assembler language to be interpreted on the mcu. I am including DCL (declare) statements in the BASIC which I would like to find on an initial pass through the code. Pass zero would also find forward branches and the WEND in a WHILE WEND statements. Can multiple passes be done with flex and bison or is there a better way? The PC program would be written in JAVA so as to run under Windows, MAC and Linux systems. Any ideas you have would be greatly appreciated.
|
The first pass of a compiler translates the source into something more amenable to analysis. This is the intermediate language that can be in one of many forms - a parse tree plus symbol table, or multiple parse trees and symbol tables (this is the one I'm most familiar with, but there are others). Consider the inclusion of subroutines/functions for instance. The ability to include "precompiled" parse trees (or whatever the intermediate language is) allows for global optimization of the code. After the optimization pass, a third pass can be made that generates the optimized assembler.
You might want to look into the LLVM/Clang compilers (http://llvm.org/) - they are designed for this. Another back end (I have only read about, not used) is what is used for Android - the Dalvic bytecode interpreter. This is supposed to provide an efficient interpreter with a minimum of actual code, but allows a higher level language to be converted into a smaller size (good), though it is slower than native code (bad), it is MUCH easier to generate good code for... And allows the code to run on anything that the interpreter runs on (making it easier to develop for). This is also what the goal for the Forth language (http://www.forth.org/) - a very small interpreter, with an easily parsed language to run on very small processors. |
How would I make a multi pass interpreter with flex and bison to pre process labels and forward branches as well as find all the variable declares ?
|
You need to refer to a text book on compiler construction as that topic is fairly large.
But a VERY simple overview is: 1. The scanner+parser validate language input and generate a symbol table + parse tree (or some intermediate language) All loops are left as symbolic in the intermediate code. 2. Optimizing passes can go through the intermediate code to prune code not reached (constant true/false tests always do the same thing,so why generate the alternative, loops that can never be executed are the same...) Register usage optimization, tail recursion elimination, constant expression elimination... Redundant operand elimination -- a*b + a*c is usually slow and replaced by a *(b+a), which eliminates a multiply. Other things that can be done is tail recursion elimination (it really is just iteration, which is faster and reduces stack overhead). 3. During code generation (the last pass usually, but not necessarily) the code generated is assigned addresses - when a label is assigned an address, the back links in the symbol table provide all locations that need the code adjusted with the real address. Note, this may involve additional changes as using things like short relative addresses may change the actual address of the label - requiring additional adjustments to be repeated. It is even possible to enter infinite loops here - where changing an address causes other locations to need long addresses - which again change the location. So some optimization of what and where long/short/relative addressing is to be used becomes important. |
What I was asking is how to do multiple passes with bison flex? Use BEGIN var to turn on and off certain areas of the grammar? Also I am getting 1 shift/reduce warning (error?). It is for :
Quote:
|
Quote:
I don't think the scanner is designed to be invoked recursively, at least not those generated by flex... Neither flex, nor bison are good at two passes... that isn't what they were designed for. Now flex CAN be made to do two file passes (it is a simple scanner) as it does not maintain internal state (I believe seeking to the beginning of the input file will do - but now you also have to track how many times you reach the end of file or you have an infinite loop). But bison DOES maintain internal state. The goal of bison is to either directly interpret the parse tree, or generate an output file... When the scanner reports an end of file to the parser, the parsing is supposed to be completed --- unless the internal state still has elements on the stack... in which case you have a parse error (unexpected end-of-file). Now that said, you can accomplish the equivalent using something like Code:
and EOF2 is the token returned at the second end of file. It also allows you to have to different action rules instead of one. This can be used to do an expected grammar check during pass1, generating errors and such that abort performing the "pass2" parsing, and have pass2 with different action rules. It is awkward, and you HAVE to make sure the grammar specification is identical. It doubles the size of the parser... And all it really does is force double the work, as the action rules of the first part can include those of the second. Quote:
Code:
IF expr THEN IF expr THEN somestatement ELSE somestatement As describe in the documentation for bison: http://www.gnu.org/software/bison/ma...02fReduce.html You fix it as shown in http://www.gnu.org/software/bison/ma...#Non-Operators which forces a particular choice. It can also be solved by having a "if...endif" construct to force the association as well, where the ELSE part is either null or an else clause... the grammar for that would look something like: Code:
ifstatement: |
Thanks jpollard for your expert help. I made it:
Code:
|
How do you handle a forward branch in say a
Code:
WHILE exp stmts WEND Code:
WHILE ( exp ) stmts WEND |
Quote:
|
Quote:
You really need to refer to a textbook for this level of detail. |
But if I am interpreting the code then I have the original in core in order to process it on a second pass. To write a two pass algorithm would it not be best to come up with two different flex/bison routines each with its own unique yyparse() (yyparse1,yyparse2) and call them from a main routine with global symbol table and variable recording routines using different (needed) grammar for each? yyparse1 for first pass for forward branch labels, while wend statements and variable declare statements with yypass2 processing the program with the now known values? A kludge I know but a way to accomplish it.
|
Only if you want to bloat the translator.
Having two parsers in memory takes twice the space... and makes it harder to fix errors. In a single pass, the rule Code:
...: WHILE expr statements WEND { (dynamic) structure with 3 elements - the entry is identified as a while, a pointer to the parse tree for the expression (represented by $2.value), and a pointer to the parse tree for the statements (represented by the $3.value). No labels (forward or backward) required AT THIS TIME. What is this entry? Something like: Code:
struct entry { elments of the language: STATEMENT, IF, WHILE, FUNCTION, ASSIGN, VARIABLE, ADD, SUB, MUL, DIV, EXP, ... These are the junction points of the grammar. Each entry only points to other entries. The end result is that a complete scan of the source code is translated into one or more elements in this tree. An expression like (a + b(c+d)) gets converted into a tree with elements like: Code:
{add, left, null, right} Code:
{while, left, null, right} Code:
{if, left, center, right} Code:
{STATEMENT, ptr to a single statement, null, ptr to another STATEMENT entry} The last STATEMENT entry has a null for "ptr to another STATEMENT entry". So now we have a parse tree that represents the program. First pass done. The SECOND PASS can be either a code generator, or an interpreter. If an interpreter, no labels are needed. The interpreter only does what each element of the tree specifies. The interpreter just walks the tree and invokes functions for each entry type identified for each node in the tree. If the second pass is a code generator, THEN AND ONLY THEN are labels necessary. But it works ALMOST the same as the interpreter. The only difference is that instead of evaluating the statements, it outputs an assembly/machine code. A translator for the IF entry would be something like: Code:
void trans_if(struct entry *left, struct entry *center, struct entry *right) Code:
void trans_while(struct entry *left, struct entry *right) the current node - that is why it is recursive - the code generated for an IF needs code generated for the expression... so the trans_if function just invokes the gencode function to generate code for a particular branch of the tree. The only caution in the "newlabel" use is to be sure to get the value of the generated label before it is needed. Now, using the above as an example creates a three pass compiler. The FIRST PASS does the syntax checking, error handling, and creates a parse tree. The SECOND PASS generats an assembly output... So a THIRD PASS would be the assember that translate the assembly code into machine code... Please note - none of this has been tested. |
Jpollard. Thank you for your most helpful explanation. I greatly appreciate it. Alvin...
|
I am not sure I will be able to trust recursion as my mcu has a return stack only 16 levels deep. Some other way of recursion would have to be found but I will ask the compiler maker what he thinks should be done.
|
Quote:
The sample is a translator to assembler - much smaller, and the resulting code generated doesn't require recursion. All interpreters (with subroutines) need recursion. Now one way to get around the hardware limitation is to define a pcode interpreter. These don't necessarily require recursion, but do need subroutines. Then the pcode interpreter can implement a stack however it wants. The programs that run on the pcode interpreter (also called a virtual machine) can then use recursion with a much less limited stack. The problem you have listed is the limited physical memory. It is possible to get a pcode interpreter there - but the language must aimed at the machine. That was why I tried pointing you to the language fourth - it was designed for minimal overhead, AND simple parsing (it is, a very simple stack machine, and uses Reverse Polish Notation to simplify the parsing - almost to the point of just being a scanner). Plus it has the advantage that source code is already available. It is only the conversion of the interpreter to the target machine that is necessary. Using lex/bison is not what one uses to create compact programs - they aren't. They are designed to make implementing compilers easier. The parsers generated are huge, and require a lot of stack space due to the supporting functions required. |
1)Are you saying that the parser should be run on the PC and the entry structure:
Code:
struct entry { 2)Excuse me for being dense but is the entry structure generated on the fly and how is that done? 3)Can bison be called line by line by an external program and still keep its internal data structure intact? |
Quote:
Quote:
Quote:
Again, that is why I pointed you at a minimalist language requiring only simple parsing/scanning such as Forth (http://en.wikipedia.org/wiki/Forth_%...ng_language%29) |
Please bear with me.
1) Quote:
2) For the following: Quote:
Code:
$$.value = generate_while($2.value, $3.value); |
Quote:
Quote:
Quote:
Quote:
A PC doesn't have the same limitations. One thing that will be necessary though, is to have a solid definition of what the pseudo assembly language is. Even to the point of having a "simulator" to evaluate the results of your translator (it would help tremendously in debugging). I do apologize for not being fully consistent. The $$ variable is actually a struct entry *. I made an error using $$.value. The production rule for a while should be: Code:
...: WHILE expr statements WEND { Code:
struct entry *generate_while( |
Thanks Jpollard you are a real life saver!
|
Would YYSTYPE be defined at the top of the rules.y bison file as:
Code:
%{ |
Would there in fact be a need for two or more structure types one for
Code:
struct entry { |
Quote:
Characters, array, integers are all elements of the underlying language. They get an entry_type value just like variables, and statements do. One way to handle an array is to treat it just like an operator (add/sub/mul ...), only the entry type for indexing would be an INDEX_t. I'm assuming, for the time being, the grammar allows for multiple dimensions. As a bit of expansion, an array (declared as A[M,N], where M,N is the size of the corresponding dimension - actual allocation calls for a block of memory of size M*N elements) would have an index operations done for a use of A[x,y] such that A[x,y] is equivalent to A[x*N+y] (which happens to be why the original BASIC only had a single dimension for an array). This works for arrays of strings just as well as for integers (using an array of pointers to the string value), and does assume the symbol table entry for A indicates the data type). Please note - there is a bit of vagueness here. It is entirely possible to just have a tree {INDEX_t, NULL, ptr to var x,NULL, ptr to expression list} and during the interpretation/code generation, use the parent entry {VARIABLE_t, ptr to symbol table entry for variable A, ptr to index tree}, and extract the M,N values from the symbol table as needed. If the "ptr to index tree" is null, then there are no subscripts involved. Note: "ptr to var x/y" could be expression subtrees. There are several ways to represent the indexing in the parse tree. In the above, I'm allowing for "expression list", which can be used for either indexing (for example "A[x,y]") or for function parameter lists (for example "function_name(x,y)"). HOW the list is used is up to the parent entry structure. An array reference A[x,y] would be broken down into a tree like {VARIABLE_t, ptr to symbol table entry for A, NULL,{INDEX_t, NULL, NULL, ptr to list}}, where ptr to list would be a tree presenting the expresion list like {EXPR_t, ptr to expression subtree, NULL, ptr to next EXPR_t entry} would allow for evaluation during code generation. This directly corresponds to a statement list (same type of BNF rules). It also happens to allow for as many dimensions as you want, just as it allows for as many function parameters as you want. In the case of a function, the tree would look almost the same - the difference is at the top of the tree. Instead of a VARIABLE_t at the beginning, you have FUNCTION_t, and instead of INDEX_t, you have PARAM_t (parameter entry type). If the function call has no parameters then the pointer would be NULL. You could even have no syntactic difference between an array and a function (provided the function/array definition appears in the program source code BEFORE its use). This would make program constructs like "A(x,y)" be either an array or a function call, as determined by the symbol table entry for A. If A is an array, then it is indexing, if A is a function, then it is a parameter list. Personally (my opinion) it is easier to read code that does make a syntax difference. Visually the code "A(x,y)" would be a function call, and "A[x,y]" is an array without needing to hunt for the declaration of A to find out which it is. |
Would YYSTYPE be defined at the top of the rules.y bison file as:
Code:
%{ |
I believe it needs to also be included in the scanner so that the scanner can generate some entries for it. I tend to put it in an include file that can then be included in both (this is mostly so the scanner can add an identifier to a symbol table, and return an entry structure that connects the parse tree with the symbol table entr).
note - the structure as it stands is not quite complete (it has enough for things within the parse tree, but things like "ptr to variable" usually means it has to be pointing into a symbol table (which we haven't discussed yet). Now it is still a pointer, but as a variable there is a different structure that needs to be defined - one that takes account of initialization, number of and size of dimensions (if any), data type (at a minimum, integer and string). And then there are debugging information (what source code line it was defined on for instance, has a value ever been assigned...) as well as how function parameters are handled (wouldn't want a function name used as a variable...) |
Fot the time being I am eliminating functions just using gosub with parameters passed as global variables. Do you have an idea as to how the structure entry should look? What about:
Code:
struct entry { |
I don't think vst or vtype would ever get used - the grammar itself specifies whether something is a terminal or not. The entry_type field would have the same (or equivalent) values. As for the parse tree, most terminals (and non-terminals) are discarded. The only terminal symbols that would remain are to variables in the symbol table, or a constant value (whether char, integer, string or whatever), and these would be identified in the entry type.
Identifiers are terminal symbols... but they are already identified in the entry type (which could be an int rather than an unsigned char if there are more than 256 different values. I thought there would only be 25 to 30 different entry types) A symbol table entry might be Code:
struct symbol { And the field "symbol" can point to any string at all. Which is why it can also hold constant string values (even if that string is a sequence of numerical digits)... With that consideration the only way a variable would exist is if it is in a variable table. Now this would allow a entry structure like: Code:
struct entry { I don't think you want to completely drop functions - I suspect you are going to want some built-in functions (like substr, index, ...). They aren't that bad - the most complex thing is that the defined parameter list of a function declaration becomes a "symbol table", and symbols within the function body are first looked up in that table, and if not defined, are then looked for in the global table. Mapping the parameter values to the code becomes a relatively simple task depending on the target interpreter. Most languages even have a table for variables defined and used only within the body. The complexity is deallocating it after the parsing is complete. Some just drop it with the expectation that the translator will exit - which releases all the memory anyway. |
1) Where are the constants stored? - I will be using a signed 32 bit int, an unsigned char for a character and a pointer to a string (16 bit).
2) I have never written a bison/flex program. Would you be interested in looking over my code? I could pay a reasonable amount for your services. If you would be interested please contact me at schmitt_alvin[at]yahoo[dot]com . Thank you for all your help. Alvin... |
What had crossed my mind about constants (both numeric and string) was that these could be stored just like other symbols - in different tables. That way if the code used the same constant it would be identified as a duplicate, and only stored once.
Storage would be just like a variable name (a string) - so each "symbol" table would accumulate those constants - symbol names in a table for variables, string constants in a string table, and such. Besides the elimination of duplicates (more significant for 32 bit integer and string constants), it could also permit more optimum storage in the target machine memory. Integers (of various sizes) can have their own alignment requirements, where character strings don't. The various sized integers can be stored in the same table, but when they go to be created for the target machine code, sorted according to the storage size - all characters together or put inline as duplicating a byte (even if extended to 16 bits) is smaller than using addresses - so all the characters would be together, and the integers together. But it is just a thought. I've started throwing some grammar around again myself, and may make a translator as an example. The target machine interpreter is also something I've done before. My old one was implemented as a protocol interpreter in a driver for a PDP-11 for a serial line block storage device (a DEC TU58 tape cartridge, which worked like the old Ftape drive - right down to the same block limits, and cheaper/more reliable that a floppy at the time). As for #2, you have an interesting problem and it has gotten me off my duff to poke around at small languages again. I wouldn't mind looking your code over, if you think it worth it, lets just see afterwards about any payment you think reasonable. Otherwise it's free. |
Please contact me at schmitt_alvin[at]yahoo[dot]com if you like. Alvin...
|
After the parser has processed the language and created the tree I guess the tree is sent to the mcu for processing as a pseudo program (interpreted?). How would the tree be sent to the micro? By walking the tree? Should the STRUC ENTRY tree be in a predefined array as sending that array would be easier than walking the tree? Since the tree is defined on a PC with infinite storage (!?) a very large array (larger that the total 53kb of ram on the mcu) could be defined and the STRUCT ENTRY allocator program would know what storage was used (allocated linearly from 0 index on) it would be simple to send the array as one contiguous byte stream to the micro. What do you think?
|
The tree is "sort of" sent to the mcu. It isn't in a suitable form as it uses local addresses, and a rather large tree structure for the program.
What should be done (still on the development PC) is that the tree gets translated into a format directly related to the target machine. There are several ways to do that, and it depends on the target as to what form it takes: 1) machine code - the parse tree can be converted to the base assembler code for other development tools (such as an assembler) which make the final creation of the machine code. This usually has the best runtime, but sometimes can be a booger to debug. 2) instructions for a virtual machine that actually runs on the target machine... This approach is like the Java virtual machine - you have to have a virtual machine program running on the target, and provide the application binary to it. 3) an inbetween mix of the two. This one makes heavy use of runtime support (if the target has such a limited stack, then function/subroutine calls are emulated by the runtime, as are function/subroutine returns using a software implemented stack). In between the emulated capability (such as some arithmetic operations, string operations, input/output) then the code generated is combined with the runtime (either during translation of the assembler code, or having some kind of linker to do the work). #3 is likely the best choice as it allows a minimal virtual machine, plus the advantage of mostly direct execution. There are two forms of this: a) a threaded code virtual machine - makes code generation relatively easy as you get to avoid problems of register availability limits solved by the virtual machine, and the code generation can be of several different types. b) a "run time library" that defines things that the hardware doesn't provide (a general stack for instance - runtime routines can emulate having a fully functioning stack of nearly any depth, subroutine calls/returns using that general stack, extra arithmetic capability (multiplication/division? even floating point is possible that way). Of these two options (3a and 3b), 3a is the easiest to implement in an emulator for debugging purposes, 3b has the best speed performance. I don't know how familiar you are with threaded code interpreters though. They can have nearly the same speed as direct execution, with an overhead of one to about 4 instructions (it depends on the specific CPU target... A PDP-11 with a 64k address range could do it with a single instruction (it had an indirect jump with an auto-increment register addressing mode). Others need two to five instructions to do it. |
If STRUCT ENTRY was an array of structures then the array could be sent to the target as is and storage could be malloc allocated on the target. The array would still be valid as the indexes are virtual. I guess int on the pc is 32 bits but the int index on the target would be 16 bit. But the indexes would still be valid as long as the struct size was known on the target. If code generation were done I would probably like to go for the java like abstract machine on the target but a simple one of course. Then the language could be ported to other machines maybe even as open source although I would need some proprietary code to access the target mcu's peripherals. What do you think and thank you for your input.
|
That is the advantage for the JVM approach. It even works for the threaded code as the runtime portion of threaded code is just a library of functions. The only difference between them and a subroutine is that the starting address value serves as the threaded code "instruction", and instead of a return, the last instruction(s) are those needed to update the thread version of a PC, and an indirect jump.
Neither require application code (which uses the library) to be of the same license. For the threaded code it is "linked" to the runtime, and the runtime can have anything in it desired, thus a separate license, and only combined by user when the target object code is generated. The advantage of adding the jump table is that instead of using the addresses directly, the thread instruction is used as an index into the jump table - thus needing a few more instructions, but making the application completely independent of the runtime, and not requiring a "link them together" step other than the vm needing a loader to put it in an appropriate place in memory. In either case, it is not possible (usually due to the lack of memory protection) to separate the virtual machine entire from the application - but it is very very close, as the vm just copies the "application" into a typeless (or nearly so) array before starting interpretation. If the two can be combined externally (installed into the flash as a unit), you eliminate the need for the VM load - and only the base boot loader would be used to copy from flash to ram. In some architectures, the "ram" would also be the flash - and eliminate the entire loader. It would all look like the bios at that point. |
Quote:
The problem with that is the size of the struct entry - there is no need to bulk up most of the structures - If you look, almost none of them are fully used - most only use the left/right pointers. Converting the tree into a code sequence eliminates the structure entire, and saves all the memory and time needed to walk the tree during execution. Effectively the walking is done during the final translation, thus the VM/interpreter doesn't have to do that work. |
I have the .lex code of:
Code:
%{ Code:
constant: HEX {$$=$1<intValue>.value;} is this the correct interpretation of what you want to do? |
Semantically, there is no difference between HEX, or DEC. If you notice (the result of both is an int), the scanner specification turns them into an integer, so that the syntax element constant only needs "NUMBER" for these two.
I think there is a scanner problem though. How is the scanner going to handle a input sequence like "v=a-5" (an expression), from "v= -5" (a constant)? If the "-5" is handled as a CONSTANT, then the expression is syntactically wrong. The problem is cause by the "-" to set the negative. The parsing of the expression (whether "a-5" or "-5") would be identified differently, the "expression - CONSTANT", or "- CONSTANT" separately. An optimizing pass would identify the difference between the the first and second case, and then collapse the "- CONSTANT" into a single element where the constant is a negative. It could be done during a parse, but it requires the action routine for the parse to look at the "- expression" subtree to decide if it is a constant and can be collapsed into a negative constant. Now semantic handling of the constants can be done in different ways... My original thought (not necessarly valid, mind), was to store the various constants in different tables, one table for each type (it does mean a different search function for each table, or a parameter to tell a search function which type to look for). The tables provide a way to accomplish duplicate elimination (saving memory, and better data packing in the target system). It also reduces the number of types for the parse tree to just pointers to entries in the various tables. If the table structure definition can contain all types, then the parse tree only needs to contain references to either the parse tree (forming the tree), or pointer to a table entry. It does mean that the scanner would be the one to enter the constants into the various tables and not done in the action portion of the parser rules (which would provide updates to the table in form of attribute flags and such). One of the purposes of the tables (outside the duplication elimination) is to carry extra information (the attributes, and the "other" field) which can be used when generating code, and possible debugging information (semantic errors - like making the value of a character too large, conversion of an integer to a character, character to an integer...). |
All times are GMT -5. The time now is 10:08 PM. |