Share your knowledge at the LQ Wiki.
Go Back > Blogs > rainbowsally
User Name


Rate this Entry

Parser Test 1

Posted 01-02-2013 at 08:24 PM by rainbowsally
Updated 01-02-2013 at 08:26 PM by rainbowsally

Parser Test 1


See the previous blog entry: "Playing with parsing" here.
to grab the files we need to include for these tests and a working mc2 definitions file for those who have mc2. Type "mc2 -init" first time. After that "make update" will add/remove files from the makefile.

[Note that we can link OR include the sources directly. This is how the mc2 MULTI types of files can be compiled using a library of shared functions without the need for an actual lib. The lib (such as it is) is the files in the src/inc folder. Use the parser-inc.sfxz file to self-extract it.]

In this test we will look for the "BEGIN COMMENT" and "END COMMENT" markers in a file an insert VB-like comments between these markers. Visual Basic doesn't have multi-line comments. This was an actual application for a friend. :-)

The reason for the 'customizable' delimiters is because normal C-style comments like "/*" can be mixed up with paths. Sorry about the long delimiters. Hope you like to type! ;-) But notice that spaces between words is not a problem.

First let's create the application, see what it does, and then talk.

file: src/test1.cpp
purpose: source file
// main.cpp - convert text between delimiters to VB style comments

/* The basic idea of a parser is to identify input and convert it to another form
 * which may be text (a C program for example) or byte codes or anything else.
 * In this scheme the input is INVIOLATE! This is unlike flex which temporarily 
 * pokes holes in the input in order to terminate yytext.  Instead we use the 
 * ability of the output buffer to 'undo' changes and push and pop temporary 
 * strings that automatically get recycled, which allows us great flexibility in 
 * manipulating the output text using C calls such as 'sprintf()'.
 * This demo is best viewed in a debugger like kdbg or some other insight-like 
 * gdb front end.  The 'watchpoint' you'll find most interesting is named 'obj'.
 * The Computer Mad Science Team
 * :-)

#include <stdio.h>
#include <malloc.h>
#include <string.h>
#include "inc/parser2.cpp"

void dbg(){} // for a non-moving breakpoint

bool parse();

#define obj (*g_parser_obj)

bool parse();

const char *delim1, *delim2;

int usage(int errcode);
int load_file(const char* filename, char** pbuf, size_t* pbuflen);

int main(int argc, char** argv)
  const char* filename;
    case 1:
    case 3:
      return usage(1);
    case 2:
      if(strcmp(argv[1], "--help") == 0)
        return usage(0);
      // else
        delim1 = "/*";
        delim2 = "*/";
        filename = argv[1];
    case 4:
      filename = argv[1];
      delim1 = argv[2];
      delim2 = argv[3];
      return usage(1);
  char* buf;
  size_t buflen;
  load_file(argv[1], &buf, &buflen);
  if(buflen == 0)
    fprintf(stderr, "Can't read file '%s'\n", argv[1]);
    return 1;
  parser_new(buf, buflen, true);
  // run the parser
    return 1;
  return 0;

// the parser

bool c2vb();

bool parse()
  while(1) // the loop minus a zzwrap() function.
    if(is_eoi() || obj.errorflag)
    // no wrap function because all chars are significant.
      // here's the BNF grammar
      // C2VB | ERROR | COPY_CHAR
      // here's the C implementation
      obj.state = 
        c2vb() || obj.errorflag
  // no setjump, just the error flag as part of the 'state' returned.
  return obj.state = is_eoi() && !obj.errorflag;

bool c2vb()
  // We are interested in the finding the first delimiter here and 
  // we don't care if it has a newline because we want to make sure 
  // the delimiter appears on a newline regardless of its position 
  // on a line.
  SAVE_PTRS();    // we'll need to be able to undo.
  // Here's the BNF grammar for the test, whether or not to 
  // enter the VB comment insertion loop.
  // and here's the implementation.  Note the parens, not braces below, C 
  // guys.  This is a single boolean statement.
  obj.state =
       (skip_newline() OPT)
    && skip(delim1)
    && (skip_newline() OPT)
  // Note: more sure than just a check for a newline would be a check for 
  // [NEWLINE] [WHITESPACES] before the delimiter and for
  // [WHITESPACES] [NEWLINE] after it.
  // if state (obj.state) is true at this point we have skipped past an 
  // optional newline and the first delimiter.
  // if not, we need to undo and get the heck out of here. :-)  
    // redundant since obj.state is already false but it's harmless
    // and makes the code more readable. Costs two opcodes in assembler
    return obj.state = false; 
  // now we are almost ready for the loop part.  We need to put the 
  // delimiter back so we can see it, also commented out.  This is 
  // equivalent to the 'action' part of a bison scanner, but as you 
  // can see WE are in control of exactly how it's done.  No magic.
  // no hidden code.
  put("\n' ");
  put("\n' "); // and now we're ready for the text until DELIM2
  // Note: there's a function like sprintf() called putfmt() that 
  // could have been used to combine all three of those statments.
  // Now we enter the commenting loop.  
    // save and restore pointers functions are nestable but you can't
    // save pointers twice at the same nesting level (or 'scope').  
    // See the macro. in parser2.h
    // when do we break out of the loop?
    if(is_eoi())    // at end of input (eoi).
      set_errormsg("Unexpected end of file"); // needs setjmp?
      return obj.state = false; // not a match
    // when we encounter a second first delimiter because we don't 
    // allow (count) these comments to be nested.
      set_errormsg("Unexpected '%s'", delim1);
      return obj.state = false; // not a match
    // or when we find the second delimiter, and again we'll
    // handle the newlines ourselves.  C guys, this is similar 
    // the boolean statement above, but is followed directly 
    // by the 'action' part which will more closely resemble 
    // bison action code.
    // Scoping is cheap.  We don't need to do this but it's a 
    // safe way to SAVE and RESTORE PTRS being sure of the 
    // pointer positions when we do so.
      // New scope, and new pointers we have to save.
          (skip_newline() OPT)
        && skip(delim2)
        && (skip_newline() OPT)
        putfmt("\n' %s\n", delim2);   // See put() above
        return obj.state = true;      // is a completed match
      // Undo any optional parts of the parse above.
    } // end of scope

    // Not exiting the loop, so then when do we insert
    // a comment char?  Here's the BNF grammar:
    // NEWLINE
    // That's the match rule.
    // And we can use the newlines we encounter so we 'copy'.
      put("' ");          // these functions with parameters always 
                          // return 'true' states which makes them 
                          // usable in boolean evaluations.  Of 
                          // course we don't need to use the state 
                          // from 'put()' above, though.

    // and each parser block should set the state to the same logic level 
    // it returns.  There are only two exceptions to this rule and they 
    // are both experiemental.
    // Since these are 'normal' parser blocks above, we can, for example,
    // do this, relying on the internal state variable in the parser's 'obj'
      if(!obj.state)      // same as 'else' or BNF: '|'
  // We never get down here, but if we did 'break's above instead of 'return's
  // we might want to RESET_PTRS() here.  In this case we don't because we 
  // want to preserve the position where an error may have occurred.
  return obj.state;

// normalish C code from here to end.

int load_file(const char* fname, char** pbuf, size_t* pbuflen)
  *pbuflen = 0;
  *pbuf = 0;
  FILE* ifile = fopen(fname, "r");
  if(!ifile) return 1;
  fseek(ifile, 0, SEEK_END);
  *pbuflen = ftell(ifile);
  fseek(ifile, 0, SEEK_SET);
  *pbuf = (char*)malloc(*pbuflen);
    fclose(ifile); return 2;
  fread(*pbuf, 1, *pbuflen, ifile);
  return 0;

int usage(int errcode)
    "Usage: c2vb-multi-line <inputfile> [delim1 delim2]\n"
    "  Converts (and preserves) multiline comments with VB style\n"
    "  Comment chars\n"
    "  Outputs to stdout only.\n"
    "Note: if delim1 and delim2 are not supplied the default\n"
    "is to use '/*' and '*/' respectively, which MAY get confused\n"
    "if the file contains linux or http style paths.  The delimiters\n"
    "should be quoted and SPACES in the delimiters is OK.\n"
    "  (Parser test 3 - The Computer Mad Science Team.\n"
  return errcode;
And here's a test file.

file: test1.txt
purpose: test file for #1
  test1 test1.txt "BEGIN<SP>COMMENT" "END<SP>COMMENT"
to test the VB comment function, where <SP> is a space between the words.

Flex GOTCHAS you gotta love.

1. If any code is in the C Code sections, the file may overwrite other C files 
even if you explicity tell it to write to some other file name with --outfile=<name>.

There may be other causes for flex to write to the wrong file as well.  But this 
work-around seems to work.

Work-around: run 'touch <filename[s]' to mark the existing files you don't want 
clobbered before running flex.  This marks the files you want to protect as being END COMMENT
newer than any others in the directory and appears (so far) to keep flex from 
BEGIN COMMENT screwing up all your work. END COMMENT
test1 test1.txt "BEGIN COMMENT" "END COMMENT"
to check it out.

In this simple example, the built-ins can be used for all the parser elements other than c2vb() itself.

Points of interest in the source code are:

In the main loop called 'parse()' the BNF-like grammar describes the parse as
The control flow is from C2VB to ERROR to COPY_CHAR as long as the internal state is false.

An 'OR' (the '|' symbol) will break out of the chain at the first occurrence of a true state.

Makes sense?

What happens here in C:
if(A || B || C)
If A is true does B ever execute? (No. And so on down the chain.)

Also in that parser code we see only two ways to break out of the loop.

is_eoi() sets and returns the state = true only if the input pointer is at the end of the file (or input buffer). 'eoi' stands for 'end of input', as 'eof' stands for 'end of file'.

And the other way to exit is if the errorflag is set. It's addressed as 'obj.errorflag' and it's set along with the error message by the call to 'set_errormsg()'.

The name 'obj' as in obj.errormsg is defined by a macro and is mainly for making typing easier and (especially) for setting a watch point in a debugger so that all the internal workings can be seen at once through a single 'obj' definition. (Recommend kdbg v >= 5.0 or some other insight-like gdb front end.)

The C2VB is the other function of interest. It's lowercase in the app because uppercase will be reserved for executable tokens and their IDs.

In c2vb() we decide when to enter a loop and we decide when to exit it. And the loop is where we 'translate' one pattern to another analyzing optional and required parts that determine the state of the parse through each loop.

Take a look at the source code. There are tons of comments there. More comments than code, actually. :-)

See why we might want to include a NEWLINE (and even more stuff) in a parse looking for a token, and then throw it all away! :-)

And stay tuned for something a bit more challenging. In test2 we actually do some serious parsing and prepare for a hairy high dive into 'compiling' executable tokens to actually do the operations that we just identify in test2.

[Here's hoping. This compiling executable tokens idea is still very experimental though I have successfully done similar work, even algebraic calculators in Forth. But test2 works and it shouldn't be a big leap from 'identifying' to 'doing'.]

In the meantime, think about this:

How would you parse text so that if you had a keyword like 'quit', it wouldn't execute an action intended for 'quit' when the input was "quite" (having and 'e' at the end).

There are two logical ways.

Flex/Bison arrange the text (or 'rules') by length, always checking the longest first.

But the longest isn't always the one that occurs the most frequently so... is there another way? Is there a way that YOU can decide what order these evaluations are done be sure your match is correct?

What is that other way? (Remember, no strlen() kinds of stuff because the input buffer is inviolate and considered read-only.)

[Hint: find 'cname' in the parser sources. That's NOT the answer. It's just a parser element that can lead to a workable answer. Need more? Okay, what does the \b regex do at the end of a word? And that's all you get. :-) ]

- The Computer Mad Science Team

Posted in Uncategorized
Views 461 Comments 0
« Prev     Main     Next »
Total Comments 0




All times are GMT -5. The time now is 12:01 PM.

Main Menu

Write for LQ is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration