[SOLVED] parsing with gnu-flex and bison fails for space and brace

RudraB · 05-02-2013, 11:04 AM

I am trying to parse a file like this: (too simple for my actual purpose, but for the beginning, this is ok)

Code:

@Book{key2,
 Author="Some2VALUE" ,
 Title="VALUE2" 
}

The lexer part (gnu-flex) is:

Code:

[A-Za-z"][^\\\"  \n\(\),=\{\}#~_]*      { yylval.sval = strdup(yytext); return KEY; }
@[A-Za-z][A-Za-z]+                 {yylval.sval = strdup(yytext + 1); return ENTRYTYPE;}
[ \t\n]                                ; /* ignore whitespace */
[{}=,]                                 { return *yytext; }
.                                      { fprintf(stderr, "Unrecognized character %c in input\n", *yytext); }

And then parsing(gnu-bison) this with:

Code:

%union
{
    char    *sval;
};

%token <sval> ENTRYTYPE
%type <sval> VALUE
%token <sval> KEY

%start Input

%%

Input: Entry
      | Input Entry ;  /* input is zero or more entires */

Entry: 
     ENTRYTYPE '{' KEY ','{ 
         b_entry.type = $1; 
         b_entry.id = $3;
         b_entry.table = g_hash_table_new_full(g_str_hash, g_str_equal, free, free);} 
     KeyVals '}' {
         parse_entry(&b_entry);
         g_hash_table_destroy(b_entry.table);
         free(b_entry.type); free(b_entry.id);
         b_entry.table = NULL;
         b_entry.type = b_entry.id = NULL;}
     ;

KeyVals: 
      /* empty */ 
      | KeyVals KeyVal ; /* zero or more keyvals */

VALUE:
      /*empty*/
      | KEY 
      | VALUE KEY 
      ;
KeyVal: 
      /*empty*/
      KEY '=' VALUE ',' { g_hash_table_replace(b_entry.table, $1, $3); }
      | KEY '=' VALUE  { g_hash_table_replace(b_entry.table, $1, $3); }
      | error '\n' {yyerrok;}
      ;

There are few problem, so that I need to generalize both the lexer and parser:
1) It can not read a sentence, i.e. if the RHS of Author="Some Value", it only shows "Some. i.e. space is not handled. Dont know how to do it.
2) If I enclose the RHS with {} rather then "", it gives syntax error. Looking for help for this 2 situation.
Kindly help.

ntubski · 05-02-2013, 03:16 PM

Looking just at the lexer:

book-lex.l

Code:

/* -*- mode: c; -*- */

%option noyywrap

%%

[A-Za-z\"][^\\\"  \n\(\),=\{\}#~_]*    { printf("KEY: %s\n", yytext); }
@[A-Za-z][A-Za-z]+                     { printf("ENTRTYTYPE: %s\n", yytext+1); }
[ \t\n]                                ; /* ignore whitespace */
[{}=,]                                 { printf("%s\n", yytext); }
.                                      { fprintf(stderr, "Unrecognized character %c in input\n", *yytext); }

%%
int main()
{
    return yylex();
}

compile (no Makefile required: GNU make's builtin rules suffice)

Code:

% make LEX=flex book-lex
flex  -t book-lex.l > book-lex.c
cc    -c -o book-lex.o book-lex.c
cc   book-lex.o   -o book-lex
rm book-lex.c book-lex.o

execute:

Code:

% ./book-lex < book.txt
ENTRTYTYPE: Book
{
KEY: key2
,
KEY: Author
=
KEY: "Some2VALUE
KEY: "
,
KEY: Title
=
KEY: "VALUE2
KEY: "
}

You can see the final quote in "Some2VALUE" being lexed as a separate token, which is probably not what you want. I suggest having separate lexing rules for quoted and unquoted tokens:

Code:

[A-Za-z][A-Za-z0-9]*            { printf("KEY: %s\n", yytext); }
\"[^\"]*\"                      { printf("KEY: %.*s\n", yyleng-2, yytext+1); }

Enclosing the RHS with {} will be a bit more tricky as you are already enclosing the Entry value in {}. If you don't need any further nesting of {}s, you could probably handle it in the lexer using Start Conditions, but it may be a better idea to handle it in the parser.

RudraB · 05-02-2013, 03:31 PM

Quote:

Originally Posted by ntubski

it may be a better idea to handle it in the parser.

Ntubski,
Thanks for your reply. I already have a lexer+parser that can parse it correctly(when the strings are quoted). But, in more general condition, where the strings may be braced, and even nested braces are common.
So, the last line of your reply is my actual goal. As you can see from my parser and lexer, I am trying to parse it using the grammer. But have not acheived much.
Help needed for the grammer.

ntubski · 05-02-2013, 08:26 PM

Ah, I looked at the BibTex Format Description: because the stuff within braces has to include everything including white space, you have to tell the lexer about it. Here is a parser that just prints out the Entries (I didn't bother freeing memory, it's very leaky):

Code:

/* -*- mode: c; -*- */
%{
#include "book-parse.h"
%}

%option noyywrap

%x braceV

%%

<braceV>[^{}]* { yylval.sval = strdup(yytext); return VALUE; }
<braceV>[{}]   { return *yytext; }

[A-Za-z][A-Za-z0-9]*    { yylval.sval = strdup(yytext); return KEY; }
\"[^\"]*\"              { yylval.sval = strndup(yytext+1, yyleng-2); return VALUE; }
[0-9]+                  { yylval.sval = strdup(yytext); return VALUE; }
@[A-Za-z][A-Za-z]+      { yylval.sval = strdup(yytext+1); return ENTRYTYPE; }
[ \t\n]                 ; /* ignore whitespace */
[{}=,]                  { return *yytext; }
.                       { fprintf(stderr, "Unrecognized character %c in input\n", *yytext); }

%%

void lex_brace() {
    BEGIN(braceV);
}
void lex_normal() {
    BEGIN(0);
}

Code:

%union
{
    char    *sval;
};

%{
#include <stdio.h>
#include <string.h>
char* concat(char* str1, char* str2);

void lex_brace();
void lex_normal();
%}

%token <sval> ENTRYTYPE
%token <sval> VALUE
%token <sval> KEY
%type <sval> Value
%type <sval> BraceV
%type <sval> BraceVs

%start Input

%%

Input : Entry
      | Input Entry
      ;

Entry: ENTRYTYPE '{' KEY { printf(" %s\n", $3); } ',' KeyVals '}';

KeyVals : /* empty */
        | KeyVal
        | KeyVal ',' KeyVals
        ;

KeyVal : KEY '=' Value           { printf("  %s = %s\n", $1, $3); }
       ;

Value : '{' { lex_brace(); } BraceVs { lex_normal(); } '}' { $$ = $3; }
      | VALUE
      ;

BraceVs : /* empty */ { $$ = ""; }
        | BraceV BraceVs { $$ = concat($1, $2); }
        ;

BraceV : VALUE
       | '{' BraceVs '}' { $$ = concat(concat("{", $2), "}"); }
       ;


%%

int main() {
    return yyparse();
}
int yyerror(char *s) {
    printf("yyerror : %s\n",s);
    return 0;
}

/* let's pretend we have garbage collection for simplicity */
char* concat(char* str1, char* str2) {
    char* ret = malloc(strlen(str1) + strlen(str2) + 1);
    strcat(ret, str1);
    strcat(ret, str2);
    return ret;
}

RudraB · 05-03-2013, 03:37 PM

Ntubski,
Thanks a lot.
Being a novice, its too sophisticated for me.

RudraB · 05-23-2013, 08:53 AM

Hi Ntubski and all,
sorry to open an solved thread once again, but its probably best thing for sake of completeness.
I have adopted the code as Ntubski provided, just put it in a gtk treeview and hashtable.
now it looks like:
the lexer:

Code:

%{
#include "bib.h"
  int line;
%}

%option noyywrap

%x braceV

%%

<braceV>[^{}]* { yylval.sval = strdup(yytext); return VALUE; }
<braceV>[{}]   { return *yytext; }

[A-Za-z][A-Za-z0-9_":]* 	{ yylval.sval = strdup(yytext); return KEY; }
\".*\"              	{ yylval.sval = strndup(yytext+1, yyleng-2); return VALUE; }
[0-9]+                  	{ yylval.sval = strdup(yytext); return VALUE; }
@[A-Za-z][A-Za-z]+      	{ yylval.sval = strdup(yytext+1); return ENTRYTYPE; }
[ \t\n]                 	; /* ignore whitespace */
[{}=,]                  	{ return *yytext; }
.                       	{ fprintf(stderr, "Unrecognized character %c in input\n", *yytext); }
\n				{line++;printf("%d",line);}

%%

void lex_brace() {
    BEGIN(braceV);
}
void lex_normal() {
    BEGIN(0);
}

The parser

Code:

%{
#include <stdio.h>
#include <glib.h>
#include <gtk/gtk.h>
#include <string.h>
#include <glib/gstdio.h>
#include <fcntl.h>
enum
{
  COL_BIB_KEY=0,
  COL_BIB_TYPE,	COL_BIB_AUTHOR,	COL_BIB_YEAR,
  NUM_COLS} ;

char* concat(char* str1, char* str2);

void lex_brace();
void lex_normal();
#define slen 1064
 int yylex(void);
/*enum
{
  COL_BIB_KEY=0,
  COL_BIB_TYPE,	COL_BIB_AUTHOR,	COL_BIB_YEAR,
  NUM_COLS} ;
*/  
typedef struct {
  char *type;
  char *id;
  GHashTable *table;
} BibEntry;

BibEntry b_entry;
GtkTreeIter siter;
GtkListStore *store;

void yyerror(char *s)
{
  printf("YYERROR : %s\n", s);
}
void parse_entry(BibEntry *bib_entry);

%}

// Symbols.
%union
{
    char    *sval;
};

%token <sval> ENTRYTYPE
%token <sval> VALUE
%token <sval> KEY
%token OBRACE
%token EBRACE
%token QUOTE
%token SEMICOLON 
%type <sval> Value
%type <sval> BraceV
%type <sval> BraceVs
%start Input

%%

Input: Entry
      | Input Entry ;  /* input is zero or more entires */

Entry: 
     ENTRYTYPE '{' KEY ','{ 
         b_entry.type = $1; 
         b_entry.id = $3;
         b_entry.table = g_hash_table_new_full(g_str_hash, g_str_equal, free, free);} 
     KeyVals '}' {
         parse_entry(&b_entry);
         g_hash_table_destroy(b_entry.table);
         free(b_entry.type); free(b_entry.id);
         b_entry.table = NULL;
         b_entry.type = b_entry.id = NULL;}
     ;

KeyVals : /* empty */
        | KeyVal
        | KeyVal ',' KeyVals
        ;

KeyVal : KEY '=' Value          { g_hash_table_replace(b_entry.table, $1, $3); 
                                printf("%s\n",$3);
				}
       ;

Value : '{' { lex_brace(); } BraceVs { lex_normal(); } '}' { $$ = $3; }
      | VALUE
      ;

BraceVs : /* empty */ { $$ = ""; }
        | BraceV BraceVs { $$ = concat($1, $2); }
        ;

BraceV : VALUE
       | '{' BraceVs '}' { $$ = concat(concat("{", $2), "}"); }
       ;
%%


void parse_entry(BibEntry *bib_entry)
{
  char *author = "", *year = "";
  GHashTableIter iter;
  gpointer key, val;
  char **kiter;
  int i;
 char *keys[] = {"id", "type", "author", "year", "title", "publisher", "editor", 
    "volume", "number", "pages", "month", "note", "address", "edition", "journal",
    "series", "book", "chapter", "organization", NULL};
  char *vals[] = {NULL,  NULL,  NULL, NULL, NULL,
    NULL,  NULL,  NULL, NULL, NULL,
    NULL,  NULL,  NULL, NULL, NULL,
    NULL,	 NULL,  NULL, NULL, NULL};

  g_hash_table_iter_init(&iter, bib_entry->table);
  while (g_hash_table_iter_next(&iter, &key, &val)) {
  for (kiter = keys, i = 0; *kiter; kiter++, i++)
    {
    if (!g_ascii_strcasecmp(*kiter, key)) {
    vals[i] = g_strndup(val,slen);
	break;
    }
  }
  }

  gtk_list_store_append (store, &siter);
  gtk_list_store_set (store, &siter,
                      COL_BIB_AUTHOR, 		vals[COL_BIB_AUTHOR],
                      COL_BIB_TYPE, 		bib_entry->type,
                      COL_BIB_KEY, 		bib_entry->id,
                      COL_BIB_YEAR, 		vals[COL_BIB_YEAR],
                      -1);
}

void setup_tree(GtkWidget *tree)
{
  GtkCellRenderer *renderer;
  GtkTreeViewColumn *column;

  renderer = gtk_cell_renderer_text_new ();
  /********************************************/
    g_object_set(G_OBJECT(renderer), "wrap-mode", PANGO_WRAP_WORD, 
      "wrap-width",300, NULL);
  column=gtk_tree_view_column_new_with_attributes (
      "Author", renderer,
      "text", COL_BIB_AUTHOR,
      NULL);
  gtk_tree_view_append_column (GTK_TREE_VIEW (tree), column);

  /********************************************/
  column = gtk_tree_view_column_new_with_attributes
    ("KEY", renderer, "text",COL_BIB_KEY, NULL);
  gtk_tree_view_append_column (GTK_TREE_VIEW (tree), column);
  /********************************************/
  column = gtk_tree_view_column_new_with_attributes
    ("Type", renderer, "text",COL_BIB_TYPE , NULL);
  gtk_tree_view_append_column (GTK_TREE_VIEW (tree), column);
  /********************************************/
  column = gtk_tree_view_column_new_with_attributes
    ("Year", renderer, "text",COL_BIB_YEAR, NULL);
  gtk_tree_view_append_column (GTK_TREE_VIEW (tree), column);
}
char* concat(char* str1, char* str2) {
    char* ret = malloc(strlen(str1) + strlen(str2) + 1);
    strcat(ret, str1);
    strcat(ret, str2);
    return ret;
}

and the C routine:

Code:

#include <stdio.h>
#include <glib.h>
#include <gtk/gtk.h>
#include <string.h>
#include <glib/gstdio.h>
#include <fcntl.h>
char* buffer;
gsize length;
  GError* error=NULL;
enum
{
  COL_BIB_KEY=0,
  COL_BIB_TYPE,	COL_BIB_AUTHOR,	COL_BIB_YEAR,
  NUM_COLS} ;
#define slen 102

GtkTreeIter siter;
GtkListStore *store;
typedef struct {
  char *type;
  char *id;
  GHashTable *table;
} BibEntry;

BibEntry b_entry;

void parse_entry (BibEntry *bib_entry);
void setup_tree(GtkWidget *tree);

int main(int argc, char** argv)
{
  gtk_init(&argc, &argv);
  GtkWidget  *window = gtk_window_new (GTK_WINDOW_TOPLEVEL);    
  GtkWidget *tree = gtk_tree_view_new();
  GtkWidget *scrolledw = gtk_scrolled_window_new(NULL, NULL);
extern  FILE *yyin;
extern int yyparse (void);
extern yy_create_buffer;
  setup_tree(tree);

  gtk_container_add(GTK_CONTAINER(window), scrolledw);
  gtk_container_add(GTK_CONTAINER(scrolledw), tree);
  store = gtk_list_store_new(NUM_COLS, 
      G_TYPE_STRING, G_TYPE_STRING, G_TYPE_STRING, G_TYPE_STRING);
//FILE *fin=fopen("u2.bib","r");
//yyin=fin;
g_file_get_contents("/home/rudra/Desktop/u2.bib", &buffer, &length , &error);
 yyin=fmemopen(buffer,strlen(buffer),"r");
  yyparse();
g_file_get_contents("/home/rudra/Desktop/u2.bib", &buffer, &length , &error);
 yyin=fmemopen(buffer,strlen(buffer),"r");
  yyparse();

  gtk_tree_view_set_model (GTK_TREE_VIEW (tree), GTK_TREE_MODEL (store));
  g_object_unref (store);
  gtk_widget_show_all (window);

  g_signal_connect(window, "destroy", G_CALLBACK(gtk_main_quit), NULL);

  gtk_main();

  return 0;
}

This is an minimal example
The problem is, even when we ignore any gtk things, just the printf statement (line #86 of parser) prints garbage value for Key={Value}; neither for Key="Value" nor Key="{Value}".
Also, the garbage comes only when I read the file 2nd time, not for the first.(This resembles the case when I open the file using a gtkwidget, thats why I have added the 2nd yyparse).
The output of printf looks like:
��b<Rudra } when the actual thing is {Rudra}
and emits warning:

Quote:

Pango-WARNING **: Invalid UTF-8 string passed to pango_layout_set_text()

Please help.

ntubski · 05-23-2013, 11:37 AM

Sorry, I had a bug. The first strcat() in concat() should be strcpy():

Code:

char* concat(char* str1, char* str2) {
    char* ret = malloc(strlen(str1) + strlen(str2) + 1);
    strcatstrcpy(ret, str1); // must not strcat to an uninitialized location
    strcat(ret, str2);
    return ret;
}

RudraB · 05-23-2013, 12:39 PM

ntubski, thanks a lot.
thanks for your patience and help.
That correction is working.

RudraB · 05-25-2013, 08:31 AM

ntubski,
Though the problem is now solved, but I am confused why the warning was only when the string starts with {, not with ".
A little tutorial?

ntubski · 05-25-2013, 05:41 PM

Quote:

Originally Posted by RudraB

I am confused why the warning was only when the string starts with {, not with ".
A little tutorial?

In the case of the " quoted string the concat() function was not used to construct the value so there is no problem. A " quoted string is lexed as a VALUE so the second alternative of Value is chosen. A {} quoted string is lexed as '{' VALUE '}' so the BraceVs alternative of Value is chosen. Both BraceVs and BraceV use concat() in the semantic expressions to create a value for $$.

Code:

Value : '{' { lex_brace(); } BraceVs { lex_normal(); } '}' { $$ = $3; }
      | VALUE
      ;

BraceVs : /* empty */ { $$ = ""; }
        | BraceV BraceVs { $$ = concat($1, $2); }
        ;

BraceV : VALUE
       | '{' BraceVs '}' { $$ = concat(concat("{", $2), "}"); }
       ;

Note that even with concat() fixed it still leaks memory. Since you are using glib, you might consider using its GString utilities.