LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-02-2013, 11:04 AM   #1
RudraB
Member
 
Registered: Mar 2007
Distribution: Fedora
Posts: 264

Rep: Reputation: 23
parsing with gnu-flex and bison fails for space and brace


I am trying to parse a file like this: (too simple for my actual purpose, but for the beginning, this is ok)

Code:
@Book{key2,
 Author="Some2VALUE" ,
 Title="VALUE2" 
}
The lexer part (gnu-flex) is:

Code:
[A-Za-z"][^\\\"  \n\(\),=\{\}#~_]*      { yylval.sval = strdup(yytext); return KEY; }
@[A-Za-z][A-Za-z]+                 {yylval.sval = strdup(yytext + 1); return ENTRYTYPE;}
[ \t\n]                                ; /* ignore whitespace */
[{}=,]                                 { return *yytext; }
.                                      { fprintf(stderr, "Unrecognized character %c in input\n", *yytext); }
And then parsing(gnu-bison) this with:

Code:
%union
{
    char    *sval;
};

%token <sval> ENTRYTYPE
%type <sval> VALUE
%token <sval> KEY

%start Input

%%

Input: Entry
      | Input Entry ;  /* input is zero or more entires */

Entry: 
     ENTRYTYPE '{' KEY ','{ 
         b_entry.type = $1; 
         b_entry.id = $3;
         b_entry.table = g_hash_table_new_full(g_str_hash, g_str_equal, free, free);} 
     KeyVals '}' {
         parse_entry(&b_entry);
         g_hash_table_destroy(b_entry.table);
         free(b_entry.type); free(b_entry.id);
         b_entry.table = NULL;
         b_entry.type = b_entry.id = NULL;}
     ;

KeyVals: 
      /* empty */ 
      | KeyVals KeyVal ; /* zero or more keyvals */

VALUE:
      /*empty*/
      | KEY 
      | VALUE KEY 
      ;
KeyVal: 
      /*empty*/
      KEY '=' VALUE ',' { g_hash_table_replace(b_entry.table, $1, $3); }
      | KEY '=' VALUE  { g_hash_table_replace(b_entry.table, $1, $3); }
      | error '\n' {yyerrok;}
      ;
There are few problem, so that I need to generalize both the lexer and parser:
1) It can not read a sentence, i.e. if the RHS of Author="Some Value", it only shows "Some. i.e. space is not handled. Dont know how to do it.
2) If I enclose the RHS with {} rather then "", it gives syntax error. Looking for help for this 2 situation.
Kindly help.
 
Old 05-02-2013, 03:16 PM   #2
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,784

Rep: Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083
Looking just at the lexer:

book-lex.l
Code:
/* -*- mode: c; -*- */

%option noyywrap

%%

[A-Za-z\"][^\\\"  \n\(\),=\{\}#~_]*    { printf("KEY: %s\n", yytext); }
@[A-Za-z][A-Za-z]+                     { printf("ENTRTYTYPE: %s\n", yytext+1); }
[ \t\n]                                ; /* ignore whitespace */
[{}=,]                                 { printf("%s\n", yytext); }
.                                      { fprintf(stderr, "Unrecognized character %c in input\n", *yytext); }

%%
int main()
{
    return yylex();
}
compile (no Makefile required: GNU make's builtin rules suffice)
Code:
% make LEX=flex book-lex
flex  -t book-lex.l > book-lex.c
cc    -c -o book-lex.o book-lex.c
cc   book-lex.o   -o book-lex
rm book-lex.c book-lex.o
execute:
Code:
% ./book-lex < book.txt
ENTRTYTYPE: Book
{
KEY: key2
,
KEY: Author
=
KEY: "Some2VALUE
KEY: "
,
KEY: Title
=
KEY: "VALUE2
KEY: "
}
You can see the final quote in "Some2VALUE" being lexed as a separate token, which is probably not what you want. I suggest having separate lexing rules for quoted and unquoted tokens:

Code:
[A-Za-z][A-Za-z0-9]*            { printf("KEY: %s\n", yytext); }
\"[^\"]*\"                      { printf("KEY: %.*s\n", yyleng-2, yytext+1); }
Enclosing the RHS with {} will be a bit more tricky as you are already enclosing the Entry value in {}. If you don't need any further nesting of {}s, you could probably handle it in the lexer using Start Conditions, but it may be a better idea to handle it in the parser.
 
Old 05-02-2013, 03:31 PM   #3
RudraB
Member
 
Registered: Mar 2007
Distribution: Fedora
Posts: 264

Original Poster
Rep: Reputation: 23
Quote:
Originally Posted by ntubski View Post
it may be a better idea to handle it in the parser.
Ntubski,
Thanks for your reply. I already have a lexer+parser that can parse it correctly(when the strings are quoted). But, in more general condition, where the strings may be braced, and even nested braces are common.
So, the last line of your reply is my actual goal. As you can see from my parser and lexer, I am trying to parse it using the grammer. But have not acheived much.
Help needed for the grammer.
 
Old 05-02-2013, 08:26 PM   #4
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,784

Rep: Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083
Ah, I looked at the BibTex Format Description: because the stuff within braces has to include everything including white space, you have to tell the lexer about it. Here is a parser that just prints out the Entries (I didn't bother freeing memory, it's very leaky):

Code:
/* -*- mode: c; -*- */
%{
#include "book-parse.h"
%}

%option noyywrap

%x braceV

%%

<braceV>[^{}]* { yylval.sval = strdup(yytext); return VALUE; }
<braceV>[{}]   { return *yytext; }

[A-Za-z][A-Za-z0-9]*    { yylval.sval = strdup(yytext); return KEY; }
\"[^\"]*\"              { yylval.sval = strndup(yytext+1, yyleng-2); return VALUE; }
[0-9]+                  { yylval.sval = strdup(yytext); return VALUE; }
@[A-Za-z][A-Za-z]+      { yylval.sval = strdup(yytext+1); return ENTRYTYPE; }
[ \t\n]                 ; /* ignore whitespace */
[{}=,]                  { return *yytext; }
.                       { fprintf(stderr, "Unrecognized character %c in input\n", *yytext); }

%%

void lex_brace() {
    BEGIN(braceV);
}
void lex_normal() {
    BEGIN(0);
}
Code:
%union
{
    char    *sval;
};

%{
#include <stdio.h>
#include <string.h>
char* concat(char* str1, char* str2);

void lex_brace();
void lex_normal();
%}

%token <sval> ENTRYTYPE
%token <sval> VALUE
%token <sval> KEY
%type <sval> Value
%type <sval> BraceV
%type <sval> BraceVs

%start Input

%%

Input : Entry
      | Input Entry
      ;

Entry: ENTRYTYPE '{' KEY { printf(" %s\n", $3); } ',' KeyVals '}';

KeyVals : /* empty */
        | KeyVal
        | KeyVal ',' KeyVals
        ;

KeyVal : KEY '=' Value           { printf("  %s = %s\n", $1, $3); }
       ;

Value : '{' { lex_brace(); } BraceVs { lex_normal(); } '}' { $$ = $3; }
      | VALUE
      ;

BraceVs : /* empty */ { $$ = ""; }
        | BraceV BraceVs { $$ = concat($1, $2); }
        ;

BraceV : VALUE
       | '{' BraceVs '}' { $$ = concat(concat("{", $2), "}"); }
       ;


%%

int main() {
    return yyparse();
}
int yyerror(char *s) {
    printf("yyerror : %s\n",s);
    return 0;
}

/* let's pretend we have garbage collection for simplicity */
char* concat(char* str1, char* str2) {
    char* ret = malloc(strlen(str1) + strlen(str2) + 1);
    strcat(ret, str1);
    strcat(ret, str2);
    return ret;
}
 
1 members found this post helpful.
Old 05-03-2013, 03:37 PM   #5
RudraB
Member
 
Registered: Mar 2007
Distribution: Fedora
Posts: 264

Original Poster
Rep: Reputation: 23
Ntubski,
Thanks a lot.
Being a novice, its too sophisticated for me.
 
Old 05-23-2013, 08:53 AM   #6
RudraB
Member
 
Registered: Mar 2007
Distribution: Fedora
Posts: 264

Original Poster
Rep: Reputation: 23
Hi Ntubski and all,
sorry to open an solved thread once again, but its probably best thing for sake of completeness.
I have adopted the code as Ntubski provided, just put it in a gtk treeview and hashtable.
now it looks like:
the lexer:
Code:
%{
#include "bib.h"
  int line;
%}

%option noyywrap

%x braceV

%%

<braceV>[^{}]* { yylval.sval = strdup(yytext); return VALUE; }
<braceV>[{}]   { return *yytext; }

[A-Za-z][A-Za-z0-9_":]* 	{ yylval.sval = strdup(yytext); return KEY; }
\".*\"              	{ yylval.sval = strndup(yytext+1, yyleng-2); return VALUE; }
[0-9]+                  	{ yylval.sval = strdup(yytext); return VALUE; }
@[A-Za-z][A-Za-z]+      	{ yylval.sval = strdup(yytext+1); return ENTRYTYPE; }
[ \t\n]                 	; /* ignore whitespace */
[{}=,]                  	{ return *yytext; }
.                       	{ fprintf(stderr, "Unrecognized character %c in input\n", *yytext); }
\n				{line++;printf("%d",line);}

%%

void lex_brace() {
    BEGIN(braceV);
}
void lex_normal() {
    BEGIN(0);
}

The parser
Code:
%{
#include <stdio.h>
#include <glib.h>
#include <gtk/gtk.h>
#include <string.h>
#include <glib/gstdio.h>
#include <fcntl.h>
enum
{
  COL_BIB_KEY=0,
  COL_BIB_TYPE,	COL_BIB_AUTHOR,	COL_BIB_YEAR,
  NUM_COLS} ;

char* concat(char* str1, char* str2);

void lex_brace();
void lex_normal();
#define slen 1064
 int yylex(void);
/*enum
{
  COL_BIB_KEY=0,
  COL_BIB_TYPE,	COL_BIB_AUTHOR,	COL_BIB_YEAR,
  NUM_COLS} ;
*/  
typedef struct {
  char *type;
  char *id;
  GHashTable *table;
} BibEntry;

BibEntry b_entry;
GtkTreeIter siter;
GtkListStore *store;

void yyerror(char *s)
{
  printf("YYERROR : %s\n", s);
}
void parse_entry(BibEntry *bib_entry);

%}

// Symbols.
%union
{
    char    *sval;
};

%token <sval> ENTRYTYPE
%token <sval> VALUE
%token <sval> KEY
%token OBRACE
%token EBRACE
%token QUOTE
%token SEMICOLON 
%type <sval> Value
%type <sval> BraceV
%type <sval> BraceVs
%start Input

%%

Input: Entry
      | Input Entry ;  /* input is zero or more entires */

Entry: 
     ENTRYTYPE '{' KEY ','{ 
         b_entry.type = $1; 
         b_entry.id = $3;
         b_entry.table = g_hash_table_new_full(g_str_hash, g_str_equal, free, free);} 
     KeyVals '}' {
         parse_entry(&b_entry);
         g_hash_table_destroy(b_entry.table);
         free(b_entry.type); free(b_entry.id);
         b_entry.table = NULL;
         b_entry.type = b_entry.id = NULL;}
     ;

KeyVals : /* empty */
        | KeyVal
        | KeyVal ',' KeyVals
        ;

KeyVal : KEY '=' Value          { g_hash_table_replace(b_entry.table, $1, $3); 
                                printf("%s\n",$3);
				}
       ;

Value : '{' { lex_brace(); } BraceVs { lex_normal(); } '}' { $$ = $3; }
      | VALUE
      ;

BraceVs : /* empty */ { $$ = ""; }
        | BraceV BraceVs { $$ = concat($1, $2); }
        ;

BraceV : VALUE
       | '{' BraceVs '}' { $$ = concat(concat("{", $2), "}"); }
       ;
%%


void parse_entry(BibEntry *bib_entry)
{
  char *author = "", *year = "";
  GHashTableIter iter;
  gpointer key, val;
  char **kiter;
  int i;
 char *keys[] = {"id", "type", "author", "year", "title", "publisher", "editor", 
    "volume", "number", "pages", "month", "note", "address", "edition", "journal",
    "series", "book", "chapter", "organization", NULL};
  char *vals[] = {NULL,  NULL,  NULL, NULL, NULL,
    NULL,  NULL,  NULL, NULL, NULL,
    NULL,  NULL,  NULL, NULL, NULL,
    NULL,	 NULL,  NULL, NULL, NULL};

  g_hash_table_iter_init(&iter, bib_entry->table);
  while (g_hash_table_iter_next(&iter, &key, &val)) {
  for (kiter = keys, i = 0; *kiter; kiter++, i++)
    {
    if (!g_ascii_strcasecmp(*kiter, key)) {
    vals[i] = g_strndup(val,slen);
	break;
    }
  }
  }

  gtk_list_store_append (store, &siter);
  gtk_list_store_set (store, &siter,
                      COL_BIB_AUTHOR, 		vals[COL_BIB_AUTHOR],
                      COL_BIB_TYPE, 		bib_entry->type,
                      COL_BIB_KEY, 		bib_entry->id,
                      COL_BIB_YEAR, 		vals[COL_BIB_YEAR],
                      -1);
}

void setup_tree(GtkWidget *tree)
{
  GtkCellRenderer *renderer;
  GtkTreeViewColumn *column;

  renderer = gtk_cell_renderer_text_new ();
  /********************************************/
    g_object_set(G_OBJECT(renderer), "wrap-mode", PANGO_WRAP_WORD, 
      "wrap-width",300, NULL);
  column=gtk_tree_view_column_new_with_attributes (
      "Author", renderer,
      "text", COL_BIB_AUTHOR,
      NULL);
  gtk_tree_view_append_column (GTK_TREE_VIEW (tree), column);

  /********************************************/
  column = gtk_tree_view_column_new_with_attributes
    ("KEY", renderer, "text",COL_BIB_KEY, NULL);
  gtk_tree_view_append_column (GTK_TREE_VIEW (tree), column);
  /********************************************/
  column = gtk_tree_view_column_new_with_attributes
    ("Type", renderer, "text",COL_BIB_TYPE , NULL);
  gtk_tree_view_append_column (GTK_TREE_VIEW (tree), column);
  /********************************************/
  column = gtk_tree_view_column_new_with_attributes
    ("Year", renderer, "text",COL_BIB_YEAR, NULL);
  gtk_tree_view_append_column (GTK_TREE_VIEW (tree), column);
}
char* concat(char* str1, char* str2) {
    char* ret = malloc(strlen(str1) + strlen(str2) + 1);
    strcat(ret, str1);
    strcat(ret, str2);
    return ret;
}
and the C routine:
Code:
#include <stdio.h>
#include <glib.h>
#include <gtk/gtk.h>
#include <string.h>
#include <glib/gstdio.h>
#include <fcntl.h>
char* buffer;
gsize length;
  GError* error=NULL;
enum
{
  COL_BIB_KEY=0,
  COL_BIB_TYPE,	COL_BIB_AUTHOR,	COL_BIB_YEAR,
  NUM_COLS} ;
#define slen 102

GtkTreeIter siter;
GtkListStore *store;
typedef struct {
  char *type;
  char *id;
  GHashTable *table;
} BibEntry;

BibEntry b_entry;

void parse_entry (BibEntry *bib_entry);
void setup_tree(GtkWidget *tree);

int main(int argc, char** argv)
{
  gtk_init(&argc, &argv);
  GtkWidget  *window = gtk_window_new (GTK_WINDOW_TOPLEVEL);    
  GtkWidget *tree = gtk_tree_view_new();
  GtkWidget *scrolledw = gtk_scrolled_window_new(NULL, NULL);
extern  FILE *yyin;
extern int yyparse (void);
extern yy_create_buffer;
  setup_tree(tree);

  gtk_container_add(GTK_CONTAINER(window), scrolledw);
  gtk_container_add(GTK_CONTAINER(scrolledw), tree);
  store = gtk_list_store_new(NUM_COLS, 
      G_TYPE_STRING, G_TYPE_STRING, G_TYPE_STRING, G_TYPE_STRING);
//FILE *fin=fopen("u2.bib","r");
//yyin=fin;
g_file_get_contents("/home/rudra/Desktop/u2.bib", &buffer, &length , &error);
 yyin=fmemopen(buffer,strlen(buffer),"r");
  yyparse();
g_file_get_contents("/home/rudra/Desktop/u2.bib", &buffer, &length , &error);
 yyin=fmemopen(buffer,strlen(buffer),"r");
  yyparse();

  gtk_tree_view_set_model (GTK_TREE_VIEW (tree), GTK_TREE_MODEL (store));
  g_object_unref (store);
  gtk_widget_show_all (window);

  g_signal_connect(window, "destroy", G_CALLBACK(gtk_main_quit), NULL);

  gtk_main();

  return 0;
}
This is an minimal example
The problem is, even when we ignore any gtk things, just the printf statement (line #86 of parser) prints garbage value for Key={Value}; neither for Key="Value" nor Key="{Value}".
Also, the garbage comes only when I read the file 2nd time, not for the first.(This resembles the case when I open the file using a gtkwidget, thats why I have added the 2nd yyparse).
The output of printf looks like:
��b<Rudra } when the actual thing is {Rudra}
and emits warning:
Quote:
Pango-WARNING **: Invalid UTF-8 string passed to pango_layout_set_text()
Please help.
 
Old 05-23-2013, 11:37 AM   #7
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,784

Rep: Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083
Sorry, I had a bug. The first strcat() in concat() should be strcpy():
Code:
char* concat(char* str1, char* str2) {
    char* ret = malloc(strlen(str1) + strlen(str2) + 1);
    strcatstrcpy(ret, str1); // must not strcat to an uninitialized location
    strcat(ret, str2);
    return ret;
}
 
1 members found this post helpful.
Old 05-23-2013, 12:39 PM   #8
RudraB
Member
 
Registered: Mar 2007
Distribution: Fedora
Posts: 264

Original Poster
Rep: Reputation: 23
ntubski, thanks a lot.
thanks for your patience and help.
That correction is working.
 
Old 05-25-2013, 08:31 AM   #9
RudraB
Member
 
Registered: Mar 2007
Distribution: Fedora
Posts: 264

Original Poster
Rep: Reputation: 23
ntubski,
Though the problem is now solved, but I am confused why the warning was only when the string starts with {, not with ".
A little tutorial?
 
Old 05-25-2013, 05:41 PM   #10
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,784

Rep: Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083
Quote:
Originally Posted by RudraB View Post
I am confused why the warning was only when the string starts with {, not with ".
A little tutorial?
In the case of the " quoted string the concat() function was not used to construct the value so there is no problem. A " quoted string is lexed as a VALUE so the second alternative of Value is chosen. A {} quoted string is lexed as '{' VALUE '}' so the BraceVs alternative of Value is chosen. Both BraceVs and BraceV use concat() in the semantic expressions to create a value for $$.

Code:
Value : '{' { lex_brace(); } BraceVs { lex_normal(); } '}' { $$ = $3; }
      | VALUE
      ;

BraceVs : /* empty */ { $$ = ""; }
        | BraceV BraceVs { $$ = concat($1, $2); }
        ;

BraceV : VALUE
       | '{' BraceVs '}' { $$ = concat(concat("{", $2), "}"); }
       ;
Note that even with concat() fixed it still leaks memory. Since you are using glib, you might consider using its GString utilities.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
installing flex and bison... adithya Linux - Newbie 2 11-18-2012 10:15 AM
[Flex & Bison] How to check which state Flex is in? courteous Programming 0 06-03-2011 11:46 AM
Is there any support for bison-bridge and bison-locations in flex on windows systems? rami alkhateeb Linux - Software 0 12-29-2010 09:10 AM
flex and bison saurav.nith Linux - General 1 04-06-2010 06:38 AM
bison / flex zaman Programming 1 08-16-2005 10:19 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 01:12 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration