How to handle extremely long string in flex and bison

heth · 06-28-2012, 05:28 AM

Hi,

I am writing a parser using flex and bison. Some input files having extremely long token and causing segmentation fault while parsing.

I am setting the YYLMAX value to a very large number and allocate the maximum size for the string as the YYLMAX value as well. I think the value that I set has exceed the buffer limit for lexer thus it does not really solve the issue.

Any idea on how to handle extremely long token in flex and bison?

firstfire · 06-28-2012, 05:44 AM

Hi.

Try to include the `%pointer' directive in the first section of flex input. In this case yytext will grow automatically.

From `info flex':

Quote:

Note that `yytext' can be defined in two different ways: either as a
character _pointer_ or as a character _array_. You can control which
definition `flex' uses by including one of the special directives
`%pointer' or `%array' in the first (definitions) section of your flex
input. The default is `%pointer', unless you use the `-l' lex
compatibility option, in which case `yytext' will be an array. The
advantage of using `%pointer' is substantially faster scanning and no
buffer overflow when matching very large tokens (unless you run out of
dynamic memory). The disadvantage is that you are restricted in how
your actions can modify `yytext' (*note Actions::), and calls to the
`unput()' function destroys the present contents of `yytext', which can
be a considerable porting headache when moving between different `lex'
versions.

The advantage of `%array' is that you can then modify `yytext' to
your heart's content, and calls to `unput()' do not destroy `yytext'
(*note Actions::). Furthermore, existing `lex' programs sometimes
access `yytext' externally using declarations of the form:

extern char yytext[];

This definition is erroneous when used with `%pointer', but correct
for `%array'.

heth · 06-28-2012, 11:04 AM

Hi firstfire,

Thanks for the suggestion. Do you have any example of how the %pointer is used?

firstfire · 06-28-2012, 01:13 PM

Hi.

I just realized that %pointer is the default behavior in flex. So you either run out of memory or use lex instead of flex, or there are bugs in the code. Can you provide more info? Example is trivial:

Code:

%pointer
%%
[[:alnum:]]+	printf("[%s] : %d bytes\n", yytext, strlen(yytext));
%%

Even using %array you get the following error message if token length exceeds YYLMAX:

Quote:

token too large, exceeds YYLMAX

instead of Segmentation Fault. You may have memory leaks. Try `valgrind ./a.out' to check it out. GDB is another useful tool in this situation.

heth · 07-02-2012, 12:02 PM

Hi firstfire,

This is the example in my flex:

%{
#undef YYLMAX
#define YYLMAX 40000
%}

id { strcpy(yylval.string, (char*)yytext); return (ID); }
call { strcpy(yylval.string, (char*)yytext); return (CALL); }

digit [0-9]
id [a-zA-Z0-9_\/\-=><.\"]*
%%

call { strcpy(yylval.string, (char*)yytext); return (CALL); }
{digit}+ {
//yylval.integer = atoi((char*)yytext);

strcpy(yylval.string, (char*)yytext);
return(NUMBERS);
}
{id} { strcpy(yylval.string, (char*)yytext); return(ID); }
%%

In my bison file:

%{
extern "C" {
extern char yytext[];
}
%}
%union {
char string[40000];
}
%token <string> ID CALL NUMBERS

%%
file: commands {};
commands: command {}
| commands command {};
command: id {}
| call {};
id: ID
{
sprintf($$, $1);
}
| NUMBERS
{
sprintf($$, $1);
};
call: CALL NUMBERS ',' ID <-Segmentation fault when the file contain long values for call
{

};

firstfire · 07-02-2012, 04:06 PM

Hi.

Try to replace

Code:

extern char yytext[];

by

Code:

extern char *yytext;

Here is your code a bit modified to be compilable:
lexer.l:

Code:

%{
#undef YYLMAX
#define YYLMAX 40000

#include "parser.h"
%}

digit [0-9]
id [a-zA-Z0-9_\/\-=><.\"]+
%%
call	 { strcpy(yylval.string, (char*)yytext);  return (CALL); }
{digit}+ { strcpy(yylval.string, (char*)yytext); return(NUMBER); }
{id}	{ strcpy(yylval.string, (char*)yytext); return(ID); }
[ \t]+	/* eat up whitespaces */
\n	return '\n';
.	return *yytext;

parser.y:

Code:

%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
extern char *yytext;
void yyerror(const char * e)
{
	fprintf(stderr, "Error: %s\n", e);
}
%}
%union {
char string[40000];
}
%token ID CALL NUMBER
%type<string> id command call ID NUMBER

%%
input: /* empty */
     | input line	
;
line: '\n'	{ puts("here"); } 
    | command '\n'	{ printf("command: %s\n", $1); }
;
command: id
       | call {}
;

id: ID		{ sprintf($$, "%s(id)", $1); }
  | NUMBER	{ sprintf($$, "%s(num)", $1); }
;

call: CALL NUMBER ',' ID { sprintf($$, "call(%s, %s)", $2, $4); }
;
%%
int main(void)
{
	return yyparse();
}

Makefile:

Code:

a.out: parser.o lex.yy.o
	$(CC) $^ -o $@ -lfl

lex.yy.c: lexer.l
	flex $<
parser.c: parser.y
	bison --defines=parser.h $< -o $@

%.o: %.c
	$(CC) -c $< -o $@

clean:
	rm -f lex.yy.c parser.c *.o a.out

Sample session:

Code:

$ make
bison --defines=parser.h parser.y -o parser.c
gcc -c parser.c -o parser.o
flex lexer.l
gcc -c lex.yy.c -o lex.yy.o
cc parser.o lex.yy.o -o a.out -lfl
$ ./a.out 
123
command: 123(num)
qwe
command: qwe(id)
call 123, me
command: call(123, me)
call me,123
Error: syntax error

P.S. Please use [CODE]...[/CODE] tags around your code and data to preserve formatting.