How to implement variable substitution in strings?

Sergei Steshenko · 05-07-2011, 06:16 AM

The deep problem is that for nested strings plain '"' is not enough - one needs opening and closing "quote". This has been implemented in Perl, for example: http://perldoc.perl.org/perlop.html#...like-Operators .

MTK358 · 05-07-2011, 08:26 AM

@Sergei Steshenko

Either you're being very cryptic again or you have no idea what I'm trying to do.

"${foo${bar}}" is a syntax error, it is NOT translated into "foo(bar)". Substitution does not occur inside the substitution operator, which means that "${foo${bar}}" will try to evaluate the expression "foo${bar}", which is invalid.

I think that the way this would work is that if the lexer comes acroos a "$" followed by a "{" inside a double-quoted string, it cuts out everything from the "{" to the matching "}", and creates more instances of the scanner/lexer/parser that will parse it as it if were a separate program in the interpreted language. Since the parser does not know that the program it's parsing is embedded in a string, it doesn't do any ${} substitution. It can, however, contain double quoted strings, and those strings can contain ${} substitutions, and this can recursively go on and on as long as there's room on the stack.

Sergei Steshenko · 05-07-2011, 08:52 AM

Quote:

Originally Posted by MTK358

@Sergei Steshenko

Either you're being very cryptic again or you have no idea what I'm trying to do.

"${foo${bar}}" is a syntax error, it is NOT translated into "foo(bar)". Substitution does not occur inside the substitution operator, which means that "${foo${bar}}" will try to evaluate the expression "foo${bar}", which is invalid.

I think that the way this would work is that if the lexer comes acroos a "$" followed by a "{" inside a double-quoted string, it cuts out everything from the "{" to the matching "}", and creates more instances of the scanner/lexer/parser that will parse it as it if were a separate program in the interpreted language. Since the parser does not know that the program it's parsing is embedded in a string, it doesn't do any ${} substitution. It can, however, contain double quoted strings, and those strings can contain ${} substitutions, and this can recursively go on and on as long as there's room on the stack.

You said (IIRC) that in ${something} the "something" is an expression. Applying the "inner items are dealt with first" principle I've created my 'foo(1)' example.

MTK358 · 05-07-2011, 09:04 AM

I though that outer items are dealt with first?

Another problem is how to find the matching "}": they should be ignored inside nesting strings.

For now I did simple variable substitution, I might do expressions later is something is figured out.

Sergei Steshenko · 05-07-2011, 09:28 AM

Quote:

Originally Posted by MTK358

I though that outer items are dealt with first? ...

You parse from outside to inside, but you evaluate from inside to outside. For example, when in 'kcalc' I enter

Code:

3*(4+5)

, at the moment I enter ')', 'kcalc' shows '9', which is '4 + 5', and when I press <ENTER>, it shows '27'. I.e. evaluation started from inside.

Sergei Steshenko · 05-07-2011, 09:30 AM

Quote:

Originally Posted by MTK358

... I might do expressions later is something is figured out.

You know, all those guys who invented various languages introduced

Code:

eval <string>

for a reason. And I think the reason is not making them and us confused.

MTK358 · 05-07-2011, 10:01 AM

Quote:

Originally Posted by Sergei Steshenko

You know, all those guys who invented various languages introduced

Code:

eval <string>

for a reason. And I think the reason is not making them and us confused.

What's sonfusing me isn't the concept of eval, but how to figure out what string to pass to it.

Sergei Steshenko · 05-07-2011, 11:42 AM

Quote:

Originally Posted by MTK358

What's sonfusing me isn't the concept of eval, but how to figure out what string to pass to it.

I think the "founding fathers" were confused too and decided not to complicate their (and our) lives: if one wants more than pure variables substitution, he/she needs to explicitly call 'eval'.

MTK358 · 05-07-2011, 11:57 AM

Quote:

Originally Posted by Sergei Steshenko

I think the "founding fathers" were confused too and decided not to complicate their (and our) lives: if one wants more than pure variables substitution, he/she needs to explicitly call 'eval'.

Maybe, but there is a language that does expression substitution exactly the way I described it: Ruby.

Code:

foo = 3
bar = 8
puts("#{foo} + #{bar} = #{foo + bar}")

puts("#{ "#{foo + bar}" + ' here are some curly braces: { }{}}}}{{' }")

# this causes a syntax error (the program only runs with it commented out)
# puts("#{foo#{bar}}")

MTK358 · 05-09-2011, 10:22 AM

I did it!

Code:

#!/home/michael/Projects/lang/build/src/lang

foo = 'te'
bar = 'st'

"\"$foo\" + \"$bar\" = \"${foo + bar}\"\n":print()

"${ "nested" + "${" embedded expressions"}" }":println()

# this is a syntax error
# "${foo${bar}}":println()

Output:

Code:

$ ./test_program 
"te" + "st" = "test"
nested embedded expressions

With the "syntax error" line un-commented:

Code:

$ ./test_program 
./test_program:11:7: syntax error: Invalid token

Nominal Animal · 05-09-2011, 01:06 PM

Quote:

Originally Posted by MTK358

I did it!

Good to hear. Did you use another instance of the lexer/parser to convert such string constants to AST, or how did you do it?

MTK358 · 05-09-2011, 01:48 PM

Quote:

Originally Posted by Nominal Animal

Good to hear. Did you use another instance of the lexer/parser to convert such string constants to AST, or how did you do it?

The way I did it is that when the lexer comes across a "${" inside a double-quoted string, it returns a special token. When the parser gets that token, it creates a new lexer and parser but tells them to use the original scanner (since it remembers the place in the text file).

I also had to slightly modify the parser to be able to recognize any specified token (not just EOF) as the end of the program, in this case the closing curly bracket.

From the lexer:

Code:

	if (isInDoubleQuotes) {
		if (s->current() == '"') {
			s->next();
			isInDoubleQuotes = false;
			curTok = DoubleQuoteTok;
		} else if (s->current() == '$') {
			s->next();
			if (s->current() == '{') {
				s->next();
				curTok = DoubleQuotedExpressionTok;
			} else if (!isCharFirstNameCharacter(s->current())) {
				curTok = InvalidInput;
			} else {
				do {
					str.push_back(s->current());
				} while (isCharNameCharacter(s->next()));
				curText = str.c_str();
				curTok = DoubleQuotedVariableTok;
			}
		} else if (s->current() == Scanner::ReadError) {
			isInDoubleQuotes = false;
			curTok = ReadError;
		} else if (s->current() == Scanner::EndOfFile) {
			isInDoubleQuotes = false;
			curTok = InvalidInput;
		} else {
			do {
				if (s->current() != '\\') {
					str.push_back(s->current());
				} else {
					s->next();
					switch (s->current()) {
						case '\\':
							str.push_back('\\');
							break;
						case 'n':
							str.push_back('\n');
							break;
						case 'r':
							str.push_back('\r');
							break;
						case '0':
							str.push_back('\0');
							break;
						case 'a':
							str.push_back('\a');
							break;
						case 'b':
							str.push_back('\b');
							break;
						case 't':
							str.push_back('\t');
							break;
						case 'v':
							str.push_back('\v');
							break;
						case 'f':
							str.push_back('\f');
							break;
						case 'e':
							str.push_back('\e');
							break;
						case '"':
							str.push_back('"');
							break;
						default:
							str.push_back(s->current());
					}
				}
				s->next();
			} while (s->current() != '"' && s->current() != '$' && s->current() >= 0);
			curText = str.c_str();
			curTok = DoubleQuotedTextTok;
		}
		return curTok;
	}

From the parser:

Code:

	else if (accept(Lexer::DoubleQuoteTok))
	{
		int l = lex->prevLine(), c = lex->prevCol();
		node = new SubstitutionStringNode();
		while ( lex->current() == Lexer::DoubleQuotedTextTok       ||
		        lex->current() == Lexer::DoubleQuotedVariableTok   ||
		        lex->current() == Lexer::DoubleQuotedExpressionTok ) {
			if (lex->current() == Lexer::DoubleQuotedTextTok) {
				((SubstitutionStringNode*) node)->addText(String::fromAscii(lex->text()));
			} else if (lex->current() == Lexer::DoubleQuotedExpressionTok) {
				Lexer l2;
				l2.setScanner(lex->getScanner());
				Parser p2;
				Node* node2 = p2.parse(&l2, Lexer::CCurlyTok);
				((SubstitutionStringNode*) node)->addExpr(node2);
			} else if (lex->current() == Lexer::DoubleQuotedVariableTok) {
				((SubstitutionStringNode*) node)->addVar(lex->text());
			}
			lex->next();
		} 
		if (!accept(Lexer::DoubleQuoteTok)) throw SyntaxError("No closing double-quote", l, c);
	}

Nominal Animal · 05-09-2011, 02:22 PM

Quote:

Originally Posted by MTK358

The way I did it is that when the lexer comes across a "${" inside a double-quoted string, it returns a special token. When the parser gets that token, it creates a new lexer and parser but tells them to use the original scanner (since it remembers the place in the text file).

Quite neat.

Quote:

Originally Posted by MTK358

I also had to slightly modify the parser to be able to recognize any specified token (not just EOF) as the end of the program, in this case the closing curly bracket.

Does it still return an error on a stray closing brace (}), or does it treat it as the end of the program?

MTK358 · 05-09-2011, 03:22 PM

Quote:

Originally Posted by Nominal Animal

Does it still return an error on a stray closing brace (}), or does it treat it as the end of the program?

It treats it as the end of the program.

The parser is a recursive descent parser. The "program" rule matches an expr-list followed by the ending token (EOF or "}", depending on how the parser was initialized). The expr-list rule matches 0 or more newlines, and then it checks if the next token could be the first token of an expression (for example, "if" or "(" tokens could be the start of an expression, while ")" or "end" could not). If so, it matches an expression and starts over. If not, it quits, returning a node that evaluates all the expressions in the list, and returns the value of the last one. If the top-level expr-list returns and the next token is not the ending token, it's treated as a syntax error.

MTK358 · 07-14-2011, 11:51 AM

I came across a big issue with this, so I have to mark the te thread as unsolved:

I recently modified the parser to have two-token lookahead, since that was necessary for some syntax I wanted to add. The problem is that this completely broke expression substitution in strings, and I'm not sure how to solve it.

Basically, the way it worked before is that if you evaluate an expression, the lexer is at the token after the expression's last token. This was OK before, but now the lexer is actually internally two tokens after the expression's last token, because that's how it implements its new peek() feature. The reason that this poses a problem for expression substitution is that the inner lexer (when inside the ${...}) actually goes past the closing curly brace to peek at the next token. If the contents of the string right after the closing brace happen not to be a valid token, the inner lexer throws a syntax error. Or if it is a valid token, when it goes back to the main parser/lexer, it starts reading from where the inner lexer finished, which means that it skips the part of the string after the closing brace.