Writing a small parser / interpreter (Part 2: Parser)

In the previous step, we looked at how to write a lexical scanner for a small subset of the LOGO programming language.

Our parser relies heavily on the fact that we're parsing a CFG language free from left-recursion (non-terminal not repeated as first symbol on lhs) and common left side prefixes (for any one rule, no productions share a common prefix). These properties enables us to write a special kind of parser, called a LL(1) parser.

The first "L" means we're recognizing LOGO programs read left-to-right.

The second "L" signifies that we're recognizing left-most parse-trees (parse-trees are the derivations of the production rules in our grammar), but this is non-important in our case, since the LOGO subset grammar can only produce one kind of parse-trees (you could try writing down derivations, if in doubt).

The "(1)" part tells us that it is sufficient to look one symbol ahead to be able to predict the next production rule.

You could possibly read up on the subject, preferably in Crafting a compiler with C by Fisher & LeBlanc or just take my word for it.

Tests

Unexpected token's are treated as syntax errors and therefore a suite of tests could look like this (please improve if you find them inadequate):

[TestFixture]
public class TestLogoParser
{
	[Test]
	public void CanParseSimpleLogoProgram()
	{
		TestParseProgram("FORWARD 50");
	}

	[Test]
	public void CanThrowOnSimpleSyntaxError()
	{
		var exception = TestParseProgramThrows("50 FORWARD");
		Assert.AreEqual(LogoErrorCode.SentenceError, exception.ErrorCode);
		Assert.AreEqual("50", exception.ScanBuffer);
	}

	[Test]
	public void CanThrowOnSecondSentenceWithSyntaxError()
	{
		var exception = TestParseProgramThrows("FORWARD 50 LEFT");
		Assert.AreEqual(LogoErrorCode.MatchError, exception.ErrorCode);
		Assert.AreEqual("LEFT", exception.ScanBuffer);
	}

	[Test]
	public void CanThrowOnMissingRBracketSyntaxError()
	{
		var exception = TestParseProgramThrows("REPEAT 4 [ FORWARD 50");
		Assert.AreEqual(LogoErrorCode.MatchError, exception.ErrorCode);
		Assert.AreEqual("50", exception.ScanBuffer);
	}

	private static void TestParseProgram(string program)
	{
		var parser = new LogoParser(new LogoScanner(program));
		parser.ParseLogoProgram();
	}

	private static LogoSyntaxErrorException TestParseProgramThrows(string program)
	{
		return Assert.Throws<LogoSyntaxErrorException>(
			() => TestParseProgram(program)
		);
	}
}

Code

The trick is to write a method, either a recursive one or one that "decends" into a sub-method, where each method corresponds to the possible rule outcomes for a nonterminal symbol. We match tokens as we go along.

public class LogoParser
{
	private readonly LogoScanner scanner;

	public LogoParser(LogoScanner logoScanner)
	{
		scanner = logoScanner;
	}

	// <logo-program>  ::= <logo-sentence> { <logo-sentence> } <EOF>
	public void ParseLogoProgram()
	{
		ParseLogoSentence();
		while (true)
		{
			switch(scanner.NextToken())
			{
				case Token.FORWARD:
				case Token.BACK:
				case Token.LEFT:
				case Token.RIGHT:
				case Token.REPEAT:
					ParseLogoSentence();
					break;

				default:
					Match(Token.EOF);
					return;
			}
		}
	}

	// <logo-sentence> ::= FORWARD <integer>
	//                   | BACK <integer>
	//                   | LEFT <integer>
	//                   | RIGHT <integer>
	//                   | REPEAT <integer> [ <logo-sentence> { <logo-sentence> } ]
	private void ParseLogoSentence()
	{
		var nextToken = scanner.NextToken();
		switch(nextToken)
		{
			case Token.FORWARD:
			case Token.BACK:
			case Token.LEFT:
			case Token.RIGHT:
				Match(nextToken);
				Match(Token.NUMBER);
				break;

			case Token.REPEAT:
				Match(nextToken);
				Match(Token.NUMBER);
				Match(Token.LBRACKET);
				ParseLogoSentence();
				Match(Token.RBRACKET);
				break;

			default:
				SyntaxError(String.Format("Expected one of: FORWARD, BACK, LEFT, RIGHT or REPEAT but found {0}", 
					TokenHelper.TokenToText(nextToken)), LogoErrorCode.SentenceError);
				break;
		}
	}

	private void Match(Token token)
	{
		Token nextToken = scanner.Scan();
		if (nextToken != token)
		{
			SyntaxError(String.Format("Expected {0} but found {1} (at '{2}')",
				TokenHelper.TokenToText(token),
				TokenHelper.TokenToText(nextToken), 
				scanner.ScanBuffer), LogoErrorCode.MatchError);
		}
	}

	private void SyntaxError(string errorMessage, LogoErrorCode errorCode)
	{
		throw new LogoSyntaxErrorException(errorMessage, errorCode, scanner.ScanBuffer);
	}
}

We'll also extend the LogoScanner class with the "lookahead" feature (no tests shown):

public Token NextToken()
{
	var oldIdx = idx;
	var result = Scan();
	idx = oldIdx;
	return result;
}

Ok, so far so good!

We have now created a parser and scanner capable of recognizing our LOGO subset programming language. But we need to translate the LOGO program into meaningful drawing primitives.

We will look into that in the final part 3 of this small mini-series.

Comments are closed