Russell McLoughlin

I occasionally write on topics related to software and synthetic biology.

Indentation + RPLY

03 Feb 2023 » programming

I really struggled to find an example of using RPLY to lex and parse an indentation based language. This feels a little ironic since RPLY is written in python!

I hope the below helps someone: The secret is that you can inject a step that converts some tokens into other tokens between the output of the parser and the input of the lexer!

When you define your lexer you need to stop ignoring whitespace and newlines. Add them as tokens to your lexer.

lg = LexerGenerator()

...more tokens here

lg.add('NEWLINE', '\n[\ \r\t]*')
lg.add('WHITESPACE', '[\ \t]+')

 # Remove your re.ignore statement!

Then when you define your parser generator you should define an extra token “INDENT” and omit the NEWLINE and WHITESPACE tokens.

tokens = [
    ...other tokens, but not WHITESPACE OR NEWLINE
    "INDENT"
]
pg = ParserGenerator(tokens, precedence=[])

Finally, rather than passing the output of your lexer straight to the parser, you can define a generator which looks for NEWLINE and WHITESPACE tokens and filters them or converts them to INDENT tokens.

def process_whitespace(tokens):
    """Given a stream of tokens convert whitespace to INDENT tokens."""
    for token in tokens:
        if token.name == 'WHITESPACE':
            continue
        elif token.name == 'NEWLINE':
            indent_len = len(token.value[1:])
            if indent_len > 0:
                yield Token('INDENT', indent_len)
        else:
            yield token

tokens = lexer.lex(source)
new_tokens = process_whitespace(tokens)
result = parser.parse(new_tokens)

And that’s it! Now you have a nifty INDENT token whose value is the number of whitespace characters involved in the indent. You can use the ParserState of RPLY to keep track of the current state of indent to throw errors or understand what scope you program is in.