I really struggled to find an example of using RPLY to lex and parse an indentation based language. This feels a little ironic since RPLY is written in python!
I hope the below helps someone: The secret is that you can inject a step that converts some tokens into other tokens between the output of the parser and the input of the lexer!
When you define your lexer you need to stop ignoring whitespace and newlines. Add them as tokens to your lexer.lg = LexerGenerator()
...more tokens here
lg.add('NEWLINE', '\n[\ \r\t]*')
lg.add('WHITESPACE', '[\ \t]+')
# Remove your re.ignore statement!
Then when you define your parser generator you should define an extra token “INDENT” and omit the NEWLINE and WHITESPACE tokens.tokens = [
...other tokens, but not WHITESPACE OR NEWLINE
"INDENT"
]
pg = ParserGenerator(tokens, precedence=[])
Finally, rather than passing the output of your lexer straight to the parser, you can define a generator which looks for NEWLINE and WHITESPACE tokens and filters them or converts them to INDENT tokens.def process_whitespace(tokens):
"""Given a stream of tokens convert whitespace to INDENT tokens."""
for token in tokens:
if token.name == 'WHITESPACE':
continue
elif token.name == 'NEWLINE':
indent_len = len(token.value[1:])
if indent_len > 0:
yield Token('INDENT', indent_len)
else:
yield token
tokens = lexer.lex(source)
new_tokens = process_whitespace(tokens)
result = parser.parse(new_tokens)
And that’s it! Now you have a nifty INDENT token whose value is the number of whitespace characters involved in the indent. You can use the ParserState
of RPLY to keep track of the current state of indent to throw errors or understand what scope you program is in.