ANTLR / lexer (was: Backtick Hickup)

Allan Odgaard 29mtuz102 at sneakemail.com
Mon Sep 3 03:04:03 EDT 2007

Previous message: PHP benchmark (was: Incremental parser)
Next message: ANTLR / lexer (was: Backtick Hickup)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Aug 27, 2007, at 11:02 PM, Eric Astor wrote:

> [...]

> Well - has anyone else looked into ANTLR 3.0 at all? The LL(*)

> grammar language it uses (an EBNF) allows for full backtracking

> support, and unspecified lookahead as far as necessary. It's fairly

> well-optimized, as I understand it, taking advantage of some of the

> packrat-parsing ideas to save handling a single text section

> repeatedly...

I am playing a bit with writing a Markdown parser now, since I have
been involuntarily cut off from my regular project.

The main challenge is really the lexer. There are two problems here:

1. If we generate a token for all special characters, we end up
having to deal with a lot of tokens in the parser (grammar). I don’t
like this, so I am making the lexer slightly context aware. I don’t
think this is really a problem, e.g. this is no different than having
the lexer switch to another state when seeing e.g. string literals in
a language where string literals themselves have a mini-grammar
(escape codes), and thus benefit from their own lexer.

2. The thing about block-environments not having an end-marker per
se, but rather have each line participating in the environment,
prefixed with something.

My solution for #2 so far is to make the LF token special in the way
that it will encompass the leading prefix-stuff (from the next line).
So effectively when the lexer sees ‘ > ’ then it outputs a
QUOTE_START token and adds ‘ > ’ to a global stack (read by the rule
for the LF token). When the LF token matches ‘\n’ it goes through
this stack, and if there is a pattern which does not match (from the
stack) it pops the stack until (and incl.) the current one, and
outputs a «token»_STOP for each, and then outputs the LF token.

This approach means that generally we detect “end of block-level
construct” one LF after it actually did end, so e.g.:

* This is a list item

A paragraph below it.

Becomes:

<ul><li>This is a list item
</li></ul>
<p>A paragraph below it</p>

This is because the empty line (included in the list item) could
actually have been part of the list item, we do not know that before
we see the paragraph. In general though this shouldn’t matter (except
for stuff in <pre>) so I am not sure it is worth addressing -- though
a simple pattern-based re-ordering of tokens could fix it, or maybe I
can address it in the parser (grammar).

Previous message: PHP benchmark (was: Incremental parser)
Next message: ANTLR / lexer (was: Backtick Hickup)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Markdown-Discuss mailing list