Incremental parser (was: Backtick Hickup)

Michel Fortin michel.fortin at michelf.com
Tue Aug 14 10:41:15 EDT 2007


Le 2007-08-13 à 21:56, Allan Odgaard a écrit :


> On Aug 13, 2007, at 10:20 AM, Michel Fortin wrote:

>

>>> [...] I know most Markdown parsers do not follow conventional

>>> parser wisdom, but IMO this is also the interpretation that suits

>>> an incremental tokenizer/parser best compared to your

>>> interpretation [...]

>> [...]

>> There is a lot of look-aheads in Markdown:

>

> Are you talking about the spec or implementations? I believe when

> it comes to the implementation it would be more correct to say a

> lot of “iteratively performing search and replacement on the entire

> document”.


Both. Implementations use look-aheads because simply the syntax often
require looking ahead to see if the pattern is complete before
deciding if something should be taken literally or not. For instance,
an asterisk alone doesn't start emphasis unless it has a
corresponding one in the same paragraph.


>> emphasis won't be applied if asterisks or underscores can't be

>> matched in pairs; links won't be links if there's no suitable

>> parenthesis after the closing bracket, Setext-style headers need

>> the line of hyphens or equal signs following its content, the

>> parsing mode for list items depends on whether or not it contains

>> a blank line, etc.

>

> All but the style thing is limited (fixed size) look-ahead. This is

> not a problem. But those “look to the end of the document each time

> you see X” is a huge problem (for performance, and performance for

> Markdown is bad) -- if there was interest in addressing this, it

> could be done, sure we would have to mildly tweak a few rules, but

> those rules are anyway not written down today, they are just de

> facto rules based on the current implementation, this is why I

> jumped in at this back-tick thing, because as I see it, we have a

> very unconventional parser (in markdown.pl and ported by you to

> PHP) and we let the language be defined by how this parser ends up

> dealing with edge-cases (like pairing two back-ticks with three

> back-ticks). But often setting that way of dealing with things as

> the standard, is just counter-productive to ever getting a “real”

> parser for Markdown.


In the case of the backtick in the other thread, it's clearly not an
edge case to me: I'm pretty sure that it has been designed that way.
First, it's much more useable, and second, the code explicitly checks
for the surrounding backticks so it can't really be an oversight. But
that's for the other thread...

I agree that the syntax needs to be defined more clearly. I think the
syntax page should be updated when we find an ambiguity. But I'm not
the one in charge of that page. I'd suggest checking the testsuites
announced on this list: most decisions regarding edge cases have been
"documented" there as regression tests. If some behaviour is part of
the test suite, you can be pretty much certain that it's not a parser
quirk.

I also agree that parsing Markdown is slower than it should be, but I
wonder which non-written rules you're talking about that makes it so
slow. When I wrote about looking until the end of the document, I was
talking about reference-style links, which are pretty well documented
as such. Maybe you had something else in mind?


>> There's no way to do a truly incremental parsing of Markdown...

>> well, you could in a way, but you'd have to mutate many parts of

>> the output document while parsing

>

> I strongly disagree. TextMate does a very good job at syntax

> highlighting Markdown, and it is based 100% on an incremental

> parser -- in v2.0 there will be some new parser features which will

> allow for it to deal with 99% of all Markdown out there. Where it

> has problems is really in the edge-cases, but that is partly

> because these are undefined, and partly because when they come up,

> e.g. like this back-tick thing, they “get defined” in a bad way.


Syntax highlighting isn't the same thing as "parsing" Markdown, not
in my mind. It's more like "tokenizing" Markdown, and for that you
don't really need to be aware of the whole document at once; you just
need to identify things, not interpret them... Although it can get
complicated if you consider that some thing such as reference-style
links are only links when they have a corresponding definition.


>> (like HTML parsers do in browsers),

>

> Say what?


I was referring to how HTML browsers mutate the DOM in strange ways
while parsing improper HTML, if only to "fix" things such as this:

<b><p>text</b> text</p>

into this:

<p><b>text</b> text</p>

or this:

<table>
<tbody><tr>text</tr></tbody>
</table>

which becomes:

text<table>
<tbody><tr></tr></tbody>
</table>

Also, some head-elements like `<meta>` or `<base>` are placed into
the head of the document when they're found outside of it (in some
browsers).

But that's not really what you meant by "incremental parsing" it
seems, so I guess all these examples are irrelevants.


>> or to delay the output of ambigus parts until the end of the

>> document; all this surely defeats the purpose of an incremental

>> parser.

>

> I think you misunderstand my use of the term. By incremental I mean

> that it scans the input document byte-by-byte (and creates tokens,

> from which a parse-tree is built), never going back to already

> scanned bytes. So this gives it a somewhat linear time complexity

> with a low constant factor.


Indeed, I was considering the parser was doing the output too, or at
least generating sequencial events (à la SAX). If you're talking
about creating the document tree in one pass, I suppose this can work
if you allow mutations of the document (as opposed to just appending
new elements at the end as you'd do with an XML parser).


> I believe I have already mentioned it, but for reference,

> markdown.pl takes almost 40s to convert the TextMate manual into

> HTML where TM 2’s parser, which parses the manual *exactly* the

> same uses less than a quarter of a second.


Ok, back to performance. How many time do you start a new Perl
process when building the manual? My benchmarks indicates that
starting the process takes generally more time than the actual
conversion process, although this clearly depends on the input. Also,
if you're using the current betas of Markdown.pl, the performance of
the HTML block parser is pretty bad.

Is the manual available in Markdown format somewhere? I'd like to do
some benchmarking.

I'm totally not convinced that creating a byte-by-byte parser in Perl
or PHP is going to be very useful. Using regular expressions is much
faster than executing equivalent PHP code in my experience. If we
were talking about a complied language, such as C++, things would be
reversed. But a C++ Markdown implementation wouldn't be very useful
to most website engines running on shared hosting and limited
privileges, hence the language constrains.

That doesn't mean PHP Markdown and Markdown.pl can't be made faster,
but I'd be surprised if it ever reach the speed of TextMate's
dedicated parsing state machine.


> That said, another reason why I am focused on an incremental parser

> is because then we get closer to a formal grammar -- i.e. if we see

> token A we switch to state X, etc. rather than now where nested

> constructs are only made possible by dubious md5-transformations on

> subsets of the document.


You may not like the way Markdown.pl or PHP Markdown works, but we
each have to work within the limits of our particular language.
Between a PHP-made state machine or a PHP-regex hybrid design, I'd
choose the second one if it means I get decent performance. And I
don't think a formal grammar will help much getting better
performance in a PHP or Perl environment for the reasons outlined above.

Don't misunderstand me: I am all for a more formal definition of the
syntax, I just don't think it'll be much use to me (for PHP Markdown
at least). If you wish to create a better definition of the language,
I'll be glad to help by answering questions I can answer,
exemplifying edge cases and their desirable outputs, etc. If you want
the syntax changed so that it better fit your parser (and possibly
other incremental parsers) then I can provide my point of view, but
I'm not the one who takes the final decision.


>> The worst "look-ahead" (or most complex "mutations") would be for

>> reference-style links which can have their definitions absolutely

>> anywhere in the document. Interestingly, that's probably one of

>> the most appreciated features of Markdown.

>

> You may have misunderstood the incremental as giving incremental

> output -- that is not the case.

>

> So basically you parse the document token-for-token. When you see

> e.g. [foo][bar] you create the link, but just keep the bar as its

> value -- when you see [bar]: http://domain.tld then you insert

> http://domain.tld in the symbol table for the bar symbol.

>

> When it gets time to write out the document (i.e. the full input

> has been parsed / parse tree built) you just look-up the links in

> the symbol table when they are written out.


That's what I meant by "mutations" of the document.

There's a tricky case here however: [foo][bar] isn't a link in
Markdown unless "bar" is defined somewhere; if it isn't defined, it's
plain text. That may seem like an edge case right now, but when/if
Markdown gets the [shortcut link] syntax (as added to the current
betas of 1.0.2), this may become a more interesting problem for
syntax highlighting as any bracketed text will then become a
potential link depending on whether or not it has been defined
elsewhere in the document.


Michel Fortin
michel.fortin at michelf.com
http://www.michelf.com/




More information about the Markdown-Discuss mailing list