Formal Grammar — some thoughts

A. Pagaltzis pagaltzis at gmx.de
Sun Jul 30 18:50:23 EDT 2006


* Michel Fortin <michel.fortin at michelf.com> [2006-07-30 22:40]:

> Personally, I'd do it with multiple passes of tokenization. I'd

> first tokenize block-level elements and define a particular

> rendering procedure for each of these block-level tokens. Then,

> when parsing of span-level elements is needed inside

> block-level tokens, I'd tokenize the text content of these

> blocks (with proper indentation removed as needed) into

> span-level tokens. This means you'd have two grammars: one to

> separate block elements, one to separate span elements.


I thought of this before, but had the impression that Markdown is
ambiguous in some cases and you can’t always decide where a block
ends without knowing about the span-level elements inside it. But
now that I look at the syntax reference and the source, I’m
unsure about how I got this impression. I think it was because of
the ambiguity of asterisks (“emphasis, or list?”), but they
actually aren’t ambiguous. I think the lazy indentation thing was
another thing I was worried about, but now that I look at it, it
seems that it shouldn’t be an issue.


> I'd like to point out that in my view John's implementation is

> already doing tokenization in some form.


Yes. I don’t see that it operates on the tokens in any form,
though. It just does that in order to hide various things from
particular stages of the search&replace train in various ways,
as it is so extremely order-dependent.


> I recognize that md5 hashes are somewhat overkill for this

> process. In fact, any alphanumeric string which isn't present

> in the input text is suitable for "tokens".


If “isn’t present in the input text” were easy, I’m pretty sure
John would have done that. But it’s not. Using MD5 hashes means
the chance that there’ll be a conflict remains present, but is so
extremely slim as to be dismissible.


> This is far from having a formal grammar, but it shows that a

> lot more could be done by reusing the current approach.


I guess if all you want to do is implement the exact purpose
of Markdown.pl, ie converting a Markdown document to an HTML
fragment, then that is true.

Extending or modifying a parser that’s separate from the
converter would still be easier though. It could also be used to
just extract bits of info from a Markdown document, or to attach
a modified converter or one that targets a different format
(DocBook, say) without needing an entire new implementation of
Markdown.

I wouldn’t be surprised if it required less logic as well.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>


More information about the Markdown-Discuss mailing list