RE: Formal Grammar — some thoughts

Eric Astor eastor1 at swarthmore.edu
Sat Jul 29 17:22:16 EDT 2006



> -----Original Message-----

> From: markdown-discuss-bounces at six.pairlist.net [mailto:markdown-discuss-

> bounces at six.pairlist.net] On Behalf Of Allan Odgaard

> Sent: Saturday, July 29, 2006 4:38 PM

> To: markdown-discuss at six.pairlist.net

> Subject: Formal Grammar — some thoughts

>

> I recently subscribed and saw in the archive that Eric Astor was

> asking for a formal grammar (unlikely the first time for such request.)


I should now add that since asking, I've started work on a parser for a
Markdown variant, coding in Python and using the Martel parsing framework
(http://www.dalkescientific.com/Martel/). A VERY large fraction of what I've
written could likely be re-used in building a true Markdown parser with this
framework, which uses a regex-based specification of the parsing format and
supports various features (including back-references) that are not possible
in formal grammars. In the process, I've learned a lot that may also be
applicable to building a true formal grammar for Markdown.


> Some of the problems with a formal grammar (as I see it) are:

>

> 1. interpreting tokens as literal text when end token is missing,

> example: `this is __not starting bold`.


This is actually simple to deal with in most formal grammars - since formal
grammars are recursive, you simply define bold (for example) as:
bold := ('__' SPAN '__') | ('**' SPAN '**')
This will then allow the marker token to be interpreted as literal text. The
Martel 'grammar' works similarly.


> 2. using back-references in end-tokens, example: `a ``` ``raw`` ```

> environment`. A formal grammar can’t really do that, and IMO the

> clean solution would be to define single-quoted (backticked) raw as

> supporting no escaping and end with the first `` ` ``, where double-

> quoted (backticked) raw would support escaping of `` ` `` and `\`.


True, back-references are not possible in any formal grammar - but given a
parsing framework that supports a fixed amount of lookahead, it's easy to
support nearly the same functionality that Markdown does using alternative
definitions. Now this can get ugly, as when six definitions are required to
support a code span using up to six backticks as markers, but it is somewhat
manageable.

Regardless, I definitely prefer Allan's proposed revision of the syntax -
but if we need the backwards-compatibility, then it should be possible to
support it with only a minor mess resulting.


> 5. heuristically defined end of lists, sub-lists and block-quotes.

> This would need to be more strict. I am not entirely sure what the

> current definition is, so I am wary of reformulating a strict

> version.


This would indeed have to be more strict - and I really think some sort of
stricter specification would be very valuable, particularly since this is an
issue the various Markdown implementations tend to disagree on. Personally,
I think sublisting should require at least 4 spaces (or 1 tab) of
indentation past the previous list's indenting level, which would keep
consistency with the rest of Markdown. This sort of thing would also have
the benefit of making it MUCH easier to write a decent semi-formal lexer,
which could then simplify the task of writing the parser. (That reminds me -
like other languages that are indentation-sensitive, it's impossible to
specify Markdown both formally and completely... some part of the parser
will need to be informal.)

Anyway, that's a bit of what I've picked up in the process of writing my
parser... I'd be glad to help out with finding a way to formalize Markdown,
however I can.

- Eric Astor

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.10.5 - Release Date: 7/28/2006




More information about the Markdown-Discuss mailing list