Markdown Extra Spec: Parsing Section

John MacFarlane jgm at berkeley.edu
Tue May 13 02:06:11 EDT 2008


+++ Michel Fortin [May 13 08 00:32 ]:

> Le 2008-05-12 à 18:14, John MacFarlane a écrit :

>

>> The PEG representation is concise, precise, and readable.

>

> Readable, hum... if I look at this rule from PEG Markdown:

>

> ListContinuationBlock = a:StartList

> ( BlankLines

> { if (strlen($$.contents.str) == 0)

> $$.contents.str = strdup("\001"); /* block separator */

> pushelt($$, &a); } )

> ( Indent ListBlock { pushelt($$, &a); } )+

> { $$ = mk_str(concat_string_list(reverse(a.children))); }

>

> it looks a lot like code to me, half of it I don't understand.


Well, you've picked the ugliest part. But don't be repelled too
quickly. Note that the stuff between { } is C code that constructs the
syntax tree. If you just want to see the syntax specification, you can
pretty much ignore those parts. The "StartList" bit can also be ignored,
as it just initializes a list. With that stripped out, you get:

ListContinuationBlock = Blanklines (Indent ListBlock)+

That is, a list continuation block is some blank lines followed by
one or more ListBlocks, each preceded by indentation. That seems
pretty readable to me.

Here's the part that concerns the recent discussion about "refname"
(again, I've omitted the {} parts and parts that modify the rules
depending on which syntax extensions have been selected).

Reference = NonindentSpace Label ':' Spnl RefSrc Spnl RefTitle
BlankLine*

A reference is some space of less than one indent, followed by a
Label, followed by ':', followed by optional blank space including
at most one newline, followed by a RefSrc, followed
by optional blank space including at most one newline, followed by
a RefTitle, followed by optional blanklines. (You may not agree
with that. But it's easy to see how to modify the rule above if,
for example, you don't think leading space should be allowed.)

Label = '[' (!']' Inline)+ ']'

A label is a '[' followed by a string of one or more Inline elements
that don't begin with ']', followed by ']'. (Note: this allows
text within balanced brackets, which will be parsed as a single Inline
element.)

RefSrc = Nonspacechar+

A RefSrc is a string of one or more nonspace characters.

And so on.

Again, a lot of the ugliness of the specification is due to the C
code that constructs the parse tree. If that bothers you, you might
like the Haskell version better (though there are a few problems with
that grammar that I have corrected in the C version). It contains
a minimum of embedded code, and the whole grammar fits in just
160 lines:

http://github.com/jgm/markdown-peg/tree/4438c336444a714f15ed619c9897d91c3ab6b40e/Markdown.hs#L68

I've been working on the C version mainly because more people have
access to a C compiler, and because it's significantly faster than the
Haskell.

John



More information about the Markdown-Discuss mailing list