Incremental parser (was: Backtick Hickup)
    Allan Odgaard 
    29mtuz102 at sneakemail.com
       
    Sun Aug 19 01:07:53 EDT 2007
    
    
  
On Aug 14, 2007, at 9:41 AM, Michel Fortin wrote:
> [...]
> I agree that the syntax needs to be defined more clearly.
I am glad that we are finally reaching agreement on this. You may not  
recall, but a year ago you asked me: “is it so much important that  
these border cases be consistent across all implementations?” [1]
[1]: http://six.pairlist.net/pipermail/markdown-discuss/2006-July/ 
000146.html
> I think the syntax page should be updated when we find an ambiguity.
That would be nice, yes -- but IMO we need to take a step back and  
really define the syntax in a more formal way, cause just clarifying  
a lot of border cases is tedious and complex. Doing something closer  
to a real grammar would not leave us with all these ambiguities in  
the first place, as stated, this is also why I brought up the  
incremental parser, because this works based on a state machine, a  
state machine has a clear transition from state to state, based on  
the input, not the present ad-hoc parser.
> But I'm not the one in charge of that page. I'd suggest checking  
> the testsuites announced on this list: most decisions regarding  
> edge cases have been "documented" there as regression tests. If  
> some behaviour is part of the test suite, you can be pretty much  
> certain that it's not a parser quirk.
I have not looked at these, that is, I did look at Gruber’s original  
test suite, and it basically just tested a lot of simple cases. This  
is IMO *not* the way to go about things, i.e. defining a grammar  
based on a lot of test cases.
Take e.g. this letter from last year http://six.pairlist.net/ 
pipermail/markdown-discuss/2006-August/000151.html -- here I talk  
about the problems which arise from mixing nested block elements and  
the lazy-mode rules. I think this should be clearly defined, not just  
defined via test cases, because we need clear rules, not recipes  
about how to handle a few select cases.
> [...]
> Syntax highlighting isn't the same thing as "parsing" Markdown, not  
> in my mind. It's more like "tokenizing" Markdown [...]
But building a parse-tree is pretty easy if you have already  
tokenized Markdown correctly. Anyway, TM does build the parse-tree as  
well. This is slightly beyond the point though, I was just saying TM  
does take the “incremental approach”, and it works quite well for the  
actual documents out there.
> [...]
> Ok, back to performance.
Just to be clear, my motivation here is *not* performance. My  
motivation is getting Markdown interpreted the same in different  
contexts, which it presently isn’t always, i.e. to get a clearly  
defined syntax, so I can ensure that the highlight in TM follows the  
standard to the point (and thus the syntax highlight you get follows  
the look of the post to your blog, the HTML constructed from  
Markdown.pl, or the local preview of the Markdown done with redcloth).
> How many time do you start a new Perl process when building the  
> manual?
> [...]
> Is the manual available in Markdown format somewhere? I'd like to  
> do some benchmarking.
http://six.pairlist.net/pipermail/markdown-discuss/2006-August/ 
000152.html
> I'm totally not convinced that creating a byte-by-byte parser in  
> Perl or PHP is going to be very useful.
The key here is really having clearly defined state transitions.
> Using regular expressions is much faster than executing equivalent  
> PHP code in my experience [...] I'd be surprised if it [PHP  
> Markdown / Markdown.pl] ever reach the speed of TextMate's  
> dedicated parsing state machine.
TM has a language grammar declaration where each token is defined by  
a regexp -- it could be a lot faster if a dedicated parser was  
written, but my goal here was flexibility, not speed.
I am *not* touting TM’s parser as fast, I am trying to convince you  
that the current way things are done, is pretty bad, and bad for many  
reasons, the (lack of) formalness with which the grammar is defined,  
the (lack of) simplicty in the code (and thus extensibility of the  
language grammar), and also (lack of) performance (by how the current  
implementation effectively does not support nested constructs, and  
thus have to fake it by doing iterative manglings of subsets of the  
document, to treat that as a nested” environment, complicated a lot  
by how it is documented to support embedded HTML (untouched by the  
Markdown parser, but in practice some edge cases are not handled  
correctly here)).
> [...] If you wish to create a better definition of the language,  
> I'll be glad to help by answering questions I can answer,  
> exemplifying edge cases and their desirable outputs, etc.
We pretty much went over that last year, and I thought I had made the  
point by now, that what I am after is defining the syntax, not the  
edge-cases -- I can read how Markdown.pl deals with them myself  
(although it deals with several by producing invalid code).
> If you want the syntax changed so that it better fit your parser  
> (and possibly other incremental parsers) then I can provide my  
> point of view, but I'm not the one who takes the final decision.
Unfortunately Gruber is dead silent when it comes to this.
It may come off as self-serving to approach things from the  
traditional incremental-parser (formal grammar / BNF) POV, but it is  
because I really think this would be best for bringing all  
implementations of the Markdown parser in sync, give better  
performance, not have as many broken edge-cases as now, and have the  
tools provide accurate syntax highlight.
Already there are several forks of Markdown (i.e. where stuff is  
added to the syntax), so I don’t think the best approach (for me)  
would be to start yet another fork -- Markdown should be one  
standard, not a dozen different ones, and that is why I am so keen on  
having a clearly defined standard.
> [...]
> There's a tricky case here however: [foo][bar] isn't a link in  
> Markdown unless "bar" is defined somewhere; if it isn't defined,  
> it's plain text. That may seem like an edge case right now, but  
> when/if Markdown gets the [shortcut link] syntax (as added to the  
> current betas of 1.0.2), this may become a more interesting problem  
> for syntax highlighting as any bracketed text will then become a  
> potential link depending on whether or not it has been defined  
> elsewhere in the document.
Yes, and personally I would say whenever you do [foo][bar] you get a  
link, regardless of whether or not bar is a defined reference -- if  
bar is not a defined reference, you could default to make it  
reference the URL ‘#’ -- this makes parsing *much* easier (here I am  
thinking about the case where you do: ‘*dum [foo*][bar]’ or ‘[*foo] 
[bar] dum*’. The 3 reasons for choosing this rule is that 1) partial  
documents are tokenized the same as full document (consider that my  
references may be from an external file, yet some stuff may still  
work on the “partial” document (i.e. the one w/o the actual  
bibliography, such as a local preview and the syntax highlight), 2)  
no-one would likely make use of the “feature” that [foo][bar] is the  
raw text [foo][bar] when bar is undefined (this is equivalent to  
saying that <p>foo</b></p> should keep </b> as literal text, since no  
<b> was found), and 3) it really is easier for the user to relate to  
“the pattern [something][something] is a link”.
    
    
More information about the Markdown-Discuss
mailing list