Incremental parser
    Jacob Rus 
    jacobolus at gmail.com
       
    Tue Aug 14 21:54:48 EDT 2007
    
    
  
Michel Fortin wrote:
> Le 2007-08-13 à 21:56, Allan Odgaard a écrit :
> Both. Implementations use look-aheads because simply the syntax often 
> require looking ahead to see if the pattern is complete before deciding 
> if something should be taken literally or not. For instance, an asterisk 
> alone doesn't start emphasis unless it has a corresponding one in the 
> same paragraph.
Okay, but the spec could easily be changed to continue emphasis until 
the end of the current paragraph, if no ending * is found.  This would 
IMO be much more logical for document readers/writers (as well as 
computer parsers), as some asterisk 60 lines later would no longer 
affect part of the document on the current line.
Given that `*`s surrounded by spaces are left alone, I don't think this 
change would have any real downside, in terms of unintentional emphasis, 
and it would make it possible to have syntax highlighting which actually 
accurately matched the expected output, which would make my life as a 
document author much easier.
> In the case of the backtick in the other thread, it's clearly not an 
> edge case to me: I'm pretty sure that it has been designed that way. 
> First, it's much more useable, and second, the code explicitly checks 
> for the surrounding backticks so it can't really be an oversight. But 
> that's for the other thread...
Well, the specification is clearly ambiguous, if it can be interpreted 
(pretty reasonably AFAICT) in two completely different ways.
> I also agree that parsing Markdown is slower than it should be, but I 
> wonder which non-written rules you're talking about that makes it so 
> slow. When I wrote about looking until the end of the document, I was 
> talking about reference-style links, which are pretty well documented as 
> such. Maybe you had something else in mind?
You don't actually need to look to the end of the document when 
constructing the parse tree, however.  Links which actually have a 
reference can easily be distinguished from those which don't in a quick 
pass at the end of parsing.
> Syntax highlighting isn't the same thing as "parsing" Markdown, not in 
> my mind. It's more like "tokenizing" Markdown, and for that you don't 
> really need to be aware of the whole document at once; you just need to 
> identify things, not interpret them... Although it can get complicated 
> if you consider that some thing such as reference-style links are only 
> links when they have a corresponding definition.
Yes, but tokenizing the document into a parse tree is 90% of the battle… 
the rest can be quickly (i.e. at worst linear time) handled at the end.
> I was referring to how HTML browsers mutate the DOM in strange ways 
> while parsing improper HTML, if only to "fix" things such as this:
Yes, but that's not really in the scope of our discussion; I don't 
particularly care how clearly invalid markdown is converted.  It might 
be nice to specify (as the HTML5 spec is trying to do for HTML), but 
document authors don't need to know about it.
> Indeed, I was considering the parser was doing the output too, or at 
> least generating sequencial events (à la SAX). If you're talking about 
> creating the document tree in one pass, I suppose this can work if you 
> allow mutations of the document (as opposed to just appending new 
> elements at the end as you'd do with an XML parser).
It requires one pass to create a document tree, and one more pass for a 
few other things, such as assigning links.
> Ok, back to performance. How many time do you start a new Perl process 
> when building the manual? My benchmarks indicates that starting the 
AFAIK it only takes starting one perl process to run markdown on a large 
file...
> I'm totally not convinced that creating a byte-by-byte parser in Perl or 
> PHP is going to be very useful. Using regular expressions is much faster 
> than executing equivalent PHP code in my experience.
This is a silly statement.  Compiled vs. interpreted languages has 
nothing to do with the big-O run-time of an algorithm.  Real parsers are 
optimized for this type of task and will take linear time (with suitable 
changes in the markdown spec), while regular expressions are not (and 
it's possible even that a document could be created that would take 
exponential time with the current markdown implementation).
> That doesn't mean PHP Markdown and Markdown.pl can't be made faster, but 
> I'd be surprised if it ever reach the speed of TextMate's dedicated 
> parsing state machine.
TextMate's parser has to do quite a bit of extra work that a dedicated 
parser for a particular format would not have to do.  I would not be at 
all surprised if a Perl or PHP implementation could be made extremely fast.
> You may not like the way Markdown.pl or PHP Markdown works, but we each 
> have to work within the limits of our particular language. Between a 
> PHP-made state machine or a PHP-regex hybrid design, I'd choose the 
> second one if it means I get decent performance. And I don't think a 
> formal grammar will help much getting better performance in a PHP or 
> Perl environment for the reasons outlined above.
I really don't understand where you're coming from on this point.  Why 
would a PHP state machine be so terribly slow?
> There's a tricky case here however: [foo][bar] isn't a link in Markdown 
> unless "bar" is defined somewhere; if it isn't defined, it's plain text.
As previously mentioned that's quite trivial to pick up after the 
document has been completely tokenized.
-Jacob Rus
    
    
More information about the Markdown-Discuss
mailing list