Incremental parser
    Michel Fortin 
    michel.fortin at michelf.com
       
    Wed Aug 15 11:24:02 EDT 2007
    
    
  
Le 2007-08-14 à 21:54, Jacob Rus a écrit :
> Okay, but the spec could easily be changed to continue emphasis  
> until the end of the current paragraph, if no ending * is found.   
> This would IMO be much more logical for document readers/writers  
> (as well as computer parsers), as some asterisk 60 lines later  
> would no longer affect part of the document on the current line.
I disagree about it being better for readers and writers. To me a  
sole asterisk or underscore doesn't mean emphasis. If someone  
voluntarily writes only one asterisk in front of a word, he probably  
didn't mean "emphasis until the end of the paragraph" either.
Hum, and if your paragraph takes 60 lines, you'd better fix that  
first. :-)
But seriously, if you happend to have put an asterisk at the start of  
a word 10 lines earlier but forgot the closing asterisk, would you  
prefer the whole remaining paragraph to be screwed up or to just have  
one rogue asterisk where emphasis should have started? Personally,  
I'd prefer the later as it would be more readable.
> Given that `*`s surrounded by spaces are left alone, I don't think  
> this change would have any real downside, in terms of unintentional  
> emphasis, and it would make it possible to have syntax highlighting  
> which actually accurately matched the expected output, which would  
> make my life as a document author much easier.
I wouldn't be so sure that no one has been writing asterisks at the  
start of words. You're right though that by the current rules this  
wouldn't be very common.
> Well, the specification is clearly ambiguous, if it can be  
> interpreted (pretty reasonably AFAICT) in two completely different  
> ways.
The specification for code spans has an ambiguity, I agree. I still  
think my interpretation and the parsing currently implemented by both  
Markdown.pl and PHP Markdown is more useful, but I guess that's still  
open for debate (in the other thread).
> You don't actually need to look to the end of the document when  
> constructing the parse tree, however.  Links which actually have a  
> reference can easily be distinguished from those which don't in a  
> quick pass at the end of parsing.
The tricky case is when deciding between a link or litteral text has  
consequences for parsing subsequent text. For instance, take this image:
     ![some *image][] and text*
If there is no corresponding link definition, then "image][] and  
text" is text and should be emphased; otherwise if the link is  
defined, then you have a first litteral asterisks inside the alt  
attribute and another in the text, and no emphasis.
The same problem could apply to links, although with links -- which  
can contain markup -- you could use some complicated trickery to span  
the emphasis inside and outside the link with two emphasis elements.
> Yes, but that's not really in the scope of our discussion; I don't  
> particularly care how clearly invalid markdown is converted.  It  
> might be nice to specify (as the HTML5 spec is trying to do for  
> HTML), but document authors don't need to know about it.
There is no thing such as "invalid Markdown" currently. When would  
you call a Markdown document "invalid"?
> It requires one pass to create a document tree, and one more pass  
> for a few other things, such as assigning links.
You could also do one pass to strip link definitions and another to  
do the actual parsing incrementally. :-)
>> Ok, back to performance. How many time do you start a new Perl  
>> process when building the manual? My benchmarks indicates that  
>> starting the
>
> AFAIK it only takes starting one perl process to run markdown on a  
> large file...
Sure, that's true, but that doesn't answer my question. Is the manual  
parsed as one big file or many smaller ones? And if only one file,  
what size is it? I'm interested in understanding what makes it so  
slow, but I haven't much data at hand to comment on the speed issue.
>> I'm totally not convinced that creating a byte-by-byte parser in  
>> Perl or PHP is going to be very useful. Using regular expressions  
>> is much faster than executing equivalent PHP code in my experience.
>
> This is a silly statement.  Compiled vs. interpreted languages has  
> nothing to do with the big-O run-time of an algorithm.  Real  
> parsers are optimized for this type of task and will take linear  
> time (with suitable changes in the markdown spec), while regular  
> expressions are not (and it's possible even that a document could  
> be created that would take exponential time with the current  
> markdown implementation).
>
> [...]
>
> TextMate's parser has to do quite a bit of extra work that a  
> dedicated parser for a particular format would not have to do.  I  
> would not be at all surprised if a Perl or PHP implementation could  
> be made extremely fast.
>
> [...]
>
> I really don't understand where you're coming from on this point.   
> Why would a PHP state machine be so terribly slow?
This is not about compiled vs. interpreted languages. I've mentioned  
coding in C++ only because it's clear that using C++ (or most  
compiled languages for that matter) you can get at the very least the  
same speed as with a regular expression. What really matters here is  
only the speed of PHP code vs. regular expressions, because these are  
the two things I have at my disposal for PHP Markdown.
In all my tests, the more work you do inside the regular expression,  
the fastest it is. I've observed that many times while optimizing PHP  
Markdown.
As a demonstration, I've made a small test program that counts  
characters 'a', 'b', and 'c' inside a string. There's nine different  
implementations counting those characters. No matter what the string  
size is (still within reasonable limits), the pure-regex  
implementation is always faster in my tests: even if it needs to pass  
three times over the whole string, even if it generates irrelevant  
arrays containing thousand of matches, it always surpass in speed the  
simplest PHP loop by a factor of more than five. Hybrid approaches  
with a single regular expression and simple PHP branching code tend  
to be in-between.
You can find the test script [here][1] if you want to check for  
yourself. Here's a typical output (tested on an iBook G4 1.2 Ghz, PHP  
5.0.4 with Zend Optimizer, 200 Kb string.):
     1) a: 12623; b: 12606; c:12373; time: 1346 ms.
     2) a: 12623; b: 12606; c:12373; time: 2572 ms.
     3) a: 12623; b: 12606; c:12373; time: 180 ms.
     4) a: 12623; b: 12606; c:12373; time: 184 ms.
     5) a: 12623; b: 12606; c:12373; time: 407 ms.
     6) a: 12623; b: 12606; c:12373; time: 397 ms.
     7) a: 12623; b: 12606; c:12373; time: 208 ms.
     8) a: 12623; b: 12606; c:12373; time: 25 ms.
     9) a: 12623; b: 12606; c:12373; time: 15 ms.
As for the algorithms used:
     1: simple for loop + switch over characters
     2: split text to an array + foreach + switch over characters
     3: count(explode('a', $text)) (called 3 times: one for each  
character)
     4: preg_match_all (called 3 times: one for each character)
     5: preg_match_all + foreach on matches + switch over matched  
characters
     6: preg_replace_callback + callback + switch over matched character
     7: preg_replace_callback + callback (called 3 times: one for  
each character)
     8: preg_replace + string length difference
     9: str_replace + string length difference
  [1]: http://www.michelf.com/docs/charcount.zip
Take notice how the simple PHP loop (1) is absurdly slow compared to  
even the most ridiculous approaches using a regular expression.  
Common sense would tell us that the first approach should be the  
fastest because all the others do extra work, but that's certainly  
not the case: all other approaches involving a native function is  
faster by a wide margin, almost proportionally to the amount of PHP  
code they have to execute. That means that executing PHP code *is* in  
itself a lot more extra work and anything else done in these tests.
So the issue isn't so much about algorithmic complexity, it's about  
PHP code being a magnitude slower than regular expressions to  
execute, or any native code for that matter. The smallest the amount  
of PHP code and the least repeated it is, the better for performance;  
that's how I have optimised PHP Markdown in the last few years.
Another case in point: PHP Markdown Extra's HTML block parser. It  
works mostly as an incremental parser. Going into the details would  
take too long, but it's generally much slower than PHP Markdown's  
HTML block parser, even though for parsing HTML blocks in PHP  
Markdown the whole document is traversed 5 times with different  
regular expressions.
Note that all this is not about incremental parsing vs. a bunch of  
regex; it's PHP speed vs. regex speed. Regular expressions can be  
used pretty effectively to speed up tokenization inside an  
incremental parser (as Extra's HTML block parser do). I'm pretty sure  
I could come up with something that parse incrementally Markdown's  
whole block gamut by combining all the blocks into a single regex and  
then looping on the result. This would surely improve performance a  
little, although it could also make code harder to maintain.
Now, about algorithmic complexity (big-O). It's certainly true that  
backtracking in regular expressions can be pretty bad. Two times I  
had to fix things like this: the first one was a few years ago in the  
header expression ported from Markdown.pl (which Perl was optimising  
better, avoiding backtracking); the second was in PHP Markdown  
Extra's HTML block parser and was reported by someone using pretty  
bad HTML as input. In both cases, it was easy to fix, and I'm pretty  
confident there's no more loopholes now. PHP supports (?> once-only  
subpatterns ) (Perl does too): those have been of precious help to  
avoid unwanted backtracking.
Michel Fortin
michel.fortin at michelf.com
http://www.michelf.com/
    
    
More information about the Markdown-Discuss
mailing list