Incremental parser

Michel Fortin michel.fortin at michelf.com
Mon Aug 27 12:09:59 EDT 2007


Le 2007-08-15 à 15:04, Jacob Rus a écrit :


> Michel Fortin wrote:

>> I disagree about it being better for readers and writers. To me a

>> sole asterisk or underscore doesn't mean emphasis. If someone

>> voluntarily writes only one asterisk in front of a word, he

>> probably didn't mean "emphasis until the end of the paragraph"

>> either.

>

> Well, this really depends. If I have a text editor which does some

> syntax highlighting for me, I'd rather have emphasis at the end of

> the paragraph, which is extremely obvious and can be fixed, than a

> stray asterisk.


I don't think syntax highlighting is an argument that should help
decide what Markdown should do.

To solve your problem, I suggest you have two colors: one for the so-
called "valid" emphasis, the one Markdown will effectively convert to
emphasis, another for "invalid" emphasis, for when the closing
asterisk is missing. That should make authoring errors even more
obvious.


> But really, the point here is that we can't determine whether that

> stray asterisk has meaning until an indefinite point in the future

> (the end of the paragraph). This means it's hard for a reader to

> understand the document's intent under the current rules until the

> whole paragraph has been read.


Well, what you're describing applies to cases where the paragraph is
too long to scan visually. I don't think Markdown should be modeled
around stretched cases like that.



>> There is no thing such as "invalid Markdown" currently. When would

>> you call a Markdown document "invalid"?

>

> You happily gave me a couple examples above. :) I would consider

> anything that tries to be markdown syntax, but is never closed to

> be invalid, as one example.


Basically, I think what you're calling "invalid Markdown" is really
what is left undefined by the current documentation.

It's certainly good practice to avoid depending on undefined
behaviours. But given that half Markdown users haven't read a line of
the syntax document and know very little about HTML, I don't think it
would accomplish much to call some documents "invalid" when they
contains asterisks at the wrong place. To me, it sounds like an
excuse to output garbage for poorly-edited documents, which is not
something I want to do with my parser.



>> Sure, that's true, but that doesn't answer my question. Is the

>> manual parsed as one big file or many smaller ones? And if only

>> one file, what size is it? I'm interested in understanding what

>> makes it so slow, but I haven't much data at hand to comment on

>> the speed issue.

>

> Well, why shouldn't markdown be equally usable for many small files

> or a few big ones? I'd rather have it be performant for all files.


Me too. I'm not accusing anyone of having files too big. If you have
something that parses too slowly, fill a bug report (with a sample
file) so someone can look at the problem.

It's clear however that the current parser for PHP Markdown and
Markdown.pl is pretty slow with big files, and that it may not be
easy to fix.



>> So the issue isn't so much about algorithmic complexity, it's

>> about PHP code being a magnitude slower than regular expressions

>> to execute, or any native code for that matter. The smallest the

>> amount of PHP code and the least repeated it is, the better for

>> performance; that's how I have optimised PHP Markdown in the last

>> few years.

>

> Well this just implies to me that PHP should not be used for text

> processing in general... ;)


I agree: PHP is very poor in that regard, and that's why in PHP
Markdown I defer everything I can to regular expressions, which are
much faster. Ideally, we'd have a compiled parser, but even that
would be useless to thousands of people on shared hosting.


Michel Fortin
michel.fortin at michelf.com
http://www.michelf.com/




More information about the Markdown-Discuss mailing list