Incremental parser

Jacob Rus jacobolus at gmail.com
Wed Aug 15 15:04:11 EDT 2007


Michel Fortin wrote:

> I disagree about it being better for readers and writers. To me a sole

> asterisk or underscore doesn't mean emphasis. If someone voluntarily

> writes only one asterisk in front of a word, he probably didn't mean

> "emphasis until the end of the paragraph" either.


Well, this really depends. If I have a text editor which does some
syntax highlighting for me, I'd rather have emphasis at the end of the
paragraph, which is extremely obvious and can be fixed, than a stray
asterisk. But really, the point here is that we can't determine whether
that stray asterisk has meaning until an indefinite point in the future
(the end of the paragraph). This means it's hard for a reader to
understand the document's intent under the current rules until the whole
paragraph has been read.


> I wouldn't be so sure that no one has been writing asterisks at the

> start of words. You're right though that by the current rules this

> wouldn't be very common.


If they did, I would consider that "invalid" markdown


> The tricky case is when deciding between a link or litteral text has

> consequences for parsing subsequent text. For instance, take this image:

>

> ![some *image][] and text*

>

> If there is no corresponding link definition, then "image][] and text"

> is text and should be emphased; otherwise if the link is defined, then

> you have a first litteral asterisks inside the alt attribute and another

> in the text, and no emphasis.


If there is no corresponding link definition, I would consider that an
"invalid" markdown document. Clearly the author *does not* intend to
put random square brackets and exclamation marks in the middle of his
prose! I'm not sure how that should render, but I think for my personal
documents I'd prefer it to put in a gap for an image, and a clear sign
that the document is invalid.

Likewise with invalid links I'd much rather (for my own documents, I'm
not suggesting this generally) have a link put in, with no URL, and a
css class like "invalid-link" that I could personally style to be red
and bold or something, instead of just rendering as random stupid square
brackets that I quite clearly would never intend to render as plain text.


> There is no thing such as "invalid Markdown" currently. When would you

> call a Markdown document "invalid"?


You happily gave me a couple examples above. :) I would consider
anything that tries to be markdown syntax, but is never closed to be
invalid, as one example.


>> It requires one pass to create a document tree, and one more pass for

>> a few other things, such as assigning links.

>

> You could also do one pass to strip link definitions and another to do

> the actual parsing incrementally. :-)


Sure... that would be perfectly reasonable too.


>> AFAIK it only takes starting one perl process to run markdown on a

>> large file...

>

> Sure, that's true, but that doesn't answer my question. Is the manual

> parsed as one big file or many smaller ones? And if only one file, what

> size is it? I'm interested in understanding what makes it so slow, but I

> haven't much data at hand to comment on the speed issue.


Well, why shouldn't markdown be equally usable for many small files or a
few big ones? I'd rather have it be performant for all files.


> So the issue isn't so much about algorithmic complexity, it's about PHP

> code being a magnitude slower than regular expressions to execute, or

> any native code for that matter. The smallest the amount of PHP code and

> the least repeated it is, the better for performance; that's how I have

> optimised PHP Markdown in the last few years.


Well this just implies to me that PHP should not be used for text
processing in general... ;)

-Jacob



More information about the Markdown-Discuss mailing list