Incremental parser (was: Backtick Hickup)

Allan Odgaard 29mtuz102 at sneakemail.com
Tue Aug 28 18:51:00 EDT 2007


On Aug 27, 2007, at 6:42 PM, Michel Fortin wrote:


>>> I'm totally not convinced that creating a byte-by-byte parser in

>>> Perl or PHP is going to be very useful.

>> The key here is really having clearly defined state transitions.

> I'm not sure what you mean by that in relation to what I wrote above.


That your implementation concerns are not what I am presently
concerned with, and I don’t think you are right (i.e. you can still
use lots of regexps for tokenization in a regular parser) -- but this
is another discussion.


> [...]

> There are many complains about different things here. About the

> syntax, you complain that it is badly defined (I agree).

>

> You then talk about lack of simplicity in the code, which I assume

> apply to Markdown.pl (or PHP Markdown), not the syntax; or perhaps

> you mean that the syntax makes it impossible to write simple code

> to parse it? I'm not sure I understand what you mean here.


Yes -- my view is that the complexity of the implementation stems
from not basing the implementation on a standard parser. I.e.
presently everything is basically a special-case in the
implementation. With a grammar, the parser is generated, and you do
not do things like run a pass first that obfuscates HTML (into
hashes) and run a pass that grabs the raw before it grabs the
emphasis etc.


> Then you talk about the lack of extensibility of the language

> grammar (which I'm not sure what you mean by that, is there a

> language grammar for Markdown anyway?).


With a formal grammar, extending the syntax is generally just adding
or editing a rule, and we have the syntax extension. By hand-writing
the parser, you tend to end up with code written for a very specific
purpose generally not easy to extend. Tweak something one place in
the source, and you break something in another place, I think we have
seen that already on a few occasions (when something is fixed/changed
in Markdown.pl).


> Then you go on the lack of performance (are you calling this a

> syntax or parser issue or both?).


I mention that because if we had a grammar and a generated parser, we
would get a known good time complexity and pretty efficient code.

I.e. my point was that all these problems I raise are really all
rooted in the lack of a grammar -- sure we can address them even w/o
a grammar, and maybe it is not (all) the case with the PHP Markdown
implementation, I was just adding some (more) arguments for why I
would like to see the goal of a formal grammar be taken more serious.


> Finally you say the current implementation (I assume you're talking

> about Markdown.pl, perhaps PHP Markdown) does not "effectively"

> support nested constructs (which constructs? what does "effectivly"

> means here?) but "support" them somewhat by recursively reparsing

> parts of the document. Very true, but how is that a problem for you?


Effectively as in, in practice the parser is a parser for a regular
language [1], and only by doing multiple passes where subsets are
hidden from further parses, does it achieve its result (Markdown is
not a regular language, thus you need a parser “better” than one for
a regular language to parse it).

This solution though is IMO anything but ideal, and it has been the
cause of many bugs in the past and IMO the result is still not what I
would prefer to see, e.g. the thing about token type having higher
precedence than position in the document.

[1]: http://en.wikipedia.org/wiki/Regular_language


> [...]

> I don't really want to see the syntax changed in and out only to

> make it easier to implement as an incremental parser.


Yeah, that is a more interesting discussion -- how much would be okay
to change? For example if we change the rules so that we had
_emphasis_ and *strong*, we would solve the problem with ***, and IMO
a welcomed change since typing four asterisks for bold is tedious and
noisy in the text (granted, cmd-B will do the asterisks for me, but
still…)


> I don't think such a parser would be usable (read fast-enough) in

> PHP anyway. Well, perhaps it could be, but not in the traditional

> sense of an incremental parser; the concept would probably need to

> be stretched a lot to fit with regular expressions.


I am not sure what you base these assumptions on. What exactly is it
that makes PHP so extremely slow that it is unfitted for a parser,
yet the current (granted, regexp-based) PHP Markdown works fine?


>> Yes, and personally I would say whenever you do [foo][bar] you get

>> a link, regardless of whether or not bar is a defined reference --

>> if bar is not a defined reference, you could default to make it

>> reference the URL ‘#’ [...]

> Hum, I disagree strongly here that creating links to nowhere (#) is

> the solution to undefined reference links. This is bad usability

> for authors who will need to test every links in resulting page to

> make sure they're linking where they should be


On the contrary, add this to your preview style sheet:

a[href="#"] {
background: blue;
border: 2px solid red;
color: white;
}

Now you have a very good indicator for missing links, contrary to
now, where they easily blend in with the regular text, and there is
no simple way to find them.



More information about the Markdown-Discuss mailing list