[ANN] PHP Markdown 1.0.2b7

Michel Fortin michel.fortin at michelf.com
Tue Sep 19 07:53:52 EDT 2006


Le 18 sept. 2006 à 19:00, Jacob Rus a écrit :


> I don't like this solution. It seems to me that the output should

> instead be:

>

> <em>Some **strange</em> emphasis**

>

> because the "do what comes first, and then toss out improper

> nesting" rule is more understandable for humans (well, at least for

> this one) and also I expect easier for a computer parser.


Easier to implement the "do what comes first" rule? Yes and no; it
really depends. Currently, Markdown checks within a text snippet for
strong emphasis, then it checks for regular emphasis. So it "sees"
only strong emphasis on the first pass, create the tags and wrap it
"into" a hash, then it sees nothing in the second because one of the
two simple emphasis markers has been removed as part of the hash.

- - -

Actually, that's how it works in PHP Markdown. But PHP Markdown Extra
however, gives this:

<em>test *</em>test* test**

which isn't right at all. I'll have to fix it. But at least it
validates.

- - -

If Markdown was processing the text character by character, then it'd
be a lot easier to work with the do "what comes first" rule. But
that's not the case, and it's not the case because in Perl, or in PHP
it was easier to call a bunch of regex one by one and change what
each of them finds.



> I've never seen any non-contrived example where overlapping bold

> and italic like this are needed. So forcing authors to use proper

> nesting should be perfectly fine.


Doesn't that mean that what Markdown does with improperly nested
emphasis is unimportant?

I agree with the "whichever comes first" rule when it concerns
emphasis, but what about emphasis and links? Should a link disappear
just because of badly nested emphasis?

This is *a [link*](#).

This is *a [link*](#) within emphasis*.

In that case I think it is better to priorize the link, just as PHP
Markdown does now. What do you think?



>>> * Made the block-level HTML parser smarter using a

>>> specially- crafted regular expression capable of handling

>>> nested tags.

>> A single pattern that matches nested tags?!

>> $me == "downloading now";

>

> Somehow it doesn't excite me to learn that even more things will

> get round-tripped through markdown's weird MD5 hash step. :P


The HTML block regular expression is completely independent of
anything else. It just helps delimiting HTML blocks within the
Markdown source; it hasn't anything to do with md5 hashes. Yes, the
HTML block parser function changes them to md5 hashes, but this has
always been the case, nothing new here.

About the block-level automatic hashing, this isn't really new:
Markdown was already doing this indirectly by calling the HTML block
parser again on its own generated content. I'm only skipping that
step and encoding blocks directly.

About the span-level automatic hashing I added: I added it simply
because it allows a better separation between each part of Markdown.
It means, beside preventing bad nesting, that the generated tags
cannot be altered just because the next processing method finds its
favorite character in it. Previously, Markdown was escaping special
characters in the tags it created. But an extensible PHP Markdown
makes that rather difficult as new special character sequences can be
introduced. I'll probably start to hash inline HTML tags for the same
reasons too, instead of escaping special characters in them.

And I don't like md5 hashes either: I'm certainly going to change
them at some point for another kind of text-token with no chance of
clashing with any input text.



> Are there any examples of what the current behavior is, and what

> changes when we use this "smarter" block HTML parser?


It is smarter in the sense that it does not rely on correct
indentation to find the end of a code block. Simple things such as
this should now work correctly:

<div>
<div>
</div>
</div>

<div>
<div>
</div>
</div>

There is a limitation in my implementation of the regex: it won't
recognize correctly nested tags beyond 5 levels of nested tags having
all the same name as the opening tag of the code block. A true
recursive regular expression would provide theoretically infinite
nesting, but using that would increase the requirement from PHP 4.1,
WordPress's minimal requirement, up to PHP 4.3.3.


Michel Fortin
michel.fortin at michelf.com
http://www.michelf.com/




More information about the Markdown-Discuss mailing list