Markdown doesn't always generate XHTML

Andrea Censi andrea at censi.org
Sat Mar 15 11:51:45 EDT 2008


On Sat, Mar 15, 2008 at 3:49 AM, Ulf Ochsenfahrt <ulf at ofahrt.de> wrote:

> Waylan Limberg wrote:

> > Regarding the security issues, I understand your concerns, but there

> > are some situations were all document authors are trusted

> > (authenticated) users and have a legitimate need for that feature. We

> > can't cut them off for everyone else. However, I know that

> > Python-Markdown has an option to not allow any html in a document

> > (this "safe_mode" can be set to either replace with a customizable

> > message, remove completely, or escape the html). Of course, to stay in

> > line with the Markdown standard, it is off by default, but very easy

> > to turn on in your code. Other implementations may offer a similar

> > option.

>

> Yes, there are situations where all document authors are trusted

> (authentication isn't trust though), but the fact remains that this

> makes markdown completely unusable for anything else. And worse, people

> are not made aware of this fact. I only encountered this by coincidence,

> because one of my users entered what looked like html tags into the forum.

>

> In summary:

> Markdown wasn't designed to handle this situation. Some implementations

> provide a 'safe mode' which aims to filter the code either before or

> after markdown conversion.

>

>

> Markdownj (Java, which I've been using) doesn't provide such an option.

>

> Markdown.pl doesn't provide such an option.

>

> Nanoki tries to, and fails (see related mail by Michel Fortin) on:

>

> <script <!--

> alert("Hello world!")

> </script <>

>

> PHP Markdown has something like this, and it has to be enabled in the

> source (?). It fails when no_markup=true and no_entities=false on:

> <script>alert('hallo');</script>

>

> Python markdown has such an option and it appears to work for simple

> tests. Looking at the code, python markdown apparently creates an XML

> document tree and serializes it, making sure that the generated code is

> always valid XML (that's a very good design choice if I may say so).

>

> I havn't tried Pandoc, which was also mentioned by John MacFarlane.

>

>


Thanks for the summary.
For completeness, Maruku's output is:

<pre class='markdown-html-error'>HTML parse error:
&lt;script &lt;!--
alert(&quot;Hello world!&quot;)
&lt;/script &lt;&gt;</pre>

You see, Maruku is used inside Jacques Distler's math-enabled branch
of Instiki [1] which outputs well-formed XHTML + MathML + SVG. You
can't really leave anything to chance. If there is only one error
somewhere, the document does not validate and therefore it does not
render.

Maruku's treatment of raw XML is that it requires it to be well-formed
XML, with some user-friendly exceptions inspired by HTML (user doesn't
have to close <br>,<img>, etc.).
If it isn't well formed, it triggers an error (nicely displayed on
stderr, or intercepted by API). Parsing goes on, but for convenience
it outputs the error in the document (see above; can be hidden by
CSS).

Jacques also did some work on sanitizing the XHTML document, but this
logically happens after Maruku.

[1]: http://golem.ph.utexas.edu/instiki/show/HomePage

--
Andrea Censi
PhD student, Control & Dynamical Systems, Caltech
http://www.cds.caltech.edu/~andrea/
"Life is too important to be taken seriously" (Oscar Wilde)


More information about the Markdown-Discuss mailing list