HTML entities in URLs and urlencoding

Waylan Limberg waylan at gmail.com
Mon Mar 31 21:45:13 EDT 2008


We recently received the following bug report for the python-markdown
implementation:


> The "&" are escaped in URLs.

>

> An example:

> [Link](http://www.site.com/?param1=value1&param2=value1)

>

> Should output:

> <a href="http://www.site.com/?param1=value1&param2=value1">Link</a>

>

> Currently outputs:

> <a href="http://www.site.com/?param1=value1&amp;param2=value1">Link</a>

>

> So the "&" must not be escaped!


A fix is easy, but it occurred to me that perhaps links should be
urlencoded -- at least some chars should be. Specifically the "unsafe"
chars listed in RFC 1738 [1]. The "reserved" chars probably should too
when not used in their approved manner (i.e.: A colon should only be
allowed after the scheme (http://) or in the location
(usr:pass at host:port) but should be encoded anywhere else). Of course,
that involves extra work. So I went to check what other
implementations do [2] and discovered that every one escapes with html
entities. Is there something I'm missing or is this a bug? As far as I
can tell, the "&amp;" breaks the query string.

[1]: http://www.rfc-editor.org/rfc/rfc1738.txt
[2]: http://babelmark.bobtfish.net/?markdown=%5BLink%5D%28http%3A%2F%2Fwww.site.com%2F%3Fparam1%3Dvalue1%26param2%3Dvalue1%29&normalize=on&src=1&dest=2
--
----
Waylan Limberg
waylan at gmail.com


More information about the Markdown-Discuss mailing list