Detab should be multi-byte aware?

John Gruber gruber at fedora.net
Mon Oct 9 23:51:49 EDT 2006


Michel Fortin <michel.fortin at michelf.com> wrote on 10/9/06 at
9:33 PM:


> I haven't tried it inside PHP Markdown yet, but I've tested

> `mb_strlen` and it seems to treat any invalid UTF-8 byte

> sequence as individual characters. So the neat result is that

> text in ISO Latin, Windows Latin, or Mac Roman will work fine

> unless it contains sequences which are valid UTF-8. For

> instance, "é" in UTF-8 is seen as "√©" in Mac Roman, so if you

> have "é" in a Mac Roman-encoded text it'll be treated as only

> one character. I'm not sure how high is that risk for all

> character combinaisons, but it obviously is less problematic

> than the current behaviour is to UTF-8.


That sounds great -- fits right in with my idea that UTF-8 and
only UTF-8 should be officially supported, but other encodings
should "just work" insofar as they've always "just worked" in
Markdown.

It's one of those things where the only time those character
sequences are likely to come up are when you're actually talking
about them as character sequences that look like UTF-8. E.g. this
very message.



> Yet another solution is a distinct configuration variable set

> to UTF-8 by default.


I think it's simpler and better to just say "use UTF-8".

-J.G.


More information about the Markdown-Discuss mailing list