Detab should be multi-byte aware?

Michel Fortin michel.fortin at michelf.com
Tue Oct 10 10:18:30 EDT 2006


Le 10 oct. 2006 à 3:17, A. Pagaltzis a écrit :


> * John Gruber <gruber at fedora.net> [2006-10-10 05:55]:

>> I think it's simpler and better to just say "use UTF-8".

>

> +1

>

> UTF-8 is in fact deliberately constructed such that the chance of

> arbitrary text accidentally being valid UTF-8 approaches zero with

> increasing length of the text.


Except that increasing the length of the text won't have any effect
when using `mb_strlen` because:

1. I only pass small snippets through it to calculate the number of
space needed for replacing the tab character, not the whole text;

2. If I give it a string with both valid and invalid UTF-8 character
sequences, it will consider valid sequences as one character
and the invalid ones as two, or more depending on the number of
bytes.

I decided to attempt more systematically testing by writing a small
PHP script that displays how UTF-8 characters are interpreted by a
couple of ASCII-compatible 8-bit encodings. You can test it here (be
sure to not look only at the first page):

<http://www.michelf.com/docs/utf8-confusion.php>

From what I can see, it seems that ISO Latin and Windows Latin are
mostly imune to any problem. Mac Roman has a couple of problematic
strings which would be legitimate within text ("«é" for instance)
which are valid UTF-8. But the worse seem to come from ISO Cyrillic
and ISO Greek which have many common character combinations clashing
with UTF-8 sequences. For instance: "Чорнобиль" (Chernobyl in
Ukrainian) is 9 characters, but if you encode it with ISO 8859-5 and
then count the characters as if they were UTF-8 characters, you find
only 4.

This shows that using `mb_strlen` in `detab` as I suggested could
cause problems, especially with non-latin encodings, but also with
some rare, but not so silly, character combinations in Mac Roman.
That said, I think these problems are less important than UTF-8
characters not working right, so I still plan to use UTF-8 to count
the characters in `detab`.


Michel Fortin
michel.fortin at michelf.com
http://www.michelf.com/




More information about the Markdown-Discuss mailing list