[ANN] vfmd

Roopesh Chander roop at forwardbias.in
Fri Oct 4 11:22:08 EDT 2013


My responses inlined:

On Thu, Oct 3, 2013 at 11:36 PM, Michel Fortin <michel.fortin at michelf.ca>wrote:


> Le 3-oct.-2013 à 11:38, Roopesh Chander <roop at forwardbias.in> a écrit :

>

> Well, what I meant is that it's more maintenance work for everyone (spec

> writer and all implementers).

>


We are talking about a case where certain characters (e.g. tabs) in the
middle of some text is converted to a certain number of other characters
(e.g. spaces) in the output, just like that. And nothing to justify or even
explain this exists in the user-facing documentation. It surprises me that
this does not seem to you like a problem worth attempting to fix.



> Really? We be more concerned with how it *should* be interpreted instead

> of how it *is* implemented?

>

> I'll just open a parenthesis here. You know what made the HTML5 parsing

> algorithm a success? It's quite simple actually. It formalized all the

> clunky patchwork that browsers where doing and created a parser algorithm

> that everyone could use. That meant that parsing of the `<title>` element

> is idiotically special-cased, so is `<script>`, so is `<plaintext>`, etc.

> Why? Because browser vendors could not start from a clean state: their

> browser needed to be able to parse the thousands of millions of HTML

> documents on the web reliably, irrespective of how "well-formed" they were.

> The failure rate had to be tremendously small.

>


Let me think this analogy through:

Before HTML5, there was no consistency in how different browsers handled
*badly-formed* HTML.
With the HTML5 parsing algorithm, instead of "clunky patchworks", the
browsers could use a common algorithm that has been designed with most of
the use cases in mind. However, switching to the HTML5 parsing algo wasn't
easy (Webkit wrote over 10k LOC for just tokenising input and forming the
syntax tree [1]). So for HTML5, as I understand it, handling as many
current HTML documents as possible is a goal; minimizing the effort for
browsers to modify their code to adopt it is a non-goal.

[1]: https://www.webkit.org/blog/1273/the-html5-parsing-algorithm/

So to map your analogy to Markdown:
HTML5 parsing spec -> vfmd spec
Web Browsers -> Markdown implementations
HTML documents -> Markdown documents

For obvious input, most Markdown implementations agree. For non-obvious
input, they behave inconsistently. For vfmd, it's desirable that it handles
existing documents well. It's a non-goal to minimize the coding effort for
switching to vfmd (thought that would be a nice-to-have).

However, there's another goal for vfmd that is placed at a higher priority:
Provide a user guide for the syntax that is consistent with the parsing
spec. The user guide can say stuff like this:
(a) "For blockquoting, start the line with '>' followed by an optional
space."
(b) "For code blocks, indent each line with 4 spaces or 1 tab"

Let's say a user wants to place a code block inside a blockquote. After
reading the above lines in the user doc, the user could write
(.=space;_=tab;tabstop=4):


>.__code block


If we preprocessed tabs, this wouldn't be interpreted as said in the user
doc. From the user's perspective, it would be like "I wrote it as said, but
things didn't happen as promised."

I believe stuff like can be solved only if we are aware of the tab
characters while parsing. If you say we should preprocess tabs to spaces
before parsing, can you come up with the right wording for the user
documentation that would be compatible with the preprocess-tabs method of
parsing?

So a change in the treatment of tabs, while it might seem innocuous at

> first glance, is the kind of change that has the potential to break

> existing documents in various ways that are hard to predict even for an

> expert reviewing a document in text form (all whitespaces are look the same

> after all).



I think we will be able to discuss this better with actual examples. You
mention "lists within blockquotes" later in the mail, maybe you can
elaborate on that.



>

> >> The more the spec deviates from what the parsers are actually doing,

> the more

> >> difficult it'll be to adopt for implementers for two reasons:

> implementation work

> >> and the potential to break our user's documents.

> >

> > Let's consider each of your reasons one by one.

> >

> > ### Reason 1: Implementation work

> >

> > vfmd can entice developers to adopt it on two orthogonal, sometimes

> > conflicting factors:

> > (a) It's easy to adopt it

> > (b) It gives the best possible interpretation for any input

> >

> > vfmd anyway has a different parsing architecture from most current

> > implementations (per my knowledge), so (a) wouldn't stand. Just

> satisfying

> > (a) wouldn't be very persuasive either. If it wants a chance at being

> > implemented, it's got to aim for (b), even if that can be a little

> > detrimental to (a). It should be easy to adopt, but not at the cost of

> > correctness.

>

> But are you sure about B? I'm not convinced it is so much better.

> Replacing tabs with spaces before parsing means that we interpret things

> the same way as a 4-spaces-per-tab-stop editor will display them, always.

> Even if you have an invisible stray tab somewhere, anywhere, if it looks

> right in your editor it'll work. On the other side, your algorithm assumes

> that tabs are always intentional; it will break if somewhere they're a

> stray tab that was not meant to be there. It's not that clear-cut to me

> which is better. It is just based on different assumptions, and will fail

> in different circumstances. Changing behaviour is more likely to fail with

> existing documents however.

>


I guess anything that exists in the document should be considered as being
put there intentionally. I don't think we should be trying to *fix* the
user's document for him. (For example, we don't filter off unmatched
asterisks assuming they were stray characters). It's futile to "guess"
whether a character in the input is there by intent or by mistake.

And anyway, for HTML output, actual tab characters in the input matter only
in code blocks/spans, not in normal text (i.e. whether it's tab or space,
it would display identically in browsers (unless overridden in CSS)).



>

> > ### Reason 2: Breaking existing documents

> >

> > Are you talking about list handling or for other parts of the syntax? For

> > lists, for users using tabstop=4, the behaviour is the same, as we saw

> > earlier.

>

> It's not always the same, even for lists. Putting your list inside a

> blockquote changes things because it adds a two-column indentation. Some

> extension features I mentioned before may also be affected, making things

> harder to adopt for implementations that have to support extensions (pretty

> much all of them actually).

>


Please give a few examples of inputs that would break. Maybe that way, I
will be able to appreciate how adverse a change this will be, if made.



> > For tabs within code blocks, the behaviour would be different, but

> > I would be surprised if users *relied* on tabs within code blocks turning

> > into spaces in the output.

>

> Actually, *I* rely on tabs being converted to spaces within code blocks in

> many of my documents. It happens a lot that I have tabs in the code I

> copy-paste, and since browsers don't all show tabs in a consistent way

> inside a `<pre>`, it's much better if they get converted to spaces. But I

> didn't think tabs inside code blocks were into question here, are they?

>


As far as I know, most browsers use 8 as tabstop by default, so it's fairly
consistent. I didn't understand what you meant by "I didn't think tabs
inside code blocks were into question here, are they?".

Is there anyone else in this list who relies on tabs inside code-blocks to
be converted to spaces?

BR,
roop.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://six.pairlist.net/pipermail/markdown-discuss/attachments/20131004/81f059d9/attachment-0001.htm>


More information about the Markdown-Discuss mailing list