Backtick Hickup

Michel Fortin michel.fortin at michelf.com
Sun Sep 2 11:12:32 EDT 2007

Previous message: Incremental parser
Next message: PHP benchmark (was: Incremental parser)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Le 2007-08-28 à 19:32, Allan Odgaard a écrit :

> On Aug 27, 2007, at 10:35 PM, Michel Fortin wrote:

>

>> I don't find them confusing, but perhaps it's only because I'm

>> used to it. Which aspect of it do you find confusing?

>

> Maybe ‘intuitive’ would have been a better choice of word. But this

> thread started because somebody did not understand how to embed

> back-ticks in back-tick quoted strings -- personally I didn’t

> understand it either until I looked at the implementation.

You don't have to look at the implementation to see what you need to
do to include a backtick inside a code span, it's documented in the
[code span section][1] of the syntax documentation; see third example.

[1]: http://daringfireball.net/projects/markdown/syntax#code

What's really confusing about this is that the syntax description
document on Daring Fireball was accidentally reverted at one point
(late 2005 or early 2006, I don't know) to an older version
describing instead the behaviour of Markdown 1.0 (where backslashes
escapes are applied inside code spans). It took about a year and a
half until this was fixed. I'm pretty sure many people read that
version, so no wonder why many are confused about it.

>> [...]

>> I think I prefer the current behaviour. I can't really see when

>> having to escape the content of code span would be useful. Perhaps

>> you had something in mind when proposing that?

>

> Yes, when you need special characters -- you can’t use entities

> inside `…` so ``…`` would allow you to do e.g. \u2620 for a unicode

> character or similar -- with everybody using utf-8 these days

> (knock on wood) escape codes for special characters are less useful

> than in the past.

Markdown doesn't support unicode escapes. It supports HTML entities
though, so you can write <code>☠</code> if you want.

>> [...]

>> I have some difficulty figuring out an what you mean by "embeded

>> HTML does not lean itself well to the 'split the document into

>> paragraphs'".

>>

>> Markdown currently distinguish block-level HTML elements from span-

>> level HTML elements: The former creates blocks which are left

>> alone by Markdown (and left outside paragraphs) while the later

>> gets wrapped into paragraphs (as valid HTML expects them to be)

>> along with Markdown-formatted text.

>

> Yes, we are dependent on Markdown finding the HTML before it does

> the paragraph splitting, so it doesn’t insert <p> in my HTML -- yet

> the present heuristic to find HTML is easily confused (talking

> Markdown.pl), for me it actually got worse when John switched to

> the Perl library thing.

Yes, Markdown.pl has several limitations regarding HTML, and I think
it's known that the new parser in the latest betas doesn't work so
great either. Recent releases of PHP Markdown are much better in that
regard.

> In fact, presently I have my own preprocessor for my Markdown pages

> (on my site, which sometimes need to embed tables and stuff) to

> take out the HTML before giving it to Markdown -- although this is

> also because Markdown does not know about <% scripting %> <?php

> tags ?> and since there is no grammar where I can just educate it

> about them, I need to handle that myself in a pre-parse step.

There *is* a grammar for HTML blocks: it's the regular expressions in
_HashHTMLBlocks. Sure, you need to be a little careful about not
catching other Markdown constructs there (especially code blocks),
but it's not that complicated if you base yourself on the already-
existing regular expressions. If you capture what you want, hash the
block and put the hash alone one one line separated by two blank
lines (as it is done everywhere in _HashHTMLBlocks), Markdown will
handle it correctly later.

That said, you can do the same by preprocessing the input given to
Markdown too; it's not much different.

>>> Anyway, if we agree that everything is dependent on everything

>>> that precedes it, I think we can slowly start to agree that

>>> *also* having things depend on what follows, is problematic.

>> Well, I think you mean problematic for writing a parser, in which

>> case I disagree.

>

> No, I mean problematic as in; what the hell should we do? You and I

> disagree about how to interpret the same line of Markdown exactly

> because it depends on the angle you view it from (read: which token

> you think is most important), i.e. totally subjective…

There are subjective things all over the world. How decides which
operator has which priority in programming languages? This is a
design issue, it should be decided by balancing people expectations,
usefulness, and verbosity, and all that is subjective.

>> Well, look at how the WHATWG is defining HTML right now: it's

>> exactly that. They describe how the parser works (in english), and

>> everything that match its behaviour is conforming...

>

> Yes, and do you know *why* they are doing that?

I think they're defining HTML parsing in english because it'd be very
complex as a formal grammar and as thus unapproachable to the common
programmer.

But you seem to have understood the question as "why are they (re)
defining how to parse HTML?", which is a valid question too, but not
really what I wanted to get into.

> [...]

>

> So given this rather broken situation, the WhatWG decided to try to

> figure out in which ways all the browsers were broken and document

> that to get them in sync, and make that the official spec, so that

> we can move on with (expanding) the HTML specification w/o cutting

> backwards compatibility -- because browser vendors don’t want

> existing pages to break, cause that makes them lose users, so if

> W3C adds features to HTML which require the browser to have a

> strict parser to really work, browser vendors may not do it because

> of backwards compatibility, or something like that…

>

> You really think Markdown should take the same route? ;)

Well, what's the alternative? We could try to develop a "cleaner"
Markdown and encourage people to use that instad. Many people would
have to give up on backward compatibility, but "cleaner" Markdown
documents will be guarantied to be parsed the same with every parser.
That's the XHTML route: a diverging path from what's currently in
use, with some benefits but loads of uncertainties about its adoption.

I think it's important that if we specify Markdown in detail, we do
it in a way it can be actually useful to parse today's Markdown
documents, not to an utopist point of view of what the syntax should be.

Being able to parse today's documents doesn't mean adopting every
quirk of the Markdown.pl parser. No, it only means that we do not
change things which are likely to break a considerable number of
documents, and we should refrain from changing things "officially"
documented on John Gruber's syntax description page. It doesn't mean
we can't add new features either, we only need to make sure they're
not likely to break current documents.

>> which brings out an interesting side topic: how should HTML be

>> parsed (or event specified) within Markdown? :-)

>

> I would say strict (for which a grammar is pretty simple)! There is

> no reason Markdown should conform to the looser WhatWG definition,

> since strict HTML is a subset of WhatWG’s definition, and they made

> a superset only to be compatible with existing bad pages, but

> Markdown does not need to support that.

That doesn't answer very much my question.

Markdown should handle "well-formed" HTML correctly, that's for sure.
The question is what should Markdown do when it encounters malformed
HTML, such as an unclosed tag, an unclosed quoted attribute, mis-
nested tags, etc. Personally, I think we should handle things mostly
like the spec WHATWG has defined, at least at the tokenization stage.
Obviously some things would need to be tweaked to coexist with
Markdown, but I think if everyone parses HTML the same, we'll only
benefit.

Michel Fortin
michel.fortin at michelf.com
http://www.michelf.com/

Previous message: Incremental parser
Next message: PHP benchmark (was: Incremental parser)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Markdown-Discuss mailing list