evolving the spec (was: forking Markdown.pl?)
    Allan Odgaard 
    29mtuz102 at sneakemail.com
       
    Tue Mar  4 00:49:24 EST 2008
    
    
  
On 3 Mar 2008, at 13:30, Michel Fortin wrote:
> [...]
>> 1. A regexp that makes the parser enter the context the rule
>> represents (e.g. block quote, list, raw, etc.).
>>
>> 2. A list of which rules are allowed in the context of this rule.
>>
>> 3. A regexp for leaving the context of this rule.
>>
>> 4. A regexp which is pushed onto a stack when entering the context of
>> this rule, and popped again when leaving this rule.
>>
>> The fourth item here is really the interesting part, because it is
>> what made Markdown nesting work (99% of the time) despite this being
>> 100% rule-driven.
>
> I'm not sure that the regular expression in 4 does, beside being  
> pushed and popped from the stack
Yeah, I accidentally sent the letter w/o noticing I forgot to explain  
the fourth rule.
The regexps which end on this stack are used to preprocess the current  
line, so for example the rule for code blocks is:
     RAW[1] = /\g {4}/          # Four spaces starts raw.
     RAW[2] = [ RAW_TEXT ]      # No other rules are active inside  
raw, RAW_TEXT is a dummy .+ rule
     RAW[4] = /\g( {4}| {,3}$)/ # While in the raw context, we need to  
eat the first
                                # four spaces of each line, or the  
line must be empty.
Two things to notice here:
  1. I don’t use an explicit ‘end’ rule since we automatically leave  
the context if RAW[4] doesn’t successfully match.
  2. I use \g instead of ^ since we need to anchor to where the last  
block-rule stopped matching, not necessarily BOL.
Now take the rule for block quote:
     BQ[1] = /\g {,3}> {,3}/    # We start it for lines with > allowing
                                # up to 3 spaces before/after.
     BQ[2] = [ BQ, RAW, PAR, … ] # Basically all block elements
                                 # can go inside block quote.
     BQ[3] = /\g( *$|«hr»)/     # We leave block quote at empty lines or
                                # horizontal rulers¹. The actual  
pattern for
                                # «hr» is something like:
                                #     [ ]{,3}(?<M>[-*_])([ ]{,2}\k<M>) 
{2,}[ \t]*+$
     BQ[4] = /\g( {,3}> ?)?/    # While in BQ eat leading quote  
characters.
¹ I am actually not sure if this is “the spec” or just a bug. But  
placing a horizontal ruler just below a block quoted paragraph does  
not give the expected “lazy mode” and places the <hr> inside the block  
quote, instead it leaves the block quote.
Just to make the example more complete, let us also have a paragraph  
rule:
     PAR[1] = /\g {,3}(?=[^ >])/ # Any non-special character with less  
than
                                 # 4 leading spaces starts a paragraph.
     PAR[2] = [ B, EM, LINK, TEXT, … ] # All the inline stuff works in  
this context
     PAR[3] = /\g(?=    | {,3}>| {,3}$)/ # We exit the paragraph when  
the line
                                         # is starting raw, block  
quote, or is
                                         # empty. In practice  
paragraphs do end
                                         # with block quote, but not  
with raw.
Now we have 3 rules, be aware I typed all this just now without actual  
testing, and the goal is not to replicate Markdown.pl 100%, just to  
give an example of how the rule-system works.
So our ROOT rule looks like this:
     ROOT[1] = //
     ROOT[2] = [ RAW, BQ, PAR ]
So when we start to process a document, using this root rule, we will  
get a match (without actually advancing our position in the document,  
since zero characters were matched).
After this match we have RAW, BQ, and PAR as active rules. Say our  
document looks like this:
     > A normal paragaph
     >     Some raw text
     > Normal text again
     Out of the block quote
The first line is ‘> A normal paragaph’ and we have 3 rules to apply,  
BQ[1], RAW[1], and PAR[1].
Since all of these regexps starts with \g, they are anchored to the  
first byte of the document, and only BQ[1] will match.
This “eats” the ‘> ’ prefix, pushes BQ[4] on our stack, and makes BQ,  
RAW, and PAR our new active rules (yeah, the same as before).
So we now have ‘A normal paragaph’ and again apply our 3 active rules,  
this time PAR[1] will match, it won’t actually eat any characters, and  
it won’t push additional rules onto our stack, but ti will change the  
active rules to: B, EM, LINK, TEXT, …
I didn’t define TEXT, but that is a fallback rule for non-special text- 
runs. We apply these rules to the line, and TEXT will match the line.
Now comes the special part, when we move to next line, which is ‘>      
Some raw text’ we start by applying the rules from our stack to this  
line, we have BQ[4] on the stack, which will eat the leading ‘> ’. The  
line is now: ‘    Some raw text’ and we have no more rules on the  
stack. Before we apply the active rules though, we need to check if we  
need to leave the current context, which is PAR, thus we try to apply  
PAR[3], and we do get a match, so we leave PAR.
The active rules now revert to those active before we entered PAR,  
i.e. RAW, BQ, and PAR. Applying these will give a match for RAW, so we  
eat the match (the leading four spaces), push RAW[4] on the stack, and  
set the new active rules to RAW[2], i.e. RAW_TEXT.
The line is now ‘Some raw text’ which will be eaten by the RAW_TEXT  
rule.
Next line is ‘> Normal text again’ and we have both BQ[4] and RAW[4]  
on the stack. We apply these in a FIFO order, so first BQ[4] which  
eats ‘> ’, then RAW[4], which fails to match, instructing us to leave  
RAW, …
Okay, enough writing — I hope the above gives a better understanding  
of how the rules are used.
> [...] You also need a way for the regular expression in 3 to be  
> variable depending on what you caught in 1 (to match the same number  
> of backticks in a code span for instance; to catch a matching  
> closing HTML tag, etc.).
I allow captures from the match done by 1 to be referenced in 3.
    
    
More information about the Markdown-Discuss
mailing list