When I started to write the code for the latest incarnation of eigenclass, I was planning to use an existent Markdown processor to generate the HTML for the posts and comments dynamically. That'd take at most a couple lines to pipe the markup to a process and read back the HTML. I took the first Markdown implementation that came to mind, Bluecloth (written in Ruby) and ran it against a few documents. I was most underwhelmed by its speed. It was so slow it'd need over one or two seconds to process some of the entries I've written since. I benchmarked other common implementations, the original markdown (in Perl) and python-markdown, and realized that they were only marginally better. At the risk of being perceived as performance-obsessed, here's the observed performance when processing markdown's README (README.n is README concatenated n times) on a 3GHZ AMD64 box (much faster than the old server running this site):
| language | LoCs (approx.) | README.1 time | README.8 time | README.32 time | README.32 MEM | |
|---|---|---|---|---|---|---|
| Bluecloth | Ruby | 1100 | 0.130s | 2.16s | 30s | 31MB |
| markdown | Perl | 1400 | 0.068s | 0.66s | segfault | segfault |
| python-markdown | Python | 1900 | 0.090s | 0.35s | 2.06s | 23MB |
| Pandoc | Haskell | 900 + 450 | 0.068s | 0.55s | 2.7s | 25MB |
Compare to the rather acceptable performance of my own Simple_markup module in OCaml, and of discount, a C implementation I found when I had already written mine:
| language | LoCs (approx.) | README.8 time | README.32 time | README.32 MEM | |
|---|---|---|---|---|---|
| Simple_markup | OCaml | 313 + 55 | 12ms | 43ms | 3.5MB |
| discount | C | ~4500 | 16ms | 63ms | 2.8MB |
(The LoC counts for Simple_markup and Pandoc are split into parsing and HTML generation.)
I didn't do any attempt to optimize Simple_markup beyond replacing a single O(n^2) call to String.nsplit with a O(n) Str.split one in order to split the input string into lines. I'm not compiling with -unsafe or -nodynlink either.
To add insult to injury, Bluecloth, markdown and python-markdown are ugly hacks that boil down to iterated regexp-based gsubs. I can see why they have a long history of bugs: it is easy for a gsub pass to interfere accidentally with another, and such regexp-based transformations are full of corner cases.
A much cleaner approach is to parse the markup into a parse tree, and then generate the (X)HTML in a separate pass. This is what Pandoc, discount and Simple_markup do.
Recognized markup
Simple_markup deviates from Markdown in a few ways:
any character can be escaped with \,
emphasis is done with __ (less prone to accidental use than _) and bold text with *,
typographical abuses like bold emphasized text are not allowed,
headers are done with
!level1 header,!!level2 header, etc.,the
#character is thus free and can be used for numbered lists, replacing1.,pre-formatted code is done with
{{ whatever }}instead of using the indentation (less error-prone), and
it is possible to extend the markup with custom processors, using
{{extension-nameblocks; for instance, raw HTML is inserted with{{html <b> whatever </b> }}(before you try to inject arbitrary HTML in the comments: this extension is only enabled in the main text).
Some comments about the implementation
The bulk of the code parses a stream of lines (I use Extlib/Batteries' Enum) into a list of paragraphs, defined as:
type paragraph =
Normal of par_text
| Pre of string * string option
| Heading of int * par_text
| Quote of paragraph list
| Ulist of paragraph list * paragraph list list
| Olist of paragraph list * paragraph list list
and par_text = text list
and text =
Text of string
| Emph of string
| Bold of string
| Struck of par_text
| Code of string
| Link of href
| Anchor of string
| Image of img_ref
and href = { href_target : string; href_desc : string; }
and img_ref = { img_src : string; img_alt : string; }
The Ulist and Olist constructors take the first item followed by a (possible empty) list of items, to prevent empty lists --- this way, there's at least one element.
Note that Emph, Bold and Struck don't contain more text, but only strings. It'd be rather easy to allow this by using polymorphic variants, as follows:
and simple_text = [`Text of string | `Emph of simple_text
|`Bold of simple_text | `Struck of simple_text]
and text = [ simple_text | `Code of string | ... ]
The same goes for styled text inside links. I chose not to do it, though, because bold emphasized text is just poor style.
The top-down parser takes 313 LoC and is not very complex, in spite of markdown's use of significant indentation. At its core, it's just a set of recursive functions that take an enumeration of (int * string * bool) tuples with the indentation, the string and whether the line is empty, for convenience:
let parse_enum e =
collect (read_paragraph 0)
(Enum.map (fun l -> let l' = String.strip l in (indentation l, l', l' = "")) e)
where read_paragraph indent returns Some paragraph or None at EOF or when the indentation level is less than indent, and
let collect f x =
let rec loop acc = match f x with
None -> List.rev acc
| Some y -> loop (y :: acc)
in loop []
Since the parser operates on an enumeration, it can work incrementally, reading a line from disk at a time, if wanted. As an aside, it seems to me that this is one of the two most important use cases for laziness (the other being, well, computations performed on demand, which are easy to do with OCaml's lazy), and it is perfectly satisfied by Enum.
The rest of the code is very top-down:
let rec read_paragraph ?(skip_blank=true) indent e = match Enum.peek e with
None -> None
| Some (indentation, line, isblank) -> match isblank with
true ->
Enum.junk e;
if skip_blank then read_paragraph indent e else None
| false ->
if indentation < indent then
None
else begin
Enum.junk e;
read_nonempty indentation e line
end
and read_nonempty indent e s = match s.[0] with
'!' -> read_heading s
| '*' when snd_is_space s -> push_remainder indent s e; read_ul indent e
| '#' when snd_is_space s -> push_remainder indent s e; read_ol indent e
| '{' when snd_is s '{' -> read_pre (String.slice s ~first:2) e
| '>' when snd_is_space s || s = ">" ->
(* last check needed because "> " becomes ">" *)
Enum.push e (indent, s, false); read_quote indent e
| _ -> Enum.push e (indent, s, false); read_normal e
and read_heading s =
let s' = String.strip ~chars:"!" s in
let level = String.length s - String.length s' in
Some (Heading (level, parse_text s'))
and read_ul indent e =
read_list
(fun fst others -> Ulist (fst, others))
(fun s -> snd_is_space s && s.[0] = '*')
indent e
...
Comments
test
the
the
the
For those of us who think this code is awesome enough for production use, what is it license?
I can only agree with you: these common markup languages are easily really bad. We tried textile for savonet's website, since it seemed to be standardized, but quickly realized that it was only defined by its broken (and slow) implementation.
We ended up quickly hacking a simple markup language in ocaml. The particularity of our approach is that we used ocamllex/yacc to generate the AST, with a simple preprocessing function between the two. It gives a fast and strict implementation.
Nice comparison. And I like the continuing "OCaml can do it like this ..." theme. ;)
Did you do any comparisons with some of the other ruby markdown processors? I believe Maruku is a bit faster than Bluecloth, but I'm not sure if it holds up on large documents. I have never looked at its internals in any detail. I remember liking Maruku's easy extension capabilities.
I believe there is also a ruby extension to discount, as well one to PEG. Ah yes, Tomayko covered them on his blog: http://tomayko.com/writings/ruby-markdown-libraries-real-cheap-for-you-two-for-price-of-one (sure, extensions might be a bit off-topic here. but I've always been a fan of how easy it is to tie Ruby to C.)
Why do you think README.n is a good file to do the comparison?
And besides, don't you state simple_markup is doing something different than the original markup? Aren't you comparing apples with oranges then?
However, I agree, gsub() all over the place is not a good design decision (most of the time).
Pretty cool.
Just a heads up, if others are looking for a super fast Ruby BlueCloth replacement check out rdiscount. Much faster and it's 100% Markdown 1.0 compatible.
Why are your headers in this post indented over like that?
Weird
I've been pretty happy with Maruku. Haven't done any benchmarks, but I haven't noticed it being slow, even on large documents. Also supports the PHP-extra extensions, for tables and definition lists, as well as some of its own, which can set html attributes on block-level elements.
Markdown's syntax is good because the plain-text version still looks like plain-text. Not code. I'm sure Simple_markup reduces syntactical ambiguity, but at the cost of making the plaintext much less aesthetic.
Kyle: MIT license.
Cameron: while the initial motivation to write my own markdown-ish processor was to get acceptable performance, I quickly realized that the ability to operate on the parse tree was essential; doing it in OCaml (which runs this site) is much easier than working with
discount's structures in C. A mere "to_html" is not flexible enough, I want to process the parse tree.Andreas, anony: Simple_markup is close enough to Markdown to make the comparison meaningful, I think. The most important deviation is the use of
{{for pre-formatted blocks; other than that, it's pretty much like Markdown, and not much less aesthetic IMO, unless you think that__foo__is much worse than_foo_(markdownrecognizes both IIRC, and my code also used the latter, originally, but I changed to the former after I saw identifiers likesome_varcausing problems in my reddit comments; I often forgot to escape such uses).Why isn't Simple_markup 100% "Markdown compliant" (if such a thing exists, as Markdown is ill-defined and buggy)? Simply because it scratches my personal itch; I just wanted some markup engine for my site and did minor adjustments to the language that made sense to me. I think this is not an apple-to-orange situation because the markup languages share an important subset (pretty much everything but
preblocks and minor differences here and there), and my code could be easily adapted to be fully "Markdown compliant", with no substantial differences in processing speed or code size. It's just that the Simple_markup language works a bit better for me.As for the choice of README.n: it's just the first Markdown document I got hold of when I needed to time the processors. It's maybe a bit more list-heavy than an average document, but not utterly unrealistic --- it's a real document, not a synthetic example.
ProbablyCorey: yup,
rdiscountbuilds ondiscountand is thus fast. The problem I have with it (the reason why I didn't scrap my code) is that it seems to be limited tomarkdown.to_html, and there's no easy way to manipulate the parse tree the way I'm doing here."my code could be easily adapted to be fully 'Markdown compliant', with no substantial differences in processing speed or code size."
OK, I throw down the gauntlet. Get your code to pass the markdown test suite, and then the performance comparisons will be meaningful.
You might be interested in my peg-markdown (written in C). It creates a parse tree and currently supports output in HTML, LaTeX, and groff. New output formats can be added very easily, and it is also easy to modify the markup syntax, since it is based on a PEG grammar.
My first thought was to do it in Ragel, the Grammar -> FSM -> C compiler. It turns out someone did that for RedCloth and it's linked off the Ragel homepage:
http://redhanded.hobix.com/inspect/superredclothAsAChild.html http://www.complang.org/ragel/
Very nice!
ThirteenNoTrump,
Apparently that was merged into RedCloth, as the FAQ says that RedCloth requires Ragel. So part of why BlueCloth is so slow is that the same hasn't been done for it. There are a couple of reasons why a developer might not bother to make a version of BlueCloth that uses ragel, though:
rdiscount is available
ragel requires c extension support, so it won't work on JRuby, and you might as well just use rdiscount
I love the idea of being able to add extensions to code blocks. It's so simple! There's a feature I haven't added to a personal wiki because I couldn't figure out maruku's extension methods. This would take the pain out of it!
Also, I didn't get why you wouldn't support both bold and italics being used at the same time at first. After all, you could always just decide not to use it. Then I realized, however, that it keeps commentors from using it. That is great. If only reddit would limit this abuse and others.
The current version of the Ruby gem RedCloth a textile => html processor which uses Ragel to compile a FSM to C or Java is now maintained by Jason Garber and is located here:
http://github.com/jgarber/redcloth/tree/master
I must be missing something ...?
Which markdown README are you processing in your benchmarks?
The one that comes with the original markdown was too small so I found this somewhat longer one:
http://wpcal.firetree.net/wp-content/plugins/PHP%20Markdown%201.0.1k/PHP%20Markdown%20Readme.text
When I process that readme concatenated 32 times with the version of bluecloth on github:
http://github.com/mislav/bluecloth/tree/master
it takes between 0.03 - 0.04s in both MRI and JRuby
See: http://gist.github.com/91477
This is on a MacBook Pro w/a 2.5 GHz Intel Core Duo 2
Is there a much larger README somewhere else?