Years ago I starting using Markdown to markup text on this site. And as is often the case today, I would include various code snippets in my posts. I wanted those snippets to be highlighted like I saw on others sites (mostly Trac instances at the time) and whatever text editor I may have been using. As I soon learned, this is a problem that many people have had various levels of success with for some time.
Of course, Trac was written in Python and I like working with Python so I thought I'd dig in and see if I could borrow from their code. It is a open source project after all. Well, after a few false starts trying to piece together their code, I finally came across something (not in the code itself) that told me they were using a third part package which name escapes me now. The package had horrible documentation and a little searching confirmed that most people had as little success getting it to work as I did. I believe Trac has since abandoned that package for a newer one which had not yet been released when this was all going on.
This lack of a good syntax highlighter for Python sent me on a search through various possible solutions. In the course of that search, I learned a lot and much of my opinions discussed below are drawn from that experience. I should also note that the second Python-Markdown extension I ever released 1 was the CodeHilite extension. In it's earliest form it called a command line script which is available on most any Linux distributions and extracted and slightly modified a snippet of html from the returned result do be included in a Markdown document. It worked, but each snippet in a blog post would generate another call to an external process and slow rendering down that much more.
Before finding Pygments, I looked at starting my own library among other
things. At the time a syntax parser was a
my head, but I did find many examples that highlighted Python code. I even got
my own variation working with little trouble. If I recall correctly, it was the
first backend I built for my extension. However, I abandoned it before public
release as it is one thing to get Python to tokenize Python code3, but quite
another to get it to tokenize anything else. At the time I was writing more
about HTML/CSS than about Python, so this was a non-option.
Another issue besides tokenizing the code into its various parts (keywords, variables, integers, strings, comments, etc.), is line numbering. Some may argue that this is an unnecessary feature, but try reading a tutorial as a beginner about how to understand some snippet of code, and if the author doesn't have an easy way to indicate which line in the code he's talking about, you could easily get lost. Besides, it makes it easy for commenters to point to the exact error they found in your code (Hey, it's not my fault you didn't test the code you're writing about).
Even today it's not uncommon to find lines of code broken up into table rows.
One row per line of code, with two columns, the first being the line number and
the second containing the actual code. With proper styling it looks nice, but
now try to select the code to copy and past into your editor. Oops, you got the
line numbers as well. Now you have to go back and delete the line number, the
punctuation following it (usually a period) and any white space added in without
messing up indentation and the like. The same problem presents itself with a
standard ordered list. Fortunately, that was solved a long time ago by
Dan Loda, Simon Willison (both via the Way Back Machine), and
Dustan Orchard (scroll about half-way down that page). The trick is to style
<ol> so that the line numbers are displayed but don't get copied.
Interestingly, modern highlighters such as Pygments still have not solved this problem. Pygments specifically still uses a table, of only one row and two columns. The first column contains all the line numbers with line breaks in a single cell and the second column contains all the lines of code, again with line breaks in a single cell. Assuming your styling is correct, the line numbers in the first column should line up with the corresponding lines in column two. Unfortunately, I have seen many sites where this is broken. Strangely, it seems to be more of a problem on longer snippets, so the site designer probably didn't catch it on his short test examples while adjusting the styling to fit his site. On some very long snippets, the line numbers actually end short of the number of lines. While, this allows one to select and copy the code within one table cell and avoid getting line numbers, it simply is not an acceptable solution. Someone really should write a new formatter for Pygments that uses the much better ordered-list.
The Un-Styled Source
But there is something else interesting about Dustan Orchard's solution. Even if the line numbers were still a problem when it came to copying and pasting, a link is provided to download each snippet as a separate file in its original plain text form -- no line numbers anywhere. Unfortunately, for Dustan this means each snippet needs to be in a separate file on his server. His code then parses the post and finds each instance of his custom markup, determines where the source file is, reads it from the file system, and inserts it into his HTML line by line before sereving it. Not ideal. Besides, I am working with Markdown where the snippets are inline with the body of text and simply marked as code blocks according to Markdown's syntax. How am I going to serve each snippet separately, especially when I have multiple snippets in one document. I'm sure it's possible, but not really worth the effort in my opinion.
Consider the project that I believe started life as that wonky dp.SyntaxHighlighter library I spoke of earlier: SyntaxHighlighter. I realize the link goes to an old abandoned version of the project, but go take a look at the sample provided in the summary there. Notice the extra links at the top of the screen capture? If it wasn't just an image, clicking on them would reveal that they are links, one of which gives you the option to "view plain" code. The newer version of the project gives you the same option as a little pop-up when you hover your mouse over the block of code, as does the jQuery plugin adaptation of the library which actually outputs valid HTML. Personally, I like the old styling better, but that should be customizable.
sending the document with the plain code wrapped in
we already have the original plain text version available client-side, its easy
code. While pop-ups are generally to be avoided, they are only used here when
specifically activated by the end user and they serve a useful purpose in this
context. I would imagine the end user could think that the plain code has
actually been fetched from the server as a separate file like Dustan Orchard's
site does. Except, the code displays instantaneously without the delay of a
request. And there's no dance involved in including two versions and figuring
out which one to display and which one to hide by default. I suspect that's why
working specifically with Markdown.
The Sad Conclusion
The first extension I released publicly was a simple little thing that converted WikiLinks to links. It simply served as a means to better understand Python-Markdown's extension API. I had actually started the core of my highlighting extension before the wikilink extension was thought of. ↩
For those who don't know, the Python tokenizer is callable from Python code. It will return a list of tokens which you can easily iterate over and build a bunch of appropriately styled spans for syntax highlighted Python code on the web. ↩
Actually that is not entirely true. For example, while Google's own library (google-code-prettify) does not directly support line numbers, they suggest a rather lousy workaround on their FAQs page (second to last FAQ). Ugh. ↩