Tuesday, October 27, 2009

Generating HTML color syntax highlighting from PRE tag

I'm working on a Django template tag that should transform code embedded within a PRE tag and convert it to nicely formatted HTML with color syntax. To test, I inserted this block of code here, the same code I wrote to do the conversion.


#!/usr/bin/python

from django import template
from django.template.defaultfilters import stringfilter
from django.utils.safestring import mark_safe
from BeautifulSoup import BeautifulSoup
from pygments.lexers import guess_lexer, guess_lexer_for_filename
from pygments import highlight
from pygments.lexers import get_lexer_by_name, TextLexer
from pygments.formatters import HtmlFormatter
import re

register = template.Library()

@stringfilter
def tocode(value):
try:
commentSoup = BeautifulSoup(value)
c = commentSoup.findAll('pre')
for all in c:
brs = all.findAll('br')
for item in brs:
item.replaceWith('\n')
joined = ''.join(all.findAll(text=True))

if all.has_key('class'):
lex = get_lexer_by_name(all['class'], stripall=True)
else:
try:
lex = guess_lexer(joined)
except:
lex = BashSessionLexer
formatter = HtmlFormatter(linenos=True, cssclass="source")
result = highlight(joined, lex, formatter)
all.replaceWith(result)

return mark_safe(commentSoup)

except:
return value

register.filter('tocode', tocode)


This is how it works: You pull a feed from, for example, blogger, and look for PRE tags, assuming that there is something interesting (like a code snippet) inside. After discovering that Pygments' guess_lexer has a hard time identifying most of the snippets I feed it, I decided to make it possible to explicitly specify the PRE content type. I do this by tagging the PRE element with a class name. In this case, I use the class name verbatim to call the get_lexer_by_name method. So this...

<pre class="python" >

...will look up the python lexer and

<pre class="php" >...

will look up the php lexer.

The PHP lexer is very disappointing actually.

Originally, I set the code to use the TextLexer in the event that PRE class attribute was not present, but this was boring. I found that the python lexer produces more appealing results for almost all snippets, so now I'm using it as the default when the PRE attribute is not specified. Of course, not all content will be python lexerable, so in the case of exception I fail over to the TextLexer. The exception handling chain is a bit ugly but it gets the job done.

No comments: