Saturday, October 20, 2007

How to get character encoding right in Smarty

Smarty is a great tool, but like its underlying PHP, has no native understanding of internationalization. One aspect of it that's easy to get wrong is character output encoding. Fortunately it's also fairly easy to get right.

HTTP useragents send a header with their requests named "Accept-Charset". This header has a really stupid format, the parsing of which I'll address another time. For now, let's start from the point where you know what encoding the client prefers. Let's say it's something really weird like BIG-5. All your Chinese Smarty templates have text stored in UTF-8 because you're a good programmer. How do you get those lovely UTF-8 templates mangled into something this silly UA likes?

First, if you're even reading this post, you're probably aware of PHP's mbstring library. This library offers a lovely function called mb_convert_encoding. The basic idea is to attach an output filter to Smarty that runs the template output through mb_convert_encoding. Here's how this looks:

* This sets an internal property in the PHP instance. Of course, this should be set to whatever the UA wants, within reason.

mb_http_output( "BIG-5" );

function convertEncoding( $templateOutput, &$smarty )
return mb_convert_encoding( $templateOutput, mb_http_output() );

$smarty->register_outputfilter( "convertEncoding");

* You must send headers so the browser knows that you gave it what it wanted
header( "Content-Type: text/html; charset=".mb_http_output());

Now, when you call display() or fetch() on that Smarty instance, all its output will be converted to the current value of mb_http_output() - even if the file was already cached!

This approach guarantees a few nice things:
  • It does not depend on output buffering
  • It is insulated from changes in php.ini
  • Only templates have their encoding converted
  • The encoding can be converted on a per-request basis
So go get this right, and stop sending everybody ISO-8859-1.


Trasher said...

Nice article. But I can't see the advantage. Converting from ISO-8859-1 to UTF-8 the way you do, just changes the data format, but doesn't expand the available characters. For example, escaping dynamic content for HTML like {$content|escape:'HTML'} will still not be UTF-8 compliant. Actually, smarty will in this case not display anything, if $content contains UTF-8 chars.

Going a bit further, converting a page to UTF-8 using your method results in HTML forms to be submitted with UTF-8 encoding. The server-side script will likely not be able to handle that, or at least Smarty will not be able to display it properly, because of the escape:'html' problem described above.

In fact, supporting UTF-8 is much more than just converting the template output. You should add a bold WARNING to your article and tell about these disadvantages before they mess up their sites with lots of subtile errors.

jonathan said...

Thanks for the feedback. I'm making some really key assumptions here that I didn't explain fully - it was 3 years ago, I was a newbie. :)

1. I'm assuming that all your templates are encoded in UTF-8.
2. I'm assuming all your database fields are collated in UTF-8.
3. I'm assuming you're taking all the POSTed form data and converting it to UTF-8 before storing it in the database.
4. I'm assuming all your localized string files - whatever mechanism you're using - are encoded in UTF-8.
5. I'm assuming you're not using Smarty's built-in escaping mechanism since as you pointed out, it doesn't support UTF-8.

The only - and very smallish - problem addressed here is how to get all this lovely UTF-8 data served to a client that doesn't understand it, say it only understands ISO-8859-1. Well you can give it that, and not give it a bunch of characters it doesn't understand, by having them removed in the encoding conversion.