Some time ago, when I wrote the first version of this post I thought I had mastered UTF-8/Unicode with Perl and MySQL. Sadly I was very, very wrong. So I had to revisit the topic and I’d like to share my findings in the hope that they can save some coders from going nuts.
First you should read “Why does modern Perl avoid UTF-8 by default?” on Stackoverflow, especially the top-voted answer. It is the best ressource on UTF-8 and Perl I’ve found so far.
The next stop would be “UTF8, Mysql, Perl and PHP” on gyford.com. Pay special attention on the “
utf8::decode( $var ) unless utf8::is_utf8( $var );” part. However I’d suggest using Encode::decode and Encode::is_utf8 instead. The imporant lesson to take away here is that you still may need to “decode” the bytes coming from the database into Perls internal UTF-8 representation. Once Perl knows its dealing with UTF-8 it will probably handle them correctly. Unfortunately sometimes the conditional decode doesn’t work … in this cases you can try to decode the data w/o checking if it is already in UTF-8 first. Brave new world …
If you still need more advice I suggest the following links, in this order:
- Checklist for going the Unicode way with Perl