The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
NAME
    Class::DBI::utf8 - A Class:::DBI subclass that knows about UTF-8

SYNOPSIS
    This module is a Class::DBI plugin:

      package Foo;
      use base qw( Class::DBI );
      use Class::DBI::utf8;

      ...
      __PACKAGE__->columns( All => qw( id text other ) );

      # the text column contains utf8-encoded character data
      __PACKAGE__->utf8_columns(qw( text ));
      ...
  
      # create an object with a nasty character.
      my $foo = Foo->create({
        text => "a \x{2264} b for some a",
      });
  
      # search for utf8 chars.
      Foo->search( text => "a \x{2264} b for some a" );

DESCRIPTION
    Rather than have to think about things like character sets, I prefer to
    have my objects just Do The Right Thing. I also want utf-8 encoded byte
    strings in the database whenever possible. Using this subclass of
    Class::DBI, I can just put perl strings into the properties of an
    object, and the right thing will always go into the database and come
    out again.

    For example, without Class::DBI::utf8,

      MyObject->create({ id => 1, text => "\x{2264}" }); # a less-than-or-equal-to symbol

    ..will create a row in the database containing (probably) the utf-8 byte
    encoding of the less-than-or-equal-to symbol. But when trying to
    retrieve the object again..

      my $broken = MyObject->retrieve( 1 );
      my $text = $broken->text;

    ... $text will (probably) contain 3 characters and look nothing like a
    less-than-or-equal-to symbol. Likewise, you will be unable to search
    properly for strings containing non-ascii characters.

    Creating objects with simpler non-ascii characters from the latin-1
    range will lead to even stranger behaviours:

      my $e_acute = "\x{e9}"; # an e-acute
      MyObject->create({ text => $e_acute });

      utf8::upgrade($e_acute); # still the same letter, but with a different
                               # internal representation
      MyObject->create({ text => $e_acute });

    This will create two rows in the database - the first containing the
    latin-1 encoded bytes of an e-acute character (or the database may
    refuse to let you create the row, if it's been set up to require utf-8),
    the latter containing the utf-8 encoded bytes of an e-acute. In the
    latter case you won't get an e-acute back out again if you retrieve the
    row; You'll get a string containing two characters, one for each byte of
    the utf-8 encoding.

    Because of this, if you're handling data from an outside source, you
    won't really have any clear idea of what will be going into the database
    at all.

    Fortunately, simply adding the lines:

      use Class::DBI::utf8;
      __PACKAGE__->utf8_columns("text");

    will make all these operations work much more as expected - the database
    will always contain utf-8 bytes, you will always get back the characters
    you put in, and you will instantly become the most popular person at
    work.

    This module assumes that the underlying database and driver don't know
    anything about character sets, and just store bytes. Some databases, for
    instance postgresql and later versions of mysql, allow you to create
    tables with utf-8 character sets, but the Perl DB drivers don't respect
    this and still require you to pass utf-8 bytes, and return utf-8 bytes
    and hence still need special handling with Class::DBI.

    Class::DBI::utf8 will do the right thing in both cases, and I would
    suggest you tell the database to use utf-8 encoding as well as using
    Class::DBI::utf8 where possible.

CAVEATS
    This module requires perl 5.8.0 or later - if you're still using 5.6,
    and you want to use unicode, I suggest you don't. It's not nice.

    Be aware that utf-8 encoded strings will commonly have a byte length
    greater than their character length - this is because non-ascii
    characters such as e-actute will encode to two bytes, and other
    characters can be encoded to other numbers of bytes, although 2 or 3
    bytes are typical. If your database has an underlying data type of a
    limited length, for instance a CHAR(10), you may not be able to store 10
    characters in it.

    Internally, the module is futzing with the _utf8_on and _utf8_off
    methods. If you don't know *why* doing that is probably a bad idea, you
    should read into it before you start trying to do this sort of thing
    yourself. I'd prefer to use encode_utf8 and decode_utf8, but I have my
    reasons for doing it this way - mostly, it's so that we can allow for
    DBD drivers that do know about character sets.

    Finally, the database may have some internal string-handling functions,
    for instance LOWER(), UPPER(), various sorting functions, etc. *If* the
    database is properly utf-8 aware, it *may* do the right thing to the
    utf-8 encoded strings in the database if you use these functions. But
    I've never seen a database do the right thing. Likewise, there are all
    sorts of nasty normalisation considerations when performing searches
    that are outside of the scope of these docs to discuss, but which can
    really ruin your day.

BUGS
    I've attempted to make the module keep doing the Right Thing even when
    the DBD driver for the database knows what it's doing, ie, if you give
    it sensible perl strings it'll store the right thing in the database and
    recover the right thing from the database. However, I've been forced to
    assume that, in this eventuality, the database driver will hand back
    strings that already have the utf-8 bit set. If they don't, things
    *will* break. On the bright side, they'll break really fast. I also find
    it extremely unlikely that anyone would bother reducing strings to
    latin1 internally.

    Also, I've been forced to override the _do_search method to make
    searching for utf8 strings work, so if you override it locally as well,
    bad things will happen. Sorry.

    Incredible popularity and fame gained through understanding of utf-8 may
    not actually be real.

SEE ALSO
    Class::DBI

AUTHOR
    Tom Insam <tinsam@fotango.com>

    Copyright Fotango 2005. All rights reserved.

    This module is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.