The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::Encoding - Determine the encoding of (X)HTML documents

SYNOPSIS

  use HTML::Encoding;
  # ...
  my $encoding = get_encoding

    headers       => $r->headers,
    string        => $r->content,
    check_bom     => 1,
    check_xmldecl => 0,
    check_meta    => 1

DESCRIPTION

This module can be used to determine the encoding of HTML and XHTML files. It reports explicitly given encoding informations, i.e.

  • the HTTP Content-Type headers charset parameter

  • the XML declaration

  • the byte order mark (BOM)

  • the meta element with http-equiv set to Content-Type

get_encoding( %options )

This function takes a hash as argument that stores all configuration options. The following are available:

string

A string containing the (X)HTML document. The function assumes that all possibly applied Content-(Transfer-)Encodings are removed.

headers

An HTTP::Headers or Mail::Header object to extract the Content-Type header. Please note that LWP::UserAgent stores header values from meta elements by default in the response header. To turn this of call the $ua->parse_head() method with a false value. get_encoding() always uses only the first given Content-Type: header; this should be the one given in the original HTTP header in most cases.

check_xmldecl

Checks the document for an XML declaration. If one is found, it tries to extract the value of the encoding pseudo-attribute. Please note that the XML declaration must not be preceded by any character. The default is no.

check_bom

Checks the document for a byte order mark (BOM). The default is yes; it's always yes if check_xmldecl is set to a true value.

check_meta

Checks the document for a meta element like

  <meta http-equiv='Content-Type'
        content='text/html;charset=iso-8859-1'>

using HTML::Parser 3.21 or later (or does nothing if it fails to load that module). The default is yes.

In list context it returns a list of hash refernces. Each hash references consists of two key/value pairs, e.g.

  [
    { source => 4, encoding => 'utf-8' },
    { source => 1, encoding => 'utf-8' }
  ]

The source value is mapped to one of the constants FROM_META, FROM_BOM, FROM_XMLDECL and FROM_HEADER. You can import these constants solely into your namespace or using the :constants symbol, e.g.

  use HTML::Encoding ':constants';

In scalar context it returns the value of the encoding key from the first entry in the list. The list is sorted according to the origin of the encoding information, see the list at the beginning of this document.

If no explicit encoding information is found, it returns undef. It's up to you to implement defaulting behaivour if this is applicable.

BUGS

  • The module does not recode the content before passing it to HTML::Parser (that only supports US-ASCII compatible encodings).

SEE ALSO

  • http://www.w3.org/TR/REC-xml-20001006.htm#sec-guessing

  • http://www.w3.org/TR/1999/REC-html401-19991224/charset.html#h-5.2

  • http://www.ietf.org/rfc/rfc2854.txt

  • http://www.ietf.org/rfc/rfc2616.txt

  • RFC 2045 - RFC 2049

COPYRIGHT

Copyright (c) 2001 Björn Höhrmann

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

AUTHOR

Björn Höhrmann <bjoern@hoehrmann.de>