HTML::Encoding - Determine the encoding of (X)HTML documents
use HTML::Encoding; # ... my $encoding = get_encoding headers => $r->headers, string => $r->content, check_bom => 1, check_xmldecl => 0, check_meta => 1
This module can be used to determine the encoding of HTML and XHTML files. It reports explicitly given encoding informations, i.e.
the HTTP Content-Type headers charset parameter
the XML declaration
the byte order mark (BOM)
the meta element with http-equiv set to Content-Type
This function takes a hash as argument that stores all configuration options. The following are available:
A string containing the (X)HTML document. The function assumes that all possibly applied Content-(Transfer-)Encodings are removed.
An HTTP::Headers or Mail::Header object to extract the Content-Type header. Please note that LWP::UserAgent stores header values from meta elements by default in the response header. To turn this of call the $ua->parse_head() method with a false value. get_encoding() always uses only the first given Content-Type: header; this should be the one given in the original HTTP header in most cases.
Checks the document for an XML declaration. If one is found, it tries to extract the value of the encoding pseudo-attribute. Please note that the XML declaration must not be preceded by any character. The default is no.
encoding
Checks the document for a byte order mark (BOM). The default is yes; it's always yes if check_xmldecl is set to a true value.
Checks the document for a meta element like
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1'>
using HTML::Parser 3.21 or later (or does nothing if it fails to load that module). The default is yes.
In list context it returns a list of hash refernces. Each hash references consists of two key/value pairs, e.g.
[ { source => 4, encoding => 'utf-8' }, { source => 1, encoding => 'utf-8' } ]
The source value is mapped to one of the constants FROM_META, FROM_BOM, FROM_XMLDECL and FROM_HEADER. You can import these constants solely into your namespace or using the :constants symbol, e.g.
:constants
use HTML::Encoding ':constants';
In scalar context it returns the value of the encoding key from the first entry in the list. The list is sorted according to the origin of the encoding information, see the list at the beginning of this document.
If no explicit encoding information is found, it returns undef. It's up to you to implement defaulting behaivour if this is applicable.
The module does not recode the content before passing it to HTML::Parser (that only supports US-ASCII compatible encodings).
HTML::Parser
http://www.w3.org/TR/REC-xml-20001006.htm#sec-guessing
http://www.w3.org/TR/1999/REC-html401-19991224/charset.html#h-5.2
http://www.ietf.org/rfc/rfc2854.txt
http://www.ietf.org/rfc/rfc2616.txt
RFC 2045 - RFC 2049
Copyright (c) 2001 Björn Höhrmann
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Björn Höhrmann <bjoern@hoehrmann.de>
To install HTML::Encoding, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::Encoding
CPAN shell
perl -MCPAN -e shell install HTML::Encoding
For more information on module installation, please visit the detailed CPAN module installation guide.