NAME

Text::GuessEncoding - Convert Text from almost any encoding to ASCII or UTF8

VERSION

Version 0.07

SYNOPSIS

CAUTION: unfinished code. No objects created.

Text::GuessEncoding gathers statistic about typical and invalid codes in both Latin1 and UTF-8. The concept of 'typical' is currently from an european point of view. Based on this statistics, methods to transform to Latin1 and UTF-8 are provided. These methods handle 'broken' input strings with mixed encodings well.

The input string may or may not have its utf8 flag set correctly; the flag is ignored. The returned string has the utf8 flag always off, and contains no characters above codepoint 127 (which means it is inside the ASCII character set). If called in a list context, to_ascii() returns the mapping table as a second value. This mapping table is a hash, using all recognized encodings as keys. (Any well-formed string should only have one encoding, but one can never be sure.) Value per encoding is an array ref, listing all the codepoints in the following form: [ [ $codepoint, $replacement_bytecount, [ $offset, ... ] ], ... ] Offset positions refer to the output string, where byte counts are identical with character counts.

Example:

  my $guess = new Text::GuessEncoding();
  ($ascii, $map) = $guess->to_ascii("J\x{fc}rgen \x{c3}\x{bc}\n");
  # $ascii = 'Juergen ue';
  # $map = { 'utf8' => [252, 2, [8]], 'latin1' => [252, 2, [1]] };

The input string contains both utf8 encoded u-umlaut glyph and a plain latin1 byte u-umlaut. The output string is never flagged as utf8.

  ($utf8, $map) = $guess->to_utf8("J\x{fc}rgen \x{c3}\x{bc}\n");
  # $utf8 = 'J\N{U+fc}rgen \N{U+fc}';
  # $map = { 'utf8' => [7], 'latin1' => [1] };

to_utf8 returns a simpler mapping table, as the string preserves more inforation. Note that the offsets differ from to_ascii(), as no multi-character rewriting takes place. The output string is always flagged as utf8.

    use Text::GuessEncoding;

    my $asciitext = Text::GuessEncoding::to_ascii($enctext);
    my ($asciitext,$mapping) = Text::GuessEncoding::to_ascii($enctext);

EXPORT

to_ascii() - create plain text in 7-bit ASCII encoding. to_utf8() - return UTF-8 encoded text .

SUBROUTINES/METHODS

to_ascii

to_ascii() is implemented in perl code as a post-processor of to_utf8(). It examines charnames::viacode($_) and constructs some useful ascii replacements from these. A number of frequently used codepoint values can be precompiled for speed.

sysread_tout(FILE, $len, $tout)

attempts to read $len bytes from FILE, with a select() timeout of $tout.

$orig_mode = tty_raw(FILE)

Uses POSIX::Termios to set the terminal FILE to raw mode. Returns the previous mode so that it can be restored with tty_set().

tty_set(FILE, $tty_mode)

Sets the FILE, which is assumed to be a terminal, to mode $tty_mode .

get_cursor_pos

Sets tty to raw mode by calling tty_raw(), flushes STDIN, and sends ANSI code 'ESC [ 6 n' (DC6 aka Report cursor Position) and attempts to read the returned position with a 100msec timeout. The terminal is returned to the previous mode, and a hashref containing the keys x, y is returned.

probe_tty

Prints a cariage return (no linefeed), to move the cursor to a defined column position. Prints a few test characters to STDOUT and calls get_cursor_pos() to check how the terminal reacts upon each, e.g. by (not) advancing the cursor position one or multiple positions. Then restores the cursor to the carriage return position.

probe_file

probe_file() contains all the material, from which we should build to_utf8.

utf8toascii convert a well formed utf8 file into ascii transcript.

#! /usr/bin/perl -w -T # # utf8toascii -- convert a well formed utf8 file into ascii transcript. # # 2010-01-09, jw -- initial draft. # # perl -e 'use Encode; use charnames ":full"; map { print charnames::viacode($_). "\n" } unpack "W*", Encode::decode_utf8("J\x{c3}\x{bc}rgen fl\x{ef}\x{ac}\x{82} ü\x{c2}\x{a8}\n")' # #use charnames ":full"; # #my $ofd = \*STDOUT; #my $ifd = \*STDIN; # #print STDERR utf8toascii($ifd, $ofd); #exit 0; #############

AUTHOR

Juergen Weigert, <jw at suse.de>

BUGS

Please report any bugs or feature requests to bug-text-toascii at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-GuessEncoding. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Text::GuessEncoding

You can also look for information at:

RT: CPAN's request tracker

http://rt.cpan.org/NoAuth/Bugs.html?Dist=Text-GuessEncoding
AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/Text-GuessEncoding
CPAN Ratings

http://cpanratings.perl.org/d/Text-GuessEncoding
Search CPAN

http://search.cpan.org/dist/Text-GuessEncoding/

ACKNOWLEDGEMENTS

LICENSE AND COPYRIGHT

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 402:: Non-ASCII character seen before =encoding in 'ü\x{c2}\x{a8}\n")''. Assuming CP1252

To install Text::GuessEncoding, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Text::GuessEncoding

CPAN shell

perl -MCPAN -e shell
install Text::GuessEncoding

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)