brian d foy > Unicode-Tussle-1.08 > uniquote

Download:
Unicode-Tussle-1.08.tar.gz

Annotate this POD

CPAN RT

New  1
Open  0
View/Report Bugs
Source  

NAME ^

uniquote - escape special characters using various quoting conventions

SYNOPSIS ^

uniquote [options] [ textfile ... ]

 Standard options:

    --version       print version information and exit
    --help          this message
    --man           full manpage
    --debug         add some debugging output

 Character mode options:

            Without a specified encoding, utf8 is assumed
            unless file has encoding extension.

    --verbose   -v  show full character names like \N{EN DASH}

    --hex       -x  use singleton \x{...} esapes instead of \N{U+XXX}

    --encoding  -E  specify encoding for all input files

    --html      -H  show HTML entities (add --verbose for names)
    --xml       -X  show XML entities

  Binary mode options:

    --bytes     -b  binary file in hex
    --octal     -0  binary file in octal

  Other options:

    --endings       -n   place $ at EOL so trailing spaces visible
    --backslash     -t   use backslash escapes for unprintable ASCII
    --fix-newlines  -l   consider any Unicode linebreak sequence as EOL
    --unbuffer      -u   flush each output line

DESCRIPTION ^

The uniquote program it means as a Unicode-aware replacement for programs like ol(1) and cat -v. It converts ASCII control code and all non-ASCII code points into a quoted form such as one might use in a Perl literal.

Use --endings or -e to cat like cat -e and add a dollar at the end of each line so trailing spaces become apparent.

Use --backslash or -t to show tabs and other ASCII control codes as backslash escapes.

By default, uniquote converts each such code points into the form \N{U+hex}, making code point 962 appear as \N{U+3C2}. The --hex option instead shows eligible points in backslash-X notation, so code point 962 would be displayed as \x{3C2}.

The --verbose option instead displays eligible code points by name. Code point 962 would then be shown as \N{GREEK SMALL LETTER FINAL SIGMA}.

The --xml and --html options show code point using numeric entities. Adding --verbose to --html will use named HTML entities where available.

Character Modes vs Binary Mode

To treat the file as a sequence a bytes, use --binary. This displays all bytes escaped in the form \xXX. The other way to specify binary input uses the <--octal> option.

If you have not specified binary mode, then you are in character mode. The default encoding in character mode us not ASCII but UTF-8. If you have not specified an optional encoding with --encoding, but the filename ends with the name of an encoding that Perl recognizes, that encoding will be assumed.

Note that no matter the actual input character encoding, code points reflect the Unicode number of that code point. You can use this property to normalize input, or to check that you actually know a file's encoding. For example, you can test the same file with various 8-bit encodings like Latin1, MacRoman, and CP1252.

The default input encoding is actually utf8; that is, Perl's permissive version of UTF-8. If you want strict UTF-8, override it.

EXAMPLES ^

  $ perl -E 'say "ascii:\tnayeeve fassodd"'                                                     > /tmp/nf.ascii
  $ perl -E 'binmode(STDOUT, "encoding(macroman)")||die; say "macroman:\tna\xEFve fa\xE7ade"'   > /tmp/nf.macroman
  $ perl -E 'binmode(STDOUT, "encoding(utf8)")||die;     say "utf8:\tna\xEFve fa\xE7ade"'       > /tmp/nf.utf8
  $ perl -E 'binmode(STDOUT, "encoding(utf16)")||die;    say "utf16:\tna\xEFve fa\xE7ade"'      > /tmp/nf.utf16
  $ perl -E 'binmode(STDOUT, "encoding(utf32)")||die;    say "utf32:\tna\xEFve fa\xE7ade"'      > /tmp/nf.utf32
  $ perl -E 'binmode(STDOUT, "encoding(latin1)")||die;   say "latin1:\tna\xEFve fa\xE7ade"'     > /tmp/nf.latin1
  $ perl -E 'binmode(STDOUT, "encoding(cp1252)")||die;   say "cp1252:\tna\xEFve fa\xE7ade"'     > /tmp/nf.cp1252


  $ wc -c /tmp/nf*
      23 /tmp/nf.ascii
      21 /tmp/nf.cp1252
      21 /tmp/nf.latin1
      23 /tmp/nf.macroman
      42 /tmp/nf.utf16
      84 /tmp/nf.utf32
      21 /tmp/nf.utf8
     235 total

  $ uniquote /tmp/nf.*
ascii:\N{U+09}nayeeve fassodd
cp1252:\N{U+09}na\N{U+EF}ve fa\N{U+E7}ade
latin1:\N{U+09}na\N{U+EF}ve fa\N{U+E7}ade
macroman:\N{U+09}na\N{U+EF}ve fa\N{U+E7}ade
utf16:\N{U+09}na\N{U+EF}ve fa\N{U+E7}ade
utf32:\N{U+09}na\N{U+EF}ve fa\N{U+E7}ade
utf8:\N{U+09}na\N{U+EF}ve fa\N{U+E7}ade

  $ uniquote --backslash --endings /tmp/nf.*
ascii:\tnayeeve fassodd$
cp1252:\tna\N{U+EF}ve fa\N{U+E7}ade$
latin1:\tna\N{U+EF}ve fa\N{U+E7}ade$
macroman:\tna\N{U+EF}ve fa\N{U+E7}ade$
utf16:\tna\N{U+EF}ve fa\N{U+E7}ade$
utf32:\tna\N{U+EF}ve fa\N{U+E7}ade$
utf8:\tna\N{U+EF}ve fa\N{U+E7}ade$

  $ uniquote --verbose /tmp/nf.*
ascii:\N{CHARACTER TABULATION}nayeeve fassodd
cp1252:\N{CHARACTER TABULATION}na\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
latin1:\N{CHARACTER TABULATION}na\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
macroman:\N{CHARACTER TABULATION}na\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
utf16:\N{CHARACTER TABULATION}na\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
utf32:\N{CHARACTER TABULATION}na\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
utf8:\N{CHARACTER TABULATION}na\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade

  $ uniquote --binary /tmp/nf.*
ascii:\x09nayeeve fassodd
cp1252:\x09na\xEFve fa\xE7ade
latin1:\x09na\xEFve fa\xE7ade
macroman:\x09na\x95ve fa\x8Dade
\xFE\xFF\x00u\x00t\x00f\x001\x006\x00:\x00\x09\x00n\x00a\x00\xEF\x00v\x00e\x00 \x00f\x00a\x00\xE7\x00a\x00d\x00e\x00
\x00\x00\xFE\xFF\x00\x00\x00u\x00\x00\x00t\x00\x00\x00f\x00\x00\x003\x00\x00\x002\x00\x00\x00:\x00\x00\x00\x09\x00\x00\x00n\x00\x00\x00a\x00\x00\x00\xEF\x00\x00\x00v\x00\x00\x00e\x00\x00\x00 \x00\x00\x00f\x00\x00\x00a\x00\x00\x00\xE7\x00\x00\x00a\x00\x00\x00d\x00\x00\x00e\x00\x00\x00
utf8:\x09na\xC3\xAFve fa\xC3\xA7ade

  $ uniquote --xml /tmp/nf.*
ascii:&#x9;nayeeve fassodd
cp1252:&#x9;na&#xef;ve fa&#xe7;ade
latin1:&#x9;na&#xef;ve fa&#xe7;ade
macroman:&#x9;na&#xef;ve fa&#xe7;ade
utf16:&#x9;na&#xef;ve fa&#xe7;ade
utf32:&#x9;na&#xef;ve fa&#xe7;ade
utf8:&#x9;na&#xef;ve fa&#xe7;ade

  $ uniquote --html /tmp/nf.*
ascii:&#9;nayeeve fassodd
cp1252:&#9;na&#239;ve fa&#231;ade
latin1:&#9;na&#239;ve fa&#231;ade
macroman:&#9;na&#239;ve fa&#231;ade
utf16:&#9;na&#239;ve fa&#231;ade
utf32:&#9;na&#239;ve fa&#231;ade
utf8:&#9;na&#239;ve fa&#231;ade

  $ uniquote --html --verbose /tmp/nf.*
ascii:&#9;nayeeve fassodd
cp1252:&#9;na&iuml;ve fa&ccedil;ade
latin1:&#9;na&iuml;ve fa&ccedil;ade
macroman:&#9;na&iuml;ve fa&ccedil;ade
utf16:&#9;na&iuml;ve fa&ccedil;ade
utf32:&#9;na&iuml;ve fa&ccedil;ade
utf8:&#9;na&iuml;ve fa&ccedil;ade

  $ uniquote --backslash --encoding latin1   --verbose /tmp/nf.*
ascii:\tnayeeve fassodd
cp1252:\tna\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
latin1:\tna\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
macroman:\tna\N{MESSAGE WAITING}ve fa\N{REVERSE LINE FEED}ade
\N{LATIN SMALL LETTER THORN}\N{LATIN SMALL LETTER Y WITH DIAERESIS}\0u\0t\0f\01\06\0:\0\t\0n\0a\0\N{LATIN SMALL LETTER I WITH DIAERESIS}\0v\0e\0 \0f\0a\0\N{LATIN SMALL LETTER C WITH CEDILLA}\0a\0d\0e\0
\0\0\N{LATIN SMALL LETTER THORN}\N{LATIN SMALL LETTER Y WITH DIAERESIS}\0\0\0u\0\0\0t\0\0\0f\0\0\03\0\0\02\0\0\0:\0\0\0\t\0\0\0n\0\0\0a\0\0\0\N{LATIN SMALL LETTER I WITH DIAERESIS}\0\0\0v\0\0\0e\0\0\0 \0\0\0f\0\0\0a\0\0\0\N{LATIN SMALL LETTER C WITH CEDILLA}\0\0\0a\0\0\0d\0\0\0e\0\0\0
utf8:\tna\N{LATIN CAPITAL LETTER A WITH TILDE}\N{MACRON}ve fa\N{LATIN CAPITAL LETTER A WITH TILDE}\N{SECTION SIGN}ade

  $ uniquote --backslash --encoding cp1252   --verbose /tmp/nf.*
ascii:\tnayeeve fassodd
uniquote: cp1252 "\x8D" does not map to Unicode at /tmp/nf.macroman line 0
cp1252:\tna\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
latin1:\tna\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
\N{LATIN SMALL LETTER THORN}\N{LATIN SMALL LETTER Y WITH DIAERESIS}\0u\0t\0f\01\06\0:\0\t\0n\0a\0\N{LATIN SMALL LETTER I WITH DIAERESIS}\0v\0e\0 \0f\0a\0\N{LATIN SMALL LETTER C WITH CEDILLA}\0a\0d\0e\0
\0\0\N{LATIN SMALL LETTER THORN}\N{LATIN SMALL LETTER Y WITH DIAERESIS}\0\0\0u\0\0\0t\0\0\0f\0\0\03\0\0\02\0\0\0:\0\0\0\t\0\0\0n\0\0\0a\0\0\0\N{LATIN SMALL LETTER I WITH DIAERESIS}\0\0\0v\0\0\0e\0\0\0 \0\0\0f\0\0\0a\0\0\0\N{LATIN SMALL LETTER C WITH CEDILLA}\0\0\0a\0\0\0d\0\0\0e\0\0\0
utf8:\tna\N{LATIN CAPITAL LETTER A WITH TILDE}\N{MACRON}ve fa\N{LATIN CAPITAL LETTER A WITH TILDE}\N{SECTION SIGN}ade

  $ uniquote --backslash --encoding macroman --verbose /tmp/nf.*
ascii:\tnayeeve fassodd
cp1252:\tna\N{LATIN CAPITAL LETTER O WITH CIRCUMFLEX}ve fa\N{LATIN CAPITAL LETTER A WITH ACUTE}ade
latin1:\tna\N{LATIN CAPITAL LETTER O WITH CIRCUMFLEX}ve fa\N{LATIN CAPITAL LETTER A WITH ACUTE}ade
macroman:\tna\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
\N{OGONEK}\N{CARON}\0u\0t\0f\01\06\0:\0\t\0n\0a\0\N{LATIN CAPITAL LETTER O WITH CIRCUMFLEX}\0v\0e\0 \0f\0a\0\N{LATIN CAPITAL LETTER A WITH ACUTE}\0a\0d\0e\0
\0\0\N{OGONEK}\N{CARON}\0\0\0u\0\0\0t\0\0\0f\0\0\03\0\0\02\0\0\0:\0\0\0\t\0\0\0n\0\0\0a\0\0\0\N{LATIN CAPITAL LETTER O WITH CIRCUMFLEX}\0\0\0v\0\0\0e\0\0\0 \0\0\0f\0\0\0a\0\0\0\N{LATIN CAPITAL LETTER A WITH ACUTE}\0\0\0a\0\0\0d\0\0\0e\0\0\0
utf8:\tna\N{SQUARE ROOT}\N{LATIN CAPITAL LETTER O WITH STROKE}ve fa\N{SQUARE ROOT}\N{LATIN SMALL LETTER SHARP S}ade

ERRORS ^

Exits 0 if all is well, 1 otherwise.

Errors include inaccessible files, bogus encodings, and contents that do not match a specified encoding.

BUGS ^

Good question.

SEE ALSO ^

od(1), cat(1), Encode(3)

HISTORY ^

First public release February 27, 2011.

AUTHOR ^

Tom Christiansen <tchrist@perl.com>

COPYRIGHT AND LICENCE ^

Copyright 2010 Tom Christiansen.

This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself.

syntax highlighting: