Zed Pobre > EBook-Tools-v0.4.9 > EBook::Tools::Mobipocket

Download:
EBook-Tools-v0.4.9.tar.gz

Dependencies

Annotate this POD

CPAN RT

Open  1
View/Report Bugs
Module Version: v0.4.8   Source   Latest Release: EBook-Tools-v0.5.4

NAME ^

EBook::Tools::Mobipocket - Palm::PDB handler for manipulating the Mobipocket format.

SYNOPSIS ^

 use EBook::Tools::Mobipocket qw(:all);
 my $mobi = EBook::Tools::Mobipocket->new();
 $mobi->Load('filename.prc');
 print "Title: ",$mobi->{title},"\n";
 print "Author: ",$mobi->{header}{exth}{author},"\n";
 print "Language: ",$mobi->{header}{mobi}{language},"\n";

 my $mobigen = find_mobigen();
 system_mobigen('myfile.opf');

DEPENDENCIES ^

CONSTRUCTOR ^

new()

Instantiates a new Ebook::Tools::Mobipocket object.

ACCESSOR METHODS ^

drm()

Returns 1 if the drmoffset header value is neither 0 nor 0xffffffff. Returns undef if drmoffset is undefined. Returns 0 otherwise.

text()

Returns the text of the file

write_images()

Writes each image record to the disk.

Returns the number of images written.

write_text($filename)

Writes the book text to disk with the given filename. This filename must match the filename given to "fix_html()" for the internal links to be consistent.

Croaks if $filename is not specified.

Returns 1 on success, or undef if there was no text to write.

write_unknown_records()

Writes each unidentified record to disk with a filename in the format of 'raw-record-####', where #### is the record number (not the record ID).

Returns the number of records written.

MODIFIER METHODS ^

These methods have two naming/capitalization schemes -- methods directly related to the subclassing of Palm::PDB use its MethodName capitalization style. Any other methods are lowercase_with_underscores for consistency with the rest of EBook::Tools.

Load($filename)

Sets $self->{filename} and then loads and parses the file specified by $filename, calling "ParseRecord(%record)" on every record found.

If DictionaryHuffman compression is detected, text records will be left untouched during the ParseRecord pass, and "uncompress_dictionaryhuffman_records()" will be called after the initial parsing pass is complete.

ParseRecord(%record)

Parses PDB records, updating the object attributes. This method is called automatically on every database record during Load().

ParseRecord0($data)

Parses the header record and places the parsed values into the hashref $self->{header}{palm}, the hashref $self->{header}{mobi}, and $self->{header}{exth} by calling "parse_palmdoc_header()", "parse_mobi_header()", and "parse_mobi_exth()" respectively.

ParseRecordCDIC(\$data)

Parses a CDIC record. Takes as a sole argument a reference to the data of the record.

Record format

ParseRecordHUFF(\$data)

Parses a HUFF record. Takes as a sole argument a reference to the data of the record.

Record format

ParseRecordImage(\$dataref)

Parses image records, updating object attributes, most notably adding the image data to the hash $self->{imagedata}, adding the image filename to $self->{recindexlinks}, and incrementing $self->{recindex}.

Takes as an argument a reference to the record data. Croaks if it isn't provided, or isn't a reference.

This is called automatically by "ParseRecord()" and "ParseResource()" as needed.

ParseRecordText(\$dataref)

Parses text records, updating object attributes, most notably appending text to $self->{text}. Takes as an argument a reference to the record data.

This is called automatically by "ParseRecord()" and "ParseResource()" as needed.

fix_html(%args)

Takes raw Mobipocket text and replaces the custom tags and file position anchors

Arguments

fix_html_filepos()

Takes the raw HTML text of the object and replaces the filepos anchors. This has to be called before any other action that modifies the text, or the filepos positions will not be valid.

Returns 1 if successful, undef if there was no text to fix.

This is called automatically by "fix_html()".

uncompress_dictionaryhuffman_records()

Uncompresses all text records using "uncompress_dictionaryhuffman()". This destroys the existing contents of $self->{text} if any.

This method is called automatically at the end of Load() if DictionaryHuffman encoding is detected.

PROCEDURES ^

All procedures are exportable, but none are exported by default. All procedures can be exported by using the ":all" tag.

find_mobidedrm()

Attempts to locate a copy of the MobiDeDrm script by searching PATH and looking in the EBook::Tools user configuration directory (see "userconfigdir()" in EBook::Tools.

Returns the complete path to the script, or undef if nothing was found.

This will use package variable $mobidedrm_cmd as its first guess, and set that variable to the return value as well.

find_mobigen()

Attempts to locate the mobigen executable by making a test execution on predicted locations (including just checking PATH) and looking in the EBook::Tools user configuration directory (see "userconfigdir()" in EBook::Tools.

Returns the system command used for a successful invocation, or undef if nothing worked.

This will use package variable $mobigen_cmd as its first guess, and set that variable to the return value as well.

parse_mobi_exth($headerdata)

Takes as an argument a scalar containing the variable-length Mobipocket EXTH data from the first record. Returns an array of hashes, each hash containing the data from one EXTH record with values from that data keyed to recognizable names.

If $headerdata doesn't appear to be an EXTH header, carps a warning and returns an empty list.

See:

http://wiki.mobileread.com/wiki/MOBI

Hash keys

parse_mobi_header($headerdata)

Takes as an argument a scalar containing the variable-length Mobipocket-specific header data from the first record. Returns a hash containing values from that data keyed to recognizable names.

See:

http://wiki.mobileread.com/wiki/MOBI

keys

The returned hash will have the following keys (documented in the order in which they are encountered in the header):

identifier

This should always be the string 'MOBI'. If it isn't, the procedure croaks.

headerlength

This is the size of the complete header. If this value is different from the length of the argument, the procedure croaks.

type

A numeric code indicating what category of Mobipocket file this is.

encoding

A numeric code representing the encoding. Expected values are '1252' (for Windows-1252) and '65001 (for UTF-8).

The procedure carps a warning if an unexpected value is encountered.

uniqueid

This is thought to be a unique ID for the book, but its actual use is unknown.

Use with caution. This key may be renamed in the future if more information is found.

version

This is thought to be the Mobipocket format version. A second version code shows up again later as version2 which is usually the same on unprotected books but different on DRMd books.

Use with caution. This key may be renamed in the future if more information is found.

reserved

40 bytes of reserved data.

Use with caution. This key may be renamed in the future if more information is found.

indxrecord

This is thought to be the record offset to the first 'INDX' record, so named for its first four letters.

Use with caution. This key may be renamed in the future if more information is found.

titleoffset

Offset in record 0 (not from start of file) of the full title of the book.

titlelength

Length in bytes of the full title of the book

languageunknown

16 bits of unknown data thought to be related to the book language.

Use with caution. This key may be renamed in the future if more information is found.

language

A pseudo-IANA language code string representing the main book language (i.e. the value of <dc:language>). See %mobilangcodes for an exact map of raw values to this string and notes on non-compliant results.

dilanguageunknown

16 bits of unknown data thought to be related to the dictionary input language.

Use with caution. This key may be renamed in the future if more information is found.

dilanguage

A pseudo-IANA language code string for the DictionaryInLanguage element. See %mobilangcodes for an exact map of raw values to this string and notes on non-compliant results.

dolanguageunknown

16 bits of unknown data thought to be related to the dictionary output language.

Use with caution. This key may be renamed in the future if more information is found.

dolanguage

A pseudo-IANA language code string for the DictionaryOutLanguage element. See %mobilangcodes for an exact map of raw values to this string and notes on non-compliant results.

version2

This is another Mobipocket format version related to DRM. If no DRM is present, it should be the same as version.

Use with caution. This key may be renamed in the future if more information is found.

firstimagerecord

This is thought to be an index to the first record containing image data. If there are no images in the book, this value will be 4294967295 (0xffffffff)

Use with caution. This key may be renamed in the future if more information is found.

huffrecord

This is thought to be the record offset to the 'HUFF' record, used in HUFF/CDIC decompression.

Use with caution. This key may be renamed in the future if more information is found.

huffreccnt

This is thought to be the number of HUFF and CDIC records, starting at huffrecord.

Use with caution. This key may be renamed in the future if more information is found.

datprecord

This is thought to be the record offset to the first 'DATP' record, so named for its first four letters.

Use with caution. This key may be renamed in the future if more information is found.

datpreccnt

This is thought to be the number of 'DATP' records present.

Use with caution. This key may be renamed in the future if more information is found.

exthflags

A 32-bit bitfield related to the Mobipocket EXTH data. If bit 6 (0x40) is set, then there is at least one EXTH record.

unknown116

36 bytes of unknown data at offset 116. This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

drmoffset

A number thought to be the byte offset inside of the record 0 data in which DRM data can be found. If present and no DRM is set, contains either the value 0xFFFFFFFF (normal books) or 0x00000000 (samples). This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

drmcount

A number thought to be related to DRM.

This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

drmsize

A number thought to be the size of the data in bytes after drmoffset containing DRM keys.

This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

drmflags

A number thought to be related to DRM.

This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

unknown168

32 bits of unknown data at offset 168, usually zeroes. This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

unknown172

32 bits of unknown data at offset 172, usually zeroes. This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

unknown176

16 bits of unknown data at offset 176. This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

lastimagerecord

This is thought to be an index to the last record containing image data. If there are no images in the book, this value will be 65535 (0xffff).

Use with caution. This key may be renamed in the future if more information is found.

unknown180

32 bits of unknown data at offset 180. This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

fcisrecord

This is thought to be an index to a 'FCIS' record, so named because those are always the first four characters when the record data is decompressed using uncompress_palmdoc().

This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

unknown188

32 bits of unknown data at offset 188. This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

flisrecord

This is thought to be an index to a 'FLIS' record, so named because those are always the first four characters when the record data is decompressed using uncompress_palmdoc().

This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

unknown196

32 bits of unknown data at offset 180. This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

unknown200

Unknown data of unknown length running to the end of the header. This value will be undefined if the header data was not long enough to contain it.

Use with caution. This key may be renamed in the future if more information is found.

extradataflags

Two bytes sometimes found inside of unknown200, used to determine if extra data has been appended to each text record that should not be used in decompression.

parse_mobi_language($languagecode, $regioncode)

Takes the integer values $languagecode and $regioncode unpacked from the Mobipocket header and returns a language string mostly (but not entirely) conformant to the IANA language subtag registry codes.

Croaks if $languagecode is not provided. If $regioncode is not provided or not recognized, it is disregarded and the base language string (with no region or script) is returned.

If $languagecode is not provided, the sub croaks. If it isn't recognized, a warning is carped and the sub returns undef. Note that 0,0 is a recognized code returning an empty string.

See %mobilanguagecodes for an exact map of values. Note that the bottom two bits of the region code appear to be unused (i.e. the values are all multiples of 4).

pid_append_checksum($pid)

Computes the Mobipocket PID checksum used as the final two bytes of the PID and appends them to $pid, returning the merged string.

Used by "pid_is_valid($pid)".

pid_is_valid($pid)

Returns 1 if the PID is a valid Mobipocket/Kindle PID and 0 otherwise.

This is determined by first ensuring that $pid is exactly ten bytes long, and then stripping the final two bytes normally used as a checksum and recomputing them, returning 1 only if they are recomputed correctly.

pukall_cipher_1(%args)

This is a COMPLETELY UNTESTED implementation of the Pukall Cipher 1 algorithm used for encryption and decryption in Mobipocket files. It is a 128-bit stream cipher. For more information and alternate implementations, see http://membres.lycos.fr/pc1/.

Use at your own risk. Bug reports appreciated.

Arguments

record_extradata_size(%args)

This checks the end of a text record for extra data that should not be made part of decompression and returns the total size of all data fields.

Arguments

system_mobidedrm(%args)

Runs python on a copy of MobiDeDrm.py if it is available (not included with this distribution) to downconvert a Mobipocket file.

Returns the output filename on success, or undef otherwise.

Arguments

system_mobigen(%args)

Runs mobigen to convert OPF, HTML, or ePub input into a Mobipocket .prc/.mobi book. The procedure find_mobigen() is called to locate the executable.

Returns the return value from mobigen, or undef if no filename was specified or the file did not exist. Also returns undef if mobigen could not be found.

Arguments

uncompress_dictionaryhuffman(%args)

Uncompresses text compressed with the DictionaryHuffman compression scheme.

Arguments

unpack_mobi_language($data)

Takes as an argument 4 bytes of data. If less data is provided, the sub croaks. If more, a debug warning is provided, but the sub continues.

In scalar context returns a language string mostly (but not entirely) conformant to the IANA language subtag registry codes.

In list context, returns the language string, an unknown code integer, a region code integer, and a language code integer, with the last three being directly unpacked values.

See %mobilangcodes for an exact map of values. Note that the bottom two bits of the region code appear to be unused (i.e. the values are all multiples of 4). The unknown code integer appears to be unused, and is generally zero.

The original implementation by Mobipocket may have been via Microsoft's .NET CultureInfo class. See: http://msdn.microsoft.com/en-us/library/system.globalization.cultureinfo(VS.71).aspx

BUGS AND LIMITATIONS ^

AUTHOR ^

Zed Pobre <zed@debian.org>

LICENSE AND COPYRIGHT ^

Copyright 2008 Zed Pobre

Licensed to the public under the terms of the GNU GPL, version 2

syntax highlighting: