Changes for version 0.4

  • fixed double utf-8 encoding bug

Documentation

HTML to Alvis XML converter
export XML content along the Alvis pipeline
script to join ALVIS XML files into one ALVIS XML file.
script to merge ALVIS XML files from input directory with ALVIS XML nodes in extra directory or file
splits a big file into pieces in a directory for easier processing.
adds relevance scores for categories to an Alvis version of a Wikipedia dump
HTML to Alvis XML converter
HTML to plain text converter
news XML to Alvis XML converter
Wikipedia XML dump to Alvis XML converter

Modules

Perl extension for buffering utilities for the Alvis pipeline
Perl extension for converting documents in various formats into the Alvis canonical format for documents
Perl extension for converting documents from a number of different source formats to Alvis XML format.
Perl extension for assembling an Alvis documentRecord from given pieces.
Perl extension for guessing and checking the encoding of documents.
Perl extension for representing links occurring in documents.
Perl extension for representing meta information about a document, such as its URL, title, modification date, HTML header information, detected character set,...
Perl extension for guessing and checking the type of a document (an extension of MIME types).
Perl extension for converting documents in dirty HTML into "clean" HTML suitable for Alvis purposes

Provides

in lib/Alvis/AinoDump.pm
in lib/Alvis/Utils.pm
in lib/Alvis/Wikipedia/Templates.pm
in lib/Alvis/Wikipedia/Variables.pm
in lib/Alvis/Wikipedia/WikitextParser.pm
in lib/Alvis/Wikipedia/XMLDump.pm