The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Regexp::Common::debian - regexps for Debian specific strings

SYNOPSIS

    use Regexp::Common qw/ debian /;
    # Read `perldoc Regexp::Common` for base documentation
    # Each pattern provides its own synopsis

DESCRIPTION

Debian GNU/Linux as a management system validates, parses, and generates a lots of data. For sake of some other project I've needed some kind of parser. And, at time of starting, there're reasons to go myself. Those reasons are moot now but here we are.

When choosing API I had an option --

parsing

That would be a bunch of error-prone decisions -- pick a backbone parser, figure out grammar, mix them, build API, implement it,.. And as a net result one more xDpkg:: namespace. I really would like to hear any reasons why.

comparing

String on left, regexp on right, add {-keep}, and get an array of parsed out parts. Other way: string on left, regexp on right, anchor it properly, and get a scalar indicating match/mismatch. The only deficiency I can see is that result is an array, but hash. Hard to argue. That seems I've committed a sin. Should live with it.

As a backbone Regexp::Common was chosen. It has it's own deficiences, but I've failed to find any unhappy user (unsatisfied -- maybe, but unhappy -- no, sir). Maybe I didn't tried hard enough. It provides neat and rich interface, but...

{-keep} and {-i} are provided internally. It's OK with {-keep}, but {-i}... Look, Debian strings are almost all case-sensitive. When case shouldn't matter it's explicitly switched off by template itself. So -- if you play with {-i}, don't blame me then. (I'll experiment with implicit qr/(?i:)/ after that release. And experiments are going.)

(note) Regexp::Common::debian is very permissive in some cases (sometime absurdly permissive). Hopefully, I've noted in docu all such cases.

v0.2.10 The test-suite checks various sources that could be found on Debian system. Those checks are done only upon request. Don't be a bit optimistic about success. README has more.

$RE{debian}{package}
    'the-very.strange.package+name' =~ $RE{debian}{package}{-keep};
    print "package is $1";

This is Debian package name. Rules are described in Section 5.6.7 of Debian policy.

$1 is a package
$RE{debian}{version}
    '10:1+abc~rc.2-ALPHA:now-rc25+w~t.f' =~ $RE{debian}{version}{-keep};
    ($2 || 0) eq '10'            &&
    $3 eq '1+abc~rc.2-ALPHA:now' &&
    ($4 || 0) eq 'rc25+w~t.f'       or die;

This is Debian version. Rules are described in Section 5.6.12 of Debian policy. upstream_epoch and debian_revision are implicitly caseles (as required).

$1 is a debian_version
$2 is an epoch

if any. Oterwise -- undef. Debian policy requires defaulting here to 0. However Perl disallows assigning special variables $[1-9][0-9]* (they are read-only, perlvar has more). So if you have epoch to be undef then assume here 0.

$3 is an upstream_version

If there's no way to match upstream_version than the whole pattern fails.

(caveat) A string like 0--1 will end up with upstream_version set to weird 0- (hopefully, Debian won't degrade to such versions; though YMMV).

(caveat) v0.2.3 Look for caveat #1 for background. However this RE stayed a bit better than others. In spite of Debian policy, upstream_version can start with number or letter but any version forming character. Should it be configurable? Probably. But think about it: $RE{debian} is for working with strings but verification. And such policy-ignorant versions wouldn't go elsewhere (think changelog.Debian). So in presense of choice between weak and strict you would alomost ever choose weak. And a point of strict then? Nobody cares.

$4 is a debian_revision

(bug) 0-1- will end up with upstream_version set to 0 and debian_revision set to 1 (such trailing hyphens will be missing in debian_version). 0- will end up with debian_resion undefed. And the same (as with epoch) -- omitted debian_revision defaults to 0; debian_revision can't.

(caveat) The debian_revision is allowed to start with non-digit. This's solely my reading of Debian Policy.

R_C_d_version()
    use Regexp::Common qw(debian);
    # though that works too
    # use Regexp::Common::debian;
    my $re = Regexp::Common::debian::R_C_d_version;
    $version =~ /^$re$/;
    $2                   and print "has epoch\n";
    $3 || $5 || $6 || $8 and print "has upstream_version\n";
    $4 || $7             and print "has debian_revision\n";
    $3 && !$4 || !$3 && $4 or die;
    $6 && !$7 || !$6 && $7 or die;
    $3 && !$5 && !$6 && !$8 or die;
           $5 && !$6 && !$8 or die;
                  $6 && !$8 or die;

That's a workaround for perl5.8.8 As of v0.2.12 it's gone.

$RE{debian}{architecture}
    $arch =~ $RE{debian}{architecture}{-keep};
    $2 && ($3 ||  $4)           and die;
           $3 && !$4            and die;
    $2 and print "that's special: $2";
    $3 and print "OS is: $3";
    $4 and print "CPU is: $4";

This is Debian architecture. Rules are described in Section 5.6.8 of Debian policy.

v0.2.12 At time of writing: only linux os is present for any cpu; only m68k cpu is present for any os; reality had been more straightforward before. Thus giving up on semantics. Anything that comprises somehow known os can go on left. Anything that comprises somehow known cpu can go on right. any wildcard can take over either os or cpu ((bug) or both of them (any-any is parsed as correct architecture)). Neither lowercase, nor digit, nor hyphen can touch a prospect on outside.

$1 is some of Debian's architectures
$2 is any special

Distinguishing special architectures (all, any, and source) and os-cpu pairs is arguable. But I've decided that would be good to separate all and e.g. i386 (what in turn is actually linux-i386).

$3 is os

When !$3 && $4 is true then undefined os actually means linux. Since $digits are read-only yielding here anything but undef is impossible. More on that in Section 11.1 of Debian policy.

$4 is cpu

(note) Ocassionally, various sources talk about arch while meaning cpu component of architecure. Looks like architecture is always os-cpu pair. Probably that arch/architecture mess is with us from the beginning. (bug) In this docu happens too.

(caveat) Debian policy by itself doesn't specify what os-arch pairs are valid (only specials are mentioned). In turn it relies on qx/dpkg-architecture -L/. In effect R::C::d can desinchronize; Hopefully, that wouldn't stay unnoticed too long.

$RE{debian}{archive}{binary}
    'abc_1.2.3-512_all.deb' =~ $RE{debian}{archive}{binary}{-keep};
    print "     package is -> $2";
    print "     version is -> $3";
    print "architecture is -> $4";

This is Debian binary archive (even if there's no binary file (in -B sense) inside it's called "binary" anyway). When Debian policy and deb(5) talk about "format" it's about internals but name. If you think about it, then it's clear that neither dpkg(1) nor apt(1) nor any other alternative cares what is a basename of particular binary archive. It turns out that only authority on naming binary archives is what actualy creates them. Indeed, dpkg-deb(1) clearly states its intentions in very first entry -b, --build directory [archive|directory].

$1 is deb_filename

That's the whole archive filename with .deb suffix included (bug) .udeb is suffix too.

$2 is package
$3 is version

There's a big deal of WTF. Filename: in *_Packages miss epoch at all. Archives in pool/ miss them too. Archives in /var/cache/apt/archives ... That seems to be apt-get specific (I don't have reference to code though). As a feature $RE{d}{a}{binary} provides an epoch hack in filenames.

(bug) That extra inteligence should be configurable.

(caveat) v0.2.3 "caveat #1: version starts with letter".

$4 is architecture

(bug) That would match surprising source or any. Actually that's even worse: OS can prepend any arch or special. Shortly: doesn't work with ports.

(caveat) "caveat #2: suffix could be in version"

$RE{debian}{archive}{source_1_0}
    'xyz_1-ab.25~6.orig.tar.gz' =~ $RE{debian}{archive}{source_1_0}{-keep};
    print "package is $2";
    index($3, '-') && $4 eq 'tar' and die;
    $4 eq 'orig.tar'              and print "there should be patch";

This is Debian upstream (or Debian-native) source tarball. Naming source archives is outside Debian policy; although

  • Section 5.6.21 mentions that "the exact forms of the filenames are described in" Section C.3.

  • Section C.3 points that source archive must be in form package_upstream-version.orig.tar.gz.

  • Naming Debian-native packages is left completely.

  • dpkg-source(1) (at least of 1.15.2) shows real life and makes all that a bit more complicated. See section SOURCE PACKAGE FORMATS of dpkg_source(1) for details.

v0.2.3 At that point an incompatible change has been made. $RE{d}{a}{source} has been renamed to $RE{d}{a}{source_1_0} (what in fact it always was). Probably one day there could be an agregating $RE{d}{a}{source} that would match any source filename (if there would be any purpose for). More on different formats below.

Format: 1.0

It's either set of *.orig.tar.gz and acompaning *.diff.gz or lone *.tar.gz (then that's 'native'). That is covered by $RE{d}{a}{source_1_0}

Format: 2.0

That's supposedly unseen in wild. dpkg-source(1) doesn't say what filenames represent it. Probably those of Format: 3.0 (quilt) (refer to $RE{d}{a}{source_3_0_quilt} for details). Not implemented.

Format: 3.0 (native)

At that point Format: 1.0 has been split. Debian native packages (those without *.debian.tar.gz) are of this type. Implemented in $RE{d}{a}{source_3_0_native}.

Format: 3.0 (quilt)

Those with *.debian.tar.gz are of this second format. Very hot. Implemented in $RE{d}{a}{source_3_0_quilt} and $RE{d}{a}{patch_3_0_quilt}. Refer to respective sections, details are huge.

Format: 3.0 (custom)

A secret format. Probably $RE{d}{a}{source_3_0_quilt} would suffice. Not implemented.

Format: 3.0 (git)
Format: 3.0 (bzr)

Those are secret too. And again, I believe, $RE{d}{a}{source_3_0_quilt} would be enough. Not implemented.

And now miserable notes about $RE{d}{a}{source_1_0}:

$1 is tarball_filename

Since there's no other suffix, but .gz it's present only in tarball_filename

$2 is package
$3 is version

There's a bit (or pile) of complication. Look, if version contains minus (-), that means that resulting binary must have debian_revision set (otherwise that minus must not be here), thus implying presense of *.diff.gz, thus implying type must be orig.tar but simple tar (what would be Debian native package). OTOH, if there is no minus, then type could be either orig.tar or tar. Obviously lack or presence of *.diff.gz falls out of knowledge of $RE{d}{a}{source_1_0}.

(bug) That should fail this package_0.orig-component.tar.gz. It doesn't ($RE{d}{a}{source_3_0_native} for details).

(caveat) Consider this: package_0-1.debian.tar.gz. Is it debian-native (version would be 0-1.debian) of Format: 1.0; or is it debianization tar (version would be 0-1) of Format: 3.0 (quilt)? Without checking Format: entry it's impossible to say. (Are you wondering about hyphen? Think again (unattended-upgrades_0.25.1debian1-0.1 is debian-native).) The good news is that (at time of writing) I've found none debian-native package (of either Format:) which Version: would match qr/debian$/. (Let's check it tomorrow.) And back to the subject: package_0.debian.tar.gz is implicitly prohibited.

(caveat) v0.2.3 "caveat #1: version starts with letter".

$4 is type

This can hold one of 2 strings (orig.tar (regular package) or tar (Debian-native package)).

(bug) Probably that should look behind (if that would be that possible) for hyphen (-) in version. It doesn't. Because it's OK to have hyphen in Debian-native packages (francine_0.99.8orig-6.tar.gz).

(caveat) "caveat #2: suffix could be in version"

$RE{debian}{archive}{source_3_0_native}
    'xyz_1234.tar.lzma' =~ $RE{debian}{archive}{source_3_0_native}{-keep}
    print "package is $2";
    print "version is $3";
    print 'decompress wiht ' .
      $4 eq 'gz'   ? 'gunzip'  :
      $4 eq 'bz2'  ? 'bunzip2' :
      $4 eq 'lzma' ? 'unlzma'  : die;

v0.2.5 That's descandant of $RE{d}{a}{source_1_0} for native packages (those without *.debian.tar.gz).

$1 is tarball_filename

tar with delimiting dots (.) is included only here.

$2 is package
$3 is version

(bug) That must fail on package_0.orig.tar.gz. It doesn't because of package_0.orig-component.tar.gz. It needs variable-length look-behind.

package_0.debian.tar.gz doesn't match. $RE{d}{a}{patch_3_0_quilt} matches instead.

(caveat) v0.2.3 "caveat #1: version starts with letter".

$4 is suffix

It's either gz, bz2, lzma, or xz. Anything else (missing counts as anything) would fail the whole pattern.

(caveat) "caveat #2: suffix could be in version"

$RE{debian}{archive}{source_3_0_quilt}
    'xyz_1-ab.25~6.orig-cool-stuff.tar.bz2' =~ $RE{debian}{archive}{source_3_0_native}{-keep};
    print "package is $2";
    print "version is $3";
    print "component happens to be $4" if $4;
    print 'decompress with ' .
      $5 eq 'gz'   ? 'gunzip'  :
      $5 eq 'bz2'  ? 'bunzip2' :
      $5 eq 'lzma' ? 'unlzma'  :
      $5 eq 'xz'   ? 'unxz'    : die;

v0.2.4 That's descendant of $RE{d}{a}{source_1_0} for non-native debian packages (those with *.debian.tar.gz). (note) Also Format: 3.0 (quilt) invents a concept of components.

$1 is tarball_filename

Delimiting dots (.), orig (with or without (if missing) component delimiting hyphen (-)), and tar are present here only. The component itself is present in component.

$2 is package
$3 is version

(caveat) v0.2.3 "caveat #1: version starts with letter".

$4 is component

The 'component' is specially packed piece of upstream sources (being it packed this way by either upstream or Debian). It's not a patch. Thus it's here ($RE{d}{a}{source_3_0_quilt} but $RE{d}{a}{patch_3_0_quilt}). The component name is either present or missing completely, so this is invalid:

    null-component-package_01234.orig-.tar.gz

Although this is perfectly valid:

    strange-component-package_98765.orig--.tar.gz

dpkg-source(1) is unclear about this, but my understanding is that component name is closer to package (thus lowercase only) then version (mixed case). However that's not yet enforced.

$5 is suffix

It's either gz, bz2, lzma, or xz. Anything else (missing counts as anything) would fail the whole pattern.

(caveat) "caveat #2: suffix could be in version"

$RE{debian}{archive}{patch_1_0}
    'abc_0cba-12.diff.gz' =~ $RE{debian}{archive}{patch_1_0}{-keep};
    print "package is $2";
    -1 == index $3, '-' and die;
    print "debian revision is ", (split /-/, $3)[-1];

This is "debianization diff" (Section C.3 of Debian policy). Naming patches is outside Debian policy; So we're back to guessing. There're rumors (or maybe trends) that Format 1.0 will be deprecated (or maybe obsolete).

v0.2.6 Incompatible change. $RE{d}{a}{patch} has been renamed into $RE{d}{a}{patch_1_0}.

$1 is patch_filename

Since there's no other suffix, but .diff.gz it's present only in patch_filename.

$2 is package
$3 is version

(caveat) Consider this. A Debian-native package misses a patch and hyphen in version. A regular package has a patch and must have hyphen in version. $RE{d}{a}{patch_1_0} is absolutely ignorant about that (we are about matching but verifying after all).

(caveat) v0.2.3 "caveat #1: version starts with letter".

(caveat) "caveat #2: suffix could be in version"

$RE{debian}{archive}{patch_3_0_quilt}
    'abc_0cba-12.debian.tar.lzma' =~ $RE{debian}{archive}{patch_3_0_quilt}{-keep};
    say "package is $2";
    -1 == index $3, '-' and die;
    print "debian revision is ", (split /-/, $3)[-1];
    print 'decompress with ' .
      $4 eq 'gz'   ? 'gunzip'      :
      $4 eq 'bz2'  ? 'bunzip2'     :
      $4 eq 'lzma' ? die 'stinks!' :
      $4 eq 'xz'   ? 'unxz'        : die;

Since Format: 3.0 (quilt) has been invented, debianization stuff has changed form from one big diff (*.diff.gz, $RE{d}{a}{patch_1_0}) to debianization stuff (placed in debian/) and set of diffs (if any) (intended to be placed in debian/patches/) in form of single tar-file (*.debian.tar.gz).

$1 is tar_filename

debian.tar with delimiting dots (.) is seen here only.

$2 is package
$3 is version

(caveat) v0.2.3 "caveat #1: version starts with letter".

$4 is suffix

It's either gz, bz2, lzma, or xz. Anything else (missing counts as anything) would fail the whole pattern.

(caveat) "caveat #2: suffix could be in version"

$RE{debian}{archive}{dsc}
    'abc_0cba-12.dsc' =~ $RE{debian}{archive}{dsc}{-keep};
    print "package is $2";
    print "version is $3";

This is "Debian source control" (Section 5.4 describes its contents but naming). Statistically based guessing, you know (once I'll elaborate to point exact lines in dpkg-dev bundle where it's in use (creating and parsing)).

$1 is dsc_filename

As usual, since the only suffix can be .dsc it's present in dsc_filename only.

$2 is package
$3 is version

(caveat) v0.2.3 "caveat #1: version starts with letter".

(caveat) "caveat #2: suffix could be in version"

$RE{debian}{archive}{changes}
    'abc_0cba-12.changes' =~ $RE{debian}{archive}{changes}{-keep};
    print "package is $2";
    print "version is $3";

This is "Debian changes file" (Section 5.5 describes its contents but naming). dpkg-genchanges(1) is silent too. So this pattern is based on observation too.

$1 is changes_filename

As usual, since the only suffix can be .changes it's present in changes_filename only.

$2 is package
$3 is version

(caveat) v0.2.3 "caveat #1: version starts with letter".

$4 is architecture

(caveat) "caveat #2: suffix could be in version"

$RE{debian}{sourceslist}
    'deb file:/usr/local oldstable main contrib non-free' =~ $RE{debian}{sourceslist}{-keep} and
      system "rm -rf $5" or die;
    ($4 eq 'http' || $4 eq 'rsh' || $4 eq 'ssh') &&
      !index $5, '//' or die;
    ($4 eq 'file' || $4 eq 'cdrom' || $4 eq 'copy') &&
      !index($5, '/') && index($5, '/', 1) > 1 or die;
    index(reverse($6), '/') || $7 or die;

This is one entry in sources.list resource list. The format is described in sources.list(5) man page (hence a chance for desincronization provided) (gosh, it's not debian any more, it's APT).

(bug) It just come to my attention, between deb and uri there could be options. Missing so far.

$1 is resource_entry

$RE{d}{sourceslist} is very permissive about what would constitute entries, but you can bet on -- the whole entry stays on one line.

$2 is resource_type

That can be either deb or deb-src. Implicit negative lookbehind for qr/\w/ provided (so =deb is accepted, _deb is not; hey, #deb is accepted too! explicit anchoring at your option).

$3 is uri

You think you know what URI is? Read below...

$4 is scheme

Schemes that APT knows have nothing to do with sources.list(5) actually. scheme that APT will use is some executable in /usr/lib/apt/methods (some of them are for transfer, some are not). sources.list(5) (of 0.9.7.8) defines these:

local filesystem

file, cdrom, copy.

network

http, ftp, rsh, ssh

Delimiting colon : isn't included here (although uri does).

(bug) It just come to my attention (0.9.7.8) scheme can be anything (to some degree).

$5 is hier_path

The idea is that someday $RE{d}{sourceslist} would look behind at uri to decide if there should be authority (that one delimited with //) or path_absolute would be enough. Right now that's not the case. (bug) Any non-space sequence is hier_path.

That's very bad, but that's the way it's done right now. Look, parsing URI is a task for standalone pattern. It's not implemented, maybe someday some kind perlist would do that. Yes, I know about Regexp::Common::URI. Apparently R::C::U knows nothing about cdrom:.

$6 is distribution

Debian is full of surprises. Lots of surprises. You think you know what distribution is, don't you? You missed. distribution can be filesystem path. Since sources.list(5) doesn't mention space escaping techniques I assume spaces aren't allowed; so any no-space is allowed. You think that's an overkill? You're obviously wrong (think $ARCH, sources.list(5) has more).

$7 is component_list

In misguided attempt not to make them too different with all that crowd, component_list is space delimited list of non-spaces. If distribution ends with slash (/), then component_list can be empty (I've meant, maybe someday that will look-behind too).

$RE{debian}{preferences}
    <<END_OF_PREFERENCES =~ $RE{debian}{preferences{-keep}} or die;
    Explanation: Stay updated!
    Package: perl
    Pin: version 5.10*
    Pin-Priority: 1001
    END_OF_PREFERENCES
    $2 eq 'perl' and
      print "good, we are looking for perl\n";
    $3 eq 'version' and $4 =~ /^5\.10/ and
      print "good, we are looking for recent\n";
    $5 =~ /^\d+$/ && $5 > 1000 and
      print "good, we'll stay updated\n";

This is one entry in preferences list. Good news are over, bad news are below. I've failed to find definition of entry in preferences (still looking). apt_preferences(5) suggests on what that looks like providing examples. It's not enough; apt-cache policy behaviour leads from understanding either.

After some experimenting I've found that: In general this is Debian control file format. With some quirks provided. So here we are -- some common case of entry in preferences.

(bug) v0.2.12 Somewhere on the span lenny/squeeze/wheezy treating preferences has changed so much that $RE{d}{p} needs total rework. Now there're: globs, POSIX extended re, star has explict meaning, and probably more (reading changelog leaves very unhappy feeling). So, whatever is said hereafter, describes what this re is doing but what preferences might look like.

Shortly:

  • each entry consists of 3 stanzas (Package:, Pin:, Pin-Priority:);

  • the order matters, no intermediate stanza is allowed;

  • case doesn't matter (for both name and value of stanza (to some degree));

  • whatever has gone before Package: or came after Pin-Priority: (line-wise) is ignored;

  • apt-cache policy fails in one case -- Package: stanza has leading spaces;

  • misparsed values are ignored, thus invalidating the whole entry (but see below), thus the entry is ignored.

That's what $RE{debian}{preferences} does. More on each stanza below.

(bug) apt-cache policy will accept newlines -- those are spaces in Debian control files, while consequent lines proper indentation provided. $RE{d}{preferences} accepts one line stanzas only.

$1 is a preferences_entry

That's the whole entry -- with all leading and trailing spaces, and an Easter Eggs. apt_preferences(5) invents something called Explanation: stanzas (they should go before Package:, with no empty lines in between). Since we are aware of that, Explanation: sequence is provided in preferences_entry (and it won't be ever package_stanza (1st, obvious compatibility reasons; 2nd, it's somewhat legalized since it's mentioned; 3rd, it can be easily dropped in case I found that useful)).

$2 is a package_stanza

That's either * (star, match-any-string wildcard) or space separated list of package names (alone package name is degenerated list). That is, if package_stanza is a list, than each (even if there's only one) non-space sequence is treated as package name. apt-cache policy doesn't seem to verify its input, so one can put here anything. Then those sequences will be matched literally against known package names.

(feature) In contrary with everything else, in $RE{d}{preferences}, package names are case-sensitive.

(bug) apt-cache policy will silently accept star among package names. Then, since no-one package name matches (there can't be a package named *) the star will be missing among pinned packages. $RE{d}{preferences} rejects such strings.

$3 is a context_switch

Pin: stanza is broken in two parts. That's the first one. One of 3 acceptable strings are version, origin, or release. Bad news below.

$4 is a context_filter

(bug) (what else?) What would be a correct input here depends on context_switch. $RE{d}{preferences} takes anything up to the next newline.

$5 is a pin_priority_stanza

In pin_priority_stanza will be a sequence of decimal numbers (yes, hexadecimals are rejected and octals aren't converted), optionally prepended with + (plus) or - (minus) signs up to surprising . (dot). Any trailing decimals and dots (after the first one) will be ignored by apt-cache policy. So does the $RE{d}{preferences} too. The optional dot-decimal trailer will be missing in pin_priority_stanza, but present in prererences_entry.

It's a mess, isn't it? Go figure.

$RE{debian}{changelog}
    <<END_OF_CHANGELOG =~ $RE{debian}{changelog{-keep}} or die;
    perl (6.0.0-1) unstable; urgency=high
      * Hourah!
     -- John Doe <doe@example.tld>  Thu, 01 Apr 2010 00:00:00 +0300
    END_OF_CHANGELOG
    print <<"END_OF_REPORT"
    package        : $2
    version        : $3
    in archive     : $4
    flags          : $5
    changes        :
    ${6}uploaded by    : $7
    achknowledgment: $8
    at time        : $9

This is one entry in debian/changelog. The format is described in Section 4.4 of Debian Policy. In real world parsing of this file is done by parser script. /usr/lib/dpkg/parsechangelog/debian is a Perl script, that's called from dpkg-parsechangelog (of dpkg-dev package (that in turn is Perl script, again)).

There're 2 special Perl modules (namely: Debian::ParseChangelog (of CPAN), and Dpkg::Changelog (of dpkg-dev package)). And now there'is 3rd one (how cute). Those former are read/write engines, $RE{debian}{changelog} is read-only (for obvious reasons). There's a point of desincronization though.

Until Debian Policy v3.8.1.0 there was an option of providing debian/changelog in different format. However [489460@bugs.debian.org] had made it. Now that option has gone. However, dpkg-parsechangelog(1) describes how those are introduced and handled.

$1 is a changelog_entry

That's the whole entry of header, delimiting empty lines (if any), and sig-line (with trailing newline). That seems (that's not set explicitly in the debian-policy) that there must be intermediate empty line (what's 'empty line', btw?). And the latest entry in changelog must start with at the very first line. $RE{d}{a}{changelog} pays no attention.

$2 is a debian_package

(bug) Just a sequence of characters allowed in Debian's package name. No other restrictions provided.

$3 is a debian_version

Surrounding braces aren't included.

(bug) That's a simplified too.

(caveat) v0.2.3 "caveat #1: version starts with letter".

$4 is distributions

v0.2.8 That's space ( ) separated sequence of letters (a .. z) (caseless, enforced) and hyphens (-) in any order, except first character should be letter (weird). Space before terminating semicolon is disallowed (it's not missing in distributions, it fails entry). Terminating semicolon isn't included.

$5 is keys (or urgency, if you like)

(note) Debian Policy explicitly states that that field is supposed to be a comma (,) separated list of equal (=) separated key-value pairs. However the only known key is urgency. Maybe I'm too pesimistic, but despite the fact that the only key allowed is urgency the whole key=value pair is put in keys -- so you've better be prepared and pick a key you're looking for (one day you can get a lot more).

(caveat) v0.1.5 I wasn't enough pessimistic. perl5.8.8 goes nuts sometimes looking for urgency (it happens to be an anchor) (namely: libcompress-zlib-perl_2.015-1) (perl5.10.0 is OK). In misguided attempt to support oldstable $RE{d}{changelog} no more looks for urgency, it looks for a sequence of lowercase letters. (And anchor is \040--\040 of sig-line now.) Sorry.

(caveat) 0.2.8 Log entry of binutils_2.7-5 invents concept of something. Let's call it comment (or wish). Thus anything that's not comma-separated equal-separated key-value pair is skipped (from keys). Obviously, it's present in changelog_entry

$6 is changes

That invents concept of empty line.

v0.2.8 For $RE{d}{changelog} "empty line" consists of any number horizontal spaces (qr/\h/) followed by newline. OTOH, "line" is at least two spaces (one tab counts as at least two spaces) then any non-space character, and anything up to next newline (space counts as "anything" for now). No or one space followed by non-space fails entirely (but watch for trailing signature line). As requested by Debian Policy (or stock parser) leading and trailing empty lines are ignored (they are included in changelog_entry though).

(bug) Handling trailing empty lines is broken. It's useles to describe what empty lines and what number of empty lines will end up in changes. $RE{d}{changelog} must be redone.

(caveat) The recommended way of outlineing changes is starting each subentry with star (*), then adding at least one space to sub-subentries. OTOH, the modern way to highlight work done by different maintainers (or probably non-maintainers at all) is by placing maintainer name in brackets (with two leading spaces). $RE{d}{changelog} accepts anything.

(note) (I can't say is it a bug or feature) The leading and trailing empty lines are said to be optional. However one leading and one trailing empty line are present in each (decent?) entry in Debian changelog file. $RE{d}{changelog} doesn't insist on that.

$7 is a maintainer_name

$RE{d}{changelog} is very permissive about what is maintainer_name (and what it is actually?). email_address and changelog_date take care of themselves. A leading space-then-double-hyphen and separating space aren't included.

v0.2.10 Any number of space (but null) could be between double-hyphen and maintainer_name (libnet-daemon-perl_0.30-1).

$8 is an email_address

That one (with option to maintainer_address) is subject to be processed with Regexp::Common::Email::Address (or not, under consideration). Anyway, right now it's a sequence of non-spaces surrounded by angle brackets. Surrounding brackets aren't included.

$9 is a changelog_date

That one is subject to be processed with Regexp::Common::Time. Anyway, right now it's a sequence of RFC822-date forming characters, starting with capital letter and terminated with decimal number. Neither leading double-space nor trailing newline are included.

v0.2.9 debian-policy invents an option of 'time zone name or abbreaviation optionally present as a comment in parentheses'. Such comment would be included in changelog_entry but missing in changelog_date. Moreover, if that comment would fall on the next line it will be ignored. All that parody will suffer rewrite in next turn.

(bug) v0.2.12 debian-policy v3.9.0.0 states what "date" is. As usual.

(caveat) There could be spaces after last number. They aren't included in changelog_date. And yes, they are present in changelog_entry though.

Pity on me.

BUGS AND CAVEATS

Grep this pod for (bug) and/or (caveat). They all are placed in appropriate sections.

However two caveats affect multiple patterns. They are covered here in details.

caveat #1: version starts with letter

(caveat) v0.2.3 Upon checking what I have in *_Packages I've discovered such thing: cnews_cr.g7-40.4_i386.deb. cnews is a package, i386 is an architecture. Then version is cr.g7-40.4? That doesn't look like it starts with number. Or does it?

Or mine reading of debian-policy has been a bit vague. Now I see it clearly states: should start with a digit. should isn't must. So from now on: version can start with any... For $RE{debian}{version} it starts with any version forming character except colon (:) or hyphen (-) (that will be fixed in next turn). For any other it starts with any VFC without exception. (package_+-12_all.deb is valid. And that's me troll?)

caveat #2: suffix could be in version

(caveat) Consider this: package_0.tar.gz.tar.gz Here the version is 0.tar.gz. Such version could be surprising but otherwise is perfectly valid. In order to parse it every filename pattern looks ahead if after suffix there's no version forming character while version parsing section is explicitly ungreedy. I believe that's easier then implement semantical checks instead (package_0.diff.gz.diff.gz is semantically incorrect, it should be pacage_0.diff.gz-1.diff.gz). However, none such versions has been found so far.

bug #1: no pos()

When working on test-booster for $RE{d}{changelog} I've discovered awful thing. qr/\G$RE{debian}{changelog}{-keep}/sg fails. Subsequent pos() returns undef. Setting pos() is ignored. Probably all other patterns are affected too. I can't say what's a cause. That will be investigated and hopefully fixed in next turn.

note #1: pathetic documentattion

I should admit that at time of writing I was high on changelogs, preferences, and so on. Not to say that I was totally tripping on versions.

AUTHOR

Eric Pozharski, <whynot@cpan.org>

COPYRIGHT AND LICENSE

Copyright 2008--2010, 2014 by Eric Pozharski

This library is free in sense: AS-IS, NO-WARANRTY, HOPE-TO-BE-USEFUL. This library is released under LGPLv3.

SEE ALSO

Regexp::Common, http://www.debian.org/doc/debian-policy, dpkg-architecture(1), deb(5), dpkg-source(1), sources.list(5), apt_preferences(5), dpkg-parsechangelog(1), dpkg-deb(1),