lib/URL/Normalize.pm - metacpan.org

package URL::Normalize;
use Moose;
use namespace::autoclean;

use URI qw();
use URI::QueryParam qw();

=head1 NAME

URL::Normalize - Normalize/optimize URLs.

=head1 VERSION

Version 0.32

=cut

our $VERSION = '0.32';

=head1 SYNOPSIS

    use URL::Normalize;

    my $normalizer = URL::Normalize->new( 'http://www.example.com/display?lang=en&article=fred' );

    # Normalize the URL of your choosing.
    $normalizer->remove_social_query_params;
    $normalizer->make_canonical;

    # Get the normalized version back.
    my $url = $normalizer->url;

=cut

=head1 DESCRIPTION

This is NOT a perfect solution. If you normalize a URL using all the methods in
this module, there is a high probability that the URL will stop "working." This
is merely a helper module for those of you who wants to either normalize a URL
using only a few of the safer methods, and/or for those of you who wants to
generate a unique "ID" from any given URL.

When writing a web crawler, for example, it's always very costly to check if a
URL has been fetched/seen when you have millions or billions of URLs in a sort
of database. This module can help you create a unique "ID", which you then can
use as a key in a key/value-store; the key is the normalized URL, whereas all
the URLs that converts to the normalized URL are part of the value (normally an
array or hash);

    'http://www.example.com/' = {
        'http://www.example.com:80/'        => 1,
        'http://www.example.com/index.html' => 1,
        'http://www.example.com/?'          => 1,
    }

Above, all the URLs inside the hash normalizes to the key if you run these
methods:

=over 4

=item * C<make_canonical>

=item * C<remove_directory_index>

=item * C<remove_empty_query>

=back

You could think of this module as a URL-specific L<Bloom::Filter> helper.

=head1 CONSTRUCTORS

=head2 new( $url )

Constructs a new URL::Normalize object:

    my $normalizer = URL::Normalize->new( 'http://www.example.com/some/path' );

You can also send in just the path:

    my $normalizer = URL::Normalize->new( '/some/path' );

The latter is not recommended, however, so you should look into L<URI>'s
C<new_abs> to create absolute URLs before you use them with URL::Normalize.

=cut

=head1 METHODS

=cut

around BUILDARGS => sub {
    my $orig  = shift;
    my $class = shift;

    if ( @_ == 1 && !ref $_[0] ) {
        return $class->$orig( url => $_[0] );
    }
    else {
        return $class->$orig( @_ );
    }
};

=head2 url

Get the current URL, preferably after you have run one or more of the
normalization methods.

=cut

has 'url' => (
    isa      => 'Str',
    is       => 'ro',
    required => 1,
    writer   => '_set_url',
);

has 'dir_index_regexps' => (
    traits  => [ 'Array' ],
    isa     => 'ArrayRef[Str]',
    is      => 'rw',
    handles => {
        'add_directory_index_regexp' => 'push',
    },
    default => sub {
        [
            '/default\.aspx?',
            '/index\.cgi',
            '/index\.php\d?',
            '/index\.pl',
            '/index\.s?html?',
        ];
    },
);

has 'social_query_params' => (
    traits  => [ 'Array' ],
    isa     => 'ArrayRef[Str]',
    is      => 'rw',
    handles => {
        'add_social_query_param' => 'push',
    },
    default => sub {
        [
            'ncid',
            'utm_campaign',
            'utm_medium',
            'utm_source',
        ],
    },
);

=head2 get_url

DEPRECATED! Use C<url> instead.

=cut

sub get_url {
    my $self = shift;

    Carp::carp( "The 'get_url' method is deprecated; start using the 'url' method instead." );

    return $self->url;
}

=head2 URI

Returns a L<URI> representation of the current URL.

=cut

sub URI {
    my $self = shift;

    my $URI = undef;

    eval {
        $URI = URI->new( $self->url );
    };

    if ( $@ ) {
        Carp::carp( "Failed to create a URI object from URL '" . $self->url . "'" );
    }

    return $URI;
}

=head2 make_canonical

Just a shortcut for URI::URL->new->canonical->as_string, and involves the
following steps (at least):

=over 4

=item * Converts the scheme and host to lower case.

=item * Capitalizes letters in escape sequences.

=item * Decodes percent-encoded octets of unreserved characters.

=item * Removes the default port (port 80 for http).

=back

Example:

    my $normalizer = URL::Normalize->new(
        url => 'HTTP://www.example.com:80/%7Eusername/',
    );

    $normalizer->make_canonical;

    print $normalizer->url; # http://www.example.com/~username/

=cut

sub make_canonical {
    my $self = shift;

    if ( my $URI = $self->URI ) {
        $self->_set_url( $URI->canonical->as_string );
    }
    else {
        Carp::carp( "Can't make non-URI URLs canonical." );
    }
}

=head2 remove_dot_segments

The C<.>, C<..> and C<...> segments will be removed and "folded" (or "flattened", if
you prefer) from the URL.

This method does NOT follow the algorithm described in L<RFC 3986: Uniform Resource Indentifier|http://tools.ietf.org/html/rfc3986>,
but rather flattens each path segment.

Also keep in mind that this method doesn't (can't) account for symbolic links
on the server side.

Example:

    my $normalizer = URL::Normalize->new(
        url => 'http://www.example.com/../a/b/../c/./d.html',
    );

    $normalizer->remove_dot_segments;

    print $normalizer->get_url; # http://www.example.com/a/c/d.html

=cut

sub remove_dot_segments {
    my $self = shift;

    my @old_path_segments = ();
    my @new_path_segments = ();

    my $URI = $self->URI;

    if ( my $URI = $self->URI ) {
        @old_path_segments = split( '/', $URI->path_segments );

        foreach my $segment ( @old_path_segments ) {
            if ( $segment eq '.' || $segment eq '...' ) {
                next;
            }

            if ( $segment eq '..' ) {
                pop( @new_path_segments );
                next;
            }

            push( @new_path_segments, $segment );
        }

        if ( @new_path_segments ) {
            $URI->path_segments( @new_path_segments );
        }
        else {
            $URI->path_segments( '' );
        }

        my $new_url = $URI->as_string;
        $new_url =  '/' . $new_url if ( $self->url =~ m,^/, );
        $new_url =  $new_url . '/' if ( $self->url =~ m,/$, );
        $new_url =~ s,^/+,/,;

        $self->_set_url( $new_url );
    }

    $self->make_canonical;
}

=head2 remove_directory_index

Removes well-known directory indexes, eg. C<index.html>, C<default.asp> etc.

Example:

    my $normalizer = URL::Normalize->new(
        url => 'http://www.example.com/index.cgi?foo=/',
    );

    $normalizer->remove_directory_index;

    print $normalizer->url; # http://www.example.com/?foo=/

The default regular expressions for matching a directory index are:

=over 4

=item * C</default\.aspx?>

=item * C</index\.cgi>

=item * C</index\.php\d?>

=item * C</index\.pl>

=item * C</index\.s?html?>

=back

You can override these by sending in your own list of regular expressions
when creating the URL::Normalizer object:

    my $normalizer = URL::Normalize->new(
        url               => 'http://www.example.com/index.cgi?foo=/',
        dir_index_regexps => [ 'MyDirIndex\.html' ], # etc.
    );

You can also choose to add regular expressions after the URL::Normalize
object has been created:

    my $normalizer = URL::Normalize->new(
        url               => 'http://www.example.com/index.cgi?foo=/',
        dir_index_regexps => [ 'MyDirIndex\.html' ], # etc.
    );

    # ...

    push( @{$normalizer->dir_index_regexps}, 'MyDirIndex\.html' );

Keep in mind that the regular expression are NOT case-sensitive, so the
default C</default\.aspx?> expression will NOT match C</Default\.aspx?>.

=cut

sub remove_directory_index {
    my $self = shift;

    if ( my $URI = $self->URI ) {
        if ( my $path = $URI->path ) {
            foreach my $regex ( @{$self->dir_index_regexps} ) {
                $path =~ s,$regex,/,;
            }

            $URI->path( $path );
        }

        $self->_set_url( $URI->as_string );
    }
}

=head2 sort_query_parameters

Sorts the URL's query parameters alphabetically.

Uppercased parameters will be lowercased during sorting, and if there are
multiple values for a parameter, the key/value-pairs will be sorted as well.

Example:

    my $normalizer = URL::Normalize->new(
        url => 'http://www.example.com/?b=2&c=3&a=0&A=1',
    );

    $normalizer->sort_query_parameters;

    print $normalizer->url; # http://www.example.com/?a=0&A=1&b=2&c=3

=cut

sub sort_query_parameters {
    my $self = shift;

    if ( my $URI = $self->URI ) {
        if ( $URI->as_string =~ m,\?, ) {
            my $query_hash     = $URI->query_form_hash || {};
            my $query_string   = '';
            my %new_query_hash = ();

            foreach my $key ( sort { lc($a) cmp lc($b) } keys %{$query_hash} ) {
                my $values = $query_hash->{ $key };
                unless ( ref $values ) {
                    $values = [ $values ];
                }

                foreach my $value ( @{$values} ) {
                    push( @{ $new_query_hash{lc($key)}->{$value} }, $key );
                }
            }

            foreach my $sort_key ( sort keys %new_query_hash ) {
                foreach my $value ( sort keys %{$new_query_hash{$sort_key}} ) {
                    foreach my $key ( @{$new_query_hash{$sort_key}->{$value}} ) {
                        $query_string .= $key . '=' . $value . '&';
                    }
                }
            }

            $query_string =~ s,&$,,;

            $URI->query( $query_string );
        }

        $self->_set_url( $URI->as_string );
    }
}

=head2 remove_duplicate_query_parameters

Removes duplicate query parameters, i.e. where the key/value combination is
identical with another key/value combination.

Example:

    my $normalizer = URL::Normalize->new(
        url => 'http://www.example.com/?a=1&a=2&b=4&a=1&c=4',
    );

    $normalizer->remove_duplicate_query_parameters;

    print $normalizer->url; # http://www.example.com/?a=1&a=2&b=3&c=4

=cut

sub remove_duplicate_query_parameters {
    my $self = shift;

    if ( my $URI = $self->URI ) {
        my %seen      = ();
        my @new_query = ();

        foreach my $key ( $URI->query_param ) {
            my @values = $URI->query_param( $key );

            foreach my $value ( @values ) {
                unless ( $seen{$key}->{$value} ) {
                    push( @new_query, { key => $key, value => $value } );
                    $seen{$key}->{$value}++;
                }
            }
        }

        my $query_string = '';
        foreach ( @new_query ) {
            $query_string .= $_->{key} . '=' . $_->{value} . '&';
        }

        $query_string =~ s,&$,,;

        $URI->query( $query_string );

        $self->_set_url( $URI->as_string );
    }
}

=head2 remove_empty_query_parameters

Removes empty query parameters, i.e. where there are keys with no value. This
only removes BLANK values, not values considered to be no value, like zero (0).

Example:

    my $normalizer = URL::Normalize->new(
        url => 'http://www.example.com/?a=1&b=&c=3',
    );

    $normalizer->remove_empty_query_parameters;

    print $normalizer->url; # http://www.example.com/?a=1&c=3

=cut

sub remove_empty_query_parameters {
    my $self = shift;

    if ( my $URI = $self->URI ) {
        foreach my $key ( $URI->query_param ) {
            my @values = $URI->query_param( $key );

            $URI->query_param_delete( $key );

            foreach my $value ( @values ) {
                if ( defined $value && length $value ) {
                    $URI->query_param_append( $key, $value );
                }
            }
        }

        $self->_set_url( $URI->as_string );
    }
}

=head2 remove_empty_query

Removes empty query from the URL.

Example:

    my $normalizer = URL::Normalize->new(
        url => 'http://www.example.com/foo?',
    );

    $normalizer->remove_empty_query;

    print $Normalize->url; # http://www.example.com/foo

=cut

sub remove_empty_query {
    my $self = shift;

    my $url = $self->url;
    $url =~ s,\?$,,;

    $self->_set_url( $url );
}

=head2 remove_fragment

Removes the fragment from the URL, but only if they are at the end of the URL.

For example C<http://www.example.com/#foo> will be translated to
C<http://www.example.com/>, but C<http://www.example.com/#foo/bar> will stay the
same.

Example:

    my $normalizer = URL::Normalize->new(
        url => 'http://www.example.com/bar.html#section1',
    );

    $normalizer->remove_fragment;

    print $normalizer->url; # http://www.example.com/bar.html

You should probably use this with caution, as most web frameworks today allows
fragments for logic, for example:

=over 4

=item * C<http://www.example.com/players#all>

=item * C<http://www.example.com/players#banned>

=item * C<http://www.example.com/players#top>

=back

...can all result in very different results, despite their "unfragmented" URL
being the same.

=cut

sub remove_fragment {
    my $self = shift;

    my $url = $self->url;

    $url =~ s{#(?:/|[^?/]*)$}{};

    $self->_set_url( $url );
}

=head2 remove_fragments

Removes EVERYTHING after a C<#>. As with C<remove_fragment>, you should use this
with caution, because a lot of web applications these days returns different
output in response to what the fragment is, for example:

=over 4

=item * C<http://www.example.com/users#list>

=item * C<http://www.example.com/users#edit>

=back

...etc.

=cut

sub remove_fragments {
    my $self = shift;

    my $url = $self->get_url;

    $url =~ s/#.*//;

    $self->_set_url( $url );
}

=head2 remove_duplicate_slashes

Remove duplicate slashes from the URL.

Example:

    my $normalizer = URL::Normalize->new(
        url => 'http://www.example.com/foo//bar.html',
    );

    $normalizer->remove_duplicate_slashes;

    print $normalizer->url; # http://www.example.com/foo/bar.html

=cut

sub remove_duplicate_slashes {
    my $self = shift;

    if ( my $URI = $self->URI ) {
        my $path = $URI->path;

        $path =~ s,/+,/,g;

        $URI->path( $path );

        $self->_set_url( $URI->as_string );
    }
}

=head2 remove_social_query_parameters

Removes query parameters that are used for "social tracking."

For example, a lot of newspapers posts links to their articles on Twitter,
and adds a lot of (for us) "noise" in the URL so that they are able to
track the number of users clicking on that specific URL. This method
attempts to remove those query parameters.

Example:

    my $normalizer = URL::Normalize->new(
        url => 'http://www.example.com/?utm_campaign=SomeCampaignId',
    );

    print $normalize->url; # 'http://www.example.com/'

Default social query parameters are:

=over 4

=item * C<ncid>

=item * C<utm_campaign>

=item * C<utm_medium>

=item * C<utm_source>

=back

You can override these default values when creating the URL::Normalize
object:

    my $normalizer = URL::Normalize->new(
        url                 => 'http://www.example.com/',
        social_query_params => [ 'your', 'list' ],
    );

You can also choose to add parameters after the URL::Normalize object
has been created:

    my $normalizer = URL::Normalize->new(
        url => 'http://www.example.com/',
    );

    push( @{$normalizer->social_query_params}, 'QueryParam' );

=cut

sub remove_social_query_parameters {
    my $self = shift;

    if ( my $URI = $self->URI ) {
        foreach ( @{$self->social_query_params} ) {
            $URI->query_param_delete( $_ );
        }

        $self->_set_url( $URI->as_string );
    }
}

=head1 SEE ALSO

=over 4

=item * L<URI>

=item * L<URI::URL>

=item * L<URI::QueryParam>

=item * L<RFC 3986: Uniform Resource Indentifier|http://tools.ietf.org/html/rfc3986>

=item * L<Wikipedia: URL normalization|http://en.wikipedia.org/wiki/URL_normalization>

=back

=head1 AUTHOR

Tore Aursand, C<< <toreau at gmail.com> >>

=head1 BUGS

Please report any bugs or feature requests to the web interface at L<https://rt.cpan.org/Dist/Display.html?Name=URL-Normalize>

=head1 SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc URL::Normalize

You can also look for information at:

=over 4

=item * AnnoCPAN: Annotated CPAN documentation

L<http://annocpan.org/dist/URL-Normalize>

=item * CPAN Ratings

L<http://cpanratings.perl.org/d/URL-Normalize>

=item * Search CPAN

L<http://search.cpan.org/dist/URL-Normalize/>

=back

=head1 LICENSE AND COPYRIGHT

Copyright 2012-2015 Tore Aursand.

This program is free software; you can redistribute it and/or modify it
under the terms of the the Artistic License (2.0). You may obtain a
copy of the full license at:

L<http://www.perlfoundation.org/artistic_license_2_0>

Any use, modification, and distribution of the Standard or Modified
Versions is governed by this Artistic License. By using, modifying or
distributing the Package, you accept this license. Do not use, modify,
or distribute the Package, if you do not accept this license.

If your Modified Version has been derived from a Modified Version made
by someone other than you, you are nevertheless required to ensure that
your Modified Version complies with the requirements of this license.

This license does not grant you the right to use any trademark, service
mark, tradename, or logo of the Copyright Holder.

This license includes the non-exclusive, worldwide, free-of-charge
patent license to make, have made, use, offer to sell, sell, import and
otherwise transfer the Package with respect to any patent claims
licensable by the Copyright Holder that are necessarily infringed by the
Package. If you institute patent litigation (including a cross-claim or
counterclaim) against any party alleging that the Package constitutes
direct or contributory patent infringement, then this Artistic License
to you shall terminate on the date that such litigation is filed.

Disclaimer of Warranty: THE PACKAGE IS PROVIDED BY THE COPYRIGHT HOLDER
AND CONTRIBUTORS "AS IS' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES.
THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE, OR NON-INFRINGEMENT ARE DISCLAIMED TO THE EXTENT PERMITTED BY
YOUR LOCAL LAW. UNLESS REQUIRED BY LAW, NO COPYRIGHT HOLDER OR
CONTRIBUTOR WILL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, OR
CONSEQUENTIAL DAMAGES ARISING IN ANY WAY OUT OF THE USE OF THE PACKAGE,
EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

=cut

__PACKAGE__->meta->make_immutable;

1; # End of URL::Normalize
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)