The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
package Locale::Maketext::Utils::Phrase::Norm::Ellipsis;

use strict;
use warnings;

sub normalize_maketext_string {
    my ($filter) = @_;

    my $string_sr = $filter->get_string_sr();

    # 1. placeholder for BN w/ empty string args: ',,'
    while ( ${$string_sr} =~ m/(\[.*?\])/g ) {    # see note about this regex in Consider.pm
        my $bn_match = $1;
        if ( $bn_match =~ m/[,]{2,}/ ) {
            my $bn_match_tmp = $bn_match;
            $bn_match_tmp =~ s/([,]{2,})/my $n=CORE::length("$1");"MULTI_COMMA_IN_BN_$n"/ge;
            ${$string_sr} =~ s/\Q$bn_match\E/$bn_match_tmp/;
        }
    }

    # 2. look for multi's
    if ( ${$string_sr} =~ s/(?:[.]{2,}|[,]{2,})/…/g ) {
        $filter->add_warning('multiple period/comma instead of ellipsis character');
    }

    # 3. restore placeholder
    ${$string_sr} =~ s/MULTI_COMMA_IN_BN_([0-9]+)/"," x "$1"/eg;

    # TODO: output,latin so this occurance is more rare:
    # if ( ${$string_sr} =~ s/([,.]{2,})/\[comment,should “$1” here be an ellipsis?\]/g ) {
    #     $filter->add_warning('multiple concurrent period and comma');
    # }

    if ( ${$string_sr} =~ s/^(|\xc2\xa0|\[output\,nbsp\])…/ …/ ) {
        $filter->add_warning('initial ellipsis should be preceded by a normal space');
    }

    # 1. placeholders for legit ones
    my %l;
    my $copy = ${$string_sr};
    if ( ${$string_sr} =~ s/((?:\x20|\xc2\xa0|\[output\,nbsp\])…[\!\?\.\:])$/ELLIPSIS_END/ ) {    # final
        $l{'ELLIPSIS_END'} = $1;
    }

    if ( ${$string_sr} =~ s/^( …(?:\x20|\xc2\xa0|\[output\,nbsp\]))/ELLIPSIS_START/ ) {           # initial
        $l{'ELLIPSIS_START'} = $1;
    }

    while ( ${$string_sr} =~ m/(\(|\x20|\xc2\xa0|\[output\,nbsp\])…(\)|\x20|\xc2\xa0|\[output\,nbsp\])/g ) {
        ${$string_sr} =~ s/(\(|\x20|\xc2\xa0|\[output\,nbsp\])…(\)|\x20|\xc2\xa0|\[output\,nbsp\])/ELLIPSIS_MEDIAL/;
        push @{ $l{'ELLIPSIS_MEDIAL'} }, [ $1, $2 ];
    }

    # 2. mark any remaining ones (that are not legit)
    if ( ${$string_sr} =~ s/\A …(?!\x20|\xc2\xa0|\[output\,nbsp\])/ … / ) {
        $filter->add_warning('initial ellipsis should be followed by a normal space or a non-break-space (in bracket notation or character form)');
    }

    if ( ${$string_sr} =~ s/…(?:\x20|\xc2\xa0|\[output\,nbsp\]|\s)+\z/…/ ) {
        $filter->add_warning('final ellipsis should be followed by a valid punctuation mark or nothing');
    }

    if ( ${$string_sr} =~ m/…\z/ && ${$string_sr} !~ m/(?:\x20|\xc2\xa0|\[output\,nbsp\])…\z/ ) {
        ${$string_sr} =~ s/…$/ …/;
        $filter->add_warning('final ellipsis should be preceded by a normal space or a non-break-space (in bracket notation or character form)');
    }

    my $medial_prob = 0;
    if ( ${$string_sr} =~ s/(.{1})((?:(?<!\x20)…|(?<!\xc2\xa0)…(?<!\[output\,nbsp\])…))(.{2})/$1 $2$3/g ) {
        $medial_prob++;
    }

    if ( ${$string_sr} =~ s/(.{2})…(?!\x20|\xc2\xa0|\[output\,nbsp\]|\z)(.{1})/$1… $2/g ) {
        $medial_prob++;
    }

    if ($medial_prob) {
        $filter->add_warning('medial ellipsis should be surrounded on each side by a parenthesis or normal space or a non-break-space (in bracket notation or character form)');
    }

    # 3. reconstruct the valid ones
    ${$string_sr} =~ s/ELLIPSIS_END/$l{'ELLIPSIS_END'}/     if exists $l{'ELLIPSIS_END'};
    ${$string_sr} =~ s/ELLIPSIS_START/$l{'ELLIPSIS_START'}/ if exists $l{'ELLIPSIS_START'};
    if ( exists $l{'ELLIPSIS_MEDIAL'} ) {
        for my $medial ( @{ $l{'ELLIPSIS_MEDIAL'} } ) {
            ${$string_sr} =~ s/ELLIPSIS_MEDIAL/$medial->[0]…$medial->[1]/;
        }
    }

    return $filter->return_value;
}

1;

__END__

=encoding utf-8

=head1 Normalization

=over 4

=item * It must be an ellipsis character (OSX: ⌥;).

=item * It must be surrounded by valid whitespace …

=item *  … except for a trailing ellipsis.

=back

Valid whitespace is a normal space or a non-break-space (literal (OSX: ⌥space) or via [output,nbsp]).

The only exception is that the initial space has to be a normal space (non-break-space there would imply formatting or partial phrase, ick).

=head2 Rationale

We want to be simple, consistent, and clear.

=over 4

=item * CLDR has 3 simple location based rules:

  initial:…{0}
  medial:{0}…{1}
  final:{0}…

Yet, English provides many more rules based on location in the text, purpose (show an omission, indicate a trailing off for various purposes), context (puntuation before or after?), and author’s whim.

Some are exact opposites and yet still valid either way.

So lets keep it simple.

=item * We are unlikely to be omitting things from a quote:

   The server said, “PHP […] is like training wheels without the bike.”.

Can be added later if necessary.

=item * We are unlikely to be implying a continuing thought:

   What can you do, you know how he is ….

Even if we were this form is still valid. So lets keep it consistent.

=item * We are not writing literature.

So lets keep it simple.

=item * The CLDR version leaves room for ambiguity:

   I drove the car…

Is that the first part of “I drove the car to the store.” or “I drove the carpet home and installed it.”?

So lets keep it clear.

=back

Tip: If you’re doing a single word(e.g. to indicate an action is happening) you might consider doing a non-break-space to the left of it:

    'Loading …' # i.e. Loading(OSX: ⌥-space)…

    'Loading[output,nbsp]…' # visually explicit

=head1 possible violations

None

=head1 possible warnings

=over 4

=item multiple period/comma instead of ellipsis character

We want an ellipsis character instead of 3 periods (or 2 periods, 4 or 5 periods, or commas (yes I’ve seen translators do ‘..’, ‘,,,,’, etc and after inquiring ‘…’ was the correct syntax)).

These will be turned into an ellipsis character.

=item initial ellipsis should be preceded by a normal space

The string is modified with a corrected version.

=item initial ellipsis should be followed by a normal space or a non-break-space (in bracket notation or character form)

The string is modified with a corrected version.

=item final ellipsis should be preceded by a normal space or a non-break-space (in bracket notation or character form)

The string is modified with a corrected version.

=item final ellipsis should be followed by a valid punctuation mark or nothing

The string is modified with a corrected version.

=item medial ellipsis should be surrounded on each side by a parenthesis or normal space or a non-break-space (in bracket notation or character form)

The string is modified with a corrected version.

=back