The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

psame - finds similarities between files or versions of files

SYNOPSIS

  psame [options] file1 file2
  psame [options] file
  psame [options] -r version file
  psame [options] -r version_a -r version_b file

The first usage compares the two files. The second usage compares the latest version from Subversion, CVS or RCS against the given file. The third usage will compare the given version from Subversion, CVS or RCS against the given file. The fourth usage will compare the two versions of the given file from Subversion, CVS or RCS.

By default, blank lines, whitespace and case are ignored when comparing. The output will be a side-by-side view of matching regions with a few lines of context.

MOTIVATION

psame was written to allow the author an easy way to compare two pieces of text. In particular to find lines in one piece of text (generally from a file) that match some lines in a second piece of text.

USES OF PSAME

Code comparison

The diff(1) command is excellent for finding differences between files, but sometimes similarity is more interesting. A common case is when a chunk of code is moved to another part of the same file. In that case comparing the old and new versions of the file with diff will tell you that there has been a deletion of text and an insertion. psame, on the other hand, will tell you where moved code is in the new version. In simple cases, the output from diff is clear enough but comparison with psame can help in the cases where there have been many edits.

DESCRIPTION

Options

-b, --dont-ignore-spaces

don't ignore changes in whitespace with a line

-i, --dont-ignore-case

don't ignore case when comparing lines

-B, --dont-ignore-blank-lines

don't ignore blank lines

-M <num>, --minimum-line-length <num>

ignore simple/short lines (ie. those with less than <num> chars). If the -b flag is active, the line length is tested after removing whitespace. default: no lines are considered too simple

-S <num>, --minimum-score <num>

only show matches with score higher than <num> (see the SCORE section below) - default 0

-y, --side-by-side

side-by-side match view (default)

-V, --vertical

vertical match view

-n, --show-only-non-matches

show non-matches instead of matches

-N, --show-non-matches

show matches and non-matches

-x <wid>, --terminal-width <wid>

set terminal width in columns (normally guessed)

-r <ver>, --revision <ver>, -r <ver> -r <ver>

compare with version(s) from SVN, CVS or RCS

-C <num>, --context <num>

number of lines of context - default 3

-m <mode>, --mode <mode>

use the <mode> to choose appropriate settings

text

choose settings suitable for text/documentation: -M 10 -S 4

code

choose setting for code: -i -M 5 -S 0

-v, --version

Display version number and exit

-h, -?, --help

show a usage message

MATCHES

A "match" is some number of consecutive lines in one file (or file version) that are similar to some number of consecutive lines in a second file (or file version). In the simplest case with no options specified, the lines in each file must be identical. As an example, consider these two pieces of text (with added line numbers):

text_1

 1.  The parrot sketch -
 2.   'E's kicked the bucket, 'e's
 3.   shuffled off 'is mortal coil, run
 4.   down the curtain and joined the
 5.   bleedin' choir invisible!
 6.   THIS IS AN EX-PARROT!

text_2

 1.   'E's kicked the bucket, 'e's
 2.   shuffled off 'is mortal coil, run
 3.   down the curtain and joined the
 4.   bleedin' choir invisible!
 5.
 6.   This is an ex-parrot!

Default settings

Using the default settings, psame will report this:

 match 2..6==1..6
   The parrot sketch -
    'E's kicked the bucket, 'e's      =  'E's kicked the bucket, 'e's
    shuffled off 'is mortal coil,     =  shuffled off 'is mortal coil,
    run down the curtain and joined   =  run down the curtain and joined
    the bleedin' choir invisible!     =  the bleedin' choir invisible!
                                      >
    THIS IS AN EX-PARROT!             =  This is an ex-parrot!

which indicates that there are five lines from text_1 (ie. lines 2 to 6) that match six lines from text_2 (ie. 1 to 6). By default psame is case and white-space insensitive and blank lines are ignored when comparing files. The "=" symbol indicates an match between two lines. The ">" indicates that text_2 has an extra blank line that has been ignored during the comparison.

Case sensitivity and ignoring blank lines

Adding the -B parameter will produce this output:

 match 2..5==1..4
   The parrot sketch -
    'E's kicked the bucket, 'e's      =  'E's kicked the bucket, 'e's
    shuffled off 'is mortal coil, run =  shuffled off 'is mortal coil, run
    down the curtain and joined the   =  down the curtain and joined the
    bleedin' choir invisible!         =  bleedin' choir invisible!
    THIS IS AN EX-PARROT!
                                         This is an ex-parrot!
 match 6..6==6..6
    shuffled off 'is mortal coil, run    down the curtain and joined the
    down the curtain and joined the      bleedin' choir invisible!
    bleedin' choir invisible!
    THIS IS AN EX-PARROT!             =  This is an ex-parrot!

In this case blank lines are significant for the comparison. psame reports two distinct matches - one four lines long and the other one line long.

Adding the -i option as well will make psame respect case. Here is the output:

 match 2..5==1..4
  The parrot sketch -
   'E's kicked the bucket, 'e's      =  'E's kicked the bucket, 'e's
   shuffled off 'is mortal coil, run =  shuffled off 'is mortal coil, run
   down the curtain and joined the   =  down the curtain and joined the
   bleedin' choir invisible!         =  bleedin' choir invisible!
   THIS IS AN EX-PARROT!
                                        This is an ex-parrot!

Note that the "This is an ex-parrot!" line doesn't match now.

NON-MATCHES

The -n flag will report lines in each file that don't match any lines in the other file. For example, running psame -n on the files above, with no other options gives:

 non matches in text_1:
   1..1:
     The parrot sketch -

ie. line 1 in text_1 doesn't occur anywhere in text_2

In this case diff(1) will tell us the same thing but in other situations we only want to know about lines in file A that don't appear anywhere in file B. An example might be when modifying the order of sections in a manuscript - we would like to check that all sections are still present, even if in a different place.

SCORE

The score of a match is currently the total number of lines this match covers in both files. The -S option for filtering by score is useful for filtering out small matches so that the larger similarity can be seen.

BUGS

None known

LIMITATIONS

The code works well with small input files (up to 10,000 lines or so), but is too slow and memory intensive for larger files.

TO DO

Output formatting should be done with Perl6::Form or some such and the output needs to be more readable. Suggestions are very welcome.

AUTHOR

Kim Rutherford <kmr+same@xenu.org.uk>

http://www.xenu.org.uk