psame - finds similarities between files or versions of files
psame [options] file1 file2 psame [options] file psame [options] -r version file psame [options] -r version_a -r version_b file
The first usage compares the two files. The second usage compares the latest version from Subversion, CVS or RCS against the given file. The third usage will compare the given version from Subversion, CVS or RCS against the given file. The fourth usage will compare the two versions of the given file from Subversion, CVS or RCS.
By default, blank lines, whitespace and case are ignored when comparing. The output will be a side-by-side view of matching regions with a few lines of context.
psame was written to allow the author an easy way to compare two pieces of text. In particular to find lines in one piece of text (generally from a file) that match some lines in a second piece of text.
The diff(1) command is excellent for finding differences between files, but sometimes similarity is more interesting. A common case is when a chunk of code is moved to another part of the same file. In that case comparing the old and new versions of the file with diff will tell you that there has been a deletion of text and an insertion. psame, on the other hand, will tell you where moved code is in the new version. In simple cases, the output from diff is clear enough but comparison with psame can help in the cases where there have been many edits.
don't ignore changes in whitespace with a line
don't ignore case when comparing lines
don't ignore blank lines
ignore simple/short lines (ie. those with less than <num> chars). If the -b flag is active, the line length is tested after removing whitespace. default: no lines are considered too simple
only show matches with score higher than <num> (see the SCORE section below) - default 0
side-by-side match view (default)
vertical match view
show non-matches instead of matches
show matches and non-matches
set terminal width in columns (normally guessed)
compare with version(s) from SVN, CVS or RCS
number of lines of context - default 3
use the <mode> to choose appropriate settings
choose settings suitable for text/documentation: -M 10 -S 4
choose setting for code: -i -M 5 -S 0
Display version number and exit
show a usage message
A "match" is some number of consecutive lines in one file (or file version) that are similar to some number of consecutive lines in a second file (or file version). In the simplest case with no options specified, the lines in each file must be identical. As an example, consider these two pieces of text (with added line numbers):
1. The parrot sketch - 2. 'E's kicked the bucket, 'e's 3. shuffled off 'is mortal coil, run 4. down the curtain and joined the 5. bleedin' choir invisible! 6. THIS IS AN EX-PARROT!
1. 'E's kicked the bucket, 'e's 2. shuffled off 'is mortal coil, run 3. down the curtain and joined the 4. bleedin' choir invisible! 5. 6. This is an ex-parrot!
Using the default settings, psame will report this:
match 2..6==1..6 The parrot sketch - 'E's kicked the bucket, 'e's = 'E's kicked the bucket, 'e's shuffled off 'is mortal coil, = shuffled off 'is mortal coil, run down the curtain and joined = run down the curtain and joined the bleedin' choir invisible! = the bleedin' choir invisible! > THIS IS AN EX-PARROT! = This is an ex-parrot!
which indicates that there are five lines from text_1 (ie. lines 2 to 6) that match six lines from text_2 (ie. 1 to 6). By default psame is case and white-space insensitive and blank lines are ignored when comparing files. The "=" symbol indicates an match between two lines. The ">" indicates that text_2 has an extra blank line that has been ignored during the comparison.
Adding the -B parameter will produce this output:
match 2..5==1..4 The parrot sketch - 'E's kicked the bucket, 'e's = 'E's kicked the bucket, 'e's shuffled off 'is mortal coil, run = shuffled off 'is mortal coil, run down the curtain and joined the = down the curtain and joined the bleedin' choir invisible! = bleedin' choir invisible! THIS IS AN EX-PARROT! This is an ex-parrot! match 6..6==6..6 shuffled off 'is mortal coil, run down the curtain and joined the down the curtain and joined the bleedin' choir invisible! bleedin' choir invisible! THIS IS AN EX-PARROT! = This is an ex-parrot!
In this case blank lines are significant for the comparison. psame reports two distinct matches - one four lines long and the other one line long.
Adding the -i option as well will make psame respect case. Here is the output:
match 2..5==1..4 The parrot sketch - 'E's kicked the bucket, 'e's = 'E's kicked the bucket, 'e's shuffled off 'is mortal coil, run = shuffled off 'is mortal coil, run down the curtain and joined the = down the curtain and joined the bleedin' choir invisible! = bleedin' choir invisible! THIS IS AN EX-PARROT! This is an ex-parrot!
Note that the "This is an ex-parrot!" line doesn't match now.
The -n flag will report lines in each file that don't match any lines in the other file. For example, running psame -n on the files above, with no other options gives:
non matches in text_1: 1..1: The parrot sketch -
ie. line 1 in text_1 doesn't occur anywhere in text_2
In this case diff(1) will tell us the same thing but in other situations we only want to know about lines in file A that don't appear anywhere in file B. An example might be when modifying the order of sections in a manuscript - we would like to check that all sections are still present, even if in a different place.
The score of a match is currently the total number of lines this match covers in both files. The -S option for filtering by score is useful for filtering out small matches so that the larger similarity can be seen.
None known
The code works well with small input files (up to 10,000 lines or so), but is too slow and memory intensive for larger files.
Output formatting should be done with Perl6::Form or some such and the output needs to be more readable. Suggestions are very welcome.
Kim Rutherford <kmr+same@xenu.org.uk>
http://www.xenu.org.uk
To install Text::Same, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::Same
CPAN shell
perl -MCPAN -e shell install Text::Same
For more information on module installation, please visit the detailed CPAN module installation guide.