The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

movie.pl - A sample script to link actors through movies

DESCRIPTION

This sample script takes a database full of actors and movies, and creates the necessary framework for Algorithm::SixDegrees to link the actors through the movies.

The data source (and thus the script) expects the last name first. In other words, you can play "Six Degrees of Bacon, Kevin" with this.

FINDING ACTORS

If an actor is not found, the script searches the data source, using the input as the starting string. If it finds only one match, it uses that instead. For example, in my data source, Johnny Carson is actually represented as 'Carson, Johnny (I)'. But since he's the only one (there's no 'Carson, Johnny (II)'), the script will use that instead. On the other hand, 'Smith, Will' gives the following note:

  No match for 'Smith, Will'.  Did you mean:
        'Smith, Will (I)' (career: 1992-2005)
        'Smith, Willetta' (career: 1953-1954)
        'Smith, William 'Smitty'' (career: 1990)
        ... (omittance for brevity) ...
        'Smith, Willis S.' (career: 1920)

Also, the script is not smart enough to figure out similar people. That is, in my data source, Charlie Chaplin is actually listed as 'Chaplin, Charles'; this sample will not know the two are the same.

MAKING A DATA SOURCE

A sample data source is at ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/ I grabbed the actors.list.gz and the actresses.list.gz files from there.

I created a MySQL database table and some indexes:

  create database movact;
  grant all privileges on movact.* to movact identified by 'movact';
  use movact;
  create table movact ( actor varchar(128), movie varchar(128), year int );
  create index movact_actor on movact ( actor );
  create index movact_movie on movact ( movie );

(You may want to make the indexes after the data load instead of before.)

I then trimmed the data source down to remove the header and footer, followed by this Perl script on both data files to load them into the database:

  #!/usr/bin/perl

  use DBI;

  my $dbh = DBI->connect('DBI:mysql:database=movact','movact','movact',{AutoCommit=>1});
  my $sth = $dbh->prepare('INSERT INTO movact (movie, actor, year) VALUES (?,?,?)');
  die unless $sth;

  while (<>) {
      chomp;
      my ($a, $t) = split(/\t+/,$_,2);
      $actor = $a if ($a !~ /^\s*$/ && $a ne $actor);
      next unless $t;
      next if $t =~ /\((TV|V|VG)\)/; # No TV movies / video-only movies / video games
      next if $t =~ /^"/; # No TV series
      $t =~ s/(\(((?:18|19|20)\d\d|\?\?\?\?)(?:\/(\w+))?\)).*/$1/;
      $y = $2 || 1000; # Sets the year to 1000 if it's not present
      $y = 1000 if $y !~ /^\d+$/; # Turns year ???? into year 1000
      die $sth->errstr unless $sth->execute($t,$actor,$y);
  }

  $sth->finish;
  $dbh->disconnect;
  exit(0);

The database is thus prepared.