The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Genetic Algorithm Optimizer for Meta Rules

Henry Stern
Anti-Spam Engineering
McAfee International
Alton House
Gatehouse Way
Aylesbury, Bucks
HP19 8YD

April 9, 2005

1.  WHAT IS IT?

This program is used to optimize phrase-based meta rules such as
ADVANCE_FEE and NIGERIAN for performance by selecting a subset of the
candidate body rules.

It selects rule sets based on combined hit rates, false positive rates and
a desired number of rules.

This program requres the GAUL: Genetic Algorithm Utility Library which can be
obtained from http://gaul.sourceforge.net/.  It is licensed under the GPL.

2.  OPTIONS

Config parameters:
  -h hits_file
	Path to the compressed matrix containing the rule hits.
	Default: hits.dat

  -r rules_fule
	Path to the file containing the rule names corresponding to columns		in the compressed matrix.
	Default: rules.dat

Fitness function parameters:
  -m maximum_relevant_hits
	Stop counting hits after seeing this many.  The rule analogue of this
	is:
	ADVANCE_FEE_1, ADVANCE_FEE_2, ..., ADVANCE_FEE_m
	Default: 4

  -t target_num_rules
	How many sub-rules should be used by the meta rule.
	Default: 50

  -l target_flex_rules
	Solutions with target_num_rules +/- target_flex_rules are half as
	fit as solutions with target_num_rules.
	Default: 5

  -e hits_exponent
	Parameter to the fitness function, how the importance of high numbers of
	hits is.
	Default: 3.0

  -p penalty_exponent
	If rules hit ham, this exponential penalty is applied based on the
	number of hits.
	Default: 9.0

GA parameters:
  -s population_size
	How many individuals should be used in the simulation.
	Default: 100

  -g max_generations
	How many geenrations that the simulation should run for.
	Default: 10000

  -x crossover_prob
	The probability of an allele-mixing cross-over.
	Default: 1.0

  -u mutation_prob
	The probability of a one-allele mutation.
	Default: 0.1

3.  HOW DOES IT WORK?

For every generation, "Parents" are selected based on their fitness.  Parents
are "mated" to produce two children.  The alleles on each chromosome are
randomly selected, which isn't very biologically plausible, but it works.

Fitness of an individual is evaluated based on hit rates, false positive rates
and how close the individual is to the target number of rules.

--
hs
9/5/2005