In order to become familiar with the Algorithm::DecisionTree module:
(1) Run the
generate_training_data.pl
script to create your training data. First run the
script as it is, and then make a copy of the
param.txt file, modify this parameter file as you
wish, and run the above script with your version of
param.txt.
(2) Next run the
construct_dt_and_classify_one_sample.pl
script as it is.
Now modify the test sample in this script and see
what classification results you get for the new
test sample. Next run this script on the new
training datafile that you yourself created. You
would obviously need to use the test samples that
mention the feature and value names in your own
parameter file.
(3) If you are using a large number of features or if the
number of possible values for the features is very
large, unless you take care, the tree you construct
could end up being much too large and much too slow
to construct. To limit the size of the tree, you
may need to change the values of the following
constructor parameters in the previous step:
max_depth_desired
entropy_threshold
The first parameter, max_depth_desired, controls
the depth of the tree from the root node, and the
second parameter, entropy_threshold, controls the
resolution in the entropy space. The smaller the
value for the first parameter and the larger the
value for the second parameter, the smaller the
decision tree. The largest possible value for
max_depth_desired is the number of features. Take
it down from there to make the tree smaller. The
smallest possible value for entropy_threshold is 0.
Take it up from there to make the tree smaller.
(4) Now run the test data generator script by invoking
generate_test_data.pl
As it is, it will put out 20 samples for testing. But you
can set that number to anything you wish.
The test data is dumped into a file without the class labels
for obvious reasons. The class labels are dumped into a
separate file whose name you can specify in the above
script. As currently programmed, the name of this file is
test_data_class_labels.dat
By comparing the class labels returned by the classifier
with the class labels in this file, you can assess the
accuracy of the classifier.
(5) Finally, run the classifier on the test datafile by
classify_test_data_in_a_file.pl training.dat testdata2.dat out.txt
Note carefully the three arguments you must supply the script.
The first is for where the training data is, the second for
where the test data is, and the last where the classification
results will be deposited.
=======================================================================
FOR USING A DECISION TREE CLASSIFIER INTERACTIVELY:
Starting with Version 1.6 of the module, you can use the
DecisionTree classifier in an interactive mode. In this
mode, after you have constructed the decision tree, the user
is prompted for answers to the questions regarding the
feature tests at the nodes of the tree. Depending on the
answer supplied by the user at a node, the classifier takes
a path corresponding to the answer to descend down the tree
to the next node, and so on.
To get a feel for using a decision tree in this mode,
examine the script
classify_by_asking_questions.pl
Execute the script as it is and see what happens.
=======================================================================
FOR THE CASE OF VERY LARGE DECISION TREES:
Large decision trees can take a very long time to create.
If that is the case with your application, having to create
afresh a decision tree every time you want to classify
something can quickly become tiresome. If such is the case
with your application, consider storing your decision tree
in a diskfile. Subsequently, you can use the disk-stored
decision tree for your classification work. The following
scripts in this directory:
store_dt_on_disk.pl
classify_from_disk_stored_dt.pl
show you how you can do that.