Assignment title: Information
Assignment #2
Due date: August 1
Total Score: 100
This assignment can be completed by a team.
---------------------------------------------------------------------------------------------------------------------------------------------------
In this assignment, you will use evolutionary approach (either GA or GP) as a learning method that can learn a valid
mathematical expression for a given data set. This problem is also known as a symbolic regression problem. To
successfully complete this assignment, conduct the following activities:
(1) Research existing systems that use either genetic programming or genetic algorithm with symbolic regression
capability, select one system, and learn how to use the system or write your own system if you wish to solve this
regression problem. Note that you are NOT allowed to use any other approach to complete this assignment than GP
or GA-based systems.
(2) For a given data set consisting of value-pairs, (xi, yi) in a text file, "train.txt" where xi is a given input value and yi is
the expected output for the given input xi, find a mathematical model (function), h(x) that represents the data set
using the selected GP/GA system.
In this type of problems, the data set "train.txt" is often called a training data set, especially when inductive
learning approach is used as the data in the data set will be used to train the evolutionary system to learn the model.
Inductive learning is one of the most commonly used machine learning approaches that simulate the human learning
process, e.g., learning by examples or mistakes. The process of inductive learning in general requires two steps
training and testing (or verification). During the training step, examples in a training data set ("train.txt" in this
assignment) are provided to the learning system and let the system to learn (or build) a model (or to find patterns) that
represents the examples in the training data set. During the testing step, the model (or patterns) learned from the
training step is verified for accuracy against a testing data set ("test.txt" in this assignment). This training and testing
step can be repeated by changing necessary parameters or using different strategies until the learned model reaches a
satisfactory accuracy level before it is used for any real-world application.
(3) Once you have your system running and get a result (as a model) from the training data set. Test the model with
test data set, "test.txt" to determine the accuracy. Both data sets, "train.txt" and "test.txt" will be posted later when
you are ready to run your system. In order to verify the accuracy of your best model discovered from the training step,
you may use the following error function, f() or other function that calculates the accuracy level of a model. For
example, if mi is the best model (function) learned by your system during the training step, the accuracy of the model
against the testing data set can be calculated as follow:
f(mi) = |yi oi|, where yi is the expected output on the ith case and oi is the output from mi, for the given xi value on
the ith case in the testing data set. In other words, f(mi) is the cumulative error from all cases in the testing data set.
This error can be easily calculated using a tool like Excel if your system doesn't support this verification feature.
(4) If you didn't find a perfect model (a model with 0 error) during the training step, improve its accuracy by modifying
various system parameters such as cross over rate, mutation rate, population size, fitness function, modifying how
new individuals are created in the initial population, or making any other necessary changes that you believe it to be
useful.
(5) Write a brief report that summarizes your activities and results including at least (a) the name(s) and contact email
addresses of your team and the percentage contribution to this assignment if the assignment was completed by a team.
If a team cannot reach a consensus on the individual contribution, include the individual's claimed percent
contribution with a brief description on specific tasks performed; (b) name of the evolutionary learning system used
with a brief description about the system's major capabilities, implemented algorithm (e.g., is it based on GP or
GA?), parameter settings of the system used to run the system, and the best model found. The required
parameters information include function set, terminal set, fitness function, crossover rate, mutation rate,
population size. In addition to these parameters, optionally you may specify other parameters setting you had to find
your final model if any. The best model found should be specified in standard form of math expression with error
information and a brief justification on why you think that is the best function you could find. Regarding standard
form of expression, for example, if a LISP style of expression, (+ (* 3 x) 5 (** x 2)) is returned as the best model from
your system, you may need to rewrite it (manually) to a standard form of math expression f(x) = 3x + 5 + x2; (c) your
strategies used to improve the accuracy of the model explaining if the accuracy was in deed improved, (d)
optionally, specify retrospective comments about the system used, evolutionary approaches in general, etc.
How to submit
Upload your report in MS word format to Titanium. Unless the source code is your own implementation, DO NOT
include any source code. Instead, clearly specify the source of the program in the report if it is used for your research.
Grading criteria
The grade will be largely based on the quality of your report that clearly demonstrates your research activities, outcomes,
the overall effort and the level of your understanding on evolutionary computation in general and the regression problem
solving process.