Assignment title: Information

Assignment #2 Due date: August 1 Total Score: 100 This assignment can be completed by a team. --------------------------------------------------------------------------------------------------------------------------------------------------- In this assignment, you will use evolutionary approach (either GA or GP) as a learning method that can learn a valid mathematical expression for a given data set. This problem is also known as a symbolic regression problem. To successfully complete this assignment, conduct the following activities: (1) Research existing systems that use either genetic programming or genetic algorithm with symbolic regression capability, select one system, and learn how to use the system or write your own system if you wish to solve this regression problem. Note that you are NOT allowed to use any other approach to complete this assignment than GP or GA-based systems. (2) For a given data set consisting of value-pairs, (xi, yi) in a text file, "train.txt" where xi is a given input value and yi is the expected output for the given input xi, find a mathematical model (function), h(x) that represents the data set using the selected GP/GA system. In this type of problems, the data set "train.txt" is often called a training data set, especially when inductive learning approach is used as the data in the data set will be used to train the evolutionary system to learn the model. Inductive learning is one of the most commonly used machine learning approaches that simulate the human learning process, e.g., learning by examples or mistakes. The process of inductive learning in general requires two steps training and testing (or verification). During the training step, examples in a training data set ("train.txt" in this assignment) are provided to the learning system and let the system to learn (or build) a model (or to find patterns) that represents the examples in the training data set. During the testing step, the model (or patterns) learned from the training step is verified for accuracy against a testing data set ("test.txt" in this assignment). This training and testing step can be repeated by changing necessary parameters or using different strategies until the learned model reaches a satisfactory accuracy level before it is used for any real-world application. (3) Once you have your system running and get a result (as a model) from the training data set. Test the model with test data set, "test.txt" to determine the accuracy. Both data sets, "train.txt" and "test.txt" will be posted later when you are ready to run your system. In order to verify the accuracy of your best model discovered from the training step, you may use the following error function, f() or other function that calculates the accuracy level of a model. For example, if mi is the best model (function) learned by your system during the training step, the accuracy of the model against the testing data set can be calculated as follow: f(mi) = |yi  oi|, where yi is the expected output on the ith case and oi is the output from mi, for the given xi value on the ith case in the testing data set. In other words, f(mi) is the cumulative error from all cases in the testing data set. This error can be easily calculated using a tool like Excel if your system doesn't support this verification feature. (4) If you didn't find a perfect model (a model with 0 error) during the training step, improve its accuracy by modifying various system parameters such as cross over rate, mutation rate, population size, fitness function, modifying how new individuals are created in the initial population, or making any other necessary changes that you believe it to be useful. (5) Write a brief report that summarizes your activities and results including at least (a) the name(s) and contact email addresses of your team and the percentage contribution to this assignment if the assignment was completed by a team. If a team cannot reach a consensus on the individual contribution, include the individual's claimed percent contribution with a brief description on specific tasks performed; (b) name of the evolutionary learning system used with a brief description about the system's major capabilities, implemented algorithm (e.g., is it based on GP or GA?), parameter settings of the system used to run the system, and the best model found. The required parameters information include function set, terminal set, fitness function, crossover rate, mutation rate, population size. In addition to these parameters, optionally you may specify other parameters setting you had to find your final model if any. The best model found should be specified in standard form of math expression with error information and a brief justification on why you think that is the best function you could find. Regarding standard form of expression, for example, if a LISP style of expression, (+ (* 3 x) 5 (** x 2)) is returned as the best model from your system, you may need to rewrite it (manually) to a standard form of math expression f(x) = 3x + 5 + x2; (c) your strategies used to improve the accuracy of the model explaining if the accuracy was in deed improved, (d) optionally, specify retrospective comments about the system used, evolutionary approaches in general, etc. How to submit Upload your report in MS word format to Titanium. Unless the source code is your own implementation, DO NOT include any source code. Instead, clearly specify the source of the program in the report if it is used for your research. Grading criteria The grade will be largely based on the quality of your report that clearly demonstrates your research activities, outcomes, the overall effort and the level of your understanding on evolutionary computation in general and the regression problem solving process.