Assignment title: Information
The sinking of the Titanic is a famous event. You may find it useful to research the facts surrounding the sinking of the Titanic to inform your understanding of the problem and ensuing interpretation of your data analysis of the factors determining the survival of passengers on the Titanic. Use the data mining tool RapidMiner to conduct an exploratory analysis of the titanic_train.csv data set which is provided on the course study desk Assignment 2 folder link and then build a simple predictive model of Survival on the Titanic using a Decision Tree. a) You need to identify five key variables that contribute most to determining the survival rate of passengers on the ill-fated Titanic on its maiden voyage. Note you should also refer to the data dictionary provided with the titanic3_train.csv file which describes each of the variables and their range of values. (Hint: an exploratory analysis should be based on summary statistics, histograms, crosstab tables and scatterplots of individual variables and the relationship between individual variables and the target variable survived. Which variables are correlated with target variable survived and other variables?) You might also need to consider reformatting some of variables to facilitate the next stage of analysis of the titanic3._train.csv and titanic3_score.csv data sets using a Decision Tree (Hint: you will need to convert the survival variable to nominal variable with the values Yes = 1, No = 0 in titanic_train.csv). See Data Mining for the Masses Chapters 3 and 4 for guidance in Exploratory Data Analysis using RapidMiner. Discuss each of your five top predictor variables and the results of your exploratory data analysis in general using the RapidMiner data mining tool as well as how you dealt with missing data and unusual data informed by relevant supporting literature on the survival rate of passengers on the Titanic. Your discussion should also include appropriate statistical analysis results such as graphs and results tables from conducting an exploratory data analysis in the RapidMiner data mining tool with some supporting references on predictive model building and interpretation using Decision Trees in data mining (about 600 words). The following table lists the data dictionary for the data set titanic_train.csv. (Note: titanic_score.csv is the same as titanic_train.csv but does not contain any values for target variable survived which is referred to as a label variable in Rapidminer).