Assignment title: Information
The sinking of the Titanic is a famous event. You may find it useful to research the facts surrounding the sinking of the Titanic to inform your understanding of the problem and ensuing interpretation of your data analysis of the factors determining the survival of passengers on the Titanic. Use the data mining tool RapidMiner to conduct an exploratory analysis of the titanic_train.csv data set which is provided on the course study desk Assignment 2 folder link and then build a simple predictive model of Survival on the Titanic using a Decision Tree. a) You need to identify five key variables that contribute most to determining the survival rate of passengers on the ill-fated Titanic on its maiden voyage. Note you should also refer to the data dictionary provided with the titanic3_train.csv file which describes each of the variables and their range of values. (Hint: an exploratory analysis should be based on summary statistics, histograms, crosstab tables and scatterplots of individual variables and the relationship between individual variables and the target variable survived. Which variables are correlated with target variable survived and other variables?) You might also need to consider reformatting some of variables to facilitate the next stage of analysis of the titanic3._train.csv and titanic3_score.csv data sets using a Decision Tree (Hint: you will need to convert the survival variable to nominal variable with the values Yes = 1, No = 0 in titanic_train.csv). See Data Mining for the Masses Chapters 3 and 4 for guidance in Exploratory Data Analysis using RapidMiner. Discuss each of your five top predictor variables and the results of your exploratory data analysis in general using the RapidMiner data mining tool as well as how you dealt with missing data and unusual data informed by relevant supporting literature on the survival rate of passengers on the Titanic. Your discussion should also include appropriate statistical analysis results such as graphs and results tables from conducting an exploratory data analysis in the RapidMiner data mining tool with some supporting references on predictive model building and interpretation using Decision Trees in data mining (about 600 words). The following table lists the data dictionary for the data set titanic_train.csv. (Note: titanic_score.csv is the same as titanic_train.csv but does not contain any values for target variable survived which is referred to as a label variable in Rapidminer). Variable Description pclass Passenger Class (1 = 1st class; 2 = 2nd class; 3 = 3rd class) survived Survived (0 = No; 1 = Yes) name Name Sex Sex Age Age sibsp Number of Siblings/Spouses Aboard parch Number of Parents/Children Aboard ticket Ticket Number fare Passenger Fare cabin Cabin embarked Port of Embarkation(C = Cherbourg; Q = Queenstown; S = Southampton) boat Lifeboat body Body Identification Number home.dest Home/Destination SPECIAL NOTES: Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5 Fare is in Pre-1970 British Pounds (£) Conversion Factors: 1£ = 12s = 240d and 1s = 20d With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch. Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiancées Ignored) Parent: Mother or Father of Passenger Aboard Titanic Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbours in a village, however, the definitions do not support such relations. STORY BEHIND THE DATA: This dataset is based on the Titanic Passenger List edited by Michael A. Findlay, originally published in Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, and expanded with the help of the internet community. b). Build a model for predicting the survival of passengers on the Titanic using a decision tree in RapidMiner (See Chapter 10 of Data Mining for the Masses textbook for guidance on Decision Trees in RapidMiner) using the two data sets, titanic3_train.csv and titanic3_score.csv. Then present and discuss the results of your Decision Tree analysis and a diagram showing your final Decision Tree. Comment on the relative predictive strength of this model and what you believe are the most significant variables that determined whether a passenger on the Titanic survived or not. Include some supporting references on using Decision Trees in data mining (about 400 words). Task 2 (Worth 25 marks) consists of the following two sub tasks Big data is a hot topic and is generating enormous interest in industry and academia however there is no agreement on the definition of this term and the application of big data analytics in practice is currently more hype than reality. Your task is twofold: a) Research and critically critique the current literature available on the Internet and in academic journals and conferences and provide a comprehensive definition and description of the term 'Big Data' that is underpinned and supported by the reference literature (Approx 500 words) b) Research and critically critique the current literature available on the Internet and in academic journals and conferences and provide a comprehensive discussion describing one specific application of Big data analytics in an Industry sector, emphasize how, in this specific application, of Big data analytics is providing business value to organisations in this industry sector (Approx 1000 words) Your discussion and analysis here should be underpinned by an appropriate level of in text referencing using Harvard Referencing Style. Task 3 (Worth 25 marks) consists of the following sub tasks With the following Excel file SalesSuperstore.xlsx provided on the course study desk Assignment 2 Folder link and using Tableau Desktop 8.3 produce the four following reports with appropriate accompanying graphs based on a Tableau workbook sheet view for each. Briefly comment on each report in about 125 words in terms of what trends and patterns are apparent in each report. The SalesSuperstore.xlsx file contains the following dimensions and information: 1. Customer Name, Customer Segment 2.Location - Region, State, City, Zipcode 3. Product Category, Sub Category, Product Name, Product Container, Unit Price 4. Order Information 5.Shipping Information 6. Sales Information 7. Profit a) Create a report and accompanying graph using Tableau that shows a trend analysis for sales by Product Category over the years 2009 to 2012 and comment on key trends and patterns apparent in this report (125 words approx) b) Create a report and accompanying graph using Tableau that shows for each Product Category Average Profit and Total Sales for each month over the years 2009 to 2012 and comment on key trends and patterns apparent in this report (125 words approx) c) Create a geographical map presentation using Tableau that shows graphically the relative size by City within each state, Product Sales for year 2012 and comment on key trends and patterns in this report (125 words approx) d) Create a report and accompanying graph using Tableau that shows for Product Sub Categories that are technology based Unit Prices, Sales and Profit for each month over the years 2009 to 2012 and comment on key trends and patterns in this report (125 words approx) Your assignment 2 report must be structured as follows, which is similar to the report structure detailed in Summers & Smith 2010: Cover page for assignment 2 report 1. Title Page 2. Table of Contents 3. Body of report – main sections and subsections for assignment 2 task and sub tasks so 3.1 Task 1 will be a main heading with appropriate sub headings etc....for each sub task etc.. 3.2 Task 2 … 3.3 Task 3 …. 4. List of References 5. List of Appendices You need to submit two files when you submit Assignment 2 1. Your Assignment 2 Report for Tasks 1, 2 and 3 in Word document format with the extension .docx 2. Your Assignment 2 Task 3 as a Tableau packaged workbook with the extension .twbx Use the following file naming convention: 1. Student_no_Student_name_CIS8008_Ass2.docx and 2. Student_no_Student_name_CIS8008_Ass2.twbx