Here is the outline:
The report will be graded according to the following:
This should be a short report - not an essay - and use as many tables and diagrams and code examples as you can. It is not an ESSAY!
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
Titanic has been very much interesting topic for the historians and for researchers to answer security and safety measures to prevent death toll. I will try to predict that with the effect of some data features of the passengers, predictive model can predict the life or death of the passenger.
As part of Machine learning assignment, I choose Kaggle “Titanic: Machine Learning from Disaster” competition over the rest. There are some reasons to choose it but mainly building a predictive model to predict passenger will survive or not on Titanic. It was interesting to predict survival of the passengers based on the features such as gender, Fare, Embarked, Age, Passenger class, SibSp.
Data Preparation 15%
Data preparation includes some exploration - what are the dimensions and levels, five number summary of continuous variables etc.
Changing data types: by default when we load data file to create table, it takes all data types as “String” but that is not the way should be. Hence, I changed data types as it required to be.
First approach was the data cleaning for the training and test data sets.
Handling missing data: For missing values in the training and test data sets, one way is to find median of the both data sets separately and add them to missing values or two data sets can be added and find global median. Median of the train data set is 28 while Median of test data set is 27.
Before casting there were 4 levels in Embarked including ‘NA’ and Now Embarked factors w/3 levels such as "C","Q","S”. That means NA has been removed.
There were value missing in Age, Fare and Cabin. However, I have decided to drop Cabin feature as its importance is almost zero for predicting passenger’s survival on Titanic in compare with other features such as Age, Sex, Passenger class, Embarked, SibSp.
First I calculate median for age and median for fare in train data set and test data set. I could replace the median values of Age and Fare individually for both data set but I created a full data set called “Titanic.full” and then I calculated global median for both Age and Fare category features individually and then replace them in “Titanic.full” data set.
However, I split the data set “Titanic.full “back to train and test data set after replacing the missing values of Age and fair categories.
There are 13 columns in train data set and 12 columns in test data set. So, to make them equal number I added another column called “Survived” in test data set.
Also, to turn the model classification into Binary classification I took the approach of “Casting as Factor” and to remove “NA” from the dataset.
As there was dependency in the model I converted char columns into factors by using “as.factor” command.
Data Statistics
Dataset statistics test data set train data
Number of rows 418 891
Number of columns 12 13
Number of features 6 6
Stratified Sampling - splits automatically on a given level
Here the Titanic data set is divided into train data test and test dataset. So, there is no requirement to do Stratified sampling in this case. In stratified sampling, one divides the population into separate groups called strata. In general, Stratified sampling is done to divide the data set into train and test data set in the proportion of 70: 30 ratios respectively.
Stratified sampling has some advantages over simple random sampling. Such as, with the help of stratified sampling, one can reduce the sample size required to achieve a given precision. It is also possible to increase the precision with sample size. It also can help against an “Unrepresentative” sample. We can obtain sufficient sample points to support a separate analysis of any subgroup.
On the other hand, the main disadvantage of it is it may require more administrative effort than a simple random sample technique.
In Titanic dataset, Kaggle competition, the data set is divided into training dataset and the test data set by default. So, stratified sampling is no longer required in this task.
Tool used:
1) Importing libraries
2) Importing Train data file:
3) Importing Test data set:
4) Evaluation of model Decision tree:
A simple and shirt exploration of the data. Data-bricks has a nice visualisation tool that can help with illustration.
Display of the data:
Dropping some unwanted attributes from the dataset:
Schema:
Visualise our data:
Summary statistics:
For a Kaggle competition, this could involve all feature engineering. This includes dropping features and transforming features but you must include a reason for why you have done it , not just a, well, I thought it was useless blah blah.
Feature engineering:
It is the process of using domain knowledge of the data to create features that makes machines learning algorithms work. Technique of transformation of raw data input to a representation that can be effectively exploited in machine learning tasks is feature learning.
There are several attributes in the training and test data set. Such as PassengerId, Age, Sex, Ticket, Cabin, Fare, Embarked, Passenger class, Name, Parch, SibSp. These attributes if can be used in building predictive model but all the attributes may or may not serve the purpose. There are some attributes those can be helpful to build a model and some are not helpful. Any attribute can be a feature if it is helpful to make model. A feature is a somehow a characteristic which help when solving the problem.
The features in the data are important to build predictive modelling. The quality and quantity of the features can have greater influence on whether predictive model is good or not. In simple way one can say, the better the features are, the better the result is. However, the quality of the model does not depend only on the quality of the data and the model. Still choosing the right features is very important. In feature engineering, it is likely to discard features which can expose the risk of over-fitting the model. There are several features that I decided to drop from the predictive model formula.
The features were used to build up the predictive model (To predict Survive) are Pclass, Sex, Age, SibSp, Parch, Fare, Embarked. In my opinion, these features make impact on survival of the passengers. Hence, it was worth to build a model based on them.
Rest features such as cabin, Name, Ticket are having zero impact on weather a passenger will survive or not.
Implementation: 28%
This is your solution, algorithm, parameters and features.
Kaggle leader board position:
Kaggle scorecard:
Decision trees:
Model evaluation: 22%
This is how you evaluate - do you use the right evaluation metric or technique, do you apply the right measure.
Examples of evaluations techniques:
Evaluation
Classification
Confusion Matrix
Precision (Positive Predictive Value)
Recall (True Positive Rate)
F-measure
Receiver Operating Characteristic (ROC)
Area Under ROC Curve
Regression
Mean Squared Error (MSE)
Root Mean Squared Error & Mean Absolute Error
Coefficient of Determination (R2)
Discussion: 10%
This is one or two paragraphs on why you selected you approach and why you think it is good/bad - thats all.
YOU CAN ONLY USE THE SPARK SDK/API.
Basically , we have a Titanic dataset which is divided into train and test dataset. we have to build a predictive model to predict a passenger will survive or not.
/FileStore/tables/72j3kcfj1492026805967/train.csv
/FileStore/tables/72j3kcfj1492026805967/test.csv
/FileStore/tables/brr6pab81492644019874/test.csv
/FileStore/tables/brr6pab81492644019874/train.csv