MIS772 Predictive Analytics Assignment A1 / Workshops M1T1-M1T4
1 of 4
Assignment A1 / Workshops M1T1-M1T4: R
This assignment has an associated workshops including three sessions M1T1-M1T3, as
well as, M1T4. By completing the workshops and assignment students will understand how
to use R to explore data, gain insights into the problem domain, create statistical
classification models and make predictions based on such models. The workshop will rely
on students’ knowledge of methods and techniques introduced in a series of classes. Note
that partial assignment solution needs to be formally submitted by its own deadline.
During the workshop (on-campus and on-cloud) students will work individually. They'll be
given some tasks and will use R with RStudio to achieve them. Bringing in own laptops
with pre-installed R and R Studio is encouraged.
Before attending R workshops, students are required to be familiar with class readings and:
R Intro: Paul Torfs and Claudia Brauer (2014): A (very) short introduction to R,
Wageningen University, The Netherland. Accessed Jan 2015. URL:
http://cran.r-project.org/doc/contrib/Torfs+Brauer-Short-R-Intro.pdf
Activities – No late arrivals for the on-campus sessions! Topic
1. Install R and R Studio, test that both work Preparation
2. The workshop facilitator will explain the mini case study.
Work in teams but submit individual and unique reports.
Improve your work iteratively. Work stage-by-stage.
M1T1
Data
Exploration
Start by formulating a business problem (it may change later).
3. Explore R Studio layout, find out where is Help, exercise the R
basics, clean a data file and save it, manipulate and plot sample
data, read and write data files.
Learn about the problem area and the assignment data.
Download your data as CSV files, read them and explore your
data. Select variables as candidate targets and predictors. Plot
the selected variables to investigate their properties.
4. Study relationships between the selected variables using
correlation charts and tables. Ensure that predictors are
associated with targets, but predictors are independent of each
other, same for the targets. Deal with incorrect or missing
variable values. Consider transforming some variables to
improve their correlation. Visualise and report the results.
M1T2
Relationships
and Data Prep
5. Use the selected variables to create a Regression model in R.
Consider whether or not any of the previously selected variables
should be dropped or new variables added to improve the model.
Evaluate your intermediate and final model performance.
Visualise and report all your results.
M1T3
Modelling
M1T4
Evaluation
6. Prepare a report of your findings. Ensure that its executive
summary is aimed at management, include interpretation of
results and gives a well justified recommendations.
Report and
Executive
Summary
7. Individually submit your partial solution(s) and later the
final assignment via CloudDeakin dropbox. Also submit all
data files used as well as all R code (.R files with comments).
Submission
Objectives
Methods
Prerequisites
Workshop
Schedule
Note:
Demos of
workshop
activities are
given in
class and
are video
recorded
Workshop
activities
support all
assignment
deliverablesMIS772 Predictive Analytics Assignment A1 / Workshops M1T1-M1T4
2 of 4
This mini case study will be used in all workshops of module 1, i.e. M1T1-M1T4. All
amendments, extensions and assumptions should be recorded in the final submission.
The World Bank approached you to assist in the identification of social and
environmental predictors of wealth and poverty. While the impact of national economic
development on people’s wealth is well known, the World Bank seeks to develop a model,
which could predict the effects of social and environmental changes on the economic
well-being of people living in different countries.
They would also like to determine a course of action aimed at improving the situation in
the countries most affected by such changes.
In technical terms:
You have been asked to explore the World Bank web site to identify a number of social,
environmental, wealth, poverty and other indicators, some of which could be used as
predictors (independent variables) and others as targets (dependent variables) to predict
the national levels of wealth and/or poverty. You will need to justify your variable
selection. Then explore the selected variables’ characteristics and their relationships
using statistical methods and data visualisation in R. Subsequently, build a multiple
regression model and evaluate its performance. Interpret the generated results and use
them to suggest ways of reducing poverty (increasing wealth) across the globe.
Develop your solution in a team but report your unique insights individually.
Data:
The World Bank data bank:
http://data.worldbank.org/indicator
Mini
Case StudyMIS772 Predictive Analytics Assignment A1 / Workshops M1T1-M1T4
3 of 4
Repeating and highlighting main points / hints on the process:
First, define what exactly “wealth” and “poverty” is and suggest up to 3-5 variables (or
their combination) that fit your definitions. You may wish to explore all such variables as
candidate targets of your predictive model.
Second, conduct research and explore The World Bank’s data bank to identify up to 10-
15 social and environmental (non-economic) national-level indicators which could be
used as candidate predictors of the previously selected targets.
Third, explore your predictor and target candidate variables in terms of their values,
distribution and their relationships. As you analyse the variables you are likely to
eliminate those that are not useful for further analysis, so the final number of selected
variables will be much smaller.
Note that some of the selected variables may need to be transformed before using them in
your modelling tasks. You may also have to deal with incorrect or missing values.
Note that the national indicators are provided by The World Bank on its web site as great
many files (either in CSV or Excel form), each describing a different aspect of a nation.
Download the most promising files and build new data sets (in Excel for example) which
could subsequently be loaded into R. Make sure that the final data files are in the form of
comma separated values (CSV).
Check the competency assessment criteria on the next page to see how you are going to
be assessed. Stick to the recommended process. Complete the basics first before moving
to the more advanced tasks or any extensions and research tasks.
Practical work is done in workshops or labs (at Deakin called “seminars”). Workshop
activities will directly contribute to the assignment work. Document your results and
findings as you go along, so that your report is always ready for submission. Submit each
stage of your project as your partial report as soon as the submission dropbox is made
available on CloudDeakin.
Use the form provided as a report template. It is essential that your final report fits into a
strict page limit (only pages within the limit will be marked).
Your deliverables must include the report, your CSV data files and all R files with code.
All submissions are to be done via CloudDeakin dropbox before the deadline.
There is a deadline for the first partial submission and the final submission.
Special considerations must be arranged well ahead of the deadline.
The only form of special consideration will be in reduced deliverables and not extension.
Note that team work and collaboration is encouraged but plagiarism is penalised.
The CloudDeakin groups should be self-selected (you can also be a 1-member group).
Teams are to share ideas and help each other in solving technical problems.
Team discussion areas assist your lecturer in providing you with confidential feedback.
Ask your team for feedback before submitting your individual work.
Your assignment needs to be completed individually. Ensure that it is unique!
Assignment
SubmissionMIS772 Predictive Analytics Assignment A1 / Workshops M1T1-M1T4
4 of 4
The work will be assessed based on the competency criteria explained in lecture 1. Note
the column called “Unacceptable”, which indicates that the assignment should be done
along the analytic process and its competency levels. Do not commence the more
advanced levels without first meeting the lower levels’ expectations (no points given).
Data files, R files with all code and clear comments must be supplied to (easily) reproduce reported results.
Competency
Assessment
Criteria
Very
Important:
Read this
very
carefully!
Exceptional /
Extensions and Research
Meets Expectations – Focus on These /
Based on Unit Teaching
Unacceptable
10 Half a page limit 0
Prepare Exec
Summary
Clearly identify what kind of
decisions are to be supported
by the analytic solution and
what types of actions can be
recommended by the system.
Succinctly state a business problem (or
question) and specify requirements for its
solution in terms of insights to be generated.
Not provided or incomprehensible.
Solution not justified.
No references to the
rest of the report
(numbered figures).
Succinctly describe the results (answer or
solution) and justify. Provide references to the
supporting evidence, e.g. charts and plots.
30 One page limit 0
Prepare
Data
Identify more variables, e.g. up
to 3-5 candidate targets and 10-
15 candidate predictors. Be
selective in data visualisation.
Understand what data is needed to solve the
problem; select and extract 1-2 candidate
targets and 5-9 candidate predictors; explore
and understand characteristics of these
variables, e.g. using scatter plots or lines
charts, histograms or density curves, etc.
Report all important insights.
Not meeting
expectations. Above
steps unacceptable.
Missing or messy R
code. Over page limit.
10 One page limit 0
Discover
Relationships
Identify 2-3 targets and 5-10
predictors. Transform these
variables if needed. Note that it
is likely that some variables will
be eliminated in the process of
correlation analysis.
Explore, visualise and understand correlation
between candidate variables; recommend and
justify the selection of the most appropriate
target variable and a subset of predictors to
build an analytic solution in terms of
relationships between them.
Not meeting
expectations. Above
steps unacceptable.
Missing or messy R
code. Over page limit.
10 One page limit 0
Create
Models
Create 2-3 models, one for
each target variable. Predict the
likely values of your target
variables for all countries in
year 2020.
Build a multiple regression model. Optimise it in
respect of R-Squared, F-ratios and coefficient
p-values. Model optimisation will determine
variables. Briefly report intermediate steps
taken and model characteristics.
Not meeting
expectations. Above
steps unacceptable.
Missing or messy R
code. Over page limit.
15 One page limit 0
Evaluate &
Improve
Deal with extreme cases using
Cook distance. Eliminate multicollinearities using VIF. Validate
and test all models. Tabulate
models performance before and
after optimisation.
Validate and test the model for its ability to
predict target values; evaluate the model
performance, e.g. in terms of accuracy, kappa,
correlation of expected and obtained results,
dollar value of error, etc. Interpret and report
the results.
Not meeting
expectations. Above
steps unacceptable.
Missing or messy R
code. Over page limit.
10 One page limit 0
Provide
Solution
Evaluate the final model using
cross-validation, bagging or
boosting, plot and interpret the
model performance, e.g. using
Gain, Lift, ROC or other
appropriate charts.
Integrate all analytic elements into a process
that could be used by the client to solve the WB
problem, i.e. to read and transform data, create
and validate the model, produce visualisations,
tables and reports. Write the final report and
recommendations.
Not meeting
expectations. Above
steps unacceptable.
Missing or messy R
code. Over page limit.
15 One page limit 0
Research &
Extend
Wow factor. Report new and
surprising insights. Deliver
professional quality. Conduct
independent research to
determine if your predictions for
2020 confirm or extend
previously published results.
Extend your work with features well beyond
what was covered in class, to improve the
model and to present its results in the best way.
Examples: report results on Google Maps or in
Leaflet. Use stunning visualisations. Apply
logistic regression, k-NN or Naïve Bayes
models for additional insights.
Not meeting
expectations. Above
steps unacceptable.
Missing or messy R
code. Over page limit.