Assignment title: Information


MIS772 Predictive Analytics Assignment A1 / Workshops M1T1-M1T4 1 of 4 Assignment A1 / Workshops M1T1-M1T4: R This assignment has an associated workshops including three sessions M1T1-M1T3, as well as, M1T4. By completing the workshops and assignment students will understand how to use R to explore data, gain insights into the problem domain, create statistical classification models and make predictions based on such models. The workshop will rely on students' knowledge of methods and techniques introduced in a series of classes. Note that partial assignment solution needs to be formally submitted by its own deadline. During the workshop (on-campus and on-cloud) students will work individually. They'll be given some tasks and will use R with RStudio to achieve them. Bringing in own laptops with pre-installed R and R Studio is encouraged. Before attending R workshops, students are required to be familiar with class readings and:  R Intro: Paul Torfs and Claudia Brauer (2014): A (very) short introduction to R, Wageningen University, The Netherland. Accessed Jan 2015. URL: http://cran.r-project.org/doc/contrib/Torfs+Brauer-Short-R-Intro.pdf Activities – No late arrivals for the on-campus sessions! Topic 1. Install R and R Studio, test that both work Preparation 2. The workshop facilitator will explain the mini case study. Work in teams but submit individual and unique reports. Improve your work iteratively. Work stage-by-stage. M1T1 Data Exploration Start by formulating a business problem (it may change later). 3. Explore R Studio layout, find out where is Help, exercise the R basics, clean a data file and save it, manipulate and plot sample data, read and write data files. Learn about the problem area and the assignment data. Download your data as CSV files, read them and explore your data. Select variables as candidate targets and predictors. Plot the selected variables to investigate their properties. 4. Study relationships between the selected variables using correlation charts and tables. Ensure that predictors are associated with targets, but predictors are independent of each other, same for the targets. Deal with incorrect or missing variable values. Consider transforming some variables to improve their correlation. Visualise and report the results. M1T2 Relationships and Data Prep 5. Use the selected variables to create a Regression model in R. Consider whether or not any of the previously selected variables should be dropped or new variables added to improve the model. Evaluate your intermediate and final model performance. Visualise and report all your results. M1T3 Modelling M1T4 Evaluation 6. Prepare a report of your findings. Ensure that its executive summary is aimed at management, include interpretation of results and gives a well justified recommendations. Report and Executive Summary 7. Individually submit your partial solution(s) and later the final assignment via CloudDeakin dropbox. Also submit all data files used as well as all R code (.R files with comments). Submission Objectives Methods Prerequisites Workshop Schedule Note: Demos of workshop activities are given in class and are video recorded Workshop activities support all assignment deliverablesMIS772 Predictive Analytics Assignment A1 / Workshops M1T1-M1T4 2 of 4 This mini case study will be used in all workshops of module 1, i.e. M1T1-M1T4. All amendments, extensions and assumptions should be recorded in the final submission. The World Bank approached you to assist in the identification of social and environmental predictors of wealth and poverty. While the impact of national economic development on people's wealth is well known, the World Bank seeks to develop a model, which could predict the effects of social and environmental changes on the economic well-being of people living in different countries. They would also like to determine a course of action aimed at improving the situation in the countries most affected by such changes. In technical terms: You have been asked to explore the World Bank web site to identify a number of social, environmental, wealth, poverty and other indicators, some of which could be used as predictors (independent variables) and others as targets (dependent variables) to predict the national levels of wealth and/or poverty. You will need to justify your variable selection. Then explore the selected variables' characteristics and their relationships using statistical methods and data visualisation in R. Subsequently, build a multiple regression model and evaluate its performance. Interpret the generated results and use them to suggest ways of reducing poverty (increasing wealth) across the globe. Develop your solution in a team but report your unique insights individually. Data: The World Bank data bank: http://data.worldbank.org/indicator Mini Case StudyMIS772 Predictive Analytics Assignment A1 / Workshops M1T1-M1T4 3 of 4 Repeating and highlighting main points / hints on the process: First, define what exactly "wealth" and "poverty" is and suggest up to 3-5 variables (or their combination) that fit your definitions. You may wish to explore all such variables as candidate targets of your predictive model. Second, conduct research and explore The World Bank's data bank to identify up to 10- 15 social and environmental (non-economic) national-level indicators which could be used as candidate predictors of the previously selected targets. Third, explore your predictor and target candidate variables in terms of their values, distribution and their relationships. As you analyse the variables you are likely to eliminate those that are not useful for further analysis, so the final number of selected variables will be much smaller. Note that some of the selected variables may need to be transformed before using them in your modelling tasks. You may also have to deal with incorrect or missing values. Note that the national indicators are provided by The World Bank on its web site as great many files (either in CSV or Excel form), each describing a different aspect of a nation. Download the most promising files and build new data sets (in Excel for example) which could subsequently be loaded into R. Make sure that the final data files are in the form of comma separated values (CSV). Check the competency assessment criteria on the next page to see how you are going to be assessed. Stick to the recommended process. Complete the basics first before moving to the more advanced tasks or any extensions and research tasks. Practical work is done in workshops or labs (at Deakin called "seminars"). Workshop activities will directly contribute to the assignment work. Document your results and findings as you go along, so that your report is always ready for submission. Submit each stage of your project as your partial report as soon as the submission dropbox is made available on CloudDeakin. Use the form provided as a report template. It is essential that your final report fits into a strict page limit (only pages within the limit will be marked). Your deliverables must include the report, your CSV data files and all R files with code. All submissions are to be done via CloudDeakin dropbox before the deadline. There is a deadline for the first partial submission and the final submission. Special considerations must be arranged well ahead of the deadline. The only form of special consideration will be in reduced deliverables and not extension. Note that team work and collaboration is encouraged but plagiarism is penalised. The CloudDeakin groups should be self-selected (you can also be a 1-member group). Teams are to share ideas and help each other in solving technical problems. Team discussion areas assist your lecturer in providing you with confidential feedback. Ask your team for feedback before submitting your individual work. Your assignment needs to be completed individually. Ensure that it is unique! Assignment SubmissionMIS772 Predictive Analytics Assignment A1 / Workshops M1T1-M1T4 4 of 4 The work will be assessed based on the competency criteria explained in lecture 1. Note the column called "Unacceptable", which indicates that the assignment should be done along the analytic process and its competency levels. Do not commence the more advanced levels without first meeting the lower levels' expectations (no points given). Data files, R files with all code and clear comments must be supplied to (easily) reproduce reported results. Competency Assessment Criteria Very Important: Read this very carefully! Exceptional / Extensions and Research Meets Expectations – Focus on These / Based on Unit Teaching Unacceptable 10  Half a page limit  0 Prepare Exec Summary Clearly identify what kind of decisions are to be supported by the analytic solution and what types of actions can be recommended by the system. Succinctly state a business problem (or question) and specify requirements for its solution in terms of insights to be generated. Not provided or incomprehensible. Solution not justified. No references to the rest of the report (numbered figures). Succinctly describe the results (answer or solution) and justify. Provide references to the supporting evidence, e.g. charts and plots. 30  One page limit  0 Prepare Data Identify more variables, e.g. up to 3-5 candidate targets and 10- 15 candidate predictors. Be selective in data visualisation. Understand what data is needed to solve the problem; select and extract 1-2 candidate targets and 5-9 candidate predictors; explore and understand characteristics of these variables, e.g. using scatter plots or lines charts, histograms or density curves, etc. Report all important insights. Not meeting expectations. Above steps unacceptable. Missing or messy R code. Over page limit. 10  One page limit  0 Discover Relationships Identify 2-3 targets and 5-10 predictors. Transform these variables if needed. Note that it is likely that some variables will be eliminated in the process of correlation analysis. Explore, visualise and understand correlation between candidate variables; recommend and justify the selection of the most appropriate target variable and a subset of predictors to build an analytic solution in terms of relationships between them. Not meeting expectations. Above steps unacceptable. Missing or messy R code. Over page limit. 10  One page limit  0 Create Models Create 2-3 models, one for each target variable. Predict the likely values of your target variables for all countries in year 2020. Build a multiple regression model. Optimise it in respect of R-Squared, F-ratios and coefficient p-values. Model optimisation will determine variables. Briefly report intermediate steps taken and model characteristics. Not meeting expectations. Above steps unacceptable. Missing or messy R code. Over page limit. 15  One page limit  0 Evaluate & Improve Deal with extreme cases using Cook distance. Eliminate multicollinearities using VIF. Validate and test all models. Tabulate models performance before and after optimisation. Validate and test the model for its ability to predict target values; evaluate the model performance, e.g. in terms of accuracy, kappa, correlation of expected and obtained results, dollar value of error, etc. Interpret and report the results. Not meeting expectations. Above steps unacceptable. Missing or messy R code. Over page limit. 10  One page limit  0 Provide Solution Evaluate the final model using cross-validation, bagging or boosting, plot and interpret the model performance, e.g. using Gain, Lift, ROC or other appropriate charts. Integrate all analytic elements into a process that could be used by the client to solve the WB problem, i.e. to read and transform data, create and validate the model, produce visualisations, tables and reports. Write the final report and recommendations. Not meeting expectations. Above steps unacceptable. Missing or messy R code. Over page limit. 15  One page limit  0 Research & Extend Wow factor. Report new and surprising insights. Deliver professional quality. Conduct independent research to determine if your predictions for 2020 confirm or extend previously published results. Extend your work with features well beyond what was covered in class, to improve the model and to present its results in the best way. Examples: report results on Google Maps or in Leaflet. Use stunning visualisations. Apply logistic regression, k-NN or Naïve Bayes models for additional insights. Not meeting expectations. Above steps unacceptable. Missing or messy R code. Over page limit.