MIS772 Predictive Analytics Assignment A2 1 of 4 Assignment A2: SAS Enterprise Miner After this workshop consisting of sessions in modules M2 and M3 students will understand how to use SAS Enterprise Miner (SAS EM) to explore data, gain insights into the problem domain and make predictions based on such insights. The workshop will rely on students’ knowledge of methods and techniques introduced in a series of classes. Note that partial assignment solution needs to be formally submitted by its own deadline. In the assignment (as well as on-campus labs) students will work in teams of up to 3 members. They will be given some tasks and use SAS EM to achieve them in groups. Demonstrations and lab exercises will assist skill development. Before attending SAS EM workshops, students need to be familiar with class readings and:  Kattamuri S. Sarma (2013): Predictive Modeling with SAS Enterprise Miner: Practical Solutions for Business Applications, Second Edition. SAS Institute. Activities – No late arrivals for the on-campus sessions! Topic 1. Learn how to use Deakin AppsOnDemand and SAS Enterprise Miner, create project and library folders on your home drive. Before Workshop 2. The workshop facilitator will explain the case in the focus of this assignment. Work in groups of up to 3 (1-2-3 but not 4). M2T1, M2T2 SAS EM Regression, Neural Nets, Decision Trees & Model Comparison 3. Learn SAS EM and the role of nodes to read and manipulate data from CSV files and libraries, clean and transform this data, produce statistics and charts. Learn to create decision trees, regression and neural network models. Gain hands-on experience in model validation and comparison of models’ performance. 4. Explore SAS EM facilities for data exploration and dimensionality reduction with data clustering. Use Ward’s hierarchical cluster analysis to determine number of clusters for k-means clustering. Learn how to profile and validate data clusters using CCC statistic. M2T3 Clustering 5. Evaluate the models using bagging, boosting and crossvalidation. Explore gradient boosting, random forests and other “high performance” data models (HPDM). M2T4 Cross-Validation HP Models 6. Learn how to evaluate and compare individual predictive models. Integrate several predictive models into ensembles. Conduct validation and testing of ensemble models. Visualise and interpret the results. M3T1 Model Comparison & Ensembles 7. As a team, prepare a report of your findings using the provided template. Executive summary should offer interpretation and justification of results. Your reports should include screen shots of SAS EM analytic processes, tables and charts produced. Report and Executive Summary 8. Teams have to submit a single submission of teams’ work via CloudDeakin dropbox (possibly in multiple versions submitted weekly or daily), Submissions must include team member’s names, student numbers and the group ID. Submission Objectives Methods Prerequisites Workshop Schedule Note: Demos of workshop activities are given in class and are video recorded Workshop activities support all assignment deliverablesMIS772 Predictive Analytics Assignment A2 2 of 4 The following mini case study will be used in the assignment A2. The workshop materials for topics M2T1-M3T1 are presented in separate handouts. All amendments, extensions and assumptions should be recorded in the final submission. Business Scenario An independent online business Best Iowa Buys have setup a members-only service to predict the likely value of auctioned real-estate. They have collected sample data about property sales in Ames, Iowa (USA) and asked you to develop an analytics solution, which they could use to estimate the price of any property. They are also interested in the property classification in terms of its affordability within its category or group, and of course its value for money for the potential buyer. You were given some data in the CSV format. The data consists of over 2,930 records of properties sold between 2006 and 2010 in Ames, described with 79 variables. Each record provides description of each property in terms of the property type (house, unit, apartment or townhouse), the number of storeys, its zoning, lot area and shape, utilities, its condition, location (suburb), and of course the price. Note that the case has been adapted from the past Kaggle.com competition. Assessment Objective You have been hired as data analyst sub-contracting for the Best Iowa Buys. Your role is to develop, evaluate and test a predictive model in SAS Enterprise Miner. The company would also like you to produce a list of the Iowa properties currently advertised for the auction, each with the estimated price, its affordability and value for money. Questions Q1. Describe the business problem and the potential value of the predictive model to the client. Present an analytic solution to the problem and support your recommendation with references to the conducted data analytics. Q2. Explore the sample data using descriptive statistics, frequency plots and cluster analysis. Specifically identify any missing, anomalous or inconsistent data characteristics, explaining the potential impact. Perform the necessary treatment or transformation of data, which may be needed to rectify any data quality issues. Assignment Case StudyMIS772 Predictive Analytics Assignment A2 3 of 4 Q3. Perform cluster analysis and segmentation of your data to identify any natural categories or groups of the Ames properties that could be potentially used to guide the customer buying choices. Q4. Develop analytic models to estimate the property price (at least two models), assess its affordability (at least two models) and value for money (at least two models). Ensure to incorporate all these model types: a) Regression; b) Decision trees; and, c) Neural networks. Consider using HPDM models. Q5. For all models, provide a summary of the model assessment statistics over the training, validation and test data sets. Consider using cross-validation. Q6. Compare performance of alternative / competing models. Select the best modelling options and combine your models to provide the predictive solution to the problem. Consider using ensembles. Q7. Undertake research of any additional factors that may have been at play to influence the Ames property prices in the period of data collection. Suggest and implement the strategy to include those factors in your predictive model. Evaluate your strategy. Alternatively, incorporate into your modelling some novel (not previously taught in MIS772) elements of SAS Enterprise Miner that could significantly improve the model predictive performance. Both on-campus and off-campus students will work in teams created for the duration of the assignment A2. Module M2 and M3 workshops will support the assignment work. Use the provided template as a way to structure your final report – some deviation from the format is acceptable, however, the page limit and readability of each section must be preserved. All SAS Enterprise Miner models in XML format and any data used in this project must be included in your submission. Teams must submit the assignments via CloudDeakin assignment box by the indicated deadline. You will be assessed as a team, with equal share. Ensure that your team’s work is unique. No extensions will be possible. Weekly contribution of all team members is necessary and must be documented. All teams, on-campus, off-campus, even those of a single member, must lodge weekly minutes of meetings to CloudDeakin’s file locker area with a file name “Minutes of Meeting yy-mm-dd.pdf” (where yy-mm-dd is a date). The post should be in the format:  Date and Time: when the meeting took place  Location: where the meeting took place (either virtual or face-to-face)  Attendance: names of all present team members and any apologies from others  Work Review: Status of all tasks allocated to all individual team members  Issues: Discussion of problems and suggested plans of actions, especially those to do with the lack of progress, incompletion of tasks and rescheduling of tasks, screen shots of diagrams, charts, tables and results discussed  Task Allocation: a list of new and rescheduled tasks against the names of responsible team members, with due dates clearly identified  Next Meeting: Date and time of the next meeting Failure to lodge meaningful minutes of meetings on two consecutive weeks will result in automatic deduction of 20% of the assignment marks. Failure to contribute to the team effort, as evident from weekly reports, will result in expulsion from the team. On-Campus/ Off-Campus Submission Weekly MeetingsMIS772 Predictive Analytics Assignment A2 4 of 4 The work will be assessed based on the competency criteria explained in lecture 1. Note the column called “Unacceptable”, which indicates that the assignment should be done along the analytic process and its competency levels. Do not commence the more advanced levels without first meeting the lower levels’ expectations (no points given). Data files, all SAS EM models in XML and data files must be supplied to (easily) reproduce reported results. Competency Assessment Criteria Very Important: Read this very carefully! Exceptional / Extensions and Research Meets Expectations – Focus on These / Based on Unit Teaching Unacceptable 10  Half a page limit  0 Prepare Exec Summary Clearly identify what kind of decisions are to be supported by the analytic solution and what types of actions can be recommended by the system. Succinctly state a business problem (or question) and specify requirements for its solution in terms of insights to be generated. Not provided or incomprehensible. Solution not justified. No references to the rest of the report (numbered figures). Succinctly describe the results (answer or solution) and justify. Provide references to the supporting evidence, e.g. charts and plots. 30  One page limit  0 Prepare Data Clean up, transform and filter your data as needed. Ensure that your predictors include both numerical (interval) and categorical (nominal) variables. Define all (3) targets and select predictors. Justify your definition and the methods of creating / adopting the nominal vars. Explore and understand selected variables using variety of charts. Report the important insights. XML and data files attached. Not meeting expectations. Above steps unacceptable. Missing or messy SAS models. Over the page limit. 10  One page limit  0 Discover Relationships Include performance analysis of your clustering activities. Perform cluster analysis and segmentation of your data to identify data categories. Explore, visualise and understand vars relationships. Justify the selection of your predictors to build each of the models and an analytic solution. XML and data files attached. Not meeting expectations. Above steps unacceptable. Missing or messy SAS models. Over the page limit. 10  One page limit  0 Create Models Include the results of your cluster analysis in the predictive models. Use the selected HPDM models. Develop a number of analytic models to predict the property price (at least 2 models), its affordability (at least 2 models) and value for money (at least 2 models). Use all these model types: a) Regression; b) Decision trees; c) Neural networks. XML and data files attached. Not meeting expectations. Above steps unacceptable. Missing or messy SAS models. Over the page limit. 15  One page limit  0 Evaluate & Improve Apply the most suitable crossvalidation methods to your models. Validate and test the models for their ability to predict target values; evaluate the models’ performance. Visualise, interpret and report the results of your performance testing. XML and data files attached. Not meeting expectations. Above steps unacceptable. Missing or messy models. Page limit. 10  One page limit  0 Provide Solution Create ensemble models where appropriate. Evaluate the final model using cross-validation, bagging or boosting, plot and interpret the overall model performance. Integrate all analytic elements into a process that could be used by the client to solve the WB problem, i.e. to read and transform data, create and validate the model, produce visualisations, tables and reports. Write the final report. XML and data files attached. Not meeting expectations. Above steps unacceptable. Missing or messy SAS models. Over the page limit. 15  One page limit  0 Research & Extend Wow factor. Report new and surprising insights. Deliver professional quality. Conduct independent research to verify your results. Extend your work with features well beyond what was covered in class, to improve the model and to present its results in the best way. XML and data files attached. Not meeting expectations. Above steps unacceptable. Missing or messy models. Page limit.