CIS8008, Semester 2 2017 Assignment 2 Specifications University of Southern Queensland © 2017 1 Description Possible Marks and Wtg(%) Word Count Due date Assignment 2 Written Practical Report 100 marks 15% Weighting 2000 14/08/17 Learning objectives The key frameworks and concepts covered in modules 1–6 are particularly relevant for this assignment. Assignment 2 relates to the course learning objectives 1, 2 and 4: 1. demonstrate applied knowledge of people, markets, finances, technology and management in a global context of business intelligence practice (data warehouse design, data mining process, data visualisation and performance management) and resulting organisational change and how these apply to implementation of business intelligence in organisation systems and business processes 2. identify and solve complex organisational problems creatively and practically through the use of business intelligence and critically reflect on how evidence based decision making and sustainable business performance management can effectively address real world problems 4. demonstrate the ability to communicate effectively in a clear and concise manner in written report style for senior management with correct and appropriate acknowledgment of main ideas presented and discussed. Software Requirements You must use RapidMiner Studio for Task 2 and you must use Tableau Desktop for Task 3 in this Assignment 2. Failure to do so may result in Task 2 and/or Task 3 not being marked and zero marks awarded. Academic integrity Note carefully University policy on Academic Misconduct such as plagiarism, collusion and cheating. Your Assignment 2 submission is automatically submitted to and checked in Turnitin for academic integrity when you submit your Assignment 2 via the course study Assignment 2 submission link. If any of these occur they will be found and dealt with by the USQ Academic Integrity Procedures. If proven, Academic Misconduct may result in failure of an individual assessment, the entire course or exclusion from a University program or programs. This assignments should be completed by each student individually (not in group). Late submission policy USQ’s Assessment – Assignment (Late Submission) and Compassionate and Compelling Circumstances procedures relate to Extensions and Late Assignments. They can be found under the following links:  Assessment – Assignment (Late Submission) Procedure  Assessment of Compassionate and Compelling Circumstances Procedures: An Assignment submitted after the due date without an approved extension of time will be penalised. The penalty for late submission is a reduction by five percent (5%) of the maximum mark applicable for the assignment, for each University Business Day or part Business Day that the assignment is late.CIS8008, Semester 2 2017 Assignment 2 Specifications University of Southern Queensland © 2017 2 Assignment 2 consists of three main tasks and a number of sub tasks Task 1 Data Management Architectures (Worth 30 Marks) For Task 1 you are required to conduct a critical literature review of the related concepts of data warehouses, data lakes and data marts in order to complete two sub tasks. Task 1.1) Define and discuss the concepts of (1) a data warehouse, (2) a data lake and (3) a data mart drawing relevant and reputable literature (about 500 words). Task 1.2) Explain how a data warehouse, a data lake and a data mart would be used in an organization and in your answer provide some real world examples of how each would be used in an organization (about 500 words). Task 2 Exploratory Data Analysis and Linear Regression Analysis (Worth 40 Marks) For Task 2 consider a set of observations on a large number of white wine varieties involving their chemical properties and ranking by wine tasters contained in white-wines.csv data set. Wine industry has been growing steadily as social drinking of wine is on the rise. The price of a wine largely depends on wine appreciation by wine tasters which may have a high degree of variability. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and take into account factors like acidity, pH level, presence of sugar and other chemical properties. For wine producers, it would be of interest if wine tasters’ perception of wine quality after tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process of wines is more rigorous. The white-wines.csv data set consists of 4898 white wine varieties in total (records). All wines are from one wine producing region. The white-wines.csv data set was collected on 12 different properties of wines. Quality is based on sensory data (wine tasters’ perception of the quality of a wine), the rest are based on chemical properties of wines including density, acidity, alcohol content etc. All chemical properties of wines are coded as continuous numeric variables. Quality is an ordinal variable with a possible ranking from 1 (worst) to 10 (best). Each white wine variety is tasted by three independent tasters and final rank assigned is the median rank given by tasters. See Table 1 White Wines Data Set Data Dictionary for full details of white-wines.csv data set. Table 1 White Wines Data Set Data Dictionary Variable Name Type and description of variable Range of values fixed acidity Real - tartaric acid grams/litre - most acids in wines are fixed 3 through to 15 volatile acidity Real - acetic acid grams/litre - steam distillable acids 0 through to 1.2 citric acid Real – grams/litre commercially produced acid supplement 0 through to 2 residual sugar 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Real - measured in grams of sugar per litre of wine 0 through to 70 chlorides Real - sodium chloride - grams/litre 0 through to 1 free sulfur dioxide Integer fruit preservative used in wines milligrams/litre 1 through to 300 total sulfur dioxide Integer fruit preservative used in wines milligrams/litre 1 through to 450 density Real – density - grams/cubic 0 through to 2 pH Real - strength of acidity 2 through to 4 sulphates Real - potassium sulfate - grams/litre 0 through to 2 alcohol Real – percentage of volume - alcohol content of a wine 3 through to 15 quality Integer - Quality Sensory score by wine tasters between 0 and 10CIS8008, Semester 2 2017 Assignment 2 Specifications University of Southern Queensland © 2017 3 Note: You will need to conduct some desktop research on wine tasting and the chemical properties of wine in order to fully understand and interpret the key findings of the exploratory data analysis (EDA) of the white-wines.csv data set and the key results of the Linear Regression Models for the white-wines.csv data set. The following resources: Wikipedia page on Acidity in Wines and What’s in a Wine? are a good starting point. Task 2.1) Conduct an exploratory data analysis of the white-wines.csv data set using the RapidMiner Studio data mining tool. Summarise the findings of your exploratory data analysis in terms of describing key characteristics of each variable in the wines.csv data set such as maximum, minimum values, average, standard deviation, most frequent values (mode), missing values and invalid values etc and relationships of variables with other variables if relevant in a table named Table 2.1 Results of Exploratory Data Analysis for the White-Wines.csv Data Set. Hint: The Statistics Tab and the Chart Tab in RapidMiner provide descriptive statistical information and the ability to create useful charts like Barcharts, Scatterplots etc. You might also like to look at running some correlations and chi square tests on the white-wines.csv data set to indicate which variables you consider to be the top five key variables and which contribute most to confirming a wine taster’s ranking of the quality of a wine. Note in completing Task 2.1 you will find it useful to refer to the data dictionary for the white-wines.csv data set included in this document which defines each of the variables in terms of their data type and range of p o s s i b l e values. Discuss the key results of your exploratory data analysis presented in Table 2.1 and provide a rationale for why you have selected your five top variables for predicting a wine taster’s ranking of a white wine drawing on the results of your EDA analysis and relevant literature (About 500 words). Task 2.2) Build a Linear Regression model for predicting the quality ranking of a white wine using a RapidMiner data mining process and an appropriate set of data mining operators and a reduced set of variables from the white-wines.csv data set determined by your exploratory data analysis in Task 2.1. Provide these outputs from RapidMiner (1) Final Linear Regression Model process and (2) Summary Table of Results of Final Linear Regression Model for Task 2.2 for white-wines.csv data set. Briefly describe your final Linear Regression Model Process, and discuss the results of the Final Linear Regression Model for white wine.csv data set drawing on the key outputs (coefficient, standardised coefficients, t-statistics values, p-values and significance levels etc) for predicting Wine Quality and relevant supporting literature on the interpretation of a Linear Regression Model (About 500 words). Include all appropriate RapidMiner outputs such as RapidMiner Processes, Graphs and Tables that support the key aspects of your exploratory data analysis and linear regression model analysis of the white-wines.csv data set in your Assignment 2 report. Note you need export these outputs from RapidMiner using the File/Print/Export Image option and include in Task 2 where relevant or in Appendix A of Assignment 2 report.CIS8008, Semester 2 2017 Assignment 2 Specifications University of Southern Queensland © 2017 4 Task 3 Tableau Desktop View of Snow Falls at Whistler over Time (worth 20 marks) Create a Tableau Text Table or Graph view that displays Snow falls at Whistler BC Canada and other relevant data over time (years) using whistler-daily-snowfall.csv. Comment on the (1) process of preparing this data set for a Text Table or Graph view presentation using Tableau Desktop and (2) key trends and patterns that are apparent in Text Table/Graph you created for presentation of the whistler-daily-snowfall.csv data set (about 200 words). Whistler Daily Snow Fall Data Set contains: Historical daily snowfall data record at Whistler, BC, Canada over the period July 1 1972 to December 31 2009. Measured at top of Whistler Gondola: Latitude: 50°04'04.000" N Longitude: 122°56'50.000" W Elevation: 1835.00 m. Data Source: http://www.climate.weatheroffice.ec.gc.ca/ (Whistler Roundhouse station, identifier 1108906). Note you will need to ensure that Tableau Desktop recognizes the variables in the whistler-dailysnowfall.csv data set as correct type and Note there are a lot of missing values (coded as nan) or null values in this data set for the earlier years 1970s etc. Table 2 Whistler Daily Snow Fall Data Set Data Dictionary Variable Name Type and description of variable Range of values Year Integer – Years for snow fall recorded at Whistler BC Canada 1972 to 2009 Month Integer – Months January through December for each year 1 through to 12 Day Integer – Days of each month of each year 1 through to 31 MaxTemp 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10) Real - measured in Celsius MinTemp Real – measured in Celsius MeanTemp Real – measured in Celsius HeatDegDays Real – measured in Celsius CoolDegDays Real – measured in Celsius TotalRain Real – rainfall is measured in mm TotalSnow Real – snow is measured in cm TotalPrecip Real – total precipitation is measured in mm SnowOnGrd Real – snow on ground is measured in cm Report presentation, writing style and referencing (worth 10 marks) Your Assignment 2 must be presented in report format, written in an appropriate and critical style and supported where required with in text references using Harvard Referencing Style Your assignment 2 report must be structured in report format as follows: Cover Page: Including course code/name, Student name & ID, assignment title (e.g. “Assignment 2) Table of Contents Body of report Task 1 1.1 1.2 Task 2 2.1 2.2 Task 3 List of References AppendicesCIS8008, Semester 2 2017 Assignment 2 Specifications University of Southern Queensland © 2017 5 You must submit two files for Assignment 2: 1. Assignment 2 Report for Tasks 1, 2 and 3 in Word document format with extension .docx 2. Tableau packaged workbook with the extension .twbx contains required Text Table / Graph view for Task 3 You must the following file naming convention: 1. Student ID_Student name_CIS8008_Ass2.docx 2. Student ID_ Student name_CIS8008_Ass2.twbx Harvard referencing resources Install a bibliography referencing tool – Endnote which integrates with your word processor. http://www.usq.edu.au/library/referencing/endnote-bibliographic-software or alternatively use an online citation tool such as Zetoro or You Cite This For Me USQ Library how to reference correctly using the Harvard referencing system https://www.usq.edu.au/library/referencing/harvard-agps-referencing-guide