Description Possible Marks and Wtg(%) Word
Count
Cou
nt
Due
Date
Assignment 3 Written Practical
Report
100 marks 40% Weighting 4000 22/05/17
The key frameworks and concepts covered in modules 1–10 are particularly relevant for
this assignment. Assignment 3 relates to the specific course learning objectives 1, 2, 3
and 4:
1. apply knowledge of people, markets, finances, technology and management in a
global context of business intelligence practice (data warehousing and big data
architecture, data mining process, data visualisation and performance management)
and resulting organisational change and understand how these apply to the
implementation of business intelligence in organisation systems and business
processes
2. identify and solve complex organisational problems creatively and practically
through the use of business intelligence and critically reflect on how evidence based
decision making and sustainable business performance management can effectively
address real-world problems
3. comprehend and address complex ethical dilemmas that arise from evidence based
decision making and business performance management
4. communicate effectively in a clear and concise manner in written report style for
senior management with the correct and appropriate acknowledgment of the main
ideas presented and discussed.
Note you must use RapidMiner Studio for Task 1 and Tableau Desktop for Task 3
in this Assignment 3. Failure to do so may result in Task 1 and/or 3 not being marked
and zero marks awarded.
Note carefully University policy on Academic Misconduct such as plagiarism,
collusion and cheating. If any of these occur they will be found and dealt with by the
USQ Academic Integrity Procedures. If proven, Academic Misconduct may result in
failure of an individual assessment, the entire course or exclusion from a University
program or programs.
Assignment 3 consists of three main tasks and a number of sub tasks
Task 1 (Worth 30 Marks)
The goal of Task 1 is to predict the likelihood of rainfall for tomorrow (next day) based on
today’s weather conditions. In Task 1 of Assignment 3 you are required to use the data
mining tool RapidMiner to analyse and report on the weatherAUS.csv data set provided for
Assignment 3. You should review the data dictionary for weatherAUS.csv data set (see Table
1 below).
Table 1 Data dictionary for Australian Weather Data set variables
Variable Name Data Type Description
Date Date Date of weather observation
Location Text Common name of the location of the weather station.
MinTemp Real Minimum temperature in degrees Celsius.
MaxTemp Real Maximum temperature in degrees Celsius.
Rainfall Real Amount of rainfall recorded for the day in mm.
Evaporation Real So-called Class A pan evaporation (mm) in the 24 hours
to 9am.
Sunshine Real Number of hours of bright sunshine in the day.WindGustDir Polynominal Direction of the strongest wind gust in the 24 hours to
midnight.
WindGustSpeed Integer Speed (km/h) of the strongest wind gust in the 24 hours
to midnight.
WindDir9am Polynominal Direction of wind at 9am
WindDir3pm Polynominal Direction of wind at 3pm
WindSpeed9am Integer Wind speed (km/hr) averaged over 10 minutes prior to
9am.
WindSpeed3pm Integer Wind speed (km/hr) averaged over 10 minutes prior to
3pm.
Humidity9am Integer Relative humidity (percent) at 9am.
Humidity3pm Integer Relative humidity (percent) at 3pm.
Pressure9am Real Atmospheric pressure (hpa) reduced to mean sea level at
9am.
Pressure3pm Real Atmospheric pressure (hpa) reduced to mean sea level at
3pm.
Cloud9am Integer Fraction of sky obscured by cloud at 9am. This is
measured in "oktas", which are a unit of eighths. It
records how many eights of the sky are obscured by
cloud. A 0 measure indicates completely clear sky whilst
an 8 indicates that it is completely overcast.
Cloud3pm Integer Fraction of sky obscured by cloud (in "oktas": eighths) at
3pm. See Cload9am for a description of the values.
Temp9am Real Temperature (degrees C) at 9am.
Temp3pm Real Temperature (degrees C) at 3pm.
RainToday Nominal Integer: Yes if precipitation (mm) in the 24 hours to 9am
exceeds 1mm, otherwise No.
RISK_MM Real Amount of rain. A kind of measure of the "risk".
RainTomorrow Nominal Target variable. Did it rain tomorrow? Yes or No
The Australian Weather dataset contains over 138,000 daily observations from January 2008
through to January 2017 from 49 Australian weather stations. Observations were drawn from
numerous weather stations. The daily observations are available from
http://www.bom.gov.au/climate/data Bureau of Meteorology. Definitions for each variable
are adapted from http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml. In completing
Task 1 of Assignment 3 you will need to apply the business understanding, data
understanding, data preparation, modelling and evaluation phases of the CRISP DM data
mining process.
Task 1.1 Conduct an exploratory data analysis of the weatherAUS.csv data set using
RapidMiner to understand the characteristics of each variable and the relationship of each
variable to the other variables in the data set. Summarise the findings of your exploratory
data analysis in terms of describing key characteristics of each of the variables in the
weatherAUS.csv data set such as maximum, minimum values, average, standard deviation,
most frequent values (mode), missing values and invalid values etc and relationships with
other variables if relevant in a table named Task 1.1 Results of Exploratory Data
Analysis for weatherAUS Data Set.Hint: Statistics Tab and Chart Tab in RapidMiner provide a lot of descriptive statistical
information and useful charts like Barcharts, Scatterplots etc. You might also like to look at
running some correlations and chi square tests. Indicate in Task 1.1 Table which variables
you consider to be the key variables which contribute most to determining whether it is
likely to rain tomorrow.
Briefly discuss the key results of your exploratory data analysis and the justification for
selecting your five top variables for predicting whether it is likely to rain tomorrow based
on today’s weather conditions. (About 250 words)
Task 1.2 Build a Decision Tree model for predicting whether it is likely to rain tomorrow
based on today’s weather conditions using RapidMiner and an appropriate set of data
mining operators and a reduced weatherAUS.csv data set determined by your exploratory
data analysis in Task 1.1. Provide these outputs from RapidMiner (1) Final Decision Tree
Model process, (2) Final Decision Tree diagram, and (3) associated decision tree rules.
Briefly explain your final Decision Tree Model Process, and discuss the results of the Final
Decision Tree Model drawing on the key outputs (Decision Tree Diagram, Decision Tree
Rules) for predicting whether it is likely to rain tomorrow based on today’s weather
conditions and relevant supporting literature on the interpretation of decision trees (About
250 words).
Task 1.3 Build a Logistic Regression model for predicting whether it is likely to rain
tomorrow based on today’s weather conditions using RapidMiner and an appropriate set of
data mining operators and a reduced weatherAUS.csv data set determined by your
exploratory data analysis in Task 1.1. Provide these outputs from RapidMiner (1) Final
Logistic Regression Model process and (2) Coefficients, and (3) Odds Ratios. Hint you will
need to install the Weka Extension in RapidMiner, use W-Logistic Regression Operator for
this Task 1.3 and you may need to change data types of some variables.
Briefly explain your final Logistic Regression Model Process, and discuss the results of the
Final Logistic Regression Model drawing on the key outputs (Coefficients, Odds Ratios) for
predicting whether it is likely to rain tomorrow based on today’s weather conditions and
relevant supporting literature on the interpretation of logistic regression models (About 250
words).
Task 1.4 You will need to validate your Final Decision Tree Model and Final Logistic
Regression Model. Note you will need to use the X-Validation Operator; Apply Model
Operator and Performance Operator in your data mining process models here.
Discuss and compare the accuracy of your Final Decision Tree Model with the Final
Logistic Regression Model for whether it is likely to rain tomorrow based on today’s
weather conditions based the results of the confusion matrix, and ROC chart for each final
model. You should use a table here to compare the key results of the confusion matrix for
the Final Decision Tree Model and Final Logistic Regression Model (About 250 words).
Note the important outputs from your data mining analyses conducted in RapidMiner for
Task 1 should be included in your Assignment 3 report to provide support for your
conclusions reached regarding each analysis conducted for Task 1.1, Task 1.2, Task 1.3 andTask 1.4. Note you can export the important outputs from RapidMiner as jpg image files
and include these screenshots in the relevant Task 1 parts of your Assignment 3 Report.
Note you will find the North Text book a useful reference for the data mining process
activities conducted in Task 1 in relation to the exploratory data analysis, decision tree
analysis, logistic regression analysis and evaluation of the accuracy of the Final Decision
Tree model and the Final Logistic Regression model.
Task 2 (Worth 30 marks)
Research the relevant literature on how big data analytics capability can be incorporated
into a data warehouse architecture. Note Chapter 2 Data Warehousing and Chapter 6 Big
Data and Analytics of Sharda et al. 2014 Textbook will be particularly useful for answering
some aspects of Task 2.
Task 2.1 Provide a high level data warehouse architecture design for a large stated owned
water utility that incorporates big data capture, processing, storage and presentation in a
diagram called Figure 1.1 Big Data Analytics and Data Warehouse Combined.
Task 2.2 Describe and justify the main components of your proposed high level data
warehouse architecture design with big data capability incorporated presented in Figure 1.1
with appropriate in-text referencing support (about 750 words).
Task 2.3 Identify and discuss the key security privacy and ethical concerns for
organisations within a specific industry that are already using a big data analytics and
algorithmic approach to decision making with appropriate in-text referencing support
(about 750 words).
Task 3 (Worth 30 marks)
Scenario Dashboard
Los Angeles Police Department (LAPD) are responsible for enforcing law and order in the
City of Los Angeles which is the cultural, financial, and commercial centre of Southern
California. With a census-estimated 2015 population of 3,971,883, it is the second-most
populous city in the United States (after New York City) and the most populous
city in California. Located in a large coastal basin surrounded on three sides by mountains
reaching up to and over 10,000 feet (3,000 m), Los Angeles covers an area of about 469
square miles (1,210 km2).
LAPD Crime Analytics Unit would like to have a Crime Events dashboard built with the aim of
providing a better understanding of the patterns that are occurring in relation to different crimes
across the 21 Police Department areas over time in the City of Los Angeles. In particular, they
would like to see if there are any distinct patterns in relation to (1) types of crimes, (2) frequency
of each type of crime across each of the 21 Police Department areas for years 2012 through to
first quarter of 2016 based on the LACrimes2012-2016.csv data set. Note this is a large data set
containing over 1 Million records. This Crime Events dashboard will assist LAPD to better
manage and coordinate their efforts in catching the perpetrators of these crimes and be more
proactive in preventing these crimes from occurring in the first place.
The LAPD Crime Analytics Unit wants the flexibility to visualize the frequency that each
type of crime is occurring over time across each of the 21 Police Department areas/districts in
the City of Los Angeles. They want to be able to get a quick overview of the crime data in
relation to category of crimes, location, date of occurrence and frequency that each crime isoccurring over time and then be able to zoom in and filter on particular aspects and then get
further details as required.
LA Crimes Data Set Data Dictionary
variable name type Description
year_id 1. character Original dataset id
date_rptd 2. date Date crime was reported
dr_no 3. character Count of Date Reported
date_occ 4. date Date crime occurred
time_occ 5. date Time crime occurred on a day
area 6. character Area Code
area_name 7. character Area geographical location
rd 8. character Nearby road identifier
crm_cd 9. character Crime type code
crm_cd_desc 10. character Crime type description
Status 11. character Status code
status_desc 12. character Status outcome of crime
location 13. character Nearby address location
cross_st 14. character Nearby cross street
lat 15. numeric Latitude of crime event
long 16. numeric Longitude of crime event
year 17. numeric Year of crime occurred
month 18. numeric Month of crime occurred
day_of_month 19. numeric Day of month crime occurred
hour_of_day 20. numeric Hour of day crime occurred
month_year 21. Month and year when crime occurred
day_of_week 22. character Day of week crime occurred
weekday 23. character Weekday/weekend classification for crime
event
intersection 24. character Occurred at an intersection
crime_classification 25. character subjective binning of crimes
Task 3 requires a Tableau dashboard consisting of four crime event views of the LA Crimes
2012-2016 data set.
Task 3.1 Specific Crimes within each Crime Category for a specific Police Department Area
and specific year
Task 3.2 Frequency of Occurrence for a selected crime over 24 hours for a specific Police
Department Area
Task 3.3 Frequency of Crimes within each Crime Classification by Police Department Area
and by Time
Task 3.4 Geographical (location) presentation of each Police Department Area for given
crime(s) and year. Note for this task you will need to make use of the geo-mapping capability
of Tableau Desktop.
You should briefly discuss the key findings for each of these four views in your
Crimes Event Dashboard (about 60 words each and 250 words in total)
Task 3.5 Provide a rationale (drawing on relevant literature for good dashboard design) forthe graphic design and functionality that is provided in your LAPD Crimes Event dashboard
for the required four specified crime events views for Tasks 3.1, 3.2, 3.3 and 3.4 (About
750 words). Note Stephen Few is considered to be the Guru for good Dashboard Design
and has wrote a number of books on this topic. Worth having a look at his website
https://www.perceptualedge.com/about.php and in particular his examples of poorly
designed dashboard views and his suggestions for better dashboard views.
For your Assignment 3 submission, you will need to submit your Task 3 Tableau workbook
in .twbx format which will contain your dashboard, four views and the associated data set
as a separate document together with your Assignment 3 Main Report in word docx format.
Report presentation writing style and referencing (worth 10 marks)
Presentation: use of formatting, spacing, paragraphs, tables and diagrams,
introduction, conclusion, table of contents
Writing style: Use of English (Correct use of language and grammar. Also, is there evidence
of spelling-checking and proofreading?)
Referencing: Appropriate level of referencing in text where required, reference list provided,
used Harvard Referencing Style correctly
Assignment 3 Report should be structured as follows:
Assignment 3 Cover page
Table of Contents
Task 1 Main Heading
Task 1 Sub Tasks – Sub headings for Tasks 1.1, 1.2 and 1.3
Task 2
Task 2 Sub Tasks – Sub headings for Task 2.1, 2.2, 2.3 and 2.4
Task 3
Task 3 Sub Tasks – Sub headings for Task 3.1, 3.2, 3.3, 3.4 and 3.5
List of References
List of Appendices