Assignment title: Information
Data Warehousing and Business
Intelligence Project
Due: Week 13 (See Moodle for exact dates)
Task: Architect, Populate and Explore a Data Warehouse for Business
Intelligence (40%)
This CA will assess your ability to apply the theories, methodologies and strategies tackled in the
course to successfully implement a data warehouse to support business intelligence queries.
CA Deliverables
For the project, there are 2 deliverables: 1) a 20 page report, and 2) a pre-recorded video
demonstration with voice commentary of up to 5 mins.
The paper should be structured as follows:
1. Introduction: description of the setting of your data warehouse. This section should
document: the sources of data, motivation for these sources of data, and describe how the
data sets have been curated as well as coerced to work together. If you have fabricated
your own data for this project, describe how this has been done, and what assumptions
and/or data artefacts are interwoven in this process.
2. Architecture and Implementation: a high-level overview of how the data warehouse is
built, i.e. have you followed in an Inmon, Kimball, or alternate approach. Detail why you
have designed the data warehouse in this manner. Use supporting argumentation and
(where appropriate) substantiating literature with respect to the architectural,
methodological and design choices made. Include any details of staging in this section.
3. Data Warehouse Data Model: description of the schema(s) built. What dimensions do you
have, and why? Also include a discussion on why they are composed in this manner.
Noteworthy, also is how the data model has been prepared for the BI queries.
4. ETL strategy: how has the data been extracted, transformed, cleaned, and loaded into the
data warehouse? Document precisely how the data goes from the sources to data
warehouse, ideally as a logical data map. Document all transformations and steps taken
to clean the data sources. Note which parts of your approach are automated.
5. Application of Data Warehouse: Execute three non-trivial queries that highlight the
business value of your data warehouse. For each query, note the sources of data
contributing to the answer of the query. Each BI query should be articulated as a question
that is answered using the data warehouse. You may use visualisation tools to execute the
queries, e.g. Power Pivot, Tableau, Power BI etc. However, the video MUST illustrate how
these tools have been integrated into your overall solution – at least how the cube has
been imported/exported.
The final report should not be more than 20 pages including all figures and references.
The video should be no more than 5 minutes in length. You may use any screen capture software,
e.g. Snagit, Camtasia, CamStudio, Quicktime, etc.
1 The video should showcase core functionality
of the data warehouse and ETL strategy. It should also show the execution of 2 or more of the BI
queries discussed in the report. The voice commentary should explain key aspects of what is
being shown, and noteworthy implications as well as interesting / surprising results or outcomes.
Some Guidance and Storage Requirements
Full freedom on the design and implementation methodology is given to the student. Students are
free to use any data warehousing tool set discussed in the course, or otherwise.
The final project implementation should be stored until the module results are announced after
the end of the semester. You may be called upon to demonstrate your solution if further details
or evidence are required to support your paper and/or video presentation. It is your
responsibility to ensure appropriate storage and backup for your project. Failure to do so, if
demonstration is required, could negatively impact the project grade.
Data Sources
Students are expected to independently source relevant data for use in the case studies. Note:
data sets will likely need cleaning and you need to ensure that appropriate relationships exist
between your data sets, such that they can be transformed and loaded into the warehouse for
meaningful case studies. The warehouse must incorporate 3 or more sources of data, of
which at least 1 should be structured and 1 unstructured.
Extra credit will be given for incorporating specifically challenging or
innovative sources of data.
The report should detail the sources of data, how they were generated or extracted, and the steps
taken to load and transform the data for storage in the warehouse.
Some example sources of data include:
- Social Media and Blogs
- Corporate documents, reports and white papers
- Patents
- (Dummy) CRM, ERP, Customer transaction records
- Accounting documents
- News articles (e.g. via feedzilla)
- Stock tickers (e.g. quandl.com)
- Web Pages
1 See http://elearningbrothers.com/top-10-screen-recording-software-for-windows/
or http://www.toptenreviews.com/software/multimedia/best-video-capture-
software/ for reference
There is a huge number of online repositories of relevant data, where some examples include, but
are not limited to:
- European Data Portal, EU Open Data Portal, and other http://data.europa.eu/
- UK’s open government data repository: http://data.gov.uk
- Central Statistics Office, Ireland: http://www.cso.ie
- Kaggle: http://www.kaggle.com
- Run My Code: http://www.runmycode.org/
- IBM’s many eyes repository:
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets
- Amazon’s public dataset repository:
https://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1
- Google’s Public Data Directory: http://www.google.com/publicdata/directory
- The UCI machine learning repository: http://archive.ics.uci.edu/ml/
- Mockaroo (for data set generation): https://www.mockaroo.com
BI Queries
The BI queries represent three exemplary non-trivial knowledge discovery exercises facilitated
through your prototypical data warehouse implementation. Appropriate presentation of the
results should be provided in the report. Their implications should be appropriately discussed
referencing relevant literature where applicable. If more than 3 BI queries are documented, the
best three will be used for grading purposes.
Advisory project time plan
• Week 5 completed the Multidimensional SQL Server Tutorial
• Reading week identify topic and potential sources of data and ensure you have
completed the SSIS tutorial
• Week 10 finalise data model (star/snowflake schema), and have data in the DW
• Week 11 have at least 1 deployable cube
• Week 12 begin documentation, and plan the video / BI queries
• Week 13 finalise the report
• Week 14 don’t plan anything for this week, you will probably miss the deadline!
Stuck on the project?
Watch these two videos, they tend to help address most problems:
• https://www.youtube.com/watch?v=ctUiHZHr-5M
• https://www.youtube.com/watch?v=kTPJBAtv29k&list=PL7A29088C98E92D5F
This tutorial is also typically a good guide for the project as well:
http://www.codeproject.com/Articles/652108/Create-First-Data-WareHouse
Grading Rubric
H1 H2.1 H2.2 Pass Fail
Setting and
Architecture
(10%)
The setting and architecture are
coherently documented.
Appropriate academic
literature substantiates well-
argued design choices and
methodological considerations.
The setting and architecture are
coherently documented, with
well-argued design choices and
methodological considerations.
The setting and architecture are
documented and justified..
The setting and architecture are
documented in an intelligible
manner. Limited justification is
given, but lacks in depth and/or
argumentation.
The setting and architecture
are poorly conceived and/or
little to no justification is
given.
Data Model
(15%)
The presented schema is
coherently documented and
engineered with key design
choices are justified.
The presented schema is
coherently documented and key
design choices are justified. It is
possible that the schema is
over-engineered.
The presented schema is
coherently documented, key
design choices are justified, but
may lack depth. It is possible
that there are better ways to
design the data model.
A schema, and core aspects are
documented. There may, however,
be some flaws in the design
choices.
The schema is poorly
conceived and/or badly
documented.
All data sets have a well
conceived place in the data
model.
Most data sets have a well-
conceived place in the data
model
It is not clear how/why some data sets are included Some of the included data sets
are arbitrary and/or not used
The model evidences multi-
dimensional drill down
The model evidences drill down No drill down demonstrated
ETL (35%) A solution set that addresses
complex issues, and performs
complex ETL processes using
advanced methods.
A solution set that attempts,
with some degree of success, to
address some complex issues
using some advanced methods.
A solution set that addresses a
wide scope of issues and
challenges, but lacks depth in
one or more components in ETL
processes.
A solution set that addresses the
basic requirements of ETL such
that a functioning data warehouse
is facilitated.
A solution may be presented,
but the basic ETL
requirements are not met.
The data warehouse may not
fully function.
May include or exploit new or emerging technologies to facilitate
complex ETL methods.
Merges concepts from existing
approaches and/or sources.
Meets basic requirements using
standard tools or technologies.
Basic ETL requirements are
likely not met.
Loading, Extraction,
Transformation and Cleaning
are automated.
Loading, and two of (Extraction,
Transformation, and Cleaning)
are automated.
Loading, and one of (Extraction,
Transformation, and Cleaning)
are automated.
Loading is automated No ETL Automation. There
may be no evidence that the
data warehouse can be
rebuilt.
The ETL methodology is well
documented and discussed.
The ETL methodology is
appropriately documented and
discussed, but may contain
some minor inconsistencies or
missing information.
The ETL methodology is
documented, but weakly
discussed. Some errors in
argumentation and information
may be present.
An ETL methodology is
documented that is sufficient to
understand the basic process
undertaken.
The ETL methodology is
poorly presented and/or not
discussed with sufficient
detail.
Application
(30%)
Three or more non-trivial BI
queries illustrating business
value are presented in the
report. A solid methodology has
been applied to evaluate the
findings and implications of the
BI queries.
Three or more non-trivial case
studies illustrating business
value are presented in the
report, which documents an
attempt to methodologically
evaluate the BI queries.
Three or more non-trivial case
studies illustrating some
business value are presented in
the report and at least 2 in the
video. An evaluation
methodology may be present,
but is weak.
Three or more non-trivial case
studies are presented in the report
and at least 2 in the video. They
may lack clear business value.
Little methodological
consideration is present in the
evaluation.
Less than three non-trivial
case studies are presented in
the report. The video may
contain some evidence of the
case studies. The evaluate
lacks methodological
grounding.
3 BI queries depending on
multiple sources are presented.
The report documents and
motivates which data sources
are leveraged for each BI query.
At least 2 BI queries depending
on multiple sources are
presented. The report
documents which data sources
are leveraged for each BI query.
At least 1 BI query depending on
multiple sources is presented.
The report documents which
data sources contribute to BI
queries.
Some BI queries depend on
multiple, but potentially arbitrary,
sources of data. The report
documents which data sources
contribute to BI queries.
None of the case studies rely
on multiple non-arbitrary
sources of data. The report
does NOT document how
different data sources
contribute to each BI query.
The results are critically
evaluated at multiple levels
using appropriate academic
literature to substantiate lines
of argumentation.
The results are critically
evaluated, but may lack some
depth.
An attempt is made to critically
evaluate the results, but lacks
depth.
Little to no attempt is made to critically evaluate the results.
Video (10%) A well-conceived
demonstration video
documents key functionality.
and at least 2 BI queries. The
results of selected BI queries
are illustrated with an advanced
grasp of their limitations,
implications and efficacy.
A well-conceived
demonstration video
documents key functionality.
The results of selected BI
queries are illustrated with an
acceptable grasp of their
limitations and implications.
A well-conceived demonstration
video documents a basic level of
functionality. The results of
selected BI queries are
illustrated; an attempt may have
been made to discuss their
implications and/or limitations.
A demonstration video is provided
that shows a functioning data
warehouse. However, the video is
poorly conceived and/or lacks
depth. Some results of BI queries
are shown, with little to no
insightful discussion on their
implications and/or limitations.
A demonstration video may
be provided, but is poorly
conceived or does not clearly
illustrate a functioning data
warehouse solution.
Meaningful results from BI
queries might not be
illustrated.