Assignment title: Information


Data Warehousing and Business Intelligence Project Due: Week 13 (See Moodle for exact dates) Task: Architect, Populate and Explore a Data Warehouse for Business Intelligence (40%) This CA will assess your ability to apply the theories, methodologies and strategies tackled in the course to successfully implement a data warehouse to support business intelligence queries. CA Deliverables For the project, there are 2 deliverables: 1) a 20 page report, and 2) a pre-recorded video demonstration with voice commentary of up to 5 mins. The paper should be structured as follows: 1. Introduction: description of the setting of your data warehouse. This section should document: the sources of data, motivation for these sources of data, and describe how the data sets have been curated as well as coerced to work together. If you have fabricated your own data for this project, describe how this has been done, and what assumptions and/or data artefacts are interwoven in this process. 2. Architecture and Implementation: a high-level overview of how the data warehouse is built, i.e. have you followed in an Inmon, Kimball, or alternate approach. Detail why you have designed the data warehouse in this manner. Use supporting argumentation and (where appropriate) substantiating literature with respect to the architectural, methodological and design choices made. Include any details of staging in this section. 3. Data Warehouse Data Model: description of the schema(s) built. What dimensions do you have, and why? Also include a discussion on why they are composed in this manner. Noteworthy, also is how the data model has been prepared for the BI queries. 4. ETL strategy: how has the data been extracted, transformed, cleaned, and loaded into the data warehouse? Document precisely how the data goes from the sources to data warehouse, ideally as a logical data map. Document all transformations and steps taken to clean the data sources. Note which parts of your approach are automated. 5. Application of Data Warehouse: Execute three non-trivial queries that highlight the business value of your data warehouse. For each query, note the sources of data contributing to the answer of the query. Each BI query should be articulated as a question that is answered using the data warehouse. You may use visualisation tools to execute the queries, e.g. Power Pivot, Tableau, Power BI etc. However, the video MUST illustrate how these tools have been integrated into your overall solution – at least how the cube has been imported/exported. The final report should not be more than 20 pages including all figures and references. The video should be no more than 5 minutes in length. You may use any screen capture software, e.g. Snagit, Camtasia, CamStudio, Quicktime, etc. 1 The video should showcase core functionality of the data warehouse and ETL strategy. It should also show the execution of 2 or more of the BI queries discussed in the report. The voice commentary should explain key aspects of what is being shown, and noteworthy implications as well as interesting / surprising results or outcomes. Some Guidance and Storage Requirements Full freedom on the design and implementation methodology is given to the student. Students are free to use any data warehousing tool set discussed in the course, or otherwise. The final project implementation should be stored until the module results are announced after the end of the semester. You may be called upon to demonstrate your solution if further details or evidence are required to support your paper and/or video presentation. It is your responsibility to ensure appropriate storage and backup for your project. Failure to do so, if demonstration is required, could negatively impact the project grade. Data Sources Students are expected to independently source relevant data for use in the case studies. Note: data sets will likely need cleaning and you need to ensure that appropriate relationships exist between your data sets, such that they can be transformed and loaded into the warehouse for meaningful case studies. The warehouse must incorporate 3 or more sources of data, of which at least 1 should be structured and 1 unstructured. Extra credit will be given for incorporating specifically challenging or innovative sources of data. The report should detail the sources of data, how they were generated or extracted, and the steps taken to load and transform the data for storage in the warehouse. Some example sources of data include: - Social Media and Blogs - Corporate documents, reports and white papers - Patents - (Dummy) CRM, ERP, Customer transaction records - Accounting documents - News articles (e.g. via feedzilla) - Stock tickers (e.g. quandl.com) - Web Pages 1 See http://elearningbrothers.com/top-10-screen-recording-software-for-windows/ or http://www.toptenreviews.com/software/multimedia/best-video-capture- software/ for reference There is a huge number of online repositories of relevant data, where some examples include, but are not limited to: - European Data Portal, EU Open Data Portal, and other http://data.europa.eu/ - UK’s open government data repository: http://data.gov.uk - Central Statistics Office, Ireland: http://www.cso.ie - Kaggle: http://www.kaggle.com - Run My Code: http://www.runmycode.org/ - IBM’s many eyes repository: http://www-958.ibm.com/software/data/cognos/manyeyes/datasets - Amazon’s public dataset repository: https://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1 - Google’s Public Data Directory: http://www.google.com/publicdata/directory - The UCI machine learning repository: http://archive.ics.uci.edu/ml/ - Mockaroo (for data set generation): https://www.mockaroo.com BI Queries The BI queries represent three exemplary non-trivial knowledge discovery exercises facilitated through your prototypical data warehouse implementation. Appropriate presentation of the results should be provided in the report. Their implications should be appropriately discussed referencing relevant literature where applicable. If more than 3 BI queries are documented, the best three will be used for grading purposes. Advisory project time plan • Week 5 completed the Multidimensional SQL Server Tutorial • Reading week identify topic and potential sources of data and ensure you have completed the SSIS tutorial • Week 10 finalise data model (star/snowflake schema), and have data in the DW • Week 11 have at least 1 deployable cube • Week 12 begin documentation, and plan the video / BI queries • Week 13 finalise the report • Week 14 don’t plan anything for this week, you will probably miss the deadline! Stuck on the project? Watch these two videos, they tend to help address most problems: • https://www.youtube.com/watch?v=ctUiHZHr-5M • https://www.youtube.com/watch?v=kTPJBAtv29k&list=PL7A29088C98E92D5F This tutorial is also typically a good guide for the project as well: http://www.codeproject.com/Articles/652108/Create-First-Data-WareHouse Grading Rubric H1 H2.1 H2.2 Pass Fail Setting and Architecture (10%) The setting and architecture are coherently documented. Appropriate academic literature substantiates well- argued design choices and methodological considerations. The setting and architecture are coherently documented, with well-argued design choices and methodological considerations. The setting and architecture are documented and justified.. The setting and architecture are documented in an intelligible manner. Limited justification is given, but lacks in depth and/or argumentation. The setting and architecture are poorly conceived and/or little to no justification is given. Data Model (15%) The presented schema is coherently documented and engineered with key design choices are justified. The presented schema is coherently documented and key design choices are justified. It is possible that the schema is over-engineered. The presented schema is coherently documented, key design choices are justified, but may lack depth. It is possible that there are better ways to design the data model. A schema, and core aspects are documented. There may, however, be some flaws in the design choices. The schema is poorly conceived and/or badly documented. All data sets have a well conceived place in the data model. Most data sets have a well- conceived place in the data model It is not clear how/why some data sets are included Some of the included data sets are arbitrary and/or not used The model evidences multi- dimensional drill down The model evidences drill down No drill down demonstrated ETL (35%) A solution set that addresses complex issues, and performs complex ETL processes using advanced methods. A solution set that attempts, with some degree of success, to address some complex issues using some advanced methods. A solution set that addresses a wide scope of issues and challenges, but lacks depth in one or more components in ETL processes. A solution set that addresses the basic requirements of ETL such that a functioning data warehouse is facilitated. A solution may be presented, but the basic ETL requirements are not met. The data warehouse may not fully function. May include or exploit new or emerging technologies to facilitate complex ETL methods. Merges concepts from existing approaches and/or sources. Meets basic requirements using standard tools or technologies. Basic ETL requirements are likely not met. Loading, Extraction, Transformation and Cleaning are automated. Loading, and two of (Extraction, Transformation, and Cleaning) are automated. Loading, and one of (Extraction, Transformation, and Cleaning) are automated. Loading is automated No ETL Automation. There may be no evidence that the data warehouse can be rebuilt. The ETL methodology is well documented and discussed. The ETL methodology is appropriately documented and discussed, but may contain some minor inconsistencies or missing information. The ETL methodology is documented, but weakly discussed. Some errors in argumentation and information may be present. An ETL methodology is documented that is sufficient to understand the basic process undertaken. The ETL methodology is poorly presented and/or not discussed with sufficient detail. Application (30%) Three or more non-trivial BI queries illustrating business value are presented in the report. A solid methodology has been applied to evaluate the findings and implications of the BI queries. Three or more non-trivial case studies illustrating business value are presented in the report, which documents an attempt to methodologically evaluate the BI queries. Three or more non-trivial case studies illustrating some business value are presented in the report and at least 2 in the video. An evaluation methodology may be present, but is weak. Three or more non-trivial case studies are presented in the report and at least 2 in the video. They may lack clear business value. Little methodological consideration is present in the evaluation. Less than three non-trivial case studies are presented in the report. The video may contain some evidence of the case studies. The evaluate lacks methodological grounding. 3 BI queries depending on multiple sources are presented. The report documents and motivates which data sources are leveraged for each BI query. At least 2 BI queries depending on multiple sources are presented. The report documents which data sources are leveraged for each BI query. At least 1 BI query depending on multiple sources is presented. The report documents which data sources contribute to BI queries. Some BI queries depend on multiple, but potentially arbitrary, sources of data. The report documents which data sources contribute to BI queries. None of the case studies rely on multiple non-arbitrary sources of data. The report does NOT document how different data sources contribute to each BI query. The results are critically evaluated at multiple levels using appropriate academic literature to substantiate lines of argumentation. The results are critically evaluated, but may lack some depth. An attempt is made to critically evaluate the results, but lacks depth. Little to no attempt is made to critically evaluate the results. Video (10%) A well-conceived demonstration video documents key functionality. and at least 2 BI queries. The results of selected BI queries are illustrated with an advanced grasp of their limitations, implications and efficacy. A well-conceived demonstration video documents key functionality. The results of selected BI queries are illustrated with an acceptable grasp of their limitations and implications. A well-conceived demonstration video documents a basic level of functionality. The results of selected BI queries are illustrated; an attempt may have been made to discuss their implications and/or limitations. A demonstration video is provided that shows a functioning data warehouse. However, the video is poorly conceived and/or lacks depth. Some results of BI queries are shown, with little to no insightful discussion on their implications and/or limitations. A demonstration video may be provided, but is poorly conceived or does not clearly illustrate a functioning data warehouse solution. Meaningful results from BI queries might not be illustrated.