INFS4018 Assignment – Analysis of a dataset Introduction The aim of the assignment is to introduce you to analysis of routine data sets (“wild datasets”). You will need to explore issues such as writing Data dictionaries; assessing data quality, explore the data using visual tools and perform some data wrangling; consider and perform data analysis and write a comprehensive report including account on your findings and summarising recommendations. For the assignment, you will be given a general scenario and a raw dataset. You will need to explore the given problem in more depth. You will be working in groups to produce both group and individual deliverables. Project methodology Data can be a product of a meticulously planned study, or it can be a side-product of practice (wild datasets). While planned studies typically yield well defined, clean data, these studies are typically expensive both in terms of money and other resources. Such effort is not sustainable in the long term. Data produced as part of routine (health care) activities are, on the other hand, readily available without additional cost. However, such data is typically incomplete, contains possible errors and requires cleansing and transformation before it can be used beyond its primary purpose. The framework we will be using for this assignment was developed by industry as Cross-Industry Standard Process for data Mining (CRISP-DM). This process has several phases: Business understanding Before you start any attempt to collect/analyse data you need to get a good idea why you are doing the exercise – understand the purpose. The main components are: ⦁ Determine business objectives ⦁ Initial situation/problem etc. (…we have crowded emergency departments (ED)…) ⦁ Explore the context of the problem and context of the data collection (…types of organisation generating the data; processes involved in the data creation...) ⦁ Assess situation ⦁ Inventory of resources (personnel, data, software) ⦁ Requirements (e.g. deadline), constraints (e.g. legal issues), risks Understanding your business will support determining the scope of the project, the timeframe, budget etc. NB: The direction of your analysis is determined by your business needs. An attempt to analyse a dataset without prior identification of the main directions would lead to extensive exploration. While this may be justified in some cases, in real business it is seldom required. Data understanding Next step is to look at what data is needed (available) and write data definitions (so that we know exactly what we talking about – this is very important for aggregation of apparently same data: the definitions may not be the same! – blood pressure data may look exactly the same – but there is indeed a difference whether it is data acquired at the ICU via intra-arterial cannula; or it is a casual self-monitoring measure the patient is doing himself at home; nailing down date format is important – especially when aggregating data form different sources – 02/03/12 can be 2nd of March 2012; 3rd of February 2012, 3rd of December 2002; explicitly describe any coding schemas …). ⦁ Collect initial data ⦁ Acquire data listed in project resources ⦁ Report locations of data, methods used to acquire them, ... ⦁ Describe data ⦁ Examine "surface" properties ⦁ Report for example format, quantity of data, ... Data dictionary ⦁ NB: data dictionary summarises your knowledge on each piece of data – this description can be considered to be part of the dataset – each piece of data comes with metadata describing meaning, coding, context of collection etc. In many cases you will be given these descriptions along with the dataset ⦁ Explore data ⦁ Examine central tendencies, distributions, ... ⦁ Report insights suggesting examination of particular data subsets (data selection) ⦁ Verify data quality ⦁ Is the data complete? (missing values) ⦁ Is the data correct? (integrity constraints) ⦁ Is the data noisy or are there outliers? NB: this is an initial exploration – scouting the problem space. It helps you to understand what data is available and it helps to align your approach to the business objectives and the data available. At the same time – this phase can help to verify, whether the project is viable (feasibility) and refine the project scope, budget, resources etc. This phase is very different to a typical research prospective approach where you design the study in a way you always know what you are getting…). Data preparation Typically any data you get is not in the right format for analysis (it was collected for other purposes such as providing care or managing the practice) and needs to be pre-processed ⦁ Select data ⦁ Relevance to the data mining goals ⦁ Quality of data ⦁ Technical constraints, e.g. limits on data volume ⦁ Clean data ⦁ Raise data quality if possible ⦁ Selection of clean subsets ⦁ Insertion of defaults ⦁ Construct data ⦁ Derived attributes (e.g. age = NOW – DOB and possibly subsequent coding of age into buckets; duration of a process from timestamps etc.) ⦁ Integrate data ⦁ Merge data from different sources ⦁ Merge data within source (tuple merging) ⦁ Format data ⦁ Data must conform to requirements of initially selected mining tools (e.g. input data is different for Weka, and different to Disco). Modelling This phase goes hand-in-hand with the data preparation. Here you select what analytic techniques you are planning to use, in which sequence etc. Once you have the analysis design, you execute it. ⦁ Select modelling technique ⦁ Make selection during business understanding phase concrete ⦁ E.g., linear regression, correlation, association detection, decision tree construction… ⦁ Generate test design ⦁ Separate test data from training data (in case of supervised learning) ⦁ Define quality measures for the model ⦁ Build model ⦁ List parameters and chosen values ⦁ Assess model At the end of the Data preparation-Modelling phase you have a set of results coming from the analysis (you have a model). NB: this needs to be assessed and evaluated from the technical point of view (to mitigate issues such as overfitting etc.). Evaluation Here you evaluate the results (model) from the business perspective (Did we learn something new? How do the results fit into knowledge we already have? etc.). ⦁ Evaluate results from business perspective ⦁ Test models on test applications if possible ⦁ Review process ⦁ Determine if there are any important factors or tasks that have been overlooked ⦁ Determine next steps NB: this phase leads to change in business understanding (and starts a new cycle of business intelligence), is viable as-is and can be deployed, or both. Deployment In this phase you conclude the project. ⦁ Plan deployment ⦁ Determine how results (discovered knowledge) are effectively used to reach business objectives ⦁ Plan monitoring and maintenance ⦁ Results become part of day-to-day business and therefore need to be monitored and maintained. ⦁ Final report ⦁ Project review ⦁ Assess what went right and what went wrong; debriefing Caveats The CRISP-DM framework describes the phases in a rather linear (cyclic) fashion. In reality this is an exploration process frequently based on the try-and-err basis. You will work with the data and use frequent visualisation to “see” the patterns. Then you confirm what you “see” with a more formal statistics. General scenario You are employed by Deloitte as a young consultant. Deloitte was engaged to analyse emergency department (ED) data for the New South Wales Productivity Committee. The main reason for the engagement is to explore the available data for clues allowing to improve the performance of EDs – especially in terms of long stay of patients at the ED. Long stay of patients at the ED carries a risk of overcrowding. Overcrowding is associated with inferior clinical outcomes (such as mortality), as well as in quality and timeliness of therapy. Business understanding Explore the literature on ED processes and related problems. Annotate relevant publications. Brainstorm and summarise your findings in the group. Write a brief justification of a project measuring times of arrival, triage, clinical care, and departure. This will require for you to find a few resources and get a basic idea how an emergency department operates. Your work will serve as Introduction/Background section of your data analysis material as well as Final report. Group: ⦁ Mind map of the problem (result of brainstorming) ⦁ Project justification and scope (explain what is going to be the main goal of your analysis) Individual ⦁ Bibliography - find, read and annotate 2 sources as a basis for group discussions and brainstorming Analysis of data For this part of your assignment you will be given a dataset of realistic data measuring the time points of ED processes. Your task will be to have a look at the data (with your understanding of ED processes from your previous reading) and: ⦁ Extract a data dictionary from the NSW document (will be provided) and add description of any data you construct. ⦁ Select which data you will be using for your analysis (and justify your choice) ⦁ Construct data (e.g. duration between “triage” and “seen by clinician”) – justify why you need this data, and describe in detail (in data dictionary) how you are going to construct the data point (formulas, …) ⦁ Explore the data (e.g. basic statistics, graphs…) ⦁ Comment on data quality ⦁ Make your choices on analytic methods (basic stats, process mining [Disco] etc.) and justify your choices (you are asked to use process mining – think about why it is useful – other methods are up to you). ⦁ Formatting/re-formatting data – what changes need to be done for methods you apply (different input is expected for statistical packages, and different for Disco). ⦁ Write an analysis plan – to discuss in the class ⦁ Perform the analysis as you propose it taking into account any comments you may have got. This part of your assignment will form your project planning, execution and results section. You will re-use the Introduction/Background section from your previous assignment – you can coup/past it, but I suggest you try to improve it. Group: ⦁ Data dictionary – this includes a consistent description of data – both copied/adjusted from the NSW data dictionary and description of data the group members constructed ⦁ Summary of dataset exploration ⦁ Analysis plan with justification and assigning work to individuals ⦁ Results of analysis – summary of results/findings produced by individual group members Individual: ⦁ Data construction (each member of the group has to do at least one data construction task) ⦁ Result of exploring the data – each group member will submit result of their exploration of the dataset (what was done, why it was done, what was the result, what do you think about the result) ⦁ Data quality analysis (what are the problems, can they be fixed? How data quality will influence the validity/trustworthiness of results) ⦁ Analysis – as assigned by the group in the analysis plan ⦁ Result of analysis (what you found, what the data tells you – i.e. trends, patterns etc.; NB: this is about facts, not interpretations or opinions) Final report In the final report you will take the results of your analysis, interpret what you have found (comments on what the results may mean in context of what you learned about the ED; design visualisation of your results), and write conclusions and recommendations for action, future analytic project or both. Your final document will contain ALL sections form all previous parts – you copy the material over, re-organise it, and – I suggest – improve. Group: ⦁ Final report (includes Interpretation of findings (summary of individual work) and Recommendations) Individual: ⦁ Interpretation of your findings (here is where you express your opinions, interpretations etc. of what you found). Formatting Your document is supposed to be aimed to senior management of a hospital – adjust your style accordingly. While you have to do the assignment in 3 phases, these are stepping stones towards writing one document at the end – write individual parts in a way you can recycle them in later stages (Phase 1 Phase 2 Phase3). Please do not write lengthy introductions (your audience is senior management of a hospital !). Use references only if you need them (no merit in “backfilling” references). Preferred format is Harvard, but you can use any other format if you use it consistently throughout the entire document. Word count – you use any amount of words you need. No penalty will be for exceeding the word count. Assignments The work described above is split into 2 assignments: Assignment 1 In this part you will do: Group: ⦁ Mind map of the problem (result of brainstorming) ⦁ Project justification and scope (explain what is going to be the main goal of your analysis) ⦁ Data dictionary – this includes a consistent description of data – both copied/adjusted from the NSW data dictionary and description of data the group members constructed ⦁ Summary of dataset exploration ⦁ Analysis plan with justification and assigning work to individuals Individual ⦁ Bibliography - find, read and annotate 2 sources as a basis for group discussions and brainstorming ⦁ Data construction (each member of the group has to do at least one data construction task) ⦁ Result of exploring the data – each group member will submit result of their exploration of the dataset (what was done, why it was done, what was the result, what do you think about the result) ⦁ Data quality analysis (what are the problems, can they be fixed? How data quality will influence the validity/trustworthiness of results) Assignment 2 In this part you will do: Group: ⦁ Results of analysis compilation – summary of results/findings produced by individual group members ⦁ Final report (includes Interpretation of findings (summary of individual work) and Recommendations) Individual: ⦁ Analysis – as assigned by the group in the analysis plan ⦁ Result of analysis (what you found, what the data tells you – i.e. trends, patterns etc.; NB: this is about facts, not interpretations or opinions) ⦁ Interpretation of your findings (here is where you express your opinions, interpretations etc. of what you found).