Assignment title: Information

Write a report with the following sections for the project started in Assignment 1 (2,000 words) 1 Data preparation 1.1 Select data Task Select data Decide on the data to be used for analysis. Criteria include relevance to the data mining goals, quality and technical constraints such as limits on data volume or data types. Output Rationale for inclusion/exclusion List the data to be used/excluded and the reasons for these decisions. 2.2 Clean data Task Clean data Raise the data quality to the level required by the selected analysis techniques. This may involve selection of clean subsets of the data, the insertion of suitable defaults or more ambitious techniques such as the estimation of missing data by modeling. Output Data cleaning report Describe the decisions and actions that were taken to address the data quality problems reported during the Verify Data Quality Task. The report should also address what data quality issues are still outstanding if the data is to be used in the data mining exercise and what possible affects that could have on the results. Activities n Reconsider how to deal with observed type of noise. Correct, remove or ignore noise. Decide how to deal with special values and their meaning. The area of special values can give rise to many strange results and should be carefully examined. Examples of special values could arise through taking results of a survey where some questions were not asked or nor answered. This might result in a value of '99' for unknown data. For example, 99 for marital status or political affiliation. Special values could also arise when data is truncated – e.g. '00' for 100 year old people or all cars with 100,000 km on the clock. n Reconsider Data Selection Criteria (See Task 2.1) in light of experiences of data cleaning (i.e. one may wish include/exclude other sets of data). 1.3 Construct data Task Construct data This task includes constructive data preparation operations such as the production of derived attributes, complete new records or transformed values for existing attributes. Output Derived attributes Derived attributes are new attributes that are constructed from one or more existing attributes in the same record. An example might be area = length * width. Why should we need to construct derived attributes during the course of a data mining investigation? It should not be thought that only data from databases or other sources is the only type of data that should be used in constructing a model. Derived attributes might be constructed because: Background knowledge convinces us that some fact is important and ought to be represented although we have no attribute currently to represent it. n The modeling algorithm in use handles only certain types of data, for example we are using linear regression and we suspect that there are certain non-linearities that will be not be included in the model. n The outcome of the modeling phase may suggest that certain facts are not being covered. Activities Derived attributes n Decide if any attribute should be normalized (e.g. when using a clustering algorithm with age and income in lire, the income will dominate). Consider adding new information on the relevant importance of attributes by adding new attributes (for example, attribute weights, weighted normalization). n How can missing attributes be constructed or imputed? [Decide type of construction (e.g., aggregate, average, induction)]. n Add new attributes to the accessed data. Good idea! Before adding Derived Attributes, try to determine if and how they ease the model process or facilitate the modeling algorithm. Perhaps "income per head" is a better/easier attribute to use that "income per household." Do not derive attributes simply to reduce the number of input attributes. Another type of derived attribute is single-attribute transformations, usually performed to fit the needs of the modeling tools. Activities Single-attribute transformations n Specify necessary transformation steps in terms of available transformation facilities (for example. change a binning of a numeric attribute). Perform transformation steps. Hint! Transformations may be necessary to transform ranges to symbolic fields (e.g. ages to age ranges) or symbolic fields ("definitely yes," "yes," "don't know," "no") to numeric values. Modeling tools or algorithms often require them. Output Generated records Generated records are completely new records, which add new knowledge or represent new data that is not otherwise represented, e.g., having segmented the data, it may be useful to generate a record to represent the prototypical member of each segment for further processing. Activities Check for available techniques if needed (e.g., mechanisms to construct prototypes for each segment of segmented data). SP-DM 1.0 2 Modeling 2.1 Select modeling technique Task Select modeling technique As the first step in modeling, select the actual modeling technique that is to be used initially. If multiple techniques are applied, perform this task for each technique separately. It should not be forgotten that not all tools and techniques are applicable to each and every task. For certain problems, only some techniques are appropriate From among these tools and techniques there are "Political Requirements" and other constraints, which further limit the choice available to the miner. It may be that only one tool or technique is available to solve the problem in hand – and even then the tool may not be the absolutely technical best for the problem in hand. ISP-DM 1.0 Output Modeling technique Record the actual modeling technique that is used. Activities Decide on appropriate technique for exercise bearing in mind the tool selected. Output Modeling assumption Many modeling techniques make specific assumptions about the data, data quality or the data format. Activities n Define any built-in assumptions made by the technique about the data (e.g. quality, format, distribution). Compare these assumptions with those in the Data Description Report. Make sure that these assumptions hold and step back to the Data Preparation Phase if necessary. 2.2 Generate test design Task Generate test design Prior to building a model, a procedure needs to be defined to test the model's quality and validity. For example, in supervised data mining tasks such as classification, it is common to use error rates as quality measures for data mining models. Therefore the test design specifies that the dataset should be separated into training and test set, the model is built on the training set and its quality estimated on the test set. Output Test design Describe the intended plan for training, testing and evaluating the models. A primary component of the plan is to decide how to divide the available dataset into training data, test data and validation test sets. Activities n Check existing test designs for each data mining goal separately. n Decide on necessary steps (number of iterations, number of folds etc.). Prepare data required for test. CRISP-DM 1.0 2.3 Build model Task Build model Run the modeling tool on the prepared dataset to create one or more models. Output Parameter settings With any modeling tool, there are often a large number of parameters that can be adjusted. List the parameters and their chosen values, along with the rationale for the choice. Activities n Set initial parameters. Document reasons for choosing those values. Output Models Run the modeling tool on the prepared dataset to create one or more models. Activities n Run the selected technique on the input dataset to produce the model. n Post-process data mining results (e.g. editing rules, display trees). Output Model description Describe the resultant model and assess its expected accuracy, robustness and possible shortcomings. Report on the interpretation of the models and any difficulties encountered. Activities n Describe any characteristics of the current model that may be useful for the future. n Record parameter settings used to produce the model. n Give a detailed description of the model and any special features. For rule-based models, list the rules produced plus any assessment of per-rule or overall model accuracy and coverage. For opaque models, list any technical information about the model (such as neural network topology) and any behavioral descriptions produced by the modeling process (such as accuracy or sensitivity). Describe the model's behavior and interpretation. n State conclusions regarding patterns in the data (if any); sometimes the model reveals important facts about the data without a separate assessment process (e.g. that the output or conclusion is duplicated in one of the inputs). RISP-DM 1.0 2.4 Assess model Task Assess model The model should now be assessed to ensure that it meets the data mining success criteria and the passes the desired test criteria. This is a purely technical assessment based on the outcome of the modeling tasks. Output Model assessment Summarize results of this task, list qualities of generated models (e.g. in terms of accuracy) and rank their quality in relation to each other. Activities Evaluate result with respect to evaluation criteria Good idea! "Lift Tables" and "Gain Tables" can be constructed to determine how well the model is predicting. Test result according to a test strategy (e.g.: Train and Test, Crossvalidation, bootstrapping etc.). n Compare evaluation results and interpretation. Create ranking of results with respect to success and evaluation criteria. Select best models. Interpret results in business terms (as far as possible at this stage). n Get comments on models by domain or data experts. Check plausibility of model. Check impacts for data mining goal. Check model against given knowledge base to see if the discovered information is novel and useful. Check reliability of result. Analyze potentials for deployment of each result. If there is a verbal description of the generated model (e.g. via rules), assess the rules; are they logical, are they feasible, are there too many or too few, do they offend common sense? Assess results. Get insights into why a certain modeling technique and certain parameter settings lead to good/bad results. Output Revised parameter settings According to the model assessment, revise parameter settings and tune them for the next run in task 'Build Model.' Iterate model building and assessment until you find the best model. Activities Adjust parameters to give better model. CRISP-DM 1.0 3 Evaluation Previous evaluation steps dealt with factors such as the accuracy and generality of the model. This step assesses the degree to which the model meets the business objectives and seeks to determine if there is some business reason why this model is deficient. It compares results with the evaluation criteria defined at the start of the project. A good way of defining the total outputs of a data mining project is to use the equation: RESULTS = MODELS + FINDINGS In this equation we are defining that the total output of the data mining project is not just the models (although they are, of course, important) but also findings which we define as anything (apart from the model) that is important in meeting objectives of the business (or important in leading to new questions, line of approach or side effects (e.g. data quality problems uncovered by the data mining exercise). Note: although the model is directly connected to the business questions, the findings need not be related to any questions or objective, but are important to the initiator of the project. 3.1 Evaluate results Task Evaluate results Previous evaluation steps dealt with factors such as the accuracy and generality of the model. This step assesses the degree to which the model meets the business objectives and seeks to determine if there is some business reason why this model is deficient. Another option of evaluation is to test the model(s) on test applications in the real application if time and budget constraints permit. Moreover, evaluation also assesses other data mining results generated. Data mining results cover models which are necessarily related to the original business objectives and all other findings which are not necessarily related to the original business objectives but might also unveil additional challenges, information or hints for future directions. Output Assessment of data mining results with respect to business Success criteria Summarize assessment results in terms of business success criteria including a final statement whether the project already meets the initial business objectives. RISP-DM 1.0 Activities n Understand the data mining result. n Interpret the results in terms of the application. n Check impacts for data mining goal. n Check the data mining result against the given knowledge base to see if the discovered information is novel and useful. Evaluate and assess result with respect to business success criteria i.e. has the project achieved the original Business Objectives? n Compare evaluation results and interpretation. n Create ranking of results with respect to business success criteria. n Check impacts of result for initial application goal. Are there new business objectives to be addresses later in the project or in new projects? States conclusions for future data mining projects. Output Approved models After model assessment with respect to business success criteria, you eventually get approved models if the generated models meet the selected criteria. 3.2 Review process Task Review process At this point the resultant model appears to be satisfactory and appears to satisfy business needs. It is now appropriate to make a more thorough review of the data mining engagement in order to determine if there is any important factor or task that has somehow been overlooked. At this stage of the Data Mining exercise, the Process Review takes on the form of a Quality Assurance Review. Output Review of process Summarize the process review and give hints for activities that have been missed and/or should be repeated. Activities n Give an overview of the data mining process used n Analyze data mining process For each stage of the process: n Was it necessary in retrospect? Was it executed optimally? n In what ways could it be improved? n Identify failures. n Identify misleading steps. n Identify possible alternative actions, unexpected paths in the process. n Review data mining results with respect to business success criteria. RISP-DM 1.0 3.3 Determine next steps Task Determine next steps According to the assessment results and the process review, the project decides how to proceed at this stage. The project needs to decide whether to finish this project and move onto deployment or whether to initiate further iterations or whether to set up new data mining projects. Output List of possible actions List possible further actions along with the reasons for and against each option. Activities n Analyze potential for deployment of each result. n Estimate potential for improvement of current process.n Check remaining resources to determine if they allow additional process iterations (or whether additional resources can be made available). n Recommend alternative continuations. n Refine process plan. Output Decision Describe the decision as to how to proceed along with the rationale. Activities n Rank the possible actions. n Select one of the possible actions. n Document reasons for the choice. CRISP-DM 1.0