Data Mining Server Event Logs for Unknown Patterns CP3300 Data Mining Server Event Logs for Unknown Patterns Page 2 of 19 Contents Abstract ................................................................................................................................................... 3 Introduction ............................................................................................................................................ 3 Related work ........................................................................................................................................... 6 Methods .................................................................................................................................................. 7 Information Domain............................................................................................................................ 7 Algorithm Selection ............................................................................................................................. 7 Apriori ............................................................................................................................................. 7 PredictiveApriori ............................................................................................................................. 7 Generalised Sequential Patterns..................................................................................................... 8 Pre-processing - Data Cleaning, Integration, Transformation, and Reduction ................................... 8 Data Mining Processing....................................................................................................................... 8 File Import ....................................................................................................................................... 8 Individual Server Event Log Analysis ............................................................................................. 10 Combined Server Event Log Analysis ............................................................................................ 11 Algorithm Settings......................................................................................................................... 11 Results ................................................................................................................................................... 13 Individual Server Event Logs ............................................................................................................. 13 Temporal Patterns ........................................................................................................................ 13 Sequential Patterns ....................................................................................................................... 13 Cross Server Event Log Analysis ........................................................................................................ 13 Basket Analysis .............................................................................................................................. 13 Sequential Patterns ....................................................................................................................... 13 Discussion.............................................................................................................................................. 13 Issues ..................................................................................................................................................... 14 Conclusion ............................................................................................................................................. 15 Future Work .......................................................................................................................................... 16 References ............................................................................................................................................ 17 APPENDIX A. .......................................................................................................................................... 18 APPENDIX B. .......................................................................................................................................... 19 CP3300 Data Mining Server Event Logs for Unknown Patterns Page 3 of 19 Abstract This project mined the system event logs from 17 Windows servers in order to discover previously unknown patterns in event generation within each server, and between servers. Each server system event log was exported to its own comma separated values (CSV) file. The file was then cleaned, trimmed and sorted into a standard format. All CSV files were concatenated into one master file where further cleaning was performed. Individual server files were generated from the master file for individual server analysis and the master file was retained for cross server analysis; variations on these files were also created. Three Association Rule Mining (ARM) algorithms were chosen for the analysis in order to broaden the results and to gain a better understanding of the data mining tool and algorithms. The algorithms Apriori, Predictive Apriori, and Generalized Sequential Patterns were used within the WEKA data mining tool. The knowledge discovery process produced both uninteresting and interesting patterns. The uninteresting patterns included scheduled tasks, system restarts, expected service state changes, Microsoft updates, IIS application pool recycling, and coincidental associations. The interesting patterns revealed by the data mining process included resource allocation problems, software bugs and time synchronisation errors. Some problems were not reported by the software applications involved. For example, the virus software scheduled scan on two primary servers were approaching resource limitations which impacted the effectiveness of the scan. Likewise, the backup software on a Microsoft SharePoint server failed to report problems with its granular restore technology that would impact the way individual files are restored. Furthermore, time synchronisation errors occurring across multiple servers could impact resource allocation and authentication, this requires further investigation. As organisations become more complex, so to do the systems that support them. Event log analysis can be time consuming if performed manually due to the volume of data that needs to be examined. Furthermore, it is often difficult to identify patterns within a server, let alone between servers, in order to recognise emerging problems or predict imminent failures. As a result, event logs are normally used as a reactive tool. Although some tools are available for event log analysis, these are often expensive and do not provide predictive analysis, so they also become a reactive tool. Data mining, and in particular ARM, could be a cost effective and proactive method of monitoring server events. Introduction Knowledge Discovery in Data (KDD) is the end to end process of identifying data for analysis, preparing the data for analysis (by way of selecting, cleaning, integrating, and/or transforming the data), mining the data for patterns, then presenting and evaluating the results (Han, Kamber & Pei 2012, ch.1). Some requirements of KDD are that the computations be non-trivial, the resulting patterns are valid and can be applied to unseen data, patterns are novel, understandable, and present some value (Lee 2013). As mentioned above, data mining is the component of KDD that is used to identify patterns in the data. Data mining has a number of functionalities belonging to two primary categories: descriptive and predictive (Han, Kamber & Pei 2012, ch.1). Firstly, clustering which is a descriptive mining method used to identify and group items based on similar characteristics between items of the same cluster, and dissimilarities between characteristics of items in different clusters. Secondly, classification mining which is a predictive mining method used to predict unknown class attributes based on models, rules, generated by analysing data attributes of known class attributes. Thirdly, characterisation and discrimination mining which describes classes and concepts in summarised form. Fourthly, outlier analysis which identifies objects not confirming to normal object behaviour. CP3300 Data Mining Server Event Logs for Unknown Patterns Page 4 of 19 Lastly, ARM which is a predictive mining method that identifies frequently occurring items and associations between these frequently occurring items to create rules for predicting future occurrences (Han, Kamber & Pei 2012, p. 15-21). ARM will be used in this project to identify associations between frequently occurring event itemsets. Furthermore, ARM will also be used to find sequential associations of itemsets. Microsoft Windows event logs are used to record significant events. Early versions of Windows supported three types of event logs: application, system and security. In addition to the three supported by early versions, newer versions of windows support the creations of event log types for specific applications and services. This project focusses on the system event logs because these logs are generated by the operating system and system services; significant security or application events will also appear in the system event logs. The aims of this project are to identify server event log generation patterns within a server, to uncover associations or dependencies between events; temporal event generation, for recurring patterns; and, event generation patterns between servers will be sought to identify associations or dependencies between services on different machines. Firstly, event logs will be reviewed to gain an understanding of the data. Secondly, the data to be analysed will be extracted from the servers, cleaned, appropriate attributes selected, and transformed into the required format. All server data will then be integrated and cleaned further. Thirdly, the integrated data will be separated into individual server files and a master file for mining. Fourthly, WEKA will be used to mine the data using three different Association Rule Mining (ARM) algorithms. The WEKA user manual by Bouckaert et al. (2013) provided the required level of operational information. The three algorithms employed will be Apriori, Predictive Apriori, and Generalised Sequential Patterns. Lastly, if any patterns are found then they will be analysed for interestingness and reported. This process is shown in Figure 1. Event log analysis can be time consuming if performed manually due to the volume of data that needs to be examined, this project’s sample event logs exceeded 348,000 records. Furthermore, it is often difficult to identify patterns, within a server let alone between servers, in order to recognise emerging problems or predict imminent failures. Because of this, event logs are normally used as a reactive tool. The project site is configured to E-Mail significant events to staff real time. If the staff were trained to recognise event types that preceded major incidents then there is a potential that system problems may be rectified before they cause disruption. CP3300 Data Mining Server Event Logs for Unknown Patterns Page 5 of 19 Servers x17 Export Event Logs to CSV Event Logs x17 Clean, Transform and Select Attributes Event Logs x17 Integrate and Clean Single Server Event Logs Separated Combined Server Event Log Trimmed Normalised time range14.25 min Normalised Date Range14.25 min Event Logs x17 Normalised time range14.25 min Each File Mined using Apriori, Predictive Apriori and General Sequential Patterns Algorithms Time and Event attributes selected Server and Event Attributes Concatenated, Date and Time Attributes Concatenated Pivot Table Created and Basket Format Extracted Normalised Date Range14.25 min Each File Mined using General Sequential Pattern Algorithms Each File Mined using Apriori and Predictive Apriori Algorithms Figure 1: Server Event Log Data Mining Process. CP3300 Data Mining Server Event Logs for Unknown Patterns Page 6 of 19 Related work Previous work in this area has focussed on accurately mining the large amounts of log data generated by today’s information systems. All works reviewed recognise the need to automate this process due to the exhaustive, and often unrealistic, task humans face in complete event log examination. Peng, Li and Ma (2005) focus on the categorisation accuracy of three data mining algorithms (Naive Bayes, Modified Naive Bayes, and the Hidden Markov Model) when applied to system log files. The authors recognise the difficulties involved in mining logs presented in different formats by each system vendor and software product. The paper highlights the potential for greater categorisation accuracy when using text and temporal data. The article then proceeds to examine the effectiveness of the algorithms when used with the text and temporal data. Interestingly, the categorisation results are presented graphically for human analysis, rather than being mined for association rules, which affords reactive rather than predictive analysis. The authors acknowledge this limitation along with the problem of having to declare the number of categories and the potential to have a large number of categories. This project differs in that event prediction will be a primary goal, although historical pattern analysis will also be performed. Liang, Zhang and Xiong (2007) use classification mining of failure logs containing temporal data to predict failures within time windows. Four classification algorithms (RIPPER, Support Vecoer Machines, classic Nearest Neighbour, and a customised Bi-Modal Nearest Neighbour Predictor) were applied to the data to compare their accuracy. The Bi-Modal Nearest Neighbour Predictor algorithm out-performed all other algorithms. The solution differs from this project in that it was applied to logs of fixed format from one specific system. Furthermore, the aim was to give advanced warning of an event occurring within a time windows of six and twelve hours. This project is examining event logs across multiple system types of varying attributes and data formats. Also, this project differs in that it will be looking to identify events that proceed a failure rather than predicting the time of a failure. Fullop, Gainaru & Plutchak (2012) present a paper describing a solution for mining super computer event logs. The solution uses clustering to group historical events by subsystem types, reducing the incidence of coincidental associations between subsystem types. The granularity of the clustering is configurable. The initial clustering builds templates that describe the contained events. Real-time events are then compared to the templates immediately, where clusters definitions can be modified or further clusters created if necessary. The clusters are then examined for frequent itemsets, producing a graph for describing system behaviour and an event relationship table. Event relationships are used to predict how long after one particular event will another event occur. Similarly, the paper discusses “event chains” which are sequences of events leading up to a significant event. One of the primary objectives was not to let events go undetected, so accuracy was paramount. The solution presented in the paper is a sophisticated solution designed for super computers, but does support multiple super computer types. It has methods to deal with noise, outliers, and coincidental events. In contrast, this project uses simple techniques that are within the abilities of most system administrators and is aimed at Windows Servers; however, the project does aim to predict associations between events. CP3300 Data Mining Server Event Logs for Unknown Patterns Page 7 of 19 Methods Information Domain The project site operates 17 Windows servers in a domain configuration. Important events are E-Mailed to ICT staff in real-time; event selection for E-Mailing is configurable. If an association between events could be found that enabled ICT staff sufficient time to rectify an imminent problem before it occurred then service availability could be enhanced. For example, if data mining revealed that event-x predominantly occurred 20 minutes before event-y then the system could be configured to notify ICT staff when event-x occurred, who would then respond in a way to prevent event-y from occurring. Similarly, this project will also explore frequent associations of event log generation between windows systems operating in a domain environment. Like events occurring within a server, if event-x occurs on server-x then is there a high probability that event-y will then occur on server-y. Likewise, if an action performed by staff on server-x causes an event on server-y then such actions could be scheduled to reduce impact. Furthermore, temporal event patterns can be difficult to identify when dealing with large or multiple datasets so mining recurring temporal patterns will also be performed. Algorithm Selection Apriori The Apriori algorithm finds frequent intra transaction itemsets within a dataset (Shweta & Garg 2013). The algorithm passes over a dataset to find all frequent one-sequence itemsets within the dataset that satisfy the minimum support. Candidate two-sequence itemsets are created from the one-sequence itemsets. This is achieved by creating all possible two-sequence combinations of the one-sequence itemsets. For example, if A, B, and C are frequent one-sequence itemsets then candidate sets will be A->B, A->C, and B->C (Shweta & Garg 2013). The dataset is then scanned to see if the candidate two-sequence itemsets are found and are frequent enough to satisfy the minimum support. For example two-sequence data sets A -> B, A -> C and A -> C may be found and frequent. The frequent two-sequence itemsets are then used to generate candidate frequent itemsets of three-sequences, which are confirmed in another dataset pass. For example, a three- sequence dataset of A, B, C could be found (Shweta & Garg 2013). This process is continued until no further frequent itemsets are found. There are two main parameters user by the Apriori algorithm; support, which measures the probability of an itemset being containing in a dataset; and, confidence or interestingness, which measures the probability of an itemset containing value Y if it contains value X. Apriori by default will generate the best 10 rules found within the set parameters. The Algorithm decreases support until it has found “n” number of rules or the minimum support value has been reached (Shweta & Garg 2013). The minimum support value is configurable. This algorithm was chosen because it forms the basis of a number of ARM algorithms. PredictiveApriori Predictive Apriori is similar to Apriori but calculates confidence differently (Shweta & Garg 2013). Where Apriori confidence is the accuracy of an association rule, the probability that if value X is in an itemset then it will contain value Y, Predictive Apriori calculates the expected accuracy, or predicted accuracy, of an association rule rather than calculating accuracy based on the training data (Shweta & Garg 2013). To achieve this, the Predictive Apriori algorithm increases support from zero when calculating its rules, whereas the Apriori algorithm decreases support from maximum to minimum, until the “n” number of rules have been found (Shweta & Garg 2013). This algorithm was chosen as a comparison to the Apriori algorithm and to increase the likelihood of finding patterns. CP3300 Data Mining Server Event Logs for Unknown Patterns Page 8 of 19 Generalised Sequential Patterns This is an Apriori based algorithm (Mooney & Roddick 2013). The Generalised Sequential Patterns algorithm finds inter transaction frequent itemsets within a dataset. Furthermore, where the standard Apriori algorithm would generate a three sequence dataset of A,B,C; the Generalised Sequential Patterns algorithm would generate the combinations of items in varying order. For example, A->B->C, A->C->B and A-> BC 3-sequence datasets from the above 2-sequence datasets (Mooney & Roddick 2013). This algorithm was chosen because of its ability to identify sequences among itemsets within a dataset, which is thought to be useful for finding interdependencies among services between machines as well as services within a machine. Pre-processing - Data Cleaning, Integration, Transformation, and Reduction System Event Logs from 17 Windows servers were exported to individual CSV files. The Windows Server operating systems included versions 2003R2, 2008, 2008R2 and 2012, which resulted in file formats that differed in attribute types and characteristics. Each files columns and attributes were adjusted to a common format. For example, the Windows 2003 R2 servers concatenate their date and time attributes into one attribute, these had to be separated because time alone was a required attribute. Other servers presented their date and time in different formats such as dd/mm/yy instead of dd/mm/yyyy, and time in hh:mm as opposed to HH:mm:ss. Furthermore, additional fields are included in Windows 2003 R2 exports, which were removed. All event logs were then concatenated into a master file and further cleaned to remove special or hidden characters and in-text commas. Notepad++ was selected as the tool of choice for dealing with special or hidden characters because of its ability to easily find and replace these, whereas Excel was selected as the best tool for handling dates. The tools were also able to handle the volume of data, which exceeded 348,000 records. Once the event logs were cleaned, individual server event logs were created from the master file. Additionally, a trimmed version of the master file, going back 12 months, was created for “between server data mining”. This file was created to reduce processing time and because of data relevance, logs older than 12 months may not be relevant due to system changes. Further files were created to enable familiarisation with the WEKA tool and ARM data mining algorithms. For example, individual server and combined server event log files were created with a normalised time field that was divided into 14.25 minute ranges to capture frequent patterns that occurred within a time frame. Additionally, the date and time attributes plus the server and event attributes were merged for GSP algorithm analysis. This file was then transformed into a Basket format for analysis with the Apriori and Predictive Apriori algorithms. A zipped copy of the files imported into WEKA are embedded in Appendix A. Data Mining Processing File Import The event log dataset files will be opened directly into WEKA (Figure 2) using the Explorer application’s open file option on the “preprocess” tab (Figure 3). The required file can be directly imported into WEKA by browsing to the appropriate directory and selecting “files of type” “CSV data files (*.CSV)” or “ARFF data files”, described in arff data format (2013) (Figure 4). An unsupervised attribute filter (Figure 5) will be applied to the event_id to convert it from field type numeric to nominal (Figure 6), this is appropriate because the event_id has no numerical significance and is merely a label. CP3300 Data Mining Server Event Logs for Unknown Patterns Page 9 of 19 Figure 2. The WEKA main screen showing the explorer application option. Figure 3. WEKA Explorer application’s preproces screen showing the open file option. Figure 4. WEKA open file option showing the method of selecting the file type. CP3300 Data Mining Server Event Logs for Unknown Patterns Page 10 of 19 Figure 5. WEKA Preprocess screen showing where to choose a filter, the chosen filter and data type before the filter is applied. Figure 6. WEKA Preprocess screen showing the applied filter. Individual Server Event Log Analysis Each server’s event log files will have each of the above ARM algorithms applied. The algorithms will be selected from the WEKA associate screen’s associator choose option (Figure 7). The rules and patterns resulting from the analysis will be compared to each other as well as examined for interesting and uninteresting patterns. Figure 7. WEKA associate screen and choose option for selecting the required algorithm. CP3300 Data Mining Server Event Logs for Unknown Patterns Page 11 of 19 Combined Server Event Log Analysis The trimmed versions of the concatenated server master file with the server name and event id plus date and time each concatenated will be used. Firstly, a standard version will be analysed with the GSP algorithm. Secondly, the standard version will have the date/time normalised and divided into 14.25 minute ranges will also be analysed with the GSP algorithm. Thirdly, the standard file will be transformed into the Basket format and have the Apriori and Predictive Apriori algorithms applied. Lastly, a Basket format version will be created with the date/time normalised and divided into 14.25 minute ranges will also have the Apriori and Predictive Apriori algorithms applied. The rules and patterns resulting from the analysis will be compared and examined for interesting and uninteresting patterns. Algorithm Settings The Apriori algorithm will be configured to use a support value of zero to allow for direct comparison with the Predictive Apriori algorithm. The algorithms settings are accessed by mouse clicking the algorithm name in the choose field shown in figure 8. Similarly, the confidence will be lowered as far as 0.7 if a sufficient number of rules are not found. Additionally, the Apriori number of rules returned will be increased from 10 to 100 to also allow a direct comparison with the Predictive Apriori algorithm; furthermore, 100 rules were chosen because rules containing certain event types can be immediately discarded. For example, events that record system uptime. Figure 9 shows the settings adjustment screen for the Apriori algorithm in WEKA. Figure 8. Left mouse clicking the algorithm name will present the settings for that algorithm. CP3300 Data Mining Server Event Logs for Unknown Patterns Page 12 of 19 Figure 8. Apriori settings adjustment screen in WEKA. The Predictive Apriori algorithm will use the standard settings, however, rules with an accuracy below 0.7 will be discarded, in line with the Apriori results. The Generalized Sequential Pattern algorithm minimum support value (Figure 10) will be adjusted as required in order to return enough results for evaluation but not too low that processing time is excessive. A setting greater than 0 will eliminate single instance sequences but still return infrequent, and potentially interesting, sequences that can occur with periodic issues. Figure 10. Generalised Sequential Pattern minimum support setting. CP3300 Data Mining Server Event Logs for Unknown Patterns Page 13 of 19 Results The Apriori and Predictive Apriori algorithms consistently produced identical results for the first 12 to 15 rules generated, with minimal difference in rules from that point on. Fewer rules, sometimes none, were returned by all algorithms when using datasets that contained the normalised time-range when compared to the datasets containing the standard time, lower confidence levels were also found. A zipped copy of the full results produced by WEKA are embedded in Appendix B. Individual Server Event Logs Temporal Patterns The Apriori and Predictive Apriori Algorithms identified both uninteresting and interesting frequent itemsets of a temporal nature. Uninteresting frequent temporal itemsets included regular events like system uptime being reported every day; web services suspending due to inactivity; web services starting again when activity commences; scheduled tasks such as individual backup jobs; Microsoft updates; IIS Application Pool recycling; and scheduled restarts. Interesting previously unknown frequent temporal itemsets included: excessive resource depletion during scheduled virus scanning tasks on the Sharepoint and mail servers; time synchronisation problems across all servers; and granular backup errors, not presented by the backup software, on a Sharepoint server. Sequential Patterns Although some uninteresting sequential patterns were found, like terminal services related printer mapping errors for administrators, no interesting sequential patterns were found with the Generalized Sequential Patterns algorithm when applied to the datasets used in this project. Cross Server Event Log Analysis Basket Analysis Concurrent time synchronisation errors were identified across all servers. Sequential Patterns Although some uninteresting sequential patterns were found, these are believed to be coincidental rather than related. No interesting sequential patterns were found with the Generalized Sequential Patterns algorithm when applied to the datasets used in this project. Discussion The Apriori and Predictive Apriori Algorithms identified both uninteresting and interesting frequent itemsets of a temporal nature. Uninteresting frequent temporal itemsets included regular events like daily system uptime, web services suspending due to inactivity, web services starting again when activity commences, scheduled tasks such as individual backup jobs, Microsoft updates, IIS Application Pool recycling, and scheduled system restarts. Future event log mining may benefit from excluding these types of events from the dataset. This could reduce the processing time and serve to increase the support of interesting event types by reducing noise. This project did not exclude frequent uninteresting events in case they were the cause of interesting events. Interesting previously unknown frequent temporal itemsets included excessive resource depletion during scheduled virus scanning tasks on the Sharepoint and mail servers, time synchronisation problems for all servers, and granular backup errors (not presented by the backup software) on a Sharepoint server. The virus scan resource issues have been rectified by adding additional memory to the effected servers. The granular Sharepoint backup has also been rectified with a patch from the vendor. As discussed below, the time synchronisation problems warrant further investigation. CP3300 Data Mining Server Event Logs for Unknown Patterns Page 14 of 19 The Generalized Sequential Patterns algorithm did not produce any interesting patterns when applied to either individual server datasets or concatenated server datasets. However, some uninteresting patterns were found. Uninteresting patterns found within each server were mainly composed of events generated at system start-up or shutdown, and uptime informational events. All reported inter server Generalized Sequential Patterns are believed to be coincidental and not related. These inter server patterns consisted mainly of uninteresting events common to all individual server sequential patterns. One problem with mining large amounts of data (and one of the reasons why individual server files were created and mined) is that the likelihood of coinciding events increases simply because there is more data. Fullop, Gainaru & Plutchak (2012) used a multi- staged approach where the event data was mined into clusters first and then association rules were mined. This reportedly helped reduce some of the problems associated with mining large quantities of data and a similar approach could be applied to this project in the future. Basket analysis revealed concurrent time synchronisation errors across all servers. This condition will require further investigation but may be caused by problems like an unstable network connection to the external time servers, over utilised time server, virtual host time interference, or unstable systems clocks. Time synchronisation problems can cause authentication issues, some application issues, plus Virtual memory and CPU over-commitment. Issues A number of issues were encountered during the project. Physical memory limitations became a problem when working with large datasets, which generated the error in Figure 11. To overcome this problem a virtual machine was created with a 64-bit operating system, to overcome the 4GB addressable memory limitation of 32-bit systems, and 64GB of RAM was configured. Additionally, the RunWeka.ini file located in the Weka program files directory was adjusted by setting the “Maxheap” parameter to “32768M”. Similarly, memory limitations were experienced with the 32-Bit version Microsoft Excel (Figure 12) so the 64-Bit version of office was installed on the project computer to overcome this problem. Figure 11. WEKA memory error. CP3300 Data Mining Server Event Logs for Unknown Patterns Page 15 of 19 Figure 12. Excel insufficient memory error. A number of WEKA “.arff” files were created using a date type attribute, as described in ARFF Data Files (2013). These files were unable to be processed using WEKA association algorithms and will require further detailed investigation; a large number of filtered associators were applied without success. When multiple attributes were used with large datasets the program would stop processing (the bird would sit down) but the application still indicated it was processing (the start button would be greyed out). This problem was overcome by concatenating some attributes in order to reduce the number of attributes to two. Further work may benefit from the application of different data mining tools. The frequency of uninteresting events meant that the support had to be set quite low when mining for association rules. Unfortunately that meant that the quality of rules returned suffered. Similarly, the volume of data significantly impacted processing time and some jobs had to be left overnight. Conclusion The primary objective of this project was to identify patterns within server event log generation for each server, to uncover associations or dependencies between events; temporal event generation, for recurring patterns; and, event generation patterns between servers were sought to identify associations or dependencies between services. This objective was partially realised. The Apriori and Predictive Apriori Algorithms identified both uninteresting and interesting frequent itemsets of a temporal nature. Uninteresting frequent temporal itemsets included regular events like daily system uptime, web services suspending due to inactivity, web services starting again when activity commences, scheduled tasks such as individual backup jobs, Microsoft updates, IIS Application Pool recycling, and scheduled system restarts. Interesting previously unknown frequent temporal itemsets included excessive resource depletion during scheduled virus scanning tasks on the Sharepoint and mail servers, time synchronisation problems across all servers, and granular backup errors on a Sharepoint server. The Generalised Sequential Patterns algorithm did not produce any interesting patterns when applied to either individual server datasets or concatenated server datasets. However, some uninteresting patterns were found. Uninteresting patterns found within each server were mainly composed of events generated at system start-up or shutdown, and uptime informational events. All reported inter server event patterns are believed to be coincidental and not related. Basket analysis revealed concurrent time synchronisation errors across all servers. This condition will require further investigation. CP3300 Data Mining Server Event Logs for Unknown Patterns Page 16 of 19 This project has shown that server event logs can be mined for interesting patterns. More interesting patterns may be possible with further refinements in the KDD process. This type of work is beneficial to improving the ability to monitor large volumes of computer system events and predict incidents. Future Work Although this work focussed on the system name, date, time, and event id, there were other attributes processed during the data pre-processing stage. An attribute of particular interest is the event description because it contains key words like application names, system parameters, service names, and state information. The event id is a single value whereas the text may contain many values capable of associating an event with other events. This may increase the possibility of finding further frequent patterns. Future work will involve analysing the text for key words then identifying associations between events in a manner similar to what has already been done in this project. As an extension of this project, on completion of CP1300 this semester, I would like to adjust the Data Mining Algorithms used above to learn and filter out uninteresting and highly reoccurring events. This should serve to reduce noise, coincidental associations, and increase the likelihood of identifying interesting patterns should they exist. Similarly, a multi stage approach could be taken where the data is pre-processed using data mining as one of its steps. For example, clustering could be used as a method of data-reduction by supporting the removal of uninteresting clusters of events with uninteresting key words. CP3300 Data Mining Server Event Logs for Unknown Patterns Page 17 of 19 References Han, J, Kamber, M & Pei, J 2012, Data Mining Concepts and Techniques, 3rd Edn, Morgan Kaufmannr, Woltham, MA. Lee, K 2013, Week 1 Lecture – Intro to DM, lecture PowerPoint slides, viewed 01 August 2013, https://learnjcu.jcu.edu.au/webapps/portal/frameset.jsp?tab_tab_group_id=_312_1&url=%2Fweba pps%2Fblackboard%2Fexecute%2Flauncher%3Ftype%3DCourse%26id%3D_50648_1%26url%3D Peng, W, Li, Tao L, & Ma, S 2005, ‘Mining Logs Files for Data-Driven System Management’, ACM SIGKDD Explorations Newsletter - Natural language processing and text mining, vol. 7, iss. 1, pp. 44- 51. Liang, Y, Zhang, Y, Sahoo, R & Xiong, H 2007, ‘Failure Prediction in IBM BlueGene/L Event Logs’, seventeenth IEEE International Conference on Data Mining, icdm, pp.583-588. Fullop, J, Gainaru, A & Plutchak, J 2012, Real Time Analysis and Event Prediction Engine, Blue Waters project, National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, viewed 03 October 2013, https://cug.org/proceedings/attendee_program_cug2012/includes/files/pap155.pdf Mooney, CH & Roddick, JF 2013, ‘ Sequential Pattern Mining – Approaches and Algorithms’, ACM Computing Surveys, vol. 45, no. 2, art 19. Shweta, M & Garg, K 2013, ‘Mining Efficient Association Rules Through Apriori Algorithm Using Attributes and Comparative Analysis of Various Association Rule Algorithms’, International Journal of Advanced Research in Computer Science and Software Engineering, vol.3 , iss. 6, pp. 306-312. Bouckaert, R, Frank, E, Hall, M, Kirkby, R, Reutemann, P, Seewald, A & Scuse, D 2013, ‘WEKA Manual (3-7-10)’, University of Waiko, Hamilton, New Zealand, July 31, 2013. ‘ARFF Data Format’ 2013, ARFF (stable version), viewed 2 October 2013, http://weka.wikispaces.com/ARFF+%28stable+version%29 CP3300 Data Mining Server Event Logs for Unknown Patterns Page 18 of 19 APPENDIX A. Project import files for WEKA. WEKA Import Files.zip CP3300 Data Mining Server Event Logs for Unknown Patterns Page 19 of 19 APPENDIX B. Project output files produced by WEKA. WEKA Output Files.zip