Data Mining Server Event Logs for Unknown Patterns
CP3300 Data Mining Server Event Logs for Unknown Patterns
Page 2 of 19
Contents
Abstract ................................................................................................................................................... 3
Introduction ............................................................................................................................................ 3
Related work ........................................................................................................................................... 6
Methods .................................................................................................................................................. 7
Information Domain............................................................................................................................ 7
Algorithm Selection ............................................................................................................................. 7
Apriori ............................................................................................................................................. 7
PredictiveApriori ............................................................................................................................. 7
Generalised Sequential Patterns..................................................................................................... 8
Pre-processing - Data Cleaning, Integration, Transformation, and Reduction ................................... 8
Data Mining Processing....................................................................................................................... 8
File Import ....................................................................................................................................... 8
Individual Server Event Log Analysis ............................................................................................. 10
Combined Server Event Log Analysis ............................................................................................ 11
Algorithm Settings......................................................................................................................... 11
Results ................................................................................................................................................... 13
Individual Server Event Logs ............................................................................................................. 13
Temporal Patterns ........................................................................................................................ 13
Sequential Patterns ....................................................................................................................... 13
Cross Server Event Log Analysis ........................................................................................................ 13
Basket Analysis .............................................................................................................................. 13
Sequential Patterns ....................................................................................................................... 13
Discussion.............................................................................................................................................. 13
Issues ..................................................................................................................................................... 14
Conclusion ............................................................................................................................................. 15
Future Work .......................................................................................................................................... 16
References ............................................................................................................................................ 17
APPENDIX A. .......................................................................................................................................... 18
APPENDIX B. .......................................................................................................................................... 19
CP3300 Data Mining Server Event Logs for Unknown Patterns
Page 3 of 19
Abstract
This project mined the system event logs from 17 Windows servers in order to discover
previously unknown patterns in event generation within each server, and between servers. Each
server system event log was exported to its own comma separated values (CSV) file. The file was
then cleaned, trimmed and sorted into a standard format. All CSV files were concatenated into one
master file where further cleaning was performed. Individual server files were generated from the
master file for individual server analysis and the master file was retained for cross server analysis;
variations on these files were also created. Three Association Rule Mining (ARM) algorithms were
chosen for the analysis in order to broaden the results and to gain a better understanding of the
data mining tool and algorithms. The algorithms Apriori, Predictive Apriori, and Generalized
Sequential Patterns were used within the WEKA data mining tool. The knowledge discovery process
produced both uninteresting and interesting patterns. The uninteresting patterns included
scheduled tasks, system restarts, expected service state changes, Microsoft updates, IIS application
pool recycling, and coincidental associations. The interesting patterns revealed by the data mining
process included resource allocation problems, software bugs and time synchronisation errors.
Some problems were not reported by the software applications involved. For example, the virus
software scheduled scan on two primary servers were approaching resource limitations which
impacted the effectiveness of the scan. Likewise, the backup software on a Microsoft SharePoint
server failed to report problems with its granular restore technology that would impact the way
individual files are restored. Furthermore, time synchronisation errors occurring across multiple
servers could impact resource allocation and authentication, this requires further investigation. As
organisations become more complex, so to do the systems that support them. Event log analysis
can be time consuming if performed manually due to the volume of data that needs to be examined.
Furthermore, it is often difficult to identify patterns within a server, let alone between servers, in
order to recognise emerging problems or predict imminent failures. As a result, event logs are
normally used as a reactive tool. Although some tools are available for event log analysis, these are
often expensive and do not provide predictive analysis, so they also become a reactive tool. Data
mining, and in particular ARM, could be a cost effective and proactive method of monitoring server
events.
Introduction
Knowledge Discovery in Data (KDD) is the end to end process of identifying data for analysis,
preparing the data for analysis (by way of selecting, cleaning, integrating, and/or transforming the
data), mining the data for patterns, then presenting and evaluating the results (Han, Kamber & Pei
2012, ch.1). Some requirements of KDD are that the computations be non-trivial, the resulting
patterns are valid and can be applied to unseen data, patterns are novel, understandable, and
present some value (Lee 2013). As mentioned above, data mining is the component of KDD that is
used to identify patterns in the data.
Data mining has a number of functionalities belonging to two primary categories: descriptive
and predictive (Han, Kamber & Pei 2012, ch.1). Firstly, clustering which is a descriptive mining
method used to identify and group items based on similar characteristics between items of the same
cluster, and dissimilarities between characteristics of items in different clusters. Secondly,
classification mining which is a predictive mining method used to predict unknown class attributes
based on models, rules, generated by analysing data attributes of known class attributes. Thirdly,
characterisation and discrimination mining which describes classes and concepts in summarised
form. Fourthly, outlier analysis which identifies objects not confirming to normal object behaviour. CP3300 Data Mining Server Event Logs for Unknown Patterns
Page 4 of 19
Lastly, ARM which is a predictive mining method that identifies frequently occurring items and
associations between these frequently occurring items to create rules for predicting future
occurrences (Han, Kamber & Pei 2012, p. 15-21).
ARM will be used in this project to identify associations between frequently occurring event
itemsets. Furthermore, ARM will also be used to find sequential associations of itemsets.
Microsoft Windows event logs are used to record significant events. Early versions of
Windows supported three types of event logs: application, system and security. In addition to the
three supported by early versions, newer versions of windows support the creations of event log
types for specific applications and services. This project focusses on the system event logs because
these logs are generated by the operating system and system services; significant security or
application events will also appear in the system event logs.
The aims of this project are to identify server event log generation patterns within a server,
to uncover associations or dependencies between events; temporal event generation, for recurring
patterns; and, event generation patterns between servers will be sought to identify associations or
dependencies between services on different machines. Firstly, event logs will be reviewed to gain an
understanding of the data. Secondly, the data to be analysed will be extracted from the servers,
cleaned, appropriate attributes selected, and transformed into the required format. All server data
will then be integrated and cleaned further. Thirdly, the integrated data will be separated into
individual server files and a master file for mining. Fourthly, WEKA will be used to mine the data
using three different Association Rule Mining (ARM) algorithms. The WEKA user manual by
Bouckaert et al. (2013) provided the required level of operational information. The three algorithms
employed will be Apriori, Predictive Apriori, and Generalised Sequential Patterns. Lastly, if any
patterns are found then they will be analysed for interestingness and reported. This process is
shown in Figure 1.
Event log analysis can be time consuming if performed manually due to the volume of data
that needs to be examined, this project’s sample event logs exceeded 348,000 records.
Furthermore, it is often difficult to identify patterns, within a server let alone between servers, in
order to recognise emerging problems or predict imminent failures. Because of this, event logs are
normally used as a reactive tool. The project site is configured to E-Mail significant events to staff
real time. If the staff were trained to recognise event types that preceded major incidents then
there is a potential that system problems may be rectified before they cause disruption.
CP3300 Data Mining Server Event Logs for Unknown Patterns
Page 5 of 19
Servers x17
Export Event
Logs to CSV
Event Logs x17
Clean,
Transform and
Select Attributes
Event Logs x17
Integrate and
Clean
Single Server
Event Logs
Separated
Combined
Server Event
Log Trimmed
Normalised time
range14.25 min
Normalised Date
Range14.25 min
Event Logs x17
Normalised time
range14.25 min
Each File Mined using Apriori, Predictive Apriori and
General Sequential Patterns Algorithms
Time and Event attributes selected
Server and Event Attributes
Concatenated, Date and Time
Attributes Concatenated
Pivot Table
Created and
Basket Format
Extracted
Normalised Date
Range14.25 min
Each File Mined using General
Sequential Pattern Algorithms
Each File Mined using Apriori
and Predictive Apriori Algorithms
Figure 1: Server Event Log Data Mining Process.
CP3300 Data Mining Server Event Logs for Unknown Patterns
Page 6 of 19
Related work
Previous work in this area has focussed on accurately mining the large amounts of log data
generated by today’s information systems. All works reviewed recognise the need to automate this
process due to the exhaustive, and often unrealistic, task humans face in complete event log
examination.
Peng, Li and Ma (2005) focus on the categorisation accuracy of three data mining algorithms
(Naive Bayes, Modified Naive Bayes, and the Hidden Markov Model) when applied to system log
files. The authors recognise the difficulties involved in mining logs presented in different formats by
each system vendor and software product. The paper highlights the potential for greater
categorisation accuracy when using text and temporal data. The article then proceeds to examine
the effectiveness of the algorithms when used with the text and temporal data. Interestingly, the
categorisation results are presented graphically for human analysis, rather than being mined for
association rules, which affords reactive rather than predictive analysis. The authors acknowledge
this limitation along with the problem of having to declare the number of categories and the
potential to have a large number of categories. This project differs in that event prediction will be a
primary goal, although historical pattern analysis will also be performed.
Liang, Zhang and Xiong (2007) use classification mining of failure logs containing temporal
data to predict failures within time windows. Four classification algorithms (RIPPER, Support Vecoer
Machines, classic Nearest Neighbour, and a customised Bi-Modal Nearest Neighbour Predictor) were
applied to the data to compare their accuracy. The Bi-Modal Nearest Neighbour Predictor algorithm
out-performed all other algorithms. The solution differs from this project in that it was applied to
logs of fixed format from one specific system. Furthermore, the aim was to give advanced warning
of an event occurring within a time windows of six and twelve hours. This project is examining event
logs across multiple system types of varying attributes and data formats. Also, this project differs in
that it will be looking to identify events that proceed a failure rather than predicting the time of a
failure.
Fullop, Gainaru & Plutchak (2012) present a paper describing a solution for mining super
computer event logs. The solution uses clustering to group historical events by subsystem types,
reducing the incidence of coincidental associations between subsystem types. The granularity of the
clustering is configurable. The initial clustering builds templates that describe the contained events.
Real-time events are then compared to the templates immediately, where clusters definitions can be
modified or further clusters created if necessary. The clusters are then examined for frequent
itemsets, producing a graph for describing system behaviour and an event relationship table. Event
relationships are used to predict how long after one particular event will another event occur.
Similarly, the paper discusses “event chains” which are sequences of events leading up to a
significant event. One of the primary objectives was not to let events go undetected, so accuracy
was paramount. The solution presented in the paper is a sophisticated solution designed for super
computers, but does support multiple super computer types. It has methods to deal with noise,
outliers, and coincidental events. In contrast, this project uses simple techniques that are within the
abilities of most system administrators and is aimed at Windows Servers; however, the project does
aim to predict associations between events. CP3300 Data Mining Server Event Logs for Unknown Patterns
Page 7 of 19
Methods
Information Domain
The project site operates 17 Windows servers in a domain configuration. Important events
are E-Mailed to ICT staff in real-time; event selection for E-Mailing is configurable. If an association
between events could be found that enabled ICT staff sufficient time to rectify an imminent problem
before it occurred then service availability could be enhanced. For example, if data mining revealed
that event-x predominantly occurred 20 minutes before event-y then the system could be
configured to notify ICT staff when event-x occurred, who would then respond in a way to prevent
event-y from occurring. Similarly, this project will also explore frequent associations of event log
generation between windows systems operating in a domain environment. Like events occurring
within a server, if event-x occurs on server-x then is there a high probability that event-y will then
occur on server-y. Likewise, if an action performed by staff on server-x causes an event on server-y
then such actions could be scheduled to reduce impact. Furthermore, temporal event patterns can
be difficult to identify when dealing with large or multiple datasets so mining recurring temporal
patterns will also be performed.
Algorithm Selection
Apriori
The Apriori algorithm finds frequent intra transaction itemsets within a dataset (Shweta &
Garg 2013). The algorithm passes over a dataset to find all frequent one-sequence itemsets within
the dataset that satisfy the minimum support. Candidate two-sequence itemsets are created from
the one-sequence itemsets. This is achieved by creating all possible two-sequence combinations of
the one-sequence itemsets. For example, if A, B, and C are frequent one-sequence itemsets then
candidate sets will be A->B, A->C, and B->C (Shweta & Garg 2013). The dataset is then scanned to
see if the candidate two-sequence itemsets are found and are frequent enough to satisfy the
minimum support. For example two-sequence data sets A -> B, A -> C and A -> C may be found and
frequent. The frequent two-sequence itemsets are then used to generate candidate frequent
itemsets of three-sequences, which are confirmed in another dataset pass. For example, a three-
sequence dataset of A, B, C could be found (Shweta & Garg 2013). This process is continued until no
further frequent itemsets are found.
There are two main parameters user by the Apriori algorithm; support, which measures the
probability of an itemset being containing in a dataset; and, confidence or interestingness, which
measures the probability of an itemset containing value Y if it contains value X. Apriori by default
will generate the best 10 rules found within the set parameters. The Algorithm decreases support
until it has found “n” number of rules or the minimum support value has been reached (Shweta &
Garg 2013). The minimum support value is configurable. This algorithm was chosen because it
forms the basis of a number of ARM algorithms.
PredictiveApriori
Predictive Apriori is similar to Apriori but calculates confidence differently (Shweta & Garg
2013). Where Apriori confidence is the accuracy of an association rule, the probability that if value X
is in an itemset then it will contain value Y, Predictive Apriori calculates the expected accuracy, or
predicted accuracy, of an association rule rather than calculating accuracy based on the training data
(Shweta & Garg 2013). To achieve this, the Predictive Apriori algorithm increases support from zero
when calculating its rules, whereas the Apriori algorithm decreases support from maximum to
minimum, until the “n” number of rules have been found (Shweta & Garg 2013). This algorithm was
chosen as a comparison to the Apriori algorithm and to increase the likelihood of finding patterns. CP3300 Data Mining Server Event Logs for Unknown Patterns
Page 8 of 19
Generalised Sequential Patterns
This is an Apriori based algorithm (Mooney & Roddick 2013). The Generalised Sequential
Patterns algorithm finds inter transaction frequent itemsets within a dataset. Furthermore, where
the standard Apriori algorithm would generate a three sequence dataset of A,B,C; the Generalised
Sequential Patterns algorithm would generate the combinations of items in varying order. For
example, A->B->C, A->C->B and A-> BC 3-sequence datasets from the above 2-sequence datasets
(Mooney & Roddick 2013). This algorithm was chosen because of its ability to identify sequences
among itemsets within a dataset, which is thought to be useful for finding interdependencies among
services between machines as well as services within a machine.
Pre-processing - Data Cleaning, Integration, Transformation, and Reduction
System Event Logs from 17 Windows servers were exported to individual CSV files. The
Windows Server operating systems included versions 2003R2, 2008, 2008R2 and 2012, which
resulted in file formats that differed in attribute types and characteristics. Each files columns and
attributes were adjusted to a common format. For example, the Windows 2003 R2 servers
concatenate their date and time attributes into one attribute, these had to be separated because
time alone was a required attribute. Other servers presented their date and time in different
formats such as dd/mm/yy instead of dd/mm/yyyy, and time in hh:mm as opposed to HH:mm:ss.
Furthermore, additional fields are included in Windows 2003 R2 exports, which were removed. All
event logs were then concatenated into a master file and further cleaned to remove special or
hidden characters and in-text commas. Notepad++ was selected as the tool of choice for dealing
with special or hidden characters because of its ability to easily find and replace these, whereas
Excel was selected as the best tool for handling dates. The tools were also able to handle the
volume of data, which exceeded 348,000 records.
Once the event logs were cleaned, individual server event logs were created from the
master file. Additionally, a trimmed version of the master file, going back 12 months, was created
for “between server data mining”. This file was created to reduce processing time and because of
data relevance, logs older than 12 months may not be relevant due to system changes. Further files
were created to enable familiarisation with the WEKA tool and ARM data mining algorithms. For
example, individual server and combined server event log files were created with a normalised time
field that was divided into 14.25 minute ranges to capture frequent patterns that occurred within a
time frame. Additionally, the date and time attributes plus the server and event attributes were
merged for GSP algorithm analysis. This file was then transformed into a Basket format for analysis
with the Apriori and Predictive Apriori algorithms. A zipped copy of the files imported into WEKA are
embedded in Appendix A.
Data Mining Processing
File Import
The event log dataset files will be opened directly into WEKA (Figure 2) using the Explorer
application’s open file option on the “preprocess” tab (Figure 3). The required file can be directly
imported into WEKA by browsing to the appropriate directory and selecting “files of type” “CSV data
files (*.CSV)” or “ARFF data files”, described in arff data format (2013) (Figure 4). An unsupervised
attribute filter (Figure 5) will be applied to the event_id to convert it from field type numeric to
nominal (Figure 6), this is appropriate because the event_id has no numerical significance and is
merely a label.
CP3300 Data Mining Server Event Logs for Unknown Patterns
Page 9 of 19
Figure 2. The WEKA main screen showing the explorer application option.
Figure 3. WEKA Explorer application’s preproces screen showing the open file option.
Figure 4. WEKA open file option showing the method of selecting the file type. CP3300 Data Mining Server Event Logs for Unknown Patterns
Page 10 of 19
Figure 5. WEKA Preprocess screen showing where to choose a filter, the chosen filter and data type
before the filter is applied.
Figure 6. WEKA Preprocess screen showing the applied filter.
Individual Server Event Log Analysis
Each server’s event log files will have each of the above ARM algorithms applied. The
algorithms will be selected from the WEKA associate screen’s associator choose option (Figure 7).
The rules and patterns resulting from the analysis will be compared to each other as well as
examined for interesting and uninteresting patterns.
Figure 7. WEKA associate screen and choose option for selecting the required algorithm. CP3300 Data Mining Server Event Logs for Unknown Patterns
Page 11 of 19
Combined Server Event Log Analysis
The trimmed versions of the concatenated server master file with the server name and
event id plus date and time each concatenated will be used. Firstly, a standard version will be
analysed with the GSP algorithm. Secondly, the standard version will have the date/time normalised
and divided into 14.25 minute ranges will also be analysed with the GSP algorithm. Thirdly, the
standard file will be transformed into the Basket format and have the Apriori and Predictive Apriori
algorithms applied. Lastly, a Basket format version will be created with the date/time normalised
and divided into 14.25 minute ranges will also have the Apriori and Predictive Apriori algorithms
applied. The rules and patterns resulting from the analysis will be compared and examined for
interesting and uninteresting patterns.
Algorithm Settings
The Apriori algorithm will be configured to use a support value of zero to allow for direct
comparison with the Predictive Apriori algorithm. The algorithms settings are accessed by mouse
clicking the algorithm name in the choose field shown in figure 8. Similarly, the confidence will be
lowered as far as 0.7 if a sufficient number of rules are not found. Additionally, the Apriori number
of rules returned will be increased from 10 to 100 to also allow a direct comparison with the
Predictive Apriori algorithm; furthermore, 100 rules were chosen because rules containing certain
event types can be immediately discarded. For example, events that record system uptime. Figure 9
shows the settings adjustment screen for the Apriori algorithm in WEKA.
Figure 8. Left mouse clicking the algorithm name will present the settings for that algorithm.
CP3300 Data Mining Server Event Logs for Unknown Patterns
Page 12 of 19
Figure 8. Apriori settings adjustment screen in WEKA.
The Predictive Apriori algorithm will use the standard settings, however, rules with an
accuracy below 0.7 will be discarded, in line with the Apriori results.
The Generalized Sequential Pattern algorithm minimum support value (Figure 10) will be
adjusted as required in order to return enough results for evaluation but not too low that processing
time is excessive. A setting greater than 0 will eliminate single instance sequences but still return
infrequent, and potentially interesting, sequences that can occur with periodic issues.
Figure 10. Generalised Sequential Pattern minimum support setting. CP3300 Data Mining Server Event Logs for Unknown Patterns
Page 13 of 19
Results
The Apriori and Predictive Apriori algorithms consistently produced identical results for the
first 12 to 15 rules generated, with minimal difference in rules from that point on. Fewer rules,
sometimes none, were returned by all algorithms when using datasets that contained the
normalised time-range when compared to the datasets containing the standard time, lower
confidence levels were also found. A zipped copy of the full results produced by WEKA are
embedded in Appendix B.
Individual Server Event Logs
Temporal Patterns
The Apriori and Predictive Apriori Algorithms identified both uninteresting and interesting
frequent itemsets of a temporal nature. Uninteresting frequent temporal itemsets included regular
events like system uptime being reported every day; web services suspending due to inactivity; web
services starting again when activity commences; scheduled tasks such as individual backup jobs;
Microsoft updates; IIS Application Pool recycling; and scheduled restarts. Interesting previously
unknown frequent temporal itemsets included: excessive resource depletion during scheduled virus
scanning tasks on the Sharepoint and mail servers; time synchronisation problems across all servers;
and granular backup errors, not presented by the backup software, on a Sharepoint server.
Sequential Patterns
Although some uninteresting sequential patterns were found, like terminal services related
printer mapping errors for administrators, no interesting sequential patterns were found with the
Generalized Sequential Patterns algorithm when applied to the datasets used in this project.
Cross Server Event Log Analysis
Basket Analysis
Concurrent time synchronisation errors were identified across all servers.
Sequential Patterns
Although some uninteresting sequential patterns were found, these are believed to be
coincidental rather than related. No interesting sequential patterns were found with the
Generalized Sequential Patterns algorithm when applied to the datasets used in this project.
Discussion
The Apriori and Predictive Apriori Algorithms identified both uninteresting and interesting
frequent itemsets of a temporal nature. Uninteresting frequent temporal itemsets included regular
events like daily system uptime, web services suspending due to inactivity, web services starting
again when activity commences, scheduled tasks such as individual backup jobs, Microsoft updates,
IIS Application Pool recycling, and scheduled system restarts. Future event log mining may benefit
from excluding these types of events from the dataset. This could reduce the processing time and
serve to increase the support of interesting event types by reducing noise. This project did not
exclude frequent uninteresting events in case they were the cause of interesting events. Interesting
previously unknown frequent temporal itemsets included excessive resource depletion during
scheduled virus scanning tasks on the Sharepoint and mail servers, time synchronisation problems
for all servers, and granular backup errors (not presented by the backup software) on a Sharepoint
server. The virus scan resource issues have been rectified by adding additional memory to the
effected servers. The granular Sharepoint backup has also been rectified with a patch from the
vendor. As discussed below, the time synchronisation problems warrant further investigation. CP3300 Data Mining Server Event Logs for Unknown Patterns
Page 14 of 19
The Generalized Sequential Patterns algorithm did not produce any interesting patterns
when applied to either individual server datasets or concatenated server datasets. However, some
uninteresting patterns were found. Uninteresting patterns found within each server were mainly
composed of events generated at system start-up or shutdown, and uptime informational events.
All reported inter server Generalized Sequential Patterns are believed to be coincidental and
not related. These inter server patterns consisted mainly of uninteresting events common to all
individual server sequential patterns. One problem with mining large amounts of data (and one of
the reasons why individual server files were created and mined) is that the likelihood of coinciding
events increases simply because there is more data. Fullop, Gainaru & Plutchak (2012) used a multi-
staged approach where the event data was mined into clusters first and then association rules were
mined. This reportedly helped reduce some of the problems associated with mining large quantities
of data and a similar approach could be applied to this project in the future.
Basket analysis revealed concurrent time synchronisation errors across all servers. This
condition will require further investigation but may be caused by problems like an unstable network
connection to the external time servers, over utilised time server, virtual host time interference, or
unstable systems clocks. Time synchronisation problems can cause authentication issues, some
application issues, plus Virtual memory and CPU over-commitment.
Issues
A number of issues were encountered during the project. Physical memory limitations
became a problem when working with large datasets, which generated the error in Figure 11. To
overcome this problem a virtual machine was created with a 64-bit operating system, to overcome
the 4GB addressable memory limitation of 32-bit systems, and 64GB of RAM was configured.
Additionally, the RunWeka.ini file located in the Weka program files directory was adjusted by
setting the “Maxheap” parameter to “32768M”. Similarly, memory limitations were experienced
with the 32-Bit version Microsoft Excel (Figure 12) so the 64-Bit version of office was installed on the
project computer to overcome this problem.
Figure 11. WEKA memory error. CP3300 Data Mining Server Event Logs for Unknown Patterns
Page 15 of 19
Figure 12. Excel insufficient memory error.
A number of WEKA “.arff” files were created using a date type attribute, as described in
ARFF Data Files (2013). These files were unable to be processed using WEKA association algorithms
and will require further detailed investigation; a large number of filtered associators were applied
without success. When multiple attributes were used with large datasets the program would stop
processing (the bird would sit down) but the application still indicated it was processing (the start
button would be greyed out). This problem was overcome by concatenating some attributes in
order to reduce the number of attributes to two. Further work may benefit from the application of
different data mining tools.
The frequency of uninteresting events meant that the support had to be set quite low when
mining for association rules. Unfortunately that meant that the quality of rules returned suffered.
Similarly, the volume of data significantly impacted processing time and some jobs had to be left
overnight.
Conclusion
The primary objective of this project was to identify patterns within server event log
generation for each server, to uncover associations or dependencies between events; temporal
event generation, for recurring patterns; and, event generation patterns between servers were
sought to identify associations or dependencies between services. This objective was partially
realised.
The Apriori and Predictive Apriori Algorithms identified both uninteresting and interesting
frequent itemsets of a temporal nature. Uninteresting frequent temporal itemsets included regular
events like daily system uptime, web services suspending due to inactivity, web services starting
again when activity commences, scheduled tasks such as individual backup jobs, Microsoft updates,
IIS Application Pool recycling, and scheduled system restarts. Interesting previously unknown
frequent temporal itemsets included excessive resource depletion during scheduled virus scanning
tasks on the Sharepoint and mail servers, time synchronisation problems across all servers, and
granular backup errors on a Sharepoint server.
The Generalised Sequential Patterns algorithm did not produce any interesting patterns
when applied to either individual server datasets or concatenated server datasets. However, some
uninteresting patterns were found. Uninteresting patterns found within each server were mainly
composed of events generated at system start-up or shutdown, and uptime informational events.
All reported inter server event patterns are believed to be coincidental and not related.
Basket analysis revealed concurrent time synchronisation errors across all servers. This
condition will require further investigation. CP3300 Data Mining Server Event Logs for Unknown Patterns
Page 16 of 19
This project has shown that server event logs can be mined for interesting patterns. More
interesting patterns may be possible with further refinements in the KDD process. This type of work
is beneficial to improving the ability to monitor large volumes of computer system events and
predict incidents.
Future Work
Although this work focussed on the system name, date, time, and event id, there were other
attributes processed during the data pre-processing stage. An attribute of particular interest is the
event description because it contains key words like application names, system parameters, service
names, and state information. The event id is a single value whereas the text may contain many
values capable of associating an event with other events. This may increase the possibility of finding
further frequent patterns. Future work will involve analysing the text for key words then identifying
associations between events in a manner similar to what has already been done in this project.
As an extension of this project, on completion of CP1300 this semester, I would like to adjust
the Data Mining Algorithms used above to learn and filter out uninteresting and highly reoccurring
events. This should serve to reduce noise, coincidental associations, and increase the likelihood of
identifying interesting patterns should they exist. Similarly, a multi stage approach could be taken
where the data is pre-processed using data mining as one of its steps. For example, clustering could
be used as a method of data-reduction by supporting the removal of uninteresting clusters of events
with uninteresting key words.
CP3300 Data Mining Server Event Logs for Unknown Patterns
Page 17 of 19
References
Han, J, Kamber, M & Pei, J 2012, Data Mining Concepts and Techniques, 3rd Edn, Morgan
Kaufmannr, Woltham, MA.
Lee, K 2013, Week 1 Lecture – Intro to DM, lecture PowerPoint slides, viewed 01 August 2013,
https://learnjcu.jcu.edu.au/webapps/portal/frameset.jsp?tab_tab_group_id=_312_1&url=%2Fweba
pps%2Fblackboard%2Fexecute%2Flauncher%3Ftype%3DCourse%26id%3D_50648_1%26url%3D
Peng, W, Li, Tao L, & Ma, S 2005, ‘Mining Logs Files for Data-Driven System Management’, ACM
SIGKDD Explorations Newsletter - Natural language processing and text mining, vol. 7, iss. 1, pp. 44-
51.
Liang, Y, Zhang, Y, Sahoo, R & Xiong, H 2007, ‘Failure Prediction in IBM BlueGene/L Event Logs’,
seventeenth IEEE International Conference on Data Mining, icdm, pp.583-588.
Fullop, J, Gainaru, A & Plutchak, J 2012, Real Time Analysis and Event Prediction Engine, Blue Waters
project, National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign,
viewed 03 October 2013,
https://cug.org/proceedings/attendee_program_cug2012/includes/files/pap155.pdf
Mooney, CH & Roddick, JF 2013, ‘ Sequential Pattern Mining – Approaches and Algorithms’, ACM
Computing Surveys, vol. 45, no. 2, art 19.
Shweta, M & Garg, K 2013, ‘Mining Efficient Association Rules Through Apriori Algorithm Using
Attributes and Comparative Analysis of Various Association Rule Algorithms’, International Journal
of Advanced Research in Computer Science and Software Engineering, vol.3 , iss. 6, pp. 306-312.
Bouckaert, R, Frank, E, Hall, M, Kirkby, R, Reutemann, P, Seewald, A & Scuse, D 2013, ‘WEKA Manual
(3-7-10)’, University of Waiko, Hamilton, New Zealand, July 31, 2013.
‘ARFF Data Format’ 2013, ARFF (stable version), viewed 2 October 2013,
http://weka.wikispaces.com/ARFF+%28stable+version%29
CP3300 Data Mining Server Event Logs for Unknown Patterns
Page 18 of 19
APPENDIX A.
Project import files for WEKA.
WEKA Import Files.zip
CP3300 Data Mining Server Event Logs for Unknown Patterns
Page 19 of 19
APPENDIX B.
Project output files produced by WEKA.
WEKA Output Files.zip