Solution to Prac 9
Find and read the article ‘A Survey of Techniques for Internet Traffic Classification
using Machine Learning’ by Thuy T.T. Nguyen and Grenville Armitage, IEEE
Communications Surveys and Tutorials, vol 10(4), 2008.
Q1. Explain the traditional method used to predict the network traffic. What is
its main drawback?
Ans: Successive IP packets having the same 5-tuple of protocol type, source
address:port and destination address:port are considered to belong to a flow whose
controlling application we wish to determine. Simple classification infers the
controlling application’s identity by assuming that most applications consistently use
‘well known’ TCP or UDP port numbers (visible in the TCP or UDP head- ers).
However, many applications are increasingly using unpredictable (or at least obscure)
port numbers. Consequently, more sophisticated classification techniques infer
application type by looking for application-specific data (or well-known protocol
behavior) within the TCP or UDP payloads.
Unfortunately, the effectiveness of such ‘deep packet inspection’ techniques is
diminishing. Such packet inspection relies on two related assumptions:
• Third parties unaffiliated with either source or recipient are able to inspect each IP
packet’s payload (i.e. is the payload visible)
• The classifier knows the syntax of each application’s packet payloads (i.e. can the
payload be interpreted)
Q2. Explain the two important steps of machine learning algorithms used to
classify network traffic.
Ans: The application of ML techniques involves a number of steps. First, features are
defined by which future unknown IP traffic may be identified and differentiated.
Features are attributes of flows calculated over multiple packets (such as maximum or
minimum packet lengths in each direction, flow durations or inter-packet arrival
times). Then the ML classifier is trained to associate sets of features with known
traffic classes (creating rules), and apply the ML algorithm to classify unknown traffic
using previously learned rules.
Q3. List and explain the six key metrics used for traffic classification.
Ans:
False Negatives (FN): Percentage of members of class X incorrectly classified as not
belonging to class X.
• False Positives (FP): Percentage of members of other classes incorrectly classified
as belonging to class X.
• True Positives (TP): Percentage of members of class X correctly classified as
belonging to class X (equivalent to 100% - FN ).
• True Negatives (TN): Percentage of members of other classes correctly classified as
not belonging to class X (equivalent to 100% - FP ).
Recall: Percentage of members of class X correctly classified as belonging to class X.
• Precision: Percentage of those instances that truly have class X, among all those
classified as class X.
Q4. Explain the concepts of flow accuracy and byte accuracy in traffic
classification.
Ans: Flow accuracy - measuring the accuracy with which flows are correctly
classified, relative to the number of other flows in the author’s test and/or training
dataset(s). Byte accuracy - focusing more on how many bytes are carried by the
packets of correctly classified flows, relative to the total number of bytes in the
author’s test and/or training dataset(s).
Q5. Is it true that byte accuracy is more important than flow accuracy in traffic
classification? Justify your answer.
Ans: Whether flow accuracy or byte accuracy is more important will generally depend
on the classifier’s intended use. For example, when classifying traffic for IP QoS
purposes it is plausible that identifying every instance of a short lived flow needing
QoS (such as a 5 minute, 32Kbit/sec phone calls) is as important as identifying long
lived flows needing QoS (such as a 30 minute, 256Kbit/sec video conference) with
both being far more important to correctly identify than the few flows that represent
multi-hour (and/or hundreds of megabytes) peer to peer file sharing sessions.
Conversely, an ISP doing analysis of load patterns on their network may well be
significantly interested in correctly classifying the applications driving the elephant
flows that contribute a disproportionate number of packets across their network.
Q6. What are the main drawbacks of payload based traffic classification?
Ans: Although payload based inspection avoids reliance on fixed port numbers, it
imposes significant complexity and processing load on the traffic identification
device. It must be kept up-to-date with extensive knowledge of application protocol
semantics, and must be powerful enough to perform concurrent analysis of a
potentially large number of flows. This approach can be difficult or impossible when
dealing with proprietary protocols or encrypted traffic. Furthermore direct analysis of
session and application layer content may represent a breach of organisational privacy
policies or violation of relevant privacy legislation.
Q7. Explain the input and output of a machine learning process.
Ans: ML takes input in the form of a dataset of instances (also known as examples).
An instance refers to an individual, independent example of the dataset. Each instance
is characterised by the values of its features (also known as attributes or
discriminators) that measure different aspects of the instance. (In the networking field
consecutive packets from the same flow might form an instance, while the set of
features might include median inter-packet arrival times or standard deviation of
packet lengths over a number of consecutive packets in a flow.) The dataset is
ultimately presented as a matrix of instances versus features.
The output is the description of the knowledge that has been learnt. How the specific
outcome of the learning process is represented (the syntax and semantics) depends
largely on the particular ML approach being used.
Q8. Explain the four types of machine learning.
Ans: Classification learning involves a machine learning from a set of pre-classified
(also called pre-labeled) examples, from which it builds a set of classification rules (a
model) to classify unseen examples. Clustering is the grouping of instances that have
similar characteristics into clusters, without any prior guidance. In association learning, any association between features is sought. In numeric prediction, the
outcome to be predicted is not a discrete class but a numeric quantity.
Q9. What are the two phases in supervised learning?
Ans: • Training: The learning phase that examines the provided data (called the
training dataset) and constructs (builds) a classification model.
• Testing (also known as classifying): The model that has been built in the training
phase is used to classify new unseen instances.