Solution to Prac 9 Find and read the article ‘A Survey of Techniques for Internet Traffic Classification using Machine Learning’ by Thuy T.T. Nguyen and Grenville Armitage, IEEE Communications Surveys and Tutorials, vol 10(4), 2008. Q1. Explain the traditional method used to predict the network traffic. What is its main drawback? Ans: Successive IP packets having the same 5-tuple of protocol type, source address:port and destination address:port are considered to belong to a flow whose controlling application we wish to determine. Simple classification infers the controlling application’s identity by assuming that most applications consistently use ‘well known’ TCP or UDP port numbers (visible in the TCP or UDP head- ers). However, many applications are increasingly using unpredictable (or at least obscure) port numbers. Consequently, more sophisticated classification techniques infer application type by looking for application-specific data (or well-known protocol behavior) within the TCP or UDP payloads. Unfortunately, the effectiveness of such ‘deep packet inspection’ techniques is diminishing. Such packet inspection relies on two related assumptions: • Third parties unaffiliated with either source or recipient are able to inspect each IP packet’s payload (i.e. is the payload visible) • The classifier knows the syntax of each application’s packet payloads (i.e. can the payload be interpreted) Q2. Explain the two important steps of machine learning algorithms used to classify network traffic. Ans: The application of ML techniques involves a number of steps. First, features are defined by which future unknown IP traffic may be identified and differentiated. Features are attributes of flows calculated over multiple packets (such as maximum or minimum packet lengths in each direction, flow durations or inter-packet arrival times). Then the ML classifier is trained to associate sets of features with known traffic classes (creating rules), and apply the ML algorithm to classify unknown traffic using previously learned rules. Q3. List and explain the six key metrics used for traffic classification. Ans: False Negatives (FN): Percentage of members of class X incorrectly classified as not belonging to class X. • False Positives (FP): Percentage of members of other classes incorrectly classified as belonging to class X. • True Positives (TP): Percentage of members of class X correctly classified as belonging to class X (equivalent to 100% - FN ). • True Negatives (TN): Percentage of members of other classes correctly classified as not belonging to class X (equivalent to 100% - FP ). Recall: Percentage of members of class X correctly classified as belonging to class X. • Precision: Percentage of those instances that truly have class X, among all those classified as class X. Q4. Explain the concepts of flow accuracy and byte accuracy in traffic classification. Ans: Flow accuracy - measuring the accuracy with which flows are correctly classified, relative to the number of other flows in the author’s test and/or training dataset(s). Byte accuracy - focusing more on how many bytes are carried by the packets of correctly classified flows, relative to the total number of bytes in the author’s test and/or training dataset(s). Q5. Is it true that byte accuracy is more important than flow accuracy in traffic classification? Justify your answer. Ans: Whether flow accuracy or byte accuracy is more important will generally depend on the classifier’s intended use. For example, when classifying traffic for IP QoS purposes it is plausible that identifying every instance of a short lived flow needing QoS (such as a 5 minute, 32Kbit/sec phone calls) is as important as identifying long lived flows needing QoS (such as a 30 minute, 256Kbit/sec video conference) with both being far more important to correctly identify than the few flows that represent multi-hour (and/or hundreds of megabytes) peer to peer file sharing sessions. Conversely, an ISP doing analysis of load patterns on their network may well be significantly interested in correctly classifying the applications driving the elephant flows that contribute a disproportionate number of packets across their network. Q6. What are the main drawbacks of payload based traffic classification? Ans: Although payload based inspection avoids reliance on fixed port numbers, it imposes significant complexity and processing load on the traffic identification device. It must be kept up-to-date with extensive knowledge of application protocol semantics, and must be powerful enough to perform concurrent analysis of a potentially large number of flows. This approach can be difficult or impossible when dealing with proprietary protocols or encrypted traffic. Furthermore direct analysis of session and application layer content may represent a breach of organisational privacy policies or violation of relevant privacy legislation. Q7. Explain the input and output of a machine learning process. Ans: ML takes input in the form of a dataset of instances (also known as examples). An instance refers to an individual, independent example of the dataset. Each instance is characterised by the values of its features (also known as attributes or discriminators) that measure different aspects of the instance. (In the networking field consecutive packets from the same flow might form an instance, while the set of features might include median inter-packet arrival times or standard deviation of packet lengths over a number of consecutive packets in a flow.) The dataset is ultimately presented as a matrix of instances versus features. The output is the description of the knowledge that has been learnt. How the specific outcome of the learning process is represented (the syntax and semantics) depends largely on the particular ML approach being used. Q8. Explain the four types of machine learning. Ans: Classification learning involves a machine learning from a set of pre-classified (also called pre-labeled) examples, from which it builds a set of classification rules (a model) to classify unseen examples. Clustering is the grouping of instances that have similar characteristics into clusters, without any prior guidance. In association learning, any association between features is sought. In numeric prediction, the outcome to be predicted is not a discrete class but a numeric quantity. Q9. What are the two phases in supervised learning? Ans: • Training: The learning phase that examines the provided data (called the training dataset) and constructs (builds) a classification model. • Testing (also known as classifying): The model that has been built in the training phase is used to classify new unseen instances.