SIT 384 Data Analytics for Cyber Security Assignment 2
Trimester 1, 2017
Objectives
• To apply skills and knowledge acquired throughout the trimester in classification algorithms
and machine learning process.
• To rationalize the use of machine learning algorithms to effectively and efficiently process of
data in big size.
• To demonstrate ability to use R to perform email classification tasks that are common for
corporate security analyst.
• To scientifically conduct and document machine learning experiments for analytics
purposes.
Due Date: 2pm, Friday, May 19, 2017
This assignment consists of a report worth 20 marks. Delays caused by student's own computer
downtime cannot be accepted as a valid reason for late submission without penalty. Students must
plan their work to allow for both scheduled and unscheduled downtime.
Submission instructions: You must submit an electronic copy of all your assignment files via Cloud-
Deakin. You must include both your report, source codes, necessary data files and optionally
presentation file. Assignments will not be accepted through any other manner of submission.
Students should note that email and paper based submissions will ordinarily be rejected.
Special requirements to prove the originality of your work: On-campus students (B and G) are
required to demonstrate the execution of your classification programs in R to your tutor in Week 10;
Cloud students are required to attach a 3-5 minutes Video presentation to demonstrate how your R
codes are executed to derive the claimed results. The video should be uploaded to a cloud storage
(You can find out how to upload a video from https://video.deakin.edu.au/.) Failure to do so will
result a delayed assessment of your submission.
Late submissions: Submissions received after the due date are penalized at a rate of 5% (out of the
full mark) per day, no exceptions. Late submission after 5 days would be penalized at a rate of 100%
out of the full mark. Close of submissions on the due date and each day thereafter for penalties will
occur at 05:00 pm Australian Eastern Time (UTC +10 hours). Students outside of Victoria should note
that the normal time zone in Victoria is UTC+10 hours. No extension will be granted.
It is the student's responsibility to ensure that they understand the submission instructions. If you
have ANY difficulties ask the Lecturer/Tutor for assistance (prior to the submission date).
Copying, Plagiarism Notice
This is an individual assignment. You are not permitted to work as a part of a group when writing this
assignment. The University's policy on plagiarism can be viewed online at
http://www.deakin.edu.au/students/study-support/referencing/plagiarism
Overview
The popularity of social media networks, such as Twitter, leads to an increasing number of
spamming activities. Researchers employed various machine learning methods to detect Twitter
spams. In this assignment, you are required to classify spam tweets by using provided datasets. The
features have been extracted and clearly structured in JSON format. The extracted features can be
categorized into two groups: user profile-based features and tweet content-based features as
summarized in Table 1.
The provided training dataset and testing dataset are separately listed in Table 2 and Table 3. In
testing dataset, we can find that the ratio of spam to non-spam is 1:1 in Dataset1, while the ratio is
1:19 in Dataset 2. In most of previous work, the testing datasets are nearly evenly distributed.
However, in real world, there are only around 5% spam tweets in Twitter, which indicates that
testing Dataset 2 simulates the real-world scenario. You are required to classify spam tweets,
evaluate the classifiers’ performance and compare the Dataset 1 and Dataset 2 outcomes by
conducting experiments.
Twitter Spam Detection Work Flow
Problem Statement
This is an individual assessment task. Each student is required to submit a report of approximately
2,000-2,500 words along with exhibits to support findings with respect to the provided spam and
non-spam messages. This report should consist of:
• Overview of classifiers and evaluation metrics
• Construction of data sets, identification of features and the process of conducting
classification
• Technical findings of experiment results
• Justified discussion of the performance evaluation outcomes for different classifiers
To demonstrate your achievement of these goals, you must write a report of at least 2,000 words
(2,500 words maximum). Your report should consist of the following chapters:
1. A proper title which matches the contents of your report.
2. Your name and Deakin student number in the author line.
3. An executive summary which summarizes your findings. (You may find hints on writing good
executive summaries from http://unilearning.uow.edu.au/report/4bi1.html.)
4. An introduction chapter which lists the classification algorithms of your choice (at least 5
algorithms), the features used for classification, the performance evaluation metrics (at least
5 evaluation metrics), the brief summary of your findings, and the organization of the rest of
your report. (You may find hints on features used for classification from Twitter Developer
Documentation https://dev.twitter.com/overview/api )
5. A literature review chapter which surveys the latest academic papers regarding the
classifiers and performance evaluation metrics of your choice. With respect to each classifier
and performance evaluation metrics, you are advised to identify and cite at least one paper
published by ACM and IEEE journals or conference proceedings. In addition, Your aim of this
part of the report is to demonstrate deep and thorough understanding of the existing body
of knowledge encompassing multiple classification techniques for security data analytics, specifically, your argument should explain why machine learning algorithms should be used
rather than human readers. (Please read through the hints on this web page before writing
this chapter http://www.uq.edu.au/student-services/learning/literature-review.))
6. Technical demonstration chapter which consists of fully explained screenshots when your
experiments were conducted in R. That is, you should explain each step of the procedure of
classification, and the performance results for your classifiers. Note, what classifiers you
presented in literature review should be what you conduct experiments.
7. Performance evaluation chapter which evaluates the performance of classifiers. You should
analyse each classifier’s performance with respect to the performance metrics of your
choice. In addition, you should compare the performance results in terms of evaluation
metrics, e.g., accuracy, false positive, recall, F-measure, speed and so on, for the selected
classifiers and datasets.
8. A conclusions chapter which summarizes major findings of the study (You should use at least
5 evaluation metrics to evaluate the performance of classifiers and compare the
performance of different classifiers. You can demonstrate your experiment results in the
form of table and plots), discusses whether the results match your hypotheses prior to the
experiments and recommends the best performing classification algorithm.
9. A bibliography list of all cited papers and other resources. You must use in-text citations in
Harvard style and each citation must correspond to a bibliography entry. There must be no
bibliography entries that are not cited in the report. (You should know the contents from
this page http://www.deakin.edu.au/students/study-support/referencing/harvard.)
Proficient (above 80%) Average (60-79%) Satisfactory (50-59%) Below Expectation (0-50%) Score
Scientific
Writing in
Introduction
and
Conclusion
Use appropriate
language and genre to
extend the knowledge
of a range of audiences.
Use discipline-specific
language and genres to
address gaps of a self-selected
audience. Apply innovatively
the knowledge developed to a
different context.
Use some discipline-specific
language and prescribed genre
to demonstrate understanding
from a stated perspective and
for a specified audience. Apply
to different contexts the
knowledge developed.
Fail to demonstrate
understanding for
lecturer/teacher as audience.
Fail to apply to a similar
context the knowledge
developed.
Out 0f 4
marks
Literature
Review
Collect and record self-
determined information
from self-selected
sources, choosing or
devising an appropriate
methodology with self-
structured guidelines;
Organize information
using student-
determined structures
and management of
processes; Generate
questions/aims/hypoth
eses based on literature
Collect and record self-
determined information/ data
from self-selected sources,
choosing an appropriate
methodology based on
structured guidelines;
Organize information/data
using student-determined
structures, and manage the
processes, within the
parameters set by
the guidelines; Generate
questions/aims/hypotheses
framed within structured
guidelines
Collect and record required
information/ data from self-
selected sources using one of
several prescribed
methodologies; Organize
information/data using
recommended structures.
Manage self-determined
processes with multiple possible
pathways; Respond to
questions/tasks generated from
a closed inquiry.
Fail to collect required
information or data from the
prescribed source; Fail to
organize information/data
using
prescribed structure; Fail to
respond to questions/tasks
arising explicitly from a
closed inquiry
Out of 4
marks
Technical
Demonstrati-
on
Provide fully explained
screenshots with R
script. Explain each step
of the procedure of
classification, and the
performance results in
details. The entire demo
is clear, correct and
covers all findings.
Provide fully explained
screenshots with R script.
Explain each step of the
procedure of classification,
and the performance results.
The entire demo is clear, but
there are some mistakes.
Provide screenshots with R
script. Explain each step of the
procedure of classification, and
the performance results. But
many parts of demo are not
clear enough and/or contain
major flows or mistakes.
No screenshots and
explanations provided.
Out of 4
marks Performan-
ce Evaluation
Evaluate
information/data and
inquiry process
rigorously based on the
latest literature.
Reflect insightfully to
renew others'
processes. Construct
and use one testing
data set and two
training data sets. 5
classifiers work
correctly. 5 evaluation
metrics apply to analyse
the performance of
classifiers.
Evaluate information/data and
the inquiry process
comprehensively developed
within the scope of the given
literature.
Reflect insightfully to renew
others' processes. Construct
and use one testing data set
and two training data sets. 4
classifiers work correctly. 4
evaluation metrics apply to
analyse the performance of
classifiers.
Evaluate information/data and
reflect on the inquiry process
based on the given literature.
Use only one testing data set.
Less than 4 classifiers work
correctly. Less than 4 evaluation
metrics apply to analyse the
performance of classifiers.
Fail to evaluate
information/data and to
reflect on inquiry process.
Use one or no testing data
set. Less than 2 classifiers
work correctly. Less than 2
evaluation metrics apply to
analyse the performance of
classifiers.
Out of 4
marks
Reference More than 10
bibliographic items (all
of them are academic
papers and at least 1
item per classifier/ at
least 1 item for per
evaluation metrics) are
correctly presented and
inline citations are
correctly used.
More than 10 bibliographic
items (most of them are
academic papers and at least 1
items per classification) are
presented, but there are a few
errors. Inline citations are
used but with a few errors.
More than 10 bibliographic
items (most of them are
academic papers) are
presented. Inline citations are
often used incorrectly.
Less than 10 bibliographic
items are presented. Or
there are more than 3 errors
in the bibliographic list and
inline citations.
Out of 4
marks