31250
/
32130
Assignment
2
1
31250
Introduction
to
Data
Analytics
32130
Fundamentals
of
Data
Analytics
Assignment
2:
data
exploration
and
preparation
Due
date
11:59pm
Friday,
5
May
2017
Marks
Out
of
100,
weighted
to
25%
of
your
final
mark.
Submission
format
A
report
in
Adobe
PDF
(preferable)
or
MS
Word
Doc
and
an
Excel
spreadsheet.
Filename
ida_a2_xxxxxxxx.pdf
or
ida_a2_xxxxxxxx.doc
where
xxxxxxxx
is
your
student
id.
ida_a2_xxxxxxxx.xls
for
the
spreadsheet.
Report
format
Around
25-‐30
pages
with
the
information
described
below.
Use
11
or
12
point
Times
or
Arial
fonts.
Submit
to
UTS
Online
assignment
submission
button.
Please,
make
sure
to
call
the
filenames
as
described
above.
This
assignment
is
individual
work.
Each
of
you
will
be
working
with
an
individual
data
set
that
you
will
be
able
to
download
from
UTS
Online.
Scenario
You
have
just
started
working
as
a
data
miner/analyst
in
the
Analytics
Unit
of
a
company.
The
Head
of
the
Analytics
Unit
has
brought
you
a
data
set
[a
welcome
present
;-‐))].
The
data
set
includes
two
files:
description
of
the
attributes
and
a
table
with
the
actual
values
of
these
attributes.
The
Head
of
the
Analytics
Unit
has
mentioned
to
you
that
this
is
some
sort
of
demographic
data
that
a
potential
client
has
provided
for
analysis.
The
Head
of
the
Analytics
Unit
would
like
to
have
a
report
with
some
insights
about
that
data,
that
she
could
deliver
to
the
client.
Your
tasks
include:
• understanding
the
specifics
of
the
data
set
• extracting
information
about
each
of
the
attributes,
possible
associations
between
them
and
other
specifics
of
the
data
set.
The
tasks
in
the
assignment
are
specified
below.
31250
/
32130
Assignment
2
2
Data
sets
The
description
of
the
attributes
is
the
same
for
all
students
and
comes
in
a
tiny
documentation
file
(download
it
from
UTS
Online).
Each
student
is
assigned
an
individual
table
with
the
actual
values
of
these
attributes.
Please,
download
the
file
that
is
linked
to
your
name
from
UTS
Online.
Tasks
1A.
Initial
data
exploration
1. Identify
the
type
of
each
attribute
(nominal,
ordinal,
interval
or
ratio).
If
it's
not
clear
you
may
need
to
justify
why
you
choose
the
type.
2. Identify
the
values
of
the
summarising
properties
for
each
attribute
including
frequency,
location
and
spread
(e.g.
value
ranges
of
the
attributes,
frequency
of
values,
distributions,
medians,
means,
variances,
percentiles,
etc.
-‐
the
statistics
that
have
been
covered
in
the
lectures
and
materials
given).
Note
that
not
all
of
these
summary
statistics
will
make
sense
for
all
the
attribute
types,
so
use
your
judgement!
Where
necessary,
use
proper
visualisations
for
the
corresponding
statistics.
3. Using
KNIME
or
other
tools,
explore
your
data
set
and
identify
any
outliers,
clusters
of
similar
instances,
"interesting"
attributes
and
specific
values
of
those
attributes.
Note
that
you
may
need
to
'temporarily'
recode
attributes
to
numeric
or
from
numeric
to
nominal.
In
the
report
include
the
corresponding
snapshots
from
the
tools
and
explanation
of
what
has
been
identified
there.
Present
your
findings
in
the
assignment
report.
1B.
Data
preprocessing
Perform
each
of
the
following
data
preparation
tasks
(each
task
applies
to
the
original
data)
using
your
choice
of
tool:
a. Use
the
following
binning
techniques
to
smooth
the
values
of
the
Age
attribute:
• equi-‐width
binning
• equi-‐depth
binning.
In
the
assignment
report
for
each
of
these
techniques
you
need
to
illustrate
your
steps.
In
your
Excel
workbook
file
place
the
results
in
separate
columns
in
the
corresponding
spreadsheet.
Use
your
judgement
in
choosing
the
appropriate
number
of
bins
-‐
and
justify
this
in
the
report.
b. Use
the
following
techniques
to
normalise
the
attribute
Age:
• min-‐max
normalization
to
transform
the
values
onto
the
range
[0.0-‐1.0].
• z-‐score
normalization
to
transform
the
values.
31250
/
32130
Assignment
2
3
In
the
assignment
report
provide
explanation
about
each
of
the
applied
techniques.
In
your
Excel
workbook
file
place
the
results
in
separate
columns
in
the
corresponding
spreadsheet.
c. Discretise
the
Age
attribute
into
the
following
categories:
Teenager
=
1-‐20;
Young
=
21-‐30;
Mid_Age
=
31-‐45;
Mature
=
46-‐65;
Old
=
66+.
Provide
the
frequency
of
each
category
in
your
data
set.
In
the
assignment
report
provide
explanation
about
each
of
the
applied
techniques.
In
your
Excel
workbook
file
place
the
results
in
a
separate
column
in
the
corresponding
spreadsheet.
d. Binarise
the
Education
variable
[with
values
"0"
or
"1"].
In
the
assignment
report
provide
explanation
about
the
applied
binarisation
technique.
In
your
Excel
workbook
file
place
the
results
in
separate
columns
in
the
corresponding
spreadsheet.
1 C.
Summary
At
the
end
of
the
report
include
a
summary
section
in
which
you
summarise
your
findings.
The
summary
is
not
a
narrative
of
what
you
have
done,
but
a
condensed
informative
section
of
what
you
have
found
about
the
data
that
you
should
report
to
the
Head
of
the
Analytics
Unit.
The
summary
may
include
the
most
important
findings
(specific
characteristics
(or
values)
of
some
attributes,
important
information
about
the
distributions,
some
clusters
identified
visually
that
you
propose
to
examine,
associations
found
that
should
be
investigated
more
rigorously,
etc.).
Deliverables
The
deliveries
include:
• a
report,
which
structure
should
follow
the
tasks
of
the
assignment,
and
• an
Excel
workbook
file
with
individual
spreadsheets
for
each
task
(spreadsheets
should
be
labeled
according
to
the
task
names,
for
example,
"1A").
Each
of
the
results
of
parts
(a)
through
(d)
in
task
1B
should
be
presented
in
a
separate
spreadsheet
(and
respectively
table
in
the
assignment
report).
Report:
In
the
report
include
a
section
(starting
with
a
section
title)
for
each
of
the
tasks
in
this
assignment.
Your
report
will
likely
be
between
25-‐30
pages
in
length
using
an
11
or
12
point
font,
including
title
page
and
graphs.
On
average
you
will
require
between
15
and
23
hours
to
complete
this
assignment.
31250
/
32130
Assignment
2
4
Assessment
This
assignment
is
assessed
as
individual
work.
The
assessment
criteria
are:
• Correctness
of
the
initial
data
exploration
(1A)
-‐-‐
20%
• Correctness
of
the
preprocessing
procedures,
results
and
explanation
of
the
steps
(1B)
-‐-‐
40%;
• Depth
of
data
understanding
-‐
how
comprehensive
are
the
explanations
of
your
explorative
results,
appropriateness
of
illustrations
-‐-‐
20%;
• Quality
of
the
summary
section
(1C)
-‐-‐
20%
Relationship
to
Objectives
This
assignment
addresses
subject
objectives
2
and
3.
Return
of
Assignments
We
plan
to
return
marked
assignments
within
3
weeks
of
submission.
Emails
will
be
sent
when
marking
is
complete.
Academic
Standards
All
text
in
your
assignment
should
be
paraphrased
into
your
own
words
and
referenced
using
the
Harvard
referencing
style.
Please
refer
to
the
Subject
Outline
for
details
about
penalties
for
Academic
Misconduct.
Late
Penalties
A
late
penalty
of
up
to
50%
may
be
applied
to
submitted
work
unless
prior
arrangements
have
been
made
with
the
subject
coordinator.
Special
Consideration
You
may
apply
for
special
consideration
(SC)
due
to
unforeseen
circumstances,
either
before
or
after
the
due
date,
at
http://www.sau.uts.edu.au/assessment/
consideration/online.html.
The
three
basic
reasons
for
SC
are
health,
family,
or
work
problems;
"I
haven't
finished
yet"
is
not
a
valid
reason.
You
must
provide
documentary
evidence
to
support
your
claim,
such
as
a
doctor's
certificate,
a
statutory
declaration,
or
a
letter
from
your
employer.
Note
The
assignments
will
be
checked
through
the
Turnitin
®
Plagiarism
Prevention
system,
for
identifying
unoriginal
material,
copied
(without
reference
to
the
source)
from
an
electronic
source
on
the
Internet,
electronic
libraries,
other
assignments.