31250  /  32130  Assignment  2   1   31250  Introduction  to  Data  Analytics   32130  Fundamentals  of  Data  Analytics   Assignment  2:  data  exploration  and  preparation     Due  date   11:59pm  Friday,  5  May  2017    Marks   Out  of  100,  weighted  to  25%  of  your  final  mark.    Submission  format   A  report  in  Adobe  PDF  (preferable)  or  MS  Word  Doc  and   an  Excel  spreadsheet.    Filename   ida_a2_xxxxxxxx.pdf  or  ida_a2_xxxxxxxx.doc  where   xxxxxxxx  is  your  student  id.   ida_a2_xxxxxxxx.xls  for  the  spreadsheet.    Report  format   Around  25-­‐30  pages  with  the  information  described   below.  Use  11  or  12  point  Times  or  Arial  fonts.    Submit  to   UTS  Online  assignment  submission  button.   Please,  make  sure  to  call  the  filenames  as  described   above.     This  assignment  is  individual  work.  Each  of  you  will  be  working  with  an   individual  data  set  that  you  will  be  able  to  download  from  UTS  Online.   Scenario   You  have  just  started  working  as  a  data  miner/analyst  in  the  Analytics  Unit  of  a   company.  The  Head  of  the  Analytics  Unit  has  brought  you  a  data  set  [a  welcome   present  ;-­‐))].  The  data  set  includes  two  files:  description  of  the  attributes  and  a   table  with  the  actual  values  of  these  attributes.  The  Head  of  the  Analytics  Unit   has  mentioned  to  you  that  this  is  some  sort  of  demographic  data  that  a  potential   client  has  provided  for  analysis.  The  Head  of  the  Analytics  Unit  would  like  to   have  a  report  with  some  insights  about  that  data,  that  she  could  deliver  to  the   client.  Your  tasks  include:   • understanding  the  specifics  of  the  data  set   • extracting  information  about  each  of  the  attributes,  possible  associations   between  them  and  other  specifics  of  the  data  set.   The  tasks  in  the  assignment  are  specified  below.     31250  /  32130  Assignment  2   2   Data  sets   The  description  of  the  attributes  is  the  same  for  all  students  and  comes  in  a  tiny   documentation  file  (download  it  from  UTS  Online).  Each  student  is  assigned  an   individual  table  with  the  actual  values  of  these  attributes.  Please,  download  the   file  that  is  linked  to  your  name  from  UTS  Online.   Tasks   1A.  Initial  data  exploration   1. Identify  the  type  of  each  attribute  (nominal,  ordinal,  interval  or  ratio).  If   it's  not  clear  you  may  need  to  justify  why  you  choose  the  type.    2. Identify  the  values  of  the  summarising  properties  for  each  attribute   including  frequency,  location  and  spread  (e.g.  value  ranges  of  the   attributes,  frequency  of  values,  distributions,  medians,  means,  variances,   percentiles,  etc.  -­‐  the  statistics  that  have  been  covered  in  the  lectures  and   materials  given).  Note  that  not  all  of  these  summary  statistics  will  make   sense  for  all  the  attribute  types,  so  use  your  judgement!  Where  necessary,   use  proper  visualisations  for  the  corresponding  statistics.    3. Using  KNIME  or  other  tools,  explore  your  data  set  and  identify  any   outliers,  clusters  of  similar  instances,  "interesting"  attributes  and  specific   values  of  those  attributes.  Note  that  you  may  need  to  'temporarily'  recode   attributes  to  numeric  or  from  numeric  to  nominal.  In  the  report  include   the  corresponding  snapshots  from  the  tools  and  explanation  of  what  has   been  identified  there.   Present  your  findings  in  the  assignment  report.   1B.  Data  preprocessing   Perform  each  of  the  following  data  preparation  tasks  (each  task  applies  to   the  original  data)  using  your  choice  of  tool:   a. Use  the  following  binning  techniques  to  smooth  the  values  of  the  Age   attribute:   • equi-­‐width  binning   • equi-­‐depth  binning.   In  the  assignment  report  for  each  of  these  techniques  you  need  to   illustrate  your  steps.  In  your  Excel  workbook  file  place  the  results  in   separate  columns  in  the  corresponding  spreadsheet.  Use  your   judgement  in  choosing  the  appropriate  number  of  bins  -­‐  and  justify   this  in  the  report.  b. Use  the  following  techniques  to  normalise  the  attribute  Age:     • min-­‐max  normalization  to  transform  the  values  onto  the  range   [0.0-­‐1.0].  • z-­‐score  normalization  to  transform  the  values.     31250  /  32130  Assignment  2   3   In  the  assignment  report  provide  explanation  about  each  of  the   applied  techniques.  In  your  Excel  workbook  file  place  the  results  in   separate  columns  in  the  corresponding  spreadsheet.  c. Discretise  the  Age  attribute  into  the  following  categories:  Teenager  =   1-­‐20;  Young  =  21-­‐30;  Mid_Age  =  31-­‐45;  Mature  =  46-­‐65;  Old  =  66+.   Provide  the  frequency  of  each  category  in  your  data  set.   In  the  assignment  report  provide  explanation  about  each  of  the   applied  techniques.  In  your  Excel  workbook  file  place  the  results  in  a   separate  column  in  the  corresponding  spreadsheet.  d. Binarise  the  Education  variable  [with  values  "0"  or  "1"].   In  the  assignment  report  provide  explanation  about  the  applied   binarisation  technique.  In  your  Excel  workbook  file  place  the  results   in  separate  columns  in  the  corresponding  spreadsheet.  1 C.  Summary   At  the  end  of  the  report  include  a  summary  section  in  which  you   summarise  your  findings.  The  summary  is  not  a  narrative  of  what  you   have  done,  but  a  condensed  informative  section  of  what  you  have  found   about  the  data  that  you  should  report  to  the  Head  of  the  Analytics  Unit.   The  summary  may  include  the  most  important  findings  (specific   characteristics  (or  values)  of  some  attributes,  important  information   about  the  distributions,  some  clusters  identified  visually  that  you  propose   to  examine,  associations  found  that  should  be  investigated  more   rigorously,  etc.).   Deliverables   The  deliveries  include:   • a  report,  which  structure  should  follow  the  tasks  of  the  assignment,  and   • an  Excel  workbook  file  with  individual  spreadsheets  for  each  task   (spreadsheets  should  be  labeled  according  to  the  task  names,  for  example,   "1A").  Each  of  the  results  of  parts  (a)  through  (d)  in  task  1B  should  be   presented  in  a  separate  spreadsheet  (and  respectively  table  in  the   assignment  report).  Report:  In  the  report  include  a  section  (starting  with  a  section  title)  for  each  of   the  tasks  in  this  assignment.     Your  report  will  likely  be  between  25-­‐30  pages  in  length  using  an  11  or  12  point   font,  including  title  page  and  graphs.  On  average  you  will  require  between  15   and  23  hours  to  complete  this  assignment.     31250  /  32130  Assignment  2   4   Assessment   This  assignment  is  assessed  as  individual  work.  The  assessment  criteria  are:   • Correctness  of  the  initial  data  exploration  (1A)  -­‐-­‐  20%   • Correctness  of  the  preprocessing  procedures,  results  and  explanation  of   the  steps  (1B)  -­‐-­‐  40%;  • Depth  of  data  understanding  -­‐  how  comprehensive  are  the  explanations  of   your  explorative  results,  appropriateness  of  illustrations  -­‐-­‐  20%;  • Quality  of  the  summary  section  (1C)  -­‐-­‐  20%   Relationship  to  Objectives   This  assignment  addresses  subject  objectives  2  and  3.   Return  of  Assignments   We  plan  to  return  marked  assignments  within  3  weeks  of  submission.  Emails   will  be  sent  when  marking  is  complete.   Academic  Standards   All  text  in  your  assignment  should  be  paraphrased  into  your  own  words  and   referenced  using  the  Harvard  referencing  style.  Please  refer  to  the  Subject   Outline  for  details  about  penalties  for  Academic  Misconduct.   Late  Penalties   A  late  penalty  of  up  to  50%  may  be  applied  to  submitted  work  unless  prior   arrangements  have  been  made  with  the  subject  coordinator.   Special  Consideration   You  may  apply  for  special  consideration  (SC)  due  to  unforeseen  circumstances,   either  before  or  after  the  due  date,  at  http://www.sau.uts.edu.au/assessment/   consideration/online.html.  The  three  basic  reasons  for  SC  are  health,  family,  or   work  problems;  "I  haven't  finished  yet"  is  not  a  valid  reason.  You  must  provide   documentary  evidence  to  support  your  claim,  such  as  a  doctor's  certificate,  a   statutory  declaration,  or  a  letter  from  your  employer.   Note   The  assignments  will  be  checked  through  the  Turnitin  ®  Plagiarism  Prevention   system,  for  identifying  unoriginal  material,  copied  (without  reference  to  the   source)  from  an  electronic  source  on  the  Internet,  electronic  libraries,  other   assignments.