Enterprise Business Intelligence (BUS5EBI) Semester 1, 2017
Group Assignment
Due: 5pm, 22nd May 2017
Assignment Details:
1. This is a group assignment and it covers 45% (15% for Presentation, 10% for Report and 20% for Investigation, Analysis and Interpretation) of the total assessment for BUS5EBI.
2. Your work should be submitted by the due date online via the LMS. The inscription on the LMS is Group Assignment Submit Here! The submitted work should include the following and the Naming Format should be as “EBIAssignment_GroupID”
a. The project files used for the project.
b. The report document in Microsoft Word format.
b. The presentation file in Microsoft PowerPoint format.
b. There should be only one submission for each group.
c. The online Academic integrity module and statement of student responsibility will suffice for the individual statements of authorship so do not worry about attaching a copy.
Objective of the Group Assignment
1. To ascertain each member understands the use of the data mining techniques and
what situation they should be used
2. Encourage collaboration amongst peers while undertaking this project
3. Test the ability of group members to explain the outcome of their analysis and create useful knowledge
4. Present tasks and outcomes of project in a professional manner
Important Note
PLEASE NOTE THAT ANSWERS FOR RELEVANT QUESTIONS AND COMMENTS SHOULD BE SUPPORTED WITH APPROPRIATE ARGUMENTS AND RELEVANT SCREEN SHOTS
Task 1 of 3: Decision Trees
The Dataset for task 1: Online news popularity 1/2/3 (Please look up the assigned dataset for your group in the Project Groups listing on the LMS)
This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The idea is to be able to predict the number of shares in social networks (popularity).
Attribute Information:
Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field) Attribute details: 0. url: URL of the article 1. timedelta: Days between the article publication and the dataset acquisition (non-predictive) 2. n_tokens_title: Number of words in the title 3. n_tokens_content: Number of words in the content 4. n_unique_tokens: Rate of unique words in the content 5. n_non_stop_words: Rate of non-stop words in the content 6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content 7. num_hrefs: Number of links 8. num_self_hrefs: Number of links to other articles published by Mashable 9. num_imgs: Number of images 10. num_videos: Number of videos 11. average_token_length: Average length of the words in the content 12. num_keywords: Number of keywords in the metadata 13. data_channel_is_lifestyle: Is data channel 'Lifestyle'? 14. data_channel_is_entertainment: Is data channel 'Entertainment'? 15. data_channel_is_bus: Is data channel 'Business'? 16. data_channel_is_socmed: Is data channel 'Social Media'? 17. data_channel_is_tech: Is data channel 'Tech'? 18. data_channel_is_world: Is data channel 'World'? 19. kw_min_min: Worst keyword (min. shares) 20. kw_max_min: Worst keyword (max. shares) 21. kw_avg_min: Worst keyword (avg. shares) 22. kw_min_max: Best keyword (min. shares) 23. kw_max_max: Best keyword (max. shares) 24. kw_avg_max: Best keyword (avg. shares) 25. kw_min_avg: Avg. keyword (min. shares) 26. kw_max_avg: Avg. keyword (max. shares) 27. kw_avg_avg: Avg. keyword (avg. shares) 28. self_reference_min_shares: Min. shares of referenced articles in Mashable 29. self_reference_max_shares: Max. shares of referenced articles in Mashable 30. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable 31. weekday_is_monday: Was the article published on a Monday? 32. weekday_is_tuesday: Was the article published on a Tuesday? 33. weekday_is_wednesday: Was the article published on a Wednesday? 34. weekday_is_thursday: Was the article published on a Thursday? 35. weekday_is_friday: Was the article published on a Friday? 36. weekday_is_saturday: Was the article published on a Saturday? 37. weekday_is_sunday: Was the article published on a Sunday? 38. is_weekend: Was the article published on the weekend?
39. LDA_00: Closeness to LDA topic 0 40. LDA_01: Closeness to LDA topic 1 41. LDA_02: Closeness to LDA topic 2 42. LDA_03: Closeness to LDA topic 3 43. LDA_04: Closeness to LDA topic 4 44. global_subjectivity: Text subjectivity 45. global_sentiment_polarity: Text sentiment polarity 46. global_rate_positive_words: Rate of positive words in the content 47. global_rate_negative_words: Rate of negative words in the content 48. rate_positive_words: Rate of positive words among non-neutral tokens 49. rate_negative_words: Rate of negative words among non-neutral tokens 50. avg_positive_polarity: Avg. polarity of positive words 51. min_positive_polarity: Min. polarity of positive words 52. max_positive_polarity: Max. polarity of positive words 53. avg_negative_polarity: Avg. polarity of negative words 54. min_negative_polarity: Min. polarity of negative words 55. max_negative_polarity: Max. polarity of negative words 56. title_subjectivity: Title subjectivity 57. title_sentiment_polarity: Title polarity 58. abs_title_subjectivity: Absolute subjectivity level 59. abs_title_sentiment_polarity: Absolute polarity level 60. shares: Number of shares
For the dataset, complete the following tasks:
Missing Values:
o Are there any anomalies (unusual data or missing values) in the given dataset? Support your answer with appropriate argument.
o List two possible strategies to handle cases with missing values in data (if applicable) & provide appropriate reasoning?
Decision trees
o Create a decision tree for the provided dataset.
o Does subjectivity and publishing days of news matters?
o Is this dataset appropriate for analysing news popularity? Explain.
o Which variable ranks highest in variable importance?
o Which characteristics in an article would help it achieve the maximum number of shares?
o Based on these results, what other value added observations can you make?
Task 2 of 3: Association Rule mining
The Dataset for task 2: Groceries 1/2/3 (Please look up the assigned dataset for your group in the Project Groups listing on the LMS)
The data shown in the dataset describes grocery purchases. The store wants to analyse associations among purchases of these items for purposes of point – of - sale display, guidance to sales personnel in promoting cross sales, and guidance for piloting an eventual time of purchase electronic recommender system to boost cross sales.
For the dataset, complete the following tasks:
Association rule mining:
o Perform Association rule mining on dataset provided.
o In your analysis, which pairs of entries have the highest lift value. List the top 3.
o In your opinion, what are the top 3 recommendations that can increase sales based on the above Analysis? Support your answer with appropriate argument.
o Can you recommend two products which can be sold as bundle offers based on the current data?
o Based on these results, what other value added observations/recommendations can you make (if applicable)?
Task 3 of 3: Clustering
The Dataset for task 3: East West Airlines 1/2/3 (Please look up the assigned dataset for your group in the Project Groups listing on the LMS)
The East West airlines dataset contains information on 1000 passengers who belong to an airline’s frequent flier program. For each passenger, the data include information on their mileage history and on different ways they accrued or spent miles in the last year. The goal is to try to identify clusters of passengers that have similar characteristics for targeting different segments for different types of mileage offers.
Dataset Description:
Field Name
Data Type
Max Data Length
Raw Data or Telcom Created Field?
Description
ID#
NUMBER
Telcom
Unique ID
Balance
NUMBER
8
Raw
Number of miles eligible for award travel
Qual_miles
NUMBER
8
Raw
Number of miles counted as qualifying for Topflight status
cc1_miles
CHAR
1
Raw
Number of miles earned with freq. flyer credit card in the past 12 months:
cc2_miles
CHAR
1
Raw
Number of miles earned with Rewards credit card in the past 12 months:
cc3_miles
CHAR
1
Raw
Number of miles earned with Small Business credit card in the past 12 months:
note: miles bins:
1 = under 5,000
2 = 5,000 - 10,000
3 = 10,001 - 25,000
4 = 25,001 - 50,000
5 = over 50,000
Bonus_miles
NUMBER
Raw
Number of miles earned from non-flight bonus transactions in the past 12 months
Bonus_trans
NUM
Raw
Number of non-flight bonus transactions in the
BER
past 12 months
Flight_miles_12mo
NUMBER
Raw
Number of flight miles in the past 12 months
Flight_trans_12
NUMBER
Raw
Number of flight transactions in the past 12 months
Days_since_enroll
NUMBER
Telcom
Number of days since Enroll_date
Award?
NUMBER
Telcom
Dummy variable for Last_award (1=not null, 0=null)
For the dataset, complete the following tasks:
Missing Values:
o Are there any anomalies (unusual data/missing values) in the given dataset? Support your answer with appropriate argument.
o List possible strategies to handle cases with unusual or missing values in data (if applicable)?
Clustering:
o What would happen if the data were not standardized? Explain.
How does SAS Enterprise miner handle standardisation?
o Perform k-means clustering on the dataset. What would be the optimal value of ‘k’ in this case? Explain.
o Which cluster(s) would you target for offers, and what type of offers would you target to customers in that cluster? Include proper reasoning in support of your choice of cluster(s) and the corresponding offer(s).
o Mention the business proposition for the first largest cluster. What potential offers can you suggest for this cluster, to increase ticket sales?
o If applicable, mention the business proposition for the second largest cluster.