Assignment title: Information
Scenario
You have been approached by the airline Qantas to help them make a decision on the type of service they should introduce for their Brisbane customers wanting to fly to New York via Los Angeles.
Qantas already has a service operating between Brisbane (BNE) and Los Angeles (LAX) with one return flight per day. QF15 (BNE LAX) lands in Los Angeles at 6:25am local time, and QF16 (LAX BNE) departs Los Angeles at 11:45pm local time. QF15 on average lands 13mins early, and QF16 on average departs 29mins late.
Qantas has two options to introduce a service to New York. Firstly they can introduce their own Qantas operated service between New York and LAX. Secondly they can recommend to passengers a connecting airline that services New York, which they would book through Qantas. Regardless of what option is selected Qantas would like you to ensure that there is gap of two to four hours between the average departure/arrival time and the relevant service to/from New York.
Important facts about flights between New York and Los Angles:
• The flight time between New York and Los Angles is 6 hours both ways.
• New York is 3 hours ahead of Los Angeles in time zone.
• Both cities are within the United States so passengers do not have to pass through customs when travelling between the cities.
• New York has three airports, which are serviced by many carriers.
Qantas has supplied you six datasets in csv format that contain information about the flights leaving three New York City airports on a specified day. The csv files also contain some additional information that may be useful for your analysis.
Your Task
Qantas would like you to analyze this data only and at a minimum report on:
1. What services currently exist between each New York City airport and Los Angeles (LAX)?
2. How the performance of each New York Airport compares?
3. How the performance of each Airline compares?
4. What factors may affect the performance of particular airlines, and New York airports, and any evidence of association?
5. What suitable options are available to Qantas based on the information available, and what you would recommend to Qantas?
6. What further data would help them better inform their decision?
A 1,750 word report (+/- 20%) which contains the following sections (report.pdf):
How to approach the assignment
You have to work within the data provided, in other words, you cannot consider any other information than given in the CSV files. If you need to make an assumption while analyzing data, it should be meaningful.
You will be required to explore the data provided, improve its quality and process it for reporting and mining purpose. Once the data is ready for analysis (i.e. integrated and cleaned), you should be able to set the variables as per their roles in your chosen software. The variable roles should also be set according to the type of analyses, such as decision tree, clustering or association analysis, that you choose. Perform the various data analytics operations and report the meaningful results.
Overall the aim of the assignment is to produce a 1,750 word report that discusses the questions Qantas would like answered and identifies any trends in the data. To do this you will need to:
1. Clean the data
2. Reformat the data to remove redundancy and inconsistency
3. Remove outliers from the data (this is easiest to do in excel using the Z score method)
4. Determine the variables and their roles (e.g. categorical or numerical, input or target)
5. Build a cube, perform multivariate analysis and discuss the outcome
6. Import the data into a data analytics package, such as Orange
7. Use software to manipulate the data, calculate descriptive statistics, make tables and graphs, and use data-mining tools to identify patterns and trends in the data
8. Make recommendations to the company based on what you find in the data
Submission Requirements
You are to submit the following files, compressed as assignment2.zip, through Blackboard:
1. CSV files containing your clean data that you used for analysis (import1.csv, import2.csv etc.)
2. A CSV file containing only the data you identified as outliers and removed from your dataset (outliers.csv)
3. A 1,750 word report (+/- 20%) which contains the following sections (report.pdf): a. Introduction
b. Data Summary (quality improvement: errors and outliers removed etc.)
c. Existing Services
d. Airport Performance
e. Airline Performance
f. Cube based Multivariate Analysis and Data mining based trends influencing Performance
g. Business Options and Recommendations
h. Further Data Required
i. Description of how software was used
j. Conclusion
Hints
1. A few examples of descriptive statistical information you might want to calculate are the: a. Average value (mean) for numerical variables
b. Most frequently occurring value (mode) for categorical or numerical variables
c. On average, how much each measurement deviates from the mean (standard deviation)
d. Span of values over which the data set occurs (range), and
e. Midpoint between the lowest and highest value of the set (median)
2. The majority of your work will be in sections d, e and f. These sections should discuss the techniques you used to analyse the data and the results. a. The performance of each New York airport is an important consideration in selecting a suitable expansion strategy.
b. To compare the performance of each airport various flight arrival time (early or delay) can be used as an indicator; add necessary plotting to support your discussion.
c. The performance of individual airlines is critical to selecting an appropriate partner airline to deliver connecting flights. The spread of delay time for each airline can be used to lead this discussion.
d. You can use visualization (eg, box plot and whisker plots) and statistics to explain your view.
e. Evidence of association between airline and the considered factors for comparing them should be included.
3. Section f should build on what you have already discussed in sections d and e.
4. These sections should together make use of most of the analysis techniques introduced in the lecture such as multivariate analysis (cube), decision tree, clustering and/or association analysis.
5. It would be useful for you to use graphs and tables in these sections, but do not include tables and graphs for every analysis you perform. Report the final tables and graphs that give meaningful information about data and the analyses you perform. You should import your graphs and figures from the statistical software you use.
6. For g, you should utilize the findings of data mining approaches to make business decision or recommendation.
7. For h, think of what more data can be considered to benefit your intelligent data analysis and make some suggestion accordingly.