Assignment title: Information
Question 1 (15 points) LBS is a management investment firm managing about $600 million in assets, primarily
in stocks and mutual funds, for both institutional and individual investors. It believes that conventional approaches
to money management are having an increasingly difficult time meeting or exceeding benchmarks. Further, it
believes that the new generation of data-mining techniques can capture significant non-linear causal relationships
for use in forecasting when market and security price behavior is dominated by non-linearity.
LBS wants to maximize the return on the assets it invests for its clients while minimizing their risk
exposure. For LBS, it is not enough just to know which securities to purchase. In order to be successful, the asset
management firm must also know when to buy and sell the securities. The firm feels that it can do this through a
combination of high-quality analytic tools, highly efficient computer engineering, and market-savvy analysts.
The problem of developing a system to estimate future prices is daunting because financial processes are
generally characterized by high levels of non-linearity and complexity. The amount of data available to an analyst
is overwhelming. Also, financial markets are constantly evolving so models must adapt to these changes. So,
The system needs to be able to quickly incorporate knowledge about a domain that often defies explicit
definition. On a day-to-day basis, random shocks, crowd psychology, and short-lived trends influence financial
markets. Also, different experts have widely varying interpretations of the data even after the fact. Even expert
traders sometimes have difficulty explaining what general principle led them to make a specific trade.
The system needs to be able to deal with and analyze complex data. As a result of the interactions among
several different market forces, financial markets can exhibit highly non-linear and highly complex behavior.
The system needs to be able to deal with the large amounts of economic and financial data that are generated
daily. It is difficult or impossible even for the most skilled expert to assimilate this amount of data accurately
and consistently. In the words of one experienced trader, "Even the smartest of us is not as smart as the
market. In order to make sense of the data, we have little choice but to turn to the computer."
The system needs to be able to adapt quickly over time. A trading strategy that works in a bull market may not
fare well in a bear market. Markets evolve and adapt to different forces over time.
The firm has determined that a meaningful horizon is about 4 weeks. It is an "active" manager that seeks to
outperform the market, as opposed to a "passive" manager that indexes its portfolio with the market and seeks only
to match the market's performance.
The system needs to be able to interpret and analyze large amounts of market data and "update its view of the
world" frequently and easily, accessing economic and market data from a variety of sources and, using these data,
identifying those stocks that are "likely" to be winners, and those that are more "likely" to be losers, over the next 4
weeks. LBS will use simulated trading systems to test the models. Models will be tested (or validated) by back
testing over several historical years to determine how they would have performed. Models that recommend buying
stocks in volumes that were not obtainable or conducting so many trades that transaction fees wiped out profits
would not be considered successful.
LBS's data is plentiful, although not necessarily clean. The system does not need to make specific point
predictions for prices on a specific date but only to provide the decision maker with estimates of a security's upside
and downside potential. On the other hand, since a decision maker (typically a portfolio manager) would be
interpreting the results of a prediction, it would be useful if the model could offer some insight into its analysis. It
is also important that the system fits smoothly into LBS's workflow and current modeling tools. To do this, the
system must interface smoothly with the financial databases where the market data are stored.
Since LBS wants a 4-week time horizon, the system need not function in real-time. On the other hand, the
system must be able to perform the analysis on each individual security in a reasonable amount of time. The system
also must be able to be expanded to accommodate additional securities and input factors.
In addition, LBS would like to take up as little of the firm's expert traders' time as possible. Expert time is
valuable; each hour away from market analysis or trading can cost real dollars. Furthermore, and more important,
LBS has found that it could be somewhat difficult for their expert traders and analysts to articulate their expertise,
especially since the rules are complex and continually evolving.
(a) Briefly describe the modeling problem facing LBS, and identify what type of problem it is in terms of the
types of data mining problems discussed in session 1 (prediction, estimation, classification, clustering,
association, etc.). Justify your answer.
(b) (What data mining model type would you propose for this problem? Justify your answer.
(c) What are two significant limitations of your proposed approach for the given problem?
Question 2 (10 points): Assume that you have to build an online recommendation system for buying cars. Cars
have hundreds of specifications/features. Comment on whether Naïve Bayes, K Nearest Neighbors or Decision
Trees would be the best approach for this type of system. Justify your answer.
Question 3 (10 points) Assume that using scanner data on customer purchases combined with demographic and
behavioral data on customers stored in the corporate data warehouse, you would like to build a predictive model
that would help classify customers into one of a set of distinct profitability segments (e.g., high, medium and low).
Further, assume that although your company operates across the whole Southern US, you would like to focus on
customers spending at least $500 per month on average for the past 12 months, at any of 5 stores in Texas. Discuss
whether K-means clustering would be useful to identify the relevant customer set. Justify your answer.
Question 4 (15 points). Which of the following is a symptom of a decision tree that is "over-fitted"? In each case,
briefly justify your answer.
(a) The error rate (misclassification) chart for the model is as in the graphs below (for the training and
validation sets):
Validation Set
Training Set
Number of leaf nodes
(b) The tree is unbalanced (i.e., some paths from the root to leaf nodes are long while others are short)
(c)
The confusion (classification) matrices for both the training set and the validation set have large values in
the off-diagonal cells (Hint: In a confusion matrix C, cij indicates the number of cases whose actual output
value ri was classified as rj by the tree)
(d) The tree has a high overall mis-classification rate for the training set but not for the validation set.
(e) A number of the leaf nodes have very low support.
Question 5 (15 points) Given the following data on purchase transactions expressed as itemsets:
1 Bread Juice Ketchup
2 Milk Juice Apples
3 Pepper Apples Juice Wine
4 Juice Ketchup Wine Salt
5 Apples Detergen
6 Juice Ketchup Wine Apples
7 Bread Milk Juice
8 Detergen
t
9 Salt Wine
1
Juice Ketchup Milk Apples
0
Bread Apples Wine
1
1
1
Milk Juice Detergen
2
Each row is an itemset (i.e., a collection of items that were bought together).
(a) Identify all the large itemsets with minsup = 0.25 (i.e., 25%). For each large itemset, compute its support as
a percentage (%).
Wine
t
Wine Apples
t
Ketchu
p
(b)
Using the results in (a), state one association rule that has a confidence above 80% and acceptable lift.
Compute its confidence, support and lift.
(c) If the APriori approach described in class were used to identify association rules for this data set, identify
three itemsets whose support would not have to be calculated by the rule mining process (i.e., their support
would not have to be computed)? Explain why they would not be considered.
Question 6 (15 points).
Consider the following dataset about customers of a particular product. The column "Buyer" indicates whether each
customer bought the product or not. You have been asked to use Naïve Bayes Classification to identify potential
buyers.
Name Married Job Hair Gender Buyer
Peter No Manager Short Male Yes
Claudia Yes Engineer Long Female No
Angela No Lawyer Long Female No
Amy No Manager Long Female Yes
Albert Yes Engineer Short Male Yes
Karin No Manager Long Female No
Nina Yes Engineer Short Female Yes
Sergio Yes Manager Long Male Yes
Would the following person be a buyer or not (show your calculations)?
John Yes Engineer Short Male ?
Question 7 (10 points) Assume that you have joined a company that sells disk drives for PCs. It has decided to
enter the market for mobile phones starting next year. The CEO has heard that neural nets are powerful tools for
building classification and prediction models, and has asked you to build a Neural Network model for classifying
mobile phone products proposed by your R&D department into one of the following three market potential
categories: Low, Medium, High. You have been given access to detailed data on the company's products and sales
for ten of the last eleven years (current year sales have still to be compiled). How would you respond?
Question 8 (10 points) Your boss has suggested that rather than using a single type of classification model, it
might be useful to combine the strengths of different model types. So she has suggested that you initially build a set
of neural network models to figure out the key determinants of buying behavior in each segment, and then use
these significant variables to build a decision tree model which would provide the key threshold of each variable
that influence the important outcomes in future buying behavior. How would you respond?