Assignment title: Information


Question 1 (15 points) LBS is a management investment firm managing about $600 million in assets, primarily in stocks and mutual funds, for both institutional and individual investors. It believes that conventional approaches to money management are having an increasingly difficult time meeting or exceeding benchmarks. Further, it believes that the new generation of data-mining techniques can capture significant non-linear causal relationships for use in forecasting when market and security price behavior is dominated by non-linearity. LBS wants to maximize the return on the assets it invests for its clients while minimizing their risk exposure. For LBS, it is not enough just to know which securities to purchase. In order to be successful, the asset management firm must also know when to buy and sell the securities. The firm feels that it can do this through a combination of high-quality analytic tools, highly efficient computer engineering, and market-savvy analysts. The problem of developing a system to estimate future prices is daunting because financial processes are generally characterized by high levels of non-linearity and complexity. The amount of data available to an analyst is overwhelming. Also, financial markets are constantly evolving so models must adapt to these changes. So,  The system needs to be able to quickly incorporate knowledge about a domain that often defies explicit definition. On a day-to-day basis, random shocks, crowd psychology, and short-lived trends influence financial markets. Also, different experts have widely varying interpretations of the data even after the fact. Even expert traders sometimes have difficulty explaining what general principle led them to make a specific trade.  The system needs to be able to deal with and analyze complex data. As a result of the interactions among several different market forces, financial markets can exhibit highly non-linear and highly complex behavior.  The system needs to be able to deal with the large amounts of economic and financial data that are generated daily. It is difficult or impossible even for the most skilled expert to assimilate this amount of data accurately and consistently. In the words of one experienced trader, "Even the smartest of us is not as smart as the market. In order to make sense of the data, we have little choice but to turn to the computer."  The system needs to be able to adapt quickly over time. A trading strategy that works in a bull market may not fare well in a bear market. Markets evolve and adapt to different forces over time. The firm has determined that a meaningful horizon is about 4 weeks. It is an "active" manager that seeks to outperform the market, as opposed to a "passive" manager that indexes its portfolio with the market and seeks only to match the market's performance. The system needs to be able to interpret and analyze large amounts of market data and "update its view of the world" frequently and easily, accessing economic and market data from a variety of sources and, using these data, identifying those stocks that are "likely" to be winners, and those that are more "likely" to be losers, over the next 4 weeks. LBS will use simulated trading systems to test the models. Models will be tested (or validated) by back testing over several historical years to determine how they would have performed. Models that recommend buying stocks in volumes that were not obtainable or conducting so many trades that transaction fees wiped out profits would not be considered successful. LBS's data is plentiful, although not necessarily clean. The system does not need to make specific point predictions for prices on a specific date but only to provide the decision maker with estimates of a security's upside and downside potential. On the other hand, since a decision maker (typically a portfolio manager) would be interpreting the results of a prediction, it would be useful if the model could offer some insight into its analysis. It is also important that the system fits smoothly into LBS's workflow and current modeling tools. To do this, the system must interface smoothly with the financial databases where the market data are stored. Since LBS wants a 4-week time horizon, the system need not function in real-time. On the other hand, the system must be able to perform the analysis on each individual security in a reasonable amount of time. The system also must be able to be expanded to accommodate additional securities and input factors. In addition, LBS would like to take up as little of the firm's expert traders' time as possible. Expert time is valuable; each hour away from market analysis or trading can cost real dollars. Furthermore, and more important, LBS has found that it could be somewhat difficult for their expert traders and analysts to articulate their expertise, especially since the rules are complex and continually evolving. (a) Briefly describe the modeling problem facing LBS, and identify what type of problem it is in terms of the types of data mining problems discussed in session 1 (prediction, estimation, classification, clustering, association, etc.). Justify your answer. (b) (What data mining model type would you propose for this problem? Justify your answer. (c) What are two significant limitations of your proposed approach for the given problem? Question 2 (10 points): Assume that you have to build an online recommendation system for buying cars. Cars have hundreds of specifications/features. Comment on whether Naïve Bayes, K Nearest Neighbors or Decision Trees would be the best approach for this type of system. Justify your answer. Question 3 (10 points) Assume that using scanner data on customer purchases combined with demographic and behavioral data on customers stored in the corporate data warehouse, you would like to build a predictive model that would help classify customers into one of a set of distinct profitability segments (e.g., high, medium and low). Further, assume that although your company operates across the whole Southern US, you would like to focus on customers spending at least $500 per month on average for the past 12 months, at any of 5 stores in Texas. Discuss whether K-means clustering would be useful to identify the relevant customer set. Justify your answer. Question 4 (15 points). Which of the following is a symptom of a decision tree that is "over-fitted"? In each case, briefly justify your answer. (a) The error rate (misclassification) chart for the model is as in the graphs below (for the training and validation sets): Validation Set Training Set Number of leaf nodes (b) The tree is unbalanced (i.e., some paths from the root to leaf nodes are long while others are short) (c) The confusion (classification) matrices for both the training set and the validation set have large values in the off-diagonal cells (Hint: In a confusion matrix C, cij indicates the number of cases whose actual output value ri was classified as rj by the tree) (d) The tree has a high overall mis-classification rate for the training set but not for the validation set. (e) A number of the leaf nodes have very low support. Question 5 (15 points) Given the following data on purchase transactions expressed as itemsets: 1 Bread Juice Ketchup 2 Milk Juice Apples 3 Pepper Apples Juice Wine 4 Juice Ketchup Wine Salt 5 Apples Detergen 6 Juice Ketchup Wine Apples 7 Bread Milk Juice 8 Detergen t 9 Salt Wine 1 Juice Ketchup Milk Apples 0 Bread Apples Wine 1 1 1 Milk Juice Detergen 2 Each row is an itemset (i.e., a collection of items that were bought together). (a) Identify all the large itemsets with minsup = 0.25 (i.e., 25%). For each large itemset, compute its support as a percentage (%). Wine t Wine Apples t Ketchu p (b) Using the results in (a), state one association rule that has a confidence above 80% and acceptable lift. Compute its confidence, support and lift. (c) If the APriori approach described in class were used to identify association rules for this data set, identify three itemsets whose support would not have to be calculated by the rule mining process (i.e., their support would not have to be computed)? Explain why they would not be considered. Question 6 (15 points). Consider the following dataset about customers of a particular product. The column "Buyer" indicates whether each customer bought the product or not. You have been asked to use Naïve Bayes Classification to identify potential buyers. Name Married Job Hair Gender Buyer Peter No Manager Short Male Yes Claudia Yes Engineer Long Female No Angela No Lawyer Long Female No Amy No Manager Long Female Yes Albert Yes Engineer Short Male Yes Karin No Manager Long Female No Nina Yes Engineer Short Female Yes Sergio Yes Manager Long Male Yes Would the following person be a buyer or not (show your calculations)? John Yes Engineer Short Male ? Question 7 (10 points) Assume that you have joined a company that sells disk drives for PCs. It has decided to enter the market for mobile phones starting next year. The CEO has heard that neural nets are powerful tools for building classification and prediction models, and has asked you to build a Neural Network model for classifying mobile phone products proposed by your R&D department into one of the following three market potential categories: Low, Medium, High. You have been given access to detailed data on the company's products and sales for ten of the last eleven years (current year sales have still to be compiled). How would you respond? Question 8 (10 points) Your boss has suggested that rather than using a single type of classification model, it might be useful to combine the strengths of different model types. So she has suggested that you initially build a set of neural network models to figure out the key determinants of buying behavior in each segment, and then use these significant variables to build a decision tree model which would provide the key threshold of each variable that influence the important outcomes in future buying behavior. How would you respond?