Assignment title: Information


Information Retrieval Concepts and Text Processing Question 1: (4 marks) a) Identify the 3 key components of an information retrieval system and describe their function. b) Describe the difference between a user's information need and a query? Question 2: (3 Marks) a) Identify 3 simple, proven methods to term weighting. b) For each of the approaches, describe the justifications for why they are successful. Question 3: (7 marks) a) For each of the words below use the Porter Stemmer to identify their stems. i. Stagnate ii. Appendices iii. Oriental iv. Agree v. Startle vi. Punctuation vii. Marvellous viii. Evolution ix. Caste x. Angular b) For each of the words above identify which of the Porter Stemmer's steps, as defined in the reference in the readings list, help reduce the word to its stem. c) For each of the word pairs below identify if the pair of stems reported by the Porter Stemmer identifies: a true positive, a true negative, a false positive or a false negative. i. Stagnate/Stagnation ii. Appendices/Appendix iii. Oriental/Orientation iv. Agree/Agreement v. Startle/Start vi. Punctuation/Punctual vii. Marvellous/Marvel viii. Evolution/Evolve ix. Caste/Cast x. Angular/Angle d) The standard Porter Stemmer does not correctly stem words that end with the "ise" suffix. Identify which step (or steps) of the Porter Stemmer, again from the reference in the readings list, you would need to change in order to enable it to handle "ise" words. Question 4: (6 Marks) You are working as a consultant for an organisation that develops search engines for companies. You have been asked to suggest some of the text (aka linguistic) processing techniques that could be used during indexing. For the following case studies identify: a) a technique that you would recommend and why and 2) a technique that you would not recommend and why. 1. Client 1 is the owner of a retro vinyl record/CD shop. The client would like you to develop a search engine that will facilitate customers searching for song titles and lyrics, for example "A day in the life", "Start me up", "Someone like you". The client's customers represent an audience that requires very precise results to be retrieved. 2. Client 2 has a large collection of legal text documents (> 1 Terabyte) on a relatively small hard disk (~2 Terabytes). The client is able to use the current disk exclusively for the collection and indexing but cannot afford to purchase a new disk. The client's customers represent an audience that require a high level of recall – even if precision is compromised.