Assignment title: Information

Readings: Büttcher, S., Clarke, C. L. A., & Cormack, G. V. (2010). Chapter 2: Basic Techniques Information retrieval, implementing and evaluating search engines (pp. 33-39;46-51;51-60;) Retrieved from http://www.ir.uwaterloo.ca/book/02-basic-techniques.pdf Manning, C. D., Raghavan, P., & Schütze, H. (2008). Chapter 1: Boolean retrieval Introduction to Information Retrieval (pp. 1-18): Cambridge University Press Retrieved from http://nlp.stanford.edu/IR-book/pdf/01bool.pdf Manning, C. D., Raghavan, P., & Schütze, H. (2008). Chapter 2: The term vocabulary and postings lists Introduction to Information Retrieval (pp. 36-41): Cambridge University Press Retrieved from http://nlp.stanford.edu/IR-book/pdf/02voc.pdf Manning, C. D., Raghavan, P., & Schütze, H. (2008). Chapter 6: Scoring, term weighting and the vector space model Introduction to Information Retrieval (pp. 109-133): Cambridge University Press Retrieved from http://nlp.stanford.edu/IR-book/pdf/06vect.pdf Tf-idf :: A Single-Page Tutorial - Information Retrieval and Text Mining. (2015). Retrieved 3 August, 2015, from http://www.tfidf.com/ Question 1: (6 marks) Imagine that you are given the following collection: Document-ID Text 1 Love to love you, baby 2 Love me, love you, baby 3 Nobody Loves No Baby (Like My Baby Loves Me) Figure 1. An example collection. a) Create a table that contains appropriate values for the dictionary and postings files for the collection. The table should be formatted as follows: • Information about each term should be on a separate row; • The information for the dictionary and postings should be written in separate columns; • The dictionary should list each term in the collection; • For a respective term, the postings should contain a list of documents – identified by the respective document id - that the term occurs in. For terms that occur in more than one document then document ids should be separated by semicolons; • The information should be sorted alphabetically by term. Assume that the following steps are taken during indexing in order: 1. Terms are folded to lower case; 2. Punctuation and non-alphanumeric characters are removed; 3. Spaces are used as the delimiter during tokenisation; An Example of required formatted supplied in figure 2. Dictionary Posting hello 4;6;10; world 7;9; Figure2. Example format for question 1A b) Repeat question 1A with the following changes. The table should be formatted as follows: • The dictionary file should list each term in the collection and the number of times that term occurs in the collection – separated by a colon. • For a respective term, the postings files should contain a list of document entries that the term occurs in. Each document entry should contain the respective document id and the number of times that the term occurs in the document, separated by a colon. For terms that occur in more than one document then document entries should be separated by semicolons. All other requirements and assumptions from Q1A remain. An example of the required formatted is supplied in Figure 3. Dictionary Posting Hello:4 4:2;6:1;10:1; World:9 7:5;9:4; Figure3. Example format required for question 1B/1C c) Repeat question 1B with the following changes. Assume that the following steps in order are taken during indexing: 1. Terms are folded to lower case; 2. Punctuation and non-alphanumeric characters are removed; 3. Stopwords are removed using the SMART stopword list provided in the references (Salton, 1971) . 4. The Porter Stemmer algorithm is applied using the steps as outlined in the respective Porter Stemmer paper provided in the references (Porter, 1980). 5. Spaces are used as the delimiter during tokenisation. All other requirements and assumptions from Q1B remain. Question 2: (5 marks) Imagine that the following set of text documents have been indexed in your collection. Document Id Text 1 Love will tear us apart 2 All you need is love 3 Run the world (Girls) 4 Love makes the world go round 5 Back in black 6 Lovely Rita 7 The man who sold the world 8 The most beautiful girl in the world 9 Lovin' you 10 Crazy in love Figure 4. Collection Document and Text a) Identify which subset of documents (by document-id) would be retrieved by an information retrieval system for the given set of queries below (I-V) if the Boolean Model was implemented. I. Q1 "World" II. Q2 "Love" III. Q3 "World" AND "Girl" IV. Q3 "Love" OR "World" V. Q3 "Love" AND NOT "World" Assume that the following steps in order are taken during indexing and querying: 1. Terms are folded to lower case; 2. Punctuation and non-alphanumeric characters are removed; 3. Spaces are used as the delimiter during tokenisation; b) Repeat question 2A. However, this time assume that the following steps have been taken: Assume that the following steps in order are taken during indexing and querying: 1. Terms are folded to lower case; 2. Punctuation and non-alphanumeric characters are removed; 3. The Porter Stemmer algorithm is applied using the steps as outlined in the respective Porter Stemmer paper provided in the references (Porter, 1980). 4. Spaces are used as the delimiter during tokenisation. Question 3: (7 Marks) Figures 5 and 6 provide the occurrence statistics for a set of terms (t1 – t5) and documents (Doc 1 – Doc 5). Figure 5 provides 1) a term by document occurrence matrix 2) the total number of term occurrences in the collection for each term and the collection as a whole and 3) the total term occurrence for each document. Figure 6 provides the total number of documents that terms (t1 – t5) occur as well as the total number of documents in the collection. Occurrences Term Collection Doc 1 Doc 2 Doc 3 Doc 3 Doc 5 t1 1,000 3 4 2 1 0 t2 20,000 0 10 4 10 2 t3 10,000 5 100 50 40 5 t4 5,000 3 1 2 4 1 t5 2,000 0 50 30 5 10 All Terms 5,000,000 50 1,000 400 200 500 Figure 5. Term Occurrence Statistics Term Number of Documents that Contain Term t1 10 t2 100 t3 1,000 t4 10 t5 100 All Terms 1,000 Figure 6. Term Document Occurrence Statistics a) For each of the documents (Doc 1 – Doc 5) identify the inverse document frequency weight (idf) using Equation 1. b) Construct a term Χ document matrix which lists the term frequency weight (tf) for each term in each document using Equation 2. c) Construct a term Χ document matrix which lists the tf.idf weight for each term in each document using Equation 3. d) For each of the given terms (t1 – t5) specify which of the given documents (Doc 1 – Doc 5) would be the highest ranked document if the term was submitted an as unweighted single term query and if the similarity measure in Equation 3 was used to rank documents. Question 4: (2 Marks) You are working as a consultant for an organisation that uses Lucene.NET to develop search engines for companies. For the following case studies identify which of the Lucene.NET Analysers would be the most appropriate to use and why. Assume that for each case study the type of information to be searched is limited to the examples given. 1. Client 1 is the owner of a technology warehouse called "Nerds R Us". She wants to create an information retrieval system for her products. She wants to be able to search by a particular product id for example: "IPhone 7", "Windows 10", "Samsung-S6". 2. Client 2 is the office manager at a legal practice. He wants to create an information retrieval system for office email. In particular, he wants to be able to search by email address and then retrieve all emails sent or received by that person.