Assignment title: Information

CMPT 454 Assignment 2 – Indices This assignment is worth 6% of your final grade. Part 1 A primary B+ tree file where the file's records are the leaf nodes of the tree has 6 levels (including both the root and the leaves). 5 main memory frames have been allocated for the use of the B+ tree, all of which are initially empty. Interior nodes contain more than 12 search key values and leaf nodes contain more than 12 records. The B+ tree is used twelve times in sequence to retrieve twelve different single records (i.e. selections on equality, with each search returning exactly one record), with each query being made after the previous one has been evaluated. [1 mark each] a. If the least recently used replacement policy (LRU) is used to replace buffer frames, how many disk reads are required if all twelve records reside on the same block? – 72 disk reads b. If LRU is used, how many disk reads are required if all twelve records reside on different blocks, and if the search keys for all twelve records are on different interior nodes of each level of the tree (except the root)? – 72 disk reads c. If the most recently used replacement policy (MRU) is used to replace buffer frames, how many disk reads are required if all twelve records reside on the same block? – 17 disk reads d. If MRU is used, how many disk reads are required if all twelve records reside on different blocks, and if the search keys for all twelve records are on different interior nodes of each level of the tree (except the root)? – 71 disk reads (72 is acceptable) https://www.coursehero.com/file/10763127/CMP-454-Assignment2solution1/ This study resource was shared via CourseHero.com Part 2 a. Write the search algorithm in pseudocode for a B+ tree that does not allow duplicates. – review and judge for accuracy [good = 3, 2 = adequate, 1 = seriously incomplete] b. Write the insertion algorithm in pseudocode for a B+ tree that does allow duplicates. The algorithm should follow the process for allowing duplicates in the tree as described in class (the method that may require null entries for search key values). – review and judge for accuracy [good = 3, 2 = adequate, 1 = seriously incomplete] c. The B+ tree insertion process may result in parent nodes having values inserted in them or in being split. In the absence of parent pointers in nodes briefly describe how this might be implemented. – there could be multiple different answers, one is to keep track of parent nodes in a data structure, but the most obvious explanation is to use recursion [2 marks for an adequate explanation, 1 mark for an explanation that is incomplete] Part 3 a. Sketch the extensible hash index, and its directory, that results from the following insertions, where two (2) data entries can fit in a disk page. Assume that initially the directory has four entries and uses the last two bits of the hash value. The initial state of the index, and the global and local depths, are shown below, where the values shown are the hash values (not the search key values). You should show the index and its directory just before each insertion that causes the directory to double and after all insertions. global depth = 2, all buckets have a local depth of 2 00-----> 0010 0000, 1101 0100 01-----> 1111 0101, 1001 0001 10-----> 0000 0110 11-----> 0110 1111 Search keys with the following five hash values are to be inserted: https://www.coursehero.com/file/10763127/CMP-454-Assignment2solution1/ This study resource was shared via CourseHero.com 1010 1011 0110 0000 1101 0111 1110 1101 0101 1111 directory bucket ld 00 0010 0000, 1101 0100 2 01 1111 0101, 1001 0001 2 10 0000 0110 2 11 0110 1111, 1010 1011 2 [1 mark] Insertion of 0110 0000 causes a bucket to overflow and the directory to split directory bucket ld 000 0010 0000, 0110 0000 3 001 1001 0001 3 010 0000 0110 2 011 1010 1011 3 100 1101 0100 3 101 1111 0101, 1110 1101 3 110(010) 111 0110 1111, 1101 0111 3 [1 mark] Insertion of 0101 1111 causes a bucket to overflow and the directory to split (again) directory bucket ld 0000 0010 0000, 0110 0000 3 0001 1001 0001 3 0010 0000 0110 2 0011 1010 1011 3 0100 1101 0100 3 0101 1111 0101, 1110 1101 3 0110(0010) 0111 1101 0111 4 1000(0000) 1001(0001) 1010(0010) 1011(0011) 1100(0100) 1101(0101) 1110(0010) 1111 0110 1111, 0101 1111 4 https://www.coursehero.com/file/10763127/CMP-454-Assignment2solution1/ This study resource was shared via CourseHero.com [3 marks, if the directory pointers are not shown somehow -1 – in my version I`ve noted which buckets the unused pointers point to] b. Sketch the linear hash index that results from the initial state and insertions given in (a) above, assuming that index pages cannot be greater than 80% full. You should show the index and its directory just before each insertion that adds a bucket to the index and after all insertions. Search keys with the following five hash values are to be inserted: 1010 1011 0110 0000 1101 0111 1110 1101 0101 1111 index bucket occupancy = 6/8 = 0.75 00 0010 0000, 1101 0100 01 1111 0101, 1001 0001 10 0000 0110 11 0110 1111 [1 mark] Insertion of 1010 1011 results in 87.5% occupancy so we need another bucket index bucket occupancy = 8/10 = 0.8 000 0010 0000, 0110 0000 001 1111 0101, 1001 0001 010 0000 0110 011 0110 1111, 1010 1011 100 1101 0100 [1 mark] Insertion of 1101 0111 results in 90% occupancy so we need another bucket index bucket occupancy = 9/12 = 0.75 000 0010 0000, 0110 0000 001 1001 0001 010 0000 0110 https://www.coursehero.com/file/10763127/CMP-454-Assignment2solution1/ This study resource was shared via CourseHero.com 011 0110 1111, 1010 1011 1101 0111 100 1101 0100 101 1111 0101 [1 mark] Insertion of 1110 1101 results in 83.3% occupancy so we need another bucket index bucket occupancy = 11/14 = 0.786 000 0010 0000, 0110 0000 001 1001 0001 010 011 0110 1111, 1010 1011 1101 0111, 0101 1111 100 1101 0100 101 1111 0101, 1110 1101 110 0000 0110 [2 marks] Part 4 A DB table has the schema: Customer = {id, fname, lname, age, income, city, street, number} The fields have the following domains:  age, number - 4 byte integers  id - 8 byte integer  income - 8 byte real  fname, lname, city, street - 20 byte character strings The table contains 100,000 records, and (for the sake of simplicity) no additional space is required to represent record headers. The following file organizations are to be considered for this file:  heap file - unsorted, pages have full occupancy  sorted file - sorted on {lname, fname} pages are in sequence and have full occupancy  primary B+ tree file – sorted on {city, street, number}, note that the leaves contain records (so only require pointers to adjacent leaf nodes), pages (leaf and interior) have 2/3 occupancy https://www.coursehero.com/file/10763127/CMP-454-Assignment2solution1/ This study resource was shared via CourseHero.com The following additional information is relevant  Size of a record ID, or pointer to a block - 8 bytes  The file is to be stored on a single disk, with ten 2,048 byte blocks per track  No single record is stored on more than one block The average daily use of the file is expected to be:  Search with equality selection on {lname, fname}, note that over 99% of the time both attributes are given, as customers tend to know their own names - 10,000  Search with equality selection on {id} - 1,000  Search with equality selection on {city, street, number} - 1,000  Search with equality selection on {city, street} - 5,000  Search with equality selection on {city}, returning, on average, 0.5% of the records - 1,000  Range selections on {income}, returning, on average, 10% of the records - 20  Range selections on {age}, returning, on average, 1% of the records - 500  Insertions - 1,200  Deletions, given all data except id - 500  Deletions, given only id - 300 Answer the following questions about the file: a. How many records fit on a single disk block? – record size = 8 + 20 + 20 + 4 + 8 + 20 + 20 + 4 = 104, 2,048 / 104 = 19.6923…, so 19 per page [1 mark] b. How large in blocks is the file, assuming that it is stored as the heap file? – 100,000 / 19 = 5,264 [1 mark] c. How large in blocks is the file, assuming that it is stored as the sorted file? – 100,000 / 19 = 5,264 [1 mark] d. How large in blocks is the file, assuming that it is stored as the B+ tree file (do not count the interior nodes or root of the tree)? – 19 * 2/3 = 13 per page, 100,000 / 13 = 7,693 – or 12 per page, 100,000 / 12 = 8,334 for the leaf nodes (values close to this are acceptable) [1 mark] https://www.coursehero.com/file/10763127/CMP-454-Assignment2solution1/ This study resource was shared via CourseHero.com e. Assume that a secondary B+ index on id is to be created, how many leaf pages does the index have? Assume that leaf pages have 2/3 occupancy. – a data entry is 8 + 8 = 16 bytes, 2,048 / 16 = 128, but we need two pointers for the pointer chains, so 127 * 2/3 = 84 per page, and 100,000 / 84 = 1,190 (values close to this are acceptable) [1 mark] f. For the index described in (e), how many data entries can be stored in a non-leaf node? – a non-leaf page has an extra pointer, so 127 data entries [1 mark] g. For the index described in (e), what is the height of the tree, assuming that non-leaf nodes also have 2/3 occupancy? State the height of the tree including both root and leaf levels. – fan-out is 84, root fans out to 84 second level nodes, each of which can fan-out to 84 third level nodes, 842 = 7,056 (!), so there 3 levels (root, nonleaf, and leaf) [1 mark] h. Assume that a secondary hash index on id is to be created, with hash buckets having 80% occupancy on average. How many buckets does the index contain? – again, a data entry requires 16 bytes, 128 * .8 = 102, 100,000 / 102 =981 (values close to this are acceptable) [1 mark] i. For the index described in (h) above, if the index is an extensible hash index, how large (in blocks) is the index? – the index is an array of 8 byte pointers, it must be a power of 2 so assume that it has 1,024 entries (assuming minimal skew), so requires 4 blocks [1 mark] j. Which of the three file organizations provides the best support for the daily use of the file? Justify your answer. – an unfortunately worded question, the best answer is the B+ tree file, since it supports equality selection, on any prefix of {city, street and number}, but only if it is supported by an index on {lname, fname}, so I would also accept the file sorted on {lname, fname} as an answer [2 marks – one for either organization, and 1 for a reasonable justification] k. Describe which secondary indices you would create for the file where the file is organized as noted in your answer to (j). Your goal is to make as few disk read and write https://www.coursehero.com/file/10763127/CMP-454-Assignment2solution1/ This study resource was shared via CourseHero.com operations as possible. Your answer should include the search key and type (B+ tree or hash) of each index structure that you would create. – either (1) sorted file on {lname, fname}, with a B+ tree index on {city, street, number}, a hash index on id and a B+ tree index on age or (2) the B+ tree file with its primary index on {city, street, number}, a B+ tree index on {lname, fname}, a hash index on id and a B+ tree index on age [3 marks, 1 mark for each index -1 for indexes on income] l. For your chosen file and index organization, calculate the cost, in disk reads and writes, of each of the operations noted above (for one operation only, you don't have to multiply by the number of operations per day) – see the table below, note that only the column showing the cost of a single operation for the chosen organization is required [0.5 mark each, accept values close to those shown below] operation n (1) (1) total (1) metho d (2) (2) total search – fname, lname 10,00 0 13 130,000 binary search 8 80,000 search – id 1,000 3 3,000 index 3 3,000 search – city, street, numb er 1,000 4 4,000 index 3 3,000 search – city, street 5,000 40 200,000 index, 10 record s 4 20,000 search – city 1,000 2,00 0 2,000,00 0 index, 500 record s 44 44,000 search – income 20 5,26 4 105,280 8,33 4 166,680 search – age 500 2,00 0 1,000,00 0 index, 1,000 record 2,00 0 1,000,00 0 https://www.coursehero.com/file/10763127/CMP-454-Assignment2solution1/ This study resource was shared via CourseHero.com s insertion 1,200 2,65 3 3,183,60 0 search , move up 3 3,600 deletions (no id) 500 21 10,500 search 3 1,500 deletions (id) 300 8 2,400 search , index 3 900 6,638,78 0 1,322,68 0 Assessment The assignment is out of 419. Marks are assigned as follows:  Question 1 – 4  Question 2 – 8  Question 3 – 10  Question 4 – 19 Submission You should submit your assignment online to the CoursSys submission server. You must submit a single .zip (or other archive) file that contains your solution, please read the documentation on site for further information. The assignment is due by 11:59pm on Monday the 17th of February.  CMPT 454 Home John Edgar ([email protected]) https://www.coursehero.com/file/10763127/CMP-454-Assignment2solution1/ This study resource was shared via CourseHero.com Powered by TCPDF (www.tcpdf.org)