Assignment title: Information


MOD004553 ARTIFICIAL INTELLIGENCE 2016 - 2017 AUTOMATIC TEXT SUMMARIZATION COURSEWORK ASSIGNMENT The ability to summarise text is a skilled requirement both for personal use and as a corporate activity. It is particularly difficult for an individual to summarize text where the information sources are large. Not surprisingly automatic text summarization is currently a very topical area or research in AI-information technology, with applications as diverse as news summarization apps for mobile phonesi, search engine returnsii and the summarization of game logsiii. There are two groups of automatic summarization methods, these are: a) Extractive methods - selecting important sentences, paragraphs etc. from the original document and concatenating them into a shorter form. The importance of sentences is decided based on statistical and linguistic features of sentences. b) Abstractive methods - understanding the original text and re-telling it in fewer words. Your assignment is to create an automatic text summarization tool based on an Extractive method. Program requirements All students should have a background in C# console programming. In fairness to all, your implementation must conform to the following two requirements: 1. Language must be C# 2. Application must be console based (a GUI interface would be an unnecessary distraction to the AI aspect of this work). Code Development Your program should do the following: 1. Prompt the user to enter an input filename (eg inFile.txt) and a percent summarization factor, SF (where SF = summarizedWordLength x 100 / inputWordLength). 2. Read the text from inFile.txt stopwords.txt and process the text from inFile.txt accordingly. 3. Output the summary text to file (eg outFile.txt) and display to console some appropriate statistics. There are many extractive methods one can take in Step 2, here is a generic common algorithm in pseudocode which also ensures the summarised text does not exceed the SF constraint:  Count the occurrence of different words in the document, copying the words into a list which is ordered by frequency with the most common word at the top of the list.  Copy each sentence in the document into a second list.  Remove (filter) those words from the word frequency list which are very common and of little use in classifying the document. These are called 'stop words' and include words such as such as the, is, at, which, and on. There is no definitive listing of what defines a stop word as it could be document specific, however the file 'stopwords.txt' is provided containing a listing of generic common stop words (taken from:http://www.lextek.com/manuals/onix/stopwords1.html ). Its use is optional, and you may edit it or use an alternative list of stop words or not use a stop word list at all. Longer and shorter alternative lists are given at: http://www.ranks.nl/stopwords.  Repeat the following until SF is exceeded> {  For each sentence, count the number of the words that matches the top word (most frequent) in the filtered word list.  Find the sentence that has the highest number of occurrences of the most frequent word.  If the length of words in the summary text added to the current selected sentence word length exceeds the summary word length limit the sentence is ignored.  Else add the sentence to the summary text. Remove the word from the top of the frequency word list and the sentence from the listing of word sentences. }  Output summary text to file and display appropriate statistics, including actual SF. Some issues to consider: 1. For testing purposes your input text files can be from any source, but ensure they are copied/converted to plain text files (.txt extension') for input purely to simplify the file reading process. 2. How are you going to implement the word list and the list of sentences? More than one kind of structure/list will be needed (think arrays, or lists, and/or variations of these). 3. Whilst your summarization tool should work with a text of any length, the more sentences the input file has the greater the likelyhood the summarization can be clearly demonstrated. 4. Your program will potentially have to read in a large number of words from a file that initially the program would not know the length of. Whilst you could use an array and impose a limit on the number of words and/or sentences, better would be to create storage as the data is read so allowing data of any length to be stored (is also more efficient). 5. Many extraction methods rely on ranking the sentences in the source text. The given algorithm does this based on counting the occurrence of the text's most frequent words in each sentence. However this may not necessarily be the best way of ranking the sentences. For example the given algorithm will generate a summary text that contains sentences arranged in the order of their importance, not necessarily in their original order or in a meaningful order. Spurious results will occur if most sentences contained the same number of most frequent words. You are encouraged to explore if there are any alternative ranking techniques or additional statistical processing that might be applied to further generate a more meaningful output. However, the subject of natural language processing (NPL) is complicated and you are NOT expected to generate a result that is equivalent to the capabilities of the extensive research community in this field. Do not worry if the summary text generated by your solution is not always what you consider to be a 'sensible' result; the assessment focus is on implementation of technique rather than the collective success of the implementation itself. 6. Implement the assignment in stages e.g. get a program/method to read and store the stop words; implement reading and storing the sentences etc. Backup working solution states regularly and do not try and extend the program without debugging each stage of improvement as otherwise the program could become impossible to debug.Assessment1 The automatic text summarization assignment is component 010 of the assessment and contributes 100% of the overall module mark. Submit your work to the iCentre or equivalent as a short report by Friday 16th December 2016 (end of teaching week 12). Ensure the removable storage medium is adequately secured within or to your report, which should contain; 1. Source and compiled code on a CD/DVD or USB stick. This media should also include a copy of all your submitted work for this assessment to comply with the University requirement for submission of all assessed coursework in electronic form. 2. A hard-copy print out of your source code. 3. Documentation to include a 600 word evaluation of the summarization tool (for example, design of any classes and/or data structure(s), strengths and limitations of the implementation). You should also include instructions for running the program in an appendix. Your final mark will be an amalgamation of four equally weighted components. See the separate mark scheme/ mark sheet for details on how this assignment will be marked. Note that even if your final program is not fully functional (or even will not compile) you can still get credit for the program design, coding and evaluation. Feedback will be available in 3 ways a) via a personalised completed mark sheet attached to each assignment on collection of the coursework b) via an email sent to each student which will contain personalised content summarising the results c) via e-vision accessible by students listing percentage, grade and outcome (subject to assessment board rules). Reassessment If you are unfortunate enough to require a coursework reassessment, you should attempt this coursework assignment again. Be aware that no one solution will define a single definitive 'answer' to this assignment. You are encouraged to explore a different approach to achieving a solution to this assignment compared to the first attempt. i Radev, D. R. et.al. 2001. Newsinessence: A System for Domain-Independent, Real-Time News Clustering and Multi-Document Summarization. In:Proceedings of Human Language Technology Conference (HLT 2001) 4pp. San Diego, California. ii Turpin, A. 2007. Fast generation of result snippets in web search. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '07) p127- 134. Amsterdam, The Netherlands. iii Cheong, Y. et. al. 2008. Automatically Generating Summary Visualizations from Game Logs. In Proceedings of the 4th Intelligence and Interactive Digital Entertainment Conference (AIIDE'08) p167-172. Palo Alto, California. 1 Detail may vary slightly to accommodate local deliveryMOD004553 Coursework feedback sheet 2016-17 Sem1 Student Functionality grade Program design (excluding enhancements) Extraction algorithm Evaluation, code clarity and presentation grade Comment Overall 010 mark % Overall module grade Does program execute, proceed to completion, generate a runtime error or crash? Is interface understandable? Successful file reading/output? Meaningful summary? Choice of userwritten classes and their methods including; file processing, word list and sentence construction, pattern matching, output display. To what extent does the code implement the generic extraction algorithm? Has the algorithm been enhanced? If so, how? Degree of enhancement complexity. Assessment of achievements and limitations. Report layout and structure. Any items missing? Physical presentation. Clarity of code layout and in-line comments. Personalized comments specific to student submission. Each component is assigned a grade (F, D, C, B, A) and each maps to a mark (0- 100) from which the average is taken. F (<40); D (40-49); C (50-59); B (60 - 69); A (70+) Sid Number Grade Grade Grade Grade Comment Final Mark Final Grade