Assignment title: Information
MOD004553 ARTIFICIAL INTELLIGENCE 2016 - 2017
AUTOMATIC TEXT SUMMARIZATION
COURSEWORK ASSIGNMENT
The ability to summarise text is a skilled requirement both for personal use and as a corporate
activity. It is particularly difficult for an individual to summarize text where the information
sources are large. Not surprisingly automatic text summarization is currently a very topical
area or research in AI-information technology, with applications as diverse as news
summarization apps for mobile phonesi, search engine returnsii and the summarization of
game logsiii.
There are two groups of automatic summarization methods, these are:
a) Extractive methods - selecting important sentences, paragraphs etc. from the original
document and concatenating them into a shorter form. The importance of sentences is
decided based on statistical and linguistic features of sentences.
b) Abstractive methods - understanding the original text and re-telling it in fewer words.
Your assignment is to create an automatic text summarization tool based on an Extractive
method.
Program requirements
All students should have a background in C# console programming. In fairness to all, your
implementation must conform to the following two requirements:
1. Language must be C#
2. Application must be console based (a GUI interface would be an unnecessary
distraction to the AI aspect of this work).
Code Development
Your program should do the following:
1. Prompt the user to enter an input filename (eg inFile.txt) and a percent summarization
factor, SF (where SF = summarizedWordLength x 100 / inputWordLength).
2. Read the text from inFile.txt stopwords.txt and process the text from inFile.txt
accordingly.
3. Output the summary text to file (eg outFile.txt) and display to console some
appropriate statistics.
There are many extractive methods one can take in Step 2, here is a generic common
algorithm in pseudocode which also ensures the summarised text does not exceed the SF
constraint:
Count the occurrence of different words in the document, copying the words into a list
which is ordered by frequency with the most common word at the top of the list.
Copy each sentence in the document into a second list.
Remove (filter) those words from the word frequency list which are very common and
of little use in classifying the document. These are called 'stop words' and include
words such as such as the, is, at, which, and on. There is no definitive listing of what
defines a stop word as it could be document specific, however the file 'stopwords.txt'
is provided containing a listing of generic common stop words (taken from:http://www.lextek.com/manuals/onix/stopwords1.html ). Its use is optional, and you
may edit it or use an alternative list of stop words or not use a stop word list at all.
Longer and shorter alternative lists are given at: http://www.ranks.nl/stopwords.
Repeat the following until SF is exceeded>
{
For each sentence, count the number of the words that matches the top word
(most frequent) in the filtered word list.
Find the sentence that has the highest number of occurrences of the most
frequent word.
If the length of words in the summary text added to the current selected
sentence word length exceeds the summary word length limit the sentence is
ignored.
Else add the sentence to the summary text. Remove the word from the top of
the frequency word list and the sentence from the listing of word sentences.
}
Output summary text to file and display appropriate statistics, including actual SF.
Some issues to consider:
1. For testing purposes your input text files can be from any source, but ensure they are
copied/converted to plain text files (.txt extension') for input purely to simplify the file
reading process.
2. How are you going to implement the word list and the list of sentences? More than
one kind of structure/list will be needed (think arrays, or lists, and/or variations of
these).
3. Whilst your summarization tool should work with a text of any length, the more
sentences the input file has the greater the likelyhood the summarization can be clearly
demonstrated.
4. Your program will potentially have to read in a large number of words from a file that
initially the program would not know the length of. Whilst you could use an array and
impose a limit on the number of words and/or sentences, better would be to create
storage as the data is read so allowing data of any length to be stored (is also more
efficient).
5. Many extraction methods rely on ranking the sentences in the source text. The given
algorithm does this based on counting the occurrence of the text's most frequent
words in each sentence. However this may not necessarily be the best way of ranking
the sentences. For example the given algorithm will generate a summary text that
contains sentences arranged in the order of their importance, not necessarily in their
original order or in a meaningful order. Spurious results will occur if most sentences
contained the same number of most frequent words. You are encouraged to explore if
there are any alternative ranking techniques or additional statistical processing that
might be applied to further generate a more meaningful output. However, the subject
of natural language processing (NPL) is complicated and you are NOT expected to
generate a result that is equivalent to the capabilities of the extensive research
community in this field. Do not worry if the summary text generated by your solution
is not always what you consider to be a 'sensible' result; the assessment focus is on
implementation of technique rather than the collective success of the implementation
itself.
6. Implement the assignment in stages e.g. get a program/method to read and store the
stop words; implement reading and storing the sentences etc. Backup working solution
states regularly and do not try and extend the program without debugging each stage
of improvement as otherwise the program could become impossible to debug.Assessment1
The automatic text summarization assignment is component 010 of the assessment and
contributes 100% of the overall module mark.
Submit your work to the iCentre or equivalent as a short report by Friday 16th December 2016
(end of teaching week 12). Ensure the removable storage medium is adequately secured
within or to your report, which should contain;
1. Source and compiled code on a CD/DVD or USB stick. This media should also
include a copy of all your submitted work for this assessment to comply with the
University requirement for submission of all assessed coursework in electronic form.
2. A hard-copy print out of your source code.
3. Documentation to include a 600 word evaluation of the summarization tool (for
example, design of any classes and/or data structure(s), strengths and limitations of the
implementation). You should also include instructions for running the program in an
appendix.
Your final mark will be an amalgamation of four equally weighted components. See the
separate mark scheme/ mark sheet for details on how this assignment will be marked. Note
that even if your final program is not fully functional (or even will not compile) you can still
get credit for the program design, coding and evaluation. Feedback will be available in 3
ways a) via a personalised completed mark sheet attached to each assignment on collection of
the coursework b) via an email sent to each student which will contain personalised content
summarising the results c) via e-vision accessible by students listing percentage, grade and
outcome (subject to assessment board rules).
Reassessment
If you are unfortunate enough to require a coursework reassessment, you should attempt this
coursework assignment again. Be aware that no one solution will define a single definitive
'answer' to this assignment. You are encouraged to explore a different approach to achieving
a solution to this assignment compared to the first attempt.
i Radev, D. R. et.al. 2001. Newsinessence: A System for Domain-Independent, Real-Time News Clustering and
Multi-Document Summarization. In:Proceedings of Human Language Technology Conference (HLT 2001) 4pp.
San Diego, California.
ii Turpin, A. 2007. Fast generation of result snippets in web search. In: Proceedings of the 30th annual
international ACM SIGIR conference on Research and development in information retrieval (SIGIR '07) p127-
134. Amsterdam, The Netherlands.
iii Cheong, Y. et. al. 2008. Automatically Generating Summary Visualizations from Game Logs. In Proceedings
of the 4th Intelligence and Interactive Digital Entertainment Conference (AIIDE'08) p167-172. Palo Alto,
California.
1 Detail may vary slightly to accommodate local deliveryMOD004553 Coursework feedback sheet 2016-17 Sem1
Student
Functionality
grade
Program design
(excluding
enhancements)
Extraction
algorithm
Evaluation, code
clarity and
presentation
grade
Comment
Overall
010
mark %
Overall
module
grade
Does program
execute, proceed
to completion,
generate a runtime error or
crash? Is
interface
understandable?
Successful file
reading/output?
Meaningful
summary?
Choice of userwritten classes
and their
methods
including; file
processing, word
list and sentence
construction,
pattern
matching, output
display.
To what extent
does the code
implement the
generic
extraction
algorithm? Has
the algorithm
been enhanced?
If so, how?
Degree of
enhancement
complexity.
Assessment of
achievements
and limitations.
Report layout
and structure.
Any items
missing? Physical
presentation.
Clarity of code
layout and in-line
comments.
Personalized comments specific to
student submission.
Each
component
is assigned
a grade (F,
D, C, B, A)
and each
maps to a
mark (0-
100) from
which the
average is
taken.
F (<40);
D (40-49);
C (50-59);
B (60 - 69);
A (70+)
Sid Number Grade Grade Grade Grade Comment Final
Mark
Final
Grade