Assignment title: Information


​​ IFB104 Building IT Systems Semester 1, 2016 Assignment 2, Part A: The Top Ten of Everything (15%, due 11:59pm Thursday, May 26th) Overview This is the first part of a two-part assignment. This part is worth 15% of your final grade for IFB104. Part B will be worth a further 10%. Part B is intended as a last-minute extension to the assignment, thereby testing the maintainability of your solution to Part A and your ability to work under time pressure. The instructions for completing Part B will not be released until Week 12. Whether or not you complete Part B you will submit only one file, and receive only one mark, for the whole 25% assignment. Motivation People love creating rankings, comparative lists and league tables and numerous examples can be found online. Here you will use your new-found skills in developing Graphical User Interfaces, accessing Web documents and searching for patterns in text to create an interactive tool that allows its user to view the latest "top ten" lists in a variety of categories. Most importantly, your application will get all of its data and images directly from online sources. To do this you will need to develop (a) "back end" functions which fetch Web documents and extract relevant text and images from them, and (b) a "front end" Graphical User Interface that makes it easy for the user to view such data. To complete this part of the assignment you will need to use Python, the Tkinter module, regular expressions and (probably) the Python Imaging Library. We have not used the last of these in IFB104 before, so you may need to install the necessary software first (see below). Illustrative Example Taking our cue from the popular book series The Top Ten of Everything, our goal is to create a program which displays the current top ten lists in (at least) three different categories. The user can select a category to display, and the program must respond by downloading and displaying the current top ten items in this category from the Internet. To make the program visually interesting it must also display images related to the chosen categories. For the purposes of this assignment you have a free choice of what kind of "top ten" lists to display, however they should all be lists that are updated online on a regular basis. Possible categories of top-ten list that you may consider could include: • Sports rankings • Online gaming leader boards • TV and movie ratings • Financial data such as Stock Market prices • Popularity contests • … or any other ordered/sorted list of items. Whichever categories you choose, they must be a lists which are updated on a regular basis and for which information can be found readily on the Internet. Your task is to extract this data from online sources and display it in an easily-interpreted format. IFB104 Building IT Systems Semester 1, 2016 Your program must have the following features. • When first started it must display an introductory "splash" page and provide an interactive mechanism via which the user can select which top-ten categories to view. Selection could be done using push buttons, menus, or any other suitable widget. At least three clearly distinct categories must be made available. • When the user selects a category your program must fetch the latest "top ten" items relating to it from online and display them in an easy-to-read format. In addition, your program must display an image relating to this category and the online address (URL) from which the data was obtained. • The downloaded top-ten data must come from (at least) three distinct Web sites. It is not acceptable to download three lists from the same site. The Web sites chosen must be ones that update their lists of data on a regular basis. • The various images displayed can be extracted from any Web documents you like. They do not have to come from the same sites as the top-ten lists themselves. To illustrate the idea, consider the following "splash" page, waiting for the user to select a category. IFB104 Building IT Systems Semester 1, 2016 Our "splash" page consists of a fixed image with three interactive buttons at the bottom. The instruction "Choose from the Top Ten …" is a static label. The image is the cover for the 2015 edition of the book The Top Ten of Everything. Importantly, this image is not contained in a local file. It is downloaded each time the program starts via the following URL: http://www.boorooandtiggertoo.com/wp-content/uploads/2014/07/Top-10-ofEverything-2015.jpg All images and data displayed by your program must be freshly downloaded from online sources. You can submit only one Python file, with no other files, including image files. When one of the buttons is pushed the corresponding top-ten list, an associated image, and the URL from which the top-ten data was obtained must be displayed. For instance, pushing the "AFL teams" button in our example causes the following new window to pop up. In this case the "AFL Ladder" image at the top was downloaded from www. krockfootball.com.au, and the list of football teams was extracted from the "Real Footy Ladder" page published by The Age newspaper, at the address shown at the bottom of the "view" above. (The original Age document accessed is shown in the appendix below.) IFB104 Building IT Systems Semester 1, 2016 Importantly, of course, the Australian Rules football ladder changes at least once a week during the winter months. The top-ten list of teams shown above was current when we ran our demonstration program on Sunday, May 1st. Your program must always download the latest data from the Web, and must be resilient to changes in the Web documents it relies on. Pushing the "conspiracies" button on our splash page (again on May 1st) then produced the following new window. In this case we have used a list of conspiracy theories sourced from the Ranker site, which has many lists of rankings voted on by the public. The original page from which we extracted our top ten list is shown in the appendix below. Again, this page is updated regularly. The X-Files inspired poster image on the left was downloaded from a different Web site. Also notice that the layout and colours used in this second top-ten list differs markedly from the first. Each of your top-ten lists must have a distinct visual appearance. Finally, pushing the button for the Top Ten Australian Google searches caused the window shown overleaf to pop up when we tested our program on May 1st. Again we have given the GUI a distinct appearance for this top-ten list, based around the standard "Google colours". IFB104 Building IT Systems Semester 1, 2016 At the end of this process we have four distinct "views" of our GUI on show, as four different Tkinter windows. (We can close the individual top-ten views and re-open them again by pushing the corresponding button on the splash page.) IFB104 Building IT Systems Semester 1, 2016 In each of these "views" we downloaded the image using a fixed URL address, having previously used a standard Web browser to find the online images we wanted. For the top-ten lists we first downloaded the appropriate Web document and then used a regular expression to extract the parts of the text we needed, based on our prior analysis of the Web document's HTML source code structure. For instance, in simple cases you may find that the items you want to extract appear between tags in the HTML source, so you can easily extract them using a corresponding regular expression and Python's findall function. Generally, however, the patterns to search for are not quite this simple. (None of the ones for our examples above were as straightforward as this, although none of them were very challenging either.) A particularly good source of easily-searched Web documents are Rich Site Summary, a.k.a. Really Simple Syndication, Web feed documents. RSS documents are written in the XML extension of HTML and are used for publishing information that is updated frequently, in a form that allows them to be used as the source for other Web documents. They usually have a simple standardised format, so we can rely on them always formatting their contents in the same way, making it relatively easy to extract specific elements from the document's source code via pattern matching. Also note that in some cases the text you extract may contain HTML markups for special fonts or characters, such as "—" for a dash and "’" for a single quote. If so you will need to use Python's string functions to delete or replace these mark-ups before displaying the text in your GUI. No HTML mark-ups should be displayed to the user. Requirements and marking guide To complete this task you are required to extend the provided top_ten.py template file with your version of a program possessing equivalent capabilities to those shown above. Your submitted solution will consist of a single Python file, and must satisfy the following criteria. Marks available are as shown. • Creating four distinct interfaces (4%). Your program must be able to create (at least) four distinct visual interfaces, the opening "splash" view and the three top-ten lists. An appropriate interactive mechanism must be provided to allow the user to choose top-ten lists to view (buttons, menus, etc). Each of the four "views" must have a distinct appearance in terms of layout, text fonts, colours, etc. Each view must have a title describing its purpose. NB: In our demonstration program above we created a separate pop-up window for each "view", but you do not need to follow this example. All four "views" can be displayed in the same Tkinter window if you prefer. • Displaying four distinct online images (4%). Your program must be able to download and display (at least) four distinct online images, one in each of the "views". Each of the images must come from a different Web site. The images for the three top-ten lists must be clearly related to the list's subject matter. The "splash" image is a free choice; it could relate to the generic theme of the assignment, as in our example above, or could relate to some consistent theme linking your choice of topten lists. The images must be of an appropriate size for the GUI and must not be distorted either horizontally or vertically. • Displaying three distinct, current top-ten lists (5%). Most importantly, your program must be able to download and display (at least) three distinct top-ten lists, IFB104 Building IT Systems Semester 1, 2016 one in each of the corresponding "views". Each of the lists must come from a different Web site and must be "live" data, downloaded at the time the user selected which list to view. The list items must be clearly numbered and must not contain any HTML mark-ups or other extraneous characters when displayed. The address of the Web document from which the list was obtained must be displayed. Your solution must continue to work even when the Web documents you access have been updated. For this reason it is unacceptable to "hardwire" your solution to the particular text and images appearing on the Web site on a particular day. Instead you will need to use regular expressions or some equivalent pattern matching technique to actively search for the relevant text in the document, regardless of any updates that may have occurred since you wrote your program. • Code quality and presentation (2%). Your program code must be presented in a professional manner. See the coding guidelines in the IFB104 Code Presentation Guide (on Blackboard under Assessment) for suggestions on how to achieve this. In particular, each significant code segment must be clearly commented to say what it does, e.g., "Create the splash image", "Extract the top ten items from the document's source code", etc. • Extra feature (10%). Part B of this assignment will require you to make a 'lastminute extension' to your solution. The instructions for Part B will not be released until just before the final deadline for Assignment 2. You must complete the task using only basic Python features and the modules already imported into the provided template. You may not import any additional modules or files into your program. In particular, you may not import any local image files. All displayed images and "top ten" text must be downloaded from online sources each time the program is run. Most importantly, you are not required to copy the precise GUI layouts shown in the example above. Instead you are strongly encouraged to be creative and to choose your own top-ten lists and design your own graphical interfaces. Support tools To get started on this task you need to download various Web documents of your choice and work out how to extract two things: • The images you want to display for your opening "splash" page and to illustrate each of your top-ten lists. • The textual top-ten lists. Note that each top-ten list and its associated image do not have to come from the same Web document. You also need to allow for the fact that the contents of the Web documents from which you get your top-ten data will be updated occasionally, so you cannot hardwire the locations of the document elements in your solution. The answer to this problem, of course, is to use Python's regular expression function findall to extract the necessary elements, no matter where they appear in the HTML source code. To help you develop your regular expressions, we have included two small Python programs with these instructions. IFB104 Building IT Systems Semester 1, 2016 1. downloader.py is a small script that downloads and displays the source code of a Web document. Use it to see a copy of your chosen Web document in exactly the form that it will be received by your Python program. (This is helpful because the version of a Web document opened by a Python program may not be the same as one opened by a particular Web browser! This can occur because some Web servers produce different HTML/XML source code for different browser clients. Some Web servers will also block access to Web pages by Python programs in the belief that they are malware!) 2. regex_tester.py is an interactive program introduced in the Week 9 lecture and workshops which makes it easy to experiment with different regular expressions on small text segments. You can use this together with the downloaded text from the Web to help you perfect your regular expressions. (The advantage of using this tool, rather than some of the online tools available, is that it is guaranteed to follow Python's regular expression syntax because it is written in Python itself.) Image formats and the Python Imaging Library As part of this assignment you need to download and display some image files, so we have included two utility functions, gif_to_PhotoImage and image_to_PhotoImage, in the Python template accompanying these instructions. Image files come in a wide variety of formats, including GIF, JPG, PNG, BMP, and so on. Python's Tkinter module can display 'photo images' but these must be in GIF format encoded as base-64 character strings, so these functions will help you convert the images into the format needed by Tkinter. If you have identified a GIF image from a particular URL of the form 'http://….gif' it can be made ready for display in a Tkinter widget as follows. Greek letters have been used for the Python variables below; obviously you should replace these with meaningful variable names. Let α be a string containing the URL. # Read the GIF image as a raw byte stream β = urlopen(α).read() # Create the Tkinter version of the image γ = gif_to_PhotoImage(β) The resulting variable γ can then be used as the image attribute in a Tkinter Label widget or any other such widget that can display images. You will find, however, that most images on the web are in JPEG, PNG or some other format and therefore cannot be used in Tkinter without conversion. If you are unable to find suitable GIF images you will need to use an additional software package called the Python Imaging Library to allow conversion of the image. The Python Imaging Library is a well-established (stable since 2009) toolkit for manipulating image data in Python programs. It contains a large number of functions for converting images, resizing them, rotating them, changing colours, etc. • To use it, Microsoft Windows users should download and install PIL version 1.1.7 for Python 2.7. Download the library from http://www.pythonware.com/products/pil/ and select the installation for Python 2.7. IFB104 Building IT Systems Semester 1, 2016 • Alternatively, Mac OS X users should install disk image PIL-1.1.7-py2.7-python.orgmacosx10.6.dmg. You can find a copy at http://www.astro.washington.edu/users/rowen/python/. We have not used the PIL library previously in this unit. PIL documentation can be found at http://effbot.org/imagingbook/ but you should not need to consult this unless you intend to do complex image processing yourself. To convert a non-GIF image into Tkinter's 'photo image' format the process is similar to the one above. For instance, if you have the URL of an image in JPEG format, 'http://….jpg', then you can convert it to a format ready for use in a Tkinter widget as follows. Greek letters have been used for the Python variables below; obviously you should replace these with meaningful variable names. Let α be a string containing the URL. # Read the JPEG or PNG image as a raw byte stream β = urlopen(α).read() # Create the Tkinter version of the image γ = image_to_PhotoImage(β) In addition, the image_to_PhotoImage function allows the image to be resized. To do so, simply supply two additional arguments specifying the desired width and height in pixels. Updating images in widgets If, as part of your solution, you decide to replace the image in a Label widget you may run across a problem where, after updating the Label's 'image' attribute, the label comes out blank. This occurs if you create the new image within a function because the image is automatically "garbage collected" when the function terminates, i.e., when we leave the function's scope. In other words, an image "local" to a function will no longer exist after the function ends. A simple solution is to make the variable holding the image a global variable, so that it will exist after the function terminates. Another solution is to entirely "destroy" the existing widget and replace it with a new one containing the updated image. To do this you can use widget.destroy() to remove the old widget entirely. Alternatively, if you used the 'grid' geometry manager to place your widgets in the window you can also use widget.grid_forget() to remove the widget from the grid but without destroying the widget itself. Extra features Beyond the basic requirements above you are encouraged to be creative! You can add extra features to display more information about the "top ten" items in a category and you can display more than three categories of top-ten list (provided they all come from different Web sites). Sadly, we cannot award extra marks for such enterprise, but we will create a "Hall of Fame" in Blackboard to display some of the best solutions! Development hints This is a substantial task, so you should not attempt to do it all at once. In particular, you should work on the "back end" parts first, before attempting the Graphical User Interface "front end". If you are unable to complete the whole task, just submit those stages you can get working. You will receive partial marks for incomplete solutions. IFB104 Building IT Systems Semester 1, 2016 It is strongly suggested that you use the following development process: 1. Decide what kind of lists you want to display and search the Web for appropriate HTML or XML documents that contain the necessary text and images. For the topten lists choose only documents that are updated regularly. 2. Using the provided downloader.py application, download each document so that you can examine its structure. You will want to study the HTML or XML source code of the document to determine how the parts you want to extract are marked up. For the images you simply need to find the URL address of the image. However, for the top-ten text you will want to identify the markup tags, and perhaps other unchanging parts of the document, that uniquely identify the beginning and end of the text you want to extract. 3. Using the provided regex_tester.py application, devise regular expressions which extract just the necessary elements from the relevant parts of the Web document. Using these regular expressions and Python's urllib module and findall function you can now develop a simple prototype of your "back end" solution that just extracts and prints the required data, i.e., the top-ten text and the addresses of the images, from the Web documents in IDLE's shell window. Doing this will give you confidence that you are heading in the right direction, and is a useful outcome in its own right. 4. Once you have got the data extraction functions working you can add a Graphical User Interface "front end" to your program. Decide whether you want to use push buttons, radio buttons, menus, lists or some other mechanism for choosing which topten categories to display, and extend your back-end code accordingly. Developing the GUI is the "messiest" step, and is best left to last. Trying to develop the front-end and back-end code simultaneously is very difficult. Focus on one thing at a time. Deliverable You should develop your solution by completing and submitting the provided Python template file top_ten.py as follows. 1. Complete the "statement" at the beginning of the Python file to confirm that this is your own individual work by inserting your name and student number in the places indicated. Submissions without a completed statement will not be marked. 2. Complete your solution by developing Python code at the place indicated. You must complete your solution using only the modules imported by the provided template. You may not use or import any other modules or files to complete this program. 3. Submit only a single, self-contained Python file. Do not submit multiple files. Do not submit an archive containing multiple files. This means that all images and text you want to display must be downloaded from online Web documents, not from local files. Apart from working correctly your program code must be well-presented and easy to understand, thanks to (sparse) commenting that explains the purpose of significant code segments and helpful choices of variable and function names. Professional presentation of your code will be taken into account when marking this assignment. If you are unable to solve the whole problem, submit whatever parts you can get working. You will receive partial marks for incomplete solutions. IFB104 Building IT Systems Semester 1, 2016 How to submit your solution A link will be created on Blackboard under Assessment for uploading your solution file before the deadline (11:59pm on Thursday, May 26th). Note that you will be able to submit as many drafts of your solution as you like. You are strongly encouraged to submit draft solutions before the deadline. IFB104 Building IT Systems Semester 1, 2016 Appendix: Original Web documents The original Web documents from which we extracted our top-ten lists in the examples shown above were as follows at the time our demonstration program was executed (Sunday, May 1st). Notice that each of them contains many more elements than we needed, so we used regular expressions to extract just the required text from the HTML source code. IFB104 Building IT Systems Semester 1, 2016