IFB104'Building'IT'Systems'
Semester'1,'2017'
Assignment 2, Part A: Online Shopper
(19%, due 11:59pm Sunday, May 28th)
Overview
This is the first part of a two-part assignment. This part is worth 19% of your final grade for
IFB104. Part B will be worth a further 6%. Part B is intended as a last-minute extension to
the assignment, thereby testing the maintainability of your solution to Part A and your ability
to work under time pressure. The instructions for completing Part B will not be released until
Week 12. Whether or not you complete Part B you will submit only one file, and receive only
one mark, for the whole 25% assignment.
Motivation
The Internet has dramatically changed the way we conduct commerce. Online shopping is
increasingly displacing traditional forms of retailing. Here you will develop an application
that provides an online shopping experience, using data downloaded from the World-Wide
Web. The program will have a Graphical User Interface that allows the user to choose how
many products they want to buy from different categories. Having done so they will then be
able to print an illustrated invoice summarising their purchases, which can be viewed in a
standard web browser. Most importantly, the online shopping application will aggregate upto-date data sourced from online “feeds” that are updated on a regular basis. To complete
this assignment you will need to: (a) use Tkinter to create a simple Graphical User Interface;
(b) download web pages using a Python script and use pattern matching to extract particular
elements from them, and (c) generate an HTML document containing the extracted elements
presented in an attractive, easy-to-read format.
Illustrative Example
For the purposes of this assignment you have a free choice of what products your online
shopping application will offer. Categories of products could be:
• clothing,
• electrical goods,
• books and magazines,
• motor vehicles,
• furniture,
• jewellery and cosmetics,
• etc.
However, whatever products you choose, you must be able to find at least three different online web sites that contain regularly-updated lists of such products. For each product there
must be a textual description, a photograph, and a price. A good source for such data is Rich
Site Summary (RSS) web-feed documents. Appendix A below lists some such sites, but you
are encouraged to find your own of personal interest.
To demonstrate the idea, below is our own online shopping application, which uses data extracted from several different web sites. Our demonstration application allows users to select
from four product categories, diversion safes (i.e., safes disguised as other household items),IFB104'Building'IT'Systems'
Semester'1,'2017'
replica watches, sportswear and computer accessories. The application allows users to
choose how many items from each product category they want to buy. The program then
downloads the necessary data from the various web sites and uses it to produce an illustrated
invoice in the form of an HTML document which can be viewed in a standard web browser.
The screenshot below shows our example solution’s GUI when it first starts. We’ve called
our online shop Pot Luck because although the user can choose the quantities of goods they
want to buy they have no control over which items they will get!
The user is invited to choose how many items in each category they want buy. In this case
this is done using “spinbox” widgets, but other solutions are possible. Below the user has
chosen to buy two safes, two watches, three items of sportswear and one computer accessory.
Having pressed the button to start printing the invoice for these items, the user can watch
their order’s progress at the bottom of the GUI.
The application downloads information about the products currently for sale in each of these
categories from four different online shopping sites and uses this information to generate anIFB104'Building'IT'Systems'
Semester'1,'2017'
HTML invoice. When the invoice is ready to be viewed the user is informed with a “Done!”
message as follows.
The invoice appears in the file invoice.html, which the user can then open in any standard web browser. The invoice has several parts, and begins as follows.
Firstly there is a heading identifying our online shop. This is followed by an image, representative of our imaginary company. Like all data in the invoice, this image is itself downloadedIFB104'Building'IT'Systems'
Semester'1,'2017'
from the web, in this case via the following URL: http://www.connectgrowserveshare.com/
dotAsset/9732.jpg.
Following this is the total cost for the user’s eight purchases in this case. Importantly, this
price is expressed in Australian dollars, even though the original web sites from which the
product prices were obtained used United States dollars.
After this, each of the user’s purchases is described. For each one there is a short textual description, a photo, and the price, again converted to Australian dollars. In this case the eight
purchases were as follows.IFB104'Building'IT'Systems'
Semester'1,'2017'
Like all the data displayed, the product photos have all been downloaded from different web
sites. Those for the diversion safes and replica watches are only small “thumbnail” images,
so appear blurry at this scale, but the sportswear and computer accessories sites provided our
application with larger, more detailed images.IFB104'Building'IT'Systems'
Semester'1,'2017'
The final part of the invoice is due acknowledgement of the sources for all this product data.
In this case the invoice provides hyperlinks to the four web sites used by our online shopping
application, as shown below.
A special case, however, is what the invoice should contain if the user doesn’t select any
items, i.e., if the “Print invoice” button is pressed when all the product quantities are zero. In
this case our demonstration application produces the following polite, “no charge” invoice.IFB104'Building'IT'Systems'
Semester'1,'2017'
Online data sources
So where did the data about the products for sale actually come from? Most importantly, we
downloaded all this data “live” from the four online web documents listed above. This was
done by reading the HTML source files and using regular expressions to find the necessary
elements needed to construct our invoice. For instance, we accessed the Zazzle online store
for computer accessory products, using the following web document which, in full, lists dozens of products for sale.
This web document contains all the information we needed for our product invoice, descriptions, photos and prices (although in US dollars). Our application uses regular expressions to
extract the necessary data, thereby allowing for the possibility that the products listed are
changed. (Indeed this site was updated while we were developing our demo solution.)
Similarly, the sportswear items were extracted from the web site shown overleaf. As can be
seen by the dates on the products, this particular web site is not updated often, but our appli-IFB104'Building'IT'Systems'
Semester'1,'2017'
cation must still allow for the possibility that new sportswear products will be added occasionally.
In each case our application extracts the topmost items in the product lists, regardless of what
they are, as per the user’s specified quantities. Our online shopping application must always
produce the most recent items listed, even after the source web page has been updated.
Output format
Our Python application generates an HTML document in a file called invoice.html that
can be viewed in a standard web browser. Although not intended for human consumption,
the generated HTML code is nonetheless laid out neatly, and with comments indicating the
purpose of each element. Part of the generated HTML for the invoice above is shown overleaf.IFB104'Building'IT'Systems'
Semester'1,'2017'
To generate our HTML code we downloaded the source web documents as character strings
and used a combination of Python string functions and regular expressions to isolate the elements we needed to construct our own HTML code. For instance, from a prior examination
of the XML source code of the “replica watches” web site we knew that each watch’s description appears between … markup tags so we used this knowledge
to help extract the necessary data. Our application does this whenever the “Print invoice”
button is pressed, so the generated HTML invoice will be updated with fresh data each time.
We also discovered that sometimes the downloaded text contained unusual characters that are
not handled properly in Python strings, for instance “registered trademark” symbols, as
shown in the sportswear product description below, so we removed these before adding the
text to our invoice.
Requirements and marking guide
To complete this task you are required to develop an application in Python similar to that
above, using the provided online_shopper.py template file as your starting point. Al-IFB104'Building'IT'Systems'
Semester'1,'2017'
though our demonstration allowed the user to select from four distinct product categories, you
are only required to support three. Your solution must have at least the following features.
• An intuitive Graphical User Interface (4%). Your program must provide an easyto-use GUI. This interface must have the following characteristics:
o All of the widgets must be neatly laid out.
o It must allow the user to select quantities from three (or more) categories of
products on sale. The user must be able to select up to (at least) five items in
each category. Any mechanism can be provided for selecting quantities as
long as it is intuitive and easy to use, e.g., text entry boxes, radio buttons, spinboxes, etc.
o It must allow the user to choose to “print” their invoice, i.e., to generate the
invoice.html file.
o It must visually indicate to the user the progress being made on downloading
data and generating their invoice. Any clear mechanism can be used for showing progress, e.g., a textual description, a progress bar, highlighting of GUI elements, etc.
• The ability to generate the fixed invoice elements (2%). Your program must be
able to generate an HTML file, invoice.html, which contains the following fixed
elements:
o The name of your “online shop”.
o An image evocative of the shop’s title. The image must be sourced from online (you cannot attach image files to your solution), but since it will never
change, the URL for this particular image can be “hardwired” into your Python code.
o Hyperlinks to each of the three (or more) web sites from which your application gets its product data. You must derive your product data from three distinct online shopping sites offering different products, not just three different
product categories from the same online shop.
o When the user has not bought any items an appropriate “no charge” message
must be included.
The HTML source code generated by your Python program must be laid out neatly.
• The ability to calculate a total in Australian dollars (2%). When the user has selected some quantities of items to buy, your application must download the prices for
that many items, convert them to Australian dollars (if necessary), and display the total cost in the invoice. You do not need to use exact current exchange rates to perform this calculation but can use a reasonable approximation based on current rates.
For instance, in our demonstration solution all of the prices downloaded were in US
dollars, so we used a fixed multiplier of 1.33 to convert them to Australian dollars.
• The ability to generate lists of products for each category (7%). Your Python
program must be capable of generating HTML code to display products downloaded
from at least three distinct web sites in quantities as specified by the user. You must
derive your product data from three different online shopping sites offering differentIFB104'Building'IT'Systems'
Semester'1,'2017'
types of products, not just three different product categories from the same online
shop.
For each quantity selected by the user you must add a description of that many products as derived from the online product lists. For each category of product there
must be the current top-listed elements from the source web page, in the quantity
specified by the user, and for each individual product there must be
o a textual description,
o a photo, and
o a price in Australian dollars.
In each case the description, photo and price must match correctly as shown on the
source web page, e.g., you can’t have the name of one product matched with the
photo of another. (Mis-matched elements are an indication that your pattern matching
solution is not working correctly.)
The top-listed products on the source web page must be used, regardless of any changes made to the web site since you developed your Python application. Your code for
extracting online web elements cannot be hardwired to the source web sites as they
were at a particular time in the past.
Each of the HTML elements used in your invoice must be extracted from the original
document separately. It is not acceptable to simply copy large chunks of the original
web document’s source code. The HTML source code generated by your Python program must be laid out neatly.
When viewed in a web browser, your invoice must:
o layout all HTML elements neatly, irrespective of the size of the web browser’s
window.
The precise visual layout, colour and style of the invoice is up to you and is determined by the design of your generated HTML code. Nonetheless, the invoice must be
attractive and easy to read. No HTML markup tags or other odd characters should
appear in any of the text displayed in the invoice.
Data on the web changes frequently, so your solution must continue to work even after the source web documents you use have been updated. For this reason it is unacceptable to “hardwire” your solution to the particular text and images appearing on
the web on a certain day. Instead you will need to use pattern matching to actively
find the text and photos in the documents, regardless of any product updates that may
have occurred since you wrote your program.
• Good Python and HTML code presentation (4%). Both your Python program code
and your generated HTML source code must be presented in a professional manner
for both parts of the assignment. See the coding guidelines in the IFB104 Code Presentation Guide (on Blackboard under Assessment) for some suggestions on how to
achieve this for Python. In particular, each significant code segment must be clearly
commented to say what it does, e.g., “Generate the invoice’s title”, “Show the total
price”, etc. in both the Python and HTML code.IFB104'Building'IT'Systems'
Semester'1,'2017'
• Extra feature (6%). Part B of this assignment will require you to make a last-minute
extension to your solution. The instructions for Part B will not be released until just
before the final deadline for Assignment 2.
You can add other features if you wish, as long as you meet these basic requirements. For
instance, in our example solution above we supported more than the required three product
categories, and our application allows up to nine items to be ordered from each category, rather than the required five.
You must complete the task using only basic Python features and the modules already imported into the provided template. In particular, you may not import any local image files.
All displayed images and text must be downloaded from online sources each time your program is run.
However, your solution is not required to follow precisely our example shown above. Instead
you are strongly encouraged to be creative in the your choices of stories to display, the design
of your Graphical User Interface, and the design of your invoice.
Support tools
To get started on this task you need to download various web documents of your choice and
work out how to extract three things:
• The description of each listed product for sale.
• The photo of each listed product.
• The price of each product.
You also need to allow for the fact that the contents of the web documents from which you
get your data can change unexpectedly, so you cannot hardwire the locations of the document
elements in your solution. The solution to this problem is to use Python’s find character
string method and/or regular expression findall function to extract the necessary elements, no matter where they appear in the HTML/XML source code.
To help you develop your pattern matching solution, we have already provided two small Python programs.
1. downloader.py is a small script that downloads and displays the source code of a
web document and was demonstrated in Week 7. Use it to see a copy of your chosen
web document in exactly the form that it will be received by your Python program.
This is helpful because the version of a web document delivered by a web server to a
Python program may not be the same as one delivered to a web browser! Worse,
some web servers will entirely block access to web pages by Python scripts in the belief that they are malware! In this case they usually deliver a short document containing an “access denied” message instead of the desired data (see Appendix B).
2. regex_tester.py is an interactive program introduced in Week 8 which makes it
easy to experiment with different regular expressions on small text segments. You
can use this together with the downloaded text from the web to help perfect your
regular expressions. There are also many online tools that do the same job.IFB104'Building'IT'Systems'
Semester'1,'2017'
Internet ethics: Responsible scraping
The process of automatically extracting data from web documents is sometimes called
“scraping”. The RSS feeds we recommend using for this assignment are specifically intended to be easily “scrapable”. However, in order to protect their intellectual property, owners of some other web pages may not want their data exploited in this way. They will therefore deny access to their web documents by anything other than recognised web browsers
such as Firefox, Safari, Internet Explorer, etc. Typically in this situation the web server will
return a short “access denied” document to your Python script instead of the expected web
document (see Appendix B).
In this situation it’s possible to trick the web server into delivering the desired document by
having your Python script impersonate a standard web browser. To do this you need to
change the “user agent” identity enclosed in the request sent to the web server. Instructions
for doing so can be found online. In short, when using urllib the process involves assigning a new value to the urllib.URLopener.version attribute and then using
urllib.urlopen to request the web document as usual. We leave it to your own conscience whether or not you wish to do this, but note that the assignment can be completed
without resorting to such subterfuge.
Development hints
This is a substantial task, so you should not attempt to do it all at once. In particular, you
should work on the “back end” parts first, designing your HTML document and working out
how to extract the necessary elements from the online web pages, before attempting the
Graphical User Interface “front end”. If you are unable to complete the whole task, just submit those stages you can get working. You will receive partial marks for incomplete solutions.
It is suggested that you use the following development process:
1. Decide what products you want to “sell” and search the web for appropriate HTML or
XML documents that contain the necessary descriptions, photos and prices.
2. Using the downloader.py application from Week 7, download each document so
that you can examine its structure. You will want to study the HTML/XML source
code of the document to determine how the elements you want to extract are marked
up. Typically you will want to identify the markup tags, and perhaps other unchanging parts of the document, that uniquely identify the beginning and end of the text and
image addresses you want to extract.
3. Using the provided regex_tester.py application from Week 8, devise regular
expressions which extract just the necessary elements from the relevant parts of the
web document. Using these regular expressions and Python’s urllib module and
findall function you can now develop a simple prototype of your “back end” solution that just extracts and prints the required elements, i.e., the text and the URLs of
the photos, from the web documents in IDLE’s shell window. Doing this will give
you confidence that you are heading in the right direction, and is a useful outcome in
its own right.IFB104'Building'IT'Systems'
Semester'1,'2017'
4. Design the HTML source code for your invoice, with appropriate placeholders for the
downloaded web elements you will insert. Keep the document simple and its source
code neat. The invoice must be well laid out when viewed in a web browser.
5. Develop the necessary Python code to download the web elements and generate the
HTML file. This completes the “back end” functions of your solution.
6. Add the Graphical User Interface “front end” to your program. Decide which mechanism for allowing the user to choose product quantities you want to provide, e.g., text
entry boxes, radio buttons, etc, and extend your back-end code accordingly. Developing the GUI is the “messiest” step, and is best left to last.
Deliverable
You should develop your solution by completing and submitting the provided Python template file online_shopper.py as follows.
1. Complete the “statement” at the beginning of the Python file to confirm that this is
your own individual work by inserting your name and student number in the places
indicated. Submissions without a completed statement will not be marked.
2. Complete your solution by developing Python code at the place indicated. You must
complete your solution using only the modules imported by the provided template.
You do not need to use or import any other modules or files to complete this assignment.
3. Submit only a single, self-contained Python file. Do not submit multiple files. Do
not submit an archive containing multiple files. This means that all images and
text you want to display must be downloaded from online web documents, not from
local files.
Apart from working correctly your program code must be well-presented and easy to understand, thanks to (sparse) commenting that explains the purpose of significant code segments
and helpful choices of variable and function names. Professional presentation of your code
will be taken into account when marking this assignment.
If you are unable to solve the whole problem, submit whatever parts you can get working.
You will receive partial marks for incomplete solutions.
How to submit your solution
A link is available on Blackboard under Assessment for uploading your solution file before
the deadline (11:59pm on Sunday, May 28th). Note that you can submit as many drafts of
your solution as you like. You are strongly encouraged to submit draft solutions before the
deadline.IFB104'Building'IT'Systems'
Semester'1,'2017'
Appendix A: Some RSS feeds that may prove helpful
For this assignment you need to find several web documents that contain regularly-updated
lists of products for sale, and that have a fairly simple source-code format so that you can
easily extract elements from them. This appendix suggests some such pages, but you are encouraged to find your own. Note that you are not limited to using RSS feeds for this assignment, but you may find other, more complex, web documents harder to work with. Most
importantly, you are not required to use any of the sources below for this task. You are
strongly encouraged to find online documents of your own, that contain products of personal
interest.
The following links point to Rich Site Summary, a.k.a. Really Simple Syndication, web feed documents. RSS documents are written in XML and
are used for publishing information that is updated frequently in a format
that can be displayed by RSS reader software. Such documents have a
simple standardised format, so we can rely on them always formatting
their contents in the same way, making it relatively easy to extract specific elements from the document’s source code via pattern matching.
Another important advantage of RSS feeds for our purposes is that such documents are specifically intended to serve as sources of online information for RSS readers and other such
software, so they are unlikely to block Python scripts from accessing their contents (see Appendix B).
However, a disadvantage of using RSS feeds is that they can be hard to find! Often you can
discover them by looking for the symbol above on shopping web sites. However, because
RSS feeds are not intended for human consumption, they don’t usually feature prominently in
the results of web searches using standard search engines such as Google, DuckDuckGo,
Bing, etc. To overcome this you can find various directories of RSS feeds online, as well as
search engines specifically intended for finding RSS feeds. Explore!
For our example solution above we used four particular RSS feeds, but there are many other
sites suitable for this assignment. Some examples of RSS feeds that could be used for this
assignment include the following.
• A big site for Indian handicrafts: http://india-shopping.khazano.com/rss/
• A huge online “marketplace” with lots of “shops” that can be accessed as RSS feeds:
https://www.etsy.com/au/c/. To find the RSS feed for a particular shop, e.g., “oktak”,
you just need to put the shop’s name into a standardised URL as, in this case:
https://www.etsy.com/shop/oktak/rss
• An online department store with lots for RSS feeds: https://feed.zazzle.com/rss. Use
the “qs” qualifier to access different product categories, e.g., for products featuring
cats use https://feed.zazzle.com/rss?qs=cats
• Another online shop with lots of feeds: https://www.rakuten.com/ct/rss/
• A site for buying property in the UK: https://www.foxtons.co.uk/buy/feeds.html
• Car accessories: https://www.seicane.com/rss
• Computer products and accessories (lots): http://www.tigerdirect.com/rss/index.asp
• Anti-theft technologies: http://www.crimezappers.com/rss/IFB104'Building'IT'Systems'
Semester'1,'2017'
• Dresses: http://www.joomlajingle.com/rss
• Shoes and handbags (and probably other stuff if you can figure out the necessary
URLs): http://www.shoebuy.com/rss-sale-shoes and http://www.shoebuy.com/rsssale-bags
• And many, many more you can find for yourself!
Not all such sites are easy to use, however. Some web sites are unreliable and, quite commonly, others don’t put all the information needed for this assignment in one place. The following are some other sites we discovered while developing our demonstration solution for
this assignment, but may be difficult to use.
• The following online department store site is not an RSS feed, but it has lots of pages
of products and looks easily “scrapable”: http://www.shopzilla.com/
• This site looks very promising at first because it lists lots of RSS feeds, but when we
tried it they all took us to the same computer accessories page:
https://www.newegg.com/RSS/Index.aspx
• This site has lots of great feeds, but appears to be very unreliable. Use it at your own
risk! http://www.shop.com/rss-a.xhtml
• This menswear site has lots of feeds accessible from the categories on the first page
below, but for each product you need to follow a link to find the prices:
http://mensclothing2you.info/
http://mensclothing2you.info/rss/clothing-men-dress.rss
http://mensclothing2you.info/rss/mens-designer-suits.rss
http://mensclothing2you.info/rss/mens-designer-pants.rss
etc.
• Similarly, you need to follow the links to get the prices from this women’s clothing
site: http://womenclothing2you.info/rss.jsp
• Another online shop with lots of great stuff but in this case you need to follow more
than one link to get full details: http://www.onlineshoppingaustralia.com.au/rss.xmlIFB104'Building'IT'Systems'
Semester'1,'2017'
Appendix B: Web sites that block access to Python scripts
As noted above, some web servers will block access to web documents by Python programs
in the belief that they may be malware. In this situation they usually return a short HTML
document containing an “access denied” message instead of the desired document. This can
be very confusing because you can usually view the document without any problems using a
standard web browser even though your Python program is delivered something different by
the server.
If you suspect that your Python program isn’t being allowed to access your chosen web page,
use the small downloader.py application from Week 7 to check whether or not your Python program is being sent an access denied message. When viewed in a web browser, such
messages look something like the following example. In this case blog www.wayofcats.com has used anti-malware software “Cloudflare” to block access to the blog’s
contents by our Python program.