Assignment title: Information
CSE5BIO Bioinformatics Technologies
Assignment Two
(Presentations will be Week 11, 17th May and Week 12, 24th May in
Lecture and Practical Time)
Assignment Due Thursday 9:30am 26th May 2016
20% of Total Subject Grade
Plagiarism Note: As part of this evaluation, student assignments may be submitted to turnitin
for checking. It should be noted that this software not only checks for students copying from
another student, but also checks for copying from web sites. Any plagiarism detected by the
turnitin software will be pursued according the Faculty's regulations for plagiarism.
Part I
1) Complete Biological Central Dogma (5 Marks)
Zika sequence containing a gene
1 ctgttgctgc ttcagactgc gacagttcga gtttgaagcg aaagctagca acagtatcaa
61 caggttttat ttggatttgg aaacgagagt ttctggtcat gaaaaaccca aaaaagaaat
121 ccggaggatt ccggattgtc aatatgctaa aacgcggagt agcccgtgtg agcccctttg
181 ggggcttgaa gaggctgcca gccggacttc tgctgggtca tgggcccatc aggatggtct
241 tggcgattct agcctttttg agattcacgg caatcaagcc atcactgggt ctcatcaata
301 gatggggttc agtggggaaa aaagaggcta tggaaataat aaagaagttc aagaaagatc
361 tggctgccat gctgagaata atcaatgcta ggaaggagaa gaagagacga ggcgcagaaa
421 ctagtgtcgg aattgttggc ctcctgctga ccacagctat ggcagcggag gtcactagac
481 gtgggagtgc atactatatg tacttggaca gaaacgatgc tggggaggcc atatcttttc
541 caaccacatt ggggatgaat aagtgttata tacagatcat ggatcttgga cacatgtgtg
601 atgccaccat gagctatgaa tgccctatgc tggatgaggg ggtggaacca gatgacgtcg
661 attgttggtg caacacgacg tcaacttggg ttgtgtacgg aacctgccat cacaaaaaag
721 gtgaagcacg gagatctaga agagccgtga cgctcccctc ccattccact aggaagctgc
781 aaacgcggtc gcaaacctgg ttggaatcaa gagaatacac aaagcacttg attagagtcg
841 aaaattggat attcaggaac cctggtttcg ctttagcagc agctgccatc gcgtggcttt
901 tgggaagctc aacgagccaa aaagtcatat acttggtcat gatactgctg attgccccgg
961 catacagcat caggtgcata ggagtcagca atagggactt tgtggaaggt atgtcaggtg
1021 ggacttgggt tgatgttgtc ttggaacatg gaggttgtgt caccgtaatg gcacaggaca
1081 aaccgactgt cgacatagag ctggttacaa caacagtcag caacatggcg gaggtaagat
1141 cctactgcta tgaggcatca atatcagaca tggcttcgcc cagccgctgc ccaacacaag
1201 ccgctgccta ccttgacaag caatcagaca ctcaatatgt ctgcaaaaga acgttagtgg
1261 acagcgactg gggtt
a) There exists a gene in this sequence, use the translate tool at
http://au.expasy.org/tools/dna.html to find the most likely protein product.
i. Paste the protein sequence into your assignment and indicate which frame you
used.
ii. Why are there 6 possible frames? Hint: the longest sequence Met -> Stop will
be the correct outcome.
b) Using the translator you have found the start and end of the gene, trim the nucleotide
sequence to show just the coding sequence
i. How many nucleotides where removed from the front or end of the above
sequence?
c) Use the Nucleotide sequence, transcribe the first 21 nucleotides into RNA, and then
translate that into amino acids.
d) Paste the nucleotide sequence of the whole gene into NCBI Blast and perform a
BLASTX search against SwissProt database.
i. What organism was my sequence most closely matching?
ii. Give a brief summary on the function of the closely matching protein?
2) Biological Sequence alignment. (5 marks)
a) Calculate the alignment score for different alignments and indicate the best possible
aligned sequence for the following sequence:
ACCTAGCTAGCCGAT
ACCCCTAGGCGAAA
Use Match: +1, Mismatch:-1 and Indel: -2
i. Show all possible alignments and their scores
ii. Which is the best possible alignment? State the reason.
b) Carry out a Multiple Sequence Alignment (MSA) for the protein Ras 1. Use the UniProt
(http://www.uniprot.org/) database for retrieving the protein sequence of keratin type2.
The protein sequences from the following organisms must be included in your MSA.
1) Homo sapiens (Human)
2) Drosophila melanogaster (Fruit Fly)
3) Candida albicans (Yeast)
4) Mus musculus (Mouse)
5) Gallus gallus (Chicken)
Notes:
Use only the reviewed sequences from the database. This can be done by selecting
'Reviewed' under the heading 'filter by' at the left hand corner of the results page in UniProt.
Use Clustal Omega for MSA (http://www.ebi.ac.uk/Tools/msa/clustalo/)
i. Provide the FASTA sequences retrieved in your report (5 sequences retrieved
from UniProt).
ii. Show the coloured alignment obtained from ClustalO in your report.
iii. What do you think about the alignment? Give a brief discussion on the alignment
results.
iv. Display the phylogenetic tree and discuss the relationship.
3) Hidden Markov Model (5 marks)
Consider the following HMM, which models an intronic sequence with GC rich regions. The
model consists of 9 states:
The states and transition probabilities are indicated in the following diagram:
Fig. 1. The states and transition probabilities.
The emission probabilities for the non-silent states are as follows:
Base S1 S2 S3 S4 S5 S6 S7
A 0 0 0.35 0.50 0.50 0 0
C 0 0 0.25 0.18 0.44 1 1
G 1 0 0.30 0.19 0.33 0 0
T 0 1 0.10 0.13 0.03 0 0
Please answer the following questions:
a) What is the probability of observing GTGGTA along the state path p =START-S1-S2-
S4-S5--S7 -END?
b) What is the probability of seeing GTCGC, given the current HMM? (Show the steps
of your calculation.)
c) What is the probability of seeing GCTCGT, given the current HMM? State your
observation.
4) Microarray (2 marks)
a) Explain why some of the dots were red, some yellow, and some green in Fig. 2
b) Find gene expression using the MagicTool. You can either download microarray data
in the form of tiff files from any website or use the use the data provided. A website for
downloading TIF dataset for example:
http://www.bio.davidson.edu/projects/magic/magic.html
5) Next Generation sequencing (3 marks)
Using Galaxy find out the highest number of SNPs in chromosome 9.
a) Display the results acquired after sorting, showing the SNP count and state the highest
number of SNP per exon.
b) Select top 5 exons with highest number of SNPs and display the result.
c) Also provide screenshots of the whole process as shown in tutorial.
Part II: Presentation of a given research topic
(6 marks for slide content + 4 marks for innovative solution + 5 marks for presentation
+5 mark for group work)
Groups of three, presentation 8 minutes' length to be presented in week 11 & 12 Lectures
and Practicals time, 17th and 24th May, 2016.
Students have to attend whole presentation sessions.
Choose ONE of the 8 research topics given below.
Each topic is allocated to ONLY ONE group on a first come first serve basis.
In case your topic is chosen by other group prior to you, you will be asked to choose another
topic.
You MUST email your group members and topic by 3rd May 2016 to
[email protected]
LIST OF 10 TOPICS
1. Next generation sequencing data analysis –Current development.
2. Personalised medicine and Bioinformatics
3. Role of bioinformatics in Drug Design, Discovery and Development
4. RNA secondary structure prediction using Bioinformatics tools.
5. Translational bioinformatics.
6. Select one of the recent hot topics in proteomics.
7. Current challenges in Bioinformatics.
8. Insights into disease using Bioinformatics.
Submission requirements:
An electronic submission of your answers (*.doc or *.docx).
An electronic copy of your presentation (*.PPT).
Oral Presentation in weeks 11 and 12 of semester.