Assignment title: Information

Homework 4 1. Profile HMMs for sequence families

a) Define matching (M), insert (I) and delete (D) states of the multiple sequence alignment (MSA) shown in Figure 1 (5 pts)

b) Derive parameters of profile HMM for MSA given in figure 1 (50 tps) I. Emission counts for match states

II. Emission counts for insert states III. Counts of transitions between states

IV. Emission probabilities for match, insert, and hidden states

Figure 1. Multiple sequence alignment of five DNA sequences T--CT-

-AA-TA T--CTA

TC-G-A C-CGAC Feel free to use Durbin's Figure 5.7c format

2. Provide 1-1.5 page review for the paper "Genome-wide genetic marker

discovery and genotyping using next-generation sequencing" available under this week's course content (35pts)

Some guidelines: - Underline main points of the paper.

- Keep your work structured. - While focusing on big picture keep in mind our class is on statistical processes. 3. Use file available in course content for this

week tor write and submit R-script which will (10pts):

a. Define HMM model for Q4 in Homework 3 b. Parse the Homework 3 Q4 sequence to show sequence of hidden states

using Viterbi algorithm:; Homework 3 Solution Question 4: (a) Define zero order Markov model for

sequence2_A2, which represents portion of non-coding sequence of Mycobacterium tuberculosis (refer to course content) zero order for sequence2_A2:

P(A) 107 0.195255474 P(C) 156 0.284671533

P(G) 183 0.333941606 P(T) 102 0.186131387

b) Use zero order Markov models defined for sequence1_A2 and sequence2_A2 and apply Viterbi algorithm to find the most likely path for sequence CGCGTTACTTCAATG without taking frame into consideration

Assume: Initial transition probabilities

a0c= a0n =0.5 State transition probabilities

acc 0.55 acn 0.45

ann 0.5 anc 0.5 where, aij is transition probability, c- coding, n-non-coding

sequence CGCGTTACTTCAATG

path of hidden states CCCCNNCCNNCCCCC

#### SEE attached excel sheet for Viterbi algorithm for question 4 homework 3 from above.