Assignment 2

To be handed in Tuesday November 21 10 AM in my mailbox (old chemistry 211).

(1) Guide Trees
Use the links on http://www.stat.duke.edu/courses/Fall00/sta294/Links.html.
Go to ClustalW, and calculate a multiple alignment of the following hemoglobin sequences

>HUMAN P012922
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>GORILLA P02024
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>MACACA P02027
VHLTPEEKNAVTTLWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKV
KAHGKKVLGAFSDGLNHLDNLKGTFAQLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK
EFTPQVQAAYQKVVAGVANALAHKYH
>LEMUR P02053
TLLSAEENAHVTSLWGKVDVEKVGGEALGRLLVVYPWTQRFFESFGDLSSPSAVMGNPKV
KAHGKKVLSAFSEGLHHLDNLKGTFAQLSELHCDKLHVDPQNFTLLGNVLVVVLAEHFGN
AFSPAVQAAFQKVVAGVANALAHKYH
>DONKEY P01959
VLSAADKTNVKAAWSKVGGNAGEFGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGK
KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPA
VHASLDKFLSSVSTVLTSKYR

(a) ClustalW calculates all pair wise alignment scores and reports them in the Command output. Calculate a distance table by using the (very ad hoc) formula: Distance=10/Score.
(b) Calculate a guide tree using single linkage clustering (follow the example given in class).
(c) Do you have any explanation for what you observe? Do you think we need a different guide tree if we want to have a good multiple alignment.

(2) Chimerical Alignment
Have a look at the following 3 sequences:

S1 ATTAG--GCATTAGCTTATG
S2 GTTGCCCTCAT----ATTCGTTC
--------------------------
S3 GTTCCCCCATTACCTTCTG

Obviously, the third sequences is very similar to the first sequences when it starts and then it becomes more similar to the second sequence. When aligning the third sequence to the first two in a progressive alignment step, it would be good if the alignment was determined by a pair wise comparison of S1 and S3 for the first half and then the reference sequence should be changed and the alignment is continued by aligning S2 and S3.

(a) Suppose there is only one change point: Describe a dynamic programming algorithm that detects the change point.
(b) Suppose there is more than one change point: Adjust your algorithm.
(c) How can we avoid, that the reference sequence is changed to often? (wild hopping)
(d) Come up with a complete progressive alignment algorithm that uses the above algorithm for adding a single
new sequences to an existing partial multiple alignment.

(3) Sum of pairs score
We have the following scores:
match=1
mismatch=0
gap of length n = -n

(a) Calculate the sum of pairs score for the following multiple alignment:

AGC-CCA
CACCCA-
AGAA--A

(b) Give an example for a multiple alignment of these sequences with a higher sum of pairs score.

(4) Score matrix
Suppose all the four letter ATCG have the same probability in a stochastic background model for DNA. As a model for related sequences we use the probabilities m(i,i)=0.19 and m(i,j)=0.02 if i and j are different. Remember, that the m(i,j)denote the joint probability of observing letter i in the first sequence and letter j in the second sequence of a correct alignment.

(a) Calculate a score matrix based on these models.
(b) What is the transition matrix P of the underlying Markov chain for this model.
(c) Calculate a new score matrix that is based on P^4 .
(d) Which difference in local alignments do you expect?