To be handed in Tuesday
November 21 10 AM in my mailbox (old chemistry
211).
(1) Guide Trees
Use the links on http://www.stat.duke.edu/courses/Fall00/sta294/Links.html.
Go to ClustalW, and calculate a multiple alignment
of the following hemoglobin sequences
>HUMAN P012922
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>GORILLA P02024
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>MACACA P02027
VHLTPEEKNAVTTLWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKV
KAHGKKVLGAFSDGLNHLDNLKGTFAQLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK
EFTPQVQAAYQKVVAGVANALAHKYH
>LEMUR P02053
TLLSAEENAHVTSLWGKVDVEKVGGEALGRLLVVYPWTQRFFESFGDLSSPSAVMGNPKV
KAHGKKVLSAFSEGLHHLDNLKGTFAQLSELHCDKLHVDPQNFTLLGNVLVVVLAEHFGN
AFSPAVQAAFQKVVAGVANALAHKYH
>DONKEY P01959
VLSAADKTNVKAAWSKVGGNAGEFGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGK
KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPA
VHASLDKFLSSVSTVLTSKYR
(a) ClustalW calculates all pair wise alignment
scores and reports them in the Command output. Calculate a distance
table by using the (very ad hoc) formula: Distance=10/Score.
(b) Calculate a guide tree using single linkage
clustering (follow the example given in class).
(c) Do you have any explanation for what you
observe? Do you think we need a different guide tree if we want to have
a good multiple alignment.
(2) Chimerical Alignment
Have a look at the following 3 sequences:
S1 ATTAG--GCATTAGCTTATG
S2 GTTGCCCTCAT----ATTCGTTC
--------------------------
S3 GTTCCCCCATTACCTTCTG
Obviously, the third sequences is very similar to the first sequences when it starts and then it becomes more similar to the second sequence. When aligning the third sequence to the first two in a progressive alignment step, it would be good if the alignment was determined by a pair wise comparison of S1 and S3 for the first half and then the reference sequence should be changed and the alignment is continued by aligning S2 and S3.
(a) Suppose there is only one change point: Describe
a dynamic programming algorithm that detects the change point.
(b) Suppose there is more than one change point:
Adjust your algorithm.
(c) How can we avoid, that the reference sequence
is changed to often? (wild hopping)
(d) Come up with a complete progressive alignment
algorithm that uses the above algorithm for adding a single
new sequences to an
existing partial multiple alignment.
(3) Sum of pairs score
We have the following scores:
match=1
mismatch=0
gap of length n = -n
(a) Calculate the sum of pairs score for the following
multiple alignment:
AGC-CCA
CACCCA-
AGAA--A
(b) Give an example for a multiple alignment of
these sequences with a higher sum of pairs score.
(4) Score matrix
Suppose all the four letter ATCG have the same
probability in a stochastic background model for DNA. As a model for related
sequences we use the probabilities m(i,i)=0.19 and m(i,j)=0.02
if i and j are different. Remember, that the
m(i,j)
denote
the joint probability of observing letter i in the first
sequence and letter j in the second sequence of a correct alignment.
(a) Calculate a score matrix based on these models.
(b) What is the transition matrix P of the underlying
Markov chain for this model.
(c) Calculate a new score matrix that is based
on P^4 .
(d) Which difference in local alignments do you
expect?