You may work on problems together, but write up the solutions on your own. Especially, team work of students with a background in life sciences together with students with a background in statistics, computer science or mathematics is highly encouraged. However, every student must be able to present his/her solutions in class.
(1) Surfing the internet
Use the links on http://www.stat.duke.edu/courses/Fall00/sta294/Links.html
and
retrieve information on this human protein sequence from the internet.
MGDSEMAVFGAAAPYLRKSEKERLEAQTRPFDLKKDVFVPDDKQEFVKAKIVSREGGKVT
AETEYGKTVTVKEDQVMQQNPPKFDKIEDMAMLTFLHEPAVLYNLKDRYGSWMIYTYSGL
FCVTVNPYKWLPVYTPEVVAAYRGKKRSEAPPHIFSISDNAYQYMLTDRENQSILITGES
GAGKTVNTKRVIQYFAVIAAIGDRSKKDQSPGKGTLEDQIIQANPALEAFGNAKTVRNDN
SSRFGKFIRIHFGATGKLASADIETYLLEKSRVIFQLKAERDYHIFYQILSNKKPELLDM
LLITNNPYDYAFISQGETTVASIDDAEELMATDNAFDVLGFTSEEKNSMYKLTGAIMHFG
NMKFKLKQREEQAEPDGTEEADKSAYLMGLNSADLLKGLCHPRVKVGNEYVTKGQNVQQV
IYATGALAKAVYERMFNWMVTRINATLETKQPRQYFIGVLDIAGFEIFDFNSFEQLCINF
TNEKLQQFFNHHMFVLEQEEYKKEGIEWTFIDFGMDLQACIDLIEKPMGIMSILEEECMF
PKATDMTFKAKLFDNHLGKSANFQKPRNIKGKPEAHFSLIHYAGIVDYNIIGWLQKNKDP
LNETVVGLYQKSSLKLLSTLFANYAGADAPIEKGKGKAKKGSSFQTVSALHRENLNKLMT
NLRSTHPHFVRCIIPNETKSPGVMDNPLVMHQLRCNGVLEGIRICRKGFPNRILYGDFRQ
RYRILNPAAIPEGQFIDSRKGAEKLLSSLDIDHNQYKFGHTKVFFKAGLLGLLEEMRDER
LSRIITRIQAQSRGVLARMEYKKLLERRDSLLVIQWNIRAFMGVKNWPWMKLYFKIKPLL
KSAEREKEMASMKEEFTRLKEALEKSEARRKELEEKMVSLLQEKNDLQLQVQAEQDNLAD
AEERCDQLIKNKIQLEAKVKEMNERLEDEEEMNAELTAKKRKLEDECSELKRDIDDLELT
LAKVEKEKHATENKVKNLTEEMAGLDEIIAKLTKEKKALQEAHQQALDDLQAEEDKVNTL
TKAKVKLEQQVDDLEGSLEQEKKVRMDLERAKRKLEGDLKLTQESIMDLENDKQQLDERL
KKKDFELNALNARIEDEQALGSQLQKKLKELQARIEELEEELESERTARAKVEKLRSDLS
RELEEISERLEEAGGATSVQIEMNKKREAEFQKMRRDLEEATLQHEATAAALRKKHADSV
AELGEQIDNLQRVKQKLEKEKSEFKLELDDVTSNMEQIIKAKANLEKMCRTLEDQMNEHR
SKAEETQRSVNDLTSQRAKLQTENGELSRQLDEKEALISQLTRGKLTYTQQLEDLKRQLE
EEVKAKNALAHALQSARHDCDLLREQYEEETEAKAELQRVLSKANSEVAQWRTKYETDAI
QRTEELEEAKKKLAQRLQEAEEAVEAVNAKCSSLEKTKHRLQNEIEDLMVDVERSNAAAA
ALDKKQRNFDKILAEWKQKYEESQSELESSQKEARSLSTELFKLKNAYEESLEHLETFKR
ENKNLQEEISDLTEQLGSSGKTIHELEKVRKQLEAEKMELQSALEEAEASLEHEEGKILR
AQLEFNQIKAEIERKLAEKDEEMEQAKRNHLRVVDSLQTSLDAETRSRNEALRVKKKMEG
DLNEMEIQLSHANRMAAEAQKQVKSLQSLLKDTQIQLDDAVRANDDLKENIAIVERRNNL
LQAELEELRAVVEQTERSRKLAEQELIETSERVQLLHSQNTSLINQKKKMDADLSQLQTE
VEEAVQECRNAEEKAKKAITDAAMMAEELKKEQDTSAHLERMKKNMEQTIKDLQHRLDEA
EQIALKGGKKQLQKLEARVRELENELEAEQKRNAESVKGMRKSERRIKELTYQTEEDRKN
LLRLQDLVDKLQLKVKAYKRQAEEAEEQANTNLSKFRKVQHELDEAEERADIAESQVNKL
RAKSRDIGTKGLNEE
(a) What is the name of the protein?
(b) In which type of tissue is it mostly found?
(c) To which other human protein does it bind?
(d) What is the function of this protein, give
a <=150 words description
how it works.
(e) What happens to people who have a defect
in the encoding genomic sequence?
(f) Get homologous sequences from
a pig, a rat and a hamster.
What are the Swissprot
IDs?
(g) Compute and print a multiple alignment
of these sequences.
(2) Sequence to structure deduction via
alignment
An important application for aligning protein
sequences is to deduce unknown secondary structure of one protein from
known secondary structure of a second protein. There are essentially
three types of secondary structure: alpha-helices, beta-sheets and
loops
(turns) . It is known that alpha-helices and beta-sheets are more highly
conserved than loop regions. Hence there are typically less substitutions,
insertions and deletions in these regions, these however may be very frequent
in the loops. Assume we have two sequences S1 and S2, the secondary structure
of S1 is known and the secondary structure of S2 is not known. We would
like to use a sequence alignment of S1 and S2 for identifying alpha-helices,
beta-sheets and loops in S2. The hope is that the regions with the
alpha- and beta-structures in both sequences will align with each other.
(a) Describe briefly what alpha-helice,
beta-sheets and loops are. Use your knowledge, the literature, the internet
or consult a biologist (biochemist). Note that teamwork is allowed.
Your answers should focus on the most important issues and should be shorter
than 150 words per structure.
(b) Suggest a variant of a global alignment
algorithm that encourages mismatch positions and gaps to be placed in loop
regions. Give the details of this algorithm: What is the optimization problem?
How is the computation done? If you use dynamic programming, note that
a dynamic programming algorithm has three parts. What is the time and space
compexity of your algorithm?
(3) Dynamic programming I
Given the sequences;
S1: AAALKYTVLT
S2: AMLKYTLVLS
(a) Calculate the edit distance between S1 and
S2.
(b) Calculate a global minimal edit distance
alignment between S1 and S2.
(c) How many cooptimal global alignments exist
?
(c) Calculate the shortest edit script that transforms
S1 into S2.
To answer this questions you can ...
(i) ... write some code in any programming
language of your choice. In this case, return the solutions and the code.
OR
(ii) ... do the computations manually.
In this case, return the solutions and the calculation tables.
(4) Dynamic programming II
Assume we are given two sequences S1 and S2.
Lets say the length of S1 is n1 and the length of S2 is n2. In each
of these sequences we fix one position. Let's say S1(x1) 0<=x1<=n1
and S2(x2) 0<=x2<=n2.
Example:
S1: TTACTATCATAAT
S2: TCATTACTCTAAT
Describe a dynamic programming algorithm, that
finds among all alignments that match the characters at the fixed positions,
the one with minimal edit distance. In the example the requirement is that
the two underlined T's need to match in the resulting alignment.
(5) Distance based alignments and score
based alignments
Consider the following scores for global alignment
with affine gap penalties:
match:
+3
mismatch:
+1
gap of length n : -( 4 + n
)
Find a distance such that every global minimal distance alignment is identical to the global maximal score alignment for the scores described above. Proof your result. Your distance should look like this:
match:
0
mismatch:
some positive number
gap of length n some positive
number depending on n