Assignment1

To be handed in Friday November 10 10 AM in my mailbox (old chemistry 211).

You may work on problems together, but write up the solutions on your own. Especially, team work of students with a background in life sciences together with students with a background in statistics, computer science or mathematics is highly encouraged. However, every student must be able to present his/her solutions in class.

(1) Surfing the internet
Use the links on http://www.stat.duke.edu/courses/Fall00/sta294/Links.htmland retrieve information on this human protein sequence from the internet.

MGDSEMAVFGAAAPYLRKSEKERLEAQTRPFDLKKDVFVPDDKQEFVKAKIVSREGGKVT
AETEYGKTVTVKEDQVMQQNPPKFDKIEDMAMLTFLHEPAVLYNLKDRYGSWMIYTYSGL
FCVTVNPYKWLPVYTPEVVAAYRGKKRSEAPPHIFSISDNAYQYMLTDRENQSILITGES
GAGKTVNTKRVIQYFAVIAAIGDRSKKDQSPGKGTLEDQIIQANPALEAFGNAKTVRNDN
SSRFGKFIRIHFGATGKLASADIETYLLEKSRVIFQLKAERDYHIFYQILSNKKPELLDM
LLITNNPYDYAFISQGETTVASIDDAEELMATDNAFDVLGFTSEEKNSMYKLTGAIMHFG
NMKFKLKQREEQAEPDGTEEADKSAYLMGLNSADLLKGLCHPRVKVGNEYVTKGQNVQQV
IYATGALAKAVYERMFNWMVTRINATLETKQPRQYFIGVLDIAGFEIFDFNSFEQLCINF
TNEKLQQFFNHHMFVLEQEEYKKEGIEWTFIDFGMDLQACIDLIEKPMGIMSILEEECMF
PKATDMTFKAKLFDNHLGKSANFQKPRNIKGKPEAHFSLIHYAGIVDYNIIGWLQKNKDP
LNETVVGLYQKSSLKLLSTLFANYAGADAPIEKGKGKAKKGSSFQTVSALHRENLNKLMT
NLRSTHPHFVRCIIPNETKSPGVMDNPLVMHQLRCNGVLEGIRICRKGFPNRILYGDFRQ
RYRILNPAAIPEGQFIDSRKGAEKLLSSLDIDHNQYKFGHTKVFFKAGLLGLLEEMRDER
LSRIITRIQAQSRGVLARMEYKKLLERRDSLLVIQWNIRAFMGVKNWPWMKLYFKIKPLL
KSAEREKEMASMKEEFTRLKEALEKSEARRKELEEKMVSLLQEKNDLQLQVQAEQDNLAD
AEERCDQLIKNKIQLEAKVKEMNERLEDEEEMNAELTAKKRKLEDECSELKRDIDDLELT
LAKVEKEKHATENKVKNLTEEMAGLDEIIAKLTKEKKALQEAHQQALDDLQAEEDKVNTL
TKAKVKLEQQVDDLEGSLEQEKKVRMDLERAKRKLEGDLKLTQESIMDLENDKQQLDERL
KKKDFELNALNARIEDEQALGSQLQKKLKELQARIEELEEELESERTARAKVEKLRSDLS
RELEEISERLEEAGGATSVQIEMNKKREAEFQKMRRDLEEATLQHEATAAALRKKHADSV
AELGEQIDNLQRVKQKLEKEKSEFKLELDDVTSNMEQIIKAKANLEKMCRTLEDQMNEHR
SKAEETQRSVNDLTSQRAKLQTENGELSRQLDEKEALISQLTRGKLTYTQQLEDLKRQLE
EEVKAKNALAHALQSARHDCDLLREQYEEETEAKAELQRVLSKANSEVAQWRTKYETDAI
QRTEELEEAKKKLAQRLQEAEEAVEAVNAKCSSLEKTKHRLQNEIEDLMVDVERSNAAAA
ALDKKQRNFDKILAEWKQKYEESQSELESSQKEARSLSTELFKLKNAYEESLEHLETFKR
ENKNLQEEISDLTEQLGSSGKTIHELEKVRKQLEAEKMELQSALEEAEASLEHEEGKILR
AQLEFNQIKAEIERKLAEKDEEMEQAKRNHLRVVDSLQTSLDAETRSRNEALRVKKKMEG
DLNEMEIQLSHANRMAAEAQKQVKSLQSLLKDTQIQLDDAVRANDDLKENIAIVERRNNL
LQAELEELRAVVEQTERSRKLAEQELIETSERVQLLHSQNTSLINQKKKMDADLSQLQTE
VEEAVQECRNAEEKAKKAITDAAMMAEELKKEQDTSAHLERMKKNMEQTIKDLQHRLDEA
EQIALKGGKKQLQKLEARVRELENELEAEQKRNAESVKGMRKSERRIKELTYQTEEDRKN
LLRLQDLVDKLQLKVKAYKRQAEEAEEQANTNLSKFRKVQHELDEAEERADIAESQVNKL
RAKSRDIGTKGLNEE

(a) What is the name of the protein?
(b) In which type of tissue is it mostly found?
(c) To which other human protein does it bind?
(d) What is the function of this protein, give a <=150 words description
how it works.
(e) What happens to people who have a defect in the encoding genomic sequence?
(f) Get homologous sequences from a pig, a rat and a hamster.
What are the Swissprot IDs?
(g) Compute and print a multiple alignment of these sequences.

(2) Sequence to structure deduction via alignment
An important application for aligning protein sequences is to deduce unknown secondary structure of one protein from known secondary structure of a second protein. There are essentially three types of secondary structure: alpha-helices, beta-sheets and loops (turns) . It is known that alpha-helices and beta-sheets are more highly conserved than loop regions. Hence there are typically less substitutions, insertions and deletions in these regions, these however may be very frequent in the loops. Assume we have two sequences S1 and S2, the secondary structure of S1 is known and the secondary structure of S2 is not known. We would like to use a sequence alignment of S1 and S2 for identifying alpha-helices, beta-sheets and loops in S2. The hope is that the regions with the alpha- and beta-structures in both sequences will align with each other.

(a) Describe briefly what alpha-helice, beta-sheets and loops are. Use your knowledge, the literature, the internet or consult a biologist (biochemist). Note that teamwork is allowed. Your answers should focus on the most important issues and should be shorter than 150 words per structure.
(b) Suggest a variant of a global alignment algorithm that encourages mismatch positions and gaps to be placed in loop regions. Give the details of this algorithm: What is the optimization problem? How is the computation done? If you use dynamic programming, note that a dynamic programming algorithm has three parts. What is the time and space compexity of your algorithm?

(3) Dynamic programming I
Given the sequences;
S1: AAALKYTVLT
S2: AMLKYTLVLS

(a) Calculate the edit distance between S1 and S2.
(b) Calculate a global minimal edit distance alignment between S1 and S2.
(c) How many cooptimal global alignments exist ?
(c) Calculate the shortest edit script that transforms S1 into S2.

To answer this questions you can ...
(i) ... write some code in any programming language of your choice. In this case, return the solutions and the code.
OR
(ii) ... do the computations manually. In this case, return the solutions and the calculation tables.

(4) Dynamic programming II
Assume we are given two sequences S1 and S2. Lets say the length of S1 is n1 and the length of S2 is n2. In each of these sequences we fix one position. Let's say S1(x1) 0<=x1<=n1 and S2(x2) 0<=x2<=n2.
Example:
S1: TTACTATCATAAT S2: TCATTACTCTAAT
Describe a dynamic programming algorithm, that finds among all alignments that match the characters at the fixed positions, the one with minimal edit distance. In the example the requirement is that the two underlined T's need to match in the resulting alignment.

(5) Distance based alignments and score based alignments
Consider the following scores for global alignment with affine gap penalties:

match:                   +3
mismatch:              +1
gap of length n :    -( 4 + n )

Find a distance such that every global minimal distance alignment is identical to the global maximal score alignment for the scores described above. Proof your result. Your distance should look like this:

match:                 0
mismatch:           some positive number
gap of length n    some positive number depending on n