Project 2: Genomes Galore

Deliverable: Bring in to class on Monday, 15 February a stapled turn-in containing your answers to all of the questions (which should all be in your project2.py file).

Purpose

Preparation

This project assumes you have completed Udacity cs101 Lesson 3: How to Manage Data (Notes) and Lesson 3: Problem Set except for the "three gold stars" problems (the last two of the problem set) which you are not expected to do. If you have not already done that, it is strongly encouraged that you do so before continuing with this assignment.

Collaboration Policy

For this assignment, you may discuss the assignment with anyone you want and get help from others, so long as it is help in the spirit of learning how to do things yourself not getting answers you don't understand. You should understand everything you turn in for the assignment well enough to be able to produce it completely on your own.

Remember to follow the course pledge you read and signed at the beginning of the semester. For this assignment, you may consult any outside resources, including books, papers, web sites and people, you wish except for materials from previous cs1120 courses or direct solutions to the given problems. You may consult an outside person (e.g., another friend who is a CS major but is not in this class) who is not a member of the course staff, but that person cannot type anything in for you and all work must remain your own and outside sources should never give you specific answers to problem set questions.

If you use resources other than the class materials, lectures and course staff, you should document this in your turn-in.

You are strongly encouraged to take advantage of the scheduled help hours and office hours for this course.

Eliminating Listlessness

Before getting into genomics, it is important to have a good understanding of how the List datatype works in Python and how to use it.

Exercise A: For this exercise, for each Python fragment below:

  1. Without using the Python interpreter, predict if the fragment is a valid Python expression. If it is valid, you should derive its expected value. If it is invalid, you should understand why and guess what error the Python interpreter will produce.

  2. Then, try evaluating the fragment in your Python interpreter (this is easiest to do by just running the fragment in the Python shell).

  3. If the value or error produced by the Python interpreter does not match your expectations, figure out why. You should play around with evaluating other expressions to make sure you understand what the given fragment is evaluating the way it does.

a)     [1, 2] + 3
b)     [1, 2] + [3]
c)     [1] + [2, 3]
d)     1 + [2, 3]
e)     [1,2][0]
f)     "dna"[1:]
g)     ([1,2][1:])[0]
h)     len([1,2,3][1:])
i)     len([1,2,3][0])
j)     len([1,2,3] + 4)
k)     len([6,7,8] + [])

It is not necessary to turn in anything for this question, but you should ask questions about anything that does not make sense or is surprising.

Project 2 Repository

Set up your project2 repository in the same way as you did for project 1:

The project2 repository contains these files:

Genomics Background

For this project, you will develop functions to solve some problems in genomics. The main problem we want to solve is known as the sequence alignment problem.

Sequence alignment is used to compare genomes to understand how they might be related, both in terms of their function and how they might have evolved from common ancestors.

Most information in biology is encoded in DNA, a nucleic acid with two very useful properties for storing and processing information. The first is that DNA is very stable, so preserves information for a long time. This is why it was possible to sequence DNA from a woolly mammoth, even though it was over 300,000 years old. DNA is also an incredibly dense way of storing information — a single cubic millimeter of DNA (weighing about one gram) can reliably store about a quadrillion bits of data!

It has not escaped our notice that the specific pairing that we have postulated immediately suggests a possible copying mechanism for the genetic material.
James Watson and Francis Crick, Molecular Structure of Nucleic Acids, Nature, 25 April 1953.

Another important property of DNA is that it can be reliably replicated. DNA is composed of two chains of nucleotides that are intertwined in a double helix. The chains are connected by chemical bonds. Each nucleotide is one of four bases: Adenine (A), Cytosine (C), Guanine (G), and Thymine (T).

Because of their chemical structure, only certain pairs of the bases can form bonds — Adenine bonds with Thymine, and Cytosine bonds with Guanine. We call bases that bond with each other complementary: the complement of A is T, the complement of T is A, the complement of C is G, and the complement of G is C.

These bonds connect the two strands of the double helix, so the two chains are redundant. The second chain contains no extra information, it can be completely determined by the first chain because of the chemical bonds. For example, anywhere we know there is an A in the first chain, we know there is a T in the other chain. This means that DNA can be copied by unraveling the strands to produce two separate strands. The nucleotides in each strand will then form chemical bonds with their complementary bases, thus forming a new double-stranded DNA that is equivalent to the first.

We use the strings 'A', 'C', 'G', and 'T' to represent the four bases. You can test to see if two strings are equal with the double equal sign ==. For example:

>>> 'A' == 'A'
True
>>> 'A' == 'C'
False
>>> 'ACAT' == 'ADOG'
False

Problem 1: Define a function nucleotide_complement that takes as input a single base represented by one of 'A', 'C', 'T', or 'G' and returns the complement of that base (also a one-letter string).

You should get these interactions:

>>> nucleotide_complement('A')
'T'
>>> nucleotide_complement('G')
'C'
>>> nucleotide_complement('C')
'G'
>>> nucleotide_complement('T')
'A'

Paranoid programmers (which should be everyone!) will be worried about what happens if the input to your function is not a valid base (e.g., nucleotide_complement('U')). For now, it is okay to have any behavior (including generating an error or launching the missiles) for an invalid input, but it is essential as a programmer to be thinking about what will happen if someone runs your code with a bad input.

We'll want to find the complement of an entire DNA sequence, not just a single nucleotide. Conveniently, Python allows you to treat lists and strings uniformly in many cases (where a string behaves like a list of characters):

>>> 'CAT'[0]
'C'
>>> ['C', 'A', 'T'][0] 
'C'
>>> 'CAT'[1:] 
'AT'
>>> ['C', 'A', 'T'][1:] 
['A', 'T']
>>> 'DOG' + 'COW'
'DOGCOW'
>>> ['D', 'O', 'G'] + ['C', 'O', 'W']
['D', 'O', 'G', 'C', 'O', 'W']

Note that lists of single-character string and strings are not the same, though. For example,

['D', 'O', 'G'] + 'COW'

produces an error since it is not possible to concatenate a list and string.

Problem 2. Define a function sequence_complement that takes as input a string representing a DNA sequence (or a list of letters representing a DNA sequence: if you write your function in a straightforward way, the same code will work in both cases!), and a list of symbols that are the complement of that sequence.

For example,

>>> sequence_complement(['A','C','A','T'])
['T', 'G', 'T', 'A'] 
>>> sequence_complement([])
[] 
>>> sequence_complement("GATTACA")
['C', 'T', 'A', 'A', 'T', 'G', 'T']
>>> sequence_complement("")
[] 

Hint 1: the body of this function can be very short!

Hint 0: the test for your base case should work for both strings (the empty string is the base case) and lists (the empty list is the base case). So, neither sequence == [] (only works for lists) or sequence == '""' will work for both, but a simple test based on len(sequence) will!

Hamming Distance

The DNA copying mechanism is quite reliable, making about 1 error for every 100,000 nucleotides. The human genome (that is, all of a human’s genetic information) is about 3 billion base pairs long (only a small fraction of which is actually every translated into proteins), though, so that means every time a human cell (which contains two versions of the genome, one inherited from each parent) divides there would be about 120,000 errors.

Fortunately, biology also has mechanisms for error-correction that fix most of the copying mistakes. After the error correction, the remaining number of mistakes is about one per 100 million bases copied (so, about 60 mistakes across a human genome). In addition to copying errors, exposure to radiation and other environmental factors can alter the genome. (For more on this, see DNA Replication and Causes of Mutation, Nature Education, 2008.)

The mutations that remain after the error-correction are mostly inconsequential, but some of them change proteins encoded in the DNA that are produced by the cell. Most consequential changes are disastrous, resulting in a cell that cannot develop into a normal organism. In other cases, a single mutation does not prevent a viable organism from developing but results in a disease (e.g., cystic fibrosis, sickle-cell anemia, and color blindness are examples of conditions that result from a single mutation changing a base pair in the genome in a way that causes a cell to be unable to produce an important protein).

Very occasionally, the copying mistakes turn out to be useful and lead to an organism that is more likely to survive and reproduce than organisms whose DNA does not have the mutations. As a result of natural selection, the new, mutated, version of the DNA will (over many generations) become the dominant version.

A very rough measure of the distance between two nucleotide sequences is to just count the number of positions where the bases are different. This is known as the Hamming distance. (The Hamming distance is very useful for error-correcting codes and cryptography, but, as we will see soon, not actually very useful in genomics. It was developed by Richard Hamming during work on error-correcting codes for communication at Bell Labs in the 1940s.)

Problem 3. Define a function count_matches that takes as inputs a list and a value, and outputs the number of elements in the list that match the value.

For example:

>>> count_matches(['A','C','A','T'], 'A')
2
>>> count_matches([], 'A')
0
>>> count_matches(['A','C'], 'G')
0

Note that count_matches is not actually useful for defining Hamming distance, but this question will give you practice with an easier function that operates on lists before trying Hamming distance.

Problem 4. Define a function hamming_distance that takes as inputs two lists (or two strings, your function should work for both, and will if you write it sensibly) and outputs the number of positions where the list elements are different.

For your hamming_distance function, you may assume that both inputs are the same length (although it would be even better to define a function that does not assume this!).

Here are some examples:

>>> hamming_distance([], [])
0
>>> hamming_distance(['A', 'C', 'A', 'T'], ['A', 'C', 'T', 'A'])
2       # last two elements do not match
>>> hamming_distance("erode", "geode")
2       # first two characters do not match

Since Hamming distance is a common and useful function, it is not hard to find implementations of it by searching the web (including a Python implementation on the wikipedia page). You should write your own implementation, and must understand everything in your code. After you've written your own implementation using things we've learned so far in class, you can look at other implementations if you want.

Levenshtein Distance

In reality, not all of the copying mistakes are point mutations, however. It is also (relatively) common for there to be a copying error where one or more bases are skipped (a deletion), or added (an insertion) into the (mis-)copy. This makes it tougher to measure how closely related two sequences are. For example, a single base insertion at the beginning of a sequence makes every subsequent base a mismatch and the Hamming distance would be (nearly) the length of the rest of the sequence even though there was only a single copying mistake.

To account for this we need a more complex way of measuring distance, known as Levenshtein distance, or more simply, edit distance.

The edit distance metric accounts for the three different types of copying errors that might occur:

  1. a base could be skipped in the copy (deletion)
  2. a base could be inserted between two of the original bases (insertion)
  3. or a base could be replaced with a different base (mutation).

To understand how closely related two DNA sequences are, a biologists wants to know the number of copying errors that separate them. (In reality, if both sequences may have started from a common ancestor, the errors happened along the paths to both of the sequences. But, an insertion in one sequence is similar to a deletion in the other sequence, so it is still useful to just look for the number of copying errors between the two comparison sequences.)

The goal of sequence alignment is to find the minimum number of edits necessary to make two (or more) sequences identical. Biologists usually think of this as inserting gaps in the sequences to make them line up better (hence the name, alignment). (In real biology, not all changes are equally likely and insertions and deletions tend to occur in chunks of multiple bases at a time, so more complex edit distance metrics are used instead of simple Levenshtein distance. See the bonus questions to try a more realistic model.)

Exercise B. Manually figure out the Levenshtein distance for each pair of sequences below. (You don't need to turn anything in for this, but should make sure your answers are correct, since you will use them to test your code.)

(a)(b)(c)(d)(e)
ACAT
ACA
AACCT
CCCCG
GATTACA
GACACA
ACAT
C
AAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAA

Breaking Down the Problem

As President Obama explained, computer scientists learn how to solve difficult problems by breaking them into smaller pieces.

Since edit distance involves sequences of bases, a smaller problem would just be a shorter sequence. Let's see if we can define the edit distance of two sequences in terms of the edit distance of two shorter sequences.

Suppose we are given two sequences, x, with bases x1 x2 x3 ... xn, and y, with bases y1 y2 y3 ... yn. We are looking for the minimum number of edits needed to transform x into y.

Then the edit distance can be computed as follows:

  1. If x is empty (i.e., n is zero, so there are no bases in x), then you have to insert all of the bases in y to make y (starting from x, the empty sequence). Hence, the edit distance is m.

  2. If y is empty (i.e., m is zero), then we need to delete all of the bases in x to produce y. Hence, the edit distance is n.

  3. Otherwise, both x and y have at least one base (x1 and y1). We have four possibilities to consider, and should pick the one that results in the lowest edit distance. The four cases are:
    (1) The sequences x and y start with the same base (that is, x1 == y1). We can match up these two bases, and the edit distance is the distance between the remaining sequences: (x2 ... xn) and (y2 ... yn).
    (2) The shortest path starts by deleting x1. The edit distance is the distance between the remaning bases of x (x2 ... xn) and y, plus one because of the deletion of x1.
    (3) The shortest path starts by deleting y1. The edit distance is the distance between x and the remaning bases of y (y2 ... yn), plus one because of the deletion of y1.
    (4) The shortest path starts by mutating x1 into y1. Then, the edit distance is the distance between the remaining sequences, (x2 ... xn) and (y2 ... yn), plus one because of the mutation.

That's it! The parts that say "the edit distance between the remaining sequences" can be implemented as recursive function calls, where we are making progress because at least one of the input sequences is now one base shorter (for cases (1) and (4), both of the sequences are shorter). Since we keep making progress by shortening the sequences, eventually we will reach the base case where one of the sequences is empty and we know the answer right away (the length of the other sequence).

Problem 5. Define a function, edit_distance, that takes as input two lists (or two strings — your code should work for both) and returns the Levenshtein distance between those sequences. (This is challenging! There are some hints below, and don't get frustrated instead of taking advantage of the available help.)

Use min. Python has a built-in min function that takes any number of arguments and returns the value of the smallest one:

>>> min(5, 7) 
5
>>> min(5, 1, 7, 3)
1
>>> min(min(5, 2), min(7, 3))
2

Test as you code. Test your function on simple examples first, and confirm that it works, before trying more complex examples. You should make a function with the testing code, so you can re-execute it easily when you change your edit_distance function. Start with test cases where the sequences are empty, and then move on to test cases with one base in each sequence, before trying tests with longer sequences.

Here's a suggestion for how to write a testing function:

def test_distance():
    def test_one(s, t, dist):
        print("Testing: " + s + " -> " + t + " == " + dist)
        assert edit_distance(s, t) == dist
        print("Correct!")
    # Now, we can use test_one to check the distance between any two strings
    test_one('', '', 0)
    test_one('', 'A', 1)
    test_one('A', '', 1)
    test_one('T', 'T', 0)
    # ...
    test_one('GATTACA', 'GACACA', 2)

The way edit_distance is described involves a huge amount of computation, even for fairly short sequences. So, don't be surprised if your function doesn't finish when you try it on longer sequences (although it should finish quickly for sequences of length up to 4 or 5 bases). In the next section, we look at how to make it execute more quickly (but make sure you have a correct, slow implementation first).

Memoizing

We have a serious problem if our edit_distance function takes too long to execute on strings like AAAAAAAAAAAAAAAAAA (10 bases), since proteins contain hundreds of amino acids (each of which requires 3 bases to encode). If we want to solve any interesting problems in biology we need an edit_distance function that is much faster than this one. (Soon, we will get more precise about what it costs to execute a procedure. This is the main focus of Lesson 5 of the cs101 course and Chapter 7 of the coursebook. But, for now, we just want to build an intuition about how much computation is going on when this function runs.)

Why is the edit_distance function you defined (assuming you followed the method in the provided description) so slow? Think about this yourself first before reading on.

The reason the recursive definition of edit distance as described is so slow is because it is doing so much redundant work. For example, suppose were are computing edit_distance("ACT","GCT"). Using the three options, we want to find:

min((1 + edit_distance("CT","GCT")),
    (1 + edit_distance("ACT","CT")),
    (1 + edit_distance("CT","CT")))

To evaluate this application expression, the interpreter needs to evaluate all the subexpressions that are inputs to min.

The first operand subexpression is (1 + edit_distance("CT","GCT")), which requires evaluating edit_distance("CT","GCT"). As before, we need to find the minimum of the three options. In this case,

min((1 + edit_distance("T","GCT")),
    (1 + edit_distance("CT","CT")),
    (1 + edit_distance("T","CT")))

Note that one of these, (1 + edit_distance("CT","CT")) is identical to the third option for the first edit_distance call.

But, the interpreter doesn’t know that. It just follows the evaluation rules, re-evaluating the same expression every time it appears. In evaluating a large edit_distance application, there are quadrillions of identical evaluations of edit_distance, all of which are redundantly evaluated by the evaluator.

One solution is to store the results of previous evaluations, so they don’t need to be recomputed. This is known as memoizing. (An alternate approach, which requires rethinking the algorithm to do things in the reverse direction, is known as dynamic programming, but we won't get into that now.)

Memoizing is fairly easy to do in many programming languages. Python provides a way to "decorate" functions and indicate that they should be memoized.

We have provided code to do so below (and in memoize.py in the project2 repository). It is not necessary for you to understand how memoize is defined (indeed, it uses some structures and techniques that we have not yet introduced).

The memoize function takes as input a function and outputs a new function. The output function is similar to the input function, except whenever it is evaluated it stores the result in a table. If it is re-evaluated on the same inputs, instead of needing to evaluate the full function, it looks up the value from the previous evaluation in the table and evaluates to that value right away.

def memoize(function):
    """ Memoization decorator for a function taking one or more arguments. """
    class MemoStore(dict):
        """
        Dictionary to keep track of results from previous calls to function.
        """
        def __getitem__(self, *key):
            """
            Returns the result of applying the function to the arguments, *key.  If the
            result is already stored in the dictionary, returns it directly.  If not,
            __missing__ will be called to evaluate the function on the inputs, store that
            result in the dictionary, and return it.
            """
            return dict.__getitem__(self, key)

        def __missing__(self, key):
            """
            Adds the result of applying function to *key arguments to the dictionary, and
            returns the result.
            """
            ret = self[key] = function(*key)
            return ret

    return MemoStore().__getitem__

With memoize defined, we can decorate any function by using @memoize.memoize before its definition. This means calls to the function will be replaced with calls to the new function that is the result of memoize(function).

Here's how to use it for your edit_distance function:

@memoize
def edit_distance(a,b):
    ...

After adding the @memoize decorator, you should be able to re-run all your previous tests with the same results (but faster execution, which will only be noticable for longer sequences).

Now you're ready to try a more ambitious (and realistic) example!

In the apoe.py file we have included some (abridged) actual gene data: a normal "mus musculus apolipoprotein E" gene (call it apoe) and the E4 variant of it associated with Alzheimer's disease.

An individual who has the E4 variant in both copies of their genome is about 30 times more likely to develop Alzheimer’s disease by age 75, compared to individuals who have both copies of their genome with the normal variant. With no copies of the E4 variant, someone has a 7% chance of developing Alzheimer’s; with one E4 variant, there is a 14% chance; with both copies as E4 variant, the chance of developing Alzheimer’s by age 75 is around 80%.

Problem 6. Compute the edit_distance between APOE_NORMAL and APOE_BAD using your memoized edit_distance function. (Without memoization, you'll probably have to terminate the Python process without getting a result.)

Remember to bring to class on Monday, 15 February, a stapled printout of all your answers (which should all be in your project2.py file.

Teasers

Edit distance and auto-correcting. Try searching for some mispelled phrases in Google, DuckDuckGo, or Bing. For example, tyler swaft ("Did you mean taylor swift?"), ooobbamma, or tom jeffson. Search engines are using methods similar to edit distance to guess what you mean when your query doesn't match lots of documents. Try some experiments with your favorite search engine to see how many substitutions, insertions, and deletions it can handle (this will vary depending on how popular the search term you are close to is). For an excellent more in-depth discussion of this, with an implementation of a full spelling corrector in 21 elegant lines Python code, see Peter Norvig's How to Write a Spelling Corrector.

Evolutionary edit distance. We are simplifying things to assume all types of copying errors are equally likely. In biology, this is not the cases, and it is more likely for a long sequence of bases to be inserted (or deleted) at once, than for the same number of point mutations. So, biological algorithms for measuring evolutionary distance such as the Smith-Waterman algorithm are a bit more complicated than the model we use where single substitutions, insertions, and deletions all have the same cost (one edit). With more sophisticated edit distance metrics used in biology, the cost is different for different subsitutions (because of the way amino acids are encoded, many point mutations have no impact on the actual protein produced by a gene, which means they are "less costly"), and the cost of a multiple-base insertion or deletion is a function of its length (but less than the cost of that many independent single-base insertions).

List comprehensions. We have focused on writing recursive functions, using only simple Python constructs that are essentially shared by nearly every programming language. A more Pythonic way to write functions that operate on every element of a list, such as sequence_complement or hamming_distance would be to use list comprehensions. They would allow us to turn a multi-line loop or recursive function into a single line of code (that is harder to understand until one gets comfortable with reading list comprehensions, but for experienced Python programmers becomes easier to read). A fun tutorial on list comprehensions is Trey Hunner's Python List Comprehensions: Explained Visually. Try to rewrite the functions you defined using list comprehensions.

Dynamic Programming. Another way to compute functions like edit distance efficiently is to us dynamic programming. Similarly to using memoization, this avoids the redundant computation inherent in a straightforward recursive solution, but instead of doing it by storing results in the order they happen to be computed, it builds up a table of results in a particular order. The wikipedia page has an explanation and a lot of examples.

Credits: This problem set was originally developed for UVa cs1120 Fall 2011 by David Evans with help from Ivan Alagenchev, Jonathan Burket, Peter Chapman, Jiamin Chen, Joseph Featherston, Valerie Sapp, and Kristina Shichanin. Cameron Mura provided assistance with the biology (but is not responsible for any biological mistakes or gross oversimplifications here!). Wes Weimer revised it for Python in Fall 2012, and Yuchi Tian and David Evans revamped it for Spring 2016.