Project 2: Genomes Galore
project2.py
file).
Purpose
- Practice programming with functions, including using
if
statements,while
loops, and definitions. - Become familiar with Lists and how they can be used to manage complex data.
- Understand and construct recursive definitions.
- Learn just enough genomics to be dangerous.
Preparation
This project assumes you have completed Udacity cs101 Lesson 3: How to Manage Data (Notes) and Lesson 3: Problem Set except for the "three gold stars" problems (the last two of the problem set) which you are not expected to do. If you have not already done that, it is strongly encouraged that you do so before continuing with this assignment.
Collaboration Policy
For this assignment, you may discuss the assignment with anyone you want and get help from others, so long as it is help in the spirit of learning how to do things yourself not getting answers you don't understand. You should understand everything you turn in for the assignment well enough to be able to produce it completely on your own.
Remember to follow the course pledge you read and signed at the beginning of the semester. For this assignment, you may consult any outside resources, including books, papers, web sites and people, you wish except for materials from previous cs1120 courses or direct solutions to the given problems. You may consult an outside person (e.g., another friend who is a CS major but is not in this class) who is not a member of the course staff, but that person cannot type anything in for you and all work must remain your own and outside sources should never give you specific answers to problem set questions.
If you use resources other than the class materials, lectures and course staff, you should document this in your turn-in.
You are strongly encouraged to take advantage of the scheduled help hours and office hours for this course.
Eliminating Listlessness
Before getting into genomics, it is important to have a good understanding of how the List datatype works in Python and how to use it.
-
Without using the Python interpreter, predict if the fragment is a valid Python expression. If it is valid, you should derive its expected value. If it is invalid, you should understand why and guess what error the Python interpreter will produce.
-
Then, try evaluating the fragment in your Python interpreter (this is easiest to do by just running the fragment in the Python shell).
-
If the value or error produced by the Python interpreter does not match your expectations, figure out why. You should play around with evaluating other expressions to make sure you understand what the given fragment is evaluating the way it does.
a) [1, 2] + 3
b) [1, 2] + [3]
c) [1] + [2, 3]
d) 1 + [2, 3]
e) [1,2][0]
f) "dna"[1:]
g) ([1,2][1:])[0]
h) len([1,2,3][1:])
i) len([1,2,3][0])
j) len([1,2,3] + 4)
k) len([6,7,8] + [])
It is not necessary to turn in anything for this question, but you should ask questions about anything that does not make sense or is surprising.
Project 2 Repository
Set up your project2 repository in the same way as you did for project 1:
-
Fork Project 2. Visit https://bitbucket.org/cs1120/project2/fork. Make sure to check the "This is a private repository" box to make your forked repository private (only visible to yourself and people you authorize to access it).
-
Clone to your machine. To have a working copy of the repository on your machine, clone the forked repository in SourceTree by selecting
File | New/Clone
and enter the URL of the cloned repository. You can copy this from the project page you see in bitbucket.org afer forking the provided Project 2 repository. It should be a URL likehttps://<your bitbucket id>@bitbucket.org/<your bitbucket id>/project2.git
.
The project2 repository contains these files:
project2.py
- This is the file you will edit and turn in (both electronically and as a printout) for this assignment.memoize.py
- Provides an implementation of a memoize decorator (this is used in the last part of the assignment).apoe.py
- Excepts from the base sequence for two variants of the APOE gene (associated with Alzheimer's disease).
Genomics Background
For this project, you will develop functions to solve some problems in genomics. The main problem we want to solve is known as the sequence alignment problem.
Sequence alignment is used to compare genomes to understand how they might be related, both in terms of their function and how they might have evolved from common ancestors.
Most information in biology is encoded in DNA, a nucleic acid with two very useful properties for storing and processing information. The first is that DNA is very stable, so preserves information for a long time. This is why it was possible to sequence DNA from a woolly mammoth, even though it was over 300,000 years old. DNA is also an incredibly dense way of storing information — a single cubic millimeter of DNA (weighing about one gram) can reliably store about a quadrillion bits of data!
Another important property of DNA is that it can be reliably replicated. DNA is composed of two chains of nucleotides that are intertwined in a double helix. The chains are connected by chemical bonds. Each nucleotide is one of four bases: Adenine (A), Cytosine (C), Guanine (G), and Thymine (T).
Because of their chemical structure, only certain pairs of the bases can form bonds — Adenine bonds with Thymine, and Cytosine bonds with Guanine. We call bases that bond with each other complementary: the complement of A is T, the complement of T is A, the complement of C is G, and the complement of G is C.
These bonds connect the two strands of the double helix, so the two chains are redundant. The second chain contains no extra information, it can be completely determined by the first chain because of the chemical bonds. For example, anywhere we know there is an A in the first chain, we know there is a T in the other chain. This means that DNA can be copied by unraveling the strands to produce two separate strands. The nucleotides in each strand will then form chemical bonds with their complementary bases, thus forming a new double-stranded DNA that is equivalent to the first.
We use the strings 'A'
, 'C'
, 'G'
, and 'T'
to represent the four
bases. You can test to see if two strings are equal with the double
equal sign ==
. For example:
>>> 'A' == 'A' True >>> 'A' == 'C' False >>> 'ACAT' == 'ADOG' False
nucleotide_complement
that takes as input a
single base represented by one of 'A'
, 'C'
, 'T'
, or 'G'
and returns the complement of that base
(also a one-letter string).
You should get these interactions:
>>> nucleotide_complement('A') 'T' >>> nucleotide_complement('G') 'C' >>> nucleotide_complement('C') 'G' >>> nucleotide_complement('T') 'A'
We'll want to find the complement of an entire DNA sequence, not just a single nucleotide. Conveniently, Python allows you to treat lists and strings uniformly in many cases (where a string behaves like a list of characters):
>>> 'CAT'[0] 'C' >>> ['C', 'A', 'T'][0] 'C' >>> 'CAT'[1:] 'AT' >>> ['C', 'A', 'T'][1:] ['A', 'T'] >>> 'DOG' + 'COW' 'DOGCOW' >>> ['D', 'O', 'G'] + ['C', 'O', 'W'] ['D', 'O', 'G', 'C', 'O', 'W']
Note that lists of single-character string and strings are not the same, though. For example,
['D', 'O', 'G'] + 'COW'
produces an error since it is not possible to concatenate a list and string.
sequence_complement
that takes as input a string
representing a DNA sequence (or a list of letters representing a DNA
sequence: if you write your function in a straightforward way, the same code will work in both cases!), and
a list of symbols that are the complement of that sequence.
For example,
>>> sequence_complement(['A','C','A','T']) ['T', 'G', 'T', 'A'] >>> sequence_complement([]) [] >>> sequence_complement("GATTACA") ['C', 'T', 'A', 'A', 'T', 'G', 'T'] >>> sequence_complement("") []
Hint 0: the test for your base case should work for both strings (the empty string is the base case) and lists (the empty list is the base case). So, neither sequence == []
(only works for lists) or sequence == '""'
will work for both, but a simple test based on len(sequence)
will!
Hamming Distance
The DNA copying mechanism is quite reliable, making about 1 error for every 100,000 nucleotides. The human genome (that is, all of a human’s genetic information) is about 3 billion base pairs long (only a small fraction of which is actually every translated into proteins), though, so that means every time a human cell (which contains two versions of the genome, one inherited from each parent) divides there would be about 120,000 errors.
Fortunately, biology also has mechanisms for error-correction that fix most of the copying mistakes. After the error correction, the remaining number of mistakes is about one per 100 million bases copied (so, about 60 mistakes across a human genome). In addition to copying errors, exposure to radiation and other environmental factors can alter the genome. (For more on this, see DNA Replication and Causes of Mutation, Nature Education, 2008.)
The mutations that remain after the error-correction are mostly inconsequential, but some of them change proteins encoded in the DNA that are produced by the cell. Most consequential changes are disastrous, resulting in a cell that cannot develop into a normal organism. In other cases, a single mutation does not prevent a viable organism from developing but results in a disease (e.g., cystic fibrosis, sickle-cell anemia, and color blindness are examples of conditions that result from a single mutation changing a base pair in the genome in a way that causes a cell to be unable to produce an important protein).
Very occasionally, the copying mistakes turn out to be useful and lead to an organism that is more likely to survive and reproduce than organisms whose DNA does not have the mutations. As a result of natural selection, the new, mutated, version of the DNA will (over many generations) become the dominant version.
A very rough measure of the distance between two nucleotide sequences is to just count the number of positions where the bases are different. This is known as the Hamming distance. (The Hamming distance is very useful for error-correcting codes and cryptography, but, as we will see soon, not actually very useful in genomics. It was developed by Richard Hamming during work on error-correcting codes for communication at Bell Labs in the 1940s.)
count_matches
that takes as
inputs a list and a value, and outputs the number of elements in the
list that match the value.
For example:
>>> count_matches(['A','C','A','T'], 'A') 2 >>> count_matches([], 'A') 0 >>> count_matches(['A','C'], 'G') 0
count_matches
is not actually useful for defining Hamming
distance, but this question will give you practice with an easier
function that operates on lists before trying Hamming distance.
hamming_distance
that takes
as inputs two lists (or two strings, your function should work for
both, and will if you write it sensibly) and outputs the number of
positions where the list elements are different.
For your hamming_distance
function, you may assume that both inputs
are the same length (although it would be even better to define a
function that does not assume this!).
Here are some examples:
>>> hamming_distance([], []) 0 >>> hamming_distance(['A', 'C', 'A', 'T'], ['A', 'C', 'T', 'A']) 2 # last two elements do not match >>> hamming_distance("erode", "geode") 2 # first two characters do not match
Levenshtein Distance
In reality, not all of the copying mistakes are point mutations, however. It is also (relatively) common for there to be a copying error where one or more bases are skipped (a deletion), or added (an insertion) into the (mis-)copy. This makes it tougher to measure how closely related two sequences are. For example, a single base insertion at the beginning of a sequence makes every subsequent base a mismatch and the Hamming distance would be (nearly) the length of the rest of the sequence even though there was only a single copying mistake.
To account for this we need a more complex way of measuring distance, known as Levenshtein distance, or more simply, edit distance.
The edit distance metric accounts for the three different types of copying errors that might occur:
- a base could be skipped in the copy (deletion)
- a base could be inserted between two of the original bases (insertion)
- or a base could be replaced with a different base (mutation).
To understand how closely related two DNA sequences are, a biologists wants to know the number of copying errors that separate them. (In reality, if both sequences may have started from a common ancestor, the errors happened along the paths to both of the sequences. But, an insertion in one sequence is similar to a deletion in the other sequence, so it is still useful to just look for the number of copying errors between the two comparison sequences.)
The goal of sequence alignment is to find the minimum number of edits necessary to make two (or more) sequences identical. Biologists usually think of this as inserting gaps in the sequences to make them line up better (hence the name, alignment). (In real biology, not all changes are equally likely and insertions and deletions tend to occur in chunks of multiple bases at a time, so more complex edit distance metrics are used instead of simple Levenshtein distance. See the bonus questions to try a more realistic model.)
(a) | (b) | (c) | (d) | (e) |
ACAT ACA |
AACCT CCCCG |
GATTACA GACACA |
ACAT C |
AAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAA |
Breaking Down the Problem
As President Obama explained, computer scientists learn how to solve difficult problems by breaking them into smaller pieces.
Since edit distance involves sequences of bases, a smaller problem would just be a shorter sequence. Let's see if we can define the edit distance of two sequences in terms of the edit distance of two shorter sequences.
Suppose we are given two sequences, x, with bases x1 x2 x3 ... xn, and y, with bases y1 y2 y3 ... yn. We are looking for the minimum number of edits needed to transform x into y.
Then the edit distance can be computed as follows:
-
If x is empty (i.e., n is zero, so there are no bases in x), then you have to insert all of the bases in y to make y (starting from x, the empty sequence). Hence, the edit distance is m.
-
If y is empty (i.e., m is zero), then we need to delete all of the bases in x to produce y. Hence, the edit distance is n.
-
Otherwise, both x and y have at least one base (x1 and y1). We have four possibilities to consider, and should pick the one that results in the lowest edit distance. The four cases are:
(1) The sequences x and y start with the same base (that is, x1 == y1). We can match up these two bases, and the edit distance is the distance between the remaining sequences: (x2 ... xn) and (y2 ... yn).
(2) The shortest path starts by deleting x1. The edit distance is the distance between the remaning bases of x (x2 ... xn) and y, plus one because of the deletion of x1.
(3) The shortest path starts by deleting y1. The edit distance is the distance between x and the remaning bases of y (y2 ... yn), plus one because of the deletion of y1.
(4) The shortest path starts by mutating x1 into y1. Then, the edit distance is the distance between the remaining sequences, (x2 ... xn) and (y2 ... yn), plus one because of the mutation.
That's it! The parts that say "the edit distance between the remaining sequences" can be implemented as recursive function calls, where we are making progress because at least one of the input sequences is now one base shorter (for cases (1) and (4), both of the sequences are shorter). Since we keep making progress by shortening the sequences, eventually we will reach the base case where one of the sequences is empty and we know the answer right away (the length of the other sequence).
edit_distance
, that takes as input
two lists (or two strings — your code should work for both) and
returns the Levenshtein distance between those sequences. (This is
challenging! There are some hints below, and don't get frustrated
instead of taking advantage of the available help.)
Use min
. Python has a built-in min
function that takes any
number of arguments and returns the value of the smallest one:
>>> min(5, 7) 5 >>> min(5, 1, 7, 3) 1 >>> min(min(5, 2), min(7, 3)) 2
Test as you code. Test your function on simple examples first, and
confirm that it works, before trying more complex examples. You should
make a function with the testing code, so you can re-execute it easily
when you change your edit_distance
function. Start with test cases
where the sequences are empty, and then move on to test cases with one
base in each sequence, before trying tests with longer sequences.
Here's a suggestion for how to write a testing function:
def test_distance(): def test_one(s, t, dist): print("Testing: " + s + " -> " + t + " == " + dist) assert edit_distance(s, t) == dist print("Correct!") # Now, we can use test_one to check the distance between any two strings test_one('', '', 0) test_one('', 'A', 1) test_one('A', '', 1) test_one('T', 'T', 0) # ... test_one('GATTACA', 'GACACA', 2)
The way edit_distance
is described involves a huge amount of
computation, even for fairly short sequences. So, don't be surprised if
your function doesn't finish when you try it on longer sequences
(although it should finish quickly for sequences of length up to 4 or 5
bases). In the next section, we look at how to make it execute more
quickly (but make sure you have a correct, slow implementation first).
Memoizing
We have a serious problem if our edit_distance
function takes too long
to execute on strings like AAAAAAAAAAAAAAAAAA
(10 bases), since
proteins contain hundreds of amino acids (each of which requires 3 bases
to encode). If we want to solve any interesting problems in biology we
need an edit_distance
function that is much faster than this one.
(Soon, we will get more precise about what it costs to execute a
procedure. This is the main focus of Lesson 5 of the cs101 course and
Chapter 7 of the coursebook. But,
for now, we just want to build an intuition about how much computation
is going on when this function runs.)
Why is the edit_distance
function you defined (assuming you followed
the method in the provided description) so slow? Think about this
yourself first before reading on.
The reason the recursive definition of edit distance as described is so
slow is because it is doing so much redundant work. For example, suppose
were are computing edit_distance("ACT","GCT")
. Using the three
options, we want to find:
min((1 + edit_distance("CT","GCT")), (1 + edit_distance("ACT","CT")), (1 + edit_distance("CT","CT")))
To evaluate this application expression, the interpreter needs to
evaluate all the subexpressions that are inputs to min
.
The first operand subexpression is (1 + edit_distance("CT","GCT"))
,
which requires evaluating edit_distance("CT","GCT")
. As before, we
need to find the minimum of the three options. In this case,
min((1 + edit_distance("T","GCT")), (1 + edit_distance("CT","CT")), (1 + edit_distance("T","CT")))
Note that one of these, (1 + edit_distance("CT","CT"))
is
identical to the third option for the first edit_distance
call.
But, the interpreter doesn’t know that. It just follows the
evaluation rules, re-evaluating the same expression every time it
appears. In evaluating a large edit_distance
application, there are
quadrillions of identical evaluations of edit_distance
, all of which
are redundantly evaluated by the evaluator.
One solution is to store the results of previous evaluations, so they don’t need to be recomputed. This is known as memoizing. (An alternate approach, which requires rethinking the algorithm to do things in the reverse direction, is known as dynamic programming, but we won't get into that now.)
Memoizing is fairly easy to do in many programming languages. Python provides a way to "decorate" functions and indicate that they should be memoized.
We have provided code to do so below (and in memoize.py
in the
project2 repository). It is not necessary for you to understand how
memoize
is defined (indeed, it uses some structures and techniques
that we have not yet introduced).
The memoize
function takes as input a function and outputs a new
function. The output function is similar to the input function, except
whenever it is evaluated it stores the result in a table. If it is
re-evaluated on the same inputs, instead of needing to evaluate the full
function, it looks up the value from the previous evaluation in the
table and evaluates to that value right away.
def memoize(function): """ Memoization decorator for a function taking one or more arguments. """ class MemoStore(dict): """ Dictionary to keep track of results from previous calls to function. """ def __getitem__(self, *key): """ Returns the result of applying the function to the arguments, *key. If the result is already stored in the dictionary, returns it directly. If not, __missing__ will be called to evaluate the function on the inputs, store that result in the dictionary, and return it. """ return dict.__getitem__(self, key) def __missing__(self, key): """ Adds the result of applying function to *key arguments to the dictionary, and returns the result. """ ret = self[key] = function(*key) return ret return MemoStore().__getitem__
With memoize
defined, we can decorate any function by using @memoize.memoize
before its definition. This means calls to the function will be
replaced with calls to the new function that is the result of
memoize(function)
.
Here's how to use it for your edit_distance
function:
@memoize def edit_distance(a,b): ...
After adding the @memoize
decorator, you should be able to re-run all
your previous tests with the same results (but faster execution, which
will only be noticable for longer sequences).
Now you're ready to try a more ambitious (and realistic) example!
In the apoe.py
file we have included some (abridged) actual gene data:
a normal "mus musculus apolipoprotein E" gene (call it apoe) and
the E4 variant of it associated with Alzheimer's disease.
An individual who has the E4 variant in both copies of their genome is about 30 times more likely to develop Alzheimer’s disease by age 75, compared to individuals who have both copies of their genome with the normal variant. With no copies of the E4 variant, someone has a 7% chance of developing Alzheimer’s; with one E4 variant, there is a 14% chance; with both copies as E4 variant, the chance of developing Alzheimer’s by age 75 is around 80%.
APOE_NORMAL
and
APOE_BAD
using your memoized edit_distance
function. (Without
memoization, you'll probably have to terminate the Python process
without getting a result.)
project2.py
file.
Teasers
Edit distance and auto-correcting. Try searching for some mispelled phrases in Google, DuckDuckGo, or Bing. For example, tyler swaft ("Did you mean taylor swift?"), ooobbamma, or tom jeffson. Search engines are using methods similar to edit distance to guess what you mean when your query doesn't match lots of documents. Try some experiments with your favorite search engine to see how many substitutions, insertions, and deletions it can handle (this will vary depending on how popular the search term you are close to is). For an excellent more in-depth discussion of this, with an implementation of a full spelling corrector in 21 elegant lines Python code, see Peter Norvig's How to Write a Spelling Corrector.
Evolutionary edit distance. We are simplifying things to assume all types of copying errors are equally likely. In biology, this is not the cases, and it is more likely for a long sequence of bases to be inserted (or deleted) at once, than for the same number of point mutations. So, biological algorithms for measuring evolutionary distance such as the Smith-Waterman algorithm are a bit more complicated than the model we use where single substitutions, insertions, and deletions all have the same cost (one edit). With more sophisticated edit distance metrics used in biology, the cost is different for different subsitutions (because of the way amino acids are encoded, many point mutations have no impact on the actual protein produced by a gene, which means they are "less costly"), and the cost of a multiple-base insertion or deletion is a function of its length (but less than the cost of that many independent single-base insertions).
List comprehensions. We have focused on writing recursive functions,
using only simple Python constructs that are essentially shared by
nearly every programming language. A more Pythonic way to write
functions that operate on every element of a list, such as
sequence_complement
or hamming_distance
would be to use list
comprehensions. They would allow us to turn a multi-line loop or
recursive function into a single line of code (that is harder to
understand until one gets comfortable with reading list comprehensions,
but for experienced Python programmers becomes easier to read). A fun
tutorial on list comprehensions is Trey Hunner's Python List
Comprehensions: Explained
Visually.
Try to rewrite the functions you defined using list comprehensions.
Dynamic Programming. Another way to compute functions like edit distance efficiently is to us dynamic programming. Similarly to using memoization, this avoids the redundant computation inherent in a straightforward recursive solution, but instead of doing it by storing results in the order they happen to be computed, it builds up a table of results in a particular order. The wikipedia page has an explanation and a lot of examples.