Science: Genomics

Science: Genomics

According to the Human Genome Project, the length of the DNA found in human cells ranges from about 50 million to 250 million base pairs. Researchers continue to gain new insights into what the genes are responsible for and how they express themselves. The amount of data encoded in DNA stems is enormous and the interactions between the genes is highly complex. How are scientists able to gain any insights when there are so many possibilities and so much data? Algorithms have been applied in genetics long before technology was able to implement them so efficiently.

Gregor Mendel is remembered for studying the results of crossbreeding pea pods and the ancient practice of selective breeding. What was impactful about the experiments Mendel conducted, were the patterns he found in the results. Mendel studied thousands of plants and looked for results where certain traits were selected. By looking through the data, Mendel was able to find patterns and uncover what is now referred to as the Laws of Inheritance. Although the Punnett Square was not developed until years later, it can help you visualize the possibilities. Imagine how big a challenge this was for Mendel!

Consider a gene that can determine the color of a flower. Flowers receive two genes, one from each parent as humans do. These genes may differ from one another which leads to the genetic diversity we see. For this gene, there are two types: a gene for purple flowers (P) and a gene for white flowers or no color pigment (p). When both of the genes in the organism are for white flower (pp) the flower will be white, but if either of the genes are purple (Pp or PP) the flower will be some shade of purple.

The Punnett Square below displays all 4 possibilities for the offspring of two parents that have one gene of each type. The parents are said to be heterozygous. A flower produced by these two parents has a 75% chance of being purple. This outcome would change if one parent had only dominant or recessive genes. For multiple genes, the number of possibilities grows exponentially. A dihybrid cross (two genes) results in 16 possibilities and three genes could have 64 possibilities.


P

p

P

PP

(purple)

Pp

(purple)

p

Pp

(purple)

pp

(white)

You can see how quickly the collection and calculation of this data would overwhelm anyone. It is hard to imagine the care and patience that was necessary for Gregor Mendel to collect the results from his seven years of experiments for seven different gene types. When you consider that the human genome contains over 20,000 genes, paper, pencil, and traditional methods of calculation would not be enough. However, a look at the data might reveal to you, just as it once did for Mendel, patterns that can be used to formulate laws and understand underlying phenomena.

Algorithms are a set of instructions that can be used to make sense of data by searching for certain patterns, calculating possibilities, and many more. Even the process of transcription and translation are algorithms that human cells use to replicate DNA and synthesize proteins. The process can be written out and each step understood, but the fact that each cell of the body carries out these processes so many times throughout the day exemplifies the power of algorithms at scale. Understanding this process has helped scientists better understand human physiology, as well as lead to new treatments against viruses.

Transcription and Translation Example

Below is an example of code simulating the process of transcription of DNA to messenger RNA (mRNA) and translation from mRNA to an amino acid chain. This mimics the process your cells use to create proteins that are used throughout your body. By simulating this process it is possible to see the impact a single mutation has on the resulting amino acid chain.

In the following example, the first DNA snippet used is a short sample of the DNA code for HBB (Hemoglobin). The second and third are identical except that in the second DNA snippet the 5th base is changed from 'C' to 'T' and in the third DNA snippet 'C' was deleted entirely.

Example Transcription and Translation Algorithm

#DNA --> mRNA
transcription_dict = {'A':'U', 'G':'C', 'C':'G', 'T':'A'}

#mRNA --> Amino acid
translation_dict = {'UUU':'Phe', 'UUC':'Phe', 'UUA':'Leu', 'UUG':'Leu', 'UCU':'Ser', 'UCC':'Ser', 'UCA':'Ser', 'UCG':'Ser', 'UAA': 'Stop', 'UAG': 'Stop', 'UAU':'Tyr', 'UAC':'Tyr', 'AUU':'Ter', 'AUC':'Ter', 'UGU':'Cys', 'UGC':'Cys', 'ACU':'Ter', 'UGA': 'Stop', 'UGG':'Trp', 'CUU':'Leu', 'CUC':'Leu', 'CUA':'Leu', 'CUG':'Leu', 'CCU':'Pro', 'CCC':'Pro', 'CCA':'Pro', 'CCG':'Pro', 'CAU':'His', 'CAC':'His', 'GUU':'Gln', 'GUC':'Gln', 'CGU':'Arg', 'CGC':'Arg', 'CGA':'Arg', 'CGG':'Arg', 'AUU':'Ile', 'AUC':'Ile', 'AUA':'Ile', 'AUG':'Met', 'ACU':'Thr', 'ACC':'Thr', 'ACA':'Thr', 'ACG':'Thr', 'AAU':'Asn', 'AAC':'Asn', 'AAA':'Lys', 'AAG':'Lys', 'AGU':'Ser', 'AGC':'Ser', 'AGA':'Arg', 'AGG':'Arg', 'GUU':'Val', 'GUC':'Val', 'GUA':'Val', 'GUG':'Val', 'GCU':'Ala', 'GCC':'Ala', 'GCA':'Ala', 'GCG':'Ala', 'GAU':'Asp', 'GAC':'Asp', 'GAA':'Glu', 'GAG':'Glu', 'GGU':'Gly', 'GGC':'Gly', 'GGA':'Gly','GGG':'Gly'}

def transcription(DNA):
    #Return a string of RNA given a string of DNA  
    mRNA = ''
    for base in DNA:
        mRNA += transcription_dictionary[base]
    return mRNA

def translation(mRNA):
    # Return a list of amino acids given a string of mRNA
    amino_acid_chain = []
    codon_length = 3
    end = len(mRNA)
    ribosome_one = 0
    ribosome_two = 3
    while ribosome_two <= end:
        codon = mRNA[ribosome_one:ribosome_two]
        if codon in translation_dict and translation_dict[codon] is not 'Stop':
            amino_acid = translation_dictionary[codon]
            amino_acid_chain.append(amino_acid)
        ribosome_one = ribosome_two
        ribosome_two += codon_length
    return amino_acid_chain

Results

Snippet of hemoglobin DNA
DNA #1 : GGATCCTCACATGAGTTCAGTATATAATTGTAACAGAATAAAAAAT
mRNA: CCUAGGAGUGUACUCAAGUCAUAUAUUAACAUUGUCUUAUUUUUUA
Amino Acid Chain: Pro-Arg-Ser-Val-Leu-Lys-Ser-Tyr-Ile-Asn-Ile-Val-Leu-Phe-Phe

Snippet of hemoglobin DNA, 5th base changed from 'C' to 'T'
DNA #2 : GGATTCTCACATGAGTTCAGTATATAATTGTAACAGAATAAAAAAT
mRNA: CCUAAGAGUGUACUCAAGUCAUAUAUUAACAUUGUCUUAUUUUUUA
Amino Acid Chain: Pro-Lys-Ser-Val-Leu-Lys-Ser-Tyr-Ile-Asn-Ile-Val-Leu-Phe-Phe

Snippet of hemoglobin DNA, 5th base deleted.
DNA #3 : GGATCTCACATGAGTTCAGTATATAATTGTAACAGAATAAAAAATC
mRNA: CCUAGAGUGUACUCAAGUCAUAUAUUAACAUUGUCUUAUUUUUUAG
Amino Acid Chain: Pro-Arg-Val-Tyr-Ser-Ser-His-Ile-Leu-Thr-Leu-Ser-Tyr-Phe-Leu

These types of mutations, where even a single DNA base is changed or deleted, can have a dramatic effect on the resulting protein. This would have required a lot of work for a geneticist to detect this using older manual processes. In contrast, you can instantly see the effects on the resulting amino acids. Integrating algorithms into technology yields insights more efficiently and provides more time for analysis and experimentation. The cost and availability to implement these algorithms continues to improve. As you and your students are able to explore the genetic data, it will not only be possible to learn about genetics, but you can conduct your own research and possibly make a novel discovery.

If you are interested in learning more about applying algorithms to the field of genetics, consider exploring the Rosalind Problems. The first few Rosalind problems are a common part of a biology curriculums, but students will not only gain the knowledge, but the ability to design algorithms for solving these types of problems and others like it. Even if you do not implement these algorithms using Python or another programming language, your students can write out the steps they would use to solve it and gain many insights from that process alone.

To use some of the tools used in the field of Bioinformatics, you can look at BLAST, as well as the Google Genomics API. You can also observe the power of algorithms, and even assist in biological research through Fold.it. To learn more these topics search the Internet for Punnett Squares, bioinformatics, computational biology, and human genome project. Potential standards this activity could align with if used with students.