|
|
|
Why Genetic Engineering is a Computer Hacker's Problem
Imagine you are given an alien computer and you have to
figure out how it works. This is not Star Trek where you can program
alien computers after staring at their console for 60 seconds. There
are no manuals, no source code, but you have lots of working machine
code to examine. If you write a test program it usually does nothing,
but be careful, it is possible your test program will kill you.
Of course the programming language is DNA and its machine code runs all
life on this planet.
The DNA/RNA Machine Language
The machinery of life needs thousands of different proteins but these
proteins are based on just twenty amino acids strung together in different sequences.
The instructions that define a specific sequence are coded in RNA by an alphabet
using just four codon molecules; adenine (A), guanine (G), cytosine (C) and uracil (U).
In DNA thymine (T) replaces uracil but DNA is the master template and it is never
used to make proteins.
DNA is huge and usually kept safe in its double helix form. When
a protein is needed part of the DNA unwinds and an RNA copy of a
small segment of the DNA is created. This process is called
transcription.
A Ribosome
is a large complex made from fifty proteins that reads and decodes
the RNA thread. The RNA thread is like a little application and
every time it runs through the ribosome it makes another copy of
the protein this is called translation. The same RNA thread may be
run many times to make many copies of the protein molecule.
The following pdf has a good overview.
Each amino acid in the protein sequence is coded by an instruction word that takes three
of the codon molecules. Since each position in the instruction word can take one of four
states (A, G, C, or U) the number of possible instruction words is 43 or 64
amino acids. Only 20 amino acids are actually used so many of the codon words specify the
same amino acid.
The redundancy is not random because there is a chemical relationship between
the codon molecules. The two DNA strands are mirror images of each other with A on one strand
always binds to T on the other. C and G are also paired as opposites in DNA. In some sense this means
A=G and U=C so the instruction word UUC is equal to UUU and both instructions encode Phenylalanine. In half the
instruction words the encoding engine completely ignores the last codon so for example CUx always encodes Leucine.
Some of the codon sequences are used to specify the start of an new protein
(like an entry point in a routine) and other codons are used to specify a stop (like a return
from the routine). The codon instruction word table used by Nature looks like:
Codon Table
The Reverse Codon table starts with the amino acid and identifies which codon
instruction words can be used to specify that amino acid.
Reverse Codon Table
The 20 amino acids used in proteins and their codons
Ala |
GCU, GCC, GCA, GCG |
Leu |
UUA, UUG, CUU, CUC, CUA, CUG |
Arg |
CGU, CGC, CGA, CGG, AGA, AGG |
Lys |
AAA, AAG |
Asn |
AAU, AAC |
Met |
AUG |
Asp |
GAU, GAC |
Phe |
UUU, UUC |
Cys |
UGU, UGC |
Pro |
CCU, CCC, CCA, CCG |
Gln |
CAA, CAG |
Ser |
UCU, UCC, UCA, UCG, AGU,AGC |
Glu |
GAA, GAG |
Thr |
ACU, ACC, ACA, ACG |
Gly |
GGU, GGC, GGA, GGG |
Trp |
UGG |
His |
CAU, CAC |
Tyr |
UAU, UAC |
Ile |
AUU, AUC, AUA |
Val |
GUU, GUC, GUA, GUG |
Start |
AUG, GUG |
Stop |
UAG, UGA, UAA |
Exceptions
Nature being Nature there are exceptions to a simple use
of the codon table. The first exception is relatively easy to understand
if you begin creating a protein at a start codon when you hit another
start codon you just make the amino acid in the codon table. In programming terms this
is like a routine with multiple entry points. It is also possible
for a Ribosome to skip a stop codon and exit on the next one. In programming
terms this is like a conditional return. Many of the RNA programs
have multiple entry and exit points so it is clear Nature never read
a book on structured programming.
The Ribosome also has the ability to make amino acid substitutions so the RNA
program is not guaranteed to completely determine the final protein sequence.
Other changes and substitutions are made after the amino acid string is created.
Sometimes a long RNA sequence is chopped up into pieces where some of the pieces are
legitimate proteins and some of the pieces are simply discarded and recycled.
Protein Folding
The average protein is 1000 amino acids long and can be manufactured in under a minute.
Just creating the string is not enough. Proteins are used as building blocks or molecular
keys so they must be the correct shape to work properly. Some of amino acids repel each other
and some attract so the protein will start to fold into shape as it is being created.
If something interferes with this folding or some helper molecule is not present to
direct the folding the result will be a protein that will not work as intended. In most
cases the protein will do nothing and disease will result because not enough of the correct
protein is manufactured. In rare cases the mis-folded protein will damage some normal
process because it competes with the protein that is supposed to do the job.
Genetic Programming
Anyone who has hacked into an application to try an break a license key or fix an
executable when the source code has been lost will be right at home trying to figure
out how the program of life works. Of course gene hacking brings with it much larger
risks than computer hacking. A computer virus is costly and annoying but it can only
indirectly threaten people's health. A gene programmed virus can easily kill its
author and many others as well. With powerful tools come awesome responsibilities.
Figuring out how to program life is not trivial but for the most part Nature plays by
a set of rules and those rules can be coded into tools and utilities that make the task
easier. No one writes machine code for computers anymore. We have assemblers for low
level work and compilers that allow us to do big projects in a high level language.
It is certain we will build assemblers and compilers for genetic programming.
We now have the complete genomes for a number of different species including
humans,
mice, rats and fruit flies. This is like having the machine code listings. The most
interesting genome project is the one that sequences a big dog and a little dog so we
can put the two listings side by side and figure out what bigness looks like in machine
code.
There are huge benefits (and huge profits) to be gained with careful use of the technology.
Like many other things we take from Nature will can bend the rules and create radically
new capabilities. There are simulation programs that predict protein folding and we
will be able to create designer proteins that cure diseases. There is already
research on the modifying protein coding and manufacture to use amino acids outside of
Nature's privileged twenty. In the not too distant future we may even be able to have
fun with genetic programming; bring back the dinosaurs, make dogs and cats intelligent,
and even create mythical creatures like unicorns, pegasus, or dragons.
Bioinformatics
Genetic programming is now called bioinformatics. It is showing up at
many universities as a field of study that crosses computing science, biology
and organic chemistry. Some universities
are offering degrees in bioinformatics and there are
organizations with lots of free (but very advanced)
software and an opportunity to join projects to develop software. Some of these orgranizations
have lecture notes from
on everything from software that predicts protein folding to how to configure your MySQL database.
|