Sooty Solutions - Burnaby BC Consulting Company - Advising Business Managers on Security, Information Technology, Business Process Performance, & Best Practices
Sooty HomeContractors are BetterThe Sooty ApproachLong Range PlanningThe Sooty PartnersContact Us
   

Business and Technology Trends

Trend Articles
 
  Capex & IT workers
  EPIC Computers
  Gene Hacking
  Indefinite Lifespan
 
 

Why Genetic Engineering is a Computer Hacker's Problem

Imagine you are given an alien computer and you have to figure out how it works. This is not Star Trek where you can program alien computers after staring at their console for 60 seconds. There are no manuals, no source code, but you have lots of working machine code to examine. If you write a test program it usually does nothing, but be careful, it is possible your test program will kill you. Of course the programming language is DNA and its machine code runs all life on this planet.

The DNA/RNA Machine Language

The machinery of life needs thousands of different proteins but these proteins are based on just twenty amino acids strung together in different sequences. The instructions that define a specific sequence are coded in RNA by an alphabet using just four codon molecules; adenine (A), guanine (G), cytosine (C) and uracil (U). In DNA thymine (T) replaces uracil but DNA is the master template and it is never used to make proteins.

DNA is huge and usually kept safe in its double helix form. When a protein is needed part of the DNA unwinds and an RNA copy of a small segment of the DNA is created. This process is called transcription. A Ribosome is a large complex made from fifty proteins that reads and decodes the RNA thread. The RNA thread is like a little application and every time it runs through the ribosome it makes another copy of the protein this is called translation. The same RNA thread may be run many times to make many copies of the protein molecule. The following pdf has a good overview.

Each amino acid in the protein sequence is coded by an instruction word that takes three of the codon molecules. Since each position in the instruction word can take one of four states (A, G, C, or U) the number of possible instruction words is 43 or 64 amino acids. Only 20 amino acids are actually used so many of the codon words specify the same amino acid.

The redundancy is not random because there is a chemical relationship between the codon molecules. The two DNA strands are mirror images of each other with A on one strand always binds to T on the other. C and G are also paired as opposites in DNA. In some sense this means A=G and U=C so the instruction word UUC is equal to UUU and both instructions encode Phenylalanine. In half the instruction words the encoding engine completely ignores the last codon so for example CUx always encodes Leucine.

Some of the codon sequences are used to specify the start of an new protein (like an entry point in a routine) and other codons are used to specify a stop (like a return from the routine). The codon instruction word table used by Nature looks like:

Codon Table

The 64 codons and the amino acids they make
2nd base
U C A G
1st
base
U

UUU Phe
UUC Phe
UUA Leu
UUG Leu, Start

UCU Ser
UCC Ser
UCA Ser
UCG Ser

UAU Tyr
UAC Tyr
UAA Stop (Ochre)
UAG Stop (Amber)

UGU Cys
UGC Cys
UGA Stop (Opal)
UGG Try

C

CUU Leu
CUC Leu
CUA Leu
CUG Leu, Start

CCU Pro
CCC Pro
CCA Pro
CCG Pro

CAU His
CAC His
CAA Glu
CAG Glu

CGU Arg
CGC Arg
CGA Arg
CGG Arg

A

AUU Iso (Start)
AUC Iso
AUA Iso
AUG Met (Start)

ACU Thr
ACC Thr
ACA Thr
ACG Thr

AAU Asp
AAC Asp
AAA Lys
AAG Lys

AGU Ser
AGC Ser
AGA Arg
AGG Arg

G

GUU Val
GUC Val
GUA Val
GUG Val (Start)

GCU Ala
GCC Ala
GCA Ala
GCG Ala

GAU Asp
GAC Asp
GAA Glu
GAG Glu

GGU Gly
GGC Gly
GGA Gly
GGG Gly




The Reverse Codon table starts with the amino acid and identifies which codon instruction words can be used to specify that amino acid.

Reverse Codon Table

The 20 amino acids used in proteins and their codons
Ala GCU, GCC, GCA, GCG Leu UUA, UUG, CUU, CUC, CUA, CUG
Arg CGU, CGC, CGA, CGG, AGA, AGG Lys AAA, AAG
Asn AAU, AAC Met AUG
Asp GAU, GAC Phe UUU, UUC
Cys UGU, UGC Pro CCU, CCC, CCA, CCG
Gln CAA, CAG Ser UCU, UCC, UCA, UCG, AGU,AGC
Glu GAA, GAG Thr ACU, ACC, ACA, ACG
Gly GGU, GGC, GGA, GGG Trp UGG
His CAU, CAC Tyr UAU, UAC
Ile AUU, AUC, AUA Val GUU, GUC, GUA, GUG
Start AUG, GUG Stop UAG, UGA, UAA


Exceptions

Nature being Nature there are exceptions to a simple use of the codon table. The first exception is relatively easy to understand if you begin creating a protein at a start codon when you hit another start codon you just make the amino acid in the codon table. In programming terms this is like a routine with multiple entry points. It is also possible for a Ribosome to skip a stop codon and exit on the next one. In programming terms this is like a conditional return. Many of the RNA programs have multiple entry and exit points so it is clear Nature never read a book on structured programming.

The Ribosome also has the ability to make amino acid substitutions so the RNA program is not guaranteed to completely determine the final protein sequence. Other changes and substitutions are made after the amino acid string is created. Sometimes a long RNA sequence is chopped up into pieces where some of the pieces are legitimate proteins and some of the pieces are simply discarded and recycled.

Protein Folding

The average protein is 1000 amino acids long and can be manufactured in under a minute. Just creating the string is not enough. Proteins are used as building blocks or molecular keys so they must be the correct shape to work properly. Some of amino acids repel each other and some attract so the protein will start to fold into shape as it is being created. If something interferes with this folding or some helper molecule is not present to direct the folding the result will be a protein that will not work as intended. In most cases the protein will do nothing and disease will result because not enough of the correct protein is manufactured. In rare cases the mis-folded protein will damage some normal process because it competes with the protein that is supposed to do the job.

Genetic Programming

Anyone who has hacked into an application to try an break a license key or fix an executable when the source code has been lost will be right at home trying to figure out how the program of life works. Of course gene hacking brings with it much larger risks than computer hacking. A computer virus is costly and annoying but it can only indirectly threaten people's health. A gene programmed virus can easily kill its author and many others as well. With powerful tools come awesome responsibilities.

Figuring out how to program life is not trivial but for the most part Nature plays by a set of rules and those rules can be coded into tools and utilities that make the task easier. No one writes machine code for computers anymore. We have assemblers for low level work and compilers that allow us to do big projects in a high level language. It is certain we will build assemblers and compilers for genetic programming.

We now have the complete genomes for a number of different species including humans, mice, rats and fruit flies. This is like having the machine code listings. The most interesting genome project is the one that sequences a big dog and a little dog so we can put the two listings side by side and figure out what bigness looks like in machine code.

There are huge benefits (and huge profits) to be gained with careful use of the technology. Like many other things we take from Nature will can bend the rules and create radically new capabilities. There are simulation programs that predict protein folding and we will be able to create designer proteins that cure diseases. There is already research on the modifying protein coding and manufacture to use amino acids outside of Nature's privileged twenty. In the not too distant future we may even be able to have fun with genetic programming; bring back the dinosaurs, make dogs and cats intelligent, and even create mythical creatures like unicorns, pegasus, or dragons.

Bioinformatics

Genetic programming is now called bioinformatics. It is showing up at many universities as a field of study that crosses computing science, biology and organic chemistry. Some universities are offering degrees in bioinformatics and there are organizations with lots of free (but very advanced) software and an opportunity to join projects to develop software. Some of these orgranizations have lecture notes from on everything from software that predicts protein folding to how to configure your MySQL database.