Rafiki Home Rafiki Store Learn about Code World Explore geometry

<< Back

What is the genetic code?

This is the $64,000 question, what is the genetic code? The thrust of this rambling page is that we don't really know the answer to this exceptionally important question. The genetic code clearly must involve the process of making protein based on information in DNA, but precisely how is this done? Here are some equally important and related questions:

What is life?

What is DNA?

What is a protein?

Life, as described by Erwin Schrodinger, is an aperiodic crystal. DNA is a crystal, and so is protein. Note the pattern. The cleanest view then to the nature of the genetic code is that it is a complex algorithm through which one crystal is formed according to specific properties of another, sovereign crystal. This then requires a language where two crystal types must somehow communicate information. Tough job, don't you think? It's hard to believe there's a way to consistently do it!

Science works by simplifying complexity, but in the case of the genetic code we have oversimplified. The chest-beating certainty of the scientific community has obfuscated the fact that we still do not understand nature's nifty crystal-forming algorithm affectionately referred to as the genetic code. We have mastered the well-worn correlation data between sequential components of each crystal type, but this is not the same as the language that communicates genetic information into a fully formed crystal. Science has dogmatically insisted that they are the same thing, but where's the proof? I doubt that we'll ever see it.

News flash: The earth rotates around the sun. Note to astronomers: lose the epicycles.

All life on this planet is based on a genetic code. It is a system that somehow defines the construction of living things by directing the processes of molecular synthesis and replication. In 1953 James Watson and Francis Crick described a double helix as the structure of a huge molecule called DNA, which was known to reside in the cell nucleus and store the secrets of the genetic code. Excitement grew, and by 1960 leaders in science were predicting that nature would be laid bare within a year, creating justifiable fears. If man actually controlled the genetic code, what would happen to life on earth? Salvadore Dali seemed to anticipate man’s dominion over nature and its relationship to a higher truth, as shown in his painting The Temptation of Saint Anthony.

The predictions and accompanying fears proved unfounded, however, since the code wasn’t completely “broken” for another ten years. Entirely synthetic life has yet to be created, and today, despite tremendous strides in genetic engineering, there is a general disaffection with the code. It appears that the code alone was not enough to allow man dominion over nature. The full glory of protein synthesis remains a mystery, so we have now moved “past the code” and on to proteins themselves. According to conventional thinking, the genetic code is so simple and buttoned down that its logical foundation appears remarkably trivial. Instead, today’s glamour boy is the protein – the idol to proteomics. It is the study of proteins and their many eccentric habits of folding that dominates the search. Proteins are so devilishly complex that “Breaking the protein” makes “breaking the code” look like child’s play. Fortunately, we have a tremendous amount of technology to help with the task as compared to 1960, and some of the greatest scientific minds are focused on a solution.

Surprise!
The genetic code is child’s play.
Enter the child.

A funny thing happened on our way to dominion: somebody… everybody forgot to “break” the other half of the code. A central premise of these pages is that the genetic code is far different from our conventional view of it. The genetic code somehow makes proteins, not just sequences of amino acids. (No, they are not the same thing.) I attempt through these writings to illustrate this “obvious” fact, and the implications of having missed it. I also intend to swing a machete in the general direction of any sacred cow that ambles into view.

That’s how children are – childish.

Starting with a modified version of the standard Watson-Crick table we see the set of codons and amino acids that are a part of the genetic code. From the conventional perspective, the genetic code starts and ends with pairing codons and amino acids.

This table, in its various forms, has today come to represent the entire logic of the genetic code. I arranged this table on a somewhat eccentric scheme. The important thing to note, however, is that we can arrange this table any-ol-way we like. There is no “correct” way to arrange and display this data according to our conventional view of the genetic code, and many different ways are in use. Since we can’t say for sure where nature got the data to begin with, and we believe there is no absolute meaning in its arrangement, we are free to view the organization of this data as arbitrary. This is strongly related to the premise that the genetic code is “one-dimensional”, which means that it contains only one degree of freedom with respect to the information it handles. The one dimension is the pairing of codons and amino acids. These two concepts are self-supporting to the point of forming a tautology. If assignments are arbitrary then the code is one-dimensional, and if the code is one-dimensional then assignments are arbitrary. Regardless, the paradigm of a one-dimensional code leaves no room for any absolute foundational logic. Adding, subtracting or shuffling assignments will change the output of the code, but leaves the foundational logic of the code unchanged.

Acceptance of this view is not merited by empiric data, however, and it is extraordinarily detrimental to our study and use of the genetic code. The accepted doctrine has prevented the asking of important and fascinating questions, many of which shall be addressed in these pages. I find the one-dimensional view of things now absurd and untenable, especially in light of discoveries of the past five years. Warning bells should be going off all over the place, but they have not. A view of a one-diminsional code is virtually impossible to rehabilitate in even the broadest of terms. There simply is more than one degree of freedom in translation of genetic information, and all of the information must be embodied in our model of the genetic code.

Some might quibble with the precise language of my description, but the conventional approach is yet unchallenged, and I therefore intend to aggressively challenge it. From a Rafiki perspective the nature of the data in the above table is the furthest thing from arbitrary, and there is indeed a “best” way to arrange and view it. Genetic translation is an objective, molecular process. Genetic information must therefore be founded on objective molecular structures. There are at least two dimensions of information in the genetic code, and surely many more. Open to debate are the natures, forms and actual mechanisms of translation for these additional dimensions of information.

Like the periodic table of chemical elements, there is a sublime logic to the assignment of amino acids to codons. Without this insight we are blind, and the genetic code goes from a periodic table of elements to a table of periodic elements as viewed by Michael Teague.

Table of Periodic Elements
Michael Teague


The conventional view of assignment data becomes particularly dysfunctional when we return to the premise of having a “genetic code” in the first place. We intuitively know that cryptic information is contained in one set of molecules and communicated to another set of molecules. We know this because we can witness the process and its results (protein synthesis). The key questions are, what information is in there, and how does it get communicated? If one accepts conventional wisdom, the answers are, “not much, and with a single, simple set of linear correlations.” These answers are incorrect, and the insistence that we cherish them as we have for so long has led to a truly comical view of the genetic code. More comical is the defense of it, as history will record. All indoctrinees are in the trance of a more than forty-year post-hypnotic suggestion, causing obvious anomalies of the paradigm to go unnoticed. This is most unfortunate, so it is our job to correct it. We will start with some basic questions.

• What is the origin of the language, or how did Life get started?

• With so many alpha-amino acids to chose from, and room for 64 in the code, why does the standard set only contain 20?

• What is the logic behind the arrangement of nucleotides, codons and amino acids?

• Since the mirrors of alpha-amino acids (L and D) are equally stable and exist in equal proportions within the abiotic areas of the universe, why are all of the standard amino acids in the L form?

• In such a beautifully rapid, accurate and efficient information system, why is there such an ugly redundancy?

• With few exceptions, the above system appears to be used in all species and presumably back through time. Given the ravages of evolution - changing properties of organisms rapidly and constantly - one might expect some branching into competing dialects of the genetic language. At least the redundancy of the language should be subject to widespread change, since it has no absolute meaning. How could this exact system exhibit such dogged durability across time and throughout species?

• We now know the shape of DNA – a double helix – and we know the functional significance of this shape, but this is only genetic storage. After all, it is a complex 3D information system, and shape imparts structure, function and meaning. What is the fundamental shape and meaning of the genetic code when it performs its magical role during protein synthesis?

Answers

To pick up a good biochemistry text today one might imagine that either there are generally accepted, plausible answers to these questions, or the questions are too unimportant to merit any attention or real answers. To wit:

Why only 20? “The fact that all living organisms use the same standard amino acids in protein synthesis is evidence that all species on Earth are descended from a common ancestor.

Why all L-amino acids? “Like modern organisms, the last common ancestor (LCA) must have used L-amino acids and not D-amino acids.

And this is from an otherwise excellent, up-to-date, advance-level biochemistry college textbook!

That’s it? That’s the best we can do? At least say, “we don’t know and we don’t care.” We mustn’t pretend to know, or imply it’s unimportant that we don’t know. These aren’t answers; they are fables. They are known as “just so” stories. They equate to, “they are because they are, and they must need to be because they are.” However, not knowing these answers is a very unsettling concept for a lot of very intelligent people, creating more than a small component of denial.

There is another anomaly, a gaping hole so to speak in the same texts. Leaf through them and what do you see? Information, lots of it and presented beautifully. They illustrate an abundance of knowledge representing some of the greatest achievement of human thought and investigation. The trend is toward shape, fit, three-dimensions. There is a tip-of-the-cap to the idea that the meaning of the covered subjects lies in their shapes and their space occupying attributes. Some even provide 3D glasses, and most offer links to animated web sites to facilitate the spatial effects. The double helix is celebrated and dissected. Proteins are unfolded, folded and fit together. Electron prowling domains are drawn and speculated upon. Yet at the point where the rubber hits the road, where the genetic code performs its magic, the descriptions revert to 1950’s flatness and they are presented in living black and white.

Toto, I have a feeling that we're not in Kansas anymore.

Dorothy
The Wizard of Oz

How can it be that the double helix has this wonderful relationship between form and function, yet the nucleic acid complexes have no form-function relationship during protein synthesis? Certainly the form-function of our genetic information storage is important, but what about the genetic processor? It is likely that it has a distinct shape as well, and the shape of the processor is somehow logically related to genetic information storage and retrieval.

The foundation of the central dogma is that the information is “co-linear”. In other words, there is a line of information in DNA that is communicated, somehow, to a line of results in proteins. This is taken to mean that the code is one-dimensional. There is believed to be only one dimension of information passed from line to line - the one dimension being the identity of amino acids or links in the protein chain. Due to faith in co-linearity, and due to the nascent digital information industry in the 1950’s, the code itself came to be seen as linear. As I’ve already stated - this is a big mistake. There is nothing really linear about the code, unless you believe that a swarm of bees is in some way linear. Nature ignores lines; it’s all about shapes.

The existence and translation of information in matter is not mystical. It is a nitty-gritty process of quantizing and selecting possibilities from a defined set of possibilities. The mysticism lies in the process by which the universe methodically bootstraps information in an evermore-complex cascade of emergence. The correct term, I believe, is sequential. I will grant that the genetic code is to an extent co-sequential, but I will not concede that it is co-linear, because these illusions of linear paradigms are clouding our eyes and our brains. The linear indoctrination process is intense; I know, because I’ve been through it. However, we can loosen the reigns on our senses and find some sense in the madness. With the help of some recent discoveries, some clear, rational thinking, and some bodacious art, we can see the order in the chaos.

M.C. Escher
Order and Chaos

In the middle of a table of periodic elements sits logic. As with any pattern there is an organizing force, something that drives the formation of complexity and order. The laws of nature are in play, and the patterns are there for us to see, if only we have the light and courage to look.

Furthermore

Although it is commonly accepted that ‘The Genetic Code’ is presently known, and therefore the fundamental rules for some type of universal translation are also known, this is false. In fact, nothing could be further from the truth. Indeed, we know of a somewhat predictive correlation between levels of translation - between a few limited sets of small organic molecules in selected organisms. We do not even know everything there is to know about this level, despite hyperbole to the contrary. Moreover, we have so far failed to identify the relationships between, and the multi-dimensional meanings in vastly larger, more informative molecular sets. We are far from able to recognize all of the genetic information stored in the double helix - undeniably. We are also unable to interpret or explain the detailed patterns of genetic information spread across past, current and future life in the universe. Much of this confusion is caused by a failure to define terms. Although key terms were initially defined in biology, our knowledge has now outstripped our definitions.

There are many usages of the terms ‘genetic’ and ‘code’ and there are many casually accepted ways to combine the two. Although there is no official definition of anything that can reasonably be called the genetic code, this label over time has been applied to a variety of general and specific concepts, causing a huge element of confusion. Terms absolutely must be defined, especially central terms, and they no longer are. Most of the key terms today are bastardized, squishy, amoeboid and evolving. The most widely accepted usage of ‘the genetic code’ relates to translation of nucleotides into amino acids; however, even an accurate definition regarding this function is lacking from the dictionaries, textbooks and literature. Understandably, the minds of investigators are out of sync on this issue, and many other issues will suffer as a consequence. I mostly get my terms from Webster, but science has a way of mutilating common parlance. As theories and ideas evolve, vestigial terms, like the shells of hermit crabs, become inhabited by the animus of new meaning.

The term ‘genetic’ is used here to mean: Of, relating to, or influenced by the origin or development of something. The term ‘code’ is used here to mean: A systematically arranged and comprehensive collection of rules. ‘The Genetic Code’ could therefore reasonably apply to anything that fairly describes: A systematically arranged and comprehensive collection of rules relating to the origin and development of living organisms. I have let it somewhat out of its classical cage. For the term to be of any use, other than nostalgia, the genetic code must somehow operate on genetic information to animate living things.

We have been told explicitly for decades that the genetic code has been broken, and our expectations have quite understandably soared. Yet the term ‘genetic code’ remains poorly defined by science, much less broken. Theoretically, we should at least be able to make proteins de novo with the genetic code that we supposedly have broken, but yet we cannot. The unreasonably high expectations caused by the hyperbole of what we actually know have apparently not been met in our quest to learn nature’s expanded secrets behind her codes of organic translation.

Our thinking about the genetic code has, over the decades, become more detailed but progressively more confused. Models of the code are woefully inadequate, and the languages used to describe them are extraordinarily self-contradictory. I am now convinced that science actually has no working paradigm for a consistent exploration of genetic information. I am equally convinced that most people have failed to realize this. Heck, why should they? It was announced long ago that the genetic code had been captured, and this mis-truth has been repeated so many times that people truly believe that the genetic code is leading a peaceful existence in captivity, inside tiny spreadsheets, stored in millions of textbooks around the world. It is embarrassing, but the most honest answer to the question of what the genetic code really is, is that we really don’t know. Anything short of this would be ‘drinking the cool-aide.’

We might call something ‘The Genetic Code’ but it does not make it real. Our label and perception of it are necessarily artifacts of the human mind. They are working tools made of the material of the mind, for use by the material of the mind. Most people tend to forget this, and they confuse the issue further when they associate DNA with the genetic code. From here it is easy to confuse progress in studying the former as an understanding of the latter. DNA is a component of the code, but it is not the code. Sequencing genomes and understanding codes are decidedly different activities. One is data collection, and the other is data interpretation.

A broader, more robust, multi-disciplinary approach is called for. Biologists must team with physicists, mathematicians, computer jocks and philosophers to advance the cause. The emperor today is quite naked, so I propose, throughout these pages, bits and pieces of cloth, a more general paradigm, a model of the genetic code that I call the Rafiki code. I have no illusions of ‘breaking the code’ myself, and in fact I do not pretend to actually propose a new comprehensive genetic code here. This, I think, is impossible. I am proposing something far more useful, but far more dangerous. I am proposing that we shift our paradigms of genetic information and the codes used to translate it. A paradigm shift in science, like religion, is usually an ugly, protracted, name-calling endeavor. Therefore, short of that, I hope to accomplish several things:

• Examine the terms currently in use and ask if they are well defined, appropriate and useful.

• Explore a general framework, or paradigm for genetic information that can tie together seemingly disparate observations from separate fields.

• Introduce new concepts, models and investigative tools that can serve as platforms for speculation and further model development.

Mostly, I am issuing an open invitation to all interested parties: Join the fun. There is a gigantic, important and fascinating puzzle of nature yet to be solved. It will likely take a collective imagination to solve it, and you never know who might be the one to add a new piece.

Discovery of the double helix coincided with the introduction of information theory and the dawn of the information age. With them came an explosion of digital computing technology. Broad parallels between organic and digital computers are striking, but the two systems of computation quickly diverge on their details. Computer metaphors of genetic information are useful, but care must be taken in using them. It is helpful to remember that human native languages inform our thinking, but more often they are apt to misinform our ideas of nature’s complex phenomena.

Life is a complex adaptive system rivaled by no artificial system of digital computation. Although it is valid to draw parallels between the two, it is also essential to meticulously highlight the differences. Most genetic metaphors suffer on this score from a sloppy application of native languages. For instance, all digital systems might be called one-dimensional in the sense that they process information in symbolic binary strings. Digital computations require that information be reduced to these symbolic strings of binary digits. Storage and processing can be carried out in this single format because the digital computer is a linear finite state device. It is dependant on the precise state of the device at each step in the process, and every possible state of the device can be reduced to and completely represented by a string of zeros and ones. I see digital computers in this way as sequential, one-dimensional, dynamic mappings of logical relationships. Furthermore, computations are done in digital systems at discrete points by things known as data processors.

Strings and data processors of sorts exist in organic computers, but they have a vast number of informative dimensions, not just one. The genetic code has famously come to also be labeled as one-dimensional, but this is a sloppy use of native language that misinforms our thinking. Genetic information is never reduced to a single format or processed in a single dimension. Nothing about genetic information comes close to one-dimensional. The all-too-familiar ‘linear model’ of the genetic code was proposed many decades ago, but its validity remains to be demonstrated, and many of its original premises have already been disproved.

The term ‘linear’ with respect to the genetic code could be interpreted in several different ways. The nucleotides in DNA tend to ‘line-up’, but in this sense it is more appropriate to use the word ‘sequential’. DNA and all other molecular forms of information should not be considered as well-behaved sequences of points that form lines, but rather as mischievous sequences of shapes that form other shapes. Alternately, a mathematical relationship is considered linear when every input produces one and only one output. Here again, this description cannot be applied to the genetic code. It is not clear to me what one means when they describe the genetic code as linear and one-dimensional; none-the-less, this is how it is often described.

Our unfortunate insistence on calling the genetic code one-dimensional owes a good deal to the fact that codon and amino acid correlations were discovered early in investigations of genetic information. Each nucleotide triplet, or codon, is mostly linked in protein translation to one and only one amino acid in a peptide chain. It was erroneously claimed that this was the only sort of information that a codon could carry in translation. Subsequent investigations have proven this to be completely false, and the idea of a one-dimensional codon is now untenable, yet the “linear” philosophical doctrine marches on with nary a question to its validity.

The most detrimental effect of using the terms linear and one-dimensional in reference to the genetic code is that an additional term - ‘simple’ - frequently gets tacked on. I invite anyone to demonstrate or explain the simplicity of genetic information and genetic translation, at any level. Attempts to do so would suggest a failure to recognize that tiny portions of computer programs do not make sense of the entire code. The portion of the genetic code with which we are vaguely familiar is but a tiny subset of the logic behind translation of genetic information. They cannot be removed from the entire code and retain their meaning. Any expectation of simplicity in this, nature’s zenith of complexity, will have a low probability of being met in reality.

Unfortunately, there are few if any competing theories with which to compare the linear model. It is a sparse, comfortable, yet highly imperfect model, but for all practical purposes it is the only model presently considered. Here I attempt to define features that must be addressed by any computational model of genetic information, and with them we can question, examine, support or invalidate general theories.

The most obvious difference between a digital computer and an organic computer is that the former can be called one-dimensional, and the latter clearly cannot. The digital system deals exclusively with binary strings, but the genetic system deals with, among other things, molecular strings. Molecular strings are sequential polymers – macromolecules - that represent some of the many dimensions of genetic information. Each can be seen as a type of animated-floppy-disk of genetic data, and the information contained in the string exists in many dimensions as well. The beauty of molecular strings is that they provide logical, molecular, spatial sequences that can be leveraged in the time dimension. Organic computation always has a vital time element. With molecular strings, information can undergo a stepwise process of translation from one string to another. Each string is a discrete form of information requiring its own finite set of rules for translation.

Humans can easily convert a molecular string into a binary string, but a molecule cannot, nor is it clear why it ever would. Moreover, genetic systems expand information beyond strings, into dimensions that cannot be as easily quantified. Stepwise genetic translations visit many states in many molecular forms, and none of them can be removed from time or space, or from the total system. At the very least, no single form of genetic information or level of translation can be separated from the dimensions of space and time in which it exists, whereas a binary string easily can.

Some less intuitive molecular dimensions of genetic information are contained in quantities, populations, interactions, co-dynamics, proportions and concentrations of various organic and inorganic molecules. With genetic information, time is of the essence, and space is of the essence as well. We can distinguish these sorts of ‘epi-string’ information from the more obvious string information, but we must take care not to consider it ‘epigenetic’ information. Organic data that is not strictly carried in a molecular string can none-the-less still be considered genetic. We will return to this issue shortly.

A particular molecule, hemoglobin for instance, might exist in one of many potential states. Even in its standard form it is an animated molecule that depends on shape and mobility to perform its functions. But as any doctor can tell you, there is vital information in the number of possible states, the probability of each state, and the actual state of the entire molecular population at any given time. Additionally, and more complex, there is entropy in the temporal fluctuations of these molecular states. The ontogeny of hemoglobin populations within an organism is contained in genetic information, and the rules of changing populations within changing environments through time are somehow there as well.

Unfortunately, because of the staggering breadth and complexity of these epi-string dimensions of genetic information, we are precluded from examining the details of these forms and dimensions here. However, we are able to identify some numerical fundamentals of genetic systems, and the dimensions of information contained within many forms of molecular strings are presently not that far from our grasp. We begin with a thumbnail overview of a cell, and an outline of the master loop in a genetic translation program.

Master Loop


These are necessarily over-simplified diagrams to help us build a structure for tracking information as a genetic program processes it. Science must simplify to enlighten. A more accurate map of the genetic code would resemble a map of the worldwide web, with a riot of interconnecting reversible arrows and overlapping relationships. In mathematics this is known as graph theory.

Each level in this master loop has an algorithm, or set of rules that dictate which translations are done to which portions of the data. Each new dimension of data is passed to the molecular forms at the next level. If we consider that the program is given a certain amount of information, INITIAL(Info), then we can see that the information is processed in many forms before a value is returned by the function SURVIVE(Info). If the value returned by this function is TRUE then the loop continues. If the value is FALSE then the loop terminates. We know that every living thing on the planet has a record of perfect, unbroken strings of ancestral TRUE values returned by the function SURVIVE(Info). Therefore, the data can be grouped on this criterion into a common set. But beyond that it is difficult to imagine what the exact nature, origin, or history of the data might be.

This is where we run into a sticky problem with our definitions of two words: genetic and epigenetic. Epigenesis is a process of successive differentiation, a form of growth or development like adding layers to an onion. Aristotle first described it, and it remains an important concept today. The master loop of translation is well defined by the term epigenesis, and this process is truly at the heart of genetic translation. However, this observation is insufficient to merit global replacement of the term genetic with epigenetic. We would otherwise be forced to discard the term genetic altogether in this discussion. Note that the first division of a zygote begins a progressive differentiation in an organism. An interpretation this literal would mean that all further translation, the second cellular division and all subsequent translation, including mRNA transcription must be considered epigenetic. At the very least we would need an arbitrary distinction, or judgment call, drawing the line on the meaning of ‘successive differentiation.’ This is a false choice, and making it has negative consequences. These semantics do not provide ample reason to abandon the definition of genetic, in my opinion.

The prefix ‘epi’ means next to, apart from, outside of, or surrounding. I will not confuse it with the concept of multiple iterations, or recursions, because these progressive translations are required in decompressing any genetic message. From this standpoint, I will use the term epigenetic to stand for anything that is apart from or outside of something genetic. The irony is not lost on me that although the hallmark of organic computation is epigenesis, the term epigenetic will stand for that which is outside genetic. Otherwise, we will need one rather impotent genetic code, and thousands of tiny epigenetic codes. The word ‘epigenetic’ here is just innately ambiguous. The following diagram will help us further simplify and visualize the metaphor.

In a broad sense, DNA is data, the genetic code is a processor, and the output is a value returned by the function, SURVIVE(Info). Each piece of data can contribute to and be evaluated by its impact on the return value of SURVIVE, but this is the highest level of processing within an organism, and there are countless billions of computations and translations for virtually all of the parts of genetic information. Because there are vastly more translation paths than possible return values, the genetic code cannot even be close to linear at the highest interpretive level. A changing and unknowable environment obligates us to acknowledge this reality of translation. In other words, despite our desire to believe otherwise, there is no direct cause and effect or one-to-one mapping at higher levels when shuffling any isolated part of genetic information at lower levels. Specific changes in information can change survival and many levels of information in between, but they cannot do it when removed from the context of the whole. Genetic information and its translation are context dependent.

In the most general possible sense, we could expand the metaphor back through all time and across all living things. However, this is not the common interpretation of a genetic code metaphor. Science started its investigation with a more extreme reduction and therefore an oversimplified view of the process. Now a restricted slice is taken from the program to stand for genetic translation. Typically, several levels are compressed into one metaphorical level, and they comprise a single dimension of information for a single act of translation that we commonly call ‘The Genetic Code’.

This paradigm has the undeniable advantage of simplicity. All of the rules of translation from DNA to protein can be considered in one small spreadsheet – a simple, linear, one-dimensional genetic look-up table.

Genetic Look-up Table

It might turn out that all of the intermediate translation functions cancel out in a single dimension to leave us with this nifty approximation (in selected organisms) but it is unlikely. In fact, I argue that it is already disproved from many angles. Besides, the low algorithmic complexity of this one-dimensional shortcut ignores entire levels and the dimensions of information they carry. It will therefore necessarily compress all of the data from DNA into amino acid sequences, and it was presumably already compressed to a fair degree before finding its way into DNA. Compression of this sort is always intellectually dangerous, because baby and bathwater can sometimes resemble each other. Some microorganisms are known to use the same nucleotide sequence in making different proteins – talk about compression! There is no obvious need for a genetic translation program in the real world to compress information much further through these levels - unless nature somehow provides no alternatives.

Philosophically, it seems that a genetic program will benefit from preservation, or even expansion of information in one form or another during these translation steps. Furthermore, just by comparing the shape of DNA to the shapes of proteins there appears to be a need for rules of translation pertaining to spatial information, and sadly no clues are given by this spreadsheet. The exact rules must be located somewhere in the compressed levels of the shortcut algorithm, but prolonged searches have yet to find them. They never will. In fact, the first guess made by scientists was that we don’t need any spatial rules at all, that all the information required during folding is somehow contained in this table. The task of finding them here is a fool’s errand; it was wishful thinking in the first place.

The spreadsheet could turn out to be a valid simplification in restricted situations for some dimensions of information, but it seems recklessly narrow as the basis for a universal proclamation as ‘The Genetic Code’. Regardless, it is clearly inadequate as a basis for more robust understanding of genetic translation. Conversely, it might turn out that this facile metaphorical compression is actually harming our understanding of genetic translation. Remarkably, high-level investigators have shown little interest in even asking this question. The dogma is tenaciously thick and definitely hardened around this one.

Granted, the familiar, one-dimensional metaphor is widely cherished, but the terms commonly used to describe and apply it are imprecise at best, and negligently incorrect at worse. At the very least we should change its label from ‘The Genetic Code’ to ‘a compressive shortcut map of codon and amino acid correlations in selected organisms’. It isn’t such a glorious title, but completely misleading is the pronoun ‘The’ in its current title. It indicates that ‘the’ one and only genetic code is completely contained in the spreadsheet, but we have discovered many non-canonical spreadsheets, and we should perhaps either number them (TGC1, TGC2… TGCx) or at least show courtesy to schoolchildren and change ‘The’ to ‘A Genetic Code’. More importantly, we already know that valuable genetic information is missing from these spreadsheets, and it is missing in the exact dimension that they are meant to stand for, as well as many others. Consider the following.

This diagram is a shortcut symbolic representation of translation steps for two nucleotide sequences, A and B, into two distinctly different folded proteins. Both translations pass through a level of polypeptide known as the primary sequence of amino acids. The spreadsheet provides a fairly reliable sequence cipher, and so the one-dimensional paradigm takes the bold step of saying that primary sequence is synonymous with primary structure. How do we know this? We take the next bold step and say that since primary structure determines tertiary structure, which is the ultimate shape of a folded protein, all we need is the primary sequence. Proteins are all about shape, so these are fantastic steps.

If nucleotide sequence A and B are identical, then one should expect that both folded proteins will be identical. However, this is rarely true in nature. More interesting, if a nucleotide sequence change is made in a single organism from A to B in which one codon, say for leucine, is substituted with another codon also for leucine, then the primary sequence will not change. These are called synonymous codons, and this sequence mutation is called a silent mutation. It is considered silent because the change presumably cannot be heard within the language. Several studies have shown that this premise is absolutely false. Identical primary sequences after silent mutations can consistently lead to different folded proteins. These folded proteins have different enzymatic properties, and enzymes are high-level components of organic computers - they process information at levels well beyond string formation. The explanation for this must be found in the spreadsheet if it is to serve as the code for turning DNA into protein.

Apart from folded proteins, silent mutations have demonstrated profound influence on translation of genetic information in many ways, including rate of translation, translation fidelity, signal peptide function, and the rules for splicing and amplification. Therefore, regardless of the mechanisms, the information contained in a single codon goes well beyond the primary sequence of amino acids. Silent mutations have been shown to significantly affect the competitive dynamics in natural selection, and they have even been implicated in human diseases, such as Hirschsprung disease and medullary thyroid carcinoma. In other words, a ‘silent mutation’ can certainly change the value returned at the highest level of translation, SURVIVE(Info). If this is silence, I’d hate to hear a really loud noise.

None of these findings are anticipated by, or explainable within a truly one-dimensional model. The theory of a genetic code with a single dimension of information translated across many levels of organic molecules should be in severe crisis by these discoveries, but yet we chose to cling tightly to it. Why?

More to the point, it was an overly optimistic, grandiose rush to judgment when the title ‘The Genetic Code’ was first applied to the codon map. We might choose to call it such, but it is a gross misnomer none-the-less. By now we should clearly recognize that no human is familiar with any such code of translation on virtually any level. Perhaps man has glimpsed a few faulty lines of code, but despite great scientific strides, the logic behind the entire genetic program remains unarguably hidden from our view. Genetic information is far more complex than our present ability to study it. We certainly should not claim to have cracked the genetic code, and we will surely benefit in the future by speaking in terms of multi-dimensional translation, across entire programs, and throughout time. These are not trivial semantics, because our historically loose choice of terms, and our inability to even partially define them has had a dramatic, counterproductive impact on our thinking. We have a map of codon information in one dimension, and we should consider it as such. Nothing more. Even at that, it is flawed.

I have asked dozens of working scientists, scoured numerous references, and I have yet to be given an appropriate, useful definition of the genetic code. Definitions, by definition, are whatever we say they are. It’s like the price of gold, it’s not the inherent value of the metal, it’s whatever we’re willing to accept as a value. The value of a definition should track with its value in describing something in a useful way. On that score, I will share, in my opinion, the ‘best’ definition that I have found. This definition comes from Horton’s Principles of Biochemistry, Third Edition.

genetic code. The correspondence between a particular three nucleotide codon and the amino acid it specifies. The standard genetic code of 64 codons is used by almost all organisms. The genetic code is used to translate the sequence of nucleotides in mRNA into protein.

This is a most artful state-of-the-art definition. It has one of the required appropriate qualifiers, and it avoids the classic pitfalls of including words like universal, linear, one-dimensional and simple, that are thrown around with abandon elsewhere. But what does it really tell us? It tells us that the genetic code is a map with only one-dimension of information, that dimension being the correlation between codons and amino acids. It says that there are always sixty-four codons, which is probably wrong, and it should stop there, because from there it becomes even more misleading. The final sentence should be replaced with another qualifier: The genetic code is part of an unknown algorithm that living things somehow use to make proteins. The standing definition here strongly implies that the genetic code and nucleotide sequence information - and nothing else – are all that is required in making proteins. This is the stated rule, and of course there are exceptions to every rule, but in this case the exception is the rule. We have yet to find even a single case where the rule applies.

The reason that the definition is invalid is because Da Vinci was right. Organic molecules are composites of shapes, not lines of identities. Knowing the primary sequence of amino acids has proven to be entirely different from knowing the shape of a protein. Information in addition to primary sequence must be brought to bear during translation. We presently do not have these extra dimensions of information, and we certainly do not have the rules of their application during translation.

Predicting the primary sequence of a protein is relatively easy, but predicting the shape of a folded protein requires tremendous effort. We must compile gigantic piles of data, which must then be thoroughly and creatively massaged, and even still we have less than sterling results. More curious, the best data for massage is nucleotide sequence data, not amino acid sequence data. This implies that the mysteriously missing information might also be found somewhere in nucleotide sequences, more so than amino acids.

Did we really expect it to be that simple? Consider the problem on its face. DNA, for all intents and purposes, is stored in a single changeless shape, whereas proteins are defined by diverse and dynamic shapes. Whatever the process, the translation of information from DNA to protein involves a shift from monotony to diversity. Molecular identities expand only four to twenty, but bond identities expand sixteen to millions. The magic of translation and the expansion of organic information therefore lie in the bonds between the molecules of molecular strings. This should be the focus of any investigation, definition, or map of the genetic code.

There are numerous precedents where man maps nature, but then later discovers new information that must be added to the map. This does not mean that the original map was wrong, or useless; it just means that it is outdated and incomplete. Moreover, maps can never be complete, because in some way they must be an encryption, a compressed view of the universe. The universe itself is perhaps incompressible. We can, however, map slices of the universe and be informed by them - but context remains all-important. I am sure that the 8th century Spanish priest Beatus made a significant and useful contribution to human achievement when he drafted his map of the known world.

World Map, circa 776

His main motivation was religious, and so the symbols are related to each other according to religious as well as geographic meaning. This map might well have been labeled ‘Map of the World’ at the time it was created, because at that time it represented the whole known world. However, Beatus surely labored under serious misperceptions, and he quite obviously lacked some key information. He believed that he was mapping a stationary flat object at the center of the universe. He also erroneously concluded that humans with whom he could speak had surveyed all available landmass in the world. Actually, from that perspective it’s really not a bad map. Incompleteness does not diminish the utility or profundity of the insight behind the map. Counter intuitively, and unfortunately for Beatus’ position in mapmaking history, the earth has now been found to be bigger and rounder than expected, and it was incidentally found to be orbiting the sun. In retrospect, Beatus might have improved his map somewhat by wrapping it around a ball, but this would not have been viewed as an improvement at that time. The addition of an extraneous dimension to 8th century reality could not have possibly been seen as an improvement. Beatus could have argued in favor of such a thing, but they surely would have killed him for it.

Similarly, the initial efforts to map a universal genetic code are rapidly appearing more parochial. We now know for certain that the scale, scope and number of dimensions for the genetic code in our current map is considerably off, and a substantial quantity of critical information is missing from it. Compared to the Beatus map, the genetic spreadsheet is a far less accurate compression of the universe. Suggesting that extra dimensions should be added to the map won’t get you killed today, but they won’t make you very popular either.

Our present mapping of amino acid correlations does not even visually provide an accurate structure for a code of genetic information, let alone a comprehensive mapping of its logic. It is curiously difficult to find anyone today willing to make a stiff defense of it. The codon map is a familiar icon, but it quickly fails on many important details. It is now difficult to define, and no useful definition of it should include anything but a historical reference to the term ‘the genetic code.’ It would be as if because of Beatus and his map, we decided to call one hemisphere of earth ‘the world’ and the other hemisphere ‘the epi-world’. Of course the first genetic spreadsheet was made at a time when man was mostly ignorant of the things he was attempting to map. Concepts such as complex object oriented programming, global computer networking, and the human genome databank were all but a twinkle in the cartographers’ eyes. None-the-less, symbols in our current map have assumed elements of a religious doctrine.

The Genetic Code, circa 2003

Maps are eminently useful, so it is common to forget what they really are. They are representations of things, but they are not the things themselves. It is easy to get carried away with maps, and begin to let languages cloud our definitions and thinking about the actual things. The word ‘synonymous’ is likely to appear in conjunction with the codon map. Synonymous is a powder keg of a word, especially with maps. Technically it means ‘the same as or similar to’ but this is itself misleading. First of all, people hear ‘the same as’ and forget about ‘or similar to’. Secondly, nothing in the universe can truly be the same as something else. This has proven especially true on an atomic level, and it is this very fact that gives the universe its informative nature.

By way of illustration, a tenet of theoretical physics called the Pauli exclusion principle was the final piece in the quantum puzzle leading to a more accurate mapping of the atom. Wolgang Pauli contributed the ‘spin quantum number’ to numerically describe electrons within an atom. The Pauli exclusion principle says that no two electrons within an atom can share the same four quantum numbers. By adding one binary digit to the existing three quantum numbers, Pauli made the ultimate statement of dualism, and matter was forever theoretically transformed into a purely informative perception. With the stilted parlance of a logician, and the poetic license of… well, a poet, I will paraphrase and expand the Pauli exclusion principle:

Existence is dualistic.

In everyday language this means that no two things can be the same. Two things that are the same thing are not two things; they are one thing. Interestingly, this does not say that two things that don’t exist cannot be the same thing. There are many ways to exist, but perhaps only one way to not exist. Pauli must have noticed something to this effect in an atom, where we can usually count discrete things that we know as electrons, whatever they might be in reality ‘out there’. Pauli essentially said that irrespective of how similar these things appear, no two of these things are the same thing. Pauli stipulated that every electron must somehow be different from every other electron. We will find it quite useful to invoke this principle as we begin to re-examine our perception of the genetic code.

The precise definitions for any one-dimensional genetic metaphor have now scattered in the wind, but we cling to their memory, and we continue to muddle our thoughts with their nostalgic usage. Attempts are no longer made to even partially define terms in a consistent fashion. It is now freakishly possible to speak of a non-synonymous-synonymous codon. In what scene shall the Mad Hatter appear? We might at least try to pin down self-consistent definitions. In the meantime, the following diagram will illustrate the linear paradigm of genetic translation as classically described.

It is instructive to re-examine the outline of the master loop, and instead of compressing the four levels of translation into one level that stands for the genetic code, we can visualize a small computer metaphor between each level in the loop.

Rather than one set of data that is translated in one dimension from nucleotide sequence to folded protein, we can imagine that genetic information is translated through several forms before folding into proteins. Translation at one level produces data for the next level. The information in the DNA string is preserved as it passes through several different strings, changing molecular form as it goes. Each translation level has its own set of informative rules that determines, or processes the information for the next level. These algorithms are not responsible for making the subunits in the string in each step, only for aligning them in a sequence. In other words, fully formed, individual molecules, such as tRNA, are available to this process during this phase of translation. 3 DNA = 3 mRNA = 1 tRNA = 1 amino acid.

The first level of the process inputs Info from DNA into a function that translates it into mRNA. There is a finite set of rules, or an algorithm to describe this function. The Info from mRNA is fed into another function with another finite set of rules that translate the Info into tRNA. Perhaps investigators have yet to give us enough experimental data at this level to track the string Info accurately through a string made of tRNA molecules, but we can establish parameters at each level and take broad measurements of information entropy. These entropy-tracking protocols can then be used to query the process and broadly investigate the form and flow of information within and between various organic programs provided by nature.

Having said all that (actually, having cut and paste all that from various writings) I will now propose some choices. We can define the genetic code in a broad sense or very simply in a restricted sense. Traditionally, we have chosen an extremely restricted definition, but even here there are choices. Either the genetic code embodies the rules for sequencing and only sequencing amino acids, or it embodies the rules for making entire proteins. Although we were deluded for quite some time into believing that these choices are identical, they clearly are not. Our definition of the genetic code should at least be taken to include the information in proteins, not just their sequences. Agreeing on a definition is a good first step toward actually understanding the code and how it works.

<< Back


<Top> - <Home> - <Store> - <Code World> - <Genetic Code> - <Geometry>


Material on this Website is copyright Rafiki, Inc. 2003 ©
Last updated September 29, 2003 2:56 PM