Rafiki Home Rafiki Store Learn about Code World Explore geometry

<< Back

How much genetic information gets into proteins?

Previous sections have described the nature of information in general, and discussed a broad perspective on the genetic code. Now we are curious as to the amount of genetic information that finds its way into protein. Knowing this first perhaps will help pin down how it gets there.

Traditionally, this issue gets framed with respect to the number of possible combinations in nucleotide sequences. This has come to stand for the range of possible genetic information that must somehow propagate forward in translation. Here we will attack the question in reverse, starting with the number of combinations in protein space and work backward to the nucleotide sequences.

We will consider a protein to be defined as an actual conformation of a string of any number of amino acids. The space of all possible proteins includes all lengths, sequences and space-filling patterns of such biopolymers. If we assume for purposes of discussion that every protein in the space of all possible proteins is equally probable, then the information entropy of each protein is easy to calculate. It will be log base 2 of all possible proteins. Furthermore, each nucleotide sequence that maps to, or makes each of these proteins must somehow contain the proper amount of genetic information.

In this over-simplified circumstance we can view the two sets of biopolymers as two distinct alphabets, whose symbol or character usage are evenly distributed. We will call the alphabet of all possible proteins ‘P’ and the alphabet of all possible nucleotide sequences ‘N’. In the simplest possible case there would be one character in N for every character in P. In reality, this probably could never be the case, and empiric data suggests that it is in fact not the case. So now we must begin considering the absolute and relative sizes of the two alphabets, and the logic that maps one to the other.

Consider the processes in nature that restrict the sizes of these two alphabets. Any restrictive process will decrease the amount of genetic information that finds its way into proteins. For instance, let's say that we put restrictions on the number and type of residues that can be included in a protein. Consider an extreme example where the process is limited to strings of only leucine, and the strings can only be between two and nine residues long. How much genetic information can be put into these proteins?

This hypothetical case of severe restriction is merely a subset of the traditional picture, much in the same way that reality must be a subset of theoretical P and N. We will merely stipulate that only twenty amino acids and four nucleotides are available (actually five) in standard genetic processes. We will take the restrictions of the standard set to be synonymous with the term ‘all possible.’ This stipulation means that P and N are in fact vanishing small subsets of theoretical P and N, but empirically we know that nature has found a way to impose the same set of restrictions in almost every organism.

In our hypothetical case of leucine-only proteins, without considering dimensions of information beyond sequence, there are now only eight ways to make a protein. But what about the two obvious conformations of di-leucine, cis and trans di-leucine? And if we string four leucines together, based solely on cis and trans bonds there are sixteen possible permutations. If we string eight leucines there are 128 distinct cis-trans conformations. Considering only cis and trans bonds, there are 254 conformations in this hypothetically super-restricted protein space, but what is the actual space of all possible poly-leucines given all parameters? I should think that it is actually much higher than 254, certainly higher than 8. What are the circumstances in which one or more of these poly-leucines might consistently appear within a cell, or can they appear at all? More importantly, If more than eight appear, where could the information to create these restricted proteins get into the system? So generally the question becomes, what are the parameters available in a cell that determine P?

Once we impose any restriction on P, beyond the standard stipulation above, we must create a subset of P, call it P’. How are these restrictions imposed, at what cost, and to what benefit? Restricting P’ to discrete amino acid sequences from P is an extremely severe restriction, so one must wonder how such a thing could ever be accomplished. For every sequence there is an incredibly large number of ways that the sequence might fill space. Dogma claims that one of these ways is so much more probable that it will occur in all instances, but this is statistically implausible, made less believable since there is no evidence that it actually happens.

Looking at it from this reverse angle gives one a completely different perspective. Now we are contemplating a search for an active process of placing restrictions. In reality it is a process of eliminating information, and the erasure of information requires energy. From an engineering standpoint, it is difficult to imagine how life could achieve this monumental task, let alone why it would be beneficial to do so. Is it actually so desirable to think that this information must get eliminated? However, the dogma assumes there need be no mechanism or cost involved in eliminating it. Furthermore, dogma fails to recognize the severity of the restriction being proposed. There must be an equally severe mechanism to impose it. No such restriction has been proven, and no such mechanism has been found.

Consider a single string of 100 amino acids. Let’s suppose that each adjacent pair of amino acids is known to exist in one of 10 possible conformations. There are then 10100 different conformations for this protein that share the same sequence. Call this number H, for huge, and call the one conformation that is most likely to occur H’. Think of the pre-folded sequence as a ball bearing at the top of a hill. Each potential conformation is like a hole or pocket that might catch the ball bearing as it rolls down the hill. Consider all the possible variables that might affect the ball bearing as it rolls down the hill – spin, humidity, temperature, wind, the gravitational pull of the moon. What are the odds of finding H’ from within H with each roll down the hill? How do we deal with the circumstances when any of these variables, or perhaps all of them change independently?

The classic paradigm superficially addresses this problem, and from it we have apparently built the equivalent of a tube for the ball to roll in a straight shot to H’. The tube somehow protects the rolling ball from all variables that might deflect its course down the hill. This tube cannot be free of cost, and it cannot be invisible to observation. Where is this magical tube, and where is the bill for its construction?

It is perhaps easy to shrug off this portion of the thought experiment, as has been done, but keep in mind that we are only halfway finished. P’ sits between P and N, and we have merely restricted P to P’. Now we must consider restricting N to P’. There are approximately three synonymous codons for every amino acid in our string of 100 amino acids, so there are 3100 members of N that point to H’. This is a number of roughly the same magnitude as H. In reality, each member of N that points to H’ represents a different set of variables to start the journey of the ball bearing down the hill. Now we really do need that protective tube if we are to find H’, against all odds, every time.

Not all synonymous codons are created equal. We like to see them as such, but it is absurd to do so. The reason is simple: codons specify anti-codons, not amino acids. While two codons might be assigned the same amino acid, they also might be assigned different tRNA to deliver that amino acid. More interesting, the same codon can be assigned different tRNAs. Because of wobble, there are two and one-half times as many anti-codons as codons. In a cell there might be more than one tRNA for every anti-codon. Why the expansion?

There is now a new alphabet to consider, call it T for translation. T represents the alphabet of all possible strings of tRNA molecules. It sits between and translates N into P. As we just saw, T is much larger than N. T is a more accurate representation than N of all the possible variables at the start of the journey down the hill of protein conformations. What this means is that T makes it much harder, not easier for N to find H’, the needle in a universal haystack. One should suspect that the role of T is to ensure that a single H’ does not exist for every sequence of amino acids, because it is a lousy mechanism for restricting N to P’. Life would never select such a lousy mechanism at this critical function. Therefore, we should begin to suspect that the primary role of a robust population of tRNA molecules is to expand P’ beyond the severe restriction of translating sequence-only information.

At some point in translation the importance of sequence gives way to the importance of conformation. It is the role of tRNA to see that this happens logically and consistently. T stands for ticket out of sequence space.

It doesn't really matter when or how actual protein conformations break from their restricted sequence space, just whether or not they do, and to what extent. If more than one conformation somehow consistently emerges from any given sequence, then the information content of all proteins must go up. The mechanism for storing and delivering this information must then be addressed.

These are the kinds of basic questions that must be answered first, and then the process must be worked backward to determine the amount of genetic information that impacts mechanisms of translation. If we return to the idea of a space of all possible proteins we can see that the space of all possible amino acid sequences represents a tiny, tiny, tiny subset of that space. The same questions apply. What possible information handling methods could restrict genetic translation to this extremely restricted subspace? How could the 'one-dimensional model' possibly account for the known instances where translation strays from this restricted subspace? Why would we expect life to choose this as the optimum method for genetic translation?

Life in general emerges from a search for solutions. In the case of translating genetic information, the search begins with the space of all possible proteins. We should expect life to search this space in a practical and clever fashion. Restricting the search to sequence-only is sub-optimal and impractical. Why do we expect it, hell, why do we virtually insist on believing that it does?

There is no longer any practical utility to calling the genetic code one-dimensional. Originally it gave us license to throw away nucleotide sequence data and concentrate on amino acid sequences in studies of proteins. Curiously, the boys in the field are now scrambling to bring the nucleotide sequences back into the studies. Why? If the information they contain maps so conveniently to the subspace of all possible amino acid sequences, what use are they? The obvious answer is that nucleotide sequences contain more genetic information than can be contained by the subspace of all possible amino acid sequences. These additional dimensions of genetic information are being translated - somehow - into proteins.

To finally answer the question, how much genetic information finds its way into proteins?

It depends. It depends on T, because T has the ability to expand the information that travels from N to P. The population of tRNA in a cell can expand information in nucleotide sequences. More robust tRNA populations can create more diverse sets of proteins. We need to recognize this reality so that we can begin to examine the mechanisms, and ultimately the code behind this translation. What variables can tRNA control, and how does the genetic code specify these variables?

<< Back

<Top> - <Home> - <Store> - <Code World> - <Genetic Code> - <Geometry>

Material on this Website is copyright Rafiki, Inc. 2003 ©
Last updated September 2, 2003 9:58 AM