The Model
Why does biological evolution work? And, for that matter, why does machine learning work? Both are examples of adaptive processes that surprise us with what they manage to achieve. So what’s the essence of what’s going on? I’m going to concentrate here on biological evolution, though much of what I’ll discuss is also relevant to machine learning—but I’ll plan to explore that in more detail elsewhere.
OK, so what is an appropriate minimal model for biology? My core idea here is to think of biological organisms as computational systems that develop by following simple underlying rules. These underlying rules in effect correspond to the genotype of the organism; the result of running them is in effect its phenotype. Cellular automata provide a convenient example of this kind of setup. Here’s an example involving cells with 3 possible colors; the rules are shown on the left, and the behavior they generate is shown on the right:
We’re starting from a single ( ) cell, and we see that from this “seed” a structure is grown—which in this case dies out after 51 steps. And in a sense it’s already remarkable that we can generate a structure that neither goes on forever nor dies out quickly—but instead manages to live (in this case) for exactly 51 steps.
But let’s say we start from the trivial rule that makes any pattern die out immediately. Can we end up “adaptively evolving” to the rule above? Imagine making a sequence of randomly chosen “point mutations”—each changing just one outcome in the rule, as in:
Then suppose that at each step—in a minimal analog of natural selection—we “accept” any mutation that makes the lifetime longer (though not infinite), or at least the same as before, and we reject any mutation that makes the lifetime shorter, or infinite. It turns out that with this procedure we can indeed “adaptively evolve” to the rule above (where here we’re showing only “waypoints” of progressively greater lifetime):
Different sequences of random mutations give different sequences of rules. But the remarkable fact is that in almost all cases it’s possible to “make progress”—and routinely reach rules that give long-lived patterns (here with lifetimes 107, 162 and 723) with elaborate morphological structure:
Is it “obvious” that our simple process of adaptive evolution will be able to successfully “wrangle” things to achieve this? No. But the fact that it can seems to be at the heart of why biological evolution manages to work.
Looking at the sequences of pictures above we see that there are often in effect “different mechanisms” for producing long lifetimes that emerge in different sequences of rules. Typically we first see the mechanism in simple form, then as the adaptive process continues, the mechanism gets progressively more developed, elaborated and built on\[LongDash]not unlike what we often appear to see in the fossil record of biological evolution.
But let’s drill down and look in a little more detail at what’s happening in the simple model we’re using. In the 3-color nearest-neighbor (k = 3, r = 1) cellular automata we’re considering, there are 26 (= 33 – 1) relevant cases in the rule (there’d be 27 if we didn’t insist that
). “Point mutations” affect a single case, changing it to one of two (= 3 – 1) possible alternative outcomes—so that there are altogether 52 (= 26 × 2) possible distinct “point mutations” that can be made to a given rule.
For example, starting from the rule
the results of possible single point mutations are:
And even with such point mutations there’s usually considerable diversity in the behavior they generate:
In quite a few cases the pattern generated is exactly the same as the one for the original rule. In other cases it dies out more quickly—or it doesn’t die out at all (either becoming periodic, or growing forever). And in this particular example, in just one case it achieves “higher fitness”, surviving longer.
If we make a sequence of random mutations, many will produce shorter-lived or infinite lifetime (“tumor”) patterns, and these we’ll reject (or, in biological terms, we can imagine they’re “selected out”):
But still there can be many “neutral mutations” that don’t change the final pattern (or at least give a pattern of the same length). And at first we might think that these don’t achieve anything. But actually they’re critical in allowing single point mutations to build up to larger mutations that can eventually give longer-lived patterns:
Tracing our whole adaptive evolution process above, the total number of point mutations involved in getting from one (increasingly long-lived) “fitness waypoint” to another is:
Here are the underlying rules associated with these fitness waypoints (where the numbers count “cumulative” mutations, ignoring ones that go “back and forth”)
where we’re indicating here the number of actual “accepted mutations” between these “waypoint rules”.
One way to get a sense of what’s going on is to take the whole sequence of (“accepted”) rules in the adaptive evolution process, and plot them in a dimension-reduced rendering of the (27-dimensional) rule space:
There are periods where there’s a lot of “wandering around” going, with many mutations needed to “make progress”. And there are other periods when things go much faster, and fewer mutations are needed.
As another way to see what’s going on, we can plot the maximum lifetime achieved so far against the total number of mutation steps made:
We see plateaus (including an extremely long one) in which “no progress” is made, punctuated by sometimes-quite-large, sudden changes, often brought on by just a single mutation.
If we include “rejected mutations” we see that there’s a lot of activity going on even in the plateaus; it just doesn’t manage to make progress (one can think of each red dot that lies below a plateau as being like a mutation—or an organism—that “doesn’t make it”, and is selected out):
It’s worth noting that there can be multiple different (“phenotype”) patterns that occur across a plateau. Here’s what one sees in the particular example we’re considering:
But even between these “phenotypically different” cases, there can be many “genotypically different” rules. And in a sense this isn’t surprising, because usually only parts of the underlying rule are “coding”; other parts are “noncoding”, in the sense that they’re not sampled during the generation of the pattern from that rule.
And for example this highlights for each “fitness waypoint rule” which cells make use of a “fresh” case in the rule that hasn’t so far been sampled during the generation of the pattern:
And we see that even in the last rule shown here, only 18 of the 26 relevant cases in the rule are actually ever sampled during the generation of the pattern (from the particular, single-red-cell initial condition used). So this means that 8 cases in the rule are “undetermined” from the phenotype, implying that there are 38 = 6561 possible genotypes (i.e. rules) that will give the same result.
So far we’ve mostly been talking about one particular random sequence of mutations. But what happens if we look at many possible such sequences? Here’s how the longest lifetime (or, in effect, “fitness”) increases for 100 different sequences of random mutations:
And what’s perhaps most notable here is that it seems as if these adaptive processes indeed don’t “get stuck”. It may take a while (with the result that there are long plateaus) but these pictures suggest that eventually “adaptive evolution will find a way”, and one will get to rules that show longer lifetimes—as the progressive development of the distribution of lifetimes reflects:
The Multiway Graph of All Possible Mutation Histories
In what we’ve done so far we’ve always been discussing particular paths of adaptive evolution, determined by particular sequences of random mutations. But a powerful way to get a more global view of the process of adaptive evolution is to look—in the spirit of our Physics Project, the ruliad, etc.—not just at individual paths of adaptive evolution, but instead at the multiway graph of all possible paths. (And in making a correspondence with biology, multiway graphs give us a way to talk about adaptive evolution not just of individual sequences of organisms, but also populations.)
To start our discussion, let’s consider not the 3-color cellular automata of the previous section, but instead (nearest-neighbor) 2-color cellular automata—for which there are just 128 possible relevant rules. How are these rules related by point mutations? We can construct a graph of every possible way that one rule from this set can be transformed to another by a single point mutation:
If we imagine 5-bit rather than 7-bit rules, there are only 16 relevant ones, and we can readily see that the graph of possible mutations has the form of a Boolean hypercube:
Let’s say we start from the “null rule” . Then we enumerate the rules obtained by a single point mutation (and therefore directly connected to the null rule in the graph above)—then we see what behavior they produce, say from the initial condition … …:
Some of these rules we can view as “making progress”, in the sense that they yield patterns with longer lifetimes (not impressively longer, just 2 rather than 1). But other rules “make no progress” or generate patterns that “live forever”. Keeping only mutations that don’t lead to shorter or infinite lifetimes, we can construct a multiway graph that shows all possible mutation paths:
Although this is a very small graph (with just 15 rules appearing), we can already see hints of some important phenomena. There are “fitness-neutral” mutations that can “go both ways”. But there are also plenty of mutations that only “go one way”—because the other way they would decrease fitness. And a notable feature of the graph is that once one’s “committed” to a particular part of the graph, one often can’t reach a different one—suggesting an analogy to the existence of distinct branches in the tree of life.
Moving beyond 2-color, nearest-neighbor (k = 2, r = 1) cellular automata, we can consider k = 2, r = ones. A typical such cellular automaton is:
For k = 2, r = 1 there were a total of 128 (= 223 – 1) relevant rules. For k = 2, r = , there are a total of 32,768 (= 224 – 1). Starting with the null rule, and again using initial condition … …, here are a few specific examples of adaptive evolution paths for such cellular automata:
And here is the beginning of the multiway graph for k = 2, r = rules—showing rules reached by up to two mutations starting from the null rule:
This graph contains many examples of “fitness-neutral sets”—rules that have the same fitness and that can be transformed into each other by mutations. A few examples of such fitness-neutral sets:
In the first case here, the “morphology of the phenotypic patterns” is the same for all “genotypic rules” in the fitness-neutral set. But in the other cases there are multiple morphologies within a single fitness-neutral set.
If we included all individual rules we’d get a complete k = 2, r = multiway graph with a total of 1884 nodes. But if we just include one representative from every fitness-neutral set, we get a more manageable multiway graph, with a total of 86 nodes:
Keeping only one representative from pairs of patterns that are related by left-right symmetry, we get a still-simpler graph, now with a total of 49 nodes:
There’s quite a lot of structure in this graph, with both divergence and convergence of possible paths. But overall, there’s a certain sense that different sections of the graph separate into distinct branches in which adaptive evolution in effect “pursues different ideas” about to how increase fitness (i.e. lifetime of patterns).
We can think of fitness-neutral sets as representing a certain kind of equivalence class of rules. There’s quite a range of possible structures to these sets—from ones with a single element to ones with many elements but few distinct morphologies, to ones with different morphologies for every element:
What about larger spaces of rules? For k = 2, r = 2 there are altogether about 2 billion (225 – 1) relevant rules. But if we choose to look only at left-right symmetric ones, this number is reduced to 524,288 (= 219). Here are some examples of sequences of rules produced by adaptive evolution in this case, starting from the null rule, and allowing only mutations that preserve symmetry (and now using a single black cell as the initial condition):
Once again we can identify fitness-neutral sets—though this time, in the vast majority of cases, the patterns generated by all members of a given set are the same:
Reducing out fitness-neutral sets, we can then compute the complete (transitively reduced) multiway graph for symmetric k = 2, r = 2 rules (containing a total of 60 nodes):
By reducing out fitness-neutral sets, we’re creating a multiway graph in which every edge represents a mutation that “makes progress” in increasing fitness. But actual paths of adaptive evolution based on random sequences of mutations can do any amount of “rattling around” within fitness-neutral sets—not to mention “trying” mutations that decrease fitness—before reaching mutations that “make progress”. So this means that even though the reduced multiway graph we’ve drawn suggests that the maximum number of steps (i.e. mutations) needed to adaptively evolve from the null rule to any other is 9, it can actually take any number of steps because of the “rattling around” within fitness-neutral sets.
Here’s an example of a sequence of accepted mutations in a particular adaptive evolution process—with the mutations that “make progress” highlighted, and numbers indicating rejected mutations:
We can see “rattling around” in a fitness-neutral set, with a cycle of morphologies being generated. But while this represents one way to reach the final pattern, there are also plenty of others, potentially involving many fewer mutations. And indeed one can determine from the multiway graph that an absolutely shortest path is:
This involves the sequence of rules:
We’re starting from the null rule, and at each step making a single point mutation (though because of symmetry two bits can sometimes be changed). The first few mutations don’t end up changing the “phenotypic behavior”. But after a while, enough mutations (here 6) have built up that we get morphologically different behavior. And after just 3 more mutations, we end up with our final pattern.
Our original random sequence of mutations gets to the same result, but in a much more tortuous way, doing a total of 169 mutations which often cancel each other out:
In drawing a multiway graph, we’re defining what evolutionary paths are possible. But what about probabilities? If we assume that every point mutation is equally likely, we can in effect “analyze the flow” in the multiway graph, and determine the ultimate probability that each rule will be reached:
The Fitness Landscape
Multiway graphs give a very global view of adaptive evolution. But in understanding the process of adaptive evolution, it’s also often useful to think somewhat more locally. We can imagine that all the possible rules are laid out in a certain space, and that adaptive evolution is trying to find appropriate paths in this space. Potentially we can suppose that there’s a “fitness landscape” defined in the space, and that adaptive evolution is trying to follow a path that progressively ascends to higher peaks of fitness.
Let’s consider again the very first example we gave above—of adaptive evolution in the space of 3-color cellular automata. At each step in this adaptive evolution, there are 52 possible point mutations that can be made to the rule. And one can think of each of these mutations as corresponding to making an “elementary move” in a different direction in the (26-dimensional) space of rules.
Here’s a visual representation of what’s going on, based on the particular path of adaptive evolution from our very first example above:
What we’re showing here is in effect the sequence of “decisions” that are being made to get us from one “fitness waypoint” to another. Different possible mutations are represented by different radial directions, with the length of each line being proportional to the fitness achieved by doing that mutation. At each step the gray disk represents the previous fitness. And what we see is that many possible mutations lead to lower fitness outcomes, shown “within the disk”. But there are at least some mutations that have higher fitness, and “escape the disk”.
In the multiway graph, we’d trace every mutation that leads to higher fitness. But for a particular path of adaptive evolution as we’ve discussed it so far, we imagine we always just pick at random one mutation from this set—as indicated here by a red line. (Later we’ll discuss different strategies.)
Our radial icons can be thought of as giving a representation of the “local derivative” at each point in the space of rules, with longer lines corresponding to directions with larger slopes “up the fitness landscape”.
But what happens if we want to “knit together” these local derivatives to form a picture of the whole space? Needless to say, it’s complicated. And as a first example, consider k = 2, r = 1 cellular automaton rules.
There are a total of 128 relevant such rules, that (as we discussed above) can be thought of as connected by point mutations to form on a 7-dimensional Boolean hypercube. As also discussed above, of all 128 relevant rules, only 15 appear in adaptive evolution processes (the others are in effect never selected because they represent lower fitness). But now we can ask where these rules lie on the whole hypercube:
Each node here represents a rule, with the size of the highlighted nodes indicating their corresponding fitness (computed from lifetime with initial condition … …). The node shown in green corresponds to the null rule.
Rendering this in 3D, with fitness shown as height, we get what we can consider a “fitness landscape”:
And now we can think of our adaptive evolution as proceeding along paths that never go to nodes with lower height on this landscape.
We get a more filled-in “fitness landscape” when we look at k = 2, r = rules (here with initial condition … …):
Adaptive evolution must trace out a “never-go-down” path on this landscape:
Along this path, we can make “derivative” pictures like the ones above to represent “local topography” around each point—indicating which of the possible upwards-on-the-landscape directions is taken:
The rule space over which our “fitness landscape” is defined is ultimately discrete and effectively very high-dimensional (15-dimensional for k = 2, r = rules)—and it’s quite challenging to produce an interpretable visualization of it in 3D. We’d like it if we could lay out our rendering of the rule space so that rules which differ just by one mutation are a fixed (“elementary”) 2D distance apart. In general this won’t be possible, but we’re trying to at least approximate this by finding a good layout for the underlying “mutation graph”.
Using this layout we can in principle make a “fitness landscape surface” by interpolating between discrete points. It’s not clear how meaningful this is, but it’s perhaps useful in engaging our spatial intuition:
We can try machine learning and dimension reduction, operating on the set of “rule vectors” (i.e. outcome lists) that won’t be rejected in our adaptive evolution process—and the results are perhaps slightly better:
By the way, if we use this dimension reduction for rule space, here’s how the behavior of rules lays out:
And here, for comparison, is a feature space plot based on the visual appearance of these patterns:
The Whole Space: Exhaustive Search vs. Adaptive Evolution
In adaptive evolution, we start, say, from the null rule and then make random mutations to try to reach rules with progressively larger fitness. But what about just exhaustively searching the complete space of possible rules? The number of rules rapidly becomes unmanageably big—but some cases are definitely accessible:
For example, there are just 524,288 symmetric k = 2, r = 2 rules—of which 77,624 generate phenotypic patterns with finite lifetimes. Ultimately, though, there are just 77 distinct patterns that appear, with varying multiplicity (which at least in this case is always associated with “unused bits” in the rule):
How do these exhaustive results compare with what’s generated in the multiway graph of adaptive evolutions? They’re almost the same, but for the addition of the two cases
which are generated by rules of the form (where the gray entries don’t matter):
Why don’t such rules ever appear in our adaptive evolution? The reason is that there isn’t a chain of point mutations starting from the null rule that can reach these rules without going through rules that would be rejected by our adaptive evolution process. If we draw a multiway graph that includes every possible “acceptable” rule, then we’ll see a separate part in the graph, with its own root, that contains rules that can’t be reached by our adaptive evolution from the null rule:
So now if we look at all (symmetric k = 2, r = 2) rules, here’s the distribution of lifetimes we get:
<
The maximum, as seen above, is 65. The overall distribution roughly follows a power law, with an exponent around –3:
As we saw above, not all rules make use of all their bits (i.e. outcomes) in generating phenotypic patterns. But what we see is that the larger the lifetime achieved, the more bits tend to be needed:
And in a sense this isn’t surprising: as we’ll discuss later, we can expect to need “more bits in the program” to specify more elaborate behavior—or, in particular, behavior that embodies a larger number for lifetime.
So what about general k = 2, r = 2 rules? There are 231 ≅ 2 billion of these. If we exhaustively search through them, we find about 75 million that have finite lifetimes. The distribution of these lifetimes is:
Again it’s roughly a power law, but with exponent around –3.5:
Here are the actual 100 patterns produced that have the longest lifetimes (in all asymmetric cases there are also rules giving left-right flipped patterns):
It’s interesting to see here a variety of “qualitatively different ideas” being used by different rules. Some (like the one with lifetime 151) we might somehow imagine could have been constructed specifically “for the purpose” of having their particular, long lifetime. But others (like the one with lifetime 308) somehow seem more “coincidental”—behaving in an apparently random way, and then just “happening to die out” after a certain number of steps.
Since we found these rules by exhaustive search, we know they’re the only possible ones with such long lifetimes (at least with k = 2, r = 2). So then we can infer that the ornate structures we see are in some sense necessary to achieve the objective of, say, having a finite lifetime of more than 100 steps. So that means that if we go through a process of adaptive evolution and achieve a lifetime above 100 steps, we see a complex pattern of behavior not because of “complicated choices in our process of adaptive evolution”, but rather because to achieve such a lifetime one has no choice but to use such a complex pattern. Or, in other words, the complexity we see is a reflection of “computational necessity”, not historical accidents of adaptive evolution.
Note also that (as we’ll discuss in more detail below) there are certain behaviors we can get, and others that we cannot. So, for example, there is a rule that gives lifetime 308, but none that gives lifetime 300. (Though, yes, if we used more complicated initial conditions or a more complicated family of rules we could find such a rule.)
Much as we saw in the symmetric k = 2, r = 2 case above, almost any long lifetimes require using all the available bits in the rule:
<
But, needless to say, there’s an exception—a pair of rules with lifetime 84 where the outcome for the case doesn’t matter:
But, OK, so can these long-lifetime rules be reached by single-mutation adaptive evolution from the null rule? Rather than trying to construct the whole multiway graph for the general k = 2, r = 2 case starting from the null rule, we can instead construct what’s in effect an inverse multiway graph in which we start from a given rule, then successively make all single point mutations that reach rules with progressively shorter lifetimes:
And what we see is that at least in this case such a procedure never reaches the null rule. The “furthest” it gets is to lifetime-2 rules, and among these rules the closest to the null rule are:
But it turns out that there’s no way to reach these 2-bit rules by a single point mutation from any of the 26 1-bit rules that aren’t rejected by our adaptive evolution process. And in fact this isn’t just an issue for this particular long-lifetime rule: it turns out that it’s something quite general among k = 2 rules. Constructing the “forward” multiway graph starting from the null rule, we find we can only ever reach lifetime-1 rules.
Ultimately this is a particular feature of rules with just 2 colors—it’s specific to starting with something like the null rule that has lifetime 1—but it’s an illustration of the fact that there can even be large swaths of rule space that can’t be reached by adaptive evolution with single point mutations.
What about symmetric k = 2, r = 2 rules? Well, to maintain symmetry we have to deal with mutations that change not just one but two bits. And the result of this is that (except in the cases we discovered above) the inverse multiway system starting from long-lifetime rules always successfully reaches the null rule:
There’s something else to notice here, however. Looking at this graph, we see that there’s a way to get with just one 2-bit mutation from a lifetime-1 to a lifetime-65 rule:
We didn’t see this in our multiway graph above because we had applied transitive reduction to it. But if we don’t do that, we find that there are a few large lifetime jumps—visible on this plot of possible lifetimes before and after a single point mutation:
Going beyond k = 2, r = 2 rules, we can consider symmetric k = 3, r = 1 rules, of which there are 317, or about 129 million. The distribution of lifetimes in this case is
which again roughly fits a power law, again with exponent around –3.5:
But now the maximum lifetime found is not just 308, but 2194:
One again, there are some different “ideas” on display, with a few curious examples of convergence—such as the rules we see with lifetimes 989 and 990 (as well as 1068 and 1069) which give essentially the same patterns after just exchanging colors, and one step of “prefatory” behavior.
What about general k = 3, r = 1 rules? There are too many to easily search exhaustively. But directed random sampling reveals plenty of long-lifetime examples, such as:
And now the tail of very long lifetimes extends further, for example with:
It’s a little easier to see what the lifetime-10863 rule does if one visualizes it in sections (and changes colors to get more contrast):
Sampling 100 steps out every 2000 (as well as at the very end), we see elaborate alternation between periodic and seemingly random behavior—but none of it gives any clue of the remarkable fact that after 10863 steps the whole pattern will die out:
The Issue of Undecidability
As our example criterion for the “fitness” of cellular automaton rules, we’ve used the lifetimes of the patterns they generate—always assuming that if the patterns don’t terminate at all they should be considered to have fitness zero.
But how can we tell if a pattern is going to terminate? In the previous section, for example, we saw patterns that live a very long time—but do eventually terminate.
Here are some examples of the first 100 steps of patterns generated by a few k = 3, r = 1 symmetric rules:
What will happen with these patterns? We know from what we see here that none of them have lifetimes less than 100 steps. But what would allow us to say more? In a few cases we can see that the patterns are periodic, or have obvious repeating structures, which means they’ll never terminate. But in the other cases there’s no obvious way to predict what will happen. Explicitly running the rules for another 100 steps we discover some more outcomes:
Going to 500 steps there are some surprises. Rule (a) becomes periodic after 388 steps; rules (o) and (v) terminate after 265 and 377 steps, respectively:
But is there a way to systematically say what will happen with all the remaining rules “in the end”? The answer is that in general there is not; it’s something that must be considered undecidable by any finite computation.
Given how comparatively simple the cellular automaton rules we’re considering are, we might have assumed that with all our sophisticated mathematical and computational methods we’d always be able to “jump ahead of them”—and figure out their outcome without the computational effort of explicitly running each step.
But the Principle of Computational Equivalence suggests that pretty much whenever the behavior of these rules isn’t obviously simple, it will in effect be of equal computational sophistication to any other system, and in particular to any methods that we might use to predict it. And the result is the phenomenon of computational irreducibility that implies that in many systems—presumably including most of the cellular automata here—there isn’t any way to figure out their outcome much more efficiently than by explicitly tracing each of their steps. So this means that to know what will happen “in the end”—after an infinite number of steps—might take an unlimited amount of computational effort. Or, in other words, it must be considered effectively undecidable by any finite computation.
As a practical matter we might look at the observed distribution of lifetimes for a particular type of cellular automaton, and become pretty confident that there won’t be longer finite lifetimes for that type of cellular automaton. But for the k = 3, r = 1 rules from the previous section, we might have been fairly confident that a few thousand steps was the longest lifetime that would ever occur—until we discovered the 10,863-step example.
So let’s say we run a particular rule for 10,000 steps and it hasn’t died out. How can we tell if it never will? Well, we have to construct a proof of some kind. And that’s easy to do if we can see that the pattern becomes, say, completely periodic. But in general, computational irreducibility implies we won’t be able to do it. Might there, though, still be special cases where we can? In effect, those would have to correspond to “pockets of computational reducibility” where we manage to find a compressed description of the cellular automaton behavior.
There are cases like this where there isn’t strict periodicity, but where in the end there’s basically repetitive behavior (here with period 480):
And there are cases of nested behavior, which is never periodic, but is nevertheless simple enough to be predictable:
But there are always surprises. Like this example—which eventually resolves to have period 6, but only after 7129 steps:
So what does all this mean for our adaptive evolution process? It implies that in principle we could miss a very long finite lifetime for a particular rule, assuming it to be infinite. In a biological analogy we might have a genome that seems to lead to unbounded perhaps-tumor-like growth—but where actually the growth in the end “unexpectedly” stops.
Computation Theoretic Perspectives and Busy Beavers
What we’re asking about the dying out of patterns in cellular automata is directly analogous to the classic halting problem for Turing machines, or the termination problem for term rewriting, Post tag systems, etc. And in looking for cellular automata that have the longest-lived patterns, we’re studying a cellular automaton analog of the so-called busy beaver problem for Turing machines.
We can summarize the results we’ve found so far (all for single-cell initial conditions):
The profiles (i.e. widths of nonzero cells) for the patterns generated by these rules are
and the “integrals” of these curves are what give the “areas” in the table above.
For the reasons described in the previous section, we can only be certain that we’ve found lower bounds on the actual maximum lifetime—though except in the last few cases listed it seems very likely that we do in fact have the maximum lifetime.
It’s somewhat sobering, though, to compare with known results for maximum (“busy beaver”) lifetimes for Turing machines (where now s is the number of Turing machine states, the Turing machines are started from blank tapes, and they are taken to “halt” when they reach a particular halt state):
Sufficiently small Turing machines can have only modest lifetimes. But even slightly bigger Turing machines can have vastly larger lifetimes. And in fact it’s a consequence of the undecidability of the halting problem for Turing machines that the maximum lifetime grows with the size of the Turing machine faster than any computable function (i.e. any function that can be computed in finite time by a Turing machine, or whose value can be proved by a finite proof in a finite axiom system).
But, OK, the maximum lifetime increases with the “size of the rule” used in a Turing machine, or a cellular automaton. But what defines the “size of a rule”? Presumably it should be roughly the number of independent bits needed to specify the rule (which we can also think of as an approximate measure of its “information content”)—or something like log 2 of the number of possible rules of its type.
At the outset, we might imagine that all 232 k = 2, r = 2 rules would need 32 bits to specify them. But as we discussed above, in some cases some of the bits in the rule don’t matter when it comes to determining the patterns they produce. And what we see is that the more bits that matter (and so have to be specified), the longer the lifetimes that are possible:
So far we’ve only been discussing cellular automata with single-cell initial conditions. But if we use more complicated initial conditions what we’re effectively doing is adding more information content into the system—with the result that maximum lifetimes can potentially get larger. And as an example, here are possible lifetimes for k = 2, r = rules with a sequence of possible initial conditions:
Probabilistic Approximations?
Cellular automata are at their core deterministic systems: given a particular cellular automaton rule and a particular initial condition, every aspect of the behavior that is generated is completely determined. But is there any way that we can approximate this behavior by some probabilistic model? Or might we at least usefully be able to use such a model if we look at the aggregate properties of large numbers of different rules?
One hint along these lines comes from the power-law distributions we found above for the frequencies of different possible lifetimes for cellular automata of given types. And we might wonder whether such distributions—and perhaps even their exponents—could be found from some probabilistic model.
One possible approach is to approximate a cellular automaton by a probabilistic process—say one in which a cell becomes black with probability p if it or either of its neighbors were black on the step before. Here are some examples of what can happen with this (“directed percolation”) setup:
The behavior varies greatly with p; for small p everything dies out, while for large p it fills in:
And indeed the final density—starting from random initial conditions—has a sharp (phase) transition at around p = 0.54 as one varies p:
If instead one starts from a single initial black cell one sees a slightly different transition:
One can also plot the probabilities for different “survival times” or “lifetimes” for the pattern:
And right around the transition the distribution of lifetimes follows a power law—that’s roughly τ–1 (which happens to be what one gets from a mean field theory estimate).
So how does this relate to cellular automata? Let’s say we have a k = 2 rule, and we suppose that the colors of cells can be approximated as somehow random. Then we might suppose that the patterns we get could be like in our probabilistic model. And a potential source for the value of p to use would be the fraction of cases in the rule that give a black cell as output.
Plotting the lifetimes for k = 2, r = 2 rules against these fractions, we see that the longest lifetimes do occur when a little under half the outcomes are black (though notice this is also where the binomial distribution implies the largest number of rules are concentrated):
If we don’t try thinking about the details of cellular automaton evolution, but instead just consider the boundaries of finite-lifetime patterns we generate, we can imagine approximating these (say for symmetric rules) just by random walks—that when they collide correspond to the pattern dying out:
The standard theory of random walks then tells us that the probability to survive τ steps is proportional to τ–3/2 for large τ—a power law, though not immediately the same one as we’ve observed for our cellular automaton lifetimes.
Other Adaptive Evolution Strategies
In what we’ve done so far, we’ve always taken each step of our process of adaptive evolution to pick an outcome of equal or greater fitness. But what if we adopt a “more impatient” procedure in which at each step we insist on an outcome that has strictly greater fitness?
For k = 2 it’s simply not possible with this procedure (at least with a null initial condition) to “escape” the null rule; everything that can be reached with 1 mutation still has lifetime 1. With k = 3 it’s possible to go one step, but only one, as captured by this multiway graph:
But we’re assuming here that we have to reach greater fitness with just one mutation. What if we allow two mutations at a time? Well, then we can “make progress”. And here’s the multiway graph in this case for symmetric k = 2, r = 2 rules:
We don’t reach as many phenotypic patterns as by using single mutations and allowing “fitness-neutral moves”, but where we do get, we get much quicker, without any “back and forth” in fitness-neutral spaces.
If we allow up to 3 mutations, we get still further:
And indeed we seem to get a pretty good representative sampling of “what’s out there” in this rule space, even though we reach only 37 rules, compared to the 77,624 (albeit with many duplicated phenotypic patterns) from our standard approach allowing neutral moves.
For k = 3, r = 1 symmetric rules single mutations can get 2 steps:
But now if we allow up to 2 mutations, we can go much further—and the fact that we now don’t have to deal with neutral moves means we can explicitly construct at least the first few steps of the multiway graph in this case:
We can go further if at each step we just pick a random higher-fitness rule reached with two or fewer mutations:
The adaptive evolution histories we just showed can be generated in effect by randomly trying a series of possibilities at each step, then picking the first one that shows increased fitness. Another approach is to use what amounts to “local exhaustive search”: at each step, look at results from all possible mutations, and pick one that gives the largest fitness. At least in smaller rule spaces, it’s common that there will be several results with the same fitness, and as an example we’ll just pick among these at random:
One might think that this approach would in effect always be an optimization of the adaptive evolution process. But in practice its systematic character can end up making it get stuck, in some sense repeatedly “trying to do the same thing” even if it “isn’t working”.
Something of an opposite approach involves loosening our criteria for which paths can be chosen—and for example allowing paths that temporarily reduce fitness, say by one step of lifetime:
In effect here we’re allowing less-than-maximally-fit organisms to survive. And we can represent the overall structure of what’s happening by a multiway graph—which now includes “backtracking” to lower fitnesses:
But although the details are different, in the end it doesn’t seem as if allowing this kind of backtracking has any dramatic effect. Somehow the basic phenomena around the process of adaptive evolution are strong enough that most of the details of how the adaptive evolution is done don’t ultimately matter much.
An Aside: Sexual Reproduction
In everything we’ve done so far, we’ve been making mutations only to individual rules. But there’s another mechanism that exists in many biological organisms: sexual reproduction, in which in effect a pair of rules (i.e. genomes) get mixed to produce a new rule. As a simple model of the crossover that typically happens with actual genomes, we can take two rules, and splice together the beginning of one with the end of the other:
In general there will be many ways to combine pairs of rules like this. In a direct analogy to our Physics Project, we can represent such “recombinations” as “events” that take two rules and produce one:
The analog of our multiway graph for all possible paths of adaptive evolution by mutations is now what we call in our Physics Project a token-event graph:
In dealing just with mutations we were able to take a single rule and progressively modify it. Now we always have to work with a “population” of rules, combining them two at a time to generate new rules. We can represent conceivable combinations among one set of rules as follows:
There are at this point many different choices we could make about how to set up our model. The particular approach we’ll use selects just n of the = n (n – 1)/2 possible combinations:
Then for each of these selected combinations we attempt a crossover, keeping those “children” (drawn here between their parents) that are not rejected as a result of having lower fitness:
Finally, to “maintain our gene pool”, we carry forward parents selected at random, so that we still end up with n rules. (And, yes, even though we’ve attempted to make this whole procedure as a clean as possible, it’s still a mess—which seems to be inevitable, and which has, as we’ll discuss below, bedeviled computational studies of evolution in the past.)
OK, so what happens when we apply this procedure, say to k = 3, r = 1 rules? We’ll pick 4 rules at random as our initial population (and, yes, two happen to produce the same pattern):
Then in a sequence of steps we’ll successively pick various combinations:
And here are the distinct “phenotype patterns” produced in this process (note that even though there can be multiple copies of the same phenotype pattern, the underlying genotype rules are always distinct):
As a final form of summarization we can just plot the successive fitnesses of the patterns we generate (with the size of each dot reflecting the number of times a particular fitness occurs):
In this case we reach a steady state after 9 steps. The larger the population the longer the adaptive evolution will typically keep going. Here are a couple of examples with population 10, showing all the patterns obtained:
Showing in each case only the longest-lifetime rule found so far we get:
The results aren’t obviously different from what we were finding with mutation alone—even though now we’ve got a much more complicated model, with a whole population of rules rather than a single rule. (One obvious difference, though, is that here we can end up with overall cycles of populations of rules, whereas in the pure-mutation case that can only happen among fitness-neutral rules.)
Here are some additional examples—now obtained after 500 steps with population 25
and with population 50:
And so far as one can tell, even here there are no substantial differences from what we saw with mutation. Certainly there detailed features introduced by sexual reproduction and crossover, but for our purposes in understanding the big picture of what’s happening in adaptive evolution it seems sufficient to do as we have done so far, and consider only mutation.
An Even More Minimal Model
By investigating adaptive evolution in cellular automata we’re already making dramatic simplifications relative, say, to actual biology. But in the effort to understand the essence of phenomena we see, it’s helpful to go even further—and instead of thinking about computational rules and their behavior, just think about vertices on a “mutation graph”, each assigned a certain fitness.
As an example, we can set up a 2D grid, assigning each point a certain random fitness:
And then, starting from a minimum-fitness point, we can follow the same kind of adaptive evolution procedure as above, at each step going to a neighboring point with an equal or greater fitness:
Typically we don’t manage to go far before we get stuck, though with the uniform distribution of fitness values used here, we still usually end on a fairly large fitness value.
We can summarize the possible paths we can take by the multiway graph:
In our cellular automaton rule space—and, for that matter, in biology—neighboring points don’t just have independent random fitnesses; instead, the fitnesses are determined by a definite computational procedure. So as a simple approximation, we can just take the fitness of each point to be a particular function of its graph coordinates. If the function forms something like a “uniform hill”, then the adaptive evolution procedure will just climb it:
But as soon as the function has “systematic bumpiness” there’s a tremendous tendency to quickly get stuck:
And if there’s some “unexpected spot of high fitness” adaptive evolution typically won’t find it (and it certainly won’t if it’s surrounded by a lower-fitness “moat”):
So what happens if we increase the dimensionality of the “mutation space” in which we’re operating? Basically it becomes easier to find a path that increases fitness:
And we can see this, for example, if we look at Boolean hypercubes in increasing numbers of dimensions:
But ultimately this relies on the fact that in the neighborhood reachable by mutations from a given point, there’ll be a “sufficiently random” collection of fitness values that it’ll (likely) be possible to find a “direction” that’s “going up” in fitness. Yet this alone won’t in general be enough, because we also need it to be the case that there’s enough regularity in the fitness landscape that we can systematically navigate it to find its maximum—and that the maximum is not somehow “unexpected and isolated”.
What Can Adaptive Evolution Achieve?
We’ve seen that adaptive evolution can be surprisingly successful at finding cellular automata that produce patterns with long but finite lifetimes. But what about other types of “traits”? What can (and cannot) adaptive evolution ultimately manage to do?
For example, what if we’re trying to find cellular automata whose patterns don’t just live “as long as possible” but instead die after a specific number of steps? It’s clear that within any finite set of rules (say with particular k and r) there’ll only be a limited collection of possible lifetimes. For symmetric k = 2, r = 2 rules, for example, the possible lifetimes are:
But as soon as we’re dealing even with k = 3, r = 1 symmetric rules it’s already in principle possible to get every lifetime up to 100. But what about adaptive evolution? How well does it do at reaching rules with all those lifetimes? Let’s say we do single point mutation as before, but now we “accept” a mutation if it leads not specifically to a larger finite lifetime, but to a lifetime that is closer in absolute magnitude to some desired lifetime. (Strictly, and importantly, in both cases we also allow “fitness-neutral” mutations that leave the lifetime the same.)
Here are examples of what happens if we try to adaptively evolve to get lifetime exactly 50 in k = 3, r = 1 rules:
It gets close—and sometimes it overshoots—but, at least in these particular examples, it never quite makes it. Here’s what we see if we look at the lifetimes achieved with 100 different random sequences of mutations:
Basically they mostly get stuck at lifetimes close to 50, but not exactly 50. It’s not that k = 3, r = 1 rules can’t yield lifetime 50; exhaustive search shows that even many symmetric such rules can:
It’s just that our adaptive evolution process usually gets stuck before it reaches rules like these. Even though there’s usually enough “room to maneuver” in k = 3, r = 1 rule space to get to generally longer lifetimes, there’s not enough to specifically get to lifetime 50.
But what about k = 4, r = 1 rule space? There are now not 1012 but about 1038 possible rules. And in this rule space it becomes quite routine to be able to reach lifetime 50 through adaptive evolution:
It can sometimes take a while, but most of the time in this rule space it’s possible to get exactly to lifetime 50:
What happens with other “lifetime goals”? Even symmetric k = 3, r = 1 rules can achieve many lifetime values:
Indeed, the first “missing” values are 129, 132, 139, etc. And, for example, many multiples of 50 can be achieved:
But it becomes increasingly difficult for adaptive evolution to reach these specific goals. Increasing the size of the rule space always seems to help; so for example with k = 4, r = 1, if one’s aiming for lifetime 100, the actual distribution of lifetimes reached is:
In general the distribution gets broader as the lifetime sought gets larger:
We saw above that across the whole space of, say, k = 4, r = 1 rules, the frequency of progressively larger lifetimes falls roughly according to a power law. So this means that the fractional region in rule space that achieves a given lifetime gets progressively smaller—with the result that typically the paths followed by adaptive evolution are progressively more likely to get stuck before they reach it.
OK, so what about other kinds of objectives? Say ones more related to the morphologies of patterns? As a simple example, let’s consider the objective of maximizing the “widths” of finite-lifetime patterns. We can try to achieve this by adaptive evolution in which we reject any mutations that lead to decreased width (where “width” is defined as the maximum horizontal extent of the pattern). And once again this process manages to “discover” all sorts of “mechanisms” for achieving larger widths (here each pattern is labeled by its height—i.e. lifetime—and width):
There are certain structural constraints here. For example, the width can’t be too large relative to the height—because if it’s too large, patterns tend to grow forever.
But what if we specifically try to select for maximal “pattern aspect ratio” (i.e. ratio of width to height)? In essentially every case so far, adaptive evolution has in effect “invented many different mechanisms” to achieve whatever objective we’ve defined. But here it turns out we essentially see “the same idea” being used over and over again—presumably because this is the only way to achieve our objective given the overall structure of how the underlying rules we’re using work:
What if we ask something more specific? Like, say, that the aspect ratio be as close to 3 as possible. Much of the time the “solution” that adaptive evolution finds is the correct if trivial:
But sometimes it finds another solution—and often a surprisingly elaborate and complicated one:
How about if our goal is an aspect ratio of π ≈ 3.14? It turns out adaptive evolution can still do rather well here even just with the symmetric k = 3, r = 1 rules that we’re using:
We can also ask about properties of the “inside” of the pattern. For example, we can ask to maximize the lengths of uniform runs of nonwhite cells in the center column of the pattern. And, once again, adaptive evolution can successfully lead us to rules (like these random examples) where this is large:
We can go on and get still more detailed, say asking about runs of particular lengths, or the presence or number of particular subpatterns. And eventually—just like when we asked for too long a lifetime—we’ll find that the cases we’re looking for are “too sparse”, and adaptive evolution won’t be able to find them, even if exhaustive search could still identify at least a few examples.
But just what kinds of objectives (or fitness functions) can be handled how well by adaptive evolution, operating for example on the “raw material” of cellular automata? It’s an important question—an analog of which is also central to the investigation of machine learning. But as of now we don’t really have the tools to address it. It’s somehow reminiscent of asking what kinds of functions can be approximated how well by different methods or basis functions. But it’s more complicated. Solving it, though, would tell us a lot about the “reach” of adaptive evolution processes, not only for biology but also for machine learning.
What It Means for What’s Going On in Biology
How do biological organisms manage to be the way they are, with all their complex and seemingly clever solutions to such wide range of challenges? Is it just natural selection that does it, or is there in effect more going on? And if “natural selection does it”, how does it actually manage it?
From the point of view of traditional engineering what we see in biology is often very surprising, and much more complex and “clever” than we’d imagine ever being able to create ourselves. But is the secret of biology in a sense just natural selection? Well, actually, there’s often an analog of natural selection going on even in engineering, as different designs get tried and only some get selected. But at least in traditional engineering a key feature is that one always tries to come up with designs where one can foresee their consequences.
But biology is different. Mutations to genomes just happen, without any notion that their consequences can be foreseen. But still one might assume that—when guided by natural selection—the results wouldn’t be too different to what we’d get in traditional engineering.
But there’s a crucial piece of intuition missing here. And it has to do with how randomly chosen programs behave. We might have assumed (based on our typical experience with programs we explicit construct for particular purposes) that at least a simple random program wouldn’t ever do anything terribly interesting or complicated.
But the surprising discovery I made in the early 1980s is that this isn’t true. And instead, it’s a ubiquitous phenomenon that in the computational universe of possible programs, one can get immense complexity even from very simple programs. So this means that as mutation operates on a genome, it’s essentially inevitable that it’ll end up sampling programs that show highly complex behavior. At the outset, one might have imagined that such complexity could only be achieved by careful design and would inevitably be at best rare. But the surprising fact is that—because of how things fundamentally work in the computational universe—it’s instead easy to get.
But what does complexity have to do with creating “successful organisms”? To create a “successful organism” that can prosper in a particular environment there fundamentally has to be some way to get to a genome that will “solve the necessary problems”. And this is where natural selection comes in. But the fact that it can work is something that’s not at all obvious.
There are really two issues. The first is whether a program (i.e. genome) even exists that will “solve the necessary problems”. And the second is whether such a program can be found by a “thread” of adaptive evolution that goes only through intermediate states that “fit enough” to survive. As it turns out, both these issues are related to the same fundamental features of computation—which are also responsible for the ubiquitous appearance of complexity.
Given some underlying framework—like cellular automata, or like the basic apparatus of life—is there some rule that can be implemented in that framework that will achieve some particular (computational) objective? The Principle of Computational Equivalence says that generically the answer will be yes. In effect, given almost any “underlying hardware”, it’ll ultimately possible to come up with “software” (i.e. a rule) that achieves almost any (“physically possible”) given objective—like growing an organism of at least some kind that can survive in a particular environment. But how can we actually find a rule that achieves this?
In principle we could do exhaustive search. But that will be exponentially difficult—and in all but toy cases will be utterly infeasible in practice. So what about adaptive evolution? Well, that’s the big question. And what we’ve seen here is that—rather surprisingly—simple mutation and selection (i.e. the mechanisms of natural selection) very often provide a dramatic shortcut for finding rules that do what we want.
So why is this? In effect, adaptive evolution is finding a path through rule space that gets to where we want to go. But the surprising part is that it’s managing to do this one step at a time. It’s just trying random variations (i.e. mutations) and as soon as it finds one that’s not a “step down in fitness”, it’ll “take it”, and keep going. At the outset it’s certainly not obvious that this will work. In particular, it could be that at some point there just won’t be any “way forward”. All “directions” will lead only to lower fitness, and in effect the adaptive evolution will get stuck.
But the key observation from the experiments in our simple model here is that this typically doesn’t happen. And there seem to be basically two things going on. The first is that rule space is in effect very high-dimensional. So that means there are “many directions to choose from” in trying to find one that will allow one to “take a step forward”. But on its own this isn’t enough. Because there could be correlations between these directions that would mean that if one’s blocked in one direction one would inevitably be blocked in all others.
So why doesn’t this happen? Well, it seems to be the result of the fundamental computational phenomenon of computational irreducibility. A traditional view based on experience with mathematical science had been that if one knew the underlying rule for a system then this would immediately let one predict what the system would do. But what became clear from my explorations in the 1980s and 1990s is that in the computational universe this generically isn’t true. And instead, that the only way one can systematically find out what most computational systems will do is explicitly to run their rules, step by step, doing in effect the same irreducible amount of computational work that they do.
So if one’s just presented with behavior from the system one won’t be in a position to “decode it” and “see its simple origins”. Unless one’s capable of doing as much computational work as the system itself, one will just have to consider what it’s doing as (more or less) “random”. And indeed this seems to be at the root of many important phenomena, such as the Second Law of thermodynamics. And I also suspect it’s at the root of the effectiveness of adaptive evolution, notably in biology.
Because what computational irreducibility implies is that around every point in rule space there’ll be a certain “effective randomness” to fitnesses one sees. And if there are many dimensions to rule space that means it’s overwhelmingly likely that there’ll be “paths to success” in some directions from that point.
But will the adaptive evolution find them? We’ve assumed that there are a series of mutations to the rule, all happening “at random”. And the point is that if there are n elements in the rule, then after some fraction of n mutations we should find our “success direction”. (If we were doing exhaustive search, we’d instead have to try about kn possible rules.)
At the outset it might seem conceivable that the sequence of mutations could somehow “cleverly probe” the structure of rule space, “knowing” what directions would or would not be successful. But the whole point is that going from a rule (i.e. genotype) to its behavior (i.e. phenotype) is generically a computationally irreducible process. So assuming that mutations are generated in a computationally bounded way it’s inevitable that they can’t “break computational irreducibility” and so will “experience” the fitness landscape in rule space as “effectively random”.
OK, but what about “achieving the characteristics an organism needs”? What seems to be critical is that these characteristics are in a sense computationally simple. We want an organism to live long enough, or be tall enough, or whatever. It’s not that we need the organism to perform some specific computationally irreducible task. Yes, there are all sorts of computationally irreducible processes happening in the actual development and behavior of an organism. But as far as biological evolution is concerned all that matters is ultimately some computationally simple measure of fitness. It’s as if biological evolution is—in the sense of my recent observer theory—a computationally bounded observer of underlying computationally irreducible processes.
And to the observer what emerges is the “simple law” of biological evolution, and the idea that, yes, it is possible just by natural selection to successfully generate all sorts of characteristics.
There are all sorts of consequences of this for thinking about biology. For example, in thinking about where complexity in biology “comes from”. Is it “generated by natural selection”, perhaps reflecting the complicated sequence of historical accidents embodied in the particular collection of mutations that occurred? Or is it from somewhere else?
In the picture we’ve developed here it’s basically from somewhere else—because it’s essentially a reflection of computational irreducibility. Having said that, we should remember that the very possibility of being able to have organisms with such a wide range of different forms and functions is a consequence of the universal computational character of their underlying setup, which in turn is closely tied to computational irreducibility.
And it’s in effect because natural selection is so coarse in its operation that it does not somehow avoid the ubiquitous computational irreducibility that exists in rule space, with the result that when we “look inside” biological systems we tend to see computational irreducibility and the complexity associated with it.
Something that we’ve seen over and over again here is that, yes, adaptive evolution manages to “solve a problem”. But its solution looks very complex to us. There might be some “simple engineering solution”—involving, say, a very regular pattern of behavior. But that’s not what adaptive evolution finds; instead it finds something that to us is typically very surprising—very often an “unexpectedly clever” solution in which lots of pieces fit together just right, in a way that our usual “understand-what’s-going-on” engineering practices would never let us invent.
We might not have expected anything like this to emerge from the simple process of adaptive evolution. But—as the models we’ve studied here highlight—it seems to be an inevitable formal consequence of core features of computational systems. And as soon as we recognize that biological systems can be viewed as computational, then it also becomes something inevitable for them—and something