The thumbnail of this blog and the image above show a zebrafish next to a stylized picture of a human brain. It is meant as a small tribute to Turing as well as a prediction of what may be possible. Below I will discuss what a little striped fish has to do with Turing’s legacy.
The famous “imitation game” of Turing was introduced in his remarkable 1950 essay “Computing Machinery and Intelligence”. The goal of the game was for the machine to behave so that a human interlocutor could not tell it from a human (or, more precisely, from either a man or a woman). According to Turing, a machine that could do so consistently could be said to think. No better definition has been proposed since.
In his essay Turing conjectured that a billion bits of storage would be sufficient for a computer program to play a passable imitation game and that a machine with that amount of memory could be constructed in 50 years (that is, by year 2000).1
Turing’s prediction of the technological progress was right on target, machines with a billion bits (or about 128 Megabytes) of storage had indeed become common by 2000.2 On the other hand, none of the AI models available in 2000 came anywhere close to playing the Imitation Game.3 Passing the Turing test convincingly took another twenty years and required much larger models, such as GPT 3, which had a storage requirement of nearly three trillion bits (175 billion parameters with 16 bits precision). This exceeded the Turing’s original estimate by more than three orders of magnitude.
Nevertheless, there is reason to believe that far smaller models can achieve human performance on conversational tasks. Modern architectures and optimization methods have been developed through an evolutionary process of trial and error. Such a process is unlikely to yield optimal results particularly given that the models and algorithms are quite new, mostly less than 10 years old, and heavily rely on hardware (GPUs) which has been developed equally recently. Furthermore, the large majority of recent efforts and resources have been concentrated on training progressively larger models and using more data. One specific way that modern models are likely to be sub-optimal, are the training and model selection procedures based on recently found heuristic considerations such as specific parameters of the “scaling laws”, choices of initialization and learning rates. These heuristics are useful to practitioners in dire need of guidance for the increasingly costly training runs. Yet, they are at best local optima in the vast generalization and optimization landscapes of modern models and, more likely, are not even that, but simply rules of thumb in the absence of more reliable understanding of the underlying principles.
There is already evidence that strong performance is possible with much smaller models, particularly when these are obtained (distilled) from larger models. Models, such as Phi-3 with 3.8 billion 16-bit parameters, require about 60 billion bits of storage, far less than GPT-3, and only 60 times more than Turing’s estimate. Yet they are arguably capable of passing the intelligence test. It seems highly plausible that the storage requirements of such models can be decreased by another order of magnitude while maintaining comparable performance levels.
Furthermore, Turing’s Imitation Game requires a model to convincingly impersonate a human in a single language—English—and within the limited scope of questions available to an un-augmented pre-internet (or even pre-pocket calculator) human.
Here is an example from the Turing’s paper:
Q : Please write me a sonnet on the subject of the Forth Bridge.
A : Count me out on this one. I never could write poetry.
Q : Add 34957 to 70764
A : (Pause about 30 seconds and then give as answer) 105621. [MB: intentionally incorrect]
The capabilities of even relatively modest modern LLMs far exceed such conversations in many directions, extending to many languages and broad domains of knowledge. Thus, if our only goal was to create a competent English conversationalist of modest abilities, they could likely be distilled into a far smaller package of a billion bits. While our understanding of these models is not sufficiently precise to perform such a distillation, there appear to be no large technological or conceptual barriers.
It is time to come back to the TuringFish illustration. A model requiring a billion bits of storage has about 100 million neural weights (assuming 8 or 16 bits per weight) or very roughly the number of synapses in a zebrafish brain. We are thus aiming at a model with the brain capacity of a fish to converse with us in fluent English and to convince us of its humanity!
Of course, such a comparison is facetious as neural networks are fundamentally different from biological brains. In particular, they operate at a far higher frequency, perhaps a billion times faster. Nevertheless, it hints at real limitations of human minds.
However there is another reason for a zebrafish to be a tribute to Turing. In a very different work “The Chemical Basis of Morphogenesis” published in 1952, Turing described a mechanism for pattern formation in living organisms through a diffusion-reaction differential equation. This is how the stripes on a zebrafish are formed.
The zebrafish with its tiny brain and its Turing-explained stripes symbolizes the limits of human intelligence, made painfully evident by recent progress in AI. But it also represents the power of thought in comprehending and molding the world. Perhaps the nature of intelligence itself may be understood as a process not so different from the equations that guide pattern formation in zebrafish stripes.
The timing of 2001 A Space Odyssey fits neatly, but it may be accidental.
Turing did not, of course, make a distinction between different types of memory, e.g., RAM vs hard drive, as none of those technologies existed in 1950, so there is some room for interpretation there.
That's a good point, thanks!
16 bits may be way too much, many new models are discretized down to almost 1 bit, so an order of magnitude may lie here.
It's a pity the lottery ticket hypothesis had not yet lead to significant advances. There might be a good combination of distillation + sparsity + quantization