Copernicus, Darwin and chatGPT

Sep 07, 2023

Never let the future disturb you. You will meet it, if you have to, with the same weapons of reason which today arm you against the present. Marcus Aurelius

Today, as we look with amazement at the power of Large Language Models, we have arrived at a moment similar to the past revolutions of Copernicus and Darwin. The capabilities of Large Language Model and, just as significantly, the nature of these models and their training process, have far-reaching implications for our understanding of statistics, science, and the human condition.

The great discoveries of Copernicus and Darwin changed the human place in the universe. The Copernican heliocentric system moved our planet away from the center of the world, to a position subordinate to the Sun and equal in standing to several other celestial bodies1. Similarly, Darwin’s theory of evolution, downgraded “man” from the pinnacle of creation, made in the divine image, to a second cousin of a chimpanzee.

Needless to say those ideas had not been readily accepted. Nearly hundred years after Copernicus, Galileo was famously put on trial by the Inquisition and forced to renounce heliocentrism. Now, over 160 years after the publication of “The Origin of Species”, resistance to evolutionary ideas remains persistent.

We are now served with a similarly bitter pill. Human linguistic skills, one of the last remaining strongholds of our uniqueness, can be readily replicated by remarkably simple statistical models. The implications of this great discovery are many and far-reaching.

Why is language so central to human identity? Most human behaviors have animal analogues. For instance, human physical abilities are unremarkable. Many animals have impressive agility and strength and even a fly with its tiny brain of about 100,000 neurons is capable of complex locomotion. It is usually thought that the two primary differentiating aspects of human intelligence are language and complex tool use. However, tools seem to be a less foundational aspect of our humanity than language. Tool use is limited for people with few means to construct tools, say, those stranded on a desert island, or with no means to manipulate them, say, physically disabled. None of it changes our view of their humanity or intelligence. Furthermore, use of complex modern tools typically requires language instructions, written or oral. The view of linguistic primacy was taken by Turing in his original 1950 paper2 proposing the “imitation game”, now known as the Turing test. In that paper Turing explicitly identified language with the intellectual component of human behavior. Indeed, no other animal has means of communication approaching human language in complexity, expressiveness and flexibility. While chimpanzees and gorillas have been taught sign language, their ability was at best comparable to that of a toddler. In contrast, a human child can master any language at a much higher level. Thus, humans alone are genetically endowed with a "universal grammar" machinery that can be applied to learn any natural language.

As of 2023 this linguistic ability is no longer unique. Best current LLMs, such as GPT4, have linguistic competence comparable or exceeding that of an average human. Furthermore their ability extends across numerous languages and nearly every domain of human knowledge not requiring manipulation of physical objects. This competence can be measured through standardized tests, such as SAT, the bar exam, ability to translate between a variety of different languages or answer job interview questions for companies such as Google3. Best LLMs are now roughly comparable in ability to an average (“endlessly enthusiastic B/B+”) undergraduate university student across multiple domains of knowledge. Of course, there are still experts who by far surpass these models in any given area of expertise. While true, it is irrelevant for an evaluation of general intelligence as the same applies to the majority of the humans. To sharpen this point, it is hardly a requirement for humans to equal Einstein in physics or Gauss in mathematics to be considered intelligent. Few of us could hope to meet that standard.

Yet the fact that LLMs give us an alternative model of human-level linguistic intelligence does not in itself rise to the level of Copernicus-Darwin revolution. After all, there is no reason to think that the human brain is uniquely suitable for implementing linguistic competence. It is a fairly mainstream view that robots could match human intelligence in principle. Thus machines capable of sophisticated linguistic processing, impressive and consequential as they may be, should not dramatically change our understanding of human intelligence and our place in the universe. What makes LLMs truly revolutionary is the narrow statistical nature of these models and the constrained scope of their training data.

At their core LLMs are conditional probability models. LLMs generate the next word based on the words4 (“tokens”) in the “context window”. Specifically, given a sequence, of, say, 1,000 words

\(w_1,\ldots,w_{1000}\)

within the context window of length 1,000, the model estimates the conditional probability

\(P(w | w_{1000}, …, w_1) ~~\text{for each next word} ~~w.\)

To generate the next word, w is simply sampled at random from this probability distribution over the possible words. The context window is then moved one forward to include that newly generated word, and the next word is sampled from the probability distribution

\(P(w | w_{1001}, …, w_2).\)

The process is repeated until the “end” token is selected. Note that the model is only concerned with predicting the next word. Furthermore, transformer-based models, such as chatGPT have no internal memory states. The probability distribution for the next word is a fully deterministic function of the context. While they are trained on a large corpus of data, currently several trillion tokens for the biggest models, they are not aware of any linguistic rules. The training process is the same for any language, natural or artificial, be it English, Chinese, Python or Esperanto. Setting aside the specifics of these models, we see that a human-equivalent output, including ability to write computer code, translate between languages and pass the bar exam, can be obtained from pure statistics of the training corpus without need for models specialized to particular tasks or languages. Only the next word needs to be predicted and even internal memory states are unnecessary.

Why is this surprising?

Suppose our vocabulary contains 2,000 different words5. Predicting the probability of a next word given 1,000 previous words6 is equivalent to filling a matrix of the size

\(2000^{1000}\times 2000. \)

The rows of the matrix correspond to all possible sequences of words, while the columns are probabilities of individual tokens. The goal of the learning algorithm is to fill in the entries of this matrix based on the training set7. While a trillion training examples may seem like massive data by any ordinary standard, the scale of the matrix is unimaginably larger.

To give an analogy, imagine a vast library. A certain book in that library has a certain letter in a particular location. Within that letter there is a molecule of ink. The goal is to reconstruct all books in the library from that one molecule. While this task may seem absurd, it pales in comparison to the problem we are facing. The size of this matrix is vastly larger in comparison to a trillion examples, than the number of all molecules in the library or even the universe compared to that single molecule of ink. One may object that most word sequences are nonsensical and need not be taken into account at all. This does not materially change the nature of the problem. Imagine that instead of 2,000 we only have two “sensible” possibilities for each word in a sentence. In that case the number of total sequences is two to the power of 1,000, still an unimaginably large number far exceeding the number of atoms in the observable universe, even if each atom contained its own universe of the same size. In order for such a reconstruction to be at all possible, the texts in the library must be extremely predictable.

But why should we be amazed by this predictability? After all, isn't it true that a human brain contains a model capable of generating texts? That model must be contained within 10¹⁴ or so synapses within a human skull and thus cannot be arbitrarily complex. Thus, a model capable of generating most books in the library must exist if not within an individual brain, then at least within the human society of about 10 billion people with their 10²⁴ or so synapses.

Several observations are in order.

It is not a priori obvious that even the totality of all human brains is sufficient for producing the texts. Human text generation is not an instantaneous process and happens historically. Conceivably, all material aspects of our society – corn fields, subways, and copper mines, may be required for the task.

The largest current models, as of summer 2023, contain approximately 10¹² parameters, about two orders of magnitude fewer than the number of synapses within an individual human brain, let alone all brains taken together8. While it is difficult to directly compare biological and artificial systems, this level of complexity appears comparable to the mouse brain. The fact that so few parameters are needed to achieve human (and in many ways super-human) linguistic competence is surprising as it is quite low even compared to recent estimates of model intelligence using biological anchors.

Even if a small model with certain functionality exists, there is no reason to believe that its structure can be recovered from its output. In general such recovery is a mathematical impossibility. In particular, finding a short program to produce a certain output (the length of such a program is known as the Kolmogorov complexity) is not computable. Our brain appears to be a universal computational machine. How else can we play chess and code COBOL with the neural machinery that evolved to sharpen rocks and hunt mammoths? Hence it would seem difficult or impossible to recover the underlying brain “program” from its linguistic output. Note that while this argument has merit, it needs to be tempered as biology did “learn” the model in our brains within its own evolutionary process. Still, a billion years of evolution across trillions of individual organisms is vastly more than the amount of computation or data going into the current models. Furthermore, unlike the evolution, our models are limited exclusively to a subset of human writings of the last few hundred years. Building a model from such a narrow slice of data is like trying to recover all of human experience from books on chess. Any success in such an endeavor is surprising.
A striking aspect of these models is that predicting the next word appears to be enough. Even if the language were fully probabilistic, why would predicting the next word be sufficient? One may plausibly argue that to produce sensible output we need to consider the probability of a whole sequence or several sequential sentences, particularly for highly structured tasks, such as generating valid computer code. Yet predicting one word at a time turns out to be sufficient to not just produce meaningful responses but for writing and analyzing complex computer programs. Furthermore, the probability distribution of the next word in a sentence produced by the model is a deterministic function of the context. Transformer models, such as chatGPT do not have any memory or hidden states and are simple (high order) Markov chains. Apparently, having internal memory is not needed to produce human quality linguistic output, as long as the context window is large enough.

What can we conclude? The power of Large Language Models is far less surprising and puzzling if we view human cognition as a relatively simple process fairly easily amenable to statistical modeling. Taking this perspective points to a true Copernicus-Darwin moment, forcing us to re-evaluate the significance and the scope of human intelligence.

Interestingly, this view is not too removed from Turing’s 1950 paper, where he predicted that the imitation game could be run on a computer with one billion bits of memory by the year 2000. It was not a bad guess. While best modern models use perhaps one trillion to ten trillion bits, their abilities, such as proficiency in numerous languages, far exceed the requirements of Turing’s imitation game. Furthermore, there is no reason to think that these models are in any respect optimal or cannot be significantly compressed. It is conceivable and perhaps even likely that the imitation game in a English can indeed be successfully played with a much smaller but highly optimized billion bit model. While the Turing’s prediction was surprisingly on target, one may suspect that the simple statistical nature of these programs would have come as a surprise even to Turing.

I will now discuss some implications ordered from less to more speculative.

Statistics. Some of the most important implications of these models are in statistics. Statistical inference has turned out to be far more powerful than could be expected by even the most starry-eyed statistician or Machine Learning researcher. The other side of that power is that it highlights a yawning gap in our understanding of statistical inference. That gap has become progressively more obvious with the recent progress of deep learning. Despite significant conceptual and theoretical advances of the last few years, the gap seems to be growing only wider, and is now gazing into us intently. However, the discussion of what is surprising and what is less so with respect to statistical inference is quite nuanced. Together with some thoughts on narrowing the gap, it will be a subject of a separate forthcoming document.
Linguistics and Natural Language Processing. It has long been argued, most notably by Noam Chomsky, that linguistic structure cannot be acquired purely from the statistics of texts. Chomsky went as far as to dismiss the usefulness of probability entirely9: “But it must be recognized that the notion of ‘probability of a sentence’ is an entirely useless one, under any known interpretation of this term.” A more balanced view is expressed by Peter Norvig who disagrees with Chomsky about the value of probabilistic modeling while agreeing on the limitations of Markov models, where the probability of the next token depends solely on the context. Norvig states that "a Markov model of word probabilities cannot model all of language” and posits that more complex structures at multiple scales are required: "What is needed is a probabilistic model that covers words, syntax, semantics, context, discourse, etc." The view that more complex statistical models than Markov chains are needed to model language has been mainstream in natural language processing. For example, insufficiency of Markov models is taught in standard NLP classes10. More recently, the usefulness of statistical modeling with the need of multiple scales has also also become accepted in parts of linguistics. And yet modern LLMs are nothing else but Markov models although of a very high order (with a large context window). While it is impossible to say whether they model “all of language”, their linguistic competence suggests that higher level concepts such as syntax and semantics can emerge naturally from the statistics of word sequences alone and that externally imposed complex model structure and memory states are not needed. Thus LLM language abilities are a refutation not just of the original Chomskyan view but even of moderate mainstream positions, such as those expressed by Norvig.
Cognitive Science. Having an alternative model for human linguistic ability and, likely for other types of cognitive behaviors, is a clear breakthrough for cognitive science and certain areas of the philosophy of the mind. While we do not know whether LLMs and natural cognitive systems operate based on similar principles, a theory cannot claim that a principle is necessary for cognition unless it applies equally to natural and artificial systems. Furthermore, the success of LLMs supports various behaviorist theories. Indeed, predicting the next state based on a certain historical context window, the way Markov models do, is exactly how behaviorism suggests modeling animal and human behavior. Some of the states may be private, not easily observable, e.g., modulations of heart rate or concentrations of chemicals in the blood, but even those still manifest themselves in behavior later on. Of course, predictions of LLMs are based on the aggregate linguistic behavior of millions of people rather than single individuals. Nevertheless, LLMs can somewhat plausibly generate texts pretending to be those of specific historical persons. It is important to note that these models do not appear similar to biological systems and certainly run on very different hardware. Still having an alternative easily observable and manipulable model of intelligence should allow for completely new lines of investigation in cognitive science and philosophy. The observed behavior of LLMs also provides indirect support for philosophical ideas such as eliminativism which dismisses many introspective mental state concepts of “folk psychology” as mere illusions or constructions. Alternatively, perhaps sufficiently complex Markov chain models can be said to develop mental states. Those states, of course, would have to be fully context dependent and should in principle be inferable from the model structure given the input.
Further speculation. What follows is some fairly wild speculation about the nature of physics and biology. A sensitive or simply sensible reader who made it this far may want to skip to the end of this document.
As discussed, LLMs learn to produce human-like responses and appear to build world models simply from the statistics of the texts. Similarly, diffusion models and other image-generating algorithms learn to produce new and often creative high-quality images by collecting statistics of images. It appears that more complex concepts naturally arise from these simple statistics in data, a concept known in neural networks as emergent behavior. It is tempting to conjecture that every successful learning system, human or artificial, must take advantage of the same patterns in data. The same statistical principles could thus apply to all types of learning, from the evolutionary process to the animal and human brain to artificial systems11. Simply organized but potentially large models exposed to sufficient amounts of data learn to recover higher level structures. Single-cell organisms may become multicellular organisms with their more complex information processing. Similarly, a human brain or a number of human brains, faced with sufficient data may eventually discover laws of planetary motion, despite having nothing in their evolutionary history preparing them for such a leap. By the same token, progressively larger statistical models presented with more data discover more complex patterns in language. This is not new, of course. Similar ideas have been been for years a playground for imaginative Sci-Fi writers. However, until very recently they appeared remote from everyday realities. They seems far less so at this point. Furthermore, this point of view suggests that increase in model complexity may be inevitable and that our control over future technologies is tenuous and ephemeral. It is questionable how much “free will” we have as individuals or as a society. Did our ancestors choose to develop language, fire or agriculture? Did single cell organisms choose to become multicellular? One can view all of these developments within the same learning process. Could our ability to plan for the future be an illusion or, rather, a comforting construction?

It is an old AI trope (originally due to H. Dreyfus) that building statistical models to achieve intelligence is like climbing a tree to reach the Moon. With Large Language Models we seem to have achieved just that. Perhaps the Moon we have reached is just a particularly convincing mirage or an artful set of decorations in a Nevada desert? Perhaps. Yet, as the evidence mounts, it is more likely that we have misunderstood the nature of the Moon and the trees all along. As in the aftermath of the Copernican and Darwin’s revolutions in the past, our world will not be the same.

Acknowledgements. I thank Daniel Hsu, Leon Bergen and Yusu Wang for many discussions and comments.

It is interesting to note that, contrary to the popular belief, the heliocentric system did not provide more accurate predictions than the old Ptolemaic system. Copernicus thought that the planetary motion was uniform and the orbits were circular. More precise predictions had to wait until Kepler with his laws of planetary motion. However, the heliocentric system was a drastic simplification of the Ptolemaic model.

Turing, Alan (October 1950), "Computing Machinery and Intelligence", Mind, LIX (236): 433–460.

SAT and various other tests: https://cdn.openai.com/papers/gpt-4.pdf; LLM reportedly passes Google job interview for a coding position: https://www.cnbc.com/2023/01/31/google-testing-chatgpt-like-chatbot-apprentice-bard-with-employees.html; GPT-4 Passes the Bar Exam: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4389233

Technically, tokens are subword units. Thinking about them as words does not significantly change the discussion.

The actual vocabulary size is about 50,000 for GPT models.

For comparison, the context size of chatGPT-3 is 4,096 tokens.

Of course, literally filling this matrix is impossible due to size. Instead we need a rule for finding entries on demand.

Note that human brains are about twice as big as ape brains. Thus it is sensible to think that whatever brain circuitry separates us from the apes, is contained in about 50% of the brain, which does not significantly change these estimates.

Chomsky, Noam (1969) Some Empirical Assumptions in Modern Philosophy of Language, in Philosophy, Science and Method: Essays in Honor or Ernest Nagel, St. Martin's Press.

According to course notes of Adityanarayanan Radhakrishnan.

For example while our brains are significantly larger than those of apes, we are very genetically similar. It is not clear why human mental processes, including those involved in writing Python code or responding to Twitter posts, should be far more complex than those of chimpanzees gorillas. Perhaps having a larger brain simply unlocks those useful capabilities.

Chris Brew

Sep 19, 2023

I don't agree with "Best current LLMs, such as GPT4, have linguistic competence comparable or exceeding that of an average human. Furthermore their ability extends across numerous languages and nearly every domain of human knowledge not requiring manipulation of physical objects"

It is true, I think, that aspects of linguistic competence are suddenly within reach, but this goes too far. Even human beings who are not yet able to formulate complete sentences in their native language have a working grasp of other aspects, such as narrative and relevance, which thr LLMs do not reliably command. It's not that one set of abilities is superior to the other, far more that they are incommensurate. In the same way that grandmaster level chess is neither harder nor easier than making a cup of tea in an unfamiliar kitchen. We are predisposed to be impressed by things that humans find hard.

I like the claim in https://arxiv.org/pdf/2308.16797.pdf "these models have achieved a proxy of a

formal linguistic competence in the most studied languages. That is, its responses follow linguistic

conventions and are fluent and grammatical, but they might be inaccurate or even hallucinate ... they also show signs of functional linguistic competence in its responses, i.e., discursive coherence, narrative structure and linguistic knowledge, even if not fully consistent (sometimes they

do not consider context or situated information, and

fail to adapt to users and domains)."

Expand full comment

8 replies by Misha Belkin and others

Maxim Raginsky

Sep 8, 2023Edited

Interesting piece, but, in my opinion, a bit too quick to accept "primacy of language" as a foregone conclusion. I find the arguments laid out by Jacob Browning and Yann LeCun here rather persuasive and would be curious to see your response: https://www.noemamag.com/ai-and-the-limits-of-language/. Bottom line is, Moravec's paradox won't go away regardless of how many tokens you include in your context and how much data you train on.

The other thing is Chomsky. By now, I think we can all pretty much agree that it's not very sportsmanlike to beat up on Chomsky, and anyway Chomsky is not the only game in town. In fact, the informational/probabilistic view of language, inspired to a large extent by Shannon, was given by Zellig Harris (who just happened to be Chomsky's advisor). Fernando Pereira had a nice overview of Harris' ideas: https://www.princeton.edu/~wbialek/rome/refs/pereira_00.pdf. Many people, including myself or Cosma Shalizi, have pointed out that LLMs are an instantiation of Harris' informational linguistics. I gave a talk at the Santa Fe Institute in June, where I discussed some of these things, here are the slides if you're interested: https://uofi.app.box.com/s/r32s6wz579astndv1ghcpeyl6ldaej7w.

Overall, though, I do agree that we witnessing the emergence a new paradigm of experimental philosophy, so buckle up, everyone!

3 replies by Misha Belkin and others

11 more comments...

Data, Machine Learning and AI

Discussion about this post