Data, Machine Learning and AI

Sep 9, 2023

Thanks for the comments, Max.

1. Indeed I have read the linked article before. I am afraid I did not find the argument fully persuasive. They seem to primarily argue that there is intelligence which is not captured in language, e.g., “language doesn’t exhaust intelligence, as is evident from many species, such as corvids, octopi and primates.” I fully agree with that. However, this type of knowledge is possessed by many or maybe all animals. Do we have more of that knowledge than chimpanzees do? If we do not, then we are back to language as a uniquely human trait and the argument I make holds. If we do, how can we measure something like that? We would need a clear definition and/or a method of measurement. I did not see either there and without it I don't see how any claim can be made.

The question, as I see it, is not whether language represents all of intelligence but whether it (aside from tool use) represents the difference between human and animal intelligence.

2. I might have not made my point clear. The intention was not to criticize Chomsky. I am actually contrasting Chomsky and Norvig’s views. While Chomsky’s take is extreme, Norvig represents a balanced mainstream position. A few years ago I would have completely agreed with him. However, my point was that “Thus LLM language abilities are a refutation not just of the original Chomskyan view but even of moderate mainstream positions, such as those expressed by Norvig. “ I don’t think Shannon ever claimed language was a Markov chain, do you know if he or anyone else made that claim? I think, at least recently, pretty much everyone agreed it was not, and we had all been wrong.

Thanks for the references and the slides, going to read them shortly!

Expand full comment

Maxim Raginsky

Sep 10, 2023Edited

Thanks for the detailed response, Misha! Let me address your points:

1. I don't see how the arguments advanced by Browning and LeCun are any less persuasive than the ones you bring up. I do not think it is possible to neatly disentangle human linguistic capacity from other modalities. For example, our constant reliance on indexicals is evidence that other sensory modalities are crucially involved as well, e.g., saying things like "wow, look at this!" Moreover, language itself is a tool as it helps us navigate the world and do a great deal of predicting and controlling. Even your example of someone with limited means to manipulate tools can be turned around to show that such a person will be able to use language as a tool to compensate for their lack of other capabilities, for example, by interacting with others who could provide help. Linguistic ability has been co-evolving with other capabilities in humans. Language acquisition by children involves what Steven Pinker called "semantic bootstrapping:" A child acquires the syntax of a language by learning to recognize semantics encoded in various interactions with the world; closing the feedback loop around this leads to further learning of more sophisticated syntactic ability, which leads to more sophisticated elicitation of semantic relations, etc. This is a great deal more realistic than your statement about "a 'universal grammar' machinery that can be applied to learn any natural language" (which, ironicallyy, sounds distinctly Chomskyan!).

2. Regarding Chomsky: criticize all you want, my point was that invoking Chomsky as a foil is essentially a strawman, since we know of several competing approaches explicitly based on statistical modeling -- e.g., that of Zellig Harris, whom I have already mentioned. The notion that generation and recognition of natural language can be implemented statistically using a predictive model does indeed go back to Shannon. These ideas were already present in some form in his 1948 paper that introduced information theory, and in more detail in his 1950 "Prediction and entropy of printed English"(see, e.g., https://www.princeton.edu/~wbialek/rome/refs/shannon_51.pdf). This is what he writes in that paper:

"The new method of estimating entropy exploits the fact that anyone speaking a language possesses, implicitly, an enormous knowledge of the statistics of the language. Familiarity with the words, idioms, cliches and grammar enables him to fill in missing or incorrect letters in proof-reading, or to complete an unfinished phrase in conversation. An experimental demonstration of the extent to which English is predictable can be given as follows; Select a short passage unfamiliar to the person who is to do the predicting. He is then asked to guess the first letter in the passage. If the guess is correct he is so informed, and proceeds to guess the second letter. If not, he is told the correct first letter and proceeds to his next guess. This is continued through the text. As the experiment progresses, the subject writes down the correct text up to the current point for use in predicting future letters."

Zellig Harris elaborates on this idea; in particular, he is interested in the mechanisms by which these implicit statistical models of language arise and are reproduced, entrenched, and modified in populations of speakers through their interactions with the world and with each other. My point is that it is simply not accurate to say that the "predict-the-next-token-based-on-context" idea had no precursors before LLMs, there were plenty of precedents -- recall Fred Jelinek's (in)famous quip from 1985, when he was the director of speech recognition at IBM Research: "Every time I fire a linguist, the performance of the speech recognizer goes up."

Now, as far as whether "language is a Markov chain," let's be terminologically accurate. Language (an open, evolving system consisting of a vocabulary, grammar, syntax, rules of formation, etc., together with the evolving community of users of that language) is not a Markov chain or any other stochastic process per se. This term can be only applied to specific realizations (sequences of sentences or texts) generated by humans, or by LLMs, as the case may be. Shannon in fact was quite certain that we can *model* such realizations well by Markov chains of a sufficiently high order, although it is a bit over the top to say that a given stochastic process with context length in the tens of thousands is a "Markov chain." This, to me, trivializes that concept. I am fairly certain that there are fundamental cognitive and physical limits to the context length a human can attend to, so most likely well-trained LLMs have a great deal of fading memory, and thus I would expect to see a much smaller "effective memory" length either for humans or for well-trained LLMs.

A more interesting question is not whether language can be modeled by a Markov chain, but what sort of structure can be distilled from the way LLMs attend to the context when determining the probability distribution of the next token. Since, as you correctly point out, systems like ChatGPT have no internal state, I would suspect that they learn implicitly some sort of an equivalence relation on the set of contexts, which may in part be determined by the pattern of interaction among the multiple attention heads and whatnot. And this suggests that, at least to some nontrivial extent, they acquire some semantics from syntax alone. These are all super-interesting questions that, I believe, we should be able to address empirically.

Expand full comment

Sep 11, 2023

Thank you for your thoughtful comments, Max. I agree with many of your points. Let me clarify a few things which perhaps did not come across clearly enough, and add some further thoughts.

I certainly agree that other (non-linguistic) modalities exist and are very important overall. What is not clear to me is how important they are specifically with respect to the _human_ condition, as opposed to other animals. Without a clear test/definition of this importance how can we say anything with confidence? I think the onus is on them (or on you) to provide such a test. In that paper they seem to make strong claims (e.g., "these systems are doomed to a shallow understanding that will never approximate the full-bodied thinking we see in humans") without defining what shallow or full-bodied might mean or how one can measure it.

“Even your example of someone with limited means to manipulate tools can be turned around to show that such a person will be able to use language as a tool to compensate for their lack of other capabilities, for example, by interacting with others who could provide help.”

Of course, language itself is a tool, probably the most powerful tool we have. But what about a person stranded on an island? They have no physical tools and calling for help is no good. Still, they are just as intelligent as anyone else.

“A child acquires the syntax of a language by learning to recognize semantics encoded in various interactions with the world; closing the feedback loop around this leads to further learning of more sophisticated syntactic ability, which leads to more sophisticated elicitation of semantic relations, etc. “

Sure.

“This is a great deal more realistic than your statement about "a 'universal grammar' machinery that can be applied to learn any natural language"

The existence of this unique machinery is simply a fact – humans have the ability to acquire language, animals, even apes, do not. There is nothing unrealistic or controversial about it (except for the word "grammar", perhaps).

“which, ironically, sounds distinctly Chomskyan!.”

I am not completely unaware of that :)

“Regarding Chomsky: criticize all you want, my point was that invoking Chomsky as a foil is essentially a strawman, since we know of several competing approaches explicitly based on statistical modeling -- e.g., that of Zellig Harris, whom I have already mentioned.”

I don't fully agree with that. Chomsky’s disregard for statistics has become a bit of a caricature. Still Chomsky has been and still is an extremely influential figure in linguistics. There have definitely been other schools of thought (e.g., Zelig Harris as you point out), but Chomskyan thinking was dominant and maybe still is, even now.

Also, no matter what Jelinek said, while strong statistical models may exist, it was far from clear they could be learned directly from a corpus or that they would be so simple. This is a tricky question and requires a nuanced discussion. Daniel Hsu and I are writing a follow-up document discussing statistical and computational issues in more detail.

“My point is that it is simply not accurate to say that the "predict-the-next-token-based-on-context" idea had no precursors before LLMs, there were plenty of precedents”

Did something I said came across that way? Of course, the idea of predicting the next token has plenty of precedents, e.g., autoregressive models.

“Now, as far as whether "language is a Markov chain," let's be terminologically accurate. Language (an open, evolving system consisting of a vocabulary, grammar, syntax, rules of formation, etc., together with the evolving community of users of that language) is not a Markov chain or any other stochastic process per se. This term can be only applied to specific realizations (sequences of sentences or texts) generated by humans, or by LLMs, as the case may be.”

Absolutely. That’s why I was careful not to say that language _was_ a Markov chain. The claim was that Markov chains models have linguistic competence comparable to humans: “While it is impossible to say whether they model “all of language”, their linguistic competence suggests that higher level concepts such as syntax and semantics can emerge naturally from the statistics of word sequences alone and that externally imposed complex model structure and memory states are not needed. “

"A more interesting question is not whether language can be modeled by a Markov chain, but what sort of structure can be distilled from the way LLMs attend to the context when determining the probability distribution of the next token. Since, as you correctly point out, systems like ChatGPT have no internal state, I would suspect that they learn implicitly some sort of an equivalence relation on the set of contexts, which may in part be determined by the pattern of interaction among the multiple attention heads and whatnot. And this suggests that, at least to some nontrivial extent, they acquire some semantics from syntax alone. These are all super-interesting questions that, I believe, we should be able to address empirically."

Agreed!

Expand full comment

Sep 19, 2023

I don't agree with "Best current LLMs, such as GPT4, have linguistic competence comparable or exceeding that of an average human. Furthermore their ability extends across numerous languages and nearly every domain of human knowledge not requiring manipulation of physical objects"

It is true, I think, that aspects of linguistic competence are suddenly within reach, but this goes too far. Even human beings who are not yet able to formulate complete sentences in their native language have a working grasp of other aspects, such as narrative and relevance, which thr LLMs do not reliably command. It's not that one set of abilities is superior to the other, far more that they are incommensurate. In the same way that grandmaster level chess is neither harder nor easier than making a cup of tea in an unfamiliar kitchen. We are predisposed to be impressed by things that humans find hard.

I like the claim in https://arxiv.org/pdf/2308.16797.pdf "these models have achieved a proxy of a

formal linguistic competence in the most studied languages. That is, its responses follow linguistic

conventions and are fluent and grammatical, but they might be inaccurate or even hallucinate ... they also show signs of functional linguistic competence in its responses, i.e., discursive coherence, narrative structure and linguistic knowledge, even if not fully consistent (sometimes they

do not consider context or situated information, and

fail to adapt to users and domains)."

Expand full comment

Sep 22, 2023Edited

Of course these machines are often wrong and not always consistent. So are humans though. I am not sure how to interpret "a proxy" of competence. They seem to be competent as evidenced by them being able to complete many human tasks, such as SAT exams. To give another more specific example, I can have a reasonable conversation with GPT-4 CI about various machine learning topics at the level of at least a good undergraduate student. It can implement various ML methods, plot decision boundaries, etc.

Expand full comment

I worked on SAT exams at ETS and know how they are graded. The exams test exactly those things that discriminate between more competent and less competent human students, but they do not test the attributes that are shared between nearly all human candidates, because there is no need to. To stand up the claim of human capability, tests should probe those too.

Expand full comment

I agree, but what are they, from the language point of view?

Expand full comment

The paper identifies discourse coherence and narrative structure as areas in which LLM linguistic performance is promising rather than great.

This discussion gets complex, because if you draw a line around language in the same way that Chomsky and some of his more adherent disciples do, these areas are out-of-scope, falling into psychology, or theory of mind, or pragmatics. But that is really just a reflection of the fact that Chomsky's research interests lean heavily towards formal syntax, and his prestige (earned) drags many along. Not forever, the ranks of Chomskyans who are no longer as adherent as they were are well populated.

Many linguists, and nearly all cognitive psychologists disagree with the narrow view of language. If you are talking just about that narrow slice LLMs are pretty good.

Expand full comment

There are certainly aspects of intelligence that these models do not capture. But the discussion gets too complex and the definitions are not clear enough to come to any definite conclusions or to differentiate humans from other animals. That's why I like the Turing test, despite its limited nature.

Expand full comment

Yes, I totally agree on that. That's why I stay out of broad claims about human-level. You went beyond what I am comfortable saying.

On specific tasks like GLUE and SuperGLUE it is possible to draw conclusions and clearly delineate their limitations. Both are great initial designs, but we now understand that GLUE has reached asymptote and isn't a helpful benchmark any more. Work on refining benchmarks is central on really understanding what we have and don't have.

Expand full comment