Mitigating AI risk has become a topic of intense effort in recent years and months. These well-intentioned efforts grow out of real concern for the uncertain future given the rapid development of AI technologies. In the current form they are unlikely to succeed. The purpose of this document is to make a case that developing a fundamental mathematical theory of deep learning is a prerequisite for managing risk as our society transitions to wide use of AI technology. Theory in this context refers to identifying precise measurable quantities and mathematically describing their patterns, the way it is used in physics and engineering, rather than proving rigorous theorems. Recent progress in the theory of statistical inference and optimization of neural networks provides hope that such a theory may indeed be possible. Admittedly, even a comprehensive theory of deep learning cannot guarantee a successful AI transition. If we *do not* have theory, however, we certainly would not be able to control or defend against misuse of AI systems as their behavior is already of similar or exceeding complexity compared to humans. Never before had a technology been deployed so widely and so quickly with so little understanding of its fundamentals. Given the societal impact of rapidly developing AI, this is a matter of urgent importance.

Much of the recent AI risk and safety work has centered on making Large Language Models (LLMs) more ethical through interactions with human teachers, in particular so-called Reinforcement Learning from Human Feedback (RLHF). This is not dissimilar to teaching children ethics and civics. Just as with children though, the results of such education are not entirely certain, while the risks of a rogue AI can be far higher than those of a human rogue. LLMs are trained on trillions of words, “tokens”, containing a large portion of accumulated human knowledge, literature, and history. In contrast, their interactions with actual humans are necessarily far more limited and represent just a tiny fraction of their “experience”. A “teacher” AI model can be used instead of humans to make larger scale training possible, but the same logic applies to the teacher itself. Given the lack of visibility inside the models, it is impossible to identify what sources of information they draw on, and there is no assurance they cannot be “jailbroken”. Another method of controlling the output, that of restricting output tokens, is similarly problematic. There is no reason to think that models, when prompted, will not find ways to circumvent censorship, just as humans do. Tellingly, defenses against adversarial examples in computer vision had often been broken even before the work could be presented at a relevant conference (and, indeed, a new set of universal attacks against LLMs was released during the writing of this document). Ultimately surface-level approaches, such as RLHF or token restrictions, lack insight into the internal operations of the model. They are unlikely to be effective and certainly cannot provide any guarantees. Worse, they offer no guidance to defend against malicious AI models.

One may argue that no deeper insight into the operation of these models is possible. Given the complexity of the models, trying to predict their behavior is comparable to predicting behavior of humans or animals, a challenging problem even for a fly or a worm. Why should it be easier for artificial neural networks? Yet there is a dramatic difference between natural organisms and artificial learners. The human brain evolved from a collection of single-celled organisms only very recently repurposed to play chess and write philosophical treatises. It still uses much of the chemical and electric machinery originally intended to keep these cells alive, active, to conserve energy and ultimately to propagate their genes. The mere possibility of such repurposing is truly astonishing and suggests universality of certain patterns in nature. Yet it is not surprising that these “hybrid” electro-chemical systems are intrinsically complex and are not easy to probe or analyze. Despite remarkable progress in neuroscience the human brain or even that of a housefly largely remains a black box. Machine learning models, on the other hand, are mathematical algorithms implemented on hardware specifically designed for certain mathematical operations, such as matrix multiplication. They are much simpler, more purposeful and, arguably, far more efficient for the task at hand, as they are not constrained by the need for survival, gene propagation or a billion years of evolutionary baggage. In contrast to biological systems, we (or at least those of us privileged with access to the models and hardware) have full visibility into their innards and can probe every aspect of operation as they are trained or run. Furthermore, modern architectures, such as transformers, bear at most a passing resemblance to biological neural networks and thus do not need to inherit their limitations. Our difficulty in understanding the brain need not doom the efforts to understand artificial models.

But can their operations be understood mathematically? The fact that algorithms are just mathematical formulas is no guarantee. Simple mathematical rules can sometimes result in great complexity. The square grid of Conway’s Game of Life with its three rules, for example, can be rigged to simulate arbitrary computer programs. Nevertheless, mathematical theory has been “unreasonably effective” in describing and helping to engineer physical processes used in modern technology. Every electric device uses laws of electromagnetism, GPS systems use both special and general relativity to compute accurate locations, chip production and imaging technologies use quantum mechanics, the list goes on. While large amounts of hard empirical engineering work are still required to design a bullet train, a space rocket, or a computer chip, we have a solid theoretical grasp of the underlying principles.

We do not have a comparable understanding of current deep learning systems. Why is that the case? While mathematics has provided insights into the physical world for thousands of years, perhaps its reign is coming to an end and these machines are simply too complex to be understood by our squishy brains? This is not an unreasonable position to take. One reason to doubt the effectiveness of mathematical descriptions is the fact that these models are extremely high-dimensional, at least judging by the number of parameters, and are trained on datasets whose scope far exceeds individual human knowledge. Perhaps the complexity of these models is beyond interpretations accessible to humans?

There are several reasons to think that it is not the case and that simple mathematical principles may well be hidden within the apparent complexity of these systems.

The architectures and methods used to train neural models are based on a composition of relatively simple principles. While the algorithms have been designed

*per aspera ad astra,*through intuitions and lots of painstaking efforts and optimizations, it is quite plausible that they fundamentally need not to be very complex. After all they are a product of our limited brains able to explore but a tiny fraction of the potential model space. Once a good solution is found, the social incentives are aimed toward building on this success by providing incremental but sure improvements, which often make the model more complex, rather than toward the risky effort of looking for fundamentally different solutions. Nearly all effort and funding in LLMs has recently converged on transformer architectures leading to their seemingly inexorable ascent. Nevertheless, alternatives (such as RWKV) are starting to appear. Similarly, several different types of architectures for vision problems achieve comparable performance. This suggests that there are simpler principles at play, and that the architecture-specific “bells and whistles” are of secondary importance. Furthermore, the fact that neural architectures are often successful across multiple application domains, suggests that their effectiveness relies on fundamental patterns in data, a “gravitational force” in the data solar system, rather than a serendipitous “alignment of the planets” in specific instances of data analysis. This universality hints at a continuity with other fundamental principles discovered in science and mathematics.The extraordinary success of LLMs shows just how dramatically we have underestimated the power of statistical inference. Traditionally, we believed that estimating conditional probabilities in a sequence of words was possible only for very short word sequences, at the trigram or perhaps quadrigram level (even the devil was stumped by a pentagram in Goethe's Faust). Yet modern LLMs appear to provide accurate estimation of probabilities for 1000+grams and the context sizes of some recent models approach 100,000. How is it possible in view of the curse of dimensionality? The unavoidable conclusion is that our understanding of dimensionality and complexity is inadequate. It is also clear that these high-dimensional problems have to be extremely structured to make inference possible. While the underlying data may or may not be low-dimensional, there must be very few relevant dimensions or directions at each individual point or context. Even trillions of data points could not provide enough coverage otherwise (as we had erroneously believed to be the reality of these problems in the past). The successful models can be quite different, but they must all rely on this structure for inference. One can perhaps visualize the relevant directions as millions of vanishingly thin low-dimensional strands of spaghetti tangled in a vast high-dimensional space. Despite great overall complexity, only a few data points are needed to recover an individual strand.

We thus posit that the structures relevant for inference in data can be represented by a mixture of (linear or non-linear) low-dimensional subspace. It is important to note that these subspaces are not the same as the data itself. In fact, their dimension must be far lower than the ambient or, perhaps, even intrinsic dimension of the underlying data, even when data are on a low-dimensional manifold. For example, for a linear predictor with a single output, the subspace relevant for inference is one-dimensional, independently of the dimension of input data or its geometry. Therefore, predictions of the model must generally be based on just a few directions in some intermediate space, something that we should be able to recover and control. This is also indirectly confirmed by the existence of small “adapter architectures’’ which are used to adjust models to new tasks based on just a few training examples.

This rethinking of the role of dimensionality seems parallel to the recent realization that the classical understanding of over-fitting is incomplete and fitting noisy data exactly is often

*benign*, even when the models are seemingly overly complex. Given that our long-standing beliefs about the nature of model selection and model complexity in statistical inference have turned out to be flawed, less suspension of disbelief is perhaps needed to suspect that other aspects of statistical inference may be on similarly uncertain grounds.Finally, resistance to mathematical modeling may also stem from our

*want*to believe that human consciousness is beyond simple models. “If the human brain were so simple that we could understand it, we would be so simple that we couldn't”, goes the saying attributed to the physicist E. Pugh. Another widely cited quote by a physicist M. Kaku proclaims that the human brain is “the most complicated object in the known universe” (self-evidently that object is the universe itself). Similarly, proponents of “quantum consciousness” say that our brain is based on quantum principles and cannot be modeled by a conventional computer. It has thus come as a major shock that mathematical models, even the largest of which (currently with about a trillion parameters) can easily fit in a single laptop hard drive, perform close to humans on many “human” tasks, such as essay-writing, SAT tests and Python coding. There seems to be a desire to expand the penumbra of the mathematical ineffability of the brain to these algorithms. While such a thought may be comforting, it lacks intellectual clarity. The evidence from computer models suggests that our brains may not be as sophisticated as we would like to believe, not the other way around. The human brain may after all be simple enough that even we can understand it.

As we have discussed, current attempts to understand, predict and control behaviors of machine learning models are not based on fundamental principles and are unlikely to be successful. On the other hand, for the reasons given above, there is a real hope that such fundamental principles may exist and can be discovered. This is evidenced by recent progress in the theory of deep learning, including rethinking of overfitting and model complexity, theory of infinitely wide neural networks, optimization for over-parameterized systems and the emerging understanding of low-dimensional feature learning.

Admittedly, even a complete theory of deep learning may not be enough. There has been intense concern about the problem of “AI alignment”, the tension between AI and human interests. While the scope of the AI alignment issue is still uncertain, in view of the lack of *human alignment*, even in democratic societies, and human history, AI will certainly be used as a weapon: of social engineering in, for example, mass impersonation attacks; for designing or as a component of cyber-weapons; and in actual warfare. Even deep insight into AI model capabilities may not be sufficient to defend against AI attacks. Furthermore, while there are reasons to be optimistic about the progress of deep learning theory, there is no guarantee that such a theory can be developed in a timely manner or at all. Yet, there is also no alternative. It is difficult to see how deep learning systems with their human-like complexity can be controlled and guided in socially acceptable ways, or countered in adversarial situations, without a fundamental understanding of their principles.

These considerations call for a major expansion in the effort to understand fundamental principles of deep learning. While investments in LLMs and other deep learning systems, including the topic of AI risk, have skyrocketed and now represent a significant portion of all technological investments, barely any funding goes toward basic mathematical theory of machine learning. In the past fundamental science had often operated on a different timescale from technological development. We do not have that luxury now. Given an unprecedented pace of technological and societal change, it is hard to see what is coming in the next few months and years, let alone decades. Understanding the fundamental theory of deep learning and applying its analyses to real systems is a compelling and urgent need.

**Acknowledgements.** I am grateful to friends and colleagues for many thoughtful and informative discussions. In particular, I would like to thank D. Beaglehole, L. Bergen, A. Boulgakov, S. Dasgupta, Y. Freund, D. Hsu, R. Huerta, J. Maher, A. Radhakrishnan, E. Richard, T. Schramm, Y. Wang for insightful comments on the draft.

## The necessity of machine learning theory in mitigating AI risk

Indeed a great motivation for working towards theory :)

Possible the most well-written and rigorous piece on safety I've read so far, may or may not changed my view on safety research as a whole! We indeed could really use more theories to cut through ambiguities and connect speculations with the empiricals.