Distillation Can Reduce the Size and Cost of AI Models

The initial version of this narrative was published in Quanta Magazine.
Earlier this year, the Chinese AI company DeepSeek launched a chatbot named R1, attracting significant attention. Much of the buzz centered on the claim that this relatively obscure company developed a chatbot that could compete with those from the most renowned AI firms, utilizing only a fraction of the computational resources and cost. Consequently, shares of several Western tech firms dropped sharply; Nvidia, which manufactures the chips powering leading AI models, experienced the steepest one-day stock value decline in history.
This attention also included accusations. Reports suggested that DeepSeek had, without permission, extracted knowledge from OpenAI’s proprietary o1 model via a method known as distillation. Various news outlets framed this possibility as a startling revelation for the AI sector, insinuating that DeepSeek had unearthed a new, more efficient strategy for AI development.
However, distillation, also referred to as knowledge distillation, is an established technique in AI, part of computer science research for over a decade, and utilized by major tech corporations on their own models. “Distillation is one of the most crucial tools that companies have today to enhance model efficiency,” remarked Enric Boix-Adsera, a researcher at the University of Pennsylvania’s Wharton School focused on distillation.
Obscure Insights
The concept of distillation originated from a 2015 paper by three Google researchers, including Geoffrey Hinton, often termed the father of AI and a Nobel laureate in 2024. During that period, researchers frequently employed ensembles of models—“many models connected together,” as Oriol Vinyals, a principal scientist at Google DeepMind and co-author of the paper, described it—to boost performance. “However, running all the models in parallel was exceedingly cumbersome and costly,” Vinyals noted. “We became interested in distilling that into a single model.”
The researchers aimed to tackle a significant limitation in machine-learning algorithms: all incorrect answers were treated as equally erroneous, regardless of how inaccurate they might be. For instance, in an image-classification model, “mistaking a dog for a fox was penalized identically as mistaking a dog for a pizza,” Vinyals explained. They hypothesized that ensemble models held insights into which incorrect answers were less egregious than others. Thus, a smaller “student” model could leverage this information from the larger “teacher” model to efficiently classify images.
After conferring with Hinton, Vinyals devised a method for the large teacher model to convey more information about image categories to a smaller student model. The breakthrough involved focusing on “soft targets” in the teacher model—where it assigns probabilities to various possibilities instead of providing absolute answers. For instance, one model assessed a 30 percent likelihood of an image depicting a dog, a 20 percent chance of a cat, a 5 percent possibility of a cow, and a 0.5 percent likelihood of a car. By utilizing these probabilities, the teacher model effectively indicated to the student that dogs are quite similar to cats, somewhat relatable to cows, and markedly different from cars. This insight allowed the student to learn to recognize images of dogs, cats, cows, and cars more effectively, enabling a complex model to be simplified with minimal loss of accuracy.
Rapid Advancements
Initially, the idea did not gain immediate traction. The paper faced rejection from a conference, leading Vinyals to explore other subjects. However, distillation emerged at a pivotal moment. Around this time, engineers discovered that increasing the amount of training data fed into neural networks significantly enhanced their effectiveness. Consequently, the dimensions and capabilities of models surged, along with the operational costs.
Many researchers turned to distillation to create more compact models. For example, in 2018, Google unveiled a robust language model named BERT, which the company soon began using to analyze billions of web searches. Yet, because BERT was large and expensive to operate, the following year, developers created a smaller version dubbed DistilBERT, which became popular in both commerce and research. Distillation gradually gained widespread adoption and is now provided as a service by firms like Google, OpenAI, and Amazon. The original distillation study, still only available on the arxiv.org preprint server, has been cited over 25,000 times.
Given that distillation necessitates access to the internal workings of the teacher model, it is not feasible for an external party to clandestinely distill data from a proprietary model such as OpenAI’s o1, as was alleged regarding DeepSeek. Nonetheless, a student model could still gain valuable insights from a teacher model by interacting with it through targeted inquiries and utilizing the responses for training—adopting an almost Socratic method of distillation.
Meanwhile, other researchers are uncovering fresh applications. In January, the NovaSky lab at UC Berkeley demonstrated that distillation effectively trains chain-of-thought reasoning models, which employ multistep “thinking” to enhance performance on complex questions. The lab claims that its fully open-source Sky-T1 model cost less than $450 to train while achieving results comparable to a much larger open-source model. “We were genuinely astounded by how well distillation operated in this context,” expressed Dacheng Li, a Berkeley doctoral student and co-student lead of the NovaSky team. “Distillation is a fundamental technique in AI.”
Original story reprinted with permission from Quanta Magazine, an editorially independent publication of the Simons Foundation whose mission is to enhance public understanding of science by covering research developments and trends in mathematics and the physical and life sciences.
