Scaling laws in deep learning
Scaling laws in deep learning revolve around a simple yet powerful observation: if you plot a model’s performance (often expressed as a loss metric) against the total compute used, you will frequently see a power-law relationship. In its most straightforward form, this relationship can be written as y = aC^(-b), with y as the loss and C as the compute. On a log-log plot, that relationship is represented by a straight line whose slope is determined by b and whose intercept is determined by a. While the equation might look simple on the surface, it captures a fundamental truth: for many methods, more compute means better performance, and plotting this relationship helps us predict how performance will change as we scale up resources.
In practice, the slope b and the intercept a matter a great deal when comparing two different approaches. Lowering the intercept a corresponds to shifting the performance curve downward, so that you get lower loss for the same compute; if you accomplish that without changing b, you have effectively found a more efficient way to train models. Realistically, though, many tricks that reduce the loss at smaller scales also reduce how much performance can improve with additional compute. This shows up as both a smaller intercept and a smaller slope. Such a method might look brilliant at lower scales, but once someone else throws a massive amount of compute at an older, simpler method, the older method might catch up or even surpass the clever technique. That pattern is a key aspect of what Richard Sutton famously called “The Bitter Lesson,” where many sophisticated methods that work nicely in small-scale experiments get overtaken by brute-force scaling.
If you manage to discover a method that not only performs well at small scale but also increases the slope b, you have struck gold. This is a rare occurrence, often observed in landmark shifts like the rise of Transformers over LSTMs. Sometimes the new method might look mediocre at small scale, even starting off worse than its predecessor, but its slope allows it to pull ahead dramatically once the compute ramps up. Methods that actually improve the slope b can change the entire playing field.
When people talk about scaling laws being broken or running into a wall, they often mean one of two things. One possibility is that data availability or other practical constraints prevent further scaling, even though the fundamental relationship still holds. Ilya Sutskever has recently touched upon the problem of data scarcity and how it slows or stalls improvements in practice. Another possibility is that a new method appears that achieves the same or better performance at significantly reduced compute, but in many cases this simply lowers the intercept a. If the slope b remains the same, the original scaling law is not truly “broken.” Instead, we have just shifted the performance curve lower, which means we can do more with less, but if we throw more compute at the problem, the gains generally continue.
One of the most potent ways to lower the intercept in real-world terms is through what is sometimes called the “Compute Multiplier.” This idea focuses on making your available compute more effective, thereby lowering the cost or resource demands for the same performance. A classic example is Mixture-of-Experts (MoE) architectures, which often preserve or slightly improve the slope while shifting the intercept down and thus reducing the required compute for a given level of performance.
In light of these considerations, it is essential to check whether a new approach can improve performance by scaling up. If it can reduce the intercept while maintaining or increasing the slope, it may be a powerful innovation that has significant practical implications. However, if it reduces both the intercept and the slope, it might shine at small scales but become less competitive at higher scales. History shows that many complex techniques lose out to simpler methods that can readily take advantage of more data and compute, as the simpler approach’s scaling potential eventually overtakes all small-scale optimizations.
When people claim the era of scaling is over, a more careful reading often reveals that practical resource constraints or local shifts in performance curves are the source of that claim. It remains rare to see a method that genuinely breaks the underlying power-law pattern; much more common is that the line shifts downward or the slope changes in limited contexts. True and concerning “breaks” in the scaling law, where the log-log line fundamentally bends and no longer yields improvements with more compute, would mean that deep learning’s progress hits a real ceiling. That scenario would be disappointing not just for the major research labs but for all players, since it would reduce our collective faith in the capacity of brute-force approaches to continue delivering breakthroughs.
Ultimately, scaling remains central to modern deep learning, and any method or technique that wants to remain competitive must align with that fact. It is wise to be cautious about any approach that cannot be scaled effectively, either because it reduces the availability of data or because it is inherently limited in how well it can expand with more resources. This tension between small-scale cleverness and large-scale brute force reappears time and again in deep learning, forming a recurring cycle in which simpler, scalable methods often eclipse more sophisticated but less scalable ones. The best way to break that cycle is to find a method that shifts the intercept down without sacrificing the slope or, in the rarest of triumphs, raises the slope itself to new heights. That remains the ultimate dream in the world of deep learning.