Scaling Laws

3 Apr 2024 •

Scaling laws are empirical formulae that relate two or more measurable quantities of a system. In deep learning, “scaling law” most commonly refers to the relationship between the amount of data and computation (model capacity) we put into a training run, vs. the generalisation error we get on some held-out test data.

In general, more data and more model capacity impliy strictly lower generalisation error. The reason why this is useful is as follows: once we’ve trained a model at a few different scales with different amounts of data, we can plot our generalisation error and find the curve of best fit, which is typically a power law of some kind, e.g.:

$L = \frac{A}{n^{α}} + \frac{B}{m^{β}} + E$ Where $n$ is data scale (number of training samples), $m$ is model scale (number of parameters), and the rest are free to be fitted to the data. $E$ in particular means the irreducible error. Then, if we want to predict what will happen with even more data/compute than we’ve presently invested, all we need to do is plug in the hypothetical values for $m$ and $n$ .

Power laws like these are used by researchers and AI companies alike in order to prioritise resources and control costs: by knowing how close we are to the irreducible error $E$ , we know whether it’s worthwhile adding more data or more compute. This idea can be incredibly powerful, and more likely than not is one of the many mathematical tools used behind the scenes of ChatGPT, Gemini, Claude and the like.

Scaling laws in nature

Scaling laws don’t just appear in the study of artifical neural networks. They emerge in all manner of natural systems: geophysical, biological, sociological. An illustrative example is rivers, where the total length of all streams is proportional to some power of the distance from mouth to source (this is vastly over-simplifying things, but the general idea holds).

The emergence of power laws in nature demonstrates that large and complex systems can often be described by much simpler rules. And with this simplicity, comes a great deal of predictive power. It’s perhaps remarkable, then, that simple power laws can extrapolate so well even when the underlying mechanism remains unexplained. The same goes for scaling laws in deep learning.

Scaling laws for general intelligence?

At what point does throwing more data and compute at a problem stop being worthwhile? Present trends in AI indicate that if such a point exists, it hasn’t been reached yet.

But even if the simple mantra “more is better” keeps working for data AI, one eventually has to ask “better at what?”. The simple scaling law given at the beginning of this post relates experimental settings to test error or test loss. This typically applies to a situation where the training and the testing data are of the same distribution.

In the real world of AI, things aren’t so neat: chances are, the use-cases of the latest viral chatbot sensation lie far outside the training data distribution (by some heuristic measure of “outsideness”). In fact, the “smarter” an AI model is, the better it should do in such situations, since what is intelligence, if not the ability to adapt to situations we’ve not seen before (so-called out-of-distribution generalization)?

Can such “smart” behaviours be predicted by scaling laws? At the very least, it’s not as simple as when the training and testing environment are similar. So if we’re to make any quantitative prediction about “where” and “when” artificial general intelligence (AGI) should emerge, our theory has to take account of such inhomogoneity.

Meta-learning and synergistic scaling

An interesting track towards improved out-of-distribution generalization is the concept of meta-learning. Loosely speaking, meta-learning is a sort of “learning of learning”, i.e. instead of learning to solve a problem directly, the problem is split inhomogenously into different tasks, then the new objective is to learn how to learn each individual task. The idea is, if our algorithm sees a completely new task, it will more readily pick it up than if we’d just used a standard training algorithm.

Another interesting idea is synergy. This is the idea that training on two or more different tasks is better than doubling up on the same task. This is somewhat related to meta-learning, except there isn’t necessarily a fancy algorithm for “learning to learn”. It may be that something that looks like general intelligence will emerge as long as there is sufficient synergy in the training data (and compute scale to match, of course). Scaling laws for predicting synergistic effects are a relatively new topic but I imagine they’ll become more and more relevant as a new “pseudo-science of intelligence” emerges.

Conclusion

This post gives an overview of my understanding and thoughts on scaling laws in deep learning, and where I think they’re likely to take us in the next few years.

Take care!

Jamie