Neural scaling law

In machine learning, a neural scaling law is an empirical scaling law that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size, and training cost. Some models also exhibit performance gains by scaling inference through increased test-time compute, extending neural scaling laws beyond training to the deployment phase.

Introduction

In general, a deep learning model can be characterized by four parameters: model size, training dataset size, training cost, and the post-training error rate. Each of these variables can be defined as a real number, usually written as .
A neural scaling law is a theoretical or empirical statistical law between these parameters. There are also other parameters with other scaling laws.

Size of the model

In most cases, the model's size is simply the number of parameters. However, one complication arises with the use of sparse models, such as mixture-of-expert models. With sparse models, during inference, only a fraction of their parameters are used. In comparison, most other kinds of neural networks, such as transformer models, always use all their parameters during inference.

Size of the training dataset

The size of the training dataset is usually quantified by the number of data points within it. Larger training datasets are typically preferred, as they provide a richer and more diverse source of information from which the model can learn. This can lead to improved generalization performance when the model is applied to new, unseen data. However, increasing the size of the training dataset also increases the computational resources and time required for model training.
With the "pretrain, then finetune" method used for most large language models, there are two kinds of training dataset: the pretraining dataset and the finetuning dataset. Their sizes have different effects on model performance. Generally, the finetuning dataset is less than 1% the size of pretraining dataset.
In some cases, a small amount of high quality data suffices for finetuning, and more data does not necessarily improve performance.

Cost of training

Training cost is typically measured in terms of time and computational resources. It is important to note that the cost of training can be significantly reduced with efficient training algorithms, optimized software libraries, and parallel computing on specialized hardware such as GPUs or TPUs.
The cost of training a neural network model is a function of several factors, including model size, training dataset size, the training algorithm complexity, and the computational resources available. In particular, doubling the training dataset size does not necessarily double the cost of training, because one may train the model for several times over the same dataset.

Performance

The performance of a neural network model is evaluated based on its ability to accurately predict the output given some input data. Common metrics for evaluating model performance include:

Negative log-likelihood per token for language modeling;
Accuracy, precision, recall, and F1 score for classification tasks;
Mean squared error or mean absolute error for regression tasks;
Elo rating in a competition against other models, such as gameplay or preference by a human judge.

Performance can be improved by using more data, larger models, different training algorithms, regularizing the model to prevent overfitting, and early stopping using a validation set.
When the performance is a number bounded within the range of, such as accuracy, precision, etc., it often scales as a sigmoid function of cost, as seen in the figures.

Examples

(Hestness, Narang, et al, 2017)

The 2017 paper is a common reference point for neural scaling laws fitted by statistical analysis on experimental data. Previous works before the 2000s, as cited in the paper, were either theoretical or orders of magnitude smaller in scale. Whereas previous works generally found the scaling exponent to scale like, with, the paper found that.
Of the factors they varied, only task can change the exponent. Changing the architecture optimizers, regularizers, and loss functions, would only change the proportionality factor, not the exponent. For example, for the same task, one architecture might have while another might have. They also found that for a given architecture, the number of parameters necessary to reach lowest levels of loss, given a fixed dataset size, grows like for another exponent.
They studied machine translation with LSTM, generative language modelling with LSTM, ImageNet classification with ResNet, and speech recognition with two hybrid architectures.

(Henighan, Kaplan, et al, 2020)

A 2020 analysis studied statistical relations between over a wide range of values and found similar scaling laws, over the range of,, and over multiple modalities.
In particular, the scaling laws it found are :

For each modality, they fixed one of the two, and varying the other one, the achievable test loss satisfieswhere is the varied variable, and are parameters to be found by statistical fitting. The parameter is the most important one.
* When is the varied variable, ranges from to depending on the model modality. This corresponds to the from the Chinchilla scaling paper.
* When is the varied variable, ranges from to depending on the model modality. This corresponds to the from the Chinchilla scaling paper.
Given fixed computing budget, optimal model parameter count is consistently aroundThe parameter varies by a factor of up to 10 for different modalities. The exponent parameter varies from to for different modalities. This exponent corresponds to the from the Chinchilla scaling paper.
It's "strongly suggested" that. This exponent corresponds to the from the Chinchilla scaling paper.

The scaling law of was confirmed during the training of GPT-3.

Chinchilla scaling (Hoffmann, et al, 2022)

One particular scaling law states that, for a large language model autoregressively trained for one epoch, with a cosine learning rate schedule, we have:where the variables are

is the cost of training the model, in FLOPS.
is the number of parameters in the model.
is the number of tokens in the training set.
is the average negative log-likelihood loss per token, achieved by the trained LLM on the test dataset.
* represents the loss of an ideal generative process on the test data
* captures the fact that a Transformer language model with parameters underperforms the ideal generative process
* captures the fact that the model trained on tokens underperforms the ideal generative process

and the statistical parameters are

, meaning that it costs 6 FLOPs per parameter to train on one token. This is estimated by Kaplan et al. Note that training cost is much higher than inference cost, as training entails both forward and backward passes, whereas inference costs 1 to 2 FLOPs per parameter to infer on one token.
.

Although Besiroglu et al. claims that the statistical estimation is slightly off, and should be.
The statistical laws were fitted over experimental data with.
Since there are 4 variables related by 2 equations, imposing 1 additional constraint and 1 additional optimization objective allows us to solve for all four variables. In particular, for any fixed, we can uniquely solve for all 4 variables that minimizes. This provides us with the optimal for any fixed :Plugging in the numerical values, we obtain the "Chinchilla efficient" model size and training dataset size, as well as the test loss achievable:Similarly, we may find the optimal training dataset size and training compute budget for any fixed model parameter size, and so on.
There are other estimates for "Chinchilla efficient" model size and training dataset size. The above is based on a statistical model of. One can also directly fit a statistical law for without going through the detour, for which one obtains:or as tabulated:

	/ FLOP	/ FLOPs of training Gopher
400 million	1.92e+19	1/29968	8.0 billion
1 billion	1.21e+20	1/5706	20.2 billion
10 billion	1.23e+22	1/2819	205.1 billion
67 billion	5.76e+23	1	1.5 trillion
175 billion	3.85e+24	6.7	3.7 trillion
280 billion	9.90e+24	17.2	5.9 trillion
520 billion	3.43e+25	59.5	11.0 trillion
1 trillion	1.27e+26	221.3	21.2 trillion
10 trillion	1.30e+28	22515.9	216.2 trillion

Discrepancy

The Chinchilla scaling law analysis for training transformer language models suggests that for a given training compute budget, to achieve the minimal pretraining loss for that budget, the number of model parameters and the number of training tokens should be scaled in equal proportions,.
This conclusion differs from analysis conducted by Kaplan et al., which found that should be increased more quickly than,.
This discrepancy can primarily be attributed to the two studies using different methods for measuring model size. Kaplan et al.:

did not count the parameters in the token embedding layer, which when analyzed at smaller model sizes leads to biased coefficients;
studied smaller models than the Chinchilla group, magnifying the effect;
assumed that.

Secondary effects also arise due to differences in hyperparameter tuning and learning rate schedules. Kaplan et al.:

used a warmup schedule that was too long for smaller models, making them appear less efficient;
did not fully tuning optimization hyperparameters.