Best Learning Rates

Learning rates are inelegant. As far as I know, the best advice for picking a good one is “start with 0.01, see how it performs, and tweak accordingly.” What if 0.01 is a terrible learning rate; what if it is off by orders of magnitude? In practice we mitigate these concerns by normalising input, and sensibly initialising weights, the intuition being that this keeps the gradient not too many orders of magnitude away from 1.

But still, isn’t there a best learning rate? The learning rate which, when used, will minimise the objective most effectively. Isn’t there a best learning rate for each step? Shouldn’t we use that learning rate? How much better could it be to use that learning rate? For one it would mean one fewer hyperparameter.

A simple algorithm

Here is my proposal for a learning rate schedule: For each step of gradient descent, if increasing the current learning rate by 1% would be better, then do so. Otherwise decrease the learning rate by 1%. The desired behaviour is that regardless of the initial value, the learning rate will random walk its way to the best value.

Sadly this doesn’t eliminate a hyperparameter as was hoped: we must choose an initial learning rate, \alpha_0. However as we will see, the algorithm is more robust to this hyperparameter than to the choice of learning rate in vanilla gradient descent.

Example 1

I implemented this optimiser for a single hidden layer MLP on MNIST. One small additional feature I made was to smooth the gradients, so as to avoid crazy updates at gradient spikes. Code can be found here.

Table 1: Hyperparameters
n_epochs 10
hidden_layer 20
batch_size 1000
n_run 10
Table 2: Vanilla gradient descent, with gradient smoothing
lr accuracy_mean accuracy_std loss_mean loss_std
0.000010 21.7% 4.6% 2.42 0.09
0.000022 23.3% 7.3% 2.16 0.11
0.000046 38.8% 10.2% 1.79 0.18
0.000100 51.2% 6.8% 1.42 0.14
0.000215 63.5% 6.9% 1.08 0.10
0.000464 58.0% 12.4% 1.15 0.26
0.001000 41.4% 14.8% 1.55 0.35
0.002154 19.2% 8.0% 2.07 0.18
0.004642 18.9% 12.9% 2.12 0.30
0.010000 11.2% 0.0% 2.30 0.00
Table 3: Proposed new algorithm, with gradient smoothing
lr0 accuracy_mean accuracy_std loss_mean loss_std
0.000010 73.2% 6.1% 0.85 0.15
0.000022 81.5% 5.2% 0.64 0.11
0.000046 80.0% 5.0% 0.67 0.12
0.000100 87.4% 2.3% 0.47 0.08
0.000215 87.5% 2.2% 0.45 0.06
0.000464 82.0% 9.3% 0.58 0.21
0.001000 47.8% 17.7% 1.38 0.38
0.002154 30.5% 19.5% 1.85 0.44
0.004642 12.3% 3.3% 2.27 0.10
0.010000 11.2% 0.0% 2.31 0.03

loss-lr0.png

Figure 1: Log loss for \alpha_0 = 2.15e^{-3}

lr-lr0.png

Figure 2: Log leaning rate \alpha_0 = 2.15e^{-3}

We can see that in the best performing case the learning rate doesn’t vary more than a single order of magnitude, and for this reason I find the performance increase surprisingly high.

After increasing to a peak, the learning rate tends to zero. As parameters tend to a minima, the best learning rate tends to zero. Does this schedule share the property that the learning rate tends to zero as the parameters tend to a minimum?

Example 2

Suppose computing the loss is as computationally expensive as computing the gradient, then this algorithm doubles the computation. We can easily reduce this overhead by reducing the frequency at which the learning rate is being updated. I re-ran [Example 1], but updating the learning rate every ten steps. We still improve on vanilla SGD, but the outperformance is not so stark. However, if we increase the learning rate learning rate, lrlr, this case stepping the learning rate by 10% instead of 1%, then we recover most of the outperformance we had before.

Table 4: Updating the learning rate every ten steps, lrlr = 101%
lr0 accuracy_mean accuracy_std loss_mean loss_std
0.000010 22.1% 7.4% 2.31 0.09
0.000022 32.5% 6.6% 1.94 0.15
0.000046 42.2% 6.4% 1.70 0.18
0.000100 50.9% 8.5% 1.39 0.18
0.000215 68.4% 7.8% 0.98 0.18
0.000464 66.5% 14.6% 0.94 0.28
0.001000 50.4% 20.3% 1.35 0.48
0.002154 30.7% 14.9% 1.81 0.37
0.004642 12.7% 3.2% 2.26 0.09
0.010000 14.6% 5.0% 2.22 0.12
Table 5: Updating the learning rate every ten steps, lrlr = 110%
lr0_mean accuracy_mean accuracy_std loss_mean loss_std
0.000010 69.5% 8.1% 0.92 0.18
0.000022 79.0% 7.3% 0.70 0.16
0.000046 82.8% 5.6% 0.60 0.11
0.000100 84.5% 3.3% 0.53 0.07
0.000215 84.6% 7.8% 0.53 0.19
0.000464 77.9% 7.5% 0.69 0.20
0.001000 51.5% 17.6% 1.31 0.43
0.002154 25.1% 12.4% 1.95 0.30
0.004642 14.1% 4.6% 2.22 0.13
0.010000 14.7% 8.6% 2.21 0.23

Conclusions

We could see if the performance increase extends to other test beds. However, given how easy this is to implement, I’m already excited to use this going forward.

Optimizers involving some online tuning of hyperparameters can be framed as an RL problem, with the reward at each step being negative of the loss function. We have presented a simple solution to this specific problem, but we could try other algorithms from the literature. And then it would be natural to extend use to multiple hyperparameters; momentum, for example, would be a clear next step.

One thought on “Best Learning Rates

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s