Learning rates are inelegant. As far as I know, the best advice for picking a good one is “start with 0.01, see how it performs, and tweak accordingly.” What if 0.01 is a terrible learning rate; what if it is off by orders of magnitude? In practice we mitigate these concerns by normalising input, and sensibly initialising weights, the intuition being that this keeps the gradient not too many orders of magnitude away from 1.

But still, isn’t there a *best* learning rate? The learning rate which, when used, will minimise the objective most effectively. Isn’t there a best learning rate for each step? Shouldn’t we use that learning rate? How much better could it be to use that learning rate? For one it would mean one fewer hyperparameter.

## A simple algorithm

Here is my proposal for a learning rate schedule: For each step of gradient descent, if increasing the current learning rate by 1% would be better, then do so. Otherwise decrease the learning rate by 1%. The desired behaviour is that regardless of the initial value, the learning rate will random walk its way to the *best* value.

Sadly this doesn’t eliminate a hyperparameter as was hoped: we must choose an initial learning rate, . However as we will see, the algorithm is more robust to this hyperparameter than to the choice of learning rate in vanilla gradient descent.

## Example 1

I implemented this optimiser for a single hidden layer MLP on MNIST. One small additional feature I made was to smooth the gradients, so as to avoid crazy updates at gradient spikes. Code can be found here.

n_epochs | 10 |

hidden_layer | 20 |

batch_size | 1000 |

n_run | 10 |

lr | accuracy_mean | accuracy_std | loss_mean | loss_std |
---|---|---|---|---|

0.000010 | 21.7% | 4.6% | 2.42 | 0.09 |

0.000022 | 23.3% | 7.3% | 2.16 | 0.11 |

0.000046 | 38.8% | 10.2% | 1.79 | 0.18 |

0.000100 | 51.2% | 6.8% | 1.42 | 0.14 |

0.000215 | 63.5% | 6.9% | 1.08 | 0.10 |

0.000464 | 58.0% | 12.4% | 1.15 | 0.26 |

0.001000 | 41.4% | 14.8% | 1.55 | 0.35 |

0.002154 | 19.2% | 8.0% | 2.07 | 0.18 |

0.004642 | 18.9% | 12.9% | 2.12 | 0.30 |

0.010000 | 11.2% | 0.0% | 2.30 | 0.00 |

lr0 | accuracy_mean | accuracy_std | loss_mean | loss_std |
---|---|---|---|---|

0.000010 | 73.2% | 6.1% | 0.85 | 0.15 |

0.000022 | 81.5% | 5.2% | 0.64 | 0.11 |

0.000046 | 80.0% | 5.0% | 0.67 | 0.12 |

0.000100 | 87.4% | 2.3% | 0.47 | 0.08 |

0.000215 | 87.5% | 2.2% | 0.45 | 0.06 |

0.000464 | 82.0% | 9.3% | 0.58 | 0.21 |

0.001000 | 47.8% | 17.7% | 1.38 | 0.38 |

0.002154 | 30.5% | 19.5% | 1.85 | 0.44 |

0.004642 | 12.3% | 3.3% | 2.27 | 0.10 |

0.010000 | 11.2% | 0.0% | 2.31 | 0.03 |

Figure 1: Log loss for

Figure 2: Log leaning rate

We can see that in the best performing case the learning rate doesn’t vary more than a single order of magnitude, and for this reason I find the performance increase surprisingly high.

After increasing to a peak, the learning rate tends to zero. As parameters tend to a minima, the *best* learning rate tends to zero. Does this schedule share the property that the learning rate tends to zero as the parameters tend to a minimum?

## Example 2

Suppose computing the loss is as computationally expensive as computing the gradient, then this algorithm doubles the computation. We can easily reduce this overhead by reducing the frequency at which the learning rate is being updated. I re-ran [Example 1], but updating the learning rate every ten steps. We still improve on vanilla SGD, but the outperformance is not so stark. However, if we increase the learning rate learning rate, lrlr, this case stepping the learning rate by 10% instead of 1%, then we recover most of the outperformance we had before.

lr0 | accuracy_mean | accuracy_std | loss_mean | loss_std |
---|---|---|---|---|

0.000010 | 22.1% | 7.4% | 2.31 | 0.09 |

0.000022 | 32.5% | 6.6% | 1.94 | 0.15 |

0.000046 | 42.2% | 6.4% | 1.70 | 0.18 |

0.000100 | 50.9% | 8.5% | 1.39 | 0.18 |

0.000215 | 68.4% | 7.8% | 0.98 | 0.18 |

0.000464 | 66.5% | 14.6% | 0.94 | 0.28 |

0.001000 | 50.4% | 20.3% | 1.35 | 0.48 |

0.002154 | 30.7% | 14.9% | 1.81 | 0.37 |

0.004642 | 12.7% | 3.2% | 2.26 | 0.09 |

0.010000 | 14.6% | 5.0% | 2.22 | 0.12 |

lr0_mean | accuracy_mean | accuracy_std | loss_mean | loss_std |
---|---|---|---|---|

0.000010 | 69.5% | 8.1% | 0.92 | 0.18 |

0.000022 | 79.0% | 7.3% | 0.70 | 0.16 |

0.000046 | 82.8% | 5.6% | 0.60 | 0.11 |

0.000100 | 84.5% | 3.3% | 0.53 | 0.07 |

0.000215 | 84.6% | 7.8% | 0.53 | 0.19 |

0.000464 | 77.9% | 7.5% | 0.69 | 0.20 |

0.001000 | 51.5% | 17.6% | 1.31 | 0.43 |

0.002154 | 25.1% | 12.4% | 1.95 | 0.30 |

0.004642 | 14.1% | 4.6% | 2.22 | 0.13 |

0.010000 | 14.7% | 8.6% | 2.21 | 0.23 |

## Conclusions

We could see if the performance increase extends to other test beds. However, given how easy this is to implement, I’m already excited to use this going forward.

Optimizers involving some online tuning of hyperparameters can be framed as an RL problem, with the reward at each step being negative of the loss function. We have presented a simple solution to this specific problem, but we could try other algorithms from the literature. And then it would be natural to extend use to multiple hyperparameters; momentum, for example, would be a clear next step.

I went to a seminar where someone was proposing almost exactly this! You can check out their results here: https://arxiv.org/abs/1911.02319

LikeLiked by 1 person