James Martens has been publishing bags of tips and tricks for large-scale non-convex optimization that occurs in training deep learning network and recurrent neural network (RNN). They were able to train deep learning network without pre-training and better than the state-of-the-art, and also for RNN, much better than back-propagation through time. Basically, it’s the use of 2nd order (curvature information) via heuristic modifications of the conjugate gradient (CG) method. CG is Hessian-free since one only needs to evaluate Hessian in the direction of a single direction which is much cheaper than computing the full Hessian (often it is prohibitive for large-scale problems). The objective function is repeatedly locally approximated as a quadratic function , and minimized. Some of the tricks are:
- Use conjugate gradient instead of other quasi-Newton methods like L-BFGS, or nonlinear conjugate gradient.
- Use Gauss-Newton approximation. For non-convex problems, the Hessian can have negative eigenvalues which can lead to erratic behavior of the CG step which assumes positive definite . Hence, they propose using the Gauss-Newton approximation which discards the second-order derivatives, and is guaranteed to be positive definite. In the following Hessian, the second term is simply ignored.
- Use fraction of improvement as termination condition for CG (instead of the regular residual norm condition).
- Add regularization (dampening) on the Hessian (or its approximation), and update its trust-region parameter via Levenberg-Marquardt style heuristic.
- Do semi-online, mini-batch updates.
- For training RNNs, use structural dampening which limits changing parameters too much that are highly sensitive.
James Martens. Deep learning via Hessian-free optimization. ICML 2010
James Martens, Ilya Sutskever. Learning Recurrent Neural Networks with Hessian-Free Optimization. ICML 2011