Lab Meeting 6/10/2013: Hessian free optimization

James Martens has been publishing bags of tips and tricks for large-scale non-convex optimization that occurs in training deep learning network and recurrent neural network (RNN). They were able to train deep learning network without pre-training and better than the state-of-the-art, and also for RNN, much better than back-propagation through time. Basically, it’s the use of 2nd order (curvature information) via heuristic modifications of the conjugate gradient (CG) method. CG is Hessian-free since one only needs to evaluate Hessian in the direction of a single direction which is much cheaper than computing the full Hessian (often it is prohibitive for large-scale problems). The objective function f(\theta) is repeatedly locally approximated as a quadratic function q_\theta(p) = f(\theta) + \nabla f(\theta)^\top p + \frac{1}{2} p^\top B p, and minimized. Some of the tricks are:

  1. Use conjugate gradient instead of other quasi-Newton methods like L-BFGS, or nonlinear conjugate gradient.
  2. Use Gauss-Newton approximation. For non-convex problems, the Hessian can have negative eigenvalues which can lead to erratic behavior of the CG step which assumes positive definite B. Hence, they propose using the Gauss-Newton approximation which discards the second-order derivatives, and is guaranteed to be positive definite. In the following Hessian, the second term is simply ignored.
    H_{ij} = \frac{\partial^2 L(g)}{\partial \theta_i \partial \theta_j} = L''(g) \frac{\partial g}{\partial \theta_i}\frac{\partial g}{\partial \theta_j} + L'(g) \frac{\partial^2 g}{\partial \theta_i \partial \theta_j}
  3. Use fraction of improvement as termination condition for CG (instead of the regular residual norm condition).
  4. Add regularization (dampening) on the Hessian (or its approximation), and update its trust-region parameter via Levenberg-Marquardt style heuristic.
  5. Do semi-online, mini-batch updates.
  6. For training RNNs, use structural dampening which limits changing parameters too much that are highly sensitive.

References:

  • James Martens. Deep learning via Hessian-free optimization. ICML 2010
  • James Martens, Ilya Sutskever. Learning Recurrent Neural Networks with Hessian-Free Optimization. ICML 2011