# Lab Meeting 6/10/2013: Hessian free optimization

James Martens has been publishing bags of tips and tricks for large-scale non-convex optimization that occurs in training deep learning network and recurrent neural network (RNN). They were able to train deep learning network without pre-training and better than the state-of-the-art, and also for RNN, much better than back-propagation through time. Basically, it’s the use of 2nd order (curvature information) via heuristic modifications of the conjugate gradient (CG) method. CG is Hessian-free since one only needs to evaluate Hessian in the direction of a single direction which is much cheaper than computing the full Hessian (often it is prohibitive for large-scale problems). The objective function $f(\theta)$ is repeatedly locally approximated as a quadratic function $q_\theta(p) = f(\theta) + \nabla f(\theta)^\top p + \frac{1}{2} p^\top B p$, and minimized. Some of the tricks are:

2. Use Gauss-Newton approximation. For non-convex problems, the Hessian can have negative eigenvalues which can lead to erratic behavior of the CG step which assumes positive definite $B$. Hence, they propose using the Gauss-Newton approximation which discards the second-order derivatives, and is guaranteed to be positive definite. In the following Hessian, the second term is simply ignored. $H_{ij} = \frac{\partial^2 L(g)}{\partial \theta_i \partial \theta_j} = L''(g) \frac{\partial g}{\partial \theta_i}\frac{\partial g}{\partial \theta_j} + L'(g) \frac{\partial^2 g}{\partial \theta_i \partial \theta_j}$