About memming

The Inconsistent.

Noise correlation in V1: 1D-dynamics explains differences between anesthetized and awake

We discussed state dependence of noise correlations in macaque primary visual cortex [1] today. Noise correlation quantifies the covariability in spike counts between neurons (it’s called noise correlation because the signal (stimulus) drive component has been subtracted out). In a 2010 science paper [2], noise correlation was shown to be much smaller than previously reported; in the range of 0.01 compared to the usual 0.1-0.2 range and stirred up the field (see [3] for a list of values). In this paper, they argue that this difference in noise correlation magnitude is due to population level covariations during anesthesia (they used sufentanil).

Continue reading

Advertisements

Lab Meeting 6/10/2013: Hessian free optimization

James Martens has been publishing bags of tips and tricks for large-scale non-convex optimization that occurs in training deep learning network and recurrent neural network (RNN). They were able to train deep learning network without pre-training and better than the state-of-the-art, and also for RNN, much better than back-propagation through time. Basically, it’s the use of 2nd order (curvature information) via heuristic modifications of the conjugate gradient (CG) method. CG is Hessian-free since one only needs to evaluate Hessian in the direction of a single direction which is much cheaper than computing the full Hessian (often it is prohibitive for large-scale problems). The objective function f(\theta) is repeatedly locally approximated as a quadratic function q_\theta(p) = f(\theta) + \nabla f(\theta)^\top p + \frac{1}{2} p^\top B p, and minimized. Some of the tricks are:

  1. Use conjugate gradient instead of other quasi-Newton methods like L-BFGS, or nonlinear conjugate gradient.
  2. Use Gauss-Newton approximation. For non-convex problems, the Hessian can have negative eigenvalues which can lead to erratic behavior of the CG step which assumes positive definite B. Hence, they propose using the Gauss-Newton approximation which discards the second-order derivatives, and is guaranteed to be positive definite. In the following Hessian, the second term is simply ignored.
    H_{ij} = \frac{\partial^2 L(g)}{\partial \theta_i \partial \theta_j} = L''(g) \frac{\partial g}{\partial \theta_i}\frac{\partial g}{\partial \theta_j} + L'(g) \frac{\partial^2 g}{\partial \theta_i \partial \theta_j}
  3. Use fraction of improvement as termination condition for CG (instead of the regular residual norm condition).
  4. Add regularization (dampening) on the Hessian (or its approximation), and update its trust-region parameter via Levenberg-Marquardt style heuristic.
  5. Do semi-online, mini-batch updates.
  6. For training RNNs, use structural dampening which limits changing parameters too much that are highly sensitive.

References:

  • James Martens. Deep learning via Hessian-free optimization. ICML 2010
  • James Martens, Ilya Sutskever. Learning Recurrent Neural Networks with Hessian-Free Optimization. ICML 2011

Lab Meeting 3/25/2013: Demixing PCA (dPCA)

If you had lots of spike trains over 4 seconds for 800 neurons, 6 stimulus conditions, and 2 behavioral choices, how would you visualize your data? Unsupervised dimensionality reduction techniques, such as principal component analysis (PCA) finds orthonormal basis vectors that captures the most variance of the data, but the results are not necessarily interpretable. What one wants is to say is something like:

“Along this direction, the population dynamics seems to encode stimulus, and along this other orthogonal dimension, neurons are modulated by the motor behavior…”

Continue reading

Lab Meeting 2/4/2013: Asymptotically optimal tuning curve in Lp sense for a Poisson neuron

Optimal tuning curve is the best transformation of the stimulus into neural firing pattern (usually firing rate) under certain constraints and optimality criterion. The following paper I saw at NIPS 2012 was related to what we are doing, so we took a deeper look into it.

Wang, Stocker & Lee (NIPS 2012), Optimal neural tuning curves for arbitrary stimulus distributions: Discrimax, infomax and minimum Lp loss.

The paper assumes a single neuron encoding a 1 dimensional stimulus, governed by a distribution \pi(s). The neuron is assumed to be Poisson (pure rate code). The neuron’s tuning curve h(s) is smooth, monotonically increasing (with h'(s) > c), and has a limited minimum and maximum firing rate as its constraint. Authors assume asymptotic regime for MLE decoding where the observation time T is long enough to apply asymptotic normality theory (and convergence of p-th moments) of MLE.

The authors show that there is a 1-to-1 mapping between the tuning curve and the Fisher information I under these constraints. Then for various loss functions, they derive the optimal tuning curve using calculus of variations. In general, to minimize the Lp loss E\left[ |\hat s - s|^p \right] under the constraints, the optimal (squared) tuning curve is:

\sqrt(h(s)) = \sqrt{h_{min}} + (\sqrt{h_{max}} - \sqrt{h_{min}}) \frac{\int_{-\infty}^s \pi(t)^{1/(p+1)} \mathrm{d}t}{\int_{-\infty}^\infty \pi(t)^{1/(p+1)} \mathrm{d}t}

Furthermore, in the limit of p \to 0, the optimal solution corresponds to the infomax solution (i.e., optimum for mutual information loss). However, all the analysis is only in the asymptotic limit, where the Cramer-Rao bound is attained by the MLE. For the case of mutual information, unlike noise-less case where the optimal tuning curve becomes the stimulus CDF (Laughlin), for Poisson noise, it turns out to be the square of the stimulus CDF. I have plotted the differences below for a normal distribution (left) and a mixture of normals (right):

Comparison of optimal tuning curves

The results are very nice, and I’d like to see more results with stimulus noise and with population tuning assumptions.

BCM rule and information maximization

As the first paper in our summer reading series on information theoretic learning and synaptic plasticity of spiking neurons, we discussed one of the earliest papers:

Taro Toyoizumi, Jean-Pascal Pfister, Kazuyuki Aihara, Wulfram Gerstner. Generalized Bienenstock–Cooper–Munro rule for spiking neurons that maximizes information transmission. Proceedings of the National Academy of Sciences of the United States of America, Vol. 102, No. 14. (05 April 2005), pp. 5239-5244, doi:10.1073/pnas.0500495102

BCM rule is a rate-based synaptic plasticity rule which is a stable version of naive Hebbian learning rule \Delta w \propto x y where x and y are rate of pre-synaptic neuron and post-synaptic neuron respectively. Hebbian learning rule has positive feedback; high firing rate makes the synapse stronger which in turn makes firing rate higher. BCM rule fixes this with a sliding threshold which is controlled by the output (post-synaptic) firing rate – higher firing rate makes the threshold get higher, which increases the range of depression.

This paper by Toyoizumi et al derives a learning rule from first principle of maximizing mutual information between input spike trains and the output spike train. This alone will prefer high firing rate, so they include a penalty for high firing rates. The cost function is:

L = I(\mathbf{X};\mathbf{Y}) - \gamma D_{KL}(P(Y)||\tilde P(Y))

where I denotes mutual information, D_{KL} is the Kullback-Leibler divergence, \gamma is the trade-off between the two terms, and \tilde P(Y) is the target spike rate distribution. The paper utilizes an escape rate approximation of IF neuron with extra spike-history dependent rate modulation term to define the conditional probability P(Y|X). Then, the synaptic plasticity rule is obtained by taking the derivative \frac{\partial L}{\partial w_i}, then applying time average instead of expectation, and small step size approximation, they arrive at an online learning rule per synapse that resembles BCM in the special case of LNP neuron (no refractory). Next meeting (this Friday), we will dig deeper into the details of the derivation.

Using size-biased sampling for certain expectations

Let \{\pi_i\}_i be a well defined infinite discrete probability distribution (e.g., a draw from Dirichlet process (DP)). We are interested in evaluating the following form of expectations: E\left[ \sum_i f(\pi_i) \right] for some function f (we are especially interested when f = -\log, which gives us Shannon’s entropy). Following [1], we can re-write it as

E\left[ \sum_i \frac{f(\pi_i)}{\pi_i} \pi_i \right] = E\left[ E[ \frac{f(X)}{X} | \{\pi_i\}]\right]

where X is a random variable that takes the value \pi_i with probability \pi_i. This random variable X is better known as the first size-biased sample \tilde{\pi_1}. It is defined by \Pr[ \tilde \pi_1 = \pi_i | \{\pi_i\}_i] = \pi_i. In other words, it takes one of the probabilities \pi_i among \{\pi_i\}_i with probability \pi_i.

For Pitman-Yor process (PY) with discount parameter d and concentration parameter \alpha (Dirichlet process is a special case where d = 0), the size biased samples are naturally obtained by the stick breaking construction. Given a sequence of independent random variables V_n distributed as Beta(1-d, \alpha+n d), if we define \pi_i = \prod_{k=1}^{i-1} (1 - V_k) V_i, then the set of \{\pi_i\}_i is invariant to size biased permutation [2], and they form a sequence of size-biased samples. In our case, we only need the first size biased sample which is simply distributed as V_1.

Using this trick, we can compute the entropy of PY without the complicated simplex integrals. We used this and its extension for computing the PY based entropy estimator.

  1. Jim Pitman, Marc Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. The Annals of Probability, Vol. 25, No. 2. (April 1997), pp. 855-900, doi:10.1214/aop/1024404422
  2. Mihael Perman, Jim Pitman, Marc Yor. Size-biased sampling of Poisson point processes and excursions. Probability Theory and Related Fields, Vol. 92, No. 1. (21 March 1992), pp. 21-39, doi:10.1007/BF01205234