About memming

The Inconsistent.

Noise correlation in V1: 1D-dynamics explains differences between anesthetized and awake

We discussed state dependence of noise correlations in macaque primary visual cortex [1] today. Noise correlation quantifies the covariability in spike counts between neurons (it’s called noise correlation because the signal (stimulus) drive component has been subtracted out). In a 2010 science paper [2], noise correlation was shown to be much smaller than previously reported; in the range of 0.01 compared to the usual 0.1-0.2 range and stirred up the field (see [3] for a list of values). In this paper, they argue that this difference in noise correlation magnitude is due to population level covariations during anesthesia (they used sufentanil).

Continue reading

Lab Meeting 6/10/2013: Hessian free optimization

James Martens has been publishing bags of tips and tricks for large-scale non-convex optimization that occurs in training deep learning network and recurrent neural network (RNN). They were able to train deep learning network without pre-training and better than the state-of-the-art, and also for RNN, much better than back-propagation through time. Basically, it’s the use of 2nd order (curvature information) via heuristic modifications of the conjugate gradient (CG) method. CG is Hessian-free since one only needs to evaluate Hessian in the direction of a single direction which is much cheaper than computing the full Hessian (often it is prohibitive for large-scale problems). The objective function f(\theta) is repeatedly locally approximated as a quadratic function q_\theta(p) = f(\theta) + \nabla f(\theta)^\top p + \frac{1}{2} p^\top B p, and minimized. Some of the tricks are:

  1. Use conjugate gradient instead of other quasi-Newton methods like L-BFGS, or nonlinear conjugate gradient.
  2. Use Gauss-Newton approximation. For non-convex problems, the Hessian can have negative eigenvalues which can lead to erratic behavior of the CG step which assumes positive definite B. Hence, they propose using the Gauss-Newton approximation which discards the second-order derivatives, and is guaranteed to be positive definite. In the following Hessian, the second term is simply ignored.
    H_{ij} = \frac{\partial^2 L(g)}{\partial \theta_i \partial \theta_j} = L''(g) \frac{\partial g}{\partial \theta_i}\frac{\partial g}{\partial \theta_j} + L'(g) \frac{\partial^2 g}{\partial \theta_i \partial \theta_j}
  3. Use fraction of improvement as termination condition for CG (instead of the regular residual norm condition).
  4. Add regularization (dampening) on the Hessian (or its approximation), and update its trust-region parameter via Levenberg-Marquardt style heuristic.
  5. Do semi-online, mini-batch updates.
  6. For training RNNs, use structural dampening which limits changing parameters too much that are highly sensitive.

References:

  • James Martens. Deep learning via Hessian-free optimization. ICML 2010
  • James Martens, Ilya Sutskever. Learning Recurrent Neural Networks with Hessian-Free Optimization. ICML 2011

Lab Meeting 3/25/2013: Demixing PCA (dPCA)

If you had lots of spike trains over 4 seconds for 800 neurons, 6 stimulus conditions, and 2 behavioral choices, how would you visualize your data? Unsupervised dimensionality reduction techniques, such as principal component analysis (PCA) finds orthonormal basis vectors that captures the most variance of the data, but the results are not necessarily interpretable. What one wants is to say is something like:

“Along this direction, the population dynamics seems to encode stimulus, and along this other orthogonal dimension, neurons are modulated by the motor behavior…”

Continue reading

Lab Meeting 2/4/2013: Asymptotically optimal tuning curve in Lp sense for a Poisson neuron

Optimal tuning curve is the best transformation of the stimulus into neural firing pattern (usually firing rate) under certain constraints and optimality criterion. The following paper I saw at NIPS 2012 was related to what we are doing, so we took a deeper look into it.

Wang, Stocker & Lee (NIPS 2012), Optimal neural tuning curves for arbitrary stimulus distributions: Discrimax, infomax and minimum Lp loss.

The paper assumes a single neuron encoding a 1 dimensional stimulus, governed by a distribution \pi(s). The neuron is assumed to be Poisson (pure rate code). The neuron’s tuning curve h(s) is smooth, monotonically increasing (with h'(s) > c), and has a limited minimum and maximum firing rate as its constraint. Authors assume asymptotic regime for MLE decoding where the observation time T is long enough to apply asymptotic normality theory (and convergence of p-th moments) of MLE.

The authors show that there is a 1-to-1 mapping between the tuning curve and the Fisher information I under these constraints. Then for various loss functions, they derive the optimal tuning curve using calculus of variations. In general, to minimize the Lp loss E\left[ |\hat s - s|^p \right] under the constraints, the optimal (squared) tuning curve is:

\sqrt(h(s)) = \sqrt{h_{min}} + (\sqrt{h_{max}} - \sqrt{h_{min}}) \frac{\int_{-\infty}^s \pi(t)^{1/(p+1)} \mathrm{d}t}{\int_{-\infty}^\infty \pi(t)^{1/(p+1)} \mathrm{d}t}

Furthermore, in the limit of p \to 0, the optimal solution corresponds to the infomax solution (i.e., optimum for mutual information loss). However, all the analysis is only in the asymptotic limit, where the Cramer-Rao bound is attained by the MLE. For the case of mutual information, unlike noise-less case where the optimal tuning curve becomes the stimulus CDF (Laughlin), for Poisson noise, it turns out to be the square of the stimulus CDF. I have plotted the differences below for a normal distribution (left) and a mixture of normals (right):

Comparison of optimal tuning curves

The results are very nice, and I’d like to see more results with stimulus noise and with population tuning assumptions.

BCM rule and information maximization

As the first paper in our summer reading series on information theoretic learning and synaptic plasticity of spiking neurons, we discussed one of the earliest papers:

Taro Toyoizumi, Jean-Pascal Pfister, Kazuyuki Aihara, Wulfram Gerstner. Generalized Bienenstock–Cooper–Munro rule for spiking neurons that maximizes information transmission. Proceedings of the National Academy of Sciences of the United States of America, Vol. 102, No. 14. (05 April 2005), pp. 5239-5244, doi:10.1073/pnas.0500495102

BCM rule is a rate-based synaptic plasticity rule which is a stable version of naive Hebbian learning rule \Delta w \propto x y where x and y are rate of pre-synaptic neuron and post-synaptic neuron respectively. Hebbian learning rule has positive feedback; high firing rate makes the synapse stronger which in turn makes firing rate higher. BCM rule fixes this with a sliding threshold which is controlled by the output (post-synaptic) firing rate – higher firing rate makes the threshold get higher, which increases the range of depression.

This paper by Toyoizumi et al derives a learning rule from first principle of maximizing mutual information between input spike trains and the output spike train. This alone will prefer high firing rate, so they include a penalty for high firing rates. The cost function is:

L = I(\mathbf{X};\mathbf{Y}) - \gamma D_{KL}(P(Y)||\tilde P(Y))

where I denotes mutual information, D_{KL} is the Kullback-Leibler divergence, \gamma is the trade-off between the two terms, and \tilde P(Y) is the target spike rate distribution. The paper utilizes an escape rate approximation of IF neuron with extra spike-history dependent rate modulation term to define the conditional probability P(Y|X). Then, the synaptic plasticity rule is obtained by taking the derivative \frac{\partial L}{\partial w_i}, then applying time average instead of expectation, and small step size approximation, they arrive at an online learning rule per synapse that resembles BCM in the special case of LNP neuron (no refractory). Next meeting (this Friday), we will dig deeper into the details of the derivation.

Using size-biased sampling for certain expectations

Let \{\pi_i\}_i be a well defined infinite discrete probability distribution (e.g., a draw from Dirichlet process (DP)). We are interested in evaluating the following form of expectations: E\left[ \sum_i f(\pi_i) \right] for some function f (we are especially interested when f = -\log, which gives us Shannon’s entropy). Following [1], we can re-write it as

E\left[ \sum_i \frac{f(\pi_i)}{\pi_i} \pi_i \right] = E\left[ E[ \frac{f(X)}{X} | \{\pi_i\}]\right]

where X is a random variable that takes the value \pi_i with probability \pi_i. This random variable X is better known as the first size-biased sample \tilde{\pi_1}. It is defined by \Pr[ \tilde \pi_1 = \pi_i | \{\pi_i\}_i] = \pi_i. In other words, it takes one of the probabilities \pi_i among \{\pi_i\}_i with probability \pi_i.

For Pitman-Yor process (PY) with discount parameter d and concentration parameter \alpha (Dirichlet process is a special case where d = 0), the size biased samples are naturally obtained by the stick breaking construction. Given a sequence of independent random variables V_n distributed as Beta(1-d, \alpha+n d), if we define \pi_i = \prod_{k=1}^{i-1} (1 - V_k) V_i, then the set of \{\pi_i\}_i is invariant to size biased permutation [2], and they form a sequence of size-biased samples. In our case, we only need the first size biased sample which is simply distributed as V_1.

Using this trick, we can compute the entropy of PY without the complicated simplex integrals. We used this and its extension for computing the PY based entropy estimator.

  1. Jim Pitman, Marc Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. The Annals of Probability, Vol. 25, No. 2. (April 1997), pp. 855-900, doi:10.1214/aop/1024404422
  2. Mihael Perman, Jim Pitman, Marc Yor. Size-biased sampling of Poisson point processes and excursions. Probability Theory and Related Fields, Vol. 92, No. 1. (21 March 1992), pp. 21-39, doi:10.1007/BF01205234

NP Bayes Reading Group: 3rd meeting

Continuing from last week (Generative model diagram for stick breaking), we established that the Chinese restaurant process (CRP) is exchangeable, and the underlying process from de Finetti theorem is Dirichlet process (DP), that is, CRP is the marginal distribution of DP. The simple form of conditional distribution of CRP provides easy way of sampling. Then, we discussed the form of posterior of DP, whose expectation (marginal) coincides with the CRP. Next week, Kenneth will lead the discussion on applying the theory we learned so far to practical clustering algorithms.

NP Bayes Reading Group: 2nd meeting

Continuing from last week, we discussed the formulation of generative clustering (mixture model) with fixed number of clusters K using Dirichlet distribution as a prior for cluster size distribution following Jordan’s slides. The definition of Dirichlet process (DP) and its existence was briefly shown via Kolmogorov extension theorem. Following (Sethuraman, 1994), we discussed the stick breaking construction of DP. Stick breaking provides the sample-biased permutation of Poisson-Dirichlet distribution obtained by Kingman limit (Kingman, 1975). The following fun facts about (extended) Dirichlet distribution are from (Sethuraman, 1994).

Fun Fact 1 Let {e_j} be a n-dimensional vector consisting of 0’s, except having 1 at j-th index.

\displaystyle \begin{array}{rcl} Dir(e_j) = e_j\end{array}

Fun Fact 2 Let

\displaystyle \begin{array}{rcl} U &\sim Dir(\alpha_1, \ldots, \alpha_n)\\ V &\sim Dir(\gamma_1, \ldots, \gamma_n)\\ W &\sim Beta(\sum_i \alpha_i, \sum_j \gamma_j). \end{array}

Then,

\displaystyle \begin{array}{rcl} W U + (1-W) V &\sim Dir(\alpha_1 + \gamma_1, \ldots, \alpha_n + \gamma_n). \end{array}

Fun Fact 3 Let {\sum_i \gamma_i = 1}. Then,

\displaystyle \begin{array}{rcl} \sum_i \gamma_i Dir([\alpha \gamma_1, \ldots, \alpha \gamma_n] + e_j) &= Dir(\alpha \gamma_1, \ldots, \alpha \gamma_n). \end{array}

Next week, we will continue on the discussion of DP as a prior for nonparameteric Bayesian clustering, posterior of DP and how to do inference with DP. (Jordan slide #45)

Possible further exploration:

  • Sampling from Poisson-Dirichlet distribution (Donnelly-Tavaré-Griffiths sampling?)
  • Proof of Lemma 3.2 from Sethuraman 1994

The number of tables distribution of a CRP has mean \simeq \alpha log(n) (simple proof can be found in Teh’s DP notes), and follows Ewens distribution.

Lab meeting 4/18/2011

Today we discussed Nemenman et. al.’s  Neural Coding of Natural Stimuli: Information at Sub-Millisecond Resolution PLOS Comp. Bio. 2008.

Given a slowly changing naturalistic stimuli with correlation in the time scale of 55 ms, is there information in the spike trains from fly H1 neuron in the time sub-millisecond scale. To quantify this, mutual information was estimated with different word length, and bin sizes. The main figure 4D suggests there is information in the smaller time scale, and this is demonstrated directly by choosing a few spike patterns that correspond to same larger time scale representation and showing stimuli (velocity) conditioned on those patterns (figure 5).

Mutual information rate is estimated using NSB entropy estimators with extra steps of extrapolation/fitting to obtain (1) asymptotic entropy rate (infinite data), (2) large word size, (3) remove empirical fluctuations of the estimate due to structure in the stimulus or response. These are more or less empirical approach to get better estimates. The mutual information is estimated by taking the difference between the marginal entropy and conditional entropy \mathcal{I}(R;S) = H(R) - H(R|S). One quantity that was not extrapolated was the limit of bin size going to zero.

NSB estimator is a Bayesian estimator that uses an approximately flat prior on entropy itself. They show that in 1D case, the uniform prior on probability space results in a poor entropy estimator. It would be interesting to see the actual prior distribution over probability given a flat prior on entropy.

One question of the overall methodology is the estimation of noise entropy using just 5 seconds of data repeated 100 times. How diverse is the stimulus? How robust is the estimated noise entropy obtained this way?