We discussed state dependence of noise correlations in macaque primary visual cortex [1] today. Noise correlation quantifies the covariability in spike counts between neurons (it’s called noise correlation because the signal (stimulus) drive component has been subtracted out). In a 2010 science paper [2], noise correlation was shown to be much smaller than previously reported; in the range of 0.01 compared to the usual 0.10.2 range and stirred up the field (see [3] for a list of values). In this paper, they argue that this difference in noise correlation magnitude is due to population level covariations during anesthesia (they used sufentanil).
Author Archives: memming
Lab Meeting 6/10/2013: Hessian free optimization
James Martens has been publishing bags of tips and tricks for largescale nonconvex optimization that occurs in training deep learning network and recurrent neural network (RNN). They were able to train deep learning network without pretraining and better than the stateoftheart, and also for RNN, much better than backpropagation through time. Basically, it’s the use of 2nd order (curvature information) via heuristic modifications of the conjugate gradient (CG) method. CG is Hessianfree since one only needs to evaluate Hessian in the direction of a single direction which is much cheaper than computing the full Hessian (often it is prohibitive for largescale problems). The objective function is repeatedly locally approximated as a quadratic function , and minimized. Some of the tricks are:
 Use conjugate gradient instead of other quasiNewton methods like LBFGS, or nonlinear conjugate gradient.
 Use GaussNewton approximation. For nonconvex problems, the Hessian can have negative eigenvalues which can lead to erratic behavior of the CG step which assumes positive definite . Hence, they propose using the GaussNewton approximation which discards the secondorder derivatives, and is guaranteed to be positive definite. In the following Hessian, the second term is simply ignored.
 Use fraction of improvement as termination condition for CG (instead of the regular residual norm condition).
 Add regularization (dampening) on the Hessian (or its approximation), and update its trustregion parameter via LevenbergMarquardt style heuristic.
 Do semionline, minibatch updates.
 For training RNNs, use structural dampening which limits changing parameters too much that are highly sensitive.
References:

James Martens. Deep learning via Hessianfree optimization. ICML 2010

James Martens, Ilya Sutskever. Learning Recurrent Neural Networks with HessianFree Optimization. ICML 2011
Lab Meeting 3/25/2013: Demixing PCA (dPCA)
If you had lots of spike trains over 4 seconds for 800 neurons, 6 stimulus conditions, and 2 behavioral choices, how would you visualize your data? Unsupervised dimensionality reduction techniques, such as principal component analysis (PCA) finds orthonormal basis vectors that captures the most variance of the data, but the results are not necessarily interpretable. What one wants is to say is something like:
“Along this direction, the population dynamics seems to encode stimulus, and along this other orthogonal dimension, neurons are modulated by the motor behavior…”
Lab Meeting 2/4/2013: Asymptotically optimal tuning curve in Lp sense for a Poisson neuron
Optimal tuning curve is the best transformation of the stimulus into neural firing pattern (usually firing rate) under certain constraints and optimality criterion. The following paper I saw at NIPS 2012 was related to what we are doing, so we took a deeper look into it.
Wang, Stocker & Lee (NIPS 2012), Optimal neural tuning curves for arbitrary stimulus distributions: Discrimax, infomax and minimum Lp loss.
The paper assumes a single neuron encoding a 1 dimensional stimulus, governed by a distribution . The neuron is assumed to be Poisson (pure rate code). The neuron’s tuning curve is smooth, monotonically increasing (with ), and has a limited minimum and maximum firing rate as its constraint. Authors assume asymptotic regime for MLE decoding where the observation time is long enough to apply asymptotic normality theory (and convergence of pth moments) of MLE.
The authors show that there is a 1to1 mapping between the tuning curve and the Fisher information under these constraints. Then for various loss functions, they derive the optimal tuning curve using calculus of variations. In general, to minimize the Lp loss under the constraints, the optimal (squared) tuning curve is:
Furthermore, in the limit of , the optimal solution corresponds to the infomax solution (i.e., optimum for mutual information loss). However, all the analysis is only in the asymptotic limit, where the CramerRao bound is attained by the MLE. For the case of mutual information, unlike noiseless case where the optimal tuning curve becomes the stimulus CDF (Laughlin), for Poisson noise, it turns out to be the square of the stimulus CDF. I have plotted the differences below for a normal distribution (left) and a mixture of normals (right):
The results are very nice, and I’d like to see more results with stimulus noise and with population tuning assumptions.
Spawning a realistic model of the brain?
I (Memming) presented Eliasmith et al. “A LargeScale Model of the Functioning Brain” Science 2012 for our computational neuroscience journal club. The authors combined their past efforts for building various modules for solving cognitive tasks to build a largescale spiking neuron model called SPAUN.
BCM rule and information maximization
As the first paper in our summer reading series on information theoretic learning and synaptic plasticity of spiking neurons, we discussed one of the earliest papers:
Taro Toyoizumi, JeanPascal Pfister, Kazuyuki Aihara, Wulfram Gerstner. Generalized Bienenstock–Cooper–Munro rule for spiking neurons that maximizes information transmission. Proceedings of the National Academy of Sciences of the United States of America, Vol. 102, No. 14. (05 April 2005), pp. 52395244, doi:10.1073/pnas.0500495102
BCM rule is a ratebased synaptic plasticity rule which is a stable version of naive Hebbian learning rule where and are rate of presynaptic neuron and postsynaptic neuron respectively. Hebbian learning rule has positive feedback; high firing rate makes the synapse stronger which in turn makes firing rate higher. BCM rule fixes this with a sliding threshold which is controlled by the output (postsynaptic) firing rate – higher firing rate makes the threshold get higher, which increases the range of depression.
This paper by Toyoizumi et al derives a learning rule from first principle of maximizing mutual information between input spike trains and the output spike train. This alone will prefer high firing rate, so they include a penalty for high firing rates. The cost function is:
where denotes mutual information, is the KullbackLeibler divergence, is the tradeoff between the two terms, and is the target spike rate distribution. The paper utilizes an escape rate approximation of IF neuron with extra spikehistory dependent rate modulation term to define the conditional probability . Then, the synaptic plasticity rule is obtained by taking the derivative , then applying time average instead of expectation, and small step size approximation, they arrive at an online learning rule per synapse that resembles BCM in the special case of LNP neuron (no refractory). Next meeting (this Friday), we will dig deeper into the details of the derivation.
Using sizebiased sampling for certain expectations
Let be a well defined infinite discrete probability distribution (e.g., a draw from Dirichlet process (DP)). We are interested in evaluating the following form of expectations: for some function (we are especially interested when , which gives us Shannon’s entropy). Following [1], we can rewrite it as
where is a random variable that takes the value with probability . This random variable is better known as the first sizebiased sample . It is defined by . In other words, it takes one of the probabilities among with probability .
For PitmanYor process (PY) with discount parameter and concentration parameter (Dirichlet process is a special case where ), the size biased samples are naturally obtained by the stick breaking construction. Given a sequence of independent random variables distributed as , if we define , then the set of is invariant to size biased permutation [2], and they form a sequence of sizebiased samples. In our case, we only need the first size biased sample which is simply distributed as .
Using this trick, we can compute the entropy of PY without the complicated simplex integrals. We used this and its extension for computing the PY based entropy estimator.
 Jim Pitman, Marc Yor. The twoparameter PoissonDirichlet distribution derived from a stable subordinator. The Annals of Probability, Vol. 25, No. 2. (April 1997), pp. 855900, doi:10.1214/aop/1024404422
 Mihael Perman, Jim Pitman, Marc Yor. Sizebiased sampling of Poisson point processes and excursions. Probability Theory and Related Fields, Vol. 92, No. 1. (21 March 1992), pp. 2139, doi:10.1007/BF01205234
NP Bayes Reading Group: 3rd meeting
Continuing from last week (Generative model diagram for stick breaking), we established that the Chinese restaurant process (CRP) is exchangeable, and the underlying process from de Finetti theorem is Dirichlet process (DP), that is, CRP is the marginal distribution of DP. The simple form of conditional distribution of CRP provides easy way of sampling. Then, we discussed the form of posterior of DP, whose expectation (marginal) coincides with the CRP. Next week, Kenneth will lead the discussion on applying the theory we learned so far to practical clustering algorithms.
NP Bayes Reading Group: 2nd meeting
Continuing from last week, we discussed the formulation of generative clustering (mixture model) with fixed number of clusters K using Dirichlet distribution as a prior for cluster size distribution following Jordan’s slides. The definition of Dirichlet process (DP) and its existence was briefly shown via Kolmogorov extension theorem. Following (Sethuraman, 1994), we discussed the stick breaking construction of DP. Stick breaking provides the samplebiased permutation of PoissonDirichlet distribution obtained by Kingman limit (Kingman, 1975). The following fun facts about (extended) Dirichlet distribution are from (Sethuraman, 1994).
Fun Fact 1 Let be a ndimensional vector consisting of 0’s, except having 1 at jth index.
Then,
Next week, we will continue on the discussion of DP as a prior for nonparameteric Bayesian clustering, posterior of DP and how to do inference with DP. (Jordan slide #45)
Possible further exploration:
 Sampling from PoissonDirichlet distribution (DonnellyTavaréGriffiths sampling?)
 Proof of Lemma 3.2 from Sethuraman 1994
The number of tables distribution of a CRP has mean (simple proof can be found in Teh’s DP notes), and follows Ewens distribution.
Lab meeting 4/18/2011
Today we discussed Nemenman et. al.’s Neural Coding of Natural Stimuli: Information at SubMillisecond Resolution PLOS Comp. Bio. 2008.
Given a slowly changing naturalistic stimuli with correlation in the time scale of 55 ms, is there information in the spike trains from fly H1 neuron in the time submillisecond scale. To quantify this, mutual information was estimated with different word length, and bin sizes. The main figure 4D suggests there is information in the smaller time scale, and this is demonstrated directly by choosing a few spike patterns that correspond to same larger time scale representation and showing stimuli (velocity) conditioned on those patterns (figure 5).
Mutual information rate is estimated using NSB entropy estimators with extra steps of extrapolation/fitting to obtain (1) asymptotic entropy rate (infinite data), (2) large word size, (3) remove empirical fluctuations of the estimate due to structure in the stimulus or response. These are more or less empirical approach to get better estimates. The mutual information is estimated by taking the difference between the marginal entropy and conditional entropy . One quantity that was not extrapolated was the limit of bin size going to zero.
NSB estimator is a Bayesian estimator that uses an approximately flat prior on entropy itself. They show that in 1D case, the uniform prior on probability space results in a poor entropy estimator. It would be interesting to see the actual prior distribution over probability given a flat prior on entropy.
One question of the overall methodology is the estimation of noise entropy using just 5 seconds of data repeated 100 times. How diverse is the stimulus? How robust is the estimated noise entropy obtained this way?