We discussed state dependence of noise correlations in macaque primary visual cortex [1] today. Noise correlation quantifies the covariability in spike counts between neurons (it’s called noise correlation because the signal (stimulus) drive component has been subtracted out). In a 2010 science paper [2], noise correlation was shown to be much smaller than previously reported; in the range of 0.01 compared to the usual 0.10.2 range and stirred up the field (see [3] for a list of values). In this paper, they argue that this difference in noise correlation magnitude is due to population level covariations during anesthesia (they used sufentanil).
Author Archives: memming
Lab Meeting 6/10/2013: Hessian free optimization
James Martens has been publishing bags of tips and tricks for largescale nonconvex optimization that occurs in training deep learning network and recurrent neural network (RNN). They were able to train deep learning network without pretraining and better than the stateoftheart, and also for RNN, much better than backpropagation through time. Basically, it’s the use of 2nd order (curvature information) via heuristic modifications of the conjugate gradient (CG) method. CG is Hessianfree since one only needs to evaluate Hessian in the direction of a single direction which is much cheaper than computing the full Hessian (often it is prohibitive for largescale problems). The objective function is repeatedly locally approximated as a quadratic function , and minimized. Some of the tricks are:
 Use conjugate gradient instead of other quasiNewton methods like LBFGS, or nonlinear conjugate gradient.
 Use GaussNewton approximation. For nonconvex problems, the Hessian can have negative eigenvalues which can lead to erratic behavior of the CG step which assumes positive definite . Hence, they propose using the GaussNewton approximation which discards the secondorder derivatives, and is guaranteed to be positive definite. In the following Hessian, the second term is simply ignored.
 Use fraction of improvement as termination condition for CG (instead of the regular residual norm condition).
 Add regularization (dampening) on the Hessian (or its approximation), and update its trustregion parameter via LevenbergMarquardt style heuristic.
 Do semionline, minibatch updates.
 For training RNNs, use structural dampening which limits changing parameters too much that are highly sensitive.
References:

James Martens. Deep learning via Hessianfree optimization. ICML 2010

James Martens, Ilya Sutskever. Learning Recurrent Neural Networks with HessianFree Optimization. ICML 2011
Lab Meeting 3/25/2013: Demixing PCA (dPCA)
If you had lots of spike trains over 4 seconds for 800 neurons, 6 stimulus conditions, and 2 behavioral choices, how would you visualize your data? Unsupervised dimensionality reduction techniques, such as principal component analysis (PCA) finds orthonormal basis vectors that captures the most variance of the data, but the results are not necessarily interpretable. What one wants is to say is something like:
“Along this direction, the population dynamics seems to encode stimulus, and along this other orthogonal dimension, neurons are modulated by the motor behavior…”
Lab Meeting 2/4/2013: Asymptotically optimal tuning curve in Lp sense for a Poisson neuron
Optimal tuning curve is the best transformation of the stimulus into neural firing pattern (usually firing rate) under certain constraints and optimality criterion. The following paper I saw at NIPS 2012 was related to what we are doing, so we took a deeper look into it.
Wang, Stocker & Lee (NIPS 2012), Optimal neural tuning curves for arbitrary stimulus distributions: Discrimax, infomax and minimum Lp loss.
The paper assumes a single neuron encoding a 1 dimensional stimulus, governed by a distribution . The neuron is assumed to be Poisson (pure rate code). The neuron’s tuning curve is smooth, monotonically increasing (with ), and has a limited minimum and maximum firing rate as its constraint. Authors assume asymptotic regime for MLE decoding where the observation time is long enough to apply asymptotic normality theory (and convergence of pth moments) of MLE.
The authors show that there is a 1to1 mapping between the tuning curve and the Fisher information under these constraints. Then for various loss functions, they derive the optimal tuning curve using calculus of variations. In general, to minimize the Lp loss under the constraints, the optimal (squared) tuning curve is:
Furthermore, in the limit of , the optimal solution corresponds to the infomax solution (i.e., optimum for mutual information loss). However, all the analysis is only in the asymptotic limit, where the CramerRao bound is attained by the MLE. For the case of mutual information, unlike noiseless case where the optimal tuning curve becomes the stimulus CDF (Laughlin), for Poisson noise, it turns out to be the square of the stimulus CDF. I have plotted the differences below for a normal distribution (left) and a mixture of normals (right):
The results are very nice, and I’d like to see more results with stimulus noise and with population tuning assumptions.
Spawning a realistic model of the brain?
I (Memming) presented Eliasmith et al. “A LargeScale Model of the Functioning Brain” Science 2012 for our computational neuroscience journal club. The authors combined their past efforts for building various modules for solving cognitive tasks to build a largescale spiking neuron model called SPAUN.
BCM rule and information maximization
As the first paper in our summer reading series on information theoretic learning and synaptic plasticity of spiking neurons, we discussed one of the earliest papers:
Taro Toyoizumi, JeanPascal Pfister, Kazuyuki Aihara, Wulfram Gerstner. Generalized Bienenstock–Cooper–Munro rule for spiking neurons that maximizes information transmission. Proceedings of the National Academy of Sciences of the United States of America, Vol. 102, No. 14. (05 April 2005), pp. 52395244, doi:10.1073/pnas.0500495102
BCM rule is a ratebased synaptic plasticity rule which is a stable version of naive Hebbian learning rule where and are rate of presynaptic neuron and postsynaptic neuron respectively. Hebbian learning rule has positive feedback; high firing rate makes the synapse stronger which in turn makes firing rate higher. BCM rule fixes this with a sliding threshold which is controlled by the output (postsynaptic) firing rate – higher firing rate makes the threshold get higher, which increases the range of depression.
This paper by Toyoizumi et al derives a learning rule from first principle of maximizing mutual information between input spike trains and the output spike train. This alone will prefer high firing rate, so they include a penalty for high firing rates. The cost function is:
where denotes mutual information, is the KullbackLeibler divergence, is the tradeoff between the two terms, and is the target spike rate distribution. The paper utilizes an escape rate approximation of IF neuron with extra spikehistory dependent rate modulation term to define the conditional probability . Then, the synaptic plasticity rule is obtained by taking the derivative , then applying time average instead of expectation, and small step size approximation, they arrive at an online learning rule per synapse that resembles BCM in the special case of LNP neuron (no refractory). Next meeting (this Friday), we will dig deeper into the details of the derivation.
Using sizebiased sampling for certain expectations
Let be a well defined infinite discrete probability distribution (e.g., a draw from Dirichlet process (DP)). We are interested in evaluating the following form of expectations: for some function (we are especially interested when , which gives us Shannon’s entropy). Following [1], we can rewrite it as
where is a random variable that takes the value with probability . This random variable is better known as the first sizebiased sample . It is defined by . In other words, it takes one of the probabilities among with probability .
For PitmanYor process (PY) with discount parameter and concentration parameter (Dirichlet process is a special case where ), the size biased samples are naturally obtained by the stick breaking construction. Given a sequence of independent random variables distributed as , if we define , then the set of is invariant to size biased permutation [2], and they form a sequence of sizebiased samples. In our case, we only need the first size biased sample which is simply distributed as .
Using this trick, we can compute the entropy of PY without the complicated simplex integrals. We used this and its extension for computing the PY based entropy estimator.
 Jim Pitman, Marc Yor. The twoparameter PoissonDirichlet distribution derived from a stable subordinator. The Annals of Probability, Vol. 25, No. 2. (April 1997), pp. 855900, doi:10.1214/aop/1024404422
 Mihael Perman, Jim Pitman, Marc Yor. Sizebiased sampling of Poisson point processes and excursions. Probability Theory and Related Fields, Vol. 92, No. 1. (21 March 1992), pp. 2139, doi:10.1007/BF01205234