Let be a well defined infinite discrete probability distribution (e.g., a draw from Dirichlet process (DP)). We are interested in evaluating the following form of expectations: for some function (we are especially interested when , which gives us Shannon’s entropy). Following , we can re-write it as
where is a random variable that takes the value with probability . This random variable is better known as the first size-biased sample . It is defined by . In other words, it takes one of the probabilities among with probability .
For Pitman-Yor process (PY) with discount parameter and concentration parameter (Dirichlet process is a special case where ), the size biased samples are naturally obtained by the stick breaking construction. Given a sequence of independent random variables distributed as , if we define , then the set of is invariant to size biased permutation , and they form a sequence of size-biased samples. In our case, we only need the first size biased sample which is simply distributed as .
Using this trick, we can compute the entropy of PY without the complicated simplex integrals. We used this and its extension for computing the PY based entropy estimator.
- Jim Pitman, Marc Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. The Annals of Probability, Vol. 25, No. 2. (April 1997), pp. 855-900, doi:10.1214/aop/1024404422
- Mihael Perman, Jim Pitman, Marc Yor. Size-biased sampling of Poisson point processes and excursions. Probability Theory and Related Fields, Vol. 92, No. 1. (21 March 1992), pp. 21-39, doi:10.1007/BF01205234
Today we discussed Nemenman et. al.’s Neural Coding of Natural Stimuli: Information at Sub-Millisecond Resolution PLOS Comp. Bio. 2008.
Given a slowly changing naturalistic stimuli with correlation in the time scale of 55 ms, is there information in the spike trains from fly H1 neuron in the time sub-millisecond scale. To quantify this, mutual information was estimated with different word length, and bin sizes. The main figure 4D suggests there is information in the smaller time scale, and this is demonstrated directly by choosing a few spike patterns that correspond to same larger time scale representation and showing stimuli (velocity) conditioned on those patterns (figure 5).
Mutual information rate is estimated using NSB entropy estimators with extra steps of extrapolation/fitting to obtain (1) asymptotic entropy rate (infinite data), (2) large word size, (3) remove empirical fluctuations of the estimate due to structure in the stimulus or response. These are more or less empirical approach to get better estimates. The mutual information is estimated by taking the difference between the marginal entropy and conditional entropy . One quantity that was not extrapolated was the limit of bin size going to zero.
NSB estimator is a Bayesian estimator that uses an approximately flat prior on entropy itself. They show that in 1D case, the uniform prior on probability space results in a poor entropy estimator. It would be interesting to see the actual prior distribution over probability given a flat prior on entropy.
One question of the overall methodology is the estimation of noise entropy using just 5 seconds of data repeated 100 times. How diverse is the stimulus? How robust is the estimated noise entropy obtained this way?