Estimating learnability

TL;DR: you can accurately estimate the hypothetical performance of multivariate linear regression and classification models trained on infinite data with surprisingly little data, O(\sqrt{d}), even when the number of samples (n) is less than the number of features or dimensions (d). 

This week in lab meeting we discussed ‘Estimating learnability in the sublinear data-regime’ by Kong and Valiant published in NeurIPS 2018. The key idea is that with clever statistical methods you can estimate the hypothetical performance of a model trained on infinite data, using only a small amount of training data. The authors provide methods to do so for multivariate linear regression and classification. 

For the multivariate regression case, the authors seek to estimate the explained variance, as quantified by:

r^2 = 1- \frac{E[(y-\beta^Tx)^2]}{E[y^2]}

where y is the (scalar) output, x is the vector (d \times 1 ) of regressors, and \beta is the optimal least squares weight vector.

In the linear regression setting, an accurate estimator of this quantity already exists (but has a key limitation):

\hat{r}^2 = 1 - \frac{\frac{1}{n-d}\sum_i^n(y_i-\hat{\beta}^Tx_{:,i})^2}{\frac{1}{n}\sum_i^ny_i^2},

where \hat{\beta} is the least squares solution estimated on n samples from (x, y) (same as those in the formula above). 

It’s important to note that with finite data, the estimated \hat{\beta} is not the true {\beta}, thus your model wouldn’t actually achieve this performance on held-out data.

The key problem with this simple but effective estimator is that it stops working when you have fewer samples than features (n<d), this is because your regressand matrix (x) is over-complete and there is no unique least-squares solution and if you regularize you can fit the data arbitrarily well (leaving no residuals on which to estimate performance).

Kong and Valiant provide a solution to this problem by not estimating performance on the basis of model predictions but directly on the basis of covariance between your regressors and regressand. To get a flavor for this approach I will provide a short alternative derivation of their estimator in the case where Cov(x) = I, E[x]=0, E[y]=0. First note that if I choose one observation of x and y, then:

E[x_1y_1]  =\beta

because \beta = E[xx^T]^{-1} E[yx] = \Sigma_x^{-1} \Sigma_{x,y}. And with two independent observations of x and y, I can have:

E[(x_1y_1)^Tx_2y_2] = E[x_1y_1]^T E[x_2y_2] = \beta^T \beta  .

Then, divide by an unbiased estimator of the variance of y, the sample variance will work, and you have an estimate of r^2 with an unbiased numerator and denominator.

Kong and Valiant go beyond the restrictive case I describe above to arbitrary covariance and provide an excellent introduction reviewing prior work in this area (note the solution to the identity covariance case was first provided by Lee H Dicker. Variance estimation in high-dimensional linear models. Biometrika, 2014.).