The Loss-Calibrated Bayesian

By Farhan Damani

In lab meeting this week, we discussed loss-calibrated approximate inference in the context of Bayesian decision theory (Lacoste-Julien et. al. 2011, Cobb et. al. 2018). For many applications, the cost of an incorrect prediction can vary depending on the nature of the mistake. Suppose you are in charge of controlling a nuclear power plant with an unknown temperature $\theta$. We observe indirect measurements of the temperature $D$, and we use Bayesian inference to infer a posterior distribution over the temperature given the observations $p(\theta|D)$. The plant is in danger of over-heating and as the operator, you can either keep the plant running or shut it down. Keeping the plant running while the plant’s temperature exceeds a critical threshold $T_{\text{critic}}$ will cause a nuclear meltdown, incurring a huge loss $L(\theta > T_{\text{critic}}, \text{'on'})$ while shutting off the plant for benign temperatures incurs a minor loss $L(\theta < T_{\text{critic}}, \text{'off'})$

In figure 1 we observe the true posterior $p(\theta|D)$ is multi-modal. Our suite of approximate inference techniques characterize general properties of the posterior, attempting to match either the first or second moment of $p$. Both strategies underestimate the posterior mass for the safety-critical region. Instead, the dash-dotted line, while failing to characterize typical properties of the posterior, results in the same decision as the true posterior by optimizing for task-specific utility. The point is the “best” approximate posterior is subjective, and therefore, we should tailor our inferential resources to find an approximation that is well suited for the decision task at hand.

Bayesian decision theory extends the Bayesian paradigm by including a task-specific utility function $U(\theta, a)$, which tells us the utility of taking action $a \in \mathcal{A}$ when the world is in state $\theta$. According to this view, the optimal action minimizes the posterior risk: $\underset{a}{\arg \min} \text{ } \mathcal{R}(a) = \mathbb{E}_{p(\theta|D)}[U(\theta, a)]$. Typically, this is computed using a 2-step procedure. First approximate the posterior $p(\theta|D)$ with a $q(\theta|D)$ and then minimize the risk under $q$. This approach, however, assumes our approximate $q$ measures properties of the posterior that we care about. This by definition requires our utility function, so therefore, we should jointly optimize the approximate posterior with the action that minimizes the posterior risk. Cobb et. al. 2018 show how to derive a variational lower bound that depends on a task-specific utility function. In their setup, they show that minimizing the KL divergence between an approximate posterior q and a calibrated posterior scaled by the utility function results in the standard ELBO loss plus an additional utility-dependent regularization term. This formulation is amenable to stochastic optimization, allowing for the practical deployment of this framework to supervised learning.