Target Reader/Required Knowledge

This post is an introduction to conjugate priors in the context of linear regression. Conjugate priors are a technique from Bayesian statistics/machine learning.

The reader is expected to have some basic knowledge of Bayes’ theorem, basic probability (conditional probability and chain rule), machine learning and a pinch of matrix algebra.

In addition the code will be in the Julia language, but it can be easily translated to Python/R/MATLAB.

Quick Bayesian refresher

Ever since the advent of computers, Bayesian methods have become more and more important in the fields of Statistics and Engineering. The main case for using these techniques is to reason about uncertainty of an inference. In particular, we can use prior information about the our model, together with new information coming from the data, to update our beliefs and obtain a better knowledge about the observed phenomenon.

Bayes’ theorem, viewed from a Machine Learning perspective, can be written as:

\[ p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D}\mid \theta) p(\theta)}{p(\mathcal{D})} \]

where $\theta$ are the parameters of the model which, we believe, has generated our data $\mathcal{D}$. Thanks to Bayes’ theorem, given our data $\mathcal{D}$, we can learn the distribution of the parameters $\theta$.

Now, let’s examine each term of the first equation:

  • $p(\theta\mid \mathcal{D})$ is called posterior. It represents how much we know about the parameters of the model after seeing the data.

  • $p(\mathcal{D}\mid \theta)$ is called likelihood. It represents how likely it is too see the data, had that data been generated by our model using parameters $\theta$.

  • $p(\theta)$ is called prior. It represents our beliefs about the parameters before seeing any data.

  • $p(\mathcal{D})$ is called model evidence or marginal likelihood. It represents the probability of observing our data without any assumption about the parameters of our model. It does not depend on $\theta$ and thus evaluates to just a constant. Because of the fact that is constant and the high cost to compute it, it is generally ignored.

Ignoring the marginal likelihood $p(\mathcal{D})$ we usually write Bayes’ theorem as:

\[ p(\theta \mid \mathcal{D}) \propto p(\mathcal{D}\mid \theta) p(\theta) \]

The $\propto$ symbol means proportional to, i.e. equal except for a normalizing constant.

Multivariate linear regression

Motivation

Let’s consider the problem of multivariate linear regression. This means that we want to find the best set of intercept and slopes to minimize the distance between our linear model’s previsions and the actual data. But it doesn’t end here, we may be interested in getting some estimates about the uncertainty of our model, e.g. a confidence interval for each parameter.

Another feature we might be interest in is supporting streaming data. When deploying our algorithm, we may have only had the opportunity to train it on a small quantity of data compared to what our user create everyday, and we want our system to react to new emerging behaviours of the users without retraining.

Problem setting

Our data $\mathcal{D}=\{X,Y\}$ contains the predictors (or design matrix) $X \in \mathbb{R}^{n \times d}$, and the response $Y \in \mathbb{R}^{n\times 1}$.

$n$ is the number of observations and $d$ is the number of features.

A single observation is called $x_i \in \mathbb{R}^{n \times 1}, i \in 1,..,n$, and a single response is $y_i \in \mathbb{R}$.

Our model will be $Y = X\beta + \epsilon$ where $\epsilon \sim \mathcal{N}(0,\sigma^2 I)$ is the noise. This can be rewritten as $Y \sim \mathcal{N}(X\beta, \sigma^2 I)$ thus having an $n$-dimensional multivariate Normal distribution.

Likelihood

Let’s write the likelihood for multivariate linear regresssion, i.e. how likely it it to observe the data $\mathcal{D}$, given a certain linear model specified by $\beta$.

\[p(\mathcal{D}\mid \theta) = p((X,Y)\mid \beta) = p(Y=\mathcal{N}(X\beta,\sigma^2I)) = (2\pi\sigma^2)^{-k/2}exp{-\frac{1}{2\sigma^2}(Y-X\beta)^T(Y-X\beta)}\]

The last expression was obtained by substituting the Gaussian PDF with mean $\mu=X\beta$ and covariance matrix $\Sigma=\sigma^2 I$.

Also, since all of the observations $X, Y$ are I.I.D. we can factorize the likelihood as:

\[p(\mathcal{D}\mid \theta) = p((X,Y)\mid \beta) = p(Y=\mathcal{N}(X\beta,\sigma^2I)) = \prod\limits_{i=1}^{n} p(y_i = \mathcal{N}(x_i\beta, \sigma^2))\]

Visualization

How can we visualize this distribution? For a single pair $(x_i, y_i)$ (with fixed $\beta$) the multivariate Normal collapses to a probability. Plotting this for a bunch of values of x and y we can see how the points with highest probability are on the line $y=1+2x$, as expected since our parameters are $\beta = {1,2}$. The variance $\sigma^2=1$, which for now we will treat as a known constant, influences how “fuzzy” the resulting plot is.

using Plots, Distributions, LaTeXStrings, LinearAlgebra, Printf
β = [1.0, 2.0]
2-element Array{Float64,1}:
 1.0
 2.0
heatmap(
    -3:2e-2:3,
    -3:2e-2:3,
    (x,y)->pdf(Normal([1 x]⋅β, 1),y),
    xlabel=L"x",
    ylabel=L"y",
    title=L"p\;(\mathcal{D}=(x,y)\mid\beta=\{1,2\})"
)

svg

In alternative, we can also plot how likely is each combination of weights given a certain point $(x_i, y_i)$. Notice how, for a single point, many combinations of angular coefficient $\beta_1$ and intercept $\beta_0$ are possible. Also notice how these combinations are distributed on a line, if you increase the intercept, the angular coefficient has to go down.

heatmap(
    -3:2e-2:3,
    -3:2e-2:3,
    (β0,β1)->pdf(Normal([1 1][β0, β1], 1),2),
    xlabel=L"\beta_0",
    ylabel=L"\beta_1",
    title=L"p\;(\mathcal{D}=(x=1,y=2)\mid\beta=\{\beta_0,\beta_1\})"
)

svg

Conjugate priors

Now comes the question of what our prior should look like and how to combine it with the likelihood to obtain a posterior. We could just use an uniform prior as we have no idea of how our $\beta$ are distributed. This is what Vincent D. Warmerdam does in his excellent post on this topic.

Another option is to use what is called conjugate prior, that is, a specially chosen prior distribution such that, when multiplied with the likelihood, the resulting posterior distribution belongs to the same family of the prior. Why would we want to do so? The main reason here is speed. Since we know the analytic expression for our posterior, almost no calculations need to be performed, it’s just a matter of calculating the new distribution’s parameters. This is a breath of fresh air considering the high cost of Markov Chain Monte Carlo methods usually used to calculate these posteriors. This speed allows us to consider using bayesian methods in high-throughput streaming contexts.

How do we find these pairs of likelihood and priors? The usual approach is to look at likelihood’s algebraic equation and come up with a distribution PDF similar enough so that the posterior is in the same family. We don’t need to do all of this work, we can just look on Wikipedia or other sources.

Multivariate Normal prior

For a Normal likelihood with known variance, the conjugate prior is another Normal distribution with parameters $\mu_\beta$ and $\Sigma_\beta$. The parameter $\mu_\beta$ describes the initial values for $\beta$ and $\Sigma_\beta$ describes how uncertain we are of these values.

\[p(\theta) = p(\beta) = \mathcal{N}(\mu_\beta, \Sigma_\beta)\]

heatmap(
    -3:2e-2:3,
    -3:2e-2:3,
    (x,y)->pdf(MvNormal([0.5, 1.5], I(2)),[x,y]),
    xlabel=L"\beta_0",
    ylabel=L"\beta_0",
    title=L"p\;(\beta\mid\mu_\beta=\{0.5, 1.5\}, \Sigma_\beta=I\;)"
)

svg

Using this prior, the formula for our posterior now looks like this:

\[p(\beta \mid (X,Y)) \propto p((X,Y)\mid \beta) p(\beta)\]

\[p(\beta \mid (X,Y)) = \mathcal{N}(X\beta,\sigma^2) \times \mathcal{N}(\mu_\beta,\Sigma_\beta) = \mathcal{N}(\mu_\beta^{new},\Sigma_\beta^{{new}})\]

The posterior only depends on $\mu_\beta^{new}$ and $\Sigma_{\beta}^{new}$ which can be calculated using the prior and the newly observed data. Recall that $\sigma^2$ is the variance of the data model’s noise.

\[ \Sigma_\beta^{new} = (\Sigma_\beta^{-1} + X^TX)^{-1} \sigma^2 \]

\[ \mu_\beta^{new} = (\Sigma_\beta^{-1} + X^TX)^{-1} (\Sigma_\beta^{-1}\mu_\beta + X^TY) \]

Since matrix inversions and multiplications have cubic time complexity, each update will cost us $O(d^3)$ where $d$ is the number of features.

Julia implementation

We can now proceed to the implementation. I chose the Julia language because of its excellent speed and scientific libraries. Also I like shiny things and Julia is much newer than Python/R/MATLAB.

First, we generate the data which we will use to verify the implementation of the algorithm. Notice how we save the variance $\sigma^2$, which we will treat as a known constant and use when updating our prior. There are ways to estimate it from the data, i.e. using a Normal-Inverse-Chi-Squared prior, which we will examine in a future blog post.

Training data

n = 100
X = [ones(n) 1:n]
β = [-13.0, 42.0]
σ² = 250^2
Y = X*β + sqrt(σ²)*randn(n)
scatter(1:n, Y, xlabel="X", ylabel="Y", title="Training data", legend=false)

svg

Prior

First of all, using MvNormal from the Distributions package, let’s define our prior.

μᵦ = [0.0, 0.0]
Σᵦ = [1.0 0.0
      0.0 1.0]
prior = MvNormal(μᵦ, Σᵦ)
FullNormal(
dim: 2
μ: [0.0, 0.0]
Σ: [1.0 0.0; 0.0 1.0]
)

Update step definition

Then, using the posterior hyperparameter update formulas, let’s implement the update function. The update function takes a prior and our data, and return the posterior distribution. Notice how by using Julia’s unicode support, we can have our code closely resembling the math.

\[ \Sigma_\beta^{new} = (\Sigma_\beta^{-1} + X^TX)^{-1} \sigma^2 \]

\[ \mu_\beta^{new} = (\Sigma_\beta^{-1} + X^TX)^{-1} (\Sigma_\beta^{-1}\mu_\beta + X^TY) \]

function update(prior, X, Y, σ²)
    μᵦ = mean(prior)
    Σᵦ = cov(prior)

    Σᵦ_new = (Σᵦ^-1 + X'X)^-1 * σ²
    μᵦ_new = (Σᵦ^-1 + X'X)^-1 * (Σᵦ^-1*μᵦ + X'Y)

    posterior = MvNormal(μᵦ_new, Symmetric(Σᵦ_new))
    return posterior
end
update (generic function with 1 method)

Training

posterior = update(prior, X, Y, σ²)
FullNormal(
dim: 2
μ: [29.797457399445193, 40.716620886842236]
Σ: [2438.8256259319187 -36.40027489487599; -36.40027489487599 0.7280054978975199]
)

Let’s extract the estimates along with standard error from the posterior.

@printf("β₀: %.3f ± %.3f (2σ)\n", mean(posterior)[1], 2*sqrt(cov(posterior)[1,1]))
@printf("β₁: %.3f ± %.3f (2σ)\n", mean(posterior)[2], 2*sqrt(cov(posterior)[2,2]))
β₀: 29.797 ± 98.769 (2σ)
β₁: 40.717 ± 1.706 (2σ)

We can see how the parameters we used to generate the data ($-13, 42$) are well within the $2\sigma$ confidence interval of our estimation.

Posterior predictive

To use our posterior in a predictive setting, we need the predictive distribution, which can be obtained with the following formula:

\[ p(y_i\mid x_i,\beta) = \mathcal{N}(x_i\beta, \sigma^2 + x_i^T\Sigma_\beta x_i) \]

where $x_i$ is the feature vector for a single observation and $y_i$ is the predicted response.

plot(
    1:n,
    X*mean(posterior),
    ribbon=2*sqrt.(
        σ² .+ [X[i,:]'*cov(posterior)*X[i,:] for i in 1:n]
    ), xlabel="X", ylabel="Y", title="Posterior",
    label="Predicted"
)
scatter!(1:n, Y, label="Training data")

svg

Sources/Bibliography