Evidence Lower Bound
Evidence lower bound, commonly known as ELBO is one of the fundamental concepts to understand variational inference and also probabilistic decision theory. I am going to assume that readers are familiar with Bayes Theorem which is a prerequisite concept. If you are not I encourage you to go through some materials like this video. Also, go through my article about Latent Variable Model.
Intractability of Distribution
The main reason to come up with the concept of ELBO (at least originally) is due to the intractability of computing distributions in higher dimensions. In my article about latent variable model I mention the intractability of the following quantity in higher dimensions.
Remember that computing this quantity was a sub-goal, the main goal was to compute the posterior shown in the following equation:
If we cannot compute
A Hand-Wavy Introduction to Variational Inference
An informal definition of variational inference is that it is a way to approximate a complex distribution with a relatively simpler distribution through the process of optimization. But for optimization, we need a loss function that will tell us how “good” our approximation is. If our approximation is good, we should stop the optimization process. As we are trying to compare between distributions, the obvious loss function that comes to mind is KL Divergence. As KL Divergence quantifies the similarity between two distributions, it can be used as an indicator of how “good” our approximation is. But there is a certain issue with using KL Divergence which is mentioned in the next section. Before we move on to the next section, let’s introduce some terminology and symbols. As I said, the surrogate distribution is a simpler class of distribution we need to explicitly mention the parameters. Assume the exact posterior distribution is
The Issue with using KL Divergence as a Loss Function
If we write down the KL Divergence (the reverse KL Divergence) formula between
Well, calculating this quantity requires knowing the posterior
Derivation of ELBO
Now we can replace
After replacing we get the following:
Notice that we have the quantity
Now we know that KL Divergence is a non-negative quantity ( a simple proof is available in Jensen’s Inequality article) which means
This means we can set a lower bound on
Therefore we can write the following,
This is the reason behind naming this quantity evidence lower bound as it imposes a lower bound on the log-evidence. Now if we maximize this quantity that in turn will mean minimizing
So, the mathematical expression of ELBO is given below:
We can further simplify this as shown below:
This means when we try to maximize ELBO we are essentially trying to maximize the expected log-likelihood and minimize the KL Divergence between the prior and the surrograte.
ELBO is expressed with the symbol
Connection to VAE
So far we have been talking about ELBO with any specific application in mind. Variational Autoencoder uses the concept of ELBO.
Image Adapted from Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes”