Variational message passing¶
This section briefly describes the variational message passing (VMP) framework, which is currently the only implemented inference engine in BayesPy. The variational Bayesian (VB) inference engine in BayesPy assumes that the posterior approximation factorizes with respect to nodes and plates. VMP is based on updating one node at a time (the plates in one node can be updated simultaneously) and iteratively updating all nodes in turns until convergence.
Standard update equation¶
The general update equation for the factorized approximation of node
is the following:
(1)¶
where and
are the set of parents and children of
, respectively. Thus, the posterior approximation of
a node is updated by taking a sum of the expectations of all log densities in
which the node variable appears. The expectations are over the approximate
distribution of all other variables than
. Actually,
not all the variables are needed, because the non-constant part depends only on
the Markov blanket of
. This leads to a local
optimization scheme, which uses messages from neighbouring nodes.
The messages are simple for conjugate exponential family models. An exponential family distribution has the following log probability density function:
(2)¶
where is the set of parents,
is the sufficient statistic vector,
is the natural parameter vector,
is the negative log normalizer, and
is the log base function. Note that the log density is linear with
respect to the terms that are functions of
:
and
. If a parent has a conjugate prior,
(2) is also linear with respect to the parent’s sufficient
statistic vector. Thus, (2) can be re-organized with respect to a
parent
as
where is the sufficient statistic
vector of
and the constant part is constant with
respect to
. Thus, the update equation
(1) for
can be written as
where the summation is over all the child nodes of
. Because of the conjugacy,
depends
(multi)linearly on the parents’ sufficient statistic vector. Similarly,
depends (multi)linearly on the expectations of the children’s and
co-parents’ sufficient statistics. This gives the following update equation for
the natural parameter vector of the posterior approximation
:
(3)¶
Variational messages¶
The update equation (3) leads to a message passing scheme: the term
is a function
of the parents’ sufficient statistic vector and the term
can be
interpreted as a message from the child node
. Thus, the
message from the child node
to the parent node
is
which can be computed as a function of the sufficient statistic vector of the
co-parent nodes of and the sufficient statistic
vector of the child node
. The message from the parent node
to the child node
is simply the
expectation of the sufficient statistic vector:
In order to compute the expectation of the sufficient statistic vector we need
to write as
where is the natural
parameter vector of
. Now, the expectation of the
sufficient statistic vector is defined as
(4)¶
We call this expectation of the sufficient statistic vector as the moments vector.
Lower bound¶
Computing the VB lower bound is not necessary in order to find the posterior approximation, although it is extremely useful in monitoring convergence and possible bugs. The VB lower bound can be written as
where is the set of all observed variables and
is the set of all latent variables. It can also be written as
which shows that observed and latent variables contribute differently to the lower bound. These contributions have simple forms for exponential family nodes. Observed exponential family nodes contribute to the lower bound as follows:
where is the observed data. On the other hand, latent
exponential family nodes contribute to the lower bound as follows:
If a node is partially observed and partially unobserved, these formulas are applied plate-wise appropriately.
Terms¶
To summarize, implementing VMP requires one to write for each stochastic exponential family node:
: the expectation of the prior natural parameter vector
Computed as a function of the messages from parents.
: natural parameter vector of the posterior approximation
Computed as a sum of
and the messages from children.
: the posterior moments vector
Computed as a function of
as defined in (4).
: the moments vector for given data
Computed as a function of of the observed data
.
: the expectation of the negative log normalizer of the prior
Computed as a function of parent moments.
: the negative log normalizer of the posterior approximation
Computed as a function of
.
: the log base measure for given data
Computed as a function of the observed data
.
: the message to parent
Computed as a function of the moments of this node and the other parents.
Deterministic nodes require only the following terms:
: the posterior moments vector
Computed as a function of the messages from the parents.
: the message to a parent
Computed as a function of the messages from the other parents and all children.