Performing inference¶
Approximation of the posterior distribution can be divided into several steps:
Observe some nodes
Choose the inference engine
Initialize the posterior approximation
Run the inference algorithm
In order to illustrate these steps, we’ll be using the PCA model constructed in the previous section.
Observing nodes¶
First, let us generate some toy data:
>>> c = np.random.randn(10, 2)
>>> x = np.random.randn(2, 100)
>>> data = np.dot(c, x) + 0.1*np.random.randn(10, 100)
The data is provided by simply calling observe
method of a stochastic node:
>>> Y.observe(data)
It is important that the shape of the data
array matches the plates and
shape of the node Y
. For instance, if Y
was Wishart
node for
matrices with plates (5,1,10)
, the full shape of Y
would be (5,1,10,3,3)
. The data
array should have this shape exactly,
that is, no broadcasting rules are applied.
Missing values¶
It is possible to mark missing values by providing a mask which is a boolean array:
>>> Y.observe(data, mask=[[True], [False], [False], [True], [True],
... [False], [True], [True], [True], [False]])
True
means that the value is observed and False
means that the value is
missing. The shape of the above mask is (10,1)
, which broadcasts to the
plates of Y, (10,100)
. Thus, the above mask means that the second, third,
sixth and tenth rows of the data matrix are missing.
The mask is applied to the plates, not to the data array directly. This means that it is not possible to observe a random variable partially, each repetition defined by the plates is either fully observed or fully missing. Thus, the mask is applied to the plates. It is often possible to circumvent this seemingly tight restriction by adding an observable child node which factorizes more.
The shape of the mask is broadcasted to plates using standard NumPy broadcasting
rules. So, if the variable has plates (5,1,10)
, the mask could have a shape
()
, (1,)
, (1,1)
, (1,1,1)
, (10,)
, (1,10)
, (1,1,10)
,
(5,1,1)
or (5,1,10)
. In order to speed up the inference, missing values
are automatically integrated out if they are not needed as latent variables to
child nodes. This leads to faster convergence and more accurate approximations.
Choosing the inference method¶
Inference methods can be found in bayespy.inference
package. Currently,
only variational Bayesian approximation is implemented
(bayespy.inference.VB
). The inference engine is constructed by giving
the stochastic nodes of the model.
>>> from bayespy.inference import VB
>>> Q = VB(Y, C, X, alpha, tau)
There is no need to give any deterministic nodes. Currently, the inference engine does not automatically search for stochastic parents and children, thus it is important that all stochastic nodes of the model are given. This should be made more robust in future versions.
A node of the model can be obtained by using the name of the node as a key:
>>> Q['X']
<bayespy.inference.vmp.nodes.gaussian.GaussianARD object at 0x...>
Note that the returned object is the same as the node object itself:
>>> Q['X'] is X
True
Thus, one may use the object X
when it is available. However, if the model
and the inference engine are constructed in another function or module, the node
object may not be available directly and this feature becomes useful.
Initializing the posterior approximation¶
The inference engines give some initialization to the stochastic nodes by default. However, the inference algorithms can be sensitive to the initialization, thus it is sometimes necessary to have better control over the initialization. For VB, the following initialization methods are available:
initialize_from_prior
: Use the current states of the parent nodes to update the node. This is the default initialization.initialize_from_parameters
: Use the given parameter values for the distribution.initialize_from_value
: Use the given value for the variable.initialize_from_random
: Draw a random value for the variable. The random sample is drawn from the current state of the node’s distribution.
Note that initialize_from_value
and initialize_from_random
initialize
the distribution with a value of the variable instead of parameters of the
distribution. Thus, the distribution is actually a delta distribution with a
peak on the value after the initialization. This state of the distribution does
not have proper natural parameter values nor normalization, thus the VB lower
bound terms are np.nan
for this initial state.
These initialization methods can be used to perform even a bit more complex initializations. For instance, a Gaussian distribution could be initialized with a random mean and variance 0.1. In our PCA model, this can be obtained by
>>> X.initialize_from_parameters(np.random.randn(1, 100, D), 10)
Note that the shape of the random mean is the sum of the plates (1, 100)
and
the variable shape (D,)
. In addition, instead of variance,
GaussianARD
uses precision as the second parameter, thus we initialized
the variance to . This random initialization is important
in our PCA model because the default initialization gives C
and X
zero
mean. If the mean of the other variable was zero when the other is updated, the
other variable gets zero mean too. This would lead to an update algorithm where
both means remain zeros and effectively no latent space is found. Thus, it is
important to give non-zero random initialization for X
if C
is updated
before X
the first time. It is typical that at least some nodes need be
initialized with some randomness.
By default, nodes are initialized with the method initialize_from_prior
.
The method is not very time consuming but if for any reason you want to avoid
that default initialization computation, you can provide initialize=False
when creating the stochastic node. However, the node does not have a proper
state in that case, which leads to errors in VB learning unless the distribution
is initialized using the above methods.
Running the inference algorithm¶
The approximation methods are based on iterative algorithms, which can
be run using update
method. By default, it takes one iteration step
updating all nodes once:
>>> Q.update()
Iteration 1: loglike=-9.305259e+02 (... seconds)
The loglike
tells the VB lower bound. The order in which the nodes are
updated is the same as the order in which the nodes were given when creating
Q
. If you want to change the order or update only some of the nodes, you
can give as arguments the nodes you want to update and they are updated in the
given order:
>>> Q.update(C, X)
Iteration 2: loglike=-8.818976e+02 (... seconds)
It is also possible to give the same node several times:
>>> Q.update(C, X, C, tau)
Iteration 3: loglike=-8.071222e+02 (... seconds)
Note that each call to update
is counted as one iteration step although not
variables are necessarily updated. Instead of doing one iteration step,
repeat
keyword argument can be used to perform several iteration steps:
>>> Q.update(repeat=10)
Iteration 4: loglike=-7.167588e+02 (... seconds)
Iteration 5: loglike=-6.827873e+02 (... seconds)
Iteration 6: loglike=-6.259477e+02 (... seconds)
Iteration 7: loglike=-4.725400e+02 (... seconds)
Iteration 8: loglike=-3.270816e+02 (... seconds)
Iteration 9: loglike=-2.208865e+02 (... seconds)
Iteration 10: loglike=-1.658761e+02 (... seconds)
Iteration 11: loglike=-1.469468e+02 (... seconds)
Iteration 12: loglike=-1.420311e+02 (... seconds)
Iteration 13: loglike=-1.405139e+02 (... seconds)
The VB algorithm stops automatically if it converges, that is, the relative change in the lower bound is below some threshold:
>>> Q.update(repeat=1000)
Iteration 14: loglike=-1.396481e+02 (... seconds)
...
Iteration 488: loglike=-1.224106e+02 (... seconds)
Converged at iteration 488.
Now the algorithm stopped before taking 1000 iteration steps because it
converged. The relative tolerance can be adjusted by providing tol
keyword
argument to the update
method:
>>> Q.update(repeat=10000, tol=1e-6)
Iteration 489: loglike=-1.224094e+02 (... seconds)
...
Iteration 847: loglike=-1.222506e+02 (... seconds)
Converged at iteration 847.
Making the tolerance smaller, may improve the result but it may also significantly increase the iteration steps until convergence.
Instead of using update
method of the inference engine VB
, it is
possible to use the update
methods of the nodes directly as
>>> C.update()
or
>>> Q['C'].update()
However, this is not recommended, because the update
method of the inference
engine VB
is a wrapper which, in addition to calling the nodes’ update
methods, checks for convergence and does a few other useful minor things. But
if for any reason these direct update methods are needed, they can be used.
Parameter expansion¶
Sometimes the VB algorithm converges very slowly. This may happen when the variables are strongly coupled in the true posterior but factorized in the approximate posterior. This coupling leads to zigzagging of the variational parameters which progresses slowly. One solution to this problem is to use parameter expansion. The idea is to add an auxiliary variable which parameterizes the posterior approximation of several variables. Then optimizing this auxiliary variable actually optimizes several posterior approximations jointly leading to faster convergence.
The parameter expansion is model specific. Currently in BayesPy, only state-space models have built-in parameter expansions available. These state-space models contain a variable which is a dot product of two variables (plus some noise):
The parameter expansion can be motivated by noticing that we can add an auxiliary variable which rotates the variables and so that the dot product is unaffected:
Now, applying this rotation to the posterior approximations and , and optimizing the VB lower bound with respect to the rotation leads to parameterized joint optimization of and .
The available parameter expansion methods are in module transformations
:
>>> from bayespy.inference.vmp import transformations
First, you create the rotation transformations for the two variables:
>>> rotX = transformations.RotateGaussianARD(X)
>>> rotC = transformations.RotateGaussianARD(C, alpha)
Here, the rotation for C
provides the ARD parameters alpha
so they are
updated simultaneously. In addition to RotateGaussianARD
, there are a
few other built-in rotations defined, for instance, RotateGaussian
and
RotateGaussianMarkovChain
. It is extremely important that the model
satisfies the assumptions made by the rotation class and the user is mostly
responsible for this. The optimizer for the rotations is constructed by giving
the two rotations and the dimensionality of the rotated space:
>>> R = transformations.RotationOptimizer(rotC, rotX, D)
Now, calling rotate
method will find optimal rotation and update the
relevant nodes (X
, C
and alpha
) accordingly:
>>> R.rotate()
Let us see how our iteration would have gone if we had used this parameter expansion. First, let us re-initialize our nodes and VB algorithm:
>>> alpha.initialize_from_prior()
>>> C.initialize_from_prior()
>>> X.initialize_from_parameters(np.random.randn(1, 100, D), 10)
>>> tau.initialize_from_prior()
>>> Q = VB(Y, C, X, alpha, tau)
Then, the rotation is set to run after each iteration step:
>>> Q.callback = R.rotate
Now the iteration converges to the relative tolerance much faster:
>>> Q.update(repeat=1000, tol=1e-6)
Iteration 1: loglike=-9.363...e+02 (... seconds)
...
Iteration 18: loglike=-1.221354e+02 (... seconds)
Converged at iteration 18.
The convergence took 18 iterations with rotations and 488 or 847 iterations without the parameter expansion. In addition, the lower bound is improved slightly. One can compare the number of iteration steps in this case because the cost per iteration step with or without parameter expansion is approximately the same. Sometimes the parameter expansion can have the drawback that it converges to a bad local optimum. Usually, this can be solved by updating the nodes near the observations a few times before starting to update the hyperparameters and to use parameter expansion. In any case, the parameter expansion is practically necessary when using state-space models in order to converge to a proper solution in a reasonable time.