Log Likelihood estimation

Purpose

Performing likelihood ratio tests and computing information criteria for a given model requires computation of the log-likelihood

${\cal L}{\cal L}_y(\hat{\theta}) = \log({\cal L}_y(\hat{\theta})) \triangleq \log(p(y;\hat{\theta}))$

where $\hat{\theta}$ is the vector of population parameter estimates for the model being considered. The log-likelihood cannot be computed in closed form for nonlinear mixed effects models. It can however be estimated in a general framework for all kinds of data and models using the importance sampling Monte Carlo method. This method has the advantage of providing an unbiased estimate of the log-likelihood – even for nonlinear models – whose variance can be controlled by the Monte Carlo size.

Two different algorithms are proposed to estimate the log-likelihood: by linearization and by Importance sampling. The estimated log-likelihoods are computed and stored in the LLInformation folder in the result folder. In this folder, two files are stored:

logLikelihood.txt containing the OFV (objective function value), AIC, and BIC.
individualLL.txt containing the -2LL for each individual.

Log-likelihood by importance sampling

The observed log-likelihood ${\cal LL}(\theta;y)=\log({\cal L}(\theta;y))$ can be estimated without requiring approximation of the model, using a Monte Carlo approach. Since

${\cal LL}(\theta;y) = \log(p(y;\theta)) = \sum_{i=1}^{N} \log(p (y_i;\theta))$

we can estimate $\log(p(y_i;\theta))$ for each individual and derive an estimate of the log-likelihood as the sum of these individual log-likelihoods. We will now explain how to estimate $\log(p(y_i;\theta))$ for any individual i. Using the $\phi$ -representation of the model (the individual parameters are transformed to be Gaussian), notice first that $p(y_i;\theta)$ can be decomposed as follows:

$p(y_i;\theta) = \int p(y_i,\phi_i;\theta)d\phi_i = \int p(y_i|\phi_i;\theta)p(\phi_i;\theta)d\phi_i = \mathbb{E}_{p_{\phi_i}}\left(p(y_i|\phi_i;\theta)\right)$

Thus, $p(y_i;\theta)$ is expressed as a mean. It can therefore be approximated by an empirical mean using a Monte Carlo procedure:

Draw M independent values $\phi_i^{(1)}$ , $\phi_i^{(2)}$ , …, $\phi_i^{(M)}$ from the marginal distribution $p_{\phi_i}(.;\theta)$ .
Estimate $p(y_i;\theta)$ with $\hat{p}_{i,M}=\frac{1}{M}\sum_{m=1}^{M}p(y_i | \phi_i^{(m)};\theta)$

By construction, this estimator is unbiased, and consistent since its variance decreases as 1/M:

$\mathbb{E}\left(\hat{p}_{i,M}\right)=\mathbb{E}_{p_{\phi_i}}\left(p(y_i|\phi_i^{(m)};\theta)\right) = p(y_i;\theta) ~~~~\mbox{Var}\left(\hat{p}_{i,M}\right) = \frac{1}{M} \mbox{Var}_{p_{\phi_i}}\left(p(y_i|\phi_i^{(m)};\theta)\right)$

We could consider ourselves satisfied with this estimator since we “only” have to select M large enough to get an estimator with a small variance. Nevertheless, it is possible to improve the statistical properties of this estimator.

The problem is that it is not possible to generate the $\phi_i^{(m)}$ with this conditional distribution, since that would require to compute a normalizing constant, which here is precisely $p(y_i;\theta)$ .

Nevertheless, this conditional distribution can be estimated using the Metropolis-Hastings algorithm described in the Metropolis-Hastings algorithm for simulating the individual parameters and a practical proposal “close” to the optimal proposal $p_{\phi_i|y_i}$ can be derived. We can then expect to get a very accurate estimate with a relatively small Monte Carlo size M.

The mean and variance of the conditional distribution $p_{\phi_i|y_i}$ are estimated by Metropolis-Hastings for each individual i. Then, the $\phi_i^{(m)}$ are drawn with a noncentral student t-distribution:

$\phi_i^{(m)} = \mu_i + \sigma_i \times T_{i,m}$

where $\mu_i$ and $\sigma^2_i$ are estimates of $\mathbb{E}\left(\phi_i|y_i;\theta\right)$ and $\mbox{Var}\left(\phi_i|y_i;\theta\right)$ , and $(T_{i,m})$ is a sequence of i.i.d. random variables distributed with a Student’s t-distribution with $\nu$ degrees of freedom.

Remark: Even if $\hat{\cal L}_y(\theta)=\prod_{i=1}^{N}\hat{p}_{i,M}$ is an unbiased estimator of ${\cal L}_y(\theta)$ , $\hat{\cal LL}_y(\theta)$ is a biased estimator of ${\cal LL}_y(\theta)$ . Indeed, by Jensen’s inequality, we have :

$\mathbb{E}\left(\log(\hat{\cal L}_y(\theta))\right) \leq \log \left(\mathbb{E}\left(\hat{\cal L}_y(\theta)\right)\right)=\log\left({\cal L}_y(\theta)\right)$

However, the bias decreases as M increases and also if $\hat{\cal L}_y(\theta)$ is close to ${\cal L}_y(\theta)$ . It is therefore highly recommended to use a proposal as close as possible to the conditional distribution $p_{\phi_i|y_i}$ , which means having to estimate this conditional distribution before estimating the log-likelihood (i.e run task “individual parameter” with “Cond. distribution” option).

Remark: The standard error of all the draws is proposed. It is a representation of impact of the variability of the draws of the proposed population parameters and not of the uncertainty of the model.

Advance settings for the log-likelihood

A t-distribution is used as proposal. The number of degrees of freedom of this distribution can be either fixed or optimized. In such a case, the default possible values are 2, 5, 10 and 20 degree of freedom. A distribution with a small number of degree of freedom (i.e. heavy tails) should be avoided in case of stiff ODE’s defined models. We recommend to set a degree of freedom at 5.

Log-likelihood by linearization

The likelihood of the nonlinear mixed effects model cannot be computed in a closed-form. An alternative is to approximate this likelihood by the likelihood of the Gaussian model deduced from the nonlinear mixed effects model after linearization of the function f (defining the structural model) around the predictions of the individual parameters $(\phi_i; 1 \leq i \leq N)$ .
Notice that the log-likelihood can not be computed by linearization for discrete outputs (categorical, count, etc.) nor for mixture models or when the posterior distribution have been estimated for some parameters with priors.

Best practices: When should I use the linearization and when should I use the importance sampling?

Firstly, it is only possible to use the linearization algorithm for the continuous data. In that case, this method is generally much faster than importance sampling method and also gives good estimates of the LL. The LL calculation by model linearization will generally be able to identify the main features of the model. More precise– and time-consuming – estimation procedures such as stochastic approximation and importance sampling will have very limited impact in terms of decisions for these most obvious features. Selection of the final model should instead use the unbiased estimator obtained by Monte Carlo. In the warfarin example, the evaluation of the log-likelihood (along with the AIC and BIC) is presented with the CPU time.

Method	-2LL	AIC	BIC	CPU time [s]
Linearization	2178.78	2220.78	2251.56	1.5
Important sampling	2119.76	2161.76	2192.54	27.1