Details of the parameter estimation

The probability model

Maximum likelihood estimates are based on the probability model for the observed responses. In the probability model the distribution of the responses is expressed as a function of one or more parameters.

For a continuous distribution the probability density is a function of the responses, given the parameters. The likelihood function is the same expression as the probability density but regarding the observed values as fixed and the parameters as varying.

In general a mixed-effects model incorporates two random variables: $\mathcal{B}$, the $q$-dimensional vector of random effects, and $\mathcal{Y}$, the $n$-dimensional response vector. The value, $\bf y$, of $\mathcal{Y}$ is observed; the value, $\bf b$, of $\mathcal{B}$ is not.

Linear Mixed-Effects Models

In a linear mixed model the unconditional distribution of $\mathcal{B}$ and the conditional distribution, $(\mathcal{Y} | \mathcal{B}=\bf{b})$, are both multivariate Gaussian distributions,

\[\begin{aligned} (\mathcal{Y} | \mathcal{B}=\bf{b}) &\sim\mathcal{N}(\bf{ X\beta + Z b},\sigma^2\bf{I})\\\\ \mathcal{B}&\sim\mathcal{N}(\bf{0},\Sigma_\theta) . \end{aligned}\]

The conditional mean of $\mathcal Y$, given $\mathcal B=\bf b$, is the linear predictor, $\bf X\bf\beta+\bf Z\bf b$, which depends on the $p$-dimensional fixed-effects parameter, $\bf \beta$, and on $\bf b$. The model matrices, $\bf X$ and $\bf Z$, of dimension $n\times p$ and $n\times q$, respectively, are determined from the formula for the model and the values of covariates. Although the matrix $\bf Z$ can be large (i.e. both $n$ and $q$ can be large), it is sparse (i.e. most of the elements in the matrix are zero).

The relative covariance factor, $\Lambda_\theta$, is a $q\times q$ lower-triangular matrix, depending on the variance-component parameter, $\bf\theta$, and generating the symmetric $q\times q$ variance-covariance matrix, $\Sigma_\theta$, as

\[\Sigma_\theta=\sigma^2\Lambda_\theta\Lambda_\theta'\]

The spherical random effects, $\mathcal{U}\sim\mathcal{N}(\bf{0},\sigma^2\bf{I}_q)$, determine $\mathcal B$ according to

\[\mathcal{B}=\Lambda_\theta\mathcal{U}.\]

The penalized residual sum of squares (PRSS),

\[r^2(\theta,\beta,\bf{u})=\|\bf{y} - \bf{X}\beta -\bf{Z}\Lambda_\theta\bf{u}\|^2+\|\bf{u}\|^2,\]

is the sum of the residual sum of squares, measuring fidelity of the model to the data, and a penalty on the size of $\bf u$, measuring the complexity of the model. Minimizing $r^2$ with respect to $\bf u$,

\[r^2_{\beta,\theta} =\min_{\bf{u}}\left(\|\bf{y} -\bf{X}{\beta} -\bf{Z}\Lambda_\theta\bf{u}\|^2+\|\bf{u}\|^2\right)\]

is a direct (i.e. non-iterative) computation. The particular method used to solve this generates a blocked Choleksy factor, $\bf{L}_\theta$, which is an lower triangular $q\times q$ matrix satisfying

\[\bf{L}_\theta\bf{L}_\theta'=\Lambda_\theta'\bf{Z}'\bf{Z}\Lambda_\theta+\bf{I}_q .\]

where ${\bf I}_q$ is the $q\times q$ identity matrix.

Negative twice the log-likelihood of the parameters, given the data, $\bf y$, is

\[d({\bf\theta},{\bf\beta},\sigma|{\bf y}) =n\log(2\pi\sigma^2)+\log(|{\bf L}_\theta|^2)+\frac{r^2_{\beta,\theta}}{\sigma^2}.\]

where $|{\bf L}_\theta|$ denotes the determinant of ${\bf L}_\theta$. Because ${\bf L}_\theta$ is triangular, its determinant is the product of its diagonal elements.

Because the conditional mean, $\bf\mu_{\mathcal Y|\mathcal B=\bf b}=\bf X\bf\beta+\bf Z\Lambda_\theta\bf u$, is a linear function of both $\bf\beta$ and $\bf u$, minimization of the PRSS with respect to both $\bf\beta$ and $\bf u$ to produce

\[r^2_\theta =\min_{{\bf\beta},{\bf u}}\left(\|{\bf y} -{\bf X}{\bf\beta} -{\bf Z}\Lambda_\theta{\bf u}\|^2+\|{\bf u}\|^2\right)\]

is also a direct calculation. The values of $\bf u$ and $\bf\beta$ that provide this minimum are called, respectively, the conditional mode, $\tilde{\bf u}_\theta$, of the spherical random effects and the conditional estimate, $\widehat{\bf\beta}_\theta$, of the fixed effects. At the conditional estimate of the fixed effects the objective is

\[d({\bf\theta},\widehat{\beta}_\theta,\sigma|{\bf y}) =n\log(2\pi\sigma^2)+\log(|{\bf L}_\theta|^2)+\frac{r^2_\theta}{\sigma^2}.\]

Minimizing this expression with respect to $\sigma^2$ produces the conditional estimate

\[\widehat{\sigma^2}_\theta=\frac{r^2_\theta}{n}\]

which provides the profiled log-likelihood on the deviance scale as

\[\tilde{d}(\theta|{\bf y})=d(\theta,\widehat{\beta}_\theta,\widehat{\sigma}_\theta|{\bf y}) =\log(|{\bf L}_\theta|^2)+n\left[1+\log\left(\frac{2\pi r^2_\theta}{n}\right)\right],\]

a function of $\bf\theta$ alone.

The MLE of $\bf\theta$, written $\widehat{\bf\theta}$, is the value that minimizes this profiled objective. We determine this value by numerical optimization. In the process of evaluating $\tilde{d}(\widehat{\theta}|{\bf y})$ we determine $\widehat{\beta}=\widehat{\beta}_{\widehat\theta}$, $\tilde{\bf u}_{\widehat{\theta}}$ and $r^2_{\widehat{\theta}}$, from which we can evaluate $\widehat{\sigma}=\sqrt{r^2_{\widehat{\theta}}/n}$.

The elements of the conditional mode of $\mathcal B$, evaluated at the parameter estimates,

\[\tilde{\bf b}_{\widehat{\theta}}=\Lambda_{\widehat{\theta}}\tilde{\bf u}_{\widehat{\theta}}\]

are sometimes called the best linear unbiased predictors or BLUPs of the random effects. Although BLUPs an appealing acronym, I don’t find the term particularly instructive (what is a “linear unbiased predictor” and in what sense are these the “best”?) and prefer the term “conditional modes”, because these are the values of $\bf b$ that maximize the density of the conditional distribution $\mathcal{B} | \mathcal{Y} = {\bf y}$. For a linear mixed model, where all the conditional and unconditional distributions are Gaussian, these values are also the conditional means.

Internal structure of $\Lambda_\theta$ and $\bf Z$

In the types of LinearMixedModel available through the MixedModels package, groups of random effects and the corresponding columns of the model matrix, $\bf Z$, are associated with random-effects terms in the model formula.

For the simple example

using BenchmarkTools, DataFrames, MixedModels

dyestuff = MixedModels.dataset(:dyestuff)
fm1 = fit(MixedModel, @formula(yield ~ 1 + (1|batch)), dyestuff)

Linear mixed model fit by maximum likelihood
 yield ~ 1 + (1 | batch)
   logLik   -2 logLik     AIC       AICc        BIC    
  -163.6635   327.3271   333.3271   334.2501   337.5307

Variance components:
            Column    Variance Std.Dev.
batch    (Intercept)  1388.3333 37.2603
Residual              2451.2500 49.5101
 Number of obs: 30; levels of grouping factors: 6

  Fixed-effects parameters:
────────────────────────────────────────────────
              Coef.  Std. Error      z  Pr(>|z|)
────────────────────────────────────────────────
(Intercept)  1527.5     17.6946  86.33    <1e-99
────────────────────────────────────────────────

the only random effects term in the formula is (1|batch), a simple, scalar random-effects term.

t1 = first(fm1.reterms);
Int.(t1)  # convert to integers for more compact display

30×6 Matrix{Int64}:
 1  0  0  0  0  0
 1  0  0  0  0  0
 1  0  0  0  0  0
 1  0  0  0  0  0
 1  0  0  0  0  0
 0  1  0  0  0  0
 0  1  0  0  0  0
 0  1  0  0  0  0
 0  1  0  0  0  0
 0  1  0  0  0  0
 ⋮              ⋮
 0  0  0  0  1  0
 0  0  0  0  1  0
 0  0  0  0  1  0
 0  0  0  0  1  0
 0  0  0  0  0  1
 0  0  0  0  0  1
 0  0  0  0  0  1
 0  0  0  0  0  1
 0  0  0  0  0  1

MixedModels.ReMat — Type

ReMat{T,S} <: AbstractMatrix{T}

A section of a model matrix generated by a random-effects term.

Fields

trm: the grouping factor as a StatsModels.CategoricalTerm
refs: indices into the levels of the grouping factor as a Vector{Int32}
levels: the levels of the grouping factor
z: transpose of the model matrix generated by the left-hand side of the term
wtz: a weighted copy of z (z and wtz are the same object for unweighted cases)
λ: a LowerTriangular matrix of size S×S
inds: a Vector{Int} of linear indices of the potential nonzeros in λ
adjA: the adjoint of the matrix as a SparseMatrixCSC{T}

source

This RandomEffectsTerm contributes a block of columns to the model matrix $\bf Z$ and a diagonal block to $\Lambda_\theta$. In this case the diagonal block of $\Lambda_\theta$ (which is also the only block) is a multiple of the $6\times6$ identity matrix where the multiple is

t1.λ

1×1 LinearAlgebra.LowerTriangular{Float64, Matrix{Float64}}:
 0.7525806757718846

Because there is only one random-effects term in the model, the matrix $\bf Z$ is the indicators matrix shown as the result of Matrix(t1), but stored in a special sparse format. Furthermore, there is only one block in $\Lambda_\theta$.

For a vector-valued random-effects term, as in

sleepstudy = MixedModels.dataset(:sleepstudy)
fm2 = fit(MixedModel, @formula(reaction ~ 1+days+(1+days|subj)), sleepstudy)

Linear mixed model fit by maximum likelihood
 reaction ~ 1 + days + (1 + days | subj)
   logLik   -2 logLik     AIC       AICc        BIC    
  -875.9697  1751.9393  1763.9393  1764.4249  1783.0971

Variance components:
            Column    Variance Std.Dev.   Corr.
subj     (Intercept)  565.51069 23.78047
         days          32.68212  5.71683 +0.08
Residual              654.94145 25.59182
 Number of obs: 180; levels of grouping factors: 18

  Fixed-effects parameters:
──────────────────────────────────────────────────
                Coef.  Std. Error      z  Pr(>|z|)
──────────────────────────────────────────────────
(Intercept)  251.405      6.63226  37.91    <1e-99
days          10.4673     1.50224   6.97    <1e-11
──────────────────────────────────────────────────

the model matrix $\bf Z$ is of the form

t21 = first(fm2.reterms);
Int.(t21) # convert to integers for more compact display

180×36 Matrix{Int64}:
 1  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0
 1  1  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  2  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  3  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  4  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  5  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0
 1  6  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  7  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  8  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  9  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 ⋮              ⋮              ⋮        ⋱     ⋮              ⋮              ⋮
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  1  1
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  1  2
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  1  3
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  1  4
 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  1  5
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  1  6
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  1  7
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  1  8
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  1  9

and $\Lambda_\theta$ is a $36\times36$ block diagonal matrix with $18$ diagonal blocks, all of the form

t21.λ

2×2 LinearAlgebra.LowerTriangular{Float64, Matrix{Float64}}:
 0.929221    ⋅ 
 0.0181684  0.222645

The $\theta$ vector is

MixedModels.getθ(t21)

3-element Vector{Float64}:
 0.9292213288149662
 0.018168393450877257
 0.22264486671069741

Random-effects terms in the model formula that have the same grouping factor are amalgamated into a single ReMat object.

fm3 = fit(MixedModel, @formula(reaction ~ 1+days+(1|subj) + (0+days|subj)), sleepstudy)
t31 = first(fm3.reterms);
Int.(t31)

180×36 Matrix{Int64}:
 1  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0
 1  1  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  2  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  3  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  4  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  5  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0
 1  6  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  7  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  8  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  9  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 ⋮              ⋮              ⋮        ⋱     ⋮              ⋮              ⋮
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  1  1
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  1  2
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  1  3
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  1  4
 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  1  5
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  1  6
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  1  7
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  1  8
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  1  9

For this model the matrix $\bf Z$ is the same as that of model fm2 but the diagonal blocks of $\Lambda_\theta$ are themselves diagonal.

t31.λ
MixedModels.getθ(t31)

2-element Vector{Float64}:
 0.9458180658294862
 0.22692714882505358

Random-effects terms with distinct grouping factors generate distinct elements of the allterms field of the LinearMixedModel object. Multiple ReMat objects are sorted by decreasing numbers of random effects.

penicillin = MixedModels.dataset(:penicillin)
fm4 = fit(MixedModel,
    @formula(diameter ~ 1 + (1|plate) + (1|sample)),
    penicillin)
Int.(first(fm4.reterms))
Int.(last(fm4.reterms))

144×6 Matrix{Int64}:
 1  0  0  0  0  0
 0  1  0  0  0  0
 0  0  1  0  0  0
 0  0  0  1  0  0
 0  0  0  0  1  0
 0  0  0  0  0  1
 1  0  0  0  0  0
 0  1  0  0  0  0
 0  0  1  0  0  0
 0  0  0  1  0  0
 ⋮              ⋮
 0  0  0  1  0  0
 0  0  0  0  1  0
 0  0  0  0  0  1
 1  0  0  0  0  0
 0  1  0  0  0  0
 0  0  1  0  0  0
 0  0  0  1  0  0
 0  0  0  0  1  0
 0  0  0  0  0  1

Note that the first ReMat in fm4.terms corresponds to grouping factor G even though the term (1|G) occurs in the formula after (1|H).

Progress of the optimization

An optional named argument, verbose=true, in the call to fit for a LinearMixedModel causes printing of the objective and the $\theta$ parameter at each evaluation during the optimization. (Not illustrated here.)

A shorter summary of the optimization process is always available as an

MixedModels.OptSummary — Type

OptSummary

Summary of an NLopt optimization

Fields

initial: a copy of the initial parameter values in the optimization
lowerbd: lower bounds on the parameter values
ftol_rel: as in NLopt
ftol_abs: as in NLopt
xtol_rel: as in NLopt
xtol_abs: as in NLopt
initial_step: as in NLopt
maxfeval: as in NLopt (maxeval)
maxtime: as in NLopt
final: a copy of the final parameter values from the optimization
fmin: the final value of the objective
feval: the number of function evaluations
optimizer: the name of the optimizer used, as a Symbol
returnvalue: the return value, as a Symbol
nAGQ: number of adaptive Gauss-Hermite quadrature points in deviance evaluation for GLMMs
REML: use the REML criterion for LMM fits

The latter two fields are model characteristics and not related directly to the NLopt package or algorithms.

source

object, which is the optsum member of the LinearMixedModel.

fm2.optsum

Initial parameter vector: [1.0, 0.0, 1.0]
Initial objective value:  1784.642296192471

Optimizer (from NLopt):   LN_BOBYQA
Lower bounds:             [0.0, -Inf, 0.0]
ftol_rel:                 1.0e-12
ftol_abs:                 1.0e-8
xtol_rel:                 0.0
xtol_abs:                 [1.0e-10, 1.0e-10, 1.0e-10]
initial_step:             [0.75, 1.0, 0.75]
maxfeval:                 -1
maxtime:                  -1.0

Function evaluations:     57
Final parameter vector:   [0.9292213288149662, 0.018168393450877257, 0.22264486671069741]
Final objective value:    1751.9393444647023
Return code:              FTOL_REACHED

A blocked Cholesky factor

A LinearMixedModel object contains two blocked matrices; a symmetric matrix A (only the lower triangle is stored) and a lower-triangular L which is the lower Cholesky factor of the updated and inflated A.

MixedModels.BlockDescription — Type

BlockDescription

Description of blocks of A and L in a LinearMixedModel

Fields

blknms: Vector{String} of block names
blkrows: Vector{Int} of the number of rows in each block
ALtypes: Matrix{String} of datatypes for blocks in A and L.

When a block in L is the same type as the corresponding block in A, it is described with a single name, such as Dense. When the types differ the entry in ALtypes is of the form Diag/Dense, as determined by a shorttype method.

source

shows the structure of the blocks

BlockDescription(fm2)

rows:     subj         fixed     
  36:   BlkDiag    
   2:    Dense         Dense

The operation of installing a new value of the variance parameters, θ, and updating L

MixedModels.setθ! — Function

setθ!(m::LinearMixedModel, v)

Install v as the θ parameters in m.

source

setθ!(bsamp::MixedModelsBootstrap, i::Integer)

Install the values of the i'th θ value of bsamp.bstr in bsamp.λ

source

MixedModels.updateL! — Function

updateL!(m::LinearMixedModel)

Update the blocked lower Cholesky factor, m.L, from m.A and m.reterms (used for λ only)

This is the crucial step in evaluating the objective, given a new parameter value.

source

is the central step in evaluating the objective (negative twice the log-likelihood).

Typically, the (1,1) block is the largest block in A and L and it has a special form, either Diagonal or

MixedModels.UniformBlockDiagonal — Type

UniformBlockDiagonal{T}

Homogeneous block diagonal matrices. k diagonal blocks each of size m×m

source

providing a compact representation and fast matrix multiplication or solutions of linear systems of equations.

Modifying the optimization process

The OptSummary object contains both input and output fields for the optimizer. To modify the optimization process the input fields can be changed after constructing the model but before fitting it.

Suppose, for example, that the user wishes to try a Nelder-Mead optimization method instead of the default BOBYQA (Bounded Optimization BY Quadratic Approximation) method.

fm2 = LinearMixedModel(@formula(reaction ~ 1+days+(1+days|subj)), sleepstudy);
fm2.optsum.optimizer = :LN_NELDERMEAD;
fit!(fm2)
fm2.optsum

Initial parameter vector: [1.0, 0.0, 1.0]
Initial objective value:  1784.642296192471

Optimizer (from NLopt):   LN_NELDERMEAD
Lower bounds:             [0.0, -Inf, 0.0]
ftol_rel:                 1.0e-12
ftol_abs:                 1.0e-8
xtol_rel:                 0.0
xtol_abs:                 [1.0e-10, 1.0e-10, 1.0e-10]
initial_step:             [0.75, 1.0, 0.75]
maxfeval:                 -1
maxtime:                  -1.0

Function evaluations:     140
Final parameter vector:   [0.9292360739538559, 0.018168794976407835, 0.22264111430139058]
Final objective value:    1751.9393444750306
Return code:              FTOL_REACHED

The parameter estimates are quite similar to those using :LN_BOBYQA but at the expense of 140 functions evaluations for :LN_NELDERMEAD versus 57 for :LN_BOBYQA.

Run time can be constrained with maxfeval and maxtime.

See the documentation for the NLopt package for details about the various settings.

Convergence to singular covariance matrices

To ensure identifiability of $\Sigma_\theta=\sigma^2\Lambda_\theta \Lambda_\theta$, the elements of $\theta$ corresponding to diagonal elements of $\Lambda_\theta$ are constrained to be non-negative. For example, in a trivial case of a single, simple, scalar, random-effects term as in fm1, the one-dimensional $\theta$ vector is the ratio of the standard deviation of the random effects to the standard deviation of the response. It happens that $-\theta$ produces the same log-likelihood but, by convention, we define the standard deviation to be the positive square root of the variance. Requiring the diagonal elements of $\Lambda_\theta$ to be non-negative is a generalization of using this positive square root.

If the optimization converges on the boundary of the feasible region, that is if one or more of the diagonal elements of $\Lambda_\theta$ is zero at convergence, the covariance matrix $\Sigma_\theta$ will be singular. This means that there will be linear combinations of random effects that are constant. Usually convergence to a singular covariance matrix is a sign of an over-specified model.

Singularity can be checked with the issingular predicate function.

MixedModels.issingular — Function

issingular(m::MixedModel, θ=m.θ)

Test whether the model m is singular if the parameter vector is θ.

Equality comparisons are used b/c small non-negative θ values are replaced by 0 in fit!.

source

issingular(bsamp::MixedModelBootstrap)

Test each bootstrap sample for singularity of the corresponding fit.

Equality comparisons are used b/c small non-negative θ values are replaced by 0 in fit!.

Generalized Linear Mixed-Effects Models

In a generalized linear model the responses are modelled as coming from a particular distribution, such as Bernoulli for binary responses or Poisson for responses that represent counts. The scalar distributions of individual responses differ only in their means, which are determined by a linear predictor expression $\eta=\bf X\beta$, where, as before, $\bf X$ is a model matrix derived from the values of covariates and $\beta$ is a vector of coefficients.

The unconstrained components of $\eta$ are mapped to the, possiby constrained, components of the mean response, $\mu$, via a scalar function, $g^{-1}$, applied to each component of $\eta$. For historical reasons, the inverse of this function, taking components of $\mu$ to the corresponding component of $\eta$ is called the link function and more frequently used map from $\eta$ to $\mu$ is the inverse link.

A generalized linear mixed-effects model (GLMM) is defined, for the purposes of this package, by

\[\begin{aligned} (\mathcal{Y} | \mathcal{B}=\bf{b}) &\sim\mathcal{D}(\bf{g^{-1}(X\beta + Z b)},\phi)\\\\ \mathcal{B}&\sim\mathcal{N}(\bf{0},\Sigma_\theta) . \end{aligned}\]

where $\mathcal{D}$ indicates the distribution family parameterized by the mean and, when needed, a common scale parameter, $\phi$. (There is no scale parameter for Bernoulli or for Poisson. Specifying the mean completely determines the distribution.)

Distributions.Bernoulli — Type

Bernoulli(p)

A Bernoulli distribution is parameterized by a success rate p, which takes value 1 with probability p and 0 with probability 1-p.

\[P(X = k) = \begin{cases} 1 - p & \quad \text{for } k = 0, \\ p & \quad \text{for } k = 1. \end{cases}\]

Bernoulli()    # Bernoulli distribution with p = 0.5
Bernoulli(p)   # Bernoulli distribution with success rate p

params(d)      # Get the parameters, i.e. (p,)
succprob(d)    # Get the success rate, i.e. p
failprob(d)    # Get the failure rate, i.e. 1 - p

External links:

Bernoulli distribution on Wikipedia

Distributions.Poisson — Type

Poisson(λ)

A Poisson distribution descibes the number of independent events occurring within a unit time interval, given the average rate of occurrence λ.

\[P(X = k) = \frac{\lambda^k}{k!} e^{-\lambda}, \quad \text{ for } k = 0,1,2,\ldots.\]

Poisson()        # Poisson distribution with rate parameter 1
Poisson(lambda)       # Poisson distribution with rate parameter lambda

params(d)        # Get the parameters, i.e. (λ,)
mean(d)          # Get the mean arrival rate, i.e. λ

External links:

Poisson distribution on Wikipedia

A GeneralizedLinearMixedModel object is generated from a formula, data frame and distribution family.

verbagg = MixedModels.dataset(:verbagg)
const vaform = @formula(r2 ~ 1 + anger + gender + btype + situ + (1|subj) + (1|item));
mdl = GeneralizedLinearMixedModel(vaform, verbagg, Bernoulli());
typeof(mdl)

GeneralizedLinearMixedModel{Float64}

A separate call to fit! can be used to fit the model. This involves optimizing an objective function, the Laplace approximation to the deviance, with respect to the parameters, which are $\beta$, the fixed-effects coefficients, and $\theta$, the covariance parameters. The starting estimate for $\beta$ is determined by fitting a GLM to the fixed-effects part of the formula

mdl.β

6-element Vector{Float64}:
  0.2060530221032275
  0.03994037605114987
  0.23131667674984469
 -0.7941857249205363
 -1.5391882085456918
 -0.7766556048305914

and the starting estimate for $\theta$, which is a vector of the two standard deviations of the random effects, is chosen to be

mdl.θ

2-element Vector{Float64}:
 1.0
 1.0

The Laplace approximation to the deviance requires determining the conditional modes of the random effects. These are the values that maximize the conditional density of the random effects, given the model parameters and the data. This is done using Penalized Iteratively Reweighted Least Squares (PIRLS). In most cases PIRLS is fast and stable. It is simply a penalized version of the IRLS algorithm used in fitting GLMs.

The distinction between the "fast" and "slow" algorithms in the MixedModels package (nAGQ=0 or nAGQ=1 in lme4) is whether the fixed-effects parameters, $\beta$, are optimized in PIRLS or in the nonlinear optimizer. In a call to the pirls! function the first argument is a GeneralizedLinearMixedModel, which is modified during the function call. (By convention, the names of such mutating functions end in ! as a warning to the user that they can modify an argument, usually the first argument.) The second and third arguments are optional logical values indicating if $\beta$ is to be varied and if verbose output is to be printed.

pirls!(mdl, true, true)

deviance(mdl)

8201.848559060621

mdl.β

6-element Vector{Float64}:
  0.21853493716518088
  0.05143854258081083
  0.2902245416630167
 -0.9791237061899788
 -1.9540167628140055
 -0.9794925718036899

mdl.θ # current values of the standard deviations of the random effects

2-element Vector{Float64}:
 1.0
 1.0

If the optimization with respect to $\beta$ is performed within PIRLS then the nonlinear optimization of the Laplace approximation to the deviance requires optimization with respect to $\theta$ only. This is the "fast" algorithm. Given a value of $\theta$, PIRLS is used to determine the conditional estimate of $\beta$ and the conditional mode of the random effects, b.

mdl.b # conditional modes of b

2-element Vector{Matrix{Float64}}:
 [-0.600771603848884 -1.932268086621969 … -0.1445537397533549 -0.5752238433557038]
 [-0.1863641874790143 0.021422773585949458 … 0.6410383402098077 0.6496779078972804]

fit!(mdl, fast=true);

Generalized Linear Mixed Model fit by maximum likelihood (nAGQ = 1)
  r2 ~ 1 + anger + gender + btype + situ + (1 | subj) + (1 | item)
  Distribution: Bernoulli{Float64}
  Link: LogitLink()


   logLik    deviance     AIC       AICc        BIC    
 -4075.7917  8151.5833  8167.5833  8167.6024  8223.0537
Variance components:
        Column   Variance Std.Dev. 
subj (Intercept)  1.794431 1.339564
item (Intercept)  0.246843 0.496833

 Number of obs: 7584; levels of grouping factors: 316, 24

Fixed-effects parameters:
─────────────────────────────────────────────────────
                   Coef.  Std. Error      z  Pr(>|z|)
─────────────────────────────────────────────────────
(Intercept)    0.208273    0.405425    0.51    0.6075
anger          0.0543791   0.0167533   3.25    0.0012
gender: M      0.304089    0.191223    1.59    0.1118
btype: scold  -1.0165      0.257531   -3.95    <1e-04
btype: shout  -2.0218      0.259235   -7.80    <1e-14
situ: self    -1.01344     0.210888   -4.81    <1e-05
─────────────────────────────────────────────────────

The optimization process is summarized by

mdl.LMM.optsum

Initial parameter vector: [1.0, 1.0]
Initial objective value:  8201.848559060621

Optimizer (from NLopt):   LN_BOBYQA
Lower bounds:             [0.0, 0.0]
ftol_rel:                 1.0e-12
ftol_abs:                 1.0e-8
xtol_rel:                 0.0
xtol_abs:                 [1.0e-10, 1.0e-10]
initial_step:             [0.75, 0.75]
maxfeval:                 -1
maxtime:                  -1.0

Function evaluations:     37
Final parameter vector:   [1.3395639000126478, 0.4968327838843539]
Final objective value:    8151.583340131867
Return code:              FTOL_REACHED

As one would hope, given the name of the option, this fit is comparatively fast.

@btime fit(MixedModel, vaform, verbagg, Bernoulli(), fast=true)

Generalized Linear Mixed Model fit by maximum likelihood (nAGQ = 1)
  r2 ~ 1 + anger + gender + btype + situ + (1 | subj) + (1 | item)
  Distribution: Bernoulli{Float64}
  Link: LogitLink()


   logLik    deviance     AIC       AICc        BIC    
 -4075.7917  8151.5833  8167.5833  8167.6024  8223.0537
Variance components:
        Column   Variance Std.Dev. 
subj (Intercept)  1.794431 1.339564
item (Intercept)  0.246843 0.496833

 Number of obs: 7584; levels of grouping factors: 316, 24

Fixed-effects parameters:
─────────────────────────────────────────────────────
                   Coef.  Std. Error      z  Pr(>|z|)
─────────────────────────────────────────────────────
(Intercept)    0.208273    0.405425    0.51    0.6075
anger          0.0543791   0.0167533   3.25    0.0012
gender: M      0.304089    0.191223    1.59    0.1118
btype: scold  -1.0165      0.257531   -3.95    <1e-04
btype: shout  -2.0218      0.259235   -7.80    <1e-14
situ: self    -1.01344     0.210888   -4.81    <1e-05
─────────────────────────────────────────────────────

The alternative algorithm is to use PIRLS to find the conditional mode of the random effects, given $\beta$ and $\theta$ and then use the general nonlinear optimizer to fit with respect to both $\beta$ and $\theta$.

mdl1 = @btime fit(MixedModel, vaform, verbagg, Bernoulli())

Generalized Linear Mixed Model fit by maximum likelihood (nAGQ = 1)
  r2 ~ 1 + anger + gender + btype + situ + (1 | subj) + (1 | item)
  Distribution: Bernoulli{Float64}
  Link: LogitLink()


   logLik    deviance     AIC       AICc        BIC    
 -4075.6999  8151.3998  8167.3998  8167.4188  8222.8702
Variance components:
        Column   Variance Std.Dev. 
subj (Intercept)  1.794973 1.339766
item (Intercept)  0.245327 0.495305

 Number of obs: 7584; levels of grouping factors: 316, 24

Fixed-effects parameters:
─────────────────────────────────────────────────────
                   Coef.  Std. Error      z  Pr(>|z|)
─────────────────────────────────────────────────────
(Intercept)    0.195555     0.40519    0.48    0.6294
anger          0.0575541    0.016758   3.43    0.0006
gender: M      0.320784     0.191266   1.68    0.0935
btype: scold  -1.05826      0.256805  -4.12    <1e-04
btype: shout  -2.10475      0.258529  -8.14    <1e-15
situ: self    -1.05498      0.210303  -5.02    <1e-06
─────────────────────────────────────────────────────

This fit provided slightly better results (Laplace approximation to the deviance of 8151.400 versus 8151.583) but took 6 times as long. That is not terribly important when the times involved are a few seconds but can be important when the fit requires many hours or days of computing time.

The comparison of the slow and fast fit is available in the optimization summary after the slow fit.

mdl1.LMM.optsum

Initial parameter vector: [0.2060530221032275, 0.03994037605114987, 0.23131667674984469, -0.7941857249205363, -1.5391882085456918, -0.7766556048305914, 1.0, 1.0]
Initial objective value:  8204.421187737946

Optimizer (from NLopt):   LN_BOBYQA
Lower bounds:             [-Inf, -Inf, -Inf, -Inf, -Inf, -Inf, 0.0, 0.0]
ftol_rel:                 1.0e-12
ftol_abs:                 1.0e-8
xtol_rel:                 0.0
xtol_abs:                 [1.0e-10, 1.0e-10]
initial_step:             [0.2060530221032275, 0.03994037605114987, 0.23131667674984469, -0.7941857249205363, -1.5391882085456918, -0.7766556048305914, 0.75, 0.75]
maxfeval:                 -1
maxtime:                  -1.0

Function evaluations:     188
Final parameter vector:   [0.1955554704948119, 0.05755412761885973, 0.3207843518569843, -1.0582595252774376, -2.1047524824609853, -1.0549789653925743, 1.339766125847893, 0.4953047709862237]
Final objective value:    8151.399795553173
Return code:              FTOL_REACHED