Prediction and simulation in Mixed-Effects Models

We recommend the MixedModelsSim.jl package and associated documentation for useful tools in constructing designs to simulate. For now, we'll use the sleep study data as a starting point.

using DataFrames
using MixedModels
using StatsBase
# use a DataFrame to make it easier to change things later
slp = DataFrame(MixedModels.dataset(:sleepstudy))
slpm = fit(MixedModel, @formula(reaction ~ 1 + days + (1|subj)), slp)
Linear mixed model fit by maximum likelihood
 reaction ~ 1 + days + (1 | subj)
   logLik   -2 logLik     AIC       AICc        BIC    
  -897.0393  1794.0786  1802.0786  1802.3072  1814.8505

Variance components:
            Column    Variance Std.Dev.
subj     (Intercept)  1296.8707 36.0121
Residual               954.5278 30.8954
 Number of obs: 180; levels of grouping factors: 18

  Fixed-effects parameters:
──────────────────────────────────────────────────
                Coef.  Std. Error      z  Pr(>|z|)
──────────────────────────────────────────────────
(Intercept)  251.405     9.50619   26.45    <1e-99
days          10.4673    0.801735  13.06    <1e-38
──────────────────────────────────────────────────

Prediction

The simplest form of prediction are the fitted values from the model: they are indeed the model's predictions for the observed data.

predict(slpm) ≈ fitted(slpm)
true

When generalizing to new data, we need to consider what happens if there are new, previously unobserved levels of the grouping variable(s). MixedModels.jl provides three options:

  1. :error: error on encountering unobserved levels
  2. :population: use population values (i.e. only the fixed effects) for observations with unobserved levels
  3. :missing: return missing for observations with unobserved levels.

Providing either no prediction (:error, :missing) or providing the population-level values seem to be the most reasonable ways for predicting new values. For simulating new values based on previous estimates of the variance components, use simulate.

In the case where there are no new levels of the grouping variable, all three of these methods provide the same results:

predict(slpm, slp; new_re_levels=:population) ≈ fitted(slpm)
true
predict(slpm, slp; new_re_levels=:missing) ≈ fitted(slpm)
true
predict(slpm, slp; new_re_levels=:error) ≈ fitted(slpm)
true

In the case where there are new levels of the grouping variable, these methods differ.

# create a new level
slp2 = transform(slp, :subj => ByRow(x -> (x == "S308" ? "NEW" : x)) => :subj)
180×3 DataFrame
 Row │ subj    days  reaction
     │ String  Int8  Float64
─────┼────────────────────────
   1 │ NEW        0   249.56
   2 │ NEW        1   258.705
   3 │ NEW        2   250.801
   4 │ NEW        3   321.44
   5 │ NEW        4   356.852
   6 │ NEW        5   414.69
   7 │ NEW        6   382.204
   8 │ NEW        7   290.149
  ⋮  │   ⋮      ⋮       ⋮
 174 │ S372       3   310.632
 175 │ S372       4   287.173
 176 │ S372       5   329.608
 177 │ S372       6   334.482
 178 │ S372       7   343.22
 179 │ S372       8   369.142
 180 │ S372       9   364.124
              165 rows omitted
try
  predict(slpm, slp2; new_re_levels=:error)
catch e
  show(e)
end
ArgumentError("New level enountered in subj")
predict(slpm, slp2; new_re_levels=:missing)
180-element Vector{Union{Missing, Float64}}:
    missing
    missing
    missing
    missing
    missing
    missing
    missing
    missing
    missing
    missing
   ⋮
 279.922125577311
 290.38941153690695
 300.85669749650293
 311.32398345609886
 321.79126941569484
 332.2585553752908
 342.72584133488675
 353.19312729448274
 363.66041325407866
predict(slpm, slp2; new_re_levels=:population)
180-element Vector{Float64}:
 251.4051048484854
 261.8723908080814
 272.3396767676773
 282.8069627272733
 293.2742486868692
 303.7415346464652
 314.2088206060612
 324.6761065656571
 335.1433925252531
 345.610678484849
   ⋮
 279.922125577311
 290.38941153690695
 300.85669749650293
 311.32398345609886
 321.79126941569484
 332.2585553752908
 342.72584133488675
 353.19312729448274
 363.66041325407866
Note

Currently, we do not support predicting based on a subset of the random effects.

Note

predict is deterministic (within the constraints of floating point) and never adds noise to the result. If you want to construct prediction intervals, then simulate will generate new data with noise (including new values of the random effects).

For generalized linear mixed models, there is an additional keyword argument to predict: type specifies whether the predictions are returned on the scale of the linear predictor (:linpred) or on the level of the response (:response) (i.e. the level at which the values were originally observed).

cbpp = DataFrame(MixedModels.dataset(:cbpp))
cbpp.rate = cbpp.incid ./ cbpp.hsz
gm = fit(MixedModel, @formula(rate ~ 1 + period + (1|herd)), cbpp, Binomial(), wts=float(cbpp.hsz))
predict(gm, cbpp; type=:response) ≈ fitted(gm)
false
logit(x) = log(x / (1 - x))
predict(gm, cbpp; type=:linpred) ≈ logit.(fitted(gm))
false

Simulation

In contrast to predict, simulate and simulate! introduce randomness. This randomness occurs both at the level of the observation-level (residual) variance and at the level of the random effects, where new conditional modes are sampled based on the specified covariance parameter (θ; see Details of the parameter estimation), which defaults to the estimated value of the model. For reproducibility, we specify a pseudorandom generator here; if none is provided, the global PRNG is taken as the default.

The simplest example of simulate takes a fitted model and generates a new response vector based on the existing model matrices combined with noise.

using Random
ynew = simulate(MersenneTwister(42), slpm)
180-element Vector{Float64}:
 283.00864553839773
 296.92520636955913
 321.96088453031666
 322.33650388844745
 396.9842691073609
 317.1515378204674
 348.51327076550524
 378.2824583013714
 302.3001679632582
 425.39060424137836
   ⋮
 314.68559367410927
 350.03193420209277
 293.64981943623627
 375.40195060061774
 324.5823728056021
 311.10278099074145
 338.76223251677374
 345.3556040031296
 301.2871942587682

The simulated response can also be placed in a pre-allocated vector:

ynew2 = zeros(nrow(slp))
simulate!(MersenneTwister(42), ynew2, slpm)
ynew2 ≈ ynew
true

Or even directly replace the previous response vector in a model, at which point the model must be refit to the new values:

slpm2 = deepcopy(slpm)
refit!(simulate!(MersenneTwister(42), slpm2))
Linear mixed model fit by maximum likelihood
 reaction ~ 1 + days + (1 | subj)
   logLik   -2 logLik     AIC       AICc        BIC    
  -903.1987  1806.3974  1814.3974  1814.6259  1827.1692

Variance components:
            Column    Variance Std.Dev.
subj     (Intercept)  2420.4311 49.1979
Residual               964.3707 31.0543
 Number of obs: 180; levels of grouping factors: 18

  Fixed-effects parameters:
───────────────────────────────────────────────────
                 Coef.  Std. Error      z  Pr(>|z|)
───────────────────────────────────────────────────
(Intercept)  263.59      12.3684    21.31    <1e-99
days           9.35638    0.805858  11.61    <1e-30
───────────────────────────────────────────────────

This inplace simulation actually forms the basis of parametricbootstrap.

Finally, we can also simulate the response from entirely new data.

df = DataFrame(days = repeat(1:10, outer=20), subj=repeat(1:20, inner=10))
df[!, :subj] = string.("S", lpad.(df.subj, 2, "0"))
df[!, :reaction] .= 0
df
200×3 DataFrame
 Row │ days   subj    reaction
     │ Int64  String  Int64
─────┼─────────────────────────
   1 │     1  S01            0
   2 │     2  S01            0
   3 │     3  S01            0
   4 │     4  S01            0
   5 │     5  S01            0
   6 │     6  S01            0
   7 │     7  S01            0
   8 │     8  S01            0
  ⋮  │   ⋮      ⋮        ⋮
 194 │     4  S20            0
 195 │     5  S20            0
 196 │     6  S20            0
 197 │     7  S20            0
 198 │     8  S20            0
 199 │     9  S20            0
 200 │    10  S20            0
               185 rows omitted
ysim = simulate(MersenneTwister(42), slpm, df)
200-element Vector{Float64}:
 262.93578249909865
 276.85234333026
 301.8880214910175
 302.2636408491483
 376.91140606806186
 297.07867478116833
 328.4404077262061
 358.2095952620723
 282.2273049239591
 405.3177412020793
   ⋮
 336.4096572490652
 274.01980636706577
 232.79273156578114
 281.77385851062616
 296.8405267395686
 379.16601401109307
 305.03738358148354
 358.38445371575614
 381.723677467381

Note that this is a convenience method for creating a new model and then using the parameters from the old model to call simulate on that model. In other words, this method incurs the cost of constructing a new model and then discarding it. If you could re-use that model (e.g., fitting that model as part of a simulation study), it often makes sense to do these steps to perform these steps explicitly and avoid the unnecessary construction and discarding of an intermediate model:

msim = LinearMixedModel(@formula(reaction ~ 1 + days + (1|subj)), df)
simulate!(MersenneTwister(42), msim; θ=slpm.θ, β=slpm.β, σ=slpm.σ)
response(msim) ≈ ysim
true
fit!(msim)
Linear mixed model fit by maximum likelihood
 reaction ~ 1 + days + (1 | subj)
   logLik   -2 logLik     AIC       AICc        BIC    
  -996.0696  1992.1392  2000.1392  2000.3443  2013.3325

Variance components:
            Column    Variance Std.Dev.
subj     (Intercept)   663.5403 25.7593
Residual              1012.9883 31.8275
 Number of obs: 200; levels of grouping factors: 20

  Fixed-effects parameters:
───────────────────────────────────────────────────
                 Coef.  Std. Error      z  Pr(>|z|)
───────────────────────────────────────────────────
(Intercept)  259.607      7.53747   34.44    <1e-99
days           9.46755    0.783538  12.08    <1e-32
───────────────────────────────────────────────────

For simulating from generalized linear mixed models, there is no type option because the observation-level always occurs at the level of the response and not of the linear predictor.

Warning

Simulating the model response in place may not yield the same result as simulating into a pre-allocated or new vector, depending on choice of pseudorandom number generator. Random number generation in Julia allows optimization based on type, and the internal storage type of the model response (currently a view into a matrix storing the concatenated fixed-effects model matrix and the response) may not match the type of a pre-allocated or new vector. See also discussion here.

Note

All the methods that take new data as a table construct an additional MixedModel behind the scenes, even when the new data is exactly the same as the data that the model was fitted to. For the simulation methods in particular, these thus form a convenience wrapper for constructing a new model and calling simulate without new data on that model with the parameters from the original model.