Prediction and simulation in Mixed-Effects Models
We recommend the MixedModelsSim.jl package and associated documentation for useful tools in constructing designs to simulate. For now, we'll use the sleep study data as a starting point.
using DataFrames
using MixedModels
using MixedModelsDatasets
using StatsBase
# use a DataFrame to make it easier to change things later
slp = DataFrame(MixedModelsDatasets.dataset(:sleepstudy))
slpm = fit(MixedModel, @formula(reaction ~ 1 + days + (1|subj)), slp)
Linear mixed model fit by maximum likelihood
reaction ~ 1 + days + (1 | subj)
logLik -2 logLik AIC AICc BIC
-897.0393 1794.0786 1802.0786 1802.3072 1814.8505
Variance components:
Column Variance Std.Dev.
subj (Intercept) 1296.8700 36.0121
Residual 954.5278 30.8954
Number of obs: 180; levels of grouping factors: 18
Fixed-effects parameters:
──────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|)
──────────────────────────────────────────────────
(Intercept) 251.405 9.50619 26.45 <1e-99
days 10.4673 0.801735 13.06 <1e-38
──────────────────────────────────────────────────
Prediction
The simplest form of prediction are the fitted values from the model: they are indeed the model's predictions for the observed data.
predict(slpm) ≈ fitted(slpm)
true
When generalizing to new data, we need to consider what happens if there are new, previously unobserved levels of the grouping variable(s). MixedModels.jl provides three options:
:error
: error on encountering unobserved levels:population
: use population values (i.e. only the fixed effects) for observations with unobserved levels:missing
: returnmissing
for observations with unobserved levels.
Providing either no prediction (:error
, :missing
) or providing the population-level values seem to be the most reasonable ways for predicting new values. For simulating new values based on previous estimates of the variance components, use simulate
.
In the case where there are no new levels of the grouping variable, all three of these methods provide the same results:
predict(slpm, slp; new_re_levels=:population) ≈ fitted(slpm)
true
predict(slpm, slp; new_re_levels=:missing) ≈ fitted(slpm)
true
predict(slpm, slp; new_re_levels=:error) ≈ fitted(slpm)
true
In the case where there are new levels of the grouping variable, these methods differ.
# create a new level
slp2 = transform(slp, :subj => ByRow(x -> (x == "S308" ? "NEW" : x)) => :subj)
180×3 DataFrame
Row │ subj days reaction
│ String Int8 Float64
─────┼────────────────────────
1 │ NEW 0 249.56
2 │ NEW 1 258.705
3 │ NEW 2 250.801
4 │ NEW 3 321.44
5 │ NEW 4 356.852
6 │ NEW 5 414.69
7 │ NEW 6 382.204
8 │ NEW 7 290.149
⋮ │ ⋮ ⋮ ⋮
174 │ S372 3 310.632
175 │ S372 4 287.173
176 │ S372 5 329.608
177 │ S372 6 334.482
178 │ S372 7 343.22
179 │ S372 8 369.142
180 │ S372 9 364.124
165 rows omitted
try
predict(slpm, slp2; new_re_levels=:error)
catch e
show(e)
end
ArgumentError("New level encountered in subj")
predict(slpm, slp2; new_re_levels=:missing)
180-element Vector{Union{Missing, Float64}}:
missing
missing
missing
missing
missing
missing
missing
missing
missing
missing
⋮
279.9221249015633
290.3894108611592
300.85669682075513
311.323982780351
321.79126873994693
332.25855469954286
342.7258406591388
353.1931266187347
363.6604125783306
predict(slpm, slp2; new_re_levels=:population)
180-element Vector{Float64}:
251.4051048484857
261.8723908080816
272.33967676767753
282.80696272727346
293.2742486868693
303.74153464646525
314.2088206060612
324.6761065656571
335.14339252525303
345.6106784848489
⋮
279.9221249015633
290.3894108611592
300.85669682075513
311.323982780351
321.79126873994693
332.25855469954286
342.7258406591388
353.1931266187347
363.6604125783306
predict
is deterministic (within the constraints of floating point) and never adds noise to the result. If you want to construct prediction intervals, then simulate
will generate new data with noise (including new values of the random effects).
For generalized linear mixed models, there is an additional keyword argument to predict
: type
specifies whether the predictions are returned on the scale of the linear predictor (:linpred
) or on the level of the response (:response)
(i.e. the level at which the values were originally observed).
cbpp = DataFrame(MixedModelsDatasets.dataset(:cbpp))
cbpp.rate = cbpp.incid ./ cbpp.hsz
gm = fit(MixedModel, @formula(rate ~ 1 + period + (1|herd)), cbpp, Binomial(), wts=float(cbpp.hsz))
predict(gm, cbpp; type=:response) ≈ fitted(gm)
false
logit(x) = log(x / (1 - x))
predict(gm, cbpp; type=:linpred) ≈ logit.(fitted(gm))
false
Simulation
In contrast to predict
, simulate
and simulate!
introduce randomness. This randomness occurs both at the level of the observation-level (residual) variance and at the level of the random effects, where new conditional modes are sampled based on the specified covariance parameter (θ; see Details of the parameter estimation), which defaults to the estimated value of the model. For reproducibility, we specify a pseudorandom generator here; if none is provided, the global PRNG is taken as the default.
The simplest example of simulate
takes a fitted model and generates a new response vector based on the existing model matrices combined with noise.
using Random
ynew = simulate(MersenneTwister(42), slpm)
180-element Vector{Float64}:
283.00863302238184
296.92519394126754
321.9608724725383
322.3364915740109
396.9842584252065
317.1515248417407
348.5132583181804
378.2824463449458
302.3001538081875
425.3905929506171
⋮
314.6855897970077
350.03193095773236
293.64981449171586
375.4019474690626
324.5823681153572
311.10277569146297
338.7622276547382
345.35559904256996
301.2871879112192
The simulated response can also be placed in a pre-allocated vector:
ynew2 = zeros(nrow(slp))
simulate!(MersenneTwister(42), ynew2, slpm)
ynew2 ≈ ynew
true
Or even directly replace the previous response vector in a model, at which point the model must be refit to the new values:
slpm2 = deepcopy(slpm)
refit!(simulate!(MersenneTwister(42), slpm2))
Linear mixed model fit by maximum likelihood
reaction ~ 1 + days + (1 | subj)
logLik -2 logLik AIC AICc BIC
-903.1987 1806.3974 1814.3974 1814.6259 1827.1692
Variance components:
Column Variance Std.Dev.
subj (Intercept) 2420.4301 49.1979
Residual 964.3708 31.0543
Number of obs: 180; levels of grouping factors: 18
Fixed-effects parameters:
───────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|)
───────────────────────────────────────────────────
(Intercept) 263.59 12.3684 21.31 <1e-99
days 9.35638 0.805859 11.61 <1e-30
───────────────────────────────────────────────────
This inplace simulation actually forms the basis of parametricbootstrap
.
Finally, we can also simulate the response from entirely new data.
df = DataFrame(days = repeat(1:10, outer=20), subj=repeat(1:20, inner=10))
df[!, :subj] = string.("S", lpad.(df.subj, 2, "0"))
df[!, :reaction] .= 0
df
200×3 DataFrame
Row │ days subj reaction
│ Int64 String Int64
─────┼─────────────────────────
1 │ 1 S01 0
2 │ 2 S01 0
3 │ 3 S01 0
4 │ 4 S01 0
5 │ 5 S01 0
6 │ 6 S01 0
7 │ 7 S01 0
8 │ 8 S01 0
⋮ │ ⋮ ⋮ ⋮
194 │ 4 S20 0
195 │ 5 S20 0
196 │ 6 S20 0
197 │ 7 S20 0
198 │ 8 S20 0
199 │ 9 S20 0
200 │ 10 S20 0
185 rows omitted
ysim = simulate(MersenneTwister(42), slpm, df)
200-element Vector{Float64}:
262.9357775452213
276.852338464107
301.8880169953778
302.2636360968504
376.911402948046
297.07866936458015
328.4404028410199
358.20959086778527
282.22729833102693
405.31773747345653
⋮
336.4096598692897
274.0198071343374
232.79273101832615
281.77385894268156
296.8405272885986
379.16601638767025
305.03738380655915
358.38445503137996
381.7236791103727
Note that this is a convenience method for creating a new model and then using the parameters from the old model to call simulate
on that model. In other words, this method incurs the cost of constructing a new model and then discarding it. If you could re-use that model (e.g., fitting that model as part of a simulation study), it often makes sense to do these steps to perform these steps explicitly and avoid the unnecessary construction and discarding of an intermediate model:
msim = LinearMixedModel(@formula(reaction ~ 1 + days + (1|subj)), df)
simulate!(MersenneTwister(42), msim; θ=slpm.θ, β=slpm.β, σ=slpm.σ)
response(msim) ≈ ysim
true
fit!(msim)
Linear mixed model fit by maximum likelihood
reaction ~ 1 + days + (1 | subj)
logLik -2 logLik AIC AICc BIC
-996.0696 1992.1392 2000.1392 2000.3443 2013.3325
Variance components:
Column Variance Std.Dev.
subj (Intercept) 663.5400 25.7593
Residual 1012.9883 31.8275
Number of obs: 200; levels of grouping factors: 20
Fixed-effects parameters:
───────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|)
───────────────────────────────────────────────────
(Intercept) 259.607 7.53747 34.44 <1e-99
days 9.46755 0.783538 12.08 <1e-32
───────────────────────────────────────────────────
For simulating from generalized linear mixed models, there is no type
option because the observation-level always occurs at the level of the response and not of the linear predictor.
Simulating the model response in place may not yield the same result as simulating into a pre-allocated or new vector, depending on choice of pseudorandom number generator. Random number generation in Julia allows optimization based on type, and the internal storage type of the model response (currently a view into a matrix storing the concatenated fixed-effects model matrix and the response) may not match the type of a pre-allocated or new vector. See also discussion here.
All the methods that take new data as a table construct an additional MixedModel
behind the scenes, even when the new data is exactly the same as the data that the model was fitted to. For the simulation methods in particular, these thus form a convenience wrapper for constructing a new model and calling simulate
without new data on that model with the parameters from the original model.