Formula syntax
MixedModels.jl uses the variant of the Wilkinson-Rogers (1973) notation for models of (co)variance implemented by StatsModels.jl. Additionally, MixedModels.jl extends this syntax to use the pipe |
as the grouping operator. Further extensions are provided by RegressionFormulae.jl, in particular the use of the slash /
as the nesting operator and the use of the caret ^
to indicate main effects and interactions up to a specified order. Currently, MixedModels.jl loads RegressionFormulae.jl by default, though this may change in a future release. If you require specific functionality from RegressionFormulae.jl, it is best to load it directly so that you can control the version used.
General rules
- "Addition" (
+
) indicates additive, i.e., main effects:a + b
indicates main effects ofa
andb
. - "Multiplication" (
*
) indicates crossing: main effects and interactions between two terms:a * b
indicates main effects ofa
andb
as well as their interaction. - Usual algebraic rules apply (associativity and distributivity):
(a + b) * c
is equivalent toa * c + b * c
a * b * c
corresponds to main effects ofa
,b
, andc
, as well as all three two-way interactions and the three-way interaction.
- Categorical terms are expanded into the associated indicators/contrast variables. See the StatsModels.jl documentation on contrasts for more information.
- Interactions are expressed with the ampersand (
&
). (This is contrast to R, which uses the colon:
for this operation.).a&b
is the interaction ofa
andb
. For categorical terms, appropriate combinations of indicators/contrast variables are generated. - Tilde (
~
) is used to separate response from predictors. - The intercept is indicated by
1
. y ~ 1 + (a + b) * c
is read as:- The response variable is
y
. - The model contains an intercept.
- The model contains main effects of
a
,b
, andc
. - The model contains interactions between
a
andc
and betweenb
andc
but nota
andb
.
- The response variable is
- An intercept is included by default, i.e. there is an implicit
1 +
in every formula. The intercept may be suppressed by including a0 +
in the formula. (In contrast to R, the use of-1
is not supported.)
MixedModels.jl provided extensions
- The pipe operator (
|
) indicates grouping or blocking. (1 + a | subject)
indicates "by-subject random effects for the intercept and main effecta
".- This is in line with the usual statistical reading of
|
as "conditional on".
RegressionFormulae.jl provided extensions
- "Exponentiation" (
^
) works like repeated multiplication and generates all multiplicative and additive terms up to the given order.(a + b + c)^2
generatesa + b + c + a&b + a&c + b&c
, but nota&b&c
.- The presence of interaction terms within the base will result in redundant terms and is currently unsupported.
fulldummy(a)
assigns "contrasts" toa
that include all indicator columns (dummy variables) and an intercept column. The resulting overparameterization is generally useful in the fixed effects only as part of nesting.- The slash operator (
/
) indicates nesting:a / b
is read as "b
is nested withina
".a / b
expands toa + fulldummy(a) & b
.
- It is generally not necessary to specify nesting in the blocking variables, when the inner levels are unique across outer levels. In other words, in a study with children (
C1
,C2
, etc. ) nested within schools (S1
,S2
, etc.),- it is not necessary to specify the nesting when
C1
identifies a unique child across schools. In other words, intercept-only random effects terms can be written as(1|C) +
(1|S)`. - it is necessary to specify the nesting when chid identifiers are re-used across schools, e.g.
C1
refers to a child inS1
and a different child inS2
. In this case, the nested syntax(1|S/C)
expands to(1|S) + (1|S&C)
. The interaction term in the second blocking variable generates unique labels for each child across schools.
- it is not necessary to specify the nesting when
Mixed models in Wilkinson-Rogers and mathematical notation
Models fit with MixedModels.jl are generally linear mixed-effects models with unconstrained random effects covariance matrices and homoskedastic, normally distributed residuals. Under these assumptions, the model specification
response ~ 1 + (age + sex) * education * n_children + (1 | subject)
corresponds to the statistical model
\[\begin{align*} \left(Y |\mathcal{B}=b\right) &\sim N\left(X\beta + Zb, \sigma^2 I \right) \\ \mathcal{B} &\sim N\left(0, G\right) \end{align*}\]
for which we wish to obtain the maximum-likelihood estimates for $G$ and thus the fixed-effects $\beta$.
- The model contains no restrictions on $G$, except that it is positive semidefinite.
- The response $Y$ is the value of a given response.
- The fixed-effects design matrix $X$ consists of columns for
- the intercept, age, sex, education, and number of children (contrast coded as appropriate)
- the interaction of all lower order terms, excluding interactions between age and sex
- The random-effects design matrix $Z$ includes a column for
- the intercept for each subject