# StatsModels.jl API

## Formulae and terms

`StatsModels.@formula`

— Macro`@formula(ex)`

Capture and parse a formula expression as a `Formula`

struct.

A formula is an abstract specification of a dependence between *left-hand* and *right-hand* side variables as in, e.g., a regression model. Each side specifies at a high level how tabular data is to be converted to a numerical matrix suitable for modeling. This specification looks something like Julia code, is represented as a Julia `Expr`

, but uses special syntax. The `@formula`

macro takes an expression like `y ~ 1 + a*b`

, transforms it according to the formula syntax rules into a lowered form (like `y ~ 1 + a + b + a&b`

), and constructs a `Formula`

struct which captures the original expression, the lowered expression, and the left- and right-hand-side.

Operators that have special interpretations in this syntax are

`~`

is the formula separator, where it is a binary operator (the first argument is the left-hand side, and the second is the right-hand side.`+`

concatenates variables as columns when generating a model matrix.`&`

represents an*interaction*between two or more variables, which corresponds to a row-wise kronecker product of the individual terms (or element-wise product if all terms involved are continuous/scalar).`*`

expands to all main effects and interactions:`a*b`

is equivalent to`a+b+a&b`

,`a*b*c`

to`a+b+c+a&b+a&c+b&c+a&b&c`

, etc.`1`

,`0`

, and`-1`

indicate the presence (for`1`

) or absence (for`0`

and`-1`

) of an intercept column.

The rules that are applied are

- The associative rule (un-nests nested calls to
`+`

,`&`

, and`*`

). - The distributive rule (interactions
`&`

distribute over concatenation`+`

). - The
`*`

rule expands`a*b`

to`a+b+a&b`

(recursively). - Subtraction is converted to addition and negation, so
`x-1`

becomes`x + -1`

(applies only to subtraction of literal 1). - Single-argument
`&`

calls are stripped, so`&(x)`

becomes the main effect`x`

.

`StatsModels.term`

— Function`term(x)`

Wrap argument in an appropriate `AbstractTerm`

type: `Symbol`

s and `AbstractString`

s become `Term`

s, and `Number`

s become `ConstantTerm`

s. Any `AbstractTerm`

s are unchanged. `AbstractString`

s are converted to symbols before wrapping.

**Example**

```
julia> ts = term.((1, :a, "b"))
1
a(unknown)
b(unknown)
julia> typeof(ts)
Tuple{ConstantTerm{Int64}, Term, Term}
```

`StatsAPI.coefnames`

— Function`coefnames(model::StatisticalModel)`

Return the names of the coefficients.

`StatsModels.modelcols`

— Function`modelcols(t::AbstractTerm, data)`

Create a numerical "model columns" representation of data based on an `AbstractTerm`

. `data`

can either be a whole table (a property-accessible collection of iterable columns or iterable collection of property-accessible rows, as defined by Tables.jl or a single row (in the form of a `NamedTuple`

of scalar values). Tables will be converted to a `NamedTuple`

of `Vectors`

(e.g., a `Tables.ColumnTable`

).

`modelcols(ts::NTuple{N, AbstractTerm}, data) where N`

When a tuple of terms is provided, `modelcols`

broadcasts over the individual terms. To create a single matrix, wrap the tuple in a `MatrixTerm`

.

**Example**

```
julia> using StableRNGs; rng = StableRNG(1);
julia> d = (a = [1:9;], b = rand(rng, 9), c = repeat(["d","e","f"], 3));
julia> ts = apply_schema(term.((:a, :b, :c)), schema(d))
a(continuous)
b(continuous)
c(DummyCoding:3→2)
julia> cols = modelcols(ts, d)
([1, 2, 3, 4, 5, 6, 7, 8, 9], [0.5851946422124186, 0.07733793456911231, 0.7166282400543453, 0.3203570514066232, 0.6530930076222579, 0.2366391513734556, 0.7096838914472361, 0.5577872440804086, 0.05079002172175784], [0.0 0.0; 1.0 0.0; … ; 1.0 0.0; 0.0 1.0])
julia> reduce(hcat, cols)
9×4 Matrix{Float64}:
1.0 0.585195 0.0 0.0
2.0 0.0773379 1.0 0.0
3.0 0.716628 0.0 1.0
4.0 0.320357 0.0 0.0
5.0 0.653093 1.0 0.0
6.0 0.236639 0.0 1.0
7.0 0.709684 0.0 0.0
8.0 0.557787 1.0 0.0
9.0 0.05079 0.0 1.0
julia> modelcols(MatrixTerm(ts), d)
9×4 Matrix{Float64}:
1.0 0.585195 0.0 0.0
2.0 0.0773379 1.0 0.0
3.0 0.716628 0.0 1.0
4.0 0.320357 0.0 0.0
5.0 0.653093 1.0 0.0
6.0 0.236639 0.0 1.0
7.0 0.709684 0.0 0.0
8.0 0.557787 1.0 0.0
9.0 0.05079 0.0 1.0
```

`StatsModels.termnames`

— Function`termnames(model::StatisticalModel)`

Return the names of terms used in the formula of `model`

.

This is a convenience method for `termnames(formula(model))`

, which returns a two-tuple of `termnames`

applied to the left and right hand sides of the formula.

For `RegressionModel`

s with only continuous predictors, this is the same as `(responsename(model), coefnames(model))`

and `coefnames(formula(model))`

.

For models with categorical predictors, the returned names reflect the variable name and not the coefficients resulting from the choice of contrast coding.

See also `coefnames`

.

`termnames(t::FormulaTerm)`

Return a two-tuple of `termnames`

applied to the left and right hand sides of the formula.

Until `apply_schema`

has been called, literal `1`

in formulae is interpreted as `ConstantTerm(1)`

and will thus appear as `"1"`

in the returned term names.

```
julia> termnames(@formula(y ~ 1 + x * y + (1+x|g)))
("y", ["1", "x", "y", "x & y", "(1 + x) | g"])
```

Similarly, formulae with an implicit intercept will not have a `"1"`

in their variable names, because the implicit intercept does not exist until `apply_schema`

is called (and may not exist for certain model contexts).

```
julia> termnames(@formula(y ~ x * y + (1+x|g)))
("y", ["x", "y", "x & y", "(1 + x) | g"])
```

`termnames(term::AbstractTerm)`

Return the name of the statistical variable associated with a term.

Return value is either a `String`

, an iterable of `String`

s or the empty vector if there is no associated variable (e.g. `termnames(InterceptTerm{false}())`

).

### Higher-order terms

`StatsModels.FormulaTerm`

— Type`FormulaTerm{L,R} <: AbstractTerm`

Represents an entire formula, with a left- and right-hand side. These can be of any type (captured by the type parameters).

**Fields**

`lhs::L`

: The left-hand side (e.g., response)`rhs::R`

: The right-hand side (e.g., predictors)

`StatsModels.InteractionTerm`

— Type`InteractionTerm{Ts} <: AbstractTerm`

Represents an *interaction* between two or more individual terms.

Generated by combining multiple `AbstractTerm`

s with `&`

(which is what calls to `&`

in a `@formula`

lower to)

**Fields**

`terms::Ts`

: the terms that participate in the interaction.

**Example**

```
julia> using StableRNGs; rng = StableRNG(1);
julia> d = (y = rand(rng, 9), a = 1:9, b = rand(rng, 9), c = repeat(["d","e","f"], 3));
julia> t = InteractionTerm(term.((:a, :b, :c)))
a(unknown) & b(unknown) & c(unknown)
julia> t == term(:a) & term(:b) & term(:c)
true
julia> t = apply_schema(t, schema(d))
a(continuous) & b(continuous) & c(DummyCoding:3→2)
julia> modelcols(t, d)
9×2 Matrix{Float64}:
0.0 0.0
1.88748 0.0
0.0 1.33701
0.0 0.0
0.725357 0.0
0.0 0.126744
0.0 0.0
4.93994 0.0
0.0 4.33378
julia> modelcols(t.terms, d)
([1, 2, 3, 4, 5, 6, 7, 8, 9], [0.236781883208121, 0.9437409715735081, 0.4456708824294644, 0.7636794266904741, 0.14507148958283067, 0.021124039581375875, 0.15254507694061115, 0.617492416565387, 0.48153065407402607], [0.0 0.0; 1.0 0.0; … ; 1.0 0.0; 0.0 1.0])
```

`StatsModels.FunctionTerm`

— Type`FunctionTerm{F,Args} <: AbstractTerm`

Represents a call to a Julia function. The first type parameter is the type of the captured function (e.g., `typeof(log)`

), and the second is the type of the captured arguments (e.g., a `Vector`

of `AbstractTerm`

s).

Nested function calls are captured as further `FunctionTerm`

s.

**Fields**

`f::F`

: the captured function (e.g.,`log`

)`args::Args`

: the arguments of the call passed to`@formula`

, each captured as an`AbstractTerm`

. Usually this is a`Vector{<:AbstractTerm}`

.`exorig::Expr`

: the original expression passed to`@formula`

**Type parameters**

`F`

: the type of the captured function (e.g.,`typeof(log)`

)`Args`

: the type of container of captured arguments.

**Example**

```
julia> f = @formula(y ~ log(1 + a + b))
FormulaTerm
Response:
y(unknown)
Predictors:
(a,b)->log(1 + a + b)
julia> typeof(f.rhs)
FunctionTerm{typeof(log), Vector{FunctionTerm{typeof(+), Vector{AbstractTerm}}}}
julia> typeof(only(f.rhs.args))
FunctionTerm{typeof(+), Vector{AbstractTerm}}
julia> only(f.rhs.args).args
3-element Vector{AbstractTerm}:
1
a(unknown)
b(unknown)
julia> f.rhs.f(1 + 3 + 4)
2.0794415416798357
julia> modelcols(f.rhs, (a=3, b=4))
2.0794415416798357
julia> modelcols(f.rhs, (a=[3, 4], b=[4, 5]))
2-element Vector{Float64}:
2.0794415416798357
2.302585092994046
```

### Placeholder terms

`StatsModels.Term`

— Type`Term <: AbstractTerm`

A placeholder for a variable in a formula where the type (and necessary data invariants) is not yet known. This will be converted to a `ContinuousTerm`

or `CategoricalTerm`

by `apply_schema`

.

**Fields**

`sym::Symbol`

: The name of the data column this term refers to.

`StatsModels.ConstantTerm`

— Type`ConstantTerm{T<:Number} <: AbstractTerm`

Represents a literal number in a formula. By default will be converted to [`InterceptTerm`

] by `apply_schema`

.

**Fields**

`n::T`

: The number represented by this term.

### Concrete terms

These are all generated by `apply_schema`

.

`StatsModels.ContinuousTerm`

— Type`ContinuousTerm <: AbstractTerm`

Represents a continuous variable, with a name and summary statistics.

**Fields**

`sym::Symbol`

: The name of the variable`mean::T`

: Mean`var::T`

: Variance`min::T`

: Minimum value`max::T`

: Maximum value

`StatsModels.CategoricalTerm`

— Type`CategoricalTerm{C,T,N} <: AbstractTerm`

Represents a categorical term, with a name and `ContrastsMatrix`

**Fields**

`sym::Symbol`

: The name of the variable`contrasts::ContrastsMatrix`

: A contrasts matrix that captures the unique values this variable takes on and how they are mapped onto numerical predictors.

`StatsModels.InterceptTerm`

— Type`InterceptTerm{HasIntercept} <: AbstractTerm`

Represents the presence or (explicit) absence of an "intercept" term in a regression model. These terms are generated from `ConstantTerm`

s in a formula by `apply_schema(::ConstantTerm, schema, ::Type{<:StatisticalModel})`

. A `1`

yields `InterceptTerm{true}`

, and `0`

or `-1`

yield `InterceptTerm{false}`

(which explicitly omits an intercept for models which implicitly includes one via the `implicit_intercept`

trait).

`ShiftedArrays.lead`

— Function```
lead(term, nsteps::Integer)
This `@formula` term is used to introduce lead variables.
For example `lead(x,1)` effectively adds a new column containing
the value of the `x` column from the next row.
If there is no such row (e.g. because this is the last row),
then the lead column will contain `missing` for that entry.
Note: this is only a basic row-wise lead operation.
It is up to the user to ensure that data is sorted by the temporal variable,
and that observations are spaced with regular time-steps.
(Which may require adding extra-rows filled with `missing` values.)
```

`ShiftedArrays.lag`

— Function```
lag(term, nsteps::Integer)
This `@formula` term is used to introduce lagged variables.
For example `lag(x,1)` effectively adds a new column containing
the value of the `x` column from the previous row.
If there is no such row (e.g. because this is the first row),
then the lagged column will contain `missing` for that entry.
Note: this is only a basic row-wise lag operation.
It is up to the user to ensure that data is sorted by the temporal variable,
and that observations are spaced with regular time-steps.
(Which may require adding extra-rows filled with `missing` values.)
```

`StatsModels.MatrixTerm`

— Type`MatrixTerm{Ts} <: AbstractTerm`

A collection of terms that should be combined to produce a single numeric matrix.

A matrix term is created by `apply_schema`

from a tuple of terms using `collect_matrix_terms`

, which pulls out all the terms that are matrix terms as determined by the trait function `is_matrix_term`

, which is true by default for all `AbstractTerm`

s.

`StatsModels.collect_matrix_terms`

— Function```
collect_matrix_terms(ts::TupleTerm)
collect_matrix_terms(t::AbstractTerm) = collect_matrix_term((t, ))
```

Depending on whether the component terms are matrix terms (meaning they have `is_matrix_term(T) == true`

), `collect_matrix_terms`

will return

- A single
`MatrixTerm`

(if all components are matrix terms) - A tuple of the components (if none of them are matrix terms)
- A tuple of terms, with all matrix terms collected into a single
`MatrixTerm`

in the first element of the tuple, and the remaining non-matrix terms passed through unchanged.

By default all terms are matrix terms (that is, `is_matrix_term(::Type{<:AbstractTerm}) = true`

), the first case is by far the most common. The others are provided only for convenience when dealing with specialized terms that can't be concatenated into a single model matrix, like random effects terms in MixedModels.jl.

`StatsModels.is_matrix_term`

— Function`is_matrix_term(::Type{<:AbstractTerm})`

Does this type of term get concatenated with other matrix terms into a single model matrix? This controls the behavior of the `collect_matrix_terms`

, which collects all of its arguments for which `is_matrix_term`

returns `true`

into a `MatrixTerm`

, and returns the rest unchanged.

Since all "normal" terms which describe one or more model matrix columns are matrix terms, this defaults to `true`

for any `AbstractTerm`

.

An example of a non-matrix term is a random effect term in MixedModels.jl.

### Protection

For more fine-grained control over whether function calls are treated as normal Julia calls ("protected" and captured as `FunctionTerm`

s) or as `@formula`

syntax ("unprotected").

`StatsModels.protect`

— Function`protect(term::T)`

Create a `Protected`

context for interpreting `term`

(and descendents) during `apply_schema`

.

Outside a `@formula`

, acts as a constructor for the singleton `Protected{T}`

.

**Example**

```
julia> d = (y=rand(4), a=[1:4;], b=rand(4));
julia> f = @formula(y ~ 1 + protect(a+b));
julia> modelmatrix(f.rhs, d)
4×2 Matrix{Float64}:
1.0 1.91493
1.0 2.19281
1.0 3.77018
1.0 4.78052
julia> d.a .+ d.b
4-element Vector{Float64}:
1.9149290036628313
2.1928081162458755
3.7701803478856664
4.7805192636751865
```

`StatsModels.unprotect`

— Function```
unprotect(term)
unprotect(::Protected{T})
```

Inside a [`@formula`

], removes `Protected`

status for the argument term(s). This allows the `@formula`

-specific interpretation of calls to `+`

, `&`

, `*`

, and `~`

to be restored inside an otherwise `Protected`

context.

When called outside a `@formula`

, unwraps `Protected{T}`

to `T`

.

**Example**

```
julia> d = (y=rand(4), a=[1.:4;], b=rand(4));
julia> f = @formula(y ~ 1 - unprotect(a&b));
julia> modelmatrix(f, d)
4×1 Matrix{Float64}:
0.08507099633716864
0.6143837675082491
-1.310541043656999
-2.1220770547007453
julia> 1 .- d.a .* d.b
4-element Vector{Float64}:
0.08507099633716864
0.6143837675082491
-1.310541043656999
-2.1220770547007453
```

`StatsModels.@support_unprotect`

— Macro`StatsModels.@support_unprotect f sch_types...`

Generate methods necessary for function `f`

to support `unprotect`

inside of a `@formula`

with a schema of types `sch_types`

. If not specified, `sch_types`

defaults to `Schema, FullRank`

(the two schema types defined in StatsModels itself).

Any function call that occurs as a child of a protected call is also protected by default. In order to support *unprotecting* functions/operators that work directly on `Term`

s (like the built-in "special" operators `+`

, `&`

, `*`

, and `~`

), we need to add methods for `apply_schema(::FunctionTerm{typeof(f)}, ...)`

that calls `f`

on the captured arguments before further schema application.

This macro generates the necessary method for `f`

. For this to do something reasonable, a few conditions must be met:

Methods must exist for

`f(args::AbstractTerm...)`

matching the specific signatures that users provide when calling`f`

in`@formula`

(and usually, returns an`AbstractTerm`

of some kind).The custom term type returned by

`new_term = f(args::AbstractTerm...)`

needs to do something reasonable when`modelcols`

is called on it.The thing returned by

`modelcols(new_term, data)`

needs to be something that can be consumed as input to whatever the parent call was for`f`

in the original formula expression.

To take a concrete example, if we have a function `g`

that can do something meaningful with the output of `modelcols(::InteractionTerm, ...)`

, then when a user provides something like

`@formula(g(unprotect(a & b)))`

that gets lowered to

`FunctionTerm(g, [FuntionTerm(&, [Term(:a), Term(:b)], ...)], ...)`

and we need to convert it to something like

`FuntionTerm(g, [Term(:a) & Term(:b)], ...)`

during schema application, which is what the method generated by `@support_unprotect &`

does.

`StatsModels.Protected`

— Type`struct Protected{Ctx}`

Represent a context in which `@formula`

DSL syntax (e.g. `&`

to construct `InteractionTerm`

rather than bitwise-and) and `apply_schema`

transformations should not apply. This is automatically applied to the arguments of a `FunctionTerm`

, meaning that by default calls to `+`

, `&`

, or `~`

inside a `FunctionTerm`

will be interpreted as calls to the normal Julia functions, rather than term union, interaction, or formula separation.

The only special behavior with `apply_schema`

inside a `Protected`

context is when a call to `unprotect`

is encountered. At that point, everything below the call to `unprotect`

is treated as formula-specific syntax.

A `Protected`

context is created inside a `FunctionTerm`

automatically, but can be manually created with a call to `protect`

. ```

## Schema

`StatsModels.Schema`

— Type`StatsModels.Schema`

Struct that wraps a `Dict`

mapping `Term`

s to their concrete forms. This exists mainly for dispatch purposes and to support possibly more sophisticated behavior in the future.

A `Schema`

behaves for all intents and purposes like an immutable `Dict`

, and delegates the constructor as well as `getindex`

, `get`

, `merge!`

, `merge`

, `keys`

, and `haskey`

to the wrapped `Dict`

.

`StatsModels.schema`

— Function```
schema([terms::AbstractVector{<:AbstractTerm}, ]data, hints::Dict{Symbol})
schema(term::AbstractTerm, data, hints::Dict{Symbol})
```

Compute all the invariants necessary to fit a model with `terms`

. A schema is a dict that maps `Term`

s to their concrete instantiations (either `CategoricalTerm`

s or `ContinuousTerm`

s. "Hints" may optionally be supplied in the form of a `Dict`

mapping term names (as `Symbol`

s) to term or contrast types. If a hint is not provided for a variable, the appropriate term type will be guessed based on the data type from the data column: any numeric data is assumed to be continuous, and any non-numeric data is assumed to be categorical.

Returns a `StatsModels.Schema`

, which is a wrapper around a `Dict`

mapping `Term`

s to their concrete instantiations (`ContinuousTerm`

or `CategoricalTerm`

).

**Example**

```
julia> using StableRNGs; rng = StableRNG(1);
julia> d = (x=sample(rng, [:a, :b, :c], 10), y=rand(rng, 10));
julia> ts = [Term(:x), Term(:y)];
julia> schema(ts, d)
StatsModels.Schema with 2 entries:
x => x
y => y
julia> schema(ts, d, Dict(:x => HelmertCoding()))
StatsModels.Schema with 2 entries:
x => x
y => y
julia> schema(term(:y), d, Dict(:y => CategoricalTerm))
StatsModels.Schema with 1 entry:
y => y
```

Note that concrete `ContinuousTerm`

and `CategoricalTerm`

and un-typed `Term`

s print the same in a container, but when printed alone are different:

```
julia> sch = schema(ts, d)
StatsModels.Schema with 2 entries:
x => x
y => y
julia> term(:x)
x(unknown)
julia> sch[term(:x)]
x(DummyCoding:3→2)
julia> sch[term(:y)]
y(continuous)
```

`StatsModels.concrete_term`

— Function`concrete_term(t::Term, data[, hint])`

Create concrete term from the placeholder `t`

based on a data source and optional hint. If `data`

is a table, the `getproperty`

is used to extract the appropriate column.

The `hint`

can be a `Dict{Symbol}`

of hints, or a specific hint, a concrete term type (`ContinuousTerm`

or `CategoricalTerm`

), or an instance of some `<:AbstractContrasts`

, in which case a `CategoricalTerm`

will be created using those contrasts.

If no hint is provided (or `hint==nothing`

), the `eltype`

of the data is used: `Number`

s are assumed to be continuous, and all others are assumed to be categorical.

**Example**

```
julia> concrete_term(term(:a), [1, 2, 3])
a(continuous)
julia> concrete_term(term(:a), [1, 2, 3], nothing)
a(continuous)
julia> concrete_term(term(:a), [1, 2, 3], CategoricalTerm)
a(DummyCoding:3→2)
julia> concrete_term(term(:a), [1, 2, 3], EffectsCoding())
a(EffectsCoding:3→2)
julia> concrete_term(term(:a), [1, 2, 3], Dict(:a=>EffectsCoding()))
a(EffectsCoding:3→2)
julia> concrete_term(term(:a), (a = [1, 2, 3], b = [0.0, 0.5, 1.0]))
a(continuous)
```

`StatsModels.apply_schema`

— Function`apply_schema(t, schema::StatsModels.Schema[, Mod::Type = Nothing])`

Return a new term that is the result of applying `schema`

to term `t`

with destination model (type) `Mod`

. If `Mod`

is omitted, `Nothing`

will be used.

When `t`

is a `ContinuousTerm`

or `CategoricalTerm`

already, the term will be returned unchanged *unless* a matching term is found in the schema. This allows selective re-setting of a schema to change the contrast coding or levels of a categorical term, or to change a continuous term to categorical or vice versa.

When defining behavior for custom term types, it's best to dispatch on `StatsModels.Schema`

for the second argument. Leaving it as `::Any`

will work in *most* cases, but cause method ambiguity in some.

`apply_schema(t::AbstractTerm, schema::StatsModels.FullRank, Mod::Type)`

Apply a schema, under the assumption that when a less-than-full rank model matrix would be produced, categorical terms should be "promoted" to full rank (where a categorical variable with $k$ levels would produce $k$ columns, instead of $k-1$ in the standard contrast coding schemes). This step is applied automatically when `Mod <: StatisticalModel`

, but other types of models can opt-in by adding a method like

```
StatsModels.apply_schema(t::FormulaTerm, schema::StatsModels.Schema, Mod::Type{<:MyModelType}) =
apply_schema(t, StatsModels.FullRank(schema), mod)
```

See the section on Modeling categorical data in the docs for more information on how promotion of categorical variables works.

## Modeling

`StatsAPI.fit`

— Function```
fit(Mod::Type{<:StatisticalModel}, f::FormulaTerm, data, args...;
contrasts::Dict{Symbol}, kwargs...)
```

Convert tabular data into a numeric response vector and predictor matrix using the formula `f`

, and then `fit`

the specified model type, wrapping the result in a `TableRegressionModel`

or `TableStatisticalModel`

(as appropriate).

This is intended as a backstop for modeling packages that implement model types that are subtypes of `StatsAPI.StatisticalModel`

but do not explicitly support the full StatsModels terms-based interface. Currently this works by creating a `ModelFrame`

from the formula and data, and then converting this to a `ModelMatrix`

, but this is an internal implementation detail which may change in the near future.

Fit a statistical model.

`StatsAPI.gvif`

— Function`gvif(m::RegressionModel; scale=false)`

Compute the generalized variance inflation factor (GVIF).

If `scale=true`

, then each GVIF is scaled by the degrees of freedom for (number of coefficients associated with) the predictor: $GVIF^(1 / (2*df))$.

The GVIF measures the increase in the variance of a (group of) parameter's estimate in a model with multiple parameters relative to the variance of a parameter's estimate in a model containing only that parameter. For continuous, numerical predictors, the GVIF is the same as the VIF, but for categorical predictors, the GVIF provides a single number for the entire group of contrast-coded coefficients associated with a categorical predictor.

See also `vif`

.

**References**

Fox, J., & Monette, G. (1992). Generalized Collinearity Diagnostics. Journal of the American Statistical Association, 87(417), 178. doi:10.2307/2290467

`StatsModels.lrtest`

— Function`lrtest(mods::StatisticalModel...; atol::Real=0.0)`

For each sequential pair of statistical models in `mods...`

, perform a likelihood ratio test to determine if the first one fits significantly better than the next.

A table is returned containing degrees of freedom (DOF), difference in DOF from the preceding model, log-likelihood, deviance, chi-squared statistic (i.e. absolute value of twice the difference in log-likelihood) and p-value for the comparison between the two models.

Optional keyword argument `atol`

controls the numerical tolerance when testing whether the models are nested.

**Examples**

Suppose we want to compare the effects of two or more treatments on some result. Our null hypothesis is that `Result ~ 1`

fits the data as well as `Result ~ 1 + Treatment`

.

```
julia> using DataFrames, GLM
julia> dat = DataFrame(Result=[1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1],
Treatment=[1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2],
Other=string.([1, 1, 2, 1, 2, 1, 3, 1, 1, 2, 2, 1]));
julia> nullmodel = glm(@formula(Result ~ 1), dat, Binomial(), LogitLink());
julia> model = glm(@formula(Result ~ 1 + Treatment), dat, Binomial(), LogitLink());
julia> bigmodel = glm(@formula(Result ~ 1 + Treatment + Other), dat, Binomial(), LogitLink());
julia> lrtest(nullmodel, model, bigmodel)
Likelihood-ratio test: 3 models fitted on 12 observations
────────────────────────────────────────────────────
DOF ΔDOF LogLik Deviance Chisq p(>Chisq)
────────────────────────────────────────────────────
[1] 1 -8.1503 16.3006
[2] 2 1 -7.9780 15.9559 0.3447 0.5571
[3] 4 2 -7.0286 14.0571 1.8988 0.3870
────────────────────────────────────────────────────
julia> lrtest(bigmodel, model, nullmodel)
Likelihood-ratio test: 3 models fitted on 12 observations
────────────────────────────────────────────────────
DOF ΔDOF LogLik Deviance Chisq p(>Chisq)
────────────────────────────────────────────────────
[1] 4 -7.0286 14.0571
[2] 2 -2 -7.9780 15.9559 1.8988 0.3870
[3] 1 -1 -8.1503 16.3006 0.3447 0.5571
────────────────────────────────────────────────────
```

`StatsModels.formula`

— Function`formula(model)`

Retrieve formula from a fitted or specified model

`StatsAPI.modelmatrix`

— Function`modelmatrix(model::RegressionModel)`

Return the model matrix (a.k.a. the design matrix).

`StatsAPI.response`

— Function`response(model::RegressionModel)`

Return the model response (a.k.a. the dependent variable).

`StatsAPI.vif`

— Function`vif(m::RegressionModel)`

Compute the variance inflation factor (VIF).

The VIF measures the increase in the variance of a parameter's estimate in a model with multiple parameters relative to the variance of a parameter's estimate in a model containing only that parameter.

See also `gvif`

.

This method will fail if there is (numerically) perfect multicollinearity, i.e. rank deficiency. In that case though, the VIF is not particularly informative anyway.

### Traits

`StatsModels.implicit_intercept`

— Function```
implicit_intercept(T::Type)
implicit_intercept(x::T) = implicit_intercept(T)
```

Return `true`

if models of type `T`

should include an implicit intercept even if none is specified in the formula. Is `true`

by default for all `T<:StatisticalModel`

, and `false`

for others. To specify that a model type `T`

includes an intercept even if one is not specified explicitly in the formula, overload this function for the corresponding type: `implicit_intercept(::Type{<:T}) = true`

If a model has an implicit intercept, it can be explicitly excluded by using `0`

in the formula, which generates `InterceptTerm{false}`

with `apply_schema`

.

`StatsModels.drop_intercept`

— Function```
drop_intercept(T::Type)
drop_intercept(x::T) = drop_intercept(T)
```

Define whether a given model automatically drops the intercept. Return `false`

by default. To specify that a model type `T`

drops the intercept, overload this function for the corresponding type: `drop_intercept(::Type{<:T}) = true`

Models that drop the intercept will be fitted without one: the intercept term will be removed even if explicitly provided by the user. Categorical variables will be expanded in the rank-reduced form (contrasts for `n`

levels will only produce `n-1`

columns).

### Wrappers

These are internal implementation details that are likely to change in the near future. In particular, the `ModelFrame`

and `ModelMatrix`

wrappers are dispreferred in favor of using terms directly, and can in most cases be replaced by something like

```
# instead of ModelMatrix(ModelFrame(f::FormulaTerm, data, model=MyModel))
sch = schema(f, data)
f = apply_schema(f, sch, MyModel)
response, predictors = modelcols(f, data)
```

`StatsModels.ModelFrame`

— Type`ModelFrame(formula, data; model=StatisticalModel, contrasts=Dict())`

Wrapper that encapsulates a `FormulaTerm`

, schema, data table, and model type.

This wrapper encapsulates all the information that's required to transform data of the same structure as the wrapped data frame into a model matrix (the `FormulaTerm`

), as well as the information about how that formula term was instantiated (the schema and model type)

Creating a model frame involves first extracting the `schema`

for the data (using any contrasts provided as hints), and then applying that schema with `apply_schema`

to the formula in the context of the provided model type.

**Constructors**

`ModelFrame(f::FormulaTerm, data; model::Type{M} = StatisticalModel, contrasts::Dict = Dict())`

**Fields**

`f::FormulaTerm`

: Formula whose left hand side is the*response*and right hand side are the*predictors*.`schema::Any`

: The schema that was applied to generate`f`

.`data::D`

: The data table being modeled. The only restriction is that`data`

is a table (`Tables.istable(data) == true`

)`model::Type{M}`

: The type of the model that will be fit from this model frame.

**Examples**

```
julia> df = (x = 1:4, y = 5:8)
julia> mf = ModelFrame(@formula(y ~ 1 + x), df)
```

`StatsModels.ModelMatrix`

— Type`ModelMatrix(mf::ModelFrame)`

Convert a `ModelFrame`

into a numeric matrix suitable for modeling

**Fields**

`m::AbstractMatrix{<:AbstractFloat}`

: the generated numeric matrix`assign::Vector{Int}`

the index of the term corresponding to each column of`m`

.

**Constructors**

```
ModelMatrix(mf::ModelFrame)
# Specify the type of the resulting matrix (default Matrix{Float64})
ModelMatrix{T <: AbstractMatrix{<:AbstractFloat}}(mf::ModelFrame)
```

`StatsModels.TableStatisticalModel`

— TypeWrapper for a `StatisticalModel`

that has been fit from a `@formula`

and tabular data.

Most functions from the StatsBase API are simply delegated to the wrapped model, with the exception of functions like `fit`

, `predict`

, and `coefnames`

where the tabular nature of the data means that additional processing is required or information provided by the formula.

**Fields**

`model::M`

the wrapped`StatisticalModel`

.`mf::ModelFrame`

encapsulates the formula, schema, and model type.`mm::ModelMatrix{T}`

the model matrix that the model was fit from.

`StatsModels.TableRegressionModel`

— TypeWrapper for a `RegressionModel`

that has been fit from a `@formula`

and tabular data.

Most functions from the StatsBase API are simply delegated to the wrapped model, with the exception of functions like `fit`

, `predict`

, and `coefnames`

where the tabular nature of the data means that additional processing is required or information provided by the formula.

**Fields**

`model::M`

the wrapped`RegressioModel`

.`mf::ModelFrame`

encapsulates the formula, schema, and model type.`mm::ModelMatrix{T}`

the model matrix that the model was fit from.