# Modeling categorical data

To convert categorical data into a numerical representation suitable for modeling, `StatsModels`

implements a variety of **contrast coding systems**. Each contrast coding system maps a categorical vector with $k$ levels onto $k-1$ linearly independent model matrix columns.

The following contrast coding systems are implemented:

## How to specify contrast coding

The default contrast coding system is `DummyCoding`

. To override this, use the `contrasts`

argument when constructing a `ModelFrame`

:

`mf = ModelFrame(@formula(y ~ 1 + x), df, contrasts = Dict(:x => EffectsCoding()))`

To change the contrast coding for one or more variables in place, use

`StatsModels.setcontrasts!`

— Function```
setcontrasts!(mf::ModelFrame; kwargs...)
setcontrasts!(mf::ModelFrame, contrasts::Dict{Symbol})
```

Update the contrasts used for coding categorical variables in `ModelFrame`

in place. This is accomplished by computing a new schema based on the provided contrasts and the `ModelFrame`

's data, and applying it to the `ModelFrame`

's `FormulaTerm`

.

Note that only the `ModelFrame`

itself is mutated: because `AbstractTerm`

s are immutable, any changes will produce a copy.

## Interface

`StatsModels.AbstractContrasts`

— TypeInterface to describe contrast coding systems for categorical variables.

Concrete subtypes of `AbstractContrasts`

describe a particular way of converting a categorical data vector into numeric columns in a `ModelMatrix`

. Each instantiation optionally includes the levels to generate columns for and the base level. If not specified these will be taken from the data when a `ContrastsMatrix`

is generated (during `ModelFrame`

construction).

**Constructors**

For `C <: AbstractContrast`

:

```
C() # levels are inferred later
C(levels = ::Vector{Any}) # levels checked against data later
C(base = ::Any) # specify base level
C(levels = ::Vector{Any}, base = ::Any) # specify levels and base
```

**Arguments**

`levels`

: Optionally, the data levels can be specified here. This allows you to specify the order of the levels. If specified, the levels will be checked against the levels actually present in the data when the`ContrastsMatrix`

is constructed. Any mismatch will result in an error, because levels missing in the data would lead to empty columns in the model matrix, and levels missing from the contrasts would lead to empty or undefined rows.`base`

: The base level may also be specified. The actual interpretation of this depends on the particular contrast type, but in general it can be thought of as a "reference" level. It defaults to the first level.

**Contrast coding systems**

`DummyCoding`

- Code each non-base level as a 0-1 indicator column.`EffectsCoding`

- Code each non-base level as 1, and base as -1.`HelmertCoding`

- Code each non-base level as the difference from the mean of the lower levels`SeqDiffCoding`

- Code for differences between sequential levels of the variable.`HypothesisCoding`

- Manually specify contrasts via a hypothesis matrix, which gives the weighting for the average response for each level`StatsModels.ContrastsCoding`

- Manually specify contrasts matrix, which is directly copied into the model matrix.

The last two coding types, `HypothesisCoding`

and `StatsModels.ContrastsCoding`

, provide a way to manually specify a contrasts matrix. For a variable `x`

with `k`

levels, a contrasts matrix `M`

is a `k×k-1`

matrix, that maps the `k`

levels onto `k-1`

model matrix columns. Specifically, let `X`

be the full-rank indicator matrix for `x`

, where `X[i,j] = 1`

if `x[i] == levels(x)[j]`

, and 0 otherwise. Then the model matrix columns generated by the contrasts matrix `M`

are `Y = X * M`

.

The *hypothesis matrix* is the `k-1×k`

matrix that gives the weighted combinations of group mean responses that are represented by regression coefficients for the generated contrasts. The contrasts matrix is the generalized pseudo-inverse (e.g. `LinearAlgebra.pinv`

) of the hypothesis matrix. See `HypothesisCoding`

or Schad et al. (2020) for more information.

**Extending**

The easiest way to specify custom contrasts is with `HypothesisCoding`

or `StatsModels.ContrastsCoding`

. But if you want to actually implement a custom contrast coding system, you can subtype `AbstractContrasts`

. This requires a constructor, a `contrasts_matrix`

method for constructing the actual contrasts matrix that maps from levels to `ModelMatrix`

column values, and (optionally) a `coefnames`

method:

```
mutable struct MyCoding <: AbstractContrasts
...
end
contrasts_matrix(C::MyCoding, baseind, n) = ...
coefnames(C::MyCoding, levels, baseind) = ...
```

**References**

Schad, D. J., Vasishth, S., Hohenstein, S., & Kliegl, R. (2020). How to capitalize on a priori contrasts in linear (mixed) models: A tutorial. *Journal of Memory and Language, 110*, 104038. https://doi.org/10.1016/j.jml.2019.104038

`StatsModels.ContrastsMatrix`

— TypeAn instantiation of a contrast coding system for particular levels

This type is used internally for generating model matrices based on categorical data, and **most users will not need to deal with it directly**. Conceptually, a `ContrastsMatrix`

object stands for an instantiation of a contrast coding *system* for a particular set of categorical *data levels*.

If levels are specified in the `AbstractContrasts`

, those will be used, and likewise for the base level (which defaults to the first level).

**Constructors**

```
ContrastsMatrix(contrasts::AbstractContrasts, levels::AbstractVector)
ContrastsMatrix(contrasts_matrix::ContrastsMatrix, levels::AbstractVector)
```

**Arguments**

`contrasts::AbstractContrasts`

: The contrast coding system to use.`levels::AbstractVector`

: The levels to generate contrasts for.`contrasts_matrix::ContrastsMatrix`

: Constructing a`ContrastsMatrix`

from another will check that the levels match. This is used, for example, in constructing a model matrix from a`ModelFrame`

using different data.

## Contrast coding systems

`StatsModels.DummyCoding`

— Type```
DummyCoding([base[, levels]])
DummyCoding(; base=nothing, levels=nothing)
```

Dummy coding generates one indicator column (1 or 0) for each non-base level.

If `levels`

are omitted or `nothing`

, they are determined from the data by calling the `levels`

function on the data when constructing `ContrastsMatrix`

. If `base`

is omitted or `nothing`

, the first level is used as the base.

Columns have non-zero mean and are collinear with an intercept column (and lower-order columns for interactions) but are orthogonal to each other. In a regression model, dummy coding leads to an intercept that is the mean of the dependent variable for base level.

Also known as "treatment coding" or "one-hot encoding".

**Examples**

```
julia> StatsModels.ContrastsMatrix(DummyCoding(), ["a", "b", "c", "d"]).matrix
4×3 Matrix{Float64}:
0.0 0.0 0.0
1.0 0.0 0.0
0.0 1.0 0.0
0.0 0.0 1.0
```

`StatsModels.EffectsCoding`

— Type```
EffectsCoding([base[, levels]])
EffectsCoding(; base=nothing, levels=nothing)
```

Effects coding generates columns that code each non-base level as the deviation from the base level. For each non-base level `x`

of `variable`

, a column is generated with 1 where `variable .== x`

and -1 where `variable .== base`

.

`EffectsCoding`

is like `DummyCoding`

, but using -1 for the base level instead of 0.

If `levels`

are omitted or `nothing`

, they are determined from the data by calling the `levels`

function when constructing `ContrastsMatrix`

. If `base`

is omitted or `nothing`

, the first level is used as the base.

When all levels are equally frequent, effects coding generates model matrix columns that are mean centered (have mean 0). For more than two levels the generated columns are not orthogonal. In a regression model with an effects-coded variable, the intercept corresponds to the grand mean.

Also known as "sum coding" or "simple coding". Note though that the default in R and SPSS is to use the *last* level as the base. Here we use the *first* level as the base, for consistency with other coding systems.

**Examples**

```
julia> StatsModels.ContrastsMatrix(EffectsCoding(), ["a", "b", "c", "d"]).matrix
4×3 Matrix{Float64}:
-1.0 -1.0 -1.0
1.0 0.0 0.0
0.0 1.0 0.0
0.0 0.0 1.0
```

`StatsModels.HelmertCoding`

— Type```
HelmertCoding([base[, levels]])
HelmertCoding(; base=nothing, levels=nothing)
```

Helmert coding codes each level as the difference from the average of the lower levels.

If `levels`

are omitted or `nothing`

, they are determined from the data by calling the `levels`

function when constructing `Contrastsmatrix`

. If `base`

is omitted or `nothing`

, the first level is used as the base. For each non-base level, Helmert coding generates a columns with -1 for each of n levels below, n for that level, and 0 above.

When all levels are equally frequent, Helmert coding generates columns that are mean-centered (mean 0) and orthogonal.

**Examples**

```
julia> StatsModels.ContrastsMatrix(HelmertCoding(), ["a", "b", "c", "d"]).matrix
4×3 Matrix{Float64}:
-1.0 -1.0 -1.0
1.0 -1.0 -1.0
0.0 2.0 -1.0
0.0 0.0 3.0
```

`StatsModels.SeqDiffCoding`

— Type`SeqDiffCoding([base[, levels]])`

Code each level in order to test "sequential difference" hypotheses, which compares each level to the level below it (starting with the second level). Specifically, the $n$th predictor tests the hypothesis that the difference between levels $n$ and $n+1$ is zero.

Differences are computed in order of `levels`

. If `levels`

are omitted or `nothing`

, they are determined from the data by calling the `levels`

function when constructing `ContrastsMatrix`

. If `base`

is omitted or `nothing`

, the first level is used as the base.

**Examples**

```
julia> seqdiff = StatsModels.ContrastsMatrix(SeqDiffCoding(), ["a", "b", "c", "d"]).matrix
4×3 Matrix{Float64}:
-0.75 -0.5 -0.25
0.25 -0.5 -0.25
0.25 0.5 -0.25
0.25 0.5 0.75
```

The interpretation of sequential difference coding may be hard to see from the contrasts matrix itself. The corresponding hypothesis matrix shows a clearer picture. From the rows of the hypothesis matrix, we can see that these contrasts test the difference between the first and second levels, the second and third, and the third and fourth, respectively:

```
julia> StatsModels.hypothesis_matrix(seqdiff)
3×4 Matrix{Int64}:
-1 1 0 0
0 -1 1 0
0 0 -1 1
```

`StatsModels.HypothesisCoding`

— Type`HypothesisCoding(hypotheses::AbstractMatrix; levels=nothing, labels=nothing)`

Specify how to code a categorical variable in terms of a *hypothesis matrix*. For a variable with $k$ levels, this should be a $k-1 imes k$ matrix. Each row of the matrix corresponds to a hypothesis about the mean outcomes under each of the $k$ levels of the predictor. The entries in the row give the weights assigned to each of these $k$ means, and the corresponding predictor in a regression model estimates the weighted sum of these cell means.

For instance, if we have a variable which has four levels A, B, C, and D, and we want to test the hypothesis that the difference between the average outcomes for levels A and B is different from zero, the corresponding row of the hypothesis matrix would be `[-1, 1, 0, 0]`

. Likewise, to test whether the difference between B and C is different from zero, the hypothesis vector would be `[0, -1, 1, 0]`

. To test each "successive difference" hypothesis, the full hypothesis matrix would be

```
julia> sdiff_hypothesis = [-1 1 0 0
0 -1 1 0
0 0 -1 1];
```

Contrasts are derived the hypothesis matrix by taking the pseudoinverse:

```
julia> using LinearAlgebra
julia> sdiff_contrasts = pinv(sdiff_hypothesis)
4×3 Matrix{Float64}:
-0.75 -0.5 -0.25
0.25 -0.5 -0.25
0.25 0.5 -0.25
0.25 0.5 0.75
```

The above matrix is what is produced by constructing a `ContrastsMatrix`

from a `HypothesisCoding`

instance:

```
julia> seqdiff_hyp = HypothesisCoding(sdiff_hypothesis;
levels=["a", "b", "c", "d"],
labels=["b-a", "c-b", "d-c"]);
julia> StatsModels.ContrastsMatrix(seqdiff_hyp, ["a", "b", "c", "d"]).matrix
4×3 Matrix{Float64}:
-0.75 -0.5 -0.25
0.25 -0.5 -0.25
0.25 0.5 -0.25
0.25 0.5 0.75
```

The interpretation of the such "sequential difference" contrasts are clear when expressed as a hypothesis matrix, but it is not obvious just from looking at the contrasts matrix. For this reason `HypothesisCoding`

is preferred for specifying custom contrast coding schemes over `ContrastsCoding`

.

Keyword arguments `levels`

and `labels`

give the names (in order) of the hypothesis matrix columns (corresponding to levels of the data) and rows (corresponding to the tested hypothesis). The `labels`

also determine the names of the model matrix columns generated by these contrasts.

**References**

Schad, D. J., Vasishth, S., Hohenstein, S., & Kliegl, R. (2020). How to capitalize on a priori contrasts in linear (mixed) models: A tutorial. *Journal of Memory and Language, 110*, 104038. https://doi.org/10.1016/j.jml.2019.104038

`StatsModels.hypothesis_matrix`

— Function```
hypothesis_matrix(cmat::AbstractMatrix; intercept=needs_intercept(cm), tolerance=1e-5)
hypothesis_matrix(contrasts::AbstractContrasts, n; baseind=1, kwargs...)
hypothesis_matrix(cmat::ContrastsMatrix; kwargs...)
```

Compute the hypothesis matrix for a contrasts matrix using the generalized pseudo-inverse (`LinearAlgebra.pinv`

). `intercept`

determines whether a column of ones is included before taking the pseudoinverse, which is needed for contrasts where the columns are not orthogonal to the intercept (e.g., have non-zero mean). If `tolerance != 0`

(the default), the hypotheses are rounded to `Int`

s if possible and `Rational`

s if not, using the given tolerance. If `tolerance == 0`

, then the hypothesis matrix is returned as-is.

The orientation of the hypothesis matrix is *opposite* that of the contrast matrix: each row of the contrasts matrix is a data level and each column is a predictor, whereas each row of the hypothesis matrix is the interpretation of a predictor with weights for each level given in the columns.

Note that this assumes a *balanced design* where there are the same number of observations in every cell. This is only important for non-orthogonal contrasts (including contrasts that are not orthogonal with the intercept).

**Examples**

```
julia> cmat = StatsModels.contrasts_matrix(DummyCoding(), 1, 4)
4×3 Matrix{Float64}:
0.0 0.0 0.0
1.0 0.0 0.0
0.0 1.0 0.0
0.0 0.0 1.0
julia> StatsModels.hypothesis_matrix(cmat)
4×4 Matrix{Int64}:
1 0 0 0
-1 1 0 0
-1 0 1 0
-1 0 0 1
```

For non-centered contrasts like `DummyCoding`

, without including the intercept the hypothesis matrix is incorrect. So while `intercept=true`

is the default for non-centered contrasts, you can see the (wrong) hypothesis matrix when ignoring it by forcing `intercept=false`

:

```
julia> StatsModels.hypothesis_matrix(cmat, intercept=false)
3×4 Matrix{Int64}:
0 1 0 0
0 0 1 0
0 0 0 1
```

The default behavior is to coerce to the nearest integer or rational value, with a tolerance of the `tolerance`

kwarg (defaults to `1e-5`

). The raw pseudo-inverse matrix can be obtained as `Float64`

by setting `tolerance=0`

:

```
julia> StatsModels.hypothesis_matrix(cmat, tolerance=0) # ugly
4×4 Matrix{Float64}:
1.0 -2.23753e-16 6.91749e-18 -1.31485e-16
-1.0 1.0 -2.42066e-16 9.93754e-17
-1.0 4.94472e-17 1.0 9.93754e-17
-1.0 1.04958e-16 -1.31044e-16 1.0
```

Finally, the hypothesis matrix for a constructed `ContrastsMatrix`

(as stored by `CategoricalTerm`

s) can also be extracted:

```
julia> StatsModels.hypothesis_matrix(StatsModels.ContrastsMatrix(DummyCoding(), ["a", "b", "c", "d"]))
4×4 Matrix{Int64}:
1 0 0 0
-1 1 0 0
-1 0 1 0
-1 0 0 1
```

### Special internal contrasts

`StatsModels.FullDummyCoding`

— Type`FullDummyCoding()`

Full-rank dummy coding generates one indicator (1 or 0) column for each level, **including** the base level. This is sometimes known as one-hot encoding.

Not exported but included here for the sake of completeness. Needed internally for some situations where a categorical variable with $k$ levels needs to be converted into $k$ model matrix columns instead of the standard $k-1$. This occurs when there are missing lower-order terms, as in discussed below in Categorical variables in Formulas.

**Examples**

```
julia> StatsModels.ContrastsMatrix(StatsModels.FullDummyCoding(), ["a", "b", "c", "d"]).matrix
4×4 Matrix{Float64}:
1.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0
0.0 0.0 1.0 0.0
0.0 0.0 0.0 1.0
```

`StatsModels.ContrastsCoding`

— Type```
StatsModels.ContrastsCoding(mat::AbstractMatrix[, levels]])
StatsModels.ContrastsCoding(mat::AbstractMatrix[; levels=nothing])
```

Coding by manual specification of contrasts matrix. For k levels, the contrasts must be a k by k-1 Matrix. The contrasts in this matrix will be copied directly into the model matrix; if you want to specify your contrasts as hypotheses (i.e., weights assigned to each level's cell mean), you should use `HypothesisCoding`

instead.

## Further details

### Categorical variables in `Formula`

s

Generating model matrices from multiple variables, some of which are categorical, requires special care. The reason for this is that rank-$k-1$ contrasts are appropriate for a categorical variable with $k$ levels when it *aliases* other terms, making it *partially redundant*. Using rank-$k$ for such a redundant variable will generally result in a rank-deficient model matrix and a model that can't be identified.

A categorical variable in a term *aliases* the term that remains when that variable is dropped. For example, with categorical `a`

:

- In
`a`

, the sole variable`a`

aliases the intercept term`1`

. - In
`a&b`

, the variable`a`

aliases the main effect term`b`

, and vice versa. - In
`a&b&c`

, the variable`a`

alises the interaction term`b&c`

(regardless of whether`b`

and`c`

are categorical).

If a categorical variable aliases another term that is present elsewhere in the formula, we call that variable *redundant*. A variable is *non-redundant* when the term that it alises is *not* present elsewhere in the formula. For categorical `a`

, `b`

, and `c`

:

- In
`y ~ 1 + a`

, the`a`

in the main effect of`a`

aliases the intercept`1`

. - In
`y ~ 0 + a`

,`a`

does not alias any other terms and is*non-redundant*. - In
`y ~ 1 + a + a&b`

:- The
`b`

in`a&b`

is redundant because it aliases the main effect`a`

: dropping`b`

from`a&b`

leaves`a`

. - The
`a`

in`a&b`

is*non-redundant*because it aliases`b`

, which is not present anywhere else in the formula.

- The

When constructing a `ModelFrame`

from a `Formula`

, each term is checked for non-redundant categorical variables. Any such non-redundant variables are "promoted" to full rank in that term by using `FullDummyCoding`

instead of the contrasts used elsewhere for that variable.

One additional complexity is introduced by promoting non-redundant variables to full rank. For the purpose of determining redundancy, a full-rank dummy coded categorical variable *implicitly* introduces the term that it aliases into the formula. Thus, in `y ~ 1 + a + a&b + b&c`

:

- In
`a&b`

,`a`

aliases the main effect`b`

, which is not explicitly present in the formula. This makes it non-redundant and so its contrast coding is promoted to`FullDummyCoding`

, which*implicitly*introduces the main effect of`b`

. - Then, in
`b&c`

, the variable`c`

is now*redundant*because it aliases the main effect of`b`

, and so it keeps its original contrast coding system.