Modeling categorical data
To convert categorical data into a numerical representation suitable for modeling, StatsModels
implements a variety of contrast coding systems. Each contrast coding system maps a categorical vector with $k$ levels onto $k-1$ linearly independent model matrix columns.
The following contrast coding systems are implemented:
How to specify contrast coding
The default contrast coding system is DummyCoding
. To override this, use the contrasts
argument when constructing a ModelFrame
:
mf = ModelFrame(@formula(y ~ 1 + x), df, contrasts = Dict(:x => EffectsCoding()))
To change the contrast coding for one or more variables in place, use
StatsModels.setcontrasts!
— Function.setcontrasts!(mf::ModelFrame; kwargs...)
setcontrasts!(mf::ModelFrame, contrasts::Dict{Symbol})
Update the contrasts used for coding categorical variables in ModelFrame
in place. This is accomplished by computing a new schema based on the provided contrasts and the ModelFrame
's data, and applying it to the ModelFrame
's FormulaTerm
.
Note that only the ModelFrame
itself is mutated: because AbstractTerm
s are immutable, any changes will produce a copy.
Interface
StatsModels.AbstractContrasts
— Type.Interface to describe contrast coding systems for categorical variables.
Concrete subtypes of AbstractContrasts
describe a particular way of converting a categorical data vector into numeric columns in a ModelMatrix
. Each instantiation optionally includes the levels to generate columns for and the base level. If not specified these will be taken from the data when a ContrastsMatrix
is generated (during ModelFrame
construction).
Constructors
For C <: AbstractContrast
:
C() # levels are inferred later
C(levels = ::Vector{Any}) # levels checked against data later
C(base = ::Any) # specify base level
C(levels = ::Vector{Any}, base = ::Any) # specify levels and base
Arguments
levels
: Optionally, the data levels can be specified here. This allows you to specify the order of the levels. If specified, the levels will be checked against the levels actually present in the data when theContrastsMatrix
is constructed. Any mismatch will result in an error, because levels missing in the data would lead to empty columns in the model matrix, and levels missing from the contrasts would lead to empty or undefined rows.base
: The base level may also be specified. The actual interpretation of this depends on the particular contrast type, but in general it can be thought of as a "reference" level. It defaults to the first level.
Contrast coding systems
DummyCoding
- Code each non-base level as a 0-1 indicator column.EffectsCoding
- Code each non-base level as 1, and base as -1.HelmertCoding
- Code each non-base level as the difference from the mean of the lower levelsContrastsCoding
- Manually specify contrasts matrix
The last coding type, ContrastsCoding
, provides a way to manually specify a contrasts matrix. For a variable x
with k
levels, a contrasts matrix M
is a k×k-1
matrix, that maps the k
levels onto k-1
model matrix columns. Specifically, let X
be the full-rank indicator matrix for x
, where X[i,j] = 1
if x[i] == levels(x)[j]
, and 0 otherwise. Then the model matrix columns generated by the contrasts matrix M
are Y = X * M
.
Extending
The easiest way to specify custom contrasts is with ContrastsCoding
. But if you want to actually implement a custom contrast coding system, you can subtype AbstractContrasts
. This requires a constructor, a contrasts_matrix
method for constructing the actual contrasts matrix that maps from levels to ModelMatrix
column values, and (optionally) a termnames
method:
mutable struct MyCoding <: AbstractContrasts
...
end
contrasts_matrix(C::MyCoding, baseind, n) = ...
termnames(C::MyCoding, levels, baseind) = ...
StatsModels.ContrastsMatrix
— Type.An instantiation of a contrast coding system for particular levels
This type is used internally for generating model matrices based on categorical data, and most users will not need to deal with it directly. Conceptually, a ContrastsMatrix
object stands for an instantiation of a contrast coding system for a particular set of categorical data levels.
If levels are specified in the AbstractContrasts
, those will be used, and likewise for the base level (which defaults to the first level).
Constructors
ContrastsMatrix(contrasts::AbstractContrasts, levels::AbstractVector)
ContrastsMatrix(contrasts_matrix::ContrastsMatrix, levels::AbstractVector)
Arguments
contrasts::AbstractContrasts
: The contrast coding system to use.levels::AbstractVector
: The levels to generate contrasts for.contrasts_matrix::ContrastsMatrix
: Constructing aContrastsMatrix
from another will check that the levels match. This is used, for example, in constructing a model matrix from aModelFrame
using different data.
Contrast coding systems
StatsModels.DummyCoding
— Type.DummyCoding([base[, levels]])
Dummy coding generates one indicator column (1 or 0) for each non-base level.
Columns have non-zero mean and are collinear with an intercept column (and lower-order columns for interactions) but are orthogonal to each other. In a regression model, dummy coding leads to an intercept that is the mean of the dependent variable for base level.
Also known as "treatment coding" or "one-hot encoding".
Examples
julia> StatsModels.ContrastsMatrix(DummyCoding(), ["a", "b", "c", "d"]).matrix
4×3 Array{Float64,2}:
0.0 0.0 0.0
1.0 0.0 0.0
0.0 1.0 0.0
0.0 0.0 1.0
StatsModels.EffectsCoding
— Type.EffectsCoding([base[, levels]])
Effects coding generates columns that code each non-base level as the deviation from the base level. For each non-base level x
of variable
, a column is generated with 1 where variable .== x
and -1 where variable .== base
.
EffectsCoding
is like DummyCoding
, but using -1 for the base level instead of 0.
When all levels are equally frequent, effects coding generates model matrix columns that are mean centered (have mean 0). For more than two levels the generated columns are not orthogonal. In a regression model with an effects-coded variable, the intercept corresponds to the grand mean.
Also known as "sum coding" or "simple coding". Note though that the default in R and SPSS is to use the last level as the base. Here we use the first level as the base, for consistency with other coding systems.
Examples
julia> StatsModels.ContrastsMatrix(EffectsCoding(), ["a", "b", "c", "d"]).matrix
4×3 Array{Float64,2}:
-1.0 -1.0 -1.0
1.0 0.0 0.0
0.0 1.0 0.0
0.0 0.0 1.0
StatsModels.HelmertCoding
— Type.HelmertCoding([base[, levels]])
Helmert coding codes each level as the difference from the average of the lower levels.
For each non-base level, Helmert coding generates a columns with -1 for each of n levels below, n for that level, and 0 above.
When all levels are equally frequent, Helmert coding generates columns that are mean-centered (mean 0) and orthogonal.
Examples
julia> StatsModels.ContrastsMatrix(HelmertCoding(), ["a", "b", "c", "d"]).matrix
4×3 Array{Float64,2}:
-1.0 -1.0 -1.0
1.0 -1.0 -1.0
0.0 2.0 -1.0
0.0 0.0 3.0
StatsModels.ContrastsCoding
— Type.ContrastsCoding(mat::Matrix[, base[, levels]])
Coding by manual specification of contrasts matrix. For k levels, the contrasts must be a k by k-1 Matrix.
Special internal contrasts
StatsModels.FullDummyCoding
— Type.FullDummyCoding()
Full-rank dummy coding generates one indicator (1 or 0) column for each level, including the base level. This is sometimes known as one-hot encoding.
Not exported but included here for the sake of completeness. Needed internally for some situations where a categorical variable with $k$ levels needs to be converted into $k$ model matrix columns instead of the standard $k-1$. This occurs when there are missing lower-order terms, as in discussed below in Categorical variables in Formulas.
Examples
julia> StatsModels.ContrastsMatrix(StatsModels.FullDummyCoding(), ["a", "b", "c", "d"]).matrix
4×4 Array{Float64,2}:
1.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0
0.0 0.0 1.0 0.0
0.0 0.0 0.0 1.0
Further details
Categorical variables in Formula
s
Generating model matrices from multiple variables, some of which are categorical, requires special care. The reason for this is that rank-$k-1$ contrasts are appropriate for a categorical variable with $k$ levels when it aliases other terms, making it partially redundant. Using rank-$k$ for such a redundant variable will generally result in a rank-deficient model matrix and a model that can't be identified.
A categorical variable in a term aliases the term that remains when that variable is dropped. For example, with categorical a
:
- In
a
, the sole variablea
aliases the intercept term1
. - In
a&b
, the variablea
aliases the main effect termb
, and vice versa. - In
a&b&c
, the variablea
alises the interaction termb&c
(regardless of whetherb
andc
are categorical).
If a categorical variable aliases another term that is present elsewhere in the formula, we call that variable redundant. A variable is non-redundant when the term that it alises is not present elsewhere in the formula. For categorical a
, b
, and c
:
- In
y ~ 1 + a
, thea
in the main effect ofa
aliases the intercept1
. - In
y ~ 0 + a
,a
does not alias any other terms and is non-redundant. - In
y ~ 1 + a + a&b
:- The
b
ina&b
is redundant because it aliases the main effecta
: droppingb
froma&b
leavesa
. - The
a
ina&b
is non-redundant because it aliasesb
, which is not present anywhere else in the formula.
- The
When constructing a ModelFrame
from a Formula
, each term is checked for non-redundant categorical variables. Any such non-redundant variables are "promoted" to full rank in that term by using FullDummyCoding
instead of the contrasts used elsewhere for that variable.
One additional complexity is introduced by promoting non-redundant variables to full rank. For the purpose of determining redundancy, a full-rank dummy coded categorical variable implicitly introduces the term that it aliases into the formula. Thus, in y ~ 1 + a + a&b + b&c
:
- In
a&b
,a
aliases the main effectb
, which is not explicitly present in the formula. This makes it non-redundant and so its contrast coding is promoted toFullDummyCoding
, which implicitly introduces the main effect ofb
. - Then, in
b&c
, the variablec
is now redundant because it aliases the main effect ofb
, and so it keeps its original contrast coding system.