Data Transformations
In general, data transformations change raw feature vectors into a representation that is more suitable for various estimators.
Standardization a.k.a Z-score Normalization
Standardization, also known as Z-score normalization, is a common requirement for many machine learning techniques. These techniques might perform poorly if the individual features do not more or less look like standard normally distributed data.
Standardization transforms data points into corresponding standard scores by subtracting mean and scaling to unit variance.
The standard score, also known as Z-score, is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured.
Standardization can be performed using t = fit(ZScoreTransform, ...)
followed by StatsBase.transform(t, ...)
or StatsBase.transform!(t, ...)
. standardize(ZScoreTransform, ...)
is a shorthand to perform both operations in a single call.
StatsAPI.fit
— Methodfit(ZScoreTransform, X; dims=nothing, center=true, scale=true)
Fit standardization parameters to vector or matrix X
and return a ZScoreTransform
transformation object.
Keyword arguments
dims
: if1
fit standardization parameters in column-wise fashion; if2
fit in row-wise fashion. The default isnothing
, which is equivalent todims=2
with a deprecation warning.center
: iftrue
(the default) center data so that its mean is zero.scale
: iftrue
(the default) scale the data so that its variance is equal to one.
Examples
julia> using StatsBase
julia> X = [0.0 -0.5 0.5; 0.0 1.0 2.0]
2×3 Matrix{Float64}:
0.0 -0.5 0.5
0.0 1.0 2.0
julia> dt = fit(ZScoreTransform, X, dims=2)
ZScoreTransform{Float64, Vector{Float64}}(2, 2, [0.0, 1.0], [0.5, 1.0])
julia> StatsBase.transform(dt, X)
2×3 Matrix{Float64}:
0.0 -1.0 1.0
-1.0 0.0 1.0
Unit Range Normalization
Unit range normalization, also known as min-max scaling, is an alternative data transformation which scales features to lie in the interval [0; 1]
.
Unit range normalization can be performed using t = fit(UnitRangeTransform, ...)
followed by StatsBase.transform(t, ...)
or StatsBase.transform!(t, ...)
. standardize(UnitRangeTransform, ...)
is a shorthand to perform both operations in a single call.
StatsAPI.fit
— Methodfit(UnitRangeTransform, X; dims=nothing, unit=true)
Fit a scaling parameters to vector or matrix X
and return a UnitRangeTransform
transformation object.
Keyword arguments
dims
: if1
fit standardization parameters in column-wise fashion;
if 2
fit in row-wise fashion. The default is nothing
.
unit
: iftrue
(the default) shift the minimum data to zero.
Examples
julia> using StatsBase
julia> X = [0.0 -0.5 0.5; 0.0 1.0 2.0]
2×3 Matrix{Float64}:
0.0 -0.5 0.5
0.0 1.0 2.0
julia> dt = fit(UnitRangeTransform, X, dims=2)
UnitRangeTransform{Float64, Vector{Float64}}(2, 2, true, [-0.5, 0.0], [1.0, 0.5])
julia> StatsBase.transform(dt, X)
2×3 Matrix{Float64}:
0.5 0.0 1.0
0.0 0.5 1.0
Methods
StatsBase.transform
— Functiontransform(t::AbstractDataTransform, x)
Return a standardized copy of vector or matrix x
using transformation t
.
StatsBase.transform!
— Functiontransform!(t::AbstractDataTransform, x)
Apply transformation t
to vector or matrix x
in place.
StatsBase.reconstruct
— Functionreconstruct(t::AbstractDataTransform, y)
Return a reconstruction of an originally scaled data from a transformed vector or matrix y
using transformation t
.
StatsBase.reconstruct!
— Functionreconstruct!(t::AbstractDataTransform, y)
Perform an in-place reconstruction into an original data scale from a transformed vector or matrix y
using transformation t
.
StatsBase.standardize
— Functionstandardize(DT, X; dims=nothing, kwargs...)
Return a standardized copy of vector or matrix X
along dimensions dims
using transformation DT
which is a subtype of AbstractDataTransform
:
ZScoreTransform
UnitRangeTransform
Example
julia> using StatsBase
julia> standardize(ZScoreTransform, [0.0 -0.5 0.5; 0.0 1.0 2.0], dims=2)
2×3 Matrix{Float64}:
0.0 -1.0 1.0
-1.0 0.0 1.0
julia> standardize(UnitRangeTransform, [0.0 -0.5 0.5; 0.0 1.0 2.0], dims=2)
2×3 Matrix{Float64}:
0.5 0.0 1.0
0.0 0.5 1.0
Types
StatsBase.UnitRangeTransform
— TypeUnit range normalization
StatsBase.ZScoreTransform
— TypeStandardization (Z-score transformation)