Data Transformations
In general, data transformations change raw feature vectors into a representation that is more suitable for various estimators.
Standardization
Standardization of dataset is a common requirement for many machine learning techniques. These techniques might perform poorly if the individual features do not more or less look like standard normally distributed data.
Standardization transforms data points into corresponding standard scores by removing mean and scaling to unit variance.
The standard score is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured.
Standardization can be performed using fit(ZScoreTransform, ...)
.
StatsBase.fit
— Method.fit(ZScoreTransform, X; center=true, scale=true)
Fit standardization parameters to X
and return a ZScoreTransform
transformation object.
Arguments
data
: matrix of samples to fit transformation parameters.
Keyword arguments
center
: iftrue
(the default) center data so that its mean is zero.scale
: iftrue
(the default) scale the data so that its variance is equal to one.
Examples
julia> using StatsBase
julia> X = [0.0 -0.5 0.5; 0.0 1.0 2.0]
2×3 Array{Float64,2}:
0.0 -0.5 0.5
0.0 1.0 2.0
julia> dt = fit(ZScoreTransform, X)
ZScoreTransform{Float64}(2, [0.0, 1.0], [0.5, 1.0])
julia> StatsBase.transform(dt, X)
2×3 Array{Float64,2}:
0.0 -1.0 1.0
-1.0 0.0 1.0
Unit range normalization
**Unit range normalization* is an alternative data transformation which scales features to lie in the interval [0; 1]
.
Unit range normalization can be performed using fit(UnitRangeTransform, ...)
.
StatsBase.fit
— Method.fit(UnitRangeTransform, X; center=true, scale=true)
Fit a scaling parameters to X
and return transformation description.
Arguments
data
: matrix of samples to fit transformation parameters.
Keyword arguments
center
: iftrue
(the default) centere data around zero.scale
: iftrue
(the default) perform variance scaling.
Examples
julia> using StatsBase
julia> X = [0.0 -0.5 0.5; 0.0 1.0 2.0]
2×3 Array{Float64,2}:
0.0 -0.5 0.5
0.0 1.0 2.0
julia> dt = fit(UnitRangeTransform, X)
UnitRangeTransform{Float64}(2, true, [-0.5, 0.0], [1.0, 0.5])
julia> StatsBase.transform(dt, X)
2×3 Array{Float64,2}:
0.5 0.0 1.0
0.0 0.5 1.0
Additional methods
StatsBase.transform
— Function.transform(t::AbstractDataTransform, x)
Return a row-standardized vector or matrix x
using t
transformation.
StatsBase.transform!
— Function.transform!(t::AbstractDataTransform, x)
Apply transformation t
to vector or matrix x
in place.
StatsBase.reconstruct
— Function.reconstruct(t::AbstractDataTransform, y)
Return a reconstruction of an originally scaled data from a row-transformed vector or matrix y
using t
transformation.
StatsBase.reconstruct!
— Function.reconstruct!(t::AbstractDataTransform, y)
Perform an in-place reconstruction into an original data scale from a row-transformed vector or matrix y
using t
transformation.
StatsBase.standardize
— Function.standardize(DT, X; kwargs...)
Return a row-standardized matrix X
using DT
transformation which is a subtype of AbstractDataTransform
:
ZScoreTransform
UnitRangeTransform
Example
julia> using StatsBase
julia> standardize(ZScoreTransform, [0.0 -0.5 0.5; 0.0 1.0 2.0])
2×3 Array{Float64,2}:
0.0 -1.0 1.0
-1.0 0.0 1.0
julia> standardize(UnitRangeTransform, [0.0 -0.5 0.5; 0.0 1.0 2.0])
2×3 Array{Float64,2}:
0.5 0.0 1.0
0.0 0.5 1.0