Missing Data and Arrays

Representing missing data

DataArrays.NAConstant.
NA

A value denoting missingness within the domain of any type.

source
NAtype

The type of a missing value, NA.

source

Arrays with possibly missing data

AbstractDataArray{T, N}

An N-dimensional AbstractArray whose entries can take on values of type T or the value NA.

source
AbstractDataVector{T}

A 1-dimensional AbstractDataArray with element type T.

source
AbstractDataMatrix{T}

A 2-dimensional AbstractDataArray with element type T.

source
DataArray{T,N}(d::Array{T,N}, m::AbstractArray{Bool} = falses(size(d)))

Construct a DataArray, an N-dimensional array with element type T that allows missing values. The resulting array uses the data in d with m as a bitmask to signify missingness. That is, for each index i in d, if m[i] is true, the array contains NA at index i, otherwise it contains d[i].

DataArray(T::Type, dims...)

Construct a DataArray with element type T and dimensions specified by dims. All elements default to NA.

Examples

julia> DataArray([1, 2, 3], [true, false, true])
3-element DataArrays.DataArray{Int64,1}:
  NA
 2
  NA

julia> DataArray(Float64, 3, 3)
3×3 DataArrays.DataArray{Float64,2}:
 NA  NA  NA
 NA  NA  NA
 NA  NA  NA
source
DataVector{T}

A 1-dimensional DataArray with element type T.

source
DataMatrix{T}

A 2-dimensional DataArray with element type T.

source
DataArrays.@dataMacro.
@data expr

Create a DataArray based on the given expression.

Examples

julia> @data [1, NA, 3]
3-element DataArrays.DataArray{Int64,1}:
 1
  NA
 3

julia> @data hcat(1:3, 4:6)
3×2 DataArrays.DataArray{Int64,2}:
 1  4
 2  5
 3  6
source
DataArrays.isnaFunction.
isna(x) -> Bool

Determine whether x is missing, i.e. NA.

Examples

julia> isna(1)
false

julia> isna(NA)
true
source
isna(a::AbstractArray, i) -> Bool

Determine whether the element of a at index i is missing, i.e. NA.

Examples

julia> X = @data [1, 2, NA];

julia> isna(X, 2)
false

julia> isna(X, 3)
true
source
DataArrays.dropnaFunction.
dropna(v::AbstractVector) -> AbstractVector

Return a copy of v with all NA elements removed.

Examples

julia> dropna(@data [NA, 1, NA, 2])
2-element Array{Int64,1}:
 1
 2

julia> dropna([4, 5, 6])
3-element Array{Int64,1}:
 4
 5
 6
source
DataArrays.padnaFunction.
padna(dv::AbstractDataVector, front::Integer, back::Integer) -> DataVector

Pad dv with NA values. front is an integer number of NAs to add at the beginning of the array and back is the number of NAs to add at the end.

Examples

julia> padna(@data([1, 2, 3]), 1, 2)
6-element DataArrays.DataArray{Int64,1}:
  NA
 1
 2
 3
  NA
  NA
source
DataArrays.levelsFunction.
levels(da::DataArray) -> DataVector

Return a vector of the unique values in da, excluding any NAs.

levels(a::AbstractArray) -> Vector

Equivalent to unique(a).

Examples

julia> levels(@data [1, 2, NA])
2-element DataArrays.DataArray{Int64,1}:
 1
 2
source

Pooled arrays

PooledDataArray(data::AbstractArray{T}, [pool::Vector{T}], [m::AbstractArray{Bool}], [r::Type])

Construct a PooledDataArray based on the unique values in the given array. PooledDataArrays are useful for efficient storage of categorical data with a limited set of unique values. Rather than storing all length(data) values, it stores a smaller set of values (typically unique(data)) and an array of references to the stored values.

Optional arguments

  • pool: The possible values of data. Defaults to unique(data).

  • m: A missingness indicator akin to that of DataArray. Defaults to falses(size(d)).

  • r: The integer subtype used to store pool references. Defaults to UInt32.

Examples

julia> d = repeat(["A", "B"], outer=4);

julia> p = PooledDataArray(d)
8-element DataArrays.PooledDataArray{String,UInt32,1}:
 "A"
 "B"
 "A"
 "B"
 "A"
 "B"
 "A"
 "B"
PooledDataArray(T::Type, [R::Type=UInt32], [dims...])

Construct a PooledDataArray with element type T, reference storage type R, and dimensions dims. If the dimensions are specified and nonzero, the array is filled with NA values.

Examples

julia> PooledDataArray(Int, 2, 2)
2×2 DataArrays.PooledDataArray{Int64,UInt32,2}:
 NA  NA
 NA  NA
source
@pdata expr

Create a PooledDataArray based on the given expression.

Examples

julia> @pdata ["Hello", NA, "World"]
3-element DataArrays.PooledDataArray{String,UInt32,1}:
 "Hello"
 NA
 "World"
source
DataArrays.compactFunction.
compact(d::PooledDataArray)

Return a PooledDataArray with the smallest possible reference type for the data in d.

Note

If the reference type is already the smallest possible for the data, the input array is returned, i.e. the function aliases the input.

Examples

julia> p = @pdata(repeat(["A", "B"], outer=4))
8-element DataArrays.PooledDataArray{String,UInt32,1}:
 "A"
 "B"
 "A"
 "B"
 "A"
 "B"
 "A"
 "B"

julia> compact(p) # second type parameter compacts to UInt8 (only need 2 unique values)
8-element DataArrays.PooledDataArray{String,UInt8,1}:
 "A"
 "B"
 "A"
 "B"
 "A"
 "B"
 "A"
 "B"
source
DataArrays.setlevelsFunction.
setlevels(x::PooledDataArray, newpool::Union{AbstractVector, Dict})

Create a new PooledDataArray based on x but with the new value pool specified by newpool. The values can be replaced using a mapping specified in a Dict or with an array, since the order of the levels is used to identify values. The pool can be enlarged to contain values not present in the data, but it cannot be reduced to exclude present values.

Examples

julia> p = @pdata repeat(["A", "B"], inner=3)
6-element DataArrays.PooledDataArray{String,UInt32,1}:
 "A"
 "A"
 "A"
 "B"
 "B"
 "B"

julia> p2 = setlevels(p, ["C", "D"]) # could also be Dict("A"=>"C", "B"=>"D")
6-element DataArrays.PooledDataArray{String,UInt32,1}:
 "C"
 "C"
 "C"
 "D"
 "D"
 "D"

julia> p3 = setlevels(p2, ["C", "D", "E"])
6-element DataArrays.PooledDataArray{String,UInt32,1}:
 "C"
 "C"
 "C"
 "D"
 "D"
 "D"

julia> p3.pool # the pool can contain values not in the array
3-element Array{String,1}:
 "C"
 "D"
 "E"
source
DataArrays.setlevels!Function.
setlevels!(x::PooledDataArray, newpool::Union{AbstractVector, Dict})

Set the value pool for the PooledDataArray x to newpool, modifying x in place. The values can be replaced using a mapping specified in a Dict or with an array, since the order of the levels is used to identify values. The pool can be enlarged to contain values not present in the data, but it cannot be reduced to exclude present values.

Examples

julia> p = @pdata repeat(["A", "B"], inner=3)
6-element DataArrays.PooledDataArray{String,UInt32,1}:
 "A"
 "A"
 "A"
 "B"
 "B"
 "B"

julia> setlevels!(p, Dict("A"=>"C"));

julia> p # has been modified
6-element DataArrays.PooledDataArray{String,UInt32,1}:
 "C"
 "C"
 "C"
 "B"
 "B"
 "B"
source
DataArrays.replace!Function.
replace!(x::PooledDataArray, from, to)

Replace all occurrences of from in x with to, modifying x in place.

source
PooledDataVecs(v1, v2) -> (pda1, pda2)

Return a tuple of PooledDataArrays created from the data in v1 and v2, respectively, but sharing a common value pool.

source
DataArrays.getpoolidxFunction.
getpoolidx(pda::PooledDataArray, val)

Return the index of val in the value pool for pda. If val is not already in the value pool, pda is modified to include it in the pool.

source
DataArrays.reorderFunction.
reorder(x::PooledDataArray) -> PooledDataArray

Return a PooledDataArray containing the same data as x but with the value pool sorted.

source