diff --git a/dev/counts/index.html b/dev/counts/index.html index cca73c0b..7312fa15 100644 --- a/dev/counts/index.html +++ b/dev/counts/index.html @@ -1,7 +1,7 @@ Counting Functions · StatsBase.jl

Counting Functions

The package provides functions to count the occurrences of distinct values.

Counting over an Integer Range

StatsBase.countsFunction
counts(x, [wv::AbstractWeights])
 counts(x, levels::UnitRange{<:Integer}, [wv::AbstractWeights])
-counts(x, k::Integer, [wv::AbstractWeights])

Count the number of times each value in x occurs. If levels is provided, only values falling in that range will be considered (the others will be ignored without raising an error or a warning). If an integer k is provided, only values in the range 1:k will be considered.

If a vector of weights wv is provided, the proportion of weights is computed rather than the proportion of raw counts.

The output is a vector of length length(levels).

source
StatsBase.proportionsFunction
proportions(x, levels=span(x), [wv::AbstractWeights])

Return the proportion of values in the range levels that occur in x. Equivalent to counts(x, levels) / length(x).

If a vector of weights wv is provided, the proportion of weights is computed rather than the proportion of raw counts.

source
proportions(x, k::Integer, [wv::AbstractWeights])

Return the proportion of integers in 1 to k that occur in x.

If a vector of weights wv is provided, the proportion of weights is computed rather than the proportion of raw counts.

source
StatsBase.addcounts!Method
addcounts!(r, x, levels::UnitRange{<:Integer}, [wv::AbstractWeights])

Add the number of occurrences in x of each value in levels to an existing array r. For each xi ∈ x, if xi == levels[j], then we increment r[j].

If a weighting vector wv is specified, the sum of weights is used rather than the raw counts.

source

Counting over arbitrary distinct values

StatsBase.countmapFunction
countmap(x; alg = :auto)
-countmap(x::AbstractVector, wv::AbstractVector{<:Real})

Return a dictionary mapping each unique value in x to its number of occurrences.

If a weighting vector wv is specified, the sum of weights is used rather than the raw counts.

alg is only allowed for unweighted counting and can be one of:

  • :auto (default): if StatsBase.radixsort_safe(eltype(x)) == true then use :radixsort, otherwise use :dict.

  • :radixsort: if radixsort_safe(eltype(x)) == true then use the radix sort algorithm to sort the input vector which will generally lead to shorter running time for large x with many duplicates. However the radix sort algorithm creates a copy of the input vector and hence uses more RAM. Choose :dict if the amount of available RAM is a limitation.

  • :dict: use Dict-based method which is generally slower but uses less RAM, is safe for any data type, is faster for small arrays, and is faster when there are not many duplicates.

source
StatsBase.proportionmapFunction
proportionmap(x)
-proportionmap(x::AbstractVector, w::AbstractVector{<:Real})

Return a dictionary mapping each unique value in x to its proportion in x.

If a vector of weights wv is provided, the proportion of weights is computed rather than the proportion of raw counts.

source
StatsBase.addcounts!Method
addcounts!(dict, x; alg = :auto)
-addcounts!(dict, x, wv)

Add counts based on x to a count map. New entries will be added if new values come up.

If a weighting vector wv is specified, the sum of the weights is used rather than the raw counts.

alg is only allowed for unweighted counting and can be one of:

  • :auto (default): if StatsBase.radixsort_safe(eltype(x)) == true then use :radixsort, otherwise use :dict.

  • :radixsort: if radixsort_safe(eltype(x)) == true then use the radix sort algorithm to sort the input vector which will generally lead to shorter running time for large x with many duplicates. However the radix sort algorithm creates a copy of the input vector and hence uses more RAM. Choose :dict if the amount of available RAM is a limitation.

  • :dict: use Dict-based method which is generally slower but uses less RAM, is safe for any data type, is faster for small arrays, and is faster when there are not many duplicates.

source
+counts(x, k::Integer, [wv::AbstractWeights])

Count the number of times each value in x occurs. If levels is provided, only values falling in that range will be considered (the others will be ignored without raising an error or a warning). If an integer k is provided, only values in the range 1:k will be considered.

If a vector of weights wv is provided, the proportion of weights is computed rather than the proportion of raw counts.

The output is a vector of length length(levels).

source
StatsBase.proportionsFunction
proportions(x, levels=span(x), [wv::AbstractWeights])

Return the proportion of values in the range levels that occur in x. Equivalent to counts(x, levels) / length(x).

If a vector of weights wv is provided, the proportion of weights is computed rather than the proportion of raw counts.

source
proportions(x, k::Integer, [wv::AbstractWeights])

Return the proportion of integers in 1 to k that occur in x.

If a vector of weights wv is provided, the proportion of weights is computed rather than the proportion of raw counts.

source
StatsBase.addcounts!Method
addcounts!(r, x, levels::UnitRange{<:Integer}, [wv::AbstractWeights])

Add the number of occurrences in x of each value in levels to an existing array r. For each xi ∈ x, if xi == levels[j], then we increment r[j].

If a weighting vector wv is specified, the sum of weights is used rather than the raw counts.

source

Counting over arbitrary distinct values

StatsBase.countmapFunction
countmap(x; alg = :auto)
+countmap(x::AbstractVector, wv::AbstractVector{<:Real})

Return a dictionary mapping each unique value in x to its number of occurrences.

If a weighting vector wv is specified, the sum of weights is used rather than the raw counts.

alg is only allowed for unweighted counting and can be one of:

  • :auto (default): if StatsBase.radixsort_safe(eltype(x)) == true then use :radixsort, otherwise use :dict.

  • :radixsort: if radixsort_safe(eltype(x)) == true then use the radix sort algorithm to sort the input vector which will generally lead to shorter running time for large x with many duplicates. However the radix sort algorithm creates a copy of the input vector and hence uses more RAM. Choose :dict if the amount of available RAM is a limitation.

  • :dict: use Dict-based method which is generally slower but uses less RAM, is safe for any data type, is faster for small arrays, and is faster when there are not many duplicates.

source
StatsBase.proportionmapFunction
proportionmap(x)
+proportionmap(x::AbstractVector, w::AbstractVector{<:Real})

Return a dictionary mapping each unique value in x to its proportion in x.

If a vector of weights wv is provided, the proportion of weights is computed rather than the proportion of raw counts.

source
StatsBase.addcounts!Method
addcounts!(dict, x; alg = :auto)
+addcounts!(dict, x, wv)

Add counts based on x to a count map. New entries will be added if new values come up.

If a weighting vector wv is specified, the sum of the weights is used rather than the raw counts.

alg is only allowed for unweighted counting and can be one of:

  • :auto (default): if StatsBase.radixsort_safe(eltype(x)) == true then use :radixsort, otherwise use :dict.

  • :radixsort: if radixsort_safe(eltype(x)) == true then use the radix sort algorithm to sort the input vector which will generally lead to shorter running time for large x with many duplicates. However the radix sort algorithm creates a copy of the input vector and hence uses more RAM. Choose :dict if the amount of available RAM is a limitation.

  • :dict: use Dict-based method which is generally slower but uses less RAM, is safe for any data type, is faster for small arrays, and is faster when there are not many duplicates.

source
diff --git a/dev/cov/index.html b/dev/cov/index.html index 3cf3da76..01c5e241 100644 --- a/dev/cov/index.html +++ b/dev/cov/index.html @@ -1,5 +1,5 @@ -Scatter Matrix and Covariance · StatsBase.jl

Scatter Matrix and Covariance

This package implements functions for computing scatter matrix, as well as weighted covariance matrix.

StatsBase.scattermatFunction
scattermat(X, [wv::AbstractWeights]; mean=nothing, dims=1)

Compute the scatter matrix, which is an unnormalized covariance matrix. A weighting vector wv can be specified to weight the estimate.

Arguments

  • mean=nothing: a known mean value. nothing indicates that the mean is unknown, and the function will compute the mean. Specifying mean=0 indicates that the data are centered and hence there's no need to subtract the mean.
  • dims=1: the dimension along which the variables are organized. When dims = 1, the variables are considered columns with observations in rows; when dims = 2, variables are in rows with observations in columns.
source
Statistics.covFunction
cov(X, w::AbstractWeights, vardim=1; mean=nothing, corrected=false)

Compute the weighted covariance matrix. Similar to var and std the biased covariance matrix (corrected=false) is computed by multiplying scattermat(X, w) by $\frac{1}{\sum{w}}$ to normalize. However, the unbiased covariance matrix (corrected=true) is dependent on the type of weights used:

  • AnalyticWeights: $\frac{1}{\sum w - \sum {w^2} / \sum w}$
  • FrequencyWeights: $\frac{1}{\sum{w} - 1}$
  • ProbabilityWeights: $\frac{n}{(n - 1) \sum w}$ where $n$ equals count(!iszero, w)
  • Weights: ArgumentError (bias correction not supported)
source
Statistics.covMethod
cov(ce::CovarianceEstimator, x::AbstractVector; mean=nothing)

Compute a variance estimate from the observation vector x using the estimator ce.

source
Statistics.covMethod
cov(ce::CovarianceEstimator, x::AbstractVector, y::AbstractVector)

Compute the covariance of the vectors x and y using estimator ce.

source
Statistics.covMethod
cov(ce::CovarianceEstimator, X::AbstractMatrix, [w::AbstractWeights]; mean=nothing, dims::Int=1)

Compute the covariance matrix of the matrix X along dimension dims using estimator ce. A weighting vector w can be specified. The keyword argument mean can be:

  • nothing (default) in which case the mean is estimated and subtracted from the data X,
  • a precalculated mean in which case it is subtracted from the data X. Assuming size(X) is (N,M), mean can either be:
    • when dims=1, an AbstractMatrix of size (1,M),
    • when dims=2, an AbstractVector of length N or an AbstractMatrix of size (N,1).
source
Statistics.varMethod
var(ce::CovarianceEstimator, x::AbstractVector; mean=nothing)

Compute the variance of the vector x using the estimator ce.

source
Statistics.stdMethod
std(ce::CovarianceEstimator, x::AbstractVector; mean=nothing)

Compute the standard deviation of the vector x using the estimator ce.

source
Statistics.corFunction
cor(X, w::AbstractWeights, dims=1)

Compute the Pearson correlation matrix of X along the dimension dims with a weighting w .

source
cor(ce::CovarianceEstimator, x::AbstractVector, y::AbstractVector)

Compute the correlation of the vectors x and y using estimator ce.

source
cor(
+Scatter Matrix and Covariance · StatsBase.jl

Scatter Matrix and Covariance

This package implements functions for computing scatter matrix, as well as weighted covariance matrix.

StatsBase.scattermatFunction
scattermat(X, [wv::AbstractWeights]; mean=nothing, dims=1)

Compute the scatter matrix, which is an unnormalized covariance matrix. A weighting vector wv can be specified to weight the estimate.

Arguments

  • mean=nothing: a known mean value. nothing indicates that the mean is unknown, and the function will compute the mean. Specifying mean=0 indicates that the data are centered and hence there's no need to subtract the mean.
  • dims=1: the dimension along which the variables are organized. When dims = 1, the variables are considered columns with observations in rows; when dims = 2, variables are in rows with observations in columns.
source
Statistics.covFunction
cov(X, w::AbstractWeights, vardim=1; mean=nothing, corrected=false)

Compute the weighted covariance matrix. Similar to var and std the biased covariance matrix (corrected=false) is computed by multiplying scattermat(X, w) by $\frac{1}{\sum{w}}$ to normalize. However, the unbiased covariance matrix (corrected=true) is dependent on the type of weights used:

  • AnalyticWeights: $\frac{1}{\sum w - \sum {w^2} / \sum w}$
  • FrequencyWeights: $\frac{1}{\sum{w} - 1}$
  • ProbabilityWeights: $\frac{n}{(n - 1) \sum w}$ where $n$ equals count(!iszero, w)
  • Weights: ArgumentError (bias correction not supported)
source
Statistics.covMethod
cov(ce::CovarianceEstimator, x::AbstractVector; mean=nothing)

Compute a variance estimate from the observation vector x using the estimator ce.

source
Statistics.covMethod
cov(ce::CovarianceEstimator, x::AbstractVector, y::AbstractVector)

Compute the covariance of the vectors x and y using estimator ce.

source
Statistics.covMethod
cov(ce::CovarianceEstimator, X::AbstractMatrix, [w::AbstractWeights]; mean=nothing, dims::Int=1)

Compute the covariance matrix of the matrix X along dimension dims using estimator ce. A weighting vector w can be specified. The keyword argument mean can be:

  • nothing (default) in which case the mean is estimated and subtracted from the data X,
  • a precalculated mean in which case it is subtracted from the data X. Assuming size(X) is (N,M), mean can either be:
    • when dims=1, an AbstractMatrix of size (1,M),
    • when dims=2, an AbstractVector of length N or an AbstractMatrix of size (N,1).
source
Statistics.varMethod
var(ce::CovarianceEstimator, x::AbstractVector; mean=nothing)

Compute the variance of the vector x using the estimator ce.

source
Statistics.stdMethod
std(ce::CovarianceEstimator, x::AbstractVector; mean=nothing)

Compute the standard deviation of the vector x using the estimator ce.

source
Statistics.corFunction
cor(X, w::AbstractWeights, dims=1)

Compute the Pearson correlation matrix of X along the dimension dims with a weighting w .

source
cor(ce::CovarianceEstimator, x::AbstractVector, y::AbstractVector)

Compute the correlation of the vectors x and y using estimator ce.

source
cor(
     ce::CovarianceEstimator, X::AbstractMatrix, [w::AbstractWeights];
     mean=nothing, dims::Int=1
-)

Compute the correlation matrix of the matrix X along dimension dims using estimator ce. A weighting vector w can be specified. The keyword argument mean can be:

  • nothing (default) in which case the mean is estimated and subtracted from the data X,
  • a precalculated mean in which case it is subtracted from the data X. Assuming size(X) is (N,M), mean can either be:
    • when dims=1, an AbstractMatrix of size (1,M),
    • when dims=2, an AbstractVector of length N or an AbstractMatrix of size (N,1).
source
StatsBase.mean_and_covFunction
mean_and_cov(x, [wv::AbstractWeights,] vardim=1; corrected=false) -> (mean, cov)

Return the mean and covariance matrix as a tuple. A weighting vector wv can be specified. vardim that designates whether the variables are columns in the matrix (1) or rows (2). Finally, bias correction is applied to the covariance calculation if corrected=true. See cov documentation for more details.

source
StatsBase.cov2corFunction
cov2cor(C::AbstractMatrix, [s::AbstractArray])

Compute the correlation matrix from the covariance matrix C and, optionally, a vector of standard deviations s. Use StatsBase.cov2cor! for an in-place version.

source
StatsBase.cor2covFunction
cor2cov(C, s)

Compute the covariance matrix from the correlation matrix C and a vector of standard deviations s. Use StatsBase.cor2cov! for an in-place version.

source
StatsBase.SimpleCovarianceType
SimpleCovariance(;corrected::Bool=false)

Simple covariance estimator. Estimation calls cov(x; corrected=corrected), cov(x, y; corrected=corrected) or cov(X, w, dims; corrected=corrected) where x, y are vectors, X is a matrix and w is a weighting vector.

source
+)

Compute the correlation matrix of the matrix X along dimension dims using estimator ce. A weighting vector w can be specified. The keyword argument mean can be:

  • nothing (default) in which case the mean is estimated and subtracted from the data X,
  • a precalculated mean in which case it is subtracted from the data X. Assuming size(X) is (N,M), mean can either be:
    • when dims=1, an AbstractMatrix of size (1,M),
    • when dims=2, an AbstractVector of length N or an AbstractMatrix of size (N,1).
source
StatsBase.mean_and_covFunction
mean_and_cov(x, [wv::AbstractWeights,] vardim=1; corrected=false) -> (mean, cov)

Return the mean and covariance matrix as a tuple. A weighting vector wv can be specified. vardim that designates whether the variables are columns in the matrix (1) or rows (2). Finally, bias correction is applied to the covariance calculation if corrected=true. See cov documentation for more details.

source
StatsBase.cov2corFunction
cov2cor(C::AbstractMatrix, [s::AbstractArray])

Compute the correlation matrix from the covariance matrix C and, optionally, a vector of standard deviations s. Use StatsBase.cov2cor! for an in-place version.

source
StatsBase.cor2covFunction
cor2cov(C, s)

Compute the covariance matrix from the correlation matrix C and a vector of standard deviations s. Use StatsBase.cor2cov! for an in-place version.

source
StatsBase.SimpleCovarianceType
SimpleCovariance(;corrected::Bool=false)

Simple covariance estimator. Estimation calls cov(x; corrected=corrected), cov(x, y; corrected=corrected) or cov(X, w, dims; corrected=corrected) where x, y are vectors, X is a matrix and w is a weighting vector.

source
diff --git a/dev/deviation/index.html b/dev/deviation/index.html index f809d0d7..f3774aee 100644 --- a/dev/deviation/index.html +++ b/dev/deviation/index.html @@ -1,2 +1,2 @@ -Computing Deviations · StatsBase.jl

Computing Deviations

This package provides functions to compute various deviations between arrays in a variety of ways:

StatsBase.counteqFunction
counteq(a, b)

Count the number of indices at which the elements of the arrays a and b are equal.

source
StatsBase.countneFunction
countne(a, b)

Count the number of indices at which the elements of the arrays a and b are not equal.

source
StatsBase.sqL2distFunction
sqL2dist(a, b)

Compute the squared L2 distance between two arrays: $\sum_{i=1}^n |a_i - b_i|^2$. Efficient equivalent of sum(abs2, a - b).

source
StatsBase.L2distFunction
L2dist(a, b)

Compute the L2 distance between two arrays: $\sqrt{\sum_{i=1}^n |a_i - b_i|^2}$. Efficient equivalent of sqrt(sum(abs2, a - b)).

source
StatsBase.L1distFunction
L1dist(a, b)

Compute the L1 distance between two arrays: $\sum_{i=1}^n |a_i - b_i|$. Efficient equivalent of sum(abs, a - b).

source
StatsBase.LinfdistFunction
Linfdist(a, b)

Compute the L∞ distance, also called the Chebyshev distance, between two arrays: $\max_{i\in1:n} |a_i - b_i|$. Efficient equivalent of maxabs(a - b).

source
StatsBase.gkldivFunction
gkldiv(a, b)

Compute the generalized Kullback-Leibler divergence between two arrays: $\sum_{i=1}^n (a_i \log(a_i/b_i) - a_i + b_i)$. Efficient equivalent of sum(a*log(a/b)-a+b).

source
StatsBase.meanadFunction
meanad(a, b)

Return the mean absolute deviation between two arrays: mean(abs, a - b).

source
StatsBase.maxadFunction
maxad(a, b)

Return the maximum absolute deviation between two arrays: maxabs(a - b).

source
StatsBase.msdFunction
msd(a, b)

Return the mean squared deviation between two arrays: mean(abs2, a - b).

source
StatsBase.rmsdFunction
rmsd(a, b; normalize=false)

Return the root mean squared deviation between two optionally normalized arrays. The root mean squared deviation is computed as sqrt(msd(a, b)).

source
StatsBase.psnrFunction
psnr(a, b, maxv)

Compute the peak signal-to-noise ratio between two arrays a and b. maxv is the maximum possible value either array can take. The PSNR is computed as 10 * log10(maxv^2 / msd(a, b)).

source
Note

All these functions are implemented in a reasonably efficient way without creating any temporary arrays in the middle.

+Computing Deviations · StatsBase.jl

Computing Deviations

This package provides functions to compute various deviations between arrays in a variety of ways:

StatsBase.counteqFunction
counteq(a, b)

Count the number of indices at which the elements of the arrays a and b are equal.

source
StatsBase.countneFunction
countne(a, b)

Count the number of indices at which the elements of the arrays a and b are not equal.

source
StatsBase.sqL2distFunction
sqL2dist(a, b)

Compute the squared L2 distance between two arrays: $\sum_{i=1}^n |a_i - b_i|^2$. Efficient equivalent of sum(abs2, a - b).

source
StatsBase.L2distFunction
L2dist(a, b)

Compute the L2 distance between two arrays: $\sqrt{\sum_{i=1}^n |a_i - b_i|^2}$. Efficient equivalent of sqrt(sum(abs2, a - b)).

source
StatsBase.L1distFunction
L1dist(a, b)

Compute the L1 distance between two arrays: $\sum_{i=1}^n |a_i - b_i|$. Efficient equivalent of sum(abs, a - b).

source
StatsBase.LinfdistFunction
Linfdist(a, b)

Compute the L∞ distance, also called the Chebyshev distance, between two arrays: $\max_{i\in1:n} |a_i - b_i|$. Efficient equivalent of maxabs(a - b).

source
StatsBase.gkldivFunction
gkldiv(a, b)

Compute the generalized Kullback-Leibler divergence between two arrays: $\sum_{i=1}^n (a_i \log(a_i/b_i) - a_i + b_i)$. Efficient equivalent of sum(a*log(a/b)-a+b).

source
StatsBase.meanadFunction
meanad(a, b)

Return the mean absolute deviation between two arrays: mean(abs, a - b).

source
StatsBase.maxadFunction
maxad(a, b)

Return the maximum absolute deviation between two arrays: maxabs(a - b).

source
StatsBase.msdFunction
msd(a, b)

Return the mean squared deviation between two arrays: mean(abs2, a - b).

source
StatsBase.rmsdFunction
rmsd(a, b; normalize=false)

Return the root mean squared deviation between two optionally normalized arrays. The root mean squared deviation is computed as sqrt(msd(a, b)).

source
StatsBase.psnrFunction
psnr(a, b, maxv)

Compute the peak signal-to-noise ratio between two arrays a and b. maxv is the maximum possible value either array can take. The PSNR is computed as 10 * log10(maxv^2 / msd(a, b)).

source
Note

All these functions are implemented in a reasonably efficient way without creating any temporary arrays in the middle.

diff --git a/dev/empirical/index.html b/dev/empirical/index.html index d5845cb3..1ff968cf 100644 --- a/dev/empirical/index.html +++ b/dev/empirical/index.html @@ -39,7 +39,7 @@ closed: left isdensity: true -julia> # observe isdensity = true and weights tells us the number of observation per binsize in each binsource

Histograms can be fitted to data using the fit method.

StatsAPI.fitMethod
fit(Histogram, data[, weight][, edges]; closed=:left[, nbins])

Fit a histogram to data.

Arguments

  • data: either a vector (for a 1-dimensional histogram), or a tuple of vectors of equal length (for an n-dimensional histogram).

  • weight: an optional AbstractWeights (of the same length as the data vectors), denoting the weight each observation contributes to the bin. If no weight vector is supplied, each observation has weight 1.

  • edges: a vector (typically an AbstractRange object), or tuple of vectors, that gives the edges of the bins along each dimension. If no edges are provided, they are chosen so that approximately nbins bins of equal width are constructed along each dimension.

Note

In most cases, the number of bins will be nbins. However, to ensure that the bins have equal width, more or fewer than nbins bins may be used.

Keyword arguments

  • closed: if :left (the default), the bin intervals are left-closed [a,b); if :right, intervals are right-closed (a,b].

  • nbins: if no edges argument is supplied, the approximate number of bins to use along each dimension (can be either a single integer, or a tuple of integers). If omitted, it is computed using Sturges's formula, i.e. ceil(log2(length(n))) + 1 with n the number of data points.

Examples

# Univariate
+julia> # observe isdensity = true and weights tells us the number of observation per binsize in each bin
source

Histograms can be fitted to data using the fit method.

StatsAPI.fitMethod
fit(Histogram, data[, weight][, edges]; closed=:left[, nbins])

Fit a histogram to data.

Arguments

  • data: either a vector (for a 1-dimensional histogram), or a tuple of vectors of equal length (for an n-dimensional histogram).

  • weight: an optional AbstractWeights (of the same length as the data vectors), denoting the weight each observation contributes to the bin. If no weight vector is supplied, each observation has weight 1.

  • edges: a vector (typically an AbstractRange object), or tuple of vectors, that gives the edges of the bins along each dimension. If no edges are provided, they are chosen so that approximately nbins bins of equal width are constructed along each dimension.

Note

In most cases, the number of bins will be nbins. However, to ensure that the bins have equal width, more or fewer than nbins bins may be used.

Keyword arguments

  • closed: if :left (the default), the bin intervals are left-closed [a,b); if :right, intervals are right-closed (a,b].

  • nbins: if no edges argument is supplied, the approximate number of bins to use along each dimension (can be either a single integer, or a tuple of integers). If omitted, it is computed using Sturges's formula, i.e. ceil(log2(length(n))) + 1 with n the number of data points.

Examples

# Univariate
 h = fit(Histogram, rand(100))
 h = fit(Histogram, rand(100), 0:0.1:1.0)
 h = fit(Histogram, rand(100), nbins=10)
@@ -49,4 +49,4 @@
 
 # Multivariate
 h = fit(Histogram, (rand(100),rand(100)))
-h = fit(Histogram, (rand(100),rand(100)),nbins=10)
source

Additional methods

Base.merge!Function
merge!(target::Histogram, others::Histogram...)

Update histogram target by merging it with the histograms others. See merge(histogram::Histogram, others::Histogram...) for details.

source
Base.mergeFunction
merge(h::Histogram, others::Histogram...)

Construct a new histogram by merging h with others. All histograms must have the same binning, shape of weights and properties (closed and isdensity). The weights of all histograms are summed up for each bin, the weights of the resulting histogram will have the same type as those of h.

source
LinearAlgebra.normFunction
norm(h::Histogram)

Calculate the norm of histogram h as the absolute value of its integral.

source
LinearAlgebra.normalizeFunction
normalize(h::Histogram{T,N}; mode::Symbol=:pdf) where {T,N}

Normalize the histogram h.

Valid values for mode are:

  • :pdf: Normalize by sum of weights and bin sizes. Resulting histogram has norm 1 and represents a PDF.
  • :density: Normalize by bin sizes only. Resulting histogram represents count density of input and does not have norm 1. Will not modify the histogram if it already represents a density (h.isdensity == 1).
  • :probability: Normalize by sum of weights only. Resulting histogram represents the fraction of probability mass for each bin and does not have norm 1.
  • :none: Leaves histogram unchanged. Useful to simplify code that has to conditionally apply different modes of normalization.

Successive application of both :probability and :density normalization (in any order) is equivalent to :pdf normalization.

source
normalize(h::Histogram{T,N}, aux_weights::Array{T,N}...; mode::Symbol=:pdf) where {T,N}

Normalize the histogram h and rescales one or more auxiliary weight arrays at the same time (aux_weights may, e.g., contain estimated statistical uncertainties). The values of the auxiliary arrays are scaled by the same factor as the corresponding histogram weight values. Returns a tuple of the normalized histogram and scaled auxiliary weights.

source
LinearAlgebra.normalize!Function
normalize!(h::Histogram{T,N}, aux_weights::Array{T,N}...; mode::Symbol=:pdf) where {T<:AbstractFloat,N}

Normalize the histogram h and optionally scale one or more auxiliary weight arrays appropriately. See description of normalize for details. Returns h.

source
Base.zeroFunction
zero(h::Histogram)

Create a new histogram with the same binning, type and shape of weights and the same properties (closed and isdensity) as h, with all weights set to zero.

source

Empirical Cumulative Distribution Function

StatsBase.ecdfFunction
ecdf(X; weights::AbstractWeights)

Return an empirical cumulative distribution function (ECDF) based on a vector of samples given in X. Optionally providing weights returns a weighted ECDF.

Note: this function that returns a callable composite type, which can then be applied to evaluate CDF values on other samples.

extrema, minimum, and maximum are supported to for obtaining the range over which function is inside the interval $(0,1)$; the function is defined for the whole real line.

source
+h = fit(Histogram, (rand(100),rand(100)),nbins=10)source

Additional methods

Base.merge!Function
merge!(target::Histogram, others::Histogram...)

Update histogram target by merging it with the histograms others. See merge(histogram::Histogram, others::Histogram...) for details.

source
Base.mergeFunction
merge(h::Histogram, others::Histogram...)

Construct a new histogram by merging h with others. All histograms must have the same binning, shape of weights and properties (closed and isdensity). The weights of all histograms are summed up for each bin, the weights of the resulting histogram will have the same type as those of h.

source
LinearAlgebra.normFunction
norm(h::Histogram)

Calculate the norm of histogram h as the absolute value of its integral.

source
LinearAlgebra.normalizeFunction
normalize(h::Histogram{T,N}; mode::Symbol=:pdf) where {T,N}

Normalize the histogram h.

Valid values for mode are:

  • :pdf: Normalize by sum of weights and bin sizes. Resulting histogram has norm 1 and represents a PDF.
  • :density: Normalize by bin sizes only. Resulting histogram represents count density of input and does not have norm 1. Will not modify the histogram if it already represents a density (h.isdensity == 1).
  • :probability: Normalize by sum of weights only. Resulting histogram represents the fraction of probability mass for each bin and does not have norm 1.
  • :none: Leaves histogram unchanged. Useful to simplify code that has to conditionally apply different modes of normalization.

Successive application of both :probability and :density normalization (in any order) is equivalent to :pdf normalization.

source
normalize(h::Histogram{T,N}, aux_weights::Array{T,N}...; mode::Symbol=:pdf) where {T,N}

Normalize the histogram h and rescales one or more auxiliary weight arrays at the same time (aux_weights may, e.g., contain estimated statistical uncertainties). The values of the auxiliary arrays are scaled by the same factor as the corresponding histogram weight values. Returns a tuple of the normalized histogram and scaled auxiliary weights.

source
LinearAlgebra.normalize!Function
normalize!(h::Histogram{T,N}, aux_weights::Array{T,N}...; mode::Symbol=:pdf) where {T<:AbstractFloat,N}

Normalize the histogram h and optionally scale one or more auxiliary weight arrays appropriately. See description of normalize for details. Returns h.

source
Base.zeroFunction
zero(h::Histogram)

Create a new histogram with the same binning, type and shape of weights and the same properties (closed and isdensity) as h, with all weights set to zero.

source

Empirical Cumulative Distribution Function

StatsBase.ecdfFunction
ecdf(X; weights::AbstractWeights)

Return an empirical cumulative distribution function (ECDF) based on a vector of samples given in X. Optionally providing weights returns a weighted ECDF.

Note: this function that returns a callable composite type, which can then be applied to evaluate CDF values on other samples.

extrema, minimum, and maximum are supported to for obtaining the range over which function is inside the interval $(0,1)$; the function is defined for the whole real line.

source
diff --git a/dev/index.html b/dev/index.html index ec896a6a..34f069f6 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,3 +1,3 @@ Getting Started · StatsBase.jl

Getting Started

StatsBase.jl is a Julia package that provides basic support for statistics. Particularly, it implements a variety of statistics-related functions, such as scalar statistics, high-order moment computation, counting, ranking, covariances, sampling, and empirical density estimation.

Installation

To install StatsBase through the Julia REPL, you can type ] add StatsBase or:

using Pkg
-Pkg.add("StatsBase")

To load the package, use the command:

using StatsBase

Available Features

+Pkg.add("StatsBase")

To load the package, use the command:

using StatsBase

Available Features

diff --git a/dev/misc/index.html b/dev/misc/index.html index c36901c6..d50d39eb 100644 --- a/dev/misc/index.html +++ b/dev/misc/index.html @@ -2,12 +2,12 @@ Miscellaneous Functions · StatsBase.jl

Miscellaneous Functions

StatsBase.rleFunction
rle(v) -> (vals, lens)

Return the run-length encoding of a vector as a tuple. The first element of the tuple is a vector of values of the input and the second is the number of consecutive occurrences of each element.

Examples

julia> using StatsBase
 
 julia> rle([1,1,1,2,2,3,3,3,3,2,2,2])
-([1, 2, 3, 2], [3, 2, 4, 3])
source
StatsBase.inverse_rleFunction
inverse_rle(vals, lens)

Reconstruct a vector from its run-length encoding (see rle). vals is a vector of the values and lens is a vector of the corresponding run lengths.

source
StatsBase.levelsmapFunction
levelsmap(a)

Construct a dictionary that maps each of the n unique values in a to a number between 1 and n.

source
StatsBase.indexmapFunction
indexmap(a)

Construct a dictionary that maps each unique value in a to the index of its first occurrence in a.

source
StatsBase.indicatormatFunction
indicatormat(x, k::Integer; sparse=false)

Construct a boolean matrix I of size (k, length(x)) such that I[x[i], i] = true and all other elements are set to false. If sparse is true, the output will be a sparse matrix, otherwise it will be dense (default).

Examples

julia> using StatsBase
+([1, 2, 3, 2], [3, 2, 4, 3])
source
StatsBase.inverse_rleFunction
inverse_rle(vals, lens)

Reconstruct a vector from its run-length encoding (see rle). vals is a vector of the values and lens is a vector of the corresponding run lengths.

source
StatsBase.levelsmapFunction
levelsmap(a)

Construct a dictionary that maps each of the n unique values in a to a number between 1 and n.

source
StatsBase.indexmapFunction
indexmap(a)

Construct a dictionary that maps each unique value in a to the index of its first occurrence in a.

source
StatsBase.indicatormatFunction
indicatormat(x, k::Integer; sparse=false)

Construct a boolean matrix I of size (k, length(x)) such that I[x[i], i] = true and all other elements are set to false. If sparse is true, the output will be a sparse matrix, otherwise it will be dense (default).

Examples

julia> using StatsBase
 
 julia> indicatormat([1 2 2], 2)
 2×3 Matrix{Bool}:
  1  0  0
- 0  1  1
source
indicatormat(x, c=sort(unique(x)); sparse=false)

Construct a boolean matrix I of size (length(c), length(x)). Let ci be the index of x[i] in c. Then I[ci, i] = true and all other elements are false.

source
StatsAPI.pairwiseFunction
pairwise(f, x[, y];
+ 0  1  1
source
indicatormat(x, c=sort(unique(x)); sparse=false)

Construct a boolean matrix I of size (length(c), length(x)). Let ci be the index of x[i] in c. Then I[ci, i] = true and all other elements are false.

source
StatsAPI.pairwiseFunction
pairwise(f, x[, y];
          symmetric::Bool=false, skipmissing::Symbol=:none)

Return a matrix holding the result of applying f to all possible pairs of entries in iterators x and y. Rows correspond to entries in x and columns to entries in y. If y is omitted then a square matrix crossing x with itself is returned.

As a special case, if f is cor, diagonal cells for which entries from x and y are identical (according to ===) are set to one even in the presence missing, NaN or Inf entries.

Keyword arguments

  • symmetric::Bool=false: If true, f is only called to compute for the lower triangle of the matrix, and these values are copied to fill the upper triangle. Only allowed when y is omitted. Defaults to true when f is cor or cov.
  • skipmissing::Symbol=:none: If :none (the default), missing values in inputs are passed to f without any modification. Use :pairwise to skip entries with a missing value in either of the two vectors passed to f for a given pair of vectors in x and y. Use :listwise to skip entries with a missing value in any of the vectors in x or y; note that this might drop a large part of entries. Only allowed when entries in x and y are vectors.

Examples

julia> using StatsBase, Statistics
 
 julia> x = [1 3 7
@@ -30,7 +30,7 @@
 3×3 Matrix{Float64}:
   1.0        0.928571  -0.866025
   0.928571   1.0       -1.0
- -0.866025  -1.0        1.0
source
StatsAPI.pairwise!Function
pairwise!(f, dest::AbstractMatrix, x[, y];
           symmetric::Bool=false, skipmissing::Symbol=:none)

Store in matrix dest the result of applying f to all possible pairs of entries in iterators x and y, and return it. Rows correspond to entries in x and columns to entries in y, and dest must therefore be of size length(x) × length(y). If y is omitted then x is crossed with itself.

As a special case, if f is cor, diagonal cells for which entries from x and y are identical (according to ===) are set to one even in the presence missing, NaN or Inf entries.

Keyword arguments

  • symmetric::Bool=false: If true, f is only called to compute for the lower triangle of the matrix, and these values are copied to fill the upper triangle. Only allowed when y is omitted. Defaults to true when f is cor or cov.
  • skipmissing::Symbol=:none: If :none (the default), missing values in inputs are passed to f without any modification. Use :pairwise to skip entries with a missing value in either of the two vectors passed to f for a given pair of vectors in x and y. Use :listwise to skip entries with a missing value in any of the vectors in x or y; note that this might drop a large part of entries. Only allowed when entries in x and y are vectors.

Examples

julia> using StatsBase, Statistics
 
 julia> dest = zeros(3, 3);
@@ -59,4 +59,4 @@
 3×3 Matrix{Float64}:
   1.0        0.928571  -0.866025
   0.928571   1.0       -1.0
- -0.866025  -1.0        1.0
source
+ -0.866025 -1.0 1.0source diff --git a/dev/multivariate/index.html b/dev/multivariate/index.html index 893f2814..1a76e402 100644 --- a/dev/multivariate/index.html +++ b/dev/multivariate/index.html @@ -1,2 +1,2 @@ -Multivariate Summary Statistics · StatsBase.jl

Multivariate Summary Statistics

This package provides a few methods for summarizing multivariate data.

Partial Correlation

StatsBase.partialcorFunction
partialcor(x, y, Z)

Compute the partial correlation of the vectors x and y given Z, which can be a vector or matrix.

source

Generalizations of Variance

StatsBase.genvarFunction
genvar(X)

Compute the generalized sample variance of X. If X is a vector, one-column matrix, or other iterable, this is equivalent to the sample variance. Otherwise if X is a matrix, this is equivalent to the determinant of the covariance matrix of X.

Note

The generalized sample variance will be 0 if the columns of the matrix of deviations are linearly dependent.

source
StatsBase.totalvarFunction
totalvar(X)

Compute the total sample variance of X. If X is a vector, one-column matrix, or other iterable, this is equivalent to the sample variance. Otherwise if X is a matrix, this is equivalent to the sum of the diagonal elements of the covariance matrix of X.

source
+Multivariate Summary Statistics · StatsBase.jl

Multivariate Summary Statistics

This package provides a few methods for summarizing multivariate data.

Partial Correlation

StatsBase.partialcorFunction
partialcor(x, y, Z)

Compute the partial correlation of the vectors x and y given Z, which can be a vector or matrix.

source

Generalizations of Variance

StatsBase.genvarFunction
genvar(X)

Compute the generalized sample variance of X. If X is a vector, one-column matrix, or other iterable, this is equivalent to the sample variance. Otherwise if X is a matrix, this is equivalent to the determinant of the covariance matrix of X.

Note

The generalized sample variance will be 0 if the columns of the matrix of deviations are linearly dependent.

source
StatsBase.totalvarFunction
totalvar(X)

Compute the total sample variance of X. If X is a vector, one-column matrix, or other iterable, this is equivalent to the sample variance. Otherwise if X is a matrix, this is equivalent to the sum of the diagonal elements of the covariance matrix of X.

source
diff --git a/dev/ranking/index.html b/dev/ranking/index.html index 3b61bed3..0af58ccb 100644 --- a/dev/ranking/index.html +++ b/dev/ranking/index.html @@ -1,2 +1,2 @@ -Rankings and Rank Correlations · StatsBase.jl

Rankings and Rank Correlations

This package implements various strategies for computing ranks and rank correlations.

StatsBase.ordinalrankFunction
ordinalrank(x; lt=isless, by=identity, rev::Bool=false, ...)

Return the ordinal ranking ("1234" ranking) of an array. Supports the same keyword arguments as the sort function. All items in x are given distinct, successive ranks based on their position in the sorted vector. Missing values are assigned rank missing.

source
StatsBase.competerankFunction
competerank(x; lt=isless, by=identity, rev::Bool=false, ...)

Return the standard competition ranking ("1224" ranking) of an array. Supports the same keyword arguments as the sort function. Equal ("tied") items are given the same rank, and the next rank comes after a gap that is equal to the number of tied items - 1. Missing values are assigned rank missing.

source
StatsBase.denserankFunction
denserank(x; lt=isless, by=identity, rev::Bool=false, ...)

Return the dense ranking ("1223" ranking) of an array. Supports the same keyword arguments as the sort function. Equal items receive the same rank, and the next subsequent rank is assigned with no gap. Missing values are assigned rank missing.

source
StatsBase.tiedrankFunction
tiedrank(x; lt=isless, by=identity, rev::Bool=false, ...)

Return the tied ranking, also called fractional or "1 2.5 2.5 4" ranking, of an array. Supports the same keyword arguments as the sort function. Equal ("tied") items receive the mean of the ranks they would have been assigned under the ordinal ranking (see ordinalrank). Missing values are assigned rank missing.

source
StatsBase.corspearmanFunction
corspearman(x, y=x)

Compute Spearman's rank correlation coefficient. If x and y are vectors, the output is a float, otherwise it's a matrix corresponding to the pairwise correlations of the columns of x and y.

source
StatsBase.corkendallFunction
corkendall(x, y=x)

Compute Kendall's rank correlation coefficient, τ. x and y must both be either matrices or vectors.

source
+Rankings and Rank Correlations · StatsBase.jl

Rankings and Rank Correlations

This package implements various strategies for computing ranks and rank correlations.

StatsBase.ordinalrankFunction
ordinalrank(x; lt=isless, by=identity, rev::Bool=false, ...)

Return the ordinal ranking ("1234" ranking) of an array. Supports the same keyword arguments as the sort function. All items in x are given distinct, successive ranks based on their position in the sorted vector. Missing values are assigned rank missing.

source
StatsBase.competerankFunction
competerank(x; lt=isless, by=identity, rev::Bool=false, ...)

Return the standard competition ranking ("1224" ranking) of an array. Supports the same keyword arguments as the sort function. Equal ("tied") items are given the same rank, and the next rank comes after a gap that is equal to the number of tied items - 1. Missing values are assigned rank missing.

source
StatsBase.denserankFunction
denserank(x; lt=isless, by=identity, rev::Bool=false, ...)

Return the dense ranking ("1223" ranking) of an array. Supports the same keyword arguments as the sort function. Equal items receive the same rank, and the next subsequent rank is assigned with no gap. Missing values are assigned rank missing.

source
StatsBase.tiedrankFunction
tiedrank(x; lt=isless, by=identity, rev::Bool=false, ...)

Return the tied ranking, also called fractional or "1 2.5 2.5 4" ranking, of an array. Supports the same keyword arguments as the sort function. Equal ("tied") items receive the mean of the ranks they would have been assigned under the ordinal ranking (see ordinalrank). Missing values are assigned rank missing.

source
StatsBase.corspearmanFunction
corspearman(x, y=x)

Compute Spearman's rank correlation coefficient. If x and y are vectors, the output is a float, otherwise it's a matrix corresponding to the pairwise correlations of the columns of x and y.

source
StatsBase.corkendallFunction
corkendall(x, y=x)

Compute Kendall's rank correlation coefficient, τ. x and y must both be either matrices or vectors.

source
diff --git a/dev/robust/index.html b/dev/robust/index.html index 557046e9..2e819d53 100644 --- a/dev/robust/index.html +++ b/dev/robust/index.html @@ -3,10 +3,10 @@ 3-element Array{Int64,1}: 2 4 - 3source
StatsBase.trim!Function
trim!(x::AbstractVector; prop=0.0, count=0)

A variant of trim that modifies x in place.

source
StatsBase.winsorFunction
winsor(x::AbstractVector; prop=0.0, count=0)

Return an iterator of all elements of x that replaces either count or proportion prop of the highest elements with the previous-highest element and an equal number of the lowest elements with the next-lowest element.

The number of replaced elements could be smaller than specified if several elements equal the lower or upper bound.

To compute the Winsorized mean of x use mean(winsor(x)).

Example

julia> collect(winsor([5,2,3,4,1], prop=0.2))
+ 3
source
StatsBase.trim!Function
trim!(x::AbstractVector; prop=0.0, count=0)

A variant of trim that modifies x in place.

source
StatsBase.winsorFunction
winsor(x::AbstractVector; prop=0.0, count=0)

Return an iterator of all elements of x that replaces either count or proportion prop of the highest elements with the previous-highest element and an equal number of the lowest elements with the next-lowest element.

The number of replaced elements could be smaller than specified if several elements equal the lower or upper bound.

To compute the Winsorized mean of x use mean(winsor(x)).

Example

julia> collect(winsor([5,2,3,4,1], prop=0.2))
 5-element Array{Int64,1}:
  4
  2
  3
  4
- 2
source
StatsBase.winsor!Function
winsor!(x::AbstractVector; prop=0.0, count=0)

A variant of winsor that modifies vector x in place.

source
StatsBase.trimvarFunction
trimvar(x; prop=0.0, count=0)

Compute the variance of the trimmed mean of x. This function uses the Winsorized variance, as described in Wilcox (2010).

source
+ 2source
StatsBase.winsor!Function
winsor!(x::AbstractVector; prop=0.0, count=0)

A variant of winsor that modifies vector x in place.

source
StatsBase.trimvarFunction
trimvar(x; prop=0.0, count=0)

Compute the variance of the trimmed mean of x. This function uses the Winsorized variance, as described in Wilcox (2010).

source
diff --git a/dev/sampling/index.html b/dev/sampling/index.html index 93c0d868..54de9626 100644 --- a/dev/sampling/index.html +++ b/dev/sampling/index.html @@ -1,5 +1,5 @@ -Sampling from Population · StatsBase.jl

Sampling from Population

Sampling API

The package provides functions for sampling from a given population (with or without replacement).

StatsBase.sampleFunction
sample([rng], a, [wv::AbstractWeights])

Select a single random element of a. Sampling probabilities are proportional to the weights given in wv, if provided.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source
sample([rng], a, [wv::AbstractWeights], n::Integer; replace=true, ordered=false)

Select a random, optionally weighted sample of size n from an array a using a polyalgorithm. Sampling probabilities are proportional to the weights given in wv, if provided. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source
sample([rng], a, [wv::AbstractWeights], dims::Dims; replace=true, ordered=false)

Select a random, optionally weighted sample from an array a specifying the dimensions dims of the output array. Sampling probabilities are proportional to the weights given in wv, if provided. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source
sample([rng], wv::AbstractWeights)

Select a single random integer in 1:length(wv) with probabilities proportional to the weights given in wv.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source
StatsBase.sample!Function
sample!([rng], a, [wv::AbstractWeights], x; replace=true, ordered=false)

Draw a random sample of length(x) elements from an array a and store the result in x. A polyalgorithm is used for sampling. Sampling probabilities are proportional to the weights given in wv, if provided. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

Output array a must not be the same object as x or wv nor share memory with them, or the result may be incorrect.

source
StatsBase.wsampleFunction
wsample([rng], [a], w)

Select a weighted random sample of size 1 from a with probabilities proportional to the weights given in w. If a is not present, select a random weight from w.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source
wsample([rng], [a], w, n::Integer; replace=true, ordered=false)

Select a weighted random sample of size n from a with probabilities proportional to the weights given in w if a is present, otherwise select a random sample of size n of the weights given in w. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source
wsample([rng], [a], w, dims::Dims; replace=true, ordered=false)

Select a weighted random sample from a with probabilities proportional to the weights given in w if a is present, otherwise select a random sample of size n of the weights given in w. The dimensions of the output are given by dims.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source
StatsBase.wsample!Function
wsample!([rng], a, w, x; replace=true, ordered=false)

Select a weighted sample from an array a and store the result in x. Sampling probabilities are proportional to the weights given in w. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source

Algorithms

Internally, this package implements multiple algorithms, and the sample (and sample!) methods integrate them into a poly-algorithm, which chooses a specific algorithm based on inputs.

Note that the choices made in sample are decided based on extensive benchmarking (see perf/sampling.jl and perf/wsampling.jl). It performs reasonably fast for most cases. That being said, if you know that a certain algorithm is particularly suitable for your context, directly calling an internal algorithm function might be slightly more efficient.

Here are a list of algorithms implemented in the package. The functions below are not exported (one can still import them from StatsBase via using though).

Notations

  • a: source array representing the population
  • x: the destination array
  • wv: the weight vector (of type AbstractWeights), for weighted sampling
  • n: the length of a
  • k: the length of x. For sampling without replacement, k must not exceed n.
  • rng: optional random number generator (defaults to Random.default_rng() on Julia >= 1.3 and Random.GLOBAL_RNG on Julia < 1.3)

All following functions write results to x (pre-allocated) and return x.

Sampling Algorithms (Non-Weighted)

StatsBase.direct_sample!Method
direct_sample!([rng], a::AbstractArray, x::AbstractArray)

Direct sampling: for each j in 1:k, randomly pick i from 1:n, and set x[j] = a[i], with n=length(a) and k=length(x).

This algorithm consumes k random numbers.

source
StatsBase.samplepairFunction
samplepair([rng], n)

Draw a pair of distinct integers between 1 and n without replacement.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source
samplepair([rng], a)

Draw a pair of distinct elements from the array a without replacement.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source
StatsBase.knuths_sample!Function
knuths_sample!([rng], a, x)

Knuth's Algorithm S for random sampling without replacement.

Reference: D. Knuth. The Art of Computer Programming. Vol 2, 3.4.2, p.142.

This algorithm consumes length(a) random numbers. It requires no additional memory space. Suitable for the case where memory is tight.

source
StatsBase.fisher_yates_sample!Function
fisher_yates_sample!([rng], a::AbstractArray, x::AbstractArray)

Fisher-Yates shuffling (with early termination).

Pseudo-code:

n = length(a)
+Sampling from Population · StatsBase.jl

Sampling from Population

Sampling API

The package provides functions for sampling from a given population (with or without replacement).

StatsBase.sampleFunction
sample([rng], a, [wv::AbstractWeights])

Select a single random element of a. Sampling probabilities are proportional to the weights given in wv, if provided.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source
sample([rng], a, [wv::AbstractWeights], n::Integer; replace=true, ordered=false)

Select a random, optionally weighted sample of size n from an array a using a polyalgorithm. Sampling probabilities are proportional to the weights given in wv, if provided. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source
sample([rng], a, [wv::AbstractWeights], dims::Dims; replace=true, ordered=false)

Select a random, optionally weighted sample from an array a specifying the dimensions dims of the output array. Sampling probabilities are proportional to the weights given in wv, if provided. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source
sample([rng], wv::AbstractWeights)

Select a single random integer in 1:length(wv) with probabilities proportional to the weights given in wv.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source
StatsBase.sample!Function
sample!([rng], a, [wv::AbstractWeights], x; replace=true, ordered=false)

Draw a random sample of length(x) elements from an array a and store the result in x. A polyalgorithm is used for sampling. Sampling probabilities are proportional to the weights given in wv, if provided. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

Output array a must not be the same object as x or wv nor share memory with them, or the result may be incorrect.

source
StatsBase.wsampleFunction
wsample([rng], [a], w)

Select a weighted random sample of size 1 from a with probabilities proportional to the weights given in w. If a is not present, select a random weight from w.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source
wsample([rng], [a], w, n::Integer; replace=true, ordered=false)

Select a weighted random sample of size n from a with probabilities proportional to the weights given in w if a is present, otherwise select a random sample of size n of the weights given in w. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source
wsample([rng], [a], w, dims::Dims; replace=true, ordered=false)

Select a weighted random sample from a with probabilities proportional to the weights given in w if a is present, otherwise select a random sample of size n of the weights given in w. The dimensions of the output are given by dims.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source
StatsBase.wsample!Function
wsample!([rng], a, w, x; replace=true, ordered=false)

Select a weighted sample from an array a and store the result in x. Sampling probabilities are proportional to the weights given in w. replace dictates whether sampling is performed with replacement. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in a) should be taken.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source

Algorithms

Internally, this package implements multiple algorithms, and the sample (and sample!) methods integrate them into a poly-algorithm, which chooses a specific algorithm based on inputs.

Note that the choices made in sample are decided based on extensive benchmarking (see perf/sampling.jl and perf/wsampling.jl). It performs reasonably fast for most cases. That being said, if you know that a certain algorithm is particularly suitable for your context, directly calling an internal algorithm function might be slightly more efficient.

Here are a list of algorithms implemented in the package. The functions below are not exported (one can still import them from StatsBase via using though).

Notations

  • a: source array representing the population
  • x: the destination array
  • wv: the weight vector (of type AbstractWeights), for weighted sampling
  • n: the length of a
  • k: the length of x. For sampling without replacement, k must not exceed n.
  • rng: optional random number generator (defaults to Random.default_rng() on Julia >= 1.3 and Random.GLOBAL_RNG on Julia < 1.3)

All following functions write results to x (pre-allocated) and return x.

Sampling Algorithms (Non-Weighted)

StatsBase.direct_sample!Method
direct_sample!([rng], a::AbstractArray, x::AbstractArray)

Direct sampling: for each j in 1:k, randomly pick i from 1:n, and set x[j] = a[i], with n=length(a) and k=length(x).

This algorithm consumes k random numbers.

source
StatsBase.samplepairFunction
samplepair([rng], n)

Draw a pair of distinct integers between 1 and n without replacement.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source
samplepair([rng], a)

Draw a pair of distinct elements from the array a without replacement.

Optionally specify a random number generator rng as the first argument (defaults to Random.default_rng()).

source
StatsBase.knuths_sample!Function
knuths_sample!([rng], a, x)

Knuth's Algorithm S for random sampling without replacement.

Reference: D. Knuth. The Art of Computer Programming. Vol 2, 3.4.2, p.142.

This algorithm consumes length(a) random numbers. It requires no additional memory space. Suitable for the case where memory is tight.

source
StatsBase.fisher_yates_sample!Function
fisher_yates_sample!([rng], a::AbstractArray, x::AbstractArray)

Fisher-Yates shuffling (with early termination).

Pseudo-code:

n = length(a)
 k = length(x)
 
 # Create an array of the indices
@@ -8,4 +8,4 @@
 for i = 1:k
     # swap element `i` with another random element in inds[i:n]
     # set element `i` in `x`
-end

This algorithm consumes k=length(x) random numbers. It uses an integer array of length n=length(a) internally to maintain the shuffled indices. It is considerably faster than Knuth's algorithm especially when n is greater than k. It is $O(n)$ for initialization, plus $O(k)$ for random shuffling

source
StatsBase.self_avoid_sample!Function
self_avoid_sample!([rng], a::AbstractArray, x::AbstractArray)

Self-avoid sampling: use a set to maintain the index that has been sampled. Each time draw a new index, if the index has already been sampled, redraw until it draws an unsampled one.

This algorithm consumes about (or slightly more than) k=length(x) random numbers, and requires $O(k)$ memory to store the set of sampled indices. Very fast when $n >> k$, with n=length(a).

However, if k is large and approaches $n$, the rejection rate would increase drastically, resulting in poorer performance.

source
StatsBase.seqsample_a!Function
seqsample_a!([rng], a::AbstractArray, x::AbstractArray)

Random subsequence sampling using algorithm A described in the following paper (page 714): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.

This algorithm consumes $O(n)$ random numbers, with n=length(a). The outputs are ordered.

source
StatsBase.seqsample_c!Function
seqsample_c!([rng], a::AbstractArray, x::AbstractArray)

Random subsequence sampling using algorithm C described in the following paper (page 715): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.

This algorithm consumes $O(k^2)$ random numbers, with k=length(x). The outputs are ordered.

source
StatsBase.seqsample_d!Function
seqsample_d!([rng], a::AbstractArray, x::AbstractArray)

Random subsequence sampling using algorithm D described in the following paper (page 716-17): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.

This algorithm consumes $O(k)$ random numbers, with k=length(x). The outputs are ordered.

source

Weighted Sampling Algorithms

StatsBase.direct_sample!Method
direct_sample!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Direct sampling.

Draw each sample by scanning the weight vector.

Noting k=length(x) and n=length(a), this algorithm:

  • consumes k random numbers
  • has time complexity $O(n k)$, as scanning the weight vector each time takes $O(n)$
  • requires no additional memory space.
source
StatsBase.alias_sample!Function
alias_sample!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Alias method.

Build an alias table, and sample therefrom.

Reference: Walker, A. J. "An Efficient Method for Generating Discrete Random Variables with General Distributions." ACM Transactions on Mathematical Software 3 (3): 253, 1977.

Noting k=length(x) and n=length(a), this algorithm takes $O(n \log n)$ time for building the alias table, and then $O(1)$ to draw each sample. It consumes $2 k$ random numbers.

source
StatsBase.naive_wsample_norep!Function
naive_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Naive implementation of weighted sampling without replacement.

It makes a copy of the weight vector at initialization, and sets the weight to zero when the corresponding sample is picked.

Noting k=length(x) and n=length(a), this algorithm consumes $O(k)$ random numbers, and has overall time complexity $O(n k)$.

source
StatsBase.efraimidis_a_wsample_norep!Function
efraimidis_a_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Weighted sampling without replacement using Efraimidis-Spirakis A algorithm.

Reference: Efraimidis, P. S., Spirakis, P. G. "Weighted random sampling with a reservoir." Information Processing Letters, 97 (5), 181-185, 2006. doi:10.1016/j.ipl.2005.11.003.

Noting k=length(x) and n=length(a), this algorithm takes $O(n + k \log k)$ processing time to draw $k$ elements. It consumes $n$ random numbers.

source
StatsBase.efraimidis_ares_wsample_norep!Function
efraimidis_ares_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Implementation of weighted sampling without replacement using Efraimidis-Spirakis A-Res algorithm.

Reference: Efraimidis, P. S., Spirakis, P. G. "Weighted random sampling with a reservoir." Information Processing Letters, 97 (5), 181-185, 2006. doi:10.1016/j.ipl.2005.11.003.

Noting k=length(x) and n=length(a), this algorithm takes $O(k \log(k) \log(n / k))$ processing time to draw $k$ elements. It consumes $n$ random numbers.

source
+end

This algorithm consumes k=length(x) random numbers. It uses an integer array of length n=length(a) internally to maintain the shuffled indices. It is considerably faster than Knuth's algorithm especially when n is greater than k. It is $O(n)$ for initialization, plus $O(k)$ for random shuffling

source
StatsBase.self_avoid_sample!Function
self_avoid_sample!([rng], a::AbstractArray, x::AbstractArray)

Self-avoid sampling: use a set to maintain the index that has been sampled. Each time draw a new index, if the index has already been sampled, redraw until it draws an unsampled one.

This algorithm consumes about (or slightly more than) k=length(x) random numbers, and requires $O(k)$ memory to store the set of sampled indices. Very fast when $n >> k$, with n=length(a).

However, if k is large and approaches $n$, the rejection rate would increase drastically, resulting in poorer performance.

source
StatsBase.seqsample_a!Function
seqsample_a!([rng], a::AbstractArray, x::AbstractArray)

Random subsequence sampling using algorithm A described in the following paper (page 714): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.

This algorithm consumes $O(n)$ random numbers, with n=length(a). The outputs are ordered.

source
StatsBase.seqsample_c!Function
seqsample_c!([rng], a::AbstractArray, x::AbstractArray)

Random subsequence sampling using algorithm C described in the following paper (page 715): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.

This algorithm consumes $O(k^2)$ random numbers, with k=length(x). The outputs are ordered.

source
StatsBase.seqsample_d!Function
seqsample_d!([rng], a::AbstractArray, x::AbstractArray)

Random subsequence sampling using algorithm D described in the following paper (page 716-17): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.

This algorithm consumes $O(k)$ random numbers, with k=length(x). The outputs are ordered.

source

Weighted Sampling Algorithms

StatsBase.direct_sample!Method
direct_sample!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Direct sampling.

Draw each sample by scanning the weight vector.

Noting k=length(x) and n=length(a), this algorithm:

  • consumes k random numbers
  • has time complexity $O(n k)$, as scanning the weight vector each time takes $O(n)$
  • requires no additional memory space.
source
StatsBase.alias_sample!Function
alias_sample!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Alias method.

Build an alias table, and sample therefrom.

Reference: Walker, A. J. "An Efficient Method for Generating Discrete Random Variables with General Distributions." ACM Transactions on Mathematical Software 3 (3): 253, 1977.

Noting k=length(x) and n=length(a), this algorithm takes $O(n \log n)$ time for building the alias table, and then $O(1)$ to draw each sample. It consumes $2 k$ random numbers.

source
StatsBase.naive_wsample_norep!Function
naive_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Naive implementation of weighted sampling without replacement.

It makes a copy of the weight vector at initialization, and sets the weight to zero when the corresponding sample is picked.

Noting k=length(x) and n=length(a), this algorithm consumes $O(k)$ random numbers, and has overall time complexity $O(n k)$.

source
StatsBase.efraimidis_a_wsample_norep!Function
efraimidis_a_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Weighted sampling without replacement using Efraimidis-Spirakis A algorithm.

Reference: Efraimidis, P. S., Spirakis, P. G. "Weighted random sampling with a reservoir." Information Processing Letters, 97 (5), 181-185, 2006. doi:10.1016/j.ipl.2005.11.003.

Noting k=length(x) and n=length(a), this algorithm takes $O(n + k \log k)$ processing time to draw $k$ elements. It consumes $n$ random numbers.

source
StatsBase.efraimidis_ares_wsample_norep!Function
efraimidis_ares_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Implementation of weighted sampling without replacement using Efraimidis-Spirakis A-Res algorithm.

Reference: Efraimidis, P. S., Spirakis, P. G. "Weighted random sampling with a reservoir." Information Processing Letters, 97 (5), 181-185, 2006. doi:10.1016/j.ipl.2005.11.003.

Noting k=length(x) and n=length(a), this algorithm takes $O(k \log(k) \log(n / k))$ processing time to draw $k$ elements. It consumes $n$ random numbers.

source
diff --git a/dev/scalarstats/index.html b/dev/scalarstats/index.html index 952f10e4..576b6b59 100644 --- a/dev/scalarstats/index.html +++ b/dev/scalarstats/index.html @@ -1,13 +1,13 @@ -Scalar Statistics · StatsBase.jl

Scalar Statistics

The package implements functions for computing various statistics over an array of scalar real numbers.

Weighted sum and mean

Base.sumFunction
sum(v::AbstractArray, w::AbstractWeights{<:Real}; [dims])

Compute the weighted sum of an array v with weights w, optionally over the dimension dims.

source
Base.sum!Function
sum!(R::AbstractArray, A::AbstractArray,
+Scalar Statistics · StatsBase.jl

Scalar Statistics

The package implements functions for computing various statistics over an array of scalar real numbers.

Weighted sum and mean

Base.sumFunction
sum(v::AbstractArray, w::AbstractWeights{<:Real}; [dims])

Compute the weighted sum of an array v with weights w, optionally over the dimension dims.

source
Base.sum!Function
sum!(R::AbstractArray, A::AbstractArray,
      w::AbstractWeights{<:Real}, dim::Int;
-     init::Bool=true)

Compute the weighted sum of A with weights w over the dimension dim and store the result in R. If init=false, the sum is added to R rather than starting from zero.

source
StatsBase.wsumFunction
wsum(v, w::AbstractVector, [dim])

Compute the weighted sum of an array v with weights w, optionally over the dimension dim.

source
StatsBase.wsum!Function
wsum!(R::AbstractArray, A::AbstractArray,
+     init::Bool=true)

Compute the weighted sum of A with weights w over the dimension dim and store the result in R. If init=false, the sum is added to R rather than starting from zero.

source
StatsBase.wsumFunction
wsum(v, w::AbstractVector, [dim])

Compute the weighted sum of an array v with weights w, optionally over the dimension dim.

source
StatsBase.wsum!Function
wsum!(R::AbstractArray, A::AbstractArray,
       w::AbstractVector, dim::Int;
-      init::Bool=true)

Compute the weighted sum of A with weights w over the dimension dim and store the result in R. If init=false, the sum is added to R rather than starting from zero.

source
Statistics.meanFunction
mean(A::AbstractArray, w::AbstractWeights[, dims::Int])

Compute the weighted mean of array A with weight vector w (of type AbstractWeights). If dim is provided, compute the weighted mean along dimension dims.

Examples

n = 20
+      init::Bool=true)

Compute the weighted sum of A with weights w over the dimension dim and store the result in R. If init=false, the sum is added to R rather than starting from zero.

source
Statistics.meanFunction
mean(A::AbstractArray, w::AbstractWeights[, dims::Int])

Compute the weighted mean of array A with weight vector w (of type AbstractWeights). If dim is provided, compute the weighted mean along dimension dims.

Examples

n = 20
 x = rand(n)
 w = rand(n)
-mean(x, weights(w))
source
Statistics.mean!Function
mean!(R::AbstractArray, A::AbstractArray, w::AbstractWeights[; dims=nothing])

Compute the weighted mean of array A with weight vector w (of type AbstractWeights) along dimension dims, and write results to R.

source

Means

The package provides functions to compute means of different kinds.

StatsBase.genmeanFunction
genmean(a, p)

Return the generalized/power mean with exponent p of a real-valued array, i.e. $\left( \frac{1}{n} \sum_{i=1}^n a_i^p \right)^{\frac{1}{p}}$, where n = length(a). It is taken to be the geometric mean when p == 0.

source

Moments and cumulants

Statistics.varFunction
var(x::AbstractArray, w::AbstractWeights, [dim]; mean=nothing, corrected=false)

Compute the variance of a real-valued array x, optionally over a dimension dim. Observations in x are weighted using weight vector w. The uncorrected (when corrected=false) sample variance is defined as:

\[\frac{1}{\sum{w}} \sum_{i=1}^n {w_i\left({x_i - μ}\right)^2 }\]

where $n$ is the length of the input and $μ$ is the mean. The unbiased estimate (when corrected=true) of the population variance is computed by replacing $\frac{1}{\sum{w}}$ with a factor dependent on the type of weights used:

  • AnalyticWeights: $\frac{1}{\sum w - \sum {w^2} / \sum w}$
  • FrequencyWeights: $\frac{1}{\sum{w} - 1}$
  • ProbabilityWeights: $\frac{n}{(n - 1) \sum w}$ where $n$ equals count(!iszero, w)
  • Weights: ArgumentError (bias correction not supported)
source
var(ce::CovarianceEstimator, x::AbstractVector; mean=nothing)

Compute the variance of the vector x using the estimator ce.

source
Statistics.stdFunction
std(x::AbstractArray, w::AbstractWeights, [dim]; mean=nothing, corrected=false)

Compute the standard deviation of a real-valued array x, optionally over a dimension dim. Observations in x are weighted using weight vector w. The uncorrected (when corrected=false) sample standard deviation is defined as:

\[\sqrt{\frac{1}{\sum{w}} \sum_{i=1}^n {w_i\left({x_i - μ}\right)^2 }}\]

where $n$ is the length of the input and $μ$ is the mean. The unbiased estimate (when corrected=true) of the population standard deviation is computed by replacing $\frac{1}{\sum{w}}$ with a factor dependent on the type of weights used:

  • AnalyticWeights: $\frac{1}{\sum w - \sum {w^2} / \sum w}$
  • FrequencyWeights: $\frac{1}{\sum{w} - 1}$
  • ProbabilityWeights: $\frac{n}{(n - 1) \sum w}$ where $n$ equals count(!iszero, w)
  • Weights: ArgumentError (bias correction not supported)
source
std(ce::CovarianceEstimator, x::AbstractVector; mean=nothing)

Compute the standard deviation of the vector x using the estimator ce.

source
StatsBase.mean_and_varFunction
mean_and_var(x, [w::AbstractWeights], [dim]; corrected=true) -> (mean, var)

Return the mean and variance of collection x. If x is an AbstractArray, dim can be specified as a tuple to compute statistics over these dimensions. A weighting vector w can be specified to weight the estimates. Finally, bias correction is be applied to the variance calculation if corrected=true. See var documentation for more details.

source
StatsBase.mean_and_stdFunction
mean_and_std(x, [w::AbstractWeights], [dim]; corrected=true) -> (mean, std)

Return the mean and standard deviation of collection x. If x is an AbstractArray, dim can be specified as a tuple to compute statistics over these dimensions. A weighting vector w can be specified to weight the estimates. Finally, bias correction is applied to the standard deviation calculation if corrected=true. See std documentation for more details.

source
StatsBase.skewnessFunction
skewness(v, [wv::AbstractWeights], m=mean(v))

Compute the standardized skewness of a real-valued array v, optionally specifying a weighting vector wv and a center m.

source
StatsBase.kurtosisFunction
kurtosis(v, [wv::AbstractWeights], m=mean(v))

Compute the excess kurtosis of a real-valued array v, optionally specifying a weighting vector wv and a center m.

source
StatsBase.momentFunction
moment(v, k, [wv::AbstractWeights], m=mean(v))

Return the kth order central moment of a real-valued array v, optionally specifying a weighting vector wv and a center m.

source
StatsBase.cumulantFunction
cumulant(v, k, [wv::AbstractWeights], m=mean(v))

Return the kth order cumulant of a real-valued array v, optionally specifying a weighting vector wv and a pre-computed mean m.

If k is a range of Integers, then return all the cumulants of orders in this range as a vector.

This quantity is calculated using a recursive definition on lower-order cumulants and central moments.

Reference: Smith, P. J. 1995. A Recursive Formulation of the Old Problem of Obtaining Moments from Cumulants and Vice Versa. The American Statistician, 49(2), 217–218. https://doi.org/10.2307/2684642

source

Measurements of Variation

StatsBase.spanFunction
span(x)

Return the span of a collection, i.e. the range minimum(x):maximum(x). The minimum and maximum of x are computed in one pass using extrema.

source
StatsBase.variationFunction
variation(x, m=mean(x); corrected=true)

Return the coefficient of variation of collection x, optionally specifying a precomputed mean m, and the optional correction parameter corrected. The coefficient of variation is the ratio of the standard deviation to the mean. If corrected is false, then std is calculated with denominator n. Else, the std is calculated with denominator n-1.

source
StatsBase.semFunction
sem(x; mean=nothing)
-sem(x::AbstractArray[, weights::AbstractWeights]; mean=nothing)

Return the standard error of the mean for a collection x. A pre-computed mean may be provided.

When not using weights, this is the (sample) standard deviation divided by the sample size. If weights are used, the variance of the sample mean is calculated as follows:

  • AnalyticWeights: Not implemented.
  • FrequencyWeights: $\frac{\sum_{i=1}^n w_i (x_i - \bar{x_i})^2}{(\sum w_i) (\sum w_i - 1)}$
  • ProbabilityWeights: $\frac{n}{n-1} \frac{\sum_{i=1}^n w_i^2 (x_i - \bar{x_i})^2}{\left( \sum w_i \right)^2}$

The standard error is then the square root of the above quantities.

References

Carl-Erik Särndal, Bengt Swensson, Jan Wretman (1992). Model Assisted Survey Sampling. New York: Springer. pp. 51-53.

source
StatsBase.madFunction
mad(x; center=median(x), normalize=true)

Compute the median absolute deviation (MAD) of collection x around center (by default, around the median).

If normalize is set to true, the MAD is multiplied by 1 / quantile(Normal(), 3/4) ≈ 1.4826, in order to obtain a consistent estimator of the standard deviation under the assumption that the data is normally distributed.

source
StatsBase.mad!Function
StatsBase.mad!(x; center=median!(x), normalize=true)

Compute the median absolute deviation (MAD) of array x around center (by default, around the median), overwriting x in the process.

If normalize is set to true, the MAD is multiplied by 1 / quantile(Normal(), 3/4) ≈ 1.4826, in order to obtain a consistent estimator of the standard deviation under the assumption that the data is normally distributed.

source

Z-scores

StatsBase.zscoreFunction
zscore(X, [μ, σ])

Compute the z-scores of X, optionally specifying a precomputed mean μ and standard deviation σ. z-scores are the signed number of standard deviations above the mean that an observation lies, i.e. $(x - μ) / σ$.

μ and σ should be both scalars or both arrays. The computation is broadcasting. In particular, when μ and σ are arrays, they should have the same size, and size(μ, i) == 1 || size(μ, i) == size(X, i) for each dimension.

source
StatsBase.zscore!Function
zscore!([Z], X, μ, σ)

Compute the z-scores of an array X with mean μ and standard deviation σ. z-scores are the signed number of standard deviations above the mean that an observation lies, i.e. $(x - μ) / σ$.

If a destination array Z is provided, the scores are stored in Z and it must have the same shape as X. Otherwise X is overwritten.

source
StatsBase.entropyFunction
entropy(p, [b])

Compute the entropy of a collection of probabilities p, optionally specifying a real number b such that the entropy is scaled by 1/log(b). Elements with probability 0 or 1 add 0 to the entropy.

source
StatsBase.crossentropyFunction
crossentropy(p, q, [b])

Compute the cross entropy between p and q, optionally specifying a real number b such that the result is scaled by 1/log(b).

source
StatsBase.kldivergenceFunction
kldivergence(p, q, [b])

Compute the Kullback-Leibler divergence from q to p, also called the relative entropy of p with respect to q, that is the sum pᵢ * log(pᵢ / qᵢ). Optionally a real number b can be specified such that the divergence is scaled by 1/log(b).

source
StatsBase.iqrFunction
iqr(x)

Compute the interquartile range (IQR) of collection x, i.e. the 75th percentile minus the 25th percentile.

source
StatsBase.nquantileFunction
nquantile(x, n::Integer)

Return the n-quantiles of collection x, i.e. the values which partition v into n subsets of nearly equal size.

Equivalent to quantile(x, [0:n]/n). For example, nquantiles(x, 5) returns a vector of quantiles, respectively at [0.0, 0.2, 0.4, 0.6, 0.8, 1.0].

source
Statistics.quantileFunction
quantile(v, w::AbstractWeights, p)

Compute the weighted quantiles of a vector v at a specified set of probability values p, using weights given by a weight vector w (of type AbstractWeights). Weights must not be negative. The weights and data vectors must have the same length. NaN is returned if x contains any NaN values. An error is raised if w contains any NaN values.

With FrequencyWeights, the function returns the same result as quantile for a vector with repeated values. Weights must be integers.

With non FrequencyWeights, denote $N$ the length of the vector, $w$ the vector of weights, $h = p (\sum_{i<= N} w_i - w_1) + w_1$ the cumulative weight corresponding to the probability $p$ and $S_k = \sum_{i<=k} w_i$ the cumulative weight for each observation, define $v_{k+1}$ the smallest element of v such that $S_{k+1}$ is strictly superior to $h$. The weighted $p$ quantile is given by $v_k + \gamma (v_{k+1} - v_k)$ with $\gamma = (h - S_k)/(S_{k+1} - S_k)$. In particular, when all weights are equal, the function returns the same result as the unweighted quantile.

source
Statistics.medianMethod
median(v::AbstractVector{<:Real}, w::AbstractWeights)

Compute the weighted median of v with weights w (of type AbstractWeights). See the documentation for quantile for more details.

source
StatsBase.quantilerankFunction
quantilerank(itr, value; method=:inc)

Compute the quantile position in the [0, 1] interval of value relative to collection itr.

Different definitions can be chosen via the method keyword argument. Let count_less be the number of elements of itr that are less than value, count_equal the number of elements of itr that are equal to value, n the length of itr, greatest_smaller the highest value below value and smallest_greater the lowest value above value. Then method supports the following definitions:

  • :inc (default): Return a value in the range 0 to 1 inclusive.

Return count_less / (n - 1) if value ∈ itr, otherwise apply interpolation based on definition 7 of quantile in Hyndman and Fan (1996) (equivalent to Excel PERCENTRANK and PERCENTRANK.INC). This definition corresponds to the lower semi-continuous inverse of quantile with its default parameters.

  • :exc: Return a value in the range 0 to 1 exclusive.

Return (count_less + 1) / (n + 1) if value ∈ itr otherwise apply interpolation based on definition 6 of quantile in Hyndman and Fan (1996) (equivalent to Excel PERCENTRANK.EXC).

  • :compete: Return count_less / (n - 1) if value ∈ itr, otherwise

return (count_less - 1) / (n - 1), without interpolation (equivalent to MariaDB PERCENT_RANK, dplyr percent_rank).

  • :tied: Return (count_less + count_equal/2) / n, without interpolation.

Based on the definition in Roscoe, J. T. (1975) (equivalent to "mean" kind of SciPy percentileofscore).

  • :strict: Return count_less / n, without interpolation

(equivalent to "strict" kind of SciPy percentileofscore).

  • :weak: Return (count_less + count_equal) / n, without interpolation

(equivalent to "weak" kind of SciPy percentileofscore).

Note

An ArgumentError is thrown if itr contains NaN or missing values or if itr contains fewer than two elements.

References

Roscoe, J. T. (1975). Fundamental Research Statistics for the Behavioral Sciences", 2nd ed., New York : Holt, Rinehart and Winston.

Hyndman, R.J and Fan, Y. (1996) "Sample Quantiles in Statistical Packages", The American Statistician, Vol. 50, No. 4, pp. 361-365.

Examples

julia> using StatsBase
+mean(x, weights(w))
source
Statistics.mean!Function
mean!(R::AbstractArray, A::AbstractArray, w::AbstractWeights[; dims=nothing])

Compute the weighted mean of array A with weight vector w (of type AbstractWeights) along dimension dims, and write results to R.

source

Means

The package provides functions to compute means of different kinds.

StatsBase.genmeanFunction
genmean(a, p)

Return the generalized/power mean with exponent p of a real-valued array, i.e. $\left( \frac{1}{n} \sum_{i=1}^n a_i^p \right)^{\frac{1}{p}}$, where n = length(a). It is taken to be the geometric mean when p == 0.

source

Moments and cumulants

Statistics.varFunction
var(x::AbstractArray, w::AbstractWeights, [dim]; mean=nothing, corrected=false)

Compute the variance of a real-valued array x, optionally over a dimension dim. Observations in x are weighted using weight vector w. The uncorrected (when corrected=false) sample variance is defined as:

\[\frac{1}{\sum{w}} \sum_{i=1}^n {w_i\left({x_i - μ}\right)^2 }\]

where $n$ is the length of the input and $μ$ is the mean. The unbiased estimate (when corrected=true) of the population variance is computed by replacing $\frac{1}{\sum{w}}$ with a factor dependent on the type of weights used:

  • AnalyticWeights: $\frac{1}{\sum w - \sum {w^2} / \sum w}$
  • FrequencyWeights: $\frac{1}{\sum{w} - 1}$
  • ProbabilityWeights: $\frac{n}{(n - 1) \sum w}$ where $n$ equals count(!iszero, w)
  • Weights: ArgumentError (bias correction not supported)
source
var(ce::CovarianceEstimator, x::AbstractVector; mean=nothing)

Compute the variance of the vector x using the estimator ce.

source
Statistics.stdFunction
std(x::AbstractArray, w::AbstractWeights, [dim]; mean=nothing, corrected=false)

Compute the standard deviation of a real-valued array x, optionally over a dimension dim. Observations in x are weighted using weight vector w. The uncorrected (when corrected=false) sample standard deviation is defined as:

\[\sqrt{\frac{1}{\sum{w}} \sum_{i=1}^n {w_i\left({x_i - μ}\right)^2 }}\]

where $n$ is the length of the input and $μ$ is the mean. The unbiased estimate (when corrected=true) of the population standard deviation is computed by replacing $\frac{1}{\sum{w}}$ with a factor dependent on the type of weights used:

  • AnalyticWeights: $\frac{1}{\sum w - \sum {w^2} / \sum w}$
  • FrequencyWeights: $\frac{1}{\sum{w} - 1}$
  • ProbabilityWeights: $\frac{n}{(n - 1) \sum w}$ where $n$ equals count(!iszero, w)
  • Weights: ArgumentError (bias correction not supported)
source
std(ce::CovarianceEstimator, x::AbstractVector; mean=nothing)

Compute the standard deviation of the vector x using the estimator ce.

source
StatsBase.mean_and_varFunction
mean_and_var(x, [w::AbstractWeights], [dim]; corrected=true) -> (mean, var)

Return the mean and variance of collection x. If x is an AbstractArray, dim can be specified as a tuple to compute statistics over these dimensions. A weighting vector w can be specified to weight the estimates. Finally, bias correction is be applied to the variance calculation if corrected=true. See var documentation for more details.

source
StatsBase.mean_and_stdFunction
mean_and_std(x, [w::AbstractWeights], [dim]; corrected=true) -> (mean, std)

Return the mean and standard deviation of collection x. If x is an AbstractArray, dim can be specified as a tuple to compute statistics over these dimensions. A weighting vector w can be specified to weight the estimates. Finally, bias correction is applied to the standard deviation calculation if corrected=true. See std documentation for more details.

source
StatsBase.skewnessFunction
skewness(v, [wv::AbstractWeights], m=mean(v))

Compute the standardized skewness of a real-valued array v, optionally specifying a weighting vector wv and a center m.

source
StatsBase.kurtosisFunction
kurtosis(v, [wv::AbstractWeights], m=mean(v))

Compute the excess kurtosis of a real-valued array v, optionally specifying a weighting vector wv and a center m.

source
StatsBase.momentFunction
moment(v, k, [wv::AbstractWeights], m=mean(v))

Return the kth order central moment of a real-valued array v, optionally specifying a weighting vector wv and a center m.

source
StatsBase.cumulantFunction
cumulant(v, k, [wv::AbstractWeights], m=mean(v))

Return the kth order cumulant of a real-valued array v, optionally specifying a weighting vector wv and a pre-computed mean m.

If k is a range of Integers, then return all the cumulants of orders in this range as a vector.

This quantity is calculated using a recursive definition on lower-order cumulants and central moments.

Reference: Smith, P. J. 1995. A Recursive Formulation of the Old Problem of Obtaining Moments from Cumulants and Vice Versa. The American Statistician, 49(2), 217–218. https://doi.org/10.2307/2684642

source

Measurements of Variation

StatsBase.spanFunction
span(x)

Return the span of a collection, i.e. the range minimum(x):maximum(x). The minimum and maximum of x are computed in one pass using extrema.

source
StatsBase.variationFunction
variation(x, m=mean(x); corrected=true)

Return the coefficient of variation of collection x, optionally specifying a precomputed mean m, and the optional correction parameter corrected. The coefficient of variation is the ratio of the standard deviation to the mean. If corrected is false, then std is calculated with denominator n. Else, the std is calculated with denominator n-1.

source
StatsBase.semFunction
sem(x; mean=nothing)
+sem(x::AbstractArray[, weights::AbstractWeights]; mean=nothing)

Return the standard error of the mean for a collection x. A pre-computed mean may be provided.

When not using weights, this is the (sample) standard deviation divided by the sample size. If weights are used, the variance of the sample mean is calculated as follows:

  • AnalyticWeights: Not implemented.
  • FrequencyWeights: $\frac{\sum_{i=1}^n w_i (x_i - \bar{x_i})^2}{(\sum w_i) (\sum w_i - 1)}$
  • ProbabilityWeights: $\frac{n}{n-1} \frac{\sum_{i=1}^n w_i^2 (x_i - \bar{x_i})^2}{\left( \sum w_i \right)^2}$

The standard error is then the square root of the above quantities.

References

Carl-Erik Särndal, Bengt Swensson, Jan Wretman (1992). Model Assisted Survey Sampling. New York: Springer. pp. 51-53.

source
StatsBase.madFunction
mad(x; center=median(x), normalize=true)

Compute the median absolute deviation (MAD) of collection x around center (by default, around the median).

If normalize is set to true, the MAD is multiplied by 1 / quantile(Normal(), 3/4) ≈ 1.4826, in order to obtain a consistent estimator of the standard deviation under the assumption that the data is normally distributed.

source
StatsBase.mad!Function
StatsBase.mad!(x; center=median!(x), normalize=true)

Compute the median absolute deviation (MAD) of array x around center (by default, around the median), overwriting x in the process.

If normalize is set to true, the MAD is multiplied by 1 / quantile(Normal(), 3/4) ≈ 1.4826, in order to obtain a consistent estimator of the standard deviation under the assumption that the data is normally distributed.

source

Z-scores

StatsBase.zscoreFunction
zscore(X, [μ, σ])

Compute the z-scores of X, optionally specifying a precomputed mean μ and standard deviation σ. z-scores are the signed number of standard deviations above the mean that an observation lies, i.e. $(x - μ) / σ$.

μ and σ should be both scalars or both arrays. The computation is broadcasting. In particular, when μ and σ are arrays, they should have the same size, and size(μ, i) == 1 || size(μ, i) == size(X, i) for each dimension.

source
StatsBase.zscore!Function
zscore!([Z], X, μ, σ)

Compute the z-scores of an array X with mean μ and standard deviation σ. z-scores are the signed number of standard deviations above the mean that an observation lies, i.e. $(x - μ) / σ$.

If a destination array Z is provided, the scores are stored in Z and it must have the same shape as X. Otherwise X is overwritten.

source
StatsBase.entropyFunction
entropy(p, [b])

Compute the entropy of a collection of probabilities p, optionally specifying a real number b such that the entropy is scaled by 1/log(b). Elements with probability 0 or 1 add 0 to the entropy.

source
StatsBase.crossentropyFunction
crossentropy(p, q, [b])

Compute the cross entropy between p and q, optionally specifying a real number b such that the result is scaled by 1/log(b).

source
StatsBase.kldivergenceFunction
kldivergence(p, q, [b])

Compute the Kullback-Leibler divergence from q to p, also called the relative entropy of p with respect to q, that is the sum pᵢ * log(pᵢ / qᵢ). Optionally a real number b can be specified such that the divergence is scaled by 1/log(b).

source
StatsBase.iqrFunction
iqr(x)

Compute the interquartile range (IQR) of collection x, i.e. the 75th percentile minus the 25th percentile.

source
StatsBase.nquantileFunction
nquantile(x, n::Integer)

Return the n-quantiles of collection x, i.e. the values which partition v into n subsets of nearly equal size.

Equivalent to quantile(x, [0:n]/n). For example, nquantiles(x, 5) returns a vector of quantiles, respectively at [0.0, 0.2, 0.4, 0.6, 0.8, 1.0].

source
Statistics.quantileFunction
quantile(v, w::AbstractWeights, p)

Compute the weighted quantiles of a vector v at a specified set of probability values p, using weights given by a weight vector w (of type AbstractWeights). Weights must not be negative. The weights and data vectors must have the same length. NaN is returned if x contains any NaN values. An error is raised if w contains any NaN values.

With FrequencyWeights, the function returns the same result as quantile for a vector with repeated values. Weights must be integers.

With non FrequencyWeights, denote $N$ the length of the vector, $w$ the vector of weights, $h = p (\sum_{i<= N} w_i - w_1) + w_1$ the cumulative weight corresponding to the probability $p$ and $S_k = \sum_{i<=k} w_i$ the cumulative weight for each observation, define $v_{k+1}$ the smallest element of v such that $S_{k+1}$ is strictly superior to $h$. The weighted $p$ quantile is given by $v_k + \gamma (v_{k+1} - v_k)$ with $\gamma = (h - S_k)/(S_{k+1} - S_k)$. In particular, when all weights are equal, the function returns the same result as the unweighted quantile.

source
Statistics.medianMethod
median(v::AbstractVector{<:Real}, w::AbstractWeights)

Compute the weighted median of v with weights w (of type AbstractWeights). See the documentation for quantile for more details.

source
StatsBase.quantilerankFunction
quantilerank(itr, value; method=:inc)

Compute the quantile position in the [0, 1] interval of value relative to collection itr.

Different definitions can be chosen via the method keyword argument. Let count_less be the number of elements of itr that are less than value, count_equal the number of elements of itr that are equal to value, n the length of itr, greatest_smaller the highest value below value and smallest_greater the lowest value above value. Then method supports the following definitions:

  • :inc (default): Return a value in the range 0 to 1 inclusive.

Return count_less / (n - 1) if value ∈ itr, otherwise apply interpolation based on definition 7 of quantile in Hyndman and Fan (1996) (equivalent to Excel PERCENTRANK and PERCENTRANK.INC). This definition corresponds to the lower semi-continuous inverse of quantile with its default parameters.

  • :exc: Return a value in the range 0 to 1 exclusive.

Return (count_less + 1) / (n + 1) if value ∈ itr otherwise apply interpolation based on definition 6 of quantile in Hyndman and Fan (1996) (equivalent to Excel PERCENTRANK.EXC).

  • :compete: Return count_less / (n - 1) if value ∈ itr, otherwise

return (count_less - 1) / (n - 1), without interpolation (equivalent to MariaDB PERCENT_RANK, dplyr percent_rank).

  • :tied: Return (count_less + count_equal/2) / n, without interpolation.

Based on the definition in Roscoe, J. T. (1975) (equivalent to "mean" kind of SciPy percentileofscore).

  • :strict: Return count_less / n, without interpolation

(equivalent to "strict" kind of SciPy percentileofscore).

  • :weak: Return (count_less + count_equal) / n, without interpolation

(equivalent to "weak" kind of SciPy percentileofscore).

Note

An ArgumentError is thrown if itr contains NaN or missing values or if itr contains fewer than two elements.

References

Roscoe, J. T. (1975). Fundamental Research Statistics for the Behavioral Sciences", 2nd ed., New York : Holt, Rinehart and Winston.

Hyndman, R.J and Fan, Y. (1996) "Sample Quantiles in Statistical Packages", The American Statistician, Vol. 50, No. 4, pp. 361-365.

Examples

julia> using StatsBase
 
 julia> v1 = [1, 1, 1, 2, 3, 4, 8, 11, 12, 13];
 
@@ -29,9 +29,9 @@
 julia> quantilerank.(Ref(v3), [4, 8])
 2-element Vector{Float64}:
  0.3333333333333333
- 0.8888888888888888
source

Mode and Modes

StatsBase.modeFunction
mode(a, [r])
-mode(a::AbstractArray, wv::AbstractWeights)

Return the mode (most common number) of an array, optionally over a specified range r or weighted via a vector wv. If several modes exist, the first one (in order of appearance) is returned.

source
StatsBase.modesFunction
modes(a, [r])::Vector
-mode(a::AbstractArray, wv::AbstractWeights)::Vector

Return all modes (most common numbers) of an array, optionally over a specified range r or weighted via vector wv.

source

Summary Statistics

StatsBase.summarystatsFunction
summarystats(a)

Compute summary statistics for a real-valued array a. Returns a SummaryStats object containing the number of observations, number of missing observations, standard deviation, mean, minimum, 25th percentile, median, 75th percentile, and maximum.

source
DataAPI.describeFunction
describe(a)

Pretty-print the summary statistics provided by summarystats: the mean, minimum, 25th percentile, median, 75th percentile, and maximum.

source

Reliability Measures

StatsBase.cronbachalphaFunction
cronbachalpha(covmatrix::AbstractMatrix{<:Real})

Calculate Cronbach's alpha (1951) from a covariance matrix covmatrix according to the formula:

\[\rho = \frac{k}{k-1} (1 - \frac{\sum^k_{i=1} \sigma^2_i}{\sum_{i=1}^k \sum_{j=1}^k \sigma_{ij}})\]

where $k$ is the number of items, i.e. columns, $\sigma_i^2$ the item variance, and $\sigma_{ij}$ the inter-item covariance.

Returns a CronbachAlpha object that holds:

  • alpha: the Cronbach's alpha score for all items, i.e. columns, in covmatrix; and
  • dropped: a vector giving Cronbach's alpha scores if a specific item, i.e. column, is dropped from covmatrix.

Example

julia> using StatsBase
+ 0.8888888888888888
source

Mode and Modes

StatsBase.modeFunction
mode(a, [r])
+mode(a::AbstractArray, wv::AbstractWeights)

Return the mode (most common number) of an array, optionally over a specified range r or weighted via a vector wv. If several modes exist, the first one (in order of appearance) is returned.

source
StatsBase.modesFunction
modes(a, [r])::Vector
+mode(a::AbstractArray, wv::AbstractWeights)::Vector

Return all modes (most common numbers) of an array, optionally over a specified range r or weighted via vector wv.

source

Summary Statistics

StatsBase.summarystatsFunction
summarystats(a)

Compute summary statistics for a real-valued array a. Returns a SummaryStats object containing the number of observations, number of missing observations, standard deviation, mean, minimum, 25th percentile, median, 75th percentile, and maximum.

source
DataAPI.describeFunction
describe(a)

Pretty-print the summary statistics provided by summarystats: the mean, minimum, 25th percentile, median, 75th percentile, and maximum.

source

Reliability Measures

StatsBase.cronbachalphaFunction
cronbachalpha(covmatrix::AbstractMatrix{<:Real})

Calculate Cronbach's alpha (1951) from a covariance matrix covmatrix according to the formula:

\[\rho = \frac{k}{k-1} (1 - \frac{\sum^k_{i=1} \sigma^2_i}{\sum_{i=1}^k \sum_{j=1}^k \sigma_{ij}})\]

where $k$ is the number of items, i.e. columns, $\sigma_i^2$ the item variance, and $\sigma_{ij}$ the inter-item covariance.

Returns a CronbachAlpha object that holds:

  • alpha: the Cronbach's alpha score for all items, i.e. columns, in covmatrix; and
  • dropped: a vector giving Cronbach's alpha scores if a specific item, i.e. column, is dropped from covmatrix.

Example

julia> using StatsBase
 
 julia> cov_X = [10 6 6 6;
                 6 11 6 6;
@@ -45,4 +45,4 @@
 item 1: 0.7500
 item 2: 0.7606
 item 3: 0.7714
-item 4: 0.7826
source
+item 4: 0.7826
source
diff --git a/dev/search/index.html b/dev/search/index.html index 7ccd66b0..b694a71e 100644 --- a/dev/search/index.html +++ b/dev/search/index.html @@ -1,2 +1,2 @@ -Search · StatsBase.jl

Loading search...

    +Search · StatsBase.jl

    Loading search...

      diff --git a/dev/signalcorr/index.html b/dev/signalcorr/index.html index 7673bc1d..168db370 100644 --- a/dev/signalcorr/index.html +++ b/dev/signalcorr/index.html @@ -1,2 +1,2 @@ -Correlation Analysis of Signals · StatsBase.jl

      Correlation Analysis of Signals

      The package provides functions to perform correlation analysis of sequential signals.

      Autocovariance and Autocorrelation

      StatsBase.autocovFunction
      autocov(x, [lags]; demean=true)

      Compute the autocovariance of a vector or matrix x, optionally specifying the lags at which to compute the autocovariance. demean denotes whether the mean of x should be subtracted from x before computing the autocovariance.

      If x is a vector, return a vector of the same length as lags. If x is a matrix, return a matrix of size (length(lags), size(x,2)), where each column in the result corresponds to a column in x.

      When left unspecified, the lags used are the integers from 0 to min(size(x,1)-1, 10*log10(size(x,1))).

      The output is not normalized. See autocor for a function with normalization.

      source
      StatsBase.autocov!Function
      autocov!(r, x, lags; demean=true)

      Compute the autocovariance of a vector or matrix x at lags and store the result in r. demean denotes whether the mean of x should be subtracted from x before computing the autocovariance.

      If x is a vector, r must be a vector of the same length as lags. If x is a matrix, r must be a matrix of size (length(lags), size(x,2)), and where each column in the result will correspond to a column in x.

      The output is not normalized. See autocor! for a method with normalization.

      source
      StatsBase.autocorFunction
      autocor(x, [lags]; demean=true)

      Compute the autocorrelation function (ACF) of a vector or matrix x, optionally specifying the lags. demean denotes whether the mean of x should be subtracted from x before computing the ACF.

      If x is a vector, return a vector of the same length as lags. If x is a matrix, return a matrix of size (length(lags), size(x,2)), where each column in the result corresponds to a column in x.

      When left unspecified, the lags used are the integers from 0 to min(size(x,1)-1, 10*log10(size(x,1))).

      The output is normalized by the variance of x, i.e. so that the lag 0 autocorrelation is 1. See autocov for the unnormalized form.

      source
      StatsBase.autocor!Function
      autocor!(r, x, lags; demean=true)

      Compute the autocorrelation function (ACF) of a vector or matrix x at lags and store the result in r. demean denotes whether the mean of x should be subtracted from x before computing the ACF.

      If x is a vector, r must be a vector of the same length as lags. If x is a matrix, r must be a matrix of size (length(lags), size(x,2)), and where each column in the result will correspond to a column in x.

      The output is normalized by the variance of x, i.e. so that the lag 0 autocorrelation is 1. See autocov! for the unnormalized form.

      source

      Cross-covariance and Cross-correlation

      StatsBase.crosscovFunction
      crosscov(x, y, [lags]; demean=true)

      Compute the cross covariance function (CCF) between real-valued vectors or matrices x and y, optionally specifying the lags. demean specifies whether the respective means of x and y should be subtracted from them before computing their CCF.

      If both x and y are vectors, return a vector of the same length as lags. Otherwise, compute cross covariances between each pairs of columns in x and y.

      When left unspecified, the lags used are the integers from -min(size(x,1)-1, 10*log10(size(x,1))) to min(size(x,1), 10*log10(size(x,1))).

      The output is not normalized. See crosscor for a function with normalization.

      source
      StatsBase.crosscov!Function
      crosscov!(r, x, y, lags; demean=true)

      Compute the cross covariance function (CCF) between real-valued vectors or matrices x and y at lags and store the result in r. demean specifies whether the respective means of x and y should be subtracted from them before computing their CCF.

      If both x and y are vectors, r must be a vector of the same length as lags. If either x is a matrix and y is a vector, r must be a matrix of size (length(lags), size(x, 2)); if x is a vector and y is a matrix, r must be a matrix of size (length(lags), size(y, 2)). If both x and y are matrices, r must be a three-dimensional array of size (length(lags), size(x, 2), size(y, 2)).

      The output is not normalized. See crosscor! for a function with normalization.

      source
      StatsBase.crosscorFunction
      crosscor(x, y, [lags]; demean=true)

      Compute the cross correlation between real-valued vectors or matrices x and y, optionally specifying the lags. demean specifies whether the respective means of x and y should be subtracted from them before computing their cross correlation.

      If both x and y are vectors, return a vector of the same length as lags. Otherwise, compute cross covariances between each pairs of columns in x and y.

      When left unspecified, the lags used are the integers from -min(size(x,1)-1, 10*log10(size(x,1))) to min(size(x,1), 10*log10(size(x,1))).

      The output is normalized by sqrt(var(x)*var(y)). See crosscov for the unnormalized form.

      source
      StatsBase.crosscor!Function
      crosscor!(r, x, y, lags; demean=true)

      Compute the cross correlation between real-valued vectors or matrices x and y at lags and store the result in r. demean specifies whether the respective means of x and y should be subtracted from them before computing their cross correlation.

      If both x and y are vectors, r must be a vector of the same length as lags. If either x is a matrix and y is a vector, r must be a matrix of size (length(lags), size(x, 2)); if x is a vector and y is a matrix, r must be a matrix of size (length(lags), size(y, 2)). If both x and y are matrices, r must be a three-dimensional array of size (length(lags), size(x, 2), size(y, 2)).

      The output is normalized by sqrt(var(x)*var(y)). See crosscov! for the unnormalized form.

      source

      Partial Autocorrelation Function

      StatsBase.pacfFunction
      pacf(X, lags; method=:regression)

      Compute the partial autocorrelation function (PACF) of a real-valued vector or matrix X at lags. method designates the estimation method. Recognized values are :regression, which computes the partial autocorrelations via successive regression models, and :yulewalker, which computes the partial autocorrelations using the Yule-Walker equations.

      If x is a vector, return a vector of the same length as lags. If x is a matrix, return a matrix of size (length(lags), size(x, 2)), where each column in the result corresponds to a column in x.

      source
      StatsBase.pacf!Function
      pacf!(r, X, lags; method=:regression)

      Compute the partial autocorrelation function (PACF) of a matrix X at lags and store the result in r. method designates the estimation method. Recognized values are :regression, which computes the partial autocorrelations via successive regression models, and :yulewalker, which computes the partial autocorrelations using the Yule-Walker equations.

      r must be a matrix of size (length(lags), size(x, 2)).

      source
      +Correlation Analysis of Signals · StatsBase.jl

      Correlation Analysis of Signals

      The package provides functions to perform correlation analysis of sequential signals.

      Autocovariance and Autocorrelation

      StatsBase.autocovFunction
      autocov(x, [lags]; demean=true)

      Compute the autocovariance of a vector or matrix x, optionally specifying the lags at which to compute the autocovariance. demean denotes whether the mean of x should be subtracted from x before computing the autocovariance.

      If x is a vector, return a vector of the same length as lags. If x is a matrix, return a matrix of size (length(lags), size(x,2)), where each column in the result corresponds to a column in x.

      When left unspecified, the lags used are the integers from 0 to min(size(x,1)-1, 10*log10(size(x,1))).

      The output is not normalized. See autocor for a function with normalization.

      source
      StatsBase.autocov!Function
      autocov!(r, x, lags; demean=true)

      Compute the autocovariance of a vector or matrix x at lags and store the result in r. demean denotes whether the mean of x should be subtracted from x before computing the autocovariance.

      If x is a vector, r must be a vector of the same length as lags. If x is a matrix, r must be a matrix of size (length(lags), size(x,2)), and where each column in the result will correspond to a column in x.

      The output is not normalized. See autocor! for a method with normalization.

      source
      StatsBase.autocorFunction
      autocor(x, [lags]; demean=true)

      Compute the autocorrelation function (ACF) of a vector or matrix x, optionally specifying the lags. demean denotes whether the mean of x should be subtracted from x before computing the ACF.

      If x is a vector, return a vector of the same length as lags. If x is a matrix, return a matrix of size (length(lags), size(x,2)), where each column in the result corresponds to a column in x.

      When left unspecified, the lags used are the integers from 0 to min(size(x,1)-1, 10*log10(size(x,1))).

      The output is normalized by the variance of x, i.e. so that the lag 0 autocorrelation is 1. See autocov for the unnormalized form.

      source
      StatsBase.autocor!Function
      autocor!(r, x, lags; demean=true)

      Compute the autocorrelation function (ACF) of a vector or matrix x at lags and store the result in r. demean denotes whether the mean of x should be subtracted from x before computing the ACF.

      If x is a vector, r must be a vector of the same length as lags. If x is a matrix, r must be a matrix of size (length(lags), size(x,2)), and where each column in the result will correspond to a column in x.

      The output is normalized by the variance of x, i.e. so that the lag 0 autocorrelation is 1. See autocov! for the unnormalized form.

      source

      Cross-covariance and Cross-correlation

      StatsBase.crosscovFunction
      crosscov(x, y, [lags]; demean=true)

      Compute the cross covariance function (CCF) between real-valued vectors or matrices x and y, optionally specifying the lags. demean specifies whether the respective means of x and y should be subtracted from them before computing their CCF.

      If both x and y are vectors, return a vector of the same length as lags. Otherwise, compute cross covariances between each pairs of columns in x and y.

      When left unspecified, the lags used are the integers from -min(size(x,1)-1, 10*log10(size(x,1))) to min(size(x,1), 10*log10(size(x,1))).

      The output is not normalized. See crosscor for a function with normalization.

      source
      StatsBase.crosscov!Function
      crosscov!(r, x, y, lags; demean=true)

      Compute the cross covariance function (CCF) between real-valued vectors or matrices x and y at lags and store the result in r. demean specifies whether the respective means of x and y should be subtracted from them before computing their CCF.

      If both x and y are vectors, r must be a vector of the same length as lags. If either x is a matrix and y is a vector, r must be a matrix of size (length(lags), size(x, 2)); if x is a vector and y is a matrix, r must be a matrix of size (length(lags), size(y, 2)). If both x and y are matrices, r must be a three-dimensional array of size (length(lags), size(x, 2), size(y, 2)).

      The output is not normalized. See crosscor! for a function with normalization.

      source
      StatsBase.crosscorFunction
      crosscor(x, y, [lags]; demean=true)

      Compute the cross correlation between real-valued vectors or matrices x and y, optionally specifying the lags. demean specifies whether the respective means of x and y should be subtracted from them before computing their cross correlation.

      If both x and y are vectors, return a vector of the same length as lags. Otherwise, compute cross covariances between each pairs of columns in x and y.

      When left unspecified, the lags used are the integers from -min(size(x,1)-1, 10*log10(size(x,1))) to min(size(x,1), 10*log10(size(x,1))).

      The output is normalized by sqrt(var(x)*var(y)). See crosscov for the unnormalized form.

      source
      StatsBase.crosscor!Function
      crosscor!(r, x, y, lags; demean=true)

      Compute the cross correlation between real-valued vectors or matrices x and y at lags and store the result in r. demean specifies whether the respective means of x and y should be subtracted from them before computing their cross correlation.

      If both x and y are vectors, r must be a vector of the same length as lags. If either x is a matrix and y is a vector, r must be a matrix of size (length(lags), size(x, 2)); if x is a vector and y is a matrix, r must be a matrix of size (length(lags), size(y, 2)). If both x and y are matrices, r must be a three-dimensional array of size (length(lags), size(x, 2), size(y, 2)).

      The output is normalized by sqrt(var(x)*var(y)). See crosscov! for the unnormalized form.

      source

      Partial Autocorrelation Function

      StatsBase.pacfFunction
      pacf(X, lags; method=:regression)

      Compute the partial autocorrelation function (PACF) of a real-valued vector or matrix X at lags. method designates the estimation method. Recognized values are :regression, which computes the partial autocorrelations via successive regression models, and :yulewalker, which computes the partial autocorrelations using the Yule-Walker equations.

      If x is a vector, return a vector of the same length as lags. If x is a matrix, return a matrix of size (length(lags), size(x, 2)), where each column in the result corresponds to a column in x.

      source
      StatsBase.pacf!Function
      pacf!(r, X, lags; method=:regression)

      Compute the partial autocorrelation function (PACF) of a matrix X at lags and store the result in r. method designates the estimation method. Recognized values are :regression, which computes the partial autocorrelations via successive regression models, and :yulewalker, which computes the partial autocorrelations using the Yule-Walker equations.

      r must be a matrix of size (length(lags), size(x, 2)).

      source
      diff --git a/dev/statmodels/index.html b/dev/statmodels/index.html index 38abd506..efd9677a 100644 --- a/dev/statmodels/index.html +++ b/dev/statmodels/index.html @@ -4,4 +4,4 @@ adjr²(model::StatisticalModel, variant::Symbol)

      Adjusted pseudo-coefficient of determination (adjusted pseudo R-squared). For nonlinear models, one of the several pseudo R² definitions must be chosen via variant. The only currently supported variants are :MacFadden, defined as $1 - (\log (L) - k)/\log (L0)$ and :devianceratio, defined as $1 - (D/(n-k))/(D_0/(n-1))$. In these formulas, $L$ is the likelihood of the model, $L0$ that of the null model (the model including only the intercept), $D$ is the deviance of the model, $D_0$ is the deviance of the null model, $n$ is the number of observations (given by nobs) and $k$ is the number of consumed degrees of freedom of the model (as returned by dof).

      StatsAPI.aicFunction
      aic(model::StatisticalModel)

      Akaike's Information Criterion, defined as $-2 \log L + 2k$, with $L$ the likelihood of the model, and k its number of consumed degrees of freedom (as returned by dof).

      StatsAPI.aiccFunction
      aicc(model::StatisticalModel)

      Corrected Akaike's Information Criterion for small sample sizes (Hurvich and Tsai 1989), defined as $-2 \log L + 2k + 2k(k-1)/(n-k-1)$, with $L$ the likelihood of the model, $k$ its number of consumed degrees of freedom (as returned by dof), and $n$ the number of observations (as returned by nobs).

      StatsAPI.bicFunction
      bic(model::StatisticalModel)

      Bayesian Information Criterion, defined as $-2 \log L + k \log n$, with $L$ the likelihood of the model, $k$ its number of consumed degrees of freedom (as returned by dof), and $n$ the number of observations (as returned by nobs).

      StatsAPI.coefFunction
      coef(model::StatisticalModel)

      Return the coefficients of the model.

      StatsAPI.coefnamesFunction
      coefnames(model::StatisticalModel)

      Return the names of the coefficients.

      StatsAPI.coeftableFunction
      coeftable(model::StatisticalModel; level::Real=0.95)

      Return a table with coefficients and related statistics of the model. level determines the level for confidence intervals (by default, 95%).

      The returned CoefTable object implements the Tables.jl interface, and can be converted e.g. to a DataFrame via using DataFrames; DataFrame(coeftable(model)).

      StatsAPI.confintFunction
      confint(model::StatisticalModel; level::Real=0.95)

      Compute confidence intervals for coefficients, with confidence level level (by default 95%).

      StatsAPI.devianceFunction
      deviance(model::StatisticalModel)

      Return the deviance of the model relative to a reference, which is usually when applicable the saturated model. It is equal, up to a constant, to $-2 \log L$, with $L$ the likelihood of the model.

      StatsAPI.dofFunction
      dof(model::StatisticalModel)

      Return the number of degrees of freedom consumed in the model, including when applicable the intercept and the distribution's dispersion parameter.

      StatsAPI.fitFunction

      Fit a statistical model.

      StatsAPI.fit!Function

      Fit a statistical model in-place.

      StatsAPI.informationmatrixFunction
      informationmatrix(model::StatisticalModel; expected::Bool = true)

      Return the information matrix of the model. By default the Fisher information matrix is returned, while the observed information matrix can be requested with expected = false.

      StatsAPI.isfittedFunction
      isfitted(model::StatisticalModel)

      Indicate whether the model has been fitted.

      StatsAPI.islinearFunction
      islinear(model::StatisticalModel)

      Indicate whether the model is linear.

      StatsAPI.loglikelihoodFunction
      loglikelihood(model::StatisticalModel)
       loglikelihood(model::StatisticalModel, observation)

      Return the log-likelihood of the model.

      With an observation argument, return the contribution of observation to the log-likelihood of model.

      If observation is a Colon, return a vector of each observation's contribution to the log-likelihood of the model. In other words, this is the vector of the pointwise log-likelihood contributions.

      In general, sum(loglikehood(model, :)) == loglikelihood(model).

      StatsAPI.mssFunction
      mss(model::StatisticalModel)

      Return the model sum of squares.

      StatsAPI.nobsFunction
      nobs(model::StatisticalModel)

      Return the number of independent observations on which the model was fitted. Be careful when using this information, as the definition of an independent observation may vary depending on the model, on the format used to pass the data, on the sampling plan (if specified), etc.

      StatsAPI.nulldevianceFunction
      nulldeviance(model::StatisticalModel)

      Return the deviance of the null model, obtained by dropping all independent variables present in model.

      If model includes an intercept, the null model is the one with only the intercept; otherwise, it is the one without any predictor (not even the intercept).

      StatsAPI.nullloglikelihoodFunction
      nullloglikelihood(model::StatisticalModel)

      Return the log-likelihood of the null model, obtained by dropping all independent variables present in model.

      If model includes an intercept, the null model is the one with only the intercept; otherwise, it is the one without any predictor (not even the intercept).

      StatsAPI.r2Function
      r2(model::StatisticalModel)
       r²(model::StatisticalModel)

      Coefficient of determination (R-squared).

      For a linear model, the R² is defined as $ESS/TSS$, with $ESS$ the explained sum of squares and $TSS$ the total sum of squares.

      r2(model::StatisticalModel, variant::Symbol)
      -r²(model::StatisticalModel, variant::Symbol)

      Pseudo-coefficient of determination (pseudo R-squared).

      For nonlinear models, one of several pseudo R² definitions must be chosen via variant. Supported variants are:

      • :MacFadden (a.k.a. likelihood ratio index), defined as $1 - \log (L)/\log (L_0)$;
      • :CoxSnell, defined as $1 - (L_0/L)^{2/n}$;
      • :Nagelkerke, defined as $(1 - (L_0/L)^{2/n})/(1 - L_0^{2/n})$.
      • :devianceratio, defined as $1 - D/D_0$.

      In the above formulas, $L$ is the likelihood of the model, $L_0$ is the likelihood of the null model (the model with only an intercept), $D$ is the deviance of the model (from the saturated model), $D_0$ is the deviance of the null model, $n$ is the number of observations (given by nobs).

      The Cox-Snell and the deviance ratio variants both match the classical definition of R² for linear models.

      StatsAPI.rssFunction
      rss(model::StatisticalModel)

      Return the residual sum of squares of the model.

      StatsAPI.scoreFunction
      score(model::StatisticalModel)

      Return the score of the model, that is the gradient of the log-likelihood with respect to the coefficients.

      StatsAPI.stderrorFunction
      stderror(model::StatisticalModel)

      Return the standard errors for the coefficients of the model.

      StatsAPI.vcovFunction
      vcov(model::StatisticalModel)

      Return the variance-covariance matrix for the coefficients of the model.

      StatsAPI.weightsFunction
      weights(model::StatisticalModel)

      Return the weights used in the model.

      RegressionModel extends StatisticalModel by implementing the following additional methods.

      StatsAPI.crossmodelmatrixFunction
      crossmodelmatrix(model::RegressionModel)

      Return X'X where X is the model matrix of model. This function will return a pre-computed matrix stored in model if possible.

      StatsAPI.dof_residualFunction
      dof_residual(model::RegressionModel)

      Return the residual degrees of freedom of the model.

      StatsAPI.fittedFunction
      fitted(model::RegressionModel)

      Return the fitted values of the model.

      StatsAPI.leverageFunction
      leverage(model::RegressionModel)

      Return the diagonal of the projection matrix of the model.

      StatsAPI.cooksdistanceFunction
      cooksdistance(model::RegressionModel)

      Compute Cook's distance for each observation in linear model model, giving an estimate of the influence of each data point.

      StatsAPI.meanresponseFunction
      meanresponse(model::RegressionModel)

      Return the mean of the response.

      StatsAPI.modelmatrixFunction
      modelmatrix(model::RegressionModel)

      Return the model matrix (a.k.a. the design matrix).

      StatsAPI.responseFunction
      response(model::RegressionModel)

      Return the model response (a.k.a. the dependent variable).

      StatsAPI.responsenameFunction
      responsename(model::RegressionModel)

      Return the name of the model response (a.k.a. the dependent variable).

      StatsAPI.predictFunction
      predict(model::RegressionModel, [newX])

      Form the predicted response of model. An object with new covariate values newX can be supplied, which should have the same type and structure as that used to fit model; e.g. for a GLM it would generally be a DataFrame with the same variable names as the original predictors.

      StatsAPI.predict!Function
      predict!

      In-place version of predict.

      StatsAPI.residualsFunction
      residuals(model::RegressionModel)

      Return the residuals of the model.

      An exception type is provided to signal convergence failures during model estimation:

      StatsBase.ConvergenceExceptionType
      ConvergenceException(iters::Int, lastchange::Real=NaN, tol::Real=NaN)

      The fitting procedure failed to converge in iters number of iterations, i.e. the lastchange between the cost of the final and penultimate iteration was greater than specified tolerance tol.

      source
      +r²(model::StatisticalModel, variant::Symbol)

      Pseudo-coefficient of determination (pseudo R-squared).

      For nonlinear models, one of several pseudo R² definitions must be chosen via variant. Supported variants are:

      In the above formulas, $L$ is the likelihood of the model, $L_0$ is the likelihood of the null model (the model with only an intercept), $D$ is the deviance of the model (from the saturated model), $D_0$ is the deviance of the null model, $n$ is the number of observations (given by nobs).

      The Cox-Snell and the deviance ratio variants both match the classical definition of R² for linear models.

      StatsAPI.rssFunction
      rss(model::StatisticalModel)

      Return the residual sum of squares of the model.

      StatsAPI.scoreFunction
      score(model::StatisticalModel)

      Return the score of the model, that is the gradient of the log-likelihood with respect to the coefficients.

      StatsAPI.stderrorFunction
      stderror(model::StatisticalModel)

      Return the standard errors for the coefficients of the model.

      StatsAPI.vcovFunction
      vcov(model::StatisticalModel)

      Return the variance-covariance matrix for the coefficients of the model.

      StatsAPI.weightsFunction
      weights(model::StatisticalModel)

      Return the weights used in the model.

      RegressionModel extends StatisticalModel by implementing the following additional methods.

      StatsAPI.crossmodelmatrixFunction
      crossmodelmatrix(model::RegressionModel)

      Return X'X where X is the model matrix of model. This function will return a pre-computed matrix stored in model if possible.

      StatsAPI.dof_residualFunction
      dof_residual(model::RegressionModel)

      Return the residual degrees of freedom of the model.

      StatsAPI.fittedFunction
      fitted(model::RegressionModel)

      Return the fitted values of the model.

      StatsAPI.leverageFunction
      leverage(model::RegressionModel)

      Return the diagonal of the projection matrix of the model.

      StatsAPI.cooksdistanceFunction
      cooksdistance(model::RegressionModel)

      Compute Cook's distance for each observation in linear model model, giving an estimate of the influence of each data point.

      StatsAPI.meanresponseFunction
      meanresponse(model::RegressionModel)

      Return the mean of the response.

      StatsAPI.modelmatrixFunction
      modelmatrix(model::RegressionModel)

      Return the model matrix (a.k.a. the design matrix).

      StatsAPI.responseFunction
      response(model::RegressionModel)

      Return the model response (a.k.a. the dependent variable).

      StatsAPI.responsenameFunction
      responsename(model::RegressionModel)

      Return the name of the model response (a.k.a. the dependent variable).

      StatsAPI.predictFunction
      predict(model::RegressionModel, [newX])

      Form the predicted response of model. An object with new covariate values newX can be supplied, which should have the same type and structure as that used to fit model; e.g. for a GLM it would generally be a DataFrame with the same variable names as the original predictors.

      StatsAPI.predict!Function
      predict!

      In-place version of predict.

      StatsAPI.residualsFunction
      residuals(model::RegressionModel)

      Return the residuals of the model.

      An exception type is provided to signal convergence failures during model estimation:

      StatsBase.ConvergenceExceptionType
      ConvergenceException(iters::Int, lastchange::Real=NaN, tol::Real=NaN)

      The fitting procedure failed to converge in iters number of iterations, i.e. the lastchange between the cost of the final and penultimate iteration was greater than specified tolerance tol.

      source
      diff --git a/dev/transformations/index.html b/dev/transformations/index.html index 455d695a..20a69aa3 100644 --- a/dev/transformations/index.html +++ b/dev/transformations/index.html @@ -12,7 +12,7 @@ julia> StatsBase.transform(dt, X) 2×3 Matrix{Float64}: 0.0 -1.0 1.0 - -1.0 0.0 1.0source

      Unit Range Normalization

      Unit range normalization, also known as min-max scaling, is an alternative data transformation which scales features to lie in the interval [0; 1].

      Unit range normalization can be performed using t = fit(UnitRangeTransform, ...) followed by StatsBase.transform(t, ...) or StatsBase.transform!(t, ...). standardize(UnitRangeTransform, ...) is a shorthand to perform both operations in a single call.

      StatsAPI.fitMethod
      fit(UnitRangeTransform, X; dims=nothing, unit=true)

      Fit a scaling parameters to vector or matrix X and return a UnitRangeTransform transformation object.

      Keyword arguments

      • dims: if 1 fit standardization parameters in column-wise fashion;

      if 2 fit in row-wise fashion. The default is nothing.

      • unit: if true (the default) shift the minimum data to zero.

      Examples

      julia> using StatsBase
      + -1.0   0.0  1.0
      source

      Unit Range Normalization

      Unit range normalization, also known as min-max scaling, is an alternative data transformation which scales features to lie in the interval [0; 1].

      Unit range normalization can be performed using t = fit(UnitRangeTransform, ...) followed by StatsBase.transform(t, ...) or StatsBase.transform!(t, ...). standardize(UnitRangeTransform, ...) is a shorthand to perform both operations in a single call.

      StatsAPI.fitMethod
      fit(UnitRangeTransform, X; dims=nothing, unit=true)

      Fit a scaling parameters to vector or matrix X and return a UnitRangeTransform transformation object.

      Keyword arguments

      • dims: if 1 fit standardization parameters in column-wise fashion;

      if 2 fit in row-wise fashion. The default is nothing.

      • unit: if true (the default) shift the minimum data to zero.

      Examples

      julia> using StatsBase
       
       julia> X = [0.0 -0.5 0.5; 0.0 1.0 2.0]
       2×3 Matrix{Float64}:
      @@ -25,7 +25,7 @@
       julia> StatsBase.transform(dt, X)
       2×3 Matrix{Float64}:
        0.5  0.0  1.0
      - 0.0  0.5  1.0
      source

      Methods

      StatsBase.transformFunction
      transform(t::AbstractDataTransform, x)

      Return a standardized copy of vector or matrix x using transformation t.

      source
      StatsBase.transform!Function
      transform!(t::AbstractDataTransform, x)

      Apply transformation t to vector or matrix x in place.

      source
      StatsBase.reconstructFunction
      reconstruct(t::AbstractDataTransform, y)

      Return a reconstruction of an originally scaled data from a transformed vector or matrix y using transformation t.

      source
      StatsBase.reconstruct!Function
      reconstruct!(t::AbstractDataTransform, y)

      Perform an in-place reconstruction into an original data scale from a transformed vector or matrix y using transformation t.

      source
      StatsBase.standardizeFunction
      standardize(DT, X; dims=nothing, kwargs...)

      Return a standardized copy of vector or matrix X along dimensions dims using transformation DT which is a subtype of AbstractDataTransform:

      • ZScoreTransform
      • UnitRangeTransform

      Example

      julia> using StatsBase
      + 0.0  0.5  1.0
      source

      Methods

      StatsBase.transformFunction
      transform(t::AbstractDataTransform, x)

      Return a standardized copy of vector or matrix x using transformation t.

      source
      StatsBase.transform!Function
      transform!(t::AbstractDataTransform, x)

      Apply transformation t to vector or matrix x in place.

      source
      StatsBase.reconstructFunction
      reconstruct(t::AbstractDataTransform, y)

      Return a reconstruction of an originally scaled data from a transformed vector or matrix y using transformation t.

      source
      StatsBase.reconstruct!Function
      reconstruct!(t::AbstractDataTransform, y)

      Perform an in-place reconstruction into an original data scale from a transformed vector or matrix y using transformation t.

      source
      StatsBase.standardizeFunction
      standardize(DT, X; dims=nothing, kwargs...)

      Return a standardized copy of vector or matrix X along dimensions dims using transformation DT which is a subtype of AbstractDataTransform:

      • ZScoreTransform
      • UnitRangeTransform

      Example

      julia> using StatsBase
       
       julia> standardize(ZScoreTransform, [0.0 -0.5 0.5; 0.0 1.0 2.0], dims=2)
       2×3 Matrix{Float64}:
      @@ -35,4 +35,4 @@
       julia> standardize(UnitRangeTransform, [0.0 -0.5 0.5; 0.0 1.0 2.0], dims=2)
       2×3 Matrix{Float64}:
        0.5  0.0  1.0
      - 0.0  0.5  1.0
      source

      Types

      StatsBase.UnitRangeTransformType

      Unit range normalization

      source
      StatsBase.ZScoreTransformType

      Standardization (Z-score transformation)

      source
      + 0.0 0.5 1.0source

      Types

      StatsBase.UnitRangeTransformType

      Unit range normalization

      source
      StatsBase.ZScoreTransformType

      Standardization (Z-score transformation)

      source
      diff --git a/dev/weights/index.html b/dev/weights/index.html index 29f500f5..a5f13eeb 100644 --- a/dev/weights/index.html +++ b/dev/weights/index.html @@ -40,7 +40,7 @@ length isempty values -sum

      The following constructors are provided:

      StatsBase.AnalyticWeightsType
      AnalyticWeights(vs, wsum=sum(vs))

      Construct an AnalyticWeights vector with weight values vs. A precomputed sum may be provided as wsum.

      Analytic weights describe a non-random relative importance (usually between 0 and 1) for each observation. These weights may also be referred to as reliability weights, precision weights or inverse variance weights. These are typically used when the observations being weighted are aggregate values (e.g., averages) with differing variances.

      source
      StatsBase.FrequencyWeightsType
      FrequencyWeights(vs, wsum=sum(vs))

      Construct a FrequencyWeights vector with weight values vs. A precomputed sum may be provided as wsum.

      Frequency weights describe the number of times (or frequency) each observation was observed. These weights may also be referred to as case weights or repeat weights.

      source
      StatsBase.ProbabilityWeightsType
      ProbabilityWeights(vs, wsum=sum(vs))

      Construct a ProbabilityWeights vector with weight values vs. A precomputed sum may be provided as wsum.

      Probability weights represent the inverse of the sampling probability for each observation, providing a correction mechanism for under- or over-sampling certain population groups. These weights may also be referred to as sampling weights.

      source
      StatsBase.UnitWeightsType
      UnitWeights{T}(s)

      Construct a UnitWeights vector with length s and weight elements of type T. All weight elements are identically one.

      source
      StatsBase.WeightsType
      Weights(vs, wsum=sum(vs))

      Construct a Weights vector with weight values vs. A precomputed sum may be provided as wsum.

      The Weights type describes a generic weights vector which does not support all operations possible for FrequencyWeights, AnalyticWeights and ProbabilityWeights.

      source
      StatsBase.aweightsFunction
      aweights(vs)

      Construct an AnalyticWeights vector from array vs. See the documentation for AnalyticWeights for more details.

      source
      StatsBase.fweightsFunction
      fweights(vs)

      Construct a FrequencyWeights vector from a given array. See the documentation for FrequencyWeights for more details.

      source
      StatsBase.pweightsFunction
      pweights(vs)

      Construct a ProbabilityWeights vector from a given array. See the documentation for ProbabilityWeights for more details.

      source
      StatsBase.eweightsFunction
      eweights(t::AbstractArray{<:Integer}, λ::Real; scale=false)
      +sum

      The following constructors are provided:

      StatsBase.AnalyticWeightsType
      AnalyticWeights(vs, wsum=sum(vs))

      Construct an AnalyticWeights vector with weight values vs. A precomputed sum may be provided as wsum.

      Analytic weights describe a non-random relative importance (usually between 0 and 1) for each observation. These weights may also be referred to as reliability weights, precision weights or inverse variance weights. These are typically used when the observations being weighted are aggregate values (e.g., averages) with differing variances.

      source
      StatsBase.FrequencyWeightsType
      FrequencyWeights(vs, wsum=sum(vs))

      Construct a FrequencyWeights vector with weight values vs. A precomputed sum may be provided as wsum.

      Frequency weights describe the number of times (or frequency) each observation was observed. These weights may also be referred to as case weights or repeat weights.

      source
      StatsBase.ProbabilityWeightsType
      ProbabilityWeights(vs, wsum=sum(vs))

      Construct a ProbabilityWeights vector with weight values vs. A precomputed sum may be provided as wsum.

      Probability weights represent the inverse of the sampling probability for each observation, providing a correction mechanism for under- or over-sampling certain population groups. These weights may also be referred to as sampling weights.

      source
      StatsBase.UnitWeightsType
      UnitWeights{T}(s)

      Construct a UnitWeights vector with length s and weight elements of type T. All weight elements are identically one.

      source
      StatsBase.eweightsFunction
      eweights(t::AbstractArray{<:Integer}, λ::Real; scale=false)
       eweights(t::AbstractVector{T}, r::StepRange{T}, λ::Real; scale=false) where T
       eweights(n::Integer, λ::Real; scale=false)

      Construct a Weights vector which assigns exponentially decreasing weights to past observations (larger integer values i in t). The integer value n represents the number of past observations to consider. n defaults to maximum(t) - minimum(t) + 1 if only t is passed in and the elements are integers, and to length(r) if a superset range r is also passed in. If n is explicitly passed instead of t, t defaults to 1:n.

      If scale is true then for each element i in t the weight value is computed as:

      $(1 - λ)^{n - i}$

      If scale is false then each value is computed as:

      $λ (1 - λ)^{1 - i}$

      Arguments

      • t::AbstractVector: temporal indices or timestamps
      • r::StepRange: a larger range to use when constructing weights from a subset of timestamps
      • n::Integer: the number of past events to consider
      • λ::Real: a smoothing factor or rate parameter such that $0 < λ ≤ 1$. As this value approaches 0, the resulting weights will be almost equal, while values closer to 1 will put greater weight on the tail elements of the vector.

      Keyword arguments

      • scale::Bool: Return the weights scaled to between 0 and 1 (default: false)

      Examples

      julia> eweights(1:10, 0.3; scale=true)
       10-element Weights{Float64,Float64,Array{Float64,1}}:
      @@ -53,7 +53,7 @@
        0.3429999999999999
        0.48999999999999994
        0.7
      - 1.0

      Links

      • https://en.wikipedia.org/wiki/Movingaverage#Exponentialmoving_average
      • https://en.wikipedia.org/wiki/Exponential_smoothing
      source
      StatsBase.uweightsFunction
      uweights(s::Integer)
      + 1.0

      Links

      • https://en.wikipedia.org/wiki/Movingaverage#Exponentialmoving_average
      • https://en.wikipedia.org/wiki/Exponential_smoothing
      source
      StatsBase.uweightsFunction
      uweights(s::Integer)
       uweights(::Type{T}, s::Integer) where T<:Real

      Construct a UnitWeights vector with length s and weight elements of type T. All weight elements are identically one.

      Examples

      julia> uweights(3)
       3-element UnitWeights{Int64}:
        1
      @@ -64,4 +64,4 @@
       3-element UnitWeights{Float64}:
        1.0
        1.0
      - 1.0
      source
      StatsAPI.weightsMethod
      weights(vs::AbstractArray{<:Real})

      Construct a Weights vector from array vs. See the documentation for Weights for more details.

      source
      + 1.0source
      StatsAPI.weightsMethod
      weights(vs::AbstractArray{<:Real})

      Construct a Weights vector from array vs. See the documentation for Weights for more details.

      source