simple-statistics-docs 7.8.7

min

The min is the lowest number in the array. This runs in O(n), linear time, with respect to the length of the array.

min(x: Array<number>): number

Parameters

x (Array<number>) sample of one or more data points

Returns

number: minimum value

Throws

Error: if the length of x is less than one

Example

min([1, 5, -10, 100, 2]); // => -10

max

This computes the maximum number in an array.

This runs in O(n), linear time, with respect to the length of the array.

max(x: Array<number>): number

Parameters

x (Array<number>) sample of one or more data points

Returns

number: maximum value

Throws

Error: if the length of x is less than one

Example

max([1, 2, 3, 4]);
// => 4

sum

Our default sum is the Kahan-Babuska algorithm. This method is an improvement over the classical Kahan summation algorithm. It aims at computing the sum of a list of numbers while correcting for floating-point errors. Traditionally, sums are calculated as many successive additions, each one with its own floating-point roundoff. These losses in precision add up as the number of numbers increases. This alternative algorithm is more accurate than the simple way of calculating sums by simple addition.

This runs in O(n), linear time, with respect to the length of the array.

sum(x: Array<number>): number

Parameters

x (Array<number>) input

Returns

number: sum of all input numbers

Example

sum([1, 2, 3]); // => 6

sumSimple

The simple sum of an array is the result of adding all numbers together, starting from zero.

This runs in O(n), linear time, with respect to the length of the array.

sumSimple(x: Array<number>): number

Parameters

x (Array<number>) input

Returns

number: sum of all input numbers

Example

sumSimple([1, 2, 3]); // => 6

quantile

The quantile: this is a population quantile, since we assume to know the entire dataset in this library. This is an implementation of the Quantiles of a Population algorithm from wikipedia.

Sample is a one-dimensional array of numbers, and p is either a decimal number from 0 to 1 or an array of decimal numbers from 0 to 1. In terms of a k/q quantile, p = k/q - it's just dealing with fractions or dealing with decimal values. When p is an array, the result of the function is also an array containing the appropriate quantiles in input order

quantile(x: Array<number>, p: (Array<number> | number)): number

Parameters

x (Array<number>) sample of one or more numbers

p ((Array<number> | number)) the desired quantile, as a number between 0 and 1

Returns

number: quantile

Example

quantile([3, 6, 7, 8, 8, 9, 10, 13, 15, 16, 20], 0.5); // => 9

product

The product of an array is the result of multiplying all numbers together, starting using one as the multiplicative identity.

This runs in O(n), linear time, with respect to the length of the array.

product(x: Array<number>): number

Parameters

x (Array<number>) input

Returns

number: product of all input numbers

Example

product([1, 2, 3, 4]); // => 24

minSorted

The minimum is the lowest number in the array. With a sorted array, the first element in the array is always the smallest, so this calculation can be done in one step, or constant time.

minSorted(x: Array<number>): number

Parameters

x (Array<number>) input

Returns

number: minimum value

Example

minSorted([-100, -10, 1, 2, 5]); // => -100

maxSorted

The maximum is the highest number in the array. With a sorted array, the last element in the array is always the largest, so this calculation can be done in one step, or constant time.

maxSorted(x: Array<number>): number

Parameters

x (Array<number>) input

Returns

number: maximum value

Example

maxSorted([-100, -10, 1, 2, 5]); // => 5

quantileSorted

This is the internal implementation of quantiles: when you know that the order is sorted, you don't need to re-sort it, and the computations are faster.

quantileSorted(x: Array<number>, p: number): number

Parameters

x (Array<number>) sample of one or more data points

p (number) desired quantile: a number between 0 to 1, inclusive

Returns

number: quantile value

Throws

Error: if p ix outside of the range from 0 to 1
Error: if x is empty

Example

quantileSorted([3, 6, 7, 8, 8, 9, 10, 13, 15, 16, 20], 0.5); // => 9

mean

The mean, also known as average, is the sum of all values over the number of values. This is a measure of central tendency: a method of finding a typical or central value of a set of numbers.

This runs in O(n), linear time, with respect to the length of the array.

mean(x: Array<number>): number

Parameters

x (Array<number>) sample of one or more data points

Returns

number: mean

Throws

Error: if the length of x is less than one

Example

mean([0, 10]); // => 5

addToMean

When adding a new value to a list, one does not have to necessary recompute the mean of the list in linear time. They can instead use this function to compute the new mean by providing the current mean, the number of elements in the list that produced it and the new value to add.

addToMean(mean: number, n: number, newValue: number): number

Since: 2.5.0

Parameters

mean (number) current mean

n (number) number of items in the list

newValue (number) the added value

Returns

number: the new mean

Example

addToMean(14, 5, 53); // => 20.5

mode

The mode is the number that appears in a list the highest number of times. There can be multiple modes in a list: in the event of a tie, this algorithm will return the most recently seen mode.

This is a measure of central tendency: a method of finding a typical or central value of a set of numbers.

This runs in O(n log(n)) because it needs to sort the array internally before running an O(n) search to find the mode.

mode(x: Array<number>): number

Parameters

x (Array<number>) input

Returns

number: mode

Example

mode([0, 0, 1]); // => 0

modeSorted

The mode is the number that appears in a list the highest number of times. There can be multiple modes in a list: in the event of a tie, this algorithm will return the most recently seen mode.

This is a measure of central tendency: a method of finding a typical or central value of a set of numbers.

This runs in O(n) because the input is sorted.

modeSorted(sorted: Array<number>): number

Parameters

sorted (Array<number>) a sample of one or more data points

Returns

number: mode

Throws

Error: if sorted is empty

Example

modeSorted([0, 0, 1]); // => 0

modeFast

The mode is the number that appears in a list the highest number of times. There can be multiple modes in a list: in the event of a tie, this algorithm will return the most recently seen mode.

modeFast uses a Map object to keep track of the mode, instead of the approach used with mode, a sorted array. As a result, it is faster than mode and supports any data type that can be compared with ==. It also requires a JavaScript environment with support for Map, and will throw an error if Map is not available.

This is a measure of central tendency: a method of finding a typical or central value of a set of numbers.

modeFast(x: Array<any>): any?

Parameters

x (Array<any>) a sample of one or more data points

Returns

any?: mode

Throws

ReferenceError: if the JavaScript environment doesn't support Map
Error: if x is empty

Example

modeFast(['rabbits', 'rabbits', 'squirrels']); // => 'rabbits'

median

The median is the middle number of a list. This is often a good indicator of 'the middle' when there are outliers that skew the mean() value. This is a measure of central tendency: a method of finding a typical or central value of a set of numbers.

The median isn't necessarily one of the elements in the list: the value can be the average of two elements if the list has an even length and the two central values are different.

median(x: Array<number>): number

Parameters

x (Array<number>) input

Returns

number: median value

Example

median([10, 2, 5, 100, 2, 1]); // => 3.5

medianSorted

The median is the middle number of a list. This is often a good indicator of 'the middle' when there are outliers that skew the mean() value. This is a measure of central tendency: a method of finding a typical or central value of a set of numbers.

The median isn't necessarily one of the elements in the list: the value can be the average of two elements if the list has an even length and the two central values are different.

medianSorted(sorted: Array<number>): number

Parameters

sorted (Array<number>) input

Returns

number: median value

Example

medianSorted([10, 2, 5, 100, 2, 1]); // => 52.5

harmonicMean

The Harmonic Mean is a mean function typically used to find the average of rates. This mean is calculated by taking the reciprocal of the arithmetic mean of the reciprocals of the input numbers.

This is a measure of central tendency: a method of finding a typical or central value of a set of numbers.

This runs in O(n), linear time, with respect to the length of the array.

harmonicMean(x: Array<number>): number

Parameters

x (Array<number>) sample of one or more data points

Returns

number: harmonic mean

Throws

Error: if x is empty
Error: if x contains a negative number

Example

harmonicMean([2, 3]).toFixed(2) // => '2.40'

geometricMean

The Geometric Mean is a mean function that is more useful for numbers in different ranges.

This is the nth root of the input numbers multiplied by each other.

The geometric mean is often useful for proportional growth: given growth rates for multiple years, like 80%, 16.66% and 42.85%, a simple mean will incorrectly estimate an average growth rate, whereas a geometric mean will correctly estimate a growth rate that, over those years, will yield the same end value.

This runs in O(n), linear time, with respect to the length of the array.

geometricMean(x: Array<number>): number

Parameters

x (Array<number>) sample of one or more data points

Returns

number: geometric mean

Throws

Error: if x is empty
Error: if x contains a negative number

Example

var growthRates = [1.80, 1.166666, 1.428571];
var averageGrowth = ss.geometricMean(growthRates);
var averageGrowthRates = [averageGrowth, averageGrowth, averageGrowth];
var startingValue = 10;
var startingValueMean = 10;
growthRates.forEach(function(rate) {
  startingValue *= rate;
});
averageGrowthRates.forEach(function(rate) {
  startingValueMean *= rate;
});
startingValueMean === startingValue;

rootMeanSquare

The Root Mean Square (RMS) is a mean function used as a measure of the magnitude of a set of numbers, regardless of their sign. This is the square root of the mean of the squares of the input numbers. This runs in O(n), linear time, with respect to the length of the array.

rootMeanSquare(x: Array<number>): number

Parameters

x (Array<number>) a sample of one or more data points

Returns

number: root mean square

Throws

Error: if x is empty

Example

rootMeanSquare([-1, 1, -1, 1]); // => 1

sampleSkewness

Skewness is a measure of the extent to which a probability distribution of a real-valued random variable "leans" to one side of the mean. The skewness value can be positive or negative, or even undefined.

Implementation is based on the adjusted Fisher-Pearson standardized moment coefficient, which is the version found in Excel and several statistical packages including Minitab, SAS and SPSS.

sampleSkewness(x: Array<number>): number

Since: 4.1.0

Parameters

x (Array<number>) a sample of 3 or more data points

Returns

number: sample skewness

Throws

Error: if x has length less than 3

Example

sampleSkewness([2, 4, 6, 3, 1]); // => 0.590128656384365

variance

The variance is the sum of squared deviations from the mean.

This is an implementation of variance, not sample variance: see the sampleVariance method if you want a sample measure.

variance(x: Array<number>): number

Parameters

x (Array<number>) a population of one or more data points

Returns

number: variance: a value greater than or equal to zero. zero indicates that all values are identical.

Throws

Error: if x's length is 0

Example

variance([1, 2, 3, 4, 5, 6]); // => 2.9166666666666665

sampleVariance

The sample variance is the sum of squared deviations from the mean. The sample variance is distinguished from the variance by the usage of Bessel's Correction: instead of dividing the sum of squared deviations by the length of the input, it is divided by the length minus one. This corrects the bias in estimating a value from a set that you don't know if full.

References:

Wolfram MathWorld on Sample Variance

sampleVariance(x: Array<number>): number

Parameters

x (Array<number>) a sample of two or more data points

Returns

number: sample variance

Throws

Error: if the length of x is less than 2

Example

sampleVariance([1, 2, 3, 4, 5]); // => 2.5

standardDeviation

The standard deviation is the square root of the variance. This is also known as the population standard deviation. It's useful for measuring the amount of variation or dispersion in a set of values.

Standard deviation is only appropriate for full-population knowledge: for samples of a population, sampleStandardDeviation is more appropriate.

standardDeviation(x: Array<number>): number

Parameters

x (Array<number>) input

Returns

number: standard deviation

Example

variance([2, 4, 4, 4, 5, 5, 7, 9]); // => 4
standardDeviation([2, 4, 4, 4, 5, 5, 7, 9]); // => 2

sampleStandardDeviation

The sample standard deviation is the square root of the sample variance.

sampleStandardDeviation(x: Array<number>): number

Parameters

x (Array<number>) input array

Returns

number: sample standard deviation

Example

sampleStandardDeviation([2, 4, 4, 4, 5, 5, 7, 9]).toFixed(2);
// => '2.14'

medianAbsoluteDeviation

The Median Absolute Deviation is a robust measure of statistical dispersion. It is more resilient to outliers than the standard deviation.

medianAbsoluteDeviation(x: Array<number>): number

Parameters

x (Array<number>) input array

Returns

number: median absolute deviation

Example

medianAbsoluteDeviation([1, 1, 2, 2, 4, 6, 9]); // => 1

interquartileRange

The Interquartile range is a measure of statistical dispersion, or how scattered, spread, or concentrated a distribution is. It's computed as the difference between the third quartile and first quartile.

interquartileRange(x: Array<number>): number

Parameters

x (Array<number>) sample of one or more numbers

Returns

number: interquartile range: the span between lower and upper quartile, 0.25 and 0.75

Example

interquartileRange([0, 1, 2, 3]); // => 2

sumNthPowerDeviations

The sum of deviations to the Nth power. When n=2 it's the sum of squared deviations. When n=3 it's the sum of cubed deviations.

sumNthPowerDeviations(x: Array<number>, n: number): number

Parameters

x (Array<number>)

n (number) power

Returns

number: sum of nth power deviations

Example

var input = [1, 2, 3];
// since the variance of a set is the mean squared
// deviations, we can calculate that with sumNthPowerDeviations:
sumNthPowerDeviations(input, 2) / input.length;

zScore

The Z-Score, or Standard Score.

The standard score is the number of standard deviations an observation or datum is above or below the mean. Thus, a positive standard score represents a datum above the mean, while a negative standard score represents a datum below the mean. It is a dimensionless quantity obtained by subtracting the population mean from an individual raw score and then dividing the difference by the population standard deviation.

The z-score is only defined if one knows the population parameters; if one only has a sample set, then the analogous computation with sample mean and sample standard deviation yields the Student's t-statistic.

zScore(x: number, mean: number, standardDeviation: number): number

Parameters

x (number)

mean (number)

standardDeviation (number)

Returns

number: z score

Example

zScore(78, 80, 5); // => -0.4

sampleCorrelation

The correlation is a measure of how correlated two datasets are, between -1 and 1

sampleCorrelation(x: Array<number>, y: Array<number>): number

Parameters

x (Array<number>) first input

y (Array<number>) second input

Returns

number: sample correlation

Example

sampleCorrelation([1, 2, 3, 4, 5, 6], [2, 2, 3, 4, 5, 60]).toFixed(2);
// => '0.69'

sampleCovariance

Sample covariance of two datasets: how much do the two datasets move together? x and y are two datasets, represented as arrays of numbers.

sampleCovariance(x: Array<number>, y: Array<number>): number

Parameters

x (Array<number>) a sample of two or more data points

y (Array<number>) a sample of two or more data points

Returns

number: sample covariance

Throws

Error: if x and y do not have equal lengths
Error: if x or y have length of one or less

Example

sampleCovariance([1, 2, 3, 4, 5, 6], [6, 5, 4, 3, 2, 1]); // => -3.5

rSquared

The R Squared value of data compared with a function f is the sum of the squared differences between the prediction and the actual value.

rSquared(x: Array<Array<number>>, func: Function): number

Parameters

x (Array<Array<number>>) input data: this should be doubly-nested

func (Function) function called on [i][0] values within the dataset

Returns

number: r-squared value

Example

var samples = [[0, 0], [1, 1]];
var regressionLine = linearRegressionLine(linearRegression(samples));
rSquared(samples, regressionLine); // = 1 this line is a perfect fit

linearRegression

Simple linear regression is a simple way to find a fitted line between a set of coordinates. This algorithm finds the slope and y-intercept of a regression line using the least sum of squares.

linearRegression(data: Array<Array<number>>): Object

Parameters

data (Array<Array<number>>) an array of two-element of arrays, like [[0, 1], [2, 3]]

Returns

Object: object containing slope and intersect of regression line

Example

linearRegression([[0, 0], [1, 1]]); // => { m: 1, b: 0 }

linearRegressionLine

Given the output of linearRegression: an object with m and b values indicating slope and intercept, respectively, generate a line function that translates x values into y values.

linearRegressionLine(mb: Object): Function

Parameters

mb (Object) object with m and b members, representing slope and intersect of desired line

Returns

Function: method that computes y-value at any given x-value on the line.

Example

var l = linearRegressionLine(linearRegression([[0, 0], [1, 1]]));
l(0) // = 0
l(2) // = 2
linearRegressionLine({ b: 0, m: 1 })(1); // => 1
linearRegressionLine({ b: 1, m: 1 })(1); // => 2

shuffle

A Fisher-Yates shuffle is a fast way to create a random permutation of a finite set. This is a function around shuffle_in_place that adds the guarantee that it will not modify its input.

shuffle(x: Array, randomSource: Function): Array

Parameters

x (Array) sample of 0 or more numbers

randomSource

(Function
            = Math.random)

an optional entropy source that returns numbers between 0 inclusive and 1 exclusive: the range [0, 1)

Returns

Array: shuffled version of input

Example

var shuffled = shuffle([1, 2, 3, 4]);
shuffled; // = [2, 3, 1, 4] or any other random permutation

shuffleInPlace

A Fisher-Yates shuffle in-place - which means that it will change the order of the original array by reference.

This is an algorithm that generates a random permutation of a set.

shuffleInPlace(x: Array, randomSource: Function): Array

Parameters

x (Array) sample of one or more numbers

randomSource

(Function
            = Math.random)

an optional entropy source that returns numbers between 0 inclusive and 1 exclusive: the range [0, 1)

Returns

Array: x

Example

var x = [1, 2, 3, 4];
shuffleInPlace(x);
// x is shuffled to a value like [2, 1, 4, 3]

sampleWithReplacement

Sampling with replacement is a type of sampling that allows the same item to be picked out of a population more than once.

sampleWithReplacement(x: Array<any>, n: number, randomSource: Function): Array

Parameters

x (Array<any>) an array of any kind of value

n (number) count of how many elements to take

randomSource

(Function
            = Math.random)

an optional entropy source that returns numbers between 0 inclusive and 1 exclusive: the range [0, 1)

Returns

Array: n sampled items from the population

Example

var values = [1, 2, 3, 4];
sampleWithReplacement(values, 2); // returns 2 random values, like [2, 4];

sample

Create a simple random sample from a given array of n elements.

The sampled values will be in any order, not necessarily the order they appear in the input.

sample(x: Array<any>, n: number, randomSource: Function): Array

Parameters

x (Array<any>) input array. can contain any type

n (number) count of how many elements to take

randomSource

(Function
            = Math.random)

an optional entropy source that returns numbers between 0 inclusive and 1 exclusive: the range [0, 1)

Returns

Array: subset of n elements in original array

Example

var values = [1, 2, 4, 5, 6, 7, 8, 9];
sample(values, 3); // returns 3 random values, like [2, 5, 8];

BayesianClassifier

Bayesian Classifier

This is a naïve bayesian classifier that takes singly-nested objects.

new BayesianClassifier()

Example

var bayes = new BayesianClassifier();
bayes.train({
  species: 'Cat'
}, 'animal');
var result = bayes.score({
  species: 'Cat'
})
// result
// {
//   animal: 1
// }

Instance Members

▸ train(item, category)

Train the classifier with a new item, which has a single dimension of Javascript literal keys and values.

train(item: Object, category: string): undefined

Parameters

item (Object) an object with singly-deep properties

category (string) the category this item belongs to

Returns

undefined: adds the item to the classifier

▸ score(item)

Generate a score of how well this item matches all possible categories based on its attributes

score(item: Object): Object

Parameters

item (Object) an item in the same format as with train

Returns

Object: of probabilities that this item belongs to a given category.

PerceptronModel

This is a single-layer Perceptron Classifier that takes arrays of numbers and predicts whether they should be classified as either 0 or 1 (negative or positive examples).

new PerceptronModel()

Example

// Create the model
var p = new PerceptronModel();
// Train the model with input with a diagonal boundary.
for (var i = 0; i < 5; i++) {
    p.train([1, 1], 1);
    p.train([0, 1], 0);
    p.train([1, 0], 0);
    p.train([0, 0], 0);
}
p.predict([0, 0]); // 0
p.predict([0, 1]); // 0
p.predict([1, 0]); // 0
p.predict([1, 1]); // 1

Instance Members

▸ predict(features)

Predict: Use an array of features with the weight array and bias to predict whether an example is labeled 0 or 1.

predict(features: Array<number>): number

Parameters

features (Array<number>) an array of features as numbers

Returns

number: 1 if the score is over 0, otherwise 0

▸ train(features, label)

Train the classifier with a new example, which is a numeric array of features and a 0 or 1 label.

train(features: Array<number>, label: number): PerceptronModel

Parameters

features (Array<number>) an array of features as numbers

label (number) either 0 or 1

Returns

PerceptronModel: this

bernoulliDistribution

The Bernoulli distribution is the probability discrete distribution of a random variable which takes value 1 with success probability p and value 0 with failure probability q = 1 - p. It can be used, for example, to represent the toss of a coin, where "1" is defined to mean "heads" and "0" is defined to mean "tails" (or vice versa). It is a special case of a Binomial Distribution where n = 1.

bernoulliDistribution(p: number): Array<number>

Parameters

p (number) input value, between 0 and 1 inclusive

Returns

Array<number>: values of bernoulli distribution at this point

Throws

Error: if p is outside 0 and 1

Example

bernoulliDistribution(0.3); // => [0.7, 0.3]

binomialDistribution

The Binomial Distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability probability. Such a success/failure experiment is also called a Bernoulli experiment or Bernoulli trial; when trials = 1, the Binomial Distribution is a Bernoulli Distribution.

binomialDistribution(trials: number, probability: number): Array<number>

Parameters

trials (number) number of trials to simulate

probability (number)

Returns

Array<number>: output

poissonDistribution

The Poisson Distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.

The Poisson Distribution is characterized by the strictly positive mean arrival or occurrence rate, λ.

poissonDistribution(lambda: number): Array<number>

Parameters

lambda (number) location poisson distribution

Returns

Array<number>: values of poisson distribution at that point

chiSquaredDistributionTable

Percentage Points of the χ2 (Chi-Squared) Distribution

The χ2 (Chi-Squared) Distribution is used in the common chi-squared tests for goodness of fit of an observed distribution to a theoretical one, the independence of two criteria of classification of qualitative data, and in confidence interval estimation for a population standard deviation of a normal distribution from a sample standard deviation.

Values from Appendix 1, Table III of William W. Hines & Douglas C. Montgomery, "Probability and Statistics in Engineering and Management Science", Wiley (1980).

chiSquaredDistributionTable

standardNormalTable

A standard normal table, also called the unit normal table or Z table, is a mathematical table for the values of Φ (phi), which are the values of the cumulative distribution function of the normal distribution. It is used to find the probability that a statistic is observed below, above, or between values on the standard normal distribution, and by extension, any normal distribution.

standardNormalTable

tTest

This is to compute a one-sample t-test, comparing the mean of a sample to a known value, x.

in this case, we're trying to determine whether the population mean is equal to the value that we know, which is x here. Usually the results here are used to look up a p-value, which, for a certain level of significance, will let you determine that the null hypothesis can or cannot be rejected.

tTest(x: Array<number>, expectedValue: number): number

Parameters

x (Array<number>) sample of one or more numbers

expectedValue (number) expected value of the population mean

Returns

number: value

Example

tTest([1, 2, 3, 4, 5, 6], 3.385).toFixed(2); // => '0.16'

tTestTwoSample

This is to compute two sample t-test. Tests whether "mean(X)-mean(Y) = difference", ( in the most common case, we often have difference == 0 to test if two samples are likely to be taken from populations with the same mean value) with no prior knowledge on standard deviations of both samples other than the fact that they have the same standard deviation.

Usually the results here are used to look up a p-value, which, for a certain level of significance, will let you determine that the null hypothesis can or cannot be rejected.

diff can be omitted if it equals 0.

This is used to reject a null hypothesis that the two populations that have been sampled into sampleX and sampleY are equal to each other.

tTestTwoSample(sampleX: Array<number>, sampleY: Array<number>, difference: number): (number | null)

Parameters

sampleX (Array<number>) a sample as an array of numbers

sampleY (Array<number>) a sample as an array of numbers

difference

(number
            = 0)

Returns

(number | null): test result

Example

tTestTwoSample([1, 2, 3, 4], [3, 4, 5, 6], 0); // => -2.1908902300206643

cumulativeStdNormalProbability

Cumulative Standard Normal Probability

Since probability tables cannot be printed for every normal distribution, as there are an infinite variety of normal distributions, it is common practice to convert a normal to a standard normal and then use the standard normal table to find probabilities.

You can use .5 + .5 * errorFunction(x / Math.sqrt(2)) to calculate the probability instead of looking it up in a table.

cumulativeStdNormalProbability(z: number): number

Parameters

z (number)

Returns

number: cumulative standard normal probability

kernelDensityEstimation

Kernel density estimation is a useful tool for, among other things, estimating the shape of the underlying probability distribution from a sample.

kernelDensityEstimation

Parameters

X (any) sample values

kernel (any) The kernel function to use. If a function is provided, it should return non-negative values and integrate to 1. Defaults to 'gaussian'.

bandwidthMethod (any) The "bandwidth selection" method to use, or a fixed bandwidth value. Defaults to "nrd", the commonly-used "normal reference distribution" rule-of-thumb .

Returns

Function: An estimated probability density function for the given sample. The returned function runs in O(X.length) .

errorFunction

Gaussian error function

The errorFunction(x/(sd * Math.sqrt(2))) is the probability that a value in a normal distribution with standard deviation sd is within x of the mean.

This function returns a numerical approximation to the exact value. It uses Horner's method to evaluate the polynomial of τ (tau).

errorFunction(x: number): number

Parameters

x (number) input

Returns

number: error estimation

Example

errorFunction(1).toFixed(2); // => '0.84'

inverseErrorFunction

The Inverse Gaussian error function returns a numerical approximation to the value that would have caused errorFunction() to return x.

inverseErrorFunction(x: number): number

Parameters

x (number) value of error function

Returns

number: estimated inverted value

probit

The Probit is the inverse of cumulativeStdNormalProbability(), and is also known as the normal quantile function.

It returns the number of standard deviations from the mean where the p'th quantile of values can be found in a normal distribution. So, for example, probit(0.5 + 0.6827/2) ≈ 1 because 68.27% of values are normally found within 1 standard deviation above or below the mean.

probit(p: number): number

Parameters

p (number)

Returns

number: probit

ckmeans

Ckmeans clustering is an improvement on heuristic-based clustering approaches like Jenks. The algorithm was developed in Haizhou Wang and Mingzhou Song as a dynamic programming approach to the problem of clustering numeric data into groups with the least within-group sum-of-squared-deviations.

Minimizing the difference within groups - what Wang & Song refer to as withinss, or within sum-of-squares, means that groups are optimally homogenous within and the data is split into representative groups. This is very useful for visualization, where you may want to represent a continuous variable in discrete color or style groups. This function can provide groups that emphasize differences between data.

Being a dynamic approach, this algorithm is based on two matrices that store incrementally-computed values for squared deviations and backtracking indexes.

This implementation is based on Ckmeans 3.4.6, which introduced a new divide and conquer approach that improved runtime from O(kn^2) to O(kn log(n)).

Unlike the original implementation, this implementation does not include any code to automatically determine the optimal number of clusters: this information needs to be explicitly provided.

References

Ckmeans.1d.dp: Optimal k-means Clustering in One Dimension by Dynamic Programming Haizhou Wang and Mingzhou Song ISSN 2073-4859

from The R Journal Vol. 3/2, December 2011

ckmeans(x: Array<number>, nClusters: number): Array<Array<number>>

Parameters

x (Array<number>) input data, as an array of number values

nClusters (number) number of desired classes. This cannot be greater than the number of values in the data array.

Returns

Array<Array<number>>: clustered input

Throws

Error: if the number of requested clusters is higher than the size of the data

Example

ckmeans([-1, 2, -1, 2, 4, 5, 6, -1, 2, -1], 3);
// The input, clustered into groups of similar numbers.
//= [[-1, -1, -1, -1], [2, 2, 2], [4, 5, 6]]);

equalIntervalBreaks

Given an array of x, this will find the extent of the x and return an array of breaks that can be used to categorize the x into a number of classes. The returned array will always be 1 longer than the number of classes because it includes the minimum value.

equalIntervalBreaks(x: Array<number>, nClasses: number): Array<number>

Parameters

x (Array<number>) an array of number values

nClasses (number) number of desired classes

Returns

Array<number>: array of class break positions

Example

equalIntervalBreaks([1, 2, 3, 4, 5, 6], 4); // => [1, 2.25, 3.5, 4.75, 6]

chunk

Split an array into chunks of a specified size. This function has the same behavior as PHP's array_chunk function, and thus will insert smaller-sized chunks at the end if the input size is not divisible by the chunk size.

x is expected to be an array, and chunkSize a number. The x array can contain any kind of data.

chunk(x: Array, chunkSize: number): Array<Array>

Parameters

x (Array) a sample

chunkSize (number) size of each output array. must be a positive integer

Returns

Array<Array>: a chunked array

Throws

Error: if chunk size is less than 1 or not an integer

Example

chunk([1, 2, 3, 4, 5, 6], 2);
// => [[1, 2], [3, 4], [5, 6]]

chiSquaredGoodnessOfFit

The χ2 (Chi-Squared) Goodness-of-Fit Test uses a measure of goodness of fit which is the sum of differences between observed and expected outcome frequencies (that is, counts of observations), each squared and divided by the number of observations expected given the hypothesized distribution. The resulting χ2 statistic, chiSquared, can be compared to the chi-squared distribution to determine the goodness of fit. In order to determine the degrees of freedom of the chi-squared distribution, one takes the total number of observed frequencies and subtracts the number of estimated parameters. The test statistic follows, approximately, a chi-square distribution with (k − c) degrees of freedom where k is the number of non-empty cells and c is the number of estimated parameters for the distribution.

chiSquaredGoodnessOfFit(data: Array<number>, distributionType: Function, significance: number): number

Parameters

data (Array<number>)

distributionType (Function) a function that returns a point in a distribution: for instance, binomial, bernoulli, or poisson

significance (number)

Returns

number: chi squared goodness of fit

Example

// Data from Poisson goodness-of-fit example 10-19 in William W. Hines & Douglas C. Montgomery,
// "Probability and Statistics in Engineering and Management Science", Wiley (1980).
var data1019 = [
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    2, 2, 2, 2, 2, 2, 2, 2, 2,
    3, 3, 3, 3
];
ss.chiSquaredGoodnessOfFit(data1019, ss.poissonDistribution, 0.05); //= false

epsilon

We use ε, epsilon, as a stopping criterion when we want to iterate until we're "close enough". Epsilon is a very small number: for simple statistics, that number is 0.0001

This is used in calculations like the binomialDistribution, in which the process of finding a value is iterative: it progresses until it is close enough.

Below is an example of using epsilon in gradient descent, where we're trying to find a local minimum of a function's derivative, given by the fDerivative method.

epsilon

Type: number

Example

// From calculation, we expect that the local minimum occurs at x=9/4
var x_old = 0;
// The algorithm starts at x=6
var x_new = 6;
var stepSize = 0.01;

function fDerivative(x) {
  return 4 * Math.pow(x, 3) - 9 * Math.pow(x, 2);
}

// The loop runs until the difference between the previous
// value and the current value is smaller than epsilon - a rough
// meaure of 'close enough'
while (Math.abs(x_new - x_old) > ss.epsilon) {
  x_old = x_new;
  x_new = x_old - stepSize * fDerivative(x_old);
}

console.log('Local minimum occurs at', x_new);

factorial

A Factorial, usually written n!, is the product of all positive integers less than or equal to n. Often factorial is implemented recursively, but this iterative approach is significantly faster and simpler.

factorial(n: number): number

Parameters

n (number) input, must be an integer number 1 or greater

Returns

number: factorial: n!

Throws

Error: if n is less than 0 or not an integer

Example

factorial(5); // => 120

gamma

Compute the gamma function of a value using Nemes' approximation. The gamma of n is equivalent to (n-1)!, but unlike the factorial function, gamma is defined for all real n except zero and negative integers (where NaN is returned). Note, the gamma function is also well-defined for complex numbers, though this implementation currently does not handle complex numbers as input values. Nemes' approximation is defined here as Theorem 2.2. Negative values use Euler's reflection formula for computation.

gamma(n: number): number

Parameters

n (number) Any real number except for zero and negative integers.

Returns

number: The gamma of the input value.

Example

gamma(11.5); // 11899423.084037038
gamma(-11.5); // 2.29575810481609e-8
gamma(5); // 24

uniqueCountSorted

For a sorted input, counting the number of unique values is possible in constant time and constant memory. This is a simple implementation of the algorithm.

Values are compared with ===, so objects and non-primitive objects are not handled in any special way.

uniqueCountSorted(x: Array<any>): number

Parameters

x (Array<any>) an array of any kind of value

Returns

number: count of unique values

Example

uniqueCountSorted([1, 2, 3]); // => 3
uniqueCountSorted([1, 1, 1]); // => 1

approxEqual

Approximate equality.

approxEqual(actual: number, expected: number, tolerance: number): boolean

Parameters

actual (number) The value to be tested.

expected (number) The reference value.

tolerance

(number
            = epsilon)

The acceptable relative difference.

Returns

boolean: Whether numbers are within tolerance.

bisect

Bisection method is a root-finding method that repeatedly bisects an interval to find the root.

This function returns a numerical approximation to the exact value.

bisect(func: Function, start: number, end: number, maxIterations: number, errorTolerance: number): number

Parameters

func (Function) input function

start (number) start of interval

end (number) end of interval

maxIterations (number) the maximum number of iterations

errorTolerance (number) the error tolerance

Returns

number: estimated root value

Throws

TypeError: Argument func must be a function

Example

bisect(Math.cos,0,4,100,0.003); // => 1.572265625

coefficientOfVariation

Thecoefficient of variation_ is the ratio of the standard deviation to the mean. .._coefficient of variation: https://en.wikipedia.org/wiki/Coefficient_of_variation

coefficientOfVariation(x: Array): number

Parameters

x (Array) input

Returns

number: coefficient of variation

Example

coefficientOfVariation([1, 2, 3, 4]).toFixed(3); // => 0.516
coefficientOfVariation([1, 2, 3, 4, 5]).toFixed(3); // => 0.527
coefficientOfVariation([-1, 0, 1, 2, 3, 4]).toFixed(3); // => 1.247

combinationsReplacement

Implementation of Combinations with replacement Combinations are unique subsets of a collection - in this case, k x from a collection at a time. 'With replacement' means that a given element can be chosen multiple times. Unlike permutation, order doesn't matter for combinations.

combinationsReplacement(x: Array, k: int): Array<Array>

Parameters

x (Array) any type of data

k (int) the number of objects in each group (without replacement)

Returns

Array<Array>: array of permutations

Example

combinationsReplacement([1, 2], 2); // => [[1, 1], [1, 2], [2, 2]]

combinations

Implementation of Combinations Combinations are unique subsets of a collection - in this case, k x from a collection at a time. https://en.wikipedia.org/wiki/Combination

combinations(x: Array, k: int): Array<Array>

Parameters

x (Array) any type of data

k (int) the number of objects in each group (without replacement)

Returns

Array<Array>: array of permutations

Example

combinations([1, 2, 3], 2); // => [[1,2], [1,3], [2,3]]

combineMeans

When combining two lists of values for which one already knows the means, one does not have to necessary recompute the mean of the combined lists in linear time. They can instead use this function to compute the combined mean by providing the mean & number of values of the first list and the mean & number of values of the second list.

combineMeans(mean1: number, n1: number, mean2: number, n2: number): number

Since: 3.0.0

Parameters

mean1 (number) mean of the first list

n1 (number) number of items in the first list

mean2 (number) mean of the second list

n2 (number) number of items in the second list

Returns

number: the combined mean

Example

combineMeans(5, 3, 4, 3); // => 4.5

combineVariances

When combining two lists of values for which one already knows the variances, one does not have to necessary recompute the variance of the combined lists in linear time. They can instead use this function to compute the combined variance by providing the variance, mean & number of values of the first list and the variance, mean & number of values of the second list.

combineVariances(variance1: number, mean1: number, n1: number, variance2: number, mean2: number, n2: number): number

Since: 3.0.0

Parameters

variance1 (number) variance of the first list

mean1 (number) mean of the first list

n1 (number) number of items in the first list

variance2 (number) variance of the second list

mean2 (number) mean of the second list

n2 (number) number of items in the second list

Returns

number: the combined mean

Example

combineVariances(14 / 3, 5, 3, 8 / 3, 4, 3); // => 47 / 12

cumulativeStdLogisticProbability

Logistic Cumulative Distribution Function

cumulativeStdLogisticProbability(x: number): number

Parameters

x (number)

Returns

number: cumulative standard logistic probability

euclideanDistance

Calculate Euclidean distance between two points.

euclideanDistance(left: Array<number>, right: Array<number>): number

Parameters

left (Array<number>) First N-dimensional point.

right (Array<number>) Second N-dimensional point.

Returns

number: Distance.

extentSorted

The extent is the lowest & highest number in the array. With a sorted array, the first element in the array is always the lowest while the last element is always the largest, so this calculation can be done in one step, or constant time.

extentSorted(x: Array<number>): Array<number>

Parameters

x (Array<number>) input

Returns

Array<number>: minimum & maximum value

Example

extentSorted([-100, -10, 1, 2, 5]); // => [-100, 5]

extent

This computes the minimum & maximum number in an array.

This runs in O(n), linear time, with respect to the length of the array.

extent(x: Array<number>): Array<number>

Parameters

x (Array<number>) sample of one or more data points

Returns

Array<number>: minimum & maximum value

Throws

Error: if the length of x is less than one

Example

extent([1, 2, 3, 4]);
// => [1, 4]

gammaln

Compute the logarithm of the gamma function of a value using Lanczos' approximation. This function takes as input any real-value n greater than 0. This function is useful for values of n too large for the normal gamma function (n > 165). The code is based on Lanczo's Gamma approximation, defined here.

gammaln(n: number): number

Parameters

n (number) Any real number greater than zero.

Returns

number: The logarithm of gamma of the input value.

Example

gammaln(500); // 2605.1158503617335
gammaln(2.4); // 0.21685932244884043

jenks

The jenks natural breaks optimization is an algorithm commonly used in cartography and visualization to decide upon groupings of data values that minimize variance within themselves and maximize variation between themselves.

For instance, cartographers often use jenks in order to choose which values are assigned to which colors in a choropleth map.

jenks(data: Array<number>, nClasses: number): Array<number>

Parameters

data (Array<number>) input data, as an array of number values

nClasses (number) number of desired classes

Returns

Array<number>: array of class break positions // split data into 3 break points jenks([1, 2, 4, 5, 7, 9, 10, 20], 3) // = [1, 7, 20, 20]

kMeansReturn

Type: Object

Properties

labels (Array<number>) : The labels.

centroids (Array<Array<number>>) : The cluster centroids.

kMeansCluster

Perform k-means clustering.

kMeansCluster(points: Array<Array<number>>, numCluster: number, randomSource: Function): kMeansReturn

Parameters

points (Array<Array<number>>) N-dimensional coordinates of points to be clustered.

numCluster (number) How many clusters to create.

randomSource

(Function
            = Math.random)

An optional entropy source that generates uniform values in [0, 1).

Returns

kMeansReturn: Labels (same length as data) and centroids (same length as numCluster).

Throws

Error: If any centroids wind up friendless (i.e., without associated points).

Example

kMeansCluster([[0.0, 0.5], [1.0, 0.5]], 2); // => {labels: [0, 1], centroids: [[0.0, 0.5], [1.0 0.5]]}

logAverage

The log average is an equivalent way of computing the geometric mean of an array suitable for large or small products.

It's found by calculating the average logarithm of the elements and exponentiating.

logAverage(x: Array<number>): number

Parameters

x (Array<number>) sample of one or more data points

Returns

number: geometric mean

Throws

Error: if x is empty
Error: if x contains a negative number

logit

The Logit is the inverse of cumulativeStdLogisticProbability, and is also known as the logistic quantile function.

logit(p: number): number

Parameters

p (number)

Returns

number: logit

meanSimple

The mean, also known as average, is the sum of all values over the number of values. This is a measure of central tendency: a method of finding a typical or central value of a set of numbers.

The simple mean uses the successive addition method internally to calculate it's result. Errors in floating-point addition are not accounted for, so if precision is required, the standard mean method should be used instead.

This runs in O(n), linear time, with respect to the length of the array.

meanSimple(x: Array<number>): number

Parameters

x (Array<number>) sample of one or more data points

Returns

number: mean

Throws

Error: if the length of x is less than one

Example

mean([0, 10]); // => 5

permutationTest

Conducts a permutation test to determine if two data sets are significantly different from each other, using the difference of means between the groups as the test statistic. The function allows for the following hypotheses:

two_tail = Null hypothesis: the two distributions are equal.
greater = Null hypothesis: observations from sampleX tend to be smaller than those from sampleY.
less = Null hypothesis: observations from sampleX tend to be greater than those from sampleY. Learn more about one-tail vs two-tail tests.

permutationTest(sampleX: Array<number>, sampleY: Array<number>, alternative: string, k: number, randomSource: Function): number

Parameters

sampleX (Array<number>) first dataset (e.g. treatment data)

sampleY (Array<number>) second dataset (e.g. control data)

alternative (string) alternative hypothesis, either 'two_sided' (default), 'greater', or 'less'

k (number) number of values in permutation distribution.

randomSource

(Function
            = Math.random)

an optional entropy source

Returns

number: p-value The probability of observing the difference between groups (as or more extreme than what we did), assuming the null hypothesis.

Example

var control = [2, 5, 3, 6, 7, 2, 5];
var treatment = [20, 5, 13, 12, 7, 2, 2];
permutationTest(control, treatment); // ~0.1324

permutationsHeap

Implementation of Heap's Algorithm for generating permutations.

permutationsHeap(elements: Array): Array<Array>

Parameters

elements (Array) any type of data

Returns

Array<Array>: array of permutations

quantileRankSorted

This function returns the quantile in which one would find the given value in the given array. With a sorted array, leveraging binary search, we can find this information in logarithmic time.

quantileRankSorted(x: Array<number>, value: any): number

Parameters

x (Array<number>) input

value (any)

Returns

number: value value

Example

quantileRankSorted([1, 2, 3, 4], 3); // => 0.75
quantileRankSorted([1, 2, 3, 3, 4], 3); // => 0.7
quantileRankSorted([1, 2, 3, 4], 6); // => 1
quantileRankSorted([1, 2, 3, 3, 5], 4); // => 0.8

quantileRank

This function returns the quantile in which one would find the given value in the given array. It will copy and sort your array before each run, so if you know your array is already sorted, you should use quantileRankSorted instead.

quantileRank(x: Array<number>, value: any): number

Parameters

x (Array<number>) input

value (any)

Returns

number: value value

Example

quantileRank([4, 3, 1, 2], 3); // => 0.75
quantileRank([4, 3, 2, 3, 1], 3); // => 0.7
quantileRank([2, 4, 1, 3], 6); // => 1
quantileRank([5, 3, 1, 2, 3], 4); // => 0.8

quickselect

Rearrange items in arr so that all items in [left, k] range are the smallest. The k-th element will have the (k - left + 1)-th smallest value in [left, right].

Implements Floyd-Rivest selection algorithm https://en.wikipedia.org/wiki/Floyd-Rivest_algorithm

quickselect(arr: Array<number>, k: number, left: number?, right: number?): void

Parameters

arr (Array<number>) input array

k (number) pivot index

left (number?) left index

right (number?) right index

Returns

void: mutates input array

Example

var arr = [65, 28, 59, 33, 21, 56, 22, 95, 50, 12, 90, 53, 28, 77, 39];
quickselect(arr, 8);
// = [39, 28, 28, 33, 21, 12, 22, 50, 53, 56, 59, 65, 90, 77, 95]

relativeError

Relative error.

This is more difficult to calculate than it first appears [1,2]. The usual formula for the relative error between an actual value A and an expected value E is |(A-E)/E|, but:

If the expected value is 0, any other value has infinite relative error, which is counter-intuitive: if the expected voltage is 0, getting 1/10th of a volt doesn't feel like an infinitely large error.
This formula does not satisfy the mathematical definition of a metric [3]. [4] solved this problem by defining the relative error as |ln(|A/E|)|, but that formula only works if all values are positive: for example, it reports the relative error of -10 and 10 as 0.

Our implementation sticks with convention and returns:

0 if the actual and expected values are both zero
Infinity if the actual value is non-zero and the expected value is zero
|(A-E)/E| in all other cases

[1] https://math.stackexchange.com/questions/677852/how-to-calculate-relative-error-when-true-value-is-zero [2] https://en.wikipedia.org/wiki/Relative_change_and_difference [3] https://en.wikipedia.org/wiki/Metric_(mathematics)#Definition [4] F.W.J. Olver: "A New Approach to Error Arithmetic." SIAM Journal on Numerical Analysis, 15(2), 1978, 10.1137/0715024.

relativeError(actual: number, expected: number): number

Parameters

actual (number) The actual value.

expected (number) The expected value.

Returns

number: The relative error.

sampleKurtosis

Kurtosis is a measure of the heaviness of a distribution's tails relative to its variance. The kurtosis value can be positive or negative, or even undefined.

Implementation is based on Fisher's excess kurtosis definition and uses unbiased moment estimators. This is the version found in Excel and available in several statistical packages, including SAS and SciPy.

sampleKurtosis(x: Array<number>): number

Parameters

x (Array<number>) a sample of 4 or more data points

Returns

number: sample kurtosis

Throws

Error: if x has length less than 4

Example

sampleKurtosis([1, 2, 2, 3, 5]); // => 1.4555765595463122

sampleRankCorrelation

The rank correlation is a measure of the strength of monotonic relationship between two arrays

sampleRankCorrelation(x: Array<number>, y: Array<number>): number

Parameters

x (Array<number>) first input

y (Array<number>) second input

Returns

number: sample rank correlation

silhouetteMetric

Calculate the silhouette metric for a set of N-dimensional points arranged in groups. The metric is the largest individual silhouette value for the data.

silhouetteMetric(points: Array<Array<number>>, labels: Array<number>): number

Parameters

points (Array<Array<number>>) N-dimensional coordinates of points.

labels (Array<number>) Labels of points. This must be the same length as points , and values must lie in [0..G-1], where G is the number of groups.

Returns

number: The silhouette metric for the groupings.

Example

silhouetteMetric([[0.25], [0.75]], [0, 0]); // => 1.0

silhouette

Calculate the silhouette values for clustered data.

silhouette(points: Array<Array<number>>, labels: Array<number>): Array<number>

Parameters

points (Array<Array<number>>) N-dimensional coordinates of points.

labels (Array<number>) Labels of points. This must be the same length as points , and values must lie in [0..G-1], where G is the number of groups.

Returns

Array<number>: The silhouette value for each point.

Example

silhouette([[0.25], [0.75]], [0, 0]); // => [1.0, 1.0]

subtractFromMean

When removing a value from a list, one does not have to necessary recompute the mean of the list in linear time. They can instead use this function to compute the new mean by providing the current mean, the number of elements in the list that produced it and the value to remove.

subtractFromMean(mean: number, n: number, value: number): number

Since: 3.0.0

Parameters

mean (number) current mean

n (number) number of items in the list

value (number) the value to remove

Returns

number: the new mean

Example

subtractFromMean(20.5, 6, 53); // => 14

wilcoxonRankSum

This function calculates the Wilcoxon rank sum statistic for the first sample with respect to the second. The Wilcoxon rank sum test is a non-parametric alternative to the t-test which is equivalent to the Mann-Whitney U test. The statistic is calculated by pooling all the observations together, ranking them, and then summing the ranks associated with one of the samples. If this rank sum is sufficiently large or small we reject the hypothesis that the two samples come from the same distribution in favor of the alternative that one is shifted with respect to the other.

wilcoxonRankSum(sampleX: Array<number>, sampleY: Array<number>): number

Parameters

sampleX (Array<number>) a sample as an array of numbers

sampleY (Array<number>) a sample as an array of numbers

Returns

number: rank sum for sampleX

Example

wilcoxonRankSum([1, 4, 8], [9, 12, 15]); // => 6