# Statistics

This section of the user guide covers the core statistical functions available in math expressions.

## Descriptive Statistics

The `describe` function can be used to return descriptive statistics about a numeric array. The `describe` function returns a single tuple with name/value pairs containing descriptive statistics.

Below is a simple example that selects a random sample of documents, vectorizes the price_f field in the result set and uses the `describe` function to return descriptive statistics about the vector:

``````let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
b=col(a, price_f),
c=describe(b))``````

When this expression is sent to the `/stream` handler it responds with:

``````{
"result-set": {
"docs": [
{
"c": {
"sumsq": 4999.041975263254,
"max": 0.99995726,
"var": 0.08344429493940454,
"geometricMean": 0.36696588922559575,
"sum": 7497.460565552007,
"kurtosis": -1.2000739963006035,
"N": 15000,
"min": 0.00012338161,
"mean": 0.49983070437013266,
"popVar": 0.08343873198640858,
"skewness": -0.001735537500095477,
"stdev": 0.28886726179926403
}
},
{
"EOF": true,
"RESPONSE_TIME": 305
}
]
}
}``````

## Histograms and Frequency Tables

Histograms and frequency tables are are tools for understanding the distribution of a random variable.

The `hist` function creates a histogram designed for usage with continuous data. The `freqTable` function creates a frequency table for use with discrete data.

### histograms

Below is an example that selects a random sample, creates a vector from the result set and uses the `hist` function to return a histogram with 5 bins. The `hist` function returns a list of tuples with summary statistics for each bin.

``````let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
b=col(a, price_f),
c=hist(b, 5))``````

When this expression is sent to the `/stream` handler it responds with:

``````{
"result-set": {
"docs": [
{
"c": [
{
"prob": 0.2057939717603699,
"min": 0.000010371208,
"max": 0.19996578,
"mean": 0.10010319358402578,
"var": 0.003366805016271609,
"cumProb": 0.10293732468049072,
"sum": 309.0185585938884,
"stdev": 0.058024176136086666,
"N": 3087
},
{
"prob": 0.19381868629885585,
"min": 0.20007741,
"max": 0.3999073,
"mean": 0.2993590803885827,
"var": 0.003401644034068929,
"cumProb": 0.3025295802728267,
"sum": 870.5362057700005,
"stdev": 0.0583236147205309,
"N": 2908
},
{
"prob": 0.20565789836690007,
"min": 0.39995712,
"max": 0.5999038,
"mean": 0.4993620963792545,
"var": 0.0033158364923609046,
"cumProb": 0.5023006239697967,
"sum": 1540.5320673300018,
"stdev": 0.05758330046429177,
"N": 3085
},
{
"prob": 0.19437108496008693,
"min": 0.6000449,
"max": 0.79973197,
"mean": 0.7001752711861512,
"var": 0.0033895105082360185,
"cumProb": 0.7026537198687285,
"sum": 2042.4112660500066,
"stdev": 0.058219502816805456,
"N": 2917
},
{
"prob": 0.20019582213899467,
"min": 0.7999126,
"max": 0.99987316,
"mean": 0.8985428275824184,
"var": 0.003312360017780078,
"cumProb": 0.899450457219298,
"sum": 2698.3241112299997,
"stdev": 0.05755310606544253,
"N": 3003
}
]
},
{
"EOF": true,
"RESPONSE_TIME": 322
}
]
}
}``````

The `col` function can be used to vectorize a column of data from the list of tuples returned by the `hist` function.

In the example below, the N field, which is the number of observations in the each bin, is returned as a vector.

``````let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
b=col(a, price_f),
c=hist(b, 11),
d=col(c, N))``````

When this expression is sent to the `/stream` handler it responds with:

``````{
"result-set": {
"docs": [
{
"d": [
1387,
1396,
1391,
1357,
1384,
1360,
1367,
1375,
1307,
1310,
1366
]
},
{
"EOF": true,
"RESPONSE_TIME": 307
}
]
}
}``````

### Frequency Tables

The `freqTable` function returns a frequency distribution for a discrete data set. The `freqTable` function doesn’t create bins like the histogram. Instead it counts the occurrence of each discrete data value and returns a list of tuples with the frequency statistics for each value. Fields from a frequency table can be vectorized using using the `col` function in the same manner as a histogram.

Below is a simple example of a frequency table built from a random sample of a discrete variable.

``````let(a=random(collection1, q="*:*", rows="15000", fl="day_i"),
b=col(a, day_i),
c=freqTable(b))``````

When this expression is sent to the `/stream` handler it responds with:

``````  "result-set": {
"docs": [
{
"c": [
{
"pct": 0.0318,
"count": 477,
"cumFreq": 477,
"cumPct": 0.0318,
"value": 0
},
{
"pct": 0.033133333333333334,
"count": 497,
"cumFreq": 974,
"cumPct": 0.06493333333333333,
"value": 1
},
{
"pct": 0.03426666666666667,
"count": 514,
"cumFreq": 1488,
"cumPct": 0.0992,
"value": 2
},
{
"pct": 0.0346,
"count": 519,
"cumFreq": 2007,
"cumPct": 0.1338,
"value": 3
},
{
"pct": 0.03133333333333333,
"count": 470,
"cumFreq": 2477,
"cumPct": 0.16513333333333333,
"value": 4
},
{
"pct": 0.03333333333333333,
"count": 500,
"cumFreq": 2977,
"cumPct": 0.19846666666666668,
"value": 5
}
]
},
{
"EOF": true,
"RESPONSE_TIME": 281
}
]
}
}``````

## Percentiles

The `percentile` function returns the estimated value for a specific percentile in a sample set. The example below returns the estimation for the 95th percentile of the price_f field.

``````let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
b=col(a, price_f),
c=percentile(b, 95))``````

When this expression is sent to the `/stream` handler it responds with:

`````` {
"result-set": {
"docs": [
{
"c": 312.94
},
{
"EOF": true,
"RESPONSE_TIME": 286
}
]
}
}``````

The `percentile` function also operates on an array of percentile values. The example below is computing the 20th, 40th, 60th and 80th percentiles for a random sample of the response_d field:

``````let(a=random(collection2, q="*:*", rows="15000", fl="response_d"),
b=col(a, response_d),
c=percentile(b, array(20,40,60,80)))``````

When this expression is sent to the `/stream` handler it responds with:

``````{
"result-set": {
"docs": [
{
"c": [
818.0835543394625,
843.5590348165282,
866.1789509894824,
892.5033386599067
]
},
{
"EOF": true,
"RESPONSE_TIME": 291
}
]
}
}``````

## Covariance and Correlation

Covariance and Correlation measure how random variables move together.

### Covariance and Covariance Matrices

The `cov` function calculates the covariance of two sample sets of data.

In the example below covariance is calculated for two numeric arrays.

The example below uses arrays created by the `array` function. Its important to note that vectorized data from SolrCloud collections can be used with any function that operates on arrays.

``````let(a=array(1, 2, 3, 4, 5),
b=array(100, 200, 300, 400, 500),
c=cov(a, b))``````

When this expression is sent to the `/stream` handler it responds with:

`````` {
"result-set": {
"docs": [
{
"c": 0.9484775349999998
},
{
"EOF": true,
"RESPONSE_TIME": 286
}
]
}
}``````

If a matrix is passed to the `cov` function it will automatically compute a covariance matrix for the columns of the matrix.

Notice in the example three numeric arrays are added as rows in a matrix. The matrix is then transposed to turn the rows into columns, and the covariance matrix is computed for the columns of the matrix.

``````let(a=array(1, 2, 3, 4, 5),
b=array(100, 200, 300, 400, 500),
c=array(30, 40, 80, 90, 110),
d=transpose(matrix(a, b, c)),
e=cov(d))``````

When this expression is sent to the `/stream` handler it responds with:

`````` {
"result-set": {
"docs": [
{
"e": [
[
2.5,
250,
52.5
],
[
250,
25000,
5250
],
[
52.5,
5250,
1150
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 2
}
]
}
}``````

### Correlation and Correlation Matrices

Correlation is measure of covariance that has been scaled between -1 and 1.

Three correlation types are supported:

• pearsons (default)

• kendalls

• spearmans

The type of correlation is specified by adding the type named parameter in the function call. The example below demonstrates the use of the type named parameter.

``````let(a=array(1, 2, 3, 4, 5),
b=array(100, 200, 300, 400, 5000),
c=corr(a, b, type=spearmans))``````

When this expression is sent to the `/stream` handler it responds with:

`````` {
"result-set": {
"docs": [
{
"c": 0.7432941462471664
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}``````

Like the `cov` function, the `corr` function automatically builds a correlation matrix if a matrix is passed as a parameter. The correlation matrix is built by correlating the columns of the matrix passed in.

## Statistical Inference Tests

Statistical inference tests test a hypothesis on random samples and return p-values which can be used to infer the reliability of the test for the entire population.

The following statistical inference tests are available:

• `anova`: One-Way-Anova tests if there is a statistically significant difference in the means of two or more random samples.

• `ttest`: The T-test tests if there is a statistically significant difference in the means of two random samples.

• `pairedTtest`: The paired t-test tests if there is a statistically significant difference in the means of two random samples with paired data.

• `gTestDataSet`: The G-test tests if two samples of binned discrete data were drawn from the same population.

• `chiSquareDataset`: The Chi-Squared test tests if two samples of binned discrete data were drawn from the same population.

• `mannWhitney`: The Mann-Whitney test is a non-parametric test that tests if two samples of continuous were pulled from the same population. The Mann-Whitney test is often used instead of the T-test when the underlying assumptions of the T-test are not met.

• `ks`: The Kolmogorov-Smirnov test tests if two samples of continuous data were drawn from the same distribution.

Below is a simple example of a T-test performed on two random samples. The returned p-value of .93 means we can accept the null hypothesis that the two samples do not have statistically significantly differences in the means.

``````let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
b=random(collection1, q="*:*", rows="1500", fl="price_f"),
c=col(a, price_f),
d=col(b, price_f),
e=ttest(c, d))``````

When this expression is sent to the `/stream` handler it responds with:

``````{
"result-set": {
"docs": [
{
"e": {
"p-value": 0.9350135639249795,
"t-statistic": 0.081545541074817
}
},
{
"EOF": true,
"RESPONSE_TIME": 48
}
]
}
}``````

## Transformations

In statistical analysis its often useful to transform data sets before performing statistical calculations. The statistical function library includes the following commonly used transformations:

• `rank`: Returns a numeric array with the rank-transformed value of each element of the original array.

• `log`: Returns a numeric array with the natural log of each element of the original array.

• `log10`: Returns a numeric array with the base 10 log of each element of the original array.

• `sqrt`: Returns a numeric array with the square root of each element of the original array.

• `cbrt`: Returns a numeric array with the cube root of each element of the original array.

• `recip`: Returns a numeric array with the reciprocal of each element of the original array.

Below is an example of a ttest performed on log transformed data sets:

``````let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
b=random(collection1, q="*:*", rows="1500", fl="price_f"),
c=log(col(a, price_f)),
d=log(col(b, price_f)),
e=ttest(c, d))``````

When this expression is sent to the `/stream` handler it responds with:

``````{
"result-set": {
"docs": [
{
"e": {
"p-value": 0.9655110070265056,
"t-statistic": -0.04324265449471238
}
},
{
"EOF": true,
"RESPONSE_TIME": 58
}
]
}
}``````

## Back Transformations

Vectors that have been transformed with the `log`, `log10`, `sqrt` and `cbrt` functions can be back transformed using the `pow` function.

The example below shows how to back transform data that has been transformed by the `sqrt` function.

``````let(echo="b,c",
a=array(100, 200, 300),
b=sqrt(a),
c=pow(b, 2))``````

When this expression is sent to the `/stream` handler it responds with:

``````{
"result-set": {
"docs": [
{
"b": [
10,
14.142135623730951,
17.320508075688775
],
"c": [
100,
200.00000000000003,
300.00000000000006
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}``````

The example below shows how to back transform data that has been transformed by the `log10` function.

``````let(echo="b,c",
a=array(100, 200, 300),
b=log10(a),
c=pow(10, b))``````

When this expression is sent to the `/stream` handler it responds with:

``````{
"result-set": {
"docs": [
{
"b": [
2,
2.3010299956639813,
2.4771212547196626
],
"c": [
100,
200.00000000000003,
300.0000000000001
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}``````

Vectors that have been transformed with the `recip` function can be back-transformed by taking the reciprocal of the reciprocal.

The example below shows an example of the back-transformation of the `recip` function.

``````let(echo="b,c",
a=array(100, 200, 300),
b=recip(a),
c=recip(b))``````

When this expression is sent to the `/stream` handler it responds with:

``````{
"result-set": {
"docs": [
{
"b": [
0.01,
0.005,
0.0033333333333333335
],
"c": [
100,
200,
300
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}``````

## Z-scores

The `zscores` function converts a numeric array to an array of z-scores. The z-score is the number of standard deviations a number is from the mean.

The example below computes the z-scores for the values in an array.

``````let(a=array(1,2,3),
b=zscores(a))``````

When this expression is sent to the `/stream` handler it responds with:

``````{
"result-set": {
"docs": [
{
"b": [
-1,
0,
1
]
},
{
"EOF": true,
"RESPONSE_TIME": 27
}
]
}
}``````