ClickHouse/docs/en/sql-reference/aggregate-functions/reference/kolmogorovsmirnovtest.md
Nikita Mikhaylov da72eb630e Done
2023-04-27 18:14:46 +02:00

4.4 KiB

slug sidebar_position sidebar_label
/en/sql-reference/aggregate-functions/reference/kolmogorovsmirnovtest 300 kolmogorovSmirnovTest

kolmogorovSmirnovTest

Applies Kolmogorov-Smirnov's test to samples from two populations.

Syntax

kolmogorovSmirnovTest([alternative, computation_method])(sample_data, sample_index)

Values of both samples are in the sample_data column. If sample_index equals to 0 then the value in that row belongs to the sample from the first population. Otherwise it belongs to the sample from the second population. Samples must belong to continuous, one-dimensional probability distributions.

Arguments

Parameters

  • alternative — alternative hypothesis. (Optional, default: 'two-sided'.) String. Let F(x) and G(x) be the CDFs of the first and second distributions respectively.
    • 'two-sided' The null hypothesis is that samples come from the same distribution, e.g. F(x) = G(x) for all x. And the alternative is that the distributions are not identical.
    • 'greater' The null hypothesis is that values in the first sample are stohastically smaller than those in the second one, e.g. the CDF of first distribution lies above and hence to the left of that for the second one. Which in fact means that F(x) >= G(x) for all x. And the alternative in this case is that F(x) < G(x) for at least one x.
    • 'less'. The null hypothesis is that values in the first sample are stohastically greater than those in the second one, e.g. the CDF of first distribution lies below and hence to the right of that for the second one. Which in fact means that F(x) <= G(x) for all x. And the alternative in this case is that F(x) > G(x) for at least one x.
  • computation_method — the method used to compute p-value. (Optional, default: 'auto'.) String.
    • 'exact' - calculation is performed using precise probability distribution of the test statistics. Compute intensive and wasteful except for small samples.
    • 'asymp' ('asymptotic') - calculation is performed using an approximation. For large sample sizes, the exact and asymptotic p-values are very similar.
    • 'auto' - the 'exact' method is used when a maximum number of samples is less than 10'000.

Returned values

Tuple with two elements:

Example

Query:

SELECT kolmogorovSmirnovTest('less', 'exact')(value, num)
FROM
(
    SELECT
        randNormal(0, 10) AS value,
        0 AS num
    FROM numbers(10000)
    UNION ALL
    SELECT
        randNormal(0, 10) AS value,
        1 AS num
    FROM numbers(10000)
)

Result:

┌─kolmogorovSmirnovTest('less', 'exact')(value, num)─┐
│ (0.009899999999999996,0.37528595205132287)         │
└────────────────────────────────────────────────────┘

Note: P-value is bigger than 0.05 (for confidence level of 95%), so null hypothesis is not rejected.

Query:

SELECT kolmogorovSmirnovTest('two-sided', 'exact')(value, num)
FROM
(
    SELECT
        randStudentT(10) AS value,
        0 AS num
    FROM numbers(100)
    UNION ALL
    SELECT
        randNormal(0, 10) AS value,
        1 AS num
    FROM numbers(100)
)

Result:

┌─kolmogorovSmirnovTest('two-sided', 'exact')(value, num)─┐
│ (0.4100000000000002,6.61735760482795e-8)                │
└─────────────────────────────────────────────────────────┘

Note: P-value is less than 0.05 (for confidence level of 95%), so null hypothesis is rejected.

See Also