Merge pull request #65292 from Blargian/doc_varpop_fix

[Docs] Fix nonsense `varPop`, `varPopStable` docs
This commit is contained in:
Alexey Milovidov 2024-06-22 11:15:22 +00:00 committed by GitHub
commit a9ce6845ba
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
6 changed files with 150 additions and 147 deletions

View File

@ -25,7 +25,7 @@ stddevPop(x)
**Returned value**
Square root of standard deviation of `x`. [Float64](../../data-types/float.md).
- Square root of standard deviation of `x`. [Float64](../../data-types/float.md).
**Example**

View File

@ -4,30 +4,25 @@ slug: "/en/sql-reference/aggregate-functions/reference/varpop"
sidebar_position: 32
---
This page covers the `varPop` and `varPopStable` functions available in ClickHouse.
## varPop
Calculates the population covariance between two data columns. The population covariance measures the degree to which two variables vary together. Calculates the amount `Σ((x - x̅)^2) / n`, where `n` is the sample size and `x̅`is the average value of `x`.
Calculates the population variance.
**Syntax**
```sql
covarPop(x, y)
varPop(x)
```
Alias: `VAR_POP`.
**Parameters**
- `x`: The first data column. [Numeric](../../../native-protocol/columns.md)
- `y`: The second data column. [Numeric](../../../native-protocol/columns.md)
- `x`: Population of values to find the population variance of. [(U)Int*](../../data-types/int-uint.md), [Float*](../../data-types/float.md), [Decimal*](../../data-types/decimal.md).
**Returned value**
Returns an integer of type `Float64`.
**Implementation details**
This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the slower but more stable [`varPopStable`](#varpopstable) function.
- Returns the population variance of `x`. [`Float64`](../../data-types/float.md).
**Example**
@ -37,69 +32,21 @@ Query:
DROP TABLE IF EXISTS test_data;
CREATE TABLE test_data
(
x Int32,
y Int32
x UInt8,
)
ENGINE = Memory;
INSERT INTO test_data VALUES (1, 2), (2, 3), (3, 5), (4, 6), (5, 8);
INSERT INTO test_data VALUES (3), (3), (3), (4), (4), (5), (5), (7), (11), (15);
SELECT
covarPop(x, y) AS covar_pop
varPop(x) AS var_pop
FROM test_data;
```
Result:
```response
3
```
## varPopStable
Calculates population covariance between two data columns using a stable, numerically accurate method to calculate the variance. This function is designed to provide reliable results even with large datasets or values that might cause numerical instability in other implementations.
**Syntax**
```sql
covarPopStable(x, y)
```
**Parameters**
- `x`: The first data column. [String literal](../../syntax#syntax-string-literal)
- `y`: The second data column. [Expression](../../syntax#syntax-expressions)
**Returned value**
Returns an integer of type `Float64`.
**Implementation details**
Unlike [`varPop`](#varpop), this function uses a stable, numerically accurate algorithm to calculate the population variance to avoid issues like catastrophic cancellation or loss of precision. This function also handles `NaN` and `Inf` values correctly, excluding them from calculations.
**Example**
Query:
```sql
DROP TABLE IF EXISTS test_data;
CREATE TABLE test_data
(
x Int32,
y Int32
)
ENGINE = Memory;
INSERT INTO test_data VALUES (1, 2), (2, 9), (9, 5), (4, 6), (5, 8);
SELECT
covarPopStable(x, y) AS covar_pop_stable
FROM test_data;
```
Result:
```response
0.5999999999999999
┌─var_pop─┐
│ 14.4 │
└─────────┘
```

View File

@ -0,0 +1,52 @@
---
title: "varPopStable"
slug: "/en/sql-reference/aggregate-functions/reference/varpopstable"
sidebar_position: 32
---
## varPopStable
Returns the population variance. Unlike [`varPop`](../reference/varpop.md), this function uses a [numerically stable](https://en.wikipedia.org/wiki/Numerical_stability) algorithm. It works slower but provides a lower computational error.
**Syntax**
```sql
varPopStable(x)
```
Alias: `VAR_POP_STABLE`.
**Parameters**
- `x`: Population of values to find the population variance of. [(U)Int*](../../data-types/int-uint.md), [Float*](../../data-types/float.md), [Decimal*](../../data-types/decimal.md).
**Returned value**
- Returns the population variance of `x`. [Float64](../../data-types/float.md).
**Example**
Query:
```sql
DROP TABLE IF EXISTS test_data;
CREATE TABLE test_data
(
x UInt8,
)
ENGINE = Memory;
INSERT INTO test_data VALUES (3),(3),(3),(4),(4),(5),(5),(7),(11),(15);
SELECT
varPopStable(x) AS var_pop_stable
FROM test_data;
```
Result:
```response
┌─var_pop_stable─┐
│ 14.4 │
└────────────────┘
```

View File

@ -4,8 +4,6 @@ slug: /en/sql-reference/aggregate-functions/reference/varsamp
sidebar_position: 33
---
This page contains information on the `varSamp` and `varSampStable` ClickHouse functions.
## varSamp
Calculate the sample variance of a data set.
@ -13,24 +11,27 @@ Calculate the sample variance of a data set.
**Syntax**
```sql
varSamp(expr)
varSamp(x)
```
Alias: `VAR_SAMP`.
**Parameters**
- `expr`: An expression representing the data set for which you want to calculate the sample variance. [Expression](../../syntax#syntax-expressions)
- `x`: The population for which you want to calculate the sample variance. [(U)Int*](../../data-types/int-uint.md), [Float*](../../data-types/float.md), [Decimal*](../../data-types/decimal.md).
**Returned value**
Returns a Float64 value representing the sample variance of the input data set.
- Returns the sample variance of the input data set `x`. [Float64](../../data-types/float.md).
**Implementation details**
The `varSamp()` function calculates the sample variance using the following formula:
The `varSamp` function calculates the sample variance using the following formula:
```plaintext
∑(x - mean(x))^2 / (n - 1)
```
$$
\sum\frac{(x - \text{mean}(x))^2}{(n - 1)}
$$
Where:
@ -38,91 +39,29 @@ Where:
- `mean(x)` is the arithmetic mean of the data set.
- `n` is the number of data points in the data set.
The function assumes that the input data set represents a sample from a larger population. If you want to calculate the variance of the entire population (when you have the complete data set), you should use the [`varPop()` function](./varpop#varpop) instead.
This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the slower but more stable [`varSampStable`](#varsampstable) function.
The function assumes that the input data set represents a sample from a larger population. If you want to calculate the variance of the entire population (when you have the complete data set), you should use [`varPop`](../reference/varpop.md) instead.
**Example**
Query:
```sql
CREATE TABLE example_table
DROP TABLE IF EXISTS test_data;
CREATE TABLE test_data
(
id UInt64,
value Float64
x Float64
)
ENGINE = MergeTree
ORDER BY id;
ENGINE = Memory;
INSERT INTO example_table VALUES (1, 10.5), (2, 12.3), (3, 9.8), (4, 11.2), (5, 10.7);
INSERT INTO test_data VALUES (10.5), (12.3), (9.8), (11.2), (10.7);
SELECT varSamp(value) FROM example_table;
SELECT round(varSamp(x),3) AS var_samp FROM test_data;
```
Response:
```response
0.8650000000000091
┌─var_samp─┐
│ 0.865 │
└──────────┘
```
## varSampStable
Calculate the sample variance of a data set using a numerically stable algorithm.
**Syntax**
```sql
varSampStable(expr)
```
**Parameters**
- `expr`: An expression representing the data set for which you want to calculate the sample variance. [Expression](../../syntax#syntax-expressions)
**Returned value**
The `varSampStable` function returns a Float64 value representing the sample variance of the input data set.
**Implementation details**
The `varSampStable` function calculates the sample variance using the same formula as the [`varSamp`](#varsamp) function:
```plaintext
∑(x - mean(x))^2 / (n - 1)
```
Where:
- `x` is each individual data point in the data set.
- `mean(x)` is the arithmetic mean of the data set.
- `n` is the number of data points in the data set.
The difference between `varSampStable` and `varSamp` is that `varSampStable` is designed to provide a more deterministic and stable result when dealing with floating-point arithmetic. It uses an algorithm that minimizes the accumulation of rounding errors, which can be particularly important when dealing with large data sets or data with a wide range of values.
Like `varSamp`, the `varSampStable` function assumes that the input data set represents a sample from a larger population. If you want to calculate the variance of the entire population (when you have the complete data set), you should use the [`varPopStable`](./varpop#varpopstable) function instead.
**Example**
Query:
```sql
CREATE TABLE example_table
(
id UInt64,
value Float64
)
ENGINE = MergeTree
ORDER BY id;
INSERT INTO example_table VALUES (1, 10.5), (2, 12.3), (3, 9.8), (4, 11.2), (5, 10.7);
SELECT varSampStable(value) FROM example_table;
```
Response:
```response
0.865
```
This query calculates the sample variance of the `value` column in the `example_table` using the `varSampStable()` function. The result shows that the sample variance of the values `[10.5, 12.3, 9.8, 11.2, 10.7]` is approximately 0.865, which may differ slightly from the result of `varSamp` due to the more precise handling of floating-point arithmetic.

View File

@ -0,0 +1,63 @@
---
title: "varSampStable"
slug: /en/sql-reference/aggregate-functions/reference/varsampstable
sidebar_position: 33
---
## varSampStable
Calculate the sample variance of a data set. Unlike [`varSamp`](../reference/varsamp.md), this function uses a numerically stable algorithm. It works slower but provides a lower computational error.
**Syntax**
```sql
varSampStable(x)
```
Alias: `VAR_SAMP_STABLE`
**Parameters**
- `x`: The population for which you want to calculate the sample variance. [(U)Int*](../../data-types/int-uint.md), [Float*](../../data-types/float.md), [Decimal*](../../data-types/decimal.md).
**Returned value**
- Returns the sample variance of the input data set. [Float64](../../data-types/float.md).
**Implementation details**
The `varSampStable` function calculates the sample variance using the same formula as the [`varSamp`](../reference/varsamp.md):
$$
\sum\frac{(x - \text{mean}(x))^2}{(n - 1)}
$$
Where:
- `x` is each individual data point in the data set.
- `mean(x)` is the arithmetic mean of the data set.
- `n` is the number of data points in the data set.
**Example**
Query:
```sql
DROP TABLE IF EXISTS test_data;
CREATE TABLE test_data
(
x Float64
)
ENGINE = Memory;
INSERT INTO test_data VALUES (10.5), (12.3), (9.8), (11.2), (10.7);
SELECT round(varSampStable(x),3) AS var_samp_stable FROM test_data;
```
Response:
```response
┌─var_samp_stable─┐
│ 0.865 │
└─────────────────┘
```

View File

@ -2852,7 +2852,9 @@ variantElement
variantType
varint
varpop
varpopstable
varsamp
varsampstable
vectorized
vectorscan
vendoring