Merge branch 'master' into multi_auth_methods

This commit is contained in:
Arthur Passos 2024-06-22 10:52:47 -03:00
commit 9d19001945
8 changed files with 157 additions and 150 deletions

View File

@ -25,7 +25,7 @@ stddevPop(x)
**Returned value**
Square root of standard deviation of `x`. [Float64](../../data-types/float.md).
- Square root of standard deviation of `x`. [Float64](../../data-types/float.md).
**Example**

View File

@ -4,30 +4,25 @@ slug: "/en/sql-reference/aggregate-functions/reference/varpop"
sidebar_position: 32
---
This page covers the `varPop` and `varPopStable` functions available in ClickHouse.
## varPop
Calculates the population covariance between two data columns. The population covariance measures the degree to which two variables vary together. Calculates the amount `Σ((x - x̅)^2) / n`, where `n` is the sample size and `x̅`is the average value of `x`.
Calculates the population variance.
**Syntax**
```sql
covarPop(x, y)
varPop(x)
```
Alias: `VAR_POP`.
**Parameters**
- `x`: The first data column. [Numeric](../../../native-protocol/columns.md)
- `y`: The second data column. [Numeric](../../../native-protocol/columns.md)
- `x`: Population of values to find the population variance of. [(U)Int*](../../data-types/int-uint.md), [Float*](../../data-types/float.md), [Decimal*](../../data-types/decimal.md).
**Returned value**
Returns an integer of type `Float64`.
**Implementation details**
This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the slower but more stable [`varPopStable`](#varpopstable) function.
- Returns the population variance of `x`. [`Float64`](../../data-types/float.md).
**Example**
@ -37,69 +32,21 @@ Query:
DROP TABLE IF EXISTS test_data;
CREATE TABLE test_data
(
x Int32,
y Int32
x UInt8,
)
ENGINE = Memory;
INSERT INTO test_data VALUES (1, 2), (2, 3), (3, 5), (4, 6), (5, 8);
INSERT INTO test_data VALUES (3), (3), (3), (4), (4), (5), (5), (7), (11), (15);
SELECT
covarPop(x, y) AS covar_pop
varPop(x) AS var_pop
FROM test_data;
```
Result:
```response
3
```
## varPopStable
Calculates population covariance between two data columns using a stable, numerically accurate method to calculate the variance. This function is designed to provide reliable results even with large datasets or values that might cause numerical instability in other implementations.
**Syntax**
```sql
covarPopStable(x, y)
```
**Parameters**
- `x`: The first data column. [String literal](../../syntax#syntax-string-literal)
- `y`: The second data column. [Expression](../../syntax#syntax-expressions)
**Returned value**
Returns an integer of type `Float64`.
**Implementation details**
Unlike [`varPop`](#varpop), this function uses a stable, numerically accurate algorithm to calculate the population variance to avoid issues like catastrophic cancellation or loss of precision. This function also handles `NaN` and `Inf` values correctly, excluding them from calculations.
**Example**
Query:
```sql
DROP TABLE IF EXISTS test_data;
CREATE TABLE test_data
(
x Int32,
y Int32
)
ENGINE = Memory;
INSERT INTO test_data VALUES (1, 2), (2, 9), (9, 5), (4, 6), (5, 8);
SELECT
covarPopStable(x, y) AS covar_pop_stable
FROM test_data;
```
Result:
```response
0.5999999999999999
┌─var_pop─┐
│ 14.4 │
└─────────┘
```

View File

@ -0,0 +1,52 @@
---
title: "varPopStable"
slug: "/en/sql-reference/aggregate-functions/reference/varpopstable"
sidebar_position: 32
---
## varPopStable
Returns the population variance. Unlike [`varPop`](../reference/varpop.md), this function uses a [numerically stable](https://en.wikipedia.org/wiki/Numerical_stability) algorithm. It works slower but provides a lower computational error.
**Syntax**
```sql
varPopStable(x)
```
Alias: `VAR_POP_STABLE`.
**Parameters**
- `x`: Population of values to find the population variance of. [(U)Int*](../../data-types/int-uint.md), [Float*](../../data-types/float.md), [Decimal*](../../data-types/decimal.md).
**Returned value**
- Returns the population variance of `x`. [Float64](../../data-types/float.md).
**Example**
Query:
```sql
DROP TABLE IF EXISTS test_data;
CREATE TABLE test_data
(
x UInt8,
)
ENGINE = Memory;
INSERT INTO test_data VALUES (3),(3),(3),(4),(4),(5),(5),(7),(11),(15);
SELECT
varPopStable(x) AS var_pop_stable
FROM test_data;
```
Result:
```response
┌─var_pop_stable─┐
│ 14.4 │
└────────────────┘
```

View File

@ -4,8 +4,6 @@ slug: /en/sql-reference/aggregate-functions/reference/varsamp
sidebar_position: 33
---
This page contains information on the `varSamp` and `varSampStable` ClickHouse functions.
## varSamp
Calculate the sample variance of a data set.
@ -13,24 +11,27 @@ Calculate the sample variance of a data set.
**Syntax**
```sql
varSamp(expr)
varSamp(x)
```
Alias: `VAR_SAMP`.
**Parameters**
- `expr`: An expression representing the data set for which you want to calculate the sample variance. [Expression](../../syntax#syntax-expressions)
- `x`: The population for which you want to calculate the sample variance. [(U)Int*](../../data-types/int-uint.md), [Float*](../../data-types/float.md), [Decimal*](../../data-types/decimal.md).
**Returned value**
Returns a Float64 value representing the sample variance of the input data set.
- Returns the sample variance of the input data set `x`. [Float64](../../data-types/float.md).
**Implementation details**
The `varSamp()` function calculates the sample variance using the following formula:
The `varSamp` function calculates the sample variance using the following formula:
```plaintext
∑(x - mean(x))^2 / (n - 1)
```
$$
\sum\frac{(x - \text{mean}(x))^2}{(n - 1)}
$$
Where:
@ -38,91 +39,29 @@ Where:
- `mean(x)` is the arithmetic mean of the data set.
- `n` is the number of data points in the data set.
The function assumes that the input data set represents a sample from a larger population. If you want to calculate the variance of the entire population (when you have the complete data set), you should use the [`varPop()` function](./varpop#varpop) instead.
This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the slower but more stable [`varSampStable`](#varsampstable) function.
The function assumes that the input data set represents a sample from a larger population. If you want to calculate the variance of the entire population (when you have the complete data set), you should use [`varPop`](../reference/varpop.md) instead.
**Example**
Query:
```sql
CREATE TABLE example_table
DROP TABLE IF EXISTS test_data;
CREATE TABLE test_data
(
id UInt64,
value Float64
x Float64
)
ENGINE = MergeTree
ORDER BY id;
ENGINE = Memory;
INSERT INTO example_table VALUES (1, 10.5), (2, 12.3), (3, 9.8), (4, 11.2), (5, 10.7);
INSERT INTO test_data VALUES (10.5), (12.3), (9.8), (11.2), (10.7);
SELECT varSamp(value) FROM example_table;
SELECT round(varSamp(x),3) AS var_samp FROM test_data;
```
Response:
```response
0.8650000000000091
┌─var_samp─┐
│ 0.865 │
└──────────┘
```
## varSampStable
Calculate the sample variance of a data set using a numerically stable algorithm.
**Syntax**
```sql
varSampStable(expr)
```
**Parameters**
- `expr`: An expression representing the data set for which you want to calculate the sample variance. [Expression](../../syntax#syntax-expressions)
**Returned value**
The `varSampStable` function returns a Float64 value representing the sample variance of the input data set.
**Implementation details**
The `varSampStable` function calculates the sample variance using the same formula as the [`varSamp`](#varsamp) function:
```plaintext
∑(x - mean(x))^2 / (n - 1)
```
Where:
- `x` is each individual data point in the data set.
- `mean(x)` is the arithmetic mean of the data set.
- `n` is the number of data points in the data set.
The difference between `varSampStable` and `varSamp` is that `varSampStable` is designed to provide a more deterministic and stable result when dealing with floating-point arithmetic. It uses an algorithm that minimizes the accumulation of rounding errors, which can be particularly important when dealing with large data sets or data with a wide range of values.
Like `varSamp`, the `varSampStable` function assumes that the input data set represents a sample from a larger population. If you want to calculate the variance of the entire population (when you have the complete data set), you should use the [`varPopStable`](./varpop#varpopstable) function instead.
**Example**
Query:
```sql
CREATE TABLE example_table
(
id UInt64,
value Float64
)
ENGINE = MergeTree
ORDER BY id;
INSERT INTO example_table VALUES (1, 10.5), (2, 12.3), (3, 9.8), (4, 11.2), (5, 10.7);
SELECT varSampStable(value) FROM example_table;
```
Response:
```response
0.865
```
This query calculates the sample variance of the `value` column in the `example_table` using the `varSampStable()` function. The result shows that the sample variance of the values `[10.5, 12.3, 9.8, 11.2, 10.7]` is approximately 0.865, which may differ slightly from the result of `varSamp` due to the more precise handling of floating-point arithmetic.

View File

@ -0,0 +1,63 @@
---
title: "varSampStable"
slug: /en/sql-reference/aggregate-functions/reference/varsampstable
sidebar_position: 33
---
## varSampStable
Calculate the sample variance of a data set. Unlike [`varSamp`](../reference/varsamp.md), this function uses a numerically stable algorithm. It works slower but provides a lower computational error.
**Syntax**
```sql
varSampStable(x)
```
Alias: `VAR_SAMP_STABLE`
**Parameters**
- `x`: The population for which you want to calculate the sample variance. [(U)Int*](../../data-types/int-uint.md), [Float*](../../data-types/float.md), [Decimal*](../../data-types/decimal.md).
**Returned value**
- Returns the sample variance of the input data set. [Float64](../../data-types/float.md).
**Implementation details**
The `varSampStable` function calculates the sample variance using the same formula as the [`varSamp`](../reference/varsamp.md):
$$
\sum\frac{(x - \text{mean}(x))^2}{(n - 1)}
$$
Where:
- `x` is each individual data point in the data set.
- `mean(x)` is the arithmetic mean of the data set.
- `n` is the number of data points in the data set.
**Example**
Query:
```sql
DROP TABLE IF EXISTS test_data;
CREATE TABLE test_data
(
x Float64
)
ENGINE = Memory;
INSERT INTO test_data VALUES (10.5), (12.3), (9.8), (11.2), (10.7);
SELECT round(varSampStable(x),3) AS var_samp_stable FROM test_data;
```
Response:
```response
┌─var_samp_stable─┐
│ 0.865 │
└─────────────────┘
```

View File

@ -202,7 +202,10 @@ uint64_t readU64(std::string_view & sp)
{
SAFE_CHECK(sp.size() >= N, "underflow");
uint64_t x = 0;
memcpy(&x, sp.data(), N);
if constexpr (std::endian::native == std::endian::little)
memcpy(&x, sp.data(), N);
else
memcpy(reinterpret_cast<char*>(&x) + sizeof(uint64_t) - N, sp.data(), N);
sp.remove_prefix(N);
return x;
}

View File

@ -1065,18 +1065,19 @@ def main() -> int:
)
# rerun helper check
# FIXME: remove rerun_helper check and rely on ci cache only
if check_name not in (
CI.JobNames.BUILD_CHECK,
): # we might want to rerun build report job
rerun_helper = RerunHelper(commit, check_name_with_group)
if rerun_helper.is_already_finished_by_status():
print("WARNING: Rerunning job with GH status ")
status = rerun_helper.get_finished_status()
assert status
previous_status = status.state
print("::group::Commit Status")
print(status)
print("::endgroup::")
# FIXME: try rerun, even if status is present. To enable manual restart via GH interface
# previous_status = status.state
# ci cache check
if not previous_status and not ci_settings.no_ci_cache:

View File

@ -2852,7 +2852,9 @@ variantElement
variantType
varint
varpop
varpopstable
varsamp
varsampstable
vectorized
vectorscan
vendoring