Merge branch 'master' into extract-text-from-html

This commit is contained in:
Alexey Milovidov 2021-02-28 04:03:32 +03:00
commit 107d8ec811
21 changed files with 532 additions and 39 deletions

View File

@ -18,7 +18,8 @@ RUN apt-get update \
curl \
tar \
krb5-user \
iproute2
iproute2 \
lsof
RUN rm -rf \
/var/lib/apt/lists/* \
/var/cache/debconf \

File diff suppressed because one or more lines are too long

View File

@ -20,5 +20,6 @@ The list of documented datasets:
- [Terabyte of Click Logs from Criteo](../../getting-started/example-datasets/criteo.md)
- [AMPLab Big Data Benchmark](../../getting-started/example-datasets/amplab-benchmark.md)
- [Brown University Benchmark](../../getting-started/example-datasets/brown-benchmark.md)
- [Cell Towers](../../getting-started/example-datasets/cell-towers.md)
[Original article](https://clickhouse.tech/docs/en/getting_started/example_datasets) <!--hide-->

View File

@ -9,7 +9,7 @@ Hash functions can be used for the deterministic pseudo-random shuffling of elem
## halfMD5 {#hash-functions-halfmd5}
[Interprets](../../sql-reference/functions/type-conversion-functions.md#type_conversion_function-reinterpretAsString) all the input parameters as strings and calculates the [MD5](https://en.wikipedia.org/wiki/MD5) hash value for each of them. Then combines hashes, takes the first 8 bytes of the hash of the resulting string, and interprets them as `UInt64` in big-endian byte order.
[Interprets](../../sql-reference/functions/type-conversion-functions.md#type_conversion_functions-reinterpretAsString) all the input parameters as strings and calculates the [MD5](https://en.wikipedia.org/wiki/MD5) hash value for each of them. Then combines hashes, takes the first 8 bytes of the hash of the resulting string, and interprets them as `UInt64` in big-endian byte order.
``` sql
halfMD5(par1, ...)
@ -54,7 +54,7 @@ sipHash64(par1,...)
This is a cryptographic hash function. It works at least three times faster than the [MD5](#hash_functions-md5) function.
Function [interprets](../../sql-reference/functions/type-conversion-functions.md#type_conversion_function-reinterpretAsString) all the input parameters as strings and calculates the hash value for each of them. Then combines hashes by the following algorithm:
Function [interprets](../../sql-reference/functions/type-conversion-functions.md#type_conversion_functions-reinterpretAsString) all the input parameters as strings and calculates the hash value for each of them. Then combines hashes by the following algorithm:
1. After hashing all the input parameters, the function gets the array of hashes.
2. Function takes the first and the second elements and calculates a hash for the array of them.

View File

@ -174,4 +174,129 @@ Result:
└──────────────────────────────┴───────────────────────────────────┘
```
## mapContains {#mapcontains}
Determines whether the `map` contains the `key` parameter.
**Syntax**
``` sql
mapContains(map, key)
```
**Parameters**
- `map` — Map. [Map](../../sql-reference/data-types/map.md).
- `key` — Key. Type matches the type of keys of `map` parameter.
**Returned value**
- `1` if `map` contains `key`, `0` if not.
Type: [UInt8](../../sql-reference/data-types/int-uint.md).
**Example**
Query:
```sql
CREATE TABLE test (a Map(String,String)) ENGINE = Memory;
INSERT INTO test VALUES ({'name':'eleven','age':'11'}), ({'number':'twelve','position':'6.0'});
SELECT mapContains(a, 'name') FROM test;
```
Result:
```text
┌─mapContains(a, 'name')─┐
│ 1 │
│ 0 │
└────────────────────────┘
```
## mapKeys {#mapkeys}
Returns all keys from the `map` parameter.
**Syntax**
```sql
mapKeys(map)
```
**Parameters**
- `map` — Map. [Map](../../sql-reference/data-types/map.md).
**Returned value**
- Array containing all keys from the `map`.
Type: [Array](../../sql-reference/data-types/array.md).
**Example**
Query:
```sql
CREATE TABLE test (a Map(String,String)) ENGINE = Memory;
INSERT INTO test VALUES ({'name':'eleven','age':'11'}), ({'number':'twelve','position':'6.0'});
SELECT mapKeys(a) FROM test;
```
Result:
```text
┌─mapKeys(a)────────────┐
│ ['name','age'] │
│ ['number','position'] │
└───────────────────────┘
```
## mapValues {#mapvalues}
Returns all values from the `map` parameter.
**Syntax**
```sql
mapKeys(map)
```
**Parameters**
- `map` — Map. [Map](../../sql-reference/data-types/map.md).
**Returned value**
- Array containing all the values from `map`.
Type: [Array](../../sql-reference/data-types/array.md).
**Example**
Query:
```sql
CREATE TABLE test (a Map(String,String)) ENGINE = Memory;
INSERT INTO test VALUES ({'name':'eleven','age':'11'}), ({'number':'twelve','position':'6.0'});
SELECT mapValues(a) FROM test;
```
Result:
```text
┌─mapValues(a)─────┐
│ ['eleven','11'] │
│ ['twelve','6.0'] │
└──────────────────┘
```
[Original article](https://clickhouse.tech/docs/en/sql-reference/functions/tuple-map-functions/) <!--hide-->

View File

@ -81,5 +81,5 @@ The `TTL` is no longer there, so the second row is not deleted:
### See Also
- More about the [TTL-expression](../../../sql-reference/statements/create/table#ttl-expression).
- Modify column [with TTL](../../../sql-reference/statements/alter/column#alter_modify-column).
- More about the [TTL-expression](../../../../sql-reference/statements/create/table#ttl-expression).
- Modify column [with TTL](../../../../sql-reference/statements/alter/column#alter_modify-column).

View File

@ -22,6 +22,7 @@ toc_title: "\u041a\u043b\u0438\u0435\u043d\u0442\u0441\u043a\u0438\u0435\u0020\u
- [seva-code/php-click-house-client](https://packagist.org/packages/seva-code/php-click-house-client)
- [SeasClick C++ client](https://github.com/SeasX/SeasClick)
- [glushkovds/phpclickhouse-laravel](https://packagist.org/packages/glushkovds/phpclickhouse-laravel)
- [kolya7k ClickHouse PHP extension](https://github.com//kolya7k/clickhouse-php)
- Go
- [clickhouse](https://github.com/kshvakov/clickhouse/)
- [go-clickhouse](https://github.com/roistat/go-clickhouse)

View File

@ -176,4 +176,129 @@ select mapPopulateSeries([1,2,4], [11,22,44], 5) as res, toTypeName(res) as type
└──────────────────────────────┴───────────────────────────────────┘
```
[Оригинальная статья](https://clickhouse.tech/docs/en/query_language/functions/tuple-map-functions/) <!--hide-->
## mapContains {#mapcontains}
Определяет, содержит ли контейнер `map` ключ `key`.
**Синтаксис**
``` sql
mapContains(map, key)
```
**Параметры**
- `map` — контейнер Map. [Map](../../sql-reference/data-types/map.md).
- `key` — ключ. Тип соответстует типу ключей параметра `map`.
**Возвращаемое значение**
- `1` если `map` включает `key`, иначе `0`.
Тип: [UInt8](../../sql-reference/data-types/int-uint.md).
**Пример**
Запрос:
```sql
CREATE TABLE test (a Map(String,String)) ENGINE = Memory;
INSERT INTO test VALUES ({'name':'eleven','age':'11'}), ({'number':'twelve','position':'6.0'});
SELECT mapContains(a, 'name') FROM test;
```
Результат:
```text
┌─mapContains(a, 'name')─┐
│ 1 │
│ 0 │
└────────────────────────┘
```
## mapKeys {#mapkeys}
Возвращает все ключи контейнера `map`.
**Синтаксис**
```sql
mapKeys(map)
```
**Параметры**
- `map` — контейнер Map. [Map](../../sql-reference/data-types/map.md).
**Возвращаемое значение**
- Массив со всеми ключами контейнера `map`.
Тип: [Array](../../sql-reference/data-types/array.md).
**Пример**
Запрос:
```sql
CREATE TABLE test (a Map(String,String)) ENGINE = Memory;
INSERT INTO test VALUES ({'name':'eleven','age':'11'}), ({'number':'twelve','position':'6.0'});
SELECT mapKeys(a) FROM test;
```
Результат:
```text
┌─mapKeys(a)────────────┐
│ ['name','age'] │
│ ['number','position'] │
└───────────────────────┘
```
## mapValues {#mapvalues}
Возвращает все значения контейнера `map`.
**Синтаксис**
```sql
mapKeys(map)
```
**Параметры**
- `map` — контейнер Map. [Map](../../sql-reference/data-types/map.md).
**Возвращаемое значение**
- Массив со всеми значениями контейнера `map`.
Тип: [Array](../../sql-reference/data-types/array.md).
**Примеры**
Запрос:
```sql
CREATE TABLE test (a Map(String,String)) ENGINE = Memory;
INSERT INTO test VALUES ({'name':'eleven','age':'11'}), ({'number':'twelve','position':'6.0'});
SELECT mapValues(a) FROM test;
```
Результат:
```text
┌─mapValues(a)─────┐
│ ['eleven','11'] │
│ ['twelve','6.0'] │
└──────────────────┘
```
[Оригинальная статья](https://clickhouse.tech/docs/ru/sql-reference/functions/tuple-map-functions/) <!--hide-->

View File

@ -391,6 +391,9 @@ public:
virtual void multi(
const Requests & requests,
MultiCallback callback) = 0;
/// Expire session and finish all pending requests
virtual void finalize() = 0;
};
}

View File

@ -30,7 +30,7 @@ using TestKeeperRequestPtr = std::shared_ptr<TestKeeperRequest>;
*
* NOTE: You can add various failure modes for better testing.
*/
class TestKeeper : public IKeeper
class TestKeeper final : public IKeeper
{
public:
TestKeeper(const String & root_path_, Poco::Timespan operation_timeout_);
@ -83,6 +83,7 @@ public:
const Requests & requests,
MultiCallback callback) override;
void finalize() override;
struct Node
{
@ -130,7 +131,6 @@ private:
void pushRequest(RequestInfo && request);
void finalize();
ThreadFromGlobalPool processing_thread;

View File

@ -44,7 +44,7 @@ static void check(Coordination::Error code, const std::string & path)
}
void ZooKeeper::init(const std::string & implementation_, const std::string & hosts_, const std::string & identity_,
void ZooKeeper::init(const std::string & implementation_, const Strings & hosts_, const std::string & identity_,
int32_t session_timeout_ms_, int32_t operation_timeout_ms_, const std::string & chroot_)
{
log = &Poco::Logger::get("ZooKeeper");
@ -60,13 +60,16 @@ void ZooKeeper::init(const std::string & implementation_, const std::string & ho
if (hosts.empty())
throw KeeperException("No hosts passed to ZooKeeper constructor.", Coordination::Error::ZBADARGUMENTS);
std::vector<std::string> hosts_strings;
splitInto<','>(hosts_strings, hosts);
Coordination::ZooKeeper::Nodes nodes;
nodes.reserve(hosts_strings.size());
nodes.reserve(hosts.size());
Strings shuffled_hosts = hosts;
/// Shuffle the hosts to distribute the load among ZooKeeper nodes.
pcg64 generator(randomSeed());
std::shuffle(shuffled_hosts.begin(), shuffled_hosts.end(), generator);
bool dns_error = false;
for (auto & host_string : hosts_strings)
for (auto & host_string : shuffled_hosts)
{
try
{
@ -109,9 +112,9 @@ void ZooKeeper::init(const std::string & implementation_, const std::string & ho
Poco::Timespan(0, operation_timeout_ms_ * 1000));
if (chroot.empty())
LOG_TRACE(log, "Initialized, hosts: {}", hosts);
LOG_TRACE(log, "Initialized, hosts: {}", fmt::join(hosts, ","));
else
LOG_TRACE(log, "Initialized, hosts: {}, chroot: {}", hosts, chroot);
LOG_TRACE(log, "Initialized, hosts: {}, chroot: {}", fmt::join(hosts, ","), chroot);
}
else if (implementation == "testkeeper")
{
@ -128,7 +131,16 @@ void ZooKeeper::init(const std::string & implementation_, const std::string & ho
throw KeeperException("Zookeeper root doesn't exist. You should create root node " + chroot + " before start.", Coordination::Error::ZNONODE);
}
ZooKeeper::ZooKeeper(const std::string & hosts_, const std::string & identity_, int32_t session_timeout_ms_,
ZooKeeper::ZooKeeper(const std::string & hosts_string, const std::string & identity_, int32_t session_timeout_ms_,
int32_t operation_timeout_ms_, const std::string & chroot_, const std::string & implementation_)
{
Strings hosts_strings;
splitInto<','>(hosts_strings, hosts_string);
init(implementation_, hosts_strings, identity_, session_timeout_ms_, operation_timeout_ms_, chroot_);
}
ZooKeeper::ZooKeeper(const Strings & hosts_, const std::string & identity_, int32_t session_timeout_ms_,
int32_t operation_timeout_ms_, const std::string & chroot_, const std::string & implementation_)
{
init(implementation_, hosts_, identity_, session_timeout_ms_, operation_timeout_ms_, chroot_);
@ -141,8 +153,6 @@ struct ZooKeeperArgs
Poco::Util::AbstractConfiguration::Keys keys;
config.keys(config_name, keys);
std::vector<std::string> hosts_strings;
session_timeout_ms = Coordination::DEFAULT_SESSION_TIMEOUT_MS;
operation_timeout_ms = Coordination::DEFAULT_OPERATION_TIMEOUT_MS;
implementation = "zookeeper";
@ -150,7 +160,7 @@ struct ZooKeeperArgs
{
if (startsWith(key, "node"))
{
hosts_strings.push_back(
hosts.push_back(
(config.getBool(config_name + "." + key + ".secure", false) ? "secure://" : "") +
config.getString(config_name + "." + key + ".host") + ":"
+ config.getString(config_name + "." + key + ".port", "2181")
@ -180,17 +190,6 @@ struct ZooKeeperArgs
throw KeeperException(std::string("Unknown key ") + key + " in config file", Coordination::Error::ZBADARGUMENTS);
}
/// Shuffle the hosts to distribute the load among ZooKeeper nodes.
pcg64 generator(randomSeed());
std::shuffle(hosts_strings.begin(), hosts_strings.end(), generator);
for (auto & host : hosts_strings)
{
if (!hosts.empty())
hosts += ',';
hosts += host;
}
if (!chroot.empty())
{
if (chroot.front() != '/')
@ -200,7 +199,7 @@ struct ZooKeeperArgs
}
}
std::string hosts;
Strings hosts;
std::string identity;
int session_timeout_ms;
int operation_timeout_ms;
@ -922,6 +921,10 @@ Coordination::Error ZooKeeper::tryMultiNoThrow(const Coordination::Requests & re
}
}
void ZooKeeper::finalize()
{
impl->finalize();
}
size_t KeeperMultiException::getFailedOpIndex(Coordination::Error exception_code, const Coordination::Responses & responses)
{
@ -1000,4 +1003,5 @@ Coordination::RequestPtr makeCheckRequest(const std::string & path, int version)
request->version = version;
return request;
}
}

View File

@ -50,7 +50,14 @@ class ZooKeeper
public:
using Ptr = std::shared_ptr<ZooKeeper>;
ZooKeeper(const std::string & hosts_, const std::string & identity_ = "",
/// hosts_string -- comma separated [secure://]host:port list
ZooKeeper(const std::string & hosts_string, const std::string & identity_ = "",
int32_t session_timeout_ms_ = Coordination::DEFAULT_SESSION_TIMEOUT_MS,
int32_t operation_timeout_ms_ = Coordination::DEFAULT_OPERATION_TIMEOUT_MS,
const std::string & chroot_ = "",
const std::string & implementation_ = "zookeeper");
ZooKeeper(const Strings & hosts_, const std::string & identity_ = "",
int32_t session_timeout_ms_ = Coordination::DEFAULT_SESSION_TIMEOUT_MS,
int32_t operation_timeout_ms_ = Coordination::DEFAULT_OPERATION_TIMEOUT_MS,
const std::string & chroot_ = "",
@ -247,10 +254,12 @@ public:
/// Like the previous one but don't throw any exceptions on future.get()
FutureMulti tryAsyncMulti(const Coordination::Requests & ops);
void finalize();
private:
friend class EphemeralNodeHolder;
void init(const std::string & implementation_, const std::string & hosts_, const std::string & identity_,
void init(const std::string & implementation_, const Strings & hosts_, const std::string & identity_,
int32_t session_timeout_ms_, int32_t operation_timeout_ms_, const std::string & chroot_);
/// The following methods don't throw exceptions but return error codes.
@ -266,7 +275,7 @@ private:
std::unique_ptr<Coordination::IKeeper> impl;
std::string hosts;
Strings hosts;
std::string identity;
int32_t session_timeout_ms;
int32_t operation_timeout_ms;

View File

@ -88,7 +88,7 @@ using namespace DB;
/** Usage scenario: look at the documentation for IKeeper class.
*/
class ZooKeeper : public IKeeper
class ZooKeeper final : public IKeeper
{
public:
struct Node
@ -167,6 +167,20 @@ public:
const Requests & requests,
MultiCallback callback) override;
/// Without forcefully invalidating (finalizing) ZooKeeper session before
/// establishing a new one, there was a possibility that server is using
/// two ZooKeeper sessions simultaneously in different parts of code.
/// This is strong antipattern and we always prevented it.
/// ZooKeeper is linearizeable for writes, but not linearizeable for
/// reads, it only maintains "sequential consistency": in every session
/// you observe all events in order but possibly with some delay. If you
/// perform write in one session, then notify different part of code and
/// it will do read in another session, that read may not see the
/// already performed write.
void finalize() override { finalize(false, false); }
private:
String root_path;
ACLs default_acls;

View File

@ -65,10 +65,8 @@ void CheckConstraintsBlockOutputStream::write(const Block & block)
/// Check if constraint value is nullable
const auto & null_map = column_nullable->getNullMapColumn();
const auto & data = null_map.getData();
const auto * it = std::find(data.begin(), data.end(), true);
bool null_map_contains_null = it != data.end();
const PaddedPODArray<UInt8> & data = null_map.getData();
bool null_map_contains_null = !memoryIsZero(data.raw_data(), data.size() * sizeof(UInt8));
if (null_map_contains_null)
throw Exception(

View File

@ -1661,7 +1661,12 @@ void Context::resetZooKeeper() const
static void reloadZooKeeperIfChangedImpl(const ConfigurationPtr & config, const std::string & config_name, zkutil::ZooKeeperPtr & zk)
{
if (!zk || zk->configChanged(*config, config_name))
{
if (zk)
zk->finalize();
zk = std::make_shared<zkutil::ZooKeeper>(*config, config_name);
}
}
void Context::reloadZooKeeperIfChanged(const ConfigurationPtr & config) const

View File

@ -144,6 +144,12 @@ static const auto MUTATIONS_FINALIZING_IDLE_SLEEP_MS = 5 * 1000;
void StorageReplicatedMergeTree::setZooKeeper()
{
/// Every ReplicatedMergeTree table is using only one ZooKeeper session.
/// But if several ReplicatedMergeTree tables are using different
/// ZooKeeper sessions, some queries like ATTACH PARTITION FROM may have
/// strange effects. So we always use only one session for all tables.
/// (excluding auxiliary zookeepers)
std::lock_guard lock(current_zookeeper_mutex);
if (zookeeper_name == default_zookeeper_name)
{

View File

@ -74,6 +74,9 @@ def test_reload_zookeeper(start_cluster):
with pytest.raises(QueryRuntimeException):
node.query("SELECT COUNT() FROM test_table", settings={"select_sequential_consistency" : 1})
def get_active_zk_connections():
return str(node.exec_in_container(['bash', '-c', 'lsof -a -i4 -i6 -itcp -w | grep 2181 | grep ESTABLISHED | wc -l'], privileged=True, user='root')).strip()
## set config to zoo2, server will be normal
new_config = """
<yandex>
@ -89,5 +92,10 @@ def test_reload_zookeeper(start_cluster):
node.replace_config("/etc/clickhouse-server/conf.d/zookeeper.xml", new_config)
node.query("SYSTEM RELOAD CONFIG")
active_zk_connections = get_active_zk_connections()
assert active_zk_connections == '1', "Total connections to ZooKeeper not equal to 1, {}".format(active_zk_connections)
assert_eq_with_retry(node, "SELECT COUNT() FROM test_table", '1000', retry_count=120, sleep_time=0.5)
active_zk_connections = get_active_zk_connections()
assert active_zk_connections == '1', "Total connections to ZooKeeper not equal to 1, {}".format(active_zk_connections)

View File

@ -0,0 +1,23 @@
},
{
"datetime": "2020-12-12",
"pipeline": "test-pipeline",
"host": "clickhouse-test-host-001.clickhouse.com",
"home": "clickhouse",
"detail": "clickhouse",
"row_number": "999998"
},
{
"datetime": "2020-12-12",
"pipeline": "test-pipeline",
"host": "clickhouse-test-host-001.clickhouse.com",
"home": "clickhouse",
"detail": "clickhouse",
"row_number": "999999"
}
],
"rows": 1000000,
"rows_before_limit_at_least": 1048080,

View File

@ -0,0 +1,7 @@
#!/usr/bin/env bash
CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
# shellcheck source=../shell_config.sh
. "$CURDIR"/../shell_config.sh
${CLICKHOUSE_CURL} -sS -H 'Accept-Encoding: gzip' "${CLICKHOUSE_URL}&enable_http_compression=1&http_zlib_compression_level=1" -d "SELECT toDate('2020-12-12') as datetime, 'test-pipeline' as pipeline, 'clickhouse-test-host-001.clickhouse.com' as host, 'clickhouse' as home, 'clickhouse' as detail, number as row_number FROM numbers(1000000) FORMAT JSON" | gzip -d | tail -n30 | head -n23

View File

@ -0,0 +1,23 @@
},
{
"datetime": "2020-12-12",
"pipeline": "test-pipeline",
"host": "clickhouse-test-host-001.clickhouse.com",
"home": "clickhouse",
"detail": "clickhouse",
"row_number": "999998"
},
{
"datetime": "2020-12-12",
"pipeline": "test-pipeline",
"host": "clickhouse-test-host-001.clickhouse.com",
"home": "clickhouse",
"detail": "clickhouse",
"row_number": "999999"
}
],
"rows": 1000000,
"rows_before_limit_at_least": 1048080,

View File

@ -0,0 +1,7 @@
#!/usr/bin/env bash
CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
# shellcheck source=../shell_config.sh
. "$CURDIR"/../shell_config.sh
${CLICKHOUSE_CURL} -sS -H 'Accept-Encoding: zstd' "${CLICKHOUSE_URL}&enable_http_compression=1" -d "SELECT toDate('2020-12-12') as datetime, 'test-pipeline' as pipeline, 'clickhouse-test-host-001.clickhouse.com' as host, 'clickhouse' as home, 'clickhouse' as detail, number as row_number FROM numbers(1000000) FORMAT JSON" | zstd -d | tail -n30 | head -n23