Compare commits

...

14 Commits

Author SHA1 Message Date
Raúl Marín
7645025359
Merge f6610790a8 into 44b4bd38b9 2024-11-21 04:28:40 +01:00
Mikhail Artemenko
44b4bd38b9
Merge pull request #72045 from ClickHouse/issues/70174/cluster_versions
Enable cluster table functions for DataLake Storages
2024-11-20 21:22:37 +00:00
Raúl Marín
f6610790a8 Fix broken check 2024-11-20 18:13:44 +01:00
Raúl Marín
27fb90bb58 Fix bugs when using UDF in join on expression with the old analyzer 2024-11-20 17:37:32 +01:00
Mikhail Artemenko
4ccebd9a24 fix syntax for iceberg in docs 2024-11-20 11:15:39 +00:00
Mikhail Artemenko
99177c0daf remove icebergCluster alias 2024-11-20 11:15:12 +00:00
Mikhail Artemenko
0951991c1d update aspell-dict.txt 2024-11-19 13:10:42 +00:00
Mikhail Artemenko
19aec5e572 Merge branch 'issues/70174/cluster_versions' of github.com:ClickHouse/ClickHouse into issues/70174/cluster_versions 2024-11-19 12:51:56 +00:00
Mikhail Artemenko
a367de9977 add docs 2024-11-19 12:49:59 +00:00
Mikhail Artemenko
6894e280b2 fix pr issues 2024-11-19 12:34:42 +00:00
Mikhail Artemenko
39ebe113d9 Merge branch 'master' into issues/70174/cluster_versions 2024-11-19 11:28:46 +00:00
robot-clickhouse
014608fb6b Automatic style fix 2024-11-18 17:51:51 +00:00
Mikhail Artemenko
a29ded4941 add test for iceberg 2024-11-18 17:39:46 +00:00
Mikhail Artemenko
d2efae7511 enable cluster versions for datalake storages 2024-11-18 17:35:21 +00:00
24 changed files with 535 additions and 44 deletions

View File

@ -12,7 +12,7 @@ tests/ci/cancel_and_rerun_workflow_lambda/app.py
- Backward Incompatible Change - Backward Incompatible Change
- Build/Testing/Packaging Improvement - Build/Testing/Packaging Improvement
- Documentation (changelog entry is not required) - Documentation (changelog entry is not required)
- Critical Bug Fix (crash, data loss, RBAC) - Critical Bug Fix (crash, data loss, RBAC) or LOGICAL_ERROR
- Bug Fix (user-visible misbehavior in an official stable release) - Bug Fix (user-visible misbehavior in an official stable release)
- CI Fix or Improvement (changelog entry is not required) - CI Fix or Improvement (changelog entry is not required)
- Not for changelog (changelog entry is not required) - Not for changelog (changelog entry is not required)

View File

@ -49,4 +49,4 @@ LIMIT 2
**See Also** **See Also**
- [DeltaLake engine](/docs/en/engines/table-engines/integrations/deltalake.md) - [DeltaLake engine](/docs/en/engines/table-engines/integrations/deltalake.md)
- [DeltaLake cluster table function](/docs/en/sql-reference/table-functions/deltalakeCluster.md)

View File

@ -0,0 +1,30 @@
---
slug: /en/sql-reference/table-functions/deltalakeCluster
sidebar_position: 46
sidebar_label: deltaLakeCluster
title: "deltaLakeCluster Table Function"
---
This is an extension to the [deltaLake](/docs/en/sql-reference/table-functions/deltalake.md) table function.
Allows processing files from [Delta Lake](https://github.com/delta-io/delta) tables in Amazon S3 in parallel from many nodes in a specified cluster. On initiator it creates a connection to all nodes in the cluster and dispatches each file dynamically. On the worker node it asks the initiator about the next task to process and processes it. This is repeated until all tasks are finished.
**Syntax**
``` sql
deltaLakeCluster(cluster_name, url [,aws_access_key_id, aws_secret_access_key] [,format] [,structure] [,compression])
```
**Arguments**
- `cluster_name` — Name of a cluster that is used to build a set of addresses and connection parameters to remote and local servers.
- Description of all other arguments coincides with description of arguments in equivalent [deltaLake](/docs/en/sql-reference/table-functions/deltalake.md) table function.
**Returned value**
A table with the specified structure for reading data from cluster in the specified Delta Lake table in S3.
**See Also**
- [deltaLake engine](/docs/en/engines/table-engines/integrations/deltalake.md)
- [deltaLake table function](/docs/en/sql-reference/table-functions/deltalake.md)

View File

@ -29,4 +29,4 @@ A table with the specified structure for reading data in the specified Hudi tabl
**See Also** **See Also**
- [Hudi engine](/docs/en/engines/table-engines/integrations/hudi.md) - [Hudi engine](/docs/en/engines/table-engines/integrations/hudi.md)
- [Hudi cluster table function](/docs/en/sql-reference/table-functions/hudiCluster.md)

View File

@ -0,0 +1,30 @@
---
slug: /en/sql-reference/table-functions/hudiCluster
sidebar_position: 86
sidebar_label: hudiCluster
title: "hudiCluster Table Function"
---
This is an extension to the [hudi](/docs/en/sql-reference/table-functions/hudi.md) table function.
Allows processing files from Apache [Hudi](https://hudi.apache.org/) tables in Amazon S3 in parallel from many nodes in a specified cluster. On initiator it creates a connection to all nodes in the cluster and dispatches each file dynamically. On the worker node it asks the initiator about the next task to process and processes it. This is repeated until all tasks are finished.
**Syntax**
``` sql
hudiCluster(cluster_name, url [,aws_access_key_id, aws_secret_access_key] [,format] [,structure] [,compression])
```
**Arguments**
- `cluster_name` — Name of a cluster that is used to build a set of addresses and connection parameters to remote and local servers.
- Description of all other arguments coincides with description of arguments in equivalent [hudi](/docs/en/sql-reference/table-functions/hudi.md) table function.
**Returned value**
A table with the specified structure for reading data from cluster in the specified Hudi table in S3.
**See Also**
- [Hudi engine](/docs/en/engines/table-engines/integrations/hudi.md)
- [Hudi table function](/docs/en/sql-reference/table-functions/hudi.md)

View File

@ -72,3 +72,4 @@ Table function `iceberg` is an alias to `icebergS3` now.
**See Also** **See Also**
- [Iceberg engine](/docs/en/engines/table-engines/integrations/iceberg.md) - [Iceberg engine](/docs/en/engines/table-engines/integrations/iceberg.md)
- [Iceberg cluster table function](/docs/en/sql-reference/table-functions/icebergCluster.md)

View File

@ -0,0 +1,43 @@
---
slug: /en/sql-reference/table-functions/icebergCluster
sidebar_position: 91
sidebar_label: icebergCluster
title: "icebergCluster Table Function"
---
This is an extension to the [iceberg](/docs/en/sql-reference/table-functions/iceberg.md) table function.
Allows processing files from Apache [Iceberg](https://iceberg.apache.org/) in parallel from many nodes in a specified cluster. On initiator it creates a connection to all nodes in the cluster and dispatches each file dynamically. On the worker node it asks the initiator about the next task to process and processes it. This is repeated until all tasks are finished.
**Syntax**
``` sql
icebergS3Cluster(cluster_name, url [, NOSIGN | access_key_id, secret_access_key, [session_token]] [,format] [,compression_method])
icebergS3Cluster(cluster_name, named_collection[, option=value [,..]])
icebergAzureCluster(cluster_name, connection_string|storage_account_url, container_name, blobpath, [,account_name], [,account_key] [,format] [,compression_method])
icebergAzureCluster(cluster_name, named_collection[, option=value [,..]])
icebergHDFSCluster(cluster_name, path_to_table, [,format] [,compression_method])
icebergHDFSCluster(cluster_name, named_collection[, option=value [,..]])
```
**Arguments**
- `cluster_name` — Name of a cluster that is used to build a set of addresses and connection parameters to remote and local servers.
- Description of all other arguments coincides with description of arguments in equivalent [iceberg](/docs/en/sql-reference/table-functions/iceberg.md) table function.
**Returned value**
A table with the specified structure for reading data from cluster in the specified Iceberg table.
**Examples**
```sql
SELECT * FROM icebergS3Cluster('cluster_simple', 'http://test.s3.amazonaws.com/clickhouse-bucket/test_table', 'test', 'test')
```
**See Also**
- [Iceberg engine](/docs/en/engines/table-engines/integrations/iceberg.md)
- [Iceberg table function](/docs/en/sql-reference/table-functions/iceberg.md)

View File

@ -26,14 +26,6 @@ void UserDefinedSQLFunctionVisitor::visit(ASTPtr & ast)
{ {
chassert(ast); chassert(ast);
if (const auto * function = ast->template as<ASTFunction>())
{
std::unordered_set<std::string> udf_in_replace_process;
auto replace_result = tryToReplaceFunction(*function, udf_in_replace_process);
if (replace_result)
ast = replace_result;
}
for (auto & child : ast->children) for (auto & child : ast->children)
{ {
if (!child) if (!child)
@ -48,6 +40,14 @@ void UserDefinedSQLFunctionVisitor::visit(ASTPtr & ast)
if (new_ptr != old_ptr) if (new_ptr != old_ptr)
ast->updatePointerToChild(old_ptr, new_ptr); ast->updatePointerToChild(old_ptr, new_ptr);
} }
if (const auto * function = ast->template as<ASTFunction>())
{
std::unordered_set<std::string> udf_in_replace_process;
auto replace_result = tryToReplaceFunction(*function, udf_in_replace_process);
if (replace_result)
ast = replace_result;
}
} }
void UserDefinedSQLFunctionVisitor::visit(IAST * ast) void UserDefinedSQLFunctionVisitor::visit(IAST * ast)

View File

@ -70,7 +70,7 @@ struct JoinedElement
join->strictness = JoinStrictness::All; join->strictness = JoinStrictness::All;
join->on_expression = on_expression; join->on_expression = on_expression;
join->children.push_back(join->on_expression); join->children = {join->on_expression};
return true; return true;
} }

View File

@ -130,12 +130,25 @@ void ASTColumnDeclaration::formatImpl(const FormatSettings & format_settings, Fo
void ASTColumnDeclaration::forEachPointerToChild(std::function<void(void **)> f) void ASTColumnDeclaration::forEachPointerToChild(std::function<void(void **)> f)
{ {
f(reinterpret_cast<void **>(&default_expression)); auto visit_child = [&f](ASTPtr & member)
f(reinterpret_cast<void **>(&comment)); {
f(reinterpret_cast<void **>(&codec)); IAST * new_member_ptr = member.get();
f(reinterpret_cast<void **>(&statistics_desc)); f(reinterpret_cast<void **>(&new_member_ptr));
f(reinterpret_cast<void **>(&ttl)); if (new_member_ptr != member.get())
f(reinterpret_cast<void **>(&collation)); {
f(reinterpret_cast<void **>(&settings)); if (new_member_ptr)
member = new_member_ptr->ptr();
else
member.reset();
}
};
visit_child(default_expression);
visit_child(comment);
visit_child(codec);
visit_child(statistics_desc);
visit_child(ttl);
visit_child(collation);
visit_child(settings);
} }
} }

View File

@ -61,6 +61,29 @@ ASTPtr ASTTableJoin::clone() const
return res; return res;
} }
void ASTTableJoin::forEachPointerToChild(std::function<void(void **)> f)
{
IAST * new_using_expression_list = using_expression_list.get();
f(reinterpret_cast<void **>(&new_using_expression_list));
if (new_using_expression_list != using_expression_list.get())
{
if (new_using_expression_list)
using_expression_list = new_using_expression_list->ptr();
else
using_expression_list.reset();
}
IAST * new_on_expression = on_expression.get();
f(reinterpret_cast<void **>(&new_on_expression));
if (new_on_expression != on_expression.get())
{
if (new_on_expression)
on_expression = new_on_expression->ptr();
else
on_expression.reset();
}
}
void ASTArrayJoin::updateTreeHashImpl(SipHash & hash_state, bool ignore_aliases) const void ASTArrayJoin::updateTreeHashImpl(SipHash & hash_state, bool ignore_aliases) const
{ {
hash_state.update(kind); hash_state.update(kind);

View File

@ -80,6 +80,9 @@ struct ASTTableJoin : public IAST
void formatImplAfterTable(const FormatSettings & settings, FormatState & state, FormatStateStacked frame) const; void formatImplAfterTable(const FormatSettings & settings, FormatState & state, FormatStateStacked frame) const;
void formatImpl(const FormatSettings & settings, FormatState & state, FormatStateStacked frame) const override; void formatImpl(const FormatSettings & settings, FormatState & state, FormatStateStacked frame) const override;
void updateTreeHashImpl(SipHash & hash_state, bool ignore_aliases) const override; void updateTreeHashImpl(SipHash & hash_state, bool ignore_aliases) const override;
protected:
void forEachPointerToChild(std::function<void(void **)> f) override;
}; };
/// Specification of ARRAY JOIN. /// Specification of ARRAY JOIN.

View File

@ -226,6 +226,26 @@ template class TableFunctionObjectStorage<HDFSClusterDefinition, StorageHDFSConf
#endif #endif
template class TableFunctionObjectStorage<LocalDefinition, StorageLocalConfiguration>; template class TableFunctionObjectStorage<LocalDefinition, StorageLocalConfiguration>;
#if USE_AVRO && USE_AWS_S3
template class TableFunctionObjectStorage<IcebergS3ClusterDefinition, StorageS3IcebergConfiguration>;
#endif
#if USE_AVRO && USE_AZURE_BLOB_STORAGE
template class TableFunctionObjectStorage<IcebergAzureClusterDefinition, StorageAzureIcebergConfiguration>;
#endif
#if USE_AVRO && USE_HDFS
template class TableFunctionObjectStorage<IcebergHDFSClusterDefinition, StorageHDFSIcebergConfiguration>;
#endif
#if USE_PARQUET && USE_AWS_S3
template class TableFunctionObjectStorage<DeltaLakeClusterDefinition, StorageS3DeltaLakeConfiguration>;
#endif
#if USE_AWS_S3
template class TableFunctionObjectStorage<HudiClusterDefinition, StorageS3HudiConfiguration>;
#endif
#if USE_AVRO #if USE_AVRO
void registerTableFunctionIceberg(TableFunctionFactory & factory) void registerTableFunctionIceberg(TableFunctionFactory & factory)
{ {

View File

@ -96,7 +96,7 @@ void registerTableFunctionObjectStorageCluster(TableFunctionFactory & factory)
{ {
.documentation = { .documentation = {
.description=R"(The table function can be used to read the data stored on HDFS in parallel for many nodes in a specified cluster.)", .description=R"(The table function can be used to read the data stored on HDFS in parallel for many nodes in a specified cluster.)",
.examples{{"HDFSCluster", "SELECT * FROM HDFSCluster(cluster_name, uri, format)", ""}}}, .examples{{"HDFSCluster", "SELECT * FROM HDFSCluster(cluster, uri, format)", ""}}},
.allow_readonly = false .allow_readonly = false
} }
); );
@ -105,15 +105,77 @@ void registerTableFunctionObjectStorageCluster(TableFunctionFactory & factory)
UNUSED(factory); UNUSED(factory);
} }
#if USE_AVRO
void registerTableFunctionIcebergCluster(TableFunctionFactory & factory)
{
UNUSED(factory);
#if USE_AWS_S3 #if USE_AWS_S3
template class TableFunctionObjectStorageCluster<S3ClusterDefinition, StorageS3Configuration>; factory.registerFunction<TableFunctionIcebergS3Cluster>(
{.documentation
= {.description = R"(The table function can be used to read the Iceberg table stored on S3 object store in parallel for many nodes in a specified cluster.)",
.examples{{"icebergS3Cluster", "SELECT * FROM icebergS3Cluster(cluster, url, [, NOSIGN | access_key_id, secret_access_key, [session_token]], format, [,compression])", ""}},
.categories{"DataLake"}},
.allow_readonly = false});
#endif #endif
#if USE_AZURE_BLOB_STORAGE #if USE_AZURE_BLOB_STORAGE
template class TableFunctionObjectStorageCluster<AzureClusterDefinition, StorageAzureConfiguration>; factory.registerFunction<TableFunctionIcebergAzureCluster>(
{.documentation
= {.description = R"(The table function can be used to read the Iceberg table stored on Azure object store in parallel for many nodes in a specified cluster.)",
.examples{{"icebergAzureCluster", "SELECT * FROM icebergAzureCluster(cluster, connection_string|storage_account_url, container_name, blobpath, [account_name, account_key, format, compression])", ""}},
.categories{"DataLake"}},
.allow_readonly = false});
#endif #endif
#if USE_HDFS #if USE_HDFS
template class TableFunctionObjectStorageCluster<HDFSClusterDefinition, StorageHDFSConfiguration>; factory.registerFunction<TableFunctionIcebergHDFSCluster>(
{.documentation
= {.description = R"(The table function can be used to read the Iceberg table stored on HDFS virtual filesystem in parallel for many nodes in a specified cluster.)",
.examples{{"icebergHDFSCluster", "SELECT * FROM icebergHDFSCluster(cluster, uri, [format], [structure], [compression_method])", ""}},
.categories{"DataLake"}},
.allow_readonly = false});
#endif #endif
} }
#endif
#if USE_AWS_S3
#if USE_PARQUET
void registerTableFunctionDeltaLakeCluster(TableFunctionFactory & factory)
{
factory.registerFunction<TableFunctionDeltaLakeCluster>(
{.documentation
= {.description = R"(The table function can be used to read the DeltaLake table stored on object store in parallel for many nodes in a specified cluster.)",
.examples{{"deltaLakeCluster", "SELECT * FROM deltaLakeCluster(cluster, url, access_key_id, secret_access_key)", ""}},
.categories{"DataLake"}},
.allow_readonly = false});
}
#endif
void registerTableFunctionHudiCluster(TableFunctionFactory & factory)
{
factory.registerFunction<TableFunctionHudiCluster>(
{.documentation
= {.description = R"(The table function can be used to read the Hudi table stored on object store in parallel for many nodes in a specified cluster.)",
.examples{{"hudiCluster", "SELECT * FROM hudiCluster(cluster, url, access_key_id, secret_access_key)", ""}},
.categories{"DataLake"}},
.allow_readonly = false});
}
#endif
void registerDataLakeClusterTableFunctions(TableFunctionFactory & factory)
{
UNUSED(factory);
#if USE_AVRO
registerTableFunctionIcebergCluster(factory);
#endif
#if USE_AWS_S3
#if USE_PARQUET
registerTableFunctionDeltaLakeCluster(factory);
#endif
registerTableFunctionHudiCluster(factory);
#endif
}
}

View File

@ -33,6 +33,36 @@ struct HDFSClusterDefinition
static constexpr auto storage_type_name = "HDFSCluster"; static constexpr auto storage_type_name = "HDFSCluster";
}; };
struct IcebergS3ClusterDefinition
{
static constexpr auto name = "icebergS3Cluster";
static constexpr auto storage_type_name = "IcebergS3Cluster";
};
struct IcebergAzureClusterDefinition
{
static constexpr auto name = "icebergAzureCluster";
static constexpr auto storage_type_name = "IcebergAzureCluster";
};
struct IcebergHDFSClusterDefinition
{
static constexpr auto name = "icebergHDFSCluster";
static constexpr auto storage_type_name = "IcebergHDFSCluster";
};
struct DeltaLakeClusterDefinition
{
static constexpr auto name = "deltaLakeCluster";
static constexpr auto storage_type_name = "DeltaLakeS3Cluster";
};
struct HudiClusterDefinition
{
static constexpr auto name = "hudiCluster";
static constexpr auto storage_type_name = "HudiS3Cluster";
};
/** /**
* Class implementing s3/hdfs/azureBlobStorageCluster(...) table functions, * Class implementing s3/hdfs/azureBlobStorageCluster(...) table functions,
* which allow to process many files from S3/HDFS/Azure blob storage on a specific cluster. * which allow to process many files from S3/HDFS/Azure blob storage on a specific cluster.
@ -79,4 +109,25 @@ using TableFunctionAzureBlobCluster = TableFunctionObjectStorageCluster<AzureClu
#if USE_HDFS #if USE_HDFS
using TableFunctionHDFSCluster = TableFunctionObjectStorageCluster<HDFSClusterDefinition, StorageHDFSConfiguration>; using TableFunctionHDFSCluster = TableFunctionObjectStorageCluster<HDFSClusterDefinition, StorageHDFSConfiguration>;
#endif #endif
#if USE_AVRO && USE_AWS_S3
using TableFunctionIcebergS3Cluster = TableFunctionObjectStorageCluster<IcebergS3ClusterDefinition, StorageS3IcebergConfiguration>;
#endif
#if USE_AVRO && USE_AZURE_BLOB_STORAGE
using TableFunctionIcebergAzureCluster = TableFunctionObjectStorageCluster<IcebergAzureClusterDefinition, StorageAzureIcebergConfiguration>;
#endif
#if USE_AVRO && USE_HDFS
using TableFunctionIcebergHDFSCluster = TableFunctionObjectStorageCluster<IcebergHDFSClusterDefinition, StorageHDFSIcebergConfiguration>;
#endif
#if USE_AWS_S3 && USE_PARQUET
using TableFunctionDeltaLakeCluster = TableFunctionObjectStorageCluster<DeltaLakeClusterDefinition, StorageS3DeltaLakeConfiguration>;
#endif
#if USE_AWS_S3
using TableFunctionHudiCluster = TableFunctionObjectStorageCluster<HudiClusterDefinition, StorageS3HudiConfiguration>;
#endif
} }

View File

@ -66,6 +66,7 @@ void registerTableFunctions(bool use_legacy_mongodb_integration [[maybe_unused]]
registerTableFunctionObjectStorage(factory); registerTableFunctionObjectStorage(factory);
registerTableFunctionObjectStorageCluster(factory); registerTableFunctionObjectStorageCluster(factory);
registerDataLakeTableFunctions(factory); registerDataLakeTableFunctions(factory);
registerDataLakeClusterTableFunctions(factory);
} }
} }

View File

@ -70,6 +70,7 @@ void registerTableFunctionExplain(TableFunctionFactory & factory);
void registerTableFunctionObjectStorage(TableFunctionFactory & factory); void registerTableFunctionObjectStorage(TableFunctionFactory & factory);
void registerTableFunctionObjectStorageCluster(TableFunctionFactory & factory); void registerTableFunctionObjectStorageCluster(TableFunctionFactory & factory);
void registerDataLakeTableFunctions(TableFunctionFactory & factory); void registerDataLakeTableFunctions(TableFunctionFactory & factory);
void registerDataLakeClusterTableFunctions(TableFunctionFactory & factory);
void registerTableFunctionTimeSeries(TableFunctionFactory & factory); void registerTableFunctionTimeSeries(TableFunctionFactory & factory);

View File

@ -56,7 +56,9 @@ LABEL_CATEGORIES = {
"Bug Fix (user-visible misbehaviour in official stable or prestable release)", "Bug Fix (user-visible misbehaviour in official stable or prestable release)",
"Bug Fix (user-visible misbehavior in official stable or prestable release)", "Bug Fix (user-visible misbehavior in official stable or prestable release)",
], ],
"pr-critical-bugfix": ["Critical Bug Fix (crash, LOGICAL_ERROR, data loss, RBAC)"], "pr-critical-bugfix": [
"Critical Bug Fix (crash, data loss, RBAC) or LOGICAL_ERROR"
],
"pr-build": [ "pr-build": [
"Build/Testing/Packaging Improvement", "Build/Testing/Packaging Improvement",
"Build Improvement", "Build Improvement",

View File

@ -0,0 +1,20 @@
<clickhouse>
<remote_servers>
<cluster_simple>
<shard>
<replica>
<host>node1</host>
<port>9000</port>
</replica>
<replica>
<host>node2</host>
<port>9000</port>
</replica>
<replica>
<host>node3</host>
<port>9000</port>
</replica>
</shard>
</cluster_simple>
</remote_servers>
</clickhouse>

View File

@ -0,0 +1,6 @@
<clickhouse>
<query_log>
<database>system</database>
<table>query_log</table>
</query_log>
</clickhouse>

View File

@ -73,14 +73,38 @@ def started_cluster():
cluster.add_instance( cluster.add_instance(
"node1", "node1",
main_configs=[ main_configs=[
"configs/config.d/query_log.xml",
"configs/config.d/cluster.xml",
"configs/config.d/named_collections.xml", "configs/config.d/named_collections.xml",
"configs/config.d/filesystem_caches.xml", "configs/config.d/filesystem_caches.xml",
], ],
user_configs=["configs/users.d/users.xml"], user_configs=["configs/users.d/users.xml"],
with_minio=True, with_minio=True,
with_azurite=True, with_azurite=True,
stay_alive=True,
with_hdfs=with_hdfs, with_hdfs=with_hdfs,
stay_alive=True,
)
cluster.add_instance(
"node2",
main_configs=[
"configs/config.d/query_log.xml",
"configs/config.d/cluster.xml",
"configs/config.d/named_collections.xml",
"configs/config.d/filesystem_caches.xml",
],
user_configs=["configs/users.d/users.xml"],
stay_alive=True,
)
cluster.add_instance(
"node3",
main_configs=[
"configs/config.d/query_log.xml",
"configs/config.d/cluster.xml",
"configs/config.d/named_collections.xml",
"configs/config.d/filesystem_caches.xml",
],
user_configs=["configs/users.d/users.xml"],
stay_alive=True,
) )
logging.info("Starting cluster...") logging.info("Starting cluster...")
@ -182,6 +206,7 @@ def get_creation_expression(
cluster, cluster,
format="Parquet", format="Parquet",
table_function=False, table_function=False,
run_on_cluster=False,
**kwargs, **kwargs,
): ):
if storage_type == "s3": if storage_type == "s3":
@ -189,7 +214,11 @@ def get_creation_expression(
bucket = kwargs["bucket"] bucket = kwargs["bucket"]
else: else:
bucket = cluster.minio_bucket bucket = cluster.minio_bucket
print(bucket)
if run_on_cluster:
assert table_function
return f"icebergS3Cluster('cluster_simple', s3, filename = 'iceberg_data/default/{table_name}/', format={format}, url = 'http://minio1:9001/{bucket}/')"
else:
if table_function: if table_function:
return f"icebergS3(s3, filename = 'iceberg_data/default/{table_name}/', format={format}, url = 'http://minio1:9001/{bucket}/')" return f"icebergS3(s3, filename = 'iceberg_data/default/{table_name}/', format={format}, url = 'http://minio1:9001/{bucket}/')"
else: else:
@ -197,7 +226,14 @@ def get_creation_expression(
DROP TABLE IF EXISTS {table_name}; DROP TABLE IF EXISTS {table_name};
CREATE TABLE {table_name} CREATE TABLE {table_name}
ENGINE=IcebergS3(s3, filename = 'iceberg_data/default/{table_name}/', format={format}, url = 'http://minio1:9001/{bucket}/')""" ENGINE=IcebergS3(s3, filename = 'iceberg_data/default/{table_name}/', format={format}, url = 'http://minio1:9001/{bucket}/')"""
elif storage_type == "azure": elif storage_type == "azure":
if run_on_cluster:
assert table_function
return f"""
icebergAzureCluster('cluster_simple', azure, container = '{cluster.azure_container_name}', storage_account_url = '{cluster.env_variables["AZURITE_STORAGE_ACCOUNT_URL"]}', blob_path = '/iceberg_data/default/{table_name}/', format={format})
"""
else:
if table_function: if table_function:
return f""" return f"""
icebergAzure(azure, container = '{cluster.azure_container_name}', storage_account_url = '{cluster.env_variables["AZURITE_STORAGE_ACCOUNT_URL"]}', blob_path = '/iceberg_data/default/{table_name}/', format={format}) icebergAzure(azure, container = '{cluster.azure_container_name}', storage_account_url = '{cluster.env_variables["AZURITE_STORAGE_ACCOUNT_URL"]}', blob_path = '/iceberg_data/default/{table_name}/', format={format})
@ -207,7 +243,14 @@ def get_creation_expression(
DROP TABLE IF EXISTS {table_name}; DROP TABLE IF EXISTS {table_name};
CREATE TABLE {table_name} CREATE TABLE {table_name}
ENGINE=IcebergAzure(azure, container = {cluster.azure_container_name}, storage_account_url = '{cluster.env_variables["AZURITE_STORAGE_ACCOUNT_URL"]}', blob_path = '/iceberg_data/default/{table_name}/', format={format})""" ENGINE=IcebergAzure(azure, container = {cluster.azure_container_name}, storage_account_url = '{cluster.env_variables["AZURITE_STORAGE_ACCOUNT_URL"]}', blob_path = '/iceberg_data/default/{table_name}/', format={format})"""
elif storage_type == "hdfs": elif storage_type == "hdfs":
if run_on_cluster:
assert table_function
return f"""
icebergHDFSCluster('cluster_simple', hdfs, filename= 'iceberg_data/default/{table_name}/', format={format}, url = 'hdfs://hdfs1:9000/')
"""
else:
if table_function: if table_function:
return f""" return f"""
icebergHDFS(hdfs, filename= 'iceberg_data/default/{table_name}/', format={format}, url = 'hdfs://hdfs1:9000/') icebergHDFS(hdfs, filename= 'iceberg_data/default/{table_name}/', format={format}, url = 'hdfs://hdfs1:9000/')
@ -217,7 +260,10 @@ def get_creation_expression(
DROP TABLE IF EXISTS {table_name}; DROP TABLE IF EXISTS {table_name};
CREATE TABLE {table_name} CREATE TABLE {table_name}
ENGINE=IcebergHDFS(hdfs, filename = 'iceberg_data/default/{table_name}/', format={format}, url = 'hdfs://hdfs1:9000/');""" ENGINE=IcebergHDFS(hdfs, filename = 'iceberg_data/default/{table_name}/', format={format}, url = 'hdfs://hdfs1:9000/');"""
elif storage_type == "local": elif storage_type == "local":
assert not run_on_cluster
if table_function: if table_function:
return f""" return f"""
icebergLocal(local, path = '/iceberg_data/default/{table_name}/', format={format}) icebergLocal(local, path = '/iceberg_data/default/{table_name}/', format={format})
@ -227,6 +273,7 @@ def get_creation_expression(
DROP TABLE IF EXISTS {table_name}; DROP TABLE IF EXISTS {table_name};
CREATE TABLE {table_name} CREATE TABLE {table_name}
ENGINE=IcebergLocal(local, path = '/iceberg_data/default/{table_name}/', format={format});""" ENGINE=IcebergLocal(local, path = '/iceberg_data/default/{table_name}/', format={format});"""
else: else:
raise Exception(f"Unknown iceberg storage type: {storage_type}") raise Exception(f"Unknown iceberg storage type: {storage_type}")
@ -492,6 +539,108 @@ def test_types(started_cluster, format_version, storage_type):
) )
@pytest.mark.parametrize("format_version", ["1", "2"])
@pytest.mark.parametrize("storage_type", ["s3", "azure", "hdfs"])
def test_cluster_table_function(started_cluster, format_version, storage_type):
if is_arm() and storage_type == "hdfs":
pytest.skip("Disabled test IcebergHDFS for aarch64")
instance = started_cluster.instances["node1"]
spark = started_cluster.spark_session
TABLE_NAME = (
"test_iceberg_cluster_"
+ format_version
+ "_"
+ storage_type
+ "_"
+ get_uuid_str()
)
def add_df(mode):
write_iceberg_from_df(
spark,
generate_data(spark, 0, 100),
TABLE_NAME,
mode=mode,
format_version=format_version,
)
files = default_upload_directory(
started_cluster,
storage_type,
f"/iceberg_data/default/{TABLE_NAME}/",
f"/iceberg_data/default/{TABLE_NAME}/",
)
logging.info(f"Adding another dataframe. result files: {files}")
return files
files = add_df(mode="overwrite")
for i in range(1, len(started_cluster.instances)):
files = add_df(mode="append")
logging.info(f"Setup complete. files: {files}")
assert len(files) == 5 + 4 * (len(started_cluster.instances) - 1)
clusters = instance.query(f"SELECT * FROM system.clusters")
logging.info(f"Clusters setup: {clusters}")
# Regular Query only node1
table_function_expr = get_creation_expression(
storage_type, TABLE_NAME, started_cluster, table_function=True
)
select_regular = (
instance.query(f"SELECT * FROM {table_function_expr}").strip().split()
)
# Cluster Query with node1 as coordinator
table_function_expr_cluster = get_creation_expression(
storage_type,
TABLE_NAME,
started_cluster,
table_function=True,
run_on_cluster=True,
)
select_cluster = (
instance.query(f"SELECT * FROM {table_function_expr_cluster}").strip().split()
)
# Simple size check
assert len(select_regular) == 600
assert len(select_cluster) == 600
# Actual check
assert select_cluster == select_regular
# Check query_log
for replica in started_cluster.instances.values():
replica.query("SYSTEM FLUSH LOGS")
for node_name, replica in started_cluster.instances.items():
cluster_secondary_queries = (
replica.query(
f"""
SELECT query, type, is_initial_query, read_rows, read_bytes FROM system.query_log
WHERE
type = 'QueryStart' AND
positionCaseInsensitive(query, '{storage_type}Cluster') != 0 AND
position(query, '{TABLE_NAME}') != 0 AND
position(query, 'system.query_log') = 0 AND
NOT is_initial_query
"""
)
.strip()
.split("\n")
)
logging.info(
f"[{node_name}] cluster_secondary_queries: {cluster_secondary_queries}"
)
assert len(cluster_secondary_queries) == 1
@pytest.mark.parametrize("format_version", ["1", "2"]) @pytest.mark.parametrize("format_version", ["1", "2"])
@pytest.mark.parametrize("storage_type", ["s3", "azure", "hdfs", "local"]) @pytest.mark.parametrize("storage_type", ["s3", "azure", "hdfs", "local"])
def test_delete_files(started_cluster, format_version, storage_type): def test_delete_files(started_cluster, format_version, storage_type):

View File

@ -0,0 +1,7 @@
SELECT 1
FROM
(
SELECT 1 AS c0
) AS v0
ALL INNER JOIN v0 AS vx ON c0 = vx.c0
1

View File

@ -0,0 +1,21 @@
#!/usr/bin/env bash
CUR_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
# shellcheck source=../shell_config.sh
. "$CUR_DIR"/../shell_config.sh
$CLICKHOUSE_CLIENT -q "
CREATE VIEW v0 AS SELECT 1 AS c0;
CREATE FUNCTION ${CLICKHOUSE_DATABASE}_second AS (x, y) -> y;
CREATE FUNCTION ${CLICKHOUSE_DATABASE}_equals AS (x, y) -> x = y;
-- SET optimize_rewrite_array_exists_to_has = 1;
EXPLAIN SYNTAX SELECT 1 FROM v0 JOIN v0 vx ON ${CLICKHOUSE_DATABASE}_second(v0.c0, vx.c0); -- { serverError INVALID_JOIN_ON_EXPRESSION }
EXPLAIN SYNTAX SELECT 1 FROM v0 JOIN v0 vx ON ${CLICKHOUSE_DATABASE}_equals(v0.c0, vx.c0);
SELECT 1 FROM v0 JOIN v0 vx ON ${CLICKHOUSE_DATABASE}_equals(v0.c0, vx.c0);
DROP view v0;
DROP FUNCTION ${CLICKHOUSE_DATABASE}_second;
DROP FUNCTION ${CLICKHOUSE_DATABASE}_equals;
"

View File

@ -244,7 +244,10 @@ Deduplication
DefaultTableEngine DefaultTableEngine
DelayedInserts DelayedInserts
DeliveryTag DeliveryTag
Deltalake
DeltaLake DeltaLake
deltalakeCluster
deltaLakeCluster
Denormalize Denormalize
DestroyAggregatesThreads DestroyAggregatesThreads
DestroyAggregatesThreadsActive DestroyAggregatesThreadsActive
@ -377,10 +380,15 @@ Homebrew's
HorizontalDivide HorizontalDivide
Hostname Hostname
HouseOps HouseOps
hudi
Hudi Hudi
hudiCluster
HudiCluster
HyperLogLog HyperLogLog
Hypot Hypot
IANA IANA
icebergCluster
IcebergCluster
IDE IDE
IDEs IDEs
IDNA IDNA