Merge f6610790a8 into 44b4bd38b9

Merge pull request #72045 from ClickHouse/issues/70174/cluster_versions
Enable cluster table functions for DataLake Storages
2024-11-21 15:12:02 +00:00 · 2024-11-21 04:28:40 +01:00 · 2024-11-20 21:22:37 +00:00 · 2024-11-20 18:13:44 +01:00 · 2024-11-20 17:37:32 +01:00 · 2024-11-20 11:15:39 +00:00
24 changed files with 535 additions and 44 deletions
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@ -12,7 +12,7 @@ tests/ci/cancel_and_rerun_workflow_lambda/app.py
 - Backward Incompatible Change
 - Build/Testing/Packaging Improvement
 - Documentation (changelog entry is not required)
- Critical Bug Fix (crash, data loss, RBAC)
+- Critical Bug Fix (crash, data loss, RBAC) or LOGICAL_ERROR
 - Bug Fix (user-visible misbehavior in an official stable release)
 - CI Fix or Improvement (changelog entry is not required)
 - Not for changelog (changelog entry is not required)
--- a/docs/en/sql-reference/table-functions/deltalake.md
+++ b/docs/en/sql-reference/table-functions/deltalake.md
@ -49,4 +49,4 @@ LIMIT 2
 **See Also**
 - [DeltaLake engine](/docs/en/engines/table-engines/integrations/deltalake.md)
-
+- [DeltaLake cluster table function](/docs/en/sql-reference/table-functions/deltalakeCluster.md)
--- a/docs/en/sql-reference/table-functions/deltalakeCluster.md
+++ b/docs/en/sql-reference/table-functions/deltalakeCluster.md
@ -0,0 +1,30 @@
 ---
 slug: /en/sql-reference/table-functions/deltalakeCluster
 sidebar_position: 46
 sidebar_label: deltaLakeCluster
 title: "deltaLakeCluster Table Function"
 ---
 This is an extension to the [deltaLake](/docs/en/sql-reference/table-functions/deltalake.md) table function.
 Allows processing files from [Delta Lake](https://github.com/delta-io/delta) tables in Amazon S3 in parallel from many nodes in a specified cluster. On initiator it creates a connection to all nodes in the cluster and dispatches each file dynamically. On the worker node it asks the initiator about the next task to process and processes it. This is repeated until all tasks are finished.
 **Syntax**
 ``` sql
 deltaLakeCluster(cluster_name, url [,aws_access_key_id, aws_secret_access_key] [,format] [,structure] [,compression])
 ```
 **Arguments**
 - `cluster_name` — Name of a cluster that is used to build a set of addresses and connection parameters to remote and local servers.
 - Description of all other arguments coincides with description of arguments in equivalent [deltaLake](/docs/en/sql-reference/table-functions/deltalake.md) table function.
 **Returned value**
 A table with the specified structure for reading data from cluster in the specified Delta Lake table in S3.
 **See Also**
 - [deltaLake engine](/docs/en/engines/table-engines/integrations/deltalake.md)
 - [deltaLake table function](/docs/en/sql-reference/table-functions/deltalake.md)
--- a/docs/en/sql-reference/table-functions/hudi.md
+++ b/docs/en/sql-reference/table-functions/hudi.md
@ -29,4 +29,4 @@ A table with the specified structure for reading data in the specified Hudi tabl
 **See Also**
 - [Hudi engine](/docs/en/engines/table-engines/integrations/hudi.md)
-
+- [Hudi cluster table function](/docs/en/sql-reference/table-functions/hudiCluster.md)
--- a/docs/en/sql-reference/table-functions/hudiCluster.md
+++ b/docs/en/sql-reference/table-functions/hudiCluster.md
@ -0,0 +1,30 @@
 ---
 slug: /en/sql-reference/table-functions/hudiCluster
 sidebar_position: 86
 sidebar_label: hudiCluster
 title: "hudiCluster Table Function"
 ---
 This is an extension to the [hudi](/docs/en/sql-reference/table-functions/hudi.md) table function.
 Allows processing files from Apache [Hudi](https://hudi.apache.org/) tables in Amazon S3 in parallel from many nodes in a specified cluster. On initiator it creates a connection to all nodes in the cluster and dispatches each file dynamically. On the worker node it asks the initiator about the next task to process and processes it. This is repeated until all tasks are finished.
 **Syntax**
 ``` sql
 hudiCluster(cluster_name, url [,aws_access_key_id, aws_secret_access_key] [,format] [,structure] [,compression])
 ```
 **Arguments**
 - `cluster_name` — Name of a cluster that is used to build a set of addresses and connection parameters to remote and local servers.
 - Description of all other arguments coincides with description of arguments in equivalent [hudi](/docs/en/sql-reference/table-functions/hudi.md) table function.
 **Returned value**
 A table with the specified structure for reading data from cluster in the specified Hudi table in S3.
 **See Also**
 - [Hudi engine](/docs/en/engines/table-engines/integrations/hudi.md)
 - [Hudi table function](/docs/en/sql-reference/table-functions/hudi.md)
--- a/docs/en/sql-reference/table-functions/iceberg.md
+++ b/docs/en/sql-reference/table-functions/iceberg.md
@ -72,3 +72,4 @@ Table function `iceberg` is an alias to `icebergS3` now.
 **See Also**
 - [Iceberg engine](/docs/en/engines/table-engines/integrations/iceberg.md)
 - [Iceberg cluster table function](/docs/en/sql-reference/table-functions/icebergCluster.md)
--- a/docs/en/sql-reference/table-functions/icebergCluster.md
+++ b/docs/en/sql-reference/table-functions/icebergCluster.md
@ -0,0 +1,43 @@
 ---
 slug: /en/sql-reference/table-functions/icebergCluster
 sidebar_position: 91
 sidebar_label: icebergCluster
 title: "icebergCluster Table Function"
 ---
 This is an extension to the [iceberg](/docs/en/sql-reference/table-functions/iceberg.md) table function.
 Allows processing files from Apache [Iceberg](https://iceberg.apache.org/) in parallel from many nodes in a specified cluster. On initiator it creates a connection to all nodes in the cluster and dispatches each file dynamically. On the worker node it asks the initiator about the next task to process and processes it. This is repeated until all tasks are finished.
 **Syntax**
 ``` sql
 icebergS3Cluster(cluster_name, url [, NOSIGN | access_key_id, secret_access_key, [session_token]] [,format] [,compression_method])
 icebergS3Cluster(cluster_name, named_collection[, option=value [,..]])
 icebergAzureCluster(cluster_name, connection_string|storage_account_url, container_name, blobpath, [,account_name], [,account_key] [,format] [,compression_method])
 icebergAzureCluster(cluster_name, named_collection[, option=value [,..]])
 icebergHDFSCluster(cluster_name, path_to_table, [,format] [,compression_method])
 icebergHDFSCluster(cluster_name, named_collection[, option=value [,..]])
 ```
 **Arguments**
 - `cluster_name` — Name of a cluster that is used to build a set of addresses and connection parameters to remote and local servers.
 - Description of all other arguments coincides with description of arguments in equivalent [iceberg](/docs/en/sql-reference/table-functions/iceberg.md) table function.
 **Returned value**
 A table with the specified structure for reading data from cluster in the specified Iceberg table.
 **Examples**
 ```sql
 SELECT * FROM icebergS3Cluster('cluster_simple', 'http://test.s3.amazonaws.com/clickhouse-bucket/test_table', 'test', 'test')
 ```
 **See Also**
 - [Iceberg engine](/docs/en/engines/table-engines/integrations/iceberg.md)
 - [Iceberg table function](/docs/en/sql-reference/table-functions/iceberg.md)
--- a/src/Functions/UserDefined/UserDefinedSQLFunctionVisitor.cpp
+++ b/src/Functions/UserDefined/UserDefinedSQLFunctionVisitor.cpp
@ -26,14 +26,6 @@ void UserDefinedSQLFunctionVisitor::visit(ASTPtr & ast)
 {
    chassert(ast);
    if (const auto * function = ast->template as<ASTFunction>())
    {
        std::unordered_set<std::string> udf_in_replace_process;
        auto replace_result = tryToReplaceFunction(*function, udf_in_replace_process);
        if (replace_result)
            ast = replace_result;
    }
    for (auto & child : ast->children)
    {
        if (!child)
@ -48,6 +40,14 @@ void UserDefinedSQLFunctionVisitor::visit(ASTPtr & ast)
        if (new_ptr != old_ptr)
            ast->updatePointerToChild(old_ptr, new_ptr);
    }
    if (const auto * function = ast->template as<ASTFunction>())
    {
        std::unordered_set<std::string> udf_in_replace_process;
        auto replace_result = tryToReplaceFunction(*function, udf_in_replace_process);
        if (replace_result)
            ast = replace_result;
    }
 }
 void UserDefinedSQLFunctionVisitor::visit(IAST * ast)
--- a/src/Interpreters/CrossToInnerJoinVisitor.cpp
+++ b/src/Interpreters/CrossToInnerJoinVisitor.cpp
@ -70,7 +70,7 @@ struct JoinedElement
        join->strictness = JoinStrictness::All;
        join->on_expression = on_expression;
-        join->children.push_back(join->on_expression);
+        join->children = {join->on_expression};
        return true;
    }
--- a/src/Parsers/ASTColumnDeclaration.cpp
+++ b/src/Parsers/ASTColumnDeclaration.cpp
@ -130,12 +130,25 @@ void ASTColumnDeclaration::formatImpl(const FormatSettings & format_settings, Fo
 void ASTColumnDeclaration::forEachPointerToChild(std::function<void(void **)> f)
 {
-    f(reinterpret_cast<void **>(&default_expression));
+    auto visit_child = [&f](ASTPtr & member)
-    f(reinterpret_cast<void **>(&comment));
+    {
-    f(reinterpret_cast<void **>(&codec));
+        IAST * new_member_ptr = member.get();
-    f(reinterpret_cast<void **>(&statistics_desc));
+        f(reinterpret_cast<void **>(&new_member_ptr));
-    f(reinterpret_cast<void **>(&ttl));
+        if (new_member_ptr != member.get())
-    f(reinterpret_cast<void **>(&collation));
+        {
-    f(reinterpret_cast<void **>(&settings));
+            if (new_member_ptr)
                member = new_member_ptr->ptr();
            else
                member.reset();
        }
    };
    visit_child(default_expression);
    visit_child(comment);
    visit_child(codec);
    visit_child(statistics_desc);
    visit_child(ttl);
    visit_child(collation);
    visit_child(settings);
 }
 }
--- a/src/Parsers/ASTTablesInSelectQuery.cpp
+++ b/src/Parsers/ASTTablesInSelectQuery.cpp
@ -61,6 +61,29 @@ ASTPtr ASTTableJoin::clone() const
    return res;
 }
 void ASTTableJoin::forEachPointerToChild(std::function<void(void **)> f)
 {
    IAST * new_using_expression_list = using_expression_list.get();
    f(reinterpret_cast<void **>(&new_using_expression_list));
    if (new_using_expression_list != using_expression_list.get())
    {
        if (new_using_expression_list)
            using_expression_list = new_using_expression_list->ptr();
        else
            using_expression_list.reset();
    }
    IAST * new_on_expression = on_expression.get();
    f(reinterpret_cast<void **>(&new_on_expression));
    if (new_on_expression != on_expression.get())
    {
        if (new_on_expression)
            on_expression = new_on_expression->ptr();
        else
            on_expression.reset();
    }
 }
 void ASTArrayJoin::updateTreeHashImpl(SipHash & hash_state, bool ignore_aliases) const
 {
    hash_state.update(kind);
--- a/src/Parsers/ASTTablesInSelectQuery.h
+++ b/src/Parsers/ASTTablesInSelectQuery.h
@ -80,6 +80,9 @@ struct ASTTableJoin : public IAST
    void formatImplAfterTable(const FormatSettings & settings, FormatState & state, FormatStateStacked frame) const;
    void formatImpl(const FormatSettings & settings, FormatState & state, FormatStateStacked frame) const override;
    void updateTreeHashImpl(SipHash & hash_state, bool ignore_aliases) const override;
 protected:
    void forEachPointerToChild(std::function<void(void **)> f) override;
 };
 /// Specification of ARRAY JOIN.
--- a/src/TableFunctions/TableFunctionObjectStorage.cpp
+++ b/src/TableFunctions/TableFunctionObjectStorage.cpp
@ -226,6 +226,26 @@ template class TableFunctionObjectStorage<HDFSClusterDefinition, StorageHDFSConf
 #endif
 template class TableFunctionObjectStorage<LocalDefinition, StorageLocalConfiguration>;
 #if USE_AVRO && USE_AWS_S3
 template class TableFunctionObjectStorage<IcebergS3ClusterDefinition, StorageS3IcebergConfiguration>;
 #endif
 #if USE_AVRO && USE_AZURE_BLOB_STORAGE
 template class TableFunctionObjectStorage<IcebergAzureClusterDefinition, StorageAzureIcebergConfiguration>;
 #endif
 #if USE_AVRO && USE_HDFS
 template class TableFunctionObjectStorage<IcebergHDFSClusterDefinition, StorageHDFSIcebergConfiguration>;
 #endif
 #if USE_PARQUET && USE_AWS_S3
 template class TableFunctionObjectStorage<DeltaLakeClusterDefinition, StorageS3DeltaLakeConfiguration>;
 #endif
 #if USE_AWS_S3
 template class TableFunctionObjectStorage<HudiClusterDefinition, StorageS3HudiConfiguration>;
 #endif
 #if USE_AVRO
 void registerTableFunctionIceberg(TableFunctionFactory & factory)
 {
--- a/src/TableFunctions/TableFunctionObjectStorageCluster.cpp
+++ b/src/TableFunctions/TableFunctionObjectStorageCluster.cpp
@ -96,7 +96,7 @@ void registerTableFunctionObjectStorageCluster(TableFunctionFactory & factory)
    {
        .documentation = {
            .description=R"(The table function can be used to read the data stored on HDFS in parallel for many nodes in a specified cluster.)",
-            .examples{{"HDFSCluster", "SELECT * FROM HDFSCluster(cluster_name, uri, format)", ""}}},
+            .examples{{"HDFSCluster", "SELECT * FROM HDFSCluster(cluster, uri, format)", ""}}},
            .allow_readonly = false
        }
    );
@ -105,15 +105,77 @@ void registerTableFunctionObjectStorageCluster(TableFunctionFactory & factory)
    UNUSED(factory);
 }
 #if USE_AVRO
 void registerTableFunctionIcebergCluster(TableFunctionFactory & factory)
 {
    UNUSED(factory);
 #if USE_AWS_S3
-template class TableFunctionObjectStorageCluster<S3ClusterDefinition, StorageS3Configuration>;
+    factory.registerFunction<TableFunctionIcebergS3Cluster>(
        {.documentation
         = {.description = R"(The table function can be used to read the Iceberg table stored on S3 object store in parallel for many nodes in a specified cluster.)",
            .examples{{"icebergS3Cluster", "SELECT * FROM icebergS3Cluster(cluster, url, [, NOSIGN | access_key_id, secret_access_key, [session_token]], format, [,compression])", ""}},
            .categories{"DataLake"}},
         .allow_readonly = false});
 #endif
 #if USE_AZURE_BLOB_STORAGE
-template class TableFunctionObjectStorageCluster<AzureClusterDefinition, StorageAzureConfiguration>;
+    factory.registerFunction<TableFunctionIcebergAzureCluster>(
        {.documentation
         = {.description = R"(The table function can be used to read the Iceberg table stored on Azure object store in parallel for many nodes in a specified cluster.)",
            .examples{{"icebergAzureCluster", "SELECT * FROM icebergAzureCluster(cluster, connection_string|storage_account_url, container_name, blobpath, [account_name, account_key, format, compression])", ""}},
            .categories{"DataLake"}},
         .allow_readonly = false});
 #endif
 #if USE_HDFS
-template class TableFunctionObjectStorageCluster<HDFSClusterDefinition, StorageHDFSConfiguration>;
+    factory.registerFunction<TableFunctionIcebergHDFSCluster>(
        {.documentation
         = {.description = R"(The table function can be used to read the Iceberg table stored on HDFS virtual filesystem in parallel for many nodes in a specified cluster.)",
            .examples{{"icebergHDFSCluster", "SELECT * FROM icebergHDFSCluster(cluster, uri, [format], [structure], [compression_method])", ""}},
            .categories{"DataLake"}},
         .allow_readonly = false});
 #endif
 }
 #endif
 #if USE_AWS_S3
 #if USE_PARQUET
 void registerTableFunctionDeltaLakeCluster(TableFunctionFactory & factory)
 {
    factory.registerFunction<TableFunctionDeltaLakeCluster>(
        {.documentation
         = {.description = R"(The table function can be used to read the DeltaLake table stored on object store in parallel for many nodes in a specified cluster.)",
            .examples{{"deltaLakeCluster", "SELECT * FROM deltaLakeCluster(cluster, url, access_key_id, secret_access_key)", ""}},
            .categories{"DataLake"}},
         .allow_readonly = false});
 }
 #endif
 void registerTableFunctionHudiCluster(TableFunctionFactory & factory)
 {
    factory.registerFunction<TableFunctionHudiCluster>(
        {.documentation
         = {.description = R"(The table function can be used to read the Hudi table stored on object store in parallel for many nodes in a specified cluster.)",
            .examples{{"hudiCluster", "SELECT * FROM hudiCluster(cluster, url, access_key_id, secret_access_key)", ""}},
            .categories{"DataLake"}},
         .allow_readonly = false});
 }
 #endif
 void registerDataLakeClusterTableFunctions(TableFunctionFactory & factory)
 {
    UNUSED(factory);
 #if USE_AVRO
    registerTableFunctionIcebergCluster(factory);
 #endif
 #if USE_AWS_S3
 #if USE_PARQUET
    registerTableFunctionDeltaLakeCluster(factory);
 #endif
    registerTableFunctionHudiCluster(factory);
 #endif
 }
 }
--- a/src/TableFunctions/TableFunctionObjectStorageCluster.h
+++ b/src/TableFunctions/TableFunctionObjectStorageCluster.h
@ -33,6 +33,36 @@ struct HDFSClusterDefinition
    static constexpr auto storage_type_name = "HDFSCluster";
 };
 struct IcebergS3ClusterDefinition
 {
    static constexpr auto name = "icebergS3Cluster";
    static constexpr auto storage_type_name = "IcebergS3Cluster";
 };
 struct IcebergAzureClusterDefinition
 {
    static constexpr auto name = "icebergAzureCluster";
    static constexpr auto storage_type_name = "IcebergAzureCluster";
 };
 struct IcebergHDFSClusterDefinition
 {
    static constexpr auto name = "icebergHDFSCluster";
    static constexpr auto storage_type_name = "IcebergHDFSCluster";
 };
 struct DeltaLakeClusterDefinition
 {
    static constexpr auto name = "deltaLakeCluster";
    static constexpr auto storage_type_name = "DeltaLakeS3Cluster";
 };
 struct HudiClusterDefinition
 {
    static constexpr auto name = "hudiCluster";
    static constexpr auto storage_type_name = "HudiS3Cluster";
 };
 /**
 * Class implementing s3/hdfs/azureBlobStorageCluster(...) table functions,
 * which allow to process many files from S3/HDFS/Azure blob storage on a specific cluster.
@ -79,4 +109,25 @@ using TableFunctionAzureBlobCluster = TableFunctionObjectStorageCluster<AzureClu
 #if USE_HDFS
 using TableFunctionHDFSCluster = TableFunctionObjectStorageCluster<HDFSClusterDefinition, StorageHDFSConfiguration>;
 #endif
 #if USE_AVRO && USE_AWS_S3
 using TableFunctionIcebergS3Cluster = TableFunctionObjectStorageCluster<IcebergS3ClusterDefinition, StorageS3IcebergConfiguration>;
 #endif
 #if USE_AVRO && USE_AZURE_BLOB_STORAGE
 using TableFunctionIcebergAzureCluster = TableFunctionObjectStorageCluster<IcebergAzureClusterDefinition, StorageAzureIcebergConfiguration>;
 #endif
 #if USE_AVRO && USE_HDFS
 using TableFunctionIcebergHDFSCluster = TableFunctionObjectStorageCluster<IcebergHDFSClusterDefinition, StorageHDFSIcebergConfiguration>;
 #endif
 #if USE_AWS_S3 && USE_PARQUET
 using TableFunctionDeltaLakeCluster = TableFunctionObjectStorageCluster<DeltaLakeClusterDefinition, StorageS3DeltaLakeConfiguration>;
 #endif
 #if USE_AWS_S3
 using TableFunctionHudiCluster = TableFunctionObjectStorageCluster<HudiClusterDefinition, StorageS3HudiConfiguration>;
 #endif
 }
--- a/src/TableFunctions/registerTableFunctions.cpp
+++ b/src/TableFunctions/registerTableFunctions.cpp
@ -66,6 +66,7 @@ void registerTableFunctions(bool use_legacy_mongodb_integration [[maybe_unused]]
    registerTableFunctionObjectStorage(factory);
    registerTableFunctionObjectStorageCluster(factory);
    registerDataLakeTableFunctions(factory);
    registerDataLakeClusterTableFunctions(factory);
 }
 }
--- a/src/TableFunctions/registerTableFunctions.h
+++ b/src/TableFunctions/registerTableFunctions.h
@ -70,6 +70,7 @@ void registerTableFunctionExplain(TableFunctionFactory & factory);
 void registerTableFunctionObjectStorage(TableFunctionFactory & factory);
 void registerTableFunctionObjectStorageCluster(TableFunctionFactory & factory);
 void registerDataLakeTableFunctions(TableFunctionFactory & factory);
 void registerDataLakeClusterTableFunctions(TableFunctionFactory & factory);
 void registerTableFunctionTimeSeries(TableFunctionFactory & factory);
--- a/tests/ci/run_check.py
+++ b/tests/ci/run_check.py
@ -56,7 +56,9 @@ LABEL_CATEGORIES = {
        "Bug Fix (user-visible misbehaviour in official stable or prestable release)",
        "Bug Fix (user-visible misbehavior in official stable or prestable release)",
    ],
-    "pr-critical-bugfix": ["Critical Bug Fix (crash, LOGICAL_ERROR, data loss, RBAC)"],
+    "pr-critical-bugfix": [
        "Critical Bug Fix (crash, data loss, RBAC) or LOGICAL_ERROR"
    ],
    "pr-build": [
        "Build/Testing/Packaging Improvement",
        "Build Improvement",
--- a/tests/integration/test_storage_iceberg/configs/config.d/cluster.xml
+++ b/tests/integration/test_storage_iceberg/configs/config.d/cluster.xml
@ -0,0 +1,20 @@
 <clickhouse>
    <remote_servers>
        <cluster_simple>
            <shard>
                <replica>
                    <host>node1</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>node2</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>node3</host>
                    <port>9000</port>
                </replica>
            </shard>
        </cluster_simple>
    </remote_servers>
 </clickhouse>
--- a/tests/integration/test_storage_iceberg/configs/config.d/query_log.xml
+++ b/tests/integration/test_storage_iceberg/configs/config.d/query_log.xml
@ -0,0 +1,6 @@
 <clickhouse>
    <query_log>
      <database>system</database>
      <table>query_log</table>
    </query_log>
 </clickhouse>
--- a/tests/integration/test_storage_iceberg/test.py
+++ b/tests/integration/test_storage_iceberg/test.py
@ -73,14 +73,38 @@ def started_cluster():
        cluster.add_instance(
            "node1",
            main_configs=[
                "configs/config.d/query_log.xml",
                "configs/config.d/cluster.xml",
                "configs/config.d/named_collections.xml",
                "configs/config.d/filesystem_caches.xml",
            ],
            user_configs=["configs/users.d/users.xml"],
            with_minio=True,
            with_azurite=True,
            stay_alive=True,
            with_hdfs=with_hdfs,
            stay_alive=True,
        )
        cluster.add_instance(
            "node2",
            main_configs=[
                "configs/config.d/query_log.xml",
                "configs/config.d/cluster.xml",
                "configs/config.d/named_collections.xml",
                "configs/config.d/filesystem_caches.xml",
            ],
            user_configs=["configs/users.d/users.xml"],
            stay_alive=True,
        )
        cluster.add_instance(
            "node3",
            main_configs=[
                "configs/config.d/query_log.xml",
                "configs/config.d/cluster.xml",
                "configs/config.d/named_collections.xml",
                "configs/config.d/filesystem_caches.xml",
            ],
            user_configs=["configs/users.d/users.xml"],
            stay_alive=True,
        )
        logging.info("Starting cluster...")
@ -182,6 +206,7 @@ def get_creation_expression(
    cluster,
    format="Parquet",
    table_function=False,
    run_on_cluster=False,
    **kwargs,
 ):
    if storage_type == "s3":
@ -189,7 +214,11 @@ def get_creation_expression(
            bucket = kwargs["bucket"]
        else:
            bucket = cluster.minio_bucket
-        print(bucket)
+
        if run_on_cluster:
            assert table_function
            return f"icebergS3Cluster('cluster_simple', s3, filename = 'iceberg_data/default/{table_name}/', format={format}, url = 'http://minio1:9001/{bucket}/')"
        else:
            if table_function:
                return f"icebergS3(s3, filename = 'iceberg_data/default/{table_name}/', format={format}, url = 'http://minio1:9001/{bucket}/')"
            else:
@ -197,7 +226,14 @@ def get_creation_expression(
                    DROP TABLE IF EXISTS {table_name};
                    CREATE TABLE {table_name}
                    ENGINE=IcebergS3(s3, filename = 'iceberg_data/default/{table_name}/', format={format}, url = 'http://minio1:9001/{bucket}/')"""
    elif storage_type == "azure":
        if run_on_cluster:
            assert table_function
            return f"""
                icebergAzureCluster('cluster_simple', azure, container = '{cluster.azure_container_name}', storage_account_url = '{cluster.env_variables["AZURITE_STORAGE_ACCOUNT_URL"]}', blob_path = '/iceberg_data/default/{table_name}/', format={format})
            """
        else:
            if table_function:
                return f"""
                    icebergAzure(azure, container = '{cluster.azure_container_name}', storage_account_url = '{cluster.env_variables["AZURITE_STORAGE_ACCOUNT_URL"]}', blob_path = '/iceberg_data/default/{table_name}/', format={format})
@ -207,7 +243,14 @@ def get_creation_expression(
                    DROP TABLE IF EXISTS {table_name};
                    CREATE TABLE {table_name}
                    ENGINE=IcebergAzure(azure, container = {cluster.azure_container_name}, storage_account_url = '{cluster.env_variables["AZURITE_STORAGE_ACCOUNT_URL"]}', blob_path = '/iceberg_data/default/{table_name}/', format={format})"""
    elif storage_type == "hdfs":
        if run_on_cluster:
            assert table_function
            return f"""
                icebergHDFSCluster('cluster_simple', hdfs, filename= 'iceberg_data/default/{table_name}/', format={format}, url = 'hdfs://hdfs1:9000/')
            """
        else:
            if table_function:
                return f"""
                    icebergHDFS(hdfs, filename= 'iceberg_data/default/{table_name}/', format={format}, url = 'hdfs://hdfs1:9000/')
@ -217,7 +260,10 @@ def get_creation_expression(
                    DROP TABLE IF EXISTS {table_name};
                    CREATE TABLE {table_name}
                    ENGINE=IcebergHDFS(hdfs, filename = 'iceberg_data/default/{table_name}/', format={format}, url = 'hdfs://hdfs1:9000/');"""
    elif storage_type == "local":
        assert not run_on_cluster
        if table_function:
            return f"""
                icebergLocal(local, path = '/iceberg_data/default/{table_name}/', format={format})
@ -227,6 +273,7 @@ def get_creation_expression(
                DROP TABLE IF EXISTS {table_name};
                CREATE TABLE {table_name}
                ENGINE=IcebergLocal(local, path = '/iceberg_data/default/{table_name}/', format={format});"""
    else:
        raise Exception(f"Unknown iceberg storage type: {storage_type}")
@ -492,6 +539,108 @@ def test_types(started_cluster, format_version, storage_type):
    )
@pytest.mark.parametrize("format_version", ["1", "2"])
@pytest.mark.parametrize("storage_type", ["s3", "azure", "hdfs"])
 def test_cluster_table_function(started_cluster, format_version, storage_type):
    if is_arm() and storage_type == "hdfs":
        pytest.skip("Disabled test IcebergHDFS for aarch64")
    instance = started_cluster.instances["node1"]
    spark = started_cluster.spark_session
    TABLE_NAME = (
        "test_iceberg_cluster_"
        + format_version
        + "_"
        + storage_type
        + "_"
        + get_uuid_str()
    )
    def add_df(mode):
        write_iceberg_from_df(
            spark,
            generate_data(spark, 0, 100),
            TABLE_NAME,
            mode=mode,
            format_version=format_version,
        )
        files = default_upload_directory(
            started_cluster,
            storage_type,
            f"/iceberg_data/default/{TABLE_NAME}/",
            f"/iceberg_data/default/{TABLE_NAME}/",
        )
        logging.info(f"Adding another dataframe. result files: {files}")
        return files
    files = add_df(mode="overwrite")
    for i in range(1, len(started_cluster.instances)):
        files = add_df(mode="append")
    logging.info(f"Setup complete. files: {files}")
    assert len(files) == 5 + 4 * (len(started_cluster.instances) - 1)
    clusters = instance.query(f"SELECT * FROM system.clusters")
    logging.info(f"Clusters setup: {clusters}")
    # Regular Query only node1
    table_function_expr = get_creation_expression(
        storage_type, TABLE_NAME, started_cluster, table_function=True
    )
    select_regular = (
        instance.query(f"SELECT * FROM {table_function_expr}").strip().split()
    )
    # Cluster Query with node1 as coordinator
    table_function_expr_cluster = get_creation_expression(
        storage_type,
        TABLE_NAME,
        started_cluster,
        table_function=True,
        run_on_cluster=True,
    )
    select_cluster = (
        instance.query(f"SELECT * FROM {table_function_expr_cluster}").strip().split()
    )
    # Simple size check
    assert len(select_regular) == 600
    assert len(select_cluster) == 600
    # Actual check
    assert select_cluster == select_regular
    # Check query_log
    for replica in started_cluster.instances.values():
        replica.query("SYSTEM FLUSH LOGS")
    for node_name, replica in started_cluster.instances.items():
        cluster_secondary_queries = (
            replica.query(
                f"""
                SELECT query, type, is_initial_query, read_rows, read_bytes FROM system.query_log
                WHERE
                    type = 'QueryStart' AND
                    positionCaseInsensitive(query, '{storage_type}Cluster') != 0 AND
                    position(query, '{TABLE_NAME}') != 0 AND
                    position(query, 'system.query_log') = 0 AND
                    NOT is_initial_query
            """
            )
            .strip()
            .split("\n")
        )
        logging.info(
            f"[{node_name}] cluster_secondary_queries: {cluster_secondary_queries}"
        )
        assert len(cluster_secondary_queries) == 1
@pytest.mark.parametrize("format_version", ["1", "2"])
@pytest.mark.parametrize("storage_type", ["s3", "azure", "hdfs", "local"])
 def test_delete_files(started_cluster, format_version, storage_type):
--- a/tests/queries/0_stateless/03274_udf_in_join.reference
+++ b/tests/queries/0_stateless/03274_udf_in_join.reference
@ -0,0 +1,7 @@
 SELECT 1
 FROM
 (
    SELECT 1 AS c0
 ) AS v0
 ALL INNER JOIN v0 AS vx ON c0 = vx.c0
 1
--- a/tests/queries/0_stateless/03274_udf_in_join.sh
+++ b/tests/queries/0_stateless/03274_udf_in_join.sh
@ -0,0 +1,21 @@
 #!/usr/bin/env bash
 CUR_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
 # shellcheck source=../shell_config.sh
 . "$CUR_DIR"/../shell_config.sh
 $CLICKHOUSE_CLIENT -q "
  CREATE VIEW v0 AS SELECT 1 AS c0;
  CREATE FUNCTION ${CLICKHOUSE_DATABASE}_second AS (x, y) -> y;
  CREATE FUNCTION ${CLICKHOUSE_DATABASE}_equals AS (x, y) -> x = y;
  -- SET optimize_rewrite_array_exists_to_has = 1;
  EXPLAIN SYNTAX SELECT 1 FROM v0 JOIN v0 vx ON ${CLICKHOUSE_DATABASE}_second(v0.c0, vx.c0); -- { serverError INVALID_JOIN_ON_EXPRESSION }
  EXPLAIN SYNTAX SELECT 1 FROM v0 JOIN v0 vx ON ${CLICKHOUSE_DATABASE}_equals(v0.c0, vx.c0);
  SELECT 1 FROM v0 JOIN v0 vx ON ${CLICKHOUSE_DATABASE}_equals(v0.c0, vx.c0);
  DROP view v0;
  DROP FUNCTION ${CLICKHOUSE_DATABASE}_second;
  DROP FUNCTION ${CLICKHOUSE_DATABASE}_equals;
 "
--- a/utils/check-style/aspell-ignore/en/aspell-dict.txt
+++ b/utils/check-style/aspell-ignore/en/aspell-dict.txt
@ -244,7 +244,10 @@ Deduplication
 DefaultTableEngine
 DelayedInserts
 DeliveryTag
 Deltalake
 DeltaLake
 deltalakeCluster
 deltaLakeCluster
 Denormalize
 DestroyAggregatesThreads
 DestroyAggregatesThreadsActive
@ -377,10 +380,15 @@ Homebrew's
 HorizontalDivide
 Hostname
 HouseOps
 hudi
 Hudi
 hudiCluster
 HudiCluster
 HyperLogLog
 Hypot
 IANA
 icebergCluster
 IcebergCluster
 IDE
 IDEs
 IDNA
Author	SHA1	Message	Date
Raúl Marín	7645025359	Merge `f6610790a8` into `44b4bd38b9`	2024-11-21 04:28:40 +01:00
Mikhail Artemenko	44b4bd38b9	Merge pull request #72045 from ClickHouse/issues/70174/cluster_versions Enable cluster table functions for DataLake Storages	2024-11-20 21:22:37 +00:00
Raúl Marín	f6610790a8	Fix broken check	2024-11-20 18:13:44 +01:00
Raúl Marín	27fb90bb58	Fix bugs when using UDF in join on expression with the old analyzer	2024-11-20 17:37:32 +01:00
Mikhail Artemenko	4ccebd9a24	fix syntax for iceberg in docs	2024-11-20 11:15:39 +00:00
Mikhail Artemenko	99177c0daf	remove icebergCluster alias	2024-11-20 11:15:12 +00:00
Mikhail Artemenko	0951991c1d	update aspell-dict.txt	2024-11-19 13:10:42 +00:00
Mikhail Artemenko	19aec5e572	Merge branch 'issues/70174/cluster_versions' of github.com:ClickHouse/ClickHouse into issues/70174/cluster_versions	2024-11-19 12:51:56 +00:00
Mikhail Artemenko	a367de9977	add docs	2024-11-19 12:49:59 +00:00
Mikhail Artemenko	6894e280b2	fix pr issues	2024-11-19 12:34:42 +00:00
Mikhail Artemenko	39ebe113d9	Merge branch 'master' into issues/70174/cluster_versions	2024-11-19 11:28:46 +00:00
robot-clickhouse	014608fb6b	Automatic style fix	2024-11-18 17:51:51 +00:00
Mikhail Artemenko	a29ded4941	add test for iceberg	2024-11-18 17:39:46 +00:00
Mikhail Artemenko	d2efae7511	enable cluster versions for datalake storages	2024-11-18 17:35:21 +00:00