Merge branch 'master' into fix_unicode_bar

This commit is contained in:
Alexander Gololobov 2022-12-22 14:53:33 +01:00 committed by GitHub
commit 0ae7f03516
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
186 changed files with 3869 additions and 1759 deletions

View File

@ -842,7 +842,7 @@ jobs:
docker ps --quiet | xargs --no-run-if-empty docker kill ||:
docker ps --all --quiet | xargs --no-run-if-empty docker rm -f ||:
sudo rm -fr "$TEMP_PATH" "$CACHES_PATH"
BuilderBinAmd64SSE2:
BuilderBinAmd64Compat:
needs: [DockerHubPush]
runs-on: [self-hosted, builder]
steps:
@ -853,7 +853,7 @@ jobs:
IMAGES_PATH=${{runner.temp}}/images_path
REPO_COPY=${{runner.temp}}/build_check/ClickHouse
CACHES_PATH=${{runner.temp}}/../ccaches
BUILD_NAME=binary_amd64sse2
BUILD_NAME=binary_amd64_compat
EOF
- name: Download changed images
uses: actions/download-artifact@v2
@ -1017,7 +1017,7 @@ jobs:
- BuilderBinFreeBSD
# - BuilderBinGCC
- BuilderBinPPC64
- BuilderBinAmd64SSE2
- BuilderBinAmd64Compat
- BuilderBinAarch64V80Compat
- BuilderBinClangTidy
- BuilderDebShared

View File

@ -901,7 +901,7 @@ jobs:
docker ps --quiet | xargs --no-run-if-empty docker kill ||:
docker ps --all --quiet | xargs --no-run-if-empty docker rm -f ||:
sudo rm -fr "$TEMP_PATH" "$CACHES_PATH"
BuilderBinAmd64SSE2:
BuilderBinAmd64Compat:
needs: [DockerHubPush, FastTest, StyleCheck]
runs-on: [self-hosted, builder]
steps:
@ -912,7 +912,7 @@ jobs:
IMAGES_PATH=${{runner.temp}}/images_path
REPO_COPY=${{runner.temp}}/build_check/ClickHouse
CACHES_PATH=${{runner.temp}}/../ccaches
BUILD_NAME=binary_amd64sse2
BUILD_NAME=binary_amd64_compat
EOF
- name: Download changed images
uses: actions/download-artifact@v2
@ -1071,7 +1071,7 @@ jobs:
- BuilderBinFreeBSD
# - BuilderBinGCC
- BuilderBinPPC64
- BuilderBinAmd64SSE2
- BuilderBinAmd64Compat
- BuilderBinAarch64V80Compat
- BuilderBinClangTidy
- BuilderDebShared

View File

@ -17,6 +17,9 @@
### <a id="2212"></a> ClickHouse release 22.12, 2022-12-15
#### Backward Incompatible Change
* Add `GROUP BY ALL` syntax: [#37631](https://github.com/ClickHouse/ClickHouse/issues/37631). [#42265](https://github.com/ClickHouse/ClickHouse/pull/42265) ([刘陶峰](https://github.com/taofengliu)). If you have a column or an alias named `all` and doing `GROUP BY all` without the intention to group by all the columns, the query will have a different semantic. To keep the old semantic, put `all` into backticks or double quotes `"all"` to make it an identifier instead of a keyword.
#### Upgrade Notes
* Fixed backward incompatibility in (de)serialization of states of `min`, `max`, `any*`, `argMin`, `argMax` aggregate functions with `String` argument. The incompatibility affects 22.9, 22.10 and 22.11 branches (fixed since 22.9.6, 22.10.4 and 22.11.2 correspondingly). Some minor releases of 22.3, 22.7 and 22.8 branches are also affected: 22.3.13...22.3.14 (fixed since 22.3.15), 22.8.6...22.8.9 (fixed since 22.8.10), 22.7.6 and newer (will not be fixed in 22.7, we recommend upgrading from 22.7.* to 22.8.10 or newer). This release note does not concern users that have never used affected versions. Incompatible versions append an extra `'\0'` to strings when reading states of the aggregate functions mentioned above. For example, if an older version saved state of `anyState('foobar')` to `state_column` then the incompatible version will print `'foobar\0'` on `anyMerge(state_column)`. Also incompatible versions write states of the aggregate functions without trailing `'\0'`. Newer versions (that have the fix) can correctly read data written by all versions including incompatible versions, except one corner case. If an incompatible version saved a state with a string that actually ends with null character, then newer version will trim trailing `'\0'` when reading state of affected aggregate function. For example, if an incompatible version saved state of `anyState('abrac\0dabra\0')` to `state_column` then newer versions will print `'abrac\0dabra'` on `anyMerge(state_column)`. The issue also affects distributed queries when an incompatible version works in a cluster together with older or newer versions. [#43038](https://github.com/ClickHouse/ClickHouse/pull/43038) ([Alexander Tokmakov](https://github.com/tavplubix), [Raúl Marín](https://github.com/Algunenano)). Note: all the official ClickHouse builds already include the patches. This is not necessarily true for unofficial third-party builds that should be avoided.

View File

@ -16,6 +16,6 @@ ClickHouse® is an open-source column-oriented database management system that a
* [Contacts](https://clickhouse.com/company/contact) can help to get your questions answered if there are any.
## Upcoming events
* [**v22.12 Release Webinar**](https://clickhouse.com/company/events/v22-12-release-webinar) 22.12 is the ClickHouse Christmas release. There are plenty of gifts (a new JOIN algorithm among them) and we adopted something from MongoDB. Original creator, co-founder, and CTO of ClickHouse Alexey Milovidov will walk us through the highlights of the release.
* **Recording available**: [**v22.12 Release Webinar**](https://www.youtube.com/watch?v=sREupr6uc2k) 22.12 is the ClickHouse Christmas release. There are plenty of gifts (a new JOIN algorithm among them) and we adopted something from MongoDB. Original creator, co-founder, and CTO of ClickHouse Alexey Milovidov will walk us through the highlights of the release.
* [**ClickHouse Meetup at the CHEQ office in Tel Aviv**](https://www.meetup.com/clickhouse-tel-aviv-user-group/events/289599423/) - Jan 16 - We are very excited to be holding our next in-person ClickHouse meetup at the CHEQ office in Tel Aviv! Hear from CHEQ, ServiceNow and Contentsquare, as well as a deep dive presentation from ClickHouse CTO Alexey Milovidov. Join us for a fun evening of talks, food and discussion!
* [**ClickHouse Meetup at Microsoft Office in Seattle**](https://www.meetup.com/clickhouse-seattle-user-group/events/290310025/) - Jan 18 - Keep an eye on this space as we will be announcing speakers soon!

View File

@ -131,7 +131,7 @@ def parse_env_variables(
ARM_V80COMPAT_SUFFIX = "-aarch64-v80compat"
FREEBSD_SUFFIX = "-freebsd"
PPC_SUFFIX = "-ppc64le"
AMD64_SSE2_SUFFIX = "-amd64sse2"
AMD64_COMPAT_SUFFIX = "-amd64-compat"
result = []
result.append("OUTPUT_DIR=/output")
@ -144,7 +144,7 @@ def parse_env_variables(
is_cross_arm_v80compat = compiler.endswith(ARM_V80COMPAT_SUFFIX)
is_cross_ppc = compiler.endswith(PPC_SUFFIX)
is_cross_freebsd = compiler.endswith(FREEBSD_SUFFIX)
is_amd64_sse2 = compiler.endswith(AMD64_SSE2_SUFFIX)
is_amd64_compat = compiler.endswith(AMD64_COMPAT_SUFFIX)
if is_cross_darwin:
cc = compiler[: -len(DARWIN_SUFFIX)]
@ -197,8 +197,8 @@ def parse_env_variables(
cmake_flags.append(
"-DCMAKE_TOOLCHAIN_FILE=/build/cmake/linux/toolchain-ppc64le.cmake"
)
elif is_amd64_sse2:
cc = compiler[: -len(AMD64_SSE2_SUFFIX)]
elif is_amd64_compat:
cc = compiler[: -len(AMD64_COMPAT_SUFFIX)]
result.append("DEB_ARCH=amd64")
cmake_flags.append("-DNO_SSE3_OR_HIGHER=1")
else:
@ -358,7 +358,7 @@ if __name__ == "__main__":
"clang-15-aarch64",
"clang-15-aarch64-v80compat",
"clang-15-ppc64le",
"clang-15-amd64sse2",
"clang-15-amd64-compat",
"clang-15-freebsd",
"gcc-11",
),

View File

@ -17,6 +17,7 @@ ENV S3_URL="https://clickhouse-datasets.s3.amazonaws.com"
ENV DATASETS="hits visits"
RUN npm install -g azurite
RUN npm install tslib
COPY run.sh /
CMD ["/bin/bash", "/run.sh"]

View File

@ -80,6 +80,7 @@ ENV MINIO_ROOT_PASSWORD="clickhouse"
ENV EXPORT_S3_STORAGE_POLICIES=1
RUN npm install -g azurite
RUN npm install tslib
COPY run.sh /
COPY setup_minio.sh /

View File

@ -9,14 +9,22 @@ if [ "${OS}" = "Linux" ]
then
if [ "${ARCH}" = "x86_64" -o "${ARCH}" = "amd64" ]
then
DIR="amd64"
# Require at least x86-64 + SSE4.2 (introduced in 2006). On older hardware fall back to plain x86-64 (introduced in 1999) which
# guarantees at least SSE2. The caveat is that plain x86-64 builds are much less tested than SSE 4.2 builds.
HAS_SSE42=$(grep sse4_2 /proc/cpuinfo)
if [ "${HAS_SSE42}" ]
then
DIR="amd64"
else
DIR="amd64compat"
fi
elif [ "${ARCH}" = "aarch64" -o "${ARCH}" = "arm64" ]
then
# If the system has >=ARMv8.2 (https://en.wikipedia.org/wiki/AArch64), choose the corresponding build, else fall back to a v8.0
# compat build. Unfortunately, the ARM ISA level cannot be read directly, we need to guess from the "features" in /proc/cpuinfo.
# Also, the flags in /proc/cpuinfo are named differently than the flags passed to the compiler (cmake/cpu_features.cmake).
ARMV82=$(grep -m 1 'Features' /proc/cpuinfo | awk '/asimd/ && /sha1/ && /aes/ && /atomics/ && /lrcpc/')
if [ "${ARMV82}" ]
HAS_ARMV82=$(grep -m 1 'Features' /proc/cpuinfo | awk '/asimd/ && /sha1/ && /aes/ && /atomics/ && /lrcpc/')
if [ "${HAS_ARMV82}" ]
then
DIR="aarch64"
else

View File

@ -1,15 +1,15 @@
---
slug: /en/development/build-cross-osx
sidebar_position: 66
title: How to Build ClickHouse on Linux for Mac OS X
sidebar_label: Build on Linux for Mac OS X
title: How to Build ClickHouse on Linux for macOS
sidebar_label: Build on Linux for macOS
---
This is for the case when you have a Linux machine and want to use it to build `clickhouse` binary that will run on OS X.
This is intended for continuous integration checks that run on Linux servers. If you want to build ClickHouse directly on Mac OS X, then proceed with [another instruction](../development/build-osx.md).
This is intended for continuous integration checks that run on Linux servers. If you want to build ClickHouse directly on macOS, then proceed with [another instruction](../development/build-osx.md).
The cross-build for Mac OS X is based on the [Build instructions](../development/build.md), follow them first.
The cross-build for macOS is based on the [Build instructions](../development/build.md), follow them first.
## Install Clang-14

View File

@ -1,9 +1,9 @@
---
slug: /en/development/build-osx
sidebar_position: 65
sidebar_label: Build on Mac OS X
title: How to Build ClickHouse on Mac OS X
description: How to build ClickHouse on Mac OS X
sidebar_label: Build on macOS
title: How to Build ClickHouse on macOS
description: How to build ClickHouse on macOS
---
:::info You don't have to build ClickHouse yourself!

View File

@ -7,7 +7,7 @@ description: Prerequisites and an overview of how to build ClickHouse
# Getting Started Guide for Building ClickHouse
The building of ClickHouse is supported on Linux, FreeBSD and Mac OS X.
The building of ClickHouse is supported on Linux, FreeBSD and macOS.
If you use Windows, you need to create a virtual machine with Ubuntu. To start working with a virtual machine please install VirtualBox. You can download Ubuntu from the website: https://www.ubuntu.com/#download. Please create a virtual machine from the downloaded image (you should reserve at least 4GB of RAM for it). To run a command-line terminal in Ubuntu, please locate a program containing the word “terminal” in its name (gnome-terminal, konsole etc.) or just press Ctrl+Alt+T.
@ -194,7 +194,7 @@ In this case, ClickHouse will use config files located in the current directory.
To connect to ClickHouse with clickhouse-client in another terminal navigate to `ClickHouse/build/programs/` and run `./clickhouse client`.
If you get `Connection refused` message on Mac OS X or FreeBSD, try specifying host address 127.0.0.1:
If you get `Connection refused` message on macOS or FreeBSD, try specifying host address 127.0.0.1:
clickhouse client --host 127.0.0.1
@ -213,7 +213,7 @@ You can also run your custom-built ClickHouse binary with the config file from t
## IDE (Integrated Development Environment) {#ide-integrated-development-environment}
If you do not know which IDE to use, we recommend that you use CLion. CLion is commercial software, but it offers 30 days free trial period. It is also free of charge for students. CLion can be used both on Linux and on Mac OS X.
If you do not know which IDE to use, we recommend that you use CLion. CLion is commercial software, but it offers 30 days free trial period. It is also free of charge for students. CLion can be used both on Linux and on macOS.
KDevelop and QTCreator are other great alternatives of an IDE for developing ClickHouse. KDevelop comes in as a very handy IDE although unstable. If KDevelop crashes after a while upon opening project, you should click “Stop All” button as soon as it has opened the list of projects files. After doing so KDevelop should be fine to work with.

View File

@ -139,7 +139,7 @@ If the system clickhouse-server is already running and you do not want to stop i
Build tests allow to check that build is not broken on various alternative configurations and on some foreign systems. These tests are automated as well.
Examples:
- cross-compile for Darwin x86_64 (Mac OS X)
- cross-compile for Darwin x86_64 (macOS)
- cross-compile for FreeBSD x86_64
- cross-compile for Linux AArch64
- build on Ubuntu with libraries from system packages (discouraged)

View File

@ -9,7 +9,7 @@ slug: /en/install
You have three options for getting up and running with ClickHouse:
- **[ClickHouse Cloud](https://clickhouse.com/cloud/):** The official ClickHouse as a service, - built by, maintained and supported by the creators of ClickHouse
- **[Self-managed ClickHouse](#self-managed-install):** ClickHouse can run on any Linux, FreeBSD, or Mac OS X with x86-64, ARM, or PowerPC64LE CPU architecture
- **[Self-managed ClickHouse](#self-managed-install):** ClickHouse can run on any Linux, FreeBSD, or macOS with x86-64, ARM, or PowerPC64LE CPU architecture
- **[Docker Image](https://hub.docker.com/r/clickhouse/clickhouse-server/):** Read the guide with the official image in Docker Hub
## ClickHouse Cloud
@ -257,7 +257,7 @@ To run ClickHouse inside Docker follow the guide on [Docker Hub](https://hub.doc
### From Sources {#from-sources}
To manually compile ClickHouse, follow the instructions for [Linux](/docs/en/development/build.md) or [Mac OS X](/docs/en/development/build-osx.md).
To manually compile ClickHouse, follow the instructions for [Linux](/docs/en/development/build.md) or [macOS](/docs/en/development/build-osx.md).
You can compile packages and install them or use programs without installing packages.
@ -352,7 +352,7 @@ To continue experimenting, you can download one of the test data sets or go thro
## Recommendations for Self-Managed ClickHouse
ClickHouse can run on any Linux, FreeBSD, or Mac OS X with x86-64, ARM, or PowerPC64LE CPU architecture.
ClickHouse can run on any Linux, FreeBSD, or macOS with x86-64, ARM, or PowerPC64LE CPU architecture.
ClickHouse uses all hardware resources available to process data.

View File

@ -890,7 +890,7 @@ The maximum number of open files.
By default: `maximum`.
We recommend using this option in Mac OS X since the `getrlimit()` function returns an incorrect value.
We recommend using this option in macOS since the `getrlimit()` function returns an incorrect value.
**Example**

View File

@ -3447,13 +3447,45 @@ Default value: 2.
## compatibility {#compatibility}
This setting changes other settings according to provided ClickHouse version.
If a behaviour in ClickHouse was changed by using a different default value for some setting, this compatibility setting allows you to use default values from previous versions for all the settings that were not set by the user.
The `compatibility` setting causes ClickHouse to use the default settings of a previous version of ClickHouse, where the previous version is provided as the setting.
This setting takes ClickHouse version number as a string, like `21.3`, `21.8`. Empty value means that this setting is disabled.
If settings are set to non-default values, then those settings are honored (only settings that have not been modified are affected by the `compatibility` setting).
This setting takes a ClickHouse version number as a string, like `22.3`, `22.8`. An empty value means that this setting is disabled.
Disabled by default.
:::note
In ClickHouse Cloud the compatibility setting must be set by ClickHouse Cloud support. Please [open a case](https://clickhouse.cloud/support) to have it set.
:::
## allow_settings_after_format_in_insert {#allow_settings_after_format_in_insert}
Control whether `SETTINGS` after `FORMAT` in `INSERT` queries is allowed or not. It is not recommended to use this, since this may interpret part of `SETTINGS` as values.
Example:
```sql
INSERT INTO FUNCTION null('foo String') SETTINGS max_threads=1 VALUES ('bar');
```
But the following query will work only with `allow_settings_after_format_in_insert`:
```sql
SET allow_settings_after_format_in_insert=1;
INSERT INTO FUNCTION null('foo String') VALUES ('bar') SETTINGS max_threads=1;
```
Possible values:
- 0 — Disallow.
- 1 — Allow.
Default value: `0`.
!!! note "Warning"
Use this setting only for backward compatibility if your use cases depend on old syntax.
# Format settings {#format-settings}
## input_format_skip_unknown_fields {#input_format_skip_unknown_fields}
@ -3588,6 +3620,13 @@ y Nullable(String)
z IPv4
```
## schema_inference_make_columns_nullable {#schema_inference_make_columns_nullable}
Controls making inferred types `Nullable` in schema inference for formats without information about nullability.
If the setting is enabled, the inferred type will be `Nullable` only if column contains `NULL` in a sample that is parsed during schema inference.
Default value: `false`.
## input_format_try_infer_integers {#input_format_try_infer_integers}
If enabled, ClickHouse will try to infer integers instead of floats in schema inference for text formats. If all numbers in the column from input data are integers, the result type will be `Int64`, if at least one number is float, the result type will be `Float64`.

View File

@ -1104,6 +1104,7 @@ Using replacement fields, you can define a pattern for the resulting string. “
| %d | day of the month, zero-padded (01-31) | 02 |
| %D | Short MM/DD/YY date, equivalent to %m/%d/%y | 01/02/18 |
| %e | day of the month, space-padded ( 1-31) | &nbsp; 2 |
| %f | fractional second from the fractional part of DateTime64 | 1234560 |
| %F | short YYYY-MM-DD date, equivalent to %Y-%m-%d | 2018-01-02 |
| %G | four-digit year format for ISO week number, calculated from the week-based year [defined by the ISO 8601](https://en.wikipedia.org/wiki/ISO_8601#Week_dates) standard, normally useful only with %V | 2018 |
| %g | two-digit year format, aligned to ISO 8601, abbreviated from four-digit notation | 18 |
@ -1143,6 +1144,20 @@ Result:
└────────────────────────────────────────────┘
```
Query:
``` sql
SELECT formatDateTime(toDateTime64('2010-01-04 12:34:56.123456', 7), '%f')
```
Result:
```
┌─formatDateTime(toDateTime64('2010-01-04 12:34:56.123456', 7), '%f')─┐
│ 1234560 │
└─────────────────────────────────────────────────────────────────────┘
```
## dateName
Returns specified part of date.

View File

@ -595,9 +595,9 @@ SELECT xxHash64('')
**Returned value**
A `Uint32` or `Uint64` data type hash value.
A `UInt32` or `UInt64` data type hash value.
Type: `xxHash`.
Type: `UInt32` for `xxHash32` and `UInt64` for `xxHash64`.
**Example**

View File

@ -75,6 +75,10 @@ void SettingsProfileElement::init(const ASTSettingsProfileElement & ast, const A
}
}
bool SettingsProfileElement::isConstraint() const
{
return this->writability || !this->min_value.isNull() || !this->max_value.isNull();
}
std::shared_ptr<ASTSettingsProfileElement> SettingsProfileElement::toAST() const
{
@ -213,7 +217,7 @@ SettingsConstraints SettingsProfileElements::toSettingsConstraints(const AccessC
{
SettingsConstraints res{access_control};
for (const auto & elem : *this)
if (!elem.setting_name.empty() && elem.setting_name != ALLOW_BACKUP_SETTING_NAME)
if (!elem.setting_name.empty() && elem.isConstraint() && elem.setting_name != ALLOW_BACKUP_SETTING_NAME)
res.set(
elem.setting_name,
elem.min_value,

View File

@ -44,6 +44,8 @@ struct SettingsProfileElement
std::shared_ptr<ASTSettingsProfileElement> toAST() const;
std::shared_ptr<ASTSettingsProfileElement> toASTWithNames(const AccessControl & access_control) const;
bool isConstraint() const;
private:
void init(const ASTSettingsProfileElement & ast, const AccessControl * access_control);
};

View File

@ -91,7 +91,8 @@ bool SortNode::isEqualImpl(const IQueryTreeNode & rhs) const
void SortNode::updateTreeHashImpl(HashState & hash_state) const
{
hash_state.update(sort_direction);
hash_state.update(nulls_sort_direction);
/// use some determined value if `nulls_sort_direction` is `nullopt`
hash_state.update(nulls_sort_direction.value_or(sort_direction));
hash_state.update(with_fill);
if (collator)

View File

@ -0,0 +1,76 @@
#pragma once
#include <Common/hex.h>
namespace DB
{
static void inline hexStringDecode(const char * pos, const char * end, char *& out, size_t word_size = 2)
{
if ((end - pos) & 1)
{
*out = unhex(*pos);
++out;
++pos;
}
while (pos < end)
{
*out = unhex2(pos);
pos += word_size;
++out;
}
*out = '\0';
++out;
}
static void inline binStringDecode(const char * pos, const char * end, char *& out)
{
if (pos == end)
{
*out = '\0';
++out;
return;
}
UInt8 left = 0;
/// end - pos is the length of input.
/// (length & 7) to make remain bits length mod 8 is zero to split.
/// e.g. the length is 9 and the input is "101000001",
/// first left_cnt is 1, left is 0, right shift, pos is 1, left = 1
/// then, left_cnt is 0, remain input is '01000001'.
for (UInt8 left_cnt = (end - pos) & 7; left_cnt > 0; --left_cnt)
{
left = left << 1;
if (*pos != '0')
left += 1;
++pos;
}
if (left != 0 || end - pos == 0)
{
*out = left;
++out;
}
assert((end - pos) % 8 == 0);
while (end - pos != 0)
{
UInt8 c = 0;
for (UInt8 i = 0; i < 8; ++i)
{
c = c << 1;
if (*pos != '0')
c += 1;
++pos;
}
*out = c;
++out;
}
*out = '\0';
++out;
}
}

View File

@ -240,24 +240,52 @@ TEST(Common, RWLockPerfTestReaders)
for (auto pool_size : pool_sizes)
{
Stopwatch watch(CLOCK_MONOTONIC_COARSE);
Stopwatch watch(CLOCK_MONOTONIC_COARSE);
auto func = [&] ()
auto func = [&] ()
{
for (auto i = 0; i < cycles; ++i)
{
for (auto i = 0; i < cycles; ++i)
{
auto lock = fifo_lock->getLock(RWLockImpl::Read, RWLockImpl::NO_QUERY);
}
};
auto lock = fifo_lock->getLock(RWLockImpl::Read, RWLockImpl::NO_QUERY);
}
};
std::list<std::thread> threads;
for (size_t thread = 0; thread < pool_size; ++thread)
threads.emplace_back(func);
std::list<std::thread> threads;
for (size_t thread = 0; thread < pool_size; ++thread)
threads.emplace_back(func);
for (auto & thread : threads)
thread.join();
for (auto & thread : threads)
thread.join();
auto total_time = watch.elapsedSeconds();
std::cout << "Threads " << pool_size << ", total_time " << std::setprecision(2) << total_time << "\n";
auto total_time = watch.elapsedSeconds();
std::cout << "Threads " << pool_size << ", total_time " << std::setprecision(2) << total_time << "\n";
}
}
TEST(Common, RWLockNotUpgradeableWithNoQuery)
{
updatePHDRCache();
static auto rw_lock = RWLockImpl::create();
std::thread read_thread([&] ()
{
auto lock = rw_lock->getLock(RWLockImpl::Read, RWLockImpl::NO_QUERY, std::chrono::duration<int, std::milli>(50000));
auto sleep_for = std::chrono::duration<int, std::milli>(5000);
std::this_thread::sleep_for(sleep_for);
});
{
auto sleep_for = std::chrono::duration<int, std::milli>(500);
std::this_thread::sleep_for(sleep_for);
Stopwatch watch(CLOCK_MONOTONIC_COARSE);
auto get_lock = rw_lock->getLock(RWLockImpl::Write, RWLockImpl::NO_QUERY, std::chrono::duration<int, std::milli>(50000));
EXPECT_NE(get_lock.get(), nullptr);
/// It took some time
EXPECT_GT(watch.elapsedMilliseconds(), 3000);
}
read_thread.join();
}

View File

@ -620,6 +620,7 @@ static constexpr UInt64 operator""_GiB(unsigned long long value)
M(Bool, enable_filesystem_cache_on_lower_level, true, "If read buffer supports caching inside threadpool, allow it to do it, otherwise cache outside ot threadpool. Do not use this setting, it is needed for testing", 0) \
M(Bool, skip_download_if_exceeds_query_cache, true, "Skip download from remote filesystem if exceeds query cache size", 0) \
M(UInt64, max_query_cache_size, (128UL * 1024 * 1024 * 1024), "Max remote filesystem cache size that can be used by a single query", 0) \
M(Bool, throw_on_error_from_cache_on_write_operations, false, "Ignore error from cache when caching on write operations (INSERT, merges)", 0) \
\
M(Bool, load_marks_asynchronously, false, "Load MergeTree marks asynchronously", 0) \
\
@ -764,6 +765,7 @@ static constexpr UInt64 operator""_GiB(unsigned long long value)
M(Bool, input_format_arrow_skip_columns_with_unsupported_types_in_schema_inference, false, "Skip columns with unsupported types while schema inference for format Arrow", 0) \
M(String, column_names_for_schema_inference, "", "The list of column names to use in schema inference for formats without column names. The format: 'column1,column2,column3,...'", 0) \
M(String, schema_inference_hints, "", "The list of column names and types to use in schema inference for formats without column names. The format: 'column_name1 column_type1, column_name2 column_type2, ...'", 0) \
M(Bool, schema_inference_make_columns_nullable, true, "If set to true, all inferred types will be Nullable in schema inference for formats without information about nullability.", 0) \
M(Bool, input_format_json_read_bools_as_numbers, true, "Allow to parse bools as numbers in JSON input formats", 0) \
M(Bool, input_format_json_try_infer_numbers_from_strings, true, "Try to infer numbers from string fields while schema inference", 0) \
M(Bool, input_format_json_validate_types_from_metadata, true, "For JSON/JSONCompact/JSONColumnsWithMetadata input formats this controls whether format parser should check if data types from input metadata match data types of the corresponding columns from the table", 0) \

View File

@ -3,6 +3,7 @@
#include <cstdint>
#include <string>
#include <vector>
#include <unordered_set>
#include <base/strong_typedef.h>
#include <base/Decimal.h>
#include <base/defines.h>
@ -93,4 +94,5 @@ using Int256 = ::Int256;
/// Not a data type in database, defined just for convenience.
using Strings = std::vector<String>;
using TypeIndexesSet = std::unordered_set<TypeIndex>;
}

View File

@ -1025,9 +1025,6 @@ void BaseDaemon::setupWatchdog()
#if defined(OS_LINUX)
if (0 != prctl(PR_SET_PDEATHSIG, SIGKILL))
logger().warning("Cannot do prctl to ask termination with parent.");
if (getppid() == 1)
throw Poco::Exception("Parent watchdog process has exited.");
#endif
{

View File

@ -41,6 +41,8 @@ public:
SerializationPtr doGetDefaultSerialization() const override;
bool hasNullableSubcolumns() const { return is_nullable; }
const String & getSchemaFormat() const { return schema_format; }
};
}

View File

@ -8,74 +8,62 @@
namespace DB
{
void transformTypesRecursively(DataTypes & types, std::function<void(DataTypes &)> transform_simple_types, std::function<void(DataTypes &)> transform_complex_types)
void transformTypesRecursively(DataTypes & types, std::function<void(DataTypes &, const TypeIndexesSet &)> transform_simple_types, std::function<void(DataTypes &, const TypeIndexesSet &)> transform_complex_types)
{
TypeIndexesSet type_indexes;
for (const auto & type : types)
type_indexes.insert(type->getTypeId());
/// Arrays
if (type_indexes.contains(TypeIndex::Array))
{
/// Arrays
bool have_array = false;
bool all_arrays = true;
DataTypes nested_types;
for (const auto & type : types)
/// All types are Array
if (type_indexes.size() == 1)
{
if (const DataTypeArray * type_array = typeid_cast<const DataTypeArray *>(type.get()))
{
have_array = true;
nested_types.push_back(type_array->getNestedType());
}
else
all_arrays = false;
DataTypes nested_types;
for (const auto & type : types)
nested_types.push_back(typeid_cast<const DataTypeArray *>(type.get())->getNestedType());
transformTypesRecursively(nested_types, transform_simple_types, transform_complex_types);
for (size_t i = 0; i != types.size(); ++i)
types[i] = std::make_shared<DataTypeArray>(nested_types[i]);
}
if (have_array)
{
if (all_arrays)
{
transformTypesRecursively(nested_types, transform_simple_types, transform_complex_types);
for (size_t i = 0; i != types.size(); ++i)
types[i] = std::make_shared<DataTypeArray>(nested_types[i]);
}
if (transform_complex_types)
transform_complex_types(types, type_indexes);
if (transform_complex_types)
transform_complex_types(types);
return;
}
return;
}
/// Tuples
if (type_indexes.contains(TypeIndex::Tuple))
{
/// Tuples
bool have_tuple = false;
bool all_tuples = true;
size_t tuple_size = 0;
std::vector<DataTypes> nested_types;
for (const auto & type : types)
/// All types are Tuple
if (type_indexes.size() == 1)
{
if (const DataTypeTuple * type_tuple = typeid_cast<const DataTypeTuple *>(type.get()))
{
if (!have_tuple)
{
tuple_size = type_tuple->getElements().size();
nested_types.resize(tuple_size);
for (size_t elem_idx = 0; elem_idx < tuple_size; ++elem_idx)
nested_types[elem_idx].reserve(types.size());
}
else if (tuple_size != type_tuple->getElements().size())
return;
std::vector<DataTypes> nested_types;
const DataTypeTuple * type_tuple = typeid_cast<const DataTypeTuple *>(types[0].get());
size_t tuple_size = type_tuple->getElements().size();
nested_types.resize(tuple_size);
for (size_t elem_idx = 0; elem_idx < tuple_size; ++elem_idx)
nested_types[elem_idx].reserve(types.size());
have_tuple = true;
bool sizes_are_equal = true;
for (const auto & type : types)
{
type_tuple = typeid_cast<const DataTypeTuple *>(type.get());
if (type_tuple->getElements().size() != tuple_size)
{
sizes_are_equal = false;
break;
}
for (size_t elem_idx = 0; elem_idx < tuple_size; ++elem_idx)
nested_types[elem_idx].emplace_back(type_tuple->getElements()[elem_idx]);
}
else
all_tuples = false;
}
if (have_tuple)
{
if (all_tuples)
if (sizes_are_equal)
{
std::vector<DataTypes> transposed_nested_types(types.size());
for (size_t elem_idx = 0; elem_idx < tuple_size; ++elem_idx)
@ -88,56 +76,47 @@ void transformTypesRecursively(DataTypes & types, std::function<void(DataTypes &
for (size_t i = 0; i != types.size(); ++i)
types[i] = std::make_shared<DataTypeTuple>(transposed_nested_types[i]);
}
if (transform_complex_types)
transform_complex_types(types);
return;
}
if (transform_complex_types)
transform_complex_types(types, type_indexes);
return;
}
/// Maps
if (type_indexes.contains(TypeIndex::Map))
{
/// Maps
bool have_maps = false;
bool all_maps = true;
DataTypes key_types;
DataTypes value_types;
key_types.reserve(types.size());
value_types.reserve(types.size());
for (const auto & type : types)
/// All types are Map
if (type_indexes.size() == 1)
{
if (const DataTypeMap * type_map = typeid_cast<const DataTypeMap *>(type.get()))
DataTypes key_types;
DataTypes value_types;
key_types.reserve(types.size());
value_types.reserve(types.size());
for (const auto & type : types)
{
have_maps = true;
const DataTypeMap * type_map = typeid_cast<const DataTypeMap *>(type.get());
key_types.emplace_back(type_map->getKeyType());
value_types.emplace_back(type_map->getValueType());
}
else
all_maps = false;
transformTypesRecursively(key_types, transform_simple_types, transform_complex_types);
transformTypesRecursively(value_types, transform_simple_types, transform_complex_types);
for (size_t i = 0; i != types.size(); ++i)
types[i] = std::make_shared<DataTypeMap>(key_types[i], value_types[i]);
}
if (have_maps)
{
if (all_maps)
{
transformTypesRecursively(key_types, transform_simple_types, transform_complex_types);
transformTypesRecursively(value_types, transform_simple_types, transform_complex_types);
if (transform_complex_types)
transform_complex_types(types, type_indexes);
for (size_t i = 0; i != types.size(); ++i)
types[i] = std::make_shared<DataTypeMap>(key_types[i], value_types[i]);
}
if (transform_complex_types)
transform_complex_types(types);
return;
}
return;
}
/// Nullable
if (type_indexes.contains(TypeIndex::Nullable))
{
/// Nullable
bool have_nullable = false;
std::vector<UInt8> is_nullable;
is_nullable.reserve(types.size());
DataTypes nested_types;
@ -146,7 +125,6 @@ void transformTypesRecursively(DataTypes & types, std::function<void(DataTypes &
{
if (const DataTypeNullable * type_nullable = typeid_cast<const DataTypeNullable *>(type.get()))
{
have_nullable = true;
is_nullable.push_back(1);
nested_types.push_back(type_nullable->getNestedType());
}
@ -157,28 +135,28 @@ void transformTypesRecursively(DataTypes & types, std::function<void(DataTypes &
}
}
if (have_nullable)
transformTypesRecursively(nested_types, transform_simple_types, transform_complex_types);
for (size_t i = 0; i != types.size(); ++i)
{
transformTypesRecursively(nested_types, transform_simple_types, transform_complex_types);
for (size_t i = 0; i != types.size(); ++i)
{
if (is_nullable[i])
types[i] = makeNullable(nested_types[i]);
else
types[i] = nested_types[i];
}
return;
if (is_nullable[i])
types[i] = makeNullable(nested_types[i]);
else
types[i] = nested_types[i];
}
if (transform_complex_types)
transform_complex_types(types, type_indexes);
return;
}
transform_simple_types(types);
transform_simple_types(types, type_indexes);
}
void callOnNestedSimpleTypes(DataTypePtr & type, std::function<void(DataTypePtr &)> callback)
{
DataTypes types = {type};
transformTypesRecursively(types, [callback](auto & data_types){ callback(data_types[0]); }, {});
transformTypesRecursively(types, [callback](auto & data_types, const TypeIndexesSet &){ callback(data_types[0]); }, {});
}
}

View File

@ -12,7 +12,7 @@ namespace DB
/// If not all types are the same complex type (Array/Map/Tuple), this function won't be called to nested types.
/// Function transform_simple_types will be applied to resulting simple types after all recursive calls.
/// Function transform_complex_types will be applied to complex types (Array/Map/Tuple) after recursive call to their nested types.
void transformTypesRecursively(DataTypes & types, std::function<void(DataTypes &)> transform_simple_types, std::function<void(DataTypes &)> transform_complex_types);
void transformTypesRecursively(DataTypes & types, std::function<void(DataTypes &, const TypeIndexesSet &)> transform_simple_types, std::function<void(DataTypes &, const TypeIndexesSet &)> transform_complex_types);
void callOnNestedSimpleTypes(DataTypePtr & type, std::function<void(DataTypePtr &)> callback);

View File

@ -256,6 +256,9 @@ void DatabaseOrdinary::startupTables(ThreadPool & thread_pool, LoadingStrictness
auto startup_one_table = [&](const StoragePtr & table)
{
/// Since startup() method can use physical paths on disk we don't allow any exclusive actions (rename, drop so on)
/// until startup finished.
auto table_lock_holder = table->lockForShare(RWLockImpl::NO_QUERY, getContext()->getSettingsRef().lock_acquire_timeout);
table->startup();
logAboutProgress(log, ++tables_processed, total_tables, watch);
};

View File

@ -44,10 +44,10 @@ FileSegmentRangeWriter::FileSegmentRangeWriter(
const String & source_path_)
: cache(cache_)
, key(key_)
, log(&Poco::Logger::get("FileSegmentRangeWriter"))
, cache_log(cache_log_)
, query_id(query_id_)
, source_path(source_path_)
, current_file_segment_it(file_segments_holder.file_segments.end())
{
}
@ -56,69 +56,68 @@ bool FileSegmentRangeWriter::write(const char * data, size_t size, size_t offset
if (finalized)
return false;
if (expected_write_offset != offset)
{
throw Exception(
ErrorCodes::LOGICAL_ERROR,
"Cannot write file segment at offset {}, because expected write offset is: {}",
offset, expected_write_offset);
}
auto & file_segments = file_segments_holder.file_segments;
if (current_file_segment_it == file_segments.end())
if (file_segments.empty() || file_segments.back()->isDownloaded())
{
current_file_segment_it = allocateFileSegment(current_file_segment_write_offset, is_persistent);
}
else
{
auto file_segment = *current_file_segment_it;
assert(file_segment->getCurrentWriteOffset() == current_file_segment_write_offset);
if (current_file_segment_write_offset != offset)
{
throw Exception(
ErrorCodes::LOGICAL_ERROR,
"Cannot write file segment at offset {}, because current write offset is: {}",
offset, current_file_segment_write_offset);
}
if (file_segment->range().size() == file_segment->getDownloadedSize())
{
completeFileSegment(*file_segment);
current_file_segment_it = allocateFileSegment(current_file_segment_write_offset, is_persistent);
}
allocateFileSegment(expected_write_offset, is_persistent);
}
auto & file_segment = *current_file_segment_it;
auto downloader = file_segment->getOrSetDownloader();
if (downloader != FileSegment::getCallerId())
throw Exception(ErrorCodes::LOGICAL_ERROR, "Failed to set a downloader. ({})", file_segment->getInfoForLog());
auto & file_segment = file_segments.back();
SCOPE_EXIT({
if (file_segment->isDownloader())
file_segment->completePartAndResetDownloader();
if (file_segments.back()->isDownloader())
file_segments.back()->completePartAndResetDownloader();
});
bool reserved = file_segment->reserve(size);
if (!reserved)
while (size > 0)
{
file_segment->completeWithState(FileSegment::State::PARTIALLY_DOWNLOADED_NO_CONTINUATION);
appendFilesystemCacheLog(*file_segment);
size_t available_size = file_segment->range().size() - file_segment->getDownloadedSize();
if (available_size == 0)
{
completeFileSegment(*file_segment);
file_segment = allocateFileSegment(expected_write_offset, is_persistent);
continue;
}
LOG_DEBUG(
&Poco::Logger::get("FileSegmentRangeWriter"),
"Unsuccessful space reservation attempt (size: {}, file segment info: {}",
size, file_segment->getInfoForLog());
if (!file_segment->isDownloader()
&& file_segment->getOrSetDownloader() != FileSegment::getCallerId())
{
throw Exception(ErrorCodes::LOGICAL_ERROR,
"Failed to set a downloader. ({})", file_segment->getInfoForLog());
}
return false;
}
size_t size_to_write = std::min(available_size, size);
try
{
file_segment->write(data, size, offset);
}
catch (...)
{
bool reserved = file_segment->reserve(size_to_write);
if (!reserved)
{
file_segment->completeWithState(FileSegment::State::PARTIALLY_DOWNLOADED_NO_CONTINUATION);
appendFilesystemCacheLog(*file_segment);
LOG_DEBUG(
log, "Failed to reserve space in cache (size: {}, file segment info: {}",
size, file_segment->getInfoForLog());
return false;
}
file_segment->write(data, size_to_write, offset);
file_segment->completePartAndResetDownloader();
throw;
}
file_segment->completePartAndResetDownloader();
current_file_segment_write_offset += size;
size -= size_to_write;
expected_write_offset += size_to_write;
offset += size_to_write;
data += size_to_write;
}
return true;
}
@ -129,10 +128,10 @@ void FileSegmentRangeWriter::finalize()
return;
auto & file_segments = file_segments_holder.file_segments;
if (file_segments.empty() || current_file_segment_it == file_segments.end())
if (file_segments.empty())
return;
completeFileSegment(**current_file_segment_it);
completeFileSegment(*file_segments.back());
finalized = true;
}
@ -149,7 +148,7 @@ FileSegmentRangeWriter::~FileSegmentRangeWriter()
}
}
FileSegments::iterator FileSegmentRangeWriter::allocateFileSegment(size_t offset, bool is_persistent)
FileSegmentPtr & FileSegmentRangeWriter::allocateFileSegment(size_t offset, bool is_persistent)
{
/**
* Allocate a new file segment starting `offset`.
@ -168,7 +167,8 @@ FileSegments::iterator FileSegmentRangeWriter::allocateFileSegment(size_t offset
auto file_segment = cache->createFileSegmentForDownload(
key, offset, cache->max_file_segment_size, create_settings, cache_lock);
return file_segments_holder.add(std::move(file_segment));
auto & file_segments = file_segments_holder.file_segments;
return *file_segments.insert(file_segments.end(), file_segment);
}
void FileSegmentRangeWriter::appendFilesystemCacheLog(const FileSegment & file_segment)
@ -199,7 +199,7 @@ void FileSegmentRangeWriter::appendFilesystemCacheLog(const FileSegment & file_s
void FileSegmentRangeWriter::completeFileSegment(FileSegment & file_segment)
{
/// File segment can be detached if space reservation failed.
if (file_segment.isDetached())
if (file_segment.isDetached() || file_segment.isCompleted())
return;
file_segment.completeWithoutState();
@ -223,6 +223,7 @@ CachedOnDiskWriteBufferFromFile::CachedOnDiskWriteBufferFromFile(
, is_persistent_cache_file(is_persistent_cache_file_)
, query_id(query_id_)
, enable_cache_log(!query_id_.empty() && settings_.enable_filesystem_cache_log)
, throw_on_error_from_cache(settings_.throw_on_error_from_cache)
{
}
@ -246,11 +247,11 @@ void CachedOnDiskWriteBufferFromFile::nextImpl()
}
/// Write data to cache.
cacheData(working_buffer.begin(), size);
cacheData(working_buffer.begin(), size, throw_on_error_from_cache);
current_download_offset += size;
}
void CachedOnDiskWriteBufferFromFile::cacheData(char * data, size_t size)
void CachedOnDiskWriteBufferFromFile::cacheData(char * data, size_t size, bool throw_on_error)
{
if (cache_in_error_state_or_disabled)
return;
@ -285,11 +286,17 @@ void CachedOnDiskWriteBufferFromFile::cacheData(char * data, size_t size)
return;
}
if (throw_on_error)
throw;
tryLogCurrentException(__PRETTY_FUNCTION__);
return;
}
catch (...)
{
if (throw_on_error)
throw;
tryLogCurrentException(__PRETTY_FUNCTION__);
return;
}

View File

@ -39,7 +39,7 @@ public:
~FileSegmentRangeWriter();
private:
FileSegments::iterator allocateFileSegment(size_t offset, bool is_persistent);
FileSegmentPtr & allocateFileSegment(size_t offset, bool is_persistent);
void appendFilesystemCacheLog(const FileSegment & file_segment);
@ -48,14 +48,14 @@ private:
FileCache * cache;
FileSegment::Key key;
Poco::Logger * log;
std::shared_ptr<FilesystemCacheLog> cache_log;
String query_id;
String source_path;
FileSegmentsHolder file_segments_holder{};
FileSegments::iterator current_file_segment_it;
size_t current_file_segment_write_offset = 0;
size_t expected_write_offset = 0;
bool finalized = false;
};
@ -81,7 +81,7 @@ public:
void finalizeImpl() override;
private:
void cacheData(char * data, size_t size);
void cacheData(char * data, size_t size, bool throw_on_error);
Poco::Logger * log;
@ -95,6 +95,7 @@ private:
bool enable_cache_log;
bool throw_on_error_from_cache;
bool cache_in_error_state_or_disabled = false;
std::unique_ptr<FileSegmentRangeWriter> cache_writer;

View File

@ -7,6 +7,8 @@ namespace DB
{
static const uint8_t BSON_DOCUMENT_END = 0x00;
static const size_t BSON_OBJECT_ID_SIZE = 12;
static const size_t BSON_DB_POINTER_SIZE = 12;
using BSONSizeT = uint32_t;
static const BSONSizeT MAX_BSON_SIZE = std::numeric_limits<BSONSizeT>::max();

View File

@ -32,6 +32,16 @@ namespace ErrorCodes
extern const int BAD_ARGUMENTS;
}
std::pair<String, String> splitCapnProtoFieldName(const String & name)
{
const auto * begin = name.data();
const auto * end = name.data() + name.size();
const auto * it = find_first_symbols<'_', '.'>(begin, end);
String first = String(begin, it);
String second = it == end ? "" : String(it + 1, end);
return {first, second};
}
capnp::StructSchema CapnProtoSchemaParser::getMessageSchema(const FormatSchemaInfo & schema_info)
{
capnp::ParsedSchema schema;
@ -201,9 +211,9 @@ static bool checkEnums(const capnp::Type & capnp_type, const DataTypePtr column_
return result;
}
static bool checkCapnProtoType(const capnp::Type & capnp_type, const DataTypePtr & data_type, FormatSettings::EnumComparingMode mode, String & error_message);
static bool checkCapnProtoType(const capnp::Type & capnp_type, const DataTypePtr & data_type, FormatSettings::EnumComparingMode mode, String & error_message, const String & column_name);
static bool checkNullableType(const capnp::Type & capnp_type, const DataTypePtr & data_type, FormatSettings::EnumComparingMode mode, String & error_message)
static bool checkNullableType(const capnp::Type & capnp_type, const DataTypePtr & data_type, FormatSettings::EnumComparingMode mode, String & error_message, const String & column_name)
{
if (!capnp_type.isStruct())
return false;
@ -222,9 +232,9 @@ static bool checkNullableType(const capnp::Type & capnp_type, const DataTypePtr
auto nested_type = assert_cast<const DataTypeNullable *>(data_type.get())->getNestedType();
if (first.getType().isVoid())
return checkCapnProtoType(second.getType(), nested_type, mode, error_message);
return checkCapnProtoType(second.getType(), nested_type, mode, error_message, column_name);
if (second.getType().isVoid())
return checkCapnProtoType(first.getType(), nested_type, mode, error_message);
return checkCapnProtoType(first.getType(), nested_type, mode, error_message, column_name);
return false;
}
@ -260,7 +270,7 @@ static bool checkTupleType(const capnp::Type & capnp_type, const DataTypePtr & d
{
KJ_IF_MAYBE(field, struct_schema.findFieldByName(name))
{
if (!checkCapnProtoType(field->getType(), nested_types[tuple_data_type->getPositionByName(name)], mode, error_message))
if (!checkCapnProtoType(field->getType(), nested_types[tuple_data_type->getPositionByName(name)], mode, error_message, name))
return false;
}
else
@ -273,16 +283,28 @@ static bool checkTupleType(const capnp::Type & capnp_type, const DataTypePtr & d
return true;
}
static bool checkArrayType(const capnp::Type & capnp_type, const DataTypePtr & data_type, FormatSettings::EnumComparingMode mode, String & error_message)
static bool checkArrayType(const capnp::Type & capnp_type, const DataTypePtr & data_type, FormatSettings::EnumComparingMode mode, String & error_message, const String & column_name)
{
if (!capnp_type.isList())
return false;
auto list_schema = capnp_type.asList();
auto nested_type = assert_cast<const DataTypeArray *>(data_type.get())->getNestedType();
return checkCapnProtoType(list_schema.getElementType(), nested_type, mode, error_message);
auto [field_name, nested_name] = splitCapnProtoFieldName(column_name);
if (!nested_name.empty() && list_schema.getElementType().isStruct())
{
auto struct_schema = list_schema.getElementType().asStruct();
KJ_IF_MAYBE(field, struct_schema.findFieldByName(nested_name))
return checkCapnProtoType(field->getType(), nested_type, mode, error_message, nested_name);
error_message += "Element type of List {} doesn't contain field with name " + nested_name;
return false;
}
return checkCapnProtoType(list_schema.getElementType(), nested_type, mode, error_message, column_name);
}
static bool checkCapnProtoType(const capnp::Type & capnp_type, const DataTypePtr & data_type, FormatSettings::EnumComparingMode mode, String & error_message)
static bool checkCapnProtoType(const capnp::Type & capnp_type, const DataTypePtr & data_type, FormatSettings::EnumComparingMode mode, String & error_message, const String & column_name)
{
switch (data_type->getTypeId())
{
@ -301,9 +323,11 @@ static bool checkCapnProtoType(const capnp::Type & capnp_type, const DataTypePtr
case TypeIndex::Int16:
return capnp_type.isInt16();
case TypeIndex::Date32: [[fallthrough]];
case TypeIndex::Decimal32: [[fallthrough]];
case TypeIndex::Int32:
return capnp_type.isInt32();
case TypeIndex::DateTime64: [[fallthrough]];
case TypeIndex::Decimal64: [[fallthrough]];
case TypeIndex::Int64:
return capnp_type.isInt64();
case TypeIndex::Float32:
@ -318,15 +342,15 @@ static bool checkCapnProtoType(const capnp::Type & capnp_type, const DataTypePtr
return checkTupleType(capnp_type, data_type, mode, error_message);
case TypeIndex::Nullable:
{
auto result = checkNullableType(capnp_type, data_type, mode, error_message);
auto result = checkNullableType(capnp_type, data_type, mode, error_message, column_name);
if (!result)
error_message += "Nullable can be represented only as a named union of type Void and nested type";
return result;
}
case TypeIndex::Array:
return checkArrayType(capnp_type, data_type, mode, error_message);
return checkArrayType(capnp_type, data_type, mode, error_message, column_name);
case TypeIndex::LowCardinality:
return checkCapnProtoType(capnp_type, assert_cast<const DataTypeLowCardinality *>(data_type.get())->getDictionaryType(), mode, error_message);
return checkCapnProtoType(capnp_type, assert_cast<const DataTypeLowCardinality *>(data_type.get())->getDictionaryType(), mode, error_message, column_name);
case TypeIndex::FixedString: [[fallthrough]];
case TypeIndex::String:
return capnp_type.isText() || capnp_type.isData();
@ -335,19 +359,9 @@ static bool checkCapnProtoType(const capnp::Type & capnp_type, const DataTypePtr
}
}
static std::pair<String, String> splitFieldName(const String & name)
{
const auto * begin = name.data();
const auto * end = name.data() + name.size();
const auto * it = find_first_symbols<'_', '.'>(begin, end);
String first = String(begin, it);
String second = it == end ? "" : String(it + 1, end);
return {first, second};
}
capnp::DynamicValue::Reader getReaderByColumnName(const capnp::DynamicStruct::Reader & struct_reader, const String & name)
{
auto [field_name, nested_name] = splitFieldName(name);
auto [field_name, nested_name] = splitCapnProtoFieldName(name);
KJ_IF_MAYBE(field, struct_reader.getSchema().findFieldByName(field_name))
{
capnp::DynamicValue::Reader field_reader;
@ -363,6 +377,20 @@ capnp::DynamicValue::Reader getReaderByColumnName(const capnp::DynamicStruct::Re
if (nested_name.empty())
return field_reader;
/// Support reading Nested as List of Structs.
if (field_reader.getType() == capnp::DynamicValue::LIST)
{
auto list_schema = field->getType().asList();
if (!list_schema.getElementType().isStruct())
throw Exception(ErrorCodes::CAPN_PROTO_BAD_CAST, "Element type of List {} is not a struct", field_name);
auto struct_schema = list_schema.getElementType().asStruct();
KJ_IF_MAYBE(nested_field, struct_schema.findFieldByName(nested_name))
return field_reader;
throw Exception(ErrorCodes::THERE_IS_NO_COLUMN, "Element type of List {} doesn't contain field with name \"{}\"", field_name, nested_name);
}
if (field_reader.getType() != capnp::DynamicValue::STRUCT)
throw Exception(ErrorCodes::CAPN_PROTO_BAD_CAST, "Field {} is not a struct", field_name);
@ -374,13 +402,28 @@ capnp::DynamicValue::Reader getReaderByColumnName(const capnp::DynamicStruct::Re
std::pair<capnp::DynamicStruct::Builder, capnp::StructSchema::Field> getStructBuilderAndFieldByColumnName(capnp::DynamicStruct::Builder struct_builder, const String & name)
{
auto [field_name, nested_name] = splitFieldName(name);
auto [field_name, nested_name] = splitCapnProtoFieldName(name);
KJ_IF_MAYBE(field, struct_builder.getSchema().findFieldByName(field_name))
{
if (nested_name.empty())
return {struct_builder, *field};
auto field_builder = struct_builder.get(*field);
/// Support reading Nested as List of Structs.
if (field_builder.getType() == capnp::DynamicValue::LIST)
{
auto list_schema = field->getType().asList();
if (!list_schema.getElementType().isStruct())
throw Exception(ErrorCodes::CAPN_PROTO_BAD_CAST, "Element type of List {} is not a struct", field_name);
auto struct_schema = list_schema.getElementType().asStruct();
KJ_IF_MAYBE(nested_field, struct_schema.findFieldByName(nested_name))
return {struct_builder, *field};
throw Exception(ErrorCodes::THERE_IS_NO_COLUMN, "Element type of List {} doesn't contain field with name \"{}\"", field_name, nested_name);
}
if (field_builder.getType() != capnp::DynamicValue::STRUCT)
throw Exception(ErrorCodes::CAPN_PROTO_BAD_CAST, "Field {} is not a struct", field_name);
@ -390,13 +433,27 @@ std::pair<capnp::DynamicStruct::Builder, capnp::StructSchema::Field> getStructBu
throw Exception(ErrorCodes::THERE_IS_NO_COLUMN, "Capnproto struct doesn't contain field with name {}", field_name);
}
static capnp::StructSchema::Field getFieldByName(const capnp::StructSchema & schema, const String & name)
static std::pair<capnp::StructSchema::Field, String> getFieldByName(const capnp::StructSchema & schema, const String & name)
{
auto [field_name, nested_name] = splitFieldName(name);
auto [field_name, nested_name] = splitCapnProtoFieldName(name);
KJ_IF_MAYBE(field, schema.findFieldByName(field_name))
{
if (nested_name.empty())
return *field;
return {*field, name};
/// Support reading Nested as List of Structs.
if (field->getType().isList())
{
auto list_schema = field->getType().asList();
if (!list_schema.getElementType().isStruct())
throw Exception(ErrorCodes::CAPN_PROTO_BAD_CAST, "Element type of List {} is not a struct", field_name);
auto struct_schema = list_schema.getElementType().asStruct();
KJ_IF_MAYBE(nested_field, struct_schema.findFieldByName(nested_name))
return {*field, name};
throw Exception(ErrorCodes::THERE_IS_NO_COLUMN, "Element type of List {} doesn't contain field with name \"{}\"", field_name, nested_name);
}
if (!field->getType().isStruct())
throw Exception(ErrorCodes::CAPN_PROTO_BAD_CAST, "Field {} is not a struct", field_name);
@ -416,8 +473,8 @@ void checkCapnProtoSchemaStructure(const capnp::StructSchema & schema, const Blo
String additional_error_message;
for (auto & [name, type] : names_and_types)
{
auto field = getFieldByName(schema, name);
if (!checkCapnProtoType(field.getType(), type, mode, additional_error_message))
auto [field, field_name] = getFieldByName(schema, name);
if (!checkCapnProtoType(field.getType(), type, mode, additional_error_message, field_name))
{
auto e = Exception(
ErrorCodes::CAPN_PROTO_BAD_CAST,

View File

@ -30,6 +30,8 @@ public:
capnp::StructSchema getMessageSchema(const FormatSchemaInfo & schema_info);
};
std::pair<String, String> splitCapnProtoFieldName(const String & name);
bool compareEnumNames(const String & first, const String & second, FormatSettings::EnumComparingMode mode);
std::pair<capnp::DynamicStruct::Builder, capnp::StructSchema::Field> getStructBuilderAndFieldByColumnName(capnp::DynamicStruct::Builder struct_builder, const String & name);

View File

@ -1,21 +1,11 @@
#include <Formats/EscapingRuleUtils.h>
#include <Formats/JSONUtils.h>
#include <Formats/ReadSchemaUtils.h>
#include <Formats/SchemaInferenceUtils.h>
#include <DataTypes/Serializations/SerializationNullable.h>
#include <DataTypes/DataTypeString.h>
#include <DataTypes/DataTypesNumber.h>
#include <DataTypes/DataTypeNullable.h>
#include <DataTypes/DataTypeFactory.h>
#include <DataTypes/DataTypeArray.h>
#include <DataTypes/DataTypeNothing.h>
#include <DataTypes/DataTypeTuple.h>
#include <DataTypes/DataTypeDate.h>
#include <DataTypes/DataTypeDateTime64.h>
#include <DataTypes/DataTypeLowCardinality.h>
#include <DataTypes/DataTypeMap.h>
#include <DataTypes/DataTypeObject.h>
#include <DataTypes/getLeastSupertype.h>
#include <DataTypes/transformTypesRecursively.h>
#include <IO/ReadHelpers.h>
#include <IO/WriteHelpers.h>
#include <IO/ReadBufferFromString.h>
@ -261,556 +251,76 @@ String readStringByEscapingRule(ReadBuffer & buf, FormatSettings::EscapingRule e
return readByEscapingRule<true>(buf, escaping_rule, format_settings);
}
void transformInferredTypesIfNeededImpl(DataTypes & types, const FormatSettings & settings, bool is_json, const std::unordered_set<const IDataType *> * numbers_parsed_from_json_strings = nullptr)
{
/// Do nothing if we didn't try to infer something special.
if (!settings.try_infer_integers && !settings.try_infer_dates && !settings.try_infer_datetimes && !is_json)
return;
auto transform_simple_types = [&](DataTypes & data_types)
{
/// If we have floats and integers convert them all to float.
if (settings.try_infer_integers)
{
bool have_floats = false;
bool have_integers = false;
for (const auto & type : data_types)
{
have_floats |= isFloat(type);
have_integers |= isInteger(type) && !isBool(type);
}
if (have_floats && have_integers)
{
for (auto & type : data_types)
{
if (isInteger(type))
type = std::make_shared<DataTypeFloat64>();
}
}
}
/// If we have only dates and datetimes, convert dates to datetime.
/// If we have date/datetimes and smth else, convert them to string, because
/// There is a special case when we inferred both Date/DateTime and Int64 from Strings,
/// for example: "arr: ["2020-01-01", "2000"]" -> Tuple(Date, Int64),
/// so if we have Date/DateTime and smth else (not only String) we should
/// convert Date/DateTime back to String, so then we will be able to
/// convert Int64 back to String as well.
if (settings.try_infer_dates || settings.try_infer_datetimes)
{
bool have_dates = false;
bool have_datetimes = false;
bool all_dates_or_datetimes = true;
for (const auto & type : data_types)
{
have_dates |= isDate(type);
have_datetimes |= isDateTime64(type);
all_dates_or_datetimes &= isDate(type) || isDateTime64(type);
}
if (!all_dates_or_datetimes && (have_dates || have_datetimes))
{
for (auto & type : data_types)
{
if (isDate(type) || isDateTime64(type))
type = std::make_shared<DataTypeString>();
}
}
else if (have_dates && have_datetimes)
{
for (auto & type : data_types)
{
if (isDate(type))
type = std::make_shared<DataTypeDateTime64>(9);
}
}
}
if (!is_json)
return;
/// Check settings specific for JSON formats.
/// If we have numbers and strings, convert numbers to strings.
if (settings.json.try_infer_numbers_from_strings || settings.json.read_numbers_as_strings)
{
bool have_strings = false;
bool have_numbers = false;
for (const auto & type : data_types)
{
have_strings |= isString(type);
have_numbers |= isNumber(type);
}
if (have_strings && have_numbers)
{
for (auto & type : data_types)
{
if (isNumber(type)
&& (settings.json.read_numbers_as_strings || !numbers_parsed_from_json_strings
|| numbers_parsed_from_json_strings->contains(type.get())))
type = std::make_shared<DataTypeString>();
}
}
}
if (settings.json.read_bools_as_numbers)
{
/// Note that have_floats and have_integers both cannot be
/// equal to true as in one of previous checks we convert
/// integers to floats if we have both.
bool have_floats = false;
bool have_integers = false;
bool have_bools = false;
for (const auto & type : data_types)
{
have_floats |= isFloat(type);
have_integers |= isInteger(type) && !isBool(type);
have_bools |= isBool(type);
}
if (have_bools && (have_integers || have_floats))
{
for (auto & type : data_types)
{
if (isBool(type))
{
if (have_integers)
type = std::make_shared<DataTypeInt64>();
else
type = std::make_shared<DataTypeFloat64>();
}
}
}
}
};
auto transform_complex_types = [&](DataTypes & data_types)
{
if (!is_json)
return;
bool have_maps = false;
bool have_objects = false;
bool have_strings = false;
bool are_maps_equal = true;
DataTypePtr first_map_type;
for (const auto & type : data_types)
{
if (isMap(type))
{
if (!have_maps)
{
first_map_type = type;
have_maps = true;
}
else
{
are_maps_equal &= type->equals(*first_map_type);
}
}
else if (isObject(type))
{
have_objects = true;
}
else if (isString(type))
{
have_strings = false;
}
}
if (have_maps && (have_objects || !are_maps_equal))
{
for (auto & type : data_types)
{
if (isMap(type))
type = std::make_shared<DataTypeObject>("json", true);
}
}
if (settings.json.read_objects_as_strings && have_strings && (have_maps || have_objects))
{
for (auto & type : data_types)
{
if (isMap(type) || isObject(type))
type = std::make_shared<DataTypeString>();
}
}
};
transformTypesRecursively(types, transform_simple_types, transform_complex_types);
}
void transformInferredTypesIfNeeded(DataTypes & types, const FormatSettings & settings, FormatSettings::EscapingRule escaping_rule)
{
transformInferredTypesIfNeededImpl(types, settings, escaping_rule == FormatSettings::EscapingRule::JSON);
}
void transformInferredTypesIfNeeded(DataTypePtr & first, DataTypePtr & second, const FormatSettings & settings, FormatSettings::EscapingRule escaping_rule)
{
DataTypes types = {first, second};
transformInferredTypesIfNeeded(types, settings, escaping_rule);
first = std::move(types[0]);
second = std::move(types[1]);
}
void transformInferredJSONTypesIfNeeded(DataTypes & types, const FormatSettings & settings, const std::unordered_set<const IDataType *> * numbers_parsed_from_json_strings)
{
transformInferredTypesIfNeededImpl(types, settings, true, numbers_parsed_from_json_strings);
}
void transformInferredJSONTypesIfNeeded(DataTypePtr & first, DataTypePtr & second, const FormatSettings & settings)
{
DataTypes types = {first, second};
transformInferredJSONTypesIfNeeded(types, settings);
first = std::move(types[0]);
second = std::move(types[1]);
}
bool tryInferDate(const std::string_view & field)
{
ReadBufferFromString buf(field);
DayNum tmp;
return tryReadDateText(tmp, buf) && buf.eof();
}
bool tryInferDateTime(const std::string_view & field, const FormatSettings & settings)
{
if (field.empty())
return false;
ReadBufferFromString buf(field);
Float64 tmp_float;
/// Check if it's just a number, and if so, don't try to infer DateTime from it,
/// because we can interpret this number as a timestamp and it will lead to
/// inferring DateTime instead of simple Int64/Float64 in some cases.
if (tryReadFloatText(tmp_float, buf) && buf.eof())
return false;
buf.seek(0, SEEK_SET); /// Return position to the beginning
DateTime64 tmp;
switch (settings.date_time_input_format)
{
case FormatSettings::DateTimeInputFormat::Basic:
if (tryReadDateTime64Text(tmp, 9, buf) && buf.eof())
return true;
break;
case FormatSettings::DateTimeInputFormat::BestEffort:
if (tryParseDateTime64BestEffort(tmp, 9, buf, DateLUT::instance(), DateLUT::instance("UTC")) && buf.eof())
return true;
break;
case FormatSettings::DateTimeInputFormat::BestEffortUS:
if (tryParseDateTime64BestEffortUS(tmp, 9, buf, DateLUT::instance(), DateLUT::instance("UTC")) && buf.eof())
return true;
break;
}
return false;
}
DataTypePtr tryInferDateOrDateTime(const std::string_view & field, const FormatSettings & settings)
{
if (settings.try_infer_dates && tryInferDate(field))
return makeNullable(std::make_shared<DataTypeDate>());
if (settings.try_infer_datetimes && tryInferDateTime(field, settings))
return makeNullable(std::make_shared<DataTypeDateTime64>(9));
return nullptr;
}
static DataTypePtr determineDataTypeForSingleFieldImpl(ReadBufferFromString & buf, const FormatSettings & settings)
{
if (buf.eof())
return nullptr;
/// Array
if (checkChar('[', buf))
{
skipWhitespaceIfAny(buf);
DataTypes nested_types;
bool first = true;
while (!buf.eof() && *buf.position() != ']')
{
if (!first)
{
skipWhitespaceIfAny(buf);
if (!checkChar(',', buf))
return nullptr;
skipWhitespaceIfAny(buf);
}
else
first = false;
auto nested_type = determineDataTypeForSingleFieldImpl(buf, settings);
if (!nested_type)
return nullptr;
nested_types.push_back(nested_type);
}
if (buf.eof())
return nullptr;
++buf.position();
if (nested_types.empty())
return std::make_shared<DataTypeArray>(std::make_shared<DataTypeNothing>());
transformInferredTypesIfNeeded(nested_types, settings);
auto least_supertype = tryGetLeastSupertype(nested_types);
if (!least_supertype)
return nullptr;
return std::make_shared<DataTypeArray>(least_supertype);
}
/// Tuple
if (checkChar('(', buf))
{
skipWhitespaceIfAny(buf);
DataTypes nested_types;
bool first = true;
while (!buf.eof() && *buf.position() != ')')
{
if (!first)
{
skipWhitespaceIfAny(buf);
if (!checkChar(',', buf))
return nullptr;
skipWhitespaceIfAny(buf);
}
else
first = false;
auto nested_type = determineDataTypeForSingleFieldImpl(buf, settings);
if (!nested_type)
return nullptr;
nested_types.push_back(nested_type);
}
if (buf.eof() || nested_types.empty())
return nullptr;
++buf.position();
return std::make_shared<DataTypeTuple>(nested_types);
}
/// Map
if (checkChar('{', buf))
{
skipWhitespaceIfAny(buf);
DataTypes key_types;
DataTypes value_types;
bool first = true;
while (!buf.eof() && *buf.position() != '}')
{
if (!first)
{
skipWhitespaceIfAny(buf);
if (!checkChar(',', buf))
return nullptr;
skipWhitespaceIfAny(buf);
}
else
first = false;
auto key_type = determineDataTypeForSingleFieldImpl(buf, settings);
if (!key_type)
return nullptr;
key_types.push_back(key_type);
skipWhitespaceIfAny(buf);
if (!checkChar(':', buf))
return nullptr;
skipWhitespaceIfAny(buf);
auto value_type = determineDataTypeForSingleFieldImpl(buf, settings);
if (!value_type)
return nullptr;
value_types.push_back(value_type);
}
if (buf.eof())
return nullptr;
++buf.position();
skipWhitespaceIfAny(buf);
if (key_types.empty())
return std::make_shared<DataTypeMap>(std::make_shared<DataTypeNothing>(), std::make_shared<DataTypeNothing>());
transformInferredTypesIfNeeded(key_types, settings);
transformInferredTypesIfNeeded(value_types, settings);
auto key_least_supertype = tryGetLeastSupertype(key_types);
auto value_least_supertype = tryGetLeastSupertype(value_types);
if (!key_least_supertype || !value_least_supertype)
return nullptr;
if (!DataTypeMap::checkKeyType(key_least_supertype))
return nullptr;
return std::make_shared<DataTypeMap>(key_least_supertype, value_least_supertype);
}
/// String
if (*buf.position() == '\'')
{
++buf.position();
String field;
while (!buf.eof())
{
char * next_pos = find_first_symbols<'\\', '\''>(buf.position(), buf.buffer().end());
field.append(buf.position(), next_pos);
buf.position() = next_pos;
if (!buf.hasPendingData())
continue;
if (*buf.position() == '\'')
break;
field.push_back(*buf.position());
if (*buf.position() == '\\')
++buf.position();
}
if (buf.eof())
return nullptr;
++buf.position();
if (auto type = tryInferDateOrDateTime(field, settings))
return type;
return std::make_shared<DataTypeString>();
}
/// Bool
if (checkStringCaseInsensitive("true", buf) || checkStringCaseInsensitive("false", buf))
return DataTypeFactory::instance().get("Bool");
/// Null
if (checkStringCaseInsensitive("NULL", buf))
return std::make_shared<DataTypeNothing>();
/// Number
Float64 tmp;
auto * pos_before_float = buf.position();
if (tryReadFloatText(tmp, buf))
{
if (settings.try_infer_integers)
{
auto * float_end_pos = buf.position();
buf.position() = pos_before_float;
Int64 tmp_int;
if (tryReadIntText(tmp_int, buf) && buf.position() == float_end_pos)
return std::make_shared<DataTypeInt64>();
buf.position() = float_end_pos;
}
return std::make_shared<DataTypeFloat64>();
}
return nullptr;
}
static DataTypePtr determineDataTypeForSingleField(ReadBufferFromString & buf, const FormatSettings & settings)
{
return makeNullableRecursivelyAndCheckForNothing(determineDataTypeForSingleFieldImpl(buf, settings));
}
DataTypePtr determineDataTypeByEscapingRule(const String & field, const FormatSettings & format_settings, FormatSettings::EscapingRule escaping_rule)
DataTypePtr tryInferDataTypeByEscapingRule(const String & field, const FormatSettings & format_settings, FormatSettings::EscapingRule escaping_rule, JSONInferenceInfo * json_info)
{
switch (escaping_rule)
{
case FormatSettings::EscapingRule::Quoted:
{
ReadBufferFromString buf(field);
auto type = determineDataTypeForSingleField(buf, format_settings);
return buf.eof() ? type : nullptr;
}
return tryInferDataTypeForSingleField(field, format_settings);
case FormatSettings::EscapingRule::JSON:
return JSONUtils::getDataTypeFromField(field, format_settings);
return tryInferDataTypeForSingleJSONField(field, format_settings, json_info);
case FormatSettings::EscapingRule::CSV:
{
if (!format_settings.csv.use_best_effort_in_schema_inference)
return makeNullable(std::make_shared<DataTypeString>());
return std::make_shared<DataTypeString>();
if (field.empty() || field == format_settings.csv.null_representation)
if (field.empty())
return nullptr;
if (field == format_settings.bool_false_representation || field == format_settings.bool_true_representation)
return DataTypeFactory::instance().get("Nullable(Bool)");
if (field == format_settings.csv.null_representation)
return makeNullable(std::make_shared<DataTypeNothing>());
if (field == format_settings.bool_false_representation || field == format_settings.bool_true_representation)
return DataTypeFactory::instance().get("Bool");
/// In CSV complex types are serialized in quotes. If we have quotes, we should try to infer type
/// from data inside quotes.
if (field.size() > 1 && ((field.front() == '\'' && field.back() == '\'') || (field.front() == '"' && field.back() == '"')))
{
auto data = std::string_view(field.data() + 1, field.size() - 2);
if (auto date_type = tryInferDateOrDateTime(data, format_settings))
/// First, try to infer dates and datetimes.
if (auto date_type = tryInferDateOrDateTimeFromString(data, format_settings))
return date_type;
ReadBufferFromString buf(data);
/// Try to determine the type of value inside quotes
auto type = determineDataTypeForSingleField(buf, format_settings);
auto type = tryInferDataTypeForSingleField(data, format_settings);
if (!type)
return nullptr;
/// If it's a number or tuple in quotes or there is some unread data in buffer, we determine it as a string.
if (isNumber(removeNullable(type)) || isTuple(type) || !buf.eof())
return makeNullable(std::make_shared<DataTypeString>());
/// If we couldn't infer any type or it's a number or tuple in quotes, we determine it as a string.
if (!type || isNumber(removeNullable(type)) || isTuple(type))
return std::make_shared<DataTypeString>();
return type;
}
/// Case when CSV value is not in quotes. Check if it's a number, and if not, determine it's as a string.
if (format_settings.try_infer_integers)
{
ReadBufferFromString buf(field);
Int64 tmp_int;
if (tryReadIntText(tmp_int, buf) && buf.eof())
return makeNullable(std::make_shared<DataTypeInt64>());
}
auto type = tryInferNumberFromString(field, format_settings);
ReadBufferFromString buf(field);
Float64 tmp;
if (tryReadFloatText(tmp, buf) && buf.eof())
return makeNullable(std::make_shared<DataTypeFloat64>());
if (!type)
return std::make_shared<DataTypeString>();
return makeNullable(std::make_shared<DataTypeString>());
return type;
}
case FormatSettings::EscapingRule::Raw: [[fallthrough]];
case FormatSettings::EscapingRule::Escaped:
{
if (!format_settings.tsv.use_best_effort_in_schema_inference)
return makeNullable(std::make_shared<DataTypeString>());
return std::make_shared<DataTypeString>();
if (field.empty() || field == format_settings.tsv.null_representation)
if (field.empty())
return nullptr;
if (field == format_settings.bool_false_representation || field == format_settings.bool_true_representation)
return DataTypeFactory::instance().get("Nullable(Bool)");
if (field == format_settings.tsv.null_representation)
return makeNullable(std::make_shared<DataTypeNothing>());
if (auto date_type = tryInferDateOrDateTime(field, format_settings))
if (field == format_settings.bool_false_representation || field == format_settings.bool_true_representation)
return DataTypeFactory::instance().get("Bool");
if (auto date_type = tryInferDateOrDateTimeFromString(field, format_settings))
return date_type;
ReadBufferFromString buf(field);
auto type = determineDataTypeForSingleField(buf, format_settings);
if (!buf.eof())
return makeNullable(std::make_shared<DataTypeString>());
auto type = tryInferDataTypeForSingleField(field, format_settings);
if (!type)
return std::make_shared<DataTypeString>();
return type;
}
default:
@ -818,15 +328,34 @@ DataTypePtr determineDataTypeByEscapingRule(const String & field, const FormatSe
}
}
DataTypes determineDataTypesByEscapingRule(const std::vector<String> & fields, const FormatSettings & format_settings, FormatSettings::EscapingRule escaping_rule)
DataTypes tryInferDataTypesByEscapingRule(const std::vector<String> & fields, const FormatSettings & format_settings, FormatSettings::EscapingRule escaping_rule, JSONInferenceInfo * json_info)
{
DataTypes data_types;
data_types.reserve(fields.size());
for (const auto & field : fields)
data_types.push_back(determineDataTypeByEscapingRule(field, format_settings, escaping_rule));
data_types.push_back(tryInferDataTypeByEscapingRule(field, format_settings, escaping_rule, json_info));
return data_types;
}
void transformInferredTypesByEscapingRuleIfNeeded(DataTypePtr & first, DataTypePtr & second, const FormatSettings & settings, FormatSettings::EscapingRule escaping_rule, JSONInferenceInfo * json_info)
{
switch (escaping_rule)
{
case FormatSettings::EscapingRule::JSON:
transformInferredJSONTypesIfNeeded(first, second, settings, json_info);
break;
case FormatSettings::EscapingRule::Escaped: [[fallthrough]];
case FormatSettings::EscapingRule::Raw: [[fallthrough]];
case FormatSettings::EscapingRule::Quoted: [[fallthrough]];
case FormatSettings::EscapingRule::CSV:
transformInferredTypesIfNeeded(first, second, settings);
break;
default:
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Cannot transform inferred types for value with {} escaping rule", escapingRuleToString(escaping_rule));
}
}
DataTypePtr getDefaultDataTypeForEscapingRule(FormatSettings::EscapingRule escaping_rule)
{
switch (escaping_rule)
@ -834,7 +363,7 @@ DataTypePtr getDefaultDataTypeForEscapingRule(FormatSettings::EscapingRule escap
case FormatSettings::EscapingRule::CSV:
case FormatSettings::EscapingRule::Escaped:
case FormatSettings::EscapingRule::Raw:
return makeNullable(std::make_shared<DataTypeString>());
return std::make_shared<DataTypeString>();
default:
return nullptr;
}
@ -851,9 +380,10 @@ DataTypes getDefaultDataTypeForEscapingRules(const std::vector<FormatSettings::E
String getAdditionalFormatInfoForAllRowBasedFormats(const FormatSettings & settings)
{
return fmt::format(
"schema_inference_hints={}, max_rows_to_read_for_schema_inference={}",
"schema_inference_hints={}, max_rows_to_read_for_schema_inference={}, schema_inference_make_columns_nullable={}",
settings.schema_inference_hints,
settings.max_rows_to_read_for_schema_inference);
settings.max_rows_to_read_for_schema_inference,
settings.schema_inference_make_columns_nullable);
}
String getAdditionalFormatInfoByEscapingRule(const FormatSettings & settings, FormatSettings::EscapingRule escaping_rule)
@ -890,7 +420,11 @@ String getAdditionalFormatInfoByEscapingRule(const FormatSettings & settings, Fo
settings.csv.tuple_delimiter);
break;
case FormatSettings::EscapingRule::JSON:
result += fmt::format(", try_infer_numbers_from_strings={}, read_bools_as_numbers={}", settings.json.try_infer_numbers_from_strings, settings.json.read_bools_as_numbers);
result += fmt::format(
", try_infer_numbers_from_strings={}, read_bools_as_numbers={}, try_infer_objects={}",
settings.json.try_infer_numbers_from_strings,
settings.json.read_bools_as_numbers,
settings.json.try_infer_objects);
break;
default:
break;

View File

@ -1,6 +1,7 @@
#pragma once
#include <Formats/FormatSettings.h>
#include <Formats/SchemaInferenceUtils.h>
#include <DataTypes/IDataType.h>
#include <DataTypes/Serializations/ISerialization.h>
#include <IO/ReadBuffer.h>
@ -38,45 +39,17 @@ String readFieldByEscapingRule(ReadBuffer & buf, FormatSettings::EscapingRule es
/// Try to determine the type of the field written by a specific escaping rule.
/// If cannot, return nullptr.
/// - For Quoted escaping rule we can interpret a single field as a constant
/// expression and get it's type by evaluation this expression.
/// - For JSON escaping rule we can use JSON parser to parse a single field
/// and then convert JSON type of this field to ClickHouse type.
/// - For CSV escaping rule we can do the next:
/// - If the field is an unquoted string, then we try to parse it as a number,
/// and if we cannot, treat it as a String.
/// - If the field is a string in quotes, then we try to use some
/// tweaks and heuristics to determine the type inside quotes, and if we can't or
/// the result is a number or tuple (we don't parse numbers in quotes and don't
/// support tuples in CSV) we treat it as a String.
/// - If input_format_csv_use_best_effort_in_schema_inference is disabled, we
/// treat everything as a string.
/// - For TSV and TSVRaw we try to use some tweaks and heuristics to determine the type
/// of value if setting input_format_tsv_use_best_effort_in_schema_inference is enabled,
/// otherwise we treat everything as a string.
DataTypePtr determineDataTypeByEscapingRule(const String & field, const FormatSettings & format_settings, FormatSettings::EscapingRule escaping_rule);
DataTypes determineDataTypesByEscapingRule(const std::vector<String> & fields, const FormatSettings & format_settings, FormatSettings::EscapingRule escaping_rule);
/// See tryInferDataTypeForSingle(JSON)Field in SchemaInferenceUtils.h
DataTypePtr tryInferDataTypeByEscapingRule(const String & field, const FormatSettings & format_settings, FormatSettings::EscapingRule escaping_rule, JSONInferenceInfo * json_info = nullptr);
DataTypes tryInferDataTypesByEscapingRule(const std::vector<String> & fields, const FormatSettings & format_settings, FormatSettings::EscapingRule escaping_rule, JSONInferenceInfo * json_info = nullptr);
/// Check if we need to transform types inferred from data and transform it if necessary.
/// See transformInferred(JSON)TypesIfNeeded in SchemaInferenceUtils.h
void transformInferredTypesByEscapingRuleIfNeeded(DataTypePtr & first, DataTypePtr & second, const FormatSettings & settings, FormatSettings::EscapingRule escaping_rule, JSONInferenceInfo * json_info = nullptr);
DataTypePtr getDefaultDataTypeForEscapingRule(FormatSettings::EscapingRule escaping_rule);
DataTypes getDefaultDataTypeForEscapingRules(const std::vector<FormatSettings::EscapingRule> & escaping_rules);
/// Try to infer Date or Datetime from string if corresponding settings are enabled.
DataTypePtr tryInferDateOrDateTime(const std::string_view & field, const FormatSettings & settings);
/// Check if we need to transform types inferred from data and transform it if necessary.
/// It's used when we try to infer some not ordinary types from another types.
/// For example dates from strings, we should check if dates were inferred from all strings
/// in the same way and if not, transform inferred dates back to strings.
/// For example, if we have array of strings and we tried to infer dates from them,
/// to make the result type Array(Date) we should ensure that all strings were
/// successfully parsed as dated and if not, convert all dates back to strings and make result type Array(String).
void transformInferredTypesIfNeeded(DataTypes & types, const FormatSettings & settings, FormatSettings::EscapingRule escaping_rule = FormatSettings::EscapingRule::Escaped);
void transformInferredTypesIfNeeded(DataTypePtr & first, DataTypePtr & second, const FormatSettings & settings, FormatSettings::EscapingRule escaping_rule = FormatSettings::EscapingRule::Escaped);
/// Same as transformInferredTypesIfNeeded but takes into account settings that are special for JSON formats.
void transformInferredJSONTypesIfNeeded(DataTypes & types, const FormatSettings & settings, const std::unordered_set<const IDataType *> * numbers_parsed_from_json_strings = nullptr);
void transformInferredJSONTypesIfNeeded(DataTypePtr & first, DataTypePtr & second, const FormatSettings & settings);
String getAdditionalFormatInfoForAllRowBasedFormats(const FormatSettings & settings);
String getAdditionalFormatInfoByEscapingRule(const FormatSettings & settings, FormatSettings::EscapingRule escaping_rule);

View File

@ -169,6 +169,7 @@ FormatSettings getFormatSettings(ContextPtr context, const Settings & settings)
format_settings.max_rows_to_read_for_schema_inference = settings.input_format_max_rows_to_read_for_schema_inference;
format_settings.column_names_for_schema_inference = settings.column_names_for_schema_inference;
format_settings.schema_inference_hints = settings.schema_inference_hints;
format_settings.schema_inference_make_columns_nullable = settings.schema_inference_make_columns_nullable;
format_settings.mysql_dump.table_name = settings.input_format_mysql_dump_table_name;
format_settings.mysql_dump.map_column_names = settings.input_format_mysql_dump_map_column_names;
format_settings.sql_insert.max_batch_size = settings.output_format_sql_insert_max_batch_size;
@ -182,6 +183,7 @@ FormatSettings getFormatSettings(ContextPtr context, const Settings & settings)
format_settings.bson.output_string_as_string = settings.output_format_bson_string_as_string;
format_settings.bson.skip_fields_with_unsupported_types_in_schema_inference = settings.input_format_bson_skip_fields_with_unsupported_types_in_schema_inference;
format_settings.max_binary_string_size = settings.format_binary_max_string_size;
format_settings.max_parser_depth = context->getSettingsRef().max_parser_depth;
/// Validate avro_schema_registry_url with RemoteHostFilter when non-empty and in Server context
if (format_settings.schema.is_server)

View File

@ -71,6 +71,8 @@ struct FormatSettings
Raw
};
bool schema_inference_make_columns_nullable = true;
DateTimeOutputFormat date_time_output_format = DateTimeOutputFormat::Simple;
bool input_format_ipv4_default_on_conversion_error = false;
@ -81,6 +83,8 @@ struct FormatSettings
UInt64 max_binary_string_size = 0;
UInt64 max_parser_depth = DBMS_DEFAULT_MAX_PARSER_DEPTH;
struct
{
UInt64 row_group_size = 1000000;

View File

@ -6,19 +6,13 @@
#include <IO/WriteBufferValidUTF8.h>
#include <DataTypes/Serializations/SerializationNullable.h>
#include <DataTypes/DataTypeNullable.h>
#include <DataTypes/DataTypesNumber.h>
#include <DataTypes/DataTypeString.h>
#include <DataTypes/DataTypeArray.h>
#include <DataTypes/DataTypeTuple.h>
#include <DataTypes/DataTypeMap.h>
#include <DataTypes/DataTypeObject.h>
#include <DataTypes/DataTypeFactory.h>
#include <Common/JSONParsers/SimdJSONParser.h>
#include <Common/JSONParsers/RapidJSONParser.h>
#include <Common/JSONParsers/DummyJSONParser.h>
#include <base/find_symbols.h>
#include <Common/logger_useful.h>
namespace DB
{
@ -122,206 +116,6 @@ namespace JSONUtils
return {loadAtPosition(in, memory, pos), number_of_rows};
}
template <const char opening_bracket, const char closing_bracket>
static String readJSONEachRowLineIntoStringImpl(ReadBuffer & in)
{
Memory memory;
fileSegmentationEngineJSONEachRowImpl<opening_bracket, closing_bracket>(in, memory, 0, 1, 1);
return String(memory.data(), memory.size());
}
template <class Element>
DataTypePtr getDataTypeFromFieldImpl(const Element & field, const FormatSettings & settings, std::unordered_set<const IDataType *> & numbers_parsed_from_json_strings)
{
if (field.isNull())
return nullptr;
if (field.isBool())
return DataTypeFactory::instance().get("Nullable(Bool)");
if (field.isInt64() || field.isUInt64())
{
if (settings.try_infer_integers)
return makeNullable(std::make_shared<DataTypeInt64>());
return makeNullable(std::make_shared<DataTypeFloat64>());
}
if (field.isDouble())
return makeNullable(std::make_shared<DataTypeFloat64>());
if (field.isString())
{
if (auto date_type = tryInferDateOrDateTime(field.getString(), settings))
return date_type;
if (!settings.json.try_infer_numbers_from_strings)
return makeNullable(std::make_shared<DataTypeString>());
ReadBufferFromString buf(field.getString());
if (settings.try_infer_integers)
{
Int64 tmp_int;
if (tryReadIntText(tmp_int, buf) && buf.eof())
{
auto type = std::make_shared<DataTypeInt64>();
numbers_parsed_from_json_strings.insert(type.get());
return makeNullable(type);
}
}
Float64 tmp;
if (tryReadFloatText(tmp, buf) && buf.eof())
{
auto type = std::make_shared<DataTypeFloat64>();
numbers_parsed_from_json_strings.insert(type.get());
return makeNullable(type);
}
return makeNullable(std::make_shared<DataTypeString>());
}
if (field.isArray())
{
auto array = field.getArray();
/// Return nullptr in case of empty array because we cannot determine nested type.
if (array.size() == 0)
return nullptr;
DataTypes nested_data_types;
/// If this array contains fields with different types we will treat it as Tuple.
bool are_types_the_same = true;
for (const auto element : array)
{
auto type = getDataTypeFromFieldImpl(element, settings, numbers_parsed_from_json_strings);
if (!type)
return nullptr;
if (!nested_data_types.empty() && !type->equals(*nested_data_types.back()))
are_types_the_same = false;
nested_data_types.push_back(std::move(type));
}
if (!are_types_the_same)
{
auto nested_types_copy = nested_data_types;
transformInferredJSONTypesIfNeeded(nested_types_copy, settings, &numbers_parsed_from_json_strings);
are_types_the_same = true;
for (size_t i = 1; i < nested_types_copy.size(); ++i)
are_types_the_same &= nested_types_copy[i]->equals(*nested_types_copy[i - 1]);
if (are_types_the_same)
nested_data_types = std::move(nested_types_copy);
}
if (!are_types_the_same)
return std::make_shared<DataTypeTuple>(nested_data_types);
return std::make_shared<DataTypeArray>(nested_data_types.back());
}
if (field.isObject())
{
auto object = field.getObject();
DataTypes value_types;
for (const auto key_value_pair : object)
{
auto type = getDataTypeFromFieldImpl(key_value_pair.second, settings, numbers_parsed_from_json_strings);
if (!type)
{
/// If we couldn't infer nested type and Object type is not enabled,
/// we can't determine the type of this JSON field.
if (!settings.json.try_infer_objects)
{
/// If read_objects_as_strings is enabled, we can read objects into strings.
if (settings.json.read_objects_as_strings)
return makeNullable(std::make_shared<DataTypeString>());
return nullptr;
}
continue;
}
if (settings.json.try_infer_objects && isObject(type))
return std::make_shared<DataTypeObject>("json", true);
value_types.push_back(type);
}
if (value_types.empty())
return nullptr;
transformInferredJSONTypesIfNeeded(value_types, settings, &numbers_parsed_from_json_strings);
bool are_types_equal = true;
for (size_t i = 1; i < value_types.size(); ++i)
are_types_equal &= value_types[i]->equals(*value_types[0]);
if (!are_types_equal)
{
if (!settings.json.try_infer_objects)
{
/// If read_objects_as_strings is enabled, we can read objects into strings.
if (settings.json.read_objects_as_strings)
return makeNullable(std::make_shared<DataTypeString>());
return nullptr;
}
return std::make_shared<DataTypeObject>("json", true);
}
return std::make_shared<DataTypeMap>(std::make_shared<DataTypeString>(), value_types[0]);
}
throw Exception{ErrorCodes::INCORRECT_DATA, "Unexpected JSON type"};
}
auto getJSONParserAndElement()
{
#if USE_SIMDJSON
return std::pair<SimdJSONParser, SimdJSONParser::Element>();
#elif USE_RAPIDJSON
return std::pair<RapidJSONParser, RapidJSONParser::Element>();
#else
return std::pair<DummyJSONParser, DummyJSONParser::Element>();
#endif
}
DataTypePtr getDataTypeFromField(const String & field, const FormatSettings & settings)
{
auto [parser, element] = getJSONParserAndElement();
bool parsed = parser.parse(field, element);
if (!parsed)
throw Exception(ErrorCodes::INCORRECT_DATA, "Cannot parse JSON object here: {}", field);
std::unordered_set<const IDataType *> numbers_parsed_from_json_strings;
return getDataTypeFromFieldImpl(element, settings, numbers_parsed_from_json_strings);
}
template <class Extractor, const char opening_bracket, const char closing_bracket>
static DataTypes determineColumnDataTypesFromJSONEachRowDataImpl(ReadBuffer & in, const FormatSettings & settings, bool /*json_strings*/, Extractor & extractor)
{
String line = readJSONEachRowLineIntoStringImpl<opening_bracket, closing_bracket>(in);
auto [parser, element] = getJSONParserAndElement();
bool parsed = parser.parse(line, element);
if (!parsed)
throw Exception(ErrorCodes::INCORRECT_DATA, "Cannot parse JSON object here: {}", line);
auto fields = extractor.extract(element);
DataTypes data_types;
data_types.reserve(fields.size());
std::unordered_set<const IDataType *> numbers_parsed_from_json_strings;
for (const auto & field : fields)
data_types.push_back(getDataTypeFromFieldImpl(field, settings, numbers_parsed_from_json_strings));
/// TODO: For JSONStringsEachRow/JSONCompactStringsEach all types will be strings.
/// Should we try to parse data inside strings somehow in this case?
return data_types;
}
std::pair<bool, size_t> fileSegmentationEngineJSONEachRow(ReadBuffer & in, DB::Memory<> & memory, size_t min_bytes, size_t max_rows)
{
return fileSegmentationEngineJSONEachRowImpl<'{', '}'>(in, memory, min_bytes, 1, max_rows);
@ -333,68 +127,56 @@ namespace JSONUtils
return fileSegmentationEngineJSONEachRowImpl<'[', ']'>(in, memory, min_bytes, min_rows, max_rows);
}
struct JSONEachRowFieldsExtractor
NamesAndTypesList readRowAndGetNamesAndDataTypesForJSONEachRow(ReadBuffer & in, const FormatSettings & settings, JSONInferenceInfo * inference_info)
{
template <class Element>
std::vector<Element> extract(const Element & element)
skipWhitespaceIfAny(in);
assertChar('{', in);
bool first = true;
NamesAndTypesList names_and_types;
String field;
while (!in.eof() && *in.position() != '}')
{
/// {..., "<column_name>" : <value>, ...}
if (!first)
skipComma(in);
else
first = false;
if (!element.isObject())
throw Exception(ErrorCodes::INCORRECT_DATA, "Root JSON value is not an object");
auto object = element.getObject();
std::vector<Element> fields;
fields.reserve(object.size());
column_names.reserve(object.size());
for (const auto & key_value_pair : object)
{
column_names.emplace_back(key_value_pair.first);
fields.push_back(key_value_pair.second);
}
return fields;
auto name = readFieldName(in);
auto type = tryInferDataTypeForSingleJSONField(in, settings, inference_info);
names_and_types.emplace_back(name, type);
}
std::vector<String> column_names;
};
if (in.eof())
throw Exception(ErrorCodes::INCORRECT_DATA, "Unexpected EOF while reading JSON object");
NamesAndTypesList readRowAndGetNamesAndDataTypesForJSONEachRow(ReadBuffer & in, const FormatSettings & settings, bool json_strings)
{
JSONEachRowFieldsExtractor extractor;
auto data_types
= determineColumnDataTypesFromJSONEachRowDataImpl<JSONEachRowFieldsExtractor, '{', '}'>(in, settings, json_strings, extractor);
NamesAndTypesList result;
for (size_t i = 0; i != extractor.column_names.size(); ++i)
result.emplace_back(extractor.column_names[i], data_types[i]);
return result;
assertChar('}', in);
return names_and_types;
}
struct JSONCompactEachRowFieldsExtractor
DataTypes readRowAndGetDataTypesForJSONCompactEachRow(ReadBuffer & in, const FormatSettings & settings, JSONInferenceInfo * inference_info)
{
template <class Element>
std::vector<Element> extract(const Element & element)
skipWhitespaceIfAny(in);
assertChar('[', in);
bool first = true;
DataTypes types;
String field;
while (!in.eof() && *in.position() != ']')
{
/// [..., <value>, ...]
if (!element.isArray())
throw Exception(ErrorCodes::INCORRECT_DATA, "Root JSON value is not an array");
auto array = element.getArray();
std::vector<Element> fields;
fields.reserve(array.size());
for (size_t i = 0; i != array.size(); ++i)
fields.push_back(array[i]);
return fields;
if (!first)
skipComma(in);
else
first = false;
auto type = tryInferDataTypeForSingleJSONField(in, settings, inference_info);
types.push_back(std::move(type));
}
};
DataTypes readRowAndGetDataTypesForJSONCompactEachRow(ReadBuffer & in, const FormatSettings & settings, bool json_strings)
{
JSONCompactEachRowFieldsExtractor extractor;
return determineColumnDataTypesFromJSONEachRowDataImpl<JSONCompactEachRowFieldsExtractor, '[', ']'>(in, settings, json_strings, extractor);
if (in.eof())
throw Exception(ErrorCodes::INCORRECT_DATA, "Unexpected EOF while reading JSON array");
assertChar(']', in);
return types;
}
bool nonTrivialPrefixAndSuffixCheckerJSONEachRowImpl(ReadBuffer & buf)
{
/// For JSONEachRow we can safely skip whitespace characters

View File

@ -13,24 +13,21 @@
namespace DB
{
struct JSONInferenceInfo;
namespace JSONUtils
{
std::pair<bool, size_t> fileSegmentationEngineJSONEachRow(ReadBuffer & in, DB::Memory<> & memory, size_t min_bytes, size_t max_rows);
std::pair<bool, size_t> fileSegmentationEngineJSONCompactEachRow(ReadBuffer & in, DB::Memory<> & memory, size_t min_bytes, size_t min_rows, size_t max_rows);
/// Parse JSON from string and convert it's type to ClickHouse type. Make the result type always Nullable.
/// JSON array with different nested types is treated as Tuple.
/// If cannot convert (for example when field contains null), return nullptr.
DataTypePtr getDataTypeFromField(const String & field, const FormatSettings & settings);
/// Read row in JSONEachRow format and try to determine type for each field.
/// Return list of names and types.
/// If cannot determine the type of some field, return nullptr for it.
NamesAndTypesList readRowAndGetNamesAndDataTypesForJSONEachRow(ReadBuffer & in, const FormatSettings & settings, bool json_strings);
NamesAndTypesList readRowAndGetNamesAndDataTypesForJSONEachRow(ReadBuffer & in, const FormatSettings & settings, JSONInferenceInfo * inference_info);
/// Read row in JSONCompactEachRow format and try to determine type for each field.
/// If cannot determine the type of some field, return nullptr for it.
DataTypes readRowAndGetDataTypesForJSONCompactEachRow(ReadBuffer & in, const FormatSettings & settings, bool json_strings);
DataTypes readRowAndGetDataTypesForJSONCompactEachRow(ReadBuffer & in, const FormatSettings & settings, JSONInferenceInfo * inference_info);
bool nonTrivialPrefixAndSuffixCheckerJSONEachRowImpl(ReadBuffer & buf);

View File

@ -197,69 +197,6 @@ ColumnsDescription readSchemaFromFormat(const String & format_name, const std::o
return readSchemaFromFormat(format_name, format_settings, read_buffer_iterator, retry, context, buf_out);
}
DataTypePtr makeNullableRecursivelyAndCheckForNothing(DataTypePtr type)
{
if (!type)
return nullptr;
WhichDataType which(type);
if (which.isNothing())
return nullptr;
if (which.isNullable())
{
const auto * nullable_type = assert_cast<const DataTypeNullable *>(type.get());
return makeNullableRecursivelyAndCheckForNothing(nullable_type->getNestedType());
}
if (which.isArray())
{
const auto * array_type = assert_cast<const DataTypeArray *>(type.get());
auto nested_type = makeNullableRecursivelyAndCheckForNothing(array_type->getNestedType());
return nested_type ? std::make_shared<DataTypeArray>(nested_type) : nullptr;
}
if (which.isTuple())
{
const auto * tuple_type = assert_cast<const DataTypeTuple *>(type.get());
DataTypes nested_types;
for (const auto & element : tuple_type->getElements())
{
auto nested_type = makeNullableRecursivelyAndCheckForNothing(element);
if (!nested_type)
return nullptr;
nested_types.push_back(nested_type);
}
return std::make_shared<DataTypeTuple>(std::move(nested_types));
}
if (which.isMap())
{
const auto * map_type = assert_cast<const DataTypeMap *>(type.get());
auto key_type = makeNullableRecursivelyAndCheckForNothing(map_type->getKeyType());
auto value_type = makeNullableRecursivelyAndCheckForNothing(map_type->getValueType());
return key_type && value_type ? std::make_shared<DataTypeMap>(removeNullable(key_type), value_type) : nullptr;
}
if (which.isLowCarnality())
{
const auto * lc_type = assert_cast<const DataTypeLowCardinality *>(type.get());
auto nested_type = makeNullableRecursivelyAndCheckForNothing(lc_type->getDictionaryType());
return nested_type ? std::make_shared<DataTypeLowCardinality>(nested_type) : nullptr;
}
return makeNullable(type);
}
NamesAndTypesList getNamesAndRecursivelyNullableTypes(const Block & header)
{
NamesAndTypesList result;
for (auto & [name, type] : header.getNamesAndTypesList())
result.emplace_back(name, makeNullableRecursivelyAndCheckForNothing(type));
return result;
}
SchemaCache::Key getKeyForSchemaCache(const String & source, const String & format, const std::optional<FormatSettings> & format_settings, const ContextPtr & context)
{
return getKeysForSchemaCache({source}, format, format_settings, context).front();

View File

@ -35,21 +35,7 @@ ColumnsDescription readSchemaFromFormat(
ContextPtr & context,
std::unique_ptr<ReadBuffer> & buf_out);
/// Make type Nullable recursively:
/// - Type -> Nullable(type)
/// - Array(Type) -> Array(Nullable(Type))
/// - Tuple(Type1, ..., TypeN) -> Tuple(Nullable(Type1), ..., Nullable(TypeN))
/// - Map(KeyType, ValueType) -> Map(KeyType, Nullable(ValueType))
/// - LowCardinality(Type) -> LowCardinality(Nullable(Type))
/// If type is Nothing or one of the nested types is Nothing, return nullptr.
DataTypePtr makeNullableRecursivelyAndCheckForNothing(DataTypePtr type);
/// Call makeNullableRecursivelyAndCheckForNothing for all types
/// in the block and return names and types.
NamesAndTypesList getNamesAndRecursivelyNullableTypes(const Block & header);
SchemaCache::Key getKeyForSchemaCache(const String & source, const String & format, const std::optional<FormatSettings> & format_settings, const ContextPtr & context);
SchemaCache::Keys getKeysForSchemaCache(const Strings & sources, const String & format, const std::optional<FormatSettings> & format_settings, const ContextPtr & context);
void splitSchemaCacheKey(const String & key, String & source, String & format, String & additional_format_info);
}

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,93 @@
#pragma once
#include <DataTypes/IDataType.h>
#include <IO/ReadBuffer.h>
namespace DB
{
/// Struct with some additional information about inferred types for JSON formats.
struct JSONInferenceInfo
{
/// We store numbers that were parsed from strings.
/// It's used in types transformation to change such numbers back to string if needed.
std::unordered_set<const IDataType *> numbers_parsed_from_json_strings;
/// Indicates if currently we are inferring type for Map/Object key.
bool is_object_key = false;
};
/// Try to determine datatype of the value in buffer/string. If the type cannot be inferred, return nullptr.
/// In general, it tries to parse a type using the following logic:
/// If we see '[', we try to parse an array of values and recursively determine datatype for each element.
/// If we see '(', we try to parse a tuple of values and recursively determine datatype for each element.
/// If we see '{', we try to parse a Map of keys and values and recursively determine datatype for each key/value.
/// If we see a quote '\'', we treat it as a string and read until next quote.
/// If we see NULL it returns Nullable(Nothing)
/// Otherwise we try to read a number.
DataTypePtr tryInferDataTypeForSingleField(ReadBuffer & buf, const FormatSettings & settings);
DataTypePtr tryInferDataTypeForSingleField(std::string_view field, const FormatSettings & settings);
/// The same as tryInferDataTypeForSingleField, but for JSON values.
DataTypePtr tryInferDataTypeForSingleJSONField(ReadBuffer & buf, const FormatSettings & settings, JSONInferenceInfo * json_info);
DataTypePtr tryInferDataTypeForSingleJSONField(std::string_view field, const FormatSettings & settings, JSONInferenceInfo * json_info);
/// Try to parse Date or DateTime value from a string.
DataTypePtr tryInferDateOrDateTimeFromString(std::string_view field, const FormatSettings & settings);
/// Try to parse a number value from a string. By default, it tries to parse Float64,
/// but if setting try_infer_integers is enabled, it also tries to parse Int64.
DataTypePtr tryInferNumberFromString(std::string_view field, const FormatSettings & settings);
/// It takes two types inferred for the same column and tries to transform them to a common type if possible.
/// It's also used when we try to infer some not ordinary types from another types.
/// Example 1:
/// Dates inferred from strings. In this case we should check if dates were inferred from all strings
/// in the same way and if not, transform inferred dates back to strings.
/// For example, when we have Array(Date) (like `['2020-01-01', '2020-02-02']`) and Array(String) (like `['string', 'abc']`
/// we will convert the first type to Array(String).
/// Example 2:
/// When we have integers and floats for the same value, we should convert all integers to floats.
/// For example, when we have Array(Int64) (like `[123, 456]`) and Array(Float64) (like `[42.42, 4.42]`)
/// we will convert the first type to Array(Float64)
/// Example 3:
/// When we have not complete types like Nullable(Nothing), Array(Nullable(Nothing)) or Tuple(UInt64, Nullable(Nothing)),
/// we try to complete them using the other type.
/// For example, if we have Tuple(UInt64, Nullable(Nothing)) and Tuple(Nullable(Nothing), String) we will convert both
/// types to common type Tuple(Nullable(UInt64), Nullable(String))
void transformInferredTypesIfNeeded(DataTypePtr & first, DataTypePtr & second, const FormatSettings & settings);
/// The same as transformInferredTypesIfNeeded but uses some specific transformations for JSON.
/// Example 1:
/// When we have numbers inferred from strings and strings, we convert all such numbers back to string.
/// For example, if we have Array(Int64) (like `['123', '456']`) and Array(String) (like `['str', 'abc']`)
/// we will convert the first type to Array(String). Note that we collect information about numbers inferred
/// from strings in json_info while inference and use it here, so we will know that Array(Int64) contains
/// integer inferred from a string.
/// Example 2:
/// When we have maps with different value types, we convert all types to JSON object type.
/// For example, if we have Map(String, UInt64) (like `{"a" : 123}`) and Map(String, String) (like `{"b" : 'abc'}`)
/// we will convert both types to Object('JSON').
void transformInferredJSONTypesIfNeeded(DataTypePtr & first, DataTypePtr & second, const FormatSettings & settings, JSONInferenceInfo * json_info);
/// Check if type is Tuple(...), try to transform nested types to find a common type for them and if all nested types
/// are the same after transform, we convert this tuple to an Array with common nested type.
/// For example, if we have Tuple(String, Nullable(Nothing)) we will convert it to Array(String).
/// It's used when all rows were read and we have Tuple in the result type that can be actually an Array.
void transformJSONTupleToArrayIfPossible(DataTypePtr & data_type, const FormatSettings & settings, JSONInferenceInfo * json_info);
/// Make type Nullable recursively:
/// - Type -> Nullable(type)
/// - Array(Type) -> Array(Nullable(Type))
/// - Tuple(Type1, ..., TypeN) -> Tuple(Nullable(Type1), ..., Nullable(TypeN))
/// - Map(KeyType, ValueType) -> Map(KeyType, Nullable(ValueType))
/// - LowCardinality(Type) -> LowCardinality(Nullable(Type))
DataTypePtr makeNullableRecursively(DataTypePtr type);
/// Call makeNullableRecursively for all types
/// in the block and return names and types.
NamesAndTypesList getNamesAndRecursivelyNullableTypes(const Block & header);
/// Check if type contains Nothing, like Array(Tuple(Nullable(Nothing), String))
bool checkIfTypeIsComplete(const DataTypePtr & type);
}

View File

@ -685,37 +685,27 @@ public:
}
else if constexpr (std::is_same_v<ResultDataType, DataTypeDateTime64>)
{
if (typeid_cast<const DataTypeDateTime64 *>(arguments[0].type.get()))
static constexpr auto target_scale = std::invoke(
[]() -> std::optional<UInt32>
{
if constexpr (std::is_base_of_v<AddNanosecondsImpl, Transform>)
return 9;
else if constexpr (std::is_base_of_v<AddMicrosecondsImpl, Transform>)
return 6;
else if constexpr (std::is_base_of_v<AddMillisecondsImpl, Transform>)
return 3;
return {};
});
auto timezone = extractTimeZoneNameFromFunctionArguments(arguments, 2, 0);
if (const auto* datetime64_type = typeid_cast<const DataTypeDateTime64 *>(arguments[0].type.get()))
{
const auto & datetime64_type = assert_cast<const DataTypeDateTime64 &>(*arguments[0].type);
auto from_scale = datetime64_type.getScale();
auto scale = from_scale;
if (std::is_same_v<Transform, AddNanosecondsImpl>)
scale = 9;
else if (std::is_same_v<Transform, AddMicrosecondsImpl>)
scale = 6;
else if (std::is_same_v<Transform, AddMillisecondsImpl>)
scale = 3;
scale = std::max(scale, from_scale);
return std::make_shared<DataTypeDateTime64>(scale, extractTimeZoneNameFromFunctionArguments(arguments, 2, 0));
const auto from_scale = datetime64_type->getScale();
return std::make_shared<DataTypeDateTime64>(std::max(from_scale, target_scale.value_or(from_scale)), std::move(timezone));
}
else
{
auto scale = DataTypeDateTime64::default_scale;
if (std::is_same_v<Transform, AddNanosecondsImpl>)
scale = 9;
else if (std::is_same_v<Transform, AddMicrosecondsImpl>)
scale = 6;
else if (std::is_same_v<Transform, AddMillisecondsImpl>)
scale = 3;
return std::make_shared<DataTypeDateTime64>(scale, extractTimeZoneNameFromFunctionArguments(arguments, 2, 0));
}
return std::make_shared<DataTypeDateTime64>(target_scale.value_or(DataTypeDateTime64::default_scale), std::move(timezone));
}
throw Exception(ErrorCodes::LOGICAL_ERROR, "Unexpected result type in datetime add interval function");

View File

@ -4,7 +4,7 @@
#include <Columns/ColumnVector.h>
#include <Columns/ColumnsNumber.h>
#include <Common/BitHelpers.h>
#include <Common/hex.h>
#include <Common/BinStringDecodeHelper.h>
#include <DataTypes/DataTypeString.h>
#include <Functions/FunctionFactory.h>
#include <Functions/IFunction.h>
@ -126,20 +126,7 @@ struct UnhexImpl
static void decode(const char * pos, const char * end, char *& out)
{
if ((end - pos) & 1)
{
*out = unhex(*pos);
++out;
++pos;
}
while (pos < end)
{
*out = unhex2(pos);
pos += word_size;
++out;
}
*out = '\0';
++out;
hexStringDecode(pos, end, out, word_size);
}
};
@ -233,52 +220,7 @@ struct UnbinImpl
static void decode(const char * pos, const char * end, char *& out)
{
if (pos == end)
{
*out = '\0';
++out;
return;
}
UInt8 left = 0;
/// end - pos is the length of input.
/// (length & 7) to make remain bits length mod 8 is zero to split.
/// e.g. the length is 9 and the input is "101000001",
/// first left_cnt is 1, left is 0, right shift, pos is 1, left = 1
/// then, left_cnt is 0, remain input is '01000001'.
for (UInt8 left_cnt = (end - pos) & 7; left_cnt > 0; --left_cnt)
{
left = left << 1;
if (*pos != '0')
left += 1;
++pos;
}
if (left != 0 || end - pos == 0)
{
*out = left;
++out;
}
assert((end - pos) % 8 == 0);
while (end - pos != 0)
{
UInt8 c = 0;
for (UInt8 i = 0; i < 8; ++i)
{
c = c << 1;
if (*pos != '0')
c += 1;
++pos;
}
*out = c;
++out;
}
*out = '\0';
++out;
binStringDecode(pos, end, out);
}
};

View File

@ -2,6 +2,7 @@
#include <Functions/FunctionHelpers.h>
#include <Functions/FunctionFactory.h>
#include <DataTypes/DataTypeArray.h>
#include <Interpreters/ArrayJoinAction.h>
namespace DB
@ -52,11 +53,11 @@ public:
DataTypePtr getReturnTypeImpl(const DataTypes & arguments) const override
{
const DataTypeArray * arr = checkAndGetDataType<DataTypeArray>(arguments[0].get());
const auto & arr = getArrayJoinDataType(arguments[0]);
if (!arr)
throw Exception("Argument for function " + getName() + " must be Array.", ErrorCodes::ILLEGAL_TYPE_OF_ARGUMENT);
throw Exception("Argument for function " + getName() + " must be Array or Map", ErrorCodes::ILLEGAL_TYPE_OF_ARGUMENT);
return arr->getNestedType();
}
ColumnPtr executeImpl(const ColumnsWithTypeAndName &, const DataTypePtr &, size_t /*input_rows_count*/) const override

View File

@ -9,6 +9,7 @@
#include <Interpreters/castColumn.h>
#include <Interpreters/Context.h>
#include <numeric>
#include <vector>
namespace DB
@ -56,7 +57,7 @@ private:
for (const auto & arg : arguments)
{
if (!isUnsignedInteger(arg))
if (!isInteger(arg))
throw Exception{"Illegal type " + arg->getName() + " of argument of function " + getName(),
ErrorCodes::ILLEGAL_TYPE_OF_ARGUMENT};
}
@ -72,8 +73,12 @@ private:
{
const auto & in_data = in->getData();
const auto total_values = std::accumulate(std::begin(in_data), std::end(in_data), size_t{},
[this] (const size_t lhs, const size_t rhs)
[this] (const size_t lhs, const T rhs)
{
if (rhs < 0)
throw Exception{"A call to function " + getName() + " overflows, only support positive values when only end is provided",
ErrorCodes::ARGUMENT_OUT_OF_BOUND};
const auto sum = lhs + rhs;
if (sum < lhs)
throw Exception{"A call to function " + getName() + " overflows, investigate the values of arguments you are passing",
@ -96,7 +101,7 @@ private:
IColumn::Offset offset{};
for (size_t row_idx = 0, rows = in->size(); row_idx < rows; ++row_idx)
{
for (size_t elem_idx = 0, elems = in_data[row_idx]; elem_idx < elems; ++elem_idx)
for (T elem_idx = 0, elems = in_data[row_idx]; elem_idx < elems; ++elem_idx)
out_data[offset + elem_idx] = static_cast<T>(elem_idx);
offset += in_data[row_idx];
@ -121,15 +126,20 @@ private:
size_t total_values = 0;
size_t pre_values = 0;
std::vector<size_t> row_length(input_rows_count);
for (size_t row_idx = 0; row_idx < input_rows_count; ++row_idx)
{
if (start < end_data[row_idx] && step == 0)
if (step == 0)
throw Exception{"A call to function " + getName() + " overflows, the 3rd argument step can't be zero",
ErrorCodes::ARGUMENT_OUT_OF_BOUND};
pre_values += start >= end_data[row_idx] ? 0
: (end_data[row_idx] - start - 1) / step + 1;
if (start < end_data[row_idx] && step > 0)
row_length[row_idx] = (static_cast<__int128_t>(end_data[row_idx]) - static_cast<__int128_t>(start) - 1) / static_cast<__int128_t>(step) + 1;
else if (start > end_data[row_idx] && step < 0)
row_length[row_idx] = (static_cast<__int128_t>(end_data[row_idx]) - static_cast<__int128_t>(start) + 1) / static_cast<__int128_t>(step) + 1;
pre_values += row_length[row_idx];
if (pre_values < total_values)
throw Exception{"A call to function " + getName() + " overflows, investigate the values of arguments you are passing",
@ -151,15 +161,8 @@ private:
IColumn::Offset offset{};
for (size_t row_idx = 0; row_idx < input_rows_count; ++row_idx)
{
for (size_t st = start, ed = end_data[row_idx]; st < ed; st += step)
{
out_data[offset++] = static_cast<T>(st);
if (st > st + step)
throw Exception{"A call to function " + getName() + " overflows, investigate the values of arguments you are passing",
ErrorCodes::ARGUMENT_OUT_OF_BOUND};
}
for (size_t idx = 0; idx < row_length[row_idx]; idx++)
out_data[offset++] = static_cast<T>(start + offset * step);
out_offsets[row_idx] = offset;
}
@ -180,19 +183,25 @@ private:
size_t total_values = 0;
size_t pre_values = 0;
std::vector<size_t> row_length(input_rows_count);
for (size_t row_idx = 0; row_idx < input_rows_count; ++row_idx)
{
if (start_data[row_idx] < end_data[row_idx] && step == 0)
if (step == 0)
throw Exception{"A call to function " + getName() + " overflows, the 3rd argument step can't be zero",
ErrorCodes::ARGUMENT_OUT_OF_BOUND};
pre_values += start_data[row_idx] >= end_data[row_idx] ? 0
: (end_data[row_idx] - start_data[row_idx] - 1) / step + 1;
if (start_data[row_idx] < end_data[row_idx] && step > 0)
row_length[row_idx] = (static_cast<__int128_t>(end_data[row_idx]) - static_cast<__int128_t>(start_data[row_idx]) - 1) / static_cast<__int128_t>(step) + 1;
else if (start_data[row_idx] > end_data[row_idx] && step < 0)
row_length[row_idx] = (static_cast<__int128_t>(end_data[row_idx]) - static_cast<__int128_t>(start_data[row_idx]) + 1) / static_cast<__int128_t>(step) + 1;
pre_values += row_length[row_idx];
if (pre_values < total_values)
throw Exception{"A call to function " + getName() + " overflows, investigate the values of arguments you are passing",
ErrorCodes::ARGUMENT_OUT_OF_BOUND};
ErrorCodes::ARGUMENT_OUT_OF_BOUND};
total_values = pre_values;
if (total_values > max_elements)
@ -210,15 +219,8 @@ private:
IColumn::Offset offset{};
for (size_t row_idx = 0; row_idx < input_rows_count; ++row_idx)
{
for (size_t st = start_data[row_idx], ed = end_data[row_idx]; st < ed; st += step)
{
out_data[offset++] = static_cast<T>(st);
if (st > st + step)
throw Exception{"A call to function " + getName() + " overflows, investigate the values of arguments you are passing",
ErrorCodes::ARGUMENT_OUT_OF_BOUND};
}
for (size_t idx = 0; idx < row_length[row_idx]; idx++)
out_data[offset++] = static_cast<T>(start_data[row_idx] + idx * step);
out_offsets[row_idx] = offset;
}
@ -239,15 +241,20 @@ private:
size_t total_values = 0;
size_t pre_values = 0;
std::vector<size_t> row_length(input_rows_count);
for (size_t row_idx = 0; row_idx < input_rows_count; ++row_idx)
{
if (start < end_data[row_idx] && step_data[row_idx] == 0)
if (step_data[row_idx] == 0)
throw Exception{"A call to function " + getName() + " overflows, the 3rd argument step can't be zero",
ErrorCodes::ARGUMENT_OUT_OF_BOUND};
ErrorCodes::ARGUMENT_OUT_OF_BOUND};
pre_values += start >= end_data[row_idx] ? 0
: (end_data[row_idx] - start - 1) / step_data[row_idx] + 1;
if (start < end_data[row_idx] && step_data[row_idx] > 0)
row_length[row_idx] = (static_cast<__int128_t>(end_data[row_idx]) - static_cast<__int128_t>(start) - 1) / static_cast<__int128_t>(step_data[row_idx]) + 1;
else if (start > end_data[row_idx] && step_data[row_idx] < 0)
row_length[row_idx] = (static_cast<__int128_t>(end_data[row_idx]) - static_cast<__int128_t>(start) + 1) / static_cast<__int128_t>(step_data[row_idx]) + 1;
pre_values += row_length[row_idx];
if (pre_values < total_values)
throw Exception{"A call to function " + getName() + " overflows, investigate the values of arguments you are passing",
@ -269,15 +276,8 @@ private:
IColumn::Offset offset{};
for (size_t row_idx = 0; row_idx < input_rows_count; ++row_idx)
{
for (size_t st = start, ed = end_data[row_idx]; st < ed; st += step_data[row_idx])
{
out_data[offset++] = static_cast<T>(st);
if (st > st + step_data[row_idx])
throw Exception{"A call to function " + getName() + " overflows, investigate the values of arguments you are passing",
ErrorCodes::ARGUMENT_OUT_OF_BOUND};
}
for (size_t idx = 0; idx < row_length[row_idx]; idx++)
out_data[offset++] = static_cast<T>(start + offset * step_data[row_idx]);
out_offsets[row_idx] = offset;
}
@ -301,15 +301,19 @@ private:
size_t total_values = 0;
size_t pre_values = 0;
std::vector<size_t> row_length(input_rows_count);
for (size_t row_idx = 0; row_idx < input_rows_count; ++row_idx)
{
if (start_data[row_idx] < end_start[row_idx] && step_data[row_idx] == 0)
throw Exception{"A call to function " + getName() + " overflows, the 3rd argument step can't be zero",
if (step_data[row_idx] == 0)
throw Exception{"A call to function " + getName() + " overflows, the 3rd argument step can't less or equal to zero",
ErrorCodes::ARGUMENT_OUT_OF_BOUND};
if (start_data[row_idx] < end_start[row_idx] && step_data[row_idx] > 0)
row_length[row_idx] = (static_cast<__int128_t>(end_start[row_idx]) - static_cast<__int128_t>(start_data[row_idx]) - 1) / static_cast<__int128_t>(step_data[row_idx]) + 1;
else if (start_data[row_idx] > end_start[row_idx] && step_data[row_idx] < 0)
row_length[row_idx] = (static_cast<__int128_t>(end_start[row_idx]) - static_cast<__int128_t>(start_data[row_idx]) + 1) / static_cast<__int128_t>(step_data[row_idx]) + 1;
pre_values += start_data[row_idx] >= end_start[row_idx] ? 0
: (end_start[row_idx] -start_data[row_idx] - 1) / (step_data[row_idx]) + 1;
pre_values += row_length[row_idx];
if (pre_values < total_values)
throw Exception{"A call to function " + getName() + " overflows, investigate the values of arguments you are passing",
@ -331,15 +335,8 @@ private:
IColumn::Offset offset{};
for (size_t row_idx = 0; row_idx < input_rows_count; ++row_idx)
{
for (size_t st = start_data[row_idx], ed = end_start[row_idx]; st < ed; st += step_data[row_idx])
{
out_data[offset++] = static_cast<T>(st);
if (st > st + step_data[row_idx])
throw Exception{"A call to function " + getName() + " overflows, investigate the values of arguments you are passing",
ErrorCodes::ARGUMENT_OUT_OF_BOUND};
}
for (size_t idx = 0; idx < row_length[row_idx]; idx++)
out_data[offset++] = static_cast<T>(start_data[row_idx] + idx * step_data[row_idx]);
out_offsets[row_idx] = offset;
}
@ -351,23 +348,20 @@ private:
DataTypePtr elem_type = checkAndGetDataType<DataTypeArray>(result_type.get())->getNestedType();
WhichDataType which(elem_type);
if (!which.isUInt8()
&& !which.isUInt16()
&& !which.isUInt32()
&& !which.isUInt64())
if (!which.isNativeUInt() && !which.isNativeInt())
{
throw Exception{"Illegal columns of arguments of function " + getName()
+ ", the function only implemented for unsigned integers up to 64 bit", ErrorCodes::ILLEGAL_COLUMN};
+ ", the function only implemented for unsigned/signed integers up to 64 bit",
ErrorCodes::ILLEGAL_COLUMN};
}
ColumnPtr res;
if (arguments.size() == 1)
{
const auto * col = arguments[0].column.get();
if (!((res = executeInternal<UInt8>(col))
|| (res = executeInternal<UInt16>(col))
|| (res = executeInternal<UInt32>(col))
|| (res = executeInternal<UInt64>(col))))
if (!((res = executeInternal<UInt8>(col)) || (res = executeInternal<UInt16>(col)) || (res = executeInternal<UInt32>(col))
|| (res = executeInternal<UInt64>(col)) || (res = executeInternal<Int8>(col)) || (res = executeInternal<Int16>(col))
|| (res = executeInternal<Int32>(col)) || (res = executeInternal<Int64>(col))))
{
throw Exception{"Illegal column " + col->getName() + " of argument of function " + getName(), ErrorCodes::ILLEGAL_COLUMN};
}
@ -402,44 +396,93 @@ private:
bool is_step_const = isColumnConst(*column_ptrs[2]);
if (is_start_const && is_step_const)
{
UInt64 start = assert_cast<const ColumnConst &>(*column_ptrs[0]).getUInt(0);
UInt64 step = assert_cast<const ColumnConst &>(*column_ptrs[2]).getUInt(0);
if ((res = executeConstStartStep<UInt8>(column_ptrs[1], start, step, input_rows_count)) ||
(res = executeConstStartStep<UInt16>(column_ptrs[1], start, step, input_rows_count)) ||
(res = executeConstStartStep<UInt32>(column_ptrs[1], static_cast<UInt32>(start), static_cast<UInt32>(step), input_rows_count)) ||
(res = executeConstStartStep<UInt64>(column_ptrs[1], start, step, input_rows_count)))
if (which.isNativeUInt())
{
UInt64 start = assert_cast<const ColumnConst &>(*column_ptrs[0]).getUInt(0);
UInt64 step = assert_cast<const ColumnConst &>(*column_ptrs[2]).getUInt(0);
if ((res = executeConstStartStep<UInt8>(column_ptrs[1], start, step, input_rows_count))
|| (res = executeConstStartStep<UInt16>(column_ptrs[1], start, step, input_rows_count))
|| (res = executeConstStartStep<UInt32>(
column_ptrs[1], static_cast<UInt32>(start), static_cast<UInt32>(step), input_rows_count))
|| (res = executeConstStartStep<UInt64>(column_ptrs[1], start, step, input_rows_count)))
{
}
}
else if (which.isNativeInt())
{
Int64 start = assert_cast<const ColumnConst &>(*column_ptrs[0]).getInt(0);
Int64 step = assert_cast<const ColumnConst &>(*column_ptrs[2]).getInt(0);
if ((res = executeConstStartStep<Int8>(column_ptrs[1], start, step, input_rows_count))
|| (res = executeConstStartStep<Int16>(column_ptrs[1], start, step, input_rows_count))
|| (res = executeConstStartStep<Int32>(
column_ptrs[1], static_cast<Int32>(start), static_cast<Int32>(step), input_rows_count))
|| (res = executeConstStartStep<Int64>(column_ptrs[1], start, step, input_rows_count)))
{
}
}
}
else if (is_start_const && !is_step_const)
{
UInt64 start = assert_cast<const ColumnConst &>(*column_ptrs[0]).getUInt(0);
if ((res = executeConstStart<UInt8>(column_ptrs[1], column_ptrs[2], start, input_rows_count)) ||
(res = executeConstStart<UInt16>(column_ptrs[1], column_ptrs[2], start, input_rows_count)) ||
(res = executeConstStart<UInt32>(column_ptrs[1], column_ptrs[2], static_cast<UInt32>(start), input_rows_count)) ||
(res = executeConstStart<UInt64>(column_ptrs[1], column_ptrs[2], start, input_rows_count)))
if (which.isNativeUInt())
{
UInt64 start = assert_cast<const ColumnConst &>(*column_ptrs[0]).getUInt(0);
if ((res = executeConstStart<UInt8>(column_ptrs[1], column_ptrs[2], start, input_rows_count))
|| (res = executeConstStart<UInt16>(column_ptrs[1], column_ptrs[2], start, input_rows_count))
|| (res = executeConstStart<UInt32>(column_ptrs[1], column_ptrs[2], static_cast<UInt32>(start), input_rows_count))
|| (res = executeConstStart<UInt64>(column_ptrs[1], column_ptrs[2], start, input_rows_count)))
{
}
}
else if (which.isNativeInt())
{
Int64 start = assert_cast<const ColumnConst &>(*column_ptrs[0]).getInt(0);
if ((res = executeConstStart<Int8>(column_ptrs[1], column_ptrs[2], start, input_rows_count))
|| (res = executeConstStart<Int16>(column_ptrs[1], column_ptrs[2], start, input_rows_count))
|| (res = executeConstStart<Int32>(column_ptrs[1], column_ptrs[2], static_cast<Int32>(start), input_rows_count))
|| (res = executeConstStart<Int64>(column_ptrs[1], column_ptrs[2], start, input_rows_count)))
{
}
}
}
else if (!is_start_const && is_step_const)
{
UInt64 step = assert_cast<const ColumnConst &>(*column_ptrs[2]).getUInt(0);
if ((res = executeConstStep<UInt8>(column_ptrs[0], column_ptrs[1], step, input_rows_count)) ||
(res = executeConstStep<UInt16>(column_ptrs[0], column_ptrs[1], step, input_rows_count)) ||
(res = executeConstStep<UInt32>(column_ptrs[0], column_ptrs[1], static_cast<UInt32>(step), input_rows_count)) ||
(res = executeConstStep<UInt64>(column_ptrs[0], column_ptrs[1], step, input_rows_count)))
if (which.isNativeUInt())
{
UInt64 step = assert_cast<const ColumnConst &>(*column_ptrs[2]).getUInt(0);
if ((res = executeConstStep<UInt8>(column_ptrs[0], column_ptrs[1], step, input_rows_count))
|| (res = executeConstStep<UInt16>(column_ptrs[0], column_ptrs[1], step, input_rows_count))
|| (res = executeConstStep<UInt32>(column_ptrs[0], column_ptrs[1], static_cast<UInt32>(step), input_rows_count))
|| (res = executeConstStep<UInt64>(column_ptrs[0], column_ptrs[1], step, input_rows_count)))
{
}
}
else if (which.isNativeInt())
{
Int64 step = assert_cast<const ColumnConst &>(*column_ptrs[2]).getInt(0);
if ((res = executeConstStep<Int8>(column_ptrs[0], column_ptrs[1], step, input_rows_count))
|| (res = executeConstStep<Int16>(column_ptrs[0], column_ptrs[1], step, input_rows_count))
|| (res = executeConstStep<Int32>(column_ptrs[0], column_ptrs[1], static_cast<Int32>(step), input_rows_count))
|| (res = executeConstStep<Int64>(column_ptrs[0], column_ptrs[1], step, input_rows_count)))
{
}
}
}
else
{
if ((res = executeGeneric<UInt8>(column_ptrs[0], column_ptrs[1], column_ptrs[2], input_rows_count)) ||
(res = executeGeneric<UInt16>(column_ptrs[0], column_ptrs[1], column_ptrs[2], input_rows_count)) ||
(res = executeGeneric<UInt32>(column_ptrs[0], column_ptrs[1], column_ptrs[2], input_rows_count)) ||
(res = executeGeneric<UInt64>(column_ptrs[0], column_ptrs[1], column_ptrs[2], input_rows_count)))
if ((res = executeGeneric<UInt8>(column_ptrs[0], column_ptrs[1], column_ptrs[2], input_rows_count))
|| (res = executeGeneric<UInt16>(column_ptrs[0], column_ptrs[1], column_ptrs[2], input_rows_count))
|| (res = executeGeneric<UInt32>(column_ptrs[0], column_ptrs[1], column_ptrs[2], input_rows_count))
|| (res = executeGeneric<UInt64>(column_ptrs[0], column_ptrs[1], column_ptrs[2], input_rows_count))
|| (res = executeGeneric<Int8>(column_ptrs[0], column_ptrs[1], column_ptrs[2], input_rows_count))
|| (res = executeGeneric<Int16>(column_ptrs[0], column_ptrs[1], column_ptrs[2], input_rows_count))
|| (res = executeGeneric<Int32>(column_ptrs[0], column_ptrs[1], column_ptrs[2], input_rows_count))
|| (res = executeGeneric<Int64>(column_ptrs[0], column_ptrs[1], column_ptrs[2], input_rows_count)))
{
}
}

View File

@ -48,7 +48,6 @@ template <> struct ActionValueTypeMap<DataTypeUInt64> { using ActionValueTyp
template <> struct ActionValueTypeMap<DataTypeDate> { using ActionValueType = UInt16; };
template <> struct ActionValueTypeMap<DataTypeDate32> { using ActionValueType = Int32; };
template <> struct ActionValueTypeMap<DataTypeDateTime> { using ActionValueType = UInt32; };
// TODO(vnemkov): to add sub-second format instruction, make that DateTime64 and do some math in Action<T>.
template <> struct ActionValueTypeMap<DataTypeDateTime64> { using ActionValueType = Int64; };
@ -113,16 +112,16 @@ private:
class Action
{
public:
using Func = void (*)(char *, Time, const DateLUTImpl &);
using Func = void (*)(char *, Time, UInt64, UInt32, const DateLUTImpl &);
Func func;
size_t shift;
explicit Action(Func func_, size_t shift_ = 0) : func(func_), shift(shift_) {}
void perform(char *& target, Time source, const DateLUTImpl & timezone)
void perform(char *& target, Time source, UInt64 fractional_second, UInt32 scale, const DateLUTImpl & timezone)
{
func(target, source, timezone);
func(target, source, fractional_second, scale, timezone);
target += shift;
}
@ -148,30 +147,30 @@ private:
}
public:
static void noop(char *, Time, const DateLUTImpl &)
static void noop(char *, Time, UInt64 , UInt32 , const DateLUTImpl &)
{
}
static void century(char * target, Time source, const DateLUTImpl & timezone)
static void century(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone)
{
auto year = ToYearImpl::execute(source, timezone);
auto century = year / 100;
writeNumber2(target, century);
}
static void dayOfMonth(char * target, Time source, const DateLUTImpl & timezone)
static void dayOfMonth(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone)
{
writeNumber2(target, ToDayOfMonthImpl::execute(source, timezone));
}
static void americanDate(char * target, Time source, const DateLUTImpl & timezone)
static void americanDate(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone)
{
writeNumber2(target, ToMonthImpl::execute(source, timezone));
writeNumber2(target + 3, ToDayOfMonthImpl::execute(source, timezone));
writeNumber2(target + 6, ToYearImpl::execute(source, timezone) % 100);
}
static void dayOfMonthSpacePadded(char * target, Time source, const DateLUTImpl & timezone)
static void dayOfMonthSpacePadded(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone)
{
auto day = ToDayOfMonthImpl::execute(source, timezone);
if (day < 10)
@ -180,101 +179,107 @@ private:
writeNumber2(target, day);
}
static void ISO8601Date(char * target, Time source, const DateLUTImpl & timezone) // NOLINT
static void ISO8601Date(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone) // NOLINT
{
writeNumber4(target, ToYearImpl::execute(source, timezone));
writeNumber2(target + 5, ToMonthImpl::execute(source, timezone));
writeNumber2(target + 8, ToDayOfMonthImpl::execute(source, timezone));
}
static void dayOfYear(char * target, Time source, const DateLUTImpl & timezone)
static void dayOfYear(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone)
{
writeNumber3(target, ToDayOfYearImpl::execute(source, timezone));
}
static void month(char * target, Time source, const DateLUTImpl & timezone)
static void month(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone)
{
writeNumber2(target, ToMonthImpl::execute(source, timezone));
}
static void dayOfWeek(char * target, Time source, const DateLUTImpl & timezone)
static void dayOfWeek(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone)
{
*target += ToDayOfWeekImpl::execute(source, timezone);
}
static void dayOfWeek0To6(char * target, Time source, const DateLUTImpl & timezone)
static void dayOfWeek0To6(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone)
{
auto day = ToDayOfWeekImpl::execute(source, timezone);
*target += (day == 7 ? 0 : day);
}
static void ISO8601Week(char * target, Time source, const DateLUTImpl & timezone) // NOLINT
static void ISO8601Week(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone) // NOLINT
{
writeNumber2(target, ToISOWeekImpl::execute(source, timezone));
}
static void ISO8601Year2(char * target, Time source, const DateLUTImpl & timezone) // NOLINT
static void ISO8601Year2(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone) // NOLINT
{
writeNumber2(target, ToISOYearImpl::execute(source, timezone) % 100);
}
static void ISO8601Year4(char * target, Time source, const DateLUTImpl & timezone) // NOLINT
static void ISO8601Year4(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone) // NOLINT
{
writeNumber4(target, ToISOYearImpl::execute(source, timezone));
}
static void year2(char * target, Time source, const DateLUTImpl & timezone)
static void year2(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone)
{
writeNumber2(target, ToYearImpl::execute(source, timezone) % 100);
}
static void year4(char * target, Time source, const DateLUTImpl & timezone)
static void year4(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone)
{
writeNumber4(target, ToYearImpl::execute(source, timezone));
}
static void hour24(char * target, Time source, const DateLUTImpl & timezone)
static void hour24(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone)
{
writeNumber2(target, ToHourImpl::execute(source, timezone));
}
static void hour12(char * target, Time source, const DateLUTImpl & timezone)
static void hour12(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone)
{
auto x = ToHourImpl::execute(source, timezone);
writeNumber2(target, x == 0 ? 12 : (x > 12 ? x - 12 : x));
}
static void minute(char * target, Time source, const DateLUTImpl & timezone)
static void minute(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone)
{
writeNumber2(target, ToMinuteImpl::execute(source, timezone));
}
static void AMPM(char * target, Time source, const DateLUTImpl & timezone) // NOLINT
static void AMPM(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone) // NOLINT
{
auto hour = ToHourImpl::execute(source, timezone);
if (hour >= 12)
*target = 'P';
}
static void hhmm24(char * target, Time source, const DateLUTImpl & timezone)
static void hhmm24(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone)
{
writeNumber2(target, ToHourImpl::execute(source, timezone));
writeNumber2(target + 3, ToMinuteImpl::execute(source, timezone));
}
static void second(char * target, Time source, const DateLUTImpl & timezone)
static void second(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone)
{
writeNumber2(target, ToSecondImpl::execute(source, timezone));
}
static void ISO8601Time(char * target, Time source, const DateLUTImpl & timezone) // NOLINT
static void fractionalSecond(char * target, Time /*source*/, UInt64 fractional_second, UInt32 scale, const DateLUTImpl & /*timezone*/)
{
for (Int64 i = scale, value = fractional_second; i > 0; --i, value /= 10)
target[i - 1] += value % 10;
}
static void ISO8601Time(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone) // NOLINT
{
writeNumber2(target, ToHourImpl::execute(source, timezone));
writeNumber2(target + 3, ToMinuteImpl::execute(source, timezone));
writeNumber2(target + 6, ToSecondImpl::execute(source, timezone));
}
static void timezoneOffset(char * target, Time source, const DateLUTImpl & timezone)
static void timezoneOffset(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone)
{
auto offset = TimezoneOffsetImpl::execute(source, timezone);
if (offset < 0)
@ -287,7 +292,7 @@ private:
writeNumber2(target + 3, offset % 3600 / 60);
}
static void quarter(char * target, Time source, const DateLUTImpl & timezone)
static void quarter(char * target, Time source, UInt64 /*fractional_second*/, UInt32 /*scale*/, const DateLUTImpl & timezone)
{
*target += ToQuarterImpl::execute(source, timezone);
}
@ -426,9 +431,15 @@ public:
String pattern = pattern_column->getValue<String>();
UInt32 scale [[maybe_unused]] = 0;
if constexpr (std::is_same_v<DataType, DataTypeDateTime64>)
{
scale = times->getScale();
}
using T = typename ActionValueTypeMap<DataType>::ActionValueType;
std::vector<Action<T>> instructions;
String pattern_to_fill = parsePattern(pattern, instructions);
String pattern_to_fill = parsePattern(pattern, instructions, scale);
size_t result_size = pattern_to_fill.size();
const DateLUTImpl * time_zone_tmp = nullptr;
@ -444,12 +455,6 @@ public:
const DateLUTImpl & time_zone = *time_zone_tmp;
const auto & vec = times->getData();
UInt32 scale [[maybe_unused]] = 0;
if constexpr (std::is_same_v<DataType, DataTypeDateTime64>)
{
scale = times->getScale();
}
auto col_res = ColumnString::create();
auto & dst_data = col_res->getChars();
auto & dst_offsets = col_res->getOffsets();
@ -484,16 +489,16 @@ public:
{
if constexpr (std::is_same_v<DataType, DataTypeDateTime64>)
{
const auto c = DecimalUtils::split(vec[i], scale);
for (auto & instruction : instructions)
{
const auto c = DecimalUtils::split(vec[i], scale);
instruction.perform(pos, static_cast<Int64>(c.whole), time_zone);
instruction.perform(pos, static_cast<Int64>(c.whole), c.fractional, scale, time_zone);
}
}
else
{
for (auto & instruction : instructions)
instruction.perform(pos, static_cast<UInt32>(vec[i]), time_zone);
instruction.perform(pos, static_cast<UInt32>(vec[i]), 0, 0, time_zone);
}
dst_offsets[i] = pos - begin;
@ -504,7 +509,7 @@ public:
}
template <typename T>
String parsePattern(const String & pattern, std::vector<Action<T>> & instructions) const
String parsePattern(const String & pattern, std::vector<Action<T>> & instructions, UInt32 scale) const
{
String result;
@ -573,6 +578,16 @@ public:
result.append(" 0");
break;
// Fractional seconds
case 'f':
{
/// If the time data type has no fractional part, then we print '0' as the fractional part.
const auto actual_scale = std::max<UInt32>(1, scale);
instructions.emplace_back(&Action<T>::fractionalSecond, actual_scale);
result.append(actual_scale, '0');
break;
}
// Short YYYY-MM-DD date, equivalent to %Y-%m-%d 2001-08-23
case 'F':
instructions.emplace_back(&Action<T>::ISO8601Date, 10);

View File

@ -319,12 +319,17 @@ template void readStringUntilEOFInto<PaddedPODArray<UInt8>>(PaddedPODArray<UInt8
/** Parse the escape sequence, which can be simple (one character after backslash) or more complex (multiple characters).
* It is assumed that the cursor is located on the `\` symbol
*/
template <typename Vector>
static void parseComplexEscapeSequence(Vector & s, ReadBuffer & buf)
template <typename Vector, typename ReturnType = void>
static ReturnType parseComplexEscapeSequence(Vector & s, ReadBuffer & buf)
{
++buf.position();
if (buf.eof())
throw Exception("Cannot parse escape sequence", ErrorCodes::CANNOT_PARSE_ESCAPE_SEQUENCE);
{
if constexpr (std::is_same_v<ReturnType, void>)
throw Exception("Cannot parse escape sequence", ErrorCodes::CANNOT_PARSE_ESCAPE_SEQUENCE);
else
return ReturnType(false);
}
char char_after_backslash = *buf.position();
@ -363,6 +368,8 @@ static void parseComplexEscapeSequence(Vector & s, ReadBuffer & buf)
s.push_back(decoded_char);
++buf.position();
}
return ReturnType(true);
}
@ -521,14 +528,18 @@ template void readEscapedStringInto<NullOutput>(NullOutput & s, ReadBuffer & buf
* backslash escape sequences are also parsed,
* that could be slightly confusing.
*/
template <char quote, bool enable_sql_style_quoting, typename Vector>
static void readAnyQuotedStringInto(Vector & s, ReadBuffer & buf)
template <char quote, bool enable_sql_style_quoting, typename Vector, typename ReturnType = void>
static ReturnType readAnyQuotedStringInto(Vector & s, ReadBuffer & buf)
{
static constexpr bool throw_exception = std::is_same_v<ReturnType, void>;
if (buf.eof() || *buf.position() != quote)
{
throw ParsingException(ErrorCodes::CANNOT_PARSE_QUOTED_STRING,
"Cannot parse quoted string: expected opening quote '{}', got '{}'",
std::string{quote}, buf.eof() ? "EOF" : std::string{*buf.position()});
if constexpr (throw_exception)
throw ParsingException(ErrorCodes::CANNOT_PARSE_QUOTED_STRING,
"Cannot parse quoted string: expected opening quote '{}', got '{}'",
std::string{quote}, buf.eof() ? "EOF" : std::string{*buf.position()});
else
return ReturnType(false);
}
++buf.position();
@ -554,15 +565,26 @@ static void readAnyQuotedStringInto(Vector & s, ReadBuffer & buf)
continue;
}
return;
return ReturnType(true);
}
if (*buf.position() == '\\')
parseComplexEscapeSequence(s, buf);
{
if constexpr (throw_exception)
parseComplexEscapeSequence<Vector, ReturnType>(s, buf);
else
{
if (!parseComplexEscapeSequence<Vector, ReturnType>(s, buf))
return ReturnType(false);
}
}
}
throw ParsingException("Cannot parse quoted string: expected closing quote",
ErrorCodes::CANNOT_PARSE_QUOTED_STRING);
if constexpr (throw_exception)
throw ParsingException("Cannot parse quoted string: expected closing quote",
ErrorCodes::CANNOT_PARSE_QUOTED_STRING);
else
return ReturnType(false);
}
template <bool enable_sql_style_quoting, typename Vector>
@ -571,6 +593,14 @@ void readQuotedStringInto(Vector & s, ReadBuffer & buf)
readAnyQuotedStringInto<'\'', enable_sql_style_quoting>(s, buf);
}
template <typename Vector>
bool tryReadQuotedStringInto(Vector & s, ReadBuffer & buf)
{
return readAnyQuotedStringInto<'\'', false, Vector, bool>(s, buf);
}
template bool tryReadQuotedStringInto(String & s, ReadBuffer & buf);
template <bool enable_sql_style_quoting, typename Vector>
void readDoubleQuotedStringInto(Vector & s, ReadBuffer & buf)
{
@ -934,6 +964,7 @@ template void readJSONStringInto<PaddedPODArray<UInt8>, void>(PaddedPODArray<UIn
template bool readJSONStringInto<PaddedPODArray<UInt8>, bool>(PaddedPODArray<UInt8> & s, ReadBuffer & buf);
template void readJSONStringInto<NullOutput>(NullOutput & s, ReadBuffer & buf);
template void readJSONStringInto<String>(String & s, ReadBuffer & buf);
template bool readJSONStringInto<String, bool>(String & s, ReadBuffer & buf);
template <typename Vector, typename ReturnType>
ReturnType readJSONObjectPossiblyInvalid(Vector & s, ReadBuffer & buf)
@ -1501,6 +1532,43 @@ static void readParsedValueInto(Vector & s, ReadBuffer & buf, ParseFunc parse_fu
peekable_buf.position() = end;
}
template <typename Vector>
static void readQuotedStringFieldInto(Vector & s, ReadBuffer & buf)
{
assertChar('\'', buf);
s.push_back('\'');
while (!buf.eof())
{
char * next_pos = find_first_symbols<'\\', '\''>(buf.position(), buf.buffer().end());
s.append(buf.position(), next_pos);
buf.position() = next_pos;
if (!buf.hasPendingData())
continue;
if (*buf.position() == '\'')
break;
s.push_back(*buf.position());
if (*buf.position() == '\\')
{
++buf.position();
if (!buf.eof())
{
s.push_back(*buf.position());
++buf.position();
}
}
}
if (buf.eof())
return;
++buf.position();
s.push_back('\'');
}
template <char opening_bracket, char closing_bracket, typename Vector>
static void readQuotedFieldInBracketsInto(Vector & s, ReadBuffer & buf)
{
@ -1518,20 +1586,19 @@ static void readQuotedFieldInBracketsInto(Vector & s, ReadBuffer & buf)
if (!buf.hasPendingData())
continue;
s.push_back(*buf.position());
if (*buf.position() == '\'')
{
readQuotedStringInto<false>(s, buf);
s.push_back('\'');
readQuotedStringFieldInto(s, buf);
}
else if (*buf.position() == opening_bracket)
{
s.push_back(opening_bracket);
++balance;
++buf.position();
}
else if (*buf.position() == closing_bracket)
{
s.push_back(closing_bracket);
--balance;
++buf.position();
}
@ -1554,11 +1621,7 @@ void readQuotedFieldInto(Vector & s, ReadBuffer & buf)
/// - Number: integer, float, decimal.
if (*buf.position() == '\'')
{
s.push_back('\'');
readQuotedStringInto<false>(s, buf);
s.push_back('\'');
}
readQuotedStringFieldInto(s, buf);
else if (*buf.position() == '[')
readQuotedFieldInBracketsInto<'[', ']'>(s, buf);
else if (*buf.position() == '(')

View File

@ -613,6 +613,9 @@ bool tryReadJSONStringInto(Vector & s, ReadBuffer & buf)
return readJSONStringInto<Vector, bool>(s, buf);
}
template <typename Vector>
bool tryReadQuotedStringInto(Vector & s, ReadBuffer & buf);
/// Reads chunk of data between {} in that way,
/// that it has balanced parentheses sequence of {}.
/// So, it may form a JSON object, but it can be incorrenct.

View File

@ -15,6 +15,8 @@ struct WriteSettings
bool enable_filesystem_cache_on_write_operations = false;
bool enable_filesystem_cache_log = false;
bool is_file_cache_persistent = false;
bool throw_on_error_from_cache = false;
bool s3_allow_parallel_part_upload = true;
/// Monitoring

View File

@ -9,6 +9,7 @@
#include <Functions/FunctionsLogical.h>
#include <Functions/CastOverloadResolver.h>
#include <Interpreters/Context.h>
#include <Interpreters/ArrayJoinAction.h>
#include <IO/WriteBufferFromString.h>
#include <IO/Operators.h>
#include <Core/SortDescription.h>
@ -141,7 +142,7 @@ const ActionsDAG::Node & ActionsDAG::addAlias(const Node & child, std::string al
const ActionsDAG::Node & ActionsDAG::addArrayJoin(const Node & child, std::string result_name)
{
const DataTypeArray * array_type = typeid_cast<const DataTypeArray *>(child.result_type.get());
const auto & array_type = getArrayJoinDataType(child.result_type);
if (!array_type)
throw Exception("ARRAY JOIN requires array argument", ErrorCodes::TYPE_MISMATCH);
@ -463,11 +464,10 @@ static ColumnWithTypeAndName executeActionForHeader(const ActionsDAG::Node * nod
auto key = arguments.at(0);
key.column = key.column->convertToFullColumnIfConst();
const ColumnArray * array = typeid_cast<const ColumnArray *>(key.column.get());
const auto * array = getArrayJoinColumnRawPtr(key.column);
if (!array)
throw Exception(ErrorCodes::TYPE_MISMATCH,
"ARRAY JOIN of not array: {}", node->result_name);
"ARRAY JOIN of not array nor map: {}", node->result_name);
res_column.column = array->getDataPtr()->cloneEmpty();
break;
}
@ -1537,12 +1537,39 @@ ActionsDAG::SplitResult ActionsDAG::splitActionsBeforeArrayJoin(const NameSet &
return res;
}
ActionsDAG::NodeRawConstPtrs ActionsDAG::getParents(const Node * target) const
{
NodeRawConstPtrs parents;
for (const auto & node : getNodes())
{
for (const auto & child : node.children)
{
if (child == target)
{
parents.push_back(&node);
break;
}
}
}
return parents;
}
ActionsDAG::SplitResult ActionsDAG::splitActionsBySortingDescription(const NameSet & sort_columns) const
{
std::unordered_set<const Node *> split_nodes;
for (const auto & sort_column : sort_columns)
if (const auto * node = tryFindInOutputs(sort_column))
{
split_nodes.insert(node);
/// Sorting can materialize const columns, so if we have const expression used in sorting,
/// we should also add all it's parents, otherwise, we can break the header
/// (function can expect const column, but will get materialized).
if (node->column && isColumnConst(*node->column))
{
auto parents = getParents(node);
split_nodes.insert(parents.begin(), parents.end());
}
}
else
throw Exception(ErrorCodes::LOGICAL_ERROR,
"Sorting column {} wasn't found in the ActionsDAG's outputs. DAG:\n{}",

View File

@ -343,6 +343,8 @@ public:
const ContextPtr & context);
private:
NodeRawConstPtrs getParents(const Node * target) const;
Node & addNode(Node node);
#if USE_EMBEDDED_COMPILER

View File

@ -1,6 +1,8 @@
#include <Common/typeid_cast.h>
#include <Columns/ColumnArray.h>
#include <DataTypes/DataTypeArray.h>
#include <DataTypes/DataTypeMap.h>
#include <Columns/ColumnArray.h>
#include <Columns/ColumnMap.h>
#include <DataTypes/DataTypesNumber.h>
#include <Functions/FunctionFactory.h>
#include <Interpreters/Context.h>
@ -16,6 +18,46 @@ namespace ErrorCodes
extern const int TYPE_MISMATCH;
}
std::shared_ptr<const DataTypeArray> getArrayJoinDataType(DataTypePtr type)
{
if (const auto * array_type = typeid_cast<const DataTypeArray *>(type.get()))
return std::shared_ptr<const DataTypeArray>{type, array_type};
else if (const auto * map_type = typeid_cast<const DataTypeMap *>(type.get()))
{
const auto & nested_type = map_type->getNestedType();
const auto * nested_array_type = typeid_cast<const DataTypeArray *>(nested_type.get());
return std::shared_ptr<const DataTypeArray>{nested_type, nested_array_type};
}
else
return nullptr;
}
ColumnPtr getArrayJoinColumn(const ColumnPtr & column)
{
if (typeid_cast<const ColumnArray *>(column.get()))
return column;
else if (const auto * map = typeid_cast<const ColumnMap *>(column.get()))
return map->getNestedColumnPtr();
else
return nullptr;
}
const ColumnArray * getArrayJoinColumnRawPtr(const ColumnPtr & column)
{
if (const auto & col_arr = getArrayJoinColumn(column))
return typeid_cast<const ColumnArray *>(col_arr.get());
return nullptr;
}
ColumnWithTypeAndName convertArrayJoinColumn(const ColumnWithTypeAndName & src_col)
{
ColumnWithTypeAndName array_col;
array_col.name = src_col.name;
array_col.type = getArrayJoinDataType(src_col.type);
array_col.column = getArrayJoinColumn(src_col.column->convertToFullColumnIfConst());
return array_col;
}
ArrayJoinAction::ArrayJoinAction(const NameSet & array_joined_columns_, bool array_join_is_left, ContextPtr context)
: columns(array_joined_columns_)
, is_left(array_join_is_left)
@ -28,13 +70,12 @@ ArrayJoinAction::ArrayJoinAction(const NameSet & array_joined_columns_, bool arr
{
function_length = FunctionFactory::instance().get("length", context);
function_greatest = FunctionFactory::instance().get("greatest", context);
function_arrayResize = FunctionFactory::instance().get("arrayResize", context);
function_array_resize = FunctionFactory::instance().get("arrayResize", context);
}
else if (is_left)
function_builder = FunctionFactory::instance().get("emptyArrayToSingle", context);
}
void ArrayJoinAction::prepare(ColumnsWithTypeAndName & sample) const
{
for (auto & current : sample)
@ -42,11 +83,13 @@ void ArrayJoinAction::prepare(ColumnsWithTypeAndName & sample) const
if (!columns.contains(current.name))
continue;
const DataTypeArray * array_type = typeid_cast<const DataTypeArray *>(&*current.type);
if (!array_type)
throw Exception("ARRAY JOIN requires array argument", ErrorCodes::TYPE_MISMATCH);
current.type = array_type->getNestedType();
current.column = nullptr;
if (const auto & type = getArrayJoinDataType(current.type))
{
current.column = nullptr;
current.type = type->getNestedType();
}
else
throw Exception("ARRAY JOIN requires array or map argument", ErrorCodes::TYPE_MISMATCH);
}
}
@ -55,10 +98,10 @@ void ArrayJoinAction::execute(Block & block)
if (columns.empty())
throw Exception("No arrays to join", ErrorCodes::LOGICAL_ERROR);
ColumnPtr any_array_ptr = block.getByName(*columns.begin()).column->convertToFullColumnIfConst();
const ColumnArray * any_array = typeid_cast<const ColumnArray *>(&*any_array_ptr);
ColumnPtr any_array_map_ptr = block.getByName(*columns.begin()).column->convertToFullColumnIfConst();
const auto * any_array = getArrayJoinColumnRawPtr(any_array_map_ptr);
if (!any_array)
throw Exception("ARRAY JOIN of not array: " + *columns.begin(), ErrorCodes::TYPE_MISMATCH);
throw Exception("ARRAY JOIN requires array or map argument", ErrorCodes::TYPE_MISMATCH);
/// If LEFT ARRAY JOIN, then we create columns in which empty arrays are replaced by arrays with one element - the default value.
std::map<String, ColumnPtr> non_empty_array_columns;
@ -78,7 +121,8 @@ void ArrayJoinAction::execute(Block & block)
{
auto & src_col = block.getByName(name);
ColumnsWithTypeAndName tmp_block{src_col}; //, {{}, uint64, {}}};
ColumnWithTypeAndName array_col = convertArrayJoinColumn(src_col);
ColumnsWithTypeAndName tmp_block{array_col}; //, {{}, uint64, {}}};
auto len_col = function_length->build(tmp_block)->execute(tmp_block, uint64, rows);
ColumnsWithTypeAndName tmp_block2{column_of_max_length, {len_col, uint64, {}}};
@ -89,28 +133,35 @@ void ArrayJoinAction::execute(Block & block)
{
auto & src_col = block.getByName(name);
ColumnsWithTypeAndName tmp_block{src_col, column_of_max_length};
src_col.column = function_arrayResize->build(tmp_block)->execute(tmp_block, src_col.type, rows);
any_array_ptr = src_col.column->convertToFullColumnIfConst();
ColumnWithTypeAndName array_col = convertArrayJoinColumn(src_col);
ColumnsWithTypeAndName tmp_block{array_col, column_of_max_length};
array_col.column = function_array_resize->build(tmp_block)->execute(tmp_block, array_col.type, rows);
src_col = std::move(array_col);
any_array_map_ptr = src_col.column->convertToFullColumnIfConst();
}
any_array = typeid_cast<const ColumnArray *>(&*any_array_ptr);
any_array = getArrayJoinColumnRawPtr(any_array_map_ptr);
if (!any_array)
throw Exception("ARRAY JOIN requires array or map argument", ErrorCodes::TYPE_MISMATCH);
}
else if (is_left)
{
for (const auto & name : columns)
{
auto src_col = block.getByName(name);
ColumnsWithTypeAndName tmp_block{src_col};
non_empty_array_columns[name] = function_builder->build(tmp_block)->execute(tmp_block, src_col.type, src_col.column->size());
const auto & src_col = block.getByName(name);
ColumnWithTypeAndName array_col = convertArrayJoinColumn(src_col);
ColumnsWithTypeAndName tmp_block{array_col};
non_empty_array_columns[name] = function_builder->build(tmp_block)->execute(tmp_block, array_col.type, array_col.column->size());
}
any_array_ptr = non_empty_array_columns.begin()->second->convertToFullColumnIfConst();
any_array = &typeid_cast<const ColumnArray &>(*any_array_ptr);
any_array_map_ptr = non_empty_array_columns.begin()->second->convertToFullColumnIfConst();
any_array = getArrayJoinColumnRawPtr(any_array_map_ptr);
if (!any_array)
throw Exception("ARRAY JOIN requires array or map argument", ErrorCodes::TYPE_MISMATCH);
}
size_t num_columns = block.columns();
for (size_t i = 0; i < num_columns; ++i)
{
@ -118,18 +169,30 @@ void ArrayJoinAction::execute(Block & block)
if (columns.contains(current.name))
{
if (!typeid_cast<const DataTypeArray *>(&*current.type))
throw Exception("ARRAY JOIN of not array: " + current.name, ErrorCodes::TYPE_MISMATCH);
if (const auto & type = getArrayJoinDataType(current.type))
{
ColumnPtr array_ptr;
if (typeid_cast<const DataTypeArray *>(current.type.get()))
{
array_ptr = (is_left && !is_unaligned) ? non_empty_array_columns[current.name] : current.column;
array_ptr = array_ptr->convertToFullColumnIfConst();
}
else
{
ColumnPtr map_ptr = current.column->convertToFullColumnIfConst();
const ColumnMap & map = typeid_cast<const ColumnMap &>(*map_ptr);
array_ptr = (is_left && !is_unaligned) ? non_empty_array_columns[current.name] : map.getNestedColumnPtr();
}
ColumnPtr array_ptr = (is_left && !is_unaligned) ? non_empty_array_columns[current.name] : current.column;
array_ptr = array_ptr->convertToFullColumnIfConst();
const ColumnArray & array = typeid_cast<const ColumnArray &>(*array_ptr);
if (!is_unaligned && !array.hasEqualOffsets(*any_array))
throw Exception("Sizes of ARRAY-JOIN-ed arrays do not match", ErrorCodes::SIZES_OF_ARRAYS_DOESNT_MATCH);
const ColumnArray & array = typeid_cast<const ColumnArray &>(*array_ptr);
if (!is_unaligned && !array.hasEqualOffsets(typeid_cast<const ColumnArray &>(*any_array_ptr)))
throw Exception("Sizes of ARRAY-JOIN-ed arrays do not match", ErrorCodes::SIZES_OF_ARRAYS_DOESNT_MATCH);
current.column = typeid_cast<const ColumnArray &>(*array_ptr).getDataPtr();
current.type = typeid_cast<const DataTypeArray &>(*current.type).getNestedType();
current.column = typeid_cast<const ColumnArray &>(*array_ptr).getDataPtr();
current.type = type->getNestedType();
}
else
throw Exception("ARRAY JOIN of not array nor map: " + current.name, ErrorCodes::TYPE_MISMATCH);
}
else
{

View File

@ -11,6 +11,15 @@ namespace DB
class IFunctionOverloadResolver;
using FunctionOverloadResolverPtr = std::shared_ptr<IFunctionOverloadResolver>;
class DataTypeArray;
class ColumnArray;
std::shared_ptr<const DataTypeArray> getArrayJoinDataType(DataTypePtr type);
const ColumnArray * getArrayJoinColumnRawPtr(const ColumnPtr & column);
/// If input array join column has map type, convert it to array type.
/// Otherwise do nothing.
ColumnWithTypeAndName convertArrayJoinColumn(const ColumnWithTypeAndName & src_col);
class ArrayJoinAction
{
public:
@ -21,7 +30,7 @@ public:
/// For unaligned [LEFT] ARRAY JOIN
FunctionOverloadResolverPtr function_length;
FunctionOverloadResolverPtr function_greatest;
FunctionOverloadResolverPtr function_arrayResize;
FunctionOverloadResolverPtr function_array_resize;
/// For LEFT ARRAY JOIN.
FunctionOverloadResolverPtr function_builder;

View File

@ -18,7 +18,6 @@ namespace DB
{
namespace ErrorCodes
{
extern const int REMOTE_FS_OBJECT_CACHE_ERROR;
extern const int LOGICAL_ERROR;
}
@ -98,7 +97,7 @@ void FileCache::assertInitialized(std::lock_guard<std::mutex> & /* cache_lock */
if (initialization_exception)
std::rethrow_exception(initialization_exception);
else
throw Exception(ErrorCodes::REMOTE_FS_OBJECT_CACHE_ERROR, "Cache not initialized");
throw Exception(ErrorCodes::LOGICAL_ERROR, "Cache not initialized");
}
}
@ -541,12 +540,12 @@ FileSegmentPtr FileCache::createFileSegmentForDownload(
#endif
if (size > max_file_segment_size)
throw Exception(ErrorCodes::REMOTE_FS_OBJECT_CACHE_ERROR, "Requested size exceeds max file segment size");
throw Exception(ErrorCodes::LOGICAL_ERROR, "Requested size exceeds max file segment size");
auto * cell = getCell(key, offset, cache_lock);
if (cell)
throw Exception(
ErrorCodes::REMOTE_FS_OBJECT_CACHE_ERROR,
ErrorCodes::LOGICAL_ERROR,
"Cache cell already exists for key `{}` and offset {}",
key.toString(), offset);
@ -738,7 +737,7 @@ bool FileCache::tryReserveForMainList(
auto * cell = getCell(entry_key, entry_offset, cache_lock);
if (!cell)
throw Exception(
ErrorCodes::REMOTE_FS_OBJECT_CACHE_ERROR,
ErrorCodes::LOGICAL_ERROR,
"Cache became inconsistent. Key: {}, offset: {}",
key.toString(), offset);
@ -964,7 +963,7 @@ void FileCache::remove(
catch (...)
{
throw Exception(
ErrorCodes::REMOTE_FS_OBJECT_CACHE_ERROR,
ErrorCodes::LOGICAL_ERROR,
"Removal of cached file failed. Key: {}, offset: {}, path: {}, error: {}",
key.toString(), offset, cache_file_path, getCurrentExceptionMessage(false));
}
@ -981,7 +980,7 @@ void FileCache::loadCacheInfoIntoMemory(std::lock_guard<std::mutex> & cache_lock
/// cache_base_path / key_prefix / key / offset
if (!files.empty())
throw Exception(
ErrorCodes::REMOTE_FS_OBJECT_CACHE_ERROR,
ErrorCodes::LOGICAL_ERROR,
"Cache initialization is partially made. "
"This can be a result of a failed first attempt to initialize cache. "
"Please, check log for error messages");
@ -1214,7 +1213,7 @@ FileCache::FileSegmentCell::FileSegmentCell(
}
default:
throw Exception(
ErrorCodes::REMOTE_FS_OBJECT_CACHE_ERROR,
ErrorCodes::LOGICAL_ERROR,
"Can create cell with either EMPTY, DOWNLOADED, DOWNLOADING state, got: {}",
FileSegment::stateToString(file_segment->download_state));
}

View File

@ -19,7 +19,6 @@ namespace DB
namespace ErrorCodes
{
extern const int REMOTE_FS_OBJECT_CACHE_ERROR;
extern const int LOGICAL_ERROR;
}
@ -66,7 +65,7 @@ FileSegment::FileSegment(
default:
{
throw Exception(
ErrorCodes::REMOTE_FS_OBJECT_CACHE_ERROR,
ErrorCodes::LOGICAL_ERROR,
"Can only create cell with either EMPTY, DOWNLOADED or SKIP_CACHE state");
}
}
@ -278,7 +277,7 @@ void FileSegment::resetRemoteFileReader()
void FileSegment::write(const char * from, size_t size, size_t offset)
{
if (!size)
throw Exception(ErrorCodes::REMOTE_FS_OBJECT_CACHE_ERROR, "Writing zero size is not allowed");
throw Exception(ErrorCodes::LOGICAL_ERROR, "Writing zero size is not allowed");
{
std::unique_lock segment_lock(mutex);
@ -294,7 +293,7 @@ void FileSegment::write(const char * from, size_t size, size_t offset)
size_t first_non_downloaded_offset = getFirstNonDownloadedOffsetUnlocked(segment_lock);
if (offset != first_non_downloaded_offset)
throw Exception(
ErrorCodes::REMOTE_FS_OBJECT_CACHE_ERROR,
ErrorCodes::LOGICAL_ERROR,
"Attempt to write {} bytes to offset: {}, but current write offset is {}",
size, offset, first_non_downloaded_offset);
@ -304,7 +303,7 @@ void FileSegment::write(const char * from, size_t size, size_t offset)
if (free_reserved_size < size)
throw Exception(
ErrorCodes::REMOTE_FS_OBJECT_CACHE_ERROR,
ErrorCodes::LOGICAL_ERROR,
"Not enough space is reserved. Available: {}, expected: {}", free_reserved_size, size);
if (current_downloaded_size == range().size())
@ -364,7 +363,7 @@ FileSegment::State FileSegment::wait()
return download_state;
if (download_state == State::EMPTY)
throw Exception(ErrorCodes::REMOTE_FS_OBJECT_CACHE_ERROR, "Cannot wait on a file segment with empty state");
throw Exception(ErrorCodes::LOGICAL_ERROR, "Cannot wait on a file segment with empty state");
if (download_state == State::DOWNLOADING)
{
@ -382,7 +381,7 @@ FileSegment::State FileSegment::wait()
bool FileSegment::reserve(size_t size_to_reserve)
{
if (!size_to_reserve)
throw Exception(ErrorCodes::REMOTE_FS_OBJECT_CACHE_ERROR, "Zero space reservation is not allowed");
throw Exception(ErrorCodes::LOGICAL_ERROR, "Zero space reservation is not allowed");
size_t expected_downloaded_size;
@ -396,7 +395,7 @@ bool FileSegment::reserve(size_t size_to_reserve)
if (expected_downloaded_size + size_to_reserve > range().size())
throw Exception(
ErrorCodes::REMOTE_FS_OBJECT_CACHE_ERROR,
ErrorCodes::LOGICAL_ERROR,
"Attempt to reserve space too much space ({}) for file segment with range: {} (downloaded size: {})",
size_to_reserve, range().toString(), downloaded_size);
@ -434,9 +433,6 @@ void FileSegment::setDownloadedUnlocked([[maybe_unused]] std::unique_lock<std::m
if (is_downloaded)
return;
setDownloadState(State::DOWNLOADED);
is_downloaded = true;
if (cache_writer)
{
cache_writer->finalize();
@ -498,7 +494,7 @@ void FileSegment::completeWithState(State state)
{
cv.notify_all();
throw Exception(
ErrorCodes::REMOTE_FS_OBJECT_CACHE_ERROR,
ErrorCodes::LOGICAL_ERROR,
"Cannot complete file segment with state: {}", stateToString(state));
}
@ -559,8 +555,7 @@ void FileSegment::completeBasedOnCurrentState(std::lock_guard<std::mutex> & cach
{
if (is_last_holder)
cache->remove(key(), offset(), cache_lock, segment_lock);
return;
break;
}
case State::DOWNLOADED:
{
@ -613,6 +608,7 @@ void FileSegment::completeBasedOnCurrentState(std::lock_guard<std::mutex> & cach
}
}
is_completed = true;
LOG_TEST(log, "Completed file segment: {}", getInfoForLogUnlocked(segment_lock));
}
@ -748,6 +744,12 @@ bool FileSegment::isDetached() const
return is_detached;
}
bool FileSegment::isCompleted() const
{
std::unique_lock segment_lock(mutex);
return is_completed;
}
void FileSegment::detach(std::lock_guard<std::mutex> & /* cache_lock */, std::unique_lock<std::mutex> & segment_lock)
{
if (is_detached)

View File

@ -181,6 +181,8 @@ public:
bool isDetached() const;
bool isCompleted() const;
void assertCorrectness() const;
/**
@ -294,6 +296,7 @@ private:
/// "detached" file segment means that it is not owned by cache ("detached" from cache).
/// In general case, all file segments are owned by cache.
bool is_detached = false;
bool is_completed = false;
bool is_downloaded{false};
@ -317,11 +320,6 @@ struct FileSegmentsHolder : private boost::noncopyable
String toString();
FileSegments::iterator add(FileSegmentPtr && file_segment)
{
return file_segments.insert(file_segments.end(), file_segment);
}
FileSegments file_segments{};
};

View File

@ -3743,6 +3743,8 @@ WriteSettings Context::getWriteSettings() const
res.enable_filesystem_cache_on_write_operations = settings.enable_filesystem_cache_on_write_operations;
res.enable_filesystem_cache_log = settings.enable_filesystem_cache_log;
res.throw_on_error_from_cache = settings.throw_on_error_from_cache_on_write_operations;
res.s3_allow_parallel_part_upload = settings.s3_allow_parallel_part_upload;
res.remote_throttler = getRemoteWriteThrottler();

View File

@ -620,9 +620,9 @@ static void executeAction(const ExpressionActions::Action & action, ExecutionCon
array_join_key.column = array_join_key.column->convertToFullColumnIfConst();
const ColumnArray * array = typeid_cast<const ColumnArray *>(array_join_key.column.get());
const auto * array = getArrayJoinColumnRawPtr(array_join_key.column);
if (!array)
throw Exception("ARRAY JOIN of not array: " + action.node->result_name, ErrorCodes::TYPE_MISMATCH);
throw Exception("ARRAY JOIN of not array nor map: " + action.node->result_name, ErrorCodes::TYPE_MISMATCH);
for (auto & column : columns)
if (column.column)
@ -635,7 +635,7 @@ static void executeAction(const ExpressionActions::Action & action, ExecutionCon
auto & res_column = columns[action.result_position];
res_column.column = array->getDataPtr();
res_column.type = assert_cast<const DataTypeArray &>(*array_join_key.type).getNestedType();
res_column.type = getArrayJoinDataType(array_join_key.type)->getNestedType();
res_column.name = action.node->result_name;
num_rows = res_column.column->size();
@ -1008,7 +1008,7 @@ ExpressionActionsChain::ArrayJoinStep::ArrayJoinStep(ArrayJoinActionPtr array_jo
if (array_join->columns.contains(column.name))
{
const auto * array = typeid_cast<const DataTypeArray *>(column.type.get());
const auto & array = getArrayJoinDataType(column.type);
column.type = array->getNestedType();
/// Arrays are materialized
column.column = nullptr;

View File

@ -15,6 +15,7 @@
#include <DataTypes/DataTypeNullable.h>
#include <Columns/IColumn.h>
#include <Interpreters/Aggregator.h>
#include <Interpreters/ArrayJoinAction.h>
#include <Interpreters/Context.h>
#include <Interpreters/ConcurrentHashJoin.h>
@ -33,6 +34,7 @@
#include <Interpreters/replaceForPositionalArguments.h>
#include <Processors/QueryPlan/ExpressionStep.h>
#include <Processors/QueryPlan/AggregatingStep.h>
#include <AggregateFunctions/AggregateFunctionFactory.h>
#include <AggregateFunctions/parseAggregateFunctionParameters.h>
@ -1831,7 +1833,7 @@ ExpressionAnalysisResult::ExpressionAnalysisResult(
ssize_t where_step_num = -1;
ssize_t having_step_num = -1;
auto finalize_chain = [&](ExpressionActionsChain & chain)
auto finalize_chain = [&](ExpressionActionsChain & chain) -> ColumnsWithTypeAndName
{
if (prewhere_step_num >= 0)
{
@ -1852,7 +1854,9 @@ ExpressionAnalysisResult::ExpressionAnalysisResult(
finalize(chain, prewhere_step_num, where_step_num, having_step_num, query);
auto res = chain.getLastStep().getResultColumns();
chain.clear();
return res;
};
{
@ -1970,7 +1974,55 @@ ExpressionAnalysisResult::ExpressionAnalysisResult(
if (settings.group_by_use_nulls)
query_analyzer.appendGroupByModifiers(before_aggregation, chain, only_types);
finalize_chain(chain);
auto columns_before_aggregation = finalize_chain(chain);
/// Here we want to check that columns after aggregation have the same type as
/// were promised in query_analyzer.aggregated_columns
/// Ideally, they should be equal. In practice, this may be not true.
/// As an example, we don't build sets for IN inside ExpressionAnalysis::analyzeAggregation,
/// so that constant folding for expression (1 in 1) will not work. This may change the return type
/// for functions with LowCardinality argument: function "substr(toLowCardinality('abc'), 1 IN 1)"
/// should usually return LowCardinality(String) when (1 IN 1) is constant, but without built set
/// for (1 IN 1) constant is not propagated and "substr" returns String type.
/// See 02503_in_lc_const_args_bug.sql
///
/// As a temporary solution, we add converting actions to the next chain.
/// Hopefully, later we can
/// * use a new analyzer where this issue is absent
/// * or remove ExpressionActionsChain completely and re-implement its logic on top of the query plan
{
for (auto & col : columns_before_aggregation)
if (!col.column)
col.column = col.type->createColumn();
Block header_before_aggregation(std::move(columns_before_aggregation));
auto keys = query_analyzer.aggregationKeys().getNames();
const auto & aggregates = query_analyzer.aggregates();
bool has_grouping = query_analyzer.group_by_kind != GroupByKind::ORDINARY;
auto actual_header = Aggregator::Params::getHeader(
header_before_aggregation, /*only_merge*/ false, keys, aggregates, /*final*/ true);
actual_header = AggregatingStep::appendGroupingColumn(
std::move(actual_header), keys, has_grouping, settings.group_by_use_nulls);
Block expected_header;
for (const auto & expected : query_analyzer.aggregated_columns)
expected_header.insert(ColumnWithTypeAndName(expected.type, expected.name));
if (!blocksHaveEqualStructure(actual_header, expected_header))
{
auto converting = ActionsDAG::makeConvertingActions(
actual_header.getColumnsWithTypeAndName(),
expected_header.getColumnsWithTypeAndName(),
ActionsDAG::MatchColumnsMode::Name,
true);
auto & step = chain.lastStep(query_analyzer.aggregated_columns);
auto & actions = step.actions();
actions = ActionsDAG::merge(std::move(*actions), std::move(*converting));
}
}
if (query_analyzer.appendHaving(chain, only_types || !second_stage))
{

View File

@ -220,8 +220,13 @@ bool isStorageTouchedByMutations(
if (all_commands_can_be_skipped)
return false;
/// We must read with one thread because it guarantees that
/// output stream will be sorted after reading from MergeTree parts.
/// Disable all settings that can enable reading with several streams.
context_copy->setSetting("max_streams_to_max_threads_ratio", 1);
context_copy->setSetting("max_threads", 1);
context_copy->setSetting("allow_asynchronous_read_from_io_pool_for_merge_tree", false);
context_copy->setSetting("max_streams_for_merge_tree_reading", Field(0));
ASTPtr select_query = prepareQueryAffectedAST(commands, storage, context_copy);

View File

@ -8,6 +8,7 @@
#include <Parsers/DumpASTNode.h>
#include <Common/typeid_cast.h>
#include <Common/StringUtils/StringUtils.h>
#include <Common/BinStringDecodeHelper.h>
#include <Parsers/ASTAsterisk.h>
#include <Parsers/ASTCollation.h>
@ -986,6 +987,38 @@ bool ParserUnsignedInteger::parseImpl(Pos & pos, ASTPtr & node, Expected & expec
return true;
}
inline static bool makeStringLiteral(IParser::Pos & pos, ASTPtr & node, String str)
{
auto literal = std::make_shared<ASTLiteral>(str);
literal->begin = pos;
literal->end = ++pos;
node = literal;
return true;
}
inline static bool makeHexOrBinStringLiteral(IParser::Pos & pos, ASTPtr & node, bool hex, size_t word_size)
{
const char * str_begin = pos->begin + 2;
const char * str_end = pos->end - 1;
if (str_begin == str_end)
return makeStringLiteral(pos, node, "");
PODArray<UInt8> res;
res.resize((pos->size() + word_size) / word_size + 1);
char * res_begin = reinterpret_cast<char *>(res.data());
char * res_pos = res_begin;
if (hex)
{
hexStringDecode(str_begin, str_end, res_pos);
}
else
{
binStringDecode(str_begin, str_end, res_pos);
}
return makeStringLiteral(pos, node, String(reinterpret_cast<char *>(res.data()), (res_pos - res_begin - 1)));
}
bool ParserStringLiteral::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
{
@ -996,6 +1029,18 @@ bool ParserStringLiteral::parseImpl(Pos & pos, ASTPtr & node, Expected & expecte
if (pos->type == TokenType::StringLiteral)
{
if (*pos->begin == 'x' || *pos->begin == 'X')
{
constexpr size_t word_size = 2;
return makeHexOrBinStringLiteral(pos, node, true, word_size);
}
if (*pos->begin == 'b' || *pos->begin == 'B')
{
constexpr size_t word_size = 8;
return makeHexOrBinStringLiteral(pos, node, false, word_size);
}
ReadBufferFromMemory in(pos->begin, pos->size());
try
@ -1022,11 +1067,7 @@ bool ParserStringLiteral::parseImpl(Pos & pos, ASTPtr & node, Expected & expecte
s = String(pos->begin + heredoc_size, pos->size() - heredoc_size * 2);
}
auto literal = std::make_shared<ASTLiteral>(s);
literal->begin = pos;
literal->end = ++pos;
node = literal;
return true;
return makeStringLiteral(pos, node, s);
}
template <typename Collection>

View File

@ -1,3 +1,4 @@
#include <cassert>
#include <base/defines.h>
#include <Parsers/Lexer.h>
#include <Common/StringUtils/StringUtils.h>
@ -44,6 +45,36 @@ Token quotedString(const char *& pos, const char * const token_begin, const char
}
}
Token quotedHexOrBinString(const char *& pos, const char * const token_begin, const char * const end)
{
constexpr char quote = '\'';
assert(pos[1] == quote);
bool hex = (*pos == 'x' || *pos == 'X');
pos += 2;
if (hex)
{
while (pos < end && isHexDigit(*pos))
++pos;
}
else
{
pos = find_first_not_symbols<'0', '1'>(pos, end);
}
if (pos >= end || *pos != quote)
{
pos = end;
return Token(TokenType::ErrorSingleQuoteIsNotClosed, token_begin, end);
}
++pos;
return Token(TokenType::StringLiteral, token_begin, pos);
}
}
@ -420,6 +451,12 @@ Token Lexer::nextTokenImpl()
return Token(TokenType::DollarSign, token_begin, ++pos);
}
}
if (pos + 2 < end && pos[1] == '\'' && (*pos == 'x' || *pos == 'b' || *pos == 'X' || *pos == 'B'))
{
return quotedHexOrBinString(pos, token_begin, end);
}
if (isWordCharASCII(*pos) || *pos == '$')
{
++pos;

View File

@ -17,7 +17,7 @@ namespace ErrorCodes
namespace DB
{
static bool parseQueryWithOnClusterAndMaybeTable(std::shared_ptr<ASTSystemQuery> & res, IParser::Pos & pos,
[[nodiscard]] static bool parseQueryWithOnClusterAndMaybeTable(std::shared_ptr<ASTSystemQuery> & res, IParser::Pos & pos,
Expected & expected, bool require_table, bool allow_string_literal)
{
/// Better form for user: SYSTEM <ACTION> table ON CLUSTER cluster
@ -71,7 +71,7 @@ enum class SystemQueryTargetType
Disk
};
static bool parseQueryWithOnClusterAndTarget(std::shared_ptr<ASTSystemQuery> & res, IParser::Pos & pos, Expected & expected, SystemQueryTargetType target_type)
[[nodiscard]] static bool parseQueryWithOnClusterAndTarget(std::shared_ptr<ASTSystemQuery> & res, IParser::Pos & pos, Expected & expected, SystemQueryTargetType target_type)
{
/// Better form for user: SYSTEM <ACTION> target_name ON CLUSTER cluster
/// Query rewritten form + form while executing on cluster: SYSTEM <ACTION> ON CLUSTER cluster target_name
@ -136,7 +136,7 @@ static bool parseQueryWithOnClusterAndTarget(std::shared_ptr<ASTSystemQuery> & r
return true;
}
static bool parseQueryWithOnCluster(std::shared_ptr<ASTSystemQuery> & res, IParser::Pos & pos,
[[nodiscard]] static bool parseQueryWithOnCluster(std::shared_ptr<ASTSystemQuery> & res, IParser::Pos & pos,
Expected & expected)
{
String cluster_str;
@ -196,7 +196,8 @@ bool ParserSystemQuery::parseImpl(IParser::Pos & pos, ASTPtr & node, Expected &
}
case Type::DROP_REPLICA:
{
parseQueryWithOnCluster(res, pos, expected);
if (!parseQueryWithOnCluster(res, pos, expected))
return false;
ASTPtr ast;
if (!ParserStringLiteral{}.parse(pos, ast, expected))
@ -239,7 +240,8 @@ bool ParserSystemQuery::parseImpl(IParser::Pos & pos, ASTPtr & node, Expected &
case Type::RESTART_REPLICA:
case Type::SYNC_REPLICA:
{
parseQueryWithOnCluster(res, pos, expected);
if (!parseQueryWithOnCluster(res, pos, expected))
return false;
if (!parseDatabaseAndTableAsAST(pos, expected, res->database, res->table))
return false;
break;
@ -247,7 +249,8 @@ bool ParserSystemQuery::parseImpl(IParser::Pos & pos, ASTPtr & node, Expected &
case Type::SYNC_DATABASE_REPLICA:
{
parseQueryWithOnCluster(res, pos, expected);
if (!parseQueryWithOnCluster(res, pos, expected))
return false;
if (!parseDatabaseAsAST(pos, expected, res->database))
return false;
break;
@ -310,7 +313,8 @@ bool ParserSystemQuery::parseImpl(IParser::Pos & pos, ASTPtr & node, Expected &
}
else
{
parseQueryWithOnCluster(res, pos, expected);
if (!parseQueryWithOnCluster(res, pos, expected))
return false;
if (ParserKeyword{"ON VOLUME"}.ignore(pos, expected))
{
if (!parse_on_volume())
@ -335,13 +339,15 @@ bool ParserSystemQuery::parseImpl(IParser::Pos & pos, ASTPtr & node, Expected &
case Type::START_REPLICATED_SENDS:
case Type::STOP_REPLICATION_QUEUES:
case Type::START_REPLICATION_QUEUES:
parseQueryWithOnCluster(res, pos, expected);
if (!parseQueryWithOnCluster(res, pos, expected))
return false;
parseDatabaseAndTableAsAST(pos, expected, res->database, res->table);
break;
case Type::SUSPEND:
{
parseQueryWithOnCluster(res, pos, expected);
if (!parseQueryWithOnCluster(res, pos, expected))
return false;
ASTPtr seconds;
if (!(ParserKeyword{"FOR"}.ignore(pos, expected)
@ -360,7 +366,8 @@ bool ParserSystemQuery::parseImpl(IParser::Pos & pos, ASTPtr & node, Expected &
ASTPtr ast;
if (path_parser.parse(pos, ast, expected))
res->filesystem_cache_path = ast->as<ASTLiteral>()->value.safeGet<String>();
parseQueryWithOnCluster(res, pos, expected);
if (!parseQueryWithOnCluster(res, pos, expected))
return false;
break;
}
case Type::DROP_SCHEMA_CACHE:
@ -397,7 +404,8 @@ bool ParserSystemQuery::parseImpl(IParser::Pos & pos, ASTPtr & node, Expected &
default:
{
parseQueryWithOnCluster(res, pos, expected);
if (!parseQueryWithOnCluster(res, pos, expected))
return false;
break;
}
}

View File

@ -1,6 +1,5 @@
#include <Processors/Formats/ISchemaReader.h>
#include <Formats/ReadSchemaUtils.h>
#include <Formats/EscapingRuleUtils.h>
#include <Formats/SchemaInferenceUtils.h>
#include <DataTypes/DataTypeString.h>
#include <Interpreters/parseColumnsListForTableFunction.h>
#include <boost/algorithm/string.hpp>
@ -11,65 +10,29 @@ namespace DB
namespace ErrorCodes
{
extern const int ONLY_NULLS_WHILE_READING_SCHEMA;
extern const int TYPE_MISMATCH;
extern const int INCORRECT_DATA;
extern const int EMPTY_DATA_PASSED;
extern const int BAD_ARGUMENTS;
}
void chooseResultColumnType(
DataTypePtr & type,
DataTypePtr & new_type,
std::function<void(DataTypePtr &, DataTypePtr &)> transform_types_if_needed,
const DataTypePtr & default_type,
const String & column_name,
size_t row)
void checkFinalInferredType(DataTypePtr & type, const String & name, const FormatSettings & settings, const DataTypePtr & default_type, size_t rows_read)
{
if (!type)
{
type = new_type;
return;
}
if (!new_type || type->equals(*new_type))
return;
transform_types_if_needed(type, new_type);
if (type->equals(*new_type))
return;
/// If the new type and the previous type for this column are different,
/// we will use default type if we have it or throw an exception.
if (default_type)
type = default_type;
else
{
throw Exception(
ErrorCodes::TYPE_MISMATCH,
"Automatically defined type {} for column '{}' in row {} differs from type defined by previous rows: {}. "
"You can specify the type for this column using setting schema_inference_hints",
type->getName(),
column_name,
row,
new_type->getName());
}
}
void checkResultColumnTypeAndAppend(NamesAndTypesList & result, DataTypePtr & type, const String & name, const DataTypePtr & default_type, size_t rows_read)
{
if (!type)
if (!checkIfTypeIsComplete(type))
{
if (!default_type)
throw Exception(
ErrorCodes::ONLY_NULLS_WHILE_READING_SCHEMA,
"Cannot determine type for column '{}' by first {} rows of data, most likely this column contains only Nulls or empty "
"Arrays/Maps. You can specify the type for this column using setting schema_inference_hints",
"Arrays/Maps. You can specify the type for this column using setting schema_inference_hints. "
"If your data contains complex JSON objects, try enabling one of the settings allow_experimental_object_type/input_format_json_read_objects_as_strings",
name,
rows_read);
type = default_type;
}
result.emplace_back(name, type);
if (settings.schema_inference_make_columns_nullable)
type = makeNullableRecursively(type);
}
IIRowSchemaReader::IIRowSchemaReader(ReadBuffer & in_, const FormatSettings & format_settings_, DataTypePtr default_type_)
@ -88,6 +51,11 @@ void IIRowSchemaReader::setContext(ContextPtr & context)
}
}
void IIRowSchemaReader::transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type)
{
transformInferredTypesIfNeeded(type, new_type, format_settings);
}
IRowSchemaReader::IRowSchemaReader(ReadBuffer & in_, const FormatSettings & format_settings_)
: IIRowSchemaReader(in_, format_settings_), column_names(splitColumnNames(format_settings.column_names_for_schema_inference))
{
@ -160,23 +128,28 @@ NamesAndTypesList IRowSchemaReader::readSchema()
if (new_data_types.size() != data_types.size())
throw Exception(ErrorCodes::INCORRECT_DATA, "Rows have different amount of values");
for (size_t i = 0; i != data_types.size(); ++i)
for (field_index = 0; field_index != data_types.size(); ++field_index)
{
/// Check if we couldn't determine the type of this column in a new row
/// or the type for this column was taken from hints.
if (!new_data_types[i] || hints.contains(column_names[i]))
if (!new_data_types[field_index] || hints.contains(column_names[field_index]))
continue;
auto transform_types_if_needed = [&](DataTypePtr & type, DataTypePtr & new_type){ transformTypesIfNeeded(type, new_type, i); };
chooseResultColumnType(data_types[i], new_data_types[i], transform_types_if_needed, getDefaultType(i), std::to_string(i + 1), rows_read);
chooseResultColumnType(*this, data_types[field_index], new_data_types[field_index], getDefaultType(field_index), std::to_string(field_index + 1), rows_read);
}
}
NamesAndTypesList result;
for (size_t i = 0; i != data_types.size(); ++i)
for (field_index = 0; field_index != data_types.size(); ++field_index)
{
/// Check that we could determine the type of this column.
checkResultColumnTypeAndAppend(result, data_types[i], column_names[i], getDefaultType(i), rows_read);
/// Don't check/change types from hints.
if (!hints.contains(column_names[field_index]))
{
transformFinalTypeIfNeeded(data_types[field_index]);
/// Check that we could determine the type of this column.
checkFinalInferredType(data_types[field_index], column_names[field_index], format_settings, getDefaultType(field_index), rows_read);
}
result.emplace_back(column_names[field_index], data_types[field_index]);
}
return result;
@ -208,11 +181,6 @@ DataTypePtr IRowSchemaReader::getDefaultType(size_t column) const
return nullptr;
}
void IRowSchemaReader::transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type, size_t)
{
transformInferredTypesIfNeeded(type, new_type, format_settings);
}
IRowWithNamesSchemaReader::IRowWithNamesSchemaReader(ReadBuffer & in_, const FormatSettings & format_settings_, DataTypePtr default_type_)
: IIRowSchemaReader(in_, format_settings_, default_type_)
{
@ -245,7 +213,6 @@ NamesAndTypesList IRowWithNamesSchemaReader::readSchema()
names_order.push_back(name);
}
auto transform_types_if_needed = [&](DataTypePtr & type, DataTypePtr & new_type){ transformTypesIfNeeded(type, new_type); };
for (rows_read = 1; rows_read < max_rows_to_read; ++rows_read)
{
auto new_names_and_types = readRowAndGetNamesAndDataTypes(eof);
@ -277,7 +244,7 @@ NamesAndTypesList IRowWithNamesSchemaReader::readSchema()
continue;
auto & type = it->second;
chooseResultColumnType(type, new_type, transform_types_if_needed, default_type, name, rows_read);
chooseResultColumnType(*this, type, new_type, default_type, name, rows_read);
}
}
@ -285,20 +252,21 @@ NamesAndTypesList IRowWithNamesSchemaReader::readSchema()
if (names_to_types.empty())
throw Exception(ErrorCodes::EMPTY_DATA_PASSED, "Cannot read rows from the data");
NamesAndTypesList result;
NamesAndTypesList result = getStaticNamesAndTypes();
for (auto & name : names_order)
{
auto & type = names_to_types[name];
/// Check that we could determine the type of this column.
checkResultColumnTypeAndAppend(result, type, name, default_type, rows_read);
/// Don't check/change types from hints.
if (!hints.contains(name))
{
transformFinalTypeIfNeeded(type);
/// Check that we could determine the type of this column.
checkFinalInferredType(type, name, format_settings, default_type, rows_read);
}
result.emplace_back(name, type);
}
return result;
}
void IRowWithNamesSchemaReader::transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type)
{
transformInferredTypesIfNeeded(type, new_type, format_settings);
}
}

View File

@ -9,6 +9,11 @@
namespace DB
{
namespace ErrorCodes
{
extern const int TYPE_MISMATCH;
}
/// Base class for schema inference for the data in some specific format.
/// It reads some data from read buffer and try to determine the schema
/// from read data.
@ -45,10 +50,14 @@ public:
bool needContext() const override { return !hints_str.empty(); }
void setContext(ContextPtr & context) override;
virtual void transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type);
protected:
void setMaxRowsToRead(size_t max_rows) override { max_rows_to_read = max_rows; }
size_t getNumRowsRead() const override { return rows_read; }
virtual void transformFinalTypeIfNeeded(DataTypePtr &) {}
size_t max_rows_to_read;
size_t rows_read = 0;
DataTypePtr default_type;
@ -82,7 +91,7 @@ protected:
void setColumnNames(const std::vector<String> & names) { column_names = names; }
virtual void transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type, size_t index);
size_t field_index;
private:
DataTypePtr getDefaultType(size_t column) const;
@ -111,7 +120,10 @@ protected:
/// Set eof = true if can't read more data.
virtual NamesAndTypesList readRowAndGetNamesAndDataTypes(bool & eof) = 0;
virtual void transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type);
/// Get special static types that have the same name/type for each row.
/// For example, in JSONObjectEachRow format we have static column with
/// type String and name from a settings for object keys.
virtual NamesAndTypesList getStaticNamesAndTypes() { return {}; }
};
/// Base class for schema inference for formats that don't need any data to
@ -125,16 +137,46 @@ public:
virtual ~IExternalSchemaReader() = default;
};
template <class SchemaReader>
void chooseResultColumnType(
SchemaReader & schema_reader,
DataTypePtr & type,
DataTypePtr & new_type,
std::function<void(DataTypePtr &, DataTypePtr &)> transform_types_if_needed,
const DataTypePtr & default_type,
const String & column_name,
size_t row);
size_t row)
{
if (!type)
{
type = new_type;
return;
}
void checkResultColumnTypeAndAppend(
NamesAndTypesList & result, DataTypePtr & type, const String & name, const DataTypePtr & default_type, size_t rows_read);
if (!new_type || type->equals(*new_type))
return;
schema_reader.transformTypesIfNeeded(type, new_type);
if (type->equals(*new_type))
return;
/// If the new type and the previous type for this column are different,
/// we will use default type if we have it or throw an exception.
if (default_type)
type = default_type;
else
{
throw Exception(
ErrorCodes::TYPE_MISMATCH,
"Automatically defined type {} for column '{}' in row {} differs from type defined by previous rows: {}. "
"You can specify the type for this column using setting schema_inference_hints",
type->getName(),
column_name,
row,
new_type->getName());
}
}
void checkFinalInferredType(DataTypePtr & type, const String & name, const FormatSettings & settings, const DataTypePtr & default_type, size_t rows_read);
Strings splitColumnNames(const String & column_names_str);

View File

@ -3,7 +3,7 @@
#if USE_ARROW
#include <Formats/FormatFactory.h>
#include <Formats/ReadSchemaUtils.h>
#include <Formats/SchemaInferenceUtils.h>
#include <IO/ReadBufferFromMemory.h>
#include <IO/WriteHelpers.h>
#include <IO/copyData.h>

View File

@ -18,6 +18,7 @@
#include <Columns/ColumnMap.h>
#include <DataTypes/DataTypeString.h>
#include <DataTypes/DataTypeFixedString.h>
#include <DataTypes/DataTypeUUID.h>
#include <DataTypes/DataTypeDateTime64.h>
#include <DataTypes/DataTypeLowCardinality.h>
@ -282,7 +283,7 @@ static void readAndInsertString(ReadBuffer & in, IColumn & column, BSONType bson
}
else if (bson_type == BSONType::OBJECT_ID)
{
readAndInsertStringImpl<is_fixed_string>(in, column, 12);
readAndInsertStringImpl<is_fixed_string>(in, column, BSON_OBJECT_ID_SIZE);
}
else
{
@ -664,7 +665,7 @@ static void skipBSONField(ReadBuffer & in, BSONType type)
}
case BSONType::OBJECT_ID:
{
in.ignore(12);
in.ignore(BSON_OBJECT_ID_SIZE);
break;
}
case BSONType::REGEXP:
@ -677,7 +678,7 @@ static void skipBSONField(ReadBuffer & in, BSONType type)
{
BSONSizeT size;
readBinary(size, in);
in.ignore(size + 12);
in.ignore(size + BSON_DB_POINTER_SIZE);
break;
}
case BSONType::JAVA_SCRIPT_CODE_W_SCOPE:
@ -772,37 +773,41 @@ DataTypePtr BSONEachRowSchemaReader::getDataTypeFromBSONField(BSONType type, boo
case BSONType::DOUBLE:
{
in.ignore(sizeof(Float64));
return makeNullable(std::make_shared<DataTypeFloat64>());
return std::make_shared<DataTypeFloat64>();
}
case BSONType::BOOL:
{
in.ignore(sizeof(UInt8));
return makeNullable(DataTypeFactory::instance().get("Bool"));
return DataTypeFactory::instance().get("Bool");
}
case BSONType::INT64:
{
in.ignore(sizeof(Int64));
return makeNullable(std::make_shared<DataTypeInt64>());
return std::make_shared<DataTypeInt64>();
}
case BSONType::DATETIME:
{
in.ignore(sizeof(Int64));
return makeNullable(std::make_shared<DataTypeDateTime64>(6, "UTC"));
return std::make_shared<DataTypeDateTime64>(6, "UTC");
}
case BSONType::INT32:
{
in.ignore(sizeof(Int32));
return makeNullable(std::make_shared<DataTypeInt32>());
return std::make_shared<DataTypeInt32>();
}
case BSONType::SYMBOL: [[fallthrough]];
case BSONType::JAVA_SCRIPT_CODE: [[fallthrough]];
case BSONType::OBJECT_ID: [[fallthrough]];
case BSONType::STRING:
{
BSONSizeT size;
readBinary(size, in);
in.ignore(size);
return makeNullable(std::make_shared<DataTypeString>());
return std::make_shared<DataTypeString>();
}
case BSONType::OBJECT_ID:;
{
in.ignore(BSON_OBJECT_ID_SIZE);
return makeNullable(std::make_shared<DataTypeFixedString>(BSON_OBJECT_ID_SIZE));
}
case BSONType::DOCUMENT:
{
@ -856,10 +861,10 @@ DataTypePtr BSONEachRowSchemaReader::getDataTypeFromBSONField(BSONType type, boo
{
case BSONBinarySubtype::BINARY_OLD: [[fallthrough]];
case BSONBinarySubtype::BINARY:
return makeNullable(std::make_shared<DataTypeString>());
return std::make_shared<DataTypeString>();
case BSONBinarySubtype::UUID_OLD: [[fallthrough]];
case BSONBinarySubtype::UUID:
return makeNullable(std::make_shared<DataTypeUUID>());
return std::make_shared<DataTypeUUID>();
default:
throw Exception(ErrorCodes::UNKNOWN_TYPE, "BSON binary subtype {} is not supported", getBSONBinarySubtypeName(subtype));
}
@ -954,6 +959,7 @@ void registerInputFormatBSONEachRow(FormatFactory & factory)
"BSONEachRow",
[](ReadBuffer & buf, const Block & sample, IRowInputFormat::Params params, const FormatSettings & settings)
{ return std::make_shared<BSONEachRowRowInputFormat>(buf, sample, std::move(params), settings); });
factory.registerFileExtension("bson", "BSONEachRow");
}
void registerFileSegmentationEngineBSONEachRow(FormatFactory & factory)

View File

@ -274,15 +274,15 @@ void CSVFormatReader::skipPrefixBeforeHeader()
}
CSVSchemaReader::CSVSchemaReader(ReadBuffer & in_, bool with_names_, bool with_types_, const FormatSettings & format_setting_)
CSVSchemaReader::CSVSchemaReader(ReadBuffer & in_, bool with_names_, bool with_types_, const FormatSettings & format_settings_)
: FormatWithNamesAndTypesSchemaReader(
in_,
format_setting_,
format_settings_,
with_names_,
with_types_,
&reader,
getDefaultDataTypeForEscapingRule(FormatSettings::EscapingRule::CSV))
, reader(in_, format_setting_)
, reader(in_, format_settings_)
{
}
@ -293,7 +293,7 @@ DataTypes CSVSchemaReader::readRowAndGetDataTypes()
return {};
auto fields = reader.readRow();
return determineDataTypesByEscapingRule(fields, reader.getFormatSettings(), FormatSettings::EscapingRule::CSV);
return tryInferDataTypesByEscapingRule(fields, reader.getFormatSettings(), FormatSettings::EscapingRule::CSV);
}

View File

@ -75,7 +75,7 @@ public:
class CSVSchemaReader : public FormatWithNamesAndTypesSchemaReader
{
public:
CSVSchemaReader(ReadBuffer & in_, bool with_names_, bool with_types_, const FormatSettings & format_setting_);
CSVSchemaReader(ReadBuffer & in_, bool with_names_, bool with_types_, const FormatSettings & format_settings_);
private:
DataTypes readRowAndGetDataTypes() override;

View File

@ -99,6 +99,12 @@ static void insertSignedInteger(IColumn & column, const DataTypePtr & column_typ
case TypeIndex::DateTime64:
assert_cast<ColumnDecimal<DateTime64> &>(column).insertValue(value);
break;
case TypeIndex::Decimal32:
assert_cast<ColumnDecimal<Decimal32> &>(column).insertValue(static_cast<Int32>(value));
break;
case TypeIndex::Decimal64:
assert_cast<ColumnDecimal<Decimal64> &>(column).insertValue(value);
break;
default:
throw Exception(ErrorCodes::LOGICAL_ERROR, "Column type is not a signed integer.");
}
@ -178,14 +184,14 @@ static void insertEnum(IColumn & column, const DataTypePtr & column_type, const
}
}
static void insertValue(IColumn & column, const DataTypePtr & column_type, const capnp::DynamicValue::Reader & value, FormatSettings::EnumComparingMode enum_comparing_mode)
static void insertValue(IColumn & column, const DataTypePtr & column_type, const String & column_name, const capnp::DynamicValue::Reader & value, FormatSettings::EnumComparingMode enum_comparing_mode)
{
if (column_type->lowCardinality())
{
auto & lc_column = assert_cast<ColumnLowCardinality &>(column);
auto tmp_column = lc_column.getDictionary().getNestedColumn()->cloneEmpty();
auto dict_type = assert_cast<const DataTypeLowCardinality *>(column_type.get())->getDictionaryType();
insertValue(*tmp_column, dict_type, value, enum_comparing_mode);
insertValue(*tmp_column, dict_type, column_name, value, enum_comparing_mode);
lc_column.insertFromFullColumn(*tmp_column, 0);
return;
}
@ -226,7 +232,7 @@ static void insertValue(IColumn & column, const DataTypePtr & column_type, const
auto & nested_column = column_array.getData();
auto nested_type = assert_cast<const DataTypeArray *>(column_type.get())->getNestedType();
for (const auto & nested_value : list_value)
insertValue(nested_column, nested_type, nested_value, enum_comparing_mode);
insertValue(nested_column, nested_type, column_name, nested_value, enum_comparing_mode);
break;
}
case capnp::DynamicValue::Type::STRUCT:
@ -243,11 +249,11 @@ static void insertValue(IColumn & column, const DataTypePtr & column_type, const
auto & nested_column = nullable_column.getNestedColumn();
auto nested_type = assert_cast<const DataTypeNullable *>(column_type.get())->getNestedType();
auto nested_value = struct_value.get(field);
insertValue(nested_column, nested_type, nested_value, enum_comparing_mode);
insertValue(nested_column, nested_type, column_name, nested_value, enum_comparing_mode);
nullable_column.getNullMapData().push_back(0);
}
}
else
else if (isTuple(column_type))
{
auto & tuple_column = assert_cast<ColumnTuple &>(column);
const auto * tuple_type = assert_cast<const DataTypeTuple *>(column_type.get());
@ -255,9 +261,16 @@ static void insertValue(IColumn & column, const DataTypePtr & column_type, const
insertValue(
tuple_column.getColumn(i),
tuple_type->getElements()[i],
tuple_type->getElementNames()[i],
struct_value.get(tuple_type->getElementNames()[i]),
enum_comparing_mode);
}
else
{
/// It can be nested column from Nested type.
auto [field_name, nested_name] = splitCapnProtoFieldName(column_name);
insertValue(column, column_type, nested_name, struct_value.get(nested_name), enum_comparing_mode);
}
break;
}
default:
@ -278,7 +291,7 @@ bool CapnProtoRowInputFormat::readRow(MutableColumns & columns, RowReadExtension
for (size_t i = 0; i != columns.size(); ++i)
{
auto value = getReaderByColumnName(root_reader, column_names[i]);
insertValue(*columns[i], column_types[i], value, format_settings.capn_proto.enum_comparing_mode);
insertValue(*columns[i], column_types[i], column_names[i], value, format_settings.capn_proto.enum_comparing_mode);
}
}
catch (const kj::Exception & e)

View File

@ -92,6 +92,7 @@ static std::optional<capnp::DynamicValue::Reader> convertToDynamicValue(
const ColumnPtr & column,
const DataTypePtr & data_type,
size_t row_num,
const String & column_name,
capnp::DynamicValue::Builder builder,
FormatSettings::EnumComparingMode enum_comparing_mode,
std::vector<std::unique_ptr<String>> & temporary_text_data_storage)
@ -103,15 +104,12 @@ static std::optional<capnp::DynamicValue::Reader> convertToDynamicValue(
const auto * lc_column = assert_cast<const ColumnLowCardinality *>(column.get());
const auto & dict_type = assert_cast<const DataTypeLowCardinality *>(data_type.get())->getDictionaryType();
size_t index = lc_column->getIndexAt(row_num);
return convertToDynamicValue(lc_column->getDictionary().getNestedColumn(), dict_type, index, builder, enum_comparing_mode, temporary_text_data_storage);
return convertToDynamicValue(lc_column->getDictionary().getNestedColumn(), dict_type, index, column_name, builder, enum_comparing_mode, temporary_text_data_storage);
}
switch (builder.getType())
{
case capnp::DynamicValue::Type::INT:
/// We allow output DateTime64 as Int64.
if (WhichDataType(data_type).isDateTime64())
return capnp::DynamicValue::Reader(assert_cast<const ColumnDecimal<DateTime64> *>(column.get())->getElement(row_num));
return capnp::DynamicValue::Reader(column->getInt(row_num));
case capnp::DynamicValue::Type::UINT:
return capnp::DynamicValue::Reader(column->getUInt(row_num));
@ -150,7 +148,7 @@ static std::optional<capnp::DynamicValue::Reader> convertToDynamicValue(
{
auto struct_builder = builder.as<capnp::DynamicStruct>();
auto nested_struct_schema = struct_builder.getSchema();
/// Struct can be represent Tuple or Naullable (named union with two fields)
/// Struct can represent Tuple, Nullable (named union with two fields) or single column when it contains one nested column.
if (data_type->isNullable())
{
const auto * nullable_type = assert_cast<const DataTypeNullable *>(data_type.get());
@ -167,12 +165,12 @@ static std::optional<capnp::DynamicValue::Reader> convertToDynamicValue(
struct_builder.clear(value_field);
const auto & nested_column = nullable_column->getNestedColumnPtr();
auto value_builder = initStructFieldBuilder(nested_column, row_num, struct_builder, value_field);
auto value = convertToDynamicValue(nested_column, nullable_type->getNestedType(), row_num, value_builder, enum_comparing_mode, temporary_text_data_storage);
auto value = convertToDynamicValue(nested_column, nullable_type->getNestedType(), row_num, column_name, value_builder, enum_comparing_mode, temporary_text_data_storage);
if (value)
struct_builder.set(value_field, *value);
}
}
else
else if (isTuple(data_type))
{
const auto * tuple_data_type = assert_cast<const DataTypeTuple *>(data_type.get());
auto nested_types = tuple_data_type->getElements();
@ -182,11 +180,21 @@ static std::optional<capnp::DynamicValue::Reader> convertToDynamicValue(
auto pos = tuple_data_type->getPositionByName(name);
auto field_builder
= initStructFieldBuilder(nested_columns[pos], row_num, struct_builder, nested_struct_schema.getFieldByName(name));
auto value = convertToDynamicValue(nested_columns[pos], nested_types[pos], row_num, field_builder, enum_comparing_mode, temporary_text_data_storage);
auto value = convertToDynamicValue(nested_columns[pos], nested_types[pos], row_num, column_name, field_builder, enum_comparing_mode, temporary_text_data_storage);
if (value)
struct_builder.set(name, *value);
}
}
else
{
/// It can be nested column from Nested type.
auto [field_name, nested_name] = splitCapnProtoFieldName(column_name);
auto nested_field = nested_struct_schema.getFieldByName(nested_name);
auto field_builder = initStructFieldBuilder(column, row_num, struct_builder, nested_field);
auto value = convertToDynamicValue(column, data_type, row_num, nested_name, field_builder, enum_comparing_mode, temporary_text_data_storage);
if (value)
struct_builder.set(nested_field, *value);
}
return std::nullopt;
}
case capnp::DynamicValue::Type::LIST:
@ -213,7 +221,7 @@ static std::optional<capnp::DynamicValue::Reader> convertToDynamicValue(
else
value_builder = list_builder[i];
auto value = convertToDynamicValue(nested_column, nested_type, offset + i, value_builder, enum_comparing_mode, temporary_text_data_storage);
auto value = convertToDynamicValue(nested_column, nested_type, offset + i, column_name, value_builder, enum_comparing_mode, temporary_text_data_storage);
if (value)
list_builder.set(i, *value);
}
@ -231,11 +239,19 @@ void CapnProtoRowOutputFormat::write(const Columns & columns, size_t row_num)
/// See comment in convertToDynamicValue() for more details.
std::vector<std::unique_ptr<String>> temporary_text_data_storage;
capnp::DynamicStruct::Builder root = message.initRoot<capnp::DynamicStruct>(schema);
/// Some columns can share same field builder. For example when we have
/// column with Nested type that was flattened into several columns.
std::unordered_map<size_t, capnp::DynamicValue::Builder> field_builders;
for (size_t i = 0; i != columns.size(); ++i)
{
auto [struct_builder, field] = getStructBuilderAndFieldByColumnName(root, column_names[i]);
auto field_builder = initStructFieldBuilder(columns[i], row_num, struct_builder, field);
auto value = convertToDynamicValue(columns[i], column_types[i], row_num, field_builder, format_settings.capn_proto.enum_comparing_mode, temporary_text_data_storage);
if (!field_builders.contains(field.getIndex()))
{
auto field_builder = initStructFieldBuilder(columns[i], row_num, struct_builder, field);
field_builders[field.getIndex()] = field_builder;
}
auto value = convertToDynamicValue(columns[i], column_types[i], row_num, column_names[i], field_builders[field.getIndex()], format_settings.capn_proto.enum_comparing_mode, temporary_text_data_storage);
if (value)
struct_builder.set(field, *value);
}

View File

@ -1,6 +1,7 @@
#include <Processors/Formats/Impl/CustomSeparatedRowInputFormat.h>
#include <Processors/Formats/Impl/TemplateRowInputFormat.h>
#include <Formats/EscapingRuleUtils.h>
#include <Formats/SchemaInferenceUtils.h>
#include <Formats/registerWithNamesAndTypes.h>
#include <IO/Operators.h>
@ -370,12 +371,12 @@ DataTypes CustomSeparatedSchemaReader::readRowAndGetDataTypes()
first_row = false;
auto fields = reader.readRow();
return determineDataTypesByEscapingRule(fields, reader.getFormatSettings(), reader.getEscapingRule());
return tryInferDataTypesByEscapingRule(fields, reader.getFormatSettings(), reader.getEscapingRule(), &json_inference_info);
}
void CustomSeparatedSchemaReader::transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type, size_t)
void CustomSeparatedSchemaReader::transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type)
{
transformInferredTypesIfNeeded(type, new_type, format_settings, reader.getEscapingRule());
transformInferredTypesByEscapingRuleIfNeeded(type, new_type, format_settings, reader.getEscapingRule(), &json_inference_info);
}
void registerInputFormatCustomSeparated(FormatFactory & factory)

View File

@ -2,6 +2,7 @@
#include <Processors/Formats/RowInputFormatWithNamesAndTypes.h>
#include <Formats/ParsedTemplateFormatString.h>
#include <Formats/SchemaInferenceUtils.h>
#include <IO/PeekableReadBuffer.h>
#include <IO/ReadHelpers.h>
@ -100,11 +101,12 @@ public:
private:
DataTypes readRowAndGetDataTypes() override;
void transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type, size_t) override;
void transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type) override;
PeekableReadBuffer buf;
CustomSeparatedFormatReader reader;
bool first_row = true;
JSONInferenceInfo json_inference_info;
};
}

View File

@ -2,8 +2,11 @@
#include <Processors/Formats/ISchemaReader.h>
#include <Formats/JSONUtils.h>
#include <Formats/EscapingRuleUtils.h>
#include <Formats/SchemaInferenceUtils.h>
#include <Interpreters/parseColumnsListForTableFunction.h>
#include <IO/ReadHelpers.h>
#include <base/find_symbols.h>
#include <Common/logger_useful.h>
namespace DB
{
@ -170,19 +173,25 @@ JSONColumnsSchemaReaderBase::JSONColumnsSchemaReaderBase(
ReadBuffer & in_, const FormatSettings & format_settings_, std::unique_ptr<JSONColumnsReaderBase> reader_)
: ISchemaReader(in_)
, format_settings(format_settings_)
, hints_str(format_settings_.schema_inference_hints)
, reader(std::move(reader_))
, column_names_from_settings(splitColumnNames(format_settings_.column_names_for_schema_inference))
{
}
void JSONColumnsSchemaReaderBase::chooseResulType(DataTypePtr & type, DataTypePtr & new_type, const String & column_name, size_t row) const
void JSONColumnsSchemaReaderBase::setContext(ContextPtr & ctx)
{
auto convert_types_if_needed = [&](DataTypePtr & first, DataTypePtr & second)
ColumnsDescription columns;
if (tryParseColumnsListFromString(hints_str, columns, ctx))
{
DataTypes types = {first, second};
transformInferredJSONTypesIfNeeded(types, format_settings);
};
chooseResultColumnType(type, new_type, convert_types_if_needed, nullptr, column_name, row);
for (const auto & [name, type] : columns.getAll())
hints[name] = type;
}
}
void JSONColumnsSchemaReaderBase::transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type)
{
transformInferredJSONTypesIfNeeded(type, new_type, format_settings, &inference_info);
}
NamesAndTypesList JSONColumnsSchemaReaderBase::readSchema()
@ -220,9 +229,18 @@ NamesAndTypesList JSONColumnsSchemaReaderBase::readSchema()
if (!names_to_types.contains(column_name))
names_order.push_back(column_name);
rows_in_block = 0;
auto column_type = readColumnAndGetDataType(column_name, rows_in_block, format_settings.max_rows_to_read_for_schema_inference - total_rows_read);
chooseResulType(names_to_types[column_name], column_type, column_name, total_rows_read + 1);
if (const auto it = hints.find(column_name); it != hints.end())
{
names_to_types[column_name] = it->second;
}
else
{
rows_in_block = 0;
auto column_type = readColumnAndGetDataType(
column_name, rows_in_block, format_settings.max_rows_to_read_for_schema_inference - total_rows_read);
chooseResultColumnType(*this, names_to_types[column_name], column_type, nullptr, column_name, total_rows_read + 1);
}
++iteration;
}
while (!reader->checkChunkEndOrSkipColumnDelimiter());
@ -237,8 +255,14 @@ NamesAndTypesList JSONColumnsSchemaReaderBase::readSchema()
for (auto & name : names_order)
{
auto & type = names_to_types[name];
/// Check that we could determine the type of this column.
checkResultColumnTypeAndAppend(result, type, name, nullptr, format_settings.max_rows_to_read_for_schema_inference);
/// Don't check/change types from hints.
if (!hints.contains(name))
{
transformJSONTupleToArrayIfPossible(type, format_settings, &inference_info);
/// Check that we could determine the type of this column.
checkFinalInferredType(type, name, format_settings, nullptr, format_settings.max_rows_to_read_for_schema_inference);
}
result.emplace_back(name, type);
}
return result;
@ -262,8 +286,8 @@ DataTypePtr JSONColumnsSchemaReaderBase::readColumnAndGetDataType(const String &
}
readJSONField(field, in);
DataTypePtr field_type = JSONUtils::getDataTypeFromField(field, format_settings);
chooseResulType(column_type, field_type, column_name, rows_read);
DataTypePtr field_type = tryInferDataTypeForSingleJSONField(field, format_settings, &inference_info);
chooseResultColumnType(*this, column_type, field_type, nullptr, column_name, rows_read);
++rows_read;
}
while (!reader->checkColumnEndOrSkipFieldDelimiter());

View File

@ -1,6 +1,7 @@
#pragma once
#include <Formats/FormatSettings.h>
#include <Formats/SchemaInferenceUtils.h>
#include <Processors/Formats/IInputFormat.h>
#include <Processors/Formats/ISchemaReader.h>
@ -76,18 +77,23 @@ class JSONColumnsSchemaReaderBase : public ISchemaReader
public:
JSONColumnsSchemaReaderBase(ReadBuffer & in_, const FormatSettings & format_settings_, std::unique_ptr<JSONColumnsReaderBase> reader_);
void transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type);
bool needContext() const override { return !hints_str.empty(); }
void setContext(ContextPtr & ctx) override;
private:
NamesAndTypesList readSchema() override;
/// Read whole column in the block (up to max_rows_to_read rows) and extract the data type.
DataTypePtr readColumnAndGetDataType(const String & column_name, size_t & rows_read, size_t max_rows_to_read);
/// Choose result type for column from two inferred types from different rows.
void chooseResulType(DataTypePtr & type, DataTypePtr & new_type, const String & column_name, size_t row) const;
const FormatSettings format_settings;
String hints_str;
std::unordered_map<String, DataTypePtr> hints;
std::unique_ptr<JSONColumnsReaderBase> reader;
Names column_names_from_settings;
JSONInferenceInfo inference_info;
};
}

View File

@ -7,6 +7,7 @@
#include <Formats/verbosePrintString.h>
#include <Formats/JSONUtils.h>
#include <Formats/EscapingRuleUtils.h>
#include <Formats/SchemaInferenceUtils.h>
#include <Formats/registerWithNamesAndTypes.h>
#include <DataTypes/NestedUtils.h>
#include <DataTypes/Serializations/SerializationNullable.h>
@ -202,12 +203,17 @@ DataTypes JSONCompactEachRowRowSchemaReader::readRowAndGetDataTypes()
if (in.eof())
return {};
return JSONUtils::readRowAndGetDataTypesForJSONCompactEachRow(in, format_settings, reader.yieldStrings());
return JSONUtils::readRowAndGetDataTypesForJSONCompactEachRow(in, format_settings, &inference_info);
}
void JSONCompactEachRowRowSchemaReader::transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type, size_t)
void JSONCompactEachRowRowSchemaReader::transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type)
{
transformInferredJSONTypesIfNeeded(type, new_type, format_settings);
transformInferredJSONTypesIfNeeded(type, new_type, format_settings, &inference_info);
}
void JSONCompactEachRowRowSchemaReader::transformFinalTypeIfNeeded(DataTypePtr & type)
{
transformJSONTupleToArrayIfPossible(type, format_settings, &inference_info);
}
void registerInputFormatJSONCompactEachRow(FormatFactory & factory)

View File

@ -4,6 +4,7 @@
#include <Processors/Formats/RowInputFormatWithNamesAndTypes.h>
#include <Processors/Formats/ISchemaReader.h>
#include <Formats/FormatSettings.h>
#include <Formats/SchemaInferenceUtils.h>
#include <Common/HashTable/HashMap.h>
namespace DB
@ -80,10 +81,12 @@ public:
private:
DataTypes readRowAndGetDataTypes() override;
void transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type, size_t) override;
void transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type) override;
void transformFinalTypeIfNeeded(DataTypePtr & type) override;
JSONCompactEachRowFormatReader reader;
bool first_row = true;
JSONInferenceInfo inference_info;
};
}

View File

@ -4,6 +4,7 @@
#include <Processors/Formats/Impl/JSONEachRowRowInputFormat.h>
#include <Formats/JSONUtils.h>
#include <Formats/EscapingRuleUtils.h>
#include <Formats/SchemaInferenceUtils.h>
#include <Formats/FormatFactory.h>
#include <DataTypes/NestedUtils.h>
#include <DataTypes/Serializations/SerializationNullable.h>
@ -300,9 +301,8 @@ void JSONEachRowRowInputFormat::readSuffix()
assertEOF(*in);
}
JSONEachRowSchemaReader::JSONEachRowSchemaReader(ReadBuffer & in_, bool json_strings_, const FormatSettings & format_settings_)
JSONEachRowSchemaReader::JSONEachRowSchemaReader(ReadBuffer & in_, const FormatSettings & format_settings_)
: IRowWithNamesSchemaReader(in_, format_settings_)
, json_strings(json_strings_)
{
}
@ -336,12 +336,17 @@ NamesAndTypesList JSONEachRowSchemaReader::readRowAndGetNamesAndDataTypes(bool &
return {};
}
return JSONUtils::readRowAndGetNamesAndDataTypesForJSONEachRow(in, format_settings, json_strings);
return JSONUtils::readRowAndGetNamesAndDataTypesForJSONEachRow(in, format_settings, &inference_info);
}
void JSONEachRowSchemaReader::transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type)
{
transformInferredJSONTypesIfNeeded(type, new_type, format_settings);
transformInferredJSONTypesIfNeeded(type, new_type, format_settings, &inference_info);
}
void JSONEachRowSchemaReader::transformFinalTypeIfNeeded(DataTypePtr & type)
{
transformJSONTupleToArrayIfPossible(type, format_settings, &inference_info);
}
void registerInputFormatJSONEachRow(FormatFactory & factory)
@ -391,11 +396,11 @@ void registerNonTrivialPrefixAndSuffixCheckerJSONEachRow(FormatFactory & factory
void registerJSONEachRowSchemaReader(FormatFactory & factory)
{
auto register_schema_reader = [&](const String & format_name, bool json_strings)
auto register_schema_reader = [&](const String & format_name)
{
factory.registerSchemaReader(format_name, [json_strings](ReadBuffer & buf, const FormatSettings & settings)
factory.registerSchemaReader(format_name, [](ReadBuffer & buf, const FormatSettings & settings)
{
return std::make_unique<JSONEachRowSchemaReader>(buf, json_strings, settings);
return std::make_unique<JSONEachRowSchemaReader>(buf, settings);
});
factory.registerAdditionalInfoForSchemaCacheGetter(format_name, [](const FormatSettings & settings)
{
@ -403,10 +408,10 @@ void registerJSONEachRowSchemaReader(FormatFactory & factory)
});
};
register_schema_reader("JSONEachRow", false);
register_schema_reader("JSONLines", false);
register_schema_reader("NDJSON", false);
register_schema_reader("JSONStringsEachRow", true);
register_schema_reader("JSONEachRow");
register_schema_reader("JSONLines");
register_schema_reader("NDJSON");
register_schema_reader("JSONStringsEachRow");
}
}

View File

@ -4,6 +4,7 @@
#include <Processors/Formats/IRowInputFormat.h>
#include <Processors/Formats/ISchemaReader.h>
#include <Formats/FormatSettings.h>
#include <Formats/SchemaInferenceUtils.h>
#include <Common/HashTable/HashMap.h>
@ -94,15 +95,16 @@ protected:
class JSONEachRowSchemaReader : public IRowWithNamesSchemaReader
{
public:
JSONEachRowSchemaReader(ReadBuffer & in_, bool json_strings, const FormatSettings & format_settings_);
JSONEachRowSchemaReader(ReadBuffer & in_, const FormatSettings & format_settings_);
private:
NamesAndTypesList readRowAndGetNamesAndDataTypes(bool & eof) override;
void transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type) override;
void transformFinalTypeIfNeeded(DataTypePtr & type) override;
bool json_strings;
bool first_row = true;
bool data_in_square_brackets = false;
JSONInferenceInfo inference_info;
};
}

View File

@ -2,6 +2,7 @@
#include <Formats/JSONUtils.h>
#include <Formats/FormatFactory.h>
#include <Formats/EscapingRuleUtils.h>
#include <Formats/SchemaInferenceUtils.h>
#include <DataTypes/DataTypeString.h>
namespace DB
@ -85,15 +86,25 @@ NamesAndTypesList JSONObjectEachRowSchemaReader::readRowAndGetNamesAndDataTypes(
JSONUtils::skipComma(in);
JSONUtils::readFieldName(in);
auto names_and_types = JSONUtils::readRowAndGetNamesAndDataTypesForJSONEachRow(in, format_settings, false);
return JSONUtils::readRowAndGetNamesAndDataTypesForJSONEachRow(in, format_settings, &inference_info);
}
NamesAndTypesList JSONObjectEachRowSchemaReader::getStaticNamesAndTypes()
{
if (!format_settings.json_object_each_row.column_for_object_name.empty())
names_and_types.emplace_front(format_settings.json_object_each_row.column_for_object_name, std::make_shared<DataTypeString>());
return names_and_types;
return {{format_settings.json_object_each_row.column_for_object_name, std::make_shared<DataTypeString>()}};
return {};
}
void JSONObjectEachRowSchemaReader::transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type)
{
transformInferredJSONTypesIfNeeded(type, new_type, format_settings);
transformInferredJSONTypesIfNeeded(type, new_type, format_settings, &inference_info);
}
void JSONObjectEachRowSchemaReader::transformFinalTypeIfNeeded(DataTypePtr & type)
{
transformJSONTupleToArrayIfPossible(type, format_settings, &inference_info);
}
void registerInputFormatJSONObjectEachRow(FormatFactory & factory)

View File

@ -4,6 +4,7 @@
#include <Processors/Formats/Impl/JSONEachRowRowInputFormat.h>
#include <Processors/Formats/ISchemaReader.h>
#include <Formats/FormatSettings.h>
#include <Formats/SchemaInferenceUtils.h>
#include <Common/HashTable/HashMap.h>
@ -41,9 +42,12 @@ public:
private:
NamesAndTypesList readRowAndGetNamesAndDataTypes(bool & eof) override;
NamesAndTypesList getStaticNamesAndTypes() override;
void transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type) override;
void transformFinalTypeIfNeeded(DataTypePtr & type) override;
bool first_row = true;
JSONInferenceInfo inference_info;
};
std::optional<size_t> getColumnIndexForJSONObjectEachRowObjectName(const Block & header, const FormatSettings & settings);

View File

@ -247,6 +247,14 @@ static void insertNull(IColumn & column, DataTypePtr type)
static void insertUUID(IColumn & column, DataTypePtr type, const char * value, size_t size)
{
auto insert_func = [&](IColumn & column_, DataTypePtr type_)
{
insertUUID(column_, type_, value, size);
};
if (checkAndInsertNullable(column, type, insert_func) || checkAndInsertLowCardinality(column, type, insert_func))
return;
if (!isUUID(type))
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Cannot insert MessagePack UUID into column with type {}.", type->getName());
ReadBufferFromMemory buf(value, size);
@ -470,16 +478,16 @@ DataTypePtr MsgPackSchemaReader::getDataType(const msgpack::object & object)
{
case msgpack::type::object_type::POSITIVE_INTEGER: [[fallthrough]];
case msgpack::type::object_type::NEGATIVE_INTEGER:
return makeNullable(std::make_shared<DataTypeInt64>());
return std::make_shared<DataTypeInt64>();
case msgpack::type::object_type::FLOAT32:
return makeNullable(std::make_shared<DataTypeFloat32>());
return std::make_shared<DataTypeFloat32>();
case msgpack::type::object_type::FLOAT64:
return makeNullable(std::make_shared<DataTypeFloat64>());
return std::make_shared<DataTypeFloat64>();
case msgpack::type::object_type::BOOLEAN:
return makeNullable(std::make_shared<DataTypeUInt8>());
return std::make_shared<DataTypeUInt8>();
case msgpack::type::object_type::BIN: [[fallthrough]];
case msgpack::type::object_type::STR:
return makeNullable(std::make_shared<DataTypeString>());
return std::make_shared<DataTypeString>();
case msgpack::type::object_type::ARRAY:
{
msgpack::object_array object_array = object.via.array;

View File

@ -435,7 +435,7 @@ DataTypes MySQLDumpSchemaReader::readRowAndGetDataTypes()
skipFieldDelimiter(in);
readQuotedField(value, in);
auto type = determineDataTypeByEscapingRule(value, format_settings, FormatSettings::EscapingRule::Quoted);
auto type = tryInferDataTypeByEscapingRule(value, format_settings, FormatSettings::EscapingRule::Quoted);
data_types.push_back(std::move(type));
}
skipEndOfRow(in, table_name);

View File

@ -3,7 +3,7 @@
#if USE_ORC
#include <Formats/FormatFactory.h>
#include <Formats/ReadSchemaUtils.h>
#include <Formats/SchemaInferenceUtils.h>
#include <IO/ReadBufferFromMemory.h>
#include <IO/WriteHelpers.h>
#include <IO/copyData.h>
@ -101,7 +101,7 @@ static size_t countIndicesForType(std::shared_ptr<arrow::DataType> type)
if (type->id() == arrow::Type::MAP)
{
auto * map_type = static_cast<arrow::MapType *>(type.get());
return countIndicesForType(map_type->key_type()) + countIndicesForType(map_type->item_type());
return countIndicesForType(map_type->key_type()) + countIndicesForType(map_type->item_type()) + 1;
}
return 1;

View File

@ -4,7 +4,7 @@
#if USE_PARQUET
#include <Formats/FormatFactory.h>
#include <Formats/ReadSchemaUtils.h>
#include <Formats/SchemaInferenceUtils.h>
#include <IO/ReadBufferFromMemory.h>
#include <IO/copyData.h>
#include <arrow/api.h>

View File

@ -3,6 +3,7 @@
#include <Processors/Formats/Impl/RegexpRowInputFormat.h>
#include <DataTypes/Serializations/SerializationNullable.h>
#include <Formats/EscapingRuleUtils.h>
#include <Formats/SchemaInferenceUtils.h>
#include <Formats/newLineSegmentationEngine.h>
#include <IO/ReadHelpers.h>
@ -155,15 +156,15 @@ DataTypes RegexpSchemaReader::readRowAndGetDataTypes()
for (size_t i = 0; i != field_extractor.getMatchedFieldsSize(); ++i)
{
String field(field_extractor.getField(i));
data_types.push_back(determineDataTypeByEscapingRule(field, format_settings, format_settings.regexp.escaping_rule));
data_types.push_back(tryInferDataTypeByEscapingRule(field, format_settings, format_settings.regexp.escaping_rule, &json_inference_info));
}
return data_types;
}
void RegexpSchemaReader::transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type, size_t)
void RegexpSchemaReader::transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type)
{
transformInferredTypesIfNeeded(type, new_type, format_settings, format_settings.regexp.escaping_rule);
transformInferredTypesByEscapingRuleIfNeeded(type, new_type, format_settings, format_settings.regexp.escaping_rule, &json_inference_info);
}

View File

@ -5,12 +5,13 @@
#include <string>
#include <vector>
#include <Core/Block.h>
#include <IO/PeekableReadBuffer.h>
#include <Processors/Formats/IRowInputFormat.h>
#include <Processors/Formats/ISchemaReader.h>
#include <Formats/FormatSettings.h>
#include <Formats/FormatFactory.h>
#include <IO/PeekableReadBuffer.h>
#include <Formats/ParsedTemplateFormatString.h>
#include <Formats/SchemaInferenceUtils.h>
namespace DB
@ -81,12 +82,13 @@ public:
private:
DataTypes readRowAndGetDataTypes() override;
void transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type, size_t) override;
void transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type) override;
using EscapingRule = FormatSettings::EscapingRule;
RegexpFieldExtractor field_extractor;
PeekableReadBuffer buf;
JSONInferenceInfo json_inference_info;
};
}

View File

@ -249,7 +249,7 @@ NamesAndTypesList TSKVSchemaReader::readRowAndGetNamesAndDataTypes(bool & eof)
if (has_value)
{
readEscapedString(value, in);
names_and_types.emplace_back(std::move(name), determineDataTypeByEscapingRule(value, format_settings, FormatSettings::EscapingRule::Escaped));
names_and_types.emplace_back(std::move(name), tryInferDataTypeByEscapingRule(value, format_settings, FormatSettings::EscapingRule::Escaped));
}
else
{

View File

@ -268,7 +268,7 @@ DataTypes TabSeparatedSchemaReader::readRowAndGetDataTypes()
return {};
auto fields = reader.readRow();
return determineDataTypesByEscapingRule(fields, reader.getFormatSettings(), reader.getEscapingRule());
return tryInferDataTypesByEscapingRule(fields, reader.getFormatSettings(), reader.getEscapingRule());
}
void registerInputFormatTabSeparated(FormatFactory & factory)

View File

@ -2,6 +2,7 @@
#include <Formats/FormatFactory.h>
#include <Formats/verbosePrintString.h>
#include <Formats/EscapingRuleUtils.h>
#include <Formats/SchemaInferenceUtils.h>
#include <IO/Operators.h>
#include <DataTypes/DataTypeNothing.h>
#include <DataTypes/Serializations/SerializationNullable.h>
@ -511,16 +512,16 @@ DataTypes TemplateSchemaReader::readRowAndGetDataTypes()
format_reader.skipDelimiter(i);
updateFormatSettingsIfNeeded(row_format.escaping_rules[i], format_settings, row_format, default_csv_delimiter, i);
field = readFieldByEscapingRule(buf, row_format.escaping_rules[i], format_settings);
data_types.push_back(determineDataTypeByEscapingRule(field, format_settings, row_format.escaping_rules[i]));
data_types.push_back(tryInferDataTypeByEscapingRule(field, format_settings, row_format.escaping_rules[i], &json_inference_info));
}
format_reader.skipRowEndDelimiter();
return data_types;
}
void TemplateSchemaReader::transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type, size_t column_idx)
void TemplateSchemaReader::transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type)
{
transformInferredTypesIfNeeded(type, new_type, format_settings, row_format.escaping_rules[column_idx]);
transformInferredTypesByEscapingRuleIfNeeded(type, new_type, format_settings, row_format.escaping_rules[field_index], &json_inference_info);
}
static ParsedTemplateFormatString fillResultSetFormat(const FormatSettings & settings)

View File

@ -5,6 +5,7 @@
#include <Processors/Formats/ISchemaReader.h>
#include <Formats/FormatSettings.h>
#include <Formats/ParsedTemplateFormatString.h>
#include <Formats/SchemaInferenceUtils.h>
#include <IO/ReadHelpers.h>
#include <IO/PeekableReadBuffer.h>
#include <Interpreters/Context.h>
@ -121,13 +122,14 @@ public:
DataTypes readRowAndGetDataTypes() override;
private:
void transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type, size_t column_idx) override;
void transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type) override;
PeekableReadBuffer buf;
const ParsedTemplateFormatString format;
const ParsedTemplateFormatString row_format;
TemplateFormatReader format_reader;
bool first_row = true;
JSONInferenceInfo json_inference_info;
const char default_csv_delimiter;
};

View File

@ -593,14 +593,14 @@ DataTypes ValuesSchemaReader::readRowAndGetDataTypes()
{
if (!data_types.empty())
{
skipWhitespaceIfAny(buf);
assertChar(',', buf);
skipWhitespaceIfAny(buf);
}
readQuotedField(value, buf);
auto type = determineDataTypeByEscapingRule(value, format_settings, FormatSettings::EscapingRule::Quoted);
auto type = tryInferDataTypeByEscapingRule(value, format_settings, FormatSettings::EscapingRule::Quoted);
data_types.push_back(std::move(type));
skipWhitespaceIfAny(buf);
}
assertChar(')', buf);

View File

@ -79,9 +79,9 @@ Block generateOutputHeader(const Block & input_header, const Names & keys, bool
}
static Block appendGroupingColumn(Block block, const Names & keys, const GroupingSetsParamsList & params, bool use_nulls)
Block AggregatingStep::appendGroupingColumn(Block block, const Names & keys, bool has_grouping, bool use_nulls)
{
if (params.empty())
if (!has_grouping)
return block;
return generateOutputHeader(block, keys, use_nulls);
@ -104,7 +104,7 @@ AggregatingStep::AggregatingStep(
bool memory_bound_merging_of_aggregation_results_enabled_)
: ITransformingStep(
input_stream_,
appendGroupingColumn(params_.getHeader(input_stream_.header, final_), params_.keys, grouping_sets_params_, group_by_use_nulls_),
appendGroupingColumn(params_.getHeader(input_stream_.header, final_), params_.keys, !grouping_sets_params_.empty(), group_by_use_nulls_),
getTraits(should_produce_results_in_order_of_bucket_number_),
false)
, params(std::move(params_))
@ -469,7 +469,7 @@ void AggregatingStep::updateOutputStream()
{
output_stream = createOutputStream(
input_streams.front(),
appendGroupingColumn(params.getHeader(input_streams.front().header, final), params.keys, grouping_sets_params, group_by_use_nulls),
appendGroupingColumn(params.getHeader(input_streams.front().header, final), params.keys, !grouping_sets_params.empty(), group_by_use_nulls),
getDataStreamTraits());
}

View File

@ -42,6 +42,8 @@ public:
bool should_produce_results_in_order_of_bucket_number_,
bool memory_bound_merging_of_aggregation_results_enabled_);
static Block appendGroupingColumn(Block block, const Names & keys, bool has_grouping, bool use_nulls);
String getName() const override { return "Aggregating"; }
void transformPipeline(QueryPipelineBuilder & pipeline, const BuildQueryPipelineSettings &) override;

View File

@ -297,9 +297,14 @@ bool MergeFromLogEntryTask::finalize(ReplicatedMergeMutateTaskBase::PartLogWrite
{
part = merge_task->getFuture().get();
/// Task is not needed
merge_task.reset();
storage.merger_mutator.renameMergedTemporaryPart(part, parts, NO_TRANSACTION_PTR, *transaction_ptr);
/// Why we reset task here? Because it holds shared pointer to part and tryRemovePartImmediately will
/// not able to remove the part and will throw an exception (because someone holds the pointer).
///
/// Why we cannot reset task right after obtaining part from getFuture()? Because it holds RAII wrapper for
/// temp directories which guards temporary dir from background removal. So it's right place to reset the task
/// and it's really needed.
merge_task.reset();
try
{

View File

@ -1348,6 +1348,8 @@ void MergeTreeData::loadDataParts(bool skip_sanity_checks)
loadDataPartsFromDisk(
broken_parts_to_detach, duplicate_parts_to_remove, pool, num_parts, parts_queue, skip_sanity_checks, settings);
bool is_static_storage = isStaticStorage();
if (settings->in_memory_parts_enable_wal)
{
std::map<String, MutableDataPartsVector> disk_wal_part_map;
@ -1376,13 +1378,13 @@ void MergeTreeData::loadDataParts(bool skip_sanity_checks)
ErrorCodes::CORRUPTED_DATA);
write_ahead_log = std::make_shared<MergeTreeWriteAheadLog>(*this, disk_ptr, it->name());
for (auto && part : write_ahead_log->restore(metadata_snapshot, getContext(), part_lock))
for (auto && part : write_ahead_log->restore(metadata_snapshot, getContext(), part_lock, is_static_storage))
disk_wal_parts.push_back(std::move(part));
}
else
{
MergeTreeWriteAheadLog wal(*this, disk_ptr, it->name());
for (auto && part : wal.restore(metadata_snapshot, getContext(), part_lock))
for (auto && part : wal.restore(metadata_snapshot, getContext(), part_lock, is_static_storage))
disk_wal_parts.push_back(std::move(part));
}
}
@ -1408,11 +1410,17 @@ void MergeTreeData::loadDataParts(bool skip_sanity_checks)
return;
}
for (auto & part : broken_parts_to_detach)
part->renameToDetached("broken-on-start"); /// detached parts must not have '_' in prefixes
if (!is_static_storage)
{
for (auto & part : broken_parts_to_detach)
{
/// detached parts must not have '_' in prefixes
part->renameToDetached("broken-on-start");
}
for (auto & part : duplicate_parts_to_remove)
part->remove();
for (auto & part : duplicate_parts_to_remove)
part->remove();
}
auto deactivate_part = [&] (DataPartIteratorByStateAndInfo it)
{
@ -2167,6 +2175,8 @@ size_t MergeTreeData::clearEmptyParts()
void MergeTreeData::rename(const String & new_table_path, const StorageID & new_table_id)
{
LOG_INFO(log, "Renaming table to path {} with ID {}", new_table_path, new_table_id.getFullTableName());
auto disks = getStoragePolicy()->getDisks();
for (const auto & disk : disks)

View File

@ -9,6 +9,10 @@
#include <Parsers/ASTLiteral.h>
#include <Parsers/ASTSelectQuery.h>
#include <Functions/FunctionFactory.h>
#include <Planner/PlannerActionsVisitor.h>
#include <Storages/MergeTree/MergeTreeIndexUtils.h>
namespace DB
{
@ -242,67 +246,78 @@ MergeTreeIndexGranulePtr MergeTreeIndexAggregatorSet::getGranuleAndReset()
MergeTreeIndexConditionSet::MergeTreeIndexConditionSet(
const String & index_name_,
const Block & index_sample_block_,
const Block & index_sample_block,
size_t max_rows_,
const SelectQueryInfo & query,
const SelectQueryInfo & query_info,
ContextPtr context)
: index_name(index_name_)
, max_rows(max_rows_)
, index_sample_block(index_sample_block_)
{
for (const auto & name : index_sample_block.getNames())
if (!key_columns.contains(name))
key_columns.insert(name);
const auto & select = query.query->as<ASTSelectQuery &>();
if (select.where() && select.prewhere())
expression_ast = makeASTFunction(
"and",
select.where()->clone(),
select.prewhere()->clone());
else if (select.where())
expression_ast = select.where()->clone();
else if (select.prewhere())
expression_ast = select.prewhere()->clone();
useless = checkASTUseless(expression_ast);
/// Do not proceed if index is useless for this query.
if (useless)
ASTPtr ast_filter_node = buildFilterNode(query_info.query);
if (!ast_filter_node)
return;
/// Replace logical functions with bit functions.
/// Working with UInt8: last bit = can be true, previous = can be false (Like src/Storages/MergeTree/BoolMask.h).
traverseAST(expression_ast);
if (context->getSettingsRef().allow_experimental_analyzer)
{
if (!query_info.filter_actions_dag)
return;
auto syntax_analyzer_result = TreeRewriter(context).analyze(
expression_ast, index_sample_block.getNamesAndTypesList());
actions = ExpressionAnalyzer(expression_ast, syntax_analyzer_result, context).getActions(true);
if (checkDAGUseless(*query_info.filter_actions_dag->getOutputs().at(0), context))
return;
const auto * filter_node = query_info.filter_actions_dag->getOutputs().at(0);
auto filter_actions_dag = ActionsDAG::buildFilterActionsDAG({filter_node}, {}, context);
const auto * filter_actions_dag_node = filter_actions_dag->getOutputs().at(0);
std::unordered_map<const ActionsDAG::Node *, const ActionsDAG::Node *> node_to_result_node;
filter_actions_dag->getOutputs()[0] = &traverseDAG(*filter_actions_dag_node, filter_actions_dag, context, node_to_result_node);
filter_actions_dag->removeUnusedActions();
actions = std::make_shared<ExpressionActions>(filter_actions_dag);
}
else
{
if (checkASTUseless(ast_filter_node))
return;
auto expression_ast = ast_filter_node->clone();
/// Replace logical functions with bit functions.
/// Working with UInt8: last bit = can be true, previous = can be false (Like src/Storages/MergeTree/BoolMask.h).
traverseAST(expression_ast);
auto syntax_analyzer_result = TreeRewriter(context).analyze(expression_ast, index_sample_block.getNamesAndTypesList());
actions = ExpressionAnalyzer(expression_ast, syntax_analyzer_result, context).getActions(true);
}
}
bool MergeTreeIndexConditionSet::alwaysUnknownOrTrue() const
{
return useless;
return isUseless();
}
bool MergeTreeIndexConditionSet::mayBeTrueOnGranule(MergeTreeIndexGranulePtr idx_granule) const
{
if (useless)
if (isUseless())
return true;
auto granule = std::dynamic_pointer_cast<MergeTreeIndexGranuleSet>(idx_granule);
if (!granule)
throw Exception(
"Set index condition got a granule with the wrong type.", ErrorCodes::LOGICAL_ERROR);
throw Exception(ErrorCodes::LOGICAL_ERROR,
"Set index condition got a granule with the wrong type");
if (useless || granule->empty() || (max_rows != 0 && granule->size() > max_rows))
if (isUseless() || granule->empty() || (max_rows != 0 && granule->size() > max_rows))
return true;
Block result = granule->block;
actions->execute(result);
auto column
= result.getByName(expression_ast->getColumnName()).column->convertToFullColumnIfConst()->convertToFullColumnIfLowCardinality();
const auto & filter_node_name = actions->getActionsDAG().getOutputs().at(0)->result_name;
auto column = result.getByName(filter_node_name).column->convertToFullColumnIfConst()->convertToFullColumnIfLowCardinality();
if (column->onlyNull())
return false;
@ -318,17 +333,214 @@ bool MergeTreeIndexConditionSet::mayBeTrueOnGranule(MergeTreeIndexGranulePtr idx
}
if (!col_uint8)
throw Exception("ColumnUInt8 expected as Set index condition result.", ErrorCodes::LOGICAL_ERROR);
throw Exception(ErrorCodes::LOGICAL_ERROR,
"ColumnUInt8 expected as Set index condition result");
const auto & condition = col_uint8->getData();
size_t column_size = column->size();
for (size_t i = 0; i < column->size(); ++i)
for (size_t i = 0; i < column_size; ++i)
if ((!null_map || (*null_map)[i] == 0) && condition[i] & 1)
return true;
return false;
}
const ActionsDAG::Node & MergeTreeIndexConditionSet::traverseDAG(const ActionsDAG::Node & node,
ActionsDAGPtr & result_dag,
const ContextPtr & context,
std::unordered_map<const ActionsDAG::Node *, const ActionsDAG::Node *> & node_to_result_node) const
{
auto result_node_it = node_to_result_node.find(&node);
if (result_node_it != node_to_result_node.end())
return *result_node_it->second;
const ActionsDAG::Node * result_node = nullptr;
if (const auto * operator_node_ptr = operatorFromDAG(node, result_dag, context, node_to_result_node))
{
result_node = operator_node_ptr;
}
else if (const auto * atom_node_ptr = atomFromDAG(node, result_dag, context))
{
result_node = atom_node_ptr;
if (atom_node_ptr->type == ActionsDAG::ActionType::INPUT ||
atom_node_ptr->type == ActionsDAG::ActionType::FUNCTION)
{
auto bit_wrapper_function = FunctionFactory::instance().get("__bitWrapperFunc", context);
result_node = &result_dag->addFunction(bit_wrapper_function, {atom_node_ptr}, {});
}
}
else
{
ColumnWithTypeAndName unknown_field_column_with_type;
unknown_field_column_with_type.name = calculateConstantActionNodeName(UNKNOWN_FIELD);
unknown_field_column_with_type.type = std::make_shared<DataTypeUInt8>();
unknown_field_column_with_type.column = unknown_field_column_with_type.type->createColumnConst(1, UNKNOWN_FIELD);
result_node = &result_dag->addColumn(unknown_field_column_with_type);
}
node_to_result_node.emplace(&node, result_node);
return *result_node;
}
const ActionsDAG::Node * MergeTreeIndexConditionSet::atomFromDAG(const ActionsDAG::Node & node, ActionsDAGPtr & result_dag, const ContextPtr & context) const
{
/// Function, literal or column
const auto * node_to_check = &node;
while (node_to_check->type == ActionsDAG::ActionType::ALIAS)
node_to_check = node_to_check->children[0];
if (node_to_check->column && isColumnConst(*node_to_check->column))
return &node;
RPNBuilderTreeContext tree_context(context);
RPNBuilderTreeNode tree_node(node_to_check, tree_context);
auto column_name = tree_node.getColumnName();
if (key_columns.contains(column_name))
{
const auto * result_node = node_to_check;
if (node.type != ActionsDAG::ActionType::INPUT)
result_node = &result_dag->addInput(column_name, node.result_type);
return result_node;
}
if (node.type != ActionsDAG::ActionType::FUNCTION)
return nullptr;
const auto & arguments = node.children;
size_t arguments_size = arguments.size();
ActionsDAG::NodeRawConstPtrs children(arguments_size);
for (size_t i = 0; i < arguments_size; ++i)
{
children[i] = atomFromDAG(*arguments[i], result_dag, context);
if (!children[i])
return nullptr;
}
return &result_dag->addFunction(node.function_builder, children, {});
}
const ActionsDAG::Node * MergeTreeIndexConditionSet::operatorFromDAG(const ActionsDAG::Node & node,
ActionsDAGPtr & result_dag,
const ContextPtr & context,
std::unordered_map<const ActionsDAG::Node *, const ActionsDAG::Node *> & node_to_result_node) const
{
/// Functions AND, OR, NOT. Replace with bit*.
const auto * node_to_check = &node;
while (node_to_check->type == ActionsDAG::ActionType::ALIAS)
node_to_check = node_to_check->children[0];
if (node_to_check->column && isColumnConst(*node_to_check->column))
return nullptr;
if (node_to_check->type != ActionsDAG::ActionType::FUNCTION)
return nullptr;
auto function_name = node_to_check->function->getName();
const auto & arguments = node_to_check->children;
size_t arguments_size = arguments.size();
if (function_name == "not")
{
if (arguments_size != 1)
return nullptr;
auto bit_swap_last_two_function = FunctionFactory::instance().get("__bitSwapLastTwo", context);
return &result_dag->addFunction(bit_swap_last_two_function, {arguments[0]}, {});
}
else if (function_name == "and" || function_name == "indexHint" || function_name == "or")
{
if (arguments_size < 2)
return nullptr;
ActionsDAG::NodeRawConstPtrs children;
children.resize(arguments_size);
for (size_t i = 0; i < arguments_size; ++i)
children[i] = &traverseDAG(*arguments[i], result_dag, context, node_to_result_node);
FunctionOverloadResolverPtr function;
if (function_name == "and" || function_name == "indexHint")
function = FunctionFactory::instance().get("__bitBoolMaskAnd", context);
else
function = FunctionFactory::instance().get("__bitBoolMaskOr", context);
const auto * last_argument = children.back();
children.pop_back();
const auto * before_last_argument = children.back();
children.pop_back();
while (true)
{
last_argument = &result_dag->addFunction(function, {before_last_argument, last_argument}, {});
if (children.empty())
break;
before_last_argument = children.back();
children.pop_back();
}
return last_argument;
}
return nullptr;
}
bool MergeTreeIndexConditionSet::checkDAGUseless(const ActionsDAG::Node & node, const ContextPtr & context, bool atomic) const
{
const auto * node_to_check = &node;
while (node_to_check->type == ActionsDAG::ActionType::ALIAS)
node_to_check = node_to_check->children[0];
RPNBuilderTreeContext tree_context(context);
RPNBuilderTreeNode tree_node(node_to_check, tree_context);
if (node.column && isColumnConst(*node.column))
{
Field literal;
node.column->get(0, literal);
return !atomic && literal.safeGet<bool>();
}
else if (node.type == ActionsDAG::ActionType::FUNCTION)
{
auto column_name = tree_node.getColumnName();
if (key_columns.contains(column_name))
return false;
auto function_name = node.function_builder->getName();
const auto & arguments = node.children;
if (function_name == "and" || function_name == "indexHint")
return std::all_of(arguments.begin(), arguments.end(), [&, atomic](const auto & arg) { return checkDAGUseless(*arg, context, atomic); });
else if (function_name == "or")
return std::any_of(arguments.begin(), arguments.end(), [&, atomic](const auto & arg) { return checkDAGUseless(*arg, context, atomic); });
else if (function_name == "not")
return checkDAGUseless(*arguments.at(0), context, atomic);
else
return std::any_of(arguments.begin(), arguments.end(),
[&](const auto & arg) { return checkDAGUseless(*arg, context, true /*atomic*/); });
}
auto column_name = tree_node.getColumnName();
return !key_columns.contains(column_name);
}
void MergeTreeIndexConditionSet::traverseAST(ASTPtr & node) const
{
if (operatorFromAST(node))
@ -465,7 +677,7 @@ bool MergeTreeIndexConditionSet::checkASTUseless(const ASTPtr & node, bool atomi
else if (const auto * literal = node->as<ASTLiteral>())
return !atomic && literal->value.safeGet<bool>();
else if (const auto * identifier = node->as<ASTIdentifier>())
return key_columns.find(identifier->getColumnName()) == std::end(key_columns);
return !key_columns.contains(identifier->getColumnName());
else
return true;
}

Some files were not shown because too many files have changed in this diff Show More