mirror of
https://github.com/ClickHouse/ClickHouse.git
synced 2024-11-22 07:31:57 +00:00
Merge pull request #42033 from mark-polokhov/BSONEachRow
Add BSONEachRow input/output format
This commit is contained in:
commit
98d6b96c82
@ -13,7 +13,7 @@ The supported formats are:
|
||||
| Format | Input | Output |
|
||||
|-------------------------------------------------------------------------------------------|------|--------|
|
||||
| [TabSeparated](#tabseparated) | ✔ | ✔ |
|
||||
| [TabSeparatedRaw](#tabseparatedraw) | ✔ | ✔ |
|
||||
| [TabSeparatedRaw](#tabseparatedraw) | ✔ | ✔ |
|
||||
| [TabSeparatedWithNames](#tabseparatedwithnames) | ✔ | ✔ |
|
||||
| [TabSeparatedWithNamesAndTypes](#tabseparatedwithnamesandtypes) | ✔ | ✔ |
|
||||
| [TabSeparatedRawWithNames](#tabseparatedrawwithnames) | ✔ | ✔ |
|
||||
@ -48,6 +48,7 @@ The supported formats are:
|
||||
| [JSONCompactStringsEachRowWithNames](#jsoncompactstringseachrowwithnames) | ✔ | ✔ |
|
||||
| [JSONCompactStringsEachRowWithNamesAndTypes](#jsoncompactstringseachrowwithnamesandtypes) | ✔ | ✔ |
|
||||
| [JSONObjectEachRow](#jsonobjecteachrow) | ✔ | ✔ |
|
||||
| [BSONEachRow](#bsoneachrow) | ✔ | ✔ |
|
||||
| [TSKV](#tskv) | ✔ | ✔ |
|
||||
| [Pretty](#pretty) | ✗ | ✔ |
|
||||
| [PrettyNoEscapes](#prettynoescapes) | ✗ | ✔ |
|
||||
@ -1210,6 +1211,69 @@ SELECT * FROM json_each_row_nested
|
||||
- [output_format_json_array_of_rows](../operations/settings/settings.md#output_format_json_array_of_rows) - output a JSON array of all rows in JSONEachRow(Compact) format. Default value - `false`.
|
||||
- [output_format_json_validate_utf8](../operations/settings/settings.md#output_format_json_validate_utf8) - enables validation of UTF-8 sequences in JSON output formats (note that it doesn't impact formats JSON/JSONCompact/JSONColumnsWithMetadata, they always validate utf8). Default value - `false`.
|
||||
|
||||
## BSONEachRow {#bsoneachrow}
|
||||
|
||||
In this format, ClickHouse formats/parses data as a sequence of BSON documents without any separator between them.
|
||||
Each row is formatted as a single document and each column is formatted as a single BSON document field with column name as a key.
|
||||
|
||||
For output it uses the following correspondence between ClickHouse types and BSON types:
|
||||
|
||||
| ClickHouse type | BSON Type |
|
||||
|-----------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
|
||||
| [Bool](../sql-reference/data-types/boolean.md) | `\x08` boolean |
|
||||
| [Int8/UInt8](../sql-reference/data-types/int-uint.md) | `\x10` int32 |
|
||||
| [Int16UInt16](../sql-reference/data-types/int-uint.md) | `\x10` int32 |
|
||||
| [Int32](../sql-reference/data-types/int-uint.md) | `\x10` int32 |
|
||||
| [UInt32](../sql-reference/data-types/int-uint.md) | `\x12` int64 |
|
||||
| [Int64/UInt64](../sql-reference/data-types/int-uint.md) | `\x12` int64 |
|
||||
| [Float32/Float64](../sql-reference/data-types/float.md) | `\x01` double |
|
||||
| [Date](../sql-reference/data-types/date.md)/[Date32](../sql-reference/data-types/date32.md) | `\x10` int32 |
|
||||
| [DateTime](../sql-reference/data-types/datetime.md) | `\x12` int64 |
|
||||
| [DateTime64](../sql-reference/data-types/datetime64.md) | `\x09` datetime |
|
||||
| [Decimal32](../sql-reference/data-types/decimal.md) | `\x10` int32 |
|
||||
| [Decimal64](../sql-reference/data-types/decimal.md) | `\x12` int64 |
|
||||
| [Decimal128](../sql-reference/data-types/decimal.md) | `\x05` binary, `\x00` binary subtype, size = 16 |
|
||||
| [Decimal256](../sql-reference/data-types/decimal.md) | `\x05` binary, `\x00` binary subtype, size = 32 |
|
||||
| [Int128/UInt128](../sql-reference/data-types/int-uint.md) | `\x05` binary, `\x00` binary subtype, size = 16 |
|
||||
| [Int256/UInt256](../sql-reference/data-types/int-uint.md) | `\x05` binary, `\x00` binary subtype, size = 32 |
|
||||
| [String](../sql-reference/data-types/string.md)/[FixedString](../sql-reference/data-types/fixedstring.md) | `\x05` binary, `\x00` binary subtype or \x02 string if setting output_format_bson_string_as_string is enabled |
|
||||
| [UUID](../sql-reference/data-types/uuid.md) | `\x05` binary, `\x04` uuid subtype, size = 16 |
|
||||
| [Array](../sql-reference/data-types/array.md) | `\x04` array |
|
||||
| [Tuple](../sql-reference/data-types/tuple.md) | `\x04` array |
|
||||
| [Named Tuple](../sql-reference/data-types/tuple.md) | `\x03` document |
|
||||
| [Map](../sql-reference/data-types/map.md) (with String keys) | `\x03` document |
|
||||
|
||||
For input it uses the following correspondence between BSON types and ClickHouse types:
|
||||
|
||||
| BSON Type | ClickHouse Type |
|
||||
|------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| `\x01` double | [Float32/Float64](../sql-reference/data-types/float.md) |
|
||||
| `\x02` string | [String](../sql-reference/data-types/string.md)/[FixedString](../sql-reference/data-types/fixedstring.md) |
|
||||
| `\x03` document | [Map](../sql-reference/data-types/map.md)/[Named Tuple](../sql-reference/data-types/tuple.md) |
|
||||
| `\x04` array | [Array](../sql-reference/data-types/array.md)/[Tuple](../sql-reference/data-types/tuple.md) |
|
||||
| `\x05` binary, `\x00` binary subtype | [String](../sql-reference/data-types/string.md)/[FixedString](../sql-reference/data-types/fixedstring.md) |
|
||||
| `\x05` binary, `\x02` old binary subtype | [String](../sql-reference/data-types/string.md)/[FixedString](../sql-reference/data-types/fixedstring.md) |
|
||||
| `\x05` binary, `\x03` old uuid subtype | [UUID](../sql-reference/data-types/uuid.md) |
|
||||
| `\x05` binary, `\x04` uuid subtype | [UUID](../sql-reference/data-types/uuid.md) |
|
||||
| `\x07` ObjectId | [String](../sql-reference/data-types/string.md)/[FixedString](../sql-reference/data-types/fixedstring.md) |
|
||||
| `\x08` boolean | [Bool](../sql-reference/data-types/boolean.md) |
|
||||
| `\x09` datetime | [DateTime64](../sql-reference/data-types/datetime64.md) |
|
||||
| `\x0A` null value | [NULL](../sql-reference/data-types/nullable.md) |
|
||||
| `\x0D` JavaScript code | [String](../sql-reference/data-types/string.md)/[FixedString](../sql-reference/data-types/fixedstring.md) |
|
||||
| `\x0E` symbol | [String](../sql-reference/data-types/string.md)/[FixedString](../sql-reference/data-types/fixedstring.md) |
|
||||
| `\x10` int32 | [Int32/UInt32](../sql-reference/data-types/int-uint.md)/[Decimal32](../sql-reference/data-types/decimal.md) |
|
||||
| `\x12` int64 | [Int64/UInt64](../sql-reference/data-types/int-uint.md)/[Decimal64](../sql-reference/data-types/decimal.md)/[DateTime64](../sql-reference/data-types/datetime64.md) |
|
||||
|
||||
Other BSON types are not supported. Also, it performs conversion between different integer types (for example, you can insert BSON int32 value into ClickHouse UInt8).
|
||||
Big integers and decimals (Int128/UInt128/Int256/UInt256/Decimal128/Decimal256) can be parsed from BSON Binary value with `\x00` binary subtype. In this case this format will validate that the size of binary data equals the size of expected value.
|
||||
|
||||
Note: this format don't work properly on Big-Endian platforms.
|
||||
|
||||
### BSON format settings {#bson-format-settings}
|
||||
|
||||
- [output_format_bson_string_as_string](../operations/settings/settings.md#output_format_bson_string_as_string) - use BSON String type instead of Binary for String columns. Default value - `false`.
|
||||
- [input_format_bson_skip_fields_with_unsupported_types_in_schema_inference](../operations/settings/settings.md#input_format_bson_skip_fields_with_unsupported_types_in_schema_inference) - allow skipping columns with unsupported types while schema inference for format BSONEachRow. Default value - `false`.
|
||||
|
||||
## Native {#native}
|
||||
|
||||
The most efficient format. Data is written and read by blocks in binary format. For each block, the number of rows, number of columns, column names and types, and parts of columns in this block are recorded one after another. In other words, this format is “columnar” – it does not convert columns to rows. This is the format used in the native interface for interaction between servers, for using the command-line client, and for C++ clients.
|
||||
|
@ -4784,7 +4784,7 @@ Possible values:
|
||||
|
||||
Default value: 1.
|
||||
|
||||
## SQLInsert format settings {$sqlinsert-format-settings}
|
||||
## SQLInsert format settings {#sqlinsert-format-settings}
|
||||
|
||||
### output_format_sql_insert_max_batch_size {#output_format_sql_insert_max_batch_size}
|
||||
|
||||
@ -4815,3 +4815,17 @@ Default value: `false`.
|
||||
Quote column names with "`" characters
|
||||
|
||||
Default value: `true`.
|
||||
|
||||
## BSONEachRow format settings {#bson-each-row-format-settings}
|
||||
|
||||
### output_format_bson_string_as_string {#output_format_bson_string_as_string}
|
||||
|
||||
Use BSON String type instead of Binary for String columns.
|
||||
|
||||
Disabled by default.
|
||||
|
||||
### input_format_bson_skip_fields_with_unsupported_types_in_schema_inference {#input_format_bson_skip_fields_with_unsupported_types_in_schema_inference}
|
||||
|
||||
Allow skipping columns with unsupported types while schema inference for format BSONEachRow.
|
||||
|
||||
Disabled by default.
|
||||
|
@ -667,9 +667,15 @@ Names Block::getDataTypeNames() const
|
||||
}
|
||||
|
||||
|
||||
std::unordered_map<String, size_t> Block::getNamesToIndexesMap() const
|
||||
Block::NameMap Block::getNamesToIndexesMap() const
|
||||
{
|
||||
return index_by_name;
|
||||
NameMap res;
|
||||
res.reserve(index_by_name.size());
|
||||
|
||||
for (const auto & [name, index] : index_by_name)
|
||||
res[name] = index;
|
||||
|
||||
return res;
|
||||
}
|
||||
|
||||
|
||||
|
@ -5,6 +5,8 @@
|
||||
#include <Core/ColumnsWithTypeAndName.h>
|
||||
#include <Core/NamesAndTypes.h>
|
||||
|
||||
#include <Common/HashTable/HashMap.h>
|
||||
|
||||
#include <initializer_list>
|
||||
#include <list>
|
||||
#include <map>
|
||||
@ -93,7 +95,10 @@ public:
|
||||
Names getNames() const;
|
||||
DataTypes getDataTypes() const;
|
||||
Names getDataTypeNames() const;
|
||||
std::unordered_map<String, size_t> getNamesToIndexesMap() const;
|
||||
|
||||
/// Hash table match `column name -> position in the block`.
|
||||
using NameMap = HashMap<StringRef, size_t, StringRefHash>;
|
||||
NameMap getNamesToIndexesMap() const;
|
||||
|
||||
Serializations getSerializations() const;
|
||||
|
||||
|
@ -851,6 +851,9 @@ static constexpr UInt64 operator""_GiB(unsigned long long value)
|
||||
M(Bool, output_format_sql_insert_include_column_names, true, "Include column names in INSERT query", 0) \
|
||||
M(Bool, output_format_sql_insert_use_replace, false, "Use REPLACE statement instead of INSERT", 0) \
|
||||
M(Bool, output_format_sql_insert_quote_names, true, "Quote column names with '`' characters", 0) \
|
||||
\
|
||||
M(Bool, output_format_bson_string_as_string, false, "Use BSON String type instead of Binary for String columns.", 0) \
|
||||
M(Bool, input_format_bson_skip_fields_with_unsupported_types_in_schema_inference, false, "Skip fields with unsupported types while schema inference for format BSON.", 0) \
|
||||
|
||||
// End of FORMAT_FACTORY_SETTINGS
|
||||
// Please add settings non-related to formats into the COMMON_SETTINGS above.
|
||||
|
@ -284,7 +284,7 @@ std::vector<DictionaryAttribute> DictionaryStructure::getAttributes(
|
||||
std::unordered_set<String> attribute_names;
|
||||
std::vector<DictionaryAttribute> res_attributes;
|
||||
|
||||
const FormatSettings format_settings;
|
||||
const FormatSettings format_settings = {};
|
||||
|
||||
for (const auto & config_elem : config_elems)
|
||||
{
|
||||
|
@ -62,7 +62,7 @@ struct ExternalQueryBuilder
|
||||
|
||||
|
||||
private:
|
||||
const FormatSettings format_settings;
|
||||
const FormatSettings format_settings = {};
|
||||
|
||||
void composeLoadAllQuery(WriteBuffer & out) const;
|
||||
|
||||
|
@ -74,7 +74,6 @@ void registerDictionarySourceMongoDB(DictionarySourceFactory & factory)
|
||||
// Poco/MongoDB/BSONWriter.h:54: void writeCString(const std::string & value);
|
||||
// src/IO/WriteHelpers.h:146 #define writeCString(s, buf)
|
||||
#include <IO/WriteHelpers.h>
|
||||
#include <Processors/Transforms/MongoDBSource.h>
|
||||
|
||||
|
||||
namespace DB
|
||||
|
@ -1,5 +1,6 @@
|
||||
#pragma once
|
||||
|
||||
#include <Processors/Transforms/MongoDBSource.h>
|
||||
#include <Core/Block.h>
|
||||
|
||||
#include "DictionaryStructure.h"
|
||||
|
106
src/Formats/BSONTypes.cpp
Normal file
106
src/Formats/BSONTypes.cpp
Normal file
@ -0,0 +1,106 @@
|
||||
#include <Formats/BSONTypes.h>
|
||||
#include <Common/Exception.h>
|
||||
#include <Common/hex.h>
|
||||
|
||||
namespace DB
|
||||
{
|
||||
|
||||
namespace ErrorCodes
|
||||
{
|
||||
extern const int UNKNOWN_TYPE;
|
||||
}
|
||||
|
||||
static std::string byteToHexString(uint8_t byte)
|
||||
{
|
||||
return "0x" + getHexUIntUppercase(byte);
|
||||
}
|
||||
|
||||
BSONType getBSONType(uint8_t value)
|
||||
{
|
||||
if ((value >= 0x01 && value <= 0x13) || value == 0xFF || value == 0x7f)
|
||||
return BSONType(value);
|
||||
|
||||
throw Exception(ErrorCodes::UNKNOWN_TYPE, "Unknown BSON type: {}", byteToHexString(value));
|
||||
}
|
||||
|
||||
BSONBinarySubtype getBSONBinarySubtype(uint8_t value)
|
||||
{
|
||||
if (value <= 0x07)
|
||||
return BSONBinarySubtype(value);
|
||||
|
||||
throw Exception(ErrorCodes::UNKNOWN_TYPE, "Unknown BSON binary subtype: {}", byteToHexString(value));
|
||||
}
|
||||
|
||||
std::string getBSONTypeName(BSONType type)
|
||||
{
|
||||
switch (type)
|
||||
{
|
||||
case BSONType::BINARY:
|
||||
return "Binary";
|
||||
case BSONType::SYMBOL:
|
||||
return "Symbol";
|
||||
case BSONType::ARRAY:
|
||||
return "Array";
|
||||
case BSONType::DOCUMENT:
|
||||
return "Document";
|
||||
case BSONType::TIMESTAMP:
|
||||
return "Timestamp";
|
||||
case BSONType::INT64:
|
||||
return "Int64";
|
||||
case BSONType::INT32:
|
||||
return "Int32";
|
||||
case BSONType::BOOL:
|
||||
return "Bool";
|
||||
case BSONType::DOUBLE:
|
||||
return "Double";
|
||||
case BSONType::STRING:
|
||||
return "String";
|
||||
case BSONType::DECIMAL128:
|
||||
return "Decimal128";
|
||||
case BSONType::JAVA_SCRIPT_CODE_W_SCOPE:
|
||||
return "JavaScript code w/ scope";
|
||||
case BSONType::JAVA_SCRIPT_CODE:
|
||||
return "JavaScript code";
|
||||
case BSONType::DB_POINTER:
|
||||
return "DBPointer";
|
||||
case BSONType::REGEXP:
|
||||
return "Regexp";
|
||||
case BSONType::DATETIME:
|
||||
return "Datetime";
|
||||
case BSONType::OBJECT_ID:
|
||||
return "ObjectId";
|
||||
case BSONType::UNDEFINED:
|
||||
return "Undefined";
|
||||
case BSONType::NULL_VALUE:
|
||||
return "Null";
|
||||
case BSONType::MAX_KEY:
|
||||
return "Max key";
|
||||
case BSONType::MIN_KEY:
|
||||
return "Min key";
|
||||
}
|
||||
}
|
||||
|
||||
std::string getBSONBinarySubtypeName(BSONBinarySubtype subtype)
|
||||
{
|
||||
switch (subtype)
|
||||
{
|
||||
case BSONBinarySubtype::BINARY:
|
||||
return "Binary";
|
||||
case BSONBinarySubtype::FUNCTION:
|
||||
return "Function";
|
||||
case BSONBinarySubtype::BINARY_OLD:
|
||||
return "Binary (Old)";
|
||||
case BSONBinarySubtype::UUID_OLD:
|
||||
return "UUID (Old)";
|
||||
case BSONBinarySubtype::UUID:
|
||||
return "UUID";
|
||||
case BSONBinarySubtype::MD5:
|
||||
return "MD5";
|
||||
case BSONBinarySubtype::ENCRYPTED_BSON_VALUE:
|
||||
return "Encrypted BSON value";
|
||||
case BSONBinarySubtype::COMPRESSED_BSON_COLUMN:
|
||||
return "Compressed BSON column";
|
||||
}
|
||||
}
|
||||
|
||||
}
|
57
src/Formats/BSONTypes.h
Normal file
57
src/Formats/BSONTypes.h
Normal file
@ -0,0 +1,57 @@
|
||||
#pragma once
|
||||
|
||||
#include <cstdint>
|
||||
#include <string>
|
||||
|
||||
namespace DB
|
||||
{
|
||||
|
||||
static const uint8_t BSON_DOCUMENT_END = 0x00;
|
||||
using BSONSizeT = uint32_t;
|
||||
static const BSONSizeT MAX_BSON_SIZE = std::numeric_limits<BSONSizeT>::max();
|
||||
|
||||
/// See details on https://bsonspec.org/spec.html
|
||||
enum class BSONType
|
||||
{
|
||||
DOUBLE = 0x01,
|
||||
STRING = 0x02,
|
||||
DOCUMENT = 0x03,
|
||||
ARRAY = 0x04,
|
||||
BINARY = 0x05,
|
||||
UNDEFINED = 0x06,
|
||||
OBJECT_ID = 0x07,
|
||||
BOOL = 0x08,
|
||||
DATETIME = 0x09,
|
||||
NULL_VALUE = 0x0A,
|
||||
REGEXP = 0x0B,
|
||||
DB_POINTER = 0x0C,
|
||||
JAVA_SCRIPT_CODE = 0x0D,
|
||||
SYMBOL = 0x0E,
|
||||
JAVA_SCRIPT_CODE_W_SCOPE = 0x0F,
|
||||
INT32 = 0x10,
|
||||
TIMESTAMP = 0x11,
|
||||
INT64 = 0x12,
|
||||
DECIMAL128 = 0x13,
|
||||
MIN_KEY = 0xFF,
|
||||
MAX_KEY = 0x7F,
|
||||
};
|
||||
|
||||
enum class BSONBinarySubtype
|
||||
{
|
||||
BINARY = 0x00,
|
||||
FUNCTION = 0x01,
|
||||
BINARY_OLD = 0x02,
|
||||
UUID_OLD = 0x03,
|
||||
UUID = 0x04,
|
||||
MD5 = 0x05,
|
||||
ENCRYPTED_BSON_VALUE = 0x06,
|
||||
COMPRESSED_BSON_COLUMN = 0x07,
|
||||
};
|
||||
|
||||
BSONType getBSONType(uint8_t value);
|
||||
std::string getBSONTypeName(BSONType type);
|
||||
|
||||
BSONBinarySubtype getBSONBinarySubtype(uint8_t value);
|
||||
std::string getBSONBinarySubtypeName(BSONBinarySubtype subtype);
|
||||
|
||||
}
|
@ -18,7 +18,7 @@ void ColumnMapping::setupByHeader(const Block & header)
|
||||
}
|
||||
|
||||
void ColumnMapping::addColumns(
|
||||
const Names & column_names, const std::unordered_map<String, size_t> & column_indexes_by_names, const FormatSettings & settings)
|
||||
const Names & column_names, const Block::NameMap & column_indexes_by_names, const FormatSettings & settings)
|
||||
{
|
||||
std::vector<bool> read_columns(column_indexes_by_names.size(), false);
|
||||
|
||||
@ -26,8 +26,8 @@ void ColumnMapping::addColumns(
|
||||
{
|
||||
names_of_columns.push_back(name);
|
||||
|
||||
const auto column_it = column_indexes_by_names.find(name);
|
||||
if (column_it == column_indexes_by_names.end())
|
||||
const auto * column_it = column_indexes_by_names.find(name);
|
||||
if (!column_it)
|
||||
{
|
||||
if (settings.skip_unknown_fields)
|
||||
{
|
||||
@ -41,7 +41,7 @@ void ColumnMapping::addColumns(
|
||||
name, column_indexes_for_input_fields.size());
|
||||
}
|
||||
|
||||
const auto column_index = column_it->second;
|
||||
const auto column_index = column_it->getMapped();
|
||||
|
||||
if (read_columns[column_index])
|
||||
throw Exception("Duplicate field found while parsing format header: " + name, ErrorCodes::INCORRECT_DATA);
|
||||
|
@ -28,7 +28,7 @@ struct ColumnMapping
|
||||
void setupByHeader(const Block & header);
|
||||
|
||||
void addColumns(
|
||||
const Names & column_names, const std::unordered_map<String, size_t> & column_indexes_by_names, const FormatSettings & settings);
|
||||
const Names & column_names, const Block::NameMap & column_indexes_by_names, const FormatSettings & settings);
|
||||
|
||||
void insertDefaultsForNotSeenColumns(MutableColumns & columns, std::vector<UInt8> & read_columns);
|
||||
};
|
||||
|
@ -834,17 +834,23 @@ DataTypes getDefaultDataTypeForEscapingRules(const std::vector<FormatSettings::E
|
||||
return data_types;
|
||||
}
|
||||
|
||||
String getAdditionalFormatInfoForAllRowBasedFormats(const FormatSettings & settings)
|
||||
{
|
||||
return fmt::format(
|
||||
"schema_inference_hints={}, max_rows_to_read_for_schema_inference={}",
|
||||
settings.schema_inference_hints,
|
||||
settings.max_rows_to_read_for_schema_inference);
|
||||
}
|
||||
|
||||
String getAdditionalFormatInfoByEscapingRule(const FormatSettings & settings, FormatSettings::EscapingRule escaping_rule)
|
||||
{
|
||||
String result;
|
||||
String result = getAdditionalFormatInfoForAllRowBasedFormats(settings);
|
||||
/// First, settings that are common for all text formats:
|
||||
result = fmt::format(
|
||||
"schema_inference_hints={}, try_infer_integers={}, try_infer_dates={}, try_infer_datetimes={}, max_rows_to_read_for_schema_inference={}",
|
||||
settings.schema_inference_hints,
|
||||
result += fmt::format(
|
||||
", try_infer_integers={}, try_infer_dates={}, try_infer_datetimes={}",
|
||||
settings.try_infer_integers,
|
||||
settings.try_infer_dates,
|
||||
settings.try_infer_datetimes,
|
||||
settings.max_rows_to_read_for_schema_inference);
|
||||
settings.try_infer_datetimes);
|
||||
|
||||
/// Second, format-specific settings:
|
||||
switch (escaping_rule)
|
||||
|
@ -77,6 +77,7 @@ void transformInferredTypesIfNeeded(DataTypePtr & first, DataTypePtr & second, c
|
||||
void transformInferredJSONTypesIfNeeded(DataTypes & types, const FormatSettings & settings, const std::unordered_set<const IDataType *> * numbers_parsed_from_json_strings = nullptr);
|
||||
void transformInferredJSONTypesIfNeeded(DataTypePtr & first, DataTypePtr & second, const FormatSettings & settings);
|
||||
|
||||
String getAdditionalFormatInfoForAllRowBasedFormats(const FormatSettings & settings);
|
||||
String getAdditionalFormatInfoByEscapingRule(const FormatSettings & settings, FormatSettings::EscapingRule escaping_rule);
|
||||
|
||||
void checkSupportedDelimiterAfterField(FormatSettings::EscapingRule escaping_rule, const String & delimiter, const DataTypePtr & type);
|
||||
|
@ -178,6 +178,8 @@ FormatSettings getFormatSettings(ContextPtr context, const Settings & settings)
|
||||
format_settings.try_infer_integers = settings.input_format_try_infer_integers;
|
||||
format_settings.try_infer_dates = settings.input_format_try_infer_dates;
|
||||
format_settings.try_infer_datetimes = settings.input_format_try_infer_datetimes;
|
||||
format_settings.bson.output_string_as_string = settings.output_format_bson_string_as_string;
|
||||
format_settings.bson.skip_fields_with_unsupported_types_in_schema_inference = settings.input_format_bson_skip_fields_with_unsupported_types_in_schema_inference;
|
||||
|
||||
/// Validate avro_schema_registry_url with RemoteHostFilter when non-empty and in Server context
|
||||
if (format_settings.schema.is_server)
|
||||
|
@ -303,6 +303,12 @@ struct FormatSettings
|
||||
bool use_replace = false;
|
||||
bool quote_names = true;
|
||||
} sql_insert;
|
||||
|
||||
struct
|
||||
{
|
||||
bool output_string_as_string;
|
||||
bool skip_fields_with_unsupported_types_in_schema_inference;
|
||||
} bson;
|
||||
};
|
||||
|
||||
}
|
||||
|
@ -231,7 +231,14 @@ namespace JSONUtils
|
||||
{
|
||||
auto type = getDataTypeFromFieldImpl(key_value_pair.second, settings, numbers_parsed_from_json_strings);
|
||||
if (!type)
|
||||
{
|
||||
/// If we couldn't infer nested type and Object type is not enabled,
|
||||
/// we can't determine the type of this JSON field.
|
||||
if (!settings.json.try_infer_objects)
|
||||
return nullptr;
|
||||
|
||||
continue;
|
||||
}
|
||||
|
||||
if (settings.json.try_infer_objects && isObject(type))
|
||||
return std::make_shared<DataTypeObject>("json", true);
|
||||
|
@ -19,6 +19,7 @@ void registerFileSegmentationEngineJSONCompactEachRow(FormatFactory & factory);
|
||||
void registerFileSegmentationEngineHiveText(FormatFactory & factory);
|
||||
#endif
|
||||
void registerFileSegmentationEngineLineAsString(FormatFactory & factory);
|
||||
void registerFileSegmentationEngineBSONEachRow(FormatFactory & factory);
|
||||
|
||||
/// Formats for both input/output.
|
||||
|
||||
@ -49,6 +50,8 @@ void registerInputFormatJSONColumns(FormatFactory & factory);
|
||||
void registerOutputFormatJSONColumns(FormatFactory & factory);
|
||||
void registerInputFormatJSONCompactColumns(FormatFactory & factory);
|
||||
void registerOutputFormatJSONCompactColumns(FormatFactory & factory);
|
||||
void registerInputFormatBSONEachRow(FormatFactory & factory);
|
||||
void registerOutputFormatBSONEachRow(FormatFactory & factory);
|
||||
void registerInputFormatJSONColumnsWithMetadata(FormatFactory & factory);
|
||||
void registerOutputFormatJSONColumnsWithMetadata(FormatFactory & factory);
|
||||
void registerInputFormatProtobuf(FormatFactory & factory);
|
||||
@ -136,7 +139,7 @@ void registerTSKVSchemaReader(FormatFactory & factory);
|
||||
void registerValuesSchemaReader(FormatFactory & factory);
|
||||
void registerTemplateSchemaReader(FormatFactory & factory);
|
||||
void registerMySQLSchemaReader(FormatFactory & factory);
|
||||
|
||||
void registerBSONEachRowSchemaReader(FormatFactory & factory);
|
||||
|
||||
void registerFileExtensions(FormatFactory & factory);
|
||||
|
||||
@ -155,6 +158,7 @@ void registerFormats()
|
||||
registerFileSegmentationEngineHiveText(factory);
|
||||
#endif
|
||||
registerFileSegmentationEngineLineAsString(factory);
|
||||
registerFileSegmentationEngineBSONEachRow(factory);
|
||||
|
||||
|
||||
registerInputFormatNative(factory);
|
||||
@ -184,6 +188,8 @@ void registerFormats()
|
||||
registerOutputFormatJSONColumns(factory);
|
||||
registerInputFormatJSONCompactColumns(factory);
|
||||
registerOutputFormatJSONCompactColumns(factory);
|
||||
registerInputFormatBSONEachRow(factory);
|
||||
registerOutputFormatBSONEachRow(factory);
|
||||
registerInputFormatJSONColumnsWithMetadata(factory);
|
||||
registerOutputFormatJSONColumnsWithMetadata(factory);
|
||||
registerInputFormatProtobuf(factory);
|
||||
@ -267,6 +273,7 @@ void registerFormats()
|
||||
registerValuesSchemaReader(factory);
|
||||
registerTemplateSchemaReader(factory);
|
||||
registerMySQLSchemaReader(factory);
|
||||
registerBSONEachRowSchemaReader(factory);
|
||||
}
|
||||
|
||||
}
|
||||
|
@ -57,7 +57,7 @@ namespace DB
|
||||
|
||||
if (unlikely(begin > end))
|
||||
{
|
||||
const FormatSettings default_format;
|
||||
const FormatSettings default_format{};
|
||||
WriteBufferFromOwnString buf_begin, buf_end;
|
||||
begin_serializaion->serializeTextQuoted(*(arguments[0].column), i, buf_begin, default_format);
|
||||
end_serialization->serializeTextQuoted(*(arguments[1].column), i, buf_end, default_format);
|
||||
|
@ -1278,6 +1278,25 @@ void skipToUnescapedNextLineOrEOF(ReadBuffer & buf)
|
||||
}
|
||||
}
|
||||
|
||||
void skipNullTerminated(ReadBuffer & buf)
|
||||
{
|
||||
while (!buf.eof())
|
||||
{
|
||||
char * next_pos = find_first_symbols<'\0'>(buf.position(), buf.buffer().end());
|
||||
buf.position() = next_pos;
|
||||
|
||||
if (!buf.hasPendingData())
|
||||
continue;
|
||||
|
||||
if (*buf.position() == '\0')
|
||||
{
|
||||
++buf.position();
|
||||
return;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
void saveUpToPosition(ReadBuffer & in, Memory<> & memory, char * current)
|
||||
{
|
||||
assert(current >= in.position());
|
||||
|
@ -1448,6 +1448,8 @@ void skipToCarriageReturnOrEOF(ReadBuffer & buf);
|
||||
/// Skip to next character after next unescaped \n. If no \n in stream, skip to end. Does not throw on invalid escape sequences.
|
||||
void skipToUnescapedNextLineOrEOF(ReadBuffer & buf);
|
||||
|
||||
/// Skip to next character after next \0. If no \0 in stream, skip to end.
|
||||
void skipNullTerminated(ReadBuffer & buf);
|
||||
|
||||
/** This function just copies the data from buffer's internal position (in.position())
|
||||
* to current position (from arguments) into memory.
|
||||
|
978
src/Processors/Formats/Impl/BSONEachRowRowInputFormat.cpp
Normal file
978
src/Processors/Formats/Impl/BSONEachRowRowInputFormat.cpp
Normal file
@ -0,0 +1,978 @@
|
||||
#include <IO/ReadBufferFromString.h>
|
||||
|
||||
#include <Formats/FormatFactory.h>
|
||||
#include <Formats/FormatSettings.h>
|
||||
#include <Formats/BSONTypes.h>
|
||||
#include <Formats/EscapingRuleUtils.h>
|
||||
#include <Processors/Formats/Impl/BSONEachRowRowInputFormat.h>
|
||||
#include <IO/ReadHelpers.h>
|
||||
|
||||
#include <Columns/ColumnsNumber.h>
|
||||
#include <Columns/ColumnNullable.h>
|
||||
#include <Columns/ColumnLowCardinality.h>
|
||||
#include <Columns/ColumnString.h>
|
||||
#include <Columns/ColumnFixedString.h>
|
||||
#include <Columns/ColumnDecimal.h>
|
||||
#include <Columns/ColumnArray.h>
|
||||
#include <Columns/ColumnTuple.h>
|
||||
#include <Columns/ColumnMap.h>
|
||||
|
||||
#include <DataTypes/DataTypeString.h>
|
||||
#include <DataTypes/DataTypeUUID.h>
|
||||
#include <DataTypes/DataTypeDateTime64.h>
|
||||
#include <DataTypes/DataTypeLowCardinality.h>
|
||||
#include <DataTypes/DataTypeNullable.h>
|
||||
#include <DataTypes/DataTypeArray.h>
|
||||
#include <DataTypes/DataTypeTuple.h>
|
||||
#include <DataTypes/DataTypeMap.h>
|
||||
#include <DataTypes/DataTypeFactory.h>
|
||||
#include <DataTypes/getLeastSupertype.h>
|
||||
|
||||
|
||||
namespace DB
|
||||
{
|
||||
|
||||
namespace ErrorCodes
|
||||
{
|
||||
extern const int INCORRECT_DATA;
|
||||
extern const int ILLEGAL_COLUMN;
|
||||
extern const int TOO_LARGE_STRING_SIZE;
|
||||
extern const int UNKNOWN_TYPE;
|
||||
}
|
||||
|
||||
namespace
|
||||
{
|
||||
enum
|
||||
{
|
||||
UNKNOWN_FIELD = size_t(-1),
|
||||
};
|
||||
}
|
||||
|
||||
BSONEachRowRowInputFormat::BSONEachRowRowInputFormat(
|
||||
ReadBuffer & in_, const Block & header_, Params params_, const FormatSettings & format_settings_)
|
||||
: IRowInputFormat(header_, in_, std::move(params_))
|
||||
, format_settings(format_settings_)
|
||||
, name_map(header_.getNamesToIndexesMap())
|
||||
, prev_positions(header_.columns())
|
||||
, types(header_.getDataTypes())
|
||||
{
|
||||
}
|
||||
|
||||
inline size_t BSONEachRowRowInputFormat::columnIndex(const StringRef & name, size_t key_index)
|
||||
{
|
||||
/// Optimization by caching the order of fields (which is almost always the same)
|
||||
/// and a quick check to match the next expected field, instead of searching the hash table.
|
||||
|
||||
if (prev_positions.size() > key_index && prev_positions[key_index] && name == prev_positions[key_index]->getKey())
|
||||
{
|
||||
return prev_positions[key_index]->getMapped();
|
||||
}
|
||||
else
|
||||
{
|
||||
auto * it = name_map.find(name);
|
||||
|
||||
if (it)
|
||||
{
|
||||
if (key_index < prev_positions.size())
|
||||
prev_positions[key_index] = it;
|
||||
|
||||
return it->getMapped();
|
||||
}
|
||||
else
|
||||
return UNKNOWN_FIELD;
|
||||
}
|
||||
}
|
||||
|
||||
/// Read the field name. Resulting StringRef is valid only before next read from buf.
|
||||
static StringRef readBSONKeyName(ReadBuffer & in, String & key_holder)
|
||||
{
|
||||
// This is just an optimization: try to avoid copying the name into key_holder
|
||||
|
||||
if (!in.eof())
|
||||
{
|
||||
char * next_pos = find_first_symbols<0>(in.position(), in.buffer().end());
|
||||
|
||||
if (next_pos != in.buffer().end())
|
||||
{
|
||||
StringRef res(in.position(), next_pos - in.position());
|
||||
in.position() = next_pos + 1;
|
||||
return res;
|
||||
}
|
||||
}
|
||||
|
||||
key_holder.clear();
|
||||
readNullTerminated(key_holder, in);
|
||||
return key_holder;
|
||||
}
|
||||
|
||||
static UInt8 readBSONType(ReadBuffer & in)
|
||||
{
|
||||
UInt8 type;
|
||||
readBinary(type, in);
|
||||
return type;
|
||||
}
|
||||
|
||||
static size_t readBSONSize(ReadBuffer & in)
|
||||
{
|
||||
BSONSizeT size;
|
||||
readBinary(size, in);
|
||||
return size;
|
||||
}
|
||||
|
||||
template <typename T>
|
||||
static void readAndInsertInteger(ReadBuffer & in, IColumn & column, const DataTypePtr & data_type, BSONType bson_type)
|
||||
{
|
||||
/// We allow to read any integer into any integer column.
|
||||
/// For example we can read BSON Int32 into ClickHouse UInt8.
|
||||
|
||||
if (bson_type == BSONType::INT32)
|
||||
{
|
||||
UInt32 value;
|
||||
readBinary(value, in);
|
||||
assert_cast<ColumnVector<T> &>(column).insertValue(static_cast<T>(value));
|
||||
}
|
||||
else if (bson_type == BSONType::INT64)
|
||||
{
|
||||
UInt64 value;
|
||||
readBinary(value, in);
|
||||
assert_cast<ColumnVector<T> &>(column).insertValue(static_cast<T>(value));
|
||||
}
|
||||
else if (bson_type == BSONType::BOOL)
|
||||
{
|
||||
UInt8 value;
|
||||
readBinary(value, in);
|
||||
assert_cast<ColumnVector<T> &>(column).insertValue(static_cast<T>(value));
|
||||
}
|
||||
else
|
||||
{
|
||||
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Cannot insert BSON {} into column with type {}", getBSONTypeName(bson_type), data_type->getName());
|
||||
}
|
||||
}
|
||||
|
||||
template <typename T>
|
||||
static void readAndInsertDouble(ReadBuffer & in, IColumn & column, const DataTypePtr & data_type, BSONType bson_type)
|
||||
{
|
||||
if (bson_type != BSONType::DOUBLE)
|
||||
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Cannot insert BSON {} into column with type {}", getBSONTypeName(bson_type), data_type->getName());
|
||||
|
||||
Float64 value;
|
||||
readBinary(value, in);
|
||||
assert_cast<ColumnVector<T> &>(column).insertValue(static_cast<T>(value));
|
||||
}
|
||||
|
||||
template <typename DecimalType, BSONType expected_bson_type>
|
||||
static void readAndInsertSmallDecimal(ReadBuffer & in, IColumn & column, const DataTypePtr & data_type, BSONType bson_type)
|
||||
{
|
||||
if (bson_type != expected_bson_type)
|
||||
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Cannot insert BSON {} into column with type {}", getBSONTypeName(bson_type), data_type->getName());
|
||||
|
||||
DecimalType value;
|
||||
readBinary(value, in);
|
||||
assert_cast<ColumnDecimal<DecimalType> &>(column).insertValue(value);
|
||||
}
|
||||
|
||||
static void readAndInsertDateTime64(ReadBuffer & in, IColumn & column, BSONType bson_type)
|
||||
{
|
||||
if (bson_type != BSONType::INT64 && bson_type != BSONType::DATETIME)
|
||||
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Cannot insert BSON {} into DateTime64 column", getBSONTypeName(bson_type));
|
||||
|
||||
DateTime64 value;
|
||||
readBinary(value, in);
|
||||
assert_cast<DataTypeDateTime64::ColumnType &>(column).insertValue(value);
|
||||
}
|
||||
|
||||
template <typename ColumnType>
|
||||
static void readAndInsertBigInteger(ReadBuffer & in, IColumn & column, const DataTypePtr & data_type, BSONType bson_type)
|
||||
{
|
||||
if (bson_type != BSONType::BINARY)
|
||||
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Cannot insert BSON {} into column with type {}", getBSONTypeName(bson_type), data_type->getName());
|
||||
|
||||
auto size = readBSONSize(in);
|
||||
auto subtype = getBSONBinarySubtype(readBSONType(in));
|
||||
if (subtype != BSONBinarySubtype::BINARY)
|
||||
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Cannot insert BSON Binary subtype {} into column with type {}", getBSONBinarySubtypeName(subtype), data_type->getName());
|
||||
|
||||
using ValueType = typename ColumnType::ValueType;
|
||||
|
||||
if (size != sizeof(ValueType))
|
||||
throw Exception(
|
||||
ErrorCodes::INCORRECT_DATA,
|
||||
"Cannot parse value of type {}, size of binary data is not equal to the binary size of expected value: {} != {}",
|
||||
data_type->getName(),
|
||||
size,
|
||||
sizeof(ValueType));
|
||||
|
||||
ValueType value;
|
||||
readBinary(value, in);
|
||||
assert_cast<ColumnType &>(column).insertValue(value);
|
||||
}
|
||||
|
||||
template <bool is_fixed_string>
|
||||
static void readAndInsertStringImpl(ReadBuffer & in, IColumn & column, size_t size)
|
||||
{
|
||||
if constexpr (is_fixed_string)
|
||||
{
|
||||
auto & fixed_string_column = assert_cast<ColumnFixedString &>(column);
|
||||
size_t n = fixed_string_column.getN();
|
||||
if (size > n)
|
||||
throw Exception("Too large string for FixedString column", ErrorCodes::TOO_LARGE_STRING_SIZE);
|
||||
|
||||
auto & data = fixed_string_column.getChars();
|
||||
|
||||
size_t old_size = data.size();
|
||||
data.resize_fill(old_size + n);
|
||||
|
||||
try
|
||||
{
|
||||
in.readStrict(reinterpret_cast<char *>(data.data() + old_size), size);
|
||||
}
|
||||
catch (...)
|
||||
{
|
||||
/// Restore column state in case of any exception.
|
||||
data.resize_assume_reserved(old_size);
|
||||
throw;
|
||||
}
|
||||
}
|
||||
else
|
||||
{
|
||||
auto & column_string = assert_cast<ColumnString &>(column);
|
||||
auto & data = column_string.getChars();
|
||||
auto & offsets = column_string.getOffsets();
|
||||
|
||||
size_t old_chars_size = data.size();
|
||||
size_t offset = old_chars_size + size + 1;
|
||||
offsets.push_back(offset);
|
||||
|
||||
try
|
||||
{
|
||||
data.resize(offset);
|
||||
in.readStrict(reinterpret_cast<char *>(&data[offset - size - 1]), size);
|
||||
data.back() = 0;
|
||||
}
|
||||
catch (...)
|
||||
{
|
||||
/// Restore column state in case of any exception.
|
||||
offsets.pop_back();
|
||||
data.resize_assume_reserved(old_chars_size);
|
||||
throw;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
template <bool is_fixed_string>
|
||||
static void readAndInsertString(ReadBuffer & in, IColumn & column, BSONType bson_type)
|
||||
{
|
||||
if (bson_type == BSONType::STRING || bson_type == BSONType::SYMBOL || bson_type == BSONType::JAVA_SCRIPT_CODE)
|
||||
{
|
||||
auto size = readBSONSize(in);
|
||||
readAndInsertStringImpl<is_fixed_string>(in, column, size - 1);
|
||||
assertChar(0, in);
|
||||
}
|
||||
else if (bson_type == BSONType::BINARY)
|
||||
{
|
||||
auto size = readBSONSize(in);
|
||||
auto subtype = getBSONBinarySubtype(readBSONType(in));
|
||||
if (subtype == BSONBinarySubtype::BINARY || subtype == BSONBinarySubtype::BINARY_OLD)
|
||||
readAndInsertStringImpl<is_fixed_string>(in, column, size);
|
||||
else
|
||||
throw Exception(
|
||||
ErrorCodes::ILLEGAL_COLUMN,
|
||||
"Cannot insert BSON Binary subtype {} into String column",
|
||||
getBSONBinarySubtypeName(subtype));
|
||||
}
|
||||
else if (bson_type == BSONType::OBJECT_ID)
|
||||
{
|
||||
readAndInsertStringImpl<is_fixed_string>(in, column, 12);
|
||||
}
|
||||
else
|
||||
{
|
||||
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Cannot insert BSON {} into String column", getBSONTypeName(bson_type));
|
||||
}
|
||||
}
|
||||
|
||||
static void readAndInsertUUID(ReadBuffer & in, IColumn & column, BSONType bson_type)
|
||||
{
|
||||
if (bson_type == BSONType::BINARY)
|
||||
{
|
||||
auto size = readBSONSize(in);
|
||||
auto subtype = getBSONBinarySubtype(readBSONType(in));
|
||||
if (subtype == BSONBinarySubtype::UUID || subtype == BSONBinarySubtype::UUID_OLD)
|
||||
{
|
||||
if (size != sizeof(UUID))
|
||||
throw Exception(
|
||||
ErrorCodes::INCORRECT_DATA,
|
||||
"Cannot parse value of type UUID, size of binary data is not equal to the binary size of UUID value: {} != {}",
|
||||
size,
|
||||
sizeof(UUID));
|
||||
|
||||
UUID value;
|
||||
readBinary(value, in);
|
||||
assert_cast<ColumnUUID &>(column).insertValue(value);
|
||||
}
|
||||
else
|
||||
{
|
||||
throw Exception(
|
||||
ErrorCodes::ILLEGAL_COLUMN,
|
||||
"Cannot insert BSON Binary subtype {} into UUID column",
|
||||
getBSONBinarySubtypeName(subtype));
|
||||
}
|
||||
}
|
||||
else
|
||||
{
|
||||
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Cannot insert BSON {} into UUID column", getBSONTypeName(bson_type));
|
||||
}
|
||||
}
|
||||
|
||||
void BSONEachRowRowInputFormat::readArray(IColumn & column, const DataTypePtr & data_type, BSONType bson_type)
|
||||
{
|
||||
if (bson_type != BSONType::ARRAY)
|
||||
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Cannot insert BSON {} into Array column", getBSONTypeName(bson_type));
|
||||
|
||||
const auto * data_type_array = assert_cast<const DataTypeArray *>(data_type.get());
|
||||
const auto & nested_type = data_type_array->getNestedType();
|
||||
auto & array_column = assert_cast<ColumnArray &>(column);
|
||||
auto & nested_column = array_column.getData();
|
||||
|
||||
size_t document_start = in->count();
|
||||
BSONSizeT document_size;
|
||||
readBinary(document_size, *in);
|
||||
while (in->count() - document_start + sizeof(BSON_DOCUMENT_END) != document_size)
|
||||
{
|
||||
auto nested_bson_type = getBSONType(readBSONType(*in));
|
||||
readBSONKeyName(*in, current_key_name);
|
||||
readField(nested_column, nested_type, nested_bson_type);
|
||||
}
|
||||
|
||||
assertChar(BSON_DOCUMENT_END, *in);
|
||||
array_column.getOffsets().push_back(array_column.getData().size());
|
||||
}
|
||||
|
||||
void BSONEachRowRowInputFormat::readTuple(IColumn & column, const DataTypePtr & data_type, BSONType bson_type)
|
||||
{
|
||||
if (bson_type != BSONType::ARRAY && bson_type != BSONType::DOCUMENT)
|
||||
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Cannot insert BSON {} into Tuple column", getBSONTypeName(bson_type));
|
||||
|
||||
/// When BSON type is ARRAY, names in nested document are not useful
|
||||
/// (most likely they are just sequential numbers).
|
||||
bool use_key_names = bson_type == BSONType::DOCUMENT;
|
||||
|
||||
const auto * data_type_tuple = assert_cast<const DataTypeTuple *>(data_type.get());
|
||||
auto & tuple_column = assert_cast<ColumnTuple &>(column);
|
||||
size_t read_nested_columns = 0;
|
||||
|
||||
size_t document_start = in->count();
|
||||
BSONSizeT document_size;
|
||||
readBinary(document_size, *in);
|
||||
while (in->count() - document_start + sizeof(BSON_DOCUMENT_END) != document_size)
|
||||
{
|
||||
auto nested_bson_type = getBSONType(readBSONType(*in));
|
||||
auto name = readBSONKeyName(*in, current_key_name);
|
||||
|
||||
size_t index = read_nested_columns;
|
||||
if (use_key_names)
|
||||
{
|
||||
auto try_get_index = data_type_tuple->tryGetPositionByName(name.toString());
|
||||
if (!try_get_index)
|
||||
throw Exception(
|
||||
ErrorCodes::INCORRECT_DATA,
|
||||
"Cannot parse tuple column with type {} from BSON array/embedded document field: tuple doesn't have element with name \"{}\"",
|
||||
data_type->getName(),
|
||||
name);
|
||||
index = *try_get_index;
|
||||
}
|
||||
|
||||
if (index >= data_type_tuple->getElements().size())
|
||||
throw Exception(
|
||||
ErrorCodes::INCORRECT_DATA,
|
||||
"Cannot parse tuple column with type {} from BSON array/embedded document field: the number of fields BSON document exceeds the number of fields in tuple",
|
||||
data_type->getName());
|
||||
|
||||
readField(tuple_column.getColumn(index), data_type_tuple->getElement(index), nested_bson_type);
|
||||
++read_nested_columns;
|
||||
}
|
||||
|
||||
assertChar(BSON_DOCUMENT_END, *in);
|
||||
|
||||
if (read_nested_columns != data_type_tuple->getElements().size())
|
||||
throw Exception(
|
||||
ErrorCodes::INCORRECT_DATA,
|
||||
"Cannot parse tuple column with type {} from BSON array/embedded document field, the number of fields in tuple and BSON document doesn't match: {} != {}",
|
||||
data_type->getName(),
|
||||
data_type_tuple->getElements().size(),
|
||||
read_nested_columns);
|
||||
}
|
||||
|
||||
void BSONEachRowRowInputFormat::readMap(IColumn & column, const DataTypePtr & data_type, BSONType bson_type)
|
||||
{
|
||||
if (bson_type != BSONType::DOCUMENT)
|
||||
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Cannot insert BSON {} into Map column", getBSONTypeName(bson_type));
|
||||
|
||||
const auto * data_type_map = assert_cast<const DataTypeMap *>(data_type.get());
|
||||
const auto & key_data_type = data_type_map->getKeyType();
|
||||
if (!isStringOrFixedString(key_data_type))
|
||||
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Only maps with String key type are supported in BSON, got key type: {}", key_data_type->getName());
|
||||
|
||||
const auto & value_data_type = data_type_map->getValueType();
|
||||
auto & column_map = assert_cast<ColumnMap &>(column);
|
||||
auto & key_column = column_map.getNestedData().getColumn(0);
|
||||
auto & value_column = column_map.getNestedData().getColumn(1);
|
||||
auto & offsets = column_map.getNestedColumn().getOffsets();
|
||||
|
||||
size_t document_start = in->count();
|
||||
BSONSizeT document_size;
|
||||
readBinary(document_size, *in);
|
||||
while (in->count() - document_start + sizeof(BSON_DOCUMENT_END) != document_size)
|
||||
{
|
||||
auto nested_bson_type = getBSONType(readBSONType(*in));
|
||||
auto name = readBSONKeyName(*in, current_key_name);
|
||||
key_column.insertData(name.data, name.size);
|
||||
readField(value_column, value_data_type, nested_bson_type);
|
||||
}
|
||||
|
||||
assertChar(BSON_DOCUMENT_END, *in);
|
||||
offsets.push_back(key_column.size());
|
||||
}
|
||||
|
||||
|
||||
bool BSONEachRowRowInputFormat::readField(IColumn & column, const DataTypePtr & data_type, BSONType bson_type)
|
||||
{
|
||||
if (bson_type == BSONType::NULL_VALUE)
|
||||
{
|
||||
if (data_type->isNullable())
|
||||
{
|
||||
column.insertDefault();
|
||||
return true;
|
||||
}
|
||||
|
||||
if (!format_settings.null_as_default)
|
||||
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Cannot insert BSON Null value into non-nullable column with type {}", getBSONTypeName(bson_type), data_type->getName());
|
||||
|
||||
column.insertDefault();
|
||||
return false;
|
||||
}
|
||||
|
||||
switch (data_type->getTypeId())
|
||||
{
|
||||
case TypeIndex::Nullable:
|
||||
{
|
||||
auto & nullable_column = assert_cast<ColumnNullable &>(column);
|
||||
auto & nested_column = nullable_column.getNestedColumn();
|
||||
const auto & nested_type = assert_cast<const DataTypeNullable *>(data_type.get())->getNestedType();
|
||||
nullable_column.getNullMapColumn().insertValue(0);
|
||||
return readField(nested_column, nested_type, bson_type);
|
||||
}
|
||||
case TypeIndex::LowCardinality:
|
||||
{
|
||||
auto & lc_column = assert_cast<ColumnLowCardinality &>(column);
|
||||
auto tmp_column = lc_column.getDictionary().getNestedColumn()->cloneEmpty();
|
||||
const auto & dict_type = assert_cast<const DataTypeLowCardinality *>(data_type.get())->getDictionaryType();
|
||||
auto res = readField(*tmp_column, dict_type, bson_type);
|
||||
lc_column.insertFromFullColumn(*tmp_column, 0);
|
||||
return res;
|
||||
}
|
||||
case TypeIndex::Int8:
|
||||
{
|
||||
readAndInsertInteger<Int8>(*in, column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::UInt8:
|
||||
{
|
||||
readAndInsertInteger<UInt8>(*in, column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::Int16:
|
||||
{
|
||||
readAndInsertInteger<Int16>(*in, column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::Date: [[fallthrough]];
|
||||
case TypeIndex::UInt16:
|
||||
{
|
||||
readAndInsertInteger<UInt16>(*in, column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::Date32: [[fallthrough]];
|
||||
case TypeIndex::Int32:
|
||||
{
|
||||
readAndInsertInteger<Int32>(*in, column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::DateTime: [[fallthrough]];
|
||||
case TypeIndex::UInt32:
|
||||
{
|
||||
readAndInsertInteger<UInt32>(*in, column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::Int64:
|
||||
{
|
||||
readAndInsertInteger<Int64>(*in, column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::UInt64:
|
||||
{
|
||||
readAndInsertInteger<UInt64>(*in, column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::Int128:
|
||||
{
|
||||
readAndInsertBigInteger<ColumnInt128>(*in, column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::UInt128:
|
||||
{
|
||||
readAndInsertBigInteger<ColumnUInt128>(*in, column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::Int256:
|
||||
{
|
||||
readAndInsertBigInteger<ColumnInt256>(*in, column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::UInt256:
|
||||
{
|
||||
readAndInsertBigInteger<ColumnUInt256>(*in, column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::Float32:
|
||||
{
|
||||
readAndInsertDouble<Float32>(*in, column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::Float64:
|
||||
{
|
||||
readAndInsertDouble<Float64>(*in, column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::Decimal32:
|
||||
{
|
||||
readAndInsertSmallDecimal<Decimal32, BSONType::INT32>(*in, column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::Decimal64:
|
||||
{
|
||||
readAndInsertSmallDecimal<Decimal64, BSONType::INT64>(*in, column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::Decimal128:
|
||||
{
|
||||
readAndInsertBigInteger<ColumnDecimal<Decimal128>>(*in, column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::Decimal256:
|
||||
{
|
||||
readAndInsertBigInteger<ColumnDecimal<Decimal256>>(*in, column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::DateTime64:
|
||||
{
|
||||
readAndInsertDateTime64(*in, column, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::FixedString:
|
||||
{
|
||||
readAndInsertString<true>(*in, column, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::String:
|
||||
{
|
||||
readAndInsertString<false>(*in, column, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::UUID:
|
||||
{
|
||||
readAndInsertUUID(*in, column, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::Array:
|
||||
{
|
||||
readArray(column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::Tuple:
|
||||
{
|
||||
readTuple(column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
case TypeIndex::Map:
|
||||
{
|
||||
readMap(column, data_type, bson_type);
|
||||
return true;
|
||||
}
|
||||
default:
|
||||
{
|
||||
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Type {} is not supported for output in BSON format", data_type->getName());
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
static void skipBSONField(ReadBuffer & in, BSONType type)
|
||||
{
|
||||
switch (type)
|
||||
{
|
||||
case BSONType::DOUBLE:
|
||||
{
|
||||
in.ignore(sizeof(Float64));
|
||||
break;
|
||||
}
|
||||
case BSONType::BOOL:
|
||||
{
|
||||
in.ignore(sizeof(UInt8));
|
||||
break;
|
||||
}
|
||||
case BSONType::INT64: [[fallthrough]];
|
||||
case BSONType::DATETIME: [[fallthrough]];
|
||||
case BSONType::TIMESTAMP:
|
||||
{
|
||||
in.ignore(sizeof(UInt64));
|
||||
break;
|
||||
}
|
||||
case BSONType::INT32:
|
||||
{
|
||||
in.ignore(sizeof(Int32));
|
||||
break;
|
||||
}
|
||||
case BSONType::JAVA_SCRIPT_CODE: [[fallthrough]];
|
||||
case BSONType::SYMBOL: [[fallthrough]];
|
||||
case BSONType::STRING:
|
||||
{
|
||||
BSONSizeT size;
|
||||
readBinary(size, in);
|
||||
in.ignore(size);
|
||||
break;
|
||||
}
|
||||
case BSONType::DOCUMENT: [[fallthrough]];
|
||||
case BSONType::ARRAY:
|
||||
{
|
||||
BSONSizeT size;
|
||||
readBinary(size, in);
|
||||
in.ignore(size - sizeof(size));
|
||||
break;
|
||||
}
|
||||
case BSONType::BINARY:
|
||||
{
|
||||
BSONSizeT size;
|
||||
readBinary(size, in);
|
||||
in.ignore(size + 1);
|
||||
break;
|
||||
}
|
||||
case BSONType::MIN_KEY: [[fallthrough]];
|
||||
case BSONType::MAX_KEY: [[fallthrough]];
|
||||
case BSONType::UNDEFINED: [[fallthrough]];
|
||||
case BSONType::NULL_VALUE:
|
||||
{
|
||||
break;
|
||||
}
|
||||
case BSONType::OBJECT_ID:
|
||||
{
|
||||
in.ignore(12);
|
||||
break;
|
||||
}
|
||||
case BSONType::REGEXP:
|
||||
{
|
||||
skipNullTerminated(in);
|
||||
skipNullTerminated(in);
|
||||
break;
|
||||
}
|
||||
case BSONType::DB_POINTER:
|
||||
{
|
||||
BSONSizeT size;
|
||||
readBinary(size, in);
|
||||
in.ignore(size + 12);
|
||||
break;
|
||||
}
|
||||
case BSONType::JAVA_SCRIPT_CODE_W_SCOPE:
|
||||
{
|
||||
BSONSizeT size;
|
||||
readBinary(size, in);
|
||||
in.ignore(size - sizeof(size));
|
||||
break;
|
||||
}
|
||||
case BSONType::DECIMAL128:
|
||||
{
|
||||
in.ignore(16);
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
void BSONEachRowRowInputFormat::skipUnknownField(BSONType type, const String & key_name)
|
||||
{
|
||||
if (!format_settings.skip_unknown_fields)
|
||||
throw Exception(ErrorCodes::INCORRECT_DATA, "Unknown field found while parsing BSONEachRow format: {}", key_name);
|
||||
|
||||
skipBSONField(*in, type);
|
||||
}
|
||||
|
||||
void BSONEachRowRowInputFormat::syncAfterError()
|
||||
{
|
||||
/// Skip all remaining bytes in current document
|
||||
size_t already_read_bytes = in->count() - current_document_start;
|
||||
in->ignore(current_document_size - already_read_bytes);
|
||||
}
|
||||
|
||||
bool BSONEachRowRowInputFormat::readRow(MutableColumns & columns, RowReadExtension & ext)
|
||||
{
|
||||
size_t num_columns = columns.size();
|
||||
|
||||
read_columns.assign(num_columns, false);
|
||||
seen_columns.assign(num_columns, false);
|
||||
|
||||
if (in->eof())
|
||||
return false;
|
||||
|
||||
size_t key_index = 0;
|
||||
|
||||
current_document_start = in->count();
|
||||
readBinary(current_document_size, *in);
|
||||
while (in->count() - current_document_start + sizeof(BSON_DOCUMENT_END) != current_document_size)
|
||||
{
|
||||
auto type = getBSONType(readBSONType(*in));
|
||||
auto name = readBSONKeyName(*in, current_key_name);
|
||||
auto index = columnIndex(name, key_index);
|
||||
|
||||
if (index == UNKNOWN_FIELD)
|
||||
{
|
||||
current_key_name.assign(name.data, name.size);
|
||||
skipUnknownField(BSONType(type), current_key_name);
|
||||
}
|
||||
else
|
||||
{
|
||||
seen_columns[index] = true;
|
||||
read_columns[index] = readField(*columns[index], types[index], BSONType(type));
|
||||
}
|
||||
|
||||
++key_index;
|
||||
}
|
||||
|
||||
assertChar(BSON_DOCUMENT_END, *in);
|
||||
|
||||
const auto & header = getPort().getHeader();
|
||||
/// Fill non-visited columns with the default values.
|
||||
for (size_t i = 0; i < num_columns; ++i)
|
||||
if (!seen_columns[i])
|
||||
header.getByPosition(i).type->insertDefaultInto(*columns[i]);
|
||||
|
||||
if (format_settings.defaults_for_omitted_fields)
|
||||
ext.read_columns = read_columns;
|
||||
else
|
||||
ext.read_columns.assign(read_columns.size(), true);
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
BSONEachRowSchemaReader::BSONEachRowSchemaReader(ReadBuffer & in_, const FormatSettings & settings_)
|
||||
: IRowWithNamesSchemaReader(in_, settings_)
|
||||
{
|
||||
}
|
||||
|
||||
DataTypePtr BSONEachRowSchemaReader::getDataTypeFromBSONField(BSONType type, bool allow_to_skip_unsupported_types, bool & skip)
|
||||
{
|
||||
switch (type)
|
||||
{
|
||||
case BSONType::DOUBLE:
|
||||
{
|
||||
in.ignore(sizeof(Float64));
|
||||
return makeNullable(std::make_shared<DataTypeFloat64>());
|
||||
}
|
||||
case BSONType::BOOL:
|
||||
{
|
||||
in.ignore(sizeof(UInt8));
|
||||
return makeNullable(DataTypeFactory::instance().get("Bool"));
|
||||
}
|
||||
case BSONType::INT64:
|
||||
{
|
||||
in.ignore(sizeof(Int64));
|
||||
return makeNullable(std::make_shared<DataTypeInt64>());
|
||||
}
|
||||
case BSONType::DATETIME:
|
||||
{
|
||||
in.ignore(sizeof(Int64));
|
||||
return makeNullable(std::make_shared<DataTypeDateTime64>(6, "UTC"));
|
||||
}
|
||||
case BSONType::INT32:
|
||||
{
|
||||
in.ignore(sizeof(Int32));
|
||||
return makeNullable(std::make_shared<DataTypeInt32>());
|
||||
}
|
||||
case BSONType::SYMBOL: [[fallthrough]];
|
||||
case BSONType::JAVA_SCRIPT_CODE: [[fallthrough]];
|
||||
case BSONType::OBJECT_ID: [[fallthrough]];
|
||||
case BSONType::STRING:
|
||||
{
|
||||
BSONSizeT size;
|
||||
readBinary(size, in);
|
||||
in.ignore(size);
|
||||
return makeNullable(std::make_shared<DataTypeString>());
|
||||
}
|
||||
case BSONType::DOCUMENT:
|
||||
{
|
||||
auto nested_names_and_types = getDataTypesFromBSONDocument(false);
|
||||
auto nested_types = nested_names_and_types.getTypes();
|
||||
bool types_are_equal = true;
|
||||
if (nested_types.empty() || !nested_types[0])
|
||||
return nullptr;
|
||||
|
||||
for (size_t i = 1; i != nested_types.size(); ++i)
|
||||
{
|
||||
if (!nested_types[i])
|
||||
return nullptr;
|
||||
|
||||
types_are_equal &= nested_types[i]->equals(*nested_types[0]);
|
||||
}
|
||||
|
||||
if (types_are_equal)
|
||||
return std::make_shared<DataTypeMap>(std::make_shared<DataTypeString>(), nested_types[0]);
|
||||
|
||||
return std::make_shared<DataTypeTuple>(std::move(nested_types), nested_names_and_types.getNames());
|
||||
|
||||
}
|
||||
case BSONType::ARRAY:
|
||||
{
|
||||
auto nested_types = getDataTypesFromBSONDocument(false).getTypes();
|
||||
bool types_are_equal = true;
|
||||
if (nested_types.empty() || !nested_types[0])
|
||||
return nullptr;
|
||||
|
||||
for (size_t i = 1; i != nested_types.size(); ++i)
|
||||
{
|
||||
if (!nested_types[i])
|
||||
return nullptr;
|
||||
|
||||
types_are_equal &= nested_types[i]->equals(*nested_types[0]);
|
||||
}
|
||||
|
||||
if (types_are_equal)
|
||||
return std::make_shared<DataTypeArray>(nested_types[0]);
|
||||
|
||||
return std::make_shared<DataTypeTuple>(std::move(nested_types));
|
||||
}
|
||||
case BSONType::BINARY:
|
||||
{
|
||||
BSONSizeT size;
|
||||
readBinary(size, in);
|
||||
auto subtype = getBSONBinarySubtype(readBSONType(in));
|
||||
in.ignore(size);
|
||||
switch (subtype)
|
||||
{
|
||||
case BSONBinarySubtype::BINARY_OLD: [[fallthrough]];
|
||||
case BSONBinarySubtype::BINARY:
|
||||
return makeNullable(std::make_shared<DataTypeString>());
|
||||
case BSONBinarySubtype::UUID_OLD: [[fallthrough]];
|
||||
case BSONBinarySubtype::UUID:
|
||||
return makeNullable(std::make_shared<DataTypeUUID>());
|
||||
default:
|
||||
throw Exception(ErrorCodes::UNKNOWN_TYPE, "BSON binary subtype {} is not supported", getBSONBinarySubtypeName(subtype));
|
||||
}
|
||||
}
|
||||
case BSONType::NULL_VALUE:
|
||||
{
|
||||
return nullptr;
|
||||
}
|
||||
default:
|
||||
{
|
||||
if (!allow_to_skip_unsupported_types)
|
||||
throw Exception(ErrorCodes::UNKNOWN_TYPE, "BSON type {} is not supported", getBSONTypeName(type));
|
||||
|
||||
skip = true;
|
||||
skipBSONField(in, type);
|
||||
return nullptr;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
NamesAndTypesList BSONEachRowSchemaReader::getDataTypesFromBSONDocument(bool allow_to_skip_unsupported_types)
|
||||
{
|
||||
size_t document_start = in.count();
|
||||
BSONSizeT document_size;
|
||||
readBinary(document_size, in);
|
||||
NamesAndTypesList names_and_types;
|
||||
while (in.count() - document_start + sizeof(BSON_DOCUMENT_END) != document_size)
|
||||
{
|
||||
auto bson_type = getBSONType(readBSONType(in));
|
||||
String name;
|
||||
readNullTerminated(name, in);
|
||||
bool skip = false;
|
||||
auto type = getDataTypeFromBSONField(bson_type, allow_to_skip_unsupported_types, skip);
|
||||
if (!skip)
|
||||
names_and_types.emplace_back(name, type);
|
||||
}
|
||||
|
||||
assertChar(BSON_DOCUMENT_END, in);
|
||||
|
||||
return names_and_types;
|
||||
}
|
||||
|
||||
NamesAndTypesList BSONEachRowSchemaReader::readRowAndGetNamesAndDataTypes(bool & eof)
|
||||
{
|
||||
if (in.eof())
|
||||
{
|
||||
eof = true;
|
||||
return {};
|
||||
}
|
||||
|
||||
return getDataTypesFromBSONDocument(format_settings.bson.skip_fields_with_unsupported_types_in_schema_inference);
|
||||
}
|
||||
|
||||
void BSONEachRowSchemaReader::transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type)
|
||||
{
|
||||
DataTypes types = {type, new_type};
|
||||
/// For example for integer conversion Int32,
|
||||
auto least_supertype = tryGetLeastSupertype(types);
|
||||
if (least_supertype)
|
||||
type = new_type = least_supertype;
|
||||
}
|
||||
|
||||
static std::pair<bool, size_t>
|
||||
fileSegmentationEngineBSONEachRow(ReadBuffer & in, DB::Memory<> & memory, size_t min_bytes, size_t max_rows)
|
||||
{
|
||||
size_t number_of_rows = 0;
|
||||
|
||||
while (!in.eof() && memory.size() < min_bytes && number_of_rows < max_rows)
|
||||
{
|
||||
BSONSizeT document_size;
|
||||
readBinary(document_size, in);
|
||||
if (min_bytes != 0 && document_size > 10 * min_bytes)
|
||||
throw ParsingException(
|
||||
ErrorCodes::INCORRECT_DATA,
|
||||
"Size of BSON document is extremely large. Expected not greater than {} bytes, but current is {} bytes per row. Increase "
|
||||
"the value setting 'min_chunk_bytes_for_parallel_parsing' or check your data manually, most likely BSON is malformed",
|
||||
min_bytes, document_size);
|
||||
|
||||
size_t old_size = memory.size();
|
||||
memory.resize(old_size + document_size);
|
||||
memcpy(memory.data() + old_size, reinterpret_cast<char *>(&document_size), sizeof(document_size));
|
||||
in.readStrict(memory.data() + old_size + sizeof(document_size), document_size - sizeof(document_size));
|
||||
++number_of_rows;
|
||||
}
|
||||
|
||||
return {!in.eof(), number_of_rows};
|
||||
}
|
||||
|
||||
void registerInputFormatBSONEachRow(FormatFactory & factory)
|
||||
{
|
||||
factory.registerInputFormat(
|
||||
"BSONEachRow",
|
||||
[](ReadBuffer & buf, const Block & sample, IRowInputFormat::Params params, const FormatSettings & settings)
|
||||
{ return std::make_shared<BSONEachRowRowInputFormat>(buf, sample, std::move(params), settings); });
|
||||
}
|
||||
|
||||
void registerFileSegmentationEngineBSONEachRow(FormatFactory & factory)
|
||||
{
|
||||
factory.registerFileSegmentationEngine("BSONEachRow", &fileSegmentationEngineBSONEachRow);
|
||||
}
|
||||
|
||||
void registerBSONEachRowSchemaReader(FormatFactory & factory)
|
||||
{
|
||||
factory.registerSchemaReader("BSONEachRow", [](ReadBuffer & buf, const FormatSettings & settings)
|
||||
{
|
||||
return std::make_unique<BSONEachRowSchemaReader>(buf, settings);
|
||||
});
|
||||
factory.registerAdditionalInfoForSchemaCacheGetter("BSONEachRow", [](const FormatSettings & settings)
|
||||
{
|
||||
String result = getAdditionalFormatInfoForAllRowBasedFormats(settings);
|
||||
return result + fmt::format(", skip_fields_with_unsupported_types_in_schema_inference={}",
|
||||
settings.bson.skip_fields_with_unsupported_types_in_schema_inference);
|
||||
});
|
||||
}
|
||||
|
||||
}
|
115
src/Processors/Formats/Impl/BSONEachRowRowInputFormat.h
Normal file
115
src/Processors/Formats/Impl/BSONEachRowRowInputFormat.h
Normal file
@ -0,0 +1,115 @@
|
||||
#pragma once
|
||||
|
||||
#include <Core/Block.h>
|
||||
#include <Formats/FormatSettings.h>
|
||||
#include <Formats/BSONTypes.h>
|
||||
#include <Processors/Formats/IRowInputFormat.h>
|
||||
#include <Processors/Formats/ISchemaReader.h>
|
||||
#include <Common/HashTable/HashMap.h>
|
||||
|
||||
|
||||
namespace DB
|
||||
{
|
||||
|
||||
/*
|
||||
* Class for parsing data in BSON format.
|
||||
* Each row is parsed as a separate BSON document.
|
||||
* Each column is parsed as a single field with column name as a key.
|
||||
* It uses the following correspondence between BSON types and ClickHouse types:
|
||||
*
|
||||
* BSON Type | ClickHouse Type
|
||||
* \x01 double | Float32/Float64
|
||||
* \x02 string | String/FixedString
|
||||
* \x03 document | Map/Named Tuple
|
||||
* \x04 array | Array/Tuple
|
||||
* \x05 binary, \x00 binary subtype | String/FixedString
|
||||
* \x05 binary, \x02 old binary subtype | String/FixedString
|
||||
* \x05 binary, \x03 old uuid subtype | UUID
|
||||
* \x05 binary, \x04 uuid subtype | UUID
|
||||
* \x07 ObjectId | String
|
||||
* \x08 boolean | Bool
|
||||
* \x09 datetime | DateTime64
|
||||
* \x0A null value | NULL
|
||||
* \x0D JavaScript code | String
|
||||
* \x0E symbol | String/FixedString
|
||||
* \x10 int32 | Int32/Decimal32
|
||||
* \x12 int64 | Int64/Decimal64/DateTime64
|
||||
* \x11 uint64 | UInt64
|
||||
*
|
||||
* Other BSON types are not supported.
|
||||
* Also, we perform conversion between different integer types
|
||||
* (for example, you can insert BSON int32 value into ClickHouse UInt8)
|
||||
* Big integers and decimals Int128/UInt128/Int256/UInt256/Decimal128/Decimal256
|
||||
* can be parsed from BSON Binary value with \x00 binary subtype. In this case
|
||||
* we validate that the size of binary data equals the size of expected value.
|
||||
*
|
||||
* Note: this format will not work on Big-Endian platforms.
|
||||
*/
|
||||
|
||||
class ReadBuffer;
|
||||
class BSONEachRowRowInputFormat final : public IRowInputFormat
|
||||
{
|
||||
public:
|
||||
BSONEachRowRowInputFormat(
|
||||
ReadBuffer & in_, const Block & header_, Params params_, const FormatSettings & format_settings_);
|
||||
|
||||
String getName() const override { return "BSONEachRowRowInputFormat"; }
|
||||
void resetParser() override { }
|
||||
|
||||
private:
|
||||
void readPrefix() override { }
|
||||
void readSuffix() override { }
|
||||
|
||||
bool readRow(MutableColumns & columns, RowReadExtension & ext) override;
|
||||
bool allowSyncAfterError() const override { return true; }
|
||||
void syncAfterError() override;
|
||||
|
||||
size_t columnIndex(const StringRef & name, size_t key_index);
|
||||
|
||||
using ColumnReader = std::function<void(StringRef name, BSONType type)>;
|
||||
|
||||
bool readField(IColumn & column, const DataTypePtr & data_type, BSONType bson_type);
|
||||
void skipUnknownField(BSONType type, const String & key_name);
|
||||
|
||||
void readTuple(IColumn & column, const DataTypePtr & data_type, BSONType bson_type);
|
||||
void readArray(IColumn & column, const DataTypePtr & data_type, BSONType bson_type);
|
||||
void readMap(IColumn & column, const DataTypePtr & data_type, BSONType bson_type);
|
||||
|
||||
const FormatSettings format_settings;
|
||||
|
||||
/// Buffer for the read from the stream field name. Used when you have to copy it.
|
||||
String current_key_name;
|
||||
|
||||
/// Set of columns for which the values were read. The rest will be filled with default values.
|
||||
std::vector<UInt8> read_columns;
|
||||
/// Set of columns which already met in row. Exception is thrown if there are more than one column with the same name.
|
||||
std::vector<UInt8> seen_columns;
|
||||
/// These sets may be different, because if null_as_default=1 read_columns[i] will be false and seen_columns[i] will be true
|
||||
/// for row like {..., "non-nullable column name" : null, ...}
|
||||
|
||||
/// Hash table match `field name -> position in the block`.
|
||||
Block::NameMap name_map;
|
||||
|
||||
/// Cached search results for previous row (keyed as index in JSON object) - used as a hint.
|
||||
std::vector<Block::NameMap::LookupResult> prev_positions;
|
||||
|
||||
DataTypes types;
|
||||
|
||||
size_t current_document_start;
|
||||
BSONSizeT current_document_size;
|
||||
};
|
||||
|
||||
class BSONEachRowSchemaReader : public IRowWithNamesSchemaReader
|
||||
{
|
||||
public:
|
||||
BSONEachRowSchemaReader(ReadBuffer & in_, const FormatSettings & settings_);
|
||||
|
||||
private:
|
||||
NamesAndTypesList readRowAndGetNamesAndDataTypes(bool & eof) override;
|
||||
void transformTypesIfNeeded(DataTypePtr & type, DataTypePtr & new_type) override;
|
||||
|
||||
NamesAndTypesList getDataTypesFromBSONDocument(bool skip_unsupported_types);
|
||||
DataTypePtr getDataTypeFromBSONField(BSONType type, bool skip_unsupported_types, bool & skip);
|
||||
};
|
||||
|
||||
}
|
527
src/Processors/Formats/Impl/BSONEachRowRowOutputFormat.cpp
Normal file
527
src/Processors/Formats/Impl/BSONEachRowRowOutputFormat.cpp
Normal file
@ -0,0 +1,527 @@
|
||||
#include <Processors/Formats/Impl/BSONEachRowRowOutputFormat.h>
|
||||
|
||||
#include <Formats/FormatFactory.h>
|
||||
#include <Formats/BSONTypes.h>
|
||||
|
||||
#include <Columns/ColumnArray.h>
|
||||
#include <Columns/ColumnNullable.h>
|
||||
#include <Columns/ColumnString.h>
|
||||
#include <Columns/ColumnFixedString.h>
|
||||
#include <Columns/ColumnLowCardinality.h>
|
||||
#include <Columns/ColumnTuple.h>
|
||||
#include <Columns/ColumnMap.h>
|
||||
#include <Columns/ColumnDecimal.h>
|
||||
|
||||
#include <DataTypes/DataTypeNullable.h>
|
||||
#include <DataTypes/DataTypeLowCardinality.h>
|
||||
#include <DataTypes/DataTypeArray.h>
|
||||
#include <DataTypes/DataTypeTuple.h>
|
||||
#include <DataTypes/DataTypeMap.h>
|
||||
|
||||
#include <IO/WriteHelpers.h>
|
||||
#include <IO/WriteBufferValidUTF8.h>
|
||||
|
||||
|
||||
namespace DB
|
||||
{
|
||||
|
||||
namespace ErrorCodes
|
||||
{
|
||||
extern const int INCORRECT_DATA;
|
||||
extern const int ILLEGAL_COLUMN;
|
||||
extern const int LOGICAL_ERROR;
|
||||
}
|
||||
|
||||
/// In BSON all names should be valid UTF8 sequences
|
||||
static String toValidUTF8String(const String & name)
|
||||
{
|
||||
WriteBufferFromOwnString buf;
|
||||
WriteBufferValidUTF8 validating_buf(buf);
|
||||
writeString(name, validating_buf);
|
||||
validating_buf.finalize();
|
||||
return buf.str();
|
||||
}
|
||||
|
||||
BSONEachRowRowOutputFormat::BSONEachRowRowOutputFormat(
|
||||
WriteBuffer & out_, const Block & header_, const RowOutputFormatParams & params_, const FormatSettings & settings_)
|
||||
: IRowOutputFormat(header_, out_, params_), settings(settings_)
|
||||
{
|
||||
const auto & sample = getPort(PortKind::Main).getHeader();
|
||||
fields.reserve(sample.columns());
|
||||
for (const auto & field : sample.getNamesAndTypes())
|
||||
fields.emplace_back(toValidUTF8String(field.name), field.type);
|
||||
}
|
||||
|
||||
static void writeBSONSize(size_t size, WriteBuffer & buf)
|
||||
{
|
||||
if (size > MAX_BSON_SIZE)
|
||||
throw Exception(ErrorCodes::INCORRECT_DATA, "Too large document/value size: {}. Maximum allowed size: {}.", size, MAX_BSON_SIZE);
|
||||
|
||||
writePODBinary<BSONSizeT>(BSONSizeT(size), buf);
|
||||
}
|
||||
|
||||
template <typename Type>
|
||||
static void writeBSONType(Type type, WriteBuffer & buf)
|
||||
{
|
||||
UInt8 value = UInt8(type);
|
||||
writeBinary(value, buf);
|
||||
}
|
||||
|
||||
static void writeBSONTypeAndKeyName(BSONType type, const String & name, WriteBuffer & buf)
|
||||
{
|
||||
writeBSONType(type, buf);
|
||||
writeString(name, buf);
|
||||
writeChar(0x00, buf);
|
||||
}
|
||||
|
||||
template <typename ColumnType, typename ValueType>
|
||||
static void writeBSONNumber(BSONType type, const IColumn & column, size_t row_num, const String & name, WriteBuffer & buf)
|
||||
{
|
||||
writeBSONTypeAndKeyName(type, name, buf);
|
||||
writePODBinary<ValueType>(assert_cast<const ColumnType &>(column).getElement(row_num), buf);
|
||||
}
|
||||
|
||||
template <typename StringColumnType>
|
||||
static void writeBSONString(const IColumn & column, size_t row_num, const String & name, WriteBuffer & buf, bool as_bson_string)
|
||||
{
|
||||
const auto & string_column = assert_cast<const StringColumnType &>(column);
|
||||
StringRef data = string_column.getDataAt(row_num);
|
||||
if (as_bson_string)
|
||||
{
|
||||
writeBSONTypeAndKeyName(BSONType::STRING, name, buf);
|
||||
writeBSONSize(data.size + 1, buf);
|
||||
writeString(data, buf);
|
||||
writeChar(0x00, buf);
|
||||
}
|
||||
else
|
||||
{
|
||||
writeBSONTypeAndKeyName(BSONType::BINARY, name, buf);
|
||||
writeBSONSize(data.size, buf);
|
||||
writeBSONType(BSONBinarySubtype::BINARY, buf);
|
||||
writeString(data, buf);
|
||||
}
|
||||
}
|
||||
|
||||
template <class ColumnType>
|
||||
static void writeBSONBigInteger(const IColumn & column, size_t row_num, const String & name, WriteBuffer & buf)
|
||||
{
|
||||
writeBSONTypeAndKeyName(BSONType::BINARY, name, buf);
|
||||
writeBSONSize(sizeof(typename ColumnType::ValueType), buf);
|
||||
writeBSONType(BSONBinarySubtype::BINARY, buf);
|
||||
auto data = assert_cast<const ColumnType &>(column).getDataAt(row_num);
|
||||
buf.write(data.data, data.size);
|
||||
}
|
||||
|
||||
size_t BSONEachRowRowOutputFormat::countBSONFieldSize(const IColumn & column, const DataTypePtr & data_type, size_t row_num, const String & name)
|
||||
{
|
||||
size_t size = 1; // Field type
|
||||
size += name.size() + 1; // Field name and \0
|
||||
switch (column.getDataType())
|
||||
{
|
||||
case TypeIndex::Int8: [[fallthrough]];
|
||||
case TypeIndex::Int16: [[fallthrough]];
|
||||
case TypeIndex::UInt16: [[fallthrough]];
|
||||
case TypeIndex::Date: [[fallthrough]];
|
||||
case TypeIndex::Date32: [[fallthrough]];
|
||||
case TypeIndex::Decimal32: [[fallthrough]];
|
||||
case TypeIndex::Int32:
|
||||
{
|
||||
return size + sizeof(Int32);
|
||||
}
|
||||
case TypeIndex::UInt8:
|
||||
{
|
||||
if (isBool(data_type))
|
||||
return size + 1;
|
||||
|
||||
return size + sizeof(Int32);
|
||||
}
|
||||
case TypeIndex::Float32: [[fallthrough]];
|
||||
case TypeIndex::Float64: [[fallthrough]];
|
||||
case TypeIndex::UInt32: [[fallthrough]];
|
||||
case TypeIndex::Int64: [[fallthrough]];
|
||||
case TypeIndex::UInt64: [[fallthrough]];
|
||||
case TypeIndex::DateTime: [[fallthrough]];
|
||||
case TypeIndex::Decimal64: [[fallthrough]];
|
||||
case TypeIndex::DateTime64:
|
||||
{
|
||||
return size + sizeof(UInt64);
|
||||
}
|
||||
case TypeIndex::Int128: [[fallthrough]];
|
||||
case TypeIndex::UInt128: [[fallthrough]];
|
||||
case TypeIndex::Decimal128:
|
||||
{
|
||||
return size + sizeof(BSONSizeT) + 1 + sizeof(UInt128); // Size of a binary + binary subtype + 16 bytes of value
|
||||
}
|
||||
case TypeIndex::Int256: [[fallthrough]];
|
||||
case TypeIndex::UInt256: [[fallthrough]];
|
||||
case TypeIndex::Decimal256:
|
||||
{
|
||||
return size + sizeof(BSONSizeT) + 1 + sizeof(UInt256); // Size of a binary + binary subtype + 32 bytes of value
|
||||
}
|
||||
case TypeIndex::String:
|
||||
{
|
||||
const auto & string_column = assert_cast<const ColumnString &>(column);
|
||||
return size + sizeof(BSONSizeT) + string_column.getDataAt(row_num).size + 1; // Size of data + data + \0 or BSON subtype (in case of BSON binary)
|
||||
}
|
||||
case TypeIndex::FixedString:
|
||||
{
|
||||
const auto & string_column = assert_cast<const ColumnFixedString &>(column);
|
||||
return size + sizeof(BSONSizeT) + string_column.getN() + 1; // Size of data + data + \0 or BSON subtype (in case of BSON binary)
|
||||
}
|
||||
case TypeIndex::UUID:
|
||||
{
|
||||
return size + sizeof(BSONSizeT) + 1 + sizeof(UUID); // Size of data + BSON binary subtype + 16 bytes of value
|
||||
}
|
||||
case TypeIndex::LowCardinality:
|
||||
{
|
||||
const auto & lc_column = assert_cast<const ColumnLowCardinality &>(column);
|
||||
auto dict_type = assert_cast<const DataTypeLowCardinality *>(data_type.get())->getDictionaryType();
|
||||
auto dict_column = lc_column.getDictionary().getNestedColumn();
|
||||
size_t index = lc_column.getIndexAt(row_num);
|
||||
return countBSONFieldSize(*dict_column, dict_type, index, name);
|
||||
}
|
||||
case TypeIndex::Nullable:
|
||||
{
|
||||
auto nested_type = removeNullable(data_type);
|
||||
const ColumnNullable & column_nullable = assert_cast<const ColumnNullable &>(column);
|
||||
if (column_nullable.isNullAt(row_num))
|
||||
return size; /// Null has no value, just type
|
||||
return countBSONFieldSize(column_nullable.getNestedColumn(), nested_type, row_num, name);
|
||||
}
|
||||
case TypeIndex::Array:
|
||||
{
|
||||
size += sizeof(BSONSizeT); // Size of a document
|
||||
|
||||
const auto & nested_type = assert_cast<const DataTypeArray *>(data_type.get())->getNestedType();
|
||||
const ColumnArray & column_array = assert_cast<const ColumnArray &>(column);
|
||||
const IColumn & nested_column = column_array.getData();
|
||||
const ColumnArray::Offsets & offsets = column_array.getOffsets();
|
||||
size_t offset = offsets[row_num - 1];
|
||||
size_t array_size = offsets[row_num] - offset;
|
||||
|
||||
for (size_t i = 0; i < array_size; ++i)
|
||||
size += countBSONFieldSize(nested_column, nested_type, offset + i, std::to_string(i)); // Add size of each value from array
|
||||
|
||||
return size + sizeof(BSON_DOCUMENT_END); // Add final \0
|
||||
}
|
||||
case TypeIndex::Tuple:
|
||||
{
|
||||
size += sizeof(BSONSizeT); // Size of a document
|
||||
|
||||
const auto * tuple_type = assert_cast<const DataTypeTuple *>(data_type.get());
|
||||
const auto & nested_types = tuple_type->getElements();
|
||||
bool have_explicit_names = tuple_type->haveExplicitNames();
|
||||
const auto & nested_names = tuple_type->getElementNames();
|
||||
const auto & tuple_column = assert_cast<const ColumnTuple &>(column);
|
||||
const auto & nested_columns = tuple_column.getColumns();
|
||||
|
||||
for (size_t i = 0; i < nested_columns.size(); ++i)
|
||||
{
|
||||
String key_name = have_explicit_names ? toValidUTF8String(nested_names[i]) : std::to_string(i);
|
||||
size += countBSONFieldSize(*nested_columns[i], nested_types[i], row_num, key_name); // Add size of each value from tuple
|
||||
}
|
||||
|
||||
return size + sizeof(BSON_DOCUMENT_END); // Add final \0
|
||||
}
|
||||
case TypeIndex::Map:
|
||||
{
|
||||
size += sizeof(BSONSizeT); // Size of a document
|
||||
|
||||
const auto & map_type = assert_cast<const DataTypeMap &>(*data_type);
|
||||
if (!isStringOrFixedString(map_type.getKeyType()))
|
||||
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Only maps with String key type are supported in BSON, got key type: {}", map_type.getKeyType()->getName());
|
||||
const auto & value_type = map_type.getValueType();
|
||||
|
||||
const auto & map_column = assert_cast<const ColumnMap &>(column);
|
||||
const auto & nested_column = map_column.getNestedColumn();
|
||||
const auto & key_value_columns = map_column.getNestedData().getColumns();
|
||||
const auto & key_column = key_value_columns[0];
|
||||
const auto & value_column = key_value_columns[1];
|
||||
const auto & offsets = nested_column.getOffsets();
|
||||
size_t offset = offsets[row_num - 1];
|
||||
size_t map_size = offsets[row_num] - offset;
|
||||
|
||||
for (size_t i = 0; i < map_size; ++i)
|
||||
{
|
||||
String key = toValidUTF8String(key_column->getDataAt(offset + i).toString());
|
||||
size += countBSONFieldSize(*value_column, value_type, offset + i, key);
|
||||
}
|
||||
|
||||
return size + sizeof(BSON_DOCUMENT_END); // Add final \0
|
||||
}
|
||||
default:
|
||||
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Type {} is not supported in BSON output format", data_type->getName());
|
||||
}
|
||||
}
|
||||
|
||||
void BSONEachRowRowOutputFormat::serializeField(const IColumn & column, const DataTypePtr & data_type, size_t row_num, const String & name)
|
||||
{
|
||||
switch (column.getDataType())
|
||||
{
|
||||
case TypeIndex::Float32:
|
||||
{
|
||||
writeBSONNumber<ColumnFloat32, double>(BSONType::DOUBLE, column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::Float64:
|
||||
{
|
||||
writeBSONNumber<ColumnFloat64, double>(BSONType::DOUBLE, column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::Int8:
|
||||
{
|
||||
writeBSONNumber<ColumnInt8, Int32>(BSONType::INT32, column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::UInt8:
|
||||
{
|
||||
if (isBool(data_type))
|
||||
writeBSONNumber<ColumnUInt8, bool>(BSONType::BOOL, column, row_num, name, out);
|
||||
else
|
||||
writeBSONNumber<ColumnUInt8, Int32>(BSONType::INT32, column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::Int16:
|
||||
{
|
||||
writeBSONNumber<ColumnInt16, Int32>(BSONType::INT32, column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::Date: [[fallthrough]];
|
||||
case TypeIndex::UInt16:
|
||||
{
|
||||
writeBSONNumber<ColumnUInt16, Int32>(BSONType::INT32, column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::Date32: [[fallthrough]];
|
||||
case TypeIndex::Int32:
|
||||
{
|
||||
writeBSONNumber<ColumnInt32, Int32>(BSONType::INT32, column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::DateTime: [[fallthrough]];
|
||||
case TypeIndex::UInt32:
|
||||
{
|
||||
writeBSONNumber<ColumnUInt32, Int64>(BSONType::INT64, column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::Int64:
|
||||
{
|
||||
writeBSONNumber<ColumnInt64, Int64>(BSONType::INT64, column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::UInt64:
|
||||
{
|
||||
writeBSONNumber<ColumnUInt64, UInt64>(BSONType::INT64, column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::Int128:
|
||||
{
|
||||
writeBSONBigInteger<ColumnInt128>(column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::UInt128:
|
||||
{
|
||||
writeBSONBigInteger<ColumnUInt128>(column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::Int256:
|
||||
{
|
||||
writeBSONBigInteger<ColumnInt256>(column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::UInt256:
|
||||
{
|
||||
writeBSONBigInteger<ColumnUInt256>(column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::Decimal32:
|
||||
{
|
||||
writeBSONNumber<ColumnDecimal<Decimal32>, Decimal32>(BSONType::INT32, column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::DateTime64:
|
||||
{
|
||||
writeBSONNumber<ColumnDecimal<DateTime64>, Decimal64>(BSONType::DATETIME, column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::Decimal64:
|
||||
{
|
||||
writeBSONNumber<ColumnDecimal<Decimal64>, Decimal64>(BSONType::INT64, column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::Decimal128:
|
||||
{
|
||||
writeBSONBigInteger<ColumnDecimal<Decimal128>>(column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::Decimal256:
|
||||
{
|
||||
writeBSONBigInteger<ColumnDecimal<Decimal256>>(column, row_num, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::String:
|
||||
{
|
||||
writeBSONString<ColumnString>(column, row_num, name, out, settings.bson.output_string_as_string);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::FixedString:
|
||||
{
|
||||
writeBSONString<ColumnFixedString>(column, row_num, name, out, settings.bson.output_string_as_string);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::UUID:
|
||||
{
|
||||
writeBSONTypeAndKeyName(BSONType::BINARY, name, out);
|
||||
writeBSONSize(sizeof(UUID), out);
|
||||
writeBSONType(BSONBinarySubtype::UUID, out);
|
||||
writeBinary(assert_cast<const ColumnUUID &>(column).getElement(row_num), out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::LowCardinality:
|
||||
{
|
||||
const auto & lc_column = assert_cast<const ColumnLowCardinality &>(column);
|
||||
auto dict_type = assert_cast<const DataTypeLowCardinality *>(data_type.get())->getDictionaryType();
|
||||
auto dict_column = lc_column.getDictionary().getNestedColumn();
|
||||
size_t index = lc_column.getIndexAt(row_num);
|
||||
serializeField(*dict_column, dict_type, index, name);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::Nullable:
|
||||
{
|
||||
auto nested_type = removeNullable(data_type);
|
||||
const ColumnNullable & column_nullable = assert_cast<const ColumnNullable &>(column);
|
||||
if (!column_nullable.isNullAt(row_num))
|
||||
serializeField(column_nullable.getNestedColumn(), nested_type, row_num, name);
|
||||
else
|
||||
writeBSONTypeAndKeyName(BSONType::NULL_VALUE, name, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::Array:
|
||||
{
|
||||
const auto & nested_type = assert_cast<const DataTypeArray *>(data_type.get())->getNestedType();
|
||||
const ColumnArray & column_array = assert_cast<const ColumnArray &>(column);
|
||||
const IColumn & nested_column = column_array.getData();
|
||||
const ColumnArray::Offsets & offsets = column_array.getOffsets();
|
||||
size_t offset = offsets[row_num - 1];
|
||||
size_t array_size = offsets[row_num] - offset;
|
||||
|
||||
writeBSONTypeAndKeyName(BSONType::ARRAY, name, out);
|
||||
|
||||
size_t document_size = sizeof(BSONSizeT);
|
||||
for (size_t i = 0; i < array_size; ++i)
|
||||
document_size += countBSONFieldSize(nested_column, nested_type, offset + i, std::to_string(i)); // Add size of each value from array
|
||||
document_size += sizeof(BSON_DOCUMENT_END); // Add final \0
|
||||
|
||||
writeBSONSize(document_size, out);
|
||||
|
||||
for (size_t i = 0; i < array_size; ++i)
|
||||
serializeField(nested_column, nested_type, offset + i, std::to_string(i));
|
||||
|
||||
writeChar(BSON_DOCUMENT_END, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::Tuple:
|
||||
{
|
||||
const auto * tuple_type = assert_cast<const DataTypeTuple *>(data_type.get());
|
||||
const auto & nested_types = tuple_type->getElements();
|
||||
bool have_explicit_names = tuple_type->haveExplicitNames();
|
||||
const auto & nested_names = tuple_type->getElementNames();
|
||||
const auto & tuple_column = assert_cast<const ColumnTuple &>(column);
|
||||
const auto & nested_columns = tuple_column.getColumns();
|
||||
|
||||
BSONType bson_type = have_explicit_names ? BSONType::DOCUMENT : BSONType::ARRAY;
|
||||
writeBSONTypeAndKeyName(bson_type, name, out);
|
||||
|
||||
size_t document_size = sizeof(BSONSizeT);
|
||||
for (size_t i = 0; i < nested_columns.size(); ++i)
|
||||
{
|
||||
String key_name = have_explicit_names ? toValidUTF8String(nested_names[i]) : std::to_string(i);
|
||||
document_size += countBSONFieldSize(*nested_columns[i], nested_types[i], row_num, key_name); // Add size of each value from tuple
|
||||
}
|
||||
document_size += sizeof(BSON_DOCUMENT_END); // Add final \0
|
||||
|
||||
writeBSONSize(document_size, out);
|
||||
|
||||
for (size_t i = 0; i < nested_columns.size(); ++i)
|
||||
serializeField(*nested_columns[i], nested_types[i], row_num, toValidUTF8String(nested_names[i]));
|
||||
|
||||
writeChar(BSON_DOCUMENT_END, out);
|
||||
break;
|
||||
}
|
||||
case TypeIndex::Map:
|
||||
{
|
||||
const auto & map_type = assert_cast<const DataTypeMap &>(*data_type);
|
||||
if (!isStringOrFixedString(map_type.getKeyType()))
|
||||
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Only maps with String key type are supported in BSON, got key type: {}", map_type.getKeyType()->getName());
|
||||
const auto & value_type = map_type.getValueType();
|
||||
|
||||
const auto & map_column = assert_cast<const ColumnMap &>(column);
|
||||
const auto & nested_column = map_column.getNestedColumn();
|
||||
const auto & key_value_columns = map_column.getNestedData().getColumns();
|
||||
const auto & key_column = key_value_columns[0];
|
||||
const auto & value_column = key_value_columns[1];
|
||||
const auto & offsets = nested_column.getOffsets();
|
||||
size_t offset = offsets[row_num - 1];
|
||||
size_t map_size = offsets[row_num] - offset;
|
||||
|
||||
writeBSONTypeAndKeyName(BSONType::DOCUMENT, name, out);
|
||||
|
||||
size_t document_size = sizeof(BSONSizeT);
|
||||
for (size_t i = 0; i < map_size; ++i)
|
||||
{
|
||||
String key = toValidUTF8String(key_column->getDataAt(offset + i).toString());
|
||||
document_size += countBSONFieldSize(*value_column, value_type, offset + i, key);
|
||||
}
|
||||
document_size += sizeof(BSON_DOCUMENT_END);
|
||||
|
||||
writeBSONSize(document_size, out);
|
||||
|
||||
for (size_t i = 0; i < map_size; ++i)
|
||||
{
|
||||
String key = toValidUTF8String(key_column->getDataAt(offset + i).toString());
|
||||
serializeField(*value_column, value_type, offset + i, key);
|
||||
}
|
||||
|
||||
writeChar(BSON_DOCUMENT_END, out);
|
||||
break;
|
||||
}
|
||||
default:
|
||||
throw Exception(ErrorCodes::ILLEGAL_COLUMN, "Type {} is not supported in BSON output format", data_type->getName());
|
||||
}
|
||||
}
|
||||
|
||||
void BSONEachRowRowOutputFormat::write(const Columns & columns, size_t row_num)
|
||||
{
|
||||
/// We should calculate and write document size before its content
|
||||
size_t document_size = sizeof(BSONSizeT);
|
||||
for (size_t i = 0; i != columns.size(); ++i)
|
||||
document_size += countBSONFieldSize(*columns[i], fields[i].type, row_num, fields[i].name);
|
||||
document_size += sizeof(BSON_DOCUMENT_END);
|
||||
|
||||
size_t document_start = out.count();
|
||||
writeBSONSize(document_size, out);
|
||||
|
||||
for (size_t i = 0; i != columns.size(); ++i)
|
||||
serializeField(*columns[i], fields[i].type, row_num, fields[i].name);
|
||||
|
||||
writeChar(BSON_DOCUMENT_END, out);
|
||||
|
||||
size_t actual_document_size = out.count() - document_start;
|
||||
if (actual_document_size != document_size)
|
||||
throw Exception(
|
||||
ErrorCodes::LOGICAL_ERROR,
|
||||
"The actual size of the BSON document does not match the estimated size: {} != {}",
|
||||
actual_document_size,
|
||||
document_size);
|
||||
}
|
||||
|
||||
void registerOutputFormatBSONEachRow(FormatFactory & factory)
|
||||
{
|
||||
factory.registerOutputFormat(
|
||||
"BSONEachRow",
|
||||
[](WriteBuffer & buf, const Block & sample, const RowOutputFormatParams & params, const FormatSettings & _format_settings)
|
||||
{ return std::make_shared<BSONEachRowRowOutputFormat>(buf, sample, params, _format_settings); });
|
||||
factory.markOutputFormatSupportsParallelFormatting("BSONEachRow");
|
||||
}
|
||||
|
||||
}
|
69
src/Processors/Formats/Impl/BSONEachRowRowOutputFormat.h
Normal file
69
src/Processors/Formats/Impl/BSONEachRowRowOutputFormat.h
Normal file
@ -0,0 +1,69 @@
|
||||
#pragma once
|
||||
|
||||
#include <Core/Block.h>
|
||||
#include <Formats/FormatSettings.h>
|
||||
#include <IO/WriteBuffer.h>
|
||||
#include <Processors/Formats/IRowOutputFormat.h>
|
||||
#include <Formats/BSONTypes.h>
|
||||
|
||||
namespace DB
|
||||
{
|
||||
|
||||
/*
|
||||
* Class for formatting data in BSON format.
|
||||
* Each row is formatted as a separate BSON document.
|
||||
* Each column is formatted as a single field with column name as a key.
|
||||
* It uses the following correspondence between ClickHouse types and BSON types:
|
||||
*
|
||||
* ClickHouse type | BSON Type
|
||||
* Bool | \x08 boolean
|
||||
* Int8/UInt8 | \x10 int32
|
||||
* Int16UInt16 | \x10 int32
|
||||
* Int32 | \x10 int32
|
||||
* UInt32 | \x12 int64
|
||||
* Int64 | \x12 int64
|
||||
* UInt64 | \x11 uint64
|
||||
* Float32/Float64 | \x01 double
|
||||
* Date/Date32 | \x10 int32
|
||||
* DateTime | \x12 int64
|
||||
* DateTime64 | \x09 datetime
|
||||
* Decimal32 | \x10 int32
|
||||
* Decimal64 | \x12 int64
|
||||
* Decimal128 | \x05 binary, \x00 binary subtype, size = 16
|
||||
* Decimal256 | \x05 binary, \x00 binary subtype, size = 32
|
||||
* Int128/UInt128 | \x05 binary, \x00 binary subtype, size = 16
|
||||
* Int256/UInt256 | \x05 binary, \x00 binary subtype, size = 32
|
||||
* String/FixedString | \x05 binary, \x00 binary subtype or \x02 string if setting output_format_bson_string_as_string is enabled
|
||||
* UUID | \x05 binary, \x04 uuid subtype, size = 16
|
||||
* Array | \x04 array
|
||||
* Tuple | \x04 array
|
||||
* Named Tuple | \x03 document
|
||||
* Map (with String keys) | \x03 document
|
||||
*
|
||||
* Note: on Big-Endian platforms this format will not work properly.
|
||||
*/
|
||||
|
||||
class BSONEachRowRowOutputFormat final : public IRowOutputFormat
|
||||
{
|
||||
public:
|
||||
BSONEachRowRowOutputFormat(
|
||||
WriteBuffer & out_, const Block & header_, const RowOutputFormatParams & params_, const FormatSettings & settings_);
|
||||
|
||||
String getName() const override { return "BSONEachRowRowOutputFormat"; }
|
||||
|
||||
private:
|
||||
void write(const Columns & columns, size_t row_num) override;
|
||||
void writeField(const IColumn &, const ISerialization &, size_t) override { }
|
||||
|
||||
void serializeField(const IColumn & column, const DataTypePtr & data_type, size_t row_num, const String & name);
|
||||
|
||||
/// Count field size in bytes that we will get after serialization in BSON format.
|
||||
/// It's needed to calculate document size before actual serialization,
|
||||
/// because in BSON format we should write the size of the document before its content.
|
||||
size_t countBSONFieldSize(const IColumn & column, const DataTypePtr & data_type, size_t row_num, const String & name);
|
||||
|
||||
NamesAndTypes fields;
|
||||
FormatSettings settings;
|
||||
};
|
||||
|
||||
}
|
@ -13,7 +13,7 @@ namespace ErrorCodes
|
||||
extern const int CANNOT_SKIP_UNKNOWN_FIELD;
|
||||
}
|
||||
|
||||
BinaryRowInputFormat::BinaryRowInputFormat(ReadBuffer & in_, Block header, Params params_, bool with_names_, bool with_types_, const FormatSettings & format_settings_)
|
||||
BinaryRowInputFormat::BinaryRowInputFormat(ReadBuffer & in_, const Block & header, Params params_, bool with_names_, bool with_types_, const FormatSettings & format_settings_)
|
||||
: RowInputFormatWithNamesAndTypes(
|
||||
header,
|
||||
in_,
|
||||
|
@ -20,7 +20,7 @@ class ReadBuffer;
|
||||
class BinaryRowInputFormat final : public RowInputFormatWithNamesAndTypes
|
||||
{
|
||||
public:
|
||||
BinaryRowInputFormat(ReadBuffer & in_, Block header, Params params_, bool with_names_, bool with_types_, const FormatSettings & format_settings_);
|
||||
BinaryRowInputFormat(ReadBuffer & in_, const Block & header, Params params_, bool with_names_, bool with_types_, const FormatSettings & format_settings_);
|
||||
|
||||
String getName() const override { return "BinaryRowInputFormat"; }
|
||||
|
||||
|
@ -72,10 +72,10 @@ JSONColumnsBlockInputFormatBase::JSONColumnsBlockInputFormatBase(
|
||||
: IInputFormat(header_, in_)
|
||||
, format_settings(format_settings_)
|
||||
, fields(header_.getNamesAndTypes())
|
||||
, name_to_index(header_.getNamesToIndexesMap())
|
||||
, serializations(header_.getSerializations())
|
||||
, reader(std::move(reader_))
|
||||
{
|
||||
name_to_index = getPort().getHeader().getNamesToIndexesMap();
|
||||
}
|
||||
|
||||
size_t JSONColumnsBlockInputFormatBase::readColumn(
|
||||
@ -125,7 +125,7 @@ Chunk JSONColumnsBlockInputFormatBase::generate()
|
||||
{
|
||||
/// Check if this name appears in header. If no, skip this column or throw
|
||||
/// an exception according to setting input_format_skip_unknown_fields
|
||||
if (!name_to_index.contains(*column_name))
|
||||
if (!name_to_index.has(*column_name))
|
||||
{
|
||||
if (!format_settings.skip_unknown_fields)
|
||||
throw Exception(ErrorCodes::INCORRECT_DATA, "Unknown column found in input data: {}", *column_name);
|
||||
|
@ -60,7 +60,7 @@ protected:
|
||||
const FormatSettings format_settings;
|
||||
const NamesAndTypes fields;
|
||||
/// Maps column names and their positions in header.
|
||||
std::unordered_map<String, size_t> name_to_index;
|
||||
Block::NameMap name_to_index;
|
||||
Serializations serializations;
|
||||
std::unique_ptr<JSONColumnsReaderBase> reader;
|
||||
BlockMissingValues block_missing_values;
|
||||
|
@ -37,25 +37,25 @@ JSONEachRowRowInputFormat::JSONEachRowRowInputFormat(
|
||||
Params params_,
|
||||
const FormatSettings & format_settings_,
|
||||
bool yield_strings_)
|
||||
: IRowInputFormat(header_, in_, std::move(params_)), format_settings(format_settings_), name_map(header_.columns()), yield_strings(yield_strings_)
|
||||
: IRowInputFormat(header_, in_, std::move(params_))
|
||||
, format_settings(format_settings_)
|
||||
, prev_positions(header_.columns())
|
||||
, yield_strings(yield_strings_)
|
||||
{
|
||||
size_t num_columns = getPort().getHeader().columns();
|
||||
for (size_t i = 0; i < num_columns; ++i)
|
||||
name_map = getPort().getHeader().getNamesToIndexesMap();
|
||||
if (format_settings_.import_nested_json)
|
||||
{
|
||||
const String & column_name = columnName(i);
|
||||
name_map[column_name] = i; /// NOTE You could place names more cache-locally.
|
||||
if (format_settings_.import_nested_json)
|
||||
for (size_t i = 0; i != header_.columns(); ++i)
|
||||
{
|
||||
const auto split = Nested::splitName(column_name);
|
||||
const StringRef column_name = header_.getByPosition(i).name;
|
||||
const auto split = Nested::splitName(column_name.toView());
|
||||
if (!split.second.empty())
|
||||
{
|
||||
const StringRef table_name(column_name.data(), split.first.size());
|
||||
const StringRef table_name(column_name.data, split.first.size());
|
||||
name_map[table_name] = NESTED_FIELD;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
prev_positions.resize(num_columns);
|
||||
}
|
||||
|
||||
const String & JSONEachRowRowInputFormat::columnName(size_t i) const
|
||||
|
@ -32,6 +32,7 @@
|
||||
#include <Columns/ColumnLowCardinality.h>
|
||||
|
||||
#include <Formats/MsgPackExtensionTypes.h>
|
||||
#include <Formats/EscapingRuleUtils.h>
|
||||
|
||||
namespace DB
|
||||
{
|
||||
@ -552,12 +553,9 @@ void registerMsgPackSchemaReader(FormatFactory & factory)
|
||||
});
|
||||
factory.registerAdditionalInfoForSchemaCacheGetter("MsgPack", [](const FormatSettings & settings)
|
||||
{
|
||||
return fmt::format(
|
||||
"number_of_columns={}, schema_inference_hints={}, max_rows_to_read_for_schema_inference={}",
|
||||
settings.msgpack.number_of_columns,
|
||||
settings.schema_inference_hints,
|
||||
settings.max_rows_to_read_for_schema_inference);
|
||||
});
|
||||
String result = getAdditionalFormatInfoForAllRowBasedFormats(settings);
|
||||
return result + fmt::format(", number_of_columns={}", settings.msgpack.number_of_columns);
|
||||
});
|
||||
}
|
||||
|
||||
}
|
||||
|
@ -35,9 +35,9 @@ MySQLDumpRowInputFormat::MySQLDumpRowInputFormat(ReadBuffer & in_, const Block &
|
||||
: IRowInputFormat(header_, in_, params_)
|
||||
, table_name(format_settings_.mysql_dump.table_name)
|
||||
, types(header_.getDataTypes())
|
||||
, column_indexes_by_names(header_.getNamesToIndexesMap())
|
||||
, format_settings(format_settings_)
|
||||
{
|
||||
column_indexes_by_names = getPort().getHeader().getNamesToIndexesMap();
|
||||
}
|
||||
|
||||
|
||||
|
@ -22,7 +22,7 @@ private:
|
||||
|
||||
String table_name;
|
||||
DataTypes types;
|
||||
std::unordered_map<String, size_t> column_indexes_by_names;
|
||||
Block::NameMap column_indexes_by_names;
|
||||
const FormatSettings format_settings;
|
||||
};
|
||||
|
||||
|
@ -30,8 +30,8 @@ RowInputFormatWithNamesAndTypes::RowInputFormatWithNamesAndTypes(
|
||||
, with_names(with_names_)
|
||||
, with_types(with_types_)
|
||||
, format_reader(std::move(format_reader_))
|
||||
, column_indexes_by_names(header_.getNamesToIndexesMap())
|
||||
{
|
||||
column_indexes_by_names = getPort().getHeader().getNamesToIndexesMap();
|
||||
}
|
||||
|
||||
void RowInputFormatWithNamesAndTypes::readPrefix()
|
||||
|
@ -59,7 +59,7 @@ private:
|
||||
std::unique_ptr<FormatWithNamesAndTypesReader> format_reader;
|
||||
|
||||
protected:
|
||||
std::unordered_map<String, size_t> column_indexes_by_names;
|
||||
Block::NameMap column_indexes_by_names;
|
||||
};
|
||||
|
||||
/// Base class for parsing data in input formats with -WithNames and -WithNamesAndTypes suffixes.
|
||||
|
@ -3,11 +3,8 @@
|
||||
#include <string>
|
||||
#include <vector>
|
||||
|
||||
#include <Common/logger_useful.h>
|
||||
#include <Poco/MongoDB/Connection.h>
|
||||
#include <Poco/MongoDB/Cursor.h>
|
||||
#include <Poco/MongoDB/Element.h>
|
||||
#include <Poco/MongoDB/Database.h>
|
||||
#include <Poco/MongoDB/ObjectId.h>
|
||||
|
||||
#include <Columns/ColumnNullable.h>
|
||||
@ -18,7 +15,6 @@
|
||||
#include <Common/quoteString.h>
|
||||
#include <base/range.h>
|
||||
#include <Poco/URI.h>
|
||||
#include <Poco/Util/AbstractConfiguration.h>
|
||||
#include <Poco/Version.h>
|
||||
|
||||
// only after poco
|
||||
|
@ -1,5 +1,7 @@
|
||||
#pragma once
|
||||
|
||||
#include <Poco/MongoDB/Element.h>
|
||||
|
||||
#include <Core/Block.h>
|
||||
#include <Processors/ISource.h>
|
||||
#include <Core/ExternalResultDescription.h>
|
||||
|
@ -73,7 +73,7 @@ private:
|
||||
static bool hasStateColumn(const Names & column_names, const StorageSnapshotPtr & storage_snapshot);
|
||||
|
||||
protected:
|
||||
const FormatSettings format_settings;
|
||||
const FormatSettings format_settings = {};
|
||||
|
||||
StorageSystemPartsBase(const StorageID & table_id_, NamesAndTypesList && columns_);
|
||||
|
||||
|
@ -1,3 +1,5 @@
|
||||
#include <TableFunctions/TableFunctionMongoDB.h>
|
||||
|
||||
#include <Common/Exception.h>
|
||||
|
||||
#include <Interpreters/evaluateConstantExpression.h>
|
||||
@ -7,7 +9,6 @@
|
||||
#include <Parsers/ASTLiteral.h>
|
||||
#include <Parsers/ASTIdentifier.h>
|
||||
|
||||
#include <TableFunctions/TableFunctionMongoDB.h>
|
||||
#include <TableFunctions/TableFunctionFactory.h>
|
||||
#include <Interpreters/parseColumnsListForTableFunction.h>
|
||||
#include <TableFunctions/registerTableFunctions.h>
|
||||
|
@ -1,8 +1,8 @@
|
||||
#pragma once
|
||||
|
||||
#include <Storages/StorageMongoDB.h>
|
||||
#include <TableFunctions/ITableFunction.h>
|
||||
#include <Storages/ExternalDataSourceConfiguration.h>
|
||||
#include <Storages/StorageMongoDB.h>
|
||||
|
||||
namespace DB
|
||||
{
|
||||
|
@ -1,6 +1,7 @@
|
||||
Arrow
|
||||
ArrowStream
|
||||
Avro
|
||||
BSONEachRow
|
||||
CSV
|
||||
CSVWithNames
|
||||
CSVWithNamesAndTypes
|
||||
|
252
tests/queries/0_stateless/02475_bson_each_row_format.reference
Normal file
252
tests/queries/0_stateless/02475_bson_each_row_format.reference
Normal file
@ -0,0 +1,252 @@
|
||||
Integers
|
||||
false 0 0 0 0 0 0 0 0
|
||||
true 1 1 1 1 1 1 1 1
|
||||
true 2 2 2 2 2 2 2 2
|
||||
true 3 3 3 3 3 3 3 3
|
||||
true 4 4 4 4 4 4 4 4
|
||||
bool Nullable(Bool)
|
||||
int8 Nullable(Int32)
|
||||
uint8 Nullable(Int32)
|
||||
int16 Nullable(Int32)
|
||||
uint16 Nullable(Int32)
|
||||
int32 Nullable(Int32)
|
||||
uint32 Nullable(Int64)
|
||||
int64 Nullable(Int64)
|
||||
uint64 Nullable(Int64)
|
||||
false 0 0 0 0 0 0 0 0
|
||||
true 1 1 1 1 1 1 1 1
|
||||
true 2 2 2 2 2 2 2 2
|
||||
true 3 3 3 3 3 3 3 3
|
||||
true 4 4 4 4 4 4 4 4
|
||||
Integers conversion
|
||||
1 4294967295
|
||||
1 -1
|
||||
1 65535
|
||||
1 -1
|
||||
1 255
|
||||
1 -1
|
||||
uint64 Nullable(Int64)
|
||||
int64 Nullable(Int64)
|
||||
4294967297 -4294967297
|
||||
Floats
|
||||
0 0
|
||||
0.5 0.5
|
||||
0.6666667 0.6666666666666666
|
||||
0.75 0.75
|
||||
0.8 0.8
|
||||
float32 Nullable(Float64)
|
||||
float64 Nullable(Float64)
|
||||
0 0
|
||||
0.5 0.5
|
||||
0.6666666865348816 0.6666666666666666
|
||||
0.75 0.75
|
||||
0.800000011920929 0.8
|
||||
Big integers
|
||||
0 0 0 0
|
||||
-10000000000000000000000 10000000000000000000000 -100000000000000000000000000000000000000000000 100000000000000000000000000000000000000000000
|
||||
-20000000000000000000000 20000000000000000000000 -200000000000000000000000000000000000000000000 200000000000000000000000000000000000000000000
|
||||
-30000000000000000000000 30000000000000000000000 -300000000000000000000000000000000000000000000 300000000000000000000000000000000000000000000
|
||||
-40000000000000000000000 40000000000000000000000 -400000000000000000000000000000000000000000000 400000000000000000000000000000000000000000000
|
||||
int128 Nullable(String)
|
||||
uint128 Nullable(String)
|
||||
int256 Nullable(String)
|
||||
uint256 Nullable(String)
|
||||
Dates
|
||||
1970-01-01 1970-01-01 1970-01-01 00:00:00 1970-01-01 00:00:00.000000
|
||||
1970-01-02 1970-01-02 1970-01-01 00:00:01 1970-01-01 00:00:01.000000
|
||||
1970-01-03 1970-01-03 1970-01-01 00:00:02 1970-01-01 00:00:02.000000
|
||||
1970-01-04 1970-01-04 1970-01-01 00:00:03 1970-01-01 00:00:03.000000
|
||||
1970-01-05 1970-01-05 1970-01-01 00:00:04 1970-01-01 00:00:04.000000
|
||||
date Nullable(Int32)
|
||||
date32 Nullable(Int32)
|
||||
datetime Nullable(Int64)
|
||||
datetime64 Nullable(DateTime64(6, \'UTC\'))
|
||||
0 0 0 1970-01-01 00:00:00.000000
|
||||
1 1 1 1970-01-01 00:00:01.000000
|
||||
2 2 2 1970-01-01 00:00:02.000000
|
||||
3 3 3 1970-01-01 00:00:03.000000
|
||||
4 4 4 1970-01-01 00:00:04.000000
|
||||
Decimals
|
||||
0 0 0 0
|
||||
42.422 42.424242 42.424242424242 42.424242424242424242424242
|
||||
84.844 84.848484 84.848484848484 84.848484848484848484848484
|
||||
127.266 127.272726 127.272727272726 127.272727272727272727272726
|
||||
169.688 169.696968 169.696969696968 169.696969696969696969696968
|
||||
decimal32 Nullable(Int32)
|
||||
decimal64 Nullable(Int64)
|
||||
decimal128 Nullable(String)
|
||||
decimal256 Nullable(String)
|
||||
Strings
|
||||
\0\0\0\0\0
|
||||
HelloWorld b\0\0\0\0
|
||||
HelloWorldHelloWorld cc\0\0\0
|
||||
HelloWorldHelloWorldHelloWorld ddd\0\0
|
||||
HelloWorldHelloWorldHelloWorldHelloWorld eeee\0
|
||||
\0\0\0\0\0
|
||||
HelloWorld b\0\0\0\0
|
||||
HelloWorldHelloWorld cc\0\0\0
|
||||
HelloWorldHelloWorldHelloWorld ddd\0\0
|
||||
HelloWorldHelloWorldHelloWorldHelloWorld eeee\0
|
||||
str Nullable(String)
|
||||
fixstr Nullable(String)
|
||||
\0\0\0\0\0
|
||||
HelloWorld b\0\0\0\0
|
||||
HelloWorldHelloWorld cc\0\0\0
|
||||
HelloWorldHelloWorldHelloWorld ddd\0\0
|
||||
HelloWorldHelloWorldHelloWorldHelloWorld eeee\0
|
||||
UUID
|
||||
b86d5c23-4b87-4465-8f33-4a685fa1c868
|
||||
uuid Nullable(UUID)
|
||||
b86d5c23-4b87-4465-8f33-4a685fa1c868
|
||||
LowCardinality
|
||||
a
|
||||
b
|
||||
c
|
||||
a
|
||||
b
|
||||
lc Nullable(String)
|
||||
a
|
||||
b
|
||||
c
|
||||
a
|
||||
b
|
||||
Nullable
|
||||
0
|
||||
\N
|
||||
2
|
||||
\N
|
||||
4
|
||||
0
|
||||
0
|
||||
2
|
||||
0
|
||||
4
|
||||
FAIL
|
||||
null Nullable(Int64)
|
||||
0
|
||||
\N
|
||||
2
|
||||
\N
|
||||
4
|
||||
LowCardinality(Nullable)
|
||||
a
|
||||
\N
|
||||
c
|
||||
\N
|
||||
b
|
||||
lc Nullable(String)
|
||||
a
|
||||
\N
|
||||
c
|
||||
\N
|
||||
b
|
||||
Array
|
||||
[] ['Hello']
|
||||
[0] ['Hello']
|
||||
[0,1] ['Hello']
|
||||
[0,1,2] ['Hello']
|
||||
[0,1,2,3] ['Hello']
|
||||
arr1 Array(Nullable(Int64))
|
||||
arr2 Array(Nullable(String))
|
||||
[] ['Hello']
|
||||
[0] ['Hello']
|
||||
[0,1] ['Hello']
|
||||
[0,1,2] ['Hello']
|
||||
[0,1,2,3] ['Hello']
|
||||
Tuple
|
||||
(0,'Hello')
|
||||
(1,'Hello')
|
||||
(2,'Hello')
|
||||
(3,'Hello')
|
||||
(4,'Hello')
|
||||
('Hello',0)
|
||||
('Hello',1)
|
||||
('Hello',2)
|
||||
('Hello',3)
|
||||
('Hello',4)
|
||||
OK
|
||||
OK
|
||||
tuple Tuple(x Nullable(Int64), s Nullable(String))
|
||||
(0,'Hello')
|
||||
(1,'Hello')
|
||||
(2,'Hello')
|
||||
(3,'Hello')
|
||||
(4,'Hello')
|
||||
(0,'Hello')
|
||||
(1,'Hello')
|
||||
(2,'Hello')
|
||||
(3,'Hello')
|
||||
(4,'Hello')
|
||||
(0,'Hello')
|
||||
(1,'Hello')
|
||||
(2,'Hello')
|
||||
(3,'Hello')
|
||||
(4,'Hello')
|
||||
OK
|
||||
OK
|
||||
tuple Tuple(Nullable(Int64), Nullable(String))
|
||||
(0,'Hello')
|
||||
(1,'Hello')
|
||||
(2,'Hello')
|
||||
(3,'Hello')
|
||||
(4,'Hello')
|
||||
Map
|
||||
OK
|
||||
OK
|
||||
{'a':0,'b':1}
|
||||
{'a':1,'b':2}
|
||||
{'a':2,'b':3}
|
||||
{'a':3,'b':4}
|
||||
{'a':4,'b':5}
|
||||
map Map(String, Nullable(Int64))
|
||||
{'a':0,'b':1}
|
||||
{'a':1,'b':2}
|
||||
{'a':2,'b':3}
|
||||
{'a':3,'b':4}
|
||||
{'a':4,'b':5}
|
||||
Nested types
|
||||
[[],[0]] ((0,'Hello'),'Hello') {'a':{'a.a':0,'a.b':1},'b':{'b.a':0,'b.b':1}}
|
||||
[[0],[0,1]] ((1,'Hello'),'Hello') {'a':{'a.a':1,'a.b':2},'b':{'b.a':1,'b.b':2}}
|
||||
[[0,1],[0,1,2]] ((2,'Hello'),'Hello') {'a':{'a.a':2,'a.b':3},'b':{'b.a':2,'b.b':3}}
|
||||
[[0,1,2],[0,1,2,3]] ((3,'Hello'),'Hello') {'a':{'a.a':3,'a.b':4},'b':{'b.a':3,'b.b':4}}
|
||||
[[0,1,2,3],[0,1,2,3,4]] ((4,'Hello'),'Hello') {'a':{'a.a':4,'a.b':5},'b':{'b.a':4,'b.b':5}}
|
||||
nested1 Array(Array(Nullable(Int64)))
|
||||
nested2 Tuple(Tuple(x Nullable(Int64), s Nullable(String)), Nullable(String))
|
||||
nested3 Map(String, Map(String, Nullable(Int64)))
|
||||
[[],[0]] ((0,'Hello'),'Hello') {'a':{'a.a':0,'a.b':1},'b':{'b.a':0,'b.b':1}}
|
||||
[[0],[0,1]] ((1,'Hello'),'Hello') {'a':{'a.a':1,'a.b':2},'b':{'b.a':1,'b.b':2}}
|
||||
[[0,1],[0,1,2]] ((2,'Hello'),'Hello') {'a':{'a.a':2,'a.b':3},'b':{'b.a':2,'b.b':3}}
|
||||
[[0,1,2],[0,1,2,3]] ((3,'Hello'),'Hello') {'a':{'a.a':3,'a.b':4},'b':{'b.a':3,'b.b':4}}
|
||||
[[0,1,2,3],[0,1,2,3,4]] ((4,'Hello'),'Hello') {'a':{'a.a':4,'a.b':5},'b':{'b.a':4,'b.b':5}}
|
||||
[({'a':[],'b':[0]},[{'c':([],[0])},{'d':([0,1],[0,1,2])}])]
|
||||
[({'a':[0],'b':[0,1]},[{'c':([0],[0,1])},{'d':([0,1,2],[0,1,2,3])}])]
|
||||
[({'a':[0,1],'b':[0,1,2]},[{'c':([0,1],[0,1,2])},{'d':([0,1,2,3],[0,1,2,3,4])}])]
|
||||
[({'a':[0,1,2],'b':[0,1,2,3]},[{'c':([0,1,2],[0,1,2,3])},{'d':([0,1,2,3,4],[0,1,2,3,4,5])}])]
|
||||
[({'a':[0,1,2,3],'b':[0,1,2,3,4]},[{'c':([0,1,2,3],[0,1,2,3,4])},{'d':([0,1,2,3,4,5],[0,1,2,3,4,5,6])}])]
|
||||
nested Array(Tuple(Map(String, Array(Nullable(Int64))), Array(Map(String, Array(Array(Nullable(Int64)))))))
|
||||
[({'a':[],'b':[0]},[{'c':[[],[0]]},{'d':[[0,1],[0,1,2]]}])]
|
||||
[({'a':[0],'b':[0,1]},[{'c':[[0],[0,1]]},{'d':[[0,1,2],[0,1,2,3]]}])]
|
||||
[({'a':[0,1],'b':[0,1,2]},[{'c':[[0,1],[0,1,2]]},{'d':[[0,1,2,3],[0,1,2,3,4]]}])]
|
||||
[({'a':[0,1,2],'b':[0,1,2,3]},[{'c':[[0,1,2],[0,1,2,3]]},{'d':[[0,1,2,3,4],[0,1,2,3,4,5]]}])]
|
||||
[({'a':[0,1,2,3],'b':[0,1,2,3,4]},[{'c':[[0,1,2,3],[0,1,2,3,4]]},{'d':[[0,1,2,3,4,5],[0,1,2,3,4,5,6]]}])]
|
||||
Schema inference
|
||||
x Nullable(Int32)
|
||||
x Nullable(Int64)
|
||||
x Nullable(Int64)
|
||||
FAIL
|
||||
x Array(Nullable(Int32))
|
||||
x Array(Nullable(Int64))
|
||||
x Array(Nullable(Int64))
|
||||
FAIL
|
||||
OK
|
||||
OK
|
||||
OK
|
||||
OK
|
||||
Sync after error
|
||||
OK
|
||||
0 42 []
|
||||
1 42 [0]
|
||||
2 42 [0,1]
|
||||
0 42 []
|
||||
1 42 [0]
|
||||
2 42 [0,1]
|
199
tests/queries/0_stateless/02475_bson_each_row_format.sh
Executable file
199
tests/queries/0_stateless/02475_bson_each_row_format.sh
Executable file
@ -0,0 +1,199 @@
|
||||
#!/usr/bin/env bash
|
||||
# Tags: no-parallel
|
||||
|
||||
CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
|
||||
# shellcheck source=../shell_config.sh
|
||||
. "$CURDIR"/../shell_config.sh
|
||||
|
||||
echo "Integers"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow) select number::Bool as bool, number::Int8 as int8, number::UInt8 as uint8, number::Int16 as int16, number::UInt16 as uint16, number::Int32 as int32, number::UInt32 as uint32, number::Int64 as int64, number::UInt64 as uint64 from numbers(5) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'bool Bool, int8 Int8, uint8 UInt8, int16 Int16, uint16 UInt16, int32 Int32, uint32 UInt32, int64 Int64, uint64 UInt64')"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow)"
|
||||
|
||||
echo "Integers conversion"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow, auto, 'uint64 UInt64, int64 Int64') select 4294967297, -4294967297 settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'uint64 UInt32, int64 UInt32')"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'uint64 Int32, int64 Int32')"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'uint64 UInt16, int64 UInt16')"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'uint64 Int16, int64 Int16')"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'uint64 UInt8, int64 UInt8')"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'uint64 Int8, int64 Int8')"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow)"
|
||||
|
||||
|
||||
echo "Floats"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow, auto, 'float32 Float32, float64 Float64') select number / (number + 1), number / (number + 1) from numbers(5) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'float32 Float32, float64 Float64')";
|
||||
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow)"
|
||||
|
||||
|
||||
echo "Big integers"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow, auto, 'int128 Int128, uint128 UInt128, int256 Int256, uint256 UInt256') select number * -10000000000000000000000::Int128 as int128, number * 10000000000000000000000::UInt128 as uint128, number * -100000000000000000000000000000000000000000000::Int256 as int256, number * 100000000000000000000000000000000000000000000::UInt256 as uint256 from numbers(5) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'int128 Int128, uint128 UInt128, int256 Int256, uint256 UInt256')"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
|
||||
|
||||
echo "Dates"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow, auto, 'date Date, date32 Date32, datetime DateTime(\'UTC\'), datetime64 DateTime64(6, \'UTC\')') select number, number, number, number from numbers(5) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'date Date, date32 Date32, datetime DateTime(\'UTC\'), datetime64 DateTime64(6, \'UTC\')')"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow)"
|
||||
|
||||
|
||||
echo "Decimals"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow, auto, 'decimal32 Decimal32(3), decimal64 Decimal64(6), decimal128 Decimal128(12), decimal256 Decimal256(24)') select number * 42.422::Decimal32(3) as decimal32, number * 42.424242::Decimal64(6) as decimal64, number * 42.424242424242::Decimal128(12) as decimal128, number * 42.424242424242424242424242::Decimal256(24) as decimal256 from numbers(5) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'decimal32 Decimal32(3), decimal64 Decimal64(6), decimal128 Decimal128(12), decimal256 Decimal256(24)')"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
|
||||
|
||||
echo "Strings"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow, auto, 'str String, fixstr FixedString(5)') select repeat('HelloWorld', number), repeat(char(97 + number), number % 6) from numbers(5) settings engine_file_truncate_on_insert=1, output_format_bson_string_as_string=0"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'str String, fixstr FixedString(5)')"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow, auto, 'str String, fixstr FixedString(5)') select repeat('HelloWorld', number), repeat(char(97 + number), number % 6) from numbers(5) settings engine_file_truncate_on_insert=1, output_format_bson_string_as_string=1"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'str String, fixstr FixedString(5)')"
|
||||
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow)"
|
||||
|
||||
|
||||
echo "UUID"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow, auto, 'uuid UUID') select 'b86d5c23-4b87-4465-8f33-4a685fa1c868'::UUID settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'uuid UUID')"
|
||||
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow)"
|
||||
|
||||
|
||||
echo "LowCardinality"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow, auto, 'lc LowCardinality(String)') select char(97 + number % 3)::LowCardinality(String) from numbers(5) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'lc LowCardinality(String)')"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow)"
|
||||
|
||||
|
||||
echo "Nullable"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow, auto, 'null Nullable(UInt32)') select number % 2 ? NULL : number from numbers(5) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'null Nullable(UInt32)')"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'null UInt32')"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'null UInt32') settings input_format_null_as_default=0" 2>&1 | grep -q -F "INCORRECT_DATA" && echo "OK" || echo "FAIL"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow)"
|
||||
|
||||
|
||||
echo "LowCardinality(Nullable)"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow, auto, 'lc LowCardinality(Nullable(String))') select number % 2 ? NULL : char(97 + number % 3)::LowCardinality(String) from numbers(5) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'lc LowCardinality(Nullable(String))')"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow)"
|
||||
|
||||
|
||||
echo "Array"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow, auto, 'arr1 Array(UInt64), arr2 Array(String)') select range(number), ['Hello'] from numbers(5) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'arr1 Array(UInt64), arr2 Array(String)') settings engine_file_truncate_on_insert=1"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow)"
|
||||
|
||||
|
||||
echo "Tuple"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow, auto, 'tuple Tuple(x UInt64, s String)') select tuple(number, 'Hello') from numbers(5) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'tuple Tuple(x UInt64, s String)')"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'tuple Tuple(s String, x UInt64)')"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'tuple Tuple(x UInt64)')" 2>&1 | grep -q -F "INCORRECT_DATA" && echo "OK" || echo "FAIL"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'tuple Tuple(x UInt64, b String)')" 2>&1 | grep -q -F "INCORRECT_DATA" && echo "OK" || echo "FAIL"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow)"
|
||||
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow, auto, 'tuple Tuple(UInt64, String)') select tuple(number, 'Hello') from numbers(5) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'tuple Tuple(x UInt64, s String)')"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'tuple Tuple(UInt64, String)')"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'tuple Tuple(UInt64)')" 2>&1 | grep -q -F "INCORRECT_DATA" && echo "OK" || echo "FAIL"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'tuple Tuple(UInt64, String, UInt64)')" 2>&1 | grep -q -F "INCORRECT_DATA" && echo "OK" || echo "FAIL"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow)"
|
||||
|
||||
|
||||
echo "Map"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow, auto, 'map Map(UInt64, UInt64)') select map(1, number, 2, number + 1) from numbers(5) settings engine_file_truncate_on_insert=1" 2>&1 | grep -q -F "ILLEGAL_COLUMN" && echo "OK" || echo "FAIL"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow, auto, 'map Map(String, UInt64)') select map('a', number, 'b', number + 1) from numbers(5) settings engine_file_truncate_on_insert=1"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'map Map(UInt64, UInt64)')" 2>&1 | grep -q -F "ILLEGAL_COLUMN" && echo "OK" || echo "FAIL"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'map Map(String, UInt64)')"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow)"
|
||||
|
||||
|
||||
echo "Nested types"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow, auto, 'nested1 Array(Array(UInt32)), nested2 Tuple(Tuple(x UInt32, s String), String), nested3 Map(String, Map(String, UInt32))') select [range(number), range(number + 1)], tuple(tuple(number, 'Hello'), 'Hello'), map('a', map('a.a', number, 'a.b', number + 1), 'b', map('b.a', number, 'b.b', number + 1)) from numbers(5) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'nested1 Array(Array(UInt32)), nested2 Tuple(Tuple(x UInt32, s String), String), nested3 Map(String, Map(String, UInt32))')"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow)"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow, auto, 'nested Array(Tuple(Map(String, Array(UInt32)), Array(Map(String, Tuple(Array(UInt64), Array(UInt64))))))') select [(map('a', range(number), 'b', range(number + 1)), [map('c', (range(number), range(number + 1))), map('d', (range(number + 2), range(number + 3)))])] from numbers(5) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow, auto, 'nested Array(Tuple(Map(String, Array(UInt32)), Array(Map(String, Tuple(Array(UInt64), Array(UInt64))))))')"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(02475_data.bsonEachRow)"
|
||||
|
||||
|
||||
echo "Schema inference"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow) select number::Bool as x from numbers(2) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow) select number::Int32 as x from numbers(2)"
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow) select number::UInt32 as x from numbers(2)"
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow) select number::Int64 as x from numbers(2)"
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow) select number::UInt64 as x from numbers(2)"
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)" 2>&1 | grep -q -F "TYPE_MISMATCH" && echo "OK" || echo "FAIL"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow) select [number::Bool] as x from numbers(2) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow) select [number::Int32] as x from numbers(2)"
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow) select [number::UInt32] as x from numbers(2)"
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow) select [number::Int64] as x from numbers(2)"
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow) select [number::UInt64] as x from numbers(2)"
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)" 2>&1 | grep -q -F "TYPE_MISMATCH" && echo "OK" || echo "FAIL"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow) select [] as x from numbers(2) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)" 2>&1 | grep -q -F "ONLY_NULLS_WHILE_READING_SCHEMA" && echo "OK" || echo "FAIL"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow) select NULL as x from numbers(2) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)" 2>&1 | grep -q -F "ONLY_NULLS_WHILE_READING_SCHEMA" && echo "OK" || echo "FAIL"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow) select [NULL, 1] as x from numbers(2) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)" 2>&1 | grep -q -F "ONLY_NULLS_WHILE_READING_SCHEMA" && echo "OK" || echo "FAIL"
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow) select tuple(1, 'str') as x from numbers(2) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(02475_data.bsonEachRow) select tuple(1) as x from numbers(2)"
|
||||
$CLICKHOUSE_CLIENT -q "desc file(02475_data.bsonEachRow)" 2>&1 | grep -q -F "TYPE_MISMATCH" && echo "OK" || echo "FAIL"
|
||||
|
||||
|
||||
echo "Sync after error"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(data.bsonEachRow) select number, 42::Int128 as int, range(number) as arr from numbers(3) settings engine_file_truncate_on_insert=1"
|
||||
$CLICKHOUSE_CLIENT -q " insert into function file(data.bsonEachRow) select number, 'Hello' as int, range(number) as arr from numbers(2) settings engine_file_truncate_on_insert=0"
|
||||
$CLICKHOUSE_CLIENT -q "insert into function file(data.bsonEachRow) select number, 42::Int128 as int, range(number) as arr from numbers(3) settings engine_file_truncate_on_insert=0"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(data.bsonEachRow, auto, 'number UInt64, int Int128, arr Array(UInt64)') settings input_format_allow_errors_num=0" 2>&1 | grep -q -F "INCORRECT_DATA" && echo "OK" || echo "FAIL"
|
||||
$CLICKHOUSE_CLIENT -q "select * from file(data.bsonEachRow, auto, 'number UInt64, int Int128, arr Array(UInt64)') settings input_format_allow_errors_num=2"
|
@ -0,0 +1 @@
|
||||
OK
|
34
tests/queries/1_stateful/00176_bson_parallel_parsing.sh
Executable file
34
tests/queries/1_stateful/00176_bson_parallel_parsing.sh
Executable file
@ -0,0 +1,34 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
|
||||
# shellcheck source=../shell_config.sh
|
||||
. "$CURDIR"/../shell_config.sh
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "DROP TABLE IF EXISTS parsing_bson"
|
||||
$CLICKHOUSE_CLIENT -q "CREATE TABLE parsing_bson(WatchID UInt64, ClientIP6 FixedString(16), EventTime DateTime, Title String) ENGINE=Memory()"
|
||||
|
||||
|
||||
$CLICKHOUSE_CLIENT --max_threads=0 --max_block_size=65505 --output_format_parallel_formatting=false -q \
|
||||
"SELECT WatchID, ClientIP6, EventTime, Title FROM test.hits ORDER BY UserID LIMIT 100000 Format BSONEachRow" > 00176_data.bson
|
||||
|
||||
cat 00176_data.bson | $CLICKHOUSE_CLIENT --max_threads=0 --max_block_size=65505 --input_format_parallel_parsing=false -q "INSERT INTO parsing_bson FORMAT BSONEachRow"
|
||||
|
||||
checksum1=$($CLICKHOUSE_CLIENT -q "SELECT * FROM parsing_bson ORDER BY WatchID;" | md5sum)
|
||||
$CLICKHOUSE_CLIENT -q "TRUNCATE TABLE parsing_bson;"
|
||||
|
||||
cat 00176_data.bson | $CLICKHOUSE_CLIENT --max_threads=0 --max_block_size=65505 --input_format_parallel_parsing=true -q "INSERT INTO parsing_bson FORMAT BSONEachRow"
|
||||
|
||||
checksum2=$($CLICKHOUSE_CLIENT -q "SELECT * FROM parsing_bson ORDER BY WatchID;" | md5sum)
|
||||
|
||||
|
||||
if [[ "$checksum1" == "$checksum2" ]];
|
||||
then
|
||||
echo "OK"
|
||||
else
|
||||
echo "FAIL"
|
||||
fi
|
||||
|
||||
$CLICKHOUSE_CLIENT -q "DROP TABLE parsing_bson"
|
||||
|
||||
rm 00176_data.bson
|
||||
|
@ -9,6 +9,9 @@ AddressSanitizer
|
||||
AppleClang
|
||||
ArrowStream
|
||||
AvroConfluent
|
||||
BSON
|
||||
BSONEachRow
|
||||
Bool
|
||||
CCTOOLS
|
||||
CLion
|
||||
CMake
|
||||
@ -95,6 +98,7 @@ NEKUDOTAYIM
|
||||
NULLIF
|
||||
NVME
|
||||
NuRaft
|
||||
ObjectId
|
||||
Ok
|
||||
OpenSUSE
|
||||
OpenStack
|
||||
@ -190,6 +194,8 @@ bools
|
||||
boringssl
|
||||
brotli
|
||||
buildable
|
||||
bson
|
||||
bsoneachrow
|
||||
camelCase
|
||||
capn
|
||||
capnproto
|
||||
@ -450,6 +456,7 @@ subquery
|
||||
subseconds
|
||||
substring
|
||||
subtree
|
||||
subtype
|
||||
sudo
|
||||
symlink
|
||||
symlinks
|
||||
@ -482,6 +489,7 @@ userspace
|
||||
userver
|
||||
utils
|
||||
uuid
|
||||
uint
|
||||
variadic
|
||||
varint
|
||||
vectorized
|
||||
|
Loading…
Reference in New Issue
Block a user