Merge branch 'master' into issue-5286

This commit is contained in:
Ivan Lezhankin 2019-08-08 12:57:00 +03:00
commit e90bf74ce8
142 changed files with 1855 additions and 8208 deletions

View File

@ -1,3 +1,42 @@
## ClickHouse release 19.11.5.28, 2019-08-05
### Bug fix
* Fixed the possibility of hanging queries when server is overloaded. [#6301](https://github.com/yandex/ClickHouse/pull/6301) ([alexey-milovidov](https://github.com/alexey-milovidov))
* Fix FPE in yandexConsistentHash function. This fixes [#6304](https://github.com/yandex/ClickHouse/issues/6304). [#6126](https://github.com/yandex/ClickHouse/pull/6126) ([alexey-milovidov](https://github.com/alexey-milovidov))
* Fixed bug in conversion of `LowCardinality` types in `AggregateFunctionFactory`. This fixes [#6257](https://github.com/yandex/ClickHouse/issues/6257). [#6281](https://github.com/yandex/ClickHouse/pull/6281) ([Nikolai Kochetov](https://github.com/KochetovNicolai))
* Fix parsing of `bool` settings from `true` and `false` strings in configuration files. [#6278](https://github.com/yandex/ClickHouse/pull/6278) ([alesapin](https://github.com/alesapin))
* Fix rare bug with incompatible stream headers in queries to `Distributed` table over `MergeTree` table when part of `WHERE` moves to `PREWHERE`. [#6236](https://github.com/yandex/ClickHouse/pull/6236) ([alesapin](https://github.com/alesapin))
* Fixed overflow in integer division of signed type to unsigned type. This fixes [#6214](https://github.com/yandex/ClickHouse/issues/6214). [#6233](https://github.com/yandex/ClickHouse/pull/6233) ([alexey-milovidov](https://github.com/alexey-milovidov))
### Backward Incompatible Change
* `Kafka` still broken.
## ClickHouse release 19.11.4.24, 2019-08-01
### Bug Fix
* Fix bug with writing secondary indices marks with adaptive granularity. [#6126](https://github.com/yandex/ClickHouse/pull/6126) ([alesapin](https://github.com/alesapin))
* Fix `WITH ROLLUP` and `WITH CUBE` modifiers of `GROUP BY` with two-level aggregation. [#6225](https://github.com/yandex/ClickHouse/pull/6225) ([Anton Popov](https://github.com/CurtizJ))
* Fixed hang in `JSONExtractRaw` function. Fixed [#6195](https://github.com/yandex/ClickHouse/issues/6195) [#6198](https://github.com/yandex/ClickHouse/pull/6198) ([alexey-milovidov](https://github.com/alexey-milovidov))
* Fix segfault in ExternalLoader::reloadOutdated(). [#6082](https://github.com/yandex/ClickHouse/pull/6082) ([Vitaly Baranov](https://github.com/vitlibar))
* Fixed the case when server may close listening sockets but not shutdown and continue serving remaining queries. You may end up with two running clickhouse-server processes. Sometimes, the server may return an error `bad_function_call` for remaining queries. [#6231](https://github.com/yandex/ClickHouse/pull/6231) ([alexey-milovidov](https://github.com/alexey-milovidov))
* Fixed useless and incorrect condition on update field for initial loading of external dictionaries via ODBC, MySQL, ClickHouse and HTTP. This fixes [#6069](https://github.com/yandex/ClickHouse/issues/6069) [#6083](https://github.com/yandex/ClickHouse/pull/6083) ([alexey-milovidov](https://github.com/alexey-milovidov))
* Fixed irrelevant exception in cast of `LowCardinality(Nullable)` to not-Nullable column in case if it doesn't contain Nulls (e.g. in query like `SELECT CAST(CAST('Hello' AS LowCardinality(Nullable(String))) AS String)`. [#6094](https://github.com/yandex/ClickHouse/issues/6094) [#6119](https://github.com/yandex/ClickHouse/pull/6119) ([Nikolai Kochetov](https://github.com/KochetovNicolai))
* Fix non-deterministic result of "uniq" aggregate function in extreme rare cases. The bug was present in all ClickHouse versions. [#6058](https://github.com/yandex/ClickHouse/pull/6058) ([alexey-milovidov](https://github.com/alexey-milovidov))
* Segfault when we set a little bit too high CIDR on the function `IPv6CIDRToRange`. [#6068](https://github.com/yandex/ClickHouse/pull/6068) ([Guillaume Tassery](https://github.com/YiuRULE))
* Fixed small memory leak when server throw many exceptions from many different contexts. [#6144](https://github.com/yandex/ClickHouse/pull/6144) ([alexey-milovidov](https://github.com/alexey-milovidov))
* Fix the situation when consumer got paused before subscription and not resumed afterwards. [#6075](https://github.com/yandex/ClickHouse/pull/6075) ([Ivan](https://github.com/abyss7)) Note that Kafka is broken in this version.
* Clearing the Kafka data buffer from the previous read operation that was completed with an error [#6026](https://github.com/yandex/ClickHouse/pull/6026) ([Nikolay](https://github.com/bopohaa)) Note that Kafka is broken in this version.
* Since `StorageMergeTree::background_task_handle` is initialized in `startup()` the `MergeTreeBlockOutputStream::write()` may try to use it before initialization. Just check if it is initialized. [#6080](https://github.com/yandex/ClickHouse/pull/6080) ([Ivan](https://github.com/abyss7))
### Build/Testing/Packaging Improvement
* Added official `rpm` packages. [#5740](https://github.com/yandex/ClickHouse/pull/5740) ([proller](https://github.com/proller)) ([alesapin](https://github.com/alesapin))
* Add an ability to build `.rpm` and `.tgz` packages with `packager` script. [#5769](https://github.com/yandex/ClickHouse/pull/5769) ([alesapin](https://github.com/alesapin))
* Fixes for "Arcadia" build system. [#6223](https://github.com/yandex/ClickHouse/pull/6223) ([proller](https://github.com/proller))
### Backward Incompatible Change
* `Kafka` is broken in this version.
## ClickHouse release 19.11.3.11, 2019-07-18
### New Feature
@ -35,6 +74,7 @@
* clickhouse-copier: Fix use-after free on shutdown [#5752](https://github.com/yandex/ClickHouse/pull/5752) ([proller](https://github.com/proller))
* Updated `simdjson`. Fixed the issue that some invalid JSONs with zero bytes successfully parse. [#5938](https://github.com/yandex/ClickHouse/pull/5938) ([alexey-milovidov](https://github.com/alexey-milovidov))
* Fix shutdown of SystemLogs [#5802](https://github.com/yandex/ClickHouse/pull/5802) ([Anton Popov](https://github.com/CurtizJ))
* Fix hanging when condition in invalidate_query depends on a dictionary. [#6011](https://github.com/yandex/ClickHouse/pull/6011) ([Vitaly Baranov](https://github.com/vitlibar))
### Improvement
* Allow unresolvable addresses in cluster configuration. They will be considered unavailable and tried to resolve at every connection attempt. This is especially useful for Kubernetes. This fixes [#5714](https://github.com/yandex/ClickHouse/issues/5714) [#5924](https://github.com/yandex/ClickHouse/pull/5924) ([alexey-milovidov](https://github.com/alexey-milovidov))
@ -55,13 +95,12 @@
* Inverting ngramSearch to be more intuitive [#5807](https://github.com/yandex/ClickHouse/pull/5807) ([Danila Kutenin](https://github.com/danlark1))
* Add user parsing in HDFS engine builder [#5946](https://github.com/yandex/ClickHouse/pull/5946) ([akonyaev90](https://github.com/akonyaev90))
* Update default value of `max_ast_elements parameter` [#5933](https://github.com/yandex/ClickHouse/pull/5933) ([Artem Konovalov](https://github.com/izebit))
* Added a notion of obsolete settings. The obsolete setting `allow_experimental_low_cardinality_type` can be used with no effect. [0f15c01c6802f7ce1a1494c12c846be8c98944cd](https://github.com/yandex/ClickHouse/commit/0f15c01c6802f7ce1a1494c12c846be8c98944cd) [Alexey Milovidov](https://github.com/alexey-milovidov)
### Performance Improvement
* Increase number of streams to SELECT from Merge table for more uniform distribution of threads. Added setting `max_streams_multiplier_for_merge_tables`. This fixes [#5797](https://github.com/yandex/ClickHouse/issues/5797) [#5915](https://github.com/yandex/ClickHouse/pull/5915) ([alexey-milovidov](https://github.com/alexey-milovidov))
### Build/Testing/Packaging Improvement
* Added official `rpm` packages. [#5740](https://github.com/yandex/ClickHouse/pull/5740) ([proller](https://github.com/proller)) ([alesapin](https://github.com/alesapin))
* Add an ability to build `.rpm` and `.tgz` packages with `packager` script. [#5769](https://github.com/yandex/ClickHouse/pull/5769) ([alesapin](https://github.com/alesapin))
* Add a backward compatibility test for client-server interaction with different versions of clickhouse. [#5868](https://github.com/yandex/ClickHouse/pull/5868) ([alesapin](https://github.com/alesapin))
* Test coverage information in every commit and pull request. [#5896](https://github.com/yandex/ClickHouse/pull/5896) ([alesapin](https://github.com/alesapin))
* Cooperate with address sanitizer to support our custom allocators (`Arena` and `ArenaWithFreeLists`) for better debugging of "use-after-free" errors. [#5728](https://github.com/yandex/ClickHouse/pull/5728) ([akuzm](https://github.com/akuzm))

File diff suppressed because it is too large Load Diff

2
contrib/fastops vendored

@ -1 +1 @@
Subproject commit d2c85c5d6549cfd648a7f31ef7b14341881ff8ae
Subproject commit 88752a5e03cf34639a4a37a4b41d8b463fffd2b5

View File

@ -3,9 +3,8 @@ set(LIBRARY_DIR ${ClickHouse_SOURCE_DIR}/contrib/fastops)
set(SRCS "")
if(HAVE_AVX)
set (SRCS ${SRCS} ${LIBRARY_DIR}/fastops/avx/ops_avx.cpp ${LIBRARY_DIR}/fastops/core/FastIntrinsics.cpp)
set (SRCS ${SRCS} ${LIBRARY_DIR}/fastops/avx/ops_avx.cpp)
set_source_files_properties(${LIBRARY_DIR}/fastops/avx/ops_avx.cpp PROPERTIES COMPILE_FLAGS "-mavx -DNO_AVX2")
set_source_files_properties(${LIBRARY_DIR}/fastops/core/FastIntrinsics.cpp PROPERTIES COMPILE_FLAGS "-mavx -DNO_AVX2")
endif()
if(HAVE_AVX2)

View File

@ -5,7 +5,6 @@
#include <DataStreams/copyData.h>
#include <DataTypes/DataTypeFactory.h>
#include "ODBCBlockInputStream.h"
#include <Formats/BinaryRowInputStream.h>
#include <Formats/FormatFactory.h>
#include <IO/WriteBufferFromHTTPServerResponse.h>
#include <IO/WriteHelpers.h>

View File

@ -299,7 +299,7 @@ void MySQLHandler::authenticate(const HandshakeResponse & handshake_response, co
}
password.resize(plaintext_size);
for (int i = 0; i < plaintext_size; i++)
for (int i = 0; i < plaintext_size; ++i)
{
password[i] = plaintext[i] ^ static_cast<unsigned char>(scramble[i % scramble.size()]);
}

View File

@ -696,6 +696,7 @@ int Server::main(const std::vector<std::string> & /*args*/)
LOG_INFO(log, "Listening https://" + address.toString());
#else
UNUSED(port);
throw Exception{"HTTPS protocol is disabled because Poco library was built without NetSSL support.",
ErrorCodes::SUPPORT_IS_DISABLED};
#endif
@ -732,6 +733,7 @@ int Server::main(const std::vector<std::string> & /*args*/)
new Poco::Net::TCPServerParams));
LOG_INFO(log, "Listening for connections with secure native protocol (tcp_secure): " + address.toString());
#else
UNUSED(port);
throw Exception{"SSL support for TCP protocol is disabled because Poco library was built without NetSSL support.",
ErrorCodes::SUPPORT_IS_DISABLED};
#endif
@ -768,6 +770,7 @@ int Server::main(const std::vector<std::string> & /*args*/)
LOG_INFO(log, "Listening for secure replica communication (interserver) https://" + address.toString());
#else
UNUSED(port);
throw Exception{"SSL support for TCP protocol is disabled because Poco library was built without NetSSL support.",
ErrorCodes::SUPPORT_IS_DISABLED};
#endif
@ -788,6 +791,7 @@ int Server::main(const std::vector<std::string> & /*args*/)
LOG_INFO(log, "Listening for MySQL compatibility protocol: " + address.toString());
#else
UNUSED(port);
throw Exception{"SSL support for MySQL protocol is disabled because Poco library was built without NetSSL support.",
ErrorCodes::SUPPORT_IS_DISABLED};
#endif

View File

@ -707,7 +707,7 @@ void Dwarf::LineNumberVM::init()
lineRange_ = read<uint8_t>(header);
opcodeBase_ = read<uint8_t>(header);
SAFE_CHECK(opcodeBase_ != 0, "invalid opcode base");
standardOpcodeLengths_ = reinterpret_cast<const uint8_t *>(header.data());
standardOpcodeLengths_ = reinterpret_cast<const uint8_t *>(header.data()); //-V506
header.remove_prefix(opcodeBase_ - 1);
// We don't want to use heap, so we don't keep an unbounded amount of state.

View File

@ -127,7 +127,7 @@ namespace ErrorCodes
extern const int INCORRECT_DATA = 117;
extern const int ENGINE_REQUIRED = 119;
extern const int CANNOT_INSERT_VALUE_OF_DIFFERENT_SIZE_INTO_TUPLE = 120;
extern const int UNKNOWN_SET_DATA_VARIANT = 121;
extern const int UNSUPPORTED_JOIN_KEYS = 121;
extern const int INCOMPATIBLE_COLUMNS = 122;
extern const int UNKNOWN_TYPE_OF_AST_NODE = 123;
extern const int INCORRECT_ELEMENT_OF_SET = 124;

View File

@ -475,6 +475,9 @@ static ReturnType checkBlockStructure(const Block & lhs, const Block & rhs, cons
return on_error("Block structure mismatch in " + context_description + " stream: different types:\n"
+ lhs.dumpStructure() + "\n" + rhs.dumpStructure(), ErrorCodes::BLOCKS_HAVE_DIFFERENT_STRUCTURE);
if (!actual.column || !expected.column)
continue;
if (actual.column->getName() != expected.column->getName())
return on_error("Block structure mismatch in " + context_description + " stream: different columns:\n"
+ lhs.dumpStructure() + "\n" + rhs.dumpStructure(), ErrorCodes::BLOCKS_HAVE_DIFFERENT_STRUCTURE);

View File

@ -32,7 +32,7 @@ public:
String getName() const override { return "NullAndDoCopy"; }
Block getHeader() const override { return {}; }
Block getHeader() const override { return input->getHeader(); }
protected:
Block readImpl() override

View File

@ -1,4 +1,5 @@
#include <DataStreams/AddingDefaultBlockOutputStream.h>
#include <DataStreams/ConvertingBlockInputStream.h>
#include <DataStreams/PushingToViewsBlockOutputStream.h>
#include <DataStreams/SquashingBlockInputStream.h>
#include <DataTypes/NestedUtils.h>
@ -192,6 +193,7 @@ void PushingToViewsBlockOutputStream::process(const Block & block, size_t view_n
/// and two-level aggregation is triggered).
in = std::make_shared<SquashingBlockInputStream>(
in, context.getSettingsRef().min_insert_block_size_rows, context.getSettingsRef().min_insert_block_size_bytes);
in = std::make_shared<ConvertingBlockInputStream>(context, in, view.out->getHeader(), ConvertingBlockInputStream::MatchColumnsMode::Position);
in->readPrefix();

View File

@ -4,8 +4,8 @@
namespace DB
{
SquashingBlockOutputStream::SquashingBlockOutputStream(BlockOutputStreamPtr & dst, const Block & header, size_t min_block_size_rows, size_t min_block_size_bytes)
: output(dst), header(header), transform(min_block_size_rows, min_block_size_bytes)
SquashingBlockOutputStream::SquashingBlockOutputStream(BlockOutputStreamPtr dst, Block header, size_t min_block_size_rows, size_t min_block_size_bytes)
: output(std::move(dst)), header(std::move(header)), transform(min_block_size_rows, min_block_size_bytes)
{
}

View File

@ -12,7 +12,7 @@ namespace DB
class SquashingBlockOutputStream : public IBlockOutputStream
{
public:
SquashingBlockOutputStream(BlockOutputStreamPtr & dst, const Block & header, size_t min_block_size_rows, size_t min_block_size_bytes);
SquashingBlockOutputStream(BlockOutputStreamPtr dst, Block header, size_t min_block_size_rows, size_t min_block_size_bytes);
Block getHeader() const override { return header; }
void write(const Block & block) override;

View File

@ -1,91 +0,0 @@
#include <IO/ReadBuffer.h>
#include <IO/ReadHelpers.h>
#include <Formats/BinaryRowInputStream.h>
#include <Formats/FormatFactory.h>
#include <Formats/BlockInputStreamFromRowInputStream.h>
namespace DB
{
BinaryRowInputStream::BinaryRowInputStream(ReadBuffer & istr_, const Block & header_, bool with_names_, bool with_types_)
: istr(istr_), header(header_), with_names(with_names_), with_types(with_types_)
{
}
bool BinaryRowInputStream::read(MutableColumns & columns, RowReadExtension &)
{
if (istr.eof())
return false;
size_t num_columns = columns.size();
for (size_t i = 0; i < num_columns; ++i)
header.getByPosition(i).type->deserializeBinary(*columns[i], istr);
return true;
}
void BinaryRowInputStream::readPrefix()
{
/// NOTE The header is completely ignored. This can be easily improved.
UInt64 columns = 0;
String tmp;
if (with_names || with_types)
{
readVarUInt(columns, istr);
}
if (with_names)
{
for (size_t i = 0; i < columns; ++i)
{
readStringBinary(tmp, istr);
}
}
if (with_types)
{
for (size_t i = 0; i < columns; ++i)
{
readStringBinary(tmp, istr);
}
}
}
void registerInputFormatRowBinary(FormatFactory & factory)
{
factory.registerInputFormat("RowBinary", [](
ReadBuffer & buf,
const Block & sample,
const Context &,
UInt64 max_block_size,
UInt64 rows_portion_size,
FormatFactory::ReadCallback callback,
const FormatSettings & settings)
{
return std::make_shared<BlockInputStreamFromRowInputStream>(
std::make_shared<BinaryRowInputStream>(buf, sample, false, false),
sample, max_block_size, rows_portion_size, callback, settings);
});
factory.registerInputFormat("RowBinaryWithNamesAndTypes", [](
ReadBuffer & buf,
const Block & sample,
const Context &,
UInt64 max_block_size,
UInt64 rows_portion_size,
FormatFactory::ReadCallback callback,
const FormatSettings & settings)
{
return std::make_shared<BlockInputStreamFromRowInputStream>(
std::make_shared<BinaryRowInputStream>(buf, sample, true, true),
sample, max_block_size, rows_portion_size, callback, settings);
});
}
}

View File

@ -1,30 +0,0 @@
#pragma once
#include <Formats/IRowInputStream.h>
#include <Core/Block.h>
namespace DB
{
class ReadBuffer;
/** A stream for inputting data in a binary line-by-line format.
*/
class BinaryRowInputStream : public IRowInputStream
{
public:
BinaryRowInputStream(ReadBuffer & istr_, const Block & sample_, bool with_names_, bool with_types_);
bool read(MutableColumns & columns, RowReadExtension &) override;
void readPrefix() override;
private:
ReadBuffer & istr;
Block header;
bool with_names;
bool with_types;
};
}

View File

@ -1,77 +0,0 @@
#include <IO/WriteBuffer.h>
#include <IO/WriteHelpers.h>
#include <Columns/IColumn.h>
#include <DataTypes/IDataType.h>
#include <Formats/BinaryRowOutputStream.h>
#include <Formats/FormatFactory.h>
#include <Formats/BlockOutputStreamFromRowOutputStream.h>
namespace DB
{
BinaryRowOutputStream::BinaryRowOutputStream(WriteBuffer & ostr_, const Block & sample_, bool with_names_, bool with_types_)
: ostr(ostr_), with_names(with_names_), with_types(with_types_), sample(sample_)
{
}
void BinaryRowOutputStream::writePrefix()
{
size_t columns = sample.columns();
if (with_names || with_types)
{
writeVarUInt(columns, ostr);
}
if (with_names)
{
for (size_t i = 0; i < columns; ++i)
{
writeStringBinary(sample.safeGetByPosition(i).name, ostr);
}
}
if (with_types)
{
for (size_t i = 0; i < columns; ++i)
{
writeStringBinary(sample.safeGetByPosition(i).type->getName(), ostr);
}
}
}
void BinaryRowOutputStream::flush()
{
ostr.next();
}
void BinaryRowOutputStream::writeField(const IColumn & column, const IDataType & type, size_t row_num)
{
type.serializeBinary(column, row_num, ostr);
}
void registerOutputFormatRowBinary(FormatFactory & factory)
{
factory.registerOutputFormat("RowBinary", [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings &)
{
return std::make_shared<BlockOutputStreamFromRowOutputStream>(
std::make_shared<BinaryRowOutputStream>(buf, sample, false, false), sample);
});
factory.registerOutputFormat("RowBinaryWithNamesAndTypes", [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings &)
{
return std::make_shared<BlockOutputStreamFromRowOutputStream>(
std::make_shared<BinaryRowOutputStream>(buf, sample, true, true), sample);
});
}
}

View File

@ -1,37 +0,0 @@
#pragma once
#include <Formats/IRowOutputStream.h>
#include <Core/Block.h>
namespace DB
{
class IColumn;
class IDataType;
class WriteBuffer;
/** A stream for outputting data in a binary line-by-line format.
*/
class BinaryRowOutputStream : public IRowOutputStream
{
public:
BinaryRowOutputStream(WriteBuffer & ostr_, const Block & sample_, bool with_names_, bool with_types_);
void writeField(const IColumn & column, const IDataType & type, size_t row_num) override;
void writePrefix() override;
void flush() override;
String getContentType() const override { return "application/octet-stream"; }
protected:
WriteBuffer & ostr;
bool with_names;
bool with_types;
const Block sample;
};
}

View File

@ -1,134 +0,0 @@
#include <Formats/CSVRowOutputStream.h>
#include <Formats/FormatFactory.h>
#include <Formats/BlockOutputStreamFromRowOutputStream.h>
#include <IO/WriteHelpers.h>
namespace DB
{
CSVRowOutputStream::CSVRowOutputStream(WriteBuffer & ostr_, const Block & sample_, bool with_names_, const FormatSettings & format_settings)
: ostr(ostr_), sample(sample_), with_names(with_names_), format_settings(format_settings)
{
size_t columns = sample.columns();
data_types.resize(columns);
for (size_t i = 0; i < columns; ++i)
data_types[i] = sample.safeGetByPosition(i).type;
}
void CSVRowOutputStream::flush()
{
ostr.next();
}
void CSVRowOutputStream::writePrefix()
{
size_t columns = sample.columns();
if (with_names)
{
for (size_t i = 0; i < columns; ++i)
{
writeCSVString(sample.safeGetByPosition(i).name, ostr);
writeChar(i == columns - 1 ? '\n' : format_settings.csv.delimiter, ostr);
}
}
}
void CSVRowOutputStream::writeField(const IColumn & column, const IDataType & type, size_t row_num)
{
type.serializeAsTextCSV(column, row_num, ostr, format_settings);
}
void CSVRowOutputStream::writeFieldDelimiter()
{
writeChar(format_settings.csv.delimiter, ostr);
}
void CSVRowOutputStream::writeRowEndDelimiter()
{
writeChar('\n', ostr);
}
void CSVRowOutputStream::writeSuffix()
{
writeTotals();
writeExtremes();
}
void CSVRowOutputStream::writeTotals()
{
if (totals)
{
size_t columns = totals.columns();
writeChar('\n', ostr);
writeRowStartDelimiter();
for (size_t j = 0; j < columns; ++j)
{
if (j != 0)
writeFieldDelimiter();
writeField(*totals.getByPosition(j).column.get(), *totals.getByPosition(j).type.get(), 0);
}
writeRowEndDelimiter();
}
}
void CSVRowOutputStream::writeExtremes()
{
if (extremes)
{
size_t rows = extremes.rows();
size_t columns = extremes.columns();
writeChar('\n', ostr);
for (size_t i = 0; i < rows; ++i)
{
if (i != 0)
writeRowBetweenDelimiter();
writeRowStartDelimiter();
for (size_t j = 0; j < columns; ++j)
{
if (j != 0)
writeFieldDelimiter();
writeField(*extremes.getByPosition(j).column.get(), *extremes.getByPosition(j).type.get(), i);
}
writeRowEndDelimiter();
}
}
}
void registerOutputFormatCSV(FormatFactory & factory)
{
for (bool with_names : {false, true})
{
factory.registerOutputFormat(with_names ? "CSVWithNames" : "CSV", [=](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & format_settings)
{
return std::make_shared<BlockOutputStreamFromRowOutputStream>(
std::make_shared<CSVRowOutputStream>(buf, sample, with_names, format_settings), sample);
});
}
}
}

View File

@ -1,56 +0,0 @@
#pragma once
#include <Core/Block.h>
#include <Formats/IRowOutputStream.h>
#include <Formats/FormatSettings.h>
namespace DB
{
class WriteBuffer;
/** The stream for outputting data in csv format.
* Does not conform with https://tools.ietf.org/html/rfc4180 because it uses LF, not CR LF.
*/
class CSVRowOutputStream : public IRowOutputStream
{
public:
/** with_names - output in the first line a header with column names
* with_types - output in the next line header with the names of the types
*/
CSVRowOutputStream(WriteBuffer & ostr_, const Block & sample_, bool with_names_, const FormatSettings & format_settings);
void writeField(const IColumn & column, const IDataType & type, size_t row_num) override;
void writeFieldDelimiter() override;
void writeRowEndDelimiter() override;
void writePrefix() override;
void writeSuffix() override;
void flush() override;
void setTotals(const Block & totals_) override { totals = totals_; }
void setExtremes(const Block & extremes_) override { extremes = extremes_; }
/// https://www.iana.org/assignments/media-types/text/csv
String getContentType() const override
{
return String("text/csv; charset=UTF-8; header=") + (with_names ? "present" : "absent");
}
protected:
void writeTotals();
void writeExtremes();
WriteBuffer & ostr;
const Block sample;
bool with_names;
const FormatSettings format_settings;
DataTypes data_types;
Block totals;
Block extremes;
};
}

View File

@ -1,332 +0,0 @@
#include "CapnProtoRowInputStream.h"
#if USE_CAPNP
#include <IO/ReadBuffer.h>
#include <Interpreters/Context.h>
#include <Formats/FormatFactory.h>
#include <Formats/BlockInputStreamFromRowInputStream.h>
#include <Formats/FormatSchemaInfo.h>
#include <capnp/serialize.h>
#include <capnp/dynamic.h>
#include <capnp/common.h>
#include <boost/algorithm/string.hpp>
#include <boost/range/join.hpp>
#include <common/logger_useful.h>
namespace DB
{
namespace ErrorCodes
{
extern const int BAD_TYPE_OF_FIELD;
extern const int BAD_ARGUMENTS;
extern const int THERE_IS_NO_COLUMN;
extern const int LOGICAL_ERROR;
}
CapnProtoRowInputStream::NestedField split(const Block & header, size_t i)
{
CapnProtoRowInputStream::NestedField field = {{}, i};
// Remove leading dot in field definition, e.g. ".msg" -> "msg"
String name(header.safeGetByPosition(i).name);
if (name.size() > 0 && name[0] == '.')
name.erase(0, 1);
boost::split(field.tokens, name, boost::is_any_of("._"));
return field;
}
Field convertNodeToField(capnp::DynamicValue::Reader value)
{
switch (value.getType())
{
case capnp::DynamicValue::UNKNOWN:
throw Exception("Unknown field type", ErrorCodes::BAD_TYPE_OF_FIELD);
case capnp::DynamicValue::VOID:
return Field();
case capnp::DynamicValue::BOOL:
return value.as<bool>() ? 1u : 0u;
case capnp::DynamicValue::INT:
return value.as<int64_t>();
case capnp::DynamicValue::UINT:
return value.as<uint64_t>();
case capnp::DynamicValue::FLOAT:
return value.as<double>();
case capnp::DynamicValue::TEXT:
{
auto arr = value.as<capnp::Text>();
return String(arr.begin(), arr.size());
}
case capnp::DynamicValue::DATA:
{
auto arr = value.as<capnp::Data>().asChars();
return String(arr.begin(), arr.size());
}
case capnp::DynamicValue::LIST:
{
auto listValue = value.as<capnp::DynamicList>();
Array res(listValue.size());
for (auto i : kj::indices(listValue))
res[i] = convertNodeToField(listValue[i]);
return res;
}
case capnp::DynamicValue::ENUM:
return value.as<capnp::DynamicEnum>().getRaw();
case capnp::DynamicValue::STRUCT:
{
auto structValue = value.as<capnp::DynamicStruct>();
const auto & fields = structValue.getSchema().getFields();
Field field = Tuple(TupleBackend(fields.size()));
TupleBackend & tuple = get<Tuple &>(field).toUnderType();
for (auto i : kj::indices(fields))
tuple[i] = convertNodeToField(structValue.get(fields[i]));
return field;
}
case capnp::DynamicValue::CAPABILITY:
throw Exception("CAPABILITY type not supported", ErrorCodes::BAD_TYPE_OF_FIELD);
case capnp::DynamicValue::ANY_POINTER:
throw Exception("ANY_POINTER type not supported", ErrorCodes::BAD_TYPE_OF_FIELD);
}
return Field();
}
capnp::StructSchema::Field getFieldOrThrow(capnp::StructSchema node, const std::string & field)
{
KJ_IF_MAYBE(child, node.findFieldByName(field))
return *child;
else
throw Exception("Field " + field + " doesn't exist in schema " + node.getShortDisplayName().cStr(), ErrorCodes::THERE_IS_NO_COLUMN);
}
void CapnProtoRowInputStream::createActions(const NestedFieldList & sorted_fields, capnp::StructSchema reader)
{
/// Columns in a table can map to fields in Cap'n'Proto or to structs.
/// Store common parents and their tokens in order to backtrack.
std::vector<capnp::StructSchema::Field> parents;
std::vector<std::string> parent_tokens;
capnp::StructSchema cur_reader = reader;
for (const auto & field : sorted_fields)
{
if (field.tokens.empty())
throw Exception("Logical error in CapnProtoRowInputStream", ErrorCodes::LOGICAL_ERROR);
// Backtrack to common parent
while (field.tokens.size() < parent_tokens.size() + 1
|| !std::equal(parent_tokens.begin(), parent_tokens.end(), field.tokens.begin()))
{
actions.push_back({Action::POP});
parents.pop_back();
parent_tokens.pop_back();
if (parents.empty())
{
cur_reader = reader;
break;
}
else
cur_reader = parents.back().getType().asStruct();
}
// Go forward
while (parent_tokens.size() + 1 < field.tokens.size())
{
const auto & token = field.tokens[parents.size()];
auto node = getFieldOrThrow(cur_reader, token);
if (node.getType().isStruct())
{
// Descend to field structure
parents.emplace_back(node);
parent_tokens.emplace_back(token);
cur_reader = node.getType().asStruct();
actions.push_back({Action::PUSH, node});
}
else if (node.getType().isList())
{
break; // Collect list
}
else
throw Exception("Field " + token + " is neither Struct nor List", ErrorCodes::BAD_TYPE_OF_FIELD);
}
// Read field from the structure
auto node = getFieldOrThrow(cur_reader, field.tokens[parents.size()]);
if (node.getType().isList() && actions.size() > 0 && actions.back().field == node)
{
// The field list here flattens Nested elements into multiple arrays
// In order to map Nested types in Cap'nProto back, they need to be collected
// Since the field names are sorted, the order of field positions must be preserved
// For example, if the fields are { b @0 :Text, a @1 :Text }, the `a` would come first
// even though it's position is second.
auto & columns = actions.back().columns;
auto it = std::upper_bound(columns.cbegin(), columns.cend(), field.pos);
columns.insert(it, field.pos);
}
else
{
actions.push_back({Action::READ, node, {field.pos}});
}
}
}
CapnProtoRowInputStream::CapnProtoRowInputStream(ReadBuffer & istr_, const Block & header_, const FormatSchemaInfo& info)
: istr(istr_), header(header_), parser(std::make_shared<SchemaParser>())
{
// Parse the schema and fetch the root object
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
auto schema = parser->impl.parseDiskFile(info.schemaPath(), info.absoluteSchemaPath(), {});
#pragma GCC diagnostic pop
root = schema.getNested(info.messageName()).asStruct();
/**
* The schema typically consists of fields in various nested structures.
* Here we gather the list of fields and sort them in a way so that fields in the same structure are adjacent,
* and the nesting level doesn't decrease to make traversal easier.
*/
NestedFieldList list;
size_t num_columns = header.columns();
for (size_t i = 0; i < num_columns; ++i)
list.push_back(split(header, i));
// Order list first by value of strings then by length of string vector.
std::sort(list.begin(), list.end(), [](const NestedField & a, const NestedField & b) { return a.tokens < b.tokens; });
createActions(list, root);
}
kj::Array<capnp::word> CapnProtoRowInputStream::readMessage()
{
uint32_t segment_count;
istr.readStrict(reinterpret_cast<char*>(&segment_count), sizeof(uint32_t));
// one for segmentCount and one because segmentCount starts from 0
const auto prefix_size = (2 + segment_count) * sizeof(uint32_t);
const auto words_prefix_size = (segment_count + 1) / 2 + 1;
auto prefix = kj::heapArray<capnp::word>(words_prefix_size);
auto prefix_chars = prefix.asChars();
::memcpy(prefix_chars.begin(), &segment_count, sizeof(uint32_t));
// read size of each segment
for (size_t i = 0; i <= segment_count; ++i)
istr.readStrict(prefix_chars.begin() + ((i + 1) * sizeof(uint32_t)), sizeof(uint32_t));
// calculate size of message
const auto expected_words = capnp::expectedSizeInWordsFromPrefix(prefix);
const auto expected_bytes = expected_words * sizeof(capnp::word);
const auto data_size = expected_bytes - prefix_size;
auto msg = kj::heapArray<capnp::word>(expected_words);
auto msg_chars = msg.asChars();
// read full message
::memcpy(msg_chars.begin(), prefix_chars.begin(), prefix_size);
istr.readStrict(msg_chars.begin() + prefix_size, data_size);
return msg;
}
bool CapnProtoRowInputStream::read(MutableColumns & columns, RowReadExtension &)
{
if (istr.eof())
return false;
auto array = readMessage();
#if CAPNP_VERSION >= 8000
capnp::UnalignedFlatArrayMessageReader msg(array);
#else
capnp::FlatArrayMessageReader msg(array);
#endif
std::vector<capnp::DynamicStruct::Reader> stack;
stack.push_back(msg.getRoot<capnp::DynamicStruct>(root));
for (auto action : actions)
{
switch (action.type)
{
case Action::READ:
{
Field value = convertNodeToField(stack.back().get(action.field));
if (action.columns.size() > 1)
{
// Nested columns must be flattened into several arrays
// e.g. Array(Tuple(x ..., y ...)) -> Array(x ...), Array(y ...)
const Array & collected = DB::get<const Array &>(value);
size_t size = collected.size();
// The flattened array contains an array of a part of the nested tuple
Array flattened(size);
for (size_t column_index = 0; column_index < action.columns.size(); ++column_index)
{
// Populate array with a single tuple elements
for (size_t off = 0; off < size; ++off)
{
const TupleBackend & tuple = DB::get<const Tuple &>(collected[off]).toUnderType();
flattened[off] = tuple[column_index];
}
auto & col = columns[action.columns[column_index]];
col->insert(flattened);
}
}
else
{
auto & col = columns[action.columns[0]];
col->insert(value);
}
break;
}
case Action::POP:
stack.pop_back();
break;
case Action::PUSH:
stack.push_back(stack.back().get(action.field).as<capnp::DynamicStruct>());
break;
}
}
return true;
}
void registerInputFormatCapnProto(FormatFactory & factory)
{
factory.registerInputFormat(
"CapnProto",
[](ReadBuffer & buf,
const Block & sample,
const Context & context,
UInt64 max_block_size,
UInt64 rows_portion_size,
FormatFactory::ReadCallback callback,
const FormatSettings & settings)
{
return std::make_shared<BlockInputStreamFromRowInputStream>(
std::make_shared<CapnProtoRowInputStream>(buf, sample, FormatSchemaInfo(context, "CapnProto")),
sample,
max_block_size,
rows_portion_size,
callback,
settings);
});
}
}
#else
namespace DB
{
class FormatFactory;
void registerInputFormatCapnProto(FormatFactory &) {}
}
#endif // USE_CAPNP

View File

@ -1,76 +0,0 @@
#pragma once
#include "config_formats.h"
#if USE_CAPNP
#include <Core/Block.h>
#include <Formats/IRowInputStream.h>
#include <capnp/schema-parser.h>
namespace DB
{
class FormatSchemaInfo;
class ReadBuffer;
/** A stream for reading messages in Cap'n Proto format in given schema.
* Like Protocol Buffers and Thrift (but unlike JSON or MessagePack),
* Cap'n Proto messages are strongly-typed and not self-describing.
* The schema in this case cannot be compiled in, so it uses a runtime schema parser.
* See https://capnproto.org/cxx.html
*/
class CapnProtoRowInputStream : public IRowInputStream
{
public:
struct NestedField
{
std::vector<std::string> tokens;
size_t pos;
};
using NestedFieldList = std::vector<NestedField>;
/** schema_dir - base path for schema files
* schema_file - location of the capnproto schema, e.g. "schema.capnp"
* root_object - name to the root object, e.g. "Message"
*/
CapnProtoRowInputStream(ReadBuffer & istr_, const Block & header_, const FormatSchemaInfo & info);
bool read(MutableColumns & columns, RowReadExtension &) override;
private:
kj::Array<capnp::word> readMessage();
// Build a traversal plan from a sorted list of fields
void createActions(const NestedFieldList & sortedFields, capnp::StructSchema reader);
/* Action for state machine for traversing nested structures. */
using BlockPositionList = std::vector<size_t>;
struct Action
{
enum Type { POP, PUSH, READ };
Type type{};
capnp::StructSchema::Field field{};
BlockPositionList columns{};
};
// Wrapper for classes that could throw in destructor
// https://github.com/capnproto/capnproto/issues/553
template <typename T>
struct DestructorCatcher
{
T impl;
template <typename ... Arg>
DestructorCatcher(Arg && ... args) : impl(kj::fwd<Arg>(args)...) {}
~DestructorCatcher() noexcept try { } catch (...) { return; }
};
using SchemaParser = DestructorCatcher<capnp::SchemaParser>;
ReadBuffer & istr;
Block header;
std::shared_ptr<SchemaParser> parser;
capnp::StructSchema root;
std::vector<Action> actions;
};
}
#endif // USE_CAPNP

View File

@ -5,6 +5,10 @@
#include <Formats/FormatSettings.h>
#include <Formats/FormatFactory.h>
#include <Processors/Formats/IRowInputFormat.h>
#include <Processors/Formats/InputStreamFromInputFormat.h>
#include <Processors/Formats/OutputStreamToOutputFormat.h>
#include <DataStreams/SquashingBlockOutputStream.h>
#include <DataStreams/NativeBlockInputStream.h>
namespace DB
@ -18,7 +22,6 @@ namespace ErrorCodes
extern const int FORMAT_IS_NOT_SUITABLE_FOR_OUTPUT;
}
const FormatFactory::Creators & FormatFactory::getCreators(const String & name) const
{
auto it = dict.find(name);
@ -27,13 +30,6 @@ const FormatFactory::Creators & FormatFactory::getCreators(const String & name)
throw Exception("Unknown format " + name, ErrorCodes::UNKNOWN_FORMAT);
}
const FormatFactory::ProcessorCreators & FormatFactory::getProcessorCreators(const String & name) const
{
auto it = processors_dict.find(name);
if (processors_dict.end() != it)
return it->second;
throw Exception("Unknown format " + name, ErrorCodes::UNKNOWN_FORMAT);
}
static FormatSettings getInputFormatSetting(const Settings & settings)
{
@ -83,38 +79,54 @@ BlockInputStreamPtr FormatFactory::getInput(
UInt64 rows_portion_size,
ReadCallback callback) const
{
const auto & input_getter = getCreators(name).first;
if (!input_getter)
throw Exception("Format " + name + " is not suitable for input", ErrorCodes::FORMAT_IS_NOT_SUITABLE_FOR_INPUT);
if (name == "Native")
return std::make_shared<NativeBlockInputStream>(buf, sample, 0);
const Settings & settings = context.getSettingsRef();
FormatSettings format_settings = getInputFormatSetting(settings);
if (!getCreators(name).input_processor_creator)
return getInput(name, buf, sample, context, max_block_size, rows_portion_size, std::move(callback));
return input_getter(
buf, sample, context, max_block_size, rows_portion_size, callback ? callback : ReadCallback(), format_settings);
auto format = getInputFormat(name, buf, sample, context, max_block_size, rows_portion_size, std::move(callback));
return std::make_shared<InputStreamFromInputFormat>(std::move(format));
}
BlockOutputStreamPtr FormatFactory::getOutput(const String & name, WriteBuffer & buf, const Block & sample, const Context & context) const
{
const auto & output_getter = getCreators(name).second;
if (!output_getter)
throw Exception("Format " + name + " is not suitable for output", ErrorCodes::FORMAT_IS_NOT_SUITABLE_FOR_OUTPUT);
if (name == "PrettyCompactMonoBlock")
{
/// TODO: rewrite
auto format = getOutputFormat("PrettyCompact", buf, sample, context);
auto res = std::make_shared<SquashingBlockOutputStream>(
std::make_shared<OutputStreamToOutputFormat>(format),
sample, context.getSettingsRef().output_format_pretty_max_rows, 0);
const Settings & settings = context.getSettingsRef();
FormatSettings format_settings = getOutputFormatSetting(settings);
res->disableFlush();
return std::make_shared<MaterializingBlockOutputStream>(res, sample);
}
if (!getCreators(name).output_processor_creator)
return getOutput(name, buf, sample, context);
auto format = getOutputFormat(name, buf, sample, context);
/** Materialization is needed, because formats can use the functions `IDataType`,
* which only work with full columns.
*/
return std::make_shared<MaterializingBlockOutputStream>(
output_getter(buf, sample, context, format_settings), sample);
return std::make_shared<MaterializingBlockOutputStream>(std::make_shared<OutputStreamToOutputFormat>(format), sample);
}
InputFormatPtr FormatFactory::getInputFormat(const String & name, ReadBuffer & buf, const Block & sample, const Context & context, UInt64 max_block_size) const
InputFormatPtr FormatFactory::getInputFormat(
const String & name,
ReadBuffer & buf,
const Block & sample,
const Context & context,
UInt64 max_block_size,
UInt64 rows_portion_size,
ReadCallback callback) const
{
const auto & input_getter = getProcessorCreators(name).first;
const auto & input_getter = getCreators(name).input_processor_creator;
if (!input_getter)
throw Exception("Format " + name + " is not suitable for input", ErrorCodes::FORMAT_IS_NOT_SUITABLE_FOR_INPUT);
@ -125,6 +137,10 @@ InputFormatPtr FormatFactory::getInputFormat(const String & name, ReadBuffer & b
params.max_block_size = max_block_size;
params.allow_errors_num = format_settings.input_allow_errors_num;
params.allow_errors_ratio = format_settings.input_allow_errors_ratio;
params.rows_portion_size = rows_portion_size;
params.callback = std::move(callback);
params.max_execution_time = settings.max_execution_time;
params.timeout_overflow_mode = settings.timeout_overflow_mode;
return input_getter(buf, sample, context, params, format_settings);
}
@ -132,7 +148,7 @@ InputFormatPtr FormatFactory::getInputFormat(const String & name, ReadBuffer & b
OutputFormatPtr FormatFactory::getOutputFormat(const String & name, WriteBuffer & buf, const Block & sample, const Context & context) const
{
const auto & output_getter = getProcessorCreators(name).second;
const auto & output_getter = getCreators(name).output_processor_creator;
if (!output_getter)
throw Exception("Format " + name + " is not suitable for output", ErrorCodes::FORMAT_IS_NOT_SUITABLE_FOR_OUTPUT);
@ -148,7 +164,7 @@ OutputFormatPtr FormatFactory::getOutputFormat(const String & name, WriteBuffer
void FormatFactory::registerInputFormat(const String & name, InputCreator input_creator)
{
auto & target = dict[name].first;
auto & target = dict[name].inout_creator;
if (target)
throw Exception("FormatFactory: Input format " + name + " is already registered", ErrorCodes::LOGICAL_ERROR);
target = std::move(input_creator);
@ -156,7 +172,7 @@ void FormatFactory::registerInputFormat(const String & name, InputCreator input_
void FormatFactory::registerOutputFormat(const String & name, OutputCreator output_creator)
{
auto & target = dict[name].second;
auto & target = dict[name].output_creator;
if (target)
throw Exception("FormatFactory: Output format " + name + " is already registered", ErrorCodes::LOGICAL_ERROR);
target = std::move(output_creator);
@ -164,7 +180,7 @@ void FormatFactory::registerOutputFormat(const String & name, OutputCreator outp
void FormatFactory::registerInputFormatProcessor(const String & name, InputProcessorCreator input_creator)
{
auto & target = processors_dict[name].first;
auto & target = dict[name].input_processor_creator;
if (target)
throw Exception("FormatFactory: Input format " + name + " is already registered", ErrorCodes::LOGICAL_ERROR);
target = std::move(input_creator);
@ -172,7 +188,7 @@ void FormatFactory::registerInputFormatProcessor(const String & name, InputProce
void FormatFactory::registerOutputFormatProcessor(const String & name, OutputProcessorCreator output_creator)
{
auto & target = processors_dict[name].second;
auto & target = dict[name].output_processor_creator;
if (target)
throw Exception("FormatFactory: Output format " + name + " is already registered", ErrorCodes::LOGICAL_ERROR);
target = std::move(output_creator);
@ -183,22 +199,8 @@ void FormatFactory::registerOutputFormatProcessor(const String & name, OutputPro
void registerInputFormatNative(FormatFactory & factory);
void registerOutputFormatNative(FormatFactory & factory);
void registerInputFormatRowBinary(FormatFactory & factory);
void registerOutputFormatRowBinary(FormatFactory & factory);
void registerInputFormatTabSeparated(FormatFactory & factory);
void registerOutputFormatTabSeparated(FormatFactory & factory);
void registerInputFormatValues(FormatFactory & factory);
void registerOutputFormatValues(FormatFactory & factory);
void registerInputFormatCSV(FormatFactory & factory);
void registerOutputFormatCSV(FormatFactory & factory);
void registerInputFormatTSKV(FormatFactory & factory);
void registerOutputFormatTSKV(FormatFactory & factory);
void registerInputFormatJSONEachRow(FormatFactory & factory);
void registerOutputFormatJSONEachRow(FormatFactory & factory);
void registerInputFormatParquet(FormatFactory & factory);
void registerOutputFormatParquet(FormatFactory & factory);
void registerInputFormatProtobuf(FormatFactory & factory);
void registerOutputFormatProtobuf(FormatFactory & factory);
void registerInputFormatProcessorNative(FormatFactory & factory);
void registerOutputFormatProcessorNative(FormatFactory & factory);
@ -221,17 +223,7 @@ void registerOutputFormatProcessorProtobuf(FormatFactory & factory);
/// Output only (presentational) formats.
void registerOutputFormatPretty(FormatFactory & factory);
void registerOutputFormatPrettyCompact(FormatFactory & factory);
void registerOutputFormatPrettySpace(FormatFactory & factory);
void registerOutputFormatVertical(FormatFactory & factory);
void registerOutputFormatJSON(FormatFactory & factory);
void registerOutputFormatJSONCompact(FormatFactory & factory);
void registerOutputFormatXML(FormatFactory & factory);
void registerOutputFormatODBCDriver(FormatFactory & factory);
void registerOutputFormatODBCDriver2(FormatFactory & factory);
void registerOutputFormatNull(FormatFactory & factory);
void registerOutputFormatMySQLWire(FormatFactory & factory);
void registerOutputFormatProcessorPretty(FormatFactory & factory);
void registerOutputFormatProcessorPrettyCompact(FormatFactory & factory);
@ -246,34 +238,14 @@ void registerOutputFormatProcessorNull(FormatFactory & factory);
void registerOutputFormatProcessorMySQLWrite(FormatFactory & factory);
/// Input only formats.
void registerInputFormatCapnProto(FormatFactory & factory);
void registerInputFormatProcessorCapnProto(FormatFactory & factory);
FormatFactory::FormatFactory()
{
registerInputFormatNative(*this);
registerOutputFormatNative(*this);
registerInputFormatRowBinary(*this);
registerOutputFormatRowBinary(*this);
registerInputFormatTabSeparated(*this);
registerOutputFormatTabSeparated(*this);
registerInputFormatValues(*this);
registerOutputFormatValues(*this);
registerInputFormatCSV(*this);
registerOutputFormatCSV(*this);
registerInputFormatTSKV(*this);
registerOutputFormatTSKV(*this);
registerInputFormatJSONEachRow(*this);
registerOutputFormatJSONEachRow(*this);
registerInputFormatProtobuf(*this);
registerOutputFormatProtobuf(*this);
registerInputFormatCapnProto(*this);
registerInputFormatParquet(*this);
registerOutputFormatParquet(*this);
registerOutputFormatMySQLWire(*this);
registerInputFormatProcessorNative(*this);
registerOutputFormatProcessorNative(*this);
@ -295,15 +267,7 @@ FormatFactory::FormatFactory()
registerInputFormatProcessorParquet(*this);
registerOutputFormatProcessorParquet(*this);
registerOutputFormatPretty(*this);
registerOutputFormatPrettyCompact(*this);
registerOutputFormatPrettySpace(*this);
registerOutputFormatVertical(*this);
registerOutputFormatJSON(*this);
registerOutputFormatJSONCompact(*this);
registerOutputFormatXML(*this);
registerOutputFormatODBCDriver(*this);
registerOutputFormatODBCDriver2(*this);
registerOutputFormatNull(*this);
registerOutputFormatProcessorPretty(*this);

View File

@ -70,11 +70,15 @@ private:
const Context & context,
const FormatSettings & settings)>;
using Creators = std::pair<InputCreator, OutputCreator>;
using ProcessorCreators = std::pair<InputProcessorCreator, OutputProcessorCreator>;
struct Creators
{
InputCreator inout_creator;
OutputCreator output_creator;
InputProcessorCreator input_processor_creator;
OutputProcessorCreator output_processor_creator;
};
using FormatsDictionary = std::unordered_map<String, Creators>;
using FormatProcessorsDictionary = std::unordered_map<String, ProcessorCreators>;
public:
BlockInputStreamPtr getInput(
@ -89,8 +93,14 @@ public:
BlockOutputStreamPtr getOutput(const String & name, WriteBuffer & buf,
const Block & sample, const Context & context) const;
InputFormatPtr getInputFormat(const String & name, ReadBuffer & buf,
const Block & sample, const Context & context, UInt64 max_block_size) const;
InputFormatPtr getInputFormat(
const String & name,
ReadBuffer & buf,
const Block & sample,
const Context & context,
UInt64 max_block_size,
UInt64 rows_portion_size = 0,
ReadCallback callback = {}) const;
OutputFormatPtr getOutputFormat(const String & name, WriteBuffer & buf,
const Block & sample, const Context & context) const;
@ -108,14 +118,13 @@ public:
}
private:
/// FormatsDictionary dict;
FormatsDictionary dict;
FormatProcessorsDictionary processors_dict;
FormatFactory();
friend class ext::singleton<FormatFactory>;
const Creators & getCreators(const String & name) const;
const ProcessorCreators & getProcessorCreators(const String & name) const;
};
}

View File

@ -1,120 +0,0 @@
#include <Formats/JSONCompactRowOutputStream.h>
#include <Formats/FormatFactory.h>
#include <Formats/BlockOutputStreamFromRowOutputStream.h>
#include <IO/WriteHelpers.h>
namespace DB
{
JSONCompactRowOutputStream::JSONCompactRowOutputStream(WriteBuffer & ostr_, const Block & sample_, const FormatSettings & settings_)
: JSONRowOutputStream(ostr_, sample_, settings_)
{
}
void JSONCompactRowOutputStream::writeField(const IColumn & column, const IDataType & type, size_t row_num)
{
type.serializeAsTextJSON(column, row_num, *ostr, settings);
++field_number;
}
void JSONCompactRowOutputStream::writeFieldDelimiter()
{
writeCString(", ", *ostr);
}
void JSONCompactRowOutputStream::writeRowStartDelimiter()
{
if (row_count > 0)
writeCString(",\n", *ostr);
writeCString("\t\t[", *ostr);
}
void JSONCompactRowOutputStream::writeRowEndDelimiter()
{
writeChar(']', *ostr);
field_number = 0;
++row_count;
}
void JSONCompactRowOutputStream::writeTotals()
{
if (totals)
{
writeCString(",\n", *ostr);
writeChar('\n', *ostr);
writeCString("\t\"totals\": [", *ostr);
size_t totals_columns = totals.columns();
for (size_t i = 0; i < totals_columns; ++i)
{
if (i != 0)
writeChar(',', *ostr);
const ColumnWithTypeAndName & column = totals.safeGetByPosition(i);
column.type->serializeAsTextJSON(*column.column.get(), 0, *ostr, settings);
}
writeChar(']', *ostr);
}
}
static void writeExtremesElement(const char * title, const Block & extremes, size_t row_num, WriteBuffer & ostr, const FormatSettings & settings)
{
writeCString("\t\t\"", ostr);
writeCString(title, ostr);
writeCString("\": [", ostr);
size_t extremes_columns = extremes.columns();
for (size_t i = 0; i < extremes_columns; ++i)
{
if (i != 0)
writeChar(',', ostr);
const ColumnWithTypeAndName & column = extremes.safeGetByPosition(i);
column.type->serializeAsTextJSON(*column.column.get(), row_num, ostr, settings);
}
writeChar(']', ostr);
}
void JSONCompactRowOutputStream::writeExtremes()
{
if (extremes)
{
writeCString(",\n", *ostr);
writeChar('\n', *ostr);
writeCString("\t\"extremes\":\n", *ostr);
writeCString("\t{\n", *ostr);
writeExtremesElement("min", extremes, 0, *ostr, settings);
writeCString(",\n", *ostr);
writeExtremesElement("max", extremes, 1, *ostr, settings);
writeChar('\n', *ostr);
writeCString("\t}", *ostr);
}
}
void registerOutputFormatJSONCompact(FormatFactory & factory)
{
factory.registerOutputFormat("JSONCompact", [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & format_settings)
{
return std::make_shared<BlockOutputStreamFromRowOutputStream>(
std::make_shared<JSONCompactRowOutputStream>(buf, sample, format_settings), sample);
});
}
}

View File

@ -1,31 +0,0 @@
#pragma once
#include <Core/Block.h>
#include <IO/WriteBuffer.h>
#include <IO/WriteBufferValidUTF8.h>
#include <Formats/JSONRowOutputStream.h>
namespace DB
{
struct FormatSettings;
/** The stream for outputting data in the JSONCompact format.
*/
class JSONCompactRowOutputStream : public JSONRowOutputStream
{
public:
JSONCompactRowOutputStream(WriteBuffer & ostr_, const Block & sample_, const FormatSettings & settings);
void writeField(const IColumn & column, const IDataType & type, size_t row_num) override;
void writeFieldDelimiter() override;
void writeRowStartDelimiter() override;
void writeRowEndDelimiter() override;
protected:
void writeTotals() override;
void writeExtremes() override;
};
}

View File

@ -1,272 +0,0 @@
#include <IO/ReadHelpers.h>
#include <Formats/JSONEachRowRowInputStream.h>
#include <Formats/FormatFactory.h>
#include <Formats/BlockInputStreamFromRowInputStream.h>
#include <DataTypes/NestedUtils.h>
namespace DB
{
namespace ErrorCodes
{
extern const int INCORRECT_DATA;
extern const int CANNOT_READ_ALL_DATA;
extern const int LOGICAL_ERROR;
}
namespace
{
enum
{
UNKNOWN_FIELD = size_t(-1),
NESTED_FIELD = size_t(-2)
};
}
JSONEachRowRowInputStream::JSONEachRowRowInputStream(ReadBuffer & istr_, const Block & header_, const FormatSettings & format_settings)
: istr(istr_), header(header_), format_settings(format_settings), name_map(header.columns())
{
/// In this format, BOM at beginning of stream cannot be confused with value, so it is safe to skip it.
skipBOMIfExists(istr);
size_t num_columns = header.columns();
for (size_t i = 0; i < num_columns; ++i)
{
const String & colname = columnName(i);
name_map[colname] = i; /// NOTE You could place names more cache-locally.
if (format_settings.import_nested_json)
{
const auto splitted = Nested::splitName(colname);
if (!splitted.second.empty())
{
const StringRef table_name(colname.data(), splitted.first.size());
name_map[table_name] = NESTED_FIELD;
}
}
}
prev_positions.assign(num_columns, name_map.end());
}
const String & JSONEachRowRowInputStream::columnName(size_t i) const
{
return header.getByPosition(i).name;
}
inline size_t JSONEachRowRowInputStream::columnIndex(const StringRef & name, size_t key_index)
{
/// Optimization by caching the order of fields (which is almost always the same)
/// and a quick check to match the next expected field, instead of searching the hash table.
if (prev_positions.size() > key_index
&& prev_positions[key_index] != name_map.end()
&& name == prev_positions[key_index]->getFirst())
{
return prev_positions[key_index]->getSecond();
}
else
{
const auto it = name_map.find(name);
if (name_map.end() != it)
{
if (key_index < prev_positions.size())
prev_positions[key_index] = it;
return it->getSecond();
}
else
return UNKNOWN_FIELD;
}
}
/** Read the field name and convert it to column name
* (taking into account the current nested name prefix)
* Resulting StringRef is valid only before next read from buf.
*/
StringRef JSONEachRowRowInputStream::readColumnName(ReadBuffer & buf)
{
// This is just an optimization: try to avoid copying the name into current_column_name
if (nested_prefix_length == 0 && buf.position() + 1 < buf.buffer().end())
{
char * next_pos = find_first_symbols<'\\', '"'>(buf.position() + 1, buf.buffer().end());
if (next_pos != buf.buffer().end() && *next_pos != '\\')
{
/// The most likely option is that there is no escape sequence in the key name, and the entire name is placed in the buffer.
assertChar('"', buf);
StringRef res(buf.position(), next_pos - buf.position());
buf.position() = next_pos + 1;
return res;
}
}
current_column_name.resize(nested_prefix_length);
readJSONStringInto(current_column_name, buf);
return current_column_name;
}
static inline void skipColonDelimeter(ReadBuffer & istr)
{
skipWhitespaceIfAny(istr);
assertChar(':', istr);
skipWhitespaceIfAny(istr);
}
void JSONEachRowRowInputStream::skipUnknownField(const StringRef & name_ref)
{
if (!format_settings.skip_unknown_fields)
throw Exception("Unknown field found while parsing JSONEachRow format: " + name_ref.toString(), ErrorCodes::INCORRECT_DATA);
skipJSONField(istr, name_ref);
}
void JSONEachRowRowInputStream::readField(size_t index, MutableColumns & columns)
{
if (read_columns[index])
throw Exception("Duplicate field found while parsing JSONEachRow format: " + columnName(index), ErrorCodes::INCORRECT_DATA);
try
{
header.getByPosition(index).type->deserializeAsTextJSON(*columns[index], istr, format_settings);
}
catch (Exception & e)
{
e.addMessage("(while read the value of key " + columnName(index) + ")");
throw;
}
read_columns[index] = true;
}
inline bool JSONEachRowRowInputStream::advanceToNextKey(size_t key_index)
{
skipWhitespaceIfAny(istr);
if (istr.eof())
throw Exception("Unexpected end of stream while parsing JSONEachRow format", ErrorCodes::CANNOT_READ_ALL_DATA);
else if (*istr.position() == '}')
{
++istr.position();
return false;
}
if (key_index > 0)
{
assertChar(',', istr);
skipWhitespaceIfAny(istr);
}
return true;
}
void JSONEachRowRowInputStream::readJSONObject(MutableColumns & columns)
{
assertChar('{', istr);
for (size_t key_index = 0; advanceToNextKey(key_index); ++key_index)
{
StringRef name_ref = readColumnName(istr);
const size_t column_index = columnIndex(name_ref, key_index);
if (unlikely(ssize_t(column_index) < 0))
{
/// name_ref may point directly to the input buffer
/// and input buffer may be filled with new data on next read
/// If we want to use name_ref after another reads from buffer, we must copy it to temporary string.
current_column_name.assign(name_ref.data, name_ref.size);
name_ref = StringRef(current_column_name);
skipColonDelimeter(istr);
if (column_index == UNKNOWN_FIELD)
skipUnknownField(name_ref);
else if (column_index == NESTED_FIELD)
readNestedData(name_ref.toString(), columns);
else
throw Exception("Logical error: illegal value of column_index", ErrorCodes::LOGICAL_ERROR);
}
else
{
skipColonDelimeter(istr);
readField(column_index, columns);
}
}
}
void JSONEachRowRowInputStream::readNestedData(const String & name, MutableColumns & columns)
{
current_column_name = name;
current_column_name.push_back('.');
nested_prefix_length = current_column_name.size();
readJSONObject(columns);
nested_prefix_length = 0;
}
bool JSONEachRowRowInputStream::read(MutableColumns & columns, RowReadExtension & ext)
{
skipWhitespaceIfAny(istr);
/// We consume ;, or \n before scanning a new row, instead scanning to next row at the end.
/// The reason is that if we want an exact number of rows read with LIMIT x
/// from a streaming table engine with text data format, like File or Kafka
/// then seeking to next ;, or \n would trigger reading of an extra row at the end.
/// Semicolon is added for convenience as it could be used at end of INSERT query.
if (!istr.eof() && (*istr.position() == ',' || *istr.position() == ';'))
++istr.position();
skipWhitespaceIfAny(istr);
if (istr.eof())
return false;
size_t num_columns = columns.size();
/// Set of columns for which the values were read. The rest will be filled with default values.
read_columns.assign(num_columns, false);
nested_prefix_length = 0;
readJSONObject(columns);
/// Fill non-visited columns with the default values.
for (size_t i = 0; i < num_columns; ++i)
if (!read_columns[i])
header.getByPosition(i).type->insertDefaultInto(*columns[i]);
/// return info about defaults set
ext.read_columns = read_columns;
return true;
}
void JSONEachRowRowInputStream::syncAfterError()
{
skipToUnescapedNextLineOrEOF(istr);
}
void registerInputFormatJSONEachRow(FormatFactory & factory)
{
factory.registerInputFormat("JSONEachRow", [](
ReadBuffer & buf,
const Block & sample,
const Context &,
UInt64 max_block_size,
UInt64 rows_portion_size,
FormatFactory::ReadCallback callback,
const FormatSettings & settings)
{
return std::make_shared<BlockInputStreamFromRowInputStream>(
std::make_shared<JSONEachRowRowInputStream>(buf, sample, settings),
sample, max_block_size, rows_portion_size, callback, settings);
});
}
}

View File

@ -1,68 +0,0 @@
#pragma once
#include <Core/Block.h>
#include <Formats/IRowInputStream.h>
#include <Formats/FormatSettings.h>
#include <Common/HashTable/HashMap.h>
namespace DB
{
class ReadBuffer;
/** A stream for reading data in JSON format, where each row is represented by a separate JSON object.
* Objects can be separated by line feed, other whitespace characters in any number and possibly a comma.
* Fields can be listed in any order (including, in different lines there may be different order),
* and some fields may be missing.
*/
class JSONEachRowRowInputStream : public IRowInputStream
{
public:
JSONEachRowRowInputStream(ReadBuffer & istr_, const Block & header_, const FormatSettings & format_settings);
bool read(MutableColumns & columns, RowReadExtension & ext) override;
bool allowSyncAfterError() const override { return true; }
void syncAfterError() override;
private:
const String & columnName(size_t i) const;
size_t columnIndex(const StringRef & name, size_t key_index);
bool advanceToNextKey(size_t key_index);
void skipUnknownField(const StringRef & name_ref);
StringRef readColumnName(ReadBuffer & buf);
void readField(size_t index, MutableColumns & columns);
void readJSONObject(MutableColumns & columns);
void readNestedData(const String & name, MutableColumns & columns);
private:
ReadBuffer & istr;
Block header;
const FormatSettings format_settings;
/// Buffer for the read from the stream field name. Used when you have to copy it.
/// Also, if processing of Nested data is in progress, it holds the common prefix
/// of the nested column names (so that appending the field name to it produces
/// the full column name)
String current_column_name;
/// If processing Nested data, holds the length of the common prefix
/// of the names of related nested columns. For example, for a table
/// created as follows
/// CREATE TABLE t (n Nested (i Int32, s String))
/// the nested column names are 'n.i' and 'n.s' and the nested prefix is 'n.'
size_t nested_prefix_length = 0;
std::vector<UInt8> read_columns;
/// Hash table match `field name -> position in the block`. NOTE You can use perfect hash map.
using NameMap = HashMap<StringRef, size_t, StringRefHash>;
NameMap name_map;
/// Cached search results for previous row (keyed as index in JSON object) - used as a hint.
std::vector<NameMap::iterator> prev_positions;
};
}

View File

@ -1,67 +0,0 @@
#include <IO/WriteHelpers.h>
#include <IO/WriteBufferValidUTF8.h>
#include <Formats/JSONEachRowRowOutputStream.h>
#include <Formats/FormatFactory.h>
#include <Formats/BlockOutputStreamFromRowOutputStream.h>
namespace DB
{
JSONEachRowRowOutputStream::JSONEachRowRowOutputStream(WriteBuffer & ostr_, const Block & sample, const FormatSettings & settings)
: ostr(ostr_), settings(settings)
{
size_t columns = sample.columns();
fields.resize(columns);
for (size_t i = 0; i < columns; ++i)
{
WriteBufferFromString out(fields[i]);
writeJSONString(sample.getByPosition(i).name, out, settings);
}
}
void JSONEachRowRowOutputStream::writeField(const IColumn & column, const IDataType & type, size_t row_num)
{
writeString(fields[field_number], ostr);
writeChar(':', ostr);
type.serializeAsTextJSON(column, row_num, ostr, settings);
++field_number;
}
void JSONEachRowRowOutputStream::writeFieldDelimiter()
{
writeChar(',', ostr);
}
void JSONEachRowRowOutputStream::writeRowStartDelimiter()
{
writeChar('{', ostr);
}
void JSONEachRowRowOutputStream::writeRowEndDelimiter()
{
writeCString("}\n", ostr);
field_number = 0;
}
void registerOutputFormatJSONEachRow(FormatFactory & factory)
{
factory.registerOutputFormat("JSONEachRow", [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & format_settings)
{
return std::make_shared<BlockOutputStreamFromRowOutputStream>(
std::make_shared<JSONEachRowRowOutputStream>(buf, sample, format_settings), sample);
});
}
}

View File

@ -1,39 +0,0 @@
#pragma once
#include <Core/Block.h>
#include <IO/WriteBuffer.h>
#include <Formats/IRowOutputStream.h>
#include <Formats/FormatSettings.h>
namespace DB
{
/** The stream for outputting data in JSON format, by object per line.
* Does not validate UTF-8.
*/
class JSONEachRowRowOutputStream : public IRowOutputStream
{
public:
JSONEachRowRowOutputStream(WriteBuffer & ostr_, const Block & sample, const FormatSettings & settings);
void writeField(const IColumn & column, const IDataType & type, size_t row_num) override;
void writeFieldDelimiter() override;
void writeRowStartDelimiter() override;
void writeRowEndDelimiter() override;
void flush() override
{
ostr.next();
}
private:
WriteBuffer & ostr;
size_t field_number = 0;
Names fields;
FormatSettings settings;
};
}

View File

@ -1,246 +0,0 @@
#include <IO/WriteHelpers.h>
#include <IO/WriteBufferValidUTF8.h>
#include <Formats/JSONRowOutputStream.h>
#include <Formats/FormatFactory.h>
#include <Formats/BlockOutputStreamFromRowOutputStream.h>
namespace DB
{
JSONRowOutputStream::JSONRowOutputStream(WriteBuffer & ostr_, const Block & sample_, const FormatSettings & settings)
: dst_ostr(ostr_), settings(settings)
{
NamesAndTypesList columns(sample_.getNamesAndTypesList());
fields.assign(columns.begin(), columns.end());
bool need_validate_utf8 = false;
for (size_t i = 0; i < sample_.columns(); ++i)
{
if (!sample_.getByPosition(i).type->textCanContainOnlyValidUTF8())
need_validate_utf8 = true;
WriteBufferFromOwnString out;
writeJSONString(fields[i].name, out, settings);
fields[i].name = out.str();
}
if (need_validate_utf8)
{
validating_ostr = std::make_unique<WriteBufferValidUTF8>(dst_ostr);
ostr = validating_ostr.get();
}
else
ostr = &dst_ostr;
}
void JSONRowOutputStream::writePrefix()
{
writeCString("{\n", *ostr);
writeCString("\t\"meta\":\n", *ostr);
writeCString("\t[\n", *ostr);
for (size_t i = 0; i < fields.size(); ++i)
{
writeCString("\t\t{\n", *ostr);
writeCString("\t\t\t\"name\": ", *ostr);
writeString(fields[i].name, *ostr);
writeCString(",\n", *ostr);
writeCString("\t\t\t\"type\": ", *ostr);
writeJSONString(fields[i].type->getName(), *ostr, settings);
writeChar('\n', *ostr);
writeCString("\t\t}", *ostr);
if (i + 1 < fields.size())
writeChar(',', *ostr);
writeChar('\n', *ostr);
}
writeCString("\t],\n", *ostr);
writeChar('\n', *ostr);
writeCString("\t\"data\":\n", *ostr);
writeCString("\t[\n", *ostr);
}
void JSONRowOutputStream::writeField(const IColumn & column, const IDataType & type, size_t row_num)
{
writeCString("\t\t\t", *ostr);
writeString(fields[field_number].name, *ostr);
writeCString(": ", *ostr);
type.serializeAsTextJSON(column, row_num, *ostr, settings);
++field_number;
}
void JSONRowOutputStream::writeFieldDelimiter()
{
writeCString(",\n", *ostr);
}
void JSONRowOutputStream::writeRowStartDelimiter()
{
if (row_count > 0)
writeCString(",\n", *ostr);
writeCString("\t\t{\n", *ostr);
}
void JSONRowOutputStream::writeRowEndDelimiter()
{
writeChar('\n', *ostr);
writeCString("\t\t}", *ostr);
field_number = 0;
++row_count;
}
void JSONRowOutputStream::writeSuffix()
{
writeChar('\n', *ostr);
writeCString("\t]", *ostr);
writeTotals();
writeExtremes();
writeCString(",\n\n", *ostr);
writeCString("\t\"rows\": ", *ostr);
writeIntText(row_count, *ostr);
writeRowsBeforeLimitAtLeast();
if (settings.write_statistics)
writeStatistics();
writeChar('\n', *ostr);
writeCString("}\n", *ostr);
ostr->next();
}
void JSONRowOutputStream::writeRowsBeforeLimitAtLeast()
{
if (applied_limit)
{
writeCString(",\n\n", *ostr);
writeCString("\t\"rows_before_limit_at_least\": ", *ostr);
writeIntText(rows_before_limit, *ostr);
}
}
void JSONRowOutputStream::writeTotals()
{
if (totals)
{
writeCString(",\n", *ostr);
writeChar('\n', *ostr);
writeCString("\t\"totals\":\n", *ostr);
writeCString("\t{\n", *ostr);
size_t totals_columns = totals.columns();
for (size_t i = 0; i < totals_columns; ++i)
{
const ColumnWithTypeAndName & column = totals.safeGetByPosition(i);
if (i != 0)
writeCString(",\n", *ostr);
writeCString("\t\t", *ostr);
writeJSONString(column.name, *ostr, settings);
writeCString(": ", *ostr);
column.type->serializeAsTextJSON(*column.column.get(), 0, *ostr, settings);
}
writeChar('\n', *ostr);
writeCString("\t}", *ostr);
}
}
static void writeExtremesElement(const char * title, const Block & extremes, size_t row_num, WriteBuffer & ostr, const FormatSettings & settings)
{
writeCString("\t\t\"", ostr);
writeCString(title, ostr);
writeCString("\":\n", ostr);
writeCString("\t\t{\n", ostr);
size_t extremes_columns = extremes.columns();
for (size_t i = 0; i < extremes_columns; ++i)
{
const ColumnWithTypeAndName & column = extremes.safeGetByPosition(i);
if (i != 0)
writeCString(",\n", ostr);
writeCString("\t\t\t", ostr);
writeJSONString(column.name, ostr, settings);
writeCString(": ", ostr);
column.type->serializeAsTextJSON(*column.column.get(), row_num, ostr, settings);
}
writeChar('\n', ostr);
writeCString("\t\t}", ostr);
}
void JSONRowOutputStream::writeExtremes()
{
if (extremes)
{
writeCString(",\n", *ostr);
writeChar('\n', *ostr);
writeCString("\t\"extremes\":\n", *ostr);
writeCString("\t{\n", *ostr);
writeExtremesElement("min", extremes, 0, *ostr, settings);
writeCString(",\n", *ostr);
writeExtremesElement("max", extremes, 1, *ostr, settings);
writeChar('\n', *ostr);
writeCString("\t}", *ostr);
}
}
void JSONRowOutputStream::onProgress(const Progress & value)
{
progress.incrementPiecewiseAtomically(value);
}
void JSONRowOutputStream::writeStatistics()
{
writeCString(",\n\n", *ostr);
writeCString("\t\"statistics\":\n", *ostr);
writeCString("\t{\n", *ostr);
writeCString("\t\t\"elapsed\": ", *ostr);
writeText(watch.elapsedSeconds(), *ostr);
writeCString(",\n", *ostr);
writeCString("\t\t\"rows_read\": ", *ostr);
writeText(progress.read_rows.load(), *ostr);
writeCString(",\n", *ostr);
writeCString("\t\t\"bytes_read\": ", *ostr);
writeText(progress.read_bytes.load(), *ostr);
writeChar('\n', *ostr);
writeCString("\t}", *ostr);
}
void registerOutputFormatJSON(FormatFactory & factory)
{
factory.registerOutputFormat("JSON", [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & format_settings)
{
return std::make_shared<BlockOutputStreamFromRowOutputStream>(
std::make_shared<JSONRowOutputStream>(buf, sample, format_settings), sample);
});
}
}

View File

@ -1,74 +0,0 @@
#pragma once
#include <Core/Block.h>
#include <IO/Progress.h>
#include <IO/WriteBuffer.h>
#include <Common/Stopwatch.h>
#include <Formats/IRowOutputStream.h>
#include <Formats/FormatSettings.h>
namespace DB
{
/** Stream for output data in JSON format.
*/
class JSONRowOutputStream : public IRowOutputStream
{
public:
JSONRowOutputStream(WriteBuffer & ostr_, const Block & sample_, const FormatSettings & settings);
void writeField(const IColumn & column, const IDataType & type, size_t row_num) override;
void writeFieldDelimiter() override;
void writeRowStartDelimiter() override;
void writeRowEndDelimiter() override;
void writePrefix() override;
void writeSuffix() override;
void flush() override
{
ostr->next();
if (validating_ostr)
dst_ostr.next();
}
void setRowsBeforeLimit(size_t rows_before_limit_) override
{
applied_limit = true;
rows_before_limit = rows_before_limit_;
}
void setTotals(const Block & totals_) override { totals = totals_; }
void setExtremes(const Block & extremes_) override { extremes = extremes_; }
void onProgress(const Progress & value) override;
String getContentType() const override { return "application/json; charset=UTF-8"; }
protected:
void writeRowsBeforeLimitAtLeast();
virtual void writeTotals();
virtual void writeExtremes();
void writeStatistics();
WriteBuffer & dst_ostr;
std::unique_ptr<WriteBuffer> validating_ostr; /// Validates UTF-8 sequences, replaces bad sequences with replacement character.
WriteBuffer * ostr;
size_t field_number = 0;
size_t row_count = 0;
bool applied_limit = false;
size_t rows_before_limit = 0;
NamesAndTypes fields;
Block totals;
Block extremes;
Progress progress;
Stopwatch watch;
FormatSettings settings;
};
}

View File

@ -1,85 +0,0 @@
#include "MySQLWireBlockOutputStream.h"
#include <Core/MySQLProtocol.h>
#include <Interpreters/ProcessList.h>
#include <iomanip>
#include <sstream>
namespace DB
{
using namespace MySQLProtocol;
MySQLWireBlockOutputStream::MySQLWireBlockOutputStream(WriteBuffer & buf, const Block & header, Context & context)
: header(header)
, context(context)
, packet_sender(buf, context.mysql.sequence_id)
{
packet_sender.max_packet_size = context.mysql.max_packet_size;
}
void MySQLWireBlockOutputStream::writePrefix()
{
if (header.columns() == 0)
return;
packet_sender.sendPacket(LengthEncodedNumber(header.columns()));
for (const ColumnWithTypeAndName & column : header.getColumnsWithTypeAndName())
{
ColumnDefinition column_definition(column.name, CharacterSet::binary, 0, ColumnType::MYSQL_TYPE_STRING, 0, 0);
packet_sender.sendPacket(column_definition);
}
if (!(context.mysql.client_capabilities & Capability::CLIENT_DEPRECATE_EOF))
{
packet_sender.sendPacket(EOF_Packet(0, 0));
}
}
void MySQLWireBlockOutputStream::write(const Block & block)
{
size_t rows = block.rows();
for (size_t i = 0; i < rows; i++)
{
ResultsetRow row_packet;
for (const ColumnWithTypeAndName & column : block)
{
WriteBufferFromOwnString ostr;
column.type->serializeAsText(*column.column.get(), i, ostr, format_settings);
row_packet.appendColumn(std::move(ostr.str()));
}
packet_sender.sendPacket(row_packet);
}
}
void MySQLWireBlockOutputStream::writeSuffix()
{
size_t affected_rows = 0;
std::stringstream human_readable_info;
if (QueryStatus * process_list_elem = context.getProcessListElement())
{
CurrentThread::finalizePerformanceCounters();
QueryStatusInfo info = process_list_elem->getInfo();
affected_rows = info.written_rows;
human_readable_info << std::fixed << std::setprecision(3)
<< "Read " << info.read_rows << " rows, " << formatReadableSizeWithBinarySuffix(info.read_bytes) << " in " << info.elapsed_seconds << " sec., "
<< static_cast<size_t>(info.read_rows / info.elapsed_seconds) << " rows/sec., "
<< formatReadableSizeWithBinarySuffix(info.read_bytes / info.elapsed_seconds) << "/sec.";
}
if (header.columns() == 0)
packet_sender.sendPacket(OK_Packet(0x0, context.mysql.client_capabilities, affected_rows, 0, 0, "", human_readable_info.str()), true);
else
if (context.mysql.client_capabilities & CLIENT_DEPRECATE_EOF)
packet_sender.sendPacket(OK_Packet(0xfe, context.mysql.client_capabilities, affected_rows, 0, 0, "", human_readable_info.str()), true);
else
packet_sender.sendPacket(EOF_Packet(0, 0), true);
}
void MySQLWireBlockOutputStream::flush()
{
packet_sender.out->next();
}
}

View File

@ -1,36 +0,0 @@
#pragma once
#include <Core/MySQLProtocol.h>
#include <DataStreams/IBlockOutputStream.h>
#include <Formats/FormatFactory.h>
#include <Formats/FormatSettings.h>
#include <Interpreters/Context.h>
namespace DB
{
/** Interface for writing rows in MySQL Client/Server Protocol format.
*/
class MySQLWireBlockOutputStream : public IBlockOutputStream
{
public:
MySQLWireBlockOutputStream(WriteBuffer & buf, const Block & header, Context & context);
Block getHeader() const { return header; }
void write(const Block & block);
void writePrefix();
void writeSuffix();
void flush();
private:
Block header;
Context & context;
MySQLProtocol::PacketSender packet_sender;
FormatSettings format_settings;
};
using MySQLWireBlockOutputStreamPtr = std::shared_ptr<MySQLWireBlockOutputStream>;
}

View File

@ -1,19 +0,0 @@
#include <Formats/MySQLWireBlockOutputStream.h>
namespace DB
{
void registerOutputFormatMySQLWire(FormatFactory & factory)
{
factory.registerOutputFormat("MySQLWire", [](
WriteBuffer & buf,
const Block & sample,
const Context & context,
const FormatSettings &)
{
return std::make_shared<MySQLWireBlockOutputStream>(buf, sample, const_cast<Context &>(context));
});
}
}

View File

@ -1,103 +0,0 @@
#include <Core/Block.h>
#include <Formats/FormatFactory.h>
#include <Formats/ODBCDriver2BlockOutputStream.h>
#include <IO/WriteBuffer.h>
#include <IO/WriteHelpers.h>
#include <DataTypes/DataTypeLowCardinality.h>
namespace DB
{
ODBCDriver2BlockOutputStream::ODBCDriver2BlockOutputStream(
WriteBuffer & out_, const Block & header_, const FormatSettings & format_settings)
: out(out_), header(header_), format_settings(format_settings)
{
}
void ODBCDriver2BlockOutputStream::flush()
{
out.next();
}
void writeODBCString(WriteBuffer & out, const std::string & str)
{
writeIntBinary(Int32(str.size()), out);
out.write(str.data(), str.size());
}
static void writeRow(const Block & block, size_t row_idx, WriteBuffer & out, const FormatSettings & format_settings, std::string & buffer)
{
size_t columns = block.columns();
for (size_t column_idx = 0; column_idx < columns; ++column_idx)
{
buffer.clear();
const ColumnWithTypeAndName & col = block.getByPosition(column_idx);
if (col.column->isNullAt(row_idx))
{
writeIntBinary(Int32(-1), out);
}
else
{
{
WriteBufferFromString text_out(buffer);
col.type->serializeAsText(*col.column, row_idx, text_out, format_settings);
}
writeODBCString(out, buffer);
}
}
}
void ODBCDriver2BlockOutputStream::write(const Block & block)
{
String text_value;
const size_t rows = block.rows();
for (size_t i = 0; i < rows; ++i)
writeRow(block, i, out, format_settings, text_value);
}
void ODBCDriver2BlockOutputStream::writeSuffix()
{
if (totals)
write(totals);
}
void ODBCDriver2BlockOutputStream::writePrefix()
{
const size_t columns = header.columns();
/// Number of header rows.
writeIntBinary(Int32(2), out);
/// Names of columns.
/// Number of columns + 1 for first name column.
writeIntBinary(Int32(columns + 1), out);
writeODBCString(out, "name");
for (size_t i = 0; i < columns; ++i)
{
const ColumnWithTypeAndName & col = header.getByPosition(i);
writeODBCString(out, col.name);
}
/// Types of columns.
writeIntBinary(Int32(columns + 1), out);
writeODBCString(out, "type");
for (size_t i = 0; i < columns; ++i)
{
auto type = header.getByPosition(i).type;
if (type->lowCardinality())
type = recursiveRemoveLowCardinality(type);
writeODBCString(out, type->getName());
}
}
void registerOutputFormatODBCDriver2(FormatFactory & factory)
{
factory.registerOutputFormat(
"ODBCDriver2", [](WriteBuffer & buf, const Block & sample, const Context &, const FormatSettings & format_settings)
{
return std::make_shared<ODBCDriver2BlockOutputStream>(buf, sample, format_settings);
});
}
}

View File

@ -1,51 +0,0 @@
#pragma once
#include <string>
#include <Core/Block.h>
#include <DataStreams/IBlockOutputStream.h>
#include <Formats/FormatSettings.h>
namespace DB
{
class WriteBuffer;
/** A data format designed to simplify the implementation of the ODBC driver.
* ODBC driver is designed to be build for different platforms without dependencies from the main code,
* so the format is made that way so that it can be as easy as possible to parse it.
* A header is displayed with the required information.
* The data is then output in the order of the rows. Each value is displayed as follows: length in Int32 format (-1 for NULL), then data in text form.
*/
class ODBCDriver2BlockOutputStream final : public IBlockOutputStream
{
public:
ODBCDriver2BlockOutputStream(WriteBuffer & out_, const Block & header_, const FormatSettings & format_settings);
Block getHeader() const override
{
return header;
}
void write(const Block & block) override;
void writePrefix() override;
void writeSuffix() override;
void flush() override;
std::string getContentType() const override
{
return "application/octet-stream";
}
void setTotals(const Block & totals_) override { totals = totals_; }
private:
WriteBuffer & out;
const Block header;
const FormatSettings format_settings;
protected:
Block totals;
};
}

View File

@ -1,74 +0,0 @@
#include <IO/WriteBuffer.h>
#include <IO/WriteHelpers.h>
#include <Core/Block.h>
#include <Formats/ODBCDriverBlockOutputStream.h>
#include <Formats/FormatFactory.h>
namespace DB
{
ODBCDriverBlockOutputStream::ODBCDriverBlockOutputStream(WriteBuffer & out_, const Block & header_, const FormatSettings & format_settings)
: out(out_), header(header_), format_settings(format_settings)
{
}
void ODBCDriverBlockOutputStream::flush()
{
out.next();
}
void ODBCDriverBlockOutputStream::write(const Block & block)
{
const size_t rows = block.rows();
const size_t columns = block.columns();
String text_value;
for (size_t i = 0; i < rows; ++i)
{
for (size_t j = 0; j < columns; ++j)
{
text_value.resize(0);
const ColumnWithTypeAndName & col = block.getByPosition(j);
{
WriteBufferFromString text_out(text_value);
col.type->serializeAsText(*col.column, i, text_out, format_settings);
}
writeStringBinary(text_value, out);
}
}
}
void ODBCDriverBlockOutputStream::writePrefix()
{
const size_t columns = header.columns();
/// Number of columns.
writeVarUInt(columns, out);
/// Names and types of columns.
for (size_t i = 0; i < columns; ++i)
{
const ColumnWithTypeAndName & col = header.getByPosition(i);
writeStringBinary(col.name, out);
writeStringBinary(col.type->getName(), out);
}
}
void registerOutputFormatODBCDriver(FormatFactory & factory)
{
factory.registerOutputFormat("ODBCDriver", [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & format_settings)
{
return std::make_shared<ODBCDriverBlockOutputStream>(buf, sample, format_settings);
});
}
}

View File

@ -1,39 +0,0 @@
#pragma once
#include <string>
#include <DataStreams/IBlockOutputStream.h>
#include <Formats/FormatSettings.h>
#include <Core/Block.h>
namespace DB
{
class WriteBuffer;
/** A data format designed to simplify the implementation of the ODBC driver.
* ODBC driver is designed to be build for different platforms without dependencies from the main code,
* so the format is made that way so that it can be as easy as possible to parse it.
* A header is displayed with the required information.
* The data is then output in the order of the rows. Each value is displayed as follows: length in VarUInt format, then data in text form.
*/
class ODBCDriverBlockOutputStream : public IBlockOutputStream
{
public:
ODBCDriverBlockOutputStream(WriteBuffer & out_, const Block & header_, const FormatSettings & format_settings);
Block getHeader() const override { return header; }
void write(const Block & block) override;
void writePrefix() override;
void flush() override;
std::string getContentType() const override { return "application/octet-stream"; }
private:
WriteBuffer & out;
const Block header;
const FormatSettings format_settings;
};
}

View File

@ -1,494 +0,0 @@
#include "ParquetBlockInputStream.h"
#if USE_PARQUET
# include <algorithm>
# include <iterator>
# include <vector>
// TODO: clear includes
# include <Columns/ColumnNullable.h>
# include <Columns/ColumnString.h>
# include <Columns/ColumnsNumber.h>
# include <Columns/IColumn.h>
# include <Core/ColumnWithTypeAndName.h>
# include <DataTypes/DataTypeDate.h>
# include <DataTypes/DataTypeDateTime.h>
# include <DataTypes/DataTypeFactory.h>
# include <DataTypes/DataTypeNullable.h>
# include <DataTypes/DataTypeString.h>
# include <DataTypes/DataTypesDecimal.h>
# include <DataTypes/DataTypesNumber.h>
# include <Formats/FormatFactory.h>
# include <IO/BufferBase.h>
# include <IO/ReadBufferFromMemory.h>
# include <IO/WriteBufferFromString.h>
# include <IO/WriteHelpers.h>
# include <IO/copyData.h>
# include <Interpreters/castColumn.h>
# include <arrow/api.h>
# include <parquet/arrow/reader.h>
# include <parquet/file_reader.h>
# include <common/DateLUTImpl.h>
# include <ext/range.h>
namespace DB
{
namespace ErrorCodes
{
extern const int UNKNOWN_TYPE;
extern const int VALUE_IS_OUT_OF_RANGE_OF_DATA_TYPE;
extern const int CANNOT_READ_ALL_DATA;
extern const int EMPTY_DATA_PASSED;
extern const int SIZES_OF_COLUMNS_DOESNT_MATCH;
extern const int CANNOT_CONVERT_TYPE;
extern const int CANNOT_INSERT_NULL_IN_ORDINARY_COLUMN;
extern const int THERE_IS_NO_COLUMN;
}
ParquetBlockInputStream::ParquetBlockInputStream(ReadBuffer & istr_, const Block & header_, const Context & context_)
: istr{istr_}, header{header_}, context{context_}
{
}
Block ParquetBlockInputStream::getHeader() const
{
return header;
}
/// Inserts numeric data right into internal column data to reduce an overhead
template <typename NumericType, typename VectorType = ColumnVector<NumericType>>
void fillColumnWithNumericData(std::shared_ptr<arrow::Column> & arrow_column, MutableColumnPtr & internal_column)
{
auto & column_data = static_cast<VectorType &>(*internal_column).getData();
column_data.reserve(arrow_column->length());
for (size_t chunk_i = 0, num_chunks = static_cast<size_t>(arrow_column->data()->num_chunks()); chunk_i < num_chunks; ++chunk_i)
{
std::shared_ptr<arrow::Array> chunk = arrow_column->data()->chunk(chunk_i);
/// buffers[0] is a null bitmap and buffers[1] are actual values
std::shared_ptr<arrow::Buffer> buffer = chunk->data()->buffers[1];
const auto * raw_data = reinterpret_cast<const NumericType *>(buffer->data());
column_data.insert_assume_reserved(raw_data, raw_data + chunk->length());
}
}
/// Inserts chars and offsets right into internal column data to reduce an overhead.
/// Internal offsets are shifted by one to the right in comparison with Arrow ones. So the last offset should map to the end of all chars.
/// Also internal strings are null terminated.
void fillColumnWithStringData(std::shared_ptr<arrow::Column> & arrow_column, MutableColumnPtr & internal_column)
{
PaddedPODArray<UInt8> & column_chars_t = static_cast<ColumnString &>(*internal_column).getChars();
PaddedPODArray<UInt64> & column_offsets = static_cast<ColumnString &>(*internal_column).getOffsets();
size_t chars_t_size = 0;
for (size_t chunk_i = 0, num_chunks = static_cast<size_t>(arrow_column->data()->num_chunks()); chunk_i < num_chunks; ++chunk_i)
{
arrow::BinaryArray & chunk = static_cast<arrow::BinaryArray &>(*(arrow_column->data()->chunk(chunk_i)));
const size_t chunk_length = chunk.length();
chars_t_size += chunk.value_offset(chunk_length - 1) + chunk.value_length(chunk_length - 1);
chars_t_size += chunk_length; /// additional space for null bytes
}
column_chars_t.reserve(chars_t_size);
column_offsets.reserve(arrow_column->length());
for (size_t chunk_i = 0, num_chunks = static_cast<size_t>(arrow_column->data()->num_chunks()); chunk_i < num_chunks; ++chunk_i)
{
arrow::BinaryArray & chunk = static_cast<arrow::BinaryArray &>(*(arrow_column->data()->chunk(chunk_i)));
std::shared_ptr<arrow::Buffer> buffer = chunk.value_data();
const size_t chunk_length = chunk.length();
for (size_t offset_i = 0; offset_i != chunk_length; ++offset_i)
{
if (!chunk.IsNull(offset_i) && buffer)
{
const UInt8 * raw_data = buffer->data() + chunk.value_offset(offset_i);
column_chars_t.insert_assume_reserved(raw_data, raw_data + chunk.value_length(offset_i));
}
column_chars_t.emplace_back('\0');
column_offsets.emplace_back(column_chars_t.size());
}
}
}
void fillColumnWithBooleanData(std::shared_ptr<arrow::Column> & arrow_column, MutableColumnPtr & internal_column)
{
auto & column_data = static_cast<ColumnVector<UInt8> &>(*internal_column).getData();
column_data.resize(arrow_column->length());
for (size_t chunk_i = 0, num_chunks = static_cast<size_t>(arrow_column->data()->num_chunks()); chunk_i < num_chunks; ++chunk_i)
{
arrow::BooleanArray & chunk = static_cast<arrow::BooleanArray &>(*(arrow_column->data()->chunk(chunk_i)));
/// buffers[0] is a null bitmap and buffers[1] are actual values
std::shared_ptr<arrow::Buffer> buffer = chunk.data()->buffers[1];
for (size_t bool_i = 0; bool_i != static_cast<size_t>(chunk.length()); ++bool_i)
column_data[bool_i] = chunk.Value(bool_i);
}
}
/// Arrow stores Parquet::DATE in Int32, while ClickHouse stores Date in UInt16. Therefore, it should be checked before saving
void fillColumnWithDate32Data(std::shared_ptr<arrow::Column> & arrow_column, MutableColumnPtr & internal_column)
{
PaddedPODArray<UInt16> & column_data = static_cast<ColumnVector<UInt16> &>(*internal_column).getData();
column_data.reserve(arrow_column->length());
for (size_t chunk_i = 0, num_chunks = static_cast<size_t>(arrow_column->data()->num_chunks()); chunk_i < num_chunks; ++chunk_i)
{
arrow::Date32Array & chunk = static_cast<arrow::Date32Array &>(*(arrow_column->data()->chunk(chunk_i)));
for (size_t value_i = 0, length = static_cast<size_t>(chunk.length()); value_i < length; ++value_i)
{
UInt32 days_num = static_cast<UInt32>(chunk.Value(value_i));
if (days_num > DATE_LUT_MAX_DAY_NUM)
{
// TODO: will it rollback correctly?
throw Exception{"Input value " + std::to_string(days_num) + " of a column \"" + arrow_column->name()
+ "\" is greater than "
"max allowed Date value, which is "
+ std::to_string(DATE_LUT_MAX_DAY_NUM),
ErrorCodes::VALUE_IS_OUT_OF_RANGE_OF_DATA_TYPE};
}
column_data.emplace_back(days_num);
}
}
}
/// Arrow stores Parquet::DATETIME in Int64, while ClickHouse stores DateTime in UInt32. Therefore, it should be checked before saving
void fillColumnWithDate64Data(std::shared_ptr<arrow::Column> & arrow_column, MutableColumnPtr & internal_column)
{
auto & column_data = static_cast<ColumnVector<UInt32> &>(*internal_column).getData();
column_data.reserve(arrow_column->length());
for (size_t chunk_i = 0, num_chunks = static_cast<size_t>(arrow_column->data()->num_chunks()); chunk_i < num_chunks; ++chunk_i)
{
auto & chunk = static_cast<arrow::Date64Array &>(*(arrow_column->data()->chunk(chunk_i)));
for (size_t value_i = 0, length = static_cast<size_t>(chunk.length()); value_i < length; ++value_i)
{
auto timestamp = static_cast<UInt32>(chunk.Value(value_i) / 1000); // Always? in ms
column_data.emplace_back(timestamp);
}
}
}
void fillColumnWithTimestampData(std::shared_ptr<arrow::Column> & arrow_column, MutableColumnPtr & internal_column)
{
auto & column_data = static_cast<ColumnVector<UInt32> &>(*internal_column).getData();
column_data.reserve(arrow_column->length());
for (size_t chunk_i = 0, num_chunks = static_cast<size_t>(arrow_column->data()->num_chunks()); chunk_i < num_chunks; ++chunk_i)
{
auto & chunk = static_cast<arrow::TimestampArray &>(*(arrow_column->data()->chunk(chunk_i)));
const auto & type = static_cast<const ::arrow::TimestampType &>(*chunk.type());
UInt32 divide = 1;
const auto unit = type.unit();
switch (unit)
{
case arrow::TimeUnit::SECOND:
divide = 1;
break;
case arrow::TimeUnit::MILLI:
divide = 1000;
break;
case arrow::TimeUnit::MICRO:
divide = 1000000;
break;
case arrow::TimeUnit::NANO:
divide = 1000000000;
break;
}
for (size_t value_i = 0, length = static_cast<size_t>(chunk.length()); value_i < length; ++value_i)
{
auto timestamp = static_cast<UInt32>(chunk.Value(value_i) / divide); // ms! TODO: check other 's' 'ns' ...
column_data.emplace_back(timestamp);
}
}
}
void fillColumnWithDecimalData(std::shared_ptr<arrow::Column> & arrow_column, MutableColumnPtr & internal_column)
{
auto & column = static_cast<ColumnDecimal<Decimal128> &>(*internal_column);
auto & column_data = column.getData();
column_data.reserve(arrow_column->length());
for (size_t chunk_i = 0, num_chunks = static_cast<size_t>(arrow_column->data()->num_chunks()); chunk_i < num_chunks; ++chunk_i)
{
auto & chunk = static_cast<arrow::DecimalArray &>(*(arrow_column->data()->chunk(chunk_i)));
for (size_t value_i = 0, length = static_cast<size_t>(chunk.length()); value_i < length; ++value_i)
{
column_data.emplace_back(
chunk.IsNull(value_i) ? Decimal128(0) : *reinterpret_cast<const Decimal128 *>(chunk.Value(value_i))); // TODO: copy column
}
}
}
/// Creates a null bytemap from arrow's null bitmap
void fillByteMapFromArrowColumn(std::shared_ptr<arrow::Column> & arrow_column, MutableColumnPtr & bytemap)
{
PaddedPODArray<UInt8> & bytemap_data = static_cast<ColumnVector<UInt8> &>(*bytemap).getData();
bytemap_data.reserve(arrow_column->length());
for (size_t chunk_i = 0; chunk_i != static_cast<size_t>(arrow_column->data()->num_chunks()); ++chunk_i)
{
std::shared_ptr<arrow::Array> chunk = arrow_column->data()->chunk(chunk_i);
for (size_t value_i = 0; value_i != static_cast<size_t>(chunk->length()); ++value_i)
bytemap_data.emplace_back(chunk->IsNull(value_i));
}
}
# define FOR_ARROW_NUMERIC_TYPES(M) \
M(arrow::Type::UINT8, UInt8) \
M(arrow::Type::INT8, Int8) \
M(arrow::Type::UINT16, UInt16) \
M(arrow::Type::INT16, Int16) \
M(arrow::Type::UINT32, UInt32) \
M(arrow::Type::INT32, Int32) \
M(arrow::Type::UINT64, UInt64) \
M(arrow::Type::INT64, Int64) \
M(arrow::Type::FLOAT, Float32) \
M(arrow::Type::DOUBLE, Float64)
//M(arrow::Type::HALF_FLOAT, Float32) // TODO
using NameToColumnPtr = std::unordered_map<std::string, std::shared_ptr<arrow::Column>>;
Block ParquetBlockInputStream::readImpl()
{
static const std::unordered_map<arrow::Type::type, std::shared_ptr<IDataType>> arrow_type_to_internal_type = {
//{arrow::Type::DECIMAL, std::make_shared<DataTypeDecimal>()},
{arrow::Type::UINT8, std::make_shared<DataTypeUInt8>()},
{arrow::Type::INT8, std::make_shared<DataTypeInt8>()},
{arrow::Type::UINT16, std::make_shared<DataTypeUInt16>()},
{arrow::Type::INT16, std::make_shared<DataTypeInt16>()},
{arrow::Type::UINT32, std::make_shared<DataTypeUInt32>()},
{arrow::Type::INT32, std::make_shared<DataTypeInt32>()},
{arrow::Type::UINT64, std::make_shared<DataTypeUInt64>()},
{arrow::Type::INT64, std::make_shared<DataTypeInt64>()},
{arrow::Type::HALF_FLOAT, std::make_shared<DataTypeFloat32>()},
{arrow::Type::FLOAT, std::make_shared<DataTypeFloat32>()},
{arrow::Type::DOUBLE, std::make_shared<DataTypeFloat64>()},
{arrow::Type::BOOL, std::make_shared<DataTypeUInt8>()},
//{arrow::Type::DATE32, std::make_shared<DataTypeDate>()},
{arrow::Type::DATE32, std::make_shared<DataTypeDate>()},
//{arrow::Type::DATE32, std::make_shared<DataTypeDateTime>()},
{arrow::Type::DATE64, std::make_shared<DataTypeDateTime>()},
{arrow::Type::TIMESTAMP, std::make_shared<DataTypeDateTime>()},
//{arrow::Type::TIME32, std::make_shared<DataTypeDateTime>()},
{arrow::Type::STRING, std::make_shared<DataTypeString>()},
{arrow::Type::BINARY, std::make_shared<DataTypeString>()},
//{arrow::Type::FIXED_SIZE_BINARY, std::make_shared<DataTypeString>()},
//{arrow::Type::UUID, std::make_shared<DataTypeString>()},
// TODO: add other types that are convertable to internal ones:
// 0. ENUM?
// 1. UUID -> String
// 2. JSON -> String
// Full list of types: contrib/arrow/cpp/src/arrow/type.h
};
Block res;
if (!istr.eof())
{
/*
First we load whole stream into string (its very bad and limiting .parquet file size to half? of RAM)
Then producing blocks for every row_group (dont load big .parquet files with one row_group - it can eat x10+ RAM from .parquet file size)
*/
if (row_group_current < row_group_total)
throw Exception{"Got new data, but data from previous chunks not readed " + std::to_string(row_group_current) + "/"
+ std::to_string(row_group_total),
ErrorCodes::CANNOT_READ_ALL_DATA};
file_data.clear();
{
WriteBufferFromString file_buffer(file_data);
copyData(istr, file_buffer);
}
buffer = std::make_unique<arrow::Buffer>(file_data);
// TODO: maybe use parquet::RandomAccessSource?
auto reader = parquet::ParquetFileReader::Open(std::make_shared<::arrow::io::BufferReader>(*buffer));
file_reader = std::make_unique<parquet::arrow::FileReader>(::arrow::default_memory_pool(), std::move(reader));
row_group_total = file_reader->num_row_groups();
row_group_current = 0;
}
if (row_group_current >= row_group_total)
return res;
// TODO: also catch a ParquetException thrown by filereader?
//arrow::Status read_status = filereader.ReadTable(&table);
std::shared_ptr<arrow::Table> table;
arrow::Status read_status = file_reader->ReadRowGroup(row_group_current, &table);
if (!read_status.ok())
throw Exception{"Error while reading parquet data: " + read_status.ToString(), ErrorCodes::CANNOT_READ_ALL_DATA};
if (0 == table->num_rows())
throw Exception{"Empty table in input data", ErrorCodes::EMPTY_DATA_PASSED};
if (header.columns() > static_cast<size_t>(table->num_columns()))
// TODO: What if some columns were not presented? Insert NULLs? What if a column is not nullable?
throw Exception{"Number of columns is less than the table has", ErrorCodes::SIZES_OF_COLUMNS_DOESNT_MATCH};
++row_group_current;
NameToColumnPtr name_to_column_ptr;
for (size_t i = 0, num_columns = static_cast<size_t>(table->num_columns()); i < num_columns; ++i)
{
std::shared_ptr<arrow::Column> arrow_column = table->column(i);
name_to_column_ptr[arrow_column->name()] = arrow_column;
}
for (size_t column_i = 0, columns = header.columns(); column_i < columns; ++column_i)
{
ColumnWithTypeAndName header_column = header.getByPosition(column_i);
if (name_to_column_ptr.find(header_column.name) == name_to_column_ptr.end())
// TODO: What if some columns were not presented? Insert NULLs? What if a column is not nullable?
throw Exception{"Column \"" + header_column.name + "\" is not presented in input data", ErrorCodes::THERE_IS_NO_COLUMN};
std::shared_ptr<arrow::Column> arrow_column = name_to_column_ptr[header_column.name];
arrow::Type::type arrow_type = arrow_column->type()->id();
// TODO: check if a column is const?
if (!header_column.type->isNullable() && arrow_column->null_count())
{
throw Exception{"Can not insert NULL data into non-nullable column \"" + header_column.name + "\"",
ErrorCodes::CANNOT_INSERT_NULL_IN_ORDINARY_COLUMN};
}
const bool target_column_is_nullable = header_column.type->isNullable() || arrow_column->null_count();
DataTypePtr internal_nested_type;
if (arrow_type == arrow::Type::DECIMAL)
{
const auto decimal_type = static_cast<arrow::DecimalType *>(arrow_column->type().get());
internal_nested_type = std::make_shared<DataTypeDecimal<Decimal128>>(decimal_type->precision(), decimal_type->scale());
}
else if (arrow_type_to_internal_type.find(arrow_type) != arrow_type_to_internal_type.end())
{
internal_nested_type = arrow_type_to_internal_type.at(arrow_type);
}
else
{
throw Exception{"The type \"" + arrow_column->type()->name() + "\" of an input column \"" + arrow_column->name()
+ "\" is not supported for conversion from a Parquet data format",
ErrorCodes::CANNOT_CONVERT_TYPE};
}
const DataTypePtr internal_type = target_column_is_nullable ? makeNullable(internal_nested_type) : internal_nested_type;
const std::string internal_nested_type_name = internal_nested_type->getName();
const DataTypePtr column_nested_type = header_column.type->isNullable()
? static_cast<const DataTypeNullable *>(header_column.type.get())->getNestedType()
: header_column.type;
const DataTypePtr column_type = header_column.type;
const std::string column_nested_type_name = column_nested_type->getName();
ColumnWithTypeAndName column;
column.name = header_column.name;
column.type = internal_type;
/// Data
MutableColumnPtr read_column = internal_nested_type->createColumn();
switch (arrow_type)
{
case arrow::Type::STRING:
case arrow::Type::BINARY:
//case arrow::Type::FIXED_SIZE_BINARY:
fillColumnWithStringData(arrow_column, read_column);
break;
case arrow::Type::BOOL:
fillColumnWithBooleanData(arrow_column, read_column);
break;
case arrow::Type::DATE32:
fillColumnWithDate32Data(arrow_column, read_column);
break;
case arrow::Type::DATE64:
fillColumnWithDate64Data(arrow_column, read_column);
break;
case arrow::Type::TIMESTAMP:
fillColumnWithTimestampData(arrow_column, read_column);
break;
case arrow::Type::DECIMAL:
//fillColumnWithNumericData<Decimal128, ColumnDecimal<Decimal128>>(arrow_column, read_column); // Have problems with trash values under NULL, but faster
fillColumnWithDecimalData(arrow_column, read_column /*, internal_nested_type*/);
break;
# define DISPATCH(ARROW_NUMERIC_TYPE, CPP_NUMERIC_TYPE) \
case ARROW_NUMERIC_TYPE: \
fillColumnWithNumericData<CPP_NUMERIC_TYPE>(arrow_column, read_column); \
break;
FOR_ARROW_NUMERIC_TYPES(DISPATCH)
# undef DISPATCH
// TODO: support TIMESTAMP_MICROS and TIMESTAMP_MILLIS with truncated micro- and milliseconds?
// TODO: read JSON as a string?
// TODO: read UUID as a string?
default:
throw Exception{"Unsupported parquet type \"" + arrow_column->type()->name() + "\" of an input column \""
+ arrow_column->name() + "\"",
ErrorCodes::UNKNOWN_TYPE};
}
if (column.type->isNullable())
{
MutableColumnPtr null_bytemap = DataTypeUInt8().createColumn();
fillByteMapFromArrowColumn(arrow_column, null_bytemap);
column.column = ColumnNullable::create(std::move(read_column), std::move(null_bytemap));
}
else
{
column.column = std::move(read_column);
}
column.column = castColumn(column, column_type, context);
column.type = column_type;
res.insert(std::move(column));
}
return res;
}
void registerInputFormatParquet(FormatFactory & factory)
{
factory.registerInputFormat(
"Parquet",
[](ReadBuffer & buf,
const Block & sample,
const Context & context,
UInt64 /* max_block_size */,
UInt64 /* rows_portion_size */,
FormatFactory::ReadCallback /* callback */,
const FormatSettings & /* settings */) { return std::make_shared<ParquetBlockInputStream>(buf, sample, context); });
}
}
#else
namespace DB
{
class FormatFactory;
void registerInputFormatParquet(FormatFactory &)
{
}
}
#endif

View File

@ -1,43 +0,0 @@
#pragma once
#include "config_formats.h"
#if USE_PARQUET
# include <DataStreams/IBlockInputStream.h>
namespace parquet { namespace arrow { class FileReader; } }
namespace arrow { class Buffer; }
namespace DB
{
class Context;
class ParquetBlockInputStream : public IBlockInputStream
{
public:
ParquetBlockInputStream(ReadBuffer & istr_, const Block & header_, const Context & context_);
String getName() const override { return "Parquet"; }
Block getHeader() const override;
protected:
Block readImpl() override;
private:
ReadBuffer & istr;
Block header;
// TODO: check that this class implements every part of its parent
const Context & context;
std::unique_ptr<parquet::arrow::FileReader> file_reader;
std::string file_data;
std::unique_ptr<arrow::Buffer> buffer;
int row_group_total = 0;
int row_group_current = 0;
};
}
#endif

View File

@ -1,450 +0,0 @@
#include "ParquetBlockOutputStream.h"
#if USE_PARQUET
// TODO: clean includes
# include <Columns/ColumnDecimal.h>
# include <Columns/ColumnFixedString.h>
# include <Columns/ColumnNullable.h>
# include <Columns/ColumnString.h>
# include <Columns/ColumnVector.h>
# include <Columns/ColumnsNumber.h>
# include <Core/ColumnWithTypeAndName.h>
# include <Core/callOnTypeIndex.h>
# include <DataTypes/DataTypeDateTime.h>
# include <DataTypes/DataTypeNullable.h>
# include <DataTypes/DataTypesDecimal.h>
# include <DataStreams/SquashingBlockOutputStream.h>
# include <Formats/FormatFactory.h>
# include <IO/WriteHelpers.h>
# include <arrow/api.h>
# include <arrow/io/api.h>
# include <arrow/util/decimal.h>
# include <parquet/arrow/writer.h>
# include <parquet/exception.h>
# include <parquet/util/memory.h>
namespace DB
{
namespace ErrorCodes
{
extern const int UNKNOWN_EXCEPTION;
extern const int UNKNOWN_TYPE;
}
ParquetBlockOutputStream::ParquetBlockOutputStream(WriteBuffer & ostr, const Block & header, const FormatSettings & format_settings) : ostr{ostr}, header{header}, format_settings{format_settings}
{
}
void ParquetBlockOutputStream::flush()
{
ostr.next();
}
void checkStatus(arrow::Status & status, const std::string & column_name)
{
if (!status.ok())
throw Exception{"Error with a parquet column \"" + column_name + "\": " + status.ToString(), ErrorCodes::UNKNOWN_EXCEPTION};
}
template <typename NumericType, typename ArrowBuilderType>
void fillArrowArrayWithNumericColumnData(
ColumnPtr write_column, std::shared_ptr<arrow::Array> & arrow_array, const PaddedPODArray<UInt8> * null_bytemap)
{
const PaddedPODArray<NumericType> & internal_data = static_cast<const ColumnVector<NumericType> &>(*write_column).getData();
ArrowBuilderType builder;
arrow::Status status;
const UInt8 * arrow_null_bytemap_raw_ptr = nullptr;
PaddedPODArray<UInt8> arrow_null_bytemap;
if (null_bytemap)
{
/// Invert values since Arrow interprets 1 as a non-null value, while CH as a null
arrow_null_bytemap.reserve(null_bytemap->size());
for (size_t i = 0, size = null_bytemap->size(); i < size; ++i)
arrow_null_bytemap.emplace_back(1 ^ (*null_bytemap)[i]);
arrow_null_bytemap_raw_ptr = arrow_null_bytemap.data();
}
status = builder.AppendValues(internal_data.data(), internal_data.size(), arrow_null_bytemap_raw_ptr);
checkStatus(status, write_column->getName());
status = builder.Finish(&arrow_array);
checkStatus(status, write_column->getName());
}
template <typename ColumnType>
void fillArrowArrayWithStringColumnData(
ColumnPtr write_column, std::shared_ptr<arrow::Array> & arrow_array, const PaddedPODArray<UInt8> * null_bytemap)
{
const auto & internal_column = static_cast<const ColumnType &>(*write_column);
arrow::StringBuilder builder;
arrow::Status status;
for (size_t string_i = 0, size = internal_column.size(); string_i < size; ++string_i)
{
if (null_bytemap && (*null_bytemap)[string_i])
{
status = builder.AppendNull();
}
else
{
StringRef string_ref = internal_column.getDataAt(string_i);
status = builder.Append(string_ref.data, string_ref.size);
}
checkStatus(status, write_column->getName());
}
status = builder.Finish(&arrow_array);
checkStatus(status, write_column->getName());
}
void fillArrowArrayWithDateColumnData(
ColumnPtr write_column, std::shared_ptr<arrow::Array> & arrow_array, const PaddedPODArray<UInt8> * null_bytemap)
{
const PaddedPODArray<UInt16> & internal_data = static_cast<const ColumnVector<UInt16> &>(*write_column).getData();
//arrow::Date32Builder date_builder;
arrow::UInt16Builder builder;
arrow::Status status;
for (size_t value_i = 0, size = internal_data.size(); value_i < size; ++value_i)
{
if (null_bytemap && (*null_bytemap)[value_i])
status = builder.AppendNull();
else
/// Implicitly converts UInt16 to Int32
status = builder.Append(internal_data[value_i]);
checkStatus(status, write_column->getName());
}
status = builder.Finish(&arrow_array);
checkStatus(status, write_column->getName());
}
void fillArrowArrayWithDateTimeColumnData(
ColumnPtr write_column, std::shared_ptr<arrow::Array> & arrow_array, const PaddedPODArray<UInt8> * null_bytemap)
{
auto & internal_data = static_cast<const ColumnVector<UInt32> &>(*write_column).getData();
//arrow::Date64Builder builder;
arrow::UInt32Builder builder;
arrow::Status status;
for (size_t value_i = 0, size = internal_data.size(); value_i < size; ++value_i)
{
if (null_bytemap && (*null_bytemap)[value_i])
status = builder.AppendNull();
else
/// Implicitly converts UInt16 to Int32
//status = date_builder.Append(static_cast<int64_t>(internal_data[value_i]) * 1000); // now ms. TODO check other units
status = builder.Append(internal_data[value_i]);
checkStatus(status, write_column->getName());
}
status = builder.Finish(&arrow_array);
checkStatus(status, write_column->getName());
}
template <typename DataType>
void fillArrowArrayWithDecimalColumnData(
ColumnPtr write_column,
std::shared_ptr<arrow::Array> & arrow_array,
const PaddedPODArray<UInt8> * null_bytemap,
const DataType * decimal_type)
{
const auto & column = static_cast<const typename DataType::ColumnType &>(*write_column);
arrow::DecimalBuilder builder(arrow::decimal(decimal_type->getPrecision(), decimal_type->getScale()));
arrow::Status status;
for (size_t value_i = 0, size = column.size(); value_i < size; ++value_i)
{
if (null_bytemap && (*null_bytemap)[value_i])
status = builder.AppendNull();
else
status = builder.Append(
arrow::Decimal128(reinterpret_cast<const uint8_t *>(&column.getElement(value_i).value))); // TODO: try copy column
checkStatus(status, write_column->getName());
}
status = builder.Finish(&arrow_array);
checkStatus(status, write_column->getName());
/* TODO column copy
const auto & internal_data = static_cast<const typename DataType::ColumnType &>(*write_column).getData();
//ArrowBuilderType numeric_builder;
arrow::DecimalBuilder builder(arrow::decimal(decimal_type->getPrecision(), decimal_type->getScale()));
arrow::Status status;
const uint8_t * arrow_null_bytemap_raw_ptr = nullptr;
PaddedPODArray<UInt8> arrow_null_bytemap;
if (null_bytemap)
{
/// Invert values since Arrow interprets 1 as a non-null value, while CH as a null
arrow_null_bytemap.reserve(null_bytemap->size());
for (size_t i = 0, size = null_bytemap->size(); i < size; ++i)
arrow_null_bytemap.emplace_back(1 ^ (*null_bytemap)[i]);
arrow_null_bytemap_raw_ptr = arrow_null_bytemap.data();
}
status = builder.AppendValues(reinterpret_cast<const uint8_t*>(internal_data.data()), internal_data.size(), arrow_null_bytemap_raw_ptr);
checkStatus(status, write_column->getName());
status = builder.Finish(&arrow_array);
checkStatus(status, write_column->getName());
*/
}
# define FOR_INTERNAL_NUMERIC_TYPES(M) \
M(UInt8, arrow::UInt8Builder) \
M(Int8, arrow::Int8Builder) \
M(UInt16, arrow::UInt16Builder) \
M(Int16, arrow::Int16Builder) \
M(UInt32, arrow::UInt32Builder) \
M(Int32, arrow::Int32Builder) \
M(UInt64, arrow::UInt64Builder) \
M(Int64, arrow::Int64Builder) \
M(Float32, arrow::FloatBuilder) \
M(Float64, arrow::DoubleBuilder)
const std::unordered_map<String, std::shared_ptr<arrow::DataType>> internal_type_to_arrow_type = {
{"UInt8", arrow::uint8()},
{"Int8", arrow::int8()},
{"UInt16", arrow::uint16()},
{"Int16", arrow::int16()},
{"UInt32", arrow::uint32()},
{"Int32", arrow::int32()},
{"UInt64", arrow::uint64()},
{"Int64", arrow::int64()},
{"Float32", arrow::float32()},
{"Float64", arrow::float64()},
//{"Date", arrow::date64()},
//{"Date", arrow::date32()},
{"Date", arrow::uint16()}, // CHECK
//{"DateTime", arrow::date64()}, // BUG! saves as date32
{"DateTime", arrow::uint32()},
// TODO: ClickHouse can actually store non-utf8 strings!
{"String", arrow::utf8()},
{"FixedString", arrow::utf8()},
};
const PaddedPODArray<UInt8> * extractNullBytemapPtr(ColumnPtr column)
{
ColumnPtr null_column = static_cast<const ColumnNullable &>(*column).getNullMapColumnPtr();
const PaddedPODArray<UInt8> & null_bytemap = static_cast<const ColumnVector<UInt8> &>(*null_column).getData();
return &null_bytemap;
}
class OstreamOutputStream : public parquet::OutputStream
{
public:
explicit OstreamOutputStream(WriteBuffer & ostr_) : ostr(ostr_) {}
virtual ~OstreamOutputStream() {}
virtual void Close() {}
virtual int64_t Tell() { return total_length; }
virtual void Write(const uint8_t * data, int64_t length)
{
ostr.write(reinterpret_cast<const char *>(data), length);
total_length += length;
}
private:
WriteBuffer & ostr;
int64_t total_length = 0;
PARQUET_DISALLOW_COPY_AND_ASSIGN(OstreamOutputStream);
};
void ParquetBlockOutputStream::write(const Block & block)
{
block.checkNumberOfRows();
const size_t columns_num = block.columns();
/// For arrow::Schema and arrow::Table creation
std::vector<std::shared_ptr<arrow::Field>> arrow_fields;
std::vector<std::shared_ptr<arrow::Array>> arrow_arrays;
arrow_fields.reserve(columns_num);
arrow_arrays.reserve(columns_num);
for (size_t column_i = 0; column_i < columns_num; ++column_i)
{
// TODO: constructed every iteration
const ColumnWithTypeAndName & column = block.safeGetByPosition(column_i);
const bool is_column_nullable = column.type->isNullable();
const auto & column_nested_type
= is_column_nullable ? static_cast<const DataTypeNullable *>(column.type.get())->getNestedType() : column.type;
const std::string column_nested_type_name = column_nested_type->getFamilyName();
if (isDecimal(column_nested_type))
{
const auto add_decimal_field = [&](const auto & types) -> bool {
using Types = std::decay_t<decltype(types)>;
using ToDataType = typename Types::LeftType;
if constexpr (
std::is_same_v<
ToDataType,
DataTypeDecimal<
Decimal32>> || std::is_same_v<ToDataType, DataTypeDecimal<Decimal64>> || std::is_same_v<ToDataType, DataTypeDecimal<Decimal128>>)
{
const auto & decimal_type = static_cast<const ToDataType *>(column_nested_type.get());
arrow_fields.emplace_back(std::make_shared<arrow::Field>(
column.name, arrow::decimal(decimal_type->getPrecision(), decimal_type->getScale()), is_column_nullable));
}
return false;
};
callOnIndexAndDataType<void>(column_nested_type->getTypeId(), add_decimal_field);
}
else
{
if (internal_type_to_arrow_type.find(column_nested_type_name) == internal_type_to_arrow_type.end())
{
throw Exception{"The type \"" + column_nested_type_name + "\" of a column \"" + column.name
+ "\""
" is not supported for conversion into a Parquet data format",
ErrorCodes::UNKNOWN_TYPE};
}
arrow_fields.emplace_back(std::make_shared<arrow::Field>(column.name, internal_type_to_arrow_type.at(column_nested_type_name), is_column_nullable));
}
std::shared_ptr<arrow::Array> arrow_array;
ColumnPtr nested_column
= is_column_nullable ? static_cast<const ColumnNullable &>(*column.column).getNestedColumnPtr() : column.column;
const PaddedPODArray<UInt8> * null_bytemap = is_column_nullable ? extractNullBytemapPtr(column.column) : nullptr;
if ("String" == column_nested_type_name)
{
fillArrowArrayWithStringColumnData<ColumnString>(nested_column, arrow_array, null_bytemap);
}
else if ("FixedString" == column_nested_type_name)
{
fillArrowArrayWithStringColumnData<ColumnFixedString>(nested_column, arrow_array, null_bytemap);
}
else if ("Date" == column_nested_type_name)
{
fillArrowArrayWithDateColumnData(nested_column, arrow_array, null_bytemap);
}
else if ("DateTime" == column_nested_type_name)
{
fillArrowArrayWithDateTimeColumnData(nested_column, arrow_array, null_bytemap);
}
else if (isDecimal(column_nested_type))
{
auto fill_decimal = [&](const auto & types) -> bool
{
using Types = std::decay_t<decltype(types)>;
using ToDataType = typename Types::LeftType;
if constexpr (
std::is_same_v<
ToDataType,
DataTypeDecimal<
Decimal32>> || std::is_same_v<ToDataType, DataTypeDecimal<Decimal64>> || std::is_same_v<ToDataType, DataTypeDecimal<Decimal128>>)
{
const auto & decimal_type = static_cast<const ToDataType *>(column_nested_type.get());
fillArrowArrayWithDecimalColumnData(nested_column, arrow_array, null_bytemap, decimal_type);
}
return false;
};
callOnIndexAndDataType<void>(column_nested_type->getTypeId(), fill_decimal);
}
# define DISPATCH(CPP_NUMERIC_TYPE, ARROW_BUILDER_TYPE) \
else if (#CPP_NUMERIC_TYPE == column_nested_type_name) \
{ \
fillArrowArrayWithNumericColumnData<CPP_NUMERIC_TYPE, ARROW_BUILDER_TYPE>(nested_column, arrow_array, null_bytemap); \
}
FOR_INTERNAL_NUMERIC_TYPES(DISPATCH)
# undef DISPATCH
else
{
throw Exception{"Internal type \"" + column_nested_type_name + "\" of a column \"" + column.name
+ "\""
" is not supported for conversion into a Parquet data format",
ErrorCodes::UNKNOWN_TYPE};
}
arrow_arrays.emplace_back(std::move(arrow_array));
}
std::shared_ptr<arrow::Schema> arrow_schema = std::make_shared<arrow::Schema>(std::move(arrow_fields));
std::shared_ptr<arrow::Table> arrow_table = arrow::Table::Make(arrow_schema, arrow_arrays);
auto sink = std::make_shared<OstreamOutputStream>(ostr);
if (!file_writer)
{
parquet::WriterProperties::Builder builder;
#if USE_SNAPPY
builder.compression(parquet::Compression::SNAPPY);
#endif
auto props = builder.build();
auto status = parquet::arrow::FileWriter::Open(
*arrow_table->schema(),
arrow::default_memory_pool(),
sink,
props, /*parquet::default_writer_properties(),*/
parquet::arrow::default_arrow_writer_properties(),
&file_writer);
if (!status.ok())
throw Exception{"Error while opening a table: " + status.ToString(), ErrorCodes::UNKNOWN_EXCEPTION};
}
// TODO: calculate row_group_size depending on a number of rows and table size
auto status = file_writer->WriteTable(*arrow_table, format_settings.parquet.row_group_size);
if (!status.ok())
throw Exception{"Error while writing a table: " + status.ToString(), ErrorCodes::UNKNOWN_EXCEPTION};
}
void ParquetBlockOutputStream::writeSuffix()
{
if (file_writer)
{
auto status = file_writer->Close();
if (!status.ok())
throw Exception{"Error while closing a table: " + status.ToString(), ErrorCodes::UNKNOWN_EXCEPTION};
}
}
void registerOutputFormatParquet(FormatFactory & factory)
{
factory.registerOutputFormat(
"Parquet", [](WriteBuffer & buf, const Block & sample, const Context & /*context*/, const FormatSettings & format_settings)
{
BlockOutputStreamPtr impl = std::make_shared<ParquetBlockOutputStream>(buf, sample, format_settings);
auto res = std::make_shared<SquashingBlockOutputStream>(impl, impl->getHeader(), format_settings.parquet.row_group_size, 0);
res->disableFlush();
return res;
});
}
}
#else
namespace DB
{
class FormatFactory;
void registerOutputFormatParquet(FormatFactory &)
{
}
}
#endif

View File

@ -1,46 +0,0 @@
#pragma once
#include "config_formats.h"
#if USE_PARQUET
# include <DataStreams/IBlockOutputStream.h>
# include <Formats/FormatSettings.h>
namespace arrow
{
class Array;
class DataType;
}
namespace parquet
{
namespace arrow
{
class FileWriter;
}
}
namespace DB
{
class ParquetBlockOutputStream : public IBlockOutputStream
{
public:
ParquetBlockOutputStream(WriteBuffer & ostr_, const Block & header_, const FormatSettings & format_settings);
Block getHeader() const override { return header; }
void write(const Block & block) override;
void writeSuffix() override;
void flush() override;
String getContentType() const override { return "application/octet-stream"; }
private:
WriteBuffer & ostr;
Block header;
const FormatSettings format_settings;
std::unique_ptr<parquet::arrow::FileWriter> file_writer;
};
}
#endif

View File

@ -1,277 +0,0 @@
#include <sys/ioctl.h>
#include <port/unistd.h>
#include <Formats/PrettyBlockOutputStream.h>
#include <Formats/FormatFactory.h>
#include <IO/WriteBuffer.h>
#include <IO/WriteHelpers.h>
#include <IO/WriteBufferFromString.h>
#include <Common/UTF8Helpers.h>
namespace DB
{
namespace ErrorCodes
{
extern const int ILLEGAL_COLUMN;
}
PrettyBlockOutputStream::PrettyBlockOutputStream(
WriteBuffer & ostr_, const Block & header_, const FormatSettings & format_settings)
: ostr(ostr_), header(header_), format_settings(format_settings)
{
struct winsize w;
if (0 == ioctl(STDOUT_FILENO, TIOCGWINSZ, &w))
terminal_width = w.ws_col;
}
void PrettyBlockOutputStream::flush()
{
ostr.next();
}
/// Evaluate the visible width of the values and column names.
/// Note that number of code points is just a rough approximation of visible string width.
void PrettyBlockOutputStream::calculateWidths(
const Block & block, WidthsPerColumn & widths, Widths & max_widths, Widths & name_widths, const FormatSettings & format_settings)
{
size_t rows = block.rows();
size_t columns = block.columns();
widths.resize(columns);
max_widths.resize_fill(columns);
name_widths.resize(columns);
/// Calculate widths of all values.
String serialized_value;
size_t prefix = 2; // Tab character adjustment
for (size_t i = 0; i < columns; ++i)
{
const ColumnWithTypeAndName & elem = block.getByPosition(i);
widths[i].resize(rows);
for (size_t j = 0; j < rows; ++j)
{
{
WriteBufferFromString out(serialized_value);
elem.type->serializeAsText(*elem.column, j, out, format_settings);
}
widths[i][j] = std::min<UInt64>(format_settings.pretty.max_column_pad_width,
UTF8::computeWidth(reinterpret_cast<const UInt8 *>(serialized_value.data()), serialized_value.size(), prefix));
max_widths[i] = std::max(max_widths[i], widths[i][j]);
}
/// And also calculate widths for names of columns.
{
// name string doesn't contain Tab, no need to pass `prefix`
name_widths[i] = std::min<UInt64>(format_settings.pretty.max_column_pad_width,
UTF8::computeWidth(reinterpret_cast<const UInt8 *>(elem.name.data()), elem.name.size()));
max_widths[i] = std::max(max_widths[i], name_widths[i]);
}
prefix += max_widths[i] + 3;
}
}
void PrettyBlockOutputStream::write(const Block & block)
{
UInt64 max_rows = format_settings.pretty.max_rows;
if (total_rows >= max_rows)
{
total_rows += block.rows();
return;
}
size_t rows = block.rows();
size_t columns = block.columns();
WidthsPerColumn widths;
Widths max_widths;
Widths name_widths;
calculateWidths(block, widths, max_widths, name_widths, format_settings);
/// Create separators
std::stringstream top_separator;
std::stringstream middle_names_separator;
std::stringstream middle_values_separator;
std::stringstream bottom_separator;
top_separator << "";
middle_names_separator << "";
middle_values_separator << "";
bottom_separator << "";
for (size_t i = 0; i < columns; ++i)
{
if (i != 0)
{
top_separator << "";
middle_names_separator << "";
middle_values_separator << "";
bottom_separator << "";
}
for (size_t j = 0; j < max_widths[i] + 2; ++j)
{
top_separator << "";
middle_names_separator << "";
middle_values_separator << "";
bottom_separator << "";
}
}
top_separator << "\n";
middle_names_separator << "\n";
middle_values_separator << "\n";
bottom_separator << "\n";
std::string top_separator_s = top_separator.str();
std::string middle_names_separator_s = middle_names_separator.str();
std::string middle_values_separator_s = middle_values_separator.str();
std::string bottom_separator_s = bottom_separator.str();
/// Output the block
writeString(top_separator_s, ostr);
/// Names
writeCString("", ostr);
for (size_t i = 0; i < columns; ++i)
{
if (i != 0)
writeCString("", ostr);
const ColumnWithTypeAndName & col = block.getByPosition(i);
if (format_settings.pretty.color)
writeCString("\033[1m", ostr);
if (col.type->shouldAlignRightInPrettyFormats())
{
for (size_t k = 0; k < max_widths[i] - name_widths[i]; ++k)
writeChar(' ', ostr);
writeString(col.name, ostr);
}
else
{
writeString(col.name, ostr);
for (size_t k = 0; k < max_widths[i] - name_widths[i]; ++k)
writeChar(' ', ostr);
}
if (format_settings.pretty.color)
writeCString("\033[0m", ostr);
}
writeCString("\n", ostr);
writeString(middle_names_separator_s, ostr);
for (size_t i = 0; i < rows && total_rows + i < max_rows; ++i)
{
if (i != 0)
writeString(middle_values_separator_s, ostr);
writeCString("", ostr);
for (size_t j = 0; j < columns; ++j)
{
if (j != 0)
writeCString("", ostr);
writeValueWithPadding(block.getByPosition(j), i, widths[j].empty() ? max_widths[j] : widths[j][i], max_widths[j]);
}
writeCString("\n", ostr);
}
writeString(bottom_separator_s, ostr);
total_rows += rows;
}
void PrettyBlockOutputStream::writeValueWithPadding(const ColumnWithTypeAndName & elem, size_t row_num, size_t value_width, size_t pad_to_width)
{
auto writePadding = [&]()
{
for (size_t k = 0; k < pad_to_width - value_width; ++k)
writeChar(' ', ostr);
};
if (elem.type->shouldAlignRightInPrettyFormats())
{
writePadding();
elem.type->serializeAsText(*elem.column.get(), row_num, ostr, format_settings);
}
else
{
elem.type->serializeAsText(*elem.column.get(), row_num, ostr, format_settings);
writePadding();
}
}
void PrettyBlockOutputStream::writeSuffix()
{
if (total_rows >= format_settings.pretty.max_rows)
{
writeCString(" Showed first ", ostr);
writeIntText(format_settings.pretty.max_rows, ostr);
writeCString(".\n", ostr);
}
total_rows = 0;
writeTotals();
writeExtremes();
}
void PrettyBlockOutputStream::writeTotals()
{
if (totals)
{
writeCString("\nTotals:\n", ostr);
write(totals);
}
}
void PrettyBlockOutputStream::writeExtremes()
{
if (extremes)
{
writeCString("\nExtremes:\n", ostr);
write(extremes);
}
}
void registerOutputFormatPretty(FormatFactory & factory)
{
factory.registerOutputFormat("Pretty", [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & format_settings)
{
return std::make_shared<PrettyBlockOutputStream>(buf, sample, format_settings);
});
factory.registerOutputFormat("PrettyNoEscapes", [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & format_settings)
{
FormatSettings changed_settings = format_settings;
changed_settings.pretty.color = false;
return std::make_shared<PrettyBlockOutputStream>(buf, sample, changed_settings);
});
}
}

View File

@ -1,55 +0,0 @@
#pragma once
#include <Core/Block.h>
#include <DataStreams/IBlockOutputStream.h>
#include <Formats/FormatSettings.h>
namespace DB
{
class WriteBuffer;
class Context;
/** Prints the result in the form of beautiful tables.
*/
class PrettyBlockOutputStream : public IBlockOutputStream
{
public:
/// no_escapes - do not use ANSI escape sequences - to display in the browser, not in the console.
PrettyBlockOutputStream(WriteBuffer & ostr_, const Block & header_, const FormatSettings & format_settings);
Block getHeader() const override { return header; }
void write(const Block & block) override;
void writeSuffix() override;
void flush() override;
void setTotals(const Block & totals_) override { totals = totals_; }
void setExtremes(const Block & extremes_) override { extremes = extremes_; }
protected:
void writeTotals();
void writeExtremes();
WriteBuffer & ostr;
const Block header;
size_t total_rows = 0;
size_t terminal_width = 0;
const FormatSettings format_settings;
Block totals;
Block extremes;
using Widths = PODArray<size_t>;
using WidthsPerColumn = std::vector<Widths>;
static void calculateWidths(
const Block & block, WidthsPerColumn & widths, Widths & max_widths, Widths & name_widths, const FormatSettings & format_settings);
void writeValueWithPadding(const ColumnWithTypeAndName & elem, size_t row_num, size_t value_width, size_t pad_to_width);
};
}

View File

@ -1,161 +0,0 @@
#include <IO/WriteBuffer.h>
#include <IO/WriteHelpers.h>
#include <DataStreams/SquashingBlockOutputStream.h>
#include <Formats/FormatFactory.h>
#include <Formats/PrettyCompactBlockOutputStream.h>
namespace DB
{
namespace ErrorCodes
{
extern const int ILLEGAL_COLUMN;
}
void PrettyCompactBlockOutputStream::writeHeader(
const Block & block,
const Widths & max_widths,
const Widths & name_widths)
{
/// Names
writeCString("┌─", ostr);
for (size_t i = 0; i < max_widths.size(); ++i)
{
if (i != 0)
writeCString("─┬─", ostr);
const ColumnWithTypeAndName & col = block.getByPosition(i);
if (col.type->shouldAlignRightInPrettyFormats())
{
for (size_t k = 0; k < max_widths[i] - name_widths[i]; ++k)
writeCString("", ostr);
if (format_settings.pretty.color)
writeCString("\033[1m", ostr);
writeString(col.name, ostr);
if (format_settings.pretty.color)
writeCString("\033[0m", ostr);
}
else
{
if (format_settings.pretty.color)
writeCString("\033[1m", ostr);
writeString(col.name, ostr);
if (format_settings.pretty.color)
writeCString("\033[0m", ostr);
for (size_t k = 0; k < max_widths[i] - name_widths[i]; ++k)
writeCString("", ostr);
}
}
writeCString("─┐\n", ostr);
}
void PrettyCompactBlockOutputStream::writeBottom(const Widths & max_widths)
{
/// Create delimiters
std::stringstream bottom_separator;
bottom_separator << "";
for (size_t i = 0; i < max_widths.size(); ++i)
{
if (i != 0)
bottom_separator << "";
for (size_t j = 0; j < max_widths[i] + 2; ++j)
bottom_separator << "";
}
bottom_separator << "\n";
writeString(bottom_separator.str(), ostr);
}
void PrettyCompactBlockOutputStream::writeRow(
size_t row_num,
const Block & block,
const WidthsPerColumn & widths,
const Widths & max_widths)
{
size_t columns = max_widths.size();
writeCString("", ostr);
for (size_t j = 0; j < columns; ++j)
{
if (j != 0)
writeCString("", ostr);
writeValueWithPadding(block.getByPosition(j), row_num, widths[j].empty() ? max_widths[j] : widths[j][row_num], max_widths[j]);
}
writeCString("\n", ostr);
}
void PrettyCompactBlockOutputStream::write(const Block & block)
{
UInt64 max_rows = format_settings.pretty.max_rows;
if (total_rows >= max_rows)
{
total_rows += block.rows();
return;
}
size_t rows = block.rows();
WidthsPerColumn widths;
Widths max_widths;
Widths name_widths;
calculateWidths(block, widths, max_widths, name_widths, format_settings);
writeHeader(block, max_widths, name_widths);
for (size_t i = 0; i < rows && total_rows + i < max_rows; ++i)
writeRow(i, block, widths, max_widths);
writeBottom(max_widths);
total_rows += rows;
}
void registerOutputFormatPrettyCompact(FormatFactory & factory)
{
factory.registerOutputFormat("PrettyCompact", [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & format_settings)
{
return std::make_shared<PrettyCompactBlockOutputStream>(buf, sample, format_settings);
});
factory.registerOutputFormat("PrettyCompactNoEscapes", [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & format_settings)
{
FormatSettings changed_settings = format_settings;
changed_settings.pretty.color = false;
return std::make_shared<PrettyCompactBlockOutputStream>(buf, sample, changed_settings);
});
factory.registerOutputFormat("PrettyCompactMonoBlock", [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & format_settings)
{
BlockOutputStreamPtr impl = std::make_shared<PrettyCompactBlockOutputStream>(buf, sample, format_settings);
auto res = std::make_shared<SquashingBlockOutputStream>(impl, impl->getHeader(), format_settings.pretty.max_rows, 0);
res->disableFlush();
return res;
});
}
}

View File

@ -1,25 +0,0 @@
#pragma once
#include <Formats/PrettyBlockOutputStream.h>
namespace DB
{
/** Prints the result in the form of beautiful tables, but with fewer delimiter lines.
*/
class PrettyCompactBlockOutputStream : public PrettyBlockOutputStream
{
public:
PrettyCompactBlockOutputStream(WriteBuffer & ostr_, const Block & header_, const FormatSettings & format_settings_)
: PrettyBlockOutputStream(ostr_, header_, format_settings_) {}
void write(const Block & block) override;
protected:
void writeHeader(const Block & block, const Widths & max_widths, const Widths & name_widths);
void writeBottom(const Widths & max_widths);
void writeRow(size_t row_num, const Block & block, const WidthsPerColumn & widths, const Widths & max_widths);
};
}

View File

@ -1,116 +0,0 @@
#include <IO/WriteBuffer.h>
#include <IO/WriteHelpers.h>
#include <Formats/FormatFactory.h>
#include <Formats/PrettySpaceBlockOutputStream.h>
namespace DB
{
void PrettySpaceBlockOutputStream::write(const Block & block)
{
UInt64 max_rows = format_settings.pretty.max_rows;
if (total_rows >= max_rows)
{
total_rows += block.rows();
return;
}
size_t rows = block.rows();
size_t columns = block.columns();
WidthsPerColumn widths;
Widths max_widths;
Widths name_widths;
calculateWidths(block, widths, max_widths, name_widths, format_settings);
/// Names
for (size_t i = 0; i < columns; ++i)
{
if (i != 0)
writeCString(" ", ostr);
const ColumnWithTypeAndName & col = block.getByPosition(i);
if (col.type->shouldAlignRightInPrettyFormats())
{
for (ssize_t k = 0; k < std::max(static_cast<ssize_t>(0), static_cast<ssize_t>(max_widths[i] - name_widths[i])); ++k)
writeChar(' ', ostr);
if (format_settings.pretty.color)
writeCString("\033[1m", ostr);
writeString(col.name, ostr);
if (format_settings.pretty.color)
writeCString("\033[0m", ostr);
}
else
{
if (format_settings.pretty.color)
writeCString("\033[1m", ostr);
writeString(col.name, ostr);
if (format_settings.pretty.color)
writeCString("\033[0m", ostr);
for (ssize_t k = 0; k < std::max(static_cast<ssize_t>(0), static_cast<ssize_t>(max_widths[i] - name_widths[i])); ++k)
writeChar(' ', ostr);
}
}
writeCString("\n\n", ostr);
for (size_t i = 0; i < rows && total_rows + i < max_rows; ++i)
{
for (size_t j = 0; j < columns; ++j)
{
if (j != 0)
writeCString(" ", ostr);
writeValueWithPadding(block.getByPosition(j), i, widths[j].empty() ? max_widths[j] : widths[j][i], max_widths[j]);
}
writeChar('\n', ostr);
}
total_rows += rows;
}
void PrettySpaceBlockOutputStream::writeSuffix()
{
if (total_rows >= format_settings.pretty.max_rows)
{
writeCString("\nShowed first ", ostr);
writeIntText(format_settings.pretty.max_rows, ostr);
writeCString(".\n", ostr);
}
total_rows = 0;
writeTotals();
writeExtremes();
}
void registerOutputFormatPrettySpace(FormatFactory & factory)
{
factory.registerOutputFormat("PrettySpace", [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & format_settings)
{
return std::make_shared<PrettySpaceBlockOutputStream>(buf, sample, format_settings);
});
factory.registerOutputFormat("PrettySpaceNoEscapes", [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & format_settings)
{
FormatSettings changed_settings = format_settings;
changed_settings.pretty.color = false;
return std::make_shared<PrettySpaceBlockOutputStream>(buf, sample, changed_settings);
});
}
}

View File

@ -1,21 +0,0 @@
#pragma once
#include <Formats/PrettyBlockOutputStream.h>
namespace DB
{
/** Prints the result, aligned with spaces.
*/
class PrettySpaceBlockOutputStream : public PrettyBlockOutputStream
{
public:
PrettySpaceBlockOutputStream(WriteBuffer & ostr_, const Block & header_, const FormatSettings & format_settings_)
: PrettyBlockOutputStream(ostr_, header_, format_settings_) {}
void write(const Block & block) override;
void writeSuffix() override;
};
}

View File

@ -1,94 +0,0 @@
#include "ProtobufRowInputStream.h"
#if USE_PROTOBUF
#include <Core/Block.h>
#include <Formats/BlockInputStreamFromRowInputStream.h>
#include <Formats/FormatFactory.h>
#include <Formats/FormatSchemaInfo.h>
#include <Formats/ProtobufSchemas.h>
namespace DB
{
ProtobufRowInputStream::ProtobufRowInputStream(ReadBuffer & in_, const Block & header, const FormatSchemaInfo & format_schema)
: data_types(header.getDataTypes()), reader(in_, ProtobufSchemas::instance().getMessageTypeForFormatSchema(format_schema), header.getNames())
{
}
ProtobufRowInputStream::~ProtobufRowInputStream() = default;
bool ProtobufRowInputStream::read(MutableColumns & columns, RowReadExtension & extra)
{
if (!reader.startMessage())
return false; // EOF reached, no more messages.
// Set of columns for which the values were read. The rest will be filled with default values.
auto & read_columns = extra.read_columns;
read_columns.assign(columns.size(), false);
// Read values from this message and put them to the columns while it's possible.
size_t column_index;
while (reader.readColumnIndex(column_index))
{
bool allow_add_row = !static_cast<bool>(read_columns[column_index]);
do
{
bool row_added;
data_types[column_index]->deserializeProtobuf(*columns[column_index], reader, allow_add_row, row_added);
if (row_added)
{
read_columns[column_index] = true;
allow_add_row = false;
}
} while (reader.canReadMoreValues());
}
// Fill non-visited columns with the default values.
for (column_index = 0; column_index < read_columns.size(); ++column_index)
if (!read_columns[column_index])
data_types[column_index]->insertDefaultInto(*columns[column_index]);
reader.endMessage();
return true;
}
bool ProtobufRowInputStream::allowSyncAfterError() const
{
return true;
}
void ProtobufRowInputStream::syncAfterError()
{
reader.endMessage(true);
}
void registerInputFormatProtobuf(FormatFactory & factory)
{
factory.registerInputFormat("Protobuf", [](
ReadBuffer & buf,
const Block & sample,
const Context & context,
UInt64 max_block_size,
UInt64 rows_portion_size,
FormatFactory::ReadCallback callback,
const FormatSettings & settings)
{
return std::make_shared<BlockInputStreamFromRowInputStream>(
std::make_shared<ProtobufRowInputStream>(buf, sample, FormatSchemaInfo(context, "Protobuf")),
sample, max_block_size, rows_portion_size, callback, settings);
});
}
}
#else
namespace DB
{
class FormatFactory;
void registerInputFormatProtobuf(FormatFactory &) {}
}
#endif

View File

@ -1,40 +0,0 @@
#pragma once
#include "config_formats.h"
#if USE_PROTOBUF
#include <DataTypes/IDataType.h>
#include <Formats/IRowInputStream.h>
#include <Formats/ProtobufReader.h>
namespace DB
{
class Block;
class FormatSchemaInfo;
/** Stream designed to deserialize data from the google protobuf format.
* Each row is read as a separated message.
* These messages are delimited according to documentation
* https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/util/delimited_message_util.h
* Serializing in the protobuf format requires the 'format_schema' setting to be set, e.g.
* INSERT INTO table FORMAT Protobuf SETTINGS format_schema = 'schema:Message'
* where schema is the name of "schema.proto" file specifying protobuf schema.
*/
class ProtobufRowInputStream : public IRowInputStream
{
public:
ProtobufRowInputStream(ReadBuffer & in_, const Block & header, const FormatSchemaInfo & format_schema);
~ProtobufRowInputStream() override;
bool read(MutableColumns & columns, RowReadExtension & extra) override;
bool allowSyncAfterError() const override;
void syncAfterError() override;
private:
DataTypes data_types;
ProtobufReader reader;
};
}
#endif

View File

@ -1,55 +0,0 @@
#include <Formats/FormatFactory.h>
#include "config_formats.h"
#if USE_PROTOBUF
#include "ProtobufRowOutputStream.h"
#include <Core/Block.h>
#include <Formats/BlockOutputStreamFromRowOutputStream.h>
#include <Formats/FormatSchemaInfo.h>
#include <Formats/ProtobufSchemas.h>
#include <google/protobuf/descriptor.h>
namespace DB
{
ProtobufRowOutputStream::ProtobufRowOutputStream(WriteBuffer & out, const Block & header, const FormatSchemaInfo & format_schema)
: data_types(header.getDataTypes()), writer(out, ProtobufSchemas::instance().getMessageTypeForFormatSchema(format_schema), header.getNames())
{
value_indices.resize(header.columns());
}
void ProtobufRowOutputStream::write(const Block & block, size_t row_num)
{
writer.startMessage();
std::fill(value_indices.begin(), value_indices.end(), 0);
size_t column_index;
while (writer.writeField(column_index))
data_types[column_index]->serializeProtobuf(
*block.getByPosition(column_index).column, row_num, writer, value_indices[column_index]);
writer.endMessage();
}
void registerOutputFormatProtobuf(FormatFactory & factory)
{
factory.registerOutputFormat(
"Protobuf", [](WriteBuffer & buf, const Block & header, const Context & context, const FormatSettings &)
{
return std::make_shared<BlockOutputStreamFromRowOutputStream>(
std::make_shared<ProtobufRowOutputStream>(buf, header, FormatSchemaInfo(context, "Protobuf")), header);
});
}
}
#else
namespace DB
{
class FormatFactory;
void registerOutputFormatProtobuf(FormatFactory &) {}
}
#endif

View File

@ -1,44 +0,0 @@
#pragma once
#include <DataTypes/IDataType.h>
#include <Formats/IRowOutputStream.h>
#include <Formats/ProtobufWriter.h>
namespace google
{
namespace protobuf
{
class Message;
}
}
namespace DB
{
class Block;
class FormatSchemaInfo;
/** Stream designed to serialize data in the google protobuf format.
* Each row is written as a separated message.
* These messages are delimited according to documentation
* https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/util/delimited_message_util.h
* Serializing in the protobuf format requires the 'format_schema' setting to be set, e.g.
* SELECT * from table FORMAT Protobuf SETTINGS format_schema = 'schema:Message'
* where schema is the name of "schema.proto" file specifying protobuf schema.
*/
class ProtobufRowOutputStream : public IRowOutputStream
{
public:
ProtobufRowOutputStream(WriteBuffer & out, const Block & header, const FormatSchemaInfo & format_schema);
void write(const Block & block, size_t row_num) override;
std::string getContentType() const override { return "application/octet-stream"; }
private:
DataTypes data_types;
ProtobufWriter writer;
std::vector<size_t> value_indices;
};
}

View File

@ -1,211 +0,0 @@
#include <IO/ReadHelpers.h>
#include <Formats/TSKVRowInputStream.h>
#include <Formats/FormatFactory.h>
#include <Formats/BlockInputStreamFromRowInputStream.h>
namespace DB
{
namespace ErrorCodes
{
extern const int INCORRECT_DATA;
extern const int CANNOT_PARSE_ESCAPE_SEQUENCE;
extern const int CANNOT_READ_ALL_DATA;
extern const int CANNOT_PARSE_INPUT_ASSERTION_FAILED;
}
TSKVRowInputStream::TSKVRowInputStream(ReadBuffer & istr_, const Block & header_, const FormatSettings & format_settings)
: istr(istr_), header(header_), format_settings(format_settings), name_map(header.columns())
{
/// In this format, we assume that column name cannot contain BOM,
/// so BOM at beginning of stream cannot be confused with name of field, and it is safe to skip it.
skipBOMIfExists(istr);
size_t num_columns = header.columns();
for (size_t i = 0; i < num_columns; ++i)
name_map[header.safeGetByPosition(i).name] = i; /// NOTE You could place names more cache-locally.
}
/** Read the field name in the `tskv` format.
* Return true if the field is followed by an equal sign,
* otherwise (field with no value) return false.
* The reference to the field name will be written to `ref`.
* A temporary `tmp` buffer can also be used to copy the field name to it.
* When reading, skips the name and the equal sign after it.
*/
static bool readName(ReadBuffer & buf, StringRef & ref, String & tmp)
{
tmp.clear();
while (!buf.eof())
{
const char * next_pos = find_first_symbols<'\t', '\n', '\\', '='>(buf.position(), buf.buffer().end());
if (next_pos == buf.buffer().end())
{
tmp.append(buf.position(), next_pos - buf.position());
buf.next();
continue;
}
/// Came to the end of the name.
if (*next_pos != '\\')
{
bool have_value = *next_pos == '=';
if (tmp.empty())
{
/// No need to copy data, you can refer directly to the `buf`.
ref = StringRef(buf.position(), next_pos - buf.position());
buf.position() += next_pos + have_value - buf.position();
}
else
{
/// Copy the data to a temporary string and return a reference to it.
tmp.append(buf.position(), next_pos - buf.position());
buf.position() += next_pos + have_value - buf.position();
ref = StringRef(tmp);
}
return have_value;
}
/// The name has an escape sequence.
else
{
tmp.append(buf.position(), next_pos - buf.position());
buf.position() += next_pos + 1 - buf.position();
if (buf.eof())
throw Exception("Cannot parse escape sequence", ErrorCodes::CANNOT_PARSE_ESCAPE_SEQUENCE);
tmp.push_back(parseEscapeSequence(*buf.position()));
++buf.position();
continue;
}
}
throw Exception("Unexpected end of stream while reading key name from TSKV format", ErrorCodes::CANNOT_READ_ALL_DATA);
}
bool TSKVRowInputStream::read(MutableColumns & columns, RowReadExtension & ext)
{
if (istr.eof())
return false;
size_t num_columns = columns.size();
/// Set of columns for which the values were read. The rest will be filled with default values.
read_columns.assign(num_columns, false);
if (unlikely(*istr.position() == '\n'))
{
/// An empty string. It is permissible, but it is unclear why.
++istr.position();
}
else
{
while (true)
{
StringRef name_ref;
bool has_value = readName(istr, name_ref, name_buf);
ssize_t index = -1;
if (has_value)
{
/// NOTE Optimization is possible by caching the order of fields (which is almost always the same)
/// and quickly checking for the next expected field, instead of searching the hash table.
auto it = name_map.find(name_ref);
if (name_map.end() == it)
{
if (!format_settings.skip_unknown_fields)
throw Exception("Unknown field found while parsing TSKV format: " + name_ref.toString(), ErrorCodes::INCORRECT_DATA);
/// If the key is not found, skip the value.
NullSink sink;
readEscapedStringInto(sink, istr);
}
else
{
index = it->getSecond();
if (read_columns[index])
throw Exception("Duplicate field found while parsing TSKV format: " + name_ref.toString(), ErrorCodes::INCORRECT_DATA);
read_columns[index] = true;
header.getByPosition(index).type->deserializeAsTextEscaped(*columns[index], istr, format_settings);
}
}
else
{
/// The only thing that can go without value is `tskv` fragment that is ignored.
if (!(name_ref.size == 4 && 0 == memcmp(name_ref.data, "tskv", 4)))
throw Exception("Found field without value while parsing TSKV format: " + name_ref.toString(), ErrorCodes::INCORRECT_DATA);
}
if (istr.eof())
{
throw Exception("Unexpected end of stream after field in TSKV format: " + name_ref.toString(), ErrorCodes::CANNOT_READ_ALL_DATA);
}
else if (*istr.position() == '\t')
{
++istr.position();
continue;
}
else if (*istr.position() == '\n')
{
++istr.position();
break;
}
else
{
/// Possibly a garbage was written into column, remove it
if (index >= 0)
{
columns[index]->popBack(1);
read_columns[index] = false;
}
throw Exception("Found garbage after field in TSKV format: " + name_ref.toString(), ErrorCodes::CANNOT_PARSE_INPUT_ASSERTION_FAILED);
}
}
}
/// Fill in the not met columns with default values.
for (size_t i = 0; i < num_columns; ++i)
if (!read_columns[i])
header.getByPosition(i).type->insertDefaultInto(*columns[i]);
/// return info about defaults set
ext.read_columns = read_columns;
return true;
}
void TSKVRowInputStream::syncAfterError()
{
skipToUnescapedNextLineOrEOF(istr);
}
void registerInputFormatTSKV(FormatFactory & factory)
{
factory.registerInputFormat("TSKV", [](
ReadBuffer & buf,
const Block & sample,
const Context &,
UInt64 max_block_size,
UInt64 rows_portion_size,
FormatFactory::ReadCallback callback,
const FormatSettings & settings)
{
return std::make_shared<BlockInputStreamFromRowInputStream>(
std::make_shared<TSKVRowInputStream>(buf, sample, settings),
sample, max_block_size, rows_portion_size, callback, settings);
});
}
}

View File

@ -1,48 +0,0 @@
#pragma once
#include <Core/Block.h>
#include <Formats/IRowInputStream.h>
#include <Formats/FormatSettings.h>
#include <Common/HashTable/HashMap.h>
namespace DB
{
class ReadBuffer;
/** Stream for reading data in TSKV format.
* TSKV is a very inefficient data format.
* Similar to TSV, but each field is written as key=value.
* Fields can be listed in any order (including, in different lines there may be different order),
* and some fields may be missing.
* An equal sign can be escaped in the field name.
* Also, as an additional element there may be a useless tskv fragment - it needs to be ignored.
*/
class TSKVRowInputStream : public IRowInputStream
{
public:
TSKVRowInputStream(ReadBuffer & istr_, const Block & header_, const FormatSettings & format_settings);
bool read(MutableColumns & columns, RowReadExtension &) override;
bool allowSyncAfterError() const override { return true; }
void syncAfterError() override;
private:
ReadBuffer & istr;
Block header;
const FormatSettings format_settings;
/// Buffer for the read from the stream the field name. Used when you have to copy it.
String name_buf;
/// Hash table matching `field name -> position in the block`. NOTE You can use perfect hash map.
using NameMap = HashMap<StringRef, size_t, StringRefHash>;
NameMap name_map;
std::vector<UInt8> read_columns;
};
}

View File

@ -1,56 +0,0 @@
#include <IO/WriteHelpers.h>
#include <IO/WriteBufferFromString.h>
#include <Formats/TSKVRowOutputStream.h>
#include <Formats/FormatFactory.h>
#include <Formats/BlockOutputStreamFromRowOutputStream.h>
namespace DB
{
TSKVRowOutputStream::TSKVRowOutputStream(WriteBuffer & ostr_, const Block & sample_, const FormatSettings & format_settings_)
: TabSeparatedRowOutputStream(ostr_, sample_, false, false, format_settings_)
{
NamesAndTypesList columns(sample_.getNamesAndTypesList());
fields.assign(columns.begin(), columns.end());
for (auto & field : fields)
{
WriteBufferFromOwnString wb;
writeAnyEscapedString<'='>(field.name.data(), field.name.data() + field.name.size(), wb);
writeCString("=", wb);
field.name = wb.str();
}
}
void TSKVRowOutputStream::writeField(const IColumn & column, const IDataType & type, size_t row_num)
{
writeString(fields[field_number].name, ostr);
type.serializeAsTextEscaped(column, row_num, ostr, format_settings);
++field_number;
}
void TSKVRowOutputStream::writeRowEndDelimiter()
{
writeChar('\n', ostr);
field_number = 0;
}
void registerOutputFormatTSKV(FormatFactory & factory)
{
factory.registerOutputFormat("TSKV", [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & settings)
{
return std::make_shared<BlockOutputStreamFromRowOutputStream>(
std::make_shared<TSKVRowOutputStream>(buf, sample, settings), sample);
});
}
}

View File

@ -1,27 +0,0 @@
#pragma once
#include <Formats/FormatSettings.h>
#include <Formats/TabSeparatedRowOutputStream.h>
namespace DB
{
/** The stream for outputting data in the TSKV format.
* TSKV is similar to TabSeparated, but before every value, its name and equal sign are specified: name=value.
* This format is very inefficient.
*/
class TSKVRowOutputStream : public TabSeparatedRowOutputStream
{
public:
TSKVRowOutputStream(WriteBuffer & ostr_, const Block & sample_, const FormatSettings & format_settings);
void writeField(const IColumn & column, const IDataType & type, size_t row_num) override;
void writeRowEndDelimiter() override;
protected:
NamesAndTypes fields;
size_t field_number = 0;
};
}

View File

@ -1,25 +0,0 @@
#pragma once
#include <Formats/TabSeparatedRowOutputStream.h>
namespace DB
{
struct FormatSettings;
/** A stream for outputting data in tsv format, but without escaping individual values.
* (That is, the output is irreversible.)
*/
class TabSeparatedRawRowOutputStream : public TabSeparatedRowOutputStream
{
public:
TabSeparatedRawRowOutputStream(WriteBuffer & ostr_, const Block & sample_, bool with_names_, bool with_types_, const FormatSettings & format_settings_)
: TabSeparatedRowOutputStream(ostr_, sample_, with_names_, with_types_, format_settings_) {}
void writeField(const IColumn & column, const IDataType & type, size_t row_num) override
{
type.serializeAsText(column, row_num, ostr, format_settings);
}
};
}

View File

@ -1,179 +0,0 @@
#include <Formats/TabSeparatedRowOutputStream.h>
#include <Formats/TabSeparatedRawRowOutputStream.h>
#include <Formats/FormatFactory.h>
#include <Formats/BlockOutputStreamFromRowOutputStream.h>
#include <IO/WriteHelpers.h>
namespace DB
{
TabSeparatedRowOutputStream::TabSeparatedRowOutputStream(
WriteBuffer & ostr_, const Block & sample_, bool with_names_, bool with_types_, const FormatSettings & format_settings)
: ostr(ostr_), sample(sample_), with_names(with_names_), with_types(with_types_), format_settings(format_settings)
{
}
void TabSeparatedRowOutputStream::flush()
{
ostr.next();
}
void TabSeparatedRowOutputStream::writePrefix()
{
size_t columns = sample.columns();
if (with_names)
{
for (size_t i = 0; i < columns; ++i)
{
writeEscapedString(sample.safeGetByPosition(i).name, ostr);
writeChar(i == columns - 1 ? '\n' : '\t', ostr);
}
}
if (with_types)
{
for (size_t i = 0; i < columns; ++i)
{
writeEscapedString(sample.safeGetByPosition(i).type->getName(), ostr);
writeChar(i == columns - 1 ? '\n' : '\t', ostr);
}
}
}
void TabSeparatedRowOutputStream::writeField(const IColumn & column, const IDataType & type, size_t row_num)
{
type.serializeAsTextEscaped(column, row_num, ostr, format_settings);
}
void TabSeparatedRowOutputStream::writeFieldDelimiter()
{
writeChar('\t', ostr);
}
void TabSeparatedRowOutputStream::writeRowEndDelimiter()
{
writeChar('\n', ostr);
}
void TabSeparatedRowOutputStream::writeSuffix()
{
writeTotals();
writeExtremes();
}
void TabSeparatedRowOutputStream::writeTotals()
{
if (totals)
{
size_t columns = totals.columns();
writeChar('\n', ostr);
writeRowStartDelimiter();
for (size_t j = 0; j < columns; ++j)
{
if (j != 0)
writeFieldDelimiter();
writeField(*totals.getByPosition(j).column.get(), *totals.getByPosition(j).type.get(), 0);
}
writeRowEndDelimiter();
}
}
void TabSeparatedRowOutputStream::writeExtremes()
{
if (extremes)
{
size_t rows = extremes.rows();
size_t columns = extremes.columns();
writeChar('\n', ostr);
for (size_t i = 0; i < rows; ++i)
{
if (i != 0)
writeRowBetweenDelimiter();
writeRowStartDelimiter();
for (size_t j = 0; j < columns; ++j)
{
if (j != 0)
writeFieldDelimiter();
writeField(*extremes.getByPosition(j).column.get(), *extremes.getByPosition(j).type.get(), i);
}
writeRowEndDelimiter();
}
}
}
void registerOutputFormatTabSeparated(FormatFactory & factory)
{
for (auto name : {"TabSeparated", "TSV"})
{
factory.registerOutputFormat(name, [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & settings)
{
return std::make_shared<BlockOutputStreamFromRowOutputStream>(
std::make_shared<TabSeparatedRowOutputStream>(buf, sample, false, false, settings), sample);
});
}
for (auto name : {"TabSeparatedRaw", "TSVRaw"})
{
factory.registerOutputFormat(name, [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & settings)
{
return std::make_shared<BlockOutputStreamFromRowOutputStream>(
std::make_shared<TabSeparatedRawRowOutputStream>(buf, sample, false, false, settings), sample);
});
}
for (auto name : {"TabSeparatedWithNames", "TSVWithNames"})
{
factory.registerOutputFormat(name, [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & settings)
{
return std::make_shared<BlockOutputStreamFromRowOutputStream>(
std::make_shared<TabSeparatedRowOutputStream>(buf, sample, true, false, settings), sample);
});
}
for (auto name : {"TabSeparatedWithNamesAndTypes", "TSVWithNamesAndTypes"})
{
factory.registerOutputFormat(name, [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & settings)
{
return std::make_shared<BlockOutputStreamFromRowOutputStream>(
std::make_shared<TabSeparatedRowOutputStream>(buf, sample, true, true, settings), sample);
});
}
}
}

View File

@ -1,51 +0,0 @@
#pragma once
#include <Core/Block.h>
#include <Formats/FormatSettings.h>
#include <Formats/IRowOutputStream.h>
namespace DB
{
class WriteBuffer;
/** A stream for outputting data in tsv format.
*/
class TabSeparatedRowOutputStream : public IRowOutputStream
{
public:
/** with_names - output in the first line a header with column names
* with_types - output the next line header with the names of the types
*/
TabSeparatedRowOutputStream(WriteBuffer & ostr_, const Block & sample_, bool with_names_, bool with_types_, const FormatSettings & format_settings);
void writeField(const IColumn & column, const IDataType & type, size_t row_num) override;
void writeFieldDelimiter() override;
void writeRowEndDelimiter() override;
void writePrefix() override;
void writeSuffix() override;
void flush() override;
void setTotals(const Block & totals_) override { totals = totals_; }
void setExtremes(const Block & extremes_) override { extremes = extremes_; }
/// https://www.iana.org/assignments/media-types/text/tab-separated-values
String getContentType() const override { return "text/tab-separated-values; charset=UTF-8"; }
protected:
void writeTotals();
void writeExtremes();
WriteBuffer & ostr;
const Block sample;
bool with_names;
bool with_types;
const FormatSettings format_settings;
Block totals;
Block extremes;
};
}

View File

@ -1,168 +0,0 @@
#include <IO/ReadHelpers.h>
#include <Interpreters/evaluateConstantExpression.h>
#include <Interpreters/Context.h>
#include <Interpreters/convertFieldToType.h>
#include <Parsers/TokenIterator.h>
#include <Parsers/ExpressionListParsers.h>
#include <Formats/ValuesRowInputStream.h>
#include <Formats/FormatFactory.h>
#include <Formats/BlockInputStreamFromRowInputStream.h>
#include <Common/FieldVisitors.h>
#include <Core/Block.h>
#include <Common/typeid_cast.h>
namespace DB
{
namespace ErrorCodes
{
extern const int CANNOT_PARSE_INPUT_ASSERTION_FAILED;
extern const int CANNOT_PARSE_QUOTED_STRING;
extern const int CANNOT_PARSE_NUMBER;
extern const int CANNOT_PARSE_DATE;
extern const int CANNOT_PARSE_DATETIME;
extern const int CANNOT_READ_ARRAY_FROM_TEXT;
extern const int CANNOT_PARSE_DATE;
extern const int SYNTAX_ERROR;
extern const int VALUE_IS_OUT_OF_RANGE_OF_DATA_TYPE;
}
ValuesRowInputStream::ValuesRowInputStream(ReadBuffer & istr_, const Block & header_, const Context & context_, const FormatSettings & format_settings)
: istr(istr_), header(header_), context(std::make_unique<Context>(context_)), format_settings(format_settings)
{
/// In this format, BOM at beginning of stream cannot be confused with value, so it is safe to skip it.
skipBOMIfExists(istr);
}
bool ValuesRowInputStream::read(MutableColumns & columns, RowReadExtension &)
{
size_t num_columns = columns.size();
skipWhitespaceIfAny(istr);
if (istr.eof() || *istr.position() == ';')
return false;
/** Typically, this is the usual format for streaming parsing.
* But as an exception, it also supports processing arbitrary expressions instead of values.
* This is very inefficient. But if there are no expressions, then there is no overhead.
*/
ParserExpression parser;
assertChar('(', istr);
for (size_t i = 0; i < num_columns; ++i)
{
skipWhitespaceIfAny(istr);
char * prev_istr_position = istr.position();
size_t prev_istr_bytes = istr.count() - istr.offset();
bool rollback_on_exception = false;
try
{
header.getByPosition(i).type->deserializeAsTextQuoted(*columns[i], istr, format_settings);
rollback_on_exception = true;
skipWhitespaceIfAny(istr);
if (i != num_columns - 1)
assertChar(',', istr);
else
assertChar(')', istr);
}
catch (const Exception & e)
{
if (!format_settings.values.interpret_expressions)
throw;
/** The normal streaming parser could not parse the value.
* Let's try to parse it with a SQL parser as a constant expression.
* This is an exceptional case.
*/
if (e.code() == ErrorCodes::CANNOT_PARSE_INPUT_ASSERTION_FAILED
|| e.code() == ErrorCodes::CANNOT_PARSE_QUOTED_STRING
|| e.code() == ErrorCodes::CANNOT_PARSE_NUMBER
|| e.code() == ErrorCodes::CANNOT_PARSE_DATE
|| e.code() == ErrorCodes::CANNOT_PARSE_DATETIME
|| e.code() == ErrorCodes::CANNOT_READ_ARRAY_FROM_TEXT)
{
/// TODO Case when the expression does not fit entirely in the buffer.
/// If the beginning of the value is no longer in the buffer.
if (istr.count() - istr.offset() != prev_istr_bytes)
throw;
if (rollback_on_exception)
columns[i]->popBack(1);
const IDataType & type = *header.getByPosition(i).type;
Expected expected;
Tokens tokens(prev_istr_position, istr.buffer().end());
TokenIterator token_iterator(tokens);
ASTPtr ast;
if (!parser.parse(token_iterator, ast, expected))
throw Exception("Cannot parse expression of type " + type.getName() + " here: "
+ String(prev_istr_position, std::min(SHOW_CHARS_ON_SYNTAX_ERROR, istr.buffer().end() - prev_istr_position)),
ErrorCodes::SYNTAX_ERROR);
istr.position() = const_cast<char *>(token_iterator->begin);
std::pair<Field, DataTypePtr> value_raw = evaluateConstantExpression(ast, *context);
Field value = convertFieldToType(value_raw.first, type, value_raw.second.get());
/// Check that we are indeed allowed to insert a NULL.
if (value.isNull())
{
if (!type.isNullable())
throw Exception{"Expression returns value " + applyVisitor(FieldVisitorToString(), value)
+ ", that is out of range of type " + type.getName()
+ ", at: " + String(prev_istr_position, std::min(SHOW_CHARS_ON_SYNTAX_ERROR, istr.buffer().end() - prev_istr_position)),
ErrorCodes::VALUE_IS_OUT_OF_RANGE_OF_DATA_TYPE};
}
columns[i]->insert(value);
skipWhitespaceIfAny(istr);
if (i != num_columns - 1)
assertChar(',', istr);
else
assertChar(')', istr);
}
else
throw;
}
}
skipWhitespaceIfAny(istr);
if (!istr.eof() && *istr.position() == ',')
++istr.position();
return true;
}
void registerInputFormatValues(FormatFactory & factory)
{
factory.registerInputFormat("Values", [](
ReadBuffer & buf,
const Block & sample,
const Context & context,
UInt64 max_block_size,
UInt64 rows_portion_size,
FormatFactory::ReadCallback callback,
const FormatSettings & settings)
{
return std::make_shared<BlockInputStreamFromRowInputStream>(
std::make_shared<ValuesRowInputStream>(buf, sample, context, settings),
sample, max_block_size, rows_portion_size, callback, settings);
});
}
}

View File

@ -1,35 +0,0 @@
#pragma once
#include <Core/Block.h>
#include <Formats/IRowInputStream.h>
#include <Formats/FormatSettings.h>
namespace DB
{
class Context;
class ReadBuffer;
/** Stream to read data in VALUES format (as in INSERT query).
*/
class ValuesRowInputStream : public IRowInputStream
{
public:
/** Data is parsed using fast, streaming parser.
* If interpret_expressions is true, it will, in addition, try to use SQL parser and interpreter
* in case when streaming parser could not parse field (this is very slow).
*/
ValuesRowInputStream(ReadBuffer & istr_, const Block & header_, const Context & context_, const FormatSettings & format_settings);
bool read(MutableColumns & columns, RowReadExtension &) override;
private:
ReadBuffer & istr;
Block header;
std::unique_ptr<Context> context; /// pimpl
const FormatSettings format_settings;
};
}

View File

@ -1,63 +0,0 @@
#include <Formats/ValuesRowOutputStream.h>
#include <Formats/FormatFactory.h>
#include <Formats/BlockOutputStreamFromRowOutputStream.h>
#include <IO/WriteHelpers.h>
#include <Columns/IColumn.h>
#include <DataTypes/IDataType.h>
namespace DB
{
ValuesRowOutputStream::ValuesRowOutputStream(WriteBuffer & ostr_, const FormatSettings & format_settings)
: ostr(ostr_), format_settings(format_settings)
{
}
void ValuesRowOutputStream::flush()
{
ostr.next();
}
void ValuesRowOutputStream::writeField(const IColumn & column, const IDataType & type, size_t row_num)
{
type.serializeAsTextQuoted(column, row_num, ostr, format_settings);
}
void ValuesRowOutputStream::writeFieldDelimiter()
{
writeChar(',', ostr);
}
void ValuesRowOutputStream::writeRowStartDelimiter()
{
writeChar('(', ostr);
}
void ValuesRowOutputStream::writeRowEndDelimiter()
{
writeChar(')', ostr);
}
void ValuesRowOutputStream::writeRowBetweenDelimiter()
{
writeCString(",", ostr);
}
void registerOutputFormatValues(FormatFactory & factory)
{
factory.registerOutputFormat("Values", [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & settings)
{
return std::make_shared<BlockOutputStreamFromRowOutputStream>(
std::make_shared<ValuesRowOutputStream>(buf, settings), sample);
});
}
}

View File

@ -1,33 +0,0 @@
#pragma once
#include <Formats/FormatSettings.h>
#include <Formats/IRowOutputStream.h>
namespace DB
{
class WriteBuffer;
/** A stream for outputting data in the VALUES format (as in the INSERT request).
*/
class ValuesRowOutputStream : public IRowOutputStream
{
public:
ValuesRowOutputStream(WriteBuffer & ostr_, const FormatSettings & format_settings);
void writeField(const IColumn & column, const IDataType & type, size_t row_num) override;
void writeFieldDelimiter() override;
void writeRowStartDelimiter() override;
void writeRowEndDelimiter() override;
void writeRowBetweenDelimiter() override;
void flush() override;
private:
WriteBuffer & ostr;
const FormatSettings format_settings;
};
}

View File

@ -1,184 +0,0 @@
#include <cmath>
#include <IO/WriteHelpers.h>
#include <IO/WriteBufferFromString.h>
#include <Formats/VerticalRowOutputStream.h>
#include <Formats/FormatFactory.h>
#include <Formats/BlockOutputStreamFromRowOutputStream.h>
#include <Common/UTF8Helpers.h>
namespace DB
{
VerticalRowOutputStream::VerticalRowOutputStream(
WriteBuffer & ostr_, const Block & sample_, const FormatSettings & format_settings)
: ostr(ostr_), sample(sample_), format_settings(format_settings)
{
size_t columns = sample.columns();
using Widths = std::vector<size_t>;
Widths name_widths(columns);
size_t max_name_width = 0;
String serialized_value;
for (size_t i = 0; i < columns; ++i)
{
/// Note that number of code points is just a rough approximation of visible string width.
const String & name = sample.getByPosition(i).name;
name_widths[i] = UTF8::computeWidth(reinterpret_cast<const UInt8 *>(name.data()), name.size());
if (name_widths[i] > max_name_width)
max_name_width = name_widths[i];
}
names_and_paddings.resize(columns);
for (size_t i = 0; i < columns; ++i)
{
WriteBufferFromString out(names_and_paddings[i]);
writeString(sample.getByPosition(i).name, out);
writeCString(": ", out);
}
for (size_t i = 0; i < columns; ++i)
{
size_t new_size = max_name_width - name_widths[i] + names_and_paddings[i].size();
names_and_paddings[i].resize(new_size, ' ');
}
}
void VerticalRowOutputStream::flush()
{
ostr.next();
}
void VerticalRowOutputStream::writeField(const IColumn & column, const IDataType & type, size_t row_num)
{
if (row_number > format_settings.pretty.max_rows)
return;
writeString(names_and_paddings[field_number], ostr);
writeValue(column, type, row_num);
writeChar('\n', ostr);
++field_number;
}
void VerticalRowOutputStream::writeValue(const IColumn & column, const IDataType & type, size_t row_num) const
{
type.serializeAsText(column, row_num, ostr, format_settings);
}
void VerticalRowOutputStream::writeRowStartDelimiter()
{
++row_number;
if (row_number > format_settings.pretty.max_rows)
return;
writeCString("Row ", ostr);
writeIntText(row_number, ostr);
writeCString(":\n", ostr);
size_t width = log10(row_number + 1) + 1 + strlen("Row :");
for (size_t i = 0; i < width; ++i)
writeCString("", ostr);
writeChar('\n', ostr);
}
void VerticalRowOutputStream::writeRowBetweenDelimiter()
{
if (row_number > format_settings.pretty.max_rows)
return;
writeCString("\n", ostr);
field_number = 0;
}
void VerticalRowOutputStream::writeSuffix()
{
if (row_number > format_settings.pretty.max_rows)
{
writeCString("Showed first ", ostr);
writeIntText(format_settings.pretty.max_rows, ostr);
writeCString(".\n", ostr);
}
if (totals || extremes)
{
writeCString("\n", ostr);
writeTotals();
writeExtremes();
}
}
void VerticalRowOutputStream::writeSpecialRow(const Block & block, size_t row_num, const char * title)
{
writeCString("\n", ostr);
row_number = 0;
field_number = 0;
size_t columns = block.columns();
writeCString(title, ostr);
writeCString(":\n", ostr);
size_t width = strlen(title) + 1;
for (size_t i = 0; i < width; ++i)
writeCString("", ostr);
writeChar('\n', ostr);
for (size_t i = 0; i < columns; ++i)
{
if (i != 0)
writeFieldDelimiter();
auto & col = block.getByPosition(i);
writeField(*col.column, *col.type, row_num);
}
}
void VerticalRowOutputStream::writeTotals()
{
if (totals)
{
writeSpecialRow(totals, 0, "Totals");
}
}
void VerticalRowOutputStream::writeExtremes()
{
if (extremes)
{
writeSpecialRow(extremes, 0, "Min");
writeSpecialRow(extremes, 1, "Max");
}
}
void registerOutputFormatVertical(FormatFactory & factory)
{
factory.registerOutputFormat("Vertical", [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & settings)
{
return std::make_shared<BlockOutputStreamFromRowOutputStream>(
std::make_shared<VerticalRowOutputStream>(buf, sample, settings), sample);
});
}
}

View File

@ -1,55 +0,0 @@
#pragma once
#include <Core/Block.h>
#include <Formats/FormatSettings.h>
#include <Formats/IRowOutputStream.h>
namespace DB
{
class WriteBuffer;
class Context;
/** Stream to output data in format "each value in separate row".
* Usable to show few rows with many columns.
*/
class VerticalRowOutputStream : public IRowOutputStream
{
public:
VerticalRowOutputStream(WriteBuffer & ostr_, const Block & sample_, const FormatSettings & format_settings);
void writeField(const IColumn & column, const IDataType & type, size_t row_num) override;
void writeRowStartDelimiter() override;
void writeRowBetweenDelimiter() override;
void writeSuffix() override;
void flush() override;
void setTotals(const Block & totals_) override { totals = totals_; }
void setExtremes(const Block & extremes_) override { extremes = extremes_; }
protected:
virtual void writeValue(const IColumn & column, const IDataType & type, size_t row_num) const;
void writeTotals();
void writeExtremes();
/// For totals and extremes.
void writeSpecialRow(const Block & block, size_t row_num, const char * title);
WriteBuffer & ostr;
const Block sample;
const FormatSettings format_settings;
size_t field_number = 0;
size_t row_number = 0;
using NamesAndPaddings = std::vector<String>;
NamesAndPaddings names_and_paddings;
Block totals;
Block extremes;
};
}

View File

@ -1,240 +0,0 @@
#include <IO/WriteHelpers.h>
#include <IO/WriteBufferValidUTF8.h>
#include <Formats/XMLRowOutputStream.h>
#include <Formats/FormatFactory.h>
#include <Formats/BlockOutputStreamFromRowOutputStream.h>
namespace DB
{
XMLRowOutputStream::XMLRowOutputStream(WriteBuffer & ostr_, const Block & sample_, const FormatSettings & format_settings)
: dst_ostr(ostr_), format_settings(format_settings)
{
NamesAndTypesList columns(sample_.getNamesAndTypesList());
fields.assign(columns.begin(), columns.end());
field_tag_names.resize(sample_.columns());
bool need_validate_utf8 = false;
for (size_t i = 0; i < sample_.columns(); ++i)
{
if (!sample_.getByPosition(i).type->textCanContainOnlyValidUTF8())
need_validate_utf8 = true;
/// As element names, we will use the column name if it has a valid form, or "field", otherwise.
/// The condition below is more strict than the XML standard requires.
bool is_column_name_suitable = true;
const char * begin = fields[i].name.data();
const char * end = begin + fields[i].name.size();
for (const char * pos = begin; pos != end; ++pos)
{
char c = *pos;
if (!(isAlphaASCII(c)
|| (pos != begin && isNumericASCII(c))
|| c == '_'
|| c == '-'
|| c == '.'))
{
is_column_name_suitable = false;
break;
}
}
field_tag_names[i] = is_column_name_suitable
? fields[i].name
: "field";
}
if (need_validate_utf8)
{
validating_ostr = std::make_unique<WriteBufferValidUTF8>(dst_ostr);
ostr = validating_ostr.get();
}
else
ostr = &dst_ostr;
}
void XMLRowOutputStream::writePrefix()
{
writeCString("<?xml version='1.0' encoding='UTF-8' ?>\n", *ostr);
writeCString("<result>\n", *ostr);
writeCString("\t<meta>\n", *ostr);
writeCString("\t\t<columns>\n", *ostr);
for (size_t i = 0; i < fields.size(); ++i)
{
writeCString("\t\t\t<column>\n", *ostr);
writeCString("\t\t\t\t<name>", *ostr);
writeXMLString(fields[i].name, *ostr);
writeCString("</name>\n", *ostr);
writeCString("\t\t\t\t<type>", *ostr);
writeXMLString(fields[i].type->getName(), *ostr);
writeCString("</type>\n", *ostr);
writeCString("\t\t\t</column>\n", *ostr);
}
writeCString("\t\t</columns>\n", *ostr);
writeCString("\t</meta>\n", *ostr);
writeCString("\t<data>\n", *ostr);
}
void XMLRowOutputStream::writeField(const IColumn & column, const IDataType & type, size_t row_num)
{
writeCString("\t\t\t<", *ostr);
writeString(field_tag_names[field_number], *ostr);
writeCString(">", *ostr);
type.serializeAsTextXML(column, row_num, *ostr, format_settings);
writeCString("</", *ostr);
writeString(field_tag_names[field_number], *ostr);
writeCString(">\n", *ostr);
++field_number;
}
void XMLRowOutputStream::writeRowStartDelimiter()
{
writeCString("\t\t<row>\n", *ostr);
}
void XMLRowOutputStream::writeRowEndDelimiter()
{
writeCString("\t\t</row>\n", *ostr);
field_number = 0;
++row_count;
}
void XMLRowOutputStream::writeSuffix()
{
writeCString("\t</data>\n", *ostr);
writeTotals();
writeExtremes();
writeCString("\t<rows>", *ostr);
writeIntText(row_count, *ostr);
writeCString("</rows>\n", *ostr);
writeRowsBeforeLimitAtLeast();
if (format_settings.write_statistics)
writeStatistics();
writeCString("</result>\n", *ostr);
ostr->next();
}
void XMLRowOutputStream::writeRowsBeforeLimitAtLeast()
{
if (applied_limit)
{
writeCString("\t<rows_before_limit_at_least>", *ostr);
writeIntText(rows_before_limit, *ostr);
writeCString("</rows_before_limit_at_least>\n", *ostr);
}
}
void XMLRowOutputStream::writeTotals()
{
if (totals)
{
writeCString("\t<totals>\n", *ostr);
size_t totals_columns = totals.columns();
for (size_t i = 0; i < totals_columns; ++i)
{
const ColumnWithTypeAndName & column = totals.safeGetByPosition(i);
writeCString("\t\t<", *ostr);
writeString(field_tag_names[i], *ostr);
writeCString(">", *ostr);
column.type->serializeAsTextXML(*column.column.get(), 0, *ostr, format_settings);
writeCString("</", *ostr);
writeString(field_tag_names[i], *ostr);
writeCString(">\n", *ostr);
}
writeCString("\t</totals>\n", *ostr);
}
}
static void writeExtremesElement(
const char * title, const Block & extremes, size_t row_num, const Names & field_tag_names, WriteBuffer & ostr, const FormatSettings & format_settings)
{
writeCString("\t\t<", ostr);
writeCString(title, ostr);
writeCString(">\n", ostr);
size_t extremes_columns = extremes.columns();
for (size_t i = 0; i < extremes_columns; ++i)
{
const ColumnWithTypeAndName & column = extremes.safeGetByPosition(i);
writeCString("\t\t\t<", ostr);
writeString(field_tag_names[i], ostr);
writeCString(">", ostr);
column.type->serializeAsTextXML(*column.column.get(), row_num, ostr, format_settings);
writeCString("</", ostr);
writeString(field_tag_names[i], ostr);
writeCString(">\n", ostr);
}
writeCString("\t\t</", ostr);
writeCString(title, ostr);
writeCString(">\n", ostr);
}
void XMLRowOutputStream::writeExtremes()
{
if (extremes)
{
writeCString("\t<extremes>\n", *ostr);
writeExtremesElement("min", extremes, 0, field_tag_names, *ostr, format_settings);
writeExtremesElement("max", extremes, 1, field_tag_names, *ostr, format_settings);
writeCString("\t</extremes>\n", *ostr);
}
}
void XMLRowOutputStream::onProgress(const Progress & value)
{
progress.incrementPiecewiseAtomically(value);
}
void XMLRowOutputStream::writeStatistics()
{
writeCString("\t<statistics>\n", *ostr);
writeCString("\t\t<elapsed>", *ostr);
writeText(watch.elapsedSeconds(), *ostr);
writeCString("</elapsed>\n", *ostr);
writeCString("\t\t<rows_read>", *ostr);
writeText(progress.read_rows.load(), *ostr);
writeCString("</rows_read>\n", *ostr);
writeCString("\t\t<bytes_read>", *ostr);
writeText(progress.read_bytes.load(), *ostr);
writeCString("</bytes_read>\n", *ostr);
writeCString("\t</statistics>\n", *ostr);
}
void registerOutputFormatXML(FormatFactory & factory)
{
factory.registerOutputFormat("XML", [](
WriteBuffer & buf,
const Block & sample,
const Context &,
const FormatSettings & settings)
{
return std::make_shared<BlockOutputStreamFromRowOutputStream>(
std::make_shared<XMLRowOutputStream>(buf, sample, settings), sample);
});
}
}

View File

@ -1,74 +0,0 @@
#pragma once
#include <Core/Block.h>
#include <IO/Progress.h>
#include <IO/WriteBuffer.h>
#include <Common/Stopwatch.h>
#include <Formats/FormatSettings.h>
#include <Formats/IRowOutputStream.h>
namespace DB
{
/** A stream for outputting data in XML format.
*/
class XMLRowOutputStream : public IRowOutputStream
{
public:
XMLRowOutputStream(WriteBuffer & ostr_, const Block & sample_, const FormatSettings & format_settings);
void writeField(const IColumn & column, const IDataType & type, size_t row_num) override;
void writeRowStartDelimiter() override;
void writeRowEndDelimiter() override;
void writePrefix() override;
void writeSuffix() override;
void flush() override
{
ostr->next();
if (validating_ostr)
dst_ostr.next();
}
void setRowsBeforeLimit(size_t rows_before_limit_) override
{
applied_limit = true;
rows_before_limit = rows_before_limit_;
}
void setTotals(const Block & totals_) override { totals = totals_; }
void setExtremes(const Block & extremes_) override { extremes = extremes_; }
void onProgress(const Progress & value) override;
String getContentType() const override { return "application/xml; charset=UTF-8"; }
protected:
void writeRowsBeforeLimitAtLeast();
virtual void writeTotals();
virtual void writeExtremes();
void writeStatistics();
WriteBuffer & dst_ostr;
std::unique_ptr<WriteBuffer> validating_ostr; /// Validates UTF-8 sequences, replaces bad sequences with replacement character.
WriteBuffer * ostr;
size_t field_number = 0;
size_t row_count = 0;
bool applied_limit = false;
size_t rows_before_limit = 0;
NamesAndTypes fields;
Names field_tag_names;
Block totals;
Block extremes;
Progress progress;
Stopwatch watch;
const FormatSettings format_settings;
};
}

View File

@ -13,11 +13,12 @@
#include <DataTypes/DataTypeString.h>
#include <Formats/TabSeparatedRowInputStream.h>
#include <Formats/TabSeparatedRowOutputStream.h>
#include <Formats/BlockInputStreamFromRowInputStream.h>
#include <Formats/BlockOutputStreamFromRowOutputStream.h>
#include <DataStreams/copyData.h>
#include <Processors/Formats/Impl/TabSeparatedRowOutputFormat.h>
#include <Processors/Formats/OutputStreamToOutputFormat.h>
int main(int, char **)
@ -46,10 +47,9 @@ try
RowInputStreamPtr row_input = std::make_shared<TabSeparatedRowInputStream>(in_buf, sample, false, false, format_settings);
BlockInputStreamFromRowInputStream block_input(row_input, sample, DEFAULT_INSERT_BLOCK_SIZE, 0, []{}, format_settings);
RowOutputStreamPtr row_output = std::make_shared<TabSeparatedRowOutputStream>(out_buf, sample, false, false, format_settings);
BlockOutputStreamFromRowOutputStream block_output(row_output, sample);
BlockOutputStreamPtr block_output = std::make_shared<OutputStreamToOutputFormat>(std::make_shared<TabSeparatedRowOutputFormat>(out_buf, sample, false, false, format_settings));
copyData(block_input, block_output);
copyData(block_input, *block_output);
}
catch (const DB::Exception & e)
{

View File

@ -10,11 +10,12 @@
#include <DataTypes/DataTypeString.h>
#include <Formats/TabSeparatedRowInputStream.h>
#include <Formats/TabSeparatedRowOutputStream.h>
#include <Formats/BlockInputStreamFromRowInputStream.h>
#include <Formats/BlockOutputStreamFromRowOutputStream.h>
#include <DataStreams/copyData.h>
#include <Processors/Formats/OutputStreamToOutputFormat.h>
#include <Processors/Formats/Impl/TabSeparatedRowOutputFormat.h>
using namespace DB;
@ -40,12 +41,12 @@ try
FormatSettings format_settings;
RowInputStreamPtr row_input = std::make_shared<TabSeparatedRowInputStream>(in_buf, sample, false, false, format_settings);
RowOutputStreamPtr row_output = std::make_shared<TabSeparatedRowOutputStream>(out_buf, sample, false, false, format_settings);
BlockInputStreamFromRowInputStream block_input(row_input, sample, DEFAULT_INSERT_BLOCK_SIZE, 0, []{}, format_settings);
BlockOutputStreamFromRowOutputStream block_output(row_output, sample);
copyData(block_input, block_output);
BlockOutputStreamPtr block_output = std::make_shared<OutputStreamToOutputFormat>(
std::make_shared<TabSeparatedRowOutputFormat>(out_buf, sample, false, false, format_settings));
copyData(block_input, *block_output);
return 0;
}
catch (...)

View File

@ -1632,6 +1632,7 @@ private:
using ToDataType = DataTypeDecimal<FieldType>;
TypeIndex type_index = from_type->getTypeId();
UInt32 precision = to_type->getPrecision();
UInt32 scale = to_type->getScale();
WhichDataType which(type_index);
@ -1645,9 +1646,9 @@ private:
throw Exception{"Conversion from " + from_type->getName() + " to " + to_type->getName() + " is not supported",
ErrorCodes::CANNOT_CONVERT_TYPE};
return [type_index, scale] (Block & block, const ColumnNumbers & arguments, const size_t result, size_t input_rows_count)
return [type_index, precision, scale] (Block & block, const ColumnNumbers & arguments, const size_t result, size_t input_rows_count)
{
callOnIndexAndDataType<ToDataType>(type_index, [&](const auto & types) -> bool
auto res = callOnIndexAndDataType<ToDataType>(type_index, [&](const auto & types) -> bool
{
using Types = std::decay_t<decltype(types)>;
using LeftDataType = typename Types::LeftType;
@ -1656,6 +1657,14 @@ private:
ConvertImpl<LeftDataType, RightDataType, NameCast>::execute(block, arguments, result, input_rows_count, scale);
return true;
});
/// Additionally check if callOnIndexAndDataType wasn't called at all.
if (!res)
{
auto to = DataTypeDecimal<FieldType>(precision, scale);
throw Exception{"Conversion from " + std::string(getTypeName(type_index)) + " to " + to.getName() +
" is not supported", ErrorCodes::CANNOT_CONVERT_TYPE};
}
};
}
@ -2022,6 +2031,11 @@ private:
const auto & tmp_res = tmp_block.getByPosition(tmp_res_index);
/// May happen in fuzzy tests. For debug purpose.
if (!tmp_res.column)
throw Exception("Couldn't convert " + block.getByPosition(arguments[0]).type->getName() + " to "
+ nested_type->getName() + " in " + " prepareRemoveNullable wrapper.", ErrorCodes::LOGICAL_ERROR);
res.column = wrapInNullable(tmp_res.column, Block({block.getByPosition(arguments[0]), tmp_res}), {0}, 1, input_rows_count);
};
}

View File

@ -36,6 +36,10 @@ const UInt8 geohash_base32_decode_lookup_table[256] = {
const size_t BITS_PER_SYMBOL = 5;
const size_t MAX_PRECISION = 12;
const size_t MAX_BITS = MAX_PRECISION * BITS_PER_SYMBOL * 1.5;
const Float64 LON_MIN = -180;
const Float64 LON_MAX = 180;
const Float64 LAT_MIN = -90;
const Float64 LAT_MAX = 90;
using Encoded = std::array<UInt8, MAX_BITS>;
@ -62,9 +66,9 @@ inline Encoded encodeCoordinate(Float64 coord, Float64 min, Float64 max, UInt8 b
Encoded result;
result.fill(0);
for (int i = 0; i < bits; ++i)
for (size_t i = 0; i < bits; ++i)
{
Float64 mid = (max + min) / 2;
const Float64 mid = (max + min) / 2;
if (coord >= mid)
{
result[i] = 1;
@ -83,7 +87,7 @@ inline Encoded encodeCoordinate(Float64 coord, Float64 min, Float64 max, UInt8 b
inline Float64 decodeCoordinate(const Encoded & coord, Float64 min, Float64 max, UInt8 bits)
{
Float64 mid = (max + min) / 2;
for (int i = 0; i < bits; ++i)
for (size_t i = 0; i < bits; ++i)
{
const auto c = coord[i];
if (c == 1)
@ -148,7 +152,7 @@ inline void base32Encode(const Encoded & binary, UInt8 precision, char * out)
{
extern const char geohash_base32_encode_lookup_table[32];
for (UInt8 i = 0; i < precision * BITS_PER_SYMBOL; i += 5)
for (UInt8 i = 0; i < precision * BITS_PER_SYMBOL; i += BITS_PER_SYMBOL)
{
UInt8 v = binary[i];
v <<= 1;
@ -187,24 +191,38 @@ inline Encoded base32Decode(const char * encoded_string, size_t encoded_length)
return result;
}
inline Float64 getMaxSpan(CoordType type)
{
if (type == LONGITUDE)
{
return LON_MAX - LON_MIN;
}
return LAT_MAX - LAT_MIN;
}
namespace DB
inline Float64 getSpan(UInt8 precision, CoordType type)
{
const auto bits = singleCoordBitsPrecision(precision, type);
// since every bit of precision divides span by 2, divide max span by 2^bits.
return ldexp(getMaxSpan(type), -1 * bits);
}
namespace GeoUtils
{
size_t geohashEncode(Float64 longitude, Float64 latitude, UInt8 precision, char *& out)
inline UInt8 geohashPrecision(UInt8 precision)
{
if (precision == 0 || precision > MAX_PRECISION)
{
precision = MAX_PRECISION;
}
return precision;
}
inline size_t geohashEncodeImpl(Float64 longitude, Float64 latitude, UInt8 precision, char * out)
{
const Encoded combined = merge(
encodeCoordinate(longitude, -180, 180, singleCoordBitsPrecision(precision, LONGITUDE)),
encodeCoordinate(latitude, -90, 90, singleCoordBitsPrecision(precision, LATITUDE)),
encodeCoordinate(longitude, LON_MIN, LON_MAX, singleCoordBitsPrecision(precision, LONGITUDE)),
encodeCoordinate(latitude, LAT_MIN, LAT_MAX, singleCoordBitsPrecision(precision, LATITUDE)),
precision);
base32Encode(combined, precision, out);
@ -212,9 +230,28 @@ size_t geohashEncode(Float64 longitude, Float64 latitude, UInt8 precision, char
return precision;
}
}
namespace DB
{
namespace ErrorCodes
{
extern const int ARGUMENT_OUT_OF_BOUND;
}
namespace GeoUtils
{
size_t geohashEncode(Float64 longitude, Float64 latitude, UInt8 precision, char * out)
{
precision = geohashPrecision(precision);
return geohashEncodeImpl(longitude, latitude, precision, out);
}
void geohashDecode(const char * encoded_string, size_t encoded_len, Float64 * longitude, Float64 * latitude)
{
const UInt8 precision = std::min(encoded_len, MAX_PRECISION);
const UInt8 precision = std::min(encoded_len, static_cast<size_t>(MAX_PRECISION));
if (precision == 0)
{
return;
@ -223,8 +260,89 @@ void geohashDecode(const char * encoded_string, size_t encoded_len, Float64 * lo
Encoded lat_encoded, lon_encoded;
std::tie(lon_encoded, lat_encoded) = split(base32Decode(encoded_string, precision), precision);
*longitude = decodeCoordinate(lon_encoded, -180, 180, singleCoordBitsPrecision(precision, LONGITUDE));
*latitude = decodeCoordinate(lat_encoded, -90, 90, singleCoordBitsPrecision(precision, LATITUDE));
*longitude = decodeCoordinate(lon_encoded, LON_MIN, LON_MAX, singleCoordBitsPrecision(precision, LONGITUDE));
*latitude = decodeCoordinate(lat_encoded, LAT_MIN, LAT_MAX, singleCoordBitsPrecision(precision, LATITUDE));
}
GeohashesInBoxPreparedArgs geohashesInBoxPrepare(const Float64 longitude_min,
const Float64 latitude_min,
const Float64 longitude_max,
const Float64 latitude_max,
UInt8 precision)
{
precision = geohashPrecision(precision);
if (longitude_max < longitude_min || latitude_max < latitude_min)
{
return {};
}
const auto lon_step = getSpan(precision, LONGITUDE);
const auto lat_step = getSpan(precision, LATITUDE);
// align max to the right(or up) border of geohash grid cell to ensure that cell is in result.
Float64 lon_min = floor(longitude_min / lon_step) * lon_step;
Float64 lat_min = floor(latitude_min / lat_step) * lat_step;
Float64 lon_max = ceil(longitude_max / lon_step) * lon_step;
Float64 lat_max = ceil(latitude_max / lat_step) * lat_step;
const auto lon_span = lon_max - lon_min;
const auto lat_span = lat_max - lat_min;
// in case of a very small (or zero) span, produce at least 1 item.
const auto items_count = std::max(size_t{1}, static_cast<size_t>(ceil(lon_span/lon_step * lat_span/lat_step)));
return GeohashesInBoxPreparedArgs{
items_count,
precision,
lon_min,
lat_min,
lon_max,
lat_max,
lon_step,
lat_step
};
}
UInt64 geohashesInBox(const GeohashesInBoxPreparedArgs & args, char * out)
{
if (args.items_count == 0
|| args.precision == 0
|| args.precision > MAX_PRECISION
|| args.latitude_min > args.latitude_max
|| args.longitude_min > args.longitude_max
|| args.longitude_step <= 0
|| args.latitude_step <= 0)
{
return 0;
}
UInt64 items = 0;
for (auto lon = args.longitude_min; lon < args.longitude_max; lon += args.longitude_step)
{
for (auto lat = args.latitude_min; lat < args.latitude_max; lat += args.latitude_step)
{
assert(items <= args.items_count);
size_t l = geohashEncodeImpl(lon, lat, args.precision, out);
out += l;
*out = '\0';
++out;
++items;
}
}
if (items == 0 && args.items_count != 0)
{
size_t l = geohashEncodeImpl(args.longitude_min, args.latitude_min, args.precision, out);
out += l;
*out = '\0';
++out;
++items;
}
return items;
}
}

View File

@ -706,10 +706,33 @@ std::string serialize(Polygon && polygon)
return result;
}
size_t geohashEncode(Float64 longitude, Float64 latitude, UInt8 precision, char *& out);
size_t geohashEncode(Float64 longitude, Float64 latitude, UInt8 precision, char * out);
void geohashDecode(const char * encoded_string, size_t encoded_len, Float64 * longitude, Float64 * latitude);
std::vector<std::pair<Float64, Float64>> geohashCoverBox(Float64 longitude_min, Float64 latitude_min, Float64 longitude_max, Float64 latitude_max, UInt8 precision, UInt32 max_items = 0);
struct GeohashesInBoxPreparedArgs
{
UInt64 items_count = 0;
UInt8 precision = 0;
Float64 longitude_min = 0.0;
Float64 latitude_min = 0.0;
Float64 longitude_max = 0.0;
Float64 latitude_max = 0.0;
Float64 longitude_step = 0.0;
Float64 latitude_step = 0.0;
};
GeohashesInBoxPreparedArgs geohashesInBoxPrepare(const Float64 longitude_min,
const Float64 latitude_min,
Float64 longitude_max,
Float64 latitude_max,
UInt8 precision);
UInt64 geohashesInBox(const GeohashesInBoxPreparedArgs & estimation, char * out);
} /// GeoUtils

View File

@ -0,0 +1,169 @@
#include <Functions/IFunction.h>
#include <Functions/FunctionFactory.h>
#include <Functions/FunctionHelpers.h>
#include <Functions/GeoUtils.h>
#include <Columns/ColumnArray.h>
#include <Columns/ColumnString.h>
#include <DataTypes/DataTypeArray.h>
#include <DataTypes/DataTypeString.h>
#include <memory>
#include <string>
namespace DB
{
namespace ErrorCodes
{
extern const int LOGICAL_ERROR;
extern const int ILLEGAL_TYPE_OF_ARGUMENT;
extern const int TOO_LARGE_ARRAY_SIZE;
}
class FunctionGeohashesInBox : public IFunction
{
public:
static constexpr auto name = "geohashesInBox";
static FunctionPtr create(const Context &) { return std::make_shared<FunctionGeohashesInBox>(); }
String getName() const override { return name; }
size_t getNumberOfArguments() const override { return 5; }
DataTypePtr getReturnTypeImpl(const DataTypes & arguments) const override
{
validateArgumentType(*this, arguments, 0, isFloat, "float");
validateArgumentType(*this, arguments, 1, isFloat, "float");
validateArgumentType(*this, arguments, 2, isFloat, "float");
validateArgumentType(*this, arguments, 3, isFloat, "float");
validateArgumentType(*this, arguments, 4, isUInt8, "integer");
if (!(arguments[0]->equals(*arguments[1]) &&
arguments[0]->equals(*arguments[2]) &&
arguments[0]->equals(*arguments[3])))
{
throw Exception("Illegal type of argument of " + getName() +
" all coordinate arguments must have the same type, instead they are:" +
arguments[0]->getName() + ", " +
arguments[1]->getName() + ", " +
arguments[2]->getName() + ", " +
arguments[3]->getName() + ".",
ErrorCodes::ILLEGAL_TYPE_OF_ARGUMENT);
}
return std::make_shared<DataTypeArray>(std::make_shared<DataTypeString>());
}
bool useDefaultImplementationForConstants() const override { return true; }
template <typename LonAndLatType, typename PrecisionType>
void execute(const IColumn * lon_min_column,
const IColumn * lat_min_column,
const IColumn * lon_max_column,
const IColumn * lat_max_column,
const IColumn * precision_column,
ColumnPtr & result)
{
static constexpr size_t max_array_size = 10'000'000;
const auto * lon_min = checkAndGetColumn<ColumnVector<LonAndLatType>>(lon_min_column);
const auto * lat_min = checkAndGetColumn<ColumnVector<LonAndLatType>>(lat_min_column);
const auto * lon_max = checkAndGetColumn<ColumnVector<LonAndLatType>>(lon_max_column);
const auto * lat_max = checkAndGetColumn<ColumnVector<LonAndLatType>>(lat_max_column);
auto * precision = checkAndGetColumn<ColumnVector<PrecisionType>>(precision_column);
if (precision == nullptr)
{
precision = checkAndGetColumnConstData<ColumnVector<PrecisionType>>(precision_column);
}
if (!lon_min || !lat_min || !lon_max || !lat_max || !precision)
{
throw Exception("Unsupported argument types for function " + getName() + " : " +
lon_min_column->getName() + ", " +
lat_min_column->getName() + ", " +
lon_max_column->getName() + ", " +
lat_max_column->getName() + ".",
ErrorCodes::LOGICAL_ERROR);
}
const size_t total_rows = lat_min->size();
auto col_res = ColumnArray::create(ColumnString::create());
ColumnString & res_strings = typeid_cast<ColumnString &>(col_res->getData());
ColumnArray::Offsets & res_offsets = col_res->getOffsets();
ColumnString::Chars & res_strings_chars = res_strings.getChars();
ColumnString::Offsets & res_strings_offsets = res_strings.getOffsets();
for (size_t row = 0; row < total_rows; ++row)
{
const Float64 lon_min_value = lon_min->getElement(row);
const Float64 lat_min_value = lat_min->getElement(row);
const Float64 lon_max_value = lon_max->getElement(row);
const Float64 lat_max_value = lat_max->getElement(row);
const auto prepared_args = GeoUtils::geohashesInBoxPrepare(
lon_min_value, lat_min_value, lon_max_value, lat_max_value,
precision->getElement(row % precision->size()));
if (prepared_args.items_count > max_array_size)
{
throw Exception(getName() + " would produce " + std::to_string(prepared_args.items_count) +
" array elements, which is bigger than the allowed maximum of " + std::to_string(max_array_size),
ErrorCodes::TOO_LARGE_ARRAY_SIZE);
}
res_strings_offsets.reserve(res_strings_offsets.size() + prepared_args.items_count);
res_strings_chars.resize(res_strings_chars.size() + prepared_args.items_count * (prepared_args.precision + 1));
const auto starting_offset = res_strings_offsets.empty() ? 0 : res_strings_offsets.back();
char * out = reinterpret_cast<char *>(res_strings_chars.data() + starting_offset);
// Actually write geohashes into preallocated buffer.
GeoUtils::geohashesInBox(prepared_args, out);
for (UInt8 i = 1; i <= prepared_args.items_count ; ++i)
{
res_strings_offsets.push_back(starting_offset + (prepared_args.precision + 1) * i);
}
res_offsets.push_back((res_offsets.empty() ? 0 : res_offsets.back()) + prepared_args.items_count);
}
if (!res_strings_offsets.empty() && res_strings_offsets.back() != res_strings_chars.size())
{
throw Exception("String column size mismatch (internal logical error)", ErrorCodes::LOGICAL_ERROR);
}
if (!res_offsets.empty() && res_offsets.back() != res_strings.size())
{
throw Exception("Arrary column size mismatch (internal logical error)" +
std::to_string(res_offsets.back()) + " != " + std::to_string(res_strings.size()),
ErrorCodes::LOGICAL_ERROR);
}
result = std::move(col_res);
}
void executeImpl(Block & block, const ColumnNumbers & arguments, size_t result, size_t /*input_rows_count*/) override
{
const IColumn * lon_min = block.getByPosition(arguments[0]).column.get();
const IColumn * lat_min = block.getByPosition(arguments[1]).column.get();
const IColumn * lon_max = block.getByPosition(arguments[2]).column.get();
const IColumn * lat_max = block.getByPosition(arguments[3]).column.get();
const IColumn * prec = block.getByPosition(arguments[4]).column.get();
ColumnPtr & res = block.getByPosition(result).column;
if (checkColumn<ColumnVector<Float32>>(lon_min))
{
execute<Float32, UInt8>(lon_min, lat_min, lon_max, lat_max, prec, res);
}
else
{
execute<Float64, UInt8>(lon_min, lat_min, lon_max, lat_max, prec, res);
}
}
};
void registerFunctionGeohashesInBox(FunctionFactory & factory)
{
factory.registerFunction<FunctionGeohashesInBox>();
}
}

View File

@ -10,6 +10,7 @@ void registerFunctionPointInEllipses(FunctionFactory & factory);
void registerFunctionPointInPolygon(FunctionFactory & factory);
void registerFunctionGeohashEncode(FunctionFactory & factory);
void registerFunctionGeohashDecode(FunctionFactory & factory);
void registerFunctionGeohashesInBox(FunctionFactory & factory);
#if USE_H3
void registerFunctionGeoToH3(FunctionFactory &);
@ -22,6 +23,7 @@ void registerFunctionsGeo(FunctionFactory & factory)
registerFunctionPointInPolygon(factory);
registerFunctionGeohashEncode(factory);
registerFunctionGeohashDecode(factory);
registerFunctionGeohashesInBox(factory);
#if USE_H3
registerFunctionGeoToH3(factory);

View File

@ -1,9 +1,11 @@
#include <Functions/IFunction.h>
#include <Functions/FunctionFactory.h>
#include <Functions/FunctionHelpers.h>
#include <Columns/ColumnString.h>
#include <Columns/ColumnsNumber.h>
#include <Columns/ColumnsCommon.h>
#include <DataTypes/DataTypesNumber.h>
#include <IO/WriteHelpers.h>
namespace DB
@ -13,6 +15,7 @@ namespace ErrorCodes
{
extern const int ILLEGAL_COLUMN;
extern const int ILLEGAL_TYPE_OF_ARGUMENT;
extern const int NUMBER_OF_ARGUMENTS_DOESNT_MATCH;
extern const int FUNCTION_THROW_IF_VALUE_IS_NON_ZERO;
}
@ -32,46 +35,70 @@ public:
return name;
}
bool isVariadic() const override { return true; }
size_t getNumberOfArguments() const override
{
return 1;
return 0;
}
DataTypePtr getReturnTypeImpl(const DataTypes & arguments) const override
{
if (!isNativeNumber(arguments.front()))
const size_t number_of_arguments = arguments.size();
if (number_of_arguments < 1 || number_of_arguments > 2)
throw Exception{"Number of arguments for function " + getName() + " doesn't match: passed "
+ toString(number_of_arguments) + ", should be 1 or 2",
ErrorCodes::NUMBER_OF_ARGUMENTS_DOESNT_MATCH};
if (!isNativeNumber(arguments[0]))
throw Exception{"Argument for function " + getName() + " must be number", ErrorCodes::ILLEGAL_TYPE_OF_ARGUMENT};
if (number_of_arguments > 1 && !isString(arguments[1]))
throw Exception{"Illegal type " + arguments[1]->getName() + " of argument of function " + getName(),
ErrorCodes::ILLEGAL_TYPE_OF_ARGUMENT};
return std::make_shared<DataTypeUInt8>();
}
bool useDefaultImplementationForConstants() const override { return true; }
ColumnNumbers getArgumentsThatAreAlwaysConstant() const override { return {1}; }
void executeImpl(Block & block, const ColumnNumbers & arguments, size_t result, size_t /*input_rows_count*/) override
{
std::optional<String> custom_message;
if (arguments.size() == 2)
{
auto * msg_column = checkAndGetColumnConst<ColumnString>(block.getByPosition(arguments[1]).column.get());
if (!msg_column)
throw Exception{"Second argument for function " + getName() + " must be constant String", ErrorCodes::ILLEGAL_COLUMN};
custom_message = msg_column->getValue<String>();
}
const auto in = block.getByPosition(arguments.front()).column.get();
if ( !execute<UInt8>(block, in, result)
&& !execute<UInt16>(block, in, result)
&& !execute<UInt32>(block, in, result)
&& !execute<UInt64>(block, in, result)
&& !execute<Int8>(block, in, result)
&& !execute<Int16>(block, in, result)
&& !execute<Int32>(block, in, result)
&& !execute<Int64>(block, in, result)
&& !execute<Float32>(block, in, result)
&& !execute<Float64>(block, in, result))
if ( !execute<UInt8>(block, in, result, custom_message)
&& !execute<UInt16>(block, in, result, custom_message)
&& !execute<UInt32>(block, in, result, custom_message)
&& !execute<UInt64>(block, in, result, custom_message)
&& !execute<Int8>(block, in, result, custom_message)
&& !execute<Int16>(block, in, result, custom_message)
&& !execute<Int32>(block, in, result, custom_message)
&& !execute<Int64>(block, in, result, custom_message)
&& !execute<Float32>(block, in, result, custom_message)
&& !execute<Float64>(block, in, result, custom_message))
throw Exception{"Illegal column " + in->getName() + " of first argument of function " + getName(), ErrorCodes::ILLEGAL_COLUMN};
}
template <typename T>
bool execute(Block & block, const IColumn * in_untyped, const size_t result)
bool execute(Block & block, const IColumn * in_untyped, const size_t result, const std::optional<String> & message)
{
if (const auto in = checkAndGetColumn<ColumnVector<T>>(in_untyped))
{
const auto & in_data = in->getData();
if (!memoryIsZero(in_data.data(), in_data.size() * sizeof(in_data[0])))
throw Exception("Value passed to 'throwIf' function is non zero", ErrorCodes::FUNCTION_THROW_IF_VALUE_IS_NON_ZERO);
throw Exception{message.value_or("Value passed to '" + getName() + "' function is non zero"),
ErrorCodes::FUNCTION_THROW_IF_VALUE_IS_NON_ZERO};
/// We return non constant to avoid constant folding.
block.getByPosition(result).column = ColumnUInt8::create(in_data.size(), 0);

View File

@ -6,6 +6,7 @@
#include <fstream>
#include <streambuf>
#include <cstdlib>
#include <functional>
namespace
{

View File

@ -1040,23 +1040,35 @@ void ExpressionAnalyzer::collectUsedColumns()
/// You need to read at least one column to find the number of rows.
if (select_query && required.empty())
{
/// We will find a column with minimum compressed size. Because it is the column that is cheapest to read.
size_t min_data_compressed = 0;
String min_column_name;
/// We will find a column with minimum <compressed_size, type_size, uncompressed_size>.
/// Because it is the column that is cheapest to read.
struct ColumnSizeTuple
{
size_t compressed_size;
size_t type_size;
size_t uncompressed_size;
String name;
bool operator<(const ColumnSizeTuple & that) const
{
return std::tie(compressed_size, type_size, uncompressed_size)
< std::tie(that.compressed_size, that.type_size, that.uncompressed_size);
}
};
std::vector<ColumnSizeTuple> columns;
if (storage)
{
auto column_sizes = storage->getColumnSizes();
for (auto & [column_name, column_size] : column_sizes)
for (auto & source_column : source_columns)
{
if (min_data_compressed == 0 || min_data_compressed > column_size.data_compressed)
{
min_data_compressed = column_size.data_compressed;
min_column_name = column_name;
}
auto c = column_sizes.find(source_column.name);
if (c == column_sizes.end())
continue;
size_t type_size = source_column.type->haveMaximumSizeOfValue() ? source_column.type->getMaximumSizeOfValueInMemory() : 100;
columns.emplace_back(ColumnSizeTuple{c->second.data_compressed, type_size, c->second.data_uncompressed, source_column.name});
}
}
if (min_data_compressed > 0)
required.insert(min_column_name);
if (columns.size())
required.insert(std::min_element(columns.begin(), columns.end())->name);
else
/// If we have no information about columns sizes, choose a column of minimum size of its data type.
required.insert(ExpressionActions::getSmallestColumn(source_columns));

View File

@ -26,7 +26,7 @@ namespace DB
namespace ErrorCodes
{
extern const int UNKNOWN_SET_DATA_VARIANT;
extern const int UNSUPPORTED_JOIN_KEYS;
extern const int LOGICAL_ERROR;
extern const int SET_SIZE_LIMIT_EXCEEDED;
extern const int TYPE_MISMATCH;
@ -770,7 +770,7 @@ IColumn::Filter switchJoinRightColumns(
#undef M
default:
throw Exception("Unknown JOIN keys variant.", ErrorCodes::UNKNOWN_SET_DATA_VARIANT);
throw Exception("Unsupported JOIN keys. Type: " + toString(static_cast<UInt32>(type)), ErrorCodes::UNSUPPORTED_JOIN_KEYS);
}
}
@ -1350,7 +1350,8 @@ private:
APPLY_FOR_JOIN_VARIANTS(M)
#undef M
default:
throw Exception("Unknown JOIN keys variant.", ErrorCodes::UNKNOWN_SET_DATA_VARIANT);
throw Exception("Unsupported JOIN keys. Type: " + toString(static_cast<UInt32>(parent.type)),
ErrorCodes::UNSUPPORTED_JOIN_KEYS);
}
__builtin_unreachable();

View File

@ -8,7 +8,6 @@ namespace DB
namespace ErrorCodes
{
extern const int UNKNOWN_SET_DATA_VARIANT;
extern const int LOGICAL_ERROR;
}

View File

@ -185,9 +185,12 @@ void PipelineExecutor::expandPipeline(Stack & stack, UInt64 pid)
graph.emplace_back(processor.get(), graph.size());
}
processors.insert(processors.end(), new_processors.begin(), new_processors.end());
UInt64 num_processors = processors.size();
{
std::lock_guard guard(processors_mutex);
processors.insert(processors.end(), new_processors.begin(), new_processors.end());
}
UInt64 num_processors = processors.size();
for (UInt64 node = 0; node < num_processors; ++node)
{
if (addEdges(node))
@ -374,6 +377,16 @@ void PipelineExecutor::doExpandPipeline(ExpandPipelineTask * task, bool processi
}
}
void PipelineExecutor::cancel()
{
cancelled = true;
finish();
std::lock_guard guard(processors_mutex);
for (auto & processor : processors)
processor->cancel();
}
void PipelineExecutor::finish()
{
{

View File

@ -35,14 +35,11 @@ public:
const Processors & getProcessors() const { return processors; }
/// Cancel execution. May be called from another thread.
void cancel()
{
cancelled = true;
finish();
}
void cancel();
private:
Processors & processors;
std::mutex processors_mutex;
struct Edge
{
@ -75,8 +72,8 @@ private:
std::exception_ptr exception;
std::function<void()> job;
IProcessor * processor;
UInt64 processors_id;
IProcessor * processor = nullptr;
UInt64 processors_id = 0;
/// Counters for profiling.
size_t num_executed_jobs = 0;

View File

@ -27,6 +27,12 @@ public:
: ISource(std::move(header)), in(in)
{
}
virtual const BlockMissingValues & getMissingValues() const
{
static const BlockMissingValues none;
return none;
}
};
}

View File

@ -58,6 +58,17 @@ public:
virtual std::string getContentType() const { return "text/plain; charset=UTF-8"; }
InputPort & getPort(PortKind kind) { return *std::next(inputs.begin(), kind); }
public:
/// Compatible to IBlockOutputStream interface
void write(const Block & block) { consume(Chunk(block.getColumns(), block.rows())); }
virtual void doWritePrefix() {}
virtual void doWriteSuffix() { finalize(); }
void setTotals(const Block & totals) { consumeTotals(Chunk(totals.getColumns(), totals.rows())); }
void setExtremes(const Block & extremes) { consumeExtremes(Chunk(extremes.getColumns(), extremes.rows())); }
};
}

View File

@ -1,5 +1,6 @@
#include <Processors/Formats/IRowInputFormat.h>
#include <IO/WriteHelpers.h> // toString
#include <common/logger_useful.h>
namespace DB
@ -16,6 +17,7 @@ namespace ErrorCodes
extern const int CANNOT_PARSE_UUID;
extern const int TOO_LARGE_STRING_SIZE;
extern const int INCORRECT_NUMBER_OF_COLUMNS;
extern const int TIMEOUT_EXCEEDED;
}
@ -32,6 +34,33 @@ static bool isParseError(int code)
}
static bool handleOverflowMode(OverflowMode mode, const String & message, int code)
{
switch (mode)
{
case OverflowMode::THROW:
throw Exception(message, code);
case OverflowMode::BREAK:
return false;
default:
throw Exception("Logical error: unknown overflow mode", ErrorCodes::LOGICAL_ERROR);
}
}
static bool checkTimeLimit(const IRowInputFormat::Params & params, const Stopwatch & stopwatch)
{
if (params.max_execution_time != 0
&& stopwatch.elapsed() > static_cast<UInt64>(params.max_execution_time.totalMicroseconds()) * 1000)
return handleOverflowMode(params.timeout_overflow_mode,
"Timeout exceeded: elapsed " + toString(stopwatch.elapsedSeconds())
+ " seconds, maximum: " + toString(params.max_execution_time.totalMicroseconds() / 1000000.0),
ErrorCodes::TIMEOUT_EXCEEDED);
return true;
}
Chunk IRowInputFormat::generate()
{
if (total_rows == 0)
@ -43,12 +72,19 @@ Chunk IRowInputFormat::generate()
MutableColumns columns = header.cloneEmptyColumns();
size_t prev_rows = total_rows;
auto chunk_missing_values = std::make_unique<ChunkMissingValues>();
///auto chunk_missing_values = std::make_unique<ChunkMissingValues>();
try
{
for (size_t rows = 0; rows < params.max_block_size; ++rows)
for (size_t rows = 0, batch = 0; rows < params.max_block_size; ++rows, ++batch)
{
if (params.rows_portion_size && batch == params.rows_portion_size)
{
batch = 0;
if (!checkTimeLimit(params, total_stopwatch) || isCancelled())
break;
}
try
{
++total_rows;
@ -56,6 +92,8 @@ Chunk IRowInputFormat::generate()
RowReadExtension info;
if (!readRow(columns, info))
break;
if (params.callback)
params.callback();
for (size_t column_idx = 0; column_idx < info.read_columns.size(); ++column_idx)
{
@ -64,7 +102,7 @@ Chunk IRowInputFormat::generate()
size_t column_size = columns[column_idx]->size();
if (column_size == 0)
throw Exception("Unexpected empty column", ErrorCodes::INCORRECT_NUMBER_OF_COLUMNS);
chunk_missing_values->setBit(column_idx, column_size - 1);
block_missing_values.setBit(column_idx, column_size - 1);
}
}
}
@ -134,12 +172,18 @@ Chunk IRowInputFormat::generate()
if (columns.empty() || columns[0]->empty())
{
if (params.allow_errors_num > 0 || params.allow_errors_ratio > 0)
{
Logger * log = &Logger::get("BlockInputStreamFromRowInputStream");
LOG_TRACE(log, "Skipped " << num_errors << " rows with errors while reading the input stream");
}
readSuffix();
return {};
}
Chunk chunk(std::move(columns), total_rows - prev_rows);
chunk.setChunkInfo(std::move(chunk_missing_values));
//chunk.setChunkInfo(std::move(chunk_missing_values));
return chunk;
}

View File

@ -3,6 +3,9 @@
#include <string>
#include <Columns/IColumn.h>
#include <Processors/Formats/IInputFormat.h>
#include <DataStreams/SizeLimits.h>
#include <Poco/Timespan.h>
#include <Common/Stopwatch.h>
namespace DB
@ -23,6 +26,14 @@ struct RowInputFormatParams
UInt64 allow_errors_num;
Float64 allow_errors_ratio;
UInt64 rows_portion_size;
using ReadCallback = std::function<void()>;
ReadCallback callback;
Poco::Timespan max_execution_time = 0;
OverflowMode timeout_overflow_mode = OverflowMode::THROW;
};
///Row oriented input format: reads data row by row.
@ -61,11 +72,16 @@ protected:
/// If not implemented, returns empty string.
virtual std::string getDiagnosticInfo() { return {}; }
const BlockMissingValues & getMissingValues() const override { return block_missing_values; }
private:
Params params;
Stopwatch total_stopwatch {CLOCK_MONOTONIC_COARSE};
size_t total_rows = 0;
size_t num_errors = 0;
BlockMissingValues block_missing_values;
};
}

View File

@ -4,6 +4,7 @@
#include <Formats/verbosePrintString.h>
#include <Processors/Formats/Impl/CSVRowInputFormat.h>
#include <Formats/FormatFactory.h>
#include <DataTypes/DataTypeNullable.h>
namespace DB
@ -18,16 +19,66 @@ namespace ErrorCodes
CSVRowInputFormat::CSVRowInputFormat(
ReadBuffer & in_, Block header, Params params, bool with_names_, const FormatSettings & format_settings)
: IRowInputFormat(std::move(header), in_, params), with_names(with_names_), format_settings(format_settings)
: IRowInputFormat(std::move(header), in_, std::move(params))
, with_names(with_names_)
, format_settings(format_settings)
{
auto & sample = getPort().getHeader();
size_t num_columns = sample.columns();
data_types.resize(num_columns);
column_indexes_by_names.reserve(num_columns);
column_idx_to_nullable_column_idx.resize(num_columns);
for (size_t i = 0; i < num_columns; ++i)
data_types[i] = sample.safeGetByPosition(i).type;
{
const auto & column_info = sample.getByPosition(i);
data_types[i] = column_info.type;
column_indexes_by_names.emplace(column_info.name, i);
/// If input_format_null_as_default=1 we need ColumnNullable of type DataTypeNullable(nested_type)
/// to parse value as nullable before inserting it in corresponding column of not-nullable type.
/// Constructing temporary column for each row is slow, so we prepare it here
if (format_settings.csv.null_as_default && !column_info.type->isNullable() && column_info.type->canBeInsideNullable())
{
column_idx_to_nullable_column_idx[i] = nullable_columns.size();
nullable_types.emplace_back(std::make_shared<DataTypeNullable>(column_info.type));
nullable_columns.emplace_back(nullable_types.back()->createColumn());
}
}
}
/// Map an input file column to a table column, based on its name.
void CSVRowInputFormat::addInputColumn(const String & column_name)
{
const auto column_it = column_indexes_by_names.find(column_name);
if (column_it == column_indexes_by_names.end())
{
if (format_settings.skip_unknown_fields)
{
column_indexes_for_input_fields.push_back(std::nullopt);
return;
}
throw Exception(
"Unknown field found in CSV header: '" + column_name + "' " +
"at position " + std::to_string(column_indexes_for_input_fields.size()) +
"\nSet the 'input_format_skip_unknown_fields' parameter explicitly to ignore and proceed",
ErrorCodes::INCORRECT_DATA
);
}
const auto column_index = column_it->second;
if (read_columns[column_index])
throw Exception("Duplicate field found while parsing CSV header: " + column_name, ErrorCodes::INCORRECT_DATA);
read_columns[column_index] = true;
column_indexes_for_input_fields.emplace_back(column_index);
}
static void skipEndOfLine(ReadBuffer & istr)
{
/// \n (Unix) or \r\n (DOS/Windows) or \n\r (Mac OS Classic)
@ -106,28 +157,111 @@ void CSVRowInputFormat::readPrefix()
size_t num_columns = data_types.size();
String tmp;
auto & header = getPort().getHeader();
if (with_names)
skipRow(in, format_settings.csv, num_columns);
{
/// This CSV file has a header row with column names. Depending on the
/// settings, use it or skip it.
if (format_settings.with_names_use_header)
{
/// Look at the file header to see which columns we have there.
/// The missing columns are filled with defaults.
read_columns.assign(header.columns(), false);
do
{
String column_name;
skipWhitespacesAndTabs(in);
readCSVString(column_name, in, format_settings.csv);
skipWhitespacesAndTabs(in);
addInputColumn(column_name);
}
while (checkChar(format_settings.csv.delimiter, in));
skipDelimiter(in, format_settings.csv.delimiter, true);
for (auto read_column : read_columns)
{
if (!read_column)
{
have_always_default_columns = true;
break;
}
}
return;
}
else
skipRow(in, format_settings.csv, num_columns);
}
/// The default: map each column of the file to the column of the table with
/// the same index.
read_columns.assign(header.columns(), true);
column_indexes_for_input_fields.resize(header.columns());
for (size_t i = 0; i < column_indexes_for_input_fields.size(); ++i)
{
column_indexes_for_input_fields[i] = i;
}
}
bool CSVRowInputFormat::readRow(MutableColumns & columns, RowReadExtension &)
bool CSVRowInputFormat::readRow(MutableColumns & columns, RowReadExtension & ext)
{
if (in.eof())
return false;
updateDiagnosticInfo();
size_t size = data_types.size();
/// Track whether we have to fill any columns in this row with default
/// values. If not, we return an empty column mask to the caller, so that
/// it doesn't have to check it.
bool have_default_columns = have_always_default_columns;
for (size_t i = 0; i < size; ++i)
const auto delimiter = format_settings.csv.delimiter;
for (size_t file_column = 0; file_column < column_indexes_for_input_fields.size(); ++file_column)
{
skipWhitespacesAndTabs(in);
data_types[i]->deserializeAsTextCSV(*columns[i], in, format_settings);
skipWhitespacesAndTabs(in);
const auto & table_column = column_indexes_for_input_fields[file_column];
const bool is_last_file_column =
file_column + 1 == column_indexes_for_input_fields.size();
skipDelimiter(in, format_settings.csv.delimiter, i + 1 == size);
if (table_column)
{
skipWhitespacesAndTabs(in);
read_columns[*table_column] = readField(*columns[*table_column], data_types[*table_column],
is_last_file_column, *table_column);
if (!read_columns[*table_column])
have_default_columns = true;
skipWhitespacesAndTabs(in);
}
else
{
/// We never read this column from the file, just skip it.
String tmp;
readCSVString(tmp, in, format_settings.csv);
}
skipDelimiter(in, delimiter, is_last_file_column);
}
if (have_default_columns)
{
for (size_t i = 0; i < read_columns.size(); i++)
{
if (!read_columns[i])
{
/// The column value for this row is going to be overwritten
/// with default by the caller, but the general assumption is
/// that the column size increases for each row, so we have
/// to insert something. Since we do not care about the exact
/// value, we do not have to use the default value specified by
/// the data type, and can just use IColumn::insertDefault().
columns[i]->insertDefault();
}
}
ext.read_columns = read_columns;
}
return true;
@ -190,93 +324,126 @@ String CSVRowInputFormat::getDiagnosticInfo()
return out.str();
}
bool CSVRowInputFormat::parseRowAndPrintDiagnosticInfo(MutableColumns & columns,
/** gcc-7 generates wrong code with optimization level greater than 1.
* See tests: dbms/src/IO/tests/write_int.cpp
* and dbms/tests/queries/0_stateless/00898_parsing_bad_diagnostic_message.sh
* This is compiler bug. The bug does not present in gcc-8 and clang-8.
* Nevertheless, we don't need high optimization of this function.
*/
bool OPTIMIZE(1) CSVRowInputFormat::parseRowAndPrintDiagnosticInfo(MutableColumns & columns,
WriteBuffer & out, size_t max_length_of_column_name, size_t max_length_of_data_type_name)
{
const char delimiter = format_settings.csv.delimiter;
auto & header = getPort().getHeader();
size_t size = data_types.size();
for (size_t i = 0; i < size; ++i)
for (size_t file_column = 0; file_column < column_indexes_for_input_fields.size(); ++file_column)
{
if (i == 0 && in.eof())
if (file_column == 0 && in.eof())
{
out << "<End of stream>\n";
return false;
}
out << "Column " << i << ", " << std::string((i < 10 ? 2 : i < 100 ? 1 : 0), ' ')
<< "name: " << header.safeGetByPosition(i).name << ", " << std::string(max_length_of_column_name - header.safeGetByPosition(i).name.size(), ' ')
<< "type: " << data_types[i]->getName() << ", " << std::string(max_length_of_data_type_name - data_types[i]->getName().size(), ' ');
BufferBase::Position prev_position = in.position();
BufferBase::Position curr_position = in.position();
std::exception_ptr exception;
try
if (column_indexes_for_input_fields[file_column].has_value())
{
skipWhitespacesAndTabs(in);
prev_position = in.position();
data_types[i]->deserializeAsTextCSV(*columns[i], in, format_settings);
curr_position = in.position();
skipWhitespacesAndTabs(in);
}
catch (...)
{
exception = std::current_exception();
}
const auto & table_column = *column_indexes_for_input_fields[file_column];
const auto & current_column_type = data_types[table_column];
const bool is_last_file_column =
file_column + 1 == column_indexes_for_input_fields.size();
const bool at_delimiter = *in.position() == delimiter;
const bool at_last_column_line_end = is_last_file_column
&& (*in.position() == '\n' || *in.position() == '\r'
|| in.eof());
if (curr_position < prev_position)
throw Exception("Logical error: parsing is non-deterministic.", ErrorCodes::LOGICAL_ERROR);
auto & header = getPort().getHeader();
out << "Column " << file_column << ", " << std::string((file_column < 10 ? 2 : file_column < 100 ? 1 : 0), ' ')
<< "name: " << header.safeGetByPosition(table_column).name << ", " << std::string(max_length_of_column_name - header.safeGetByPosition(table_column).name.size(), ' ')
<< "type: " << current_column_type->getName() << ", " << std::string(max_length_of_data_type_name - current_column_type->getName().size(), ' ');
if (isNumber(data_types[i]) || isDateOrDateTime(data_types[i]))
{
/// An empty string instead of a value.
if (curr_position == prev_position)
if (format_settings.csv.empty_as_default
&& (at_delimiter || at_last_column_line_end))
{
out << "ERROR: text ";
verbosePrintString(prev_position, std::min(prev_position + 10, in.buffer().end()), out);
out << " is not like " << data_types[i]->getName() << "\n";
return false;
columns[table_column]->insertDefault();
}
}
out << "parsed text: ";
verbosePrintString(prev_position, curr_position, out);
if (exception)
{
if (data_types[i]->getName() == "DateTime")
out << "ERROR: DateTime must be in YYYY-MM-DD hh:mm:ss or NNNNNNNNNN (unix timestamp, exactly 10 digits) format.\n";
else if (data_types[i]->getName() == "Date")
out << "ERROR: Date must be in YYYY-MM-DD format.\n";
else
out << "ERROR\n";
return false;
}
out << "\n";
if (data_types[i]->haveMaximumSizeOfValue())
{
if (*curr_position != '\n' && *curr_position != '\r' && *curr_position != delimiter)
{
out << "ERROR: garbage after " << data_types[i]->getName() << ": ";
verbosePrintString(curr_position, std::min(curr_position + 10, in.buffer().end()), out);
BufferBase::Position prev_position = in.position();
BufferBase::Position curr_position = in.position();
std::exception_ptr exception;
try
{
skipWhitespacesAndTabs(in);
prev_position = in.position();
readField(*columns[table_column], current_column_type, is_last_file_column, table_column);
curr_position = in.position();
skipWhitespacesAndTabs(in);
}
catch (...)
{
exception = std::current_exception();
}
if (curr_position < prev_position)
throw Exception("Logical error: parsing is non-deterministic.", ErrorCodes::LOGICAL_ERROR);
if (isNativeNumber(current_column_type) || isDateOrDateTime(current_column_type))
{
/// An empty string instead of a value.
if (curr_position == prev_position)
{
out << "ERROR: text ";
verbosePrintString(prev_position, std::min(prev_position + 10, in.buffer().end()), out);
out << " is not like " << current_column_type->getName() << "\n";
return false;
}
}
out << "parsed text: ";
verbosePrintString(prev_position, curr_position, out);
if (exception)
{
if (current_column_type->getName() == "DateTime")
out << "ERROR: DateTime must be in YYYY-MM-DD hh:mm:ss or NNNNNNNNNN (unix timestamp, exactly 10 digits) format.\n";
else if (current_column_type->getName() == "Date")
out << "ERROR: Date must be in YYYY-MM-DD format.\n";
else
out << "ERROR\n";
return false;
}
out << "\n";
if (data_types[i]->getName() == "DateTime")
out << "ERROR: DateTime must be in YYYY-MM-DD hh:mm:ss or NNNNNNNNNN (unix timestamp, exactly 10 digits) format.\n";
else if (data_types[i]->getName() == "Date")
out << "ERROR: Date must be in YYYY-MM-DD format.\n";
if (current_column_type->haveMaximumSizeOfValue()
&& *curr_position != '\n' && *curr_position != '\r'
&& *curr_position != delimiter)
{
out << "ERROR: garbage after " << current_column_type->getName() << ": ";
verbosePrintString(curr_position, std::min(curr_position + 10, in.buffer().end()), out);
out << "\n";
return false;
if (current_column_type->getName() == "DateTime")
out << "ERROR: DateTime must be in YYYY-MM-DD hh:mm:ss or NNNNNNNNNN (unix timestamp, exactly 10 digits) format.\n";
else if (current_column_type->getName() == "Date")
out << "ERROR: Date must be in YYYY-MM-DD format.\n";
return false;
}
}
}
else
{
static const String skipped_column_str = "<SKIPPED COLUMN>";
out << "Column " << file_column << ", " << std::string((file_column < 10 ? 2 : file_column < 100 ? 1 : 0), ' ')
<< "name: " << skipped_column_str << ", " << std::string(max_length_of_column_name - skipped_column_str.length(), ' ')
<< "type: " << skipped_column_str << ", " << std::string(max_length_of_data_type_name - skipped_column_str.length(), ' ');
String tmp;
readCSVString(tmp, in, format_settings.csv);
}
/// Delimiters
if (i + 1 == size)
if (file_column + 1 == column_indexes_for_input_fields.size())
{
if (in.eof())
return false;
@ -294,8 +461,8 @@ bool CSVRowInputFormat::parseRowAndPrintDiagnosticInfo(MutableColumns & columns,
out << "ERROR: There is no line feed. ";
verbosePrintString(in.position(), in.position() + 1, out);
out << " found instead.\n"
" It's like your file has more columns than expected.\n"
"And if your file have right number of columns, maybe it have unquoted string value with comma.\n";
" It's like your file has more columns than expected.\n"
"And if your file have right number of columns, maybe it have unquoted string value with comma.\n";
return false;
}
@ -313,8 +480,8 @@ bool CSVRowInputFormat::parseRowAndPrintDiagnosticInfo(MutableColumns & columns,
if (*in.position() == '\n' || *in.position() == '\r')
{
out << "ERROR: Line feed found where delimiter (" << delimiter << ") is expected."
" It's like your file has less columns than expected.\n"
"And if your file have right number of columns, maybe it have unescaped quotes in values.\n";
" It's like your file has less columns than expected.\n"
"And if your file have right number of columns, maybe it have unescaped quotes in values.\n";
}
else
{
@ -347,6 +514,45 @@ void CSVRowInputFormat::updateDiagnosticInfo()
pos_of_current_row = in.position();
}
bool CSVRowInputFormat::readField(IColumn & column, const DataTypePtr & type, bool is_last_file_column, size_t column_idx)
{
const bool at_delimiter = *in.position() == format_settings.csv.delimiter;
const bool at_last_column_line_end = is_last_file_column
&& (*in.position() == '\n' || *in.position() == '\r'
|| in.eof());
if (format_settings.csv.empty_as_default
&& (at_delimiter || at_last_column_line_end))
{
/// Treat empty unquoted column value as default value, if
/// specified in the settings. Tuple columns might seem
/// problematic, because they are never quoted but still contain
/// commas, which might be also used as delimiters. However,
/// they do not contain empty unquoted fields, so this check
/// works for tuples as well.
return false;
}
else if (column_idx_to_nullable_column_idx[column_idx])
{
/// If value is null but type is not nullable then use default value instead.
const size_t nullable_idx = *column_idx_to_nullable_column_idx[column_idx];
auto & tmp_col = *nullable_columns[nullable_idx];
nullable_types[nullable_idx]->deserializeAsTextCSV(tmp_col, in, format_settings);
Field value = tmp_col[0];
tmp_col.popBack(1); /// do not store copy of values in memory
if (value.isNull())
return false;
column.insert(value);
return true;
}
else
{
/// Read the column normally.
type->deserializeAsTextCSV(column, in, format_settings);
return true;
}
}
void registerInputFormatProcessorCSV(FormatFactory & factory)
{
@ -359,7 +565,7 @@ void registerInputFormatProcessorCSV(FormatFactory & factory)
IRowInputFormat::Params params,
const FormatSettings & settings)
{
return std::make_shared<CSVRowInputFormat>(buf, sample, params, with_names, settings);
return std::make_shared<CSVRowInputFormat>(buf, sample, std::move(params), with_names, settings);
});
}
}

View File

@ -36,8 +36,26 @@ private:
const FormatSettings format_settings;
/// For convenient diagnostics in case of an error.
using IndexesMap = std::unordered_map<String, size_t>;
IndexesMap column_indexes_by_names;
/// Maps indexes of columns in the input file to indexes of table columns
using OptionalIndexes = std::vector<std::optional<size_t>>;
OptionalIndexes column_indexes_for_input_fields;
/// Tracks which colums we have read in a single read() call.
/// For columns that are never read, it is initialized to false when we
/// read the file header, and never changed afterwards.
/// For other columns, it is updated on each read() call.
std::vector<UInt8> read_columns;
/// Whether we have any columns that are not read from file at all,
/// and must be always initialized with defaults.
bool have_always_default_columns = false;
void addInputColumn(const String & column_name);
/// For convenient diagnostics in case of an error.
size_t row_num = 0;
/// How many bytes were read, not counting those that are still in the buffer.
@ -47,10 +65,17 @@ private:
char * pos_of_current_row = nullptr;
char * pos_of_prev_row = nullptr;
/// For setting input_format_null_as_default
DataTypes nullable_types;
MutableColumns nullable_columns;
OptionalIndexes column_idx_to_nullable_column_idx;
void updateDiagnosticInfo();
bool parseRowAndPrintDiagnosticInfo(MutableColumns & columns,
WriteBuffer & out, size_t max_length_of_column_name, size_t max_length_of_data_type_name);
bool readField(IColumn & column, const DataTypePtr & type, bool is_last_file_column, size_t column_idx);
};
}

View File

@ -1,9 +1,9 @@
#include "config_formats.h"
#include <Processors/Formats/Impl/CapnProtoRowInputFormat.h> // Y_IGNORE
#if USE_CAPNP
#include <IO/ReadBuffer.h>
#include <Interpreters/Context.h>
#include <Processors/Formats/Impl/CapnProtoRowInputFormat.h> // Y_IGNORE
#include <Formats/FormatFactory.h>
#include <Formats/FormatSchemaInfo.h>
#include <capnp/serialize.h> // Y_IGNORE
@ -179,7 +179,7 @@ void CapnProtoRowInputFormat::createActions(const NestedFieldList & sorted_field
}
CapnProtoRowInputFormat::CapnProtoRowInputFormat(ReadBuffer & in_, Block header, Params params, const FormatSchemaInfo & info)
: IRowInputFormat(std::move(header), in_, params), parser(std::make_shared<SchemaParser>())
: IRowInputFormat(std::move(header), in_, std::move(params)), parser(std::make_shared<SchemaParser>())
{
// Parse the schema and fetch the root object
@ -206,28 +206,42 @@ CapnProtoRowInputFormat::CapnProtoRowInputFormat(ReadBuffer & in_, Block header,
createActions(list, root);
}
kj::Array<capnp::word> CapnProtoRowInputFormat::readMessage()
{
uint32_t segment_count;
in.readStrict(reinterpret_cast<char*>(&segment_count), sizeof(uint32_t));
// one for segmentCount and one because segmentCount starts from 0
const auto prefix_size = (2 + segment_count) * sizeof(uint32_t);
const auto words_prefix_size = (segment_count + 1) / 2 + 1;
auto prefix = kj::heapArray<capnp::word>(words_prefix_size);
auto prefix_chars = prefix.asChars();
::memcpy(prefix_chars.begin(), &segment_count, sizeof(uint32_t));
// read size of each segment
for (size_t i = 0; i <= segment_count; ++i)
in.readStrict(prefix_chars.begin() + ((i + 1) * sizeof(uint32_t)), sizeof(uint32_t));
// calculate size of message
const auto expected_words = capnp::expectedSizeInWordsFromPrefix(prefix);
const auto expected_bytes = expected_words * sizeof(capnp::word);
const auto data_size = expected_bytes - prefix_size;
auto msg = kj::heapArray<capnp::word>(expected_words);
auto msg_chars = msg.asChars();
// read full message
::memcpy(msg_chars.begin(), prefix_chars.begin(), prefix_size);
in.readStrict(msg_chars.begin() + prefix_size, data_size);
return msg;
}
bool CapnProtoRowInputFormat::readRow(MutableColumns & columns, RowReadExtension &)
{
if (in.eof())
return false;
// Read from underlying buffer directly
auto buf = in.buffer();
auto base = reinterpret_cast<const capnp::word *>(in.position());
// Check if there's enough bytes in the buffer to read the full message
kj::Array<capnp::word> heap_array;
auto array = kj::arrayPtr(base, buf.size() - in.offset());
auto expected_words = capnp::expectedSizeInWordsFromPrefix(array);
if (expected_words * sizeof(capnp::word) > array.size())
{
// We'll need to reassemble the message in a contiguous buffer
heap_array = kj::heapArray<capnp::word>(expected_words);
in.readStrict(heap_array.asChars().begin(), heap_array.asChars().size());
array = heap_array.asPtr();
}
auto array = readMessage();
#if CAPNP_VERSION >= 8000
capnp::UnalignedFlatArrayMessageReader msg(array);
@ -281,13 +295,6 @@ bool CapnProtoRowInputFormat::readRow(MutableColumns & columns, RowReadExtension
}
}
// Advance buffer position if used directly
if (heap_array.size() == 0)
{
auto parsed = (msg.getEnd() - base) * sizeof(capnp::word);
in.position() += parsed;
}
return true;
}
@ -297,7 +304,7 @@ void registerInputFormatProcessorCapnProto(FormatFactory & factory)
"CapnProto",
[](ReadBuffer & buf, const Block & sample, const Context & context, IRowInputFormat::Params params, const FormatSettings &)
{
return std::make_shared<CapnProtoRowInputFormat>(buf, sample, params, FormatSchemaInfo(context, "capnp"));
return std::make_shared<CapnProtoRowInputFormat>(buf, sample, std::move(params), FormatSchemaInfo(context, "CapnProto"));
});
}

View File

@ -40,6 +40,8 @@ public:
bool readRow(MutableColumns & columns, RowReadExtension &) override;
private:
kj::Array<capnp::word> readMessage();
// Build a traversal plan from a sorted list of fields
void createActions(const NestedFieldList & sortedFields, capnp::StructSchema reader);

View File

@ -34,8 +34,6 @@ void JSONCompactRowOutputFormat::writeTotalsFieldDelimiter()
void JSONCompactRowOutputFormat::writeRowStartDelimiter()
{
if (row_count > 0)
writeCString(",\n", *ostr);
writeCString("\t\t[", *ostr);
}

View File

@ -28,7 +28,7 @@ enum
JSONEachRowRowInputFormat::JSONEachRowRowInputFormat(
ReadBuffer & in_, const Block & header, Params params, const FormatSettings & format_settings)
: IRowInputFormat(header, in_, params), format_settings(format_settings), name_map(header.columns())
: IRowInputFormat(header, in_, std::move(params)), format_settings(format_settings), name_map(header.columns())
{
/// In this format, BOM at beginning of stream cannot be confused with value, so it is safe to skip it.
skipBOMIfExists(in);
@ -263,7 +263,7 @@ void registerInputFormatProcessorJSONEachRow(FormatFactory & factory)
IRowInputFormat::Params params,
const FormatSettings & settings)
{
return std::make_shared<JSONEachRowRowInputFormat>(buf, sample, params, settings);
return std::make_shared<JSONEachRowRowInputFormat>(buf, sample, std::move(params), settings);
});
}

View File

@ -13,7 +13,7 @@ class ReadBuffer;
/** A stream for reading data in JSON format, where each row is represented by a separate JSON object.
* Objects can be separated by feed return, other whitespace characters in any number and possibly a comma.
* Objects can be separated by line feed, other whitespace characters in any number and possibly a comma.
* Fields can be listed in any order (including, in different lines there may be different order),
* and some fields may be missing.
*/

View File

@ -24,6 +24,11 @@ public:
void writeRowStartDelimiter() override;
void writeRowEndDelimiter() override;
protected:
/// No totals and extremes.
void consumeTotals(Chunk) override {}
void consumeExtremes(Chunk) override {}
private:
size_t field_number = 0;
Names fields;

View File

@ -17,39 +17,47 @@ using namespace MySQLProtocol;
MySQLOutputFormat::MySQLOutputFormat(WriteBuffer & out_, const Block & header, const Context & context, const FormatSettings & settings)
: IOutputFormat(header, out_)
, context(context)
, packet_sender(std::make_shared<PacketSender>(out, const_cast<uint8_t &>(context.mysql.sequence_id))) /// TODO: fix it
, packet_sender(out, const_cast<uint8_t &>(context.mysql.sequence_id)) /// TODO: fix it
, format_settings(settings)
{
packet_sender.max_packet_size = context.mysql.max_packet_size;
}
void MySQLOutputFormat::initialize()
{
if (initialized)
return;
initialized = true;
auto & header = getPort(PortKind::Main).getHeader();
if (header.columns())
{
packet_sender.sendPacket(LengthEncodedNumber(header.columns()));
for (const ColumnWithTypeAndName & column : header.getColumnsWithTypeAndName())
{
ColumnDefinition column_definition(column.name, CharacterSet::binary, 0, ColumnType::MYSQL_TYPE_STRING,
0, 0);
packet_sender.sendPacket(column_definition);
}
if (!(context.mysql.client_capabilities & Capability::CLIENT_DEPRECATE_EOF))
{
packet_sender.sendPacket(EOF_Packet(0, 0));
}
}
}
void MySQLOutputFormat::consume(Chunk chunk)
{
initialize();
auto & header = getPort(PortKind::Main).getHeader();
if (!initialized)
{
initialized = true;
if (header.columns())
{
packet_sender->sendPacket(LengthEncodedNumber(header.columns()));
for (const ColumnWithTypeAndName & column : header.getColumnsWithTypeAndName())
{
ColumnDefinition column_definition(column.name, CharacterSet::binary, 0, ColumnType::MYSQL_TYPE_STRING,
0, 0);
packet_sender->sendPacket(column_definition);
}
if (!(context.mysql.client_capabilities & Capability::CLIENT_DEPRECATE_EOF))
{
packet_sender->sendPacket(EOF_Packet(0, 0));
}
}
}
size_t rows = chunk.getNumRows();
auto & columns = chunk.getColumns();
@ -58,14 +66,11 @@ void MySQLOutputFormat::consume(Chunk chunk)
ResultsetRow row_packet;
for (size_t col = 0; col < columns.size(); ++col)
{
String column_value;
WriteBufferFromString ostr(column_value);
WriteBufferFromOwnString ostr;
header.getByPosition(col).type->serializeAsText(*columns[col], i, ostr, format_settings);
ostr.finish();
row_packet.appendColumn(std::move(column_value));
row_packet.appendColumn(std::move(ostr.str()));
}
packet_sender->sendPacket(row_packet);
packet_sender.sendPacket(row_packet);
}
}
@ -84,15 +89,19 @@ void MySQLOutputFormat::finalize()
<< formatReadableSizeWithBinarySuffix(info.read_bytes / info.elapsed_seconds) << "/sec.";
}
auto & header = getPort(PortKind::Main).getHeader();
const auto & header = getPort(PortKind::Main).getHeader();
if (header.columns() == 0)
packet_sender->sendPacket(OK_Packet(0x0, context.mysql.client_capabilities, affected_rows, 0, 0, "", human_readable_info.str()), true);
packet_sender.sendPacket(OK_Packet(0x0, context.mysql.client_capabilities, affected_rows, 0, 0, "", human_readable_info.str()), true);
else
if (context.mysql.client_capabilities & CLIENT_DEPRECATE_EOF)
packet_sender->sendPacket(OK_Packet(0xfe, context.mysql.client_capabilities, affected_rows, 0, 0, "", human_readable_info.str()), true);
packet_sender.sendPacket(OK_Packet(0xfe, context.mysql.client_capabilities, affected_rows, 0, 0, "", human_readable_info.str()), true);
else
packet_sender->sendPacket(EOF_Packet(0, 0), true);
packet_sender.sendPacket(EOF_Packet(0, 0), true);
}
void MySQLOutputFormat::flush()
{
packet_sender.out->next();
}
void registerOutputFormatProcessorMySQLWrite(FormatFactory & factory)

View File

@ -25,13 +25,17 @@ public:
void consume(Chunk) override;
void finalize() override;
void flush() override;
void doWritePrefix() override { initialize(); }
void initialize();
private:
bool initialized = false;
const Context & context;
std::shared_ptr<MySQLProtocol::PacketSender> packet_sender;
MySQLProtocol::PacketSender packet_sender;
FormatSettings format_settings;
};

View File

@ -6,6 +6,7 @@
#include <Core/iostream_debug_helpers.h>
#include <DataTypes/DataTypeLowCardinality.h>
namespace DB
@ -38,7 +39,7 @@ void ODBCDriver2BlockOutputFormat::writeRow(const Block & header, const Columns
{
{
WriteBufferFromString text_out(buffer);
header.getByPosition(row_idx).type->serializeAsText(*column, row_idx, text_out, format_settings);
header.getByPosition(column_idx).type->serializeAsText(*column, row_idx, text_out, format_settings);
}
writeODBCString(out, buffer);
}
@ -95,8 +96,10 @@ void ODBCDriver2BlockOutputFormat::writePrefix()
writeODBCString(out, "type");
for (size_t i = 0; i < columns; ++i)
{
const ColumnWithTypeAndName & col = header.getByPosition(i);
writeODBCString(out, col.type->getName());
auto type = header.getByPosition(i).type;
if (type->lowCardinality())
type = recursiveRemoveLowCardinality(type);
writeODBCString(out, type->getName());
}
}

View File

@ -1,8 +1,7 @@
#include "config_formats.h"
#include <Processors/Formats/Impl/ParquetBlockInputFormat.h>
#if USE_PARQUET
#include "ParquetBlockInputFormat.h"
#include <algorithm>
#include <iterator>
#include <vector>
@ -29,15 +28,9 @@
#include <common/DateLUTImpl.h>
#include <ext/range.h>
#include <arrow/api.h>
//#include <arrow/buffer.h>
//#include <arrow/io/api.h>
#include <parquet/arrow/reader.h>
//#include <parquet/arrow/writer.h>
//#include <parquet/exception.h>
#include <parquet/file_reader.h>
#include <Core/iostream_debug_helpers.h> // REMOVE ME
namespace DB
{
namespace ErrorCodes

View File

@ -1,11 +1,9 @@
#pragma once
#include <Common/config.h>
#include "config_formats.h"
#if USE_PARQUET
# include <Processors/Formats/IInputFormat.h>
//# include <parquet/file_reader.h>
//# include <parquet/arrow/reader.h>
//# include <arrow/buffer.h>
#include <Processors/Formats/IInputFormat.h>
namespace parquet { namespace arrow { class FileReader; } }

View File

@ -1,7 +1,7 @@
#include "config_formats.h"
#include <Processors/Formats/Impl/ParquetBlockOutputFormat.h>
#if USE_PARQUET
# include "ParquetBlockOutputFormat.h"
// TODO: clean includes
# include <Columns/ColumnDecimal.h>

Some files were not shown because too many files have changed in this diff Show More