ClickHouse

mirror of https://github.com/ClickHouse/ClickHouse.git synced 2024-11-28 02:21:59 +00:00

History

Robert Schulze 514d4d2187 Implement ProtobufList - fixes ClickHouse#16436 Introduce IO format "ProtobufList" with protobuf schema // schemafile.proto message Envelope { message MessageType { uint32 colA = 1; string colB = 2; } repeated MessageType mt = 1; } where "Envelope" is a hard-coded/expected top-level message and "MessageType" is a message with user-provided name containing the table fields to export/import, e.g. SELECT * FROM db1.tab1 FORMAT ProtobufList SETTINGS format_schema = 'schemafile:MessageType' As a result, the new format wraps a list of messages (one per row) into a single, containing message. Compare that to the schema of the existing IO formats "Protobuf" and "ProtobufSingle": message MessageType { uint32 colA = 1; string colB = 2; } The new format does not save space compared to the existing formats, but it is conceptually a bit more beautiful and also more convenenient. Implementation details: - Created new files ProtobufList(Input\|Output)Format which use the existing ProtobufSerializer mechanism. The goal was to reuse as much code as possible and avoid copypasta. - I was torn between inheriting from I(Input\|Output)Format vs. IRow(Input\|Output)Format for ProtobufList(Input\|Output)Format. The former is chunk-based which can be better for performance. Since the ProtobufSerializer mechanism is row-based but data is generally passed around in chunks, I decided for the latter to leverage the existing chunk <--> row mapping code in IRow(InputOutput)Format. - A new ProtobufSerializer called ProtobufSerializerEnvelope was introduced (--> ProtobufSerializer.cpp). It represents the top-level message which encloses the list of inner nested messages, i.e. the rows. - With the new format, parsing the schema file and matching the fields in the schema file to table column works like for the old formats. The only difference is that parsing starts one level below the "Envelope" (--> ProtobufSchema.cpp). This is more natural than forcing customers to have table columns start with "Envelope". - Creation of the ProtobufSerializer tree also works like before. What is different is that we finally add a ProtobufSerializerEnvelope as new root of the tree. It's only purpose is to write/read the top-level message for the first/last row to write/read. Caveats: - The low-level serialization code in ProtobufWriter uses an internal buffer which is flushed to the output file only in endMessage(). In the existing "Protobuf" format, this happens once per row, in the new format this happens only at the end of the serialization since row-level messages now call start/endNestedMessage(). As a future TODO to, the buffer should be flushed also in start/endNestedMessage() to reduce memory consumption.		2022-03-14 08:04:58 +01:00
..
examples	fix code	2021-12-14 17:37:31 +08:00
Executors	Fix memory accounting for queries that uses < max_untracker_memory	2022-01-28 16:25:33 +03:00
Formats	Implement ProtobufList - fixes ClickHouse#16436	2022-03-14 08:04:58 +01:00
Merges	Fix clang-tidy issue	2022-02-06 22:36:35 -08:00
QueryPlan	avoid unnecessary copying of Settings	2022-02-10 12:13:51 +03:00
Sinks	Fix MV query with multiple chunk result.	2021-12-16 21:17:05 +03:00
Sources	to MaterializeMySQL_support_set_and_other_dataType	2022-01-21 12:24:12 +08:00
tests	Force PipeLineExecutor creators to pass a QueryStatus	2021-12-09 10:02:32 +01:00
Transforms	Merge pull request #34601 from CurtizJ/filtering-by-sparse-columns	2022-02-15 23:26:13 +03:00
TTL	Move code around to avoid dlsym on Musl	2021-12-24 12:25:27 +03:00
Chunk.cpp	Merge remote-tracking branch 'upstream/master' into HEAD	2021-12-01 15:49:02 +03:00
Chunk.h	add comments	2021-12-08 18:56:30 +03:00
CMakeLists.txt	move to examples everywhere	2021-04-27 01:51:42 +03:00
ConcatProcessor.cpp	Use Resize instead of Concat in InterpreterInsertQuery.	2020-06-05 12:30:16 +03:00
ConcatProcessor.h	Fix half of typos	2020-08-08 03:47:03 +03:00
DelayedPortsProcessor.cpp	Update sort to pdqsort	2022-01-30 19:49:48 +00:00
DelayedPortsProcessor.h	Colse input ports in DelayedPortsProcessor as soon as all outputs are finished.	2021-02-03 14:57:22 +03:00
ForkProcessor.cpp	clang-tidy move fix build	2022-02-08 21:21:32 +00:00
ForkProcessor.h
IAccumulatingTransform.cpp	Fix some tests.	2021-09-06 23:13:06 +03:00
IAccumulatingTransform.h	Fix typos reported by codespell	2020-10-27 12:04:03 +01:00
IInflatingTransform.cpp
IInflatingTransform.h
IProcessor.cpp	Fixing build and tests.	2020-12-09 17:11:20 +03:00
IProcessor.h	mirror changes in code and comment	2021-01-22 09:13:22 +00:00
ISimpleTransform.cpp
ISimpleTransform.h	support adding defaults in async inserts	2021-09-08 17:08:57 +03:00
ISink.cpp	Fix more tests.	2021-07-26 17:47:29 +03:00
ISink.h	Fix more tests.	2021-07-26 17:47:29 +03:00
ISource.cpp	Add async status to RemoteSource.	2020-12-04 13:52:57 +03:00
ISource.h	better	2021-03-11 18:22:24 +03:00
LimitTransform.cpp	Minor fixes for min/sim hash	2020-12-29 13:16:43 +03:00
LimitTransform.h	Avoid overflow in LIMIT #10470 #11372	2020-07-12 08:18:01 +03:00
OffsetTransform.cpp	Fix comment	2020-07-13 01:32:24 +03:00
OffsetTransform.h	Fix build	2020-07-13 02:53:13 +03:00
Port.cpp	Try fix #23029	2021-04-20 14:55:23 +03:00
Port.h	Rewrite distributed DDL to Processors	2021-07-18 00:45:07 +03:00
QueueBuffer.h	Rewrite distributed DDL to Processors	2021-07-18 00:45:07 +03:00
ResizeProcessor.cpp	Fix possible Pipeline stuck in case of StrictResize processor.	2021-12-06 14:53:39 +03:00
ResizeProcessor.h	Fix half of typos	2020-08-08 03:47:03 +03:00
RowsBeforeLimitCounter.h	Added RemoteSource.	2020-06-03 22:50:11 +03:00