mirror of
https://github.com/ClickHouse/ClickHouse.git
synced 2024-11-28 02:21:59 +00:00
514d4d2187
Introduce IO format "ProtobufList" with protobuf schema // schemafile.proto message Envelope { message MessageType { uint32 colA = 1; string colB = 2; } repeated MessageType mt = 1; } where "Envelope" is a hard-coded/expected top-level message and "MessageType" is a message with user-provided name containing the table fields to export/import, e.g. SELECT * FROM db1.tab1 FORMAT ProtobufList SETTINGS format_schema = 'schemafile:MessageType' As a result, the new format wraps a list of messages (one per row) into a single, containing message. Compare that to the schema of the existing IO formats "Protobuf" and "ProtobufSingle": message MessageType { uint32 colA = 1; string colB = 2; } The new format does not save space compared to the existing formats, but it is conceptually a bit more beautiful and also more convenenient. Implementation details: - Created new files ProtobufList(Input|Output)Format which use the existing ProtobufSerializer mechanism. The goal was to reuse as much code as possible and avoid copypasta. - I was torn between inheriting from I(Input|Output)Format vs. IRow(Input|Output)Format for ProtobufList(Input|Output)Format. The former is chunk-based which can be better for performance. Since the ProtobufSerializer mechanism is row-based but data is generally passed around in chunks, I decided for the latter to leverage the existing chunk <--> row mapping code in IRow(InputOutput)Format. - A new ProtobufSerializer called ProtobufSerializerEnvelope was introduced (--> ProtobufSerializer.cpp). It represents the top-level message which encloses the list of inner nested messages, i.e. the rows. - With the new format, parsing the schema file and matching the fields in the schema file to table column works like for the old formats. The only difference is that parsing starts one level below the "Envelope" (--> ProtobufSchema.cpp). This is more natural than forcing customers to have table columns start with "Envelope". - Creation of the ProtobufSerializer tree also works like before. What is different is that we finally add a ProtobufSerializerEnvelope as new root of the tree. It's only purpose is to write/read the top-level message for the first/last row to write/read. Caveats: - The low-level serialization code in ProtobufWriter uses an internal buffer which is flushed to the output file only in endMessage(). In the existing "Protobuf" format, this happens once per row, in the new format this happens only at the end of the serialization since row-level messages now call start/endNestedMessage(). As a future TODO to, the buffer should be flushed also in start/endNestedMessage() to reduce memory consumption. |
||
---|---|---|
.. | ||
examples | ||
Executors | ||
Formats | ||
Merges | ||
QueryPlan | ||
Sinks | ||
Sources | ||
tests | ||
Transforms | ||
TTL | ||
Chunk.cpp | ||
Chunk.h | ||
CMakeLists.txt | ||
ConcatProcessor.cpp | ||
ConcatProcessor.h | ||
DelayedPortsProcessor.cpp | ||
DelayedPortsProcessor.h | ||
ForkProcessor.cpp | ||
ForkProcessor.h | ||
IAccumulatingTransform.cpp | ||
IAccumulatingTransform.h | ||
IInflatingTransform.cpp | ||
IInflatingTransform.h | ||
IProcessor.cpp | ||
IProcessor.h | ||
ISimpleTransform.cpp | ||
ISimpleTransform.h | ||
ISink.cpp | ||
ISink.h | ||
ISource.cpp | ||
ISource.h | ||
LimitTransform.cpp | ||
LimitTransform.h | ||
OffsetTransform.cpp | ||
OffsetTransform.h | ||
Port.cpp | ||
Port.h | ||
QueueBuffer.h | ||
ResizeProcessor.cpp | ||
ResizeProcessor.h | ||
RowsBeforeLimitCounter.h |