ClickHouse/src/Processors
Robert Schulze 514d4d2187
Implement ProtobufList - fixes ClickHouse#16436
Introduce IO format "ProtobufList" with protobuf schema

    // schemafile.proto
    message Envelope {
      message MessageType {
        uint32 colA = 1;
        string colB = 2;
      }
      repeated MessageType mt = 1;
    }

where "Envelope" is a hard-coded/expected top-level message and
"MessageType" is a message with user-provided name containing the table
fields to export/import, e.g.

    SELECT * FROM db1.tab1 FORMAT ProtobufList SETTINGS format_schema =
    'schemafile:MessageType'

As a result, the new format wraps a list of messages (one per row) into
a single, containing message. Compare that to the schema of the existing
IO formats "Protobuf" and "ProtobufSingle":

    message MessageType {
      uint32 colA = 1;
      string colB = 2;
    }

The new format does not save space compared to the existing formats, but
it is conceptually a bit more beautiful and also more convenenient.

Implementation details:

- Created new files ProtobufList(Input|Output)Format which use the
  existing ProtobufSerializer mechanism. The goal was to reuse as much
  code as possible and avoid copypasta.

- I was torn between inheriting from I(Input|Output)Format vs.
  IRow(Input|Output)Format for ProtobufList(Input|Output)Format. The
  former is chunk-based which can be better for performance. Since the
  ProtobufSerializer mechanism is row-based but data is generally passed
  around in chunks, I decided for the latter to leverage the existing
  chunk <--> row mapping code in IRow(InputOutput)Format.

- A new ProtobufSerializer called ProtobufSerializerEnvelope was
  introduced (--> ProtobufSerializer.cpp). It represents the top-level
  message which encloses the list of inner nested messages, i.e. the
  rows.

- With the new format, parsing the schema file and matching the fields in
  the schema file to table column works like for the old formats. The only
  difference is that parsing starts one level below the "Envelope" (-->
  ProtobufSchema.cpp). This is more natural than forcing customers to
  have table columns start with "Envelope".

- Creation of the ProtobufSerializer tree also works like before. What
  is different is that we finally add a ProtobufSerializerEnvelope as
  new root of the tree. It's only purpose is to write/read the top-level
  message for the first/last row to write/read.

Caveats:

- The low-level serialization code in ProtobufWriter uses an internal
  buffer which is flushed to the output file only in endMessage().
  In the existing "Protobuf" format, this happens once per row, in the
  new format this happens only at the end of the serialization
  since row-level messages now call start/endNestedMessage(). As a
  future TODO to, the buffer should be flushed also in
  start/endNestedMessage() to reduce memory consumption.
2022-03-14 08:04:58 +01:00
..
examples fix code 2021-12-14 17:37:31 +08:00
Executors Fix memory accounting for queries that uses < max_untracker_memory 2022-01-28 16:25:33 +03:00
Formats Implement ProtobufList - fixes ClickHouse#16436 2022-03-14 08:04:58 +01:00
Merges Fix clang-tidy issue 2022-02-06 22:36:35 -08:00
QueryPlan avoid unnecessary copying of Settings 2022-02-10 12:13:51 +03:00
Sinks Fix MV query with multiple chunk result. 2021-12-16 21:17:05 +03:00
Sources to MaterializeMySQL_support_set_and_other_dataType 2022-01-21 12:24:12 +08:00
tests Force PipeLineExecutor creators to pass a QueryStatus 2021-12-09 10:02:32 +01:00
Transforms Merge pull request #34601 from CurtizJ/filtering-by-sparse-columns 2022-02-15 23:26:13 +03:00
TTL Move code around to avoid dlsym on Musl 2021-12-24 12:25:27 +03:00
Chunk.cpp Merge remote-tracking branch 'upstream/master' into HEAD 2021-12-01 15:49:02 +03:00
Chunk.h add comments 2021-12-08 18:56:30 +03:00
CMakeLists.txt move to examples everywhere 2021-04-27 01:51:42 +03:00
ConcatProcessor.cpp Use Resize instead of Concat in InterpreterInsertQuery. 2020-06-05 12:30:16 +03:00
ConcatProcessor.h Fix half of typos 2020-08-08 03:47:03 +03:00
DelayedPortsProcessor.cpp Update sort to pdqsort 2022-01-30 19:49:48 +00:00
DelayedPortsProcessor.h Colse input ports in DelayedPortsProcessor as soon as all outputs are finished. 2021-02-03 14:57:22 +03:00
ForkProcessor.cpp clang-tidy move fix build 2022-02-08 21:21:32 +00:00
ForkProcessor.h
IAccumulatingTransform.cpp Fix some tests. 2021-09-06 23:13:06 +03:00
IAccumulatingTransform.h Fix typos reported by codespell 2020-10-27 12:04:03 +01:00
IInflatingTransform.cpp
IInflatingTransform.h
IProcessor.cpp Fixing build and tests. 2020-12-09 17:11:20 +03:00
IProcessor.h mirror changes in code and comment 2021-01-22 09:13:22 +00:00
ISimpleTransform.cpp
ISimpleTransform.h support adding defaults in async inserts 2021-09-08 17:08:57 +03:00
ISink.cpp Fix more tests. 2021-07-26 17:47:29 +03:00
ISink.h Fix more tests. 2021-07-26 17:47:29 +03:00
ISource.cpp Add async status to RemoteSource. 2020-12-04 13:52:57 +03:00
ISource.h better 2021-03-11 18:22:24 +03:00
LimitTransform.cpp Minor fixes for min/sim hash 2020-12-29 13:16:43 +03:00
LimitTransform.h Avoid overflow in LIMIT #10470 #11372 2020-07-12 08:18:01 +03:00
OffsetTransform.cpp Fix comment 2020-07-13 01:32:24 +03:00
OffsetTransform.h Fix build 2020-07-13 02:53:13 +03:00
Port.cpp Try fix #23029 2021-04-20 14:55:23 +03:00
Port.h Rewrite distributed DDL to Processors 2021-07-18 00:45:07 +03:00
QueueBuffer.h Rewrite distributed DDL to Processors 2021-07-18 00:45:07 +03:00
ResizeProcessor.cpp Fix possible Pipeline stuck in case of StrictResize processor. 2021-12-06 14:53:39 +03:00
ResizeProcessor.h Fix half of typos 2020-08-08 03:47:03 +03:00
RowsBeforeLimitCounter.h Added RemoteSource. 2020-06-03 22:50:11 +03:00