Commit Graph

2380 Commits

Author SHA1 Message Date
Robert Schulze
514d4d2187
Implement ProtobufList - fixes ClickHouse#16436
Introduce IO format "ProtobufList" with protobuf schema

    // schemafile.proto
    message Envelope {
      message MessageType {
        uint32 colA = 1;
        string colB = 2;
      }
      repeated MessageType mt = 1;
    }

where "Envelope" is a hard-coded/expected top-level message and
"MessageType" is a message with user-provided name containing the table
fields to export/import, e.g.

    SELECT * FROM db1.tab1 FORMAT ProtobufList SETTINGS format_schema =
    'schemafile:MessageType'

As a result, the new format wraps a list of messages (one per row) into
a single, containing message. Compare that to the schema of the existing
IO formats "Protobuf" and "ProtobufSingle":

    message MessageType {
      uint32 colA = 1;
      string colB = 2;
    }

The new format does not save space compared to the existing formats, but
it is conceptually a bit more beautiful and also more convenenient.

Implementation details:

- Created new files ProtobufList(Input|Output)Format which use the
  existing ProtobufSerializer mechanism. The goal was to reuse as much
  code as possible and avoid copypasta.

- I was torn between inheriting from I(Input|Output)Format vs.
  IRow(Input|Output)Format for ProtobufList(Input|Output)Format. The
  former is chunk-based which can be better for performance. Since the
  ProtobufSerializer mechanism is row-based but data is generally passed
  around in chunks, I decided for the latter to leverage the existing
  chunk <--> row mapping code in IRow(InputOutput)Format.

- A new ProtobufSerializer called ProtobufSerializerEnvelope was
  introduced (--> ProtobufSerializer.cpp). It represents the top-level
  message which encloses the list of inner nested messages, i.e. the
  rows.

- With the new format, parsing the schema file and matching the fields in
  the schema file to table column works like for the old formats. The only
  difference is that parsing starts one level below the "Envelope" (-->
  ProtobufSchema.cpp). This is more natural than forcing customers to
  have table columns start with "Envelope".

- Creation of the ProtobufSerializer tree also works like before. What
  is different is that we finally add a ProtobufSerializerEnvelope as
  new root of the tree. It's only purpose is to write/read the top-level
  message for the first/last row to write/read.

Caveats:

- The low-level serialization code in ProtobufWriter uses an internal
  buffer which is flushed to the output file only in endMessage().
  In the existing "Protobuf" format, this happens once per row, in the
  new format this happens only at the end of the serialization
  since row-level messages now call start/endNestedMessage(). As a
  future TODO to, the buffer should be flushed also in
  start/endNestedMessage() to reduce memory consumption.
2022-03-14 08:04:58 +01:00
Robert Schulze
f0ba39b071
Clean up some header includes and make formatting more consistent 2022-03-13 20:24:12 +01:00
Kruglov Pavel
5e8b2228e0
Merge pull request #34561 from bigo-sg/arrow_type_timestamp
Implement transformation between CH DateTime64 and arrow timestamp column
2022-02-17 16:55:17 +03:00
Anton Popov
72e75fdaf5
Merge pull request #34601 from CurtizJ/filtering-by-sparse-columns
Support filtering by sparse columns without conversion to full
2022-02-15 23:26:13 +03:00
Anton Popov
7cddae1351 return back result_size_hint 2022-02-15 15:12:25 +03:00
Anton Popov
5c316ffabe support filtering by sparse columns without convertion to full 2022-02-15 14:30:54 +03:00
Kruglov Pavel
cf454a6539
Merge pull request #34532 from CurtizJ/fix-aggregation-in-order-3
Fix aggregation in order with distributed_aggregation_memory_efficient=0
2022-02-15 14:26:15 +03:00
taiyang-li
e53719a86b remove comments 2022-02-13 17:13:23 +08:00
taiyang-li
aabf2aac69 finish all tests 2022-02-13 17:06:58 +08:00
taiyang-li
6559941972 support datetime64 when transform ch chunk to arrow table 2022-02-13 14:56:01 +08:00
alexey-milovidov
4a2c69c073
Merge pull request #34067 from Algunenano/mv_fixes
Fix  `parallel_view_processing=0` and `view_duration_ms` in views log
2022-02-12 22:36:41 +03:00
Anton Popov
357bdd69c4 fix aggregation in order with distributed_aggregation_memory_efficient=0 2022-02-11 18:09:13 +03:00
Vladimir C
a2b1900333
Merge pull request #34496 from Avogar/jsonl
Support .jsonl extension for JSONEachRow format
2022-02-11 15:44:31 +01:00
W
7cb0433fae
Update buildPushingToViewsChain.h
typo
2022-02-11 14:34:20 +08:00
avogar
9e58ae7577 Support jsonl extension for JSONEachRow format 2022-02-10 16:00:37 +03:00
Kruglov Pavel
a4f5610764
Merge pull request #34476 from CurtizJ/avoid-settings-copy
Avoid unnecessary copying of `Settings`
2022-02-10 14:13:46 +03:00
Anton Popov
298838f891 avoid unnecessary copying of Settings 2022-02-10 12:13:51 +03:00
mergify[bot]
d78525bd10
Merge branch 'master' into fix-removing-order-in-CreatingSetsTransform 2022-02-09 13:55:52 +00:00
Azat Khuzhin
4fa2ae76bc Fix memory leak in AggregatingInOrderTransform
Reproducer:

    # NOTE: we need clickhouse from 33957 since right now LSan is broken due to getauxval().
    $ url=https://s3.amazonaws.com/clickhouse-builds/33957/e04b862673644d313712607a0078f5d1c48b5377/package_asan/clickhouse
    $ wget $url -o clickhouse-asan
    $ chmod +x clickhouse-asan
    $ ./clickhouse-asan server &

    $ ./clickhouse-asan client
    :) create table data (key Int, value String) engine=MergeTree() order by key
    :) insert into data select number%5, toString(number) from numbers(10e6)

    # usually it is enough one query, benchmark is just for stability of the results
    # note, that if the exception was not happen from AggregatingInOrderTransform then add --continue_on_errors and wait
    $ ./clickhouse-asan benchmark --query 'select key, uniqCombined64(value), groupArray(value) from data group by key' --optimize_aggregation_in_order=1 --memory_tracker_fault_probability=0.01, max_untracked_memory='2Mi'

LSan report:

    ==24595==ERROR: LeakSanitizer: detected memory leaks

    Direct leak of 3932160 byte(s) in 6 object(s) allocated from:
        0 0xcadba93 in realloc ()
        1 0xcc108d9 in Allocator<false, false>::realloc() obj-x86_64-linux-gnu/../src/Common/Allocator.h:134:30
        2 0xde19eae in void DB::PODArrayBase<>::realloc<DB::Arena*&>(unsigned long, DB::Arena*&) obj-x86_64-linux-gnu/../src/Common/PODArray.h:161:25
        3 0xde5f039 in void DB::PODArrayBase<>::reserveForNextSize<DB::Arena*&>(DB::Arena*&) obj-x86_64-linux-gnu/../src/Common/PODArray.h
        4 0xde5f039 in void DB::PODArray<>::push_back<>(DB::GroupArrayNodeString*&, DB::Arena*&) obj-x86_64-linux-gnu/../src/Common/PODArray.h:432:19
        5 0xde5f039 in DB::GroupArrayGeneralImpl<>::add() const obj-x86_64-linux-gnu/../src/AggregateFunctions/AggregateFunctionGroupArray.h:465:31
        6 0xde5f039 in DB::IAggregateFunctionHelper<>::addBatchSinglePlaceFromInterval() const obj-x86_64-linux-gnu/../src/AggregateFunctions/IAggregateFunction.h:481:53
        7 0x299df134 in DB::Aggregator::executeOnIntervalWithoutKeyImpl() obj-x86_64-linux-gnu/../src/Interpreters/Aggregator.cpp:869:31
        8 0x2ca75f7d in DB::AggregatingInOrderTransform::consume() obj-x86_64-linux-gnu/../src/Processors/Transforms/AggregatingInOrderTransform.cpp:124:13

    ...

    SUMMARY: AddressSanitizer: 4523184 byte(s) leaked in 12 allocation(s).

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2022-02-09 09:23:56 +03:00
Maksim Kita
b8e350054c clang-tidy move fix build 2022-02-08 21:21:32 +00:00
Nikolai Kochetov
82a7d70a31
Merge branch 'master' into fix-removing-order-in-CreatingSetsTransform 2022-02-08 19:29:03 +03:00
Nikolai Kochetov
d2d47b9595 Fixing build. 2022-02-08 16:27:33 +00:00
Maksim Kita
4bb69bcb15
Merge pull request #34398 from DevTeamBK/input_format
Method called on already moved
2022-02-08 15:20:07 +01:00
Nikolai Kochetov
7e54dafdc1 Fix wrong destruction order in CreatingSetsTransform. 2022-02-08 10:41:07 +00:00
Kruglov Pavel
b4fec2af7c
Merge pull request #34065 from Avogar/msgpack
Support UUID in MsgPack format
2022-02-08 11:42:17 +03:00
Rajkumar
6b3adbb0de Method called on already moved 2022-02-07 19:50:34 -08:00
avogar
a4c7ecde87 Make better 2022-02-07 17:51:26 +03:00
avogar
c3d30fd502 Fix comments 2022-02-07 17:11:44 +03:00
Kruglov Pavel
34a17075d3 FIx error messages 2022-02-07 17:11:44 +03:00
avogar
77b42bb9ff Support UUID in MsgPack format 2022-02-07 17:11:44 +03:00
HeenaBansal2009
eeec2478ba Fix clang-tidy issue 2022-02-06 22:36:35 -08:00
Alexey Milovidov
f98010e374 Small improvements 2022-02-06 07:14:01 +03:00
Alexey Milovidov
4a83dbc514 Fix linkage 2022-02-04 00:26:44 +03:00
Alexey Milovidov
c426f11096 Maybe better 2022-02-04 00:20:16 +03:00
Alexey Milovidov
7c12f5f37a Fix terribly low performance of LineAsString format 2022-02-04 00:07:31 +03:00
Anton Popov
9b844c6b42
Merge pull request #32748 from CurtizJ/read-in-order-fixed-prefix
Support `optimize_read_in_order` if prefix of sorting key is already sorted
2022-02-03 18:17:08 +03:00
mergify[bot]
150d7ba8b5
Merge branch 'master' into mv_fixes 2022-02-03 00:41:52 +00:00
Azat Khuzhin
1d19851590 Disable data skipping indexes by default for queries with FINAL
This patch adds use_skip_indexes_if_final setting, that is OFF by
default. Since skipping data for queries with FINAL may produce
incorrect result.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2022-02-02 13:31:50 +03:00
mergify[bot]
4f0e011958
Merge branch 'master' into mv_fixes 2022-01-31 23:53:35 +00:00
Alexey Milovidov
e4e7169277 Remove some strange code 2022-02-01 02:52:36 +03:00
Alexey Milovidov
83136f3515 Allow \r in the middle of the line in format Regexp 2022-02-01 02:49:26 +03:00
Alexey Milovidov
872d0a0fbe Improve performance of format Regexp 2022-02-01 02:07:48 +03:00
alesapin
dd61d1c2de
Merge pull request #34172 from ClickHouse/fix_race_in_some_engines
Fix benign race condition for storage HDFS, S3, URL
2022-01-31 22:41:54 +03:00
alesapin
93c0700c4c Fix typo 2022-01-31 16:46:58 +03:00
alesapin
056b9e335f Fix comment 2022-01-31 16:39:42 +03:00
alesapin
31753afb7e Fix cancel logic in parallel parsing 2022-01-31 16:38:15 +03:00
Maksim Kita
5ef83deaa6 Update sort to pdqsort 2022-01-30 19:49:48 +00:00
Anton Popov
b950a12cb3
Merge pull request #34068 from CurtizJ/fix-async-insert-native
Fix asynchronous inserts with `Native` format
2022-01-29 01:24:53 +03:00
Azat Khuzhin
1519985c98 Fix possible "Can't attach query to the thread, it is already attached"
After detachQueryIfNotDetached() had been removed it is not enough to
use attachTo() for ThreadPool (scheduleOrThrowOnError()) since the query
may be already attached, if the thread doing multiple jobs, so
CurrentThread::attachToIfDetached() should be used instead.

This should fix all the places from the failures on CI [1]:

    $ fgrep DB::CurrentThread::attachTo -A1 ~/Downloads/47.txt  | fgrep -v attachTo | cut -d' ' -f5,6 | sort | uniq -c
         92 --
          2 /fasttest-workspace/build/../../ClickHouse/contrib/libcxx/include/deque:1393: DB::ParallelParsingInputFormat::parserThreadFunction(std::__1::shared_ptr<DB::ThreadGroupStatus>,
          4 /fasttest-workspace/build/../../ClickHouse/src/Storages/MergeTree/MergeTreeData.cpp:1595: void
         87 /fasttest-workspace/build/../../ClickHouse/src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp:993: void

  [1]: https://github.com/ClickHouse/ClickHouse/runs/4954466034?check_suite_focus=true

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2022-01-28 16:25:33 +03:00
Azat Khuzhin
b0c862c297 Fix memory accounting for queries that uses < max_untracker_memory
MemoryTracker starts accounting memory directly only after per-thread
allocation exceeded max_untracker_memory (or memory_profiler_step).

But even memory under this limit should be accounted too, and there is
code to do this in ThreadStatus dtor, however due to
PullingAsyncPipelineExecutor detached the query from thread group that
memory was not accounted.

So remove CurrentThread::detachQueryIfNotDetached() from threads that
uses ThreadFromGlobalPool since it has ThreadStatus, and the query will
be detached using CurrentThread::defaultThreadDeleter.

Note, that before this patch memory accounting works for HTTP queries
due to it had been accounted from ParallelFormattingOutputFormat, but
not for TCP.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2022-01-28 16:25:33 +03:00