ClickHouse

mirror of https://github.com/ClickHouse/ClickHouse.git synced 2024-11-30 19:42:00 +00:00

Author	SHA1	Message	Date
Robert Schulze	514d4d2187	Implement ProtobufList - fixes ClickHouse#16436 Introduce IO format "ProtobufList" with protobuf schema // schemafile.proto message Envelope { message MessageType { uint32 colA = 1; string colB = 2; } repeated MessageType mt = 1; } where "Envelope" is a hard-coded/expected top-level message and "MessageType" is a message with user-provided name containing the table fields to export/import, e.g. SELECT * FROM db1.tab1 FORMAT ProtobufList SETTINGS format_schema = 'schemafile:MessageType' As a result, the new format wraps a list of messages (one per row) into a single, containing message. Compare that to the schema of the existing IO formats "Protobuf" and "ProtobufSingle": message MessageType { uint32 colA = 1; string colB = 2; } The new format does not save space compared to the existing formats, but it is conceptually a bit more beautiful and also more convenenient. Implementation details: - Created new files ProtobufList(Input\|Output)Format which use the existing ProtobufSerializer mechanism. The goal was to reuse as much code as possible and avoid copypasta. - I was torn between inheriting from I(Input\|Output)Format vs. IRow(Input\|Output)Format for ProtobufList(Input\|Output)Format. The former is chunk-based which can be better for performance. Since the ProtobufSerializer mechanism is row-based but data is generally passed around in chunks, I decided for the latter to leverage the existing chunk <--> row mapping code in IRow(InputOutput)Format. - A new ProtobufSerializer called ProtobufSerializerEnvelope was introduced (--> ProtobufSerializer.cpp). It represents the top-level message which encloses the list of inner nested messages, i.e. the rows. - With the new format, parsing the schema file and matching the fields in the schema file to table column works like for the old formats. The only difference is that parsing starts one level below the "Envelope" (--> ProtobufSchema.cpp). This is more natural than forcing customers to have table columns start with "Envelope". - Creation of the ProtobufSerializer tree also works like before. What is different is that we finally add a ProtobufSerializerEnvelope as new root of the tree. It's only purpose is to write/read the top-level message for the first/last row to write/read. Caveats: - The low-level serialization code in ProtobufWriter uses an internal buffer which is flushed to the output file only in endMessage(). In the existing "Protobuf" format, this happens once per row, in the new format this happens only at the end of the serialization since row-level messages now call start/endNestedMessage(). As a future TODO to, the buffer should be flushed also in start/endNestedMessage() to reduce memory consumption.	2022-03-14 08:04:58 +01:00
Robert Schulze	f0ba39b071	Clean up some header includes and make formatting more consistent	2022-03-13 20:24:12 +01:00
Kruglov Pavel	5e8b2228e0	Merge pull request #34561 from bigo-sg/arrow_type_timestamp Implement transformation between CH DateTime64 and arrow timestamp column	2022-02-17 16:55:17 +03:00
Anton Popov	72e75fdaf5	Merge pull request #34601 from CurtizJ/filtering-by-sparse-columns Support filtering by sparse columns without conversion to full	2022-02-15 23:26:13 +03:00
Anton Popov	7cddae1351	return back result_size_hint	2022-02-15 15:12:25 +03:00
Anton Popov	5c316ffabe	support filtering by sparse columns without convertion to full	2022-02-15 14:30:54 +03:00
Kruglov Pavel	cf454a6539	Merge pull request #34532 from CurtizJ/fix-aggregation-in-order-3 Fix aggregation in order with distributed_aggregation_memory_efficient=0	2022-02-15 14:26:15 +03:00
taiyang-li	e53719a86b	remove comments	2022-02-13 17:13:23 +08:00
taiyang-li	aabf2aac69	finish all tests	2022-02-13 17:06:58 +08:00
taiyang-li	6559941972	support datetime64 when transform ch chunk to arrow table	2022-02-13 14:56:01 +08:00
alexey-milovidov	4a2c69c073	Merge pull request #34067 from Algunenano/mv_fixes Fix `parallel_view_processing=0` and `view_duration_ms` in views log	2022-02-12 22:36:41 +03:00
Anton Popov	357bdd69c4	fix aggregation in order with distributed_aggregation_memory_efficient=0	2022-02-11 18:09:13 +03:00
Vladimir C	a2b1900333	Merge pull request #34496 from Avogar/jsonl Support .jsonl extension for JSONEachRow format	2022-02-11 15:44:31 +01:00
W	7cb0433fae	Update buildPushingToViewsChain.h typo	2022-02-11 14:34:20 +08:00
avogar	9e58ae7577	Support jsonl extension for JSONEachRow format	2022-02-10 16:00:37 +03:00
Kruglov Pavel	a4f5610764	Merge pull request #34476 from CurtizJ/avoid-settings-copy Avoid unnecessary copying of `Settings`	2022-02-10 14:13:46 +03:00
Anton Popov	298838f891	avoid unnecessary copying of Settings	2022-02-10 12:13:51 +03:00
mergify[bot]	d78525bd10	Merge branch 'master' into fix-removing-order-in-CreatingSetsTransform	2022-02-09 13:55:52 +00:00
Azat Khuzhin	4fa2ae76bc	Fix memory leak in AggregatingInOrderTransform Reproducer: # NOTE: we need clickhouse from 33957 since right now LSan is broken due to getauxval(). $ url=https://s3.amazonaws.com/clickhouse-builds/33957/e04b862673644d313712607a0078f5d1c48b5377/package_asan/clickhouse $ wget $url -o clickhouse-asan $ chmod +x clickhouse-asan $ ./clickhouse-asan server & $ ./clickhouse-asan client :) create table data (key Int, value String) engine=MergeTree() order by key :) insert into data select number%5, toString(number) from numbers(10e6) # usually it is enough one query, benchmark is just for stability of the results # note, that if the exception was not happen from AggregatingInOrderTransform then add --continue_on_errors and wait $ ./clickhouse-asan benchmark --query 'select key, uniqCombined64(value), groupArray(value) from data group by key' --optimize_aggregation_in_order=1 --memory_tracker_fault_probability=0.01, max_untracked_memory='2Mi' LSan report: ==24595==ERROR: LeakSanitizer: detected memory leaks Direct leak of 3932160 byte(s) in 6 object(s) allocated from: 0 0xcadba93 in realloc () 1 0xcc108d9 in Allocator<false, false>::realloc() obj-x86_64-linux-gnu/../src/Common/Allocator.h:134:30 2 0xde19eae in void DB::PODArrayBase<>::realloc<DB::Arena&>(unsigned long, DB::Arena&) obj-x86_64-linux-gnu/../src/Common/PODArray.h:161:25 3 0xde5f039 in void DB::PODArrayBase<>::reserveForNextSize<DB::Arena&>(DB::Arena&) obj-x86_64-linux-gnu/../src/Common/PODArray.h 4 0xde5f039 in void DB::PODArray<>::push_back<>(DB::GroupArrayNodeString&, DB::Arena&) obj-x86_64-linux-gnu/../src/Common/PODArray.h:432:19 5 0xde5f039 in DB::GroupArrayGeneralImpl<>::add() const obj-x86_64-linux-gnu/../src/AggregateFunctions/AggregateFunctionGroupArray.h:465:31 6 0xde5f039 in DB::IAggregateFunctionHelper<>::addBatchSinglePlaceFromInterval() const obj-x86_64-linux-gnu/../src/AggregateFunctions/IAggregateFunction.h:481:53 7 0x299df134 in DB::Aggregator::executeOnIntervalWithoutKeyImpl() obj-x86_64-linux-gnu/../src/Interpreters/Aggregator.cpp:869:31 8 0x2ca75f7d in DB::AggregatingInOrderTransform::consume() obj-x86_64-linux-gnu/../src/Processors/Transforms/AggregatingInOrderTransform.cpp:124:13 ... SUMMARY: AddressSanitizer: 4523184 byte(s) leaked in 12 allocation(s). Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>	2022-02-09 09:23:56 +03:00
Maksim Kita	b8e350054c	clang-tidy move fix build	2022-02-08 21:21:32 +00:00
Nikolai Kochetov	82a7d70a31	Merge branch 'master' into fix-removing-order-in-CreatingSetsTransform	2022-02-08 19:29:03 +03:00
Nikolai Kochetov	d2d47b9595	Fixing build.	2022-02-08 16:27:33 +00:00
Maksim Kita	4bb69bcb15	Merge pull request #34398 from DevTeamBK/input_format Method called on already moved	2022-02-08 15:20:07 +01:00
Nikolai Kochetov	7e54dafdc1	Fix wrong destruction order in CreatingSetsTransform.	2022-02-08 10:41:07 +00:00
Kruglov Pavel	b4fec2af7c	Merge pull request #34065 from Avogar/msgpack Support UUID in MsgPack format	2022-02-08 11:42:17 +03:00
Rajkumar	6b3adbb0de	Method called on already moved	2022-02-07 19:50:34 -08:00
avogar	a4c7ecde87	Make better	2022-02-07 17:51:26 +03:00
avogar	c3d30fd502	Fix comments	2022-02-07 17:11:44 +03:00
Kruglov Pavel	34a17075d3	FIx error messages	2022-02-07 17:11:44 +03:00
avogar	77b42bb9ff	Support UUID in MsgPack format	2022-02-07 17:11:44 +03:00
HeenaBansal2009	eeec2478ba	Fix clang-tidy issue	2022-02-06 22:36:35 -08:00
Alexey Milovidov	f98010e374	Small improvements	2022-02-06 07:14:01 +03:00
Alexey Milovidov	4a83dbc514	Fix linkage	2022-02-04 00:26:44 +03:00
Alexey Milovidov	c426f11096	Maybe better	2022-02-04 00:20:16 +03:00
Alexey Milovidov	7c12f5f37a	Fix terribly low performance of `LineAsString` format	2022-02-04 00:07:31 +03:00
Anton Popov	9b844c6b42	Merge pull request #32748 from CurtizJ/read-in-order-fixed-prefix Support `optimize_read_in_order` if prefix of sorting key is already sorted	2022-02-03 18:17:08 +03:00
mergify[bot]	150d7ba8b5	Merge branch 'master' into mv_fixes	2022-02-03 00:41:52 +00:00
Azat Khuzhin	1d19851590	Disable data skipping indexes by default for queries with FINAL This patch adds use_skip_indexes_if_final setting, that is OFF by default. Since skipping data for queries with FINAL may produce incorrect result. Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>	2022-02-02 13:31:50 +03:00
mergify[bot]	4f0e011958	Merge branch 'master' into mv_fixes	2022-01-31 23:53:35 +00:00
Alexey Milovidov	e4e7169277	Remove some strange code	2022-02-01 02:52:36 +03:00
Alexey Milovidov	83136f3515	Allow \r in the middle of the line in format `Regexp`	2022-02-01 02:49:26 +03:00
Alexey Milovidov	872d0a0fbe	Improve performance of format Regexp	2022-02-01 02:07:48 +03:00
alesapin	dd61d1c2de	Merge pull request #34172 from ClickHouse/fix_race_in_some_engines Fix benign race condition for storage HDFS, S3, URL	2022-01-31 22:41:54 +03:00
alesapin	93c0700c4c	Fix typo	2022-01-31 16:46:58 +03:00
alesapin	056b9e335f	Fix comment	2022-01-31 16:39:42 +03:00
alesapin	31753afb7e	Fix cancel logic in parallel parsing	2022-01-31 16:38:15 +03:00
Maksim Kita	5ef83deaa6	Update sort to pdqsort	2022-01-30 19:49:48 +00:00
Anton Popov	b950a12cb3	Merge pull request #34068 from CurtizJ/fix-async-insert-native Fix asynchronous inserts with `Native` format	2022-01-29 01:24:53 +03:00
Azat Khuzhin	1519985c98	Fix possible "Can't attach query to the thread, it is already attached" After detachQueryIfNotDetached() had been removed it is not enough to use attachTo() for ThreadPool (scheduleOrThrowOnError()) since the query may be already attached, if the thread doing multiple jobs, so CurrentThread::attachToIfDetached() should be used instead. This should fix all the places from the failures on CI [1]: $ fgrep DB::CurrentThread::attachTo -A1 ~/Downloads/47.txt \| fgrep -v attachTo \| cut -d' ' -f5,6 \| sort \| uniq -c 92 -- 2 /fasttest-workspace/build/../../ClickHouse/contrib/libcxx/include/deque:1393: DB::ParallelParsingInputFormat::parserThreadFunction(std::__1::shared_ptr<DB::ThreadGroupStatus>, 4 /fasttest-workspace/build/../../ClickHouse/src/Storages/MergeTree/MergeTreeData.cpp:1595: void 87 /fasttest-workspace/build/../../ClickHouse/src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp:993: void [1]: https://github.com/ClickHouse/ClickHouse/runs/4954466034?check_suite_focus=true Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>	2022-01-28 16:25:33 +03:00
Azat Khuzhin	b0c862c297	Fix memory accounting for queries that uses < max_untracker_memory MemoryTracker starts accounting memory directly only after per-thread allocation exceeded max_untracker_memory (or memory_profiler_step). But even memory under this limit should be accounted too, and there is code to do this in ThreadStatus dtor, however due to PullingAsyncPipelineExecutor detached the query from thread group that memory was not accounted. So remove CurrentThread::detachQueryIfNotDetached() from threads that uses ThreadFromGlobalPool since it has ThreadStatus, and the query will be detached using CurrentThread::defaultThreadDeleter. Note, that before this patch memory accounting works for HTTP queries due to it had been accounted from ParallelFormattingOutputFormat, but not for TCP. Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>	2022-01-28 16:25:33 +03:00

1 2 3 4 5 ...

2380 Commits