In #21643 async_socket_for_remote=1 was fixed to avoid leaving the
connection in the unsynchronised state.
But one should not try to wait for the current packet in case of timeout
because this will exceed the timeout.
Anyway if the timeout is exceeded, then the connection will be shutdown
(disconnected), so it will not left in an unsynchronised state.
Before this patch for distributed queries, that requires cancellation
(simple select from multiple shards with limit, i.e. `select * from
remote('127.{2,3}', system.numbers) limit 100`) it is very easy to
trigger the situation when remote shard is in the middle of sending Data
block while the initiator already send Cancel and expecting some new
packet, but it will receive not new packet, but part of the Data block
that was in the middle of sending before cancellation, and this will
lead to some various errors, like:
- Unknown packet X from server Y
- Unexpected packet from server Y
- and a lot more...
Fix this, by correctly waiting for the pending packet before
cancellation.
It is not very easy to write a test, since localhost is too fast.
Also note, that it is not possible to get these errors with hedged
requests (use_hedged_requests=1) since handle fibers correctly.
But it had been disabled by default for 21.3 in #21534, while
async_socket_for_remote is enabled by default.
The Year 1925 is a starting point because most of the timezones
switched to saner (mostly 15-minutes based) offsets somewhere
during 1924 or before. And that significantly simplifies implementation.
2238 is to simplify arithmetics for sanitizing LUT index access;
there are less than 0x1ffff days from 1925.
* Extended DateLUTImpl internal LUT to 0x1ffff items, some of which
represent negative (pre-1970) time values.
As a collateral benefit, Date now correctly supports dates up to 2149
(instead of 2106).
* Added a new strong typedef ExtendedDayNum, which represents dates
pre-1970 and post 2149.
* Functions that used to return DayNum now return ExtendedDayNum.
* Refactored DateLUTImpl to untie DayNum from the dual role of being
a value and an index (due to negative time). Index is now a different
type LUTIndex with explicit conversion functions from DatNum, time_t,
and ExtendedDayNum.
* Updated DateLUTImpl to properly support values close to epoch start
(1970-01-01 00:00), including negative ones.
* Reduced resolution of DateLUTImpl::Values::time_at_offset_change
to multiple of 15-minutes to allow storing 64-bits of time_t in
DateLUTImpl::Value while keeping same size.
* Minor performance updates to DateLUTImpl when building month LUT
by skipping non-start-of-month days.
* Fixed extractTimeZoneFromFunctionArguments to work correctly
with DateTime64.
* New unit-tests and stateless integration tests for both DateTime
and DateTime64.
Since this hides real problems, since destructor does final flush and if
it fails, then data will be lost.
One of such examples if MEMORY_LIMIT_EXCEEDED exception, so lock
exceptions from destructors, by using
MemoryTracker::LockExceptionInThread to block these exception, and allow
others (so std::terminate will be called, since this is c++11 with
noexcept for destructors by default).
Here is an example, that leads to empty block in the distributed batch:
2021.01.21 12:43:18.619739 [ 46468 ] {7bd60d75-ebcb-45d2-874d-260df9a4ddac} <Error> virtual DB::CompressedWriteBuffer::~CompressedWriteBuffer(): Code: 241, e.displayText() = DB::Exception: Memory limit (for user) exceeded: would use 332.07 GiB (attempt to allocate chunk of 4355342 bytes), maximum: 256.00 GiB, Stack trace (when copying this message, always include the lines below):
0. DB::Exception::Exception<>() @ 0x86f7b88 in /usr/bin/clickhouse
...
4. void DB::PODArrayBase<>::resize<>(unsigned long) @ 0xe9e878d in /usr/bin/clickhouse
5. DB::CompressedWriteBuffer::nextImpl() @ 0xe9f0296 in /usr/bin/clickhouse
6. DB::CompressedWriteBuffer::~CompressedWriteBuffer() @ 0xe9f0415 in /usr/bin/clickhouse
7. DB::DistributedBlockOutputStream::writeToShard() @ 0xf6bed4a in /usr/bin/clickhouse
Logging from PushingToViewsBlockOutputStream::write() is incorrect,
since it does not takes into account squashing (default to 1<<20 rows
and 256<<20), whle write will be done from
PushingToViewsBlockOutputStream::writeSuffix().
Fixes: #19378
* add the query data deduplication excluding duplicated parts in MergeTree family engines.
query deduplication is based on parts' UUID which should be enabled first with merge_tree setting
assign_part_uuids=1
allow_experimental_query_deduplication setting is to enable part deduplication, default ot false.
data part UUID is a mechanism of giving a data part a unique identifier.
Having UUID and deduplication mechanism provides a potential of moving parts
between shards preserving data consistency on a read path:
duplicated UUIDs will cause root executor to retry query against on of the replica explicitly
asking to exclude encountered duplicated fingerprints during a distributed query execution.
NOTE: this implementation don't provide any knobs to lock part and hence its UUID. Any mutations/merge will
update part's UUID.
* add _part_uuid virtual column, allowing to use UUIDs in predicates.
Signed-off-by: Aleksei Semiglazov <asemiglazov@cloudflare.com>
address comments
Since right now theses queries are not logged and there is no way to
determine the time that it takes.
Proper fix will be to account them in system.processes (and all other
places), but this is not that easy.