When I tried to add cool new clang-tidy 14 warnings, I noticed that the
current clang-tidy settings already produce a ton of warnings. This
commit addresses many of these. Almost all of them were non-critical,
i.e. C vs. C++ style casts.
* Add a warning if parallel_distributed_insert_select was ignored
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
* Respect max_distributed_depth for parallel_distributed_insert_select
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
* Print warning for non applied parallel_distributed_insert_select only for initial query
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
* Remove Cluster::getHashOfAddresses()
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
* Forbid parallel_distributed_insert_select for remote()/cluster() with different addresses
Before it uses empty cluster name (getClusterName()) which is not
correct, compare all addresses instead.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
* Fix max_distributed_depth check
max_distributed_depth=1 must mean not more then one distributed query,
not two, since max_distributed_depth=0 means no limit, and
distribute_depth is 0 for the first query.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
* Fix INSERT INTO remote()/cluster() with parallel_distributed_insert_select
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
* Add a test for parallel_distributed_insert_select with cluster()/remote()
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
* Return <remote> instead of empty cluster name in Distributed engine
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
* Make user with sharding_key and w/o in remote()/cluster() identical
Before with sharding_key the user was "default", while w/o it it was
empty.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Defines.h is a very common header, so lots of modules will be recompiled
on changes.
Move macros for protocol into separate header, this should significantly
decreases number of units to compile on it's changes.
UBsan reports:
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../src/Storages/Distributed/DirectoryMonitor.cpp:435:53 in
../src/Storages/Distributed/DirectoryMonitor.cpp:435: runtime error: 1.15292e+19 is outside the range of representable values of type 'long'
0 0x1df0c286 in DB::StorageDistributedDirectoryMonitor::run() obj-x86_64-linux-gnu/../src/Storages/Distributed/DirectoryMonitor.cpp:435:53
It is pretty easy to reproduce by limiting max_server_memory_usage
before staring the test.
- create directory monitors in parallel (this also includes rmdir in
case of directory is empty, since even if the directory is empty it
may take some time to remove it, due to waiting for journal or if the
directory is large, i.e. it had lots of files before, since remember
ext4 does not truncate the directory size on each unlink [1])
- initialize increment in parallel too (since it does readdir())
[1]: https://lore.kernel.org/linux-ext4/930A5754-5CE6-4567-8CF0-62447C97825C@dilger.ca/
Broken batches may be because of abnormal server shutdown (and lack of
fsync), and ignoring the whole batch is not great in this case, so apply
the same split logic here too.
v2: rename exception
v3: catch missing exception
v4: fix marking the file as broken multiple times (fixes
test_insert_distributed_async_send with setting enabled)
Add distributed_directory_monitor_split_batch_on_failure setting (OFF by
default), that will split the batch and send files one by one in case of
retriable errors.
v2: more error codes
Under use_compact_format_in_distributed_parts_names=1 and
internal_replication=true the server encodes all replicas for the
directory name for async INSERT into Distributed, and the directory name
looks like:
shard1_replica1,shard1_replica2,shard3_replica3
This is required for creating connections (to specific replicas only),
but in case of internal_replication=true, this can be avoided, since
this path will always includes all replicas.
This patch replaces all replicas with "_all_replicas" marker.
Note, that initial problem was that this path may overflow the NAME_MAX
if you will have more then 15 replicas, and the server will fail to
create the directory.
Also note, that changed directory name should not be a problem, since:
- empty directories will be removed since #16729
- and replicas encoded in the directory name is also supported anyway.
Number of files for asynchronous insertion into Distributed tables that
has been marked as broken. This metric will starts from 0 on start.
Number of files for every shard is summed.
Before this patch (and after #22208) the INSERT may fail with "Cannot
schedule a task" because the pool in DistributedBlockOutputStream
already throws exception and simply fail in writeSuffix().
INSERT into Distributed with insert_distributed_sync=1 stores the
distributed batches on the disk for sending in background.
But types may be a little bit different for the Distributed and it's
underlying table, so the initiator need to know whether conversion is
required or not.
Before this patch those on disk distributed batches contains header,
which includes dumpStructure() for the block in that batch, however it
checks not only names and types and plus dumpStructure() is a debug
method.
So instead of storing string representation for the block header we
should store empty block in the file header (note, that we cannot store
the empty block not in header, since this will require reading all
blocks from file, due to some trickery of the readers interface).
Note, that this patch also contains tiny refactoring:
- s/header/distributed_header/
v1: dumpNamesAndTypes()
v2: dump empty block into the batch itself
v3: move empty block into the header
Add two new settings for the Distributed engine:
- bytes_to_delay_insert
- max_delay_to_insert
If at the beginning of INSERT there will be too much pending data, more
then bytes_to_delay_insert, then the INSERT will wait until it will be
shrinked, and not more then max_delay_to_insert seconds.
If after this there will be still too much pending, it will throw an
exception.
Also new profile events were added (by analogy to the MergeTree):
- DistributedDelayedInserts (although you can use system.errors instead
of this, but still)
- DistributedRejectedInserts
- DistributedDelayedInsertsMilliseconds
Previous patch fixes the inaccuracy, but it's done using iterating over
directory on each request (to system.distribution_queue or to check
bytes_to_throw_insert), and like previous patch alredy stated, it may
have pretty huge overhead (especially when you have lots of distributed
files pending).
This patch remove that recalculation (but it will still be done, and
if there is different, there will be a log message), and replace it with
proper account at INSERT time (and after file has been sent, or marked
as broken).
So now system.distribution_queue will show accurate statistics, so tests
does not requires sleep anymore.
But note that with too much distributed pending this will iterate over
all directories.
Right now with distributed_directory_monitor_batch_inserts=1 and
insert_distributed_sync=0 INSERT into Distributed table will store
blocks that should be sent to remote (and in case of
prefer_localhost_replica=0 to the localhost too) on the local
filesystem, and sent it in background.
However there is no limit for this storage, and if the remote is
unavailable (or some other error), these pending blocks may take
significant space, and this is not always desired behaviour.
Add new Distributed setting - bytes_to_throw_insert, that will set the
limit for how much pending bytes is allowed, if the limit will be
reached an exception will be throw.
By default was set to 0, to avoid surprises.
It is possible to get corruption (even though it is very unlikely, and
initially it wasn't corruption) just before the data block goes in the
file on disk, and in case of batching, it will break the packets, since
it will write the packet type but will not write any data after.
- the sender will got ATTEMPT_TO_READ_AFTER_EOF (added in
946c275dfb) when the client just go
away, i.e. server had been restarted, and this is incorrect to mark the
file as broken in this case.
- since #18853 the file will be checked on the sender locally, and
in case the file was truncated CANNOT_READ_ALL_DATA will be thrown.
But before #18853 the sender will not receive
ATTEMPT_TO_READ_AFTER_EOF from the client in case of file was truncated
on the sender, since the client will just wait for more data, IOW just hang.
- and I don't see how ATTEMPT_TO_READ_AFTER_EOF can be received while
reading local file.
Before this patch the DirectoryMonitor was checking the compressed file
by reading it one more time (since w/o this receiver may stuck on
truncated file), while this is ineffective and we can just check the
checksums before sending.
But note that this may decrease batch size that is used for sending over
network.
Before this patch StorageDistributedDirectoryMonitor reading .bin files
in batch mode, just to calculate number of bytes/rows, this is very
ineffective, let's just store them in the header (rows/bytes).
This is already done for distributed_directory_monitor_batch_inserts=1,
so let's do the same for the non batched mode, since otherwise in case
the file will be truncated the receiver will just stuck (since it will
wait for the block, but the sender will not send it).
Two new settings (by analogy with MergeTree family) has been added:
- `fsync_after_insert` - Do fsync for every inserted. Will decreases
performance of inserts.
- `fsync_tmp_directory` - Do fsync for temporary directory (that is used
for async INSERT only) after all part operations (writes, renames,
etc.).
Refs: #17380 (p1)
* Distributed insertion to one random shard
* add some tests
* add some documentation
* Respect shards' weights
* fine locking
Co-authored-by: Ivan Lezhankin <ilezhankin@yandex-team.ru>
The following headers are pretty generic, so use forward declaration as
much as possible:
- Context.h
- Settings.h
- ConnectionTimeouts.h
(Also this shows that some missing some includes -- this has been fixed)
And split ConnectionTimeouts.h into ConnectionTimeoutsContext.h (since
module part cannot be added for it, due to recursive build dependencies
that will be introduced)
Also remove Settings from the RemoteBlockInputStream/RemoteQueryExecutor
and just pass the context, since settings was passed only in speicifc
places, that can allow making a copy of Context (i.e. Copier).
Approx results (How much units will be recompiled after changing file X?):
- ConnectionTimeouts.h
- mainline: 100
- Context.h:
- mainline: ~800
- patched: 415
- Settings.h:
- mainline: 900-1K
- patched: 440 (most of them because of the Context.h)
Before this patch use_compact_format_in_distributed_parts_names was
applied only from default profile (at server start) for
internal_replication=1, and was ignored on INSERT.
Add inter-server cluster secret, it is used for Distributed queries
inside cluster, you can configure in the configuration file:
<remote_servers>
<logs>
<shard>
<secret>foobar</secret> <!-- empty -- works as before -->
...
</shard>
</logs>
</remote_servers>
And this will allow clickhouse to make sure that the query was not
faked, and was issued from the node that knows the secret. And since
trust appeared it can use initial_user for query execution, this will
apply correct *_for_user (since with inter-server secret enabled, the
query will be executed from the same user on the shards as on initator,
unlike "default" user w/o it).
v2: Change user to the initial_user for Distributed queries if secret match
v3: Add Protocol::Cluster package
v4: Drop Protocol::Cluster and use plain Protocol::Hello + user marker
v5: Do not use user from Hello for cluster-secure (superfluous)
CurrentMetrics::Increment add amount for specified metric only for the
lifetime of the object, but this is not the intention, since
DistributedFilesToInsert is a gauge and after #10263 it can exit from
the callback (and enter again later, for example after SYSTEM STOP
DISTRIBUTED SEND it will always exit from it, until SYSTEM START
DISTRIBUTED SEND).
So make Increment member of a class (this will also fix possible issues
with substructing value on DROP TABLE).