Before this patch the DirectoryMonitor was checking the compressed file
by reading it one more time (since w/o this receiver may stuck on
truncated file), while this is ineffective and we can just check the
checksums before sending.
But note that this may decrease batch size that is used for sending over
network.
Before this patch StorageDistributedDirectoryMonitor reading .bin files
in batch mode, just to calculate number of bytes/rows, this is very
ineffective, let's just store them in the header (rows/bytes).
This is already done for distributed_directory_monitor_batch_inserts=1,
so let's do the same for the non batched mode, since otherwise in case
the file will be truncated the receiver will just stuck (since it will
wait for the block, but the sender will not send it).
Two new settings (by analogy with MergeTree family) has been added:
- `fsync_after_insert` - Do fsync for every inserted. Will decreases
performance of inserts.
- `fsync_tmp_directory` - Do fsync for temporary directory (that is used
for async INSERT only) after all part operations (writes, renames,
etc.).
Refs: #17380 (p1)
* Distributed insertion to one random shard
* add some tests
* add some documentation
* Respect shards' weights
* fine locking
Co-authored-by: Ivan Lezhankin <ilezhankin@yandex-team.ru>
The following headers are pretty generic, so use forward declaration as
much as possible:
- Context.h
- Settings.h
- ConnectionTimeouts.h
(Also this shows that some missing some includes -- this has been fixed)
And split ConnectionTimeouts.h into ConnectionTimeoutsContext.h (since
module part cannot be added for it, due to recursive build dependencies
that will be introduced)
Also remove Settings from the RemoteBlockInputStream/RemoteQueryExecutor
and just pass the context, since settings was passed only in speicifc
places, that can allow making a copy of Context (i.e. Copier).
Approx results (How much units will be recompiled after changing file X?):
- ConnectionTimeouts.h
- mainline: 100
- Context.h:
- mainline: ~800
- patched: 415
- Settings.h:
- mainline: 900-1K
- patched: 440 (most of them because of the Context.h)
Before this patch use_compact_format_in_distributed_parts_names was
applied only from default profile (at server start) for
internal_replication=1, and was ignored on INSERT.
Add inter-server cluster secret, it is used for Distributed queries
inside cluster, you can configure in the configuration file:
<remote_servers>
<logs>
<shard>
<secret>foobar</secret> <!-- empty -- works as before -->
...
</shard>
</logs>
</remote_servers>
And this will allow clickhouse to make sure that the query was not
faked, and was issued from the node that knows the secret. And since
trust appeared it can use initial_user for query execution, this will
apply correct *_for_user (since with inter-server secret enabled, the
query will be executed from the same user on the shards as on initator,
unlike "default" user w/o it).
v2: Change user to the initial_user for Distributed queries if secret match
v3: Add Protocol::Cluster package
v4: Drop Protocol::Cluster and use plain Protocol::Hello + user marker
v5: Do not use user from Hello for cluster-secure (superfluous)
CurrentMetrics::Increment add amount for specified metric only for the
lifetime of the object, but this is not the intention, since
DistributedFilesToInsert is a gauge and after #10263 it can exit from
the callback (and enter again later, for example after SYSTEM STOP
DISTRIBUTED SEND it will always exit from it, until SYSTEM START
DISTRIBUTED SEND).
So make Increment member of a class (this will also fix possible issues
with substructing value on DROP TABLE).
In this case error_count can be decreased before checking it for
rescheduling send.
And actually this can be a problem not only when
distributed_directory_monitor_{max_,}sleep_time_ms > 5min, because all
threads can be occupated and the real timeout between sends will be > 5min.
1. Moved Volume to separate file
2. Created IVolume interface and implemented current behaviour in implementation of new interface — VolumeJBOD
3. Replaced all old volume usages with new VolumeJBOD. Where it is unnecessary to have JBOD — left just IVolume.
4. Removed old Volume completely
5. Moved StoragePolicy to separated files
6. Moved DiskSelector to separated files
7. Removed DiskSpaceMonitor file
This also fixes hardlink code (when one file should be sent to multiple
servers, i.e. internal_replication == false) of writeToShard() with
distributed_storage_policy (i.e. when StorageDistributed::getPath() will
path to different filesystems).
Plus also cleanup DistributedBlockOutputStream::writeToShard() a little.
Since in this case it will be scheduled from the
DistributedBlockOutputStream with the
distributed_directory_monitor_max_sleep_time_ms, and this will overwrite
timer that was set by the DistributedBlockOutputStream, not good.
Before this patch each INSERT query re-schedule distributed send, thus
each time it resets the timer, while this is not the expected behaviour,
since in on frequent INSERT distributed sends will not be triggered at
all.
Fix this by not resetting the timer.
After #8756 the problem with 1 thread for each (distributed table, disk)
for distributed sends became even worse (since there can be multiple
disks), so use predefined thread pool for this tasks, that can be
controlled with background_distributed_schedule_pool_size knob.
Add missing conversion (via ConvertingBlockInputStream) for INSERT into
remote nodes (for sync insert, async insert and async batch insert),
like for local nodes (in DistributedBlockOutputStream::writeBlockConverted).
This is required when the structure of the Distributed table differs
from the structure of the local table.
And also add a warning message, to highlight this in logs (since this
works slower).
Fixes: #19888