UBsan reports:
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../src/Storages/Distributed/DirectoryMonitor.cpp:435:53 in
../src/Storages/Distributed/DirectoryMonitor.cpp:435: runtime error: 1.15292e+19 is outside the range of representable values of type 'long'
0 0x1df0c286 in DB::StorageDistributedDirectoryMonitor::run() obj-x86_64-linux-gnu/../src/Storages/Distributed/DirectoryMonitor.cpp:435:53
It is pretty easy to reproduce by limiting max_server_memory_usage
before staring the test.
Broken batches may be because of abnormal server shutdown (and lack of
fsync), and ignoring the whole batch is not great in this case, so apply
the same split logic here too.
v2: rename exception
v3: catch missing exception
v4: fix marking the file as broken multiple times (fixes
test_insert_distributed_async_send with setting enabled)
Add distributed_directory_monitor_split_batch_on_failure setting (OFF by
default), that will split the batch and send files one by one in case of
retriable errors.
v2: more error codes
Under use_compact_format_in_distributed_parts_names=1 and
internal_replication=true the server encodes all replicas for the
directory name for async INSERT into Distributed, and the directory name
looks like:
shard1_replica1,shard1_replica2,shard3_replica3
This is required for creating connections (to specific replicas only),
but in case of internal_replication=true, this can be avoided, since
this path will always includes all replicas.
This patch replaces all replicas with "_all_replicas" marker.
Note, that initial problem was that this path may overflow the NAME_MAX
if you will have more then 15 replicas, and the server will fail to
create the directory.
Also note, that changed directory name should not be a problem, since:
- empty directories will be removed since #16729
- and replicas encoded in the directory name is also supported anyway.
Number of files for asynchronous insertion into Distributed tables that
has been marked as broken. This metric will starts from 0 on start.
Number of files for every shard is summed.
INSERT into Distributed with insert_distributed_sync=1 stores the
distributed batches on the disk for sending in background.
But types may be a little bit different for the Distributed and it's
underlying table, so the initiator need to know whether conversion is
required or not.
Before this patch those on disk distributed batches contains header,
which includes dumpStructure() for the block in that batch, however it
checks not only names and types and plus dumpStructure() is a debug
method.
So instead of storing string representation for the block header we
should store empty block in the file header (note, that we cannot store
the empty block not in header, since this will require reading all
blocks from file, due to some trickery of the readers interface).
Note, that this patch also contains tiny refactoring:
- s/header/distributed_header/
v1: dumpNamesAndTypes()
v2: dump empty block into the batch itself
v3: move empty block into the header
Previous patch fixes the inaccuracy, but it's done using iterating over
directory on each request (to system.distribution_queue or to check
bytes_to_throw_insert), and like previous patch alredy stated, it may
have pretty huge overhead (especially when you have lots of distributed
files pending).
This patch remove that recalculation (but it will still be done, and
if there is different, there will be a log message), and replace it with
proper account at INSERT time (and after file has been sent, or marked
as broken).
So now system.distribution_queue will show accurate statistics, so tests
does not requires sleep anymore.
But note that with too much distributed pending this will iterate over
all directories.
Add missing conversion (via ConvertingBlockInputStream) for INSERT into
remote nodes (for sync insert, async insert and async batch insert),
like for local nodes (in DistributedBlockOutputStream::writeBlockConverted).
This is required when the structure of the Distributed table differs
from the structure of the local table.
And also add a warning message, to highlight this in logs (since this
works slower).
Fixes: #19888
It is possible to get corruption (even though it is very unlikely, and
initially it wasn't corruption) just before the data block goes in the
file on disk, and in case of batching, it will break the packets, since
it will write the packet type but will not write any data after.
- the sender will got ATTEMPT_TO_READ_AFTER_EOF (added in
946c275dfb) when the client just go
away, i.e. server had been restarted, and this is incorrect to mark the
file as broken in this case.
- since #18853 the file will be checked on the sender locally, and
in case the file was truncated CANNOT_READ_ALL_DATA will be thrown.
But before #18853 the sender will not receive
ATTEMPT_TO_READ_AFTER_EOF from the client in case of file was truncated
on the sender, since the client will just wait for more data, IOW just hang.
- and I don't see how ATTEMPT_TO_READ_AFTER_EOF can be received while
reading local file.
Before this patch the DirectoryMonitor was checking the compressed file
by reading it one more time (since w/o this receiver may stuck on
truncated file), while this is ineffective and we can just check the
checksums before sending.
But note that this may decrease batch size that is used for sending over
network.
Before this patch StorageDistributedDirectoryMonitor reading .bin files
in batch mode, just to calculate number of bytes/rows, this is very
ineffective, let's just store them in the header (rows/bytes).
This is already done for distributed_directory_monitor_batch_inserts=1,
so let's do the same for the non batched mode, since otherwise in case
the file will be truncated the receiver will just stuck (since it will
wait for the block, but the sender will not send it).