In case of compression is used, ZstdDeflatingAppendableWriteBuffer is
used, but it has a leak, since it frees ZSTD_CCtx only if there was
write while this is not correct, since it is created anyway.
This was found with jemalloc profile, keeper-bench and the following
keeper settings:
- force_sync=false
- snapshot_distance=100
- reserved_log_items=0
- rotate_log_storage_interval=100
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
For unaligned offset pread() may return EINVAL even if the offset pass
EOF, although it should not, since otherwise there is no abiliity to
rely on read() == 0 is EOF (with pread() loop).
Here is a reproducer for the problem on 4.9.0-12-amd64:
$ head -c27 /dev/urandom > /tmp/pread.issue
$ xfs_io
xfs_io> open -d /tmp/pread.issue
xfs_io> pread 1000 4096
pread: Invalid argument
And this is how it should work:
xfs_io> pread 29 4096
read 0/4096 bytes at offset 29
Note, here I use interactive mode since we had old xfs_io that does not
allow to execute multiple commands at once, and to avoid EMFILE issue
Here is some history of a patches that affects this behaviour in the
linux kernel:
- the issue had been introduced in
torvalds/linux@9fe55eea7e v3.14
("Fix race when checking i_size on direct i/o read")
- an attempt to fix it had been made in
torvalds/linux@74cedf9b6c v4.4
("direct-io: Fix negative return from dio read beyond eof")
- but this wasn't enough, since alignment check was earlier, so
eventually fixed in
torvalds/linux@41b21af388 v5.10
("direct-io: defer alignment check until after the EOF check")
Someone may ask why CI does not shows the issue, since:
- it had 4.19 kernel when CI was in yandex
- now it has 5.4 when CI is in AWS
Since both of those kernels does not have the last patch.
But, this bug requires the following conditions to met:
- index_granularity_bytes=0
- min_merge_bytes_to_use_direct_io=1
Which was not covered by CI yet.
Right now streams relies on correct file size not the number of bytes
that will be read from the stream, to overcome one bug in the linux
kernel that may return EIINVAL for pread() with offset pass the EOF.
v2: Swap read_hint and file_size (since it is easy to miss something)
Before the first argument to readFile()/createReadBufferFromFileBase()
was read_hint, not the file_size, and let's preserve the order, since
it is easy to miss something
This will also fix 02051_read_settings test automatically because now
MergeTreeReaderStream will pass estimated_sum_mark_range_bytes to
read_hint not file_size, previously it cause on of the following errors:
- Attempt to read after EOF w/ O_DIRECT
- and LOGICAL_ERROR while adjusting granulas w/o O_DIRECT
This will also improve zero-length reads guard (via
ReadBufferFromEmptyFile), that had been added in #30190
v3: fix for other storages that wasn't enabled in fast-test
v4: ignore ENOENT/ENOTSUP in readFile