ASan founds [1]:
==553==
ERROR: AddressSanitizer: heap-use-after-free on address 0x61e004694080 at pc 0x000029150af2 bp 0x7f70b3f8ef10 sp 0x7f70b3f8ef08
READ of size 8 at 0x61e004694080 thread T477 (QueryPipelineEx)
0 0x29150af1 in DB::MergeTreeDataWriter::writeTempPart() >
1 0x293b8e43 in DB::MergeTreeSink::consume(DB::Chunk) obj-x86_64-linux-gnu/../src/Storages/MergeTree/MergeTreeSink.cpp:27:65
2 0x29dac73b in DB::SinkToStorage::onConsume(DB::Chunk) obj-x86_64-linux-gnu/../src/Processors/Sinks/SinkToStorage.cpp:18:5
3 0x29c72dd2 in DB::ExceptionKeepingTransform::work()::$_1::operator()() const obj-x86_64-linux-gnu/../src/Processors/Transforms/ExceptionKeepingTransform.cpp:151:51
0x61e004694080 is located 2048 bytes inside of 2480-byte region [0x61e004693880,0x61e004694230)
freed by thread T199 (BgSchPool) here:
...
4 0x26220f20 in DB::DatabaseCatalog::TableMarkedAsDropped::~TableMarkedAsDropped() obj-x86_64-linux-gnu/../src/Interpreters/DatabaseCatalog.h:248:12
5 0x26220f20 in DB::DatabaseCatalog::dropTableDataTask() obj-x86_64-linux-gnu/../src/Interpreters/DatabaseCatalog.cpp:908:1
[1]: https://s3.amazonaws.com/clickhouse-test-reports/33201/4f04d6af61eabf4899eb8188150dc862aaab80fc/stress_test__address__actions_.html
There was a fix in #32572, but it was not complete (yes it reduced the
race window a lot, but not completely), since the inner table still can
go away after the INSERT chain was built, to fix this obtain the
reference earlier.
Follow-up for: #32572 (cc @tavplubix)
For unaligned offset pread() may return EINVAL even if the offset pass
EOF, although it should not, since otherwise there is no abiliity to
rely on read() == 0 is EOF (with pread() loop).
Here is a reproducer for the problem on 4.9.0-12-amd64:
$ head -c27 /dev/urandom > /tmp/pread.issue
$ xfs_io
xfs_io> open -d /tmp/pread.issue
xfs_io> pread 1000 4096
pread: Invalid argument
And this is how it should work:
xfs_io> pread 29 4096
read 0/4096 bytes at offset 29
Note, here I use interactive mode since we had old xfs_io that does not
allow to execute multiple commands at once, and to avoid EMFILE issue
Here is some history of a patches that affects this behaviour in the
linux kernel:
- the issue had been introduced in
torvalds/linux@9fe55eea7e v3.14
("Fix race when checking i_size on direct i/o read")
- an attempt to fix it had been made in
torvalds/linux@74cedf9b6c v4.4
("direct-io: Fix negative return from dio read beyond eof")
- but this wasn't enough, since alignment check was earlier, so
eventually fixed in
torvalds/linux@41b21af388 v5.10
("direct-io: defer alignment check until after the EOF check")
Someone may ask why CI does not shows the issue, since:
- it had 4.19 kernel when CI was in yandex
- now it has 5.4 when CI is in AWS
Since both of those kernels does not have the last patch.
But, this bug requires the following conditions to met:
- index_granularity_bytes=0
- min_merge_bytes_to_use_direct_io=1
Which was not covered by CI yet.
Right now streams relies on correct file size not the number of bytes
that will be read from the stream, to overcome one bug in the linux
kernel that may return EIINVAL for pread() with offset pass the EOF.
v2: Swap read_hint and file_size (since it is easy to miss something)
Before the first argument to readFile()/createReadBufferFromFileBase()
was read_hint, not the file_size, and let's preserve the order, since
it is easy to miss something
This will also fix 02051_read_settings test automatically because now
MergeTreeReaderStream will pass estimated_sum_mark_range_bytes to
read_hint not file_size, previously it cause on of the following errors:
- Attempt to read after EOF w/ O_DIRECT
- and LOGICAL_ERROR while adjusting granulas w/o O_DIRECT
This will also improve zero-length reads guard (via
ReadBufferFromEmptyFile), that had been added in #30190
v3: fix for other storages that wasn't enabled in fast-test
v4: ignore ENOENT/ENOTSUP in readFile