There are two possible cases for execution merges/mutations:
1) from background thread
2) from OPTIMIZE TABLE query
1) is pretty simple, it's memory tracking structure is as follow:
current_thread::memory_tracker = level=Thread / description="(for thread)" ==
background_thread_memory_tracker = level=Thread / description="(for thread)"
current_thread::memory_tracker.parent = level=Global / description="(total)"
So as you can see it is pretty simple and MemoryTrackerThreadSwitcher
does not do anything icky for this case.
2) is complex, it's memory tracking structure is as follow:
current_thread::memory_tracker = level=Thread / description="(for thread)"
current_thread::memory_tracker.parent = level=Process / description="(for query)" ==
background_thread_memory_tracker = level=Process / description="(for query)"
Before this patch to track memory (and related things, like sampling,
profiling and so on) for OPTIMIZE TABLE query dirty hacks was done to
do this, since current_thread memory_tracker was of Thread scope, that
does not have any limits.
And so if will change parent for it to Merge/Mutate memory tracker
(which also does not have some of settings) it will not be correctly
tracked.
To address this Merge/Mutate was set as parent not to the
current_thread memory_tracker but to it's parent, since it's scope is
Process with all settings.
But that parent's memory_tracker is the memory_tracker of the
thread_group, and so if you will have nested ThreadPool inside
merge/mutate (this is the case for s3 async writes, which has been
added in #33291) you may get use-after-free of memory_tracker.
Consider the following example:
MemoryTrackerThreadSwitcher()
thread_group.memory_tracker.parent = merge_list_entry->memory_tracker
(see also background_thread_memory_tracker above)
CurrentThread::attachTo()
current_thread.memory_tracker.parent = thread_group.memory_tracker
CurrentThread::detachQuery()
current_thread.memory_tracker.parent = thread_group.memory_tracker.parent
# and this is equal to merge_list_entry->memory_tracker
~MemoryTrackerThreadSwitcher()
thread_group.memory_tracker = thread_group.memory_tracker.parent
So after the following we will get incorrect memory_tracker (from the
mege_list_entry) when the next job in that ThreadPool will not have
thread_group, since in this case it will not try to update the
current_thread.memory_tracker.parent and use-after-free will happens.
So to address the (2) issue, settings from the parent memory_tracker
should be copied to the merge_list_entry->memory_tracker, to avoid
playing with parent memory tracker.
Note, that settings from the query (OPTIMIZE TABLE) is not available at
that time, so it cannot be used (instead of parent's memory tracker
settings).
v2: remove memory_tracker.setOrRaiseHardLimit() from settings
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Note, that there is no need to disable JEMALLOC_PURGE_MADVISE_FREE,
since jemalloc does check in runtime, and ClickHouse already
successfully works w/o this change.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>