Merge pull request #34355 from azat/processors-profiling

Profiling on Processors level
2024-12-11 17:02:25 +00:00 · 2022-04-07 12:13:14 +02:00 · 2022-04-07 12:13:14 +02:00 · 3e1b3f14c0
commit 3e1b3f14c0
parent d21ed546a4 acd48ebe2d
28 changed files with 430 additions and 21 deletions
--- a/docs/en/operations/settings/settings.md
+++ b/docs/en/operations/settings/settings.md
@ -1062,6 +1062,15 @@ Result:
 └─────────────┴───────────┘
 ```
 ## log_processors_profiles {#settings-log_processors_profiles}
 Write time that processor spent during execution/waiting for data to `system.processors_profile_log` table.
 See also:
 -   [`system.processors_profile_log`](../../operations/system-tables/processors_profile_log.md#system-processors_profile_log)
 -   [`EXPLAIN PIPELINE`](../../sql-reference/statements/explain.md#explain-pipeline)
 ## max_insert_block_size {#settings-max_insert_block_size}
 The size of blocks (in a count of rows) to form for insertion into a table.
--- a/docs/en/operations/system-tables/processors_profile_log.md
+++ b/docs/en/operations/system-tables/processors_profile_log.md
@ -0,0 +1,75 @@
 # system.processors_profile_log {#system-processors_profile_log}
 This table contains profiling on processors level (that you can find in [`EXPLAIN PIPELINE`](../../sql-reference/statements/explain.md#explain-pipeline)).
 Columns:
 -   `event_date` ([Date](../../sql-reference/data-types/date.md)) — The date when the event happened.
 -   `event_time` ([DateTime64](../../sql-reference/data-types/datetime64.md)) — The date and time when the event happened.
 -   `id` ([UInt64](../../sql-reference/data-types/int-uint.md)) — ID of processor
 -   `parent_ids` ([Array(UInt64)](../../sql-reference/data-types/array.md)) — Parent processors IDs
 -   `query_id` ([String](../../sql-reference/data-types/string.md)) — ID of the query
 -   `name` ([LowCardinality(String)](../../sql-reference/data-types/lowcardinality.md)) — Name of the processor.
 -   `elapsed_us` ([UInt64](../../sql-reference/data-types/int-uint.md)) — Number of microseconds this processor was executed.
 -   `input_wait_elapsed_us` ([UInt64](../../sql-reference/data-types/int-uint.md)) — Number of microseconds this processor was waiting for data (from other processor).
 -   `output_wait_elapsed_us` ([UInt64](../../sql-reference/data-types/int-uint.md)) — Number of microseconds this processor was waiting because output port was full.
 **Example**
 Query:
 ``` sql
 EXPLAIN PIPELINE
 SELECT sleep(1)
 ┌─explain─────────────────────────┐
 │ (Expression)                    │
 │ ExpressionTransform             │
 │   (SettingQuotaAndLimits)       │
 │     (ReadFromStorage)           │
 │     SourceFromSingleChunk 0 → 1 │
 └─────────────────────────────────┘
 SELECT sleep(1)
 SETTINGS log_processors_profiles = 1
 Query id: feb5ed16-1c24-4227-aa54-78c02b3b27d4
 ┌─sleep(1)─┐
 │        0 │
 └──────────┘
 1 rows in set. Elapsed: 1.018 sec.
 SELECT
    name,
    elapsed_us,
    input_wait_elapsed_us,
    output_wait_elapsed_us
 FROM system.processors_profile_log
 WHERE query_id = 'feb5ed16-1c24-4227-aa54-78c02b3b27d4'
 ORDER BY name ASC
 ```
 Result:
 ``` text
 ┌─name────────────────────┬─elapsed_us─┬─input_wait_elapsed_us─┬─output_wait_elapsed_us─┐
 │ ExpressionTransform     │    1000497 │                  2823 │                    197 │
 │ LazyOutputFormat        │         36 │               1002188 │                      0 │
 │ LimitsCheckingTransform │         10 │               1002994 │                    106 │
 │ NullSource              │          5 │               1002074 │                      0 │
 │ NullSource              │          1 │               1002084 │                      0 │
 │ SourceFromSingleChunk   │         45 │                  4736 │                1000819 │
 └─────────────────────────┴────────────┴───────────────────────┴────────────────────────┘
 ```
 Here you can see:
 -  `ExpressionTransform` was executing `sleep(1)` function, so it `work` will takes 1e6, and so `elapsed_us` > 1e6.
 -  `SourceFromSingleChunk` need to wait, because `ExpressionTransform` does not accept any data during execution of `sleep(1)`, so it will be in `PortFull` state for 1e6 us, and so `output_wait_elapsed_us` > 1e6.
 -  `LimitsCheckingTransform`/`NullSource`/`LazyOutputFormat` need to wait until `ExpressionTransform` will execute `sleep(1)` to process the result, so `input_wait_elapsed_us` > 1e6.
 **See Also**
 -   [`EXPLAIN PIPELINE`](../../sql-reference/statements/explain.md#explain-pipeline)
--- a/programs/server/config.xml
+++ b/programs/server/config.xml
@ -1042,6 +1042,15 @@
        <flush_interval_milliseconds>7500</flush_interval_milliseconds>
    </session_log> -->
    <!-- Profiling on Processors level. -->
    <processors_profile_log>
        <database>system</database>
        <table>processors_profile_log</table>
        <partition_by>toYYYYMM(event_date)</partition_by>
        <flush_interval_milliseconds>7500</flush_interval_milliseconds>
    </processors_profile_log>
    <!-- <top_level_domains_path>/var/lib/clickhouse/top_level_domains/</top_level_domains_path> -->
    <!-- Custom TLD lists.
         Format: <name>/path/to/file</name>
--- a/src/Common/SystemLogBase.cpp
+++ b/src/Common/SystemLogBase.cpp
@ -9,6 +9,7 @@
 #include <Interpreters/SessionLog.h>
 #include <Interpreters/TextLog.h>
 #include <Interpreters/TraceLog.h>
 #include <Interpreters/ProcessorsProfileLog.h>
 #include <Interpreters/ZooKeeperLog.h>
 #include <Common/MemoryTrackerBlockerInThread.h>
--- a/src/Common/SystemLogBase.h
+++ b/src/Common/SystemLogBase.h
@ -24,6 +24,7 @@
    M(SessionLogElement) \
    M(TraceLogElement) \
    M(ZooKeeperLogElement) \
    M(ProcessorProfileLogElement) \
    M(TextLogElement)
 namespace Poco
--- a/src/Core/Settings.h
+++ b/src/Core/Settings.h
@ -195,6 +195,7 @@ class IColumn;
    M(UInt64, log_queries_cut_to_length, 100000, "If query length is greater than specified threshold (in bytes), then cut query when writing to query log. Also limit length of printed query in ordinary text log.", 0) \
    M(Float, log_queries_probability, 1., "Log queries with the specified probabality.", 0) \
    \
    M(Bool, log_processors_profiles, false, "Log Processors profile events.", 0) \
    M(DistributedProductMode, distributed_product_mode, DistributedProductMode::DENY, "How are distributed subqueries performed inside IN or JOIN sections?", IMPORTANT) \
    \
    M(UInt64, max_concurrent_queries_for_all_users, 0, "The maximum number of concurrent requests for all users.", 0) \
--- a/src/Interpreters/Context.cpp
+++ b/src/Interpreters/Context.cpp
@ -2474,6 +2474,17 @@ std::shared_ptr<ZooKeeperLog> Context::getZooKeeperLog() const
 }
 std::shared_ptr<ProcessorsProfileLog> Context::getProcessorsProfileLog() const
 {
    auto lock = getLock();
    if (!shared->system_logs)
        return {};
    return shared->system_logs->processors_profile_log;
 }
 CompressionCodecPtr Context::chooseCompressionCodec(size_t part_size, double part_size_ratio) const
 {
    auto lock = getLock();
--- a/src/Interpreters/Context.h
+++ b/src/Interpreters/Context.h
@ -80,6 +80,7 @@ class AsynchronousMetricLog;
 class OpenTelemetrySpanLog;
 class ZooKeeperLog;
 class SessionLog;
 class ProcessorsProfileLog;
 struct MergeTreeSettings;
 class StorageS3Settings;
 class IDatabase;
@ -800,6 +801,7 @@ public:
    std::shared_ptr<OpenTelemetrySpanLog> getOpenTelemetrySpanLog() const;
    std::shared_ptr<ZooKeeperLog> getZooKeeperLog() const;
    std::shared_ptr<SessionLog> getSessionLog() const;
    std::shared_ptr<ProcessorsProfileLog> getProcessorsProfileLog() const;
    /// Returns an object used to log operations with parts if it possible.
    /// Provide table name to make required checks.
--- a/src/Interpreters/InterpreterSystemQuery.cpp
+++ b/src/Interpreters/InterpreterSystemQuery.cpp
@ -29,6 +29,7 @@
 #include <Interpreters/AsynchronousMetricLog.h>
 #include <Interpreters/OpenTelemetrySpanLog.h>
 #include <Interpreters/ZooKeeperLog.h>
 #include <Interpreters/ProcessorsProfileLog.h>
 #include <Interpreters/JIT/CompiledExpressionCache.h>
 #include <Access/ContextAccess.h>
 #include <Access/Common/AllowedClientHosts.h>
@ -443,7 +444,8 @@ BlockIO InterpreterSystemQuery::execute()
                [&] { if (auto opentelemetry_span_log = getContext()->getOpenTelemetrySpanLog()) opentelemetry_span_log->flush(true); },
                [&] { if (auto query_views_log = getContext()->getQueryViewsLog()) query_views_log->flush(true); },
                [&] { if (auto zookeeper_log = getContext()->getZooKeeperLog()) zookeeper_log->flush(true); },
-                [&] { if (auto session_log = getContext()->getSessionLog()) session_log->flush(true); }
+                [&] { if (auto session_log = getContext()->getSessionLog()) session_log->flush(true); },
                [&] { if (auto processors_profile_log = getContext()->getProcessorsProfileLog()) processors_profile_log->flush(true); }
            );
            break;
        }
--- a/src/Interpreters/ProcessorsProfileLog.cpp
+++ b/src/Interpreters/ProcessorsProfileLog.cpp
@ -0,0 +1,70 @@
 #include <Interpreters/ProcessorsProfileLog.h>
 #include <Common/ClickHouseRevision.h>
 #include <DataTypes/DataTypeDate.h>
 #include <DataTypes/DataTypeDateTime.h>
 #include <DataTypes/DataTypeDateTime64.h>
 #include <DataTypes/DataTypeLowCardinality.h>
 #include <DataTypes/DataTypeString.h>
 #include <DataTypes/DataTypesNumber.h>
 #include <DataTypes/DataTypeNullable.h>
 #include <DataTypes/DataTypeArray.h>
 #include <base/logger_useful.h>
 #include <array>
 namespace DB
 {
 NamesAndTypesList ProcessorProfileLogElement::getNamesAndTypes()
 {
    return
    {
        {"event_date", std::make_shared<DataTypeDate>()},
        {"event_time", std::make_shared<DataTypeDateTime>()},
        {"event_time_microseconds", std::make_shared<DataTypeDateTime64>(6)},
        {"id", std::make_shared<DataTypeUInt64>()},
        {"parent_ids", std::make_shared<DataTypeArray>(std::make_shared<DataTypeUInt64>())},
        {"query_id", std::make_shared<DataTypeString>()},
        {"name", std::make_shared<DataTypeLowCardinality>(std::make_shared<DataTypeString>())},
        {"elapsed_us", std::make_shared<DataTypeUInt64>()},
        {"input_wait_elapsed_us", std::make_shared<DataTypeUInt64>()},
        {"output_wait_elapsed_us", std::make_shared<DataTypeUInt64>()},
    };
 }
 void ProcessorProfileLogElement::appendToBlock(MutableColumns & columns) const
 {
    size_t i = 0;
    columns[i++]->insert(DateLUT::instance().toDayNum(event_time).toUnderType());
    columns[i++]->insert(event_time);
    columns[i++]->insert(event_time_microseconds);
    columns[i++]->insert(id);
    {
        Array parent_ids_array;
        parent_ids_array.reserve(parent_ids.size());
        for (const UInt64 parent : parent_ids)
            parent_ids_array.emplace_back(parent);
        columns[i++]->insert(parent_ids_array);
    }
    columns[i++]->insertData(query_id.data(), query_id.size());
    columns[i++]->insertData(processor_name.data(), processor_name.size());
    columns[i++]->insert(elapsed_us);
    columns[i++]->insert(input_wait_elapsed_us);
    columns[i++]->insert(output_wait_elapsed_us);
 }
 ProcessorsProfileLog::ProcessorsProfileLog(ContextPtr context_, const String & database_name_,
        const String & table_name_, const String & storage_def_,
        size_t flush_interval_milliseconds_)
  : SystemLog<ProcessorProfileLogElement>(context_, database_name_, table_name_,
        storage_def_, flush_interval_milliseconds_)
 {
 }
 }
--- a/src/Interpreters/ProcessorsProfileLog.h
+++ b/src/Interpreters/ProcessorsProfileLog.h
@ -0,0 +1,46 @@
 #pragma once
 #include <Interpreters/SystemLog.h>
 #include <Core/NamesAndTypes.h>
 #include <Core/NamesAndAliases.h>
 #include <Processors/IProcessor.h>
 namespace DB
 {
 struct ProcessorProfileLogElement
 {
    time_t event_time{};
    Decimal64 event_time_microseconds{};
    UInt64 id;
    std::vector<UInt64> parent_ids;
    String query_id;
    String processor_name;
    /// Milliseconds spend in IProcessor::work()
    UInt32 elapsed_us{};
    /// IProcessor::NeedData
    UInt32 input_wait_elapsed_us{};
    /// IProcessor::PortFull
    UInt32 output_wait_elapsed_us{};
    static std::string name() { return "ProcessorsProfileLog"; }
    static NamesAndTypesList getNamesAndTypes();
    static NamesAndAliases getNamesAndAliases() { return {}; }
    void appendToBlock(MutableColumns & columns) const;
 };
 class ProcessorsProfileLog : public SystemLog<ProcessorProfileLogElement>
 {
 public:
    ProcessorsProfileLog(
        ContextPtr context_,
        const String & database_name_,
        const String & table_name_,
        const String & storage_def_,
        size_t flush_interval_milliseconds_);
 };
 }
--- a/src/Interpreters/SystemLog.cpp
+++ b/src/Interpreters/SystemLog.cpp
@ -9,6 +9,7 @@
 #include <Interpreters/SessionLog.h>
 #include <Interpreters/TextLog.h>
 #include <Interpreters/TraceLog.h>
 #include <Interpreters/ProcessorsProfileLog.h>
 #include <Interpreters/ZooKeeperLog.h>
 #include <Interpreters/InterpreterCreateQuery.h>
 #include <Interpreters/InterpreterRenameQuery.h>
@ -201,6 +202,7 @@ SystemLogs::SystemLogs(ContextPtr global_context, const Poco::Util::AbstractConf
    query_views_log = createSystemLog<QueryViewsLog>(global_context, "system", "query_views_log", config, "query_views_log");
    zookeeper_log = createSystemLog<ZooKeeperLog>(global_context, "system", "zookeeper_log", config, "zookeeper_log");
    session_log = createSystemLog<SessionLog>(global_context, "system", "session_log", config, "session_log");
    processors_profile_log = createSystemLog<ProcessorsProfileLog>(global_context, "system", "processors_profile_log", config, "processors_profile_log");
    if (query_log)
        logs.emplace_back(query_log.get());
@ -226,6 +228,8 @@ SystemLogs::SystemLogs(ContextPtr global_context, const Poco::Util::AbstractConf
        logs.emplace_back(zookeeper_log.get());
    if (session_log)
        logs.emplace_back(session_log.get());
    if (processors_profile_log)
        logs.emplace_back(processors_profile_log.get());
    try
    {
--- a/src/Interpreters/SystemLog.h
+++ b/src/Interpreters/SystemLog.h
@ -43,6 +43,7 @@ class OpenTelemetrySpanLog;
 class QueryViewsLog;
 class ZooKeeperLog;
 class SessionLog;
 class ProcessorsProfileLog;
 /// System logs should be destroyed in destructor of the last Context and before tables,
 ///  because SystemLog destruction makes insert query while flushing data into underlying tables
@ -70,6 +71,8 @@ struct SystemLogs
    std::shared_ptr<ZooKeeperLog> zookeeper_log;
    /// Login, LogOut and Login failure events
    std::shared_ptr<SessionLog> session_log;
    /// Used to log processors profiling
    std::shared_ptr<ProcessorsProfileLog> processors_profile_log;
    std::vector<ISystemLog *> logs;
 };
--- a/src/Interpreters/executeQuery.cpp
+++ b/src/Interpreters/executeQuery.cpp
@ -45,6 +45,7 @@
 #include <Interpreters/OpenTelemetrySpanLog.h>
 #include <Interpreters/ProcessList.h>
 #include <Interpreters/QueryLog.h>
 #include <Interpreters/ProcessorsProfileLog.h>
 #include <Interpreters/ReplaceQueryParameterVisitor.h>
 #include <Interpreters/SelectQueryOptions.h>
 #include <Interpreters/executeQuery.h>
@ -811,6 +812,7 @@ static std::tuple<ASTPtr, BlockIO> executeQueryImpl(
                 log_queries,
                 log_queries_min_type = settings.log_queries_min_type,
                 log_queries_min_query_duration_ms = settings.log_queries_min_query_duration_ms.totalMilliseconds(),
                 log_processors_profiles = settings.log_processors_profiles,
                 status_info_to_query_log,
                 pulling_pipeline = pipeline.pulling()
            ]
@ -866,6 +868,44 @@ static std::tuple<ASTPtr, BlockIO> executeQueryImpl(
                    if (auto query_log = context->getQueryLog())
                        query_log->add(elem);
                }
                if (log_processors_profiles)
                {
                    if (auto processors_profile_log = context->getProcessorsProfileLog())
                    {
                        ProcessorProfileLogElement processor_elem;
                        processor_elem.event_time = time_in_seconds(finish_time);
                        processor_elem.event_time_microseconds = time_in_microseconds(finish_time);
                        processor_elem.query_id = elem.client_info.current_query_id;
                        auto get_proc_id = [](const IProcessor & proc) -> UInt64
                        {
                            return reinterpret_cast<std::uintptr_t>(&proc);
                        };
                        for (const auto & processor : query_pipeline.getProcessors())
                        {
                            std::vector<UInt64> parents;
                            for (const auto & port : processor->getOutputs())
                            {
                                if (!port.isConnected())
                                    continue;
                                const IProcessor & next = port.getInputPort().getProcessor();
                                parents.push_back(get_proc_id(next));
                            }
                            processor_elem.id = get_proc_id(*processor);
                            processor_elem.parent_ids = std::move(parents);
                            processor_elem.processor_name = processor->getName();
                            processor_elem.elapsed_us = processor->getElapsedUs();
                            processor_elem.input_wait_elapsed_us = processor->getInputWaitElapsedUs();
                            processor_elem.output_wait_elapsed_us = processor->getOutputWaitElapsedUs();
                            processors_profile_log->add(processor_elem);
                        }
                    }
                }
                if (auto opentelemetry_span_log = context->getOpenTelemetrySpanLog();
                    context->query_trace_context.trace_id != UUID()
--- a/src/Processors/Executors/ExecutingGraph.cpp
+++ b/src/Processors/Executors/ExecutingGraph.cpp
@ -10,7 +10,9 @@ namespace ErrorCodes
    extern const int LOGICAL_ERROR;
 }
-ExecutingGraph::ExecutingGraph(Processors & processors_) : processors(processors_)
+ExecutingGraph::ExecutingGraph(Processors & processors_, bool profile_processors_)
    : processors(processors_)
    , profile_processors(profile_processors_)
 {
    uint64_t num_processors = processors.size();
    nodes.reserve(num_processors);
@ -263,7 +265,33 @@ bool ExecutingGraph::updateNode(uint64_t pid, Queue & queue, Queue & async_queue
                try
                {
-                    node.last_processor_status = node.processor->prepare(node.updated_input_ports, node.updated_output_ports);
+                    auto & processor = *node.processor;
                    IProcessor::Status last_status = node.last_processor_status;
                    IProcessor::Status status = processor.prepare(node.updated_input_ports, node.updated_output_ports);
                    node.last_processor_status = status;
                    if (profile_processors)
                    {
                        /// NeedData
                        if (last_status != IProcessor::Status::NeedData && status == IProcessor::Status::NeedData)
                        {
                            processor.input_wait_watch.restart();
                        }
                        else if (last_status == IProcessor::Status::NeedData && status != IProcessor::Status::NeedData)
                        {
                            processor.input_wait_elapsed_us += processor.input_wait_watch.elapsedMicroseconds();
                        }
                        /// PortFull
                        if (last_status != IProcessor::Status::PortFull && status == IProcessor::Status::PortFull)
                        {
                            processor.output_wait_watch.restart();
                        }
                        else if (last_status == IProcessor::Status::PortFull && status != IProcessor::Status::PortFull)
                        {
                            processor.output_wait_elapsed_us += processor.output_wait_watch.elapsedMicroseconds();
                        }
                    }
                }
                catch (...)
                {
--- a/src/Processors/Executors/ExecutingGraph.h
+++ b/src/Processors/Executors/ExecutingGraph.h
@ -123,7 +123,7 @@ public:
    using ProcessorsMap = std::unordered_map<const IProcessor *, uint64_t>;
    ProcessorsMap processors_map;
-    explicit ExecutingGraph(Processors & processors_);
+    explicit ExecutingGraph(Processors & processors_, bool profile_processors_);
    const Processors & getProcessors() const { return processors; }
@ -153,6 +153,8 @@ private:
    std::mutex processors_mutex;
    UpgradableMutex nodes_mutex;
    const bool profile_processors;
 };
 }
--- a/src/Processors/Executors/ExecutionThreadContext.cpp
+++ b/src/Processors/Executors/ExecutionThreadContext.cpp
@ -54,8 +54,13 @@ static void executeJob(IProcessor * processor)
 bool ExecutionThreadContext::executeTask()
 {
    std::optional<Stopwatch> execution_time_watch;
 #ifndef NDEBUG
-    Stopwatch execution_time_watch;
+    execution_time_watch.emplace();
 #else
    if (profile_processors)
        execution_time_watch.emplace();
 #endif
    try
@ -69,8 +74,11 @@ bool ExecutionThreadContext::executeTask()
        node->exception = std::current_exception();
    }
    if (profile_processors)
        node->processor->elapsed_us += execution_time_watch->elapsedMicroseconds();
 #ifndef NDEBUG
-    execution_time_ns += execution_time_watch.elapsed();
+    execution_time_ns += execution_time_watch->elapsed();
 #endif
    return node->exception == nullptr;
--- a/src/Processors/Executors/ExecutionThreadContext.h
+++ b/src/Processors/Executors/ExecutionThreadContext.h
@ -35,6 +35,7 @@ public:
 #endif
    const size_t thread_number;
    const bool profile_processors;
    void wait(std::atomic_bool & finished);
    void wakeUp();
@ -55,7 +56,10 @@ public:
    void setException(std::exception_ptr exception_) { exception = std::move(exception_); }
    void rethrowExceptionIfHas();
-    explicit ExecutionThreadContext(size_t thread_number_) : thread_number(thread_number_) {}
+    explicit ExecutionThreadContext(size_t thread_number_, bool profile_processors_)
        : thread_number(thread_number_)
        , profile_processors(profile_processors_)
    {}
 };
 }
--- a/src/Processors/Executors/ExecutorTasks.cpp
+++ b/src/Processors/Executors/ExecutorTasks.cpp
@ -137,7 +137,7 @@ void ExecutorTasks::pushTasks(Queue & queue, Queue & async_queue, ExecutionThrea
    }
 }
-void ExecutorTasks::init(size_t num_threads_)
+void ExecutorTasks::init(size_t num_threads_, bool profile_processors)
 {
    num_threads = num_threads_;
    threads_queue.init(num_threads);
@ -148,7 +148,7 @@ void ExecutorTasks::init(size_t num_threads_)
        executor_contexts.reserve(num_threads);
        for (size_t i = 0; i < num_threads; ++i)
-            executor_contexts.emplace_back(std::make_unique<ExecutionThreadContext>(i));
+            executor_contexts.emplace_back(std::make_unique<ExecutionThreadContext>(i, profile_processors));
    }
 }
--- a/src/Processors/Executors/ExecutorTasks.h
+++ b/src/Processors/Executors/ExecutorTasks.h
@ -53,7 +53,7 @@ public:
    void tryGetTask(ExecutionThreadContext & context);
    void pushTasks(Queue & queue, Queue & async_queue, ExecutionThreadContext & context);
-    void init(size_t num_threads_);
+    void init(size_t num_threads_, bool profile_processors);
    void fill(Queue & queue);
    void processAsyncTasks();
--- a/src/Processors/Executors/PipelineExecutor.cpp
+++ b/src/Processors/Executors/PipelineExecutor.cpp
@ -8,6 +8,7 @@
 #include <QueryPipeline/printPipeline.h>
 #include <Processors/ISource.h>
 #include <Interpreters/ProcessList.h>
 #include <Interpreters/Context.h>
 #include <Interpreters/OpenTelemetrySpanLog.h>
 #include <base/scope_guard_safe.h>
@ -27,9 +28,12 @@ namespace ErrorCodes
 PipelineExecutor::PipelineExecutor(Processors & processors, QueryStatus * elem)
    : process_list_element(elem)
 {
    if (process_list_element)
        profile_processors = process_list_element->getContext()->getSettingsRef().log_processors_profiles;
    try
    {
-        graph = std::make_unique<ExecutingGraph>(processors);
+        graph = std::make_unique<ExecutingGraph>(processors, profile_processors);
    }
    catch (Exception & exception)
    {
@ -259,7 +263,7 @@ void PipelineExecutor::initializeExecution(size_t num_threads)
    Queue queue;
    graph->initializeExecution(queue);
-    tasks.init(num_threads);
+    tasks.init(num_threads, profile_processors);
    tasks.fill(queue);
 }
--- a/src/Processors/Executors/PipelineExecutor.h
+++ b/src/Processors/Executors/PipelineExecutor.h
@ -56,6 +56,8 @@ private:
    /// Flag that checks that initializeExecution was called.
    bool is_execution_initialized = false;
    /// system.processors_profile_log
    bool profile_processors = false;
    std::atomic_bool cancelled = false;
--- a/src/Processors/Executors/PushingPipelineExecutor.cpp
+++ b/src/Processors/Executors/PushingPipelineExecutor.cpp
@ -15,16 +15,16 @@ namespace ErrorCodes
 class PushingSource : public ISource
 {
 public:
-    explicit PushingSource(const Block & header, std::atomic_bool & need_data_flag_)
+    explicit PushingSource(const Block & header, std::atomic_bool & input_wait_flag_)
        : ISource(header)
-        , need_data_flag(need_data_flag_)
+        , input_wait_flag(input_wait_flag_)
    {}
    String getName() const override { return "PushingSource"; }
    void setData(Chunk chunk)
    {
-        need_data_flag = false;
+        input_wait_flag = false;
        data = std::move(chunk);
    }
@ -34,7 +34,7 @@ protected:
    {
        auto status = ISource::prepare();
        if (status == Status::Ready)
-            need_data_flag = true;
+            input_wait_flag = true;
        return status;
    }
@ -46,7 +46,7 @@ protected:
 private:
    Chunk data;
-    std::atomic_bool & need_data_flag;
+    std::atomic_bool & input_wait_flag;
 };
@ -55,7 +55,7 @@ PushingPipelineExecutor::PushingPipelineExecutor(QueryPipeline & pipeline_) : pi
    if (!pipeline.pushing())
        throw Exception(ErrorCodes::LOGICAL_ERROR, "Pipeline for PushingPipelineExecutor must be pushing");
-    pushing_source = std::make_shared<PushingSource>(pipeline.input->getHeader(), need_data_flag);
+    pushing_source = std::make_shared<PushingSource>(pipeline.input->getHeader(), input_wait_flag);
    connect(pushing_source->getPort(), *pipeline.input);
    pipeline.processors.emplace_back(pushing_source);
 }
@ -86,7 +86,7 @@ void PushingPipelineExecutor::start()
    started = true;
    executor = std::make_shared<PipelineExecutor>(pipeline.processors, pipeline.process_list_element);
-    if (!executor->executeStep(&need_data_flag))
+    if (!executor->executeStep(&input_wait_flag))
        throw Exception(ErrorCodes::LOGICAL_ERROR,
                        "Pipeline for PushingPipelineExecutor was finished before all data was inserted");
 }
@ -98,7 +98,7 @@ void PushingPipelineExecutor::push(Chunk chunk)
    pushing_source->setData(std::move(chunk));
-    if (!executor->executeStep(&need_data_flag))
+    if (!executor->executeStep(&input_wait_flag))
        throw Exception(ErrorCodes::LOGICAL_ERROR,
                        "Pipeline for PushingPipelineExecutor was finished before all data was inserted");
 }
--- a/src/Processors/Executors/PushingPipelineExecutor.h
+++ b/src/Processors/Executors/PushingPipelineExecutor.h
@ -47,7 +47,7 @@ public:
 private:
    QueryPipeline & pipeline;
-    std::atomic_bool need_data_flag = false;
+    std::atomic_bool input_wait_flag = false;
    std::shared_ptr<PushingSource> pushing_source;
    PipelineExecutorPtr executor;
--- a/src/Processors/IProcessor.h
+++ b/src/Processors/IProcessor.h
@ -2,6 +2,7 @@
 #include <memory>
 #include <Processors/Port.h>
 #include <Common/Stopwatch.h>
 class EventCounter;
@ -299,14 +300,33 @@ public:
    IQueryPlanStep * getQueryPlanStep() const { return query_plan_step; }
    size_t getQueryPlanStepGroup() const { return query_plan_step_group; }
    uint64_t getElapsedUs() const { return elapsed_us; }
    uint64_t getInputWaitElapsedUs() const { return input_wait_elapsed_us; }
    uint64_t getOutputWaitElapsedUs() const { return output_wait_elapsed_us; }
 protected:
    virtual void onCancel() {}
 private:
    /// For:
    /// - elapsed_us
    friend class ExecutionThreadContext;
    /// For
    /// - input_wait_elapsed_us
    /// - output_wait_elapsed_us
    friend class ExecutingGraph;
    std::atomic<bool> is_cancelled{false};
    std::string processor_description;
    /// For processors_profile_log
    uint64_t elapsed_us = 0;
    Stopwatch input_wait_watch;
    uint64_t input_wait_elapsed_us = 0;
    Stopwatch output_wait_watch;
    uint64_t output_wait_elapsed_us = 0;
    size_t stream_number = NO_STREAM;
    IQueryPlanStep * query_plan_step = nullptr;
--- a/tests/integration/test_input_format_parallel_parsing_memory_tracking/configs/conf.xml
+++ b/tests/integration/test_input_format_parallel_parsing_memory_tracking/configs/conf.xml
@ -18,6 +18,7 @@
    <part_log remove="remove" />
    <crash_log remove="remove" />
    <opentelemetry_span_log remove="remove" />
    <processors_profile_log remove="remove" />
    <!-- just in case it will be enabled by default -->
    <zookeeper_log remove="remove" />
 </clickhouse>
--- a/tests/queries/0_stateless/02210_processors_profile_log.reference
+++ b/tests/queries/0_stateless/02210_processors_profile_log.reference
@ -0,0 +1,38 @@
 -- { echo }
 EXPLAIN PIPELINE SELECT sleep(1);
 (Expression)
 ExpressionTransform
  (SettingQuotaAndLimits)
    (ReadFromStorage)
    SourceFromSingleChunk 0 → 1
 SELECT sleep(1) SETTINGS log_processors_profiles=true, log_queries=1, log_queries_min_type='QUERY_FINISH';
 0
 SYSTEM FLUSH LOGS;
 WITH
    (
        SELECT query_id
        FROM system.query_log
        WHERE current_database = currentDatabase() AND Settings['log_processors_profiles']='1'
    ) AS query_id_
 SELECT
    name,
    multiIf(
        -- ExpressionTransform executes sleep(),
        -- so IProcessor::work() will spend 1 sec.
        name = 'ExpressionTransform', elapsed_us>1e6,
        -- SourceFromSingleChunk, that feed data to ExpressionTransform,
        -- will feed first block and then wait in PortFull.
        name = 'SourceFromSingleChunk', output_wait_elapsed_us>1e6,
        -- NullSource/LazyOutputFormatLazyOutputFormat are the outputs
        -- so they cannot starts to execute before sleep(1) will be executed.
        input_wait_elapsed_us>1e6)
    elapsed
 FROM system.processors_profile_log
 WHERE query_id = query_id_
 ORDER BY name;
 ExpressionTransform	1
 LazyOutputFormat	1
 LimitsCheckingTransform	1
 NullSource	1
 NullSource	1
 SourceFromSingleChunk	1
--- a/tests/queries/0_stateless/02210_processors_profile_log.sql
+++ b/tests/queries/0_stateless/02210_processors_profile_log.sql
@ -0,0 +1,28 @@
 -- { echo }
 EXPLAIN PIPELINE SELECT sleep(1);
 SELECT sleep(1) SETTINGS log_processors_profiles=true, log_queries=1, log_queries_min_type='QUERY_FINISH';
 SYSTEM FLUSH LOGS;
 WITH
    (
        SELECT query_id
        FROM system.query_log
        WHERE current_database = currentDatabase() AND Settings['log_processors_profiles']='1'
    ) AS query_id_
 SELECT
    name,
    multiIf(
        -- ExpressionTransform executes sleep(),
        -- so IProcessor::work() will spend 1 sec.
        name = 'ExpressionTransform', elapsed_us>1e6,
        -- SourceFromSingleChunk, that feed data to ExpressionTransform,
        -- will feed first block and then wait in PortFull.
        name = 'SourceFromSingleChunk', output_wait_elapsed_us>1e6,
        -- NullSource/LazyOutputFormatLazyOutputFormat are the outputs
        -- so they cannot starts to execute before sleep(1) will be executed.
        input_wait_elapsed_us>1e6)
    elapsed
 FROM system.processors_profile_log
 WHERE query_id = query_id_
 ORDER BY name;