Merge branch 'master' into DOCSUP-13927-document-system_views

2024-11-29 11:02:08 +00:00 · 2021-08-31 21:35:01 +08:00 · 2021-08-31 21:35:01 +08:00 · dfdf262f8b
commit dfdf262f8b
parent bc32baffac 619fe01b58
151 changed files with 3737 additions and 967 deletions
--- a/docker/test/performance-comparison/config/users.d/perf-comparison-tweaks-users.xml
+++ b/docker/test/performance-comparison/config/users.d/perf-comparison-tweaks-users.xml
@ -18,9 +18,6 @@
            <!-- One NUMA node w/o hyperthreading -->
            <max_threads>12</max_threads>

-            <!-- mmap shows some improvements in perf tests -->
-            <min_bytes_to_use_mmap_io>64Mi</min_bytes_to_use_mmap_io>
-
            <!-- disable jit for perf tests -->
            <compile_expressions>0</compile_expressions>
            <compile_aggregate_expressions>0</compile_aggregate_expressions>
--- a/docs/en/interfaces/third-party/gui.md
+++ b/docs/en/interfaces/third-party/gui.md
@ -190,4 +190,20 @@ SeekTable is [free](https://www.seektable.com/help/cloud-pricing) for personal/i

 [Chadmin](https://github.com/bun4uk/chadmin) is a simple UI where you can visualize your currently running queries on your ClickHouse cluster and info about them and kill them if you want.

+### DBM {#dbm}
+
+[DBM](https://dbm.incubator.edurt.io/) DBM is a visual management tool for ClickHouse!
+
+Features:
+
+-   Support query history (pagination, clear all, etc.)
+-   Support selected sql clauses query
+-   Support terminating query
+-   Support table management (metadata, delete, preview)
+-   Support database management (delete, create)
+-   Support custom query
+-   Support multiple data sources management(connection test, monitoring)
+-   Support monitor (processor, connection, query)
+-   Support migrate data
+
 [Original article](https://clickhouse.tech/docs/en/interfaces/third-party/gui/) <!--hide-->
--- a/docs/en/sql-reference/statements/select/join.md
+++ b/docs/en/sql-reference/statements/select/join.md
@ -36,6 +36,9 @@ Additional join types available in ClickHouse:
 -   `LEFT ANY JOIN`, `RIGHT ANY JOIN` and `INNER ANY JOIN`, partially (for opposite side of `LEFT` and `RIGHT`) or completely (for `INNER` and `FULL`) disables the cartesian product for standard `JOIN` types.
 -   `ASOF JOIN` and `LEFT ASOF JOIN`, joining sequences with a non-exact match. `ASOF JOIN` usage is described below.

+!!! note "Note"
+    When [join_algorithm](../../../operations/settings/settings.md#settings-join_algorithm) is set to `partial_merge`, `RIGHT JOIN` and `FULL JOIN` are supported only with `ALL` strictness (`SEMI`, `ANTI`, `ANY`, and `ASOF` are not supported).
+
 ## Settings {#join-settings}

 The default join type can be overridden using [join_default_strictness](../../../operations/settings/settings.md#settings-join_default_strictness) setting.
--- a/docs/ru/sql-reference/statements/select/join.md
+++ b/docs/ru/sql-reference/statements/select/join.md
@ -36,6 +36,9 @@ FROM <left_table>
 -   `LEFT ANY JOIN`, `RIGHT ANY JOIN` и `INNER ANY JOIN`, Частично (для противоположных сторон `LEFT` и `RIGHT`) или полностью (для `INNER` и `FULL`) отключает декартово произведение для стандартных видов `JOIN`.
 -   `ASOF JOIN` и `LEFT ASOF JOIN`, Для соединения последовательностей по нечеткому совпадению. Использование `ASOF JOIN` описано ниже.

+!!! note "Примечание"
+    Если настройка [join_algorithm](../../../operations/settings/settings.md#settings-join_algorithm) установлена в значение `partial_merge`, то для `RIGHT JOIN` и `FULL JOIN` поддерживается только уровень строгости `ALL` (`SEMI`, `ANTI`, `ANY` и `ASOF` не поддерживаются).
+
 ## Настройки {#join-settings}

 Значение строгости по умолчанию может быть переопределено с помощью настройки [join_default_strictness](../../../operations/settings/settings.md#settings-join_default_strictness).
--- a/docs/zh/engines/database-engines/atomic.md
+++ b/docs/zh/engines/database-engines/atomic.md
@ -3,15 +3,55 @@ toc_priority: 32
 toc_title: Atomic
 ---

-
 # Atomic {#atomic}

-它支持非阻塞 DROP 和 RENAME TABLE 查询以及原子 EXCHANGE TABLES t1 AND t2 查询。默认情况下使用Atomic数据库引擎。
+它支持非阻塞的[DROP TABLE](#drop-detach-table)和[RENAME TABLE](#rename-table)查询和原子的[EXCHANGE TABLES t1 AND t2](#exchange-tables)查询。默认情况下使用`Atomic`数据库引擎。

 ## 创建数据库 {#creating-a-database}

-```sql
-CREATE DATABASE test ENGINE = Atomic;
+``` sql
+  CREATE DATABASE test[ ENGINE = Atomic];
 ```

-[原文](https://clickhouse.tech/docs/en/engines/database_engines/atomic/) <!--hide-->
+## 使用方式 {#specifics-and-recommendations}
+
+### Table UUID {#table-uuid}
+
+数据库`Atomic`中的所有表都有唯一的[UUID](../../sql-reference/data-types/uuid.md)，并将数据存储在目录`/clickhouse_path/store/xxx/xxxyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy/`，其中`xxxyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy`是该表的UUID。
+
+通常，UUID是自动生成的，但用户也可以在创建表时以相同的方式显式指定UUID(不建议这样做)。可以使用 [show_table_uuid_in_table_create_query_if_not_nil](../../operations/settings/settings.md#show_table_uuid_in_table_create_query_if_not_nil)设置。显示UUID的使用`SHOW CREATE`查询。例如:
+
+```sql
+CREATE TABLE name UUID '28f1c61c-2970-457a-bffe-454156ddcfef' (n UInt64) ENGINE = ...;
+```
+
+### RENAME TABLES {#rename-table}
+
+`RENAME`查询是在不更改UUID和移动表数据的情况下执行的。这些查询不会等待使用表的查询完成，而是会立即执行。
+
+### DROP/DETACH TABLES {#drop-detach-table}
+
+在`DROP TABLE`上，不删除任何数据，数据库`Atomic`只是通过将元数据移动到`/clickhouse_path/metadata_dropped/`将表标记为已删除，并通知后台线程。最终表数据删除前的延迟由[database_atomic_delay_before_drop_table_sec](../../operations/server-configuration-parameters/settings.md#database_atomic_delay_before_drop_table_sec)设置指定。
+
+可以使用`SYNC`修饰符指定同步模式。使用[database_atomic_wait_for_drop_and_detach_synchronously](../../operations/settings/settings.md#database_atomic_wait_for_drop_and_detach_synchronously)设置执行此操作。在本例中，`DROP`等待运行 `SELECT`, `INSERT`和其他使用表完成的查询。表在不使用时将被实际删除。
+
+### EXCHANGE TABLES {#exchange-tables}
+
+`EXCHANGE`以原子方式交换表。因此，不是这种非原子操作：
+
+```sql
+RENAME TABLE new_table TO tmp, old_table TO new_table, tmp TO old_table;
+```
+可以使用一个原子查询：
+
+``` sql
+EXCHANGE TABLES new_table AND old_table;
+```
+
+### ReplicatedMergeTree in Atomic Database {#replicatedmergetree-in-atomic-database}
+
+对于[ReplicatedMergeTree](../table-engines/mergetree-family/replication.md#table_engines-replication)表，建议不要在ZooKeeper和副本名称中指定engine-path的参数。在这种情况下，将使用配置的参数[default_replica_path](../../operations/server-configuration-parameters/settings.md#default_replica_path)和[default_replica_name](../../operations/server-configuration-parameters/settings.md#default_replica_name)。如果要显式指定引擎的参数，建议使用{uuid}宏。这是非常有用的，以便为ZooKeeper中的每个表自动生成唯一的路径。
+
+## 另请参考
+
+-   [system.databases](../../operations/system-tables/databases.md) 系统表
--- a/docs/zh/engines/database-engines/index.md
+++ b/docs/zh/engines/database-engines/index.md
@ -1,11 +1,29 @@
-# 数据库引擎 {#shu-ju-ku-yin-qing}
+---
+toc_folder_title: 数据库引擎
+toc_priority: 27
+toc_title: Introduction
+---

-您使用的所有表都是由数据库引擎所提供的
+# 数据库引擎 {#database-engines}

-默认情况下，ClickHouse使用自己的数据库引擎，该引擎提供可配置的[表引擎](../../engines/database-engines/index.md)和[所有支持的SQL语法](../../engines/database-engines/index.md).
+数据库引擎允许您处理数据表。

-除此之外，您还可以选择使用以下的数据库引擎：
+默认情况下，ClickHouse使用[Atomic](../../engines/database-engines/atomic.md)数据库引擎。它提供了可配置的[table engines](../../engines/table-engines/index.md)和[SQL dialect](../../sql-reference/syntax.md)。

-   [MySQL](mysql.md)
+您还可以使用以下数据库引擎：
+
+-   [MySQL](../../engines/database-engines/mysql.md)
+
+-   [MaterializeMySQL](../../engines/database-engines/materialize-mysql.md)
+
+-   [Lazy](../../engines/database-engines/lazy.md)
+
+-   [Atomic](../../engines/database-engines/atomic.md)
+
+-   [PostgreSQL](../../engines/database-engines/postgresql.md)
+
+-   [MaterializedPostgreSQL](../../engines/database-engines/materialized-postgresql.md)
+
+-   [Replicated](../../engines/database-engines/replicated.md)

 [来源文章](https://clickhouse.tech/docs/en/database_engines/) <!--hide-->
--- a/docs/zh/engines/database-engines/lazy.md
+++ b/docs/zh/engines/database-engines/lazy.md
@ -1,16 +1,18 @@
 ---
 toc_priority: 31
-toc_title: "延时引擎"
+toc_title: Lazy
 ---

-# 延时引擎Lazy {#lazy}
+# Lazy {#lazy}

-在距最近一次访问间隔`expiration_time_in_seconds`时间段内，将表保存在内存中，仅适用于 \*Log引擎表
+在最后一次访问之后，只在RAM中保存`expiration_time_in_seconds`秒。只能用于\*Log表。

-由于针对这类表的访问间隔较长，对保存大量小的 \*Log引擎表进行了优化，
+它是为存储许多小的\*Log表而优化的，对于这些表，访问之间有很长的时间间隔。

 ## 创建数据库 {#creating-a-database}

-    CREATE DATABASE testlazy ENGINE = Lazy(expiration_time_in_seconds);
+``` sql
+CREATE DATABASE testlazy ENGINE = Lazy(expiration_time_in_seconds);
+```

-[原始文章](https://clickhouse.tech/docs/en/database_engines/lazy/) <!--hide-->
+[来源文章](https://clickhouse.tech/docs/en/database_engines/lazy/) <!--hide-->
--- a/docs/zh/engines/database-engines/materialize-mysql.md
+++ b/docs/zh/engines/database-engines/materialize-mysql.md
@ -0,0 +1,197 @@
+---
+toc_priority: 29
+toc_title: "[experimental] MaterializedMySQL"
+---
+
+# [experimental] MaterializedMySQL {#materialized-mysql}
+
+**这是一个实验性的特性，不应该在生产中使用。**
+
+创建ClickHouse数据库，包含MySQL中所有的表，以及这些表中的所有数据。
+
+ClickHouse服务器作为MySQL副本工作。它读取binlog并执行DDL和DML查询。
+
+这个功能是实验性的。
+
+## 创建数据库 {#creating-a-database}
+
+``` sql
+CREATE DATABASE [IF NOT EXISTS] db_name [ON CLUSTER cluster]
+ENGINE = MaterializeMySQL('host:port', ['database' | database], 'user', 'password') [SETTINGS ...]
+```
+
+**引擎参数**
+
+-   `host:port` — MySQL服务地址
+-   `database` — MySQL数据库名称
+-   `user` — MySQL用户名
+-   `password` — MySQL用户密码
+
+**引擎配置**
+
+-   `max_rows_in_buffer` — 允许数据缓存到内存中的最大行数(对于单个表和无法查询的缓存数据)。当超过行数时，数据将被物化。默认值: `65505`。
+-   `max_bytes_in_buffer` —  允许在内存中缓存数据的最大字节数(对于单个表和无法查询的缓存数据)。当超过行数时，数据将被物化。默认值: `1048576`.
+-   `max_rows_in_buffers` — 允许数据缓存到内存中的最大行数(对于数据库和无法查询的缓存数据)。当超过行数时，数据将被物化。默认值: `65505`.
+-   `max_bytes_in_buffers` — 允许在内存中缓存数据的最大字节数(对于数据库和无法查询的缓存数据)。当超过行数时，数据将被物化。默认值: `1048576`.
+-   `max_flush_data_time` — 允许数据在内存中缓存的最大毫秒数(对于数据库和无法查询的缓存数据)。当超过这个时间时，数据将被物化。默认值: `1000`.
+-   `max_wait_time_when_mysql_unavailable` — 当MySQL不可用时重试间隔(毫秒)。负值禁止重试。默认值: `1000`.
+-   `allows_query_when_mysql_lost` — 当mysql丢失时，允许查询物化表。默认值: `0` (`false`).
+```
+CREATE DATABASE mysql ENGINE = MaterializeMySQL('localhost:3306', 'db', 'user', '***') 
+     SETTINGS 
+        allows_query_when_mysql_lost=true,
+        max_wait_time_when_mysql_unavailable=10000;
+```
+
+**MySQL服务器端配置**
+
+为了`MaterializeMySQL`正确的工作，有一些强制性的`MySQL`侧配置设置应该设置:
+
+- `default_authentication_plugin = mysql_native_password`，因为`MaterializeMySQL`只能使用此方法授权。
+- `gtid_mode = on`，因为要提供正确的`MaterializeMySQL`复制，基于GTID的日志记录是必须的。注意，在打开这个模式`On`时，你还应该指定`enforce_gtid_consistency = on`。
+
+## 虚拟列 {#virtual-columns}
+
+当使用`MaterializeMySQL`数据库引擎时，[ReplacingMergeTree](../../engines/table-engines/mergetree-family/replacingmergetree.md)表与虚拟的`_sign`和`_version`列一起使用。
+
+- `_version` — 同步版本。 类型[UInt64](../../sql-reference/data-types/int-uint.md).
+- `_sign` — 删除标记。类型 [Int8](../../sql-reference/data-types/int-uint.md). Possible values:
+    - `1` — 行不会删除,
+    - `-1` — 行被删除。
+
+## 支持的数据类型 {#data_types-support}
+
+| MySQL                   | ClickHouse                                                   |
+|-------------------------|--------------------------------------------------------------|
+| TINY                    | [Int8](../../sql-reference/data-types/int-uint.md)           |
+| SHORT                   | [Int16](../../sql-reference/data-types/int-uint.md)          |
+| INT24                   | [Int32](../../sql-reference/data-types/int-uint.md)          |
+| LONG                    | [UInt32](../../sql-reference/data-types/int-uint.md)         |
+| LONGLONG                | [UInt64](../../sql-reference/data-types/int-uint.md)         |
+| FLOAT                   | [Float32](../../sql-reference/data-types/float.md)           |
+| DOUBLE                  | [Float64](../../sql-reference/data-types/float.md)           |
+| DECIMAL, NEWDECIMAL     | [Decimal](../../sql-reference/data-types/decimal.md)         |
+| DATE, NEWDATE           | [Date](../../sql-reference/data-types/date.md)               |
+| DATETIME, TIMESTAMP     | [DateTime](../../sql-reference/data-types/datetime.md)       |
+| DATETIME2, TIMESTAMP2   | [DateTime64](../../sql-reference/data-types/datetime64.md)   |
+| ENUM                    | [Enum](../../sql-reference/data-types/enum.md)               |
+| STRING                  | [String](../../sql-reference/data-types/string.md)           |
+| VARCHAR, VAR_STRING     | [String](../../sql-reference/data-types/string.md)           |
+| BLOB                    | [String](../../sql-reference/data-types/string.md)           |
+| BINARY                  | [FixedString](../../sql-reference/data-types/fixedstring.md) |
+
+不支持其他类型。如果MySQL表包含此类类型的列，ClickHouse抛出异常"Unhandled data type"并停止复制。
+
+[Nullable](../../sql-reference/data-types/nullable.md)已经支持
+
+## 使用方式 {#specifics-and-recommendations}
+
+### 兼容性限制
+
+除了数据类型的限制外，与`MySQL`数据库相比，还存在一些限制，在实现复制之前应先解决这些限制：
+
+- `MySQL`中的每个表都应该包含`PRIMARY KEY`
+
+- 对于包含`ENUM`字段值超出范围（在`ENUM`签名中指定）的行的表，复制将不起作用。
+
+### DDL查询 {#ddl-queries}
+
+MySQL DDL查询转换为相应的ClickHouse DDL查询([ALTER](../../sql-reference/statements/alter/index.md), [CREATE](../../sql-reference/statements/create/index.md), [DROP](../../sql-reference/statements/drop.md), [RENAME](../../sql-reference/statements/rename.md))。如果ClickHouse无法解析某个DDL查询，则该查询将被忽略。
+
+### Data Replication {#data-replication}
+
+`MaterializeMySQL`不支持直接`INSERT`, `DELETE`和`UPDATE`查询. 但是，它们是在数据复制方面支持的：
+
+- MySQL的`INSERT`查询转换为`INSERT`并携带`_sign=1`.
+
+- MySQL的`DELETE`查询转换为`INSERT`并携带`_sign=-1`.
+
+- MySQL的`UPDATE`查询转换为`INSERT`并携带`_sign=-1`, `INSERT`和`_sign=1`.
+
+### 查询MaterializeMySQL表 {#select}
+
+`SELECT`查询`MaterializeMySQL`表有一些细节:
+
+- 如果`_version`在`SELECT`中没有指定，则使用[FINAL](../../sql-reference/statements/select/from.md#select-from-final)修饰符。所以只有带有`MAX(_version)`的行才会被选中。
+
+- 如果`_sign`在`SELECT`中没有指定，则默认使用`WHERE _sign=1`。因此，删除的行不会包含在结果集中。
+
+- 结果包括列中的列注释，因为它们存在于SQL数据库表中。
+
+### Index Conversion {#index-conversion}
+
+MySQL的`PRIMARY KEY`和`INDEX`子句在ClickHouse表中转换为`ORDER BY`元组。
+
+ClickHouse只有一个物理顺序，由`ORDER BY`子句决定。要创建一个新的物理顺序，使用[materialized views](../../sql-reference/statements/create/view.md#materialized)。
+
+**Notes**
+
+- 带有`_sign=-1`的行不会从表中物理删除。
+- `MaterializeMySQL`引擎不支持级联`UPDATE/DELETE`查询。
+- 复制很容易被破坏。
+- 禁止对数据库和表进行手工操作。
+- `MaterializeMySQL`受[optimize_on_insert](../../operations/settings/settings.md#optimize-on-insert)设置的影响。当MySQL服务器中的表发生变化时，数据会合并到`MaterializeMySQL`数据库中相应的表中。
+
+## 使用示例 {#examples-of-use}
+
+MySQL操作:
+
+``` sql
+mysql> CREATE DATABASE db;
+mysql> CREATE TABLE db.test (a INT PRIMARY KEY, b INT);
+mysql> INSERT INTO db.test VALUES (1, 11), (2, 22);
+mysql> DELETE FROM db.test WHERE a=1;
+mysql> ALTER TABLE db.test ADD COLUMN c VARCHAR(16);
+mysql> UPDATE db.test SET c='Wow!', b=222;
+mysql> SELECT * FROM test;
+```
+
+```text
+---+------+------+ 
+| a |    b |    c |
+---+------+------+ 
+| 2 |  222 | Wow! |
+---+------+------+
+```
+
+ClickHouse中的数据库，与MySQL服务器交换数据:
+
+创建的数据库和表:
+
+``` sql
+CREATE DATABASE mysql ENGINE = MaterializeMySQL('localhost:3306', 'db', 'user', '***');
+SHOW TABLES FROM mysql;
+```
+
+``` text
+┌─name─┐
+│ test │
+└──────┘
+```
+
+然后插入数据:
+
+``` sql
+SELECT * FROM mysql.test;
+```
+
+``` text
+┌─a─┬──b─┐ 
+│ 1 │ 11 │ 
+│ 2 │ 22 │ 
+└───┴────┘
+```
+
+删除数据后，添加列并更新:
+
+``` sql
+SELECT * FROM mysql.test;
+```
+
+``` text
+┌─a─┬───b─┬─c────┐ 
+│ 2 │ 222 │ Wow! │ 
+└───┴─────┴──────┘
+```
+
+[来源文章](https://clickhouse.tech/docs/en/engines/database-engines/materialize-mysql/) <!--hide-->
--- a/docs/zh/engines/database-engines/materialized-postgresql.md
+++ b/docs/zh/engines/database-engines/materialized-postgresql.md
@ -0,0 +1,85 @@
+---
+toc_priority: 30
+toc_title: MaterializedPostgreSQL
+---
+
+# [experimental] MaterializedPostgreSQL {#materialize-postgresql}
+
+使用PostgreSQL数据库表的初始数据转储创建ClickHouse数据库，并启动复制过程，即执行后台作业，以便在远程PostgreSQL数据库中的PostgreSQL数据库表上发生新更改时应用这些更改。
+
+ClickHouse服务器作为PostgreSQL副本工作。它读取WAL并执行DML查询。DDL不是复制的，但可以处理（如下所述）。
+
+## 创建数据库 {#creating-a-database}
+
+``` sql
+CREATE DATABASE [IF NOT EXISTS] db_name [ON CLUSTER cluster]
+ENGINE = MaterializedPostgreSQL('host:port', ['database' | database], 'user', 'password') [SETTINGS ...]
+```
+
+**Engine参数**
+
+-   `host:port` — PostgreSQL服务地址
+-   `database` — PostgreSQL数据库名
+-   `user` — PostgreSQL用户名
+-   `password` — 用户密码
+
+## 设置 {#settings}
+
+-   [materialized_postgresql_max_block_size](../../operations/settings/settings.md#materialized-postgresql-max-block-size)
+
+-   [materialized_postgresql_tables_list](../../operations/settings/settings.md#materialized-postgresql-tables-list)
+
+-   [materialized_postgresql_allow_automatic_update](../../operations/settings/settings.md#materialized-postgresql-allow-automatic-update)
+
+``` sql
+CREATE DATABASE database1
+ENGINE = MaterializedPostgreSQL('postgres1:5432', 'postgres_database', 'postgres_user', 'postgres_password')
+SETTINGS materialized_postgresql_max_block_size = 65536,
+         materialized_postgresql_tables_list = 'table1,table2,table3';
+
+SELECT * FROM database1.table1;
+```
+
+## 必备条件 {#requirements}
+
+- 在postgresql配置文件中将[wal_level](https://www.postgresql.org/docs/current/runtime-config-wal.html)设置为`logical`，将`max_replication_slots`设置为`2`。
+
+- 每个复制表必须具有以下一个[replica identity](https://www.postgresql.org/docs/10/sql-altertable.html#SQL-CREATETABLE-REPLICA-IDENTITY):
+
+1. **default** (主键)
+
+2. **index**
+
+``` bash
+postgres# CREATE TABLE postgres_table (a Integer NOT NULL, b Integer, c Integer NOT NULL, d Integer, e Integer NOT NULL);
+postgres# CREATE unique INDEX postgres_table_index on postgres_table(a, c, e);
+postgres# ALTER TABLE postgres_table REPLICA IDENTITY USING INDEX postgres_table_index;
+```
+
+总是先检查主键。如果不存在，则检查索引(定义为副本标识索引)。
+如果使用index作为副本标识，则表中必须只有一个这样的索引。
+你可以用下面的命令来检查一个特定的表使用了什么类型:
+
+``` bash
+postgres# SELECT CASE relreplident
+          WHEN 'd' THEN 'default'
+          WHEN 'n' THEN 'nothing'
+          WHEN 'f' THEN 'full'
+          WHEN 'i' THEN 'index'
+       END AS replica_identity
+FROM pg_class
+WHERE oid = 'postgres_table'::regclass;
+```
+
+## 注意 {#warning}
+
+1. [**TOAST**](https://www.postgresql.org/docs/9.5/storage-toast.html)不支持值转换。将使用数据类型的默认值。
+
+## 使用示例 {#example-of-use}
+
+``` sql
+CREATE DATABASE postgresql_db
+ENGINE = MaterializedPostgreSQL('postgres1:5432', 'postgres_database', 'postgres_user', 'postgres_password');
+
+SELECT * FROM postgresql_db.postgres_table;
+```
--- a/docs/zh/engines/database-engines/mysql.md
+++ b/docs/zh/engines/database-engines/mysql.md
@ -1,6 +1,11 @@
+---
+toc_priority: 30
+toc_title: MySQL
+---
+
 # MySQL {#mysql}

-MySQL引擎用于将远程的MySQL服务器中的表映射到ClickHouse中，并允许您对表进行`INSERT`和`SELECT`查询，以方便您在ClickHouse与MySQL之间进行数据交换。
+MySQL引擎用于将远程的MySQL服务器中的表映射到ClickHouse中，并允许您对表进行`INSERT`和`SELECT`查询，以方便您在ClickHouse与MySQL之间进行数据交换

 `MySQL`数据库引擎会将对其的查询转换为MySQL语法并发送到MySQL服务器中，因此您可以执行诸如`SHOW TABLES`或`SHOW CREATE TABLE`之类的操作。

@ -10,67 +15,86 @@ MySQL引擎用于将远程的MySQL服务器中的表映射到ClickHouse中，并
 -   `CREATE TABLE`
 -   `ALTER`

-## CREATE DATABASE {#create-database}
+## 创建数据库 {#creating-a-database}

 ``` sql
 CREATE DATABASE [IF NOT EXISTS] db_name [ON CLUSTER cluster]
 ENGINE = MySQL('host:port', ['database' | database], 'user', 'password')
 ```

-**MySQL数据库引擎参数**
+**引擎参数**

-   `host:port` — 链接的MySQL地址。
-   `database` — 链接的MySQL数据库。
-   `user` — 链接的MySQL用户。
-   `password` — 链接的MySQL用户密码。
+-   `host:port` — MySQL服务地址
+-   `database` — MySQL数据库名称
+-   `user` — MySQL用户名
+-   `password` — MySQL用户密码

-## 支持的类型对应 {#zhi-chi-de-lei-xing-dui-ying}
+## 支持的数据类型 {#data_types-support}

-| MySQL                            | ClickHouse                                                  |
-|----------------------------------|-------------------------------------------------------------|
-| UNSIGNED TINYINT                 | [UInt8](../../sql-reference/data-types/int-uint.md)         |
-| TINYINT                          | [Int8](../../sql-reference/data-types/int-uint.md)          |
-| UNSIGNED SMALLINT                | [UInt16](../../sql-reference/data-types/int-uint.md)        |
-| SMALLINT                         | [Int16](../../sql-reference/data-types/int-uint.md)         |
-| UNSIGNED INT, UNSIGNED MEDIUMINT | [UInt32](../../sql-reference/data-types/int-uint.md)        |
-| INT, MEDIUMINT                   | [Int32](../../sql-reference/data-types/int-uint.md)         |
-| UNSIGNED BIGINT                  | [UInt64](../../sql-reference/data-types/int-uint.md)        |
-| BIGINT                           | [Int64](../../sql-reference/data-types/int-uint.md)         |
-| FLOAT                            | [Float32](../../sql-reference/data-types/float.md)          |
-| DOUBLE                           | [Float64](../../sql-reference/data-types/float.md)          |
-| DATE                             | [Date](../../sql-reference/data-types/date.md)              |
-| DATETIME, TIMESTAMP              | [DateTime](../../sql-reference/data-types/datetime.md)      |
+| MySQL                            | ClickHouse                                                   |
+|----------------------------------|--------------------------------------------------------------|
+| UNSIGNED TINYINT                 | [UInt8](../../sql-reference/data-types/int-uint.md)          |
+| TINYINT                          | [Int8](../../sql-reference/data-types/int-uint.md)           |
+| UNSIGNED SMALLINT                | [UInt16](../../sql-reference/data-types/int-uint.md)         |
+| SMALLINT                         | [Int16](../../sql-reference/data-types/int-uint.md)          |
+| UNSIGNED INT, UNSIGNED MEDIUMINT | [UInt32](../../sql-reference/data-types/int-uint.md)         |
+| INT, MEDIUMINT                   | [Int32](../../sql-reference/data-types/int-uint.md)          |
+| UNSIGNED BIGINT                  | [UInt64](../../sql-reference/data-types/int-uint.md)         |
+| BIGINT                           | [Int64](../../sql-reference/data-types/int-uint.md)          |
+| FLOAT                            | [Float32](../../sql-reference/data-types/float.md)           |
+| DOUBLE                           | [Float64](../../sql-reference/data-types/float.md)           |
+| DATE                             | [Date](../../sql-reference/data-types/date.md)               |
+| DATETIME, TIMESTAMP              | [DateTime](../../sql-reference/data-types/datetime.md)       |
 | BINARY                           | [FixedString](../../sql-reference/data-types/fixedstring.md) |

-其他的MySQL数据类型将全部都转换为[String](../../sql-reference/data-types/string.md)。
+其他的MySQL数据类型将全部都转换为[String](../../sql-reference/data-types/string.md).

-同时以上的所有类型都支持[Nullable](../../sql-reference/data-types/nullable.md)。
+[Nullable](../../sql-reference/data-types/nullable.md)已经支持

-## 使用示例 {#shi-yong-shi-li}
+## 全局变量支持 {#global-variables-support}

-在MySQL中创建表:
+为了更好地兼容，您可以在SQL样式中设置全局变量，如`@@identifier`.

-    mysql> USE test;
-    Database changed
+支持这些变量：
+- `version`
+- `max_allowed_packet`

-    mysql> CREATE TABLE `mysql_table` (
-        ->   `int_id` INT NOT NULL AUTO_INCREMENT,
-        ->   `float` FLOAT NOT NULL,
-        ->   PRIMARY KEY (`int_id`));
-    Query OK, 0 rows affected (0,09 sec)
+!!! warning "警告"
+到目前为止，这些变量是存根，并且不对应任何内容。

-    mysql> insert into mysql_table (`int_id`, `float`) VALUES (1,2);
-    Query OK, 1 row affected (0,00 sec)
+示例:

-    mysql> select * from mysql_table;
-    +--------+-------+
-    | int_id | value |
-    +--------+-------+
-    |      1 |     2 |
-    +--------+-------+
-    1 row in set (0,00 sec)
+``` sql
+SELECT @@version;
+```

-在ClickHouse中创建MySQL类型的数据库，同时与MySQL服务器交换数据：
+## 使用示例 {#examples-of-use}
+
+MySQL操作:
+
+``` text
+mysql> USE test;
+Database changed
+
+mysql> CREATE TABLE `mysql_table` (
+    ->   `int_id` INT NOT NULL AUTO_INCREMENT,
+    ->   `float` FLOAT NOT NULL,
+    ->   PRIMARY KEY (`int_id`));
+Query OK, 0 rows affected (0,09 sec)
+
+mysql> insert into mysql_table (`int_id`, `float`) VALUES (1,2);
+Query OK, 1 row affected (0,00 sec)
+
+mysql> select * from mysql_table;
+------+-----+
+| int_id | value |
+------+-----+
+|      1 |     2 |
+------+-----+
+1 row in set (0,00 sec)
+```
+
+ClickHouse中的数据库，与MySQL服务器交换数据:

 ``` sql
 CREATE DATABASE mysql_db ENGINE = MySQL('localhost:3306', 'test', 'my_user', 'user_password')
--- a/docs/zh/engines/database-engines/postgresql.md
+++ b/docs/zh/engines/database-engines/postgresql.md
@ -0,0 +1,138 @@
+---
+toc_priority: 35
+toc_title: PostgreSQL
+---
+
+# PostgreSQL {#postgresql}
+
+允许连接到远程[PostgreSQL](https://www.postgresql.org)服务。支持读写操作(`SELECT`和`INSERT`查询)，以在ClickHouse和PostgreSQL之间交换数据。
+
+在`SHOW TABLES`和`DESCRIBE TABLE`查询的帮助下，从远程PostgreSQL实时访问表列表和表结构。
+
+支持表结构修改(`ALTER TABLE ... ADD|DROP COLUMN`)。如果`use_table_cache`参数(参见下面的引擎参数)设置为`1`，则会缓存表结构，不会检查是否被修改，但可以用`DETACH`和`ATTACH`查询进行更新。
+
+## 创建数据库 {#creating-a-database}
+
+``` sql
+CREATE DATABASE test_database 
+ENGINE = PostgreSQL('host:port', 'database', 'user', 'password'[, `use_table_cache`]);
+```
+
+**引擎参数**
+
+-   `host:port` — PostgreSQL服务地址
+-   `database` — 远程数据库名次
+-   `user` — PostgreSQL用户名称
+-   `password` — PostgreSQL用户密码
+-   `use_table_cache` —  定义数据库表结构是否已缓存或不进行。可选的。默认值： `0`.
+
+## 支持的数据类型 {#data_types-support}
+
+| PostgerSQL       | ClickHouse                                                   |
+|------------------|--------------------------------------------------------------|
+| DATE             | [Date](../../sql-reference/data-types/date.md)               |
+| TIMESTAMP        | [DateTime](../../sql-reference/data-types/datetime.md)       |
+| REAL             | [Float32](../../sql-reference/data-types/float.md)           |
+| DOUBLE           | [Float64](../../sql-reference/data-types/float.md)           |
+| DECIMAL, NUMERIC | [Decimal](../../sql-reference/data-types/decimal.md)         |
+| SMALLINT         | [Int16](../../sql-reference/data-types/int-uint.md)          |
+| INTEGER          | [Int32](../../sql-reference/data-types/int-uint.md)          |
+| BIGINT           | [Int64](../../sql-reference/data-types/int-uint.md)          |
+| SERIAL           | [UInt32](../../sql-reference/data-types/int-uint.md)         |
+| BIGSERIAL        | [UInt64](../../sql-reference/data-types/int-uint.md)         |
+| TEXT, CHAR       | [String](../../sql-reference/data-types/string.md)           |
+| INTEGER          | Nullable([Int32](../../sql-reference/data-types/int-uint.md))|
+| ARRAY            | [Array](../../sql-reference/data-types/array.md)             |
+
+
+## 使用示例 {#examples-of-use}
+
+ClickHouse中的数据库，与PostgreSQL服务器交换数据:
+
+``` sql
+CREATE DATABASE test_database 
+ENGINE = PostgreSQL('postgres1:5432', 'test_database', 'postgres', 'mysecretpassword', 1);
+```
+
+``` sql
+SHOW DATABASES;
+```
+
+``` text
+┌─name──────────┐
+│ default       │
+│ test_database │
+│ system        │
+└───────────────┘
+```
+
+``` sql
+SHOW TABLES FROM test_database;
+```
+
+``` text
+┌─name───────┐
+│ test_table │
+└────────────┘
+```
+
+从PostgreSQL表中读取数据：
+
+``` sql
+SELECT * FROM test_database.test_table;
+```
+
+``` text
+┌─id─┬─value─┐
+│  1 │     2 │
+└────┴───────┘
+```
+
+将数据写入PostgreSQL表：
+
+``` sql
+INSERT INTO test_database.test_table VALUES (3,4);
+SELECT * FROM test_database.test_table;
+```
+
+``` text
+┌─int_id─┬─value─┐
+│      1 │     2 │
+│      3 │     4 │
+└────────┴───────┘
+```
+
+在PostgreSQL中修改了表结构：
+
+``` sql
+postgre> ALTER TABLE test_table ADD COLUMN data Text
+```
+
+当创建数据库时，参数`use_table_cache`被设置为`1`，ClickHouse中的表结构被缓存，因此没有被修改:
+
+``` sql
+DESCRIBE TABLE test_database.test_table;
+```
+``` text
+┌─name───┬─type──────────────┐
+│ id     │ Nullable(Integer) │
+│ value  │ Nullable(Integer) │
+└────────┴───────────────────┘
+```
+
+分离表并再次附加它之后，结构被更新了:
+
+``` sql
+DETACH TABLE test_database.test_table;
+ATTACH TABLE test_database.test_table;
+DESCRIBE TABLE test_database.test_table;
+```
+``` text
+┌─name───┬─type──────────────┐
+│ id     │ Nullable(Integer) │
+│ value  │ Nullable(Integer) │
+│ data   │ Nullable(String)  │
+└────────┴───────────────────┘
+```
+
+[来源文章](https://clickhouse.tech/docs/en/database-engines/postgresql/) <!--hide-->
--- a/docs/zh/engines/database-engines/replicated.md
+++ b/docs/zh/engines/database-engines/replicated.md
@ -0,0 +1,116 @@
+# [experimental] Replicated {#replicated}
+
+该引擎基于[Atomic](../../engines/database-engines/atomic.md)引擎。它支持通过将DDL日志写入ZooKeeper并在给定数据库的所有副本上执行的元数据复制。
+
+一个ClickHouse服务器可以同时运行和更新多个复制的数据库。但是同一个复制的数据库不能有多个副本。
+
+## 创建数据库 {#creating-a-database}
+
+``` sql
+CREATE DATABASE testdb ENGINE = Replicated('zoo_path', 'shard_name', 'replica_name') [SETTINGS ...]
+```
+
+**引擎参数**
+
+-   `zoo_path` — ZooKeeper地址，同一个ZooKeeper路径对应同一个数据库。
+-   `shard_name` — 分片的名字。数据库副本按`shard_name`分组到分片中。
+-   `replica_name` — 副本的名字。同一分片的所有副本的副本名称必须不同。
+
+!!! note "警告"
+对于[ReplicatedMergeTree](../table-engines/mergetree-family/replication.md#table_engines-replication)表，如果没有提供参数，则使用默认参数:`/clickhouse/tables/{uuid}/{shard}`和`{replica}`。这些可以在服务器设置[default_replica_path](../../operations/server-configuration-parameters/settings.md#default_replica_path)和[default_replica_name](../../operations/server-configuration-parameters/settings.md#default_replica_name)中更改。宏`{uuid}`被展开到表的uuid， `{shard}`和`{replica}`被展开到服务器配置的值，而不是数据库引擎参数。但是在将来，可以使用Replicated数据库的`shard_name`和`replica_name`。
+
+## 使用方式 {#specifics-and-recommendations}
+
+使用`Replicated`数据库的DDL查询的工作方式类似于[ON CLUSTER](../../sql-reference/distributed-ddl.md)查询，但有细微差异。
+
+首先，DDL请求尝试在启动器(最初从用户接收请求的主机)上执行。如果请求没有完成，那么用户立即收到一个错误，其他主机不会尝试完成它。如果在启动器上成功地完成了请求，那么所有其他主机将自动重试，直到完成请求。启动器将尝试在其他主机上等待查询完成(不超过[distributed_ddl_task_timeout](../../operations/settings/settings.md#distributed_ddl_task_timeout))，并返回一个包含每个主机上查询执行状态的表。
+
+错误情况下的行为是由[distributed_ddl_output_mode](../../operations/settings/settings.md#distributed_ddl_output_mode)设置调节的，对于`Replicated`数据库，最好将其设置为`null_status_on_timeout` - 例如，如果一些主机没有时间执行[distributed_ddl_task_timeout](../../operations/settings/settings.md#distributed_ddl_task_timeout)的请求，那么不要抛出异常，但在表中显示它们的`NULL`状态。
+
+[system.clusters](../../operations/system-tables/clusters.md)系统表包含一个名为复制数据库的集群，它包含数据库的所有副本。当创建/删除副本时，这个集群会自动更新，它可以用于[Distributed](../../engines/table-engines/special/distributed.md#distributed)表。
+
+当创建数据库的新副本时，该副本会自己创建表。如果副本已经不可用很长一段时间，并且已经滞后于复制日志-它用ZooKeeper中的当前元数据检查它的本地元数据，将带有数据的额外表移动到一个单独的非复制数据库(以免意外地删除任何多余的东西)，创建缺失的表，如果表名已经被重命名，则更新表名。数据在`ReplicatedMergeTree`级别被复制，也就是说，如果表没有被复制，数据将不会被复制(数据库只负责元数据)。
+
+## 使用示例 {#usage-example}
+
+创建三台主机的集群:
+
+``` sql
+node1 :) CREATE DATABASE r ENGINE=Replicated('some/path/r','shard1','replica1');
+node2 :) CREATE DATABASE r ENGINE=Replicated('some/path/r','shard1','other_replica');
+node3 :) CREATE DATABASE r ENGINE=Replicated('some/path/r','other_shard','{replica}');
+```
+
+运行DDL:
+
+``` sql
+CREATE TABLE r.rmt (n UInt64) ENGINE=ReplicatedMergeTree ORDER BY n;
+```
+
+``` text
+┌─────hosts────────────┬──status─┬─error─┬─num_hosts_remaining─┬─num_hosts_active─┐ 
+│ shard1|replica1      │    0    │       │          2          │        0         │ 
+│ shard1|other_replica │    0    │       │          1          │        0         │
+│ other_shard|r1       │    0    │       │          0          │        0         │
+└──────────────────────┴─────────┴───────┴─────────────────────┴──────────────────┘
+```
+
+显示系统表:
+
+``` sql
+SELECT cluster, shard_num, replica_num, host_name, host_address, port, is_local 
+FROM system.clusters WHERE cluster='r';
+```
+
+``` text
+┌─cluster─┬─shard_num─┬─replica_num─┬─host_name─┬─host_address─┬─port─┬─is_local─┐ 
+│ r       │     1     │      1      │   node3   │  127.0.0.1   │ 9002 │     0    │ 
+│ r       │     2     │      1      │   node2   │  127.0.0.1   │ 9001 │     0    │
+│ r       │     2     │      2      │   node1   │  127.0.0.1   │ 9000 │     1    │
+└─────────┴───────────┴─────────────┴───────────┴──────────────┴──────┴──────────┘
+```
+
+创建分布式表并插入数据:
+
+``` sql
+node2 :) CREATE TABLE r.d (n UInt64) ENGINE=Distributed('r','r','rmt', n % 2);
+node3 :) INSERT INTO r SELECT * FROM numbers(10);
+node1 :) SELECT materialize(hostName()) AS host, groupArray(n) FROM r.d GROUP BY host;
+```
+
+``` text
+┌─hosts─┬─groupArray(n)─┐ 
+│ node1 │  [1,3,5,7,9]  │   
+│ node2 │  [0,2,4,6,8]  │    
+└───────┴───────────────┘
+```
+
+向一台主机添加副本:
+
+``` sql
+node4 :) CREATE DATABASE r ENGINE=Replicated('some/path/r','other_shard','r2');
+```
+
+集群配置如下所示:
+
+``` text
+┌─cluster─┬─shard_num─┬─replica_num─┬─host_name─┬─host_address─┬─port─┬─is_local─┐ 
+│ r       │     1     │      1      │   node3   │  127.0.0.1   │ 9002 │     0    │ 
+│ r       │     1     │      2      │   node4   │  127.0.0.1   │ 9003 │     0    │
+│ r       │     2     │      1      │   node2   │  127.0.0.1   │ 9001 │     0    │
+│ r       │     2     │      2      │   node1   │  127.0.0.1   │ 9000 │     1    │
+└─────────┴───────────┴─────────────┴───────────┴──────────────┴──────┴──────────┘
+```
+
+分布式表也将从新主机获取数据:
+
+```sql
+node2 :) SELECT materialize(hostName()) AS host, groupArray(n) FROM r.d GROUP BY host;
+```
+
+```text
+┌─hosts─┬─groupArray(n)─┐ 
+│ node2 │  [1,3,5,7,9]  │   
+│ node4 │  [0,2,4,6,8]  │    
+└───────┴───────────────┘
+```
--- a/docs/zh/interfaces/third-party/gui.md
+++ b/docs/zh/interfaces/third-party/gui.md
@ -99,4 +99,20 @@ ClickHouse Web 界面 [Tabix](https://github.com/tabixio/tabix).
 -   重构。
 -   搜索和导航。

+### DBM {#dbm}
+
+[DBM](https://dbm.incubator.edurt.io/) DBM是一款ClickHouse可视化管理工具!
+
+特征：
+
+-   支持查询历史（分页、全部清除等）
+-   支持选中的sql子句查询(多窗口等)
+-   支持终止查询
+-   支持表管理
+-   支持数据库管理
+-   支持自定义查询
+-   支持多数据源管理（连接测试、监控）
+-   支持监控（处理进程、连接、查询）
+-   支持迁移数据
+
 [来源文章](https://clickhouse.tech/docs/zh/interfaces/third-party/gui/) <!--hide-->
--- a/docs/zh/sql-reference/distributed-ddl.md
+++ b/docs/zh/sql-reference/distributed-ddl.md
@ -0,0 +1,21 @@
+---
+toc_priority: 32
+toc_title: Distributed DDL
+---
+
+# 分布式DDL查询(ON CLUSTER条件) {#distributed-ddl-queries-on-cluster-clause}
+
+默认情况下，`CREATE`、`DROP`、`ALTER`和`RENAME`查询仅影响执行它们的当前服务器。 在集群设置中，可以使用`ON CLUSTER`子句以分布式方式运行此类查询。
+
+例如，以下查询在`cluster`中的每个主机上创建`all_hits` `Distributed`表：
+
+``` sql
+CREATE TABLE IF NOT EXISTS all_hits ON CLUSTER cluster (p Date, i Int32) ENGINE = Distributed(cluster, default, hits)
+```
+
+为了正确运行这些查询，每个主机必须具有相同的集群定义（为了简化同步配置，您可以使用ZooKeeper替换）。 他们还必须连接到ZooKeeper服务器。
+
+本地版本的查询最终会在集群中的每台主机上执行，即使某些主机当前不可用。
+
+!!! warning "警告"
+在单个主机内执行查询的顺序是有保证的。
--- a/docs/zh/sql-reference/statements/alter/index.md
+++ b/docs/zh/sql-reference/statements/alter/index.md
@ -0,0 +1,23 @@
+---
+toc_hidden_folder: true
+toc_priority: 42
+toc_title: INDEX
+---
+
+# 操作数据跳过索引 {#manipulations-with-data-skipping-indices}
+
+可以使用以下操作：
+
+-   `ALTER TABLE [db].name ADD INDEX name expression TYPE type GRANULARITY value [FIRST|AFTER name]` - 向表元数据添加索引描述。
+
+-   `ALTER TABLE [db].name DROP INDEX name` - 从表元数据中删除索引描述并从磁盘中删除索引文件。
+
+-   `ALTER TABLE [db.]table MATERIALIZE INDEX name IN PARTITION partition_name` - 查询在分区`partition_name`中重建二级索引`name`。 操作为[mutation](../../../sql-reference/statements/alter/index.md#mutations).
+
+前两个命令是轻量级的，它们只更改元数据或删除文件。
+
+Also, they are replicated, syncing indices metadata via ZooKeeper.
+此外，它们会被复制，会通过ZooKeeper同步索引元数据。
+
+!!! note "注意"
+索引操作仅支持具有以下特征的表 [`*MergeTree`](../../../engines/table-engines/mergetree-family/mergetree.md)引擎 (包括[replicated](../../../engines/table-engines/mergetree-family/replication.md)).
--- a/docs/zh/sql-reference/statements/create/database.md
+++ b/docs/zh/sql-reference/statements/create/database.md
@ -0,0 +1,29 @@
+---
+toc_priority: 35
+toc_title: DATABASE
+---
+
+# CREATE DATABASE {#query-language-create-database}
+
+创建数据库.
+
+``` sql
+CREATE DATABASE [IF NOT EXISTS] db_name [ON CLUSTER cluster] [ENGINE = engine(...)]
+```
+
+## 条件 {#clauses}
+
+### IF NOT EXISTS {#if-not-exists}
+
+如果`db_name`数据库已经存在，则ClickHouse不会创建新数据库并且：
+
+-   如果指定了子句，则不会引发异常。
+-   如果未指定子句，则抛出异常。
+
+### ON CLUSTER {#on-cluster}
+
+ClickHouse在指定集群的所有服务器上创建`db_name`数据库。 更多细节在 [Distributed DDL](../../../sql-reference/distributed-ddl.md) article.
+
+### ENGINE {#engine}
+
+[MySQL](../../../engines/database-engines/mysql.md) 允许您从远程MySQL服务器检索数据. 默认情况下，ClickHouse使用自己的[database engine](../../../engines/database-engines/index.md). 还有一个[lazy](../../../engines/database-engines/lazy.md)引擎.
--- a/docs/zh/sql-reference/statements/create/index.md
+++ b/docs/zh/sql-reference/statements/create/index.md
@ -0,0 +1,13 @@
+---
+toc_folder_title: CREATE
+toc_priority: 34
+toc_title: Overview
+---
+
+# CREATE语法 {#create-queries}
+
+CREATE语法包含以下子集:
+
+-   [DATABASE](../../../sql-reference/statements/create/database.md)
+
+[原始文章](https://clickhouse.tech/docs/zh/sql-reference/statements/create/) <!--hide-->
--- a/docs/zh/sql-reference/statements/create/view.md
+++ b/docs/zh/sql-reference/statements/create/view.md
@ -0,0 +1,243 @@
+---
+toc_priority: 37
+toc_title: VIEW
+---
+
+# CREATE VIEW {#create-view}
+
+创建一个新视图。 有两种类型的视图：普通视图和物化视图。
+
+## Normal {#normal}
+
+语法:
+
+``` sql
+CREATE [OR REPLACE] VIEW [IF NOT EXISTS] [db.]table_name [ON CLUSTER] AS SELECT ...
+```
+
+普通视图不存储任何数据。 他们只是在每次访问时从另一个表执行读取。换句话说，普通视图只不过是一个保存的查询。 从视图中读取时，此保存的查询用作[FROM](../../../sql-reference/statements/select/from.md)子句中的子查询.
+
+例如，假设您已经创建了一个视图：
+
+``` sql
+CREATE VIEW view AS SELECT ...
+```
+
+并写了一个查询：
+
+``` sql
+SELECT a, b, c FROM view
+```
+
+这个查询完全等同于使用子查询：
+
+``` sql
+SELECT a, b, c FROM (SELECT ...)
+```
+
+## Materialized {#materialized}
+
+``` sql
+CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]table_name [ON CLUSTER] [TO[db.]name] [ENGINE = engine] [POPULATE] AS SELECT ...
+```
+
+物化视图存储由相应的[SELECT](../../../sql-reference/statements/select/index.md)管理.
+
+创建不带`TO [db].[table]`的物化视图时，必须指定`ENGINE` – 用于存储数据的表引擎。
+
+使用`TO [db].[table]` 创建物化视图时，不得使用`POPULATE`。
+
+一个物化视图的实现是这样的：当向SELECT中指定的表插入数据时，插入数据的一部分被这个SELECT查询转换，结果插入到视图中。
+
+!!! important "重要"
+ClickHouse 中的物化视图更像是插入触发器。 如果视图查询中有一些聚合，则它仅应用于一批新插入的数据。 对源表现有数据的任何更改（如更新、删除、删除分区等）都不会更改物化视图。
+
+如果指定`POPULATE`，则在创建视图时将现有表数据插入到视图中，就像创建一个`CREATE TABLE ... AS SELECT ...`一样。 否则，查询仅包含创建视图后插入表中的数据。 我们**不建议**使用POPULATE，因为在创建视图期间插入表中的数据不会插入其中。
+
+`SELECT` 查询可以包含`DISTINCT`、`GROUP BY`、`ORDER BY`、`LIMIT`……请注意，相应的转换是在每个插入数据块上独立执行的。 例如，如果设置了`GROUP BY`，则在插入期间聚合数据，但仅在插入数据的单个数据包内。 数据不会被进一步聚合。 例外情况是使用独立执行数据聚合的`ENGINE`，例如`SummingMergeTree`。
+
+在物化视图上执行[ALTER](../../../sql-reference/statements/alter/index.md)查询有局限性，因此可能不方便。 如果物化视图使用构造`TO [db.]name`，你可以`DETACH`视图，为目标表运行`ALTER`，然后`ATTACH`先前分离的（`DETACH`）视图。
+
+请注意，物化视图受[optimize_on_insert](../../../operations/settings/settings.md#optimize-on-insert)设置的影响。 在插入视图之前合并数据。
+
+视图看起来与普通表相同。 例如，它们列在1SHOW TABLES1查询的结果中。
+
+删除视图,使用[DROP VIEW](../../../sql-reference/statements/drop.md#drop-view). `DROP TABLE`也适用于视图。
+
+## Live View (实验性) {#live-view}
+
+!!! important "重要"
+这是一项实验性功能，可能会在未来版本中以向后不兼容的方式进行更改。
+使用[allow_experimental_live_view](../../../operations/settings/settings.md#allow-experimental-live-view)设置启用实时视图和`WATCH`查询的使用。 输入命令`set allow_experimental_live_view = 1`。
+
+```sql
+CREATE LIVE VIEW [IF NOT EXISTS] [db.]table_name [WITH [TIMEOUT [value_in_sec] [AND]] [REFRESH [value_in_sec]]] AS SELECT ...
+```
+
+实时视图存储相应[SELECT](../../../sql-reference/statements/select/index.md)查询的结果，并在查询结果更改时随时更新。 查询结果以及与新数据结合所需的部分结果存储在内存中，为重复查询提供更高的性能。当使用[WATCH](../../../sql-reference/statements/watch.md)查询更改查询结果时，实时视图可以提供推送通知。
+
+实时视图是通过插入到查询中指定的最里面的表来触发的。
+
+实时视图的工作方式类似于分布式表中查询的工作方式。 但不是组合来自不同服务器的部分结果，而是将当前数据的部分结果与新数据的部分结果组合在一起。当实时视图查询包含子查询时，缓存的部分结果仅存储在最里面的子查询中。
+
+!!! info "限制"
+- [Table function](../../../sql-reference/table-functions/index.md)不支持作为最里面的表.
+- 没有插入的表，例如[dictionary](../../../sql-reference/dictionaries/index.md), [system table](../../../operations/system-tables/index.md), [normal view](#normal), [materialized view](#materialized)不会触发实时视图。
+- 只有可以将旧数据的部分结果与新数据的部分结果结合起来的查询才有效。 实时视图不适用于需要完整数据集来计算最终结果或必须保留聚合状态的聚合的查询。
+- 不适用于在不同节点上执行插入的复制或分布式表。
+- 不能被多个表触发。
+
+    [WITH REFRESH](#live-view-with-refresh)强制定期更新实时视图，在某些情况下可以用作解决方法。
+
+### Monitoring Changes {#live-view-monitoring}
+
+您可以使用[WATCH](../../../sql-reference/statements/watch.md)命令监视`LIVE VIEW`查询结果的变化
+
+```sql
+WATCH [db.]live_view
+```
+
+**示例:**
+
+```sql
+CREATE TABLE mt (x Int8) Engine = MergeTree ORDER BY x;
+CREATE LIVE VIEW lv AS SELECT sum(x) FROM mt;
+```
+在对源表进行并行插入时观看实时视图。
+
+```sql
+WATCH lv;
+```
+
+```bash
+┌─sum(x)─┬─_version─┐
+│      1 │        1 │
+└────────┴──────────┘
+┌─sum(x)─┬─_version─┐
+│      2 │        2 │
+└────────┴──────────┘
+┌─sum(x)─┬─_version─┐
+│      6 │        3 │
+└────────┴──────────┘
+...
+```
+
+```sql
+INSERT INTO mt VALUES (1);
+INSERT INTO mt VALUES (2);
+INSERT INTO mt VALUES (3);
+```
+
+添加[EVENTS](../../../sql-reference/statements/watch.md#events-clause)子句只获取更改事件。
+
+```sql
+WATCH [db.]live_view EVENTS;
+```
+
+**示例:**
+
+```sql
+WATCH lv EVENTS;
+```
+
+```bash
+┌─version─┐
+│       1 │
+└─────────┘
+┌─version─┐
+│       2 │
+└─────────┘
+┌─version─┐
+│       3 │
+└─────────┘
+...
+```
+
+你可以执行[SELECT](../../../sql-reference/statements/select/index.md)与任何常规视图或表格相同的方式查询实时视图。如果查询结果被缓存，它将立即返回结果而不在基础表上运行存储的查询。
+
+```sql
+SELECT * FROM [db.]live_view WHERE ...
+```
+
+### Force Refresh {#live-view-alter-refresh}
+
+您可以使用`ALTER LIVE VIEW [db.]table_name REFRESH`语法.
+
+### WITH TIMEOUT条件 {#live-view-with-timeout}
+
+当使用`WITH TIMEOUT`子句创建实时视图时，[WATCH](../../../sql-reference/statements/watch.md)观察实时视图的查询。
+
+```sql
+CREATE LIVE VIEW [db.]table_name WITH TIMEOUT [value_in_sec] AS SELECT ...
+```
+
+如果未指定超时值，则由指定的值[temporary_live_view_timeout](../../../operations/settings/settings.md#temporary-live-view-timeout)决定.
+
+**示例:**
+
+```sql
+CREATE TABLE mt (x Int8) Engine = MergeTree ORDER BY x;
+CREATE LIVE VIEW lv WITH TIMEOUT 15 AS SELECT sum(x) FROM mt;
+```
+
+### WITH REFRESH条件 {#live-view-with-refresh}
+
+当使用`WITH REFRESH`子句创建实时视图时，它将在自上次刷新或触发后经过指定的秒数后自动刷新。
+
+```sql
+CREATE LIVE VIEW [db.]table_name WITH REFRESH [value_in_sec] AS SELECT ...
+```
+
+如果未指定刷新值，则由指定的值[periodic_live_view_refresh](../../../operations/settings/settings.md#periodic-live-view-refresh)决定.
+
+**示例:**
+
+```sql
+CREATE LIVE VIEW lv WITH REFRESH 5 AS SELECT now();
+WATCH lv
+```
+
+```bash
+┌───────────────now()─┬─_version─┐
+│ 2021-02-21 08:47:05 │        1 │
+└─────────────────────┴──────────┘
+┌───────────────now()─┬─_version─┐
+│ 2021-02-21 08:47:10 │        2 │
+└─────────────────────┴──────────┘
+┌───────────────now()─┬─_version─┐
+│ 2021-02-21 08:47:15 │        3 │
+└─────────────────────┴──────────┘
+```
+
+您可以使用`AND`子句组合`WITH TIMEOUT`和`WITH REFRESH`子句。
+
+```sql
+CREATE LIVE VIEW [db.]table_name WITH TIMEOUT [value_in_sec] AND REFRESH [value_in_sec] AS SELECT ...
+```
+
+**示例:**
+
+```sql
+CREATE LIVE VIEW lv WITH TIMEOUT 15 AND REFRESH 5 AS SELECT now();
+```
+
+15 秒后，如果没有活动的`WATCH`查询，实时视图将自动删除。
+
+```sql
+WATCH lv
+```
+
+```
+Code: 60. DB::Exception: Received from localhost:9000. DB::Exception: Table default.lv does not exist..
+```
+
+### Usage {#live-view-usage}
+
+实时视图表的最常见用途包括：
+
+- 为查询结果更改提供推送通知以避免轮询。
+- 缓存最频繁查询的结果以提供即时查询结果。
+- 监视表更改并触发后续选择查询。
+- 使用定期刷新从系统表中查看指标。
+
+[原始文章](https://clickhouse.tech/docs/en/sql-reference/statements/create/view/) <!--hide-->
--- a/docs/zh/sql-reference/statements/drop.md
+++ b/docs/zh/sql-reference/statements/drop.md
@ -0,0 +1,100 @@
+---
+toc_priority: 44
+toc_title: DROP
+---
+
+# DROP语法 {#drop}
+
+删除现有实体。 如果指定了`IF EXISTS`子句，如果实体不存在，这些查询不会返回错误。
+
+## DROP DATABASE {#drop-database}
+
+删除`db`数据库中的所有表，然后删除`db`数据库本身。
+
+语法:
+
+``` sql
+DROP DATABASE [IF EXISTS] db [ON CLUSTER cluster]
+```
+
+## DROP TABLE {#drop-table}
+
+删除数据表
+
+语法:
+
+``` sql
+DROP [TEMPORARY] TABLE [IF EXISTS] [db.]name [ON CLUSTER cluster]
+```
+
+## DROP DICTIONARY {#drop-dictionary}
+
+删除字典。
+
+语法:
+
+``` sql
+DROP DICTIONARY [IF EXISTS] [db.]name
+```
+
+## DROP USER {#drop-user-statement}
+
+删除用户.
+
+语法:
+
+``` sql
+DROP USER [IF EXISTS] name [,...] [ON CLUSTER cluster_name]
+```
+
+## DROP ROLE {#drop-role-statement}
+
+删除角色。删除的角色将从分配给它的所有实体中撤消。
+
+语法:
+
+``` sql
+DROP ROLE [IF EXISTS] name [,...] [ON CLUSTER cluster_name]
+```
+
+## DROP ROW POLICY {#drop-row-policy-statement}
+
+删除行策略。 删除的行策略从分配到它的所有实体中撤消。
+
+语法:
+
+``` sql
+DROP [ROW] POLICY [IF EXISTS] name [,...] ON [database.]table [,...] [ON CLUSTER cluster_name]
+```
+
+## DROP QUOTA {#drop-quota-statement}
+
+Deletes a quota. The deleted quota is revoked from all the entities where it was assigned.
+
+语法:
+
+``` sql
+DROP QUOTA [IF EXISTS] name [,...] [ON CLUSTER cluster_name]
+```
+
+## DROP SETTINGS PROFILE {#drop-settings-profile-statement}
+
+删除配置文件。 已删除的设置配置文件将从分配给它的所有实体中撤销。
+
+语法:
+
+``` sql
+DROP [SETTINGS] PROFILE [IF EXISTS] name [,...] [ON CLUSTER cluster_name]
+```
+
+## DROP VIEW {#drop-view}
+
+删除视图。 视图也可以通过`DROP TABLE`命令删除，但`DROP VIEW`会检查`[db.]name`是否是一个视图。
+
+语法:
+
+``` sql
+DROP VIEW [IF EXISTS] [db.]name [ON CLUSTER cluster]
+```
+
+[原始文章](https://clickhouse.tech/docs/zh/sql-reference/statements/drop/) <!--hide-->
--- a/docs/zh/sql-reference/statements/exchange.md
+++ b/docs/zh/sql-reference/statements/exchange.md
@ -0,0 +1,42 @@
+---
+toc_priority: 49
+toc_title: EXCHANGE
+---
+
+# EXCHANGE语法 {#exchange}
+
+以原子方式交换两个表或字典的名称。
+此任务也可以通过使用[RENAME](./rename.md)来完成，但在这种情况下操作不是原子的。
+
+!!! note "注意"
+`EXCHANGE`仅支持[Atomic](../../engines/database-engines/atomic.md)数据库引擎.
+
+**语法**
+
+```sql
+EXCHANGE TABLES|DICTIONARIES [db0.]name_A AND [db1.]name_B
+```
+
+## EXCHANGE TABLES {#exchange_tables}
+
+交换两个表的名称。
+
+**语法**
+
+```sql
+EXCHANGE TABLES [db0.]table_A AND [db1.]table_B
+```
+
+## EXCHANGE DICTIONARIES {#exchange_dictionaries}
+
+交换两个字典的名称。
+
+**语法**
+
+```sql
+EXCHANGE DICTIONARIES [db0.]dict_A AND [db1.]dict_B
+```
+
+**参考**
+
+-   [Dictionaries](../../sql-reference/dictionaries/index.md)
--- a/docs/zh/sql-reference/statements/rename.md
+++ b/docs/zh/sql-reference/statements/rename.md
@ -0,0 +1,61 @@
+---
+toc_priority: 48
+toc_title: RENAME
+---
+
+# RENAME语法 {#misc_operations-rename}
+
+重命名数据库、表或字典。 可以在单个查询中重命名多个实体。
+请注意，具有多个实体的`RENAME`查询是非原子操作。 要以原子方式交换实体名称，请使用[EXCHANGE](./exchange.md)语法.
+
+!!! note "注意"
+`RENAME`仅支持[Atomic](../../engines/database-engines/atomic.md)数据库引擎.
+
+**语法**
+
+```sql
+RENAME DATABASE|TABLE|DICTIONARY name TO new_name [,...] [ON CLUSTER cluster]
+```
+
+## RENAME DATABASE {#misc_operations-rename_database}
+
+重命名数据库.
+
+**语法**
+
+```sql
+RENAME DATABASE atomic_database1 TO atomic_database2 [,...] [ON CLUSTER cluster]
+```
+
+## RENAME TABLE {#misc_operations-rename_table}
+
+重命名一个或多个表
+
+重命名表是一个轻量级的操作。 如果在`TO`之后传递一个不同的数据库，该表将被移动到这个数据库。 但是，包含数据库的目录必须位于同一文件系统中。 否则，返回错误。
+如果在一个查询中重命名多个表，则该操作不是原子操作。 可能会部分执行，其他会话中可能会得到`Table ... does not exist ...`错误。
+
+**语法**
+
+``` sql
+RENAME TABLE [db1.]name1 TO [db2.]name2 [,...] [ON CLUSTER cluster]
+```
+
+**示例**
+
+```sql
+RENAME TABLE table_A TO table_A_bak, table_B TO table_B_bak;
+```
+
+## RENAME DICTIONARY {#rename_dictionary}
+
+重命名一个或多个词典。 此查询可用于在数据库之间移动字典。
+
+**语法**
+
+```sql
+RENAME DICTIONARY [db0.]dict_A TO [db1.]dict_B [,...] [ON CLUSTER cluster]
+```
+
+**参考**
+
+-   [Dictionaries](../../sql-reference/dictionaries/index.md)
--- a/docs/zh/sql-reference/statements/watch.md
+++ b/docs/zh/sql-reference/statements/watch.md
@ -0,0 +1,4 @@
+---
+toc_priority: 53
+toc_title: WATCH
+---
--- a/programs/client/Client.cpp
+++ b/programs/client/Client.cpp
@ -2285,7 +2285,10 @@ private:
            if (!pager.empty())
            {
                signal(SIGPIPE, SIG_IGN);
-                pager_cmd = ShellCommand::execute(pager, true);
+
+                ShellCommand::Config config(pager);
+                config.pipe_stdin_only = true;
+                pager_cmd = ShellCommand::execute(config);
                out_buf = &pager_cmd->in;
            }
            else
--- a/programs/server/Server.cpp
+++ b/programs/server/Server.cpp
@ -766,6 +766,12 @@ if (ThreadFuzzer::instance().isEffective())
        fs::create_directories(dictionaries_lib_path);
    }

+    {
+        std::string user_scripts_path = config().getString("user_scripts_path", path / "user_scripts/");
+        global_context->setUserScriptsPath(user_scripts_path);
+        fs::create_directories(user_scripts_path);
+    }
+
    /// top_level_domains_lists
    {
        const std::string & top_level_domains_path = config().getString("top_level_domains_path", path / "top_level_domains/");
--- a/programs/static-files-disk-uploader/static-files-disk-uploader.cpp
+++ b/programs/static-files-disk-uploader/static-files-disk-uploader.cpp
@ -44,7 +44,7 @@ void processTableFiles(const fs::path & path, const String & files_prefix, Strin
        writeIntText(fs::file_size(file_path), metadata_buf);
        writeChar('\n', metadata_buf);

-        auto src_buf = createReadBufferFromFileBase(file_path, fs::file_size(file_path), 0, 0, nullptr);
+        auto src_buf = createReadBufferFromFileBase(file_path, {}, fs::file_size(file_path));
        auto dst_buf = create_dst_buf(remote_file_name);

        copyData(*src_buf, *dst_buf);
--- a/src/AggregateFunctions/AggregateFunctionIf.cpp
+++ b/src/AggregateFunctions/AggregateFunctionIf.cpp
@ -2,7 +2,6 @@
 #include <AggregateFunctions/AggregateFunctionIf.h>
 #include "AggregateFunctionNull.h"

-
 namespace DB
 {

@ -46,6 +45,29 @@ public:
    }
 };

+/** Given an array of flags, checks if it's all zeros
+ *  When the buffer is all zeros, this is slightly faster than doing a memcmp since doesn't require allocating memory
+ *  When the buffer has values, this is much faster since it avoids visiting all memory (and the allocation and function calls)
+ */
+static bool ALWAYS_INLINE inline is_all_zeros(const UInt8 * flags, size_t size)
+{
+    size_t unroll_size = size - size % 8;
+    size_t i = 0;
+    while (i < unroll_size)
+    {
+        UInt64 v = *reinterpret_cast<const UInt64 *>(&flags[i]);
+        if (v)
+            return false;
+        i += 8;
+    }
+
+    for (; i < size; i++)
+        if (flags[i])
+            return false;
+
+    return true;
+}
+
 /** There are two cases: for single argument and variadic.
  * Code for single argument is much more efficient.
  */
@ -112,6 +134,38 @@ public:
        }
    }

+    void addBatchSinglePlace(size_t batch_size, AggregateDataPtr place, const IColumn ** columns, Arena * arena, ssize_t) const override
+    {
+        const ColumnNullable * column = assert_cast<const ColumnNullable *>(columns[0]);
+        const UInt8 * null_map = column->getNullMapData().data();
+        const IColumn * columns_param[] = {&column->getNestedColumn()};
+
+        const IColumn * filter_column = columns[num_arguments - 1];
+        if (const ColumnNullable * nullable_column = typeid_cast<const ColumnNullable *>(filter_column))
+            filter_column = nullable_column->getNestedColumnPtr().get();
+        if constexpr (result_is_nullable)
+        {
+            /// We need to check if there is work to do as otherwise setting the flag would be a mistake,
+            /// it would mean that the return value would be the default value of the nested type instead of NULL
+            if (is_all_zeros(assert_cast<const ColumnUInt8 *>(filter_column)->getData().data(), batch_size))
+                return;
+        }
+
+        /// Combine the 2 flag arrays so we can call a simplified version (one check vs 2)
+        /// Note that now the null map will contain 0 if not null and not filtered, or 1 for null or filtered (or both)
+        const auto * filter_flags = assert_cast<const ColumnUInt8 *>(filter_column)->getData().data();
+        auto final_nulls = std::make_unique<UInt8[]>(batch_size);
+        for (size_t i = 0; i < batch_size; ++i)
+            final_nulls[i] = (!!null_map[i]) | (!filter_flags[i]);
+
+        this->nested_function->addBatchSinglePlaceNotNull(
+            batch_size, this->nestedPlace(place), columns_param, final_nulls.get(), arena, -1);
+
+        if constexpr (result_is_nullable)
+            if (!memoryIsByte(null_map, batch_size, 1))
+                this->setFlag(place);
+    }
+
 #if USE_EMBEDDED_COMPILER

    void compileAdd(llvm::IRBuilderBase & builder, llvm::Value * aggregate_data_ptr, const DataTypes & arguments_types, const std::vector<llvm::Value *> & argument_values) const override
--- a/src/AggregateFunctions/AggregateFunctionSum.h
+++ b/src/AggregateFunctions/AggregateFunctionSum.h
@ -1,5 +1,6 @@
 #pragma once

+#include <memory>
 #include <experimental/type_traits>
 #include <type_traits>

@ -96,8 +97,9 @@ struct AggregateFunctionSumData
        Impl::add(sum, local_sum);
    }

-    template <typename Value>
-    void NO_SANITIZE_UNDEFINED NO_INLINE addManyNotNull(const Value * __restrict ptr, const UInt8 * __restrict null_map, size_t count)
+    template <typename Value, bool add_if_zero>
+    void NO_SANITIZE_UNDEFINED NO_INLINE
+    addManyConditional_internal(const Value * __restrict ptr, const UInt8 * __restrict condition_map, size_t count)
    {
        const auto * end = ptr + count;

@ -110,10 +112,10 @@ struct AggregateFunctionSumData
            T local_sum{};
            while (ptr < end)
            {
-                T multiplier = !*null_map;
+                T multiplier = !*condition_map == add_if_zero;
                Impl::add(local_sum, *ptr * multiplier);
                ++ptr;
-                ++null_map;
+                ++condition_map;
            }
            Impl::add(sum, local_sum);
            return;
@ -130,13 +132,13 @@ struct AggregateFunctionSumData
            {
                for (size_t i = 0; i < unroll_count; ++i)
                {
-                    if (!null_map[i])
+                    if (!condition_map[i] == add_if_zero)
                    {
                        Impl::add(partial_sums[i], ptr[i]);
                    }
                }
                ptr += unroll_count;
-                null_map += unroll_count;
+                condition_map += unroll_count;
            }

            for (size_t i = 0; i < unroll_count; ++i)
@ -146,14 +148,26 @@ struct AggregateFunctionSumData
        T local_sum{};
        while (ptr < end)
        {
-            if (!*null_map)
+            if (!*condition_map == add_if_zero)
                Impl::add(local_sum, *ptr);
            ++ptr;
-            ++null_map;
+            ++condition_map;
        }
        Impl::add(sum, local_sum);
    }

+    template <typename Value>
+    void ALWAYS_INLINE addManyNotNull(const Value * __restrict ptr, const UInt8 * __restrict null_map, size_t count)
+    {
+        return addManyConditional_internal<Value, true>(ptr, null_map, count);
+    }
+
+    template <typename Value>
+    void ALWAYS_INLINE addManyConditional(const Value * __restrict ptr, const UInt8 * __restrict cond_map, size_t count)
+    {
+        return addManyConditional_internal<Value, false>(ptr, cond_map, count);
+    }
+
    void NO_SANITIZE_UNDEFINED merge(const AggregateFunctionSumData & rhs)
    {
        Impl::add(sum, rhs.sum);
@ -229,8 +243,8 @@ struct AggregateFunctionSumKahanData
        }
    }

-    template <typename Value>
-    void NO_INLINE addManyNotNull(const Value * __restrict ptr, const UInt8 * __restrict null_map, size_t count)
+    template <typename Value, bool add_if_zero>
+    void NO_INLINE addManyConditional_internal(const Value * __restrict ptr, const UInt8 * __restrict condition_map, size_t count)
    {
        constexpr size_t unroll_count = 4;
        T partial_sums[unroll_count]{};
@ -242,10 +256,10 @@ struct AggregateFunctionSumKahanData
        while (ptr < unrolled_end)
        {
            for (size_t i = 0; i < unroll_count; ++i)
-                if (!null_map[i])
+                if ((!condition_map[i]) == add_if_zero)
                    addImpl(ptr[i], partial_sums[i], partial_compensations[i]);
            ptr += unroll_count;
-            null_map += unroll_count;
+            condition_map += unroll_count;
        }

        for (size_t i = 0; i < unroll_count; ++i)
@ -253,13 +267,25 @@ struct AggregateFunctionSumKahanData

        while (ptr < end)
        {
-            if (!*null_map)
+            if ((!*condition_map) == add_if_zero)
                addImpl(*ptr, sum, compensation);
            ++ptr;
-            ++null_map;
+            ++condition_map;
        }
    }

+    template <typename Value>
+    void ALWAYS_INLINE addManyNotNull(const Value * __restrict ptr, const UInt8 * __restrict null_map, size_t count)
+    {
+        return addManyConditional_internal<Value, true>(ptr, null_map, count);
+    }
+
+    template <typename Value>
+    void ALWAYS_INLINE addManyConditional(const Value * __restrict ptr, const UInt8 * __restrict cond_map, size_t count)
+    {
+        return addManyConditional_internal<Value, false>(ptr, cond_map, count);
+    }
+
    void ALWAYS_INLINE mergeImpl(T & to_sum, T & to_compensation, T from_sum, T from_compensation)
    {
        auto raw_sum = to_sum + from_sum;
@ -352,40 +378,38 @@ public:
            this->data(place).add(column.getData()[row_num]);
    }

-    /// Vectorized version when there is no GROUP BY keys.
    void addBatchSinglePlace(
-        size_t batch_size, AggregateDataPtr place, const IColumn ** columns, Arena * arena, ssize_t if_argument_pos) const override
+        size_t batch_size, AggregateDataPtr place, const IColumn ** columns, Arena *, ssize_t if_argument_pos) const override
    {
+        const auto & column = assert_cast<const ColVecType &>(*columns[0]);
        if (if_argument_pos >= 0)
        {
            const auto & flags = assert_cast<const ColumnUInt8 &>(*columns[if_argument_pos]).getData();
-            for (size_t i = 0; i < batch_size; ++i)
-            {
-                if (flags[i])
-                    add(place, columns, i, arena);
-            }
+            this->data(place).addManyConditional(column.getData().data(), flags.data(), batch_size);
        }
        else
        {
-            const auto & column = assert_cast<const ColVecType &>(*columns[0]);
            this->data(place).addMany(column.getData().data(), batch_size);
        }
    }

    void addBatchSinglePlaceNotNull(
-        size_t batch_size, AggregateDataPtr place, const IColumn ** columns, const UInt8 * null_map, Arena * arena, ssize_t if_argument_pos)
+        size_t batch_size, AggregateDataPtr place, const IColumn ** columns, const UInt8 * null_map, Arena *, ssize_t if_argument_pos)
        const override
    {
+        const auto & column = assert_cast<const ColVecType &>(*columns[0]);
        if (if_argument_pos >= 0)
        {
-            const auto & flags = assert_cast<const ColumnUInt8 &>(*columns[if_argument_pos]).getData();
+            /// Merge the 2 sets of flags (null and if) into a single one. This allows us to use parallelizable sums when available
+            const auto * if_flags = assert_cast<const ColumnUInt8 &>(*columns[if_argument_pos]).getData().data();
+            auto final_flags = std::make_unique<UInt8[]>(batch_size);
            for (size_t i = 0; i < batch_size; ++i)
-                if (!null_map[i] && flags[i])
-                    add(place, columns, i, arena);
+                final_flags[i] = (!null_map[i]) & if_flags[i];
+
+            this->data(place).addManyConditional(column.getData().data(), final_flags.get(), batch_size);
        }
        else
        {
-            const auto & column = assert_cast<const ColVecType &>(*columns[0]);
            this->data(place).addManyNotNull(column.getData().data(), null_map, batch_size);
        }
    }
--- a/src/AggregateFunctions/IAggregateFunction.h
+++ b/src/AggregateFunctions/IAggregateFunction.h
@ -190,6 +190,7 @@ public:
        size_t batch_size, AggregateDataPtr place, const IColumn ** columns, Arena * arena, ssize_t if_argument_pos = -1) const = 0;

    /** The same for single place when need to aggregate only filtered data.
+      * Instead of using an if-column, the condition is combined inside the null_map
      */
    virtual void addBatchSinglePlaceNotNull(
        size_t batch_size,
--- a/src/Backups/BackupEntryFromImmutableFile.cpp
+++ b/src/Backups/BackupEntryFromImmutableFile.cpp
@ -41,7 +41,7 @@ std::unique_ptr<ReadBuffer> BackupEntryFromImmutableFile::getReadBuffer() const
    if (disk)
        return disk->readFile(file_path);
    else
-        return createReadBufferFromFileBase(file_path, 0, 0, 0, nullptr);
+        return createReadBufferFromFileBase(file_path, {}, 0);
 }

 }
--- a/src/Backups/BackupEntryFromSmallFile.cpp
+++ b/src/Backups/BackupEntryFromSmallFile.cpp
@ -10,7 +10,7 @@ namespace
 {
    String readFile(const String & file_path)
    {
-        auto buf = createReadBufferFromFileBase(file_path, 0, 0, 0, nullptr);
+        auto buf = createReadBufferFromFileBase(file_path, {}, 0);
        String s;
        readStringUntilEOF(s, *buf);
        return s;
--- a/src/Bridge/IBridgeHelper.cpp
+++ b/src/Bridge/IBridgeHelper.cpp
@ -113,7 +113,11 @@ std::unique_ptr<ShellCommand> IBridgeHelper::startBridgeCommand()

    LOG_TRACE(getLog(), "Starting {}", serviceAlias());

-    return ShellCommand::executeDirect(path.string(), cmd_args, ShellCommandDestructorStrategy(true));
+    ShellCommand::Config command_config(path.string());
+    command_config.arguments = cmd_args;
+    command_config.terminate_in_destructor_strategy = ShellCommand::DestructorStrategy(true);
+
+    return ShellCommand::executeDirect(command_config);
 }

 }
--- a/src/Columns/ColumnsCommon.cpp
+++ b/src/Columns/ColumnsCommon.cpp
@ -85,7 +85,7 @@ size_t countBytesInFilterWithNull(const IColumn::Filter & filt, const UInt8 * nu
        /// TODO Add duff device for tail?
 #endif

-    for (; pos < end; ++pos)
+    for (; pos < end; ++pos, ++pos2)
        count += (*pos & ~*pos2) != 0;

    return count;
--- a/src/Common/CurrentMetrics.cpp
+++ b/src/Common/CurrentMetrics.cpp
@ -76,6 +76,7 @@
    M(ActiveAsyncDrainedConnections, "Number of active connections drained asynchronously.") \
    M(SyncDrainedConnections, "Number of connections drained synchronously.") \
    M(ActiveSyncDrainedConnections, "Number of active connections drained synchronously.") \
+    M(AsynchronousReadWait, "Number of threads waiting for asynchronous read.") \

 namespace CurrentMetrics
 {
--- a/src/Common/ErrorCodes.cpp
+++ b/src/Common/ErrorCodes.cpp
@ -583,6 +583,8 @@
    M(612, OBJECT_ALREADY_STORED_ON_DISK) \
    M(613, OBJECT_WAS_NOT_STORED_ON_DISK) \
    M(614, POSTGRESQL_CONNECTION_FAILURE) \
+    M(615, CANNOT_ADVISE) \
+    M(616, UNKNOWN_READ_METHOD) \
    \
    M(999, KEEPER_EXCEPTION) \
    M(1000, POCO_EXCEPTION) \
--- a/src/Common/ProfileEvents.cpp
+++ b/src/Common/ProfileEvents.cpp
@ -251,6 +251,15 @@
    \
    M(SleepFunctionCalls, "Number of times a sleep function (sleep, sleepEachRow) has been called.") \
    M(SleepFunctionMicroseconds, "Time spent sleeping due to a sleep function call.") \
+    \
+    M(ThreadPoolReaderPageCacheHit, "Number of times the read inside ThreadPoolReader was done from page cache.") \
+    M(ThreadPoolReaderPageCacheHitBytes, "Number of bytes read inside ThreadPoolReader when it was done from page cache.") \
+    M(ThreadPoolReaderPageCacheHitElapsedMicroseconds, "Time spent reading data from page cache in ThreadPoolReader.") \
+    M(ThreadPoolReaderPageCacheMiss, "Number of times the read inside ThreadPoolReader was not done from page cache and was hand off to thread pool.") \
+    M(ThreadPoolReaderPageCacheMissBytes, "Number of bytes read inside ThreadPoolReader when read was not done from page cache and was hand off to thread pool.") \
+    M(ThreadPoolReaderPageCacheMissElapsedMicroseconds, "Time spent reading data inside the asynchronous job in ThreadPoolReader - when read was not done from page cache.") \
+    \
+    M(AsynchronousReadWaitMicroseconds, "Time spent in waiting for asynchronous reads.") \


 namespace ProfileEvents
--- a/src/Common/QueryProfiler.cpp
+++ b/src/Common/QueryProfiler.cpp
@ -173,7 +173,7 @@ void QueryProfilerBase<ProfilerImpl>::tryCleanup()
 }

 template class QueryProfilerBase<QueryProfilerReal>;
-template class QueryProfilerBase<QueryProfilerCpu>;
+template class QueryProfilerBase<QueryProfilerCPU>;

 QueryProfilerReal::QueryProfilerReal(const UInt64 thread_id, const UInt32 period)
    : QueryProfilerBase(thread_id, CLOCK_MONOTONIC, period, SIGUSR1)
@ -185,11 +185,11 @@ void QueryProfilerReal::signalHandler(int sig, siginfo_t * info, void * context)
    writeTraceInfo(TraceType::Real, sig, info, context);
 }

-QueryProfilerCpu::QueryProfilerCpu(const UInt64 thread_id, const UInt32 period)
+QueryProfilerCPU::QueryProfilerCPU(const UInt64 thread_id, const UInt32 period)
    : QueryProfilerBase(thread_id, CLOCK_THREAD_CPUTIME_ID, period, SIGUSR2)
 {}

-void QueryProfilerCpu::signalHandler(int sig, siginfo_t * info, void * context)
+void QueryProfilerCPU::signalHandler(int sig, siginfo_t * info, void * context)
 {
    DENY_ALLOCATIONS_IN_SCOPE;
    writeTraceInfo(TraceType::CPU, sig, info, context);
--- a/src/Common/QueryProfiler.h
+++ b/src/Common/QueryProfiler.h
@ -26,7 +26,7 @@ namespace DB
  *  3. write collected stack trace to trace_pipe for TraceCollector
  *
  * Destructor tries to unset timer and restore previous signal handler.
-  * Note that signal handler implementation is defined by template parameter. See QueryProfilerReal and QueryProfilerCpu.
+  * Note that signal handler implementation is defined by template parameter. See QueryProfilerReal and QueryProfilerCPU.
  */
 template <typename ProfilerImpl>
 class QueryProfilerBase
@ -62,10 +62,10 @@ public:
 };

 /// Query profiler with timer based on CPU clock
-class QueryProfilerCpu : public QueryProfilerBase<QueryProfilerCpu>
+class QueryProfilerCPU : public QueryProfilerBase<QueryProfilerCPU>
 {
 public:
-    QueryProfilerCpu(const UInt64 thread_id, const UInt32 period);
+    QueryProfilerCPU(const UInt64 thread_id, const UInt32 period);

    static void signalHandler(int sig, siginfo_t * info, void * context);
 };
--- a/src/Common/ShellCommand.cpp
+++ b/src/Common/ShellCommand.cpp
@ -20,10 +20,12 @@ namespace
    /// By these return codes from the child process, we learn (for sure) about errors when creating it.
    enum class ReturnCodes : int
    {
-        CANNOT_DUP_STDIN    = 0x55555555,   /// The value is not important, but it is chosen so that it's rare to conflict with the program return code.
-        CANNOT_DUP_STDOUT   = 0x55555556,
-        CANNOT_DUP_STDERR   = 0x55555557,
-        CANNOT_EXEC         = 0x55555558,
+        CANNOT_DUP_STDIN            = 0x55555555,   /// The value is not important, but it is chosen so that it's rare to conflict with the program return code.
+        CANNOT_DUP_STDOUT           = 0x55555556,
+        CANNOT_DUP_STDERR           = 0x55555557,
+        CANNOT_EXEC                 = 0x55555558,
+        CANNOT_DUP_READ_DESCRIPTOR  = 0x55555559,
+        CANNOT_DUP_WRITE_DESCRIPTOR = 0x55555560,
    };
 }

@ -39,12 +41,12 @@ namespace ErrorCodes
    extern const int CANNOT_CREATE_CHILD_PROCESS;
 }

-ShellCommand::ShellCommand(pid_t pid_, int & in_fd_, int & out_fd_, int & err_fd_, ShellCommandDestructorStrategy destructor_strategy_)
-    : pid(pid_)
-    , destructor_strategy(destructor_strategy_)
-    , in(in_fd_)
+ShellCommand::ShellCommand(pid_t pid_, int & in_fd_, int & out_fd_, int & err_fd_, const ShellCommand::Config & config_)
+    : in(in_fd_)
    , out(out_fd_)
    , err(err_fd_)
+    , pid(pid_)
+    , config(config_)
 {
 }

@ -58,9 +60,9 @@ ShellCommand::~ShellCommand()
    if (wait_called)
        return;

-    if (destructor_strategy.terminate_in_destructor)
+    if (config.terminate_in_destructor_strategy.terminate_in_destructor)
    {
-        size_t try_wait_timeout = destructor_strategy.wait_for_normal_exit_before_termination_seconds;
+        size_t try_wait_timeout = config.terminate_in_destructor_strategy.wait_for_normal_exit_before_termination_seconds;
        bool process_terminated_normally = tryWaitProcessWithTimeout(try_wait_timeout);

        if (!process_terminated_normally)
@ -149,8 +151,7 @@ void ShellCommand::logCommand(const char * filename, char * const argv[])
 std::unique_ptr<ShellCommand> ShellCommand::executeImpl(
    const char * filename,
    char * const argv[],
-    bool pipe_stdin_only,
-    ShellCommandDestructorStrategy terminate_in_destructor_strategy)
+    const Config & config)
 {
    logCommand(filename, argv);

@ -168,9 +169,18 @@ std::unique_ptr<ShellCommand> ShellCommand::executeImpl(
    PipeFDs pipe_stdout;
    PipeFDs pipe_stderr;

+    std::vector<std::unique_ptr<PipeFDs>> read_pipe_fds;
+    std::vector<std::unique_ptr<PipeFDs>> write_pipe_fds;
+
+    for (size_t i = 0; i < config.read_fds.size(); ++i)
+        read_pipe_fds.emplace_back(std::make_unique<PipeFDs>());
+
+    for (size_t i = 0; i < config.write_fds.size(); ++i)
+        write_pipe_fds.emplace_back(std::make_unique<PipeFDs>());
+
    pid_t pid = reinterpret_cast<pid_t(*)()>(real_vfork)();

-    if (-1 == pid)
+    if (pid == -1)
        throwFromErrno("Cannot vfork", ErrorCodes::CANNOT_FORK);

    if (0 == pid)
@ -184,7 +194,7 @@ std::unique_ptr<ShellCommand> ShellCommand::executeImpl(
        if (STDIN_FILENO != dup2(pipe_stdin.fds_rw[0], STDIN_FILENO))
            _exit(int(ReturnCodes::CANNOT_DUP_STDIN));

-        if (!pipe_stdin_only)
+        if (!config.pipe_stdin_only)
        {
            if (STDOUT_FILENO != dup2(pipe_stdout.fds_rw[1], STDOUT_FILENO))
                _exit(int(ReturnCodes::CANNOT_DUP_STDOUT));
@ -193,6 +203,24 @@ std::unique_ptr<ShellCommand> ShellCommand::executeImpl(
                _exit(int(ReturnCodes::CANNOT_DUP_STDERR));
        }

+        for (size_t i = 0; i < config.read_fds.size(); ++i)
+        {
+            auto & fds = *read_pipe_fds[i];
+            auto fd = config.read_fds[i];
+
+            if (fd != dup2(fds.fds_rw[1], fd))
+                _exit(int(ReturnCodes::CANNOT_DUP_READ_DESCRIPTOR));
+        }
+
+        for (size_t i = 0; i < config.write_fds.size(); ++i)
+        {
+            auto & fds = *write_pipe_fds[i];
+            auto fd = config.write_fds[i];
+
+            if (fd != dup2(fds.fds_rw[0], fd))
+                _exit(int(ReturnCodes::CANNOT_DUP_WRITE_DESCRIPTOR));
+        }
+
        // Reset the signal mask: it may be non-empty and will be inherited
        // by the child process, which might not expect this.
        sigset_t mask;
@ -207,35 +235,49 @@ std::unique_ptr<ShellCommand> ShellCommand::executeImpl(
    }

    std::unique_ptr<ShellCommand> res(new ShellCommand(
-        pid, pipe_stdin.fds_rw[1], pipe_stdout.fds_rw[0], pipe_stderr.fds_rw[0], terminate_in_destructor_strategy));
+        pid,
+        pipe_stdin.fds_rw[1],
+        pipe_stdout.fds_rw[0],
+        pipe_stderr.fds_rw[0],
+        config));
+
+    for (size_t i = 0; i < config.read_fds.size(); ++i)
+    {
+        auto & fds = *read_pipe_fds[i];
+        auto fd = config.read_fds[i];
+        res->read_fds.emplace(fd, fds.fds_rw[0]);
+    }
+
+    for (size_t i = 0; i < config.write_fds.size(); ++i)
+    {
+        auto & fds = *write_pipe_fds[i];
+        auto fd = config.write_fds[i];
+        res->write_fds.emplace(fd, fds.fds_rw[1]);
+    }

    LOG_TRACE(getLogger(), "Started shell command '{}' with pid {}", filename, pid);
    return res;
 }


-std::unique_ptr<ShellCommand> ShellCommand::execute(
-    const std::string & command,
-    bool pipe_stdin_only,
-    ShellCommandDestructorStrategy terminate_in_destructor_strategy)
+std::unique_ptr<ShellCommand> ShellCommand::execute(const ShellCommand::Config & config)
 {
-    /// Arguments in non-constant chunks of memory (as required for `execv`).
-    /// Moreover, their copying must be done before calling `vfork`, so after `vfork` do a minimum of things.
-    std::vector<char> argv0("sh", &("sh"[3]));
-    std::vector<char> argv1("-c", &("-c"[3]));
-    std::vector<char> argv2(command.data(), command.data() + command.size() + 1);
+    auto config_copy = config;
+    config_copy.command = "/bin/sh";
+    config_copy.arguments = {"-c", config.command};

-    char * const argv[] = { argv0.data(), argv1.data(), argv2.data(), nullptr };
+    for (const auto & argument : config.arguments)
+        config_copy.arguments.emplace_back(argument);

-    return executeImpl("/bin/sh", argv, pipe_stdin_only, terminate_in_destructor_strategy);
+    return executeDirect(config_copy);
 }


-std::unique_ptr<ShellCommand> ShellCommand::executeDirect(
-    const std::string & path,
-    const std::vector<std::string> & arguments,
-    ShellCommandDestructorStrategy terminate_in_destructor_strategy)
+std::unique_ptr<ShellCommand> ShellCommand::executeDirect(const ShellCommand::Config & config)
 {
+    const auto & path = config.command;
+    const auto & arguments = config.arguments;
+
    size_t argv_sum_size = path.size() + 1;
    for (const auto & arg : arguments)
        argv_sum_size += arg.size() + 1;
@ -255,7 +297,7 @@ std::unique_ptr<ShellCommand> ShellCommand::executeDirect(

    argv[arguments.size() + 1] = nullptr;

-    return executeImpl(path.data(), argv.data(), false, terminate_in_destructor_strategy);
+    return executeImpl(path.data(), argv.data(), config);
 }


@ -307,6 +349,10 @@ void ShellCommand::wait()
                throw Exception("Cannot dup2 stderr of child process", ErrorCodes::CANNOT_CREATE_CHILD_PROCESS);
            case int(ReturnCodes::CANNOT_EXEC):
                throw Exception("Cannot execv in child process", ErrorCodes::CANNOT_CREATE_CHILD_PROCESS);
+            case int(ReturnCodes::CANNOT_DUP_READ_DESCRIPTOR):
+                throw Exception("Cannot dup2 read descriptor of child process", ErrorCodes::CANNOT_CREATE_CHILD_PROCESS);
+            case int(ReturnCodes::CANNOT_DUP_WRITE_DESCRIPTOR):
+                throw Exception("Cannot dup2 write descriptor of child process", ErrorCodes::CANNOT_CREATE_CHILD_PROCESS);
            default:
                throw Exception("Child process was exited with return code " + toString(retcode), ErrorCodes::CHILD_WAS_NOT_EXITED_NORMALLY);
        }
--- a/src/Common/ShellCommand.h
+++ b/src/Common/ShellCommand.h
@ -23,29 +23,76 @@ namespace DB
  * The second difference - allows to work simultaneously with stdin, and with stdout, and with stderr of running process,
  *  and also to obtain the return code and completion status.
  */
-
-struct ShellCommandDestructorStrategy final
-{
-    explicit ShellCommandDestructorStrategy(bool terminate_in_destructor_, size_t wait_for_normal_exit_before_termination_seconds_ = 0)
-        : terminate_in_destructor(terminate_in_destructor_)
-        , wait_for_normal_exit_before_termination_seconds(wait_for_normal_exit_before_termination_seconds_)
-    {
-    }
-
-    bool terminate_in_destructor;
-
-    /// If terminate in destructor is true, command will wait until send SIGTERM signal to created process
-    size_t wait_for_normal_exit_before_termination_seconds = 0;
-};
-
 class ShellCommand final
 {
-private:
-    pid_t pid;
-    bool wait_called = false;
-    ShellCommandDestructorStrategy destructor_strategy;
+public:

-    ShellCommand(pid_t pid_, int & in_fd_, int & out_fd_, int & err_fd_, ShellCommandDestructorStrategy destructor_strategy_);
+    ~ShellCommand();
+
+    struct DestructorStrategy final
+    {
+        explicit DestructorStrategy(bool terminate_in_destructor_, size_t wait_for_normal_exit_before_termination_seconds_ = 0)
+            : terminate_in_destructor(terminate_in_destructor_)
+            , wait_for_normal_exit_before_termination_seconds(wait_for_normal_exit_before_termination_seconds_)
+        {
+        }
+
+        bool terminate_in_destructor;
+
+        /// If terminate in destructor is true, command will wait until send SIGTERM signal to created process
+        size_t wait_for_normal_exit_before_termination_seconds = 0;
+    };
+
+    struct Config
+    {
+        Config(const std::string & command_)
+            : command(command_)
+        {}
+
+        Config(const char * command_)
+            : command(command_)
+        {}
+
+        std::string command;
+
+        std::vector<std::string> arguments;
+
+        std::vector<int> read_fds;
+
+        std::vector<int> write_fds;
+
+        bool pipe_stdin_only = false;
+
+        DestructorStrategy terminate_in_destructor_strategy = DestructorStrategy(false);
+    };
+
+    /// Run the command using /bin/sh -c.
+    /// If terminate_in_destructor is true, send terminate signal in destructor and don't wait process.
+    static std::unique_ptr<ShellCommand> execute(const Config & config);
+
+    /// Run the executable with the specified arguments. `arguments` - without argv[0].
+    /// If terminate_in_destructor is true, send terminate signal in destructor and don't wait process.
+    static std::unique_ptr<ShellCommand> executeDirect(const Config & config);
+
+    /// Wait for the process to end, throw an exception if the code is not 0 or if the process was not completed by itself.
+    void wait();
+
+    /// Wait for the process to finish, see the return code. To throw an exception if the process was not completed independently.
+    int tryWait();
+
+    WriteBufferFromFile in;        /// If the command reads from stdin, do not forget to call in.close() after writing all the data there.
+    ReadBufferFromFile out;
+    ReadBufferFromFile err;
+
+    std::unordered_map<int, ReadBufferFromFile> read_fds;
+    std::unordered_map<int, WriteBufferFromFile> write_fds;
+private:
+
+    pid_t pid;
+    Config config;
+    bool wait_called = false;
+
+    ShellCommand(pid_t pid_, int & in_fd_, int & out_fd_, int & err_fd_, const Config & config);

    bool tryWaitProcessWithTimeout(size_t timeout_in_seconds);

@ -54,28 +101,7 @@ private:
    /// Print command name and the list of arguments to log. NOTE: No escaping of arguments is performed.
    static void logCommand(const char * filename, char * const argv[]);

-    static std::unique_ptr<ShellCommand> executeImpl(const char * filename, char * const argv[], bool pipe_stdin_only, ShellCommandDestructorStrategy terminate_in_destructor_strategy);
-
-public:
-    WriteBufferFromFile in;        /// If the command reads from stdin, do not forget to call in.close() after writing all the data there.
-    ReadBufferFromFile out;
-    ReadBufferFromFile err;
-
-    ~ShellCommand();
-
-    /// Run the command using /bin/sh -c.
-    /// If terminate_in_destructor is true, send terminate signal in destructor and don't wait process.
-    static std::unique_ptr<ShellCommand> execute(const std::string & command, bool pipe_stdin_only = false, ShellCommandDestructorStrategy terminate_in_destructor_strategy = ShellCommandDestructorStrategy(false));
-
-    /// Run the executable with the specified arguments. `arguments` - without argv[0].
-    /// If terminate_in_destructor is true, send terminate signal in destructor and don't wait process.
-    static std::unique_ptr<ShellCommand> executeDirect(const std::string & path, const std::vector<std::string> & arguments, ShellCommandDestructorStrategy terminate_in_destructor_strategy = ShellCommandDestructorStrategy(false));
-
-    /// Wait for the process to end, throw an exception if the code is not 0 or if the process was not completed by itself.
-    void wait();
-
-    /// Wait for the process to finish, see the return code. To throw an exception if the process was not completed independently.
-    int tryWait();
+    static std::unique_ptr<ShellCommand> executeImpl(const char * filename, char * const argv[], const Config & config);
 };


--- a/src/Common/StudentTTest.cpp
+++ b/src/Common/StudentTTest.cpp
@ -158,7 +158,7 @@ std::pair<bool, std::string> StudentTTest::compareAndReport(size_t confidence_le

    if (mean_difference > mean_confidence_interval && (mean_difference - mean_confidence_interval > 0.0001)) /// difference must be more than 0.0001, to take into account connection latency.
    {
-        ss << "Difference at " << confidence_level[confidence_level_index] <<  "% confidence : ";
+        ss << "Difference at " << confidence_level[confidence_level_index] <<  "% confidence: ";
        ss << std::fixed << std::setprecision(8) << "mean difference is " << mean_difference << ", but confidence interval is " << mean_confidence_interval;
        return {false, ss.str()};
    }
--- a/src/Common/ThreadStatus.h
+++ b/src/Common/ThreadStatus.h
@ -29,7 +29,7 @@ namespace DB
 class QueryStatus;
 class ThreadStatus;
 class QueryProfilerReal;
-class QueryProfilerCpu;
+class QueryProfilerCPU;
 class QueryThreadLog;
 struct OpenTelemetrySpanHolder;
 class TasksStatsCounters;
@ -140,7 +140,7 @@ protected:

    // CPU and Real time query profilers
    std::unique_ptr<QueryProfilerReal> query_profiler_real;
-    std::unique_ptr<QueryProfilerCpu> query_profiler_cpu;
+    std::unique_ptr<QueryProfilerCPU> query_profiler_cpu;

    Poco::Logger * log = nullptr;

--- a/src/Common/tests/gtest_local_address.cpp
+++ b/src/Common/tests/gtest_local_address.cpp
@ -7,7 +7,9 @@

 TEST(LocalAddress, SmokeTest)
 {
-    auto cmd = DB::ShellCommand::executeDirect("/bin/hostname", {"-i"});
+    DB::ShellCommand::Config config("/bin/hostname");
+    config.arguments = {"-i"};
+    auto cmd = DB::ShellCommand::executeDirect(config);
    std::string address_str;
    DB::readString(address_str, cmd->out);
    cmd->wait();
--- a/src/Common/tests/gtest_shell_command.cpp
+++ b/src/Common/tests/gtest_shell_command.cpp
@ -28,7 +28,9 @@ TEST(ShellCommand, Execute)

 TEST(ShellCommand, ExecuteDirect)
 {
-    auto command = ShellCommand::executeDirect("/bin/echo", {"Hello, world!"});
+    ShellCommand::Config config("/bin/echo");
+    config.arguments = {"Hello, world!"};
+    auto command = ShellCommand::executeDirect(config);

    std::string res;
    readStringUntilEOF(res, command->out);
--- a/src/Compression/CompressedReadBufferBase.h
+++ b/src/Compression/CompressedReadBufferBase.h
@ -32,7 +32,8 @@ protected:

    /// Read compressed data into compressed_buffer. Get size of decompressed data from block header. Checksum if need.
    ///
-    /// If always_copy is true then even if the compressed block is already stored in compressed_in.buffer() it will be copied into own_compressed_buffer.
+    /// If always_copy is true then even if the compressed block is already stored in compressed_in.buffer()
+    /// it will be copied into own_compressed_buffer.
    /// This is required for CheckingCompressedReadBuffer, since this is just a proxy.
    ///
    /// Returns number of compressed bytes read.
--- a/src/Compression/CompressedReadBufferFromFile.cpp
+++ b/src/Compression/CompressedReadBufferFromFile.cpp
@ -36,6 +36,13 @@ bool CompressedReadBufferFromFile::nextImpl()
    return true;
 }

+
+void CompressedReadBufferFromFile::prefetch()
+{
+    file_in.prefetch();
+}
+
+
 CompressedReadBufferFromFile::CompressedReadBufferFromFile(std::unique_ptr<ReadBufferFromFileBase> buf, bool allow_different_codecs_)
    : BufferWithOwnMemory<ReadBuffer>(0), p_file_in(std::move(buf)), file_in(*p_file_in)
 {
@ -46,14 +53,11 @@ CompressedReadBufferFromFile::CompressedReadBufferFromFile(std::unique_ptr<ReadB

 CompressedReadBufferFromFile::CompressedReadBufferFromFile(
    const std::string & path,
+    const ReadSettings & settings,
    size_t estimated_size,
-    size_t direct_io_threshold,
-    size_t mmap_threshold,
-    MMappedFileCache * mmap_cache,
-    size_t buf_size,
    bool allow_different_codecs_)
    : BufferWithOwnMemory<ReadBuffer>(0)
-    , p_file_in(createReadBufferFromFileBase(path, estimated_size, direct_io_threshold, mmap_threshold, mmap_cache, buf_size))
+    , p_file_in(createReadBufferFromFileBase(path, settings, estimated_size))
    , file_in(*p_file_in)
 {
    compressed_in = &file_in;
--- a/src/Compression/CompressedReadBufferFromFile.h
+++ b/src/Compression/CompressedReadBufferFromFile.h
@ -1,7 +1,8 @@
 #pragma once

-#include "CompressedReadBufferBase.h"
+#include <Compression/CompressedReadBufferBase.h>
 #include <IO/ReadBufferFromFileBase.h>
+#include <IO/ReadSettings.h>
 #include <time.h>
 #include <memory>

@ -28,13 +29,13 @@ private:
    size_t size_compressed = 0;

    bool nextImpl() override;
+    void prefetch() override;

 public:
    CompressedReadBufferFromFile(std::unique_ptr<ReadBufferFromFileBase> buf, bool allow_different_codecs_ = false);

    CompressedReadBufferFromFile(
-        const std::string & path, size_t estimated_size, size_t direct_io_threshold, size_t mmap_threshold, MMappedFileCache * mmap_cache,
-        size_t buf_size = DBMS_DEFAULT_BUFFER_SIZE, bool allow_different_codecs_ = false);
+        const std::string & path, const ReadSettings & settings, size_t estimated_size, bool allow_different_codecs_ = false);

    void seek(size_t offset_in_compressed_file, size_t offset_in_decompressed_block);

--- a/src/Compression/examples/cached_compressed_read_buffer.cpp
+++ b/src/Compression/examples/cached_compressed_read_buffer.cpp
@ -37,7 +37,7 @@ int main(int argc, char ** argv)
                path,
                [&]()
                {
-                    return createReadBufferFromFileBase(path, 0, 0, 0, nullptr);
+                    return createReadBufferFromFileBase(path, {}, 0);
                },
                &cache
            );
@ -56,7 +56,7 @@ int main(int argc, char ** argv)
                path,
                [&]()
                {
-                    return createReadBufferFromFileBase(path, 0, 0, 0, nullptr);
+                    return createReadBufferFromFileBase(path, {}, 0);
                },
                &cache
            );
--- a/src/Core/Settings.h
+++ b/src/Core/Settings.h
@ -3,6 +3,7 @@
 #include <Core/BaseSettings.h>
 #include <Core/SettingsEnums.h>
 #include <Core/Defines.h>
+#include <IO/ReadSettings.h>


 namespace Poco::Util
@ -499,6 +500,11 @@ class IColumn;
    M(UInt64, function_range_max_elements_in_block, 500000000, "Maximum number of values generated by function 'range' per block of data (sum of array sizes for every row in a block, see also 'max_block_size' and 'min_insert_block_size_rows'). It is a safety threshold.", 0) \
    M(ShortCircuitFunctionEvaluation, short_circuit_function_evaluation, ShortCircuitFunctionEvaluation::ENABLE, "Setting for short-circuit function evaluation configuration. Possible values: 'enable', 'disable', 'force_enable'", 0) \
    \
+    M(String, local_filesystem_read_method, "pread", "Method of reading data from local filesystem, one of: read, pread, mmap, pread_threadpool.", 0) \
+    M(Bool, local_filesystem_read_prefetch, false, "Should use prefetching when reading data from local filesystem.", 0) \
+    M(Bool, remote_filesystem_read_prefetch, true, "Should use prefetching when reading data from remote filesystem.", 0) \
+    M(Int64, read_priority, 0, "Priority to read data from local filesystem. Only supported for 'pread_threadpool' method.", 0) \
+    \
    /** Experimental functions */ \
    M(Bool, allow_experimental_funnel_functions, false, "Enable experimental functions for funnel analysis.", 0) \
    M(Bool, allow_experimental_nlp_functions, false, "Enable experimental functions for natural language processing.", 0) \
--- a/src/DataStreams/ShellCommandSource.h
+++ b/src/DataStreams/ShellCommandSource.h
@ -0,0 +1,110 @@
+#pragma once
+
+#include <memory>
+
+#include <common/logger_useful.h>
+#include <Common/ShellCommand.h>
+#include <Common/ThreadPool.h>
+#include <IO/ReadHelpers.h>
+#include <Formats/FormatFactory.h>
+#include <Processors/ISimpleTransform.h>
+#include <Processors/Sources/SourceWithProgress.h>
+#include <Processors/Formats/IInputFormat.h>
+#include <Processors/QueryPipeline.h>
+#include <Processors/Executors/PullingPipelineExecutor.h>
+
+
+namespace DB
+{
+
+/** A stream, that runs child process and sends data to its stdin in background thread,
+  * and receives data from its stdout.
+  */
+class ShellCommandSource final : public SourceWithProgress
+{
+public:
+    using SendDataTask = std::function<void (void)>;
+
+    ShellCommandSource(
+        ContextPtr context,
+        const std::string & format,
+        const Block & sample_block,
+        std::unique_ptr<ShellCommand> command_,
+        Poco::Logger * log_,
+        std::vector<SendDataTask> && send_data_tasks,
+        size_t max_block_size = DEFAULT_BLOCK_SIZE)
+        : SourceWithProgress(sample_block)
+        , command(std::move(command_))
+        , log(log_)
+    {
+        for (auto && send_data_task : send_data_tasks)
+            send_data_threads.emplace_back([task = std::move(send_data_task)]() { task(); });
+
+        pipeline.init(Pipe(FormatFactory::instance().getInput(format, command->out, sample_block, context, max_block_size)));
+        executor = std::make_unique<PullingPipelineExecutor>(pipeline);
+    }
+
+    ShellCommandSource(
+        ContextPtr context,
+        const std::string & format,
+        const Block & sample_block,
+        std::unique_ptr<ShellCommand> command_,
+        Poco::Logger * log_,
+        size_t max_block_size = DEFAULT_BLOCK_SIZE)
+        : SourceWithProgress(sample_block)
+        , command(std::move(command_))
+        , log(log_)
+    {
+        pipeline.init(Pipe(FormatFactory::instance().getInput(format, command->out, sample_block, context, max_block_size)));
+        executor = std::make_unique<PullingPipelineExecutor>(pipeline);
+    }
+
+    ~ShellCommandSource() override
+    {
+        for (auto & thread : send_data_threads)
+            if (thread.joinable())
+                thread.join();
+    }
+
+protected:
+    Chunk generate() override
+    {
+        Chunk chunk;
+        executor->pull(chunk);
+        return chunk;
+    }
+
+public:
+    Status prepare() override
+    {
+        auto status = SourceWithProgress::prepare();
+
+        if (status == Status::Finished)
+        {
+            std::string err;
+            readStringUntilEOF(err, command->err);
+            if (!err.empty())
+                LOG_ERROR(log, "Having stderr: {}", err);
+
+            for (auto & thread : send_data_threads)
+                if (thread.joinable())
+                    thread.join();
+
+            command->wait();
+        }
+
+        return status;
+    }
+
+    String getName() const override { return "ShellCommandSource"; }
+
+private:
+
+    QueryPipeline pipeline;
+    std::unique_ptr<PullingPipelineExecutor> executor;
+    std::unique_ptr<ShellCommand> command;
+    std::vector<ThreadFromGlobalPool> send_data_threads;
+    Poco::Logger * log;
+};
+
+}
--- a/src/DataStreams/formatBlock.cpp
+++ b/src/DataStreams/formatBlock.cpp
@ -4,6 +4,7 @@

 namespace DB
 {
+
 void formatBlock(BlockOutputStreamPtr & out, const Block & block)
 {
    out->writePrefix();
--- a/src/DataStreams/formatBlock.h
+++ b/src/DataStreams/formatBlock.h
@ -4,6 +4,9 @@

 namespace DB
 {
+
+class Block;
+
 void formatBlock(BlockOutputStreamPtr & out, const Block & block);

 }
--- a/src/Dictionaries/ExecutableDictionarySource.cpp
+++ b/src/Dictionaries/ExecutableDictionarySource.cpp
@ -1,33 +1,26 @@
 #include "ExecutableDictionarySource.h"

 #include <functional>
-#include <common/scope_guard.h>
-#include <DataStreams/OwningBlockInputStream.h>
-#include <DataStreams/formatBlock.h>
-#include <Processors/ISimpleTransform.h>
-#include <Processors/QueryPipeline.h>
-#include <Processors/Executors/PullingPipelineExecutor.h>
-#include <Processors/Sources/SourceWithProgress.h>
-#include <Processors/Formats/IInputFormat.h>
-#include <Interpreters/Context.h>
-#include <Formats/FormatFactory.h>
-#include <IO/WriteHelpers.h>
-#include <IO/ReadHelpers.h>
-#include <Common/ShellCommand.h>
-#include <Common/ThreadPool.h>
 #include <common/logger_useful.h>
 #include <common/LocalDateTime.h>
-#include "DictionarySourceFactory.h"
-#include "DictionarySourceHelpers.h"
-#include "DictionaryStructure.h"
-#include "registerDictionaries.h"
+#include <Common/ShellCommand.h>
+
+#include <DataStreams/ShellCommandSource.h>
+#include <DataStreams/formatBlock.h>
+
+#include <Interpreters/Context.h>
+#include <IO/WriteHelpers.h>
+#include <IO/ReadHelpers.h>
+
+#include <Dictionaries/DictionarySourceFactory.h>
+#include <Dictionaries/DictionarySourceHelpers.h>
+#include <Dictionaries/DictionaryStructure.h>
+#include <Dictionaries/registerDictionaries.h>


 namespace DB
 {

-static const UInt64 max_block_size = 8192;
-
 namespace ErrorCodes
 {
    extern const int LOGICAL_ERROR;
@ -35,42 +28,6 @@ namespace ErrorCodes
    extern const int UNSUPPORTED_METHOD;
 }

-namespace
-{
-    /// Owns ShellCommand and calls wait for it.
-    class ShellCommandOwningTransform final : public ISimpleTransform
-    {
-    private:
-        Poco::Logger * log;
-        std::unique_ptr<ShellCommand> command;
-    public:
-        ShellCommandOwningTransform(const Block & header, Poco::Logger * log_, std::unique_ptr<ShellCommand> command_)
-            : ISimpleTransform(header, header, true), log(log_), command(std::move(command_))
-        {
-        }
-
-        String getName() const override { return "ShellCommandOwningTransform"; }
-        void transform(Chunk &) override {}
-
-        Status prepare() override
-        {
-            auto status = ISimpleTransform::prepare();
-            if (status == Status::Finished)
-            {
-                std::string err;
-                readStringUntilEOF(err, command->err);
-                if (!err.empty())
-                    LOG_ERROR(log, "Having stderr: {}", err);
-
-                command->wait();
-            }
-
-            return status;
-        }
-    };
-
-}
-
 ExecutableDictionarySource::ExecutableDictionarySource(
    const DictionaryStructure & dict_struct_,
    const Configuration & configuration_,
@ -112,9 +69,11 @@ Pipe ExecutableDictionarySource::loadAll()
        throw Exception(ErrorCodes::UNSUPPORTED_METHOD, "ExecutableDictionarySource with implicit_key does not support loadAll method");

    LOG_TRACE(log, "loadAll {}", toString());
-    auto process = ShellCommand::execute(configuration.command);
-    Pipe pipe(FormatFactory::instance().getInput(configuration.format, process->out, sample_block, context, max_block_size));
-    pipe.addTransform(std::make_shared<ShellCommandOwningTransform>(pipe.getHeader(), log, std::move(process)));
+
+    ShellCommand::Config config(configuration.command);
+    auto process = ShellCommand::execute(config);
+
+    Pipe pipe(std::make_unique<ShellCommandSource>(context, configuration.format, sample_block, std::move(process), log));
    return pipe;
 }

@ -131,86 +90,12 @@ Pipe ExecutableDictionarySource::loadUpdatedAll()
        command_with_update_field += " " + configuration.update_field + " " + DB::toString(LocalDateTime(update_time - configuration.update_lag));

    LOG_TRACE(log, "loadUpdatedAll {}", command_with_update_field);
-    auto process = ShellCommand::execute(command_with_update_field);
-
-    Pipe pipe(FormatFactory::instance().getInput(configuration.format, process->out, sample_block, context, max_block_size));
-    pipe.addTransform(std::make_shared<ShellCommandOwningTransform>(pipe.getHeader(), log, std::move(process)));
+    ShellCommand::Config config(command_with_update_field);
+    auto process = ShellCommand::execute(config);
+    Pipe pipe(std::make_unique<ShellCommandSource>(context, configuration.format, sample_block, std::move(process), log));
    return pipe;
 }

-namespace
-{
-    /** A stream, that runs child process and sends data to its stdin in background thread,
-      *  and receives data from its stdout.
-      *
-      *  TODO: implement without background thread.
-      */
-    class SourceWithBackgroundThread final : public SourceWithProgress
-    {
-    public:
-        SourceWithBackgroundThread(
-            ContextPtr context,
-            const std::string & format,
-            const Block & sample_block,
-            const std::string & command_str,
-            Poco::Logger * log_,
-            std::function<void(WriteBufferFromFile &)> && send_data_)
-            : SourceWithProgress(sample_block)
-            , log(log_)
-            , command(ShellCommand::execute(command_str))
-            , send_data(std::move(send_data_))
-            , thread([this] { send_data(command->in); })
-        {
-            pipeline.init(Pipe(FormatFactory::instance().getInput(format, command->out, sample_block, context, max_block_size)));
-            executor = std::make_unique<PullingPipelineExecutor>(pipeline);
-        }
-
-        ~SourceWithBackgroundThread() override
-        {
-            if (thread.joinable())
-                thread.join();
-        }
-
-    protected:
-        Chunk generate() override
-        {
-            Chunk chunk;
-            executor->pull(chunk);
-            return chunk;
-        }
-
-    public:
-        Status prepare() override
-        {
-            auto status = SourceWithProgress::prepare();
-
-            if (status == Status::Finished)
-            {
-                std::string err;
-                readStringUntilEOF(err, command->err);
-                if (!err.empty())
-                    LOG_ERROR(log, "Having stderr: {}", err);
-
-                if (thread.joinable())
-                    thread.join();
-
-                command->wait();
-            }
-
-            return status;
-        }
-
-        String getName() const override { return "SourceWithBackgroundThread"; }
-
-        Poco::Logger * log;
-        QueryPipeline pipeline;
-        std::unique_ptr<PullingPipelineExecutor> executor;
-        std::unique_ptr<ShellCommand> command;
-        std::function<void(WriteBufferFromFile &)> send_data;
-        ThreadFromGlobalPool thread;
-    };
-}
-
 Pipe ExecutableDictionarySource::loadIds(const std::vector<UInt64> & ids)
 {
    LOG_TRACE(log, "loadIds {} size = {}", toString(), ids.size());
@ -229,14 +114,21 @@ Pipe ExecutableDictionarySource::loadKeys(const Columns & key_columns, const std

 Pipe ExecutableDictionarySource::getStreamForBlock(const Block & block)
 {
-    Pipe pipe(std::make_unique<SourceWithBackgroundThread>(
-        context, configuration.format, sample_block, configuration.command, log,
-        [block, this](WriteBufferFromFile & out) mutable
-        {
-            auto output_stream = context->getOutputStream(configuration.format, out, block.cloneEmpty());
-            formatBlock(output_stream, block);
-            out.close();
-        }));
+    ShellCommand::Config config(configuration.command);
+    auto process = ShellCommand::execute(config);
+    auto * process_in = &process->in;
+
+    ShellCommandSource::SendDataTask task = {[process_in, block, this]()
+    {
+        auto & out = *process_in;
+        auto output_stream = context->getOutputStream(configuration.format, out, block.cloneEmpty());
+        formatBlock(output_stream, block);
+        out.close();
+    }};
+
+    std::vector<ShellCommandSource::SendDataTask> tasks = {task};
+
+    Pipe pipe(std::make_unique<ShellCommandSource>(context, configuration.format, sample_block, std::move(process), log, std::move(tasks)));

    if (configuration.implicit_key)
        pipe.addTransform(std::make_shared<TransformWithAdditionalColumns>(block, pipe.getHeader()));
--- a/src/Dictionaries/ExecutablePoolDictionarySource.cpp
+++ b/src/Dictionaries/ExecutablePoolDictionarySource.cpp
@ -220,9 +220,9 @@ Pipe ExecutablePoolDictionarySource::getStreamForBlock(const Block & block)
    std::unique_ptr<ShellCommand> process;
    bool result = process_pool->tryBorrowObject(process, [this]()
    {
-        bool terminate_in_destructor = true;
-        ShellCommandDestructorStrategy strategy { terminate_in_destructor, configuration.command_termination_timeout };
-        auto shell_command = ShellCommand::execute(configuration.command, false, strategy);
+        ShellCommand::Config config(configuration.command);
+        config.terminate_in_destructor_strategy = ShellCommand::DestructorStrategy{ true /*terminate_in_destructor*/, configuration.command_termination_timeout };
+        auto shell_command = ShellCommand::execute(config);
        return shell_command;
    }, configuration.max_command_execution_time * 10000);

--- a/src/Disks/DiskCacheWrapper.cpp
+++ b/src/Disks/DiskCacheWrapper.cpp
@ -88,19 +88,16 @@ std::shared_ptr<FileDownloadMetadata> DiskCacheWrapper::acquireDownloadMetadata(
 std::unique_ptr<ReadBufferFromFileBase>
 DiskCacheWrapper::readFile(
    const String & path,
-    size_t buf_size,
-    size_t estimated_size,
-    size_t direct_io_threshold,
-    size_t mmap_threshold,
-    MMappedFileCache * mmap_cache) const
+    const ReadSettings & settings,
+    size_t estimated_size) const
 {
    if (!cache_file_predicate(path))
-        return DiskDecorator::readFile(path, buf_size, estimated_size, direct_io_threshold, mmap_threshold, mmap_cache);
+        return DiskDecorator::readFile(path, settings, estimated_size);

    LOG_DEBUG(log, "Read file {} from cache", backQuote(path));

    if (cache_disk->exists(path))
-        return cache_disk->readFile(path, buf_size, estimated_size, direct_io_threshold, mmap_threshold, mmap_cache);
+        return cache_disk->readFile(path, settings, estimated_size);

    auto metadata = acquireDownloadMetadata(path);

@ -134,8 +131,8 @@ DiskCacheWrapper::readFile(

                auto tmp_path = path + ".tmp";
                {
-                    auto src_buffer = DiskDecorator::readFile(path, buf_size, estimated_size, direct_io_threshold, mmap_threshold, mmap_cache);
-                    auto dst_buffer = cache_disk->writeFile(tmp_path, buf_size, WriteMode::Rewrite);
+                    auto src_buffer = DiskDecorator::readFile(path, settings, estimated_size);
+                    auto dst_buffer = cache_disk->writeFile(tmp_path, settings.local_fs_buffer_size, WriteMode::Rewrite);
                    copyData(*src_buffer, *dst_buffer);
                }
                cache_disk->moveFile(tmp_path, path);
@ -158,9 +155,9 @@ DiskCacheWrapper::readFile(
    }

    if (metadata->status == DOWNLOADED)
-        return cache_disk->readFile(path, buf_size, estimated_size, direct_io_threshold, mmap_threshold, mmap_cache);
+        return cache_disk->readFile(path, settings, estimated_size);

-    return DiskDecorator::readFile(path, buf_size, estimated_size, direct_io_threshold, mmap_threshold, mmap_cache);
+    return DiskDecorator::readFile(path, settings, estimated_size);
 }

 std::unique_ptr<WriteBufferFromFileBase>
@ -180,7 +177,7 @@ DiskCacheWrapper::writeFile(const String & path, size_t buf_size, WriteMode mode
        [this, path, buf_size, mode]()
        {
            /// Copy file from cache to actual disk when cached buffer is finalized.
-            auto src_buffer = cache_disk->readFile(path, buf_size, 0, 0, 0, nullptr);
+            auto src_buffer = cache_disk->readFile(path, ReadSettings(), 0);
            auto dst_buffer = DiskDecorator::writeFile(path, buf_size, mode);
            copyData(*src_buffer, *dst_buffer);
            dst_buffer->finalize();
--- a/src/Disks/DiskCacheWrapper.h
+++ b/src/Disks/DiskCacheWrapper.h
@ -36,11 +36,8 @@ public:

    std::unique_ptr<ReadBufferFromFileBase> readFile(
        const String & path,
-        size_t buf_size,
-        size_t estimated_size,
-        size_t direct_io_threshold,
-        size_t mmap_threshold,
-        MMappedFileCache * mmap_cache) const override;
+        const ReadSettings & settings,
+        size_t estimated_size) const override;

    std::unique_ptr<WriteBufferFromFileBase> writeFile(const String & path, size_t buf_size, WriteMode mode) override;

--- a/src/Disks/DiskDecorator.cpp
+++ b/src/Disks/DiskDecorator.cpp
@ -115,9 +115,9 @@ void DiskDecorator::listFiles(const String & path, std::vector<String> & file_na

 std::unique_ptr<ReadBufferFromFileBase>
 DiskDecorator::readFile(
-    const String & path, size_t buf_size, size_t estimated_size, size_t direct_io_threshold, size_t mmap_threshold, MMappedFileCache * mmap_cache) const
+    const String & path, const ReadSettings & settings, size_t estimated_size) const
 {
-    return delegate->readFile(path, buf_size, estimated_size, direct_io_threshold, mmap_threshold, mmap_cache);
+    return delegate->readFile(path, settings, estimated_size);
 }

 std::unique_ptr<WriteBufferFromFileBase>
--- a/src/Disks/DiskDecorator.h
+++ b/src/Disks/DiskDecorator.h
@ -37,11 +37,8 @@ public:

    std::unique_ptr<ReadBufferFromFileBase> readFile(
        const String & path,
-        size_t buf_size,
-        size_t estimated_size,
-        size_t direct_io_threshold,
-        size_t mmap_threshold,
-        MMappedFileCache * mmap_cache) const override;
+        const ReadSettings & settings,
+        size_t estimated_size) const override;

    std::unique_ptr<WriteBufferFromFileBase> writeFile(
        const String & path,
@ -64,7 +61,8 @@ public:
    void sync(int fd) const;
    String getUniqueId(const String & path) const override { return delegate->getUniqueId(path); }
    bool checkUniqueId(const String & id) const override { return delegate->checkUniqueId(id); }
-    DiskType::Type getType() const override { return delegate->getType(); }
+    DiskType getType() const override { return delegate->getType(); }
+    bool isRemote() const override { return delegate->isRemote(); }
    bool supportZeroCopyReplication() const override { return delegate->supportZeroCopyReplication(); }
    void onFreeze(const String & path) override;
    SyncGuardPtr getDirectorySyncGuard(const String & path) const override;
--- a/src/Disks/DiskEncrypted.cpp
+++ b/src/Disks/DiskEncrypted.cpp
@ -238,18 +238,15 @@ void DiskEncrypted::copy(const String & from_path, const std::shared_ptr<IDisk>

 std::unique_ptr<ReadBufferFromFileBase> DiskEncrypted::readFile(
    const String & path,
-    size_t buf_size,
-    size_t estimated_size,
-    size_t aio_threshold,
-    size_t mmap_threshold,
-    MMappedFileCache * mmap_cache) const
+    const ReadSettings & settings,
+    size_t estimated_size) const
 {
    auto wrapped_path = wrappedPath(path);
-    auto buffer = delegate->readFile(wrapped_path, buf_size, estimated_size, aio_threshold, mmap_threshold, mmap_cache);
-    auto settings = current_settings.get();
+    auto buffer = delegate->readFile(wrapped_path, settings, estimated_size);
+    auto encryption_settings = current_settings.get();
    FileEncryption::Header header = readHeader(*buffer);
-    String key = getKey(path, header, *settings);
-    return std::make_unique<ReadBufferFromEncryptedFile>(buf_size, std::move(buffer), key, header);
+    String key = getKey(path, header, *encryption_settings);
+    return std::make_unique<ReadBufferFromEncryptedFile>(settings.local_fs_buffer_size, std::move(buffer), key, header);
 }

 std::unique_ptr<WriteBufferFromFileBase> DiskEncrypted::writeFile(const String & path, size_t buf_size, WriteMode mode)
@ -265,7 +262,7 @@ std::unique_ptr<WriteBufferFromFileBase> DiskEncrypted::writeFile(const String &
        if (old_file_size)
        {
            /// Append mode: we continue to use the same header.
-            auto read_buffer = delegate->readFile(wrapped_path, FileEncryption::Header::kSize);
+            auto read_buffer = delegate->readFile(wrapped_path, ReadSettings().adjustBufferSize(FileEncryption::Header::kSize));
            header = readHeader(*read_buffer);
            key = getKey(path, header, *settings);
        }
--- a/src/Disks/DiskEncrypted.h
+++ b/src/Disks/DiskEncrypted.h
@ -121,11 +121,8 @@ public:

    std::unique_ptr<ReadBufferFromFileBase> readFile(
        const String & path,
-        size_t buf_size,
-        size_t estimated_size,
-        size_t aio_threshold,
-        size_t mmap_threshold,
-        MMappedFileCache * mmap_cache) const override;
+        const ReadSettings & settings,
+        size_t estimated_size) const override;

    std::unique_ptr<WriteBufferFromFileBase> writeFile(
        const String & path,
@ -215,7 +212,8 @@ public:

    void applyNewSettings(const Poco::Util::AbstractConfiguration & config, ContextPtr context, const String & config_prefix, const DisksMap & map) override;

-    DiskType::Type getType() const override { return DiskType::Type::Encrypted; }
+    DiskType getType() const override { return DiskType::Encrypted; }
+    bool isRemote() const override { return delegate->isRemote(); }

    SyncGuardPtr getDirectorySyncGuard(const String & path) const override;

--- a/src/Disks/DiskLocal.cpp
+++ b/src/Disks/DiskLocal.cpp
@ -259,11 +259,9 @@ void DiskLocal::replaceFile(const String & from_path, const String & to_path)
    fs::rename(from_file, to_file);
 }

-std::unique_ptr<ReadBufferFromFileBase>
-DiskLocal::readFile(
-    const String & path, size_t buf_size, size_t estimated_size, size_t direct_io_threshold, size_t mmap_threshold, MMappedFileCache * mmap_cache) const
+std::unique_ptr<ReadBufferFromFileBase> DiskLocal::readFile(const String & path, const ReadSettings & settings, size_t estimated_size) const
 {
-    return createReadBufferFromFileBase(fs::path(disk_path) / path, estimated_size, direct_io_threshold, mmap_threshold, mmap_cache, buf_size);
+    return createReadBufferFromFileBase(fs::path(disk_path) / path, settings, estimated_size);
 }

 std::unique_ptr<WriteBufferFromFileBase>
--- a/src/Disks/DiskLocal.h
+++ b/src/Disks/DiskLocal.h
@ -73,11 +73,8 @@ public:

    std::unique_ptr<ReadBufferFromFileBase> readFile(
        const String & path,
-        size_t buf_size,
-        size_t estimated_size,
-        size_t direct_io_threshold,
-        size_t mmap_threshold,
-        MMappedFileCache * mmap_cache) const override;
+        const ReadSettings & settings,
+        size_t estimated_size) const override;

    std::unique_ptr<WriteBufferFromFileBase> writeFile(
        const String & path,
@ -99,7 +96,8 @@ public:

    void truncateFile(const String & path, size_t size) override;

-    DiskType::Type getType() const override { return DiskType::Type::Local; }
+    DiskType getType() const override { return DiskType::Local; }
+    bool isRemote() const override { return false; }

    bool supportZeroCopyReplication() const override { return false; }

--- a/src/Disks/DiskMemory.cpp
+++ b/src/Disks/DiskMemory.cpp
@ -313,7 +313,7 @@ void DiskMemory::replaceFileImpl(const String & from_path, const String & to_pat
    files.insert(std::move(node));
 }

-std::unique_ptr<ReadBufferFromFileBase> DiskMemory::readFile(const String & path, size_t /*buf_size*/, size_t, size_t, size_t, MMappedFileCache *) const
+std::unique_ptr<ReadBufferFromFileBase> DiskMemory::readFile(const String & path, const ReadSettings &, size_t) const
 {
    std::lock_guard lock(mutex);

--- a/src/Disks/DiskMemory.h
+++ b/src/Disks/DiskMemory.h
@ -64,11 +64,8 @@ public:

    std::unique_ptr<ReadBufferFromFileBase> readFile(
        const String & path,
-        size_t buf_size,
-        size_t estimated_size,
-        size_t direct_io_threshold,
-        size_t mmap_threshold,
-        MMappedFileCache * mmap_cache) const override;
+        const ReadSettings & settings,
+        size_t estimated_size) const override;

    std::unique_ptr<WriteBufferFromFileBase> writeFile(
        const String & path,
@ -90,7 +87,8 @@ public:

    void truncateFile(const String & path, size_t size) override;

-    DiskType::Type getType() const override { return DiskType::Type::RAM; }
+    DiskType getType() const override { return DiskType::RAM; }
+    bool isRemote() const override { return false; }

    bool supportZeroCopyReplication() const override { return false; }

--- a/src/Disks/DiskRestartProxy.cpp
+++ b/src/Disks/DiskRestartProxy.cpp
@ -187,11 +187,10 @@ void DiskRestartProxy::listFiles(const String & path, std::vector<String> & file
 }

 std::unique_ptr<ReadBufferFromFileBase> DiskRestartProxy::readFile(
-    const String & path, size_t buf_size, size_t estimated_size, size_t direct_io_threshold, size_t mmap_threshold, MMappedFileCache * mmap_cache)
-    const
+    const String & path, const ReadSettings & settings, size_t estimated_size) const
 {
    ReadLock lock (mutex);
-    auto impl = DiskDecorator::readFile(path, buf_size, estimated_size, direct_io_threshold, mmap_threshold, mmap_cache);
+    auto impl = DiskDecorator::readFile(path, settings, estimated_size);
    return std::make_unique<RestartAwareReadBuffer>(*this, std::move(impl));
 }

--- a/src/Disks/DiskRestartProxy.h
+++ b/src/Disks/DiskRestartProxy.h
@ -45,11 +45,8 @@ public:
    void listFiles(const String & path, std::vector<String> & file_names) override;
    std::unique_ptr<ReadBufferFromFileBase> readFile(
        const String & path,
-        size_t buf_size,
-        size_t estimated_size,
-        size_t direct_io_threshold,
-        size_t mmap_threshold,
-        MMappedFileCache * mmap_cache) const override;
+        const ReadSettings & settings,
+        size_t estimated_size) const override;
    std::unique_ptr<WriteBufferFromFileBase> writeFile(const String & path, size_t buf_size, WriteMode mode) override;
    void removeFile(const String & path) override;
    void removeFileIfExists(const String & path) override;
--- a/src/Disks/DiskType.h
+++ b/src/Disks/DiskType.h
@ -5,37 +5,34 @@
 namespace DB
 {

-struct DiskType
+enum class DiskType
 {
-    enum class Type
-    {
-        Local,
-        RAM,
-        S3,
-        HDFS,
-        Encrypted,
-        WebServer
-    };
-
-    static String toString(Type disk_type)
-    {
-        switch (disk_type)
-        {
-            case Type::Local:
-                return "local";
-            case Type::RAM:
-                return "memory";
-            case Type::S3:
-                return "s3";
-            case Type::HDFS:
-                return "hdfs";
-            case Type::Encrypted:
-                return "encrypted";
-            case Type::WebServer:
-                return "web";
-        }
-        __builtin_unreachable();
-    }
+    Local,
+    RAM,
+    S3,
+    HDFS,
+    Encrypted,
+    WebServer,
 };

+inline String toString(DiskType disk_type)
+{
+    switch (disk_type)
+    {
+        case DiskType::Local:
+            return "local";
+        case DiskType::RAM:
+            return "memory";
+        case DiskType::S3:
+            return "s3";
+        case DiskType::HDFS:
+            return "hdfs";
+        case DiskType::Encrypted:
+            return "encrypted";
+        case DiskType::WebServer:
+            return "web";
+    }
+    __builtin_unreachable();
+}
+
 }
--- a/src/Disks/DiskWebServer.cpp
+++ b/src/Disks/DiskWebServer.cpp
@ -226,7 +226,7 @@ bool DiskWebServer::exists(const String & path) const
 }


-std::unique_ptr<ReadBufferFromFileBase> DiskWebServer::readFile(const String & path, size_t buf_size, size_t, size_t, size_t, MMappedFileCache *) const
+std::unique_ptr<ReadBufferFromFileBase> DiskWebServer::readFile(const String & path, const ReadSettings & read_settings, size_t) const
 {
    LOG_DEBUG(log, "Read from file by path: {}", path);

@ -237,7 +237,7 @@ std::unique_ptr<ReadBufferFromFileBase> DiskWebServer::readFile(const String & p
    RemoteMetadata meta(uri, fs::path(path).parent_path() / fs::path(path).filename());
    meta.remote_fs_objects.emplace_back(std::make_pair(getFileName(path), file.size));

-    auto reader = std::make_unique<ReadBufferFromWebServer>(uri, meta, getContext(), settings->max_read_tries, buf_size);
+    auto reader = std::make_unique<ReadBufferFromWebServer>(uri, meta, getContext(), settings->max_read_tries, read_settings.remote_fs_buffer_size);
    return std::make_unique<SeekAvoidingReadBuffer>(std::move(reader), settings->min_bytes_for_seek);
 }

--- a/src/Disks/DiskWebServer.h
+++ b/src/Disks/DiskWebServer.h
@ -107,14 +107,14 @@ public:

    String getFileName(const String & path) const;

-    DiskType::Type getType() const override { return DiskType::Type::WebServer; }
+    DiskType getType() const override { return DiskType::WebServer; }
+
+    bool isRemote() const override { return true; }

    std::unique_ptr<ReadBufferFromFileBase> readFile(const String & path,
-                                                     size_t buf_size,
-                                                     size_t estimated_size,
-                                                     size_t aio_threshold,
-                                                     size_t mmap_threshold,
-                                                     MMappedFileCache * mmap_cache) const override;
+                                                     const ReadSettings & settings,
+                                                     size_t estimated_size) const override;
+
    /// Disk info

    const String & getName() const final override { return name; }
--- a/src/Disks/HDFS/DiskHDFS.cpp
+++ b/src/Disks/HDFS/DiskHDFS.cpp
@ -93,15 +93,15 @@ DiskHDFS::DiskHDFS(
 }


-std::unique_ptr<ReadBufferFromFileBase> DiskHDFS::readFile(const String & path, size_t buf_size, size_t, size_t, size_t, MMappedFileCache *) const
+std::unique_ptr<ReadBufferFromFileBase> DiskHDFS::readFile(const String & path, const ReadSettings & read_settings, size_t) const
 {
    auto metadata = readMeta(path);

-    LOG_DEBUG(log,
+    LOG_TRACE(log,
        "Read from file by path: {}. Existing HDFS objects: {}",
        backQuote(metadata_path + path), metadata.remote_fs_objects.size());

-    auto reader = std::make_unique<ReadIndirectBufferFromHDFS>(config, remote_fs_root_path, metadata, buf_size);
+    auto reader = std::make_unique<ReadIndirectBufferFromHDFS>(config, remote_fs_root_path, metadata, read_settings.remote_fs_buffer_size);
    return std::make_unique<SeekAvoidingReadBuffer>(std::move(reader), settings->min_bytes_for_seek);
 }

@ -114,7 +114,7 @@ std::unique_ptr<WriteBufferFromFileBase> DiskHDFS::writeFile(const String & path
    auto file_name = getRandomName();
    auto hdfs_path = remote_fs_root_path + file_name;

-    LOG_DEBUG(log, "{} to file by path: {}. HDFS path: {}", mode == WriteMode::Rewrite ? "Write" : "Append",
+    LOG_TRACE(log, "{} to file by path: {}. HDFS path: {}", mode == WriteMode::Rewrite ? "Write" : "Append",
              backQuote(metadata_path + path), hdfs_path);

    /// Single O_WRONLY in libhdfs adds O_TRUNC
--- a/src/Disks/HDFS/DiskHDFS.h
+++ b/src/Disks/HDFS/DiskHDFS.h
@ -42,17 +42,15 @@ public:
        const String & metadata_path_,
        const Poco::Util::AbstractConfiguration & config_);

-    DiskType::Type getType() const override { return DiskType::Type::HDFS; }
+    DiskType getType() const override { return DiskType::HDFS; }
+    bool isRemote() const override { return true; }

    bool supportZeroCopyReplication() const override { return true; }

    std::unique_ptr<ReadBufferFromFileBase> readFile(
        const String & path,
-        size_t buf_size,
-        size_t estimated_size,
-        size_t direct_io_threshold,
-        size_t mmap_threshold,
-        MMappedFileCache * mmap_cache) const override;
+        const ReadSettings & settings,
+        size_t estimated_size) const override;

    std::unique_ptr<WriteBufferFromFileBase> writeFile(const String & path, size_t buf_size, WriteMode mode) override;

--- a/src/Disks/IDisk.h
+++ b/src/Disks/IDisk.h
@ -8,24 +8,34 @@
 #include <Common/Exception.h>
 #include <Disks/Executor.h>
 #include <Disks/DiskType.h>
+#include <IO/ReadSettings.h>

 #include <memory>
 #include <mutex>
 #include <utility>
 #include <boost/noncopyable.hpp>
-#include "Poco/Util/AbstractConfiguration.h"
 #include <Poco/Timestamp.h>
 #include <filesystem>

+
 namespace fs = std::filesystem;

+namespace Poco
+{
+    namespace Util
+    {
+        class AbstractConfiguration;
+    }
+}
+
 namespace CurrentMetrics
 {
-extern const Metric DiskSpaceReservedForMerge;
+    extern const Metric DiskSpaceReservedForMerge;
 }

 namespace DB
 {
+
 class IDiskDirectoryIterator;
 using DiskDirectoryIteratorPtr = std::unique_ptr<IDiskDirectoryIterator>;

@ -155,11 +165,8 @@ public:
    /// Open the file for read and return ReadBufferFromFileBase object.
    virtual std::unique_ptr<ReadBufferFromFileBase> readFile(
        const String & path,
-        size_t buf_size = DBMS_DEFAULT_BUFFER_SIZE,
-        size_t estimated_size = 0,
-        size_t direct_io_threshold = 0,
-        size_t mmap_threshold = 0,
-        MMappedFileCache * mmap_cache = nullptr) const = 0;
+        const ReadSettings & settings = ReadSettings{},
+        size_t estimated_size = 0) const = 0;

    /// Open the file for write and return WriteBufferFromFileBase object.
    virtual std::unique_ptr<WriteBufferFromFileBase> writeFile(
@ -210,7 +217,10 @@ public:
    virtual void truncateFile(const String & path, size_t size);

    /// Return disk type - "local", "s3", etc.
-    virtual DiskType::Type getType() const = 0;
+    virtual DiskType getType() const = 0;
+
+    /// Involves network interaction.
+    virtual bool isRemote() const = 0;

    /// Whether this disk support zero-copy replication.
    /// Overrode in remote fs disks.
@ -240,7 +250,7 @@ public:
    virtual SyncGuardPtr getDirectorySyncGuard(const String & path) const;

    /// Applies new settings for disk in runtime.
-    virtual void applyNewSettings(const Poco::Util::AbstractConfiguration &, ContextPtr, const String &, const DisksMap &) { }
+    virtual void applyNewSettings(const Poco::Util::AbstractConfiguration &, ContextPtr, const String &, const DisksMap &) {}

 protected:
    friend class DiskDecorator;
--- a/src/Disks/IStoragePolicy.h
+++ b/src/Disks/IStoragePolicy.h
@ -39,7 +39,6 @@ public:
    /// mutations files
    virtual DiskPtr getAnyDisk() const = 0;
    virtual DiskPtr getDiskByName(const String & disk_name) const = 0;
-    virtual Disks getDisksByType(DiskType::Type type) const = 0;
    /// Get free space from most free disk
    virtual UInt64 getMaxUnreservedFreeSpace() const = 0;
    /// Reserves space on any volume with index > min_volume_index or returns nullptr
--- a/src/Disks/S3/DiskS3.cpp
+++ b/src/Disks/S3/DiskS3.cpp
@ -184,7 +184,7 @@ void DiskS3::removeFromRemoteFS(RemoteFSPathKeeperPtr fs_paths_keeper)
    if (s3_paths_keeper)
        s3_paths_keeper->removePaths([&](S3PathKeeper::Chunk && chunk)
        {
-            LOG_DEBUG(log, "Remove AWS keys {}", S3PathKeeper::getChunkKeys(chunk));
+            LOG_TRACE(log, "Remove AWS keys {}", S3PathKeeper::getChunkKeys(chunk));
            Aws::S3::Model::Delete delkeys;
            delkeys.SetObjects(chunk);
            /// TODO: Make operation idempotent. Do not throw exception if key is already deleted.
@ -221,15 +221,16 @@ void DiskS3::moveFile(const String & from_path, const String & to_path, bool sen
    fs::rename(fs::path(metadata_path) / from_path, fs::path(metadata_path) / to_path);
 }

-std::unique_ptr<ReadBufferFromFileBase> DiskS3::readFile(const String & path, size_t buf_size, size_t, size_t, size_t, MMappedFileCache *) const
+std::unique_ptr<ReadBufferFromFileBase> DiskS3::readFile(const String & path, const ReadSettings & read_settings, size_t) const
 {
    auto settings = current_settings.get();
    auto metadata = readMeta(path);

-    LOG_DEBUG(log, "Read from file by path: {}. Existing S3 objects: {}",
+    LOG_TRACE(log, "Read from file by path: {}. Existing S3 objects: {}",
        backQuote(metadata_path + path), metadata.remote_fs_objects.size());

-    auto reader = std::make_unique<ReadIndirectBufferFromS3>(settings->client, bucket, metadata, settings->s3_max_single_read_retries, buf_size);
+    auto reader = std::make_unique<ReadIndirectBufferFromS3>(
+        settings->client, bucket, metadata, settings->s3_max_single_read_retries, read_settings.remote_fs_buffer_size);
    return std::make_unique<SeekAvoidingReadBuffer>(std::move(reader), settings->min_bytes_for_seek);
 }

@ -251,7 +252,7 @@ std::unique_ptr<WriteBufferFromFileBase> DiskS3::writeFile(const String & path,
        s3_path = "r" + revisionToString(revision) + "-file-" + s3_path;
    }

-    LOG_DEBUG(log, "{} to file by path: {}. S3 path: {}",
+    LOG_TRACE(log, "{} to file by path: {}. S3 path: {}",
              mode == WriteMode::Rewrite ? "Write" : "Append", backQuote(metadata_path + path), remote_fs_root_path + s3_path);

    auto s3_buffer = std::make_unique<WriteBufferFromS3>(
@ -351,7 +352,7 @@ void DiskS3::findLastRevision()
    {
        auto revision_prefix = revision + "1";

-        LOG_DEBUG(log, "Check object exists with revision prefix {}", revision_prefix);
+        LOG_TRACE(log, "Check object exists with revision prefix {}", revision_prefix);

        /// Check file or operation with such revision prefix exists.
        if (checkObjectExists(bucket, remote_fs_root_path + "r" + revision_prefix)
@ -405,7 +406,7 @@ void DiskS3::updateObjectMetadata(const String & key, const ObjectMetadata & met

 void DiskS3::migrateFileToRestorableSchema(const String & path)
 {
-    LOG_DEBUG(log, "Migrate file {} to restorable schema", metadata_path + path);
+    LOG_TRACE(log, "Migrate file {} to restorable schema", metadata_path + path);

    auto meta = readMeta(path);

@ -422,7 +423,7 @@ void DiskS3::migrateToRestorableSchemaRecursive(const String & path, Futures & r
 {
    checkStackSize(); /// This is needed to prevent stack overflow in case of cyclic symlinks.

-    LOG_DEBUG(log, "Migrate directory {} to restorable schema", metadata_path + path);
+    LOG_TRACE(log, "Migrate directory {} to restorable schema", metadata_path + path);

    bool dir_contains_only_files = true;
    for (auto it = iterateDirectory(path); it->isValid(); it->next())
@ -595,7 +596,7 @@ void DiskS3::copyObjectMultipartImpl(const String & src_bucket, const String & s
    std::optional<Aws::S3::Model::HeadObjectResult> head,
    std::optional<std::reference_wrapper<const ObjectMetadata>> metadata) const
 {
-    LOG_DEBUG(log, "Multipart copy upload has created. Src Bucket: {}, Src Key: {}, Dst Bucket: {}, Dst Key: {}, Metadata: {}",
+    LOG_TRACE(log, "Multipart copy upload has created. Src Bucket: {}, Src Key: {}, Dst Bucket: {}, Dst Key: {}, Metadata: {}",
        src_bucket, src_key, dst_bucket, dst_key, metadata ? "REPLACE" : "NOT_SET");

    auto settings = current_settings.get();
@ -669,7 +670,7 @@ void DiskS3::copyObjectMultipartImpl(const String & src_bucket, const String & s

        throwIfError(outcome);

-        LOG_DEBUG(log, "Multipart copy upload has completed. Src Bucket: {}, Src Key: {}, Dst Bucket: {}, Dst Key: {}, "
+        LOG_TRACE(log, "Multipart copy upload has completed. Src Bucket: {}, Src Key: {}, Dst Bucket: {}, Dst Key: {}, "
            "Upload_id: {}, Parts: {}", src_bucket, src_key, dst_bucket, dst_key, multipart_upload_id, part_tags.size());
    }
 }
@ -871,7 +872,7 @@ void DiskS3::processRestoreFiles(const String & source_bucket, const String & so
        metadata.addObject(relative_key, head_result.GetContentLength());
        metadata.save();

-        LOG_DEBUG(log, "Restored file {}", path);
+        LOG_TRACE(log, "Restored file {}", path);
    }
 }

@ -918,7 +919,7 @@ void DiskS3::restoreFileOperations(const RestoreInformation & restore_informatio
                if (exists(from_path))
                {
                    moveFile(from_path, to_path, send_metadata);
-                    LOG_DEBUG(log, "Revision {}. Restored rename {} -> {}", revision, from_path, to_path);
+                    LOG_TRACE(log, "Revision {}. Restored rename {} -> {}", revision, from_path, to_path);

                    if (restore_information.detached && isDirectory(to_path))
                    {
@ -945,7 +946,7 @@ void DiskS3::restoreFileOperations(const RestoreInformation & restore_informatio
                {
                    createDirectories(directoryPath(dst_path));
                    createHardLink(src_path, dst_path, send_metadata);
-                    LOG_DEBUG(log, "Revision {}. Restored hardlink {} -> {}", revision, src_path, dst_path);
+                    LOG_TRACE(log, "Revision {}. Restored hardlink {} -> {}", revision, src_path, dst_path);
                }
            }
        }
@ -976,7 +977,7 @@ void DiskS3::restoreFileOperations(const RestoreInformation & restore_informatio

            auto detached_path = pathToDetached(path);

-            LOG_DEBUG(log, "Move directory to 'detached' {} -> {}", path, detached_path);
+            LOG_TRACE(log, "Move directory to 'detached' {} -> {}", path, detached_path);

            fs::path from_path = fs::path(metadata_path) / path;
            fs::path to_path = fs::path(metadata_path) / detached_path;
--- a/src/Disks/S3/DiskS3.h
+++ b/src/Disks/S3/DiskS3.h
@ -76,11 +76,8 @@ public:

    std::unique_ptr<ReadBufferFromFileBase> readFile(
        const String & path,
-        size_t buf_size,
-        size_t estimated_size,
-        size_t direct_io_threshold,
-        size_t mmap_threshold,
-        MMappedFileCache * mmap_cache) const override;
+        const ReadSettings & settings,
+        size_t estimated_size) const override;

    std::unique_ptr<WriteBufferFromFileBase> writeFile(
        const String & path,
@ -97,7 +94,8 @@ public:
    void createHardLink(const String & src_path, const String & dst_path) override;
    void createHardLink(const String & src_path, const String & dst_path, bool send_metadata);

-    DiskType::Type getType() const override { return DiskType::Type::S3; }
+    DiskType getType() const override { return DiskType::S3; }
+    bool isRemote() const override { return true; }

    bool supportZeroCopyReplication() const override { return true; }

--- a/src/Disks/S3/registerDiskS3.cpp
+++ b/src/Disks/S3/registerDiskS3.cpp
@ -39,7 +39,7 @@ void checkWriteAccess(IDisk & disk)

 void checkReadAccess(const String & disk_name, IDisk & disk)
 {
-    auto file = disk.readFile("test_acl", DBMS_DEFAULT_BUFFER_SIZE);
+    auto file = disk.readFile("test_acl");
    String buf(4, '0');
    file->readStrict(buf.data(), 4);
    if (buf != "test")
--- a/src/Disks/StoragePolicy.cpp
+++ b/src/Disks/StoragePolicy.cpp
@ -157,17 +157,6 @@ Disks StoragePolicy::getDisks() const
 }


-Disks StoragePolicy::getDisksByType(DiskType::Type type) const
-{
-    Disks res;
-    for (const auto & volume : volumes)
-        for (const auto & disk : volume->getDisks())
-            if (disk->getType() == type)
-                res.push_back(disk);
-    return res;
-}
-
-
 DiskPtr StoragePolicy::getAnyDisk() const
 {
    /// StoragePolicy must contain at least one Volume
--- a/src/Disks/StoragePolicy.h
+++ b/src/Disks/StoragePolicy.h
@ -47,9 +47,6 @@ public:
    /// Returns disks ordered by volumes priority
    Disks getDisks() const override;

-    /// Returns disks by type ordered by volumes priority
-    Disks getDisksByType(DiskType::Type type) const override;
-
    /// Returns any disk
    /// Used when it's not important, for example for
    /// mutations files
--- a/src/Disks/tests/gtest_disk_hdfs.cpp
+++ b/src/Disks/tests/gtest_disk_hdfs.cpp
@ -53,7 +53,7 @@ TEST(DiskTestHDFS, WriteReadHDFS)

    {
        DB::String result;
-        auto in = disk.readFile(file_name, 1024, 1024, 1024, 1024, nullptr);
+        auto in = disk.readFile(file_name, {}, 1024);
        readString(result, *in);
        EXPECT_EQ("Test write to file", result);
    }
@ -76,7 +76,7 @@ TEST(DiskTestHDFS, RewriteFileHDFS)

    {
        String result;
-        auto in = disk.readFile(file_name, 1024, 1024, 1024, 1024, nullptr);
+        auto in = disk.readFile(file_name, {}, 1024);
        readString(result, *in);
        EXPECT_EQ("Text10", result);
        readString(result, *in);
@ -104,7 +104,7 @@ TEST(DiskTestHDFS, AppendFileHDFS)

    {
        String result, expected;
-        auto in = disk.readFile(file_name, 1024, 1024, 1024, 1024, nullptr);
+        auto in = disk.readFile(file_name, {}, 1024);

        readString(result, *in);
        EXPECT_EQ("Text0123456789", result);
@ -131,7 +131,7 @@ TEST(DiskTestHDFS, SeekHDFS)
    /// Test SEEK_SET
    {
        String buf(4, '0');
-        std::unique_ptr<DB::SeekableReadBuffer> in = disk.readFile(file_name, 1024, 1024, 1024, 1024, nullptr);
+        std::unique_ptr<DB::SeekableReadBuffer> in = disk.readFile(file_name, {}, 1024);

        in->seek(5, SEEK_SET);

@ -141,7 +141,7 @@ TEST(DiskTestHDFS, SeekHDFS)

    /// Test SEEK_CUR
    {
-        std::unique_ptr<DB::SeekableReadBuffer> in = disk.readFile(file_name, 1024, 1024, 1024, 1024, nullptr);
+        std::unique_ptr<DB::SeekableReadBuffer> in = disk.readFile(file_name, {}, 1024);
        String buf(4, '0');

        in->readStrict(buf.data(), 4);
--- a/src/Disks/ya.make
+++ b/src/Disks/ya.make
@ -22,6 +22,7 @@ SRCS(
    IVolume.cpp
    LocalDirectorySyncGuard.cpp
    ReadIndirectBufferFromRemoteFS.cpp
+    ReadIndirectBufferFromWebServer.cpp
    SingleDiskVolume.cpp
    StoragePolicy.cpp
    TemporaryFileOnDisk.cpp
--- a/src/IO/AIO.cpp
+++ b/src/IO/AIO.cpp
@ -6,6 +6,7 @@

 #    include <sys/syscall.h>
 #    include <unistd.h>
+#    include <utility>


 /** Small wrappers for asynchronous I/O.
@ -50,7 +51,19 @@ AIOContext::AIOContext(unsigned int nr_events)

 AIOContext::~AIOContext()
 {
-    io_destroy(ctx);
+    if (ctx)
+        io_destroy(ctx);
+}
+
+AIOContext::AIOContext(AIOContext && rhs)
+{
+    *this = std::move(rhs);
+}
+
+AIOContext & AIOContext::operator=(AIOContext && rhs)
+{
+    std::swap(ctx, rhs.ctx);
+    return *this;
 }

 #elif defined(OS_FREEBSD)
--- a/src/IO/AIO.h
+++ b/src/IO/AIO.h
@ -33,10 +33,13 @@ int io_getevents(aio_context_t ctx, long min_nr, long max_nr, io_event * events,

 struct AIOContext : private boost::noncopyable
 {
-    aio_context_t ctx;
+    aio_context_t ctx = 0;

-    AIOContext(unsigned int nr_events = 128);
+    AIOContext() {}
+    AIOContext(unsigned int nr_events);
    ~AIOContext();
+    AIOContext(AIOContext && rhs);
+    AIOContext & operator=(AIOContext && rhs);
 };

 #elif defined(OS_FREEBSD)
--- a/src/IO/AIOContextPool.cpp
+++ b/src/IO/AIOContextPool.cpp
@ -1,172 +0,0 @@
-#if defined(OS_LINUX) || defined(__FreeBSD__)
-
-#include <Common/Exception.h>
-#include <common/logger_useful.h>
-#include <Common/MemorySanitizer.h>
-#include <Poco/Logger.h>
-#include <boost/range/iterator_range.hpp>
-#include <errno.h>
-
-#include <IO/AIOContextPool.h>
-
-
-namespace DB
-{
-
-namespace ErrorCodes
-{
-    extern const int CANNOT_IO_SUBMIT;
-    extern const int CANNOT_IO_GETEVENTS;
-}
-
-
-AIOContextPool::~AIOContextPool()
-{
-    cancelled.store(true, std::memory_order_relaxed);
-    io_completion_monitor.join();
-}
-
-
-void AIOContextPool::doMonitor()
-{
-    /// continue checking for events unless cancelled
-    while (!cancelled.load(std::memory_order_relaxed))
-        waitForCompletion();
-
-    /// wait until all requests have been completed
-    while (!promises.empty())
-        waitForCompletion();
-}
-
-
-void AIOContextPool::waitForCompletion()
-{
-    /// array to hold completion events
-    std::vector<io_event> events(max_concurrent_events);
-
-    try
-    {
-        const auto num_events = getCompletionEvents(events.data(), max_concurrent_events);
-        fulfillPromises(events.data(), num_events);
-        notifyProducers(num_events);
-    }
-    catch (...)
-    {
-        /// there was an error, log it, return to any producer and continue
-        reportExceptionToAnyProducer();
-        tryLogCurrentException("AIOContextPool::waitForCompletion()");
-    }
-}
-
-
-int AIOContextPool::getCompletionEvents(io_event events[], const int max_events) const
-{
-    timespec timeout{timeout_sec, 0};
-
-    auto num_events = 0;
-
-    /// request 1 to `max_events` events
-    while ((num_events = io_getevents(aio_context.ctx, 1, max_events, events, &timeout)) < 0)
-        if (errno != EINTR)
-            throwFromErrno("io_getevents: Failed to wait for asynchronous IO completion", ErrorCodes::CANNOT_IO_GETEVENTS, errno);
-
-    /// Unpoison the memory returned from a non-instrumented system call.
-    __msan_unpoison(events, sizeof(*events) * num_events);
-
-    return num_events;
-}
-
-
-void AIOContextPool::fulfillPromises(const io_event events[], const int num_events)
-{
-    if (num_events == 0)
-        return;
-
-    const std::lock_guard lock{mutex};
-
-    /// look at returned events and find corresponding promise, set result and erase promise from map
-    for (const auto & event : boost::make_iterator_range(events, events + num_events))
-    {
-        /// get id from event
-#if defined(__FreeBSD__)
-        const auto completed_id = (reinterpret_cast<struct iocb *>(event.udata))->aio_data;
-#else
-        const auto completed_id = event.data;
-#endif
-
-        /// set value via promise and release it
-        const auto it = promises.find(completed_id);
-        if (it == std::end(promises))
-        {
-            LOG_ERROR(&Poco::Logger::get("AIOcontextPool"), "Found io_event with unknown id {}", completed_id);
-            continue;
-        }
-
-#if defined(__FreeBSD__)
-        it->second.set_value(aio_return(reinterpret_cast<struct aiocb *>(event.udata)));
-#else
-        it->second.set_value(event.res);
-#endif
-        promises.erase(it);
-    }
-}
-
-
-void AIOContextPool::notifyProducers(const int num_producers) const
-{
-    if (num_producers == 0)
-        return;
-
-    if (num_producers > 1)
-        have_resources.notify_all();
-    else
-        have_resources.notify_one();
-}
-
-
-void AIOContextPool::reportExceptionToAnyProducer()
-{
-    const std::lock_guard lock{mutex};
-
-    const auto any_promise_it = std::begin(promises);
-    any_promise_it->second.set_exception(std::current_exception());
-}
-
-
-std::future<AIOContextPool::BytesRead> AIOContextPool::post(struct iocb & iocb)
-{
-    std::unique_lock lock{mutex};
-
-    /// get current id and increment it by one
-    const auto request_id = next_id;
-    ++next_id;
-
-    /// create a promise and put request in "queue"
-    promises.emplace(request_id, std::promise<BytesRead>{});
-    /// store id in AIO request for further identification
-    iocb.aio_data = request_id;
-
-    struct iocb * requests[] { &iocb };
-
-    /// submit a request
-    while (io_submit(aio_context.ctx, 1, requests) < 0)
-    {
-        if (errno == EAGAIN)
-            /// wait until at least one event has been completed (or a spurious wakeup) and try again
-            have_resources.wait(lock);
-        else if (errno != EINTR)
-            throwFromErrno("io_submit: Failed to submit a request for asynchronous IO", ErrorCodes::CANNOT_IO_SUBMIT);
-    }
-
-    return promises[request_id].get_future();
-}
-
-AIOContextPool & AIOContextPool::instance()
-{
-    static AIOContextPool instance;
-    return instance;
-}
-
-}
-
-#endif
--- a/src/IO/AIOContextPool.h
+++ b/src/IO/AIOContextPool.h
@ -1,53 +0,0 @@
-#pragma once
-
-#if defined(OS_LINUX) || defined(__FreeBSD__)
-
-#include <condition_variable>
-#include <future>
-#include <mutex>
-#include <map>
-#include <IO/AIO.h>
-#include <Common/ThreadPool.h>
-
-
-namespace DB
-{
-
-class AIOContextPool : private boost::noncopyable
-{
-    static const auto max_concurrent_events = 128;
-    static const auto timeout_sec = 1;
-
-    AIOContext aio_context{max_concurrent_events};
-
-    using ID = size_t;
-    using BytesRead = ssize_t;
-
-    /// Autoincremental id used to identify completed requests
-    ID next_id{};
-    mutable std::mutex mutex;
-    mutable std::condition_variable have_resources;
-    std::map<ID, std::promise<BytesRead>> promises;
-
-    std::atomic<bool> cancelled{false};
-    ThreadFromGlobalPool io_completion_monitor{&AIOContextPool::doMonitor, this};
-
-    ~AIOContextPool();
-
-    void doMonitor();
-    void waitForCompletion();
-    int getCompletionEvents(io_event events[], const int max_events) const;
-    void fulfillPromises(const io_event events[], const int num_events);
-    void notifyProducers(const int num_producers) const;
-    void reportExceptionToAnyProducer();
-
-public:
-    static AIOContextPool & instance();
-
-    /// Request AIO read operation for iocb, returns a future with number of bytes read
-    std::future<BytesRead> post(struct iocb & iocb);
-};
-
-}
-
-#endif
--- a/src/IO/AsynchronousReadBufferFromFile.cpp
+++ b/src/IO/AsynchronousReadBufferFromFile.cpp
@ -0,0 +1,106 @@
+#include <fcntl.h>
+
+#include <IO/AsynchronousReadBufferFromFile.h>
+#include <IO/WriteHelpers.h>
+#include <Common/ProfileEvents.h>
+#include <errno.h>
+
+
+namespace ProfileEvents
+{
+    extern const Event FileOpen;
+}
+
+
+namespace DB
+{
+
+namespace ErrorCodes
+{
+    extern const int FILE_DOESNT_EXIST;
+    extern const int CANNOT_OPEN_FILE;
+    extern const int CANNOT_CLOSE_FILE;
+}
+
+
+AsynchronousReadBufferFromFile::AsynchronousReadBufferFromFile(
+    AsynchronousReaderPtr reader_,
+    Int32 priority_,
+    const std::string & file_name_,
+    size_t buf_size,
+    int flags,
+    char * existing_memory,
+    size_t alignment)
+    : AsynchronousReadBufferFromFileDescriptor(std::move(reader_), priority_, -1, buf_size, existing_memory, alignment), file_name(file_name_)
+{
+    ProfileEvents::increment(ProfileEvents::FileOpen);
+
+#ifdef __APPLE__
+    bool o_direct = (flags != -1) && (flags & O_DIRECT);
+    if (o_direct)
+        flags = flags & ~O_DIRECT;
+#endif
+    fd = ::open(file_name.c_str(), flags == -1 ? O_RDONLY | O_CLOEXEC : flags | O_CLOEXEC);
+
+    if (-1 == fd)
+        throwFromErrnoWithPath("Cannot open file " + file_name, file_name,
+                               errno == ENOENT ? ErrorCodes::FILE_DOESNT_EXIST : ErrorCodes::CANNOT_OPEN_FILE);
+#ifdef __APPLE__
+    if (o_direct)
+    {
+        if (fcntl(fd, F_NOCACHE, 1) == -1)
+            throwFromErrnoWithPath("Cannot set F_NOCACHE on file " + file_name, file_name, ErrorCodes::CANNOT_OPEN_FILE);
+    }
+#endif
+}
+
+
+AsynchronousReadBufferFromFile::AsynchronousReadBufferFromFile(
+    AsynchronousReaderPtr reader_,
+    Int32 priority_,
+    int & fd_,
+    const std::string & original_file_name,
+    size_t buf_size,
+    char * existing_memory,
+    size_t alignment)
+    :
+    AsynchronousReadBufferFromFileDescriptor(std::move(reader_), priority_, fd_, buf_size, existing_memory, alignment),
+    file_name(original_file_name.empty() ? "(fd = " + toString(fd_) + ")" : original_file_name)
+{
+    fd_ = -1;
+}
+
+
+AsynchronousReadBufferFromFile::~AsynchronousReadBufferFromFile()
+{
+    /// Must wait for events in flight before closing the file.
+    finalize();
+
+    if (fd < 0)
+        return;
+
+    ::close(fd);
+}
+
+
+void AsynchronousReadBufferFromFile::close()
+{
+    if (fd < 0)
+        return;
+
+    if (0 != ::close(fd))
+        throw Exception("Cannot close file", ErrorCodes::CANNOT_CLOSE_FILE);
+
+    fd = -1;
+}
+
+
+AsynchronousReadBufferFromFileWithDescriptorsCache::~AsynchronousReadBufferFromFileWithDescriptorsCache()
+{
+    /// Must wait for events in flight before potentially closing the file by destroying OpenedFilePtr.
+    finalize();
+}
+
+
+}
+
--- a/src/IO/AsynchronousReadBufferFromFile.h
+++ b/src/IO/AsynchronousReadBufferFromFile.h
@ -0,0 +1,70 @@
+#pragma once
+
+#include <IO/AsynchronousReadBufferFromFileDescriptor.h>
+#include <IO/OpenedFileCache.h>
+
+
+namespace DB
+{
+
+class AsynchronousReadBufferFromFile : public AsynchronousReadBufferFromFileDescriptor
+{
+protected:
+    std::string file_name;
+
+public:
+    explicit AsynchronousReadBufferFromFile(
+        AsynchronousReaderPtr reader_, Int32 priority_,
+        const std::string & file_name_, size_t buf_size = DBMS_DEFAULT_BUFFER_SIZE, int flags = -1,
+        char * existing_memory = nullptr, size_t alignment = 0);
+
+    /// Use pre-opened file descriptor.
+    explicit AsynchronousReadBufferFromFile(
+        AsynchronousReaderPtr reader_, Int32 priority_,
+        int & fd, /// Will be set to -1 if constructor didn't throw and ownership of file descriptor is passed to the object.
+        const std::string & original_file_name = {},
+        size_t buf_size = DBMS_DEFAULT_BUFFER_SIZE,
+        char * existing_memory = nullptr, size_t alignment = 0);
+
+    ~AsynchronousReadBufferFromFile() override;
+
+    /// Close file before destruction of object.
+    void close();
+
+    std::string getFileName() const override
+    {
+        return file_name;
+    }
+};
+
+
+/** Similar to AsynchronousReadBufferFromFile but also transparently shares open file descriptors.
+  */
+class AsynchronousReadBufferFromFileWithDescriptorsCache : public AsynchronousReadBufferFromFileDescriptor
+{
+private:
+    std::string file_name;
+    OpenedFileCache::OpenedFilePtr file;
+
+public:
+    AsynchronousReadBufferFromFileWithDescriptorsCache(
+        AsynchronousReaderPtr reader_, Int32 priority_,
+        const std::string & file_name_, size_t buf_size = DBMS_DEFAULT_BUFFER_SIZE, int flags = -1,
+        char * existing_memory = nullptr, size_t alignment = 0)
+        : AsynchronousReadBufferFromFileDescriptor(std::move(reader_), priority_, -1, buf_size, existing_memory, alignment),
+        file_name(file_name_)
+    {
+        file = OpenedFileCache::instance().get(file_name, flags);
+        fd = file->getFD();
+    }
+
+    ~AsynchronousReadBufferFromFileWithDescriptorsCache() override;
+
+    std::string getFileName() const override
+    {
+        return file_name;
+    }
+};
+
+}
+
--- a/src/IO/AsynchronousReadBufferFromFileDescriptor.cpp
+++ b/src/IO/AsynchronousReadBufferFromFileDescriptor.cpp
@ -0,0 +1,204 @@
+#include <errno.h>
+#include <time.h>
+#include <optional>
+#include <Common/ProfileEvents.h>
+#include <Common/Stopwatch.h>
+#include <Common/Exception.h>
+#include <Common/CurrentMetrics.h>
+#include <IO/AsynchronousReadBufferFromFileDescriptor.h>
+#include <IO/WriteHelpers.h>
+
+
+namespace ProfileEvents
+{
+    extern const Event AsynchronousReadWaitMicroseconds;
+}
+
+namespace CurrentMetrics
+{
+    extern const Metric AsynchronousReadWait;
+}
+
+
+namespace DB
+{
+
+namespace ErrorCodes
+{
+    extern const int ARGUMENT_OUT_OF_BOUND;
+}
+
+
+std::string AsynchronousReadBufferFromFileDescriptor::getFileName() const
+{
+    return "(fd = " + toString(fd) + ")";
+}
+
+
+std::future<IAsynchronousReader::Result> AsynchronousReadBufferFromFileDescriptor::readInto(char * data, size_t size)
+{
+    IAsynchronousReader::Request request;
+    request.descriptor = std::make_shared<IAsynchronousReader::LocalFileDescriptor>(fd);
+    request.buf = data;
+    request.size = size;
+    request.offset = file_offset_of_buffer_end;
+    request.priority = priority;
+
+    return reader->submit(request);
+}
+
+
+void AsynchronousReadBufferFromFileDescriptor::prefetch()
+{
+    if (prefetch_future.valid())
+        return;
+
+    /// Will request the same amount of data that is read in nextImpl.
+    prefetch_buffer.resize(internal_buffer.size());
+    prefetch_future = readInto(prefetch_buffer.data(), prefetch_buffer.size());
+}
+
+
+bool AsynchronousReadBufferFromFileDescriptor::nextImpl()
+{
+    if (prefetch_future.valid())
+    {
+        /// Read request already in flight. Wait for its completion.
+
+        size_t size = 0;
+        {
+            Stopwatch watch;
+            CurrentMetrics::Increment metric_increment{CurrentMetrics::AsynchronousReadWait};
+            size = prefetch_future.get();
+            ProfileEvents::increment(ProfileEvents::AsynchronousReadWaitMicroseconds, watch.elapsedMicroseconds());
+        }
+
+        prefetch_future = {};
+        file_offset_of_buffer_end += size;
+
+        if (size)
+        {
+            prefetch_buffer.swap(memory);
+            set(memory.data(), memory.size());
+            working_buffer.resize(size);
+            return true;
+        }
+
+        return false;
+    }
+    else
+    {
+        /// No pending request. Do synchronous read.
+
+        auto size = readInto(memory.data(), memory.size()).get();
+        file_offset_of_buffer_end += size;
+
+        if (size)
+        {
+            set(memory.data(), memory.size());
+            working_buffer.resize(size);
+            return true;
+        }
+
+        return false;
+    }
+}
+
+
+void AsynchronousReadBufferFromFileDescriptor::finalize()
+{
+    if (prefetch_future.valid())
+    {
+        prefetch_future.wait();
+        prefetch_future = {};
+    }
+}
+
+
+AsynchronousReadBufferFromFileDescriptor::~AsynchronousReadBufferFromFileDescriptor()
+{
+    finalize();
+}
+
+
+/// If 'offset' is small enough to stay in buffer after seek, then true seek in file does not happen.
+off_t AsynchronousReadBufferFromFileDescriptor::seek(off_t offset, int whence)
+{
+    size_t new_pos;
+    if (whence == SEEK_SET)
+    {
+        assert(offset >= 0);
+        new_pos = offset;
+    }
+    else if (whence == SEEK_CUR)
+    {
+        new_pos = file_offset_of_buffer_end - (working_buffer.end() - pos) + offset;
+    }
+    else
+    {
+        throw Exception("ReadBufferFromFileDescriptor::seek expects SEEK_SET or SEEK_CUR as whence", ErrorCodes::ARGUMENT_OUT_OF_BOUND);
+    }
+
+    /// Position is unchanged.
+    if (new_pos + (working_buffer.end() - pos) == file_offset_of_buffer_end)
+        return new_pos;
+
+    if (file_offset_of_buffer_end - working_buffer.size() <= static_cast<size_t>(new_pos)
+        && new_pos <= file_offset_of_buffer_end)
+    {
+        /// Position is still inside the buffer.
+        /// Probably it is at the end of the buffer - then we will load data on the following 'next' call.
+
+        pos = working_buffer.end() - file_offset_of_buffer_end + new_pos;
+        assert(pos >= working_buffer.begin());
+        assert(pos <= working_buffer.end());
+
+        return new_pos;
+    }
+    else
+    {
+        if (prefetch_future.valid())
+        {
+            //std::cerr << "Ignoring prefetched data" << "\n";
+            prefetch_future.wait();
+            prefetch_future = {};
+        }
+
+        /// Position is out of the buffer, we need to do real seek.
+        off_t seek_pos = required_alignment > 1
+            ? new_pos / required_alignment * required_alignment
+            : new_pos;
+
+        off_t offset_after_seek_pos = new_pos - seek_pos;
+
+        /// First put position at the end of the buffer so the next read will fetch new data to the buffer.
+        pos = working_buffer.end();
+
+        /// Just update the info about the next position in file.
+
+        file_offset_of_buffer_end = seek_pos;
+
+        if (offset_after_seek_pos > 0)
+            ignore(offset_after_seek_pos);
+
+        return seek_pos;
+    }
+}
+
+
+void AsynchronousReadBufferFromFileDescriptor::rewind()
+{
+    if (prefetch_future.valid())
+    {
+        prefetch_future.wait();
+        prefetch_future = {};
+    }
+
+    /// Clearing the buffer with existing data. New data will be read on subsequent call to 'next'.
+    working_buffer.resize(0);
+    pos = working_buffer.begin();
+    file_offset_of_buffer_end = 0;
+}
+
+}
+
--- a/src/IO/AsynchronousReadBufferFromFileDescriptor.h
+++ b/src/IO/AsynchronousReadBufferFromFileDescriptor.h
@ -0,0 +1,70 @@
+#pragma once
+
+#include <IO/ReadBufferFromFileBase.h>
+#include <IO/AsynchronousReader.h>
+#include <Interpreters/Context.h>
+
+#include <optional>
+#include <unistd.h>
+
+
+namespace DB
+{
+
+/** Use ready file descriptor. Does not open or close a file.
+  */
+class AsynchronousReadBufferFromFileDescriptor : public ReadBufferFromFileBase
+{
+protected:
+    AsynchronousReaderPtr reader;
+    Int32 priority;
+
+    Memory<> prefetch_buffer;
+    std::future<IAsynchronousReader::Result> prefetch_future;
+
+    const size_t required_alignment = 0;  /// For O_DIRECT both file offsets and memory addresses have to be aligned.
+    size_t file_offset_of_buffer_end = 0; /// What offset in file corresponds to working_buffer.end().
+    int fd;
+
+    bool nextImpl() override;
+
+    /// Name or some description of file.
+    std::string getFileName() const override;
+
+    void finalize();
+
+public:
+    AsynchronousReadBufferFromFileDescriptor(
+        AsynchronousReaderPtr reader_, Int32 priority_,
+        int fd_, size_t buf_size = DBMS_DEFAULT_BUFFER_SIZE, char * existing_memory = nullptr, size_t alignment = 0)
+        : ReadBufferFromFileBase(buf_size, existing_memory, alignment),
+        reader(std::move(reader_)), priority(priority_), required_alignment(alignment), fd(fd_)
+    {
+    }
+
+    ~AsynchronousReadBufferFromFileDescriptor() override;
+
+    void prefetch() override;
+
+    int getFD() const
+    {
+        return fd;
+    }
+
+    off_t getPosition() override
+    {
+        return file_offset_of_buffer_end - (working_buffer.end() - pos);
+    }
+
+    /// If 'offset' is small enough to stay in buffer after seek, then true seek in file does not happen.
+    off_t seek(off_t off, int whence) override;
+
+    /// Seek to the beginning, discarding already read data if any. Useful to reread file that changes on every read.
+    void rewind();
+
+private:
+    std::future<IAsynchronousReader::Result> readInto(char * data, size_t size);
+};
+
+}
+
--- a/src/IO/AsynchronousReader.h
+++ b/src/IO/AsynchronousReader.h
@ -0,0 +1,69 @@
+#pragma once
+
+#include <Core/Types.h>
+#include <optional>
+#include <memory>
+#include <future>
+
+
+namespace DB
+{
+
+/** Interface for asynchronous reads from file descriptors.
+  * It can abstract Linux AIO, io_uring or normal reads from separate thread pool,
+  * and also reads from non-local filesystems.
+  * The implementation not necessarily to be efficient for large number of small requests,
+  * instead it should be ok for moderate number of sufficiently large requests
+  * (e.g. read 1 MB of data 50 000 times per seconds; BTW this is normal performance for reading from page cache).
+  * For example, this interface may not suffice if you want to serve 10 000 000 of 4 KiB requests per second.
+  * This interface is fairly limited.
+  */
+class IAsynchronousReader
+{
+public:
+    /// For local filesystems, the file descriptor is simply integer
+    /// but it can be arbitrary opaque object for remote filesystems.
+    struct IFileDescriptor
+    {
+        virtual ~IFileDescriptor() = default;
+    };
+
+    using FileDescriptorPtr = std::shared_ptr<IFileDescriptor>;
+
+    struct LocalFileDescriptor : public IFileDescriptor
+    {
+        LocalFileDescriptor(int fd_) : fd(fd_) {}
+        int fd;
+    };
+
+    /// Read from file descriptor at specified offset up to size bytes into buf.
+    /// Some implementations may require alignment and it is responsibility of
+    /// the caller to provide conforming requests.
+    struct Request
+    {
+        FileDescriptorPtr descriptor;
+        size_t offset = 0;
+        size_t size = 0;
+        char * buf = nullptr;
+        int64_t priority = 0;
+    };
+
+    /// Less than requested amount of data can be returned.
+    /// If size is zero - the file has ended.
+    /// (for example, EINTR must be handled by implementation automatically)
+    using Result = size_t;
+
+    /// Submit request and obtain a handle. This method don't perform any waits.
+    /// If this method did not throw, the caller must wait for the result with 'wait' method
+    /// or destroy the whole reader before destroying the buffer for request.
+    /// The method can be called concurrently from multiple threads.
+    virtual std::future<Result> submit(Request request) = 0;
+
+    /// Destructor must wait for all not completed request and ignore the results.
+    /// It may also cancel the requests.
+    virtual ~IAsynchronousReader() = default;
+};
+
+using AsynchronousReaderPtr = std::shared_ptr<IAsynchronousReader>;
+
+}
--- a/src/IO/BufferWithOwnMemory.h
+++ b/src/IO/BufferWithOwnMemory.h
@ -48,18 +48,22 @@ struct Memory : boost::noncopyable, Allocator
        dealloc();
    }

-    Memory(Memory && rhs) noexcept
-    {
-        *this = std::move(rhs);
-    }
-
-    Memory & operator=(Memory && rhs) noexcept
+    void swap(Memory & rhs) noexcept
    {
        std::swap(m_capacity, rhs.m_capacity);
        std::swap(m_size, rhs.m_size);
        std::swap(m_data, rhs.m_data);
        std::swap(alignment, rhs.alignment);
+    }

+    Memory(Memory && rhs) noexcept
+    {
+        swap(rhs);
+    }
+
+    Memory & operator=(Memory && rhs) noexcept
+    {
+        swap(rhs);
        return *this;
    }

--- a/src/IO/OpenedFileCache.h
+++ b/src/IO/OpenedFileCache.h
@ -65,6 +65,12 @@ public:
        it->second = res;
        return res;
    }
+
+    static OpenedFileCache & instance()
+    {
+        static OpenedFileCache res;
+        return res;
+    }
 };

 using OpenedFileCachePtr = std::shared_ptr<OpenedFileCache>;
--- a/src/IO/ReadBuffer.h
+++ b/src/IO/ReadBuffer.h
@ -197,6 +197,11 @@ public:
        return read(to, n);
    }

+    /** Do something to allow faster subsequent call to 'nextImpl' if possible.
+      * It's used for asynchronous readers with double-buffering.
+      */
+    virtual void prefetch() {}
+
 protected:
    /// The number of bytes to ignore from the initial position of `working_buffer`
    /// buffer. Apparently this is an additional out-parameter for nextImpl(),
--- a/src/IO/ReadBufferFromFile.cpp
+++ b/src/IO/ReadBufferFromFile.cpp
@ -88,7 +88,4 @@ void ReadBufferFromFile::close()
    metric_increment.destroy();
 }

-
-OpenedFileCache ReadBufferFromFilePReadWithCache::cache;
-
 }
--- a/src/IO/ReadBufferFromFile.h
+++ b/src/IO/ReadBufferFromFile.h
@ -4,10 +4,6 @@
 #include <IO/OpenedFileCache.h>
 #include <Common/CurrentMetrics.h>

-#ifndef O_DIRECT
-#define O_DIRECT 00040000
-#endif
-

 namespace CurrentMetrics
 {
@ -65,21 +61,19 @@ public:

 /** Similar to ReadBufferFromFilePRead but also transparently shares open file descriptors.
  */
-class ReadBufferFromFilePReadWithCache : public ReadBufferFromFileDescriptorPRead
+class ReadBufferFromFilePReadWithDescriptorsCache : public ReadBufferFromFileDescriptorPRead
 {
 private:
-    static OpenedFileCache cache;
-
    std::string file_name;
    OpenedFileCache::OpenedFilePtr file;

 public:
-    ReadBufferFromFilePReadWithCache(const std::string & file_name_, size_t buf_size = DBMS_DEFAULT_BUFFER_SIZE, int flags = -1,
+    ReadBufferFromFilePReadWithDescriptorsCache(const std::string & file_name_, size_t buf_size = DBMS_DEFAULT_BUFFER_SIZE, int flags = -1,
        char * existing_memory = nullptr, size_t alignment = 0)
        : ReadBufferFromFileDescriptorPRead(-1, buf_size, existing_memory, alignment),
        file_name(file_name_)
    {
-        file = cache.get(file_name, flags);
+        file = OpenedFileCache::instance().get(file_name, flags);
        fd = file->getFD();
    }

--- a/src/IO/ReadBufferFromFileBase.h
+++ b/src/IO/ReadBufferFromFileBase.h
@ -7,8 +7,14 @@
 #include <functional>
 #include <string>

+#include <sys/stat.h>
+#include <sys/types.h>
 #include <fcntl.h>

+#ifndef O_DIRECT
+#define O_DIRECT 00040000
+#endif
+

 namespace DB
 {
--- a/src/IO/ReadBufferFromFileDescriptor.cpp
+++ b/src/IO/ReadBufferFromFileDescriptor.cpp
@ -6,13 +6,9 @@
 #include <Common/Exception.h>
 #include <Common/CurrentMetrics.h>
 #include <IO/ReadBufferFromFileDescriptor.h>
-#include <IO/WriteBufferFromFile.h>
 #include <IO/WriteHelpers.h>
-#include <sys/stat.h>
-#include <Common/UnicodeBar.h>
-#include <Common/TerminalSize.h>
-#include <IO/Operators.h>
 #include <IO/Progress.h>
+#include <sys/stat.h>


 namespace ProfileEvents
@ -39,6 +35,7 @@ namespace ErrorCodes
    extern const int CANNOT_SEEK_THROUGH_FILE;
    extern const int CANNOT_SELECT;
    extern const int CANNOT_FSTAT;
+    extern const int CANNOT_ADVISE;
 }


@ -111,6 +108,20 @@ bool ReadBufferFromFileDescriptor::nextImpl()
 }


+void ReadBufferFromFileDescriptor::prefetch()
+{
+#if defined(POSIX_FADV_WILLNEED)
+    /// For direct IO, loading data into page cache is pointless.
+    if (required_alignment)
+        return;
+
+    /// Ask OS to prefetch data into page cache.
+    if (0 != posix_fadvise(fd, file_offset_of_buffer_end, internal_buffer.size(), POSIX_FADV_WILLNEED))
+        throwFromErrno("Cannot posix_fadvise", ErrorCodes::CANNOT_ADVISE);
+#endif
+}
+
+
 /// If 'offset' is small enough to stay in buffer after seek, then true seek in file does not happen.
 off_t ReadBufferFromFileDescriptor::seek(off_t offset, int whence)
 {
@ -133,16 +144,15 @@ off_t ReadBufferFromFileDescriptor::seek(off_t offset, int whence)
    if (new_pos + (working_buffer.end() - pos) == file_offset_of_buffer_end)
        return new_pos;

-    /// file_offset_of_buffer_end corresponds to working_buffer.end(); it's a past-the-end pos,
-    /// so the second inequality is strict.
    if (file_offset_of_buffer_end - working_buffer.size() <= static_cast<size_t>(new_pos)
-        && new_pos < file_offset_of_buffer_end)
+        && new_pos <= file_offset_of_buffer_end)
    {
        /// Position is still inside the buffer.
+        /// Probably it is at the end of the buffer - then we will load data on the following 'next' call.

        pos = working_buffer.end() - file_offset_of_buffer_end + new_pos;
        assert(pos >= working_buffer.begin());
-        assert(pos < working_buffer.end());
+        assert(pos <= working_buffer.end());

        return new_pos;
    }
--- a/src/IO/ReadBufferFromFileDescriptor.h
+++ b/src/IO/ReadBufferFromFileDescriptor.h
@ -21,6 +21,7 @@ protected:
    int fd;

    bool nextImpl() override;
+    void prefetch() override;

    /// Name or some description of file.
    std::string getFileName() const override;
--- a/src/IO/ReadBufferFromS3.cpp
+++ b/src/IO/ReadBufferFromS3.cpp
@ -76,7 +76,7 @@ bool ReadBufferFromS3::nextImpl()
            ProfileEvents::increment(ProfileEvents::S3ReadMicroseconds, watch.elapsedMicroseconds());
            ProfileEvents::increment(ProfileEvents::S3ReadRequestsErrors, 1);

-            LOG_INFO(log, "Caught exception while reading S3 object. Bucket: {}, Key: {}, Offset: {}, Attempt: {}, Message: {}",
+            LOG_DEBUG(log, "Caught exception while reading S3 object. Bucket: {}, Key: {}, Offset: {}, Attempt: {}, Message: {}",
                    bucket, key, getPosition(), attempt, e.message());

            if (attempt + 1 == max_single_read_retries)
--- a/src/IO/ReadSettings.cpp
+++ b/src/IO/ReadSettings.cpp
@ -0,0 +1,32 @@
+#include <IO/ReadSettings.h>
+#include <Common/Exception.h>
+
+
+namespace DB
+{
+
+namespace ErrorCodes
+{
+    extern const int UNKNOWN_READ_METHOD;
+}
+
+const char * toString(ReadMethod read_method)
+{
+    switch (read_method)
+    {
+#define CASE_READ_METHOD(NAME) case ReadMethod::NAME: return #NAME;
+    FOR_EACH_READ_METHOD(CASE_READ_METHOD)
+#undef CASE_READ_METHOD
+    }
+    __builtin_unreachable();
+}
+
+ReadMethod parseReadMethod(const std::string & name)
+{
+#define CASE_READ_METHOD(NAME) if (name == #NAME) return ReadMethod::NAME;
+    FOR_EACH_READ_METHOD(CASE_READ_METHOD)
+#undef CASE_READ_METHOD
+    throw Exception(ErrorCodes::UNKNOWN_READ_METHOD, "Unknown read method '{}'", name);
+}
+
+}
--- a/src/IO/ReadSettings.h
+++ b/src/IO/ReadSettings.h
@ -0,0 +1,80 @@
+#pragma once
+
+#include <cstddef>
+#include <string>
+#include <Core/Defines.h>
+
+
+namespace DB
+{
+
+#define FOR_EACH_READ_METHOD(M) \
+    /** Simple synchronous reads with 'read'. \
+        Can use direct IO after specified size. Can use prefetch by asking OS to perform readahead. */ \
+    M(read) \
+    \
+    /** Simple synchronous reads with 'pread'. \
+        In contrast to 'read', shares single file descriptor from multiple threads. \
+        Can use direct IO after specified size. Can use prefetch by asking OS to perform readahead. */ \
+    M(pread) \
+    \
+    /** Use mmap after specified size or simple synchronous reads with 'pread'. \
+        Can use prefetch by asking OS to perform readahead. */ \
+    M(mmap) \
+    \
+    /** Checks if data is in page cache with 'preadv2' on modern Linux kernels. \
+        If data is in page cache, read from the same thread. \
+        If not, offload IO to separate threadpool. \
+        Can do prefetch with double buffering. \
+        Can use specified priorities and limit the number of concurrent reads. */ \
+    M(pread_threadpool) \
+    \
+    /** It's using asynchronous reader with fake backend that in fact synchronous. \
+        Only used for testing purposes. */ \
+    M(pread_fake_async) \
+
+
+enum class ReadMethod
+{
+#define DEFINE_READ_METHOD(NAME) NAME,
+    FOR_EACH_READ_METHOD(DEFINE_READ_METHOD)
+#undef DEFINE_READ_METHOD
+};
+
+const char * toString(ReadMethod read_method);
+ReadMethod parseReadMethod(const std::string & name);
+
+
+class MMappedFileCache;
+
+struct ReadSettings
+{
+    /// Method to use reading from local filesystem.
+    ReadMethod local_fs_method = ReadMethod::pread;
+
+    size_t local_fs_buffer_size = DBMS_DEFAULT_BUFFER_SIZE;
+    size_t remote_fs_buffer_size = DBMS_DEFAULT_BUFFER_SIZE;
+
+    bool local_fs_prefetch = false;
+    bool remote_fs_prefetch = false;
+
+    /// For 'read', 'pread' and 'pread_threadpool' methods.
+    size_t direct_io_threshold = 0;
+
+    /// For 'mmap' method.
+    size_t mmap_threshold = 0;
+    MMappedFileCache * mmap_cache = nullptr;
+
+    /// For 'pread_threadpool' method. Lower is more priority.
+    size_t priority = 0;
+
+    ReadSettings adjustBufferSize(size_t file_size) const
+    {
+        ReadSettings res = *this;
+        res.local_fs_buffer_size = std::min(file_size, local_fs_buffer_size);
+        res.remote_fs_buffer_size = std::min(file_size, remote_fs_buffer_size);
+        return res;
+    }
+};
+
+}
--- a/src/IO/SynchronousReader.cpp
+++ b/src/IO/SynchronousReader.cpp
@ -0,0 +1,89 @@
+#include <IO/SynchronousReader.h>
+#include <Common/assert_cast.h>
+#include <Common/Exception.h>
+#include <Common/CurrentMetrics.h>
+#include <Common/ProfileEvents.h>
+#include <Common/Stopwatch.h>
+#include <common/errnoToString.h>
+#include <unordered_map>
+#include <mutex>
+#include <unistd.h>
+#include <fcntl.h>
+
+
+namespace ProfileEvents
+{
+    extern const Event ReadBufferFromFileDescriptorRead;
+    extern const Event ReadBufferFromFileDescriptorReadFailed;
+    extern const Event ReadBufferFromFileDescriptorReadBytes;
+    extern const Event DiskReadElapsedMicroseconds;
+    extern const Event Seek;
+}
+
+namespace CurrentMetrics
+{
+    extern const Metric Read;
+}
+
+namespace DB
+{
+
+namespace ErrorCodes
+{
+    extern const int CANNOT_READ_FROM_FILE_DESCRIPTOR;
+    extern const int CANNOT_ADVISE;
+}
+
+
+std::future<IAsynchronousReader::Result> SynchronousReader::submit(Request request)
+{
+    int fd = assert_cast<const LocalFileDescriptor &>(*request.descriptor).fd;
+
+#if defined(POSIX_FADV_WILLNEED)
+    if (0 != posix_fadvise(fd, request.offset, request.size, POSIX_FADV_WILLNEED))
+        throwFromErrno("Cannot posix_fadvise", ErrorCodes::CANNOT_ADVISE);
+#endif
+
+    return std::async(std::launch::deferred, [fd, request]
+    {
+        ProfileEvents::increment(ProfileEvents::ReadBufferFromFileDescriptorRead);
+        Stopwatch watch(CLOCK_MONOTONIC);
+
+        size_t bytes_read = 0;
+        while (!bytes_read)
+        {
+            ssize_t res = 0;
+
+            {
+                CurrentMetrics::Increment metric_increment{CurrentMetrics::Read};
+                res = ::pread(fd, request.buf, request.size, request.offset);
+            }
+            if (!res)
+                break;
+
+            if (-1 == res && errno != EINTR)
+            {
+                ProfileEvents::increment(ProfileEvents::ReadBufferFromFileDescriptorReadFailed);
+                throwFromErrno(fmt::format("Cannot read from file {}", fd), ErrorCodes::CANNOT_READ_FROM_FILE_DESCRIPTOR);
+            }
+
+            if (res > 0)
+                bytes_read += res;
+        }
+
+        ProfileEvents::increment(ProfileEvents::ReadBufferFromFileDescriptorReadBytes, bytes_read);
+
+        /// It reports real time spent including the time spent while thread was preempted doing nothing.
+        /// And it is Ok for the purpose of this watch (it is used to lower the number of threads to read from tables).
+        /// Sometimes it is better to use taskstats::blkio_delay_total, but it is quite expensive to get it
+        /// (TaskStatsInfoGetter has about 500K RPS).
+        watch.stop();
+        ProfileEvents::increment(ProfileEvents::DiskReadElapsedMicroseconds, watch.elapsedMicroseconds());
+
+        return bytes_read;
+    });
+}
+
+}
+
+
--- a/Show More
+++ b/Show More