Merge pull request #70135 from ClickHouse/cache-for-object-storage-table-engines

Allow to cache read files for object storage table engines and data lakes using hash from ETag and file path as cache key
2024-11-14 19:45:11 +00:00 · 2024-10-14 14:08:30 +00:00 · 2024-10-14 14:08:30 +00:00 · ad7c74c549
commit ad7c74c549
parent 790fa1b3a0 619f60b5ac
29 changed files with 472 additions and 77 deletions
--- a/docs/en/engines/table-engines/integrations/azureBlobStorage.md
+++ b/docs/en/engines/table-engines/integrations/azureBlobStorage.md
@ -63,7 +63,34 @@ Currently there are 3 ways to authenticate:
 - `SAS Token` - Can be used by providing an `endpoint`, `connection_string` or `storage_account_url`. It is identified by presence of '?' in the url.
 - `Workload Identity` - Can be used by providing an `endpoint` or `storage_account_url`. If `use_workload_identity` parameter is set in config, ([workload identity](https://github.com/Azure/azure-sdk-for-cpp/tree/main/sdk/identity/azure-identity#authenticate-azure-hosted-applications)) is used for authentication.

+### Data cache {#data-cache}

+`Azure` table engine supports data caching on local disk.
+See filesystem cache configuration options and usage in this [section](/docs/en/operations/storing-data.md/#using-local-cache).
+Caching is made depending on the path and ETag of the storage object, so clickhouse will not read a stale cache version.
+
+To enable caching use a setting `filesystem_cache_name = '<name>'` and `enable_filesystem_cache = 1`.
+
+```sql
+SELECT *
+FROM azureBlobStorage('DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://azurite1:10000/devstoreaccount1/;', 'test_container', 'test_table', 'CSV')
+SETTINGS filesystem_cache_name = 'cache_for_azure', enable_filesystem_cache = 1;
+```
+
+1. add the following section to clickhouse configuration file:
+
+``` xml
+<clickhouse>
+    <filesystem_caches>
+        <cache_for_azure>
+            <path>path to cache directory</path>
+            <max_size>10Gi</max_size>
+        </cache_for_azure>
+    </filesystem_caches>
+</clickhouse>
+```
+
+2. reuse cache configuration (and therefore cache storage) from clickhouse `storage_configuration` section, [described here](/docs/en/operations/storing-data.md/#using-local-cache)

 ## See also

--- a/docs/en/engines/table-engines/integrations/deltalake.md
+++ b/docs/en/engines/table-engines/integrations/deltalake.md
@ -48,6 +48,10 @@ Using named collections:
 CREATE TABLE deltalake ENGINE=DeltaLake(deltalake_conf, filename = 'test_table')
 ```

+### Data cache {#data-cache}
+
+`Iceberg` table engine and table function support data caching same as `S3`, `AzureBlobStorage`, `HDFS` storages. See [here](../../../engines/table-engines/integrations/s3.md#data-cache).
+
 ## See also

 - [deltaLake table function](../../../sql-reference/table-functions/deltalake.md)
--- a/docs/en/engines/table-engines/integrations/iceberg.md
+++ b/docs/en/engines/table-engines/integrations/iceberg.md
@ -63,6 +63,10 @@ CREATE TABLE iceberg_table ENGINE=IcebergS3(iceberg_conf, filename = 'test_table

 Table engine `Iceberg` is an alias to `IcebergS3` now.

+### Data cache {#data-cache}
+
+`Iceberg` table engine and table function support data caching same as `S3`, `AzureBlobStorage`, `HDFS` storages. See [here](../../../engines/table-engines/integrations/s3.md#data-cache).
+
 ## See also

 - [iceberg table function](/docs/en/sql-reference/table-functions/iceberg.md)
--- a/docs/en/engines/table-engines/integrations/s3.md
+++ b/docs/en/engines/table-engines/integrations/s3.md
@ -26,6 +26,7 @@ SELECT * FROM s3_engine_table LIMIT 2;
 │ two  │     2 │
 └──────┴───────┘
 ```
+
 ## Create Table {#creating-a-table}

 ``` sql
@ -43,6 +44,37 @@ CREATE TABLE s3_engine_table (name String, value UInt32)
 - `aws_access_key_id`, `aws_secret_access_key` - Long-term credentials for the [AWS](https://aws.amazon.com/) account user.  You can use these to authenticate your requests. Parameter is optional. If credentials are not specified, they are used from the configuration file. For more information see [Using S3 for Data Storage](../mergetree-family/mergetree.md#table_engine-mergetree-s3).
 - `compression` — Compression type. Supported values: `none`, `gzip/gz`, `brotli/br`, `xz/LZMA`, `zstd/zst`. Parameter is optional. By default, it will auto-detect compression by file extension.

+### Data cache {#data-cache}
+
+`S3` table engine supports data caching on local disk.
+See filesystem cache configuration options and usage in this [section](/docs/en/operations/storing-data.md/#using-local-cache).
+Caching is made depending on the path and ETag of the storage object, so clickhouse will not read a stale cache version.
+
+To enable caching use a setting `filesystem_cache_name = '<name>'` and `enable_filesystem_cache = 1`.
+
+```sql
+SELECT *
+FROM s3('http://minio:10000/clickhouse//test_3.csv', 'minioadmin', 'minioadminpassword', 'CSV')
+SETTINGS filesystem_cache_name = 'cache_for_s3', enable_filesystem_cache = 1;
+```
+
+There are two ways to define cache in configuration file.
+
+1. add the following section to clickhouse configuration file:
+
+``` xml
+<clickhouse>
+    <filesystem_caches>
+        <cache_for_s3>
+            <path>path to cache directory</path>
+            <max_size>10Gi</max_size>
+        </cache_for_s3>
+    </filesystem_caches>
+</clickhouse>
+```
+
+2. reuse cache configuration (and therefore cache storage) from clickhouse `storage_configuration` section, [described here](/docs/en/operations/storing-data.md/#using-local-cache)
+
 ### PARTITION BY

 `PARTITION BY` — Optional. In most cases you don't need a partition key, and if it is needed you generally don't need a partition key more granular than by month. Partitioning does not speed up queries (in contrast to the ORDER BY expression). You should never use too granular partitioning. Don't partition your data by client identifiers or names (instead, make client identifier or name the first column in the ORDER BY expression).
--- a/programs/server/Server.cpp
+++ b/programs/server/Server.cpp
@ -1496,6 +1496,8 @@ try

    NamedCollectionFactory::instance().loadIfNot();

+    FileCacheFactory::instance().loadDefaultCaches(config());
+
    /// Initialize main config reloader.
    std::string include_from_path = config().getString("include_from", "/etc/metrika.xml");

--- a/src/Core/Settings.cpp
+++ b/src/Core/Settings.cpp
@ -4812,6 +4812,9 @@ Max attempts to read with backoff
 )", 0) \
    M(Bool, enable_filesystem_cache, true, R"(
 Use cache for remote filesystem. This setting does not turn on/off cache for disks (must be done via disk config), but allows to bypass cache for some queries if intended
+)", 0) \
+    M(String, filesystem_cache_name, "", R"(
+Filesystem cache name to use for stateless table engines or data lakes
 )", 0) \
    M(Bool, enable_filesystem_cache_on_write_operations, false, R"(
 Write into cache on write operations. To actually work this setting requires be added to disk config too
--- a/src/Core/SettingsChangesHistory.cpp
+++ b/src/Core/SettingsChangesHistory.cpp
@ -73,6 +73,7 @@ static std::initializer_list<std::pair<ClickHouseVersion, SettingsChangesHistory
            {"mongodb_throw_on_unsupported_query", false, true, "New setting."},
            {"enable_parallel_replicas", false, false, "Parallel replicas with read tasks became the Beta tier feature."},
            {"parallel_replicas_mode", "read_tasks", "read_tasks", "This setting was introduced as a part of making parallel replicas feature Beta"},
+            {"filesystem_cache_name", "", "", "Filesystem cache name to use for stateless table engines or data lakes"},
            {"restore_replace_external_dictionary_source_to_null", false, false, "New setting."},
            {"show_create_query_identifier_quoting_rule", "when_necessary", "when_necessary", "New setting."},
            {"show_create_query_identifier_quoting_style", "Backticks", "Backticks", "New setting."},
--- a/src/Disks/ObjectStorages/Cached/CachedObjectStorage.cpp
+++ b/src/Disks/ObjectStorages/Cached/CachedObjectStorage.cpp
@ -31,7 +31,7 @@ CachedObjectStorage::CachedObjectStorage(

 FileCache::Key CachedObjectStorage::getCacheKey(const std::string & path) const
 {
-    return cache->createKeyForPath(path);
+    return FileCacheKey::fromPath(path);
 }

 ObjectStorageKey
@ -71,7 +71,7 @@ std::unique_ptr<ReadBufferFromFileBase> CachedObjectStorage::readObject( /// NOL
    {
        if (cache->isInitialized())
        {
-            auto cache_key = cache->createKeyForPath(object.remote_path);
+            auto cache_key = FileCacheKey::fromPath(object.remote_path);
            auto global_context = Context::getGlobalContextInstance();
            auto modified_read_settings = read_settings.withNestedBuffer();

--- a/src/Interpreters/Cache/FileCache.cpp
+++ b/src/Interpreters/Cache/FileCache.cpp
@ -122,11 +122,6 @@ FileCache::FileCache(const std::string & cache_name, const FileCacheSettings & s
        query_limit = std::make_unique<FileCacheQueryLimit>();
 }

-FileCache::Key FileCache::createKeyForPath(const String & path)
-{
-    return Key(path);
-}
-
 const FileCache::UserInfo & FileCache::getCommonUser()
 {
    static UserInfo user(getCommonUserID(), 0);
@ -1168,7 +1163,7 @@ void FileCache::removeFileSegment(const Key & key, size_t offset, const UserID &

 void FileCache::removePathIfExists(const String & path, const UserID & user_id)
 {
-    removeKeyIfExists(createKeyForPath(path), user_id);
+    removeKeyIfExists(Key::fromPath(path), user_id);
 }

 void FileCache::removeAllReleasable(const UserID & user_id)
--- a/src/Interpreters/Cache/FileCache.h
+++ b/src/Interpreters/Cache/FileCache.h
@ -88,8 +88,6 @@ public:

    const String & getBasePath() const;

-    static Key createKeyForPath(const String & path);
-
    static const UserInfo & getCommonUser();

    static const UserInfo & getInternalUser();
--- a/src/Interpreters/Cache/FileCacheFactory.cpp
+++ b/src/Interpreters/Cache/FileCacheFactory.cpp
@ -1,5 +1,6 @@
 #include "FileCacheFactory.h"
 #include "FileCache.h"
+#include <Poco/Util/AbstractConfiguration.h>

 namespace DB
 {
@ -43,6 +44,16 @@ FileCacheFactory::CacheByName FileCacheFactory::getAll()
    return caches_by_name;
 }

+FileCachePtr FileCacheFactory::get(const std::string & cache_name)
+{
+    std::lock_guard lock(mutex);
+
+    auto it = caches_by_name.find(cache_name);
+    if (it == caches_by_name.end())
+        throw Exception(ErrorCodes::BAD_ARGUMENTS, "There is no cache by name `{}`", cache_name);
+    return it->second->cache;
+}
+
 FileCachePtr FileCacheFactory::getOrCreate(
    const std::string & cache_name,
    const FileCacheSettings & file_cache_settings,
@ -202,4 +213,20 @@ void FileCacheFactory::clear()
    caches_by_name.clear();
 }

+void FileCacheFactory::loadDefaultCaches(const Poco::Util::AbstractConfiguration & config)
+{
+    Poco::Util::AbstractConfiguration::Keys cache_names;
+    config.keys(FILECACHE_DEFAULT_CONFIG_PATH, cache_names);
+    auto * log = &Poco::Logger::get("FileCacheFactory");
+    LOG_DEBUG(log, "Will load {} caches from default cache config", cache_names.size());
+    for (const auto & name : cache_names)
+    {
+        FileCacheSettings settings;
+        const auto & config_path = fmt::format("{}.{}", FILECACHE_DEFAULT_CONFIG_PATH, name);
+        settings.loadFromConfig(config, config_path);
+        auto cache = getOrCreate(name, settings, config_path);
+        cache->initialize();
+        LOG_DEBUG(log, "Loaded cache `{}` from default cache config", name);
+    }
+}
 }
--- a/src/Interpreters/Cache/FileCacheFactory.h
+++ b/src/Interpreters/Cache/FileCacheFactory.h
@ -44,6 +44,8 @@ public:
        const FileCacheSettings & file_cache_settings,
        const std::string & config_path);

+    FileCachePtr get(const std::string & cache_name);
+
    FileCachePtr create(
        const std::string & cache_name,
        const FileCacheSettings & file_cache_settings,
@ -53,8 +55,12 @@ public:

    FileCacheDataPtr getByName(const std::string & cache_name);

+    void loadDefaultCaches(const Poco::Util::AbstractConfiguration & config);
+
    void updateSettingsFromConfig(const Poco::Util::AbstractConfiguration & config);
+
    void remove(FileCachePtr cache);
+
    void clear();

 private:
--- a/src/Interpreters/Cache/FileCacheKey.cpp
+++ b/src/Interpreters/Cache/FileCacheKey.cpp
@ -12,11 +12,6 @@ namespace ErrorCodes
    extern const int BAD_ARGUMENTS;
 }

-FileCacheKey::FileCacheKey(const std::string & path)
-    : key(sipHash128(path.data(), path.size()))
-{
-}
-
 FileCacheKey::FileCacheKey(const UInt128 & key_)
    : key(key_)
 {
@ -32,6 +27,16 @@ FileCacheKey FileCacheKey::random()
    return FileCacheKey(UUIDHelpers::generateV4().toUnderType());
 }

+FileCacheKey FileCacheKey::fromPath(const std::string & path)
+{
+    return FileCacheKey(sipHash128(path.data(), path.size()));
+}
+
+FileCacheKey FileCacheKey::fromKey(const UInt128 & key)
+{
+    return FileCacheKey(key);
+}
+
 FileCacheKey FileCacheKey::fromKeyString(const std::string & key_str)
 {
    if (key_str.size() != 32)
--- a/src/Interpreters/Cache/FileCacheKey.h
+++ b/src/Interpreters/Cache/FileCacheKey.h
@ -14,16 +14,16 @@ struct FileCacheKey

    FileCacheKey() = default;

-    explicit FileCacheKey(const std::string & path);
-
-    explicit FileCacheKey(const UInt128 & key_);
-
    static FileCacheKey random();
+    static FileCacheKey fromPath(const std::string & path);
+    static FileCacheKey fromKey(const UInt128 & key);
+    static FileCacheKey fromKeyString(const std::string & key_str);

    bool operator==(const FileCacheKey & other) const { return key == other.key; }
    bool operator<(const FileCacheKey & other) const { return key < other.key; }

-    static FileCacheKey fromKeyString(const std::string & key_str);
+private:
+    explicit FileCacheKey(const UInt128 & key_);
 };

 using FileCacheKeyAndOffset = std::pair<FileCacheKey, size_t>;
--- a/src/Interpreters/Cache/FileCache_fwd.h
+++ b/src/Interpreters/Cache/FileCache_fwd.h
@ -15,10 +15,12 @@ static constexpr size_t FILECACHE_BYPASS_THRESHOLD = 256 * 1024 * 1024;
 static constexpr double FILECACHE_DEFAULT_FREE_SPACE_SIZE_RATIO = 0; /// Disabled.
 static constexpr double FILECACHE_DEFAULT_FREE_SPACE_ELEMENTS_RATIO = 0; /// Disabled.
 static constexpr int FILECACHE_DEFAULT_FREE_SPACE_REMOVE_BATCH = 10;
+static constexpr auto FILECACHE_DEFAULT_CONFIG_PATH = "filesystem_caches";

 class FileCache;
 using FileCachePtr = std::shared_ptr<FileCache>;

 struct FileCacheSettings;
+struct FileCacheKey;

 }
--- a/src/Interpreters/tests/gtest_filecache.cpp
+++ b/src/Interpreters/tests/gtest_filecache.cpp
@ -372,7 +372,7 @@ TEST_F(FileCacheTest, LRUPolicy)
        std::cerr << "Step 1\n";
        auto cache = DB::FileCache("1", settings);
        cache.initialize();
-        auto key = DB::FileCache::createKeyForPath("key1");
+        auto key = DB::FileCacheKey::fromPath("key1");

        auto get_or_set = [&](size_t offset, size_t size)
        {
@ -736,7 +736,7 @@ TEST_F(FileCacheTest, LRUPolicy)

        auto cache2 = DB::FileCache("2", settings);
        cache2.initialize();
-        auto key = DB::FileCache::createKeyForPath("key1");
+        auto key = DB::FileCacheKey::fromPath("key1");

        /// Get [2, 29]
        assertEqual(
@ -755,7 +755,7 @@ TEST_F(FileCacheTest, LRUPolicy)
        fs::create_directories(settings2.base_path);
        auto cache2 = DB::FileCache("3", settings2);
        cache2.initialize();
-        auto key = DB::FileCache::createKeyForPath("key1");
+        auto key = DB::FileCacheKey::fromPath("key1");

        /// Get [0, 24]
        assertEqual(
@ -770,7 +770,7 @@ TEST_F(FileCacheTest, LRUPolicy)

        auto cache = FileCache("4", settings);
        cache.initialize();
-        const auto key = FileCache::createKeyForPath("key10");
+        const auto key = FileCacheKey::fromPath("key10");
        const auto key_path = cache.getKeyPath(key, user);

        cache.removeAllReleasable(user.user_id);
@ -794,7 +794,7 @@ TEST_F(FileCacheTest, LRUPolicy)

        auto cache = DB::FileCache("5", settings);
        cache.initialize();
-        const auto key = FileCache::createKeyForPath("key10");
+        const auto key = FileCacheKey::fromPath("key10");
        const auto key_path = cache.getKeyPath(key, user);

        cache.removeAllReleasable(user.user_id);
@ -833,7 +833,7 @@ TEST_F(FileCacheTest, writeBuffer)
        segment_settings.kind = FileSegmentKind::Ephemeral;
        segment_settings.unbounded = true;

-        auto cache_key = FileCache::createKeyForPath(key);
+        auto cache_key = FileCacheKey::fromPath(key);
        auto holder = cache.set(cache_key, 0, 3, segment_settings, user);
        /// The same is done in TemporaryDataOnDisk::createStreamToCacheFile.
        std::filesystem::create_directories(cache.getKeyPath(cache_key, user));
@ -961,7 +961,7 @@ TEST_F(FileCacheTest, temporaryData)
    const auto user = FileCache::getCommonUser();
    auto tmp_data_scope = std::make_shared<TemporaryDataOnDiskScope>(nullptr, &file_cache, TemporaryDataOnDiskSettings{});

-    auto some_data_holder = file_cache.getOrSet(FileCache::createKeyForPath("some_data"), 0, 5_KiB, 5_KiB, CreateFileSegmentSettings{}, 0, user);
+    auto some_data_holder = file_cache.getOrSet(FileCacheKey::fromPath("some_data"), 0, 5_KiB, 5_KiB, CreateFileSegmentSettings{}, 0, user);

    {
        ASSERT_EQ(some_data_holder->size(), 5);
@ -1103,7 +1103,7 @@ TEST_F(FileCacheTest, CachedReadBuffer)
    auto cache = std::make_shared<DB::FileCache>("8", settings);
    cache->initialize();

-    auto key = cache->createKeyForPath(file_path);
+    auto key = DB::FileCacheKey::fromPath(file_path);
    const auto user = FileCache::getCommonUser();

    {
@ -1219,7 +1219,7 @@ TEST_F(FileCacheTest, SLRUPolicy)
    {
        auto cache = DB::FileCache(std::to_string(++file_cache_name), settings);
        cache.initialize();
-        auto key = FileCache::createKeyForPath("key1");
+        auto key = FileCacheKey::fromPath("key1");

        auto add_range = [&](size_t offset, size_t size)
        {
@ -1342,7 +1342,7 @@ TEST_F(FileCacheTest, SLRUPolicy)

        std::string data1(15, '*');
        auto file1 = write_file("test1", data1);
-        auto key1 = cache->createKeyForPath(file1);
+        auto key1 = DB::FileCacheKey::fromPath(file1);

        read_and_check(file1, key1, data1);

@ -1358,7 +1358,7 @@ TEST_F(FileCacheTest, SLRUPolicy)

        std::string data2(10, '*');
        auto file2 = write_file("test2", data2);
-        auto key2 = cache->createKeyForPath(file2);
+        auto key2 = DB::FileCacheKey::fromPath(file2);

        read_and_check(file2, key2, data2);

--- a/src/Storages/ObjectStorage/DataLakes/DeltaLakeMetadata.cpp
+++ b/src/Storages/ObjectStorage/DataLakes/DeltaLakeMetadata.cpp
@ -14,6 +14,7 @@
 #include <IO/ReadBufferFromFileBase.h>
 #include <IO/ReadHelpers.h>
 #include <Storages/ObjectStorage/DataLakes/Common.h>
+#include <Storages/ObjectStorage/StorageObjectStorageSource.h>

 #include <Processors/Formats/Impl/ArrowBufferedStreams.h>
 #include <Processors/Formats/Impl/ParquetBlockInputFormat.h>
@ -185,7 +186,8 @@ struct DeltaLakeMetadataImpl
        std::set<String> & result)
    {
        auto read_settings = context->getReadSettings();
-        auto buf = object_storage->readObject(StoredObject(metadata_file_path), read_settings);
+        StorageObjectStorageSource::ObjectInfo object_info(metadata_file_path);
+        auto buf = StorageObjectStorageSource::createReadBuffer(object_info, object_storage, context, log);

        char c;
        while (!buf->eof())
@ -492,7 +494,8 @@ struct DeltaLakeMetadataImpl

        String json_str;
        auto read_settings = context->getReadSettings();
-        auto buf = object_storage->readObject(StoredObject(last_checkpoint_file), read_settings);
+        StorageObjectStorageSource::ObjectInfo object_info(last_checkpoint_file);
+        auto buf = StorageObjectStorageSource::createReadBuffer(object_info, object_storage, context, log);
        readJSONObjectPossiblyInvalid(json_str, *buf);

        const JSON json(json_str);
@ -557,7 +560,8 @@ struct DeltaLakeMetadataImpl
        LOG_TRACE(log, "Using checkpoint file: {}", checkpoint_path.string());

        auto read_settings = context->getReadSettings();
-        auto buf = object_storage->readObject(StoredObject(checkpoint_path), read_settings);
+        StorageObjectStorageSource::ObjectInfo object_info(checkpoint_path);
+        auto buf = StorageObjectStorageSource::createReadBuffer(object_info, object_storage, context, log);
        auto format_settings = getFormatSettings(context);

        /// Force nullable, because this parquet file for some reason does not have nullable
--- a/src/Storages/ObjectStorage/DataLakes/IcebergMetadata.cpp
+++ b/src/Storages/ObjectStorage/DataLakes/IcebergMetadata.cpp
@ -26,6 +26,7 @@
 #include <Processors/Formats/Impl/AvroRowInputFormat.h>
 #include <Storages/ObjectStorage/DataLakes/IcebergMetadata.h>
 #include <Storages/ObjectStorage/DataLakes/Common.h>
+#include <Storages/ObjectStorage/StorageObjectStorageSource.h>

 #include <Poco/JSON/Array.h>
 #include <Poco/JSON/Object.h>
@ -387,9 +388,13 @@ DataLakeMetadataPtr IcebergMetadata::create(
    ContextPtr local_context)
 {
    const auto [metadata_version, metadata_file_path] = getMetadataFileAndVersion(object_storage, *configuration);
-    LOG_DEBUG(getLogger("IcebergMetadata"), "Parse metadata {}", metadata_file_path);
-    auto read_settings = local_context->getReadSettings();
-    auto buf = object_storage->readObject(StoredObject(metadata_file_path), read_settings);
+
+    auto log = getLogger("IcebergMetadata");
+    LOG_DEBUG(log, "Parse metadata {}", metadata_file_path);
+
+    StorageObjectStorageSource::ObjectInfo object_info(metadata_file_path);
+    auto buf = StorageObjectStorageSource::createReadBuffer(object_info, object_storage, local_context, log);
+
    String json_str;
    readJSONObjectPossiblyInvalid(json_str, *buf);

@ -456,8 +461,8 @@ Strings IcebergMetadata::getDataFiles() const
    LOG_TEST(log, "Collect manifest files from manifest list {}", manifest_list_file);

    auto context = getContext();
-    auto read_settings = context->getReadSettings();
-    auto manifest_list_buf = object_storage->readObject(StoredObject(manifest_list_file), read_settings);
+    StorageObjectStorageSource::ObjectInfo object_info(manifest_list_file);
+    auto manifest_list_buf = StorageObjectStorageSource::createReadBuffer(object_info, object_storage, context, log);
    auto manifest_list_file_reader = std::make_unique<avro::DataFileReaderBase>(std::make_unique<AvroInputStreamReadBufferAdapter>(*manifest_list_buf));

    auto data_type = AvroSchemaReader::avroNodeToDataType(manifest_list_file_reader->dataSchema().root()->leafAt(0));
@ -487,7 +492,8 @@ Strings IcebergMetadata::getDataFiles() const
    {
        LOG_TEST(log, "Process manifest file {}", manifest_file);

-        auto buffer = object_storage->readObject(StoredObject(manifest_file), read_settings);
+        StorageObjectStorageSource::ObjectInfo manifest_object_info(manifest_file);
+        auto buffer = StorageObjectStorageSource::createReadBuffer(manifest_object_info, object_storage, context, log);
        auto manifest_file_reader = std::make_unique<avro::DataFileReaderBase>(std::make_unique<AvroInputStreamReadBufferAdapter>(*buffer));

        /// Manifest file should always have table schema in avro file metadata. By now we don't support tables with evolved schema,
--- a/src/Storages/ObjectStorage/ReadBufferIterator.cpp
+++ b/src/Storages/ObjectStorage/ReadBufferIterator.cpp
@ -150,7 +150,7 @@ std::unique_ptr<ReadBuffer> ReadBufferIterator::recreateLastReadBuffer()
    auto context = getContext();

    const auto & path = current_object_info->isArchive() ? current_object_info->getPathToArchive() : current_object_info->getPath();
-    auto impl = object_storage->readObject(StoredObject(path), context->getReadSettings());
+    auto impl = StorageObjectStorageSource::createReadBuffer(*current_object_info, object_storage, context, getLogger("ReadBufferIterator"));

    const auto compression_method = chooseCompressionMethod(current_object_info->getFileName(), configuration->compression_method);
    const auto zstd_window = static_cast<int>(context->getSettingsRef()[Setting::zstd_window_log_max]);
@ -276,11 +276,7 @@ ReadBufferIterator::Data ReadBufferIterator::next()
        else
        {
            compression_method = chooseCompressionMethod(filename, configuration->compression_method);
-            read_buf = object_storage->readObject(
-                StoredObject(current_object_info->getPath()),
-                getContext()->getReadSettings(),
-                {},
-                current_object_info->metadata->size_bytes);
+            read_buf = StorageObjectStorageSource::createReadBuffer(*current_object_info, object_storage, getContext(), getLogger("ReadBufferIterator"));
        }

        if (!query_settings.skip_empty_files || !read_buf->eof())
--- a/src/Storages/ObjectStorage/StorageObjectStorageSource.cpp
+++ b/src/Storages/ObjectStorage/StorageObjectStorageSource.cpp
@ -7,6 +7,9 @@
 #include <Processors/Executors/PullingPipelineExecutor.h>
 #include <Processors/Transforms/ExtractColumnsTransform.h>
 #include <IO/ReadBufferFromFileBase.h>
+#include <Interpreters/Cache/FileCacheFactory.h>
+#include <Interpreters/Cache/FileCache.h>
+#include <Disks/IO/CachedOnDiskReadBufferFromFile.h>
 #include <IO/Archives/createArchiveReader.h>
 #include <Formats/FormatFactory.h>
 #include <Disks/IO/AsynchronousBoundedReadBuffer.h>
@ -37,6 +40,7 @@ namespace Setting
    extern const SettingsUInt64 max_download_buffer_size;
    extern const SettingsMaxThreads max_threads;
    extern const SettingsBool use_cache_for_count_from_files;
+    extern const SettingsString filesystem_cache_name;
 }

 namespace ErrorCodes
@ -420,44 +424,110 @@ std::future<StorageObjectStorageSource::ReaderHolder> StorageObjectStorageSource
    return create_reader_scheduler([=, this] { return createReader(); }, Priority{});
 }

-std::unique_ptr<ReadBuffer> StorageObjectStorageSource::createReadBuffer(
-    const ObjectInfo & object_info, const ObjectStoragePtr & object_storage, const ContextPtr & context_, const LoggerPtr & log)
+std::unique_ptr<ReadBufferFromFileBase> StorageObjectStorageSource::createReadBuffer(
+    ObjectInfo & object_info, const ObjectStoragePtr & object_storage, const ContextPtr & context_, const LoggerPtr & log)
 {
+    const auto & settings = context_->getSettingsRef();
+    const auto & read_settings = context_->getReadSettings();
+
+    const auto filesystem_cache_name = settings[Setting::filesystem_cache_name].value;
+    bool use_cache = read_settings.enable_filesystem_cache
+        && !filesystem_cache_name.empty()
+        && (object_storage->getType() == ObjectStorageType::Azure
+            || object_storage->getType() == ObjectStorageType::S3);
+
+    if (!object_info.metadata)
+    {
+        if (!use_cache)
+        {
+            return object_storage->readObject(StoredObject(object_info.getPath()), read_settings);
+        }
+        object_info.metadata = object_storage->getObjectMetadata(object_info.getPath());
+    }
+
    const auto & object_size = object_info.metadata->size_bytes;

-    auto read_settings = context_->getReadSettings().adjustBufferSize(object_size);
+    auto modified_read_settings = read_settings.adjustBufferSize(object_size);
    /// FIXME: Changing this setting to default value breaks something around parquet reading
-    read_settings.remote_read_min_bytes_for_seek = read_settings.remote_fs_buffer_size;
+    modified_read_settings.remote_read_min_bytes_for_seek = modified_read_settings.remote_fs_buffer_size;
    /// User's object may change, don't cache it.
-    read_settings.enable_filesystem_cache = false;
-    read_settings.use_page_cache_for_disks_without_file_cache = false;
-
-    const bool object_too_small = object_size <= 2 * context_->getSettingsRef()[Setting::max_download_buffer_size];
-    const bool use_prefetch = object_too_small
-        && read_settings.remote_fs_method == RemoteFSReadMethod::threadpool
-        && read_settings.remote_fs_prefetch;
-
-    if (use_prefetch)
-        read_settings.remote_read_buffer_use_external_buffer = true;
-
-    auto impl = object_storage->readObject(StoredObject(object_info.getPath(), "", object_size), read_settings);
+    modified_read_settings.use_page_cache_for_disks_without_file_cache = false;

    // Create a read buffer that will prefetch the first ~1 MB of the file.
    // When reading lots of tiny files, this prefetching almost doubles the throughput.
    // For bigger files, parallel reading is more useful.
-    if (!use_prefetch)
+    const bool object_too_small = object_size <= 2 * context_->getSettingsRef()[Setting::max_download_buffer_size];
+    const bool use_prefetch = object_too_small
+        && modified_read_settings.remote_fs_method == RemoteFSReadMethod::threadpool
+        && modified_read_settings.remote_fs_prefetch;
+
+    /// FIXME: Use async buffer if use_cache,
+    /// because CachedOnDiskReadBufferFromFile does not work as an independent buffer currently.
+    const bool use_async_buffer = use_prefetch || use_cache;
+
+    if (use_async_buffer)
+        modified_read_settings.remote_read_buffer_use_external_buffer = true;
+
+    std::unique_ptr<ReadBufferFromFileBase> impl;
+    if (use_cache)
+    {
+        if (object_info.metadata->etag.empty())
+        {
+            LOG_WARNING(log, "Cannot use filesystem cache, no etag specified");
+        }
+        else
+        {
+            SipHash hash;
+            hash.update(object_info.getPath());
+            hash.update(object_info.metadata->etag);
+
+            const auto cache_key = FileCacheKey::fromKey(hash.get128());
+            auto cache = FileCacheFactory::instance().get(filesystem_cache_name);
+
+            auto read_buffer_creator = [path = object_info.getPath(), object_size, modified_read_settings, object_storage]()
+            {
+                return object_storage->readObject(StoredObject(path, "", object_size), modified_read_settings);
+            };
+
+            impl = std::make_unique<CachedOnDiskReadBufferFromFile>(
+                object_info.getPath(),
+                cache_key,
+                cache,
+                FileCache::getCommonUser(),
+                read_buffer_creator,
+                modified_read_settings,
+                std::string(CurrentThread::getQueryId()),
+                object_size,
+                /* allow_seeks */true,
+                /* use_external_buffer */true,
+                /* read_until_position */std::nullopt,
+                context_->getFilesystemCacheLog());
+
+            LOG_TEST(log, "Using filesystem cache `{}` (path: {}, etag: {}, hash: {})",
+                     filesystem_cache_name, object_info.getPath(),
+                     object_info.metadata->etag, toString(hash.get128()));
+        }
+    }
+
+    if (!impl)
+        impl = object_storage->readObject(StoredObject(object_info.getPath(), "", object_size), modified_read_settings);
+
+    if (!use_async_buffer)
        return impl;

    LOG_TRACE(log, "Downloading object of size {} with initial prefetch", object_size);

    auto & reader = context_->getThreadPoolReader(FilesystemReaderType::ASYNCHRONOUS_REMOTE_FS_READER);
    impl = std::make_unique<AsynchronousBoundedReadBuffer>(
-        std::move(impl), reader, read_settings,
+        std::move(impl), reader, modified_read_settings,
        context_->getAsyncReadCounters(),
        context_->getFilesystemReadPrefetchesLog());

-    impl->setReadUntilEnd();
-    impl->prefetch(DEFAULT_PREFETCH_PRIORITY);
+    if (use_prefetch)
+    {
+        impl->setReadUntilEnd();
+        impl->prefetch(DEFAULT_PREFETCH_PRIORITY);
+    }
    return impl;
 }

@ -787,8 +857,7 @@ StorageObjectStorageSource::ArchiveIterator::createArchiveReader(ObjectInfoPtr o
        /* path_to_archive */object_info->getPath(),
        /* archive_read_function */[=, this]()
        {
-            StoredObject stored_object(object_info->getPath(), "", size);
-            return object_storage->readObject(stored_object, getContext()->getReadSettings());
+            return StorageObjectStorageSource::createReadBuffer(*object_info, object_storage, getContext(), logger);
        },
        /* archive_size */size);
 }
--- a/src/Storages/ObjectStorage/StorageObjectStorageSource.h
+++ b/src/Storages/ObjectStorage/StorageObjectStorageSource.h
@ -66,6 +66,11 @@ public:
        const ObjectInfo & object_info,
        bool include_connection_info = true);

+    static std::unique_ptr<ReadBufferFromFileBase> createReadBuffer(
+        ObjectInfo & object_info,
+        const ObjectStoragePtr & object_storage,
+        const ContextPtr & context_,
+        const LoggerPtr & log);
 protected:
    const String name;
    ObjectStoragePtr object_storage;
@ -135,11 +140,6 @@ protected:
    ReaderHolder createReader();

    std::future<ReaderHolder> createReaderAsync();
-    static std::unique_ptr<ReadBuffer> createReadBuffer(
-        const ObjectInfo & object_info,
-        const ObjectStoragePtr & object_storage,
-        const ContextPtr & context_,
-        const LoggerPtr & log);

    void addNumRowsToCache(const ObjectInfo & object_info, size_t num_rows);
    void lazyInitialize();
--- a/src/Storages/System/StorageSystemRemoteDataPaths.cpp
+++ b/src/Storages/System/StorageSystemRemoteDataPaths.cpp
@ -401,7 +401,7 @@ Chunk SystemRemoteDataPathsSource::generate()

            if (cache)
            {
-                auto cache_paths = cache->tryGetCachePaths(cache->createKeyForPath(object.remote_path));
+                auto cache_paths = cache->tryGetCachePaths(FileCacheKey::fromPath(object.remote_path));
                col_cache_paths->insert(Array(cache_paths.begin(), cache_paths.end()));
            }
            else
--- a/tests/integration/test_storage_azure_blob_storage/test.py
+++ b/tests/integration/test_storage_azure_blob_storage/test.py
@ -1412,7 +1412,7 @@ def test_parallel_read(cluster):

    res = azure_query(
        node,
-        f"select count() from azureBlobStorage('{connection_string}', 'cont', 'test_parallel_read.parquet')",
+        f"select count() from azureBlobStorage('{connection_string}', 'cont', 'test_parallel_read.parquet') settings remote_filesystem_read_method='read'",
    )
    assert int(res) == 10000
    assert_logs_contain_with_retry(node, "AzureBlobStorage readBigAt read bytes")
--- a/tests/integration/test_storage_delta/configs/config.d/filesystem_caches.xml
+++ b/tests/integration/test_storage_delta/configs/config.d/filesystem_caches.xml
@ -0,0 +1,8 @@
+<clickhouse>
+  <filesystem_caches>
+    <cache1>
+      <max_size>1Gi</max_size>
+      <path>cache1</path>
+    </cache1>
+  </filesystem_caches>
+</clickhouse>
--- a/tests/integration/test_storage_delta/test.py
+++ b/tests/integration/test_storage_delta/test.py
@ -5,6 +5,7 @@ import os
 import random
 import string
 import time
+import uuid
 from datetime import datetime

 import delta
@ -70,7 +71,10 @@ def started_cluster():
        cluster = ClickHouseCluster(__file__, with_spark=True)
        cluster.add_instance(
            "node1",
-            main_configs=["configs/config.d/named_collections.xml"],
+            main_configs=[
+                "configs/config.d/named_collections.xml",
+                "configs/config.d/filesystem_caches.xml",
+            ],
            user_configs=["configs/users.d/users.xml"],
            with_minio=True,
            stay_alive=True,
@ -826,3 +830,64 @@ def test_complex_types(started_cluster):
            f"SELECT metadata FROM deltaLake('http://{started_cluster.minio_ip}:{started_cluster.minio_port}/root/{table_name}' , 'minio', 'minio123')"
        )
    )
+
+
+@pytest.mark.parametrize("storage_type", ["s3"])
+def test_filesystem_cache(started_cluster, storage_type):
+    instance = started_cluster.instances["node1"]
+    spark = started_cluster.spark_session
+    minio_client = started_cluster.minio_client
+    TABLE_NAME = randomize_table_name("test_filesystem_cache")
+    bucket = started_cluster.minio_bucket
+
+    if not minio_client.bucket_exists(bucket):
+        minio_client.make_bucket(bucket)
+
+    parquet_data_path = create_initial_data_file(
+        started_cluster,
+        instance,
+        "SELECT number, toString(number) FROM numbers(100)",
+        TABLE_NAME,
+    )
+
+    write_delta_from_file(spark, parquet_data_path, f"/{TABLE_NAME}")
+    upload_directory(minio_client, bucket, f"/{TABLE_NAME}", "")
+    create_delta_table(instance, TABLE_NAME, bucket=bucket)
+
+    query_id = f"{TABLE_NAME}-{uuid.uuid4()}"
+    instance.query(
+        f"SELECT * FROM {TABLE_NAME} SETTINGS filesystem_cache_name = 'cache1'",
+        query_id=query_id,
+    )
+
+    instance.query("SYSTEM FLUSH LOGS")
+
+    count = int(
+        instance.query(
+            f"SELECT ProfileEvents['CachedReadBufferCacheWriteBytes'] FROM system.query_log WHERE query_id = '{query_id}' AND type = 'QueryFinish'"
+        )
+    )
+    assert 0 < int(
+        instance.query(
+            f"SELECT ProfileEvents['S3GetObject'] FROM system.query_log WHERE query_id = '{query_id}' AND type = 'QueryFinish'"
+        )
+    )
+
+    query_id = f"{TABLE_NAME}-{uuid.uuid4()}"
+    instance.query(
+        f"SELECT * FROM {TABLE_NAME} SETTINGS filesystem_cache_name = 'cache1'",
+        query_id=query_id,
+    )
+
+    instance.query("SYSTEM FLUSH LOGS")
+
+    assert count == int(
+        instance.query(
+            f"SELECT ProfileEvents['CachedReadBufferReadFromCacheBytes'] FROM system.query_log WHERE query_id = '{query_id}' AND type = 'QueryFinish'"
+        )
+    )
+    assert 0 == int(
+        instance.query(
+            f"SELECT ProfileEvents['S3GetObject'] FROM system.query_log WHERE query_id = '{query_id}' AND type = 'QueryFinish'"
+        )
+    )
--- a/tests/integration/test_storage_iceberg/configs/config.d/filesystem_caches.xml
+++ b/tests/integration/test_storage_iceberg/configs/config.d/filesystem_caches.xml
@ -0,0 +1,8 @@
+<clickhouse>
+  <filesystem_caches>
+    <cache1>
+      <max_size>1Gi</max_size>
+      <path>cache1</path>
+    </cache1>
+  </filesystem_caches>
+</clickhouse>
--- a/tests/integration/test_storage_iceberg/test.py
+++ b/tests/integration/test_storage_iceberg/test.py
@ -72,7 +72,10 @@ def started_cluster():
            with_hdfs = False
        cluster.add_instance(
            "node1",
-            main_configs=["configs/config.d/named_collections.xml"],
+            main_configs=[
+                "configs/config.d/named_collections.xml",
+                "configs/config.d/filesystem_caches.xml",
+            ],
            user_configs=["configs/users.d/users.xml"],
            with_minio=True,
            with_azurite=True,
@ -870,3 +873,66 @@ def test_restart_broken_s3(started_cluster):
    )

    assert int(instance.query(f"SELECT count() FROM {TABLE_NAME}")) == 100
+
+
+@pytest.mark.parametrize("storage_type", ["s3"])
+def test_filesystem_cache(started_cluster, storage_type):
+    instance = started_cluster.instances["node1"]
+    spark = started_cluster.spark_session
+    TABLE_NAME = "test_filesystem_cache_" + storage_type + "_" + get_uuid_str()
+
+    write_iceberg_from_df(
+        spark,
+        generate_data(spark, 0, 10),
+        TABLE_NAME,
+        mode="overwrite",
+        format_version="1",
+        partition_by="a",
+    )
+
+    default_upload_directory(
+        started_cluster,
+        storage_type,
+        f"/iceberg_data/default/{TABLE_NAME}/",
+        f"/iceberg_data/default/{TABLE_NAME}/",
+    )
+
+    create_iceberg_table(storage_type, instance, TABLE_NAME, started_cluster)
+
+    query_id = f"{TABLE_NAME}-{uuid.uuid4()}"
+    instance.query(
+        f"SELECT * FROM {TABLE_NAME} SETTINGS filesystem_cache_name = 'cache1'",
+        query_id=query_id,
+    )
+
+    instance.query("SYSTEM FLUSH LOGS")
+
+    count = int(
+        instance.query(
+            f"SELECT ProfileEvents['CachedReadBufferCacheWriteBytes'] FROM system.query_log WHERE query_id = '{query_id}' AND type = 'QueryFinish'"
+        )
+    )
+    assert 0 < int(
+        instance.query(
+            f"SELECT ProfileEvents['S3GetObject'] FROM system.query_log WHERE query_id = '{query_id}' AND type = 'QueryFinish'"
+        )
+    )
+
+    query_id = f"{TABLE_NAME}-{uuid.uuid4()}"
+    instance.query(
+        f"SELECT * FROM {TABLE_NAME} SETTINGS filesystem_cache_name = 'cache1'",
+        query_id=query_id,
+    )
+
+    instance.query("SYSTEM FLUSH LOGS")
+
+    assert count == int(
+        instance.query(
+            f"SELECT ProfileEvents['CachedReadBufferReadFromCacheBytes'] FROM system.query_log WHERE query_id = '{query_id}' AND type = 'QueryFinish'"
+        )
+    )
+    assert 0 == int(
+        instance.query(
+            f"SELECT ProfileEvents['S3GetObject'] FROM system.query_log WHERE query_id = '{query_id}' AND type = 'QueryFinish'"
+        )
+    )
--- a/tests/integration/test_storage_s3/configs/filesystem_caches.xml
+++ b/tests/integration/test_storage_s3/configs/filesystem_caches.xml
@ -0,0 +1,8 @@
+<clickhouse>
+  <filesystem_caches>
+    <cache1>
+      <max_size>1Gi</max_size>
+      <path>cache1</path>
+    </cache1>
+  </filesystem_caches>
+</clickhouse>
--- a/tests/integration/test_storage_s3/test.py
+++ b/tests/integration/test_storage_s3/test.py
@ -56,6 +56,7 @@ def started_cluster():
                "configs/named_collections.xml",
                "configs/schema_cache.xml",
                "configs/blob_log.xml",
+                "configs/filesystem_caches.xml",
            ],
            user_configs=[
                "configs/access.xml",
@ -2394,3 +2395,61 @@ def test_respect_object_existence_on_partitioned_write(started_cluster):
    )

    assert int(result) == 44
+
+
+def test_filesystem_cache(started_cluster):
+    id = uuid.uuid4()
+    bucket = started_cluster.minio_bucket
+    instance = started_cluster.instances["dummy"]
+    table_name = f"test_filesystem_cache-{uuid.uuid4()}"
+
+    instance.query(
+        f"insert into function s3('http://{started_cluster.minio_host}:{started_cluster.minio_port}/{bucket}/{table_name}.tsv', auto, 'x UInt64') select number from numbers(100) SETTINGS s3_truncate_on_insert=1"
+    )
+
+    query_id = f"{table_name}-{uuid.uuid4()}"
+    instance.query(
+        f"select * from s3('http://{started_cluster.minio_host}:{started_cluster.minio_port}/{bucket}/{table_name}.tsv') SETTINGS filesystem_cache_name = 'cache1', enable_filesystem_cache=1",
+        query_id=query_id,
+    )
+
+    instance.query("SYSTEM FLUSH LOGS")
+
+    count = int(
+        instance.query(
+            f"SELECT ProfileEvents['CachedReadBufferCacheWriteBytes'] FROM system.query_log WHERE query_id = '{query_id}' AND type = 'QueryFinish'"
+        )
+    )
+
+    assert count == 290
+    assert 0 < int(
+        instance.query(
+            f"SELECT ProfileEvents['S3GetObject'] FROM system.query_log WHERE query_id = '{query_id}' AND type = 'QueryFinish'"
+        )
+    )
+
+    instance.query("SYSTEM DROP SCHEMA CACHE")
+
+    query_id = f"{table_name}-{uuid.uuid4()}"
+    instance.query(
+        f"select * from s3('http://{started_cluster.minio_host}:{started_cluster.minio_port}/{bucket}/{table_name}.tsv') SETTINGS filesystem_cache_name = 'cache1', enable_filesystem_cache=1",
+        query_id=query_id,
+    )
+
+    instance.query("SYSTEM FLUSH LOGS")
+
+    assert count * 2 == int(
+        instance.query(
+            f"SELECT ProfileEvents['CachedReadBufferReadFromCacheBytes'] FROM system.query_log WHERE query_id = '{query_id}' AND type = 'QueryFinish'"
+        )
+    )
+    assert 0 == int(
+        instance.query(
+            f"SELECT ProfileEvents['CachedReadBufferCacheWriteBytes'] FROM system.query_log WHERE query_id = '{query_id}' AND type = 'QueryFinish'"
+        )
+    )
+    assert 0 == int(
+        instance.query(
+            f"SELECT ProfileEvents['S3GetObject'] FROM system.query_log WHERE query_id = '{query_id}' AND type = 'QueryFinish'"
+        )
+    )