Commit Graph

595 Commits

Author SHA1 Message Date
Antonio Andelic
175a1db787 Support specifying users for s3 settings 2024-02-19 16:43:27 +01:00
Kruglov Pavel
1dbfeafb42
Merge branch 'master' into auto-format-detection 2024-02-13 19:08:33 +01:00
kssenii
5cd757dd9a Merge remote-tracking branch 'origin/master' into s3-queue-parallelize-ordered-mode 2024-01-26 16:33:22 +01:00
Kruglov Pavel
946c4e0495
Merge branch 'master' into auto-format-detection 2024-01-26 15:51:35 +01:00
Kruglov Pavel
46a6b84a5a
Merge branch 'master' into auto-format-detection 2024-01-25 22:11:07 +01:00
Maksim Kita
5bb734a4bb ActionsDAG buildFilterActionsDAG refactoring 2024-01-25 18:24:14 +03:00
Maksim Kita
2a327107b6 Updated implementation 2024-01-25 14:31:49 +03:00
avogar
11f1ea50d7 Fix tests 2024-01-24 17:55:31 +00:00
kssenii
7f8f379d7f Parallel & disrtibuted processing for ordered mode 2024-01-24 16:32:15 +01:00
avogar
617cc514b7 Try to detect file format automatically during schema inference if it's unknown 2024-01-23 18:59:39 +00:00
Michael Kolupaev
fd361273f0 Fix StorageURL forgetting headers on server restart 2024-01-19 11:35:12 -08:00
pufit
6cf55b82f4
Merge pull request #58539 from canhld94/file_custom_compress_level
Allow explicitly set compression level in output format
2024-01-09 13:43:38 -05:00
Sema Checherinda
7c7e72c4b7
Merge pull request #58573 from ClickHouse/chesema-s3-client-creation
fix and test that S3Clients are reused
2024-01-09 13:28:20 +01:00
Sema Checherinda
6f626d8294 fix auth_settings.hasUpdates function 2024-01-07 23:00:26 +01:00
Sema Checherinda
d5f86f671d fix and test that S3Clients reused 2024-01-07 02:19:06 +01:00
Nikolai Kochetov
494a32f4e4 Review fixes 2024-01-04 14:41:04 +00:00
Duc Canh Le
2e14cfb526 add settings for output compression level and window size
Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
2024-01-04 08:16:00 +00:00
Nikolai Kochetov
d06de83ac1 Fix KeyCondition for file/url/s3 2024-01-03 17:44:28 +00:00
Nikolai Kochetov
aee933a437 Merge branch 'master' into hdfs-virtuals 2024-01-03 12:16:57 +00:00
Nikolai Kochetov
1b20ce5162 Cleanup 2024-01-02 17:50:06 +00:00
Nikolai Kochetov
c808b03e55 Remove unneeded code 2024-01-02 17:27:33 +00:00
Nikolai Kochetov
3e3fed1cbe Add reading step to URL 2024-01-02 15:18:13 +00:00
Nikolai Kochetov
b95bdef09e Update StorageS3 and StorageS3Cluster 2023-12-29 17:41:11 +00:00
Azat Khuzhin
8c54380d80 Avoid sending ComposeObject requests after upload to GCS
This should not be required anymore, but leave it as an option, since
likely this is required for old files.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-12-29 11:53:49 +01:00
Azat Khuzhin
f4a7789cd4 Convert various S3::Client settings into separate ClientSettings struct
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-12-29 11:53:49 +01:00
Shani Elharrar
679a0e1300 StorageS3 / TableFunctionS3: Allow passing session_token to AuthSettings
This can help users that want to pass temporary credentials that
issued by AWS in order to load data from S3 without changing
configuration or creating an IAM User.

Fixes #57848
2023-12-19 08:06:36 +02:00
Shani Elharrar
c696c0bfe7 S3Common.AuthSettings: Allow passing SESSION_TOKEN to AWSCredentials
This sets the infrastructure of loading session_token and passing it directly
to all AWSCredentials instances that are created using the AuthSettings.

The default SESSION_TOKEN is set to an empty string as documented in AWS SDK
reference: https://sdk.amazonaws.com/cpp/api/0.12.9/d4/d27/class_aws_1_1_auth_1_1_a_w_s_credentials.html
2023-12-17 10:29:15 +02:00
avogar
ee7af95bc0 Merge branch 'master' of github.com:ClickHouse/ClickHouse into schema-inference-union 2023-12-08 20:29:28 +00:00
Alexey Milovidov
10d65a1ade
Merge pull request #55559 from azat/s3-fix-excessive-reads
Add ability to disable checksums for S3 to avoid excessive input file read
2023-12-05 06:34:21 +01:00
Nikolai Kochetov
0b4131546a
Merge pull request #56813 from jsc0218/SystemTablesFilterEngine
Able to Filter Engine When Scanning System Tables
2023-12-01 16:02:27 +01:00
avogar
4d9a1b50f9 Add information about new _size virtual column in file/s3/url/hdfs/azure table functions 2023-11-28 18:15:07 +00:00
Nikolai Kochetov
08a7575984 Re-implement filtering a bit. 2023-11-28 16:17:35 +00:00
Azat Khuzhin
4a02de4674 Add ability to disable checksums for S3 to avoid excessive input file read
AWS S3 client can read file multiple times, this is required for:
- calculate checksums
- calculate signature (done only for HTTP, since ClickHouse uses
  PayloadSigningPolicy::Never)

So this means that for HTTP, to send file to S3 it will be read 3x
times, and for HTTPS 2x times.

By overriding GetChecksumAlgorithmName() to return empty string,
checksums can be disabled, and the input file will be read only once.

And even though additional https layer adds extra integrity layer,
someone still may find this too risky I guess, even though ClickHouse
internal format (for MergeTree) has checksums, and more.

Here is an example stacktrace of this excessive read:

<details>

<summary>stacktrace</summary>

    (lldb) bt
    * thread 383, name = 'BackupWorker', stop reason = breakpoint 1.1
      * frame 0: 0x00000000103c5fc0 clickhouse`DB::StdStreamBufFromReadBuffer::seekpos() + 32 at StdStreamBufFromReadBuffer.cpp:67
        frame 1: 0x000000001777f7f8 clickhouse`std::__1::basic_istream<char, std::__1::char_traits<char>>::tellg() [inlined] std::__1::basic_streambuf<char, std::__1::char_traits<char>>::pubseekoff[abi:v15000](this=<unavailable>, __off=0, __way=cur, __which=8) + 120 at streambuf:162
        frame 2: 0x000000001777f7e3 clickhouse`std::__1::basic_istream<char, std::__1::char_traits<char>>::tellg() + 99 at istream:1249
        frame 3: 0x00000000152e4979 clickhouse`Aws::Utils::Crypto::MD5OpenSSLImpl::Calculate() + 57 at CryptoImpl.cpp:223
        frame 4: 0x00000000152dedee clickhouse`Aws::Utils::Crypto::MD5::Calculate() + 14 at MD5.cpp:30
        frame 5: 0x00000000152db5ac clickhouse`Aws::Utils::HashingUtils::CalculateMD5() + 44 at HashingUtils.cpp:235
        frame 6: 0x000000001528b97b clickhouse`Aws::Client::AWSClient::AddChecksumToRequest() const + 507 at AWSClient.cpp:772
        frame 7: 0x000000001528ded2 clickhouse`Aws::Client::AWSClient::BuildHttpRequest() const + 1682 at AWSClient.cpp:930
        frame 8: 0x00000000100b864f clickhouse`DB::S3::Client::BuildHttpRequest() const + 15 at Client.cpp:622
        frame 9: 0x0000000015286a41 clickhouse`Aws::Client::AWSClient::AttemptOneRequest(this=0x00007ffde2f8f000, httpRequest=<unavailable>, request=<unavailable>, signerName=<unavailable>, signerRegionOverride=<unavailable>, signerServiceNameOverride="s3") const + 65 at AWSClient.cpp:491
        frame 10: 0x00000000152845b9 clickhouse`Aws::Client::AWSClient::AttemptExhaustively(this=0x00007ffde2f8f000, uri=0x00007ffdd4d44f38, request=0x00007ffdd4d45d10, method=HTTP_PUT, signerName="SignatureV4", signerRegionOverride="us-east-1", signerServiceNameOverride="s3") const + 1337 at AWSClient.cpp:272
        frame 11: 0x0000000015298d0d clickhouse`Aws::Client::AWSXMLClient::MakeRequest() const + 45 at AWSXmlClient.cpp:99
        frame 12: 0x0000000015298cb5 clickhouse`Aws::Client::AWSXMLClient::MakeRequest() const + 309 at AWSXmlClient.cpp:66
        frame 13: 0x0000000015354b23 clickhouse`Aws::S3::S3Client::PutObject(this=0x00007ffde2f8f000, request=0x00007ffdd4d45d10) const + 2659 at S3Client.cpp:1731
        frame 14: 0x00000000100b174f clickhouse`DB::S3::Client::PutObject(DB::S3::ExtendedRequest<Aws::S3::Model::PutObjectRequest> const&) const [inlined]
        frame 15: 0x00000000100b173a clickhouse`DB::S3::Client::PutObject(DB::S3::ExtendedRequest<Aws::S3::Model::PutObjectRequest> const&) const + 41 at Client.cpp:578
        frame 16: 0x00000000100b1711 clickhouse`DB::S3::Client::PutObject(DB::S3::ExtendedRequest<Aws::S3::Model::PutObjectRequest> const&) const + 981 at Client.cpp:508
        frame 17: 0x00000000100b133c clickhouse`DB::S3::Client::PutObject(DB::S3::ExtendedRequest<Aws::S3::Model::PutObjectRequest> const&) const [inlined]
        frame 18: 0x00000000100b133c clickhouse`DB::S3::Client::PutObject() const + 28 at Client.cpp:418
        frame 19: 0x00000000103b96d6 clickhouse`DB::copyDataToS3File()

</details>

This new behaviour could be enabled with `s3_disable_checksum=true`.

Note, that I've checked this implementation with GCS/R2/S3/MinIO and it
works everywhere.
2023-11-26 19:20:19 +01:00
Kruglov Pavel
b84e3cf683
Merge branch 'master' into size-virtual-column 2023-11-22 19:25:00 +01:00
avogar
4a86f4a7b9 Fix style changes 2023-11-22 18:24:34 +00:00
avogar
007353a2dd Add _size virtual column to s3/file/hdfs/url/azureBlobStorage engines 2023-11-22 18:12:36 +00:00
vdimir
ffbe85d3a0
Merge pull request #56668 from ClickHouse/vdimir/analyzer_s3_partition_pruning
Analyzer: filtering by virtual columns for StorageS3
2023-11-22 16:44:44 +01:00
vdimir
15234474d7
Implement system table blob_storage_log 2023-11-21 09:18:25 +00:00
vdimir
31a6c7c1c4
Style changes around filterKeysForPartitionPruning 2023-11-20 18:08:45 +00:00
vdimir
95e9a27417
Remove ast based code from filterKeysForPartitionPruning 2023-11-20 17:59:58 +00:00
vdimir
a915eeded8
StorageS3 use filters from SourceStepWithFilter 2023-11-20 17:59:58 +00:00
vdimir
1cddfb1e6b
rewrite getBlockWithVirtuals for S3Storage 2023-11-20 17:59:57 +00:00
vdimir
cbb2e02c03
Analyzer: partition pruning for S3 2023-11-20 17:59:53 +00:00
avogar
f537bad469 Merge branch 'master' of github.com:ClickHouse/ClickHouse into schema-inference-union 2023-11-20 14:32:50 +00:00
avogar
872556a5d4 Merge branch 'master' of github.com:ClickHouse/ClickHouse into schema-inference-union 2023-11-20 14:03:36 +00:00
Sema Checherinda
f999337dae
Revert "Revert "s3 adaptive timeouts"" 2023-11-20 14:53:22 +01:00
Alexander Tokmakov
5031f239c3
Revert "s3 adaptive timeouts" 2023-11-20 14:28:59 +01:00
Sema Checherinda
a950595c24
Merge pull request #56314 from CheSema/s3-aggressive-timeouts
s3 adaptive timeouts
2023-11-19 14:12:14 +01:00
Alexey Milovidov
d56cbda185 Add metrics for the number of queued jobs, which is useful for the IO thread pool 2023-11-18 19:07:59 +01:00
Sema Checherinda
8d36fd6e54 get rid off of client_with_long_timeout_ptr 2023-11-14 11:34:12 +01:00