This should not be required anymore, but leave it as an option, since
likely this is required for old files.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
This should reduce amount of code that should be recompiled on
Exception.h changes (and everything else that had been included there).
This will actually not help a lot, because it is also included into
PODArray.h and ThreadPool.h at least... Sigh.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Reading from S3 has retries on almost any exception, except for
CANNOT_ALLOCATE_MEMORY, but the problem is that this exception may not
be even delivered to ReadBufferFromS3::processException(), due to
std::istream will hide it (see [1]).
[1]: 1834e42289/libcxx/include/istream (L1059)
So let's pass throught them to see at least some information instead of
just:
Message: Cannot read from istream at offset 0
And here as an example of exception:
<details>
(lldb) bt
* thread 244, name = 'MergeMutate', stop reason = breakpoint 6.1
* frame 0: 0x000000001aa9c791 clickhouse`::__cxa_throw(thrown_object=0x00007fc3b68fdb00, tinfo=0x000000000548af50, dest=(clickhouse`Poco::Net::NetException::~NetException() at NetException.cpp:26))(void *)) at cxa_exception.cpp:258:33
frame 1: 0x0000000017c55b2b clickhouse`Poco::Net::SocketImpl::error(code=-1232086272, arg=0x00007fc3be4f4d30) at SocketImpl.cpp:985:3
frame 2: 0x0000000017c1f7ff clickhouse`Poco::Net::SecureSocketImpl::handleError(int) [inlined] Poco::Net::SocketImpl::error(code=<unavailable>) at SocketImpl.cpp:923:2
frame 3: 0x0000000017c1f7e4 clickhouse`Poco::Net::SecureSocketImpl::handleError(this=0x00007fc35ae42b30, rc=-1) at SecureSocketImpl.cpp:531
frame 4: 0x0000000017c208f9 clickhouse`Poco::Net::SecureSocketImpl::receiveBytes(this=0x00007fc35ae42b30, buffer=0x00007fc346e402c4, length=131068, flags=<unavailable>) at SecureSocketImpl.cpp:353:10
frame 5: 0x0000000017c3b388 clickhouse`Poco::Net::HTTPSession::read(char*, long) [inlined] Poco::Net::StreamSocket::receiveBytes(buffer=<unavailable>, length=<unavailable>, flags=0) at StreamSocket.cpp:130:17
frame 6: 0x0000000017c3b379 clickhouse`Poco::Net::HTTPSession::read(char*, long) [inlined] Poco::Net::HTTPSession::receive(this=<unavailable>, buffer=<unavailable>, length=<unavailable>) at HTTPSession.cpp:166
frame 7: 0x0000000017c3b379 clickhouse`Poco::Net::HTTPSession::read(this=0x00007fc3973e9098, buffer=<unavailable>, length=<unavailable>) at HTTPSession.cpp:144
frame 8: 0x0000000017c0f706 clickhouse`Poco::Net::HTTPSClientSession::read(this=<unavailable>, buffer=<unavailable>, length=<unavailable>) at HTTPSClientSession.cpp:195:23
frame 9: 0x0000000017c32eb4 clickhouse`Poco::Net::HTTPFixedLengthStreamBuf::readFromDevice(this=0x00007fc361bd6708, buffer=<unavailable>, length=<unavailable>) at HTTPFixedLengthStream.cpp:53:16
frame 10: 0x0000000017c2a368 clickhouse`Poco::BasicBufferedStreamBuf<char, std::__1::char_traits<char>, Poco::Net::HTTPBufferAllocator>::underflow(this=0x00007fc361bd6708) at BufferedStreamBuf.h:97:17
frame 11: 0x0000000008b2abca clickhouse`std::__1::basic_streambuf<char, std::__1::char_traits<char> >::uflow(this=0x00007fc361bd6708) at streambuf:445:9
frame 12: 0x0000000008b2ab5a clickhouse`std::__1::basic_streambuf<char, std::__1::char_traits<char> >::xsgetn(this=0x00007fc361bd6708, __s="P, __n=1048576) at streambuf:422:25
frame 13: 0x0000000008b2ca0e clickhouse`std::__1::basic_istream<char, std::__1::char_traits<char> >::read(char*, long) [inlined] std::__1::basic_streambuf<char, std::__1::char_traits<char> >::sgetn[abi:v15000](this=<unavailable>, __s="!P, __n=1048576) at streambuf:204:14
frame 14: 0x0000000008b2ca02 clickhouse`std::__1::basic_istream<char, std::__1::char_traits<char> >::read(this=0x00007fc35ae42d40, __s="!P, __n=1048576) at istream:1052
frame 15: 0x0000000010de0b9f clickhouse`DB::ReadBufferFromIStream::nextImpl(this=0x00007fc3663d2380) at ReadBufferFromIStream.cpp:15:10
</details>
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
AWS S3 client can read file multiple times, this is required for:
- calculate checksums
- calculate signature (done only for HTTP, since ClickHouse uses
PayloadSigningPolicy::Never)
So this means that for HTTP, to send file to S3 it will be read 3x
times, and for HTTPS 2x times.
By overriding GetChecksumAlgorithmName() to return empty string,
checksums can be disabled, and the input file will be read only once.
And even though additional https layer adds extra integrity layer,
someone still may find this too risky I guess, even though ClickHouse
internal format (for MergeTree) has checksums, and more.
Here is an example stacktrace of this excessive read:
<details>
<summary>stacktrace</summary>
(lldb) bt
* thread 383, name = 'BackupWorker', stop reason = breakpoint 1.1
* frame 0: 0x00000000103c5fc0 clickhouse`DB::StdStreamBufFromReadBuffer::seekpos() + 32 at StdStreamBufFromReadBuffer.cpp:67
frame 1: 0x000000001777f7f8 clickhouse`std::__1::basic_istream<char, std::__1::char_traits<char>>::tellg() [inlined] std::__1::basic_streambuf<char, std::__1::char_traits<char>>::pubseekoff[abi:v15000](this=<unavailable>, __off=0, __way=cur, __which=8) + 120 at streambuf:162
frame 2: 0x000000001777f7e3 clickhouse`std::__1::basic_istream<char, std::__1::char_traits<char>>::tellg() + 99 at istream:1249
frame 3: 0x00000000152e4979 clickhouse`Aws::Utils::Crypto::MD5OpenSSLImpl::Calculate() + 57 at CryptoImpl.cpp:223
frame 4: 0x00000000152dedee clickhouse`Aws::Utils::Crypto::MD5::Calculate() + 14 at MD5.cpp:30
frame 5: 0x00000000152db5ac clickhouse`Aws::Utils::HashingUtils::CalculateMD5() + 44 at HashingUtils.cpp:235
frame 6: 0x000000001528b97b clickhouse`Aws::Client::AWSClient::AddChecksumToRequest() const + 507 at AWSClient.cpp:772
frame 7: 0x000000001528ded2 clickhouse`Aws::Client::AWSClient::BuildHttpRequest() const + 1682 at AWSClient.cpp:930
frame 8: 0x00000000100b864f clickhouse`DB::S3::Client::BuildHttpRequest() const + 15 at Client.cpp:622
frame 9: 0x0000000015286a41 clickhouse`Aws::Client::AWSClient::AttemptOneRequest(this=0x00007ffde2f8f000, httpRequest=<unavailable>, request=<unavailable>, signerName=<unavailable>, signerRegionOverride=<unavailable>, signerServiceNameOverride="s3") const + 65 at AWSClient.cpp:491
frame 10: 0x00000000152845b9 clickhouse`Aws::Client::AWSClient::AttemptExhaustively(this=0x00007ffde2f8f000, uri=0x00007ffdd4d44f38, request=0x00007ffdd4d45d10, method=HTTP_PUT, signerName="SignatureV4", signerRegionOverride="us-east-1", signerServiceNameOverride="s3") const + 1337 at AWSClient.cpp:272
frame 11: 0x0000000015298d0d clickhouse`Aws::Client::AWSXMLClient::MakeRequest() const + 45 at AWSXmlClient.cpp:99
frame 12: 0x0000000015298cb5 clickhouse`Aws::Client::AWSXMLClient::MakeRequest() const + 309 at AWSXmlClient.cpp:66
frame 13: 0x0000000015354b23 clickhouse`Aws::S3::S3Client::PutObject(this=0x00007ffde2f8f000, request=0x00007ffdd4d45d10) const + 2659 at S3Client.cpp:1731
frame 14: 0x00000000100b174f clickhouse`DB::S3::Client::PutObject(DB::S3::ExtendedRequest<Aws::S3::Model::PutObjectRequest> const&) const [inlined]
frame 15: 0x00000000100b173a clickhouse`DB::S3::Client::PutObject(DB::S3::ExtendedRequest<Aws::S3::Model::PutObjectRequest> const&) const + 41 at Client.cpp:578
frame 16: 0x00000000100b1711 clickhouse`DB::S3::Client::PutObject(DB::S3::ExtendedRequest<Aws::S3::Model::PutObjectRequest> const&) const + 981 at Client.cpp:508
frame 17: 0x00000000100b133c clickhouse`DB::S3::Client::PutObject(DB::S3::ExtendedRequest<Aws::S3::Model::PutObjectRequest> const&) const [inlined]
frame 18: 0x00000000100b133c clickhouse`DB::S3::Client::PutObject() const + 28 at Client.cpp:418
frame 19: 0x00000000103b96d6 clickhouse`DB::copyDataToS3File()
</details>
This new behaviour could be enabled with `s3_disable_checksum=true`.
Note, that I've checked this implementation with GCS/R2/S3/MinIO and it
works everywhere.
* initial commit. integ tests passing, need to re-run unit & my own personal tests
* partial refactoring to remove Protocol::ANY
* improve naming
* remove all usages of ProxyConfiguration::Protocol::ANY
* fix ut
* blabla
* support url functions as well
* support for HTTPS requests over HTTP proxy with tunneling off
* remove gtestabc
* fix silly mistake
* ...
* remove usages of httpclientsession::proxyconfig in src/
* got you
* remove stale comment
* it seems like I need reasonable defaults
* fix ut
* add some comments
* remove no longer needed header
* matrix out
* add https over http proxy with no tunneling
* soem docs
* partial refactoring
* rename to use_tunneling_for_https_requests_over_http_proxy
* improve docs
* use shorter version
* remove useless test
* rename the setting
* update
* fix typo
* fix setting docs typo
* move ); up
* move ) up