Occasionally, 02479_mysql_connect_to_self fails on CI [1].
[1]: https://github.com/ClickHouse/ClickHouse/issues/50911
The problem was indeed query profiler and EINTR, but not in a way you
may think.
For such failures you may see the following trace in trace_log:
contrib/openssl/crypto/bio/bss_sock.c:127::sock_read
contrib/openssl/crypto/bio/bio_meth.c:121::bread_conv
contrib/openssl/crypto/bio/bio_lib.c:285::bio_read_intern
contrib/openssl/crypto/bio/bio_lib.c:311::BIO_read
contrib/openssl/ssl/record/methods/tls_common.c:398::tls_default_read_n
contrib/openssl/ssl/record/methods/tls_common.c:575::tls_get_more_records
contrib/openssl/ssl/record/methods/tls_common.c:1122::tls_read_record
contrib/openssl/ssl/record/rec_layer_s3.c:645::ssl3_read_bytes
contrib/openssl/ssl/s3_lib.c:4527::ssl3_read_internal
contrib/openssl/ssl/s3_lib.c:4551::ssl3_read
contrib/openssl/ssl/ssl_lib.c:2343::ssl_read_internal
contrib/openssl/ssl/ssl_lib.c:2357::SSL_read
contrib/mariadb-connector-c/libmariadb/secure/openssl.c:729::ma_tls_read
contrib/mariadb-connector-c/libmariadb/ma_tls.c:90::ma_pvio_tls_read
contrib/mariadb-connector-c/libmariadb/ma_pvio.c:250::ma_pvio_read
contrib/mariadb-connector-c/libmariadb/ma_pvio.c:297::ma_pvio_cache_read
contrib/mariadb-connector-c/libmariadb/ma_net.c:373::ma_real_read
contrib/mariadb-connector-c/libmariadb/ma_net.c:427::ma_net_read
contrib/mariadb-connector-c/libmariadb/mariadb_lib.c:192::ma_net_safe_read
contrib/mariadb-connector-c/libmariadb/mariadb_lib.c:2138::mthd_my_read_query_result
contrib/mariadb-connector-c/libmariadb/mariadb_lib.c:2212::mysql_real_query
src/Common/mysqlxx/Query.cpp:56::mysqlxx::Query::executeImpl()
src/Common/mysqlxx/Query.cpp:73::mysqlxx::Query::use()
src/Processors/Sources/MySQLSource.cpp:50::DB::MySQLSource::Connection::Connection()
After which the connection will fail with:
Code: 1000. DB::Exception: Received from localhost:9000. DB::Exception: mysqlxx::ConnectionLost: Lost connection to MySQL server during query (127.0.0.1:9004). (POCO_EXCEPTION)
But, if you will take a look at ma_tls_read() you will see that it has
proper retries for SSL_ERROR_WANT_READ (and EINTR is just a special case
of it), but still, for some reason it fails.
And the reason is the units of the read/write timeout, ma_tls_read()
calls poll(read_timeout) in case of SSL_ERROR_WANT_READ, but, it
incorrectly assume that the timeout is in milliseconds, but that timeout
was in seconds, this bug had been fixed in [2], and now it works like a
charm!
[2]: https://github.com/ClickHouse/mariadb-connector-c/pull/17
I've verified it with patching openssl library:
diff --git a/crypto/bio/bss_sock.c b/crypto/bio/bss_sock.c
index 82f7be85ae..3d2f3926a0 100644
--- a/crypto/bio/bss_sock.c
+++ b/crypto/bio/bss_sock.c
@@ -124,7 +124,24 @@ static int sock_read(BIO *b, char *out, int outl)
ret = ktls_read_record(b->num, out, outl);
else
# endif
- ret = readsocket(b->num, out, outl);
+ {
+ static int i = 0;
+ static int j = 0;
+ if (!(++j % 5))
+ {
+ fprintf(stderr, "sock_read: inject EAGAIN with ret=0\n");
+ ret = 0;
+ errno = EAGAIN;
+ }
+ else if (!(++i % 3))
+ {
+ fprintf(stderr, "sock_read: inject EAGAIN with ret=-1\n");
+ ret = -1;
+ errno = EAGAIN;
+ }
+ else
+ ret = readsocket(b->num, out, outl);
+ }
BIO_clear_retry_flags(b);
if (ret <= 0) {
if (BIO_sock_should_retry(ret))
And before this patch (well, not the patch itself, but the referenced
patch in mariadb-connector-c) if you will pass read_write_timeout=1 it
will fail:
SELECT * FROM mysql('127.0.0.1:9004', system, one, 'default', '', SETTINGS connect_timeout = 100, connection_wait_timeout = 100, read_write_timeout=1)
Code: 1000. DB::Exception: Received from localhost:9000. DB::Exception: mysqlxx::ConnectionLost: Lost connection to MySQL server during query (127.0.0.1:9004). (POCO_EXCEPTION)
But after, it always works:
$ ch benchmark -c10 -q "SELECT * FROM mysql('127.0.0.1:9004', system, one, 'default', '', SETTINGS connection_pool_size=1, connect_timeout = 100, connection_wait_timeout = 100, read_write_timeout=1)"
^CStopping launch of queries. SIGINT received.
Queries executed: 478.
localhost:9000, queries: 478, QPS: 120.171, RPS: 120.171, MiB/s: 0.001, result RPS: 120.171, result MiB/s: 0.001.
0.000% 0.014 sec.
10.000% 0.058 sec.
20.000% 0.065 sec.
30.000% 0.073 sec.
40.000% 0.079 sec.
50.000% 0.087 sec.
60.000% 0.089 sec.
70.000% 0.091 sec.
80.000% 0.095 sec.
90.000% 0.100 sec.
95.000% 0.102 sec.
99.000% 0.105 sec.
99.900% 0.140 sec.
99.990% 0.140 sec.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Right now it fails due to [1] with the following error:
/usr/work/ClickHouse/contrib/libbcrypt/crypt_blowfish/ow-crypt.h:27:14: error: conflicting types for 'crypt_r'
27 | extern char *crypt_r(__const char *key, __const char *setting, void *data);
| ^
/usr/include/unistd.h:500:7: note: previous declaration is here
500 | char *crypt_r(const char *, const char *, struct crypt_data *);
| ^
[1]: 5f521d7ba7
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
From this macros sizeof(unw_context_t)/sizeof(unw_cursor_t) is depends
(_LIBUNWIND_CONTEXT_SIZE/_LIBUNWIND_CURSOR_SIZE).
So it should be not only PRIVATE but for INTERFACE as well.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
The problem was due to incorrect unwinding due from signal handlers,
which leads to incorrect DWARF (FDE/CIE) interpretation.
After this patch I was not able to reproduce the crash for couple of
hours, while before it was very stable (I've reduced the minimal
threshold for query_profiler_real_time_period_ns), using simply:
$ clickhouse-benchmark --port 19000 -q "SELECT * FROM remote('127.{1..10}', system, one)" --query_profiler_real_time_period_ns=1
Note, I'm using here remote() for fibers, that has stack with guard
pages that helps with reproducing the crash more faster.
P.S. I also have another implementation of this fix, without patching
unwind and using info from signal context directly, and even though it
is better, because you don't need to trip extra frames and you can use
all the 45 frames for something useful, it is too complex, so let's go
with a simpler patch first, and I think it could be even backported.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Make ClickHouse compilable and runnable on loongarch64
So far only basic functionality was tested (on real hw),
clickhouse server runs, exceptions works, client works,
simple tests works.
This commit adresses below leaksan find.
I tried to reproduce this locally but 02802_clickhouse_disks_s3_copy.sh
but couldn't. clickhouse-disks did not do anything useful (it would not
even print logging), neither with a standard build nor with a leaksan
build. Could not find further documentation of it or even what this tool
is supposed to do, perhaps it is just for internal use. Also,
- line numbers in the leaksan report were partially missing,
- I am not really sure how Sha256HMACOpenSSLImpl::Calculate is calling
into hmac_init (there must be some sort of static initialization
somewhere but I did not find it), and
- my fix is in a weird place due to other restrictions (see the commit
in the aws-sdk-cpp contrib repo).
The chance that this fix fixes the leak are low.
If it doesn't work, add "# Tag no-asan" + a comment to
02802_clickhouse_disks_s3_copy.sh and don't worry further.
EDIT: The commit fixes the issue, everything is good.
https://s3.amazonaws.com/clickhouse-test-reports/59870/b452e3d1ab87b8cc5810693aeea28f69ad28d671/stateless_tests__asan__[3_4].html
2024-03-19 08:34:03 =================================================================
2024-03-19 08:34:03 ==13149==ERROR: LeakSanitizer: detected memory leaks
2024-03-19 08:34:03
2024-03-19 08:34:03 Direct leak of 904 byte(s) in 1 object(s) allocated from:
2024-03-19 08:34:03 #0 0x55f9cb5a18ee in malloc (/usr/bin/clickhouse+0xa9a48ee) (BuildId: b3766b865d6580f6dcba75acf37673d4aeedc2b6)
2024-03-19 08:34:03 #1 0x55fa01e34070 in CRYPTO_malloc build_docker/./contrib/openssl/crypto/mem.c:202:11
2024-03-19 08:34:03 #2 0x55fa01e34070 in CRYPTO_zalloc build_docker/./contrib/openssl/crypto/mem.c:222:11
2024-03-19 08:34:03 #3 0x55fa01d6dcca in ossl_err_get_state_int build_docker/./contrib/openssl/crypto/err/err.c:691:17
2024-03-19 08:34:03 #4 0x55fa01d71748 in ERR_set_mark build_docker/./contrib/openssl/crypto/err/err_mark.c:19:10
2024-03-19 08:34:03 #5 0x55fa01f4735b in ossl_prov_digest_load_from_params build_docker/./contrib/openssl/providers/common/provider_util.c:194:5
2024-03-19 08:34:03 #6 0x55fa01ff467a in hmac_set_ctx_params build_docker/./contrib/openssl/providers/implementations/macs/hmac_prov.c:307:10
2024-03-19 08:34:03 #7 0x55fa01ff3ef2 in hmac_init build_docker/./contrib/openssl/providers/implementations/macs/hmac_prov.c:169:37
2024-03-19 08:34:03 #8 0x55f9fb83c75b in Aws::Utils::Crypto::Sha256HMACOpenSSLImpl::Calculate(Aws::Utils::Array<unsigned char> const&, Aws::Utils::Array<unsigned char> const&) (/usr/bin/clickhouse+0x3ac3f75b) (BuildId: b3766b865d6580f6dcba75acf37673d4aeedc2b6)
2024-03-19 08:34:03 #9 0x55f9fb82ebeb in Aws::Utils::Crypto::Sha256HMAC::Calculate(Aws::Utils::Array<unsigned char> const&, Aws::Utils::Array<unsigned char> const&) (/usr/bin/clickhouse+0x3ac31beb) (BuildId: b3766b865d6580f6dcba75acf37673d4aeedc2b6)
2024-03-19 08:34:03 #10 0x55f9fb6d5afd in Aws::Client::AWSAuthV4Signer::ComputeHash(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) const (/usr/bin/clickhouse+0x3aad8afd) (BuildId: b3766b865d6580f6dcba75acf37673d4aeedc2b6)
2024-03-19 08:34:03 #11 0x55f9fb6e0bad in Aws::Client::AWSAuthV4Signer::SignRequest(Aws::Http::HttpRequest&, char const*, char const*, bool) const (/usr/bin/clickhouse+0x3aae3bad) (BuildId: b3766b865d6580f6dcba75acf37673d4aeedc2b6)
2024-03-19 08:34:03 #12 0x55f9fb72651d in bool smithy::components::tracing::TracingUtils::MakeCallWithTiming<bool>(std::__1::function<bool ()>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, smithy::components::tracing::Meter const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>>&&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) (/usr/bin/clickhouse+0x3ab2951d) (BuildId: b3766b865d6580f6dcba75acf37673d4aeedc2b6)
2024-03-19 08:34:03 #13 0x55f9fb70adcb in Aws::Client::AWSClient::AttemptOneRequest(std::__1::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const (/usr/bin/clickhouse+0x3ab0ddcb) (BuildId: b3766b865d6580f6dcba75acf37673d4aeedc2b6)
2024-03-19 08:34:03 #14 0x55f9fb6fdb17 in Aws::Client::AWSClient::AttemptExhaustively(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char cons