Occasionally, 02479_mysql_connect_to_self fails on CI [1].
[1]: https://github.com/ClickHouse/ClickHouse/issues/50911
The problem was indeed query profiler and EINTR, but not in a way you
may think.
For such failures you may see the following trace in trace_log:
contrib/openssl/crypto/bio/bss_sock.c:127::sock_read
contrib/openssl/crypto/bio/bio_meth.c:121::bread_conv
contrib/openssl/crypto/bio/bio_lib.c:285::bio_read_intern
contrib/openssl/crypto/bio/bio_lib.c:311::BIO_read
contrib/openssl/ssl/record/methods/tls_common.c:398::tls_default_read_n
contrib/openssl/ssl/record/methods/tls_common.c:575::tls_get_more_records
contrib/openssl/ssl/record/methods/tls_common.c:1122::tls_read_record
contrib/openssl/ssl/record/rec_layer_s3.c:645::ssl3_read_bytes
contrib/openssl/ssl/s3_lib.c:4527::ssl3_read_internal
contrib/openssl/ssl/s3_lib.c:4551::ssl3_read
contrib/openssl/ssl/ssl_lib.c:2343::ssl_read_internal
contrib/openssl/ssl/ssl_lib.c:2357::SSL_read
contrib/mariadb-connector-c/libmariadb/secure/openssl.c:729::ma_tls_read
contrib/mariadb-connector-c/libmariadb/ma_tls.c:90::ma_pvio_tls_read
contrib/mariadb-connector-c/libmariadb/ma_pvio.c:250::ma_pvio_read
contrib/mariadb-connector-c/libmariadb/ma_pvio.c:297::ma_pvio_cache_read
contrib/mariadb-connector-c/libmariadb/ma_net.c:373::ma_real_read
contrib/mariadb-connector-c/libmariadb/ma_net.c:427::ma_net_read
contrib/mariadb-connector-c/libmariadb/mariadb_lib.c:192::ma_net_safe_read
contrib/mariadb-connector-c/libmariadb/mariadb_lib.c:2138::mthd_my_read_query_result
contrib/mariadb-connector-c/libmariadb/mariadb_lib.c:2212::mysql_real_query
src/Common/mysqlxx/Query.cpp:56::mysqlxx::Query::executeImpl()
src/Common/mysqlxx/Query.cpp:73::mysqlxx::Query::use()
src/Processors/Sources/MySQLSource.cpp:50::DB::MySQLSource::Connection::Connection()
After which the connection will fail with:
Code: 1000. DB::Exception: Received from localhost:9000. DB::Exception: mysqlxx::ConnectionLost: Lost connection to MySQL server during query (127.0.0.1:9004). (POCO_EXCEPTION)
But, if you will take a look at ma_tls_read() you will see that it has
proper retries for SSL_ERROR_WANT_READ (and EINTR is just a special case
of it), but still, for some reason it fails.
And the reason is the units of the read/write timeout, ma_tls_read()
calls poll(read_timeout) in case of SSL_ERROR_WANT_READ, but, it
incorrectly assume that the timeout is in milliseconds, but that timeout
was in seconds, this bug had been fixed in [2], and now it works like a
charm!
[2]: https://github.com/ClickHouse/mariadb-connector-c/pull/17
I've verified it with patching openssl library:
diff --git a/crypto/bio/bss_sock.c b/crypto/bio/bss_sock.c
index 82f7be85ae..3d2f3926a0 100644
--- a/crypto/bio/bss_sock.c
+++ b/crypto/bio/bss_sock.c
@@ -124,7 +124,24 @@ static int sock_read(BIO *b, char *out, int outl)
ret = ktls_read_record(b->num, out, outl);
else
# endif
- ret = readsocket(b->num, out, outl);
+ {
+ static int i = 0;
+ static int j = 0;
+ if (!(++j % 5))
+ {
+ fprintf(stderr, "sock_read: inject EAGAIN with ret=0\n");
+ ret = 0;
+ errno = EAGAIN;
+ }
+ else if (!(++i % 3))
+ {
+ fprintf(stderr, "sock_read: inject EAGAIN with ret=-1\n");
+ ret = -1;
+ errno = EAGAIN;
+ }
+ else
+ ret = readsocket(b->num, out, outl);
+ }
BIO_clear_retry_flags(b);
if (ret <= 0) {
if (BIO_sock_should_retry(ret))
And before this patch (well, not the patch itself, but the referenced
patch in mariadb-connector-c) if you will pass read_write_timeout=1 it
will fail:
SELECT * FROM mysql('127.0.0.1:9004', system, one, 'default', '', SETTINGS connect_timeout = 100, connection_wait_timeout = 100, read_write_timeout=1)
Code: 1000. DB::Exception: Received from localhost:9000. DB::Exception: mysqlxx::ConnectionLost: Lost connection to MySQL server during query (127.0.0.1:9004). (POCO_EXCEPTION)
But after, it always works:
$ ch benchmark -c10 -q "SELECT * FROM mysql('127.0.0.1:9004', system, one, 'default', '', SETTINGS connection_pool_size=1, connect_timeout = 100, connection_wait_timeout = 100, read_write_timeout=1)"
^CStopping launch of queries. SIGINT received.
Queries executed: 478.
localhost:9000, queries: 478, QPS: 120.171, RPS: 120.171, MiB/s: 0.001, result RPS: 120.171, result MiB/s: 0.001.
0.000% 0.014 sec.
10.000% 0.058 sec.
20.000% 0.065 sec.
30.000% 0.073 sec.
40.000% 0.079 sec.
50.000% 0.087 sec.
60.000% 0.089 sec.
70.000% 0.091 sec.
80.000% 0.095 sec.
90.000% 0.100 sec.
95.000% 0.102 sec.
99.000% 0.105 sec.
99.900% 0.140 sec.
99.990% 0.140 sec.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Right now it fails due to [1] with the following error:
/usr/work/ClickHouse/contrib/libbcrypt/crypt_blowfish/ow-crypt.h:27:14: error: conflicting types for 'crypt_r'
27 | extern char *crypt_r(__const char *key, __const char *setting, void *data);
| ^
/usr/include/unistd.h:500:7: note: previous declaration is here
500 | char *crypt_r(const char *, const char *, struct crypt_data *);
| ^
[1]: 5f521d7ba7
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
From this macros sizeof(unw_context_t)/sizeof(unw_cursor_t) is depends
(_LIBUNWIND_CONTEXT_SIZE/_LIBUNWIND_CURSOR_SIZE).
So it should be not only PRIVATE but for INTERFACE as well.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
The problem was due to incorrect unwinding due from signal handlers,
which leads to incorrect DWARF (FDE/CIE) interpretation.
After this patch I was not able to reproduce the crash for couple of
hours, while before it was very stable (I've reduced the minimal
threshold for query_profiler_real_time_period_ns), using simply:
$ clickhouse-benchmark --port 19000 -q "SELECT * FROM remote('127.{1..10}', system, one)" --query_profiler_real_time_period_ns=1
Note, I'm using here remote() for fibers, that has stack with guard
pages that helps with reproducing the crash more faster.
P.S. I also have another implementation of this fix, without patching
unwind and using info from signal context directly, and even though it
is better, because you don't need to trip extra frames and you can use
all the 45 frames for something useful, it is too complex, so let's go
with a simpler patch first, and I think it could be even backported.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>