Occasionally, 02479_mysql_connect_to_self fails on CI [1].
[1]: https://github.com/ClickHouse/ClickHouse/issues/50911
The problem was indeed query profiler and EINTR, but not in a way you
may think.
For such failures you may see the following trace in trace_log:
contrib/openssl/crypto/bio/bss_sock.c:127::sock_read
contrib/openssl/crypto/bio/bio_meth.c:121::bread_conv
contrib/openssl/crypto/bio/bio_lib.c:285::bio_read_intern
contrib/openssl/crypto/bio/bio_lib.c:311::BIO_read
contrib/openssl/ssl/record/methods/tls_common.c:398::tls_default_read_n
contrib/openssl/ssl/record/methods/tls_common.c:575::tls_get_more_records
contrib/openssl/ssl/record/methods/tls_common.c:1122::tls_read_record
contrib/openssl/ssl/record/rec_layer_s3.c:645::ssl3_read_bytes
contrib/openssl/ssl/s3_lib.c:4527::ssl3_read_internal
contrib/openssl/ssl/s3_lib.c:4551::ssl3_read
contrib/openssl/ssl/ssl_lib.c:2343::ssl_read_internal
contrib/openssl/ssl/ssl_lib.c:2357::SSL_read
contrib/mariadb-connector-c/libmariadb/secure/openssl.c:729::ma_tls_read
contrib/mariadb-connector-c/libmariadb/ma_tls.c:90::ma_pvio_tls_read
contrib/mariadb-connector-c/libmariadb/ma_pvio.c:250::ma_pvio_read
contrib/mariadb-connector-c/libmariadb/ma_pvio.c:297::ma_pvio_cache_read
contrib/mariadb-connector-c/libmariadb/ma_net.c:373::ma_real_read
contrib/mariadb-connector-c/libmariadb/ma_net.c:427::ma_net_read
contrib/mariadb-connector-c/libmariadb/mariadb_lib.c:192::ma_net_safe_read
contrib/mariadb-connector-c/libmariadb/mariadb_lib.c:2138::mthd_my_read_query_result
contrib/mariadb-connector-c/libmariadb/mariadb_lib.c:2212::mysql_real_query
src/Common/mysqlxx/Query.cpp:56::mysqlxx::Query::executeImpl()
src/Common/mysqlxx/Query.cpp:73::mysqlxx::Query::use()
src/Processors/Sources/MySQLSource.cpp:50::DB::MySQLSource::Connection::Connection()
After which the connection will fail with:
Code: 1000. DB::Exception: Received from localhost:9000. DB::Exception: mysqlxx::ConnectionLost: Lost connection to MySQL server during query (127.0.0.1:9004). (POCO_EXCEPTION)
But, if you will take a look at ma_tls_read() you will see that it has
proper retries for SSL_ERROR_WANT_READ (and EINTR is just a special case
of it), but still, for some reason it fails.
And the reason is the units of the read/write timeout, ma_tls_read()
calls poll(read_timeout) in case of SSL_ERROR_WANT_READ, but, it
incorrectly assume that the timeout is in milliseconds, but that timeout
was in seconds, this bug had been fixed in [2], and now it works like a
charm!
[2]: https://github.com/ClickHouse/mariadb-connector-c/pull/17
I've verified it with patching openssl library:
diff --git a/crypto/bio/bss_sock.c b/crypto/bio/bss_sock.c
index 82f7be85ae..3d2f3926a0 100644
--- a/crypto/bio/bss_sock.c
+++ b/crypto/bio/bss_sock.c
@@ -124,7 +124,24 @@ static int sock_read(BIO *b, char *out, int outl)
ret = ktls_read_record(b->num, out, outl);
else
# endif
- ret = readsocket(b->num, out, outl);
+ {
+ static int i = 0;
+ static int j = 0;
+ if (!(++j % 5))
+ {
+ fprintf(stderr, "sock_read: inject EAGAIN with ret=0\n");
+ ret = 0;
+ errno = EAGAIN;
+ }
+ else if (!(++i % 3))
+ {
+ fprintf(stderr, "sock_read: inject EAGAIN with ret=-1\n");
+ ret = -1;
+ errno = EAGAIN;
+ }
+ else
+ ret = readsocket(b->num, out, outl);
+ }
BIO_clear_retry_flags(b);
if (ret <= 0) {
if (BIO_sock_should_retry(ret))
And before this patch (well, not the patch itself, but the referenced
patch in mariadb-connector-c) if you will pass read_write_timeout=1 it
will fail:
SELECT * FROM mysql('127.0.0.1:9004', system, one, 'default', '', SETTINGS connect_timeout = 100, connection_wait_timeout = 100, read_write_timeout=1)
Code: 1000. DB::Exception: Received from localhost:9000. DB::Exception: mysqlxx::ConnectionLost: Lost connection to MySQL server during query (127.0.0.1:9004). (POCO_EXCEPTION)
But after, it always works:
$ ch benchmark -c10 -q "SELECT * FROM mysql('127.0.0.1:9004', system, one, 'default', '', SETTINGS connection_pool_size=1, connect_timeout = 100, connection_wait_timeout = 100, read_write_timeout=1)"
^CStopping launch of queries. SIGINT received.
Queries executed: 478.
localhost:9000, queries: 478, QPS: 120.171, RPS: 120.171, MiB/s: 0.001, result RPS: 120.171, result MiB/s: 0.001.
0.000% 0.014 sec.
10.000% 0.058 sec.
20.000% 0.065 sec.
30.000% 0.073 sec.
40.000% 0.079 sec.
50.000% 0.087 sec.
60.000% 0.089 sec.
70.000% 0.091 sec.
80.000% 0.095 sec.
90.000% 0.100 sec.
95.000% 0.102 sec.
99.000% 0.105 sec.
99.900% 0.140 sec.
99.990% 0.140 sec.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Right now it fails due to [1] with the following error:
/usr/work/ClickHouse/contrib/libbcrypt/crypt_blowfish/ow-crypt.h:27:14: error: conflicting types for 'crypt_r'
27 | extern char *crypt_r(__const char *key, __const char *setting, void *data);
| ^
/usr/include/unistd.h:500:7: note: previous declaration is here
500 | char *crypt_r(const char *, const char *, struct crypt_data *);
| ^
[1]: 5f521d7ba7
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
From this macros sizeof(unw_context_t)/sizeof(unw_cursor_t) is depends
(_LIBUNWIND_CONTEXT_SIZE/_LIBUNWIND_CURSOR_SIZE).
So it should be not only PRIVATE but for INTERFACE as well.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>