We have an issue when using external dictionary. Occasionally library bridge called with extDict_libClone and fails with Unknown library method 'extDict_libClone'. And it looks like it is because of at some point `else if (method == "extDict_libNew")` was changed to if (lib_new) with no handling for extDict_libClone inside this new if else statement and reporing an error that extDict_libClone is an unknown method.
So there is a two-line fix to handle extDict_libClone properly.
Error logs that we have:
```
2022.12.16 14:17:44.285088 [ 393573 ] {} <Error> ExternalDictionaries: Could not update cache dictionary 'dict.vhash_s', next update is scheduled at 2022-12-16 14:18:00: Code: 86. DB::Exception: Received error from remote server /extdict_request?version=1&dictionary_id=be2b2cd1-ba57-4658-8d1b-35ef40ab005b&method=extDict_libClone&from_dictionary_id=c3537142-eaa9-4deb-9b65-47eb8ea1dee6. HTTP status code: 500 Internal Server Error, body: Unknown library method 'extDict_libClone'
2022.12.16 14:17:44.387049 [ 399133 ] {} <Error> ExternalDictionaries: Could not update cache dictionary 'dict.vhash_s', next update is scheduled at 2022-12-16 14:17:51: Code: 86. DB::Exception: Received error from remote server /extdict_request?version=1&dictionary_id=0df866ac-6c94-4974-a76c-3940522091b9&method=extDict_libClone&from_dictionary_id=c3537142-eaa9-4deb-9b65-47eb8ea1dee6. HTTP status code: 500 Internal Server Error, body: Unknown library method 'extDict_libClone'
2022.12.16 14:17:44.488468 [ 397769 ] {} <Error> ExternalDictionaries: Could not update cache dictionary 'dict.vhash_s', next update is scheduled at 2022-12-16 14:19:38: Code: 86. DB::Exception: Received error from remote server /extdict_request?version=1&dictionary_id=2d8af321-b669-4526-982b-42c0fabf0e8d&method=extDict_libClone&from_dictionary_id=c3537142-eaa9-4deb-9b65-47eb8ea1dee6. HTTP status code: 500 Internal Server Error, body: Unknown library method 'extDict_libClone'
2022.12.16 14:17:44.489935 [ 398226 ] {datamarts_v_dwh_node0032-241534:0x552da2_1_11} <Error> executeQuery: Code: 510. DB::Exception: Update failed for dictionary 'dict.vhash_s': Code: 510. DB::Exception: Update failed for dictionary dict.vhash_s : Code: 86. DB::Exception: Received error from remote server /extdict_request?version=1&dictionary_id=be2b2cd1-ba57-4658-8d1b-35ef40ab005b&method=extDict_libClone&from_dictionary_id=c3537142-eaa9-4deb-9b65-47eb8ea1dee6. HTTP status code: 500 Internal Server Error, body: Unknown library method 'extDict_libClone'
```
- This commit restores statements "SYSTEM RELOAD MODEL(S)" which provide
a mechanism to update a model explicitly. It also saves potentially
unnecessary reloads of a model from disk after it's initial load.
To keep the complexity low, the semantics of "SYSTEM RELOAD MODEL(S)
was changed from eager to lazy. This means that both statements
previously immedately reloaded the specified/all models, whereas now
the statements only trigger an unload and the first call to
catboostEvaluate() does the actual load.
- Monitoring view SYSTEM.MODELS is also restored but with some obsolete
fields removed. The view was not documented in the past and for now it
remains undocumented. The commit is thus not considered a breach of
ClickHouse's public interface.
This commit moves the catboost model evaluation out of the server
process into the library-bridge binary. This serves two goals: On the
one hand, crashes / memory corruptions of the catboost library no longer
affect the server. On the other hand, we can forbid loading dynamic
libraries in the server (catboost was the last consumer of this
functionality), thus improving security.
SQL syntax:
SELECT
catboostEvaluate('/path/to/model.bin', FEAT_1, ..., FEAT_N) > 0 AS prediction,
ACTION AS target
FROM amazon_train
LIMIT 10
Required configuration:
<catboost_lib_path>/path/to/libcatboostmodel.so</catboost_lib_path>
*** Implementation Details ***
The internal protocol between the server and the library-bridge is
simple:
- HTTP GET on path "/extdict_ping":
A ping, used during the handshake to check if the library-bridge runs.
- HTTP POST on path "extdict_request"
(1) Send a "catboost_GetTreeCount" request from the server to the
bridge, containing a library path (e.g /home/user/libcatboost.so) and
a model path (e.g. /home/user/model.bin). Rirst, this unloads the
catboost library handler associated to the model path (if it was
loaded), then loads the catboost library handler associated to the
model path, then executes GetTreeCount() on the library handler and
finally sends the result back to the server. Step (1) is called once
by the server from FunctionCatBoostEvaluate::getReturnTypeImpl(). The
library path handler is unloaded in the beginning because it contains
state which may no longer be valid if the user runs
catboost("/path/to/model.bin", ...) more than once and if "model.bin"
was updated in between.
(2) Send "catboost_Evaluate" from the server to the bridge, containing
the model path and the features to run the interference on. Step (2)
is called multiple times (once per chunk) by the server from function
FunctionCatBoostEvaluate::executeImpl(). The library handler for the
given model path is expected to be already loaded by Step (1).
Fixes#27870
This commit moves the catboost model evaluation out of the server
process into the library-bridge binary. This serves two goals: On the
one hand, crashes / memory corruptions of the catboost library no longer
affect the server. On the other hand, we can forbid loading dynamic
libraries in the server (catboost was the last consumer of this
functionality), thus improving security.
SQL syntax:
SELECT
catboostEvaluate('/path/to/model.bin', FEAT_1, ..., FEAT_N) > 0 AS prediction,
ACTION AS target
FROM amazon_train
LIMIT 10
Required configuration:
<catboost_lib_path>/path/to/libcatboostmodel.so</catboost_lib_path>
*** Implementation Details ***
The internal protocol between the server and the library-bridge is
simple:
- HTTP GET on path "/extdict_ping":
A ping, used during the handshake to check if the library-bridge runs.
- HTTP POST on path "extdict_request"
(1) Send a "catboost_GetTreeCount" request from the server to the
bridge, containing a library path (e.g /home/user/libcatboost.so) and
a model path (e.g. /home/user/model.bin). Rirst, this unloads the
catboost library handler associated to the model path (if it was
loaded), then loads the catboost library handler associated to the
model path, then executes GetTreeCount() on the library handler and
finally sends the result back to the server. Step (1) is called once
by the server from FunctionCatBoostEvaluate::getReturnTypeImpl(). The
library path handler is unloaded in the beginning because it contains
state which may no longer be valid if the user runs
catboost("/path/to/model.bin", ...) more than once and if "model.bin"
was updated in between.
(2) Send "catboost_Evaluate" from the server to the bridge, containing
the model path and the features to run the interference on. Step (2)
is called multiple times (once per chunk) by the server from function
FunctionCatBoostEvaluate::executeImpl(). The library handler for the
given model path is expected to be already loaded by Step (1).
Fixes#27870
- In general, it is expected that clickhouse-*-bridges and
clickhouse-server were build from the same source version (e.g. are
upgraded "atomically"). If that is not the case, we should at least
be able to detect the mismatch and abort.
- This commit adds a URL parameter "version", defined in a header shared
by the server and bridges. The bridge returns an error in case of
mismatch.
- The version is *not* send and checked for "ping" requests (used for
handshake), only for regular requests send after handshake. This is
because the internally thrown server-side exception due to HTTP
failure does not propagate the exact HTTP error (it only stores the
error as text), and as a result, the server-side handshake code
simply retries in case of error with exponential backoff and finally
fails with a "timeout error". This is reasonable as pings typically
fail due to time out. However, without a rework of HTTP exceptions,
version mismatch during ping would also appear as "timeout" which is
too misleading. The behavior may be changed later if needed.
- Note that introducing a version parameter does not represent a
protocol upgrade itself. Bridges older than the server will simply
ignore the field. Only servers older than the bridges receive an error
but such a situation should never occur in practice.
- Rename generic file and identifier names in library-bridge to
something more dictionary-specific. This is needed because later on,
catboost will be integrated into library-bridge.
- Also: Some smaller fixes like typos and un-inlining non-performance
critical code.
- The logic remains unchanged in this commit.