The test is a stress test which queries all system tables in parallel.
System table SYSTEM.MODELS is now populated using the library-bridge.
This made the test fail with error:
2022-09-12 23:51:34 Code: 410. DB::Exception: Received from localhost:9000. DB::Exception: BridgeHelper: clickhouse-library-bridge is not responding. (EXTERNAL_SERVER_IS_NOT_RESPONDING)
2022-09-12 23:51:34 (query: SELECT * FROM system.models LIMIT 10000 FORMAT Null)
Looking at the logs, the server tried to start the library-bridge when
querying SYSTEM.MODELS but multiple handshake attempts failed.
So (most likely) the infrastructure is slow or (unlikely) there are
settings in CI (e.g. a firewall) preventing communication with the
bridge. Because 1. other system tables are also excluded (zookeeper,
merge_tree_metadata_cache), 2. SYSTEM.MODELS is tested in integration
test "test_catboost_evaluate" and 3. the test runs locally just fine, I
am excluding SYSTEM.MODELS from it.
- This commit restores statements "SYSTEM RELOAD MODEL(S)" which provide
a mechanism to update a model explicitly. It also saves potentially
unnecessary reloads of a model from disk after it's initial load.
To keep the complexity low, the semantics of "SYSTEM RELOAD MODEL(S)
was changed from eager to lazy. This means that both statements
previously immedately reloaded the specified/all models, whereas now
the statements only trigger an unload and the first call to
catboostEvaluate() does the actual load.
- Monitoring view SYSTEM.MODELS is also restored but with some obsolete
fields removed. The view was not documented in the past and for now it
remains undocumented. The commit is thus not considered a breach of
ClickHouse's public interface.
- The deleted function modelEvaluate() was superseded by
catboostEvaluate().
- Also delete the external model repository, as modelEvaluate() was it's
last user. Additionally remove the system view SYSTEM.MODELS for
inspecting the repository.
- SYSTEM RELOAD MODELS is also obsolete. HOWEVER, it was retained and
made a no-op instead of deleted.
Why?
The reason is that RBAC in distributed setups works by storing
privileges (granted and revoked) as plain SQL statements in Keeper.
Nodes read these statements at startup and parse them. If a privilege
for SYSTEM RELOAD MODELS exists but parser doesn't recognize it
nodes would fail to come up.
Considered but rejected alternatives:
- Ignore SYSTEM RELOAD MODELS during parsing RBAC privileges and
return an error for regular SYSTEM RELOAD MODELS SQL. Special-case
of no-op behavior, too brittle.
- Remove SYSTEM RELOAD MODELS manually from Keeper via command-line
manipulation of Keeper nodes or via SQL by dropping the privileges.
Needs user intervention during upgrade.
This commit moves the catboost model evaluation out of the server
process into the library-bridge binary. This serves two goals: On the
one hand, crashes / memory corruptions of the catboost library no longer
affect the server. On the other hand, we can forbid loading dynamic
libraries in the server (catboost was the last consumer of this
functionality), thus improving security.
SQL syntax:
SELECT
catboostEvaluate('/path/to/model.bin', FEAT_1, ..., FEAT_N) > 0 AS prediction,
ACTION AS target
FROM amazon_train
LIMIT 10
Required configuration:
<catboost_lib_path>/path/to/libcatboostmodel.so</catboost_lib_path>
*** Implementation Details ***
The internal protocol between the server and the library-bridge is
simple:
- HTTP GET on path "/extdict_ping":
A ping, used during the handshake to check if the library-bridge runs.
- HTTP POST on path "extdict_request"
(1) Send a "catboost_GetTreeCount" request from the server to the
bridge, containing a library path (e.g /home/user/libcatboost.so) and
a model path (e.g. /home/user/model.bin). Rirst, this unloads the
catboost library handler associated to the model path (if it was
loaded), then loads the catboost library handler associated to the
model path, then executes GetTreeCount() on the library handler and
finally sends the result back to the server. Step (1) is called once
by the server from FunctionCatBoostEvaluate::getReturnTypeImpl(). The
library path handler is unloaded in the beginning because it contains
state which may no longer be valid if the user runs
catboost("/path/to/model.bin", ...) more than once and if "model.bin"
was updated in between.
(2) Send "catboost_Evaluate" from the server to the bridge, containing
the model path and the features to run the interference on. Step (2)
is called multiple times (once per chunk) by the server from function
FunctionCatBoostEvaluate::executeImpl(). The library handler for the
given model path is expected to be already loaded by Step (1).
Fixes#27870
Force-inlining code compiled for SSE3 into "generic"
(non-platform-specific) code works for standard x86 builds where
everything is compiled with SSE 4.2 (and smaller). It no longer works
if we compile everything only with SSE 2.