Commit Graph

117 Commits

Author SHA1 Message Date
Robert Schulze
6b2b3c1eb3
feat: implement catboost in library-bridge
This commit moves the catboost model evaluation out of the server
process into the library-bridge binary. This serves two goals: On the
one hand, crashes / memory corruptions of the catboost library no longer
affect the server. On the other hand, we can forbid loading dynamic
libraries in the server (catboost was the last consumer of this
functionality), thus improving security.

SQL syntax:

  SELECT
    catboostEvaluate('/path/to/model.bin', FEAT_1, ..., FEAT_N) > 0 AS prediction,
    ACTION AS target
  FROM amazon_train
  LIMIT 10

Required configuration:

  <catboost_lib_path>/path/to/libcatboostmodel.so</catboost_lib_path>

*** Implementation Details ***

The internal protocol between the server and the library-bridge is
simple:

- HTTP GET on path "/extdict_ping":
  A ping, used during the handshake to check if the library-bridge runs.

- HTTP POST on path "extdict_request"
  (1) Send a "catboost_GetTreeCount" request from the server to the
      bridge, containing a library path (e.g /home/user/libcatboost.so) and
      a model path (e.g. /home/user/model.bin). Rirst, this unloads the
      catboost library handler associated to the model path (if it was
      loaded), then loads the catboost library handler associated to the
      model path, then executes GetTreeCount() on the library handler and
      finally sends the result back to the server. Step (1) is called once
      by the server from FunctionCatBoostEvaluate::getReturnTypeImpl(). The
      library path handler is unloaded in the beginning because it contains
      state which may no longer be valid if the user runs
      catboost("/path/to/model.bin", ...) more than once and if "model.bin"
      was updated in between.
  (2) Send "catboost_Evaluate" from the server to the bridge, containing
      the model path and the features to run the interference on. Step (2)
      is called multiple times (once per chunk) by the server from function
      FunctionCatBoostEvaluate::executeImpl(). The library handler for the
      given model path is expected to be already loaded by Step (1).

Fixes #27870
2022-08-29 20:26:45 +00:00
Robert Schulze
810221baf2
Assume unversioned server has version=0 and use tryParse() instead of from_chars() 2022-08-10 07:39:32 +00:00
Robert Schulze
e0d5020a92
Add simple versioning to the *-bridge-to-server protocol
- In general, it is expected that clickhouse-*-bridges and
  clickhouse-server were build from the same source version (e.g. are
  upgraded "atomically"). If that is not the case, we should at least
  be able to detect the mismatch and abort.

- This commit adds a URL parameter "version", defined in a header shared
  by the server and bridges. The bridge returns an error in case of
  mismatch.

- The version is *not* send and checked for "ping" requests (used for
  handshake), only for regular requests send after handshake. This is
  because the internally thrown server-side exception due to HTTP
  failure does not propagate the exact HTTP error (it only stores the
  error as text), and as a result, the server-side handshake code
  simply retries in case of error with exponential backoff and finally
  fails with a "timeout error". This is reasonable as pings typically
  fail due to time out. However, without a rework of HTTP exceptions,
  version mismatch during ping would also appear as "timeout" which is
  too misleading. The behavior may be changed later if needed.

- Note that introducing a version parameter does not represent a
  protocol upgrade itself. Bridges older than the server will simply
  ignore the field. Only servers older than the bridges receive an error
  but such a situation should never occur in practice.
2022-08-08 19:40:37 +00:00
Robert Schulze
9952ab1099
Prefix class names "LibraryBridge*Handler" with "ExternalDictionary"
- necessary to disambiguate the names from "CatBoost"-"LibraryBridgeHandler"
  which will be added in a next step
2022-08-08 17:16:46 +00:00
Robert Schulze
20bb8a248e
Prepare server-side BridgeHelper for catboost integration
Wall of text, sorry, but I also had to document some stuff for myself:

There are three ways to communicate data using HTTP:
- the HTTP verb: for our purposes, PUT and GET,
- the HTTP path: '/ping', '/request' etc.,
- the HTTP URL parameter(s), e.g. 'method=libNew&dictionary_id=1234'

The bridge will use different handlers for communication with the
external dictionary library and for communication with the catboost
library. Handlers are created based on a combination of the HTTP verb
and the HTTP method. More specifically, there will be combinations
- GET + '/extdict_ping'
- PUT + '/extdict_request'
- GET + '/catboost_ping'
- PUT + '/catboost_request'.
For each combination, the bridge expects a certain set of URL
parameters, e.g. for the first combination parameter "dictionary_id" is
expected.

Starting with this commit, the library-bridge creates handlers based on
the first two combinations (the latter two combinations will be added
later). This makes the handler creation mechanism consistent with it's
counterpart in xdbc-bridge.

For that, it was necessary to make both IBridgeHelper methods
"getMainURI()" and "getPingURI()" pure virtual so that derived classes
(LibraryBridgeHelper and XDBCBridgeHelper) must provide custom URLs with
custom paths.

Side note 1: Previously, LibraryBridgeHelper sent HTTP URL parameter
"method=ping" during handshake (PING) but the library-bridge ignored
that parameter. We now omit this parameter, i.e.
LibraryBridgeHelper::PING was removed. Again, this makes things
consistent with xdbc-bridge.

Side note 2: xdbc-bridge is unchanged in this commit. Therefore,
XDBCBridgeHelper now uses the HTTP paths previously in the base class.
For funny reason, XDBCBridgeHelper did not use
IBridgeHelper::getMainURI() - it generates the URLs by itself. I kept it
that way for now but provided an implementation of getMainURI() anyways.
2022-08-04 19:29:51 +00:00
Robert Schulze
ea73b98fb9
Prepare library-bridge for catboost integration
- Rename generic file and identifier names in library-bridge to
  something more dictionary-specific. This is needed because later on,
  catboost will be integrated into library-bridge.

- Also: Some smaller fixes like typos and un-inlining non-performance
  critical code.

- The logic remains unchanged in this commit.
2022-08-04 19:26:51 +00:00
Robert Schulze
1a7727a254
Prefix overridden add_executable() command with "clickhouse_"
A simple HelloWorld program with zero includes except iostream triggers
a build of ca. 2000 source files. The reason is that ClickHouse's
top-level CMakeLists.txt overrides "add_executable()" to link all
binaries against "clickhouse_new_delete". This links against
"clickhouse_common_io", which in turn has lots of 3rd party library
dependencies ... Without linking "clickhouse_new_delete", the number of
compiled files for "HelloWorld" goes down to ca. 70.

As an example, the self-extracting-executable needs none of its current
dependencies but other programs may also benefit.

In order to restore access to the original "add_executable()", the
overriding version is now prefixed. There is precedence for a
"clickhouse_" prefix (as opposed to "ch_"), for example
"clickhouse_split_debug_symbols". In general prefixing makes sense also
because overriding CMake commands relies on undocumented behavior and is
considered not-so-great practice (*).

(*) https://crascit.com/2018/09/14/do-not-redefine-cmake-commands/
2022-07-11 19:36:18 +02:00
Robert Schulze
bb358617e1
Better naming for stuff related to splitted debug symbols
The previous name was slightly misleading, e.g. it is not about
"intalling stripped binaries" but about splitting debug symbols from the
binary.
2022-06-30 23:41:27 +02:00
Robert Schulze
5a4f21c50f
Support for Clang Thread Safety Analysis (TSA)
- TSA is a static analyzer build by Google which finds race conditions
  and deadlocks at compile time.

- It works by associating a shared member variable with a
  synchronization primitive that protects it. The compiler can then
  check at each access if proper locking happened before. A good
  introduction are [0] and [1].

- TSA requires some help by the programmer via annotations. Luckily,
  LLVM's libcxx already has annotations for std::mutex, std::lock_guard,
  std::shared_mutex and std::scoped_lock. This commit enables them
  (--> contrib/libcxx-cmake/CMakeLists.txt).

- Further, this commit adds convenience macros for the low-level
  annotations for use in ClickHouse (--> base/defines.h). For
  demonstration, they are leveraged in a few places.

- As we compile with "-Wall -Wextra -Weverything", the required compiler
  flag "-Wthread-safety-analysis" was already enabled. Negative checks
  are an experimental feature of TSA and disabled
  (--> cmake/warnings.cmake). Compile times did not increase noticeably.

- TSA is used in a few places with simple locking. I tried TSA also
  where locking is more complex. The problem was usually that it is
  unclear which data is protected by which lock :-(. But there was
  definitely some weird code where locking looked broken. So there is
  some potential to find bugs.

*** Limitations of TSA besides the ones listed in [1]:

- The programmer needs to know which lock protects which piece of shared
  data. This is not always easy for large classes.

- Two synchronization primitives used in ClickHouse are not annotated in
  libcxx:
  (1) std::unique_lock: A releaseable lock handle often together with
      std::condition_variable, e.g. in solve producer-consumer problems.
  (2) std::recursive_mutex: A re-entrant mutex variant. Its usage can be
      considered a design flaw + typically it is slower than a standard
      mutex. In this commit, one std::recursive_mutex was converted to
      std::mutex and annotated with TSA.

- For free-standing functions (e.g. helper functions) which are passed
  shared data members, it can be tricky to specify the associated lock.
  This is because the annotations use the normal C++ rules for symbol
  resolution.

[0] https://clang.llvm.org/docs/ThreadSafetyAnalysis.html
[1] https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42958.pdf
2022-06-20 16:13:25 +02:00
Amos Bird
4a5e4274f0
base should not depend on Common 2022-04-29 10:26:35 +08:00
Robert Schulze
118e94523c
Activate clang-tidy warning "readability-container-contains"
This check suggests replacing <Container>.count() by
<Container>.contains() which is more speaking and in case of
multimaps/multisets also faster.
2022-04-18 23:53:11 +02:00
alesapin
e790a73081 Simplify strip for new packages 2022-03-23 15:14:30 +01:00
Mikhail f. Shiryaev
1d362796df
Fix strip bug 2022-03-22 11:10:02 +01:00
Mikhail f. Shiryaev
fa2a9bb9aa
Separate BUILD_STRIPPED_BINARIES_PREFIX to option and parameter 2022-03-22 11:10:02 +01:00
alesapin
e53578910b Add ability to strip binaries in cmake 2022-03-10 22:23:28 +01:00
Azat Khuzhin
3b3635c6d5 Fix formatting error in logging messages
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2022-02-01 14:30:04 +03:00
Azat Khuzhin
bedf208cbd Use fmt::runtime() for LOG_* for non constexpr
Here is oneliner:

    $ gg 'LOG_\(DEBUG\|TRACE\|INFO\|TEST\|WARNING\|ERROR\|FATAL\)([^,]*, [a-zA-Z]' -- :*.cpp :*.h | cut -d: -f1 | sort -u | xargs -r sed -E -i 's#(LOG_[A-Z]*)\(([^,]*), ([A-Za-z][^,)]*)#\1(\2, fmt::runtime(\3)#'

Note, that I tried to do this with coccinelle (tool for semantic
patchin), but it cannot parse C++:

    $ cat fmt.cocci
    @@
    expression log;
    expression var;
    @@

    -LOG_DEBUG(log, var)
    +LOG_DEBUG(log, fmt::runtime(var))

I've also tried to use some macros/templates magic to do this implicitly
in logger_useful.h, but I failed to do so, and apparently it is not
possible for now.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>

v2: manual fixes
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2022-02-01 14:30:03 +03:00
Nikolai Kochetov
a08c98d760 Move some files. 2021-10-16 17:03:50 +03:00
Nikolai Kochetov
ab28c6c855 Remove BlockInputStream interfaces. 2021-10-14 13:25:43 +03:00
Nikolai Kochetov
ec18340351 Remove streams from formats. 2021-10-11 19:11:50 +03:00
Alexey Milovidov
fe6b7c77c7 Rename "common" to "base" 2021-10-02 10:13:14 +03:00
kssenii
294695bb7d Review fixes 2021-08-02 13:40:58 +00:00
kssenii
9c6a8b0059 Restore previous ids passing 2021-08-01 08:59:19 +00:00
kssenii
130253e3b9 Fix bridge-server interaction in case of metadata inconsistency 2021-08-01 08:59:16 +00:00
Raúl Marín
2442216472 Fix style too 2021-07-28 11:39:53 +02:00
kssenii
3fe5e8d1ce Fix 2021-07-28 08:30:58 +00:00
kssenii
6c220c8b35 Fix ids parsing 2021-07-27 20:54:21 +00:00
Nikolai Kochetov
d03bcebc8e Remove debug logging. 2021-07-23 12:05:42 +03:00
Nikolai Kochetov
f38de35b14 Rename some constants. 2021-07-21 19:13:17 +03:00
alexey-milovidov
04be5437d9
Merge pull request #25296 from abyss7/http-issues
Add settings for HTTP header limitations
2021-06-17 01:50:48 +03:00
Maksim Kita
67e9b85951 Merge ext into common 2021-06-16 23:28:41 +03:00
Ivan Lezhankin
ba08a580f8 Add test 2021-06-16 17:33:14 +03:00
Ivan Lezhankin
b182d87d9c Add settings for HTTP header limitations 2021-06-15 17:33:46 +03:00
Alexey Milovidov
e905883c75 More fixes for PVS-Studio 2021-05-08 19:12:31 +03:00
Maksim Kita
dcf41db1ae
Merge pull request #23048 from kitaisreal/library-dictionary-bridge-library-interface
LibraryDictionary bridge library interface
2021-04-14 11:23:29 +03:00
Maksim Kita
616d7d19f8 LibraryDictionary bridge library interface 2021-04-13 22:53:36 +03:00
kssenii
7a287e6fe9 Merge branch 'master' of https://github.com/ClickHouse/ClickHouse into nanodbc 2021-04-11 21:36:08 +00:00
Ivan
495c6e03aa
Replace all Context references with std::weak_ptr (#22297)
* Replace all Context references with std::weak_ptr

* Fix shared context captured by value

* Fix build

* Fix Context with named sessions

* Fix copy context

* Fix gcc build

* Merge with master and fix build

* Fix gcc-9 build
2021-04-11 02:33:54 +03:00
kssenii
7a89948801 Fix 2021-04-07 07:16:50 +00:00
kssenii
4419a430cb Less dependencies 2021-04-06 20:15:32 +00:00
kssenii
b629f5c64d Add const 2021-04-05 14:15:10 +00:00
kssenii
89a2e94364 Fixes 2021-04-05 14:08:49 +00:00
kssenii
02c6332e86 Pass null values properly 2021-04-02 18:45:42 +03:00
kssenii
ab3caf7b3c Add exception message 2021-03-26 16:16:31 +00:00
kssenii
6d41f7356b Better way to pass attributes 2021-03-24 19:32:31 +00:00
kssenii
7da36be9b6 Better 2021-03-24 09:23:29 +00:00
kssenii
c008f054ae Pass sample_block only once 2021-03-24 08:41:42 +00:00
kssenii
1ef3c1f780 Use binary format for params 2021-03-24 07:55:21 +00:00
kssenii
b8d9b97903 Better 2021-03-23 15:43:14 +00:00
kssenii
e877402406 Better 2021-03-22 15:58:20 +00:00
kssenii
472ce89b75 Small fixes 2021-03-22 14:39:17 +00:00
kssenii
4ba83aa87a Small improvement 2021-03-17 08:20:14 +00:00
kssenii
323fb54a8e Fix split build finally 2021-03-12 21:12:34 +00:00
kssenii
02ed33936a Fix split build 2021-03-12 12:54:49 +00:00
kssenii
4d29241d5a Try fix build 2021-03-10 21:14:09 +00:00
kssenii
e77ff68582 Better 2021-03-10 18:26:15 +00:00
kssenii
38f7f37468 Add loadKeys method 2021-03-10 13:23:07 +00:00
kssenii
8a9fa561c7 Fix style check, build, ya check 2021-03-07 20:07:42 +00:00
kssenii
b7b15fe920 Remove unused 2021-03-07 15:23:20 +00:00
kssenii
9e9bf2bb75 Common base for bridges part 3 2021-03-07 13:55:40 +00:00
kssenii
1c4d4c8e54 Better handlers 2021-03-07 11:31:55 +00:00
kssenii
61d8e27ea7 Common base for bridges part 2 2021-03-07 11:21:49 +00:00
kssenii
f83a5d83a2 Better 2021-03-06 18:44:40 +00:00
kssenii
94af06588e Make libraries storage as singleton 2021-03-06 18:38:01 +00:00
kssenii
2c080da51b Better 2021-03-05 15:37:43 +00:00
kssenii
e0cda1033a More methods 2021-03-05 10:43:47 +00:00
kssenii
dd4a7b6e3d First version 2021-03-05 10:19:01 +00:00