ClickHouse

mirror of https://github.com/ClickHouse/ClickHouse.git synced 2024-11-21 23:21:59 +00:00

Author	SHA1	Message	Date
Robert Schulze	35a37c91f8	chore: incorporate review feedback	2022-08-29 20:27:06 +00:00
robot-clickhouse	64fa077148	style: fix style	2022-08-29 20:27:06 +00:00
Robert Schulze	6b2b3c1eb3	feat: implement catboost in library-bridge This commit moves the catboost model evaluation out of the server process into the library-bridge binary. This serves two goals: On the one hand, crashes / memory corruptions of the catboost library no longer affect the server. On the other hand, we can forbid loading dynamic libraries in the server (catboost was the last consumer of this functionality), thus improving security. SQL syntax: SELECT catboostEvaluate('/path/to/model.bin', FEAT_1, ..., FEAT_N) > 0 AS prediction, ACTION AS target FROM amazon_train LIMIT 10 Required configuration: <catboost_lib_path>/path/to/libcatboostmodel.so</catboost_lib_path> * Implementation Details * The internal protocol between the server and the library-bridge is simple: - HTTP GET on path "/extdict_ping": A ping, used during the handshake to check if the library-bridge runs. - HTTP POST on path "extdict_request" (1) Send a "catboost_GetTreeCount" request from the server to the bridge, containing a library path (e.g /home/user/libcatboost.so) and a model path (e.g. /home/user/model.bin). Rirst, this unloads the catboost library handler associated to the model path (if it was loaded), then loads the catboost library handler associated to the model path, then executes GetTreeCount() on the library handler and finally sends the result back to the server. Step (1) is called once by the server from FunctionCatBoostEvaluate::getReturnTypeImpl(). The library path handler is unloaded in the beginning because it contains state which may no longer be valid if the user runs catboost("/path/to/model.bin", ...) more than once and if "model.bin" was updated in between. (2) Send "catboost_Evaluate" from the server to the bridge, containing the model path and the features to run the interference on. Step (2) is called multiple times (once per chunk) by the server from function FunctionCatBoostEvaluate::executeImpl(). The library handler for the given model path is expected to be already loaded by Step (1). Fixes #27870	2022-08-29 20:26:45 +00:00
Robert Schulze	810221baf2	Assume unversioned server has version=0 and use tryParse() instead of from_chars()	2022-08-10 07:39:32 +00:00
Robert Schulze	e0d5020a92	Add simple versioning to the -bridge-to-server protocol - In general, it is expected that clickhouse--bridges and clickhouse-server were build from the same source version (e.g. are upgraded "atomically"). If that is not the case, we should at least be able to detect the mismatch and abort. - This commit adds a URL parameter "version", defined in a header shared by the server and bridges. The bridge returns an error in case of mismatch. - The version is not send and checked for "ping" requests (used for handshake), only for regular requests send after handshake. This is because the internally thrown server-side exception due to HTTP failure does not propagate the exact HTTP error (it only stores the error as text), and as a result, the server-side handshake code simply retries in case of error with exponential backoff and finally fails with a "timeout error". This is reasonable as pings typically fail due to time out. However, without a rework of HTTP exceptions, version mismatch during ping would also appear as "timeout" which is too misleading. The behavior may be changed later if needed. - Note that introducing a version parameter does not represent a protocol upgrade itself. Bridges older than the server will simply ignore the field. Only servers older than the bridges receive an error but such a situation should never occur in practice.	2022-08-08 19:40:37 +00:00
Robert Schulze	9952ab1099	Prefix class names "LibraryBridge*Handler" with "ExternalDictionary" - necessary to disambiguate the names from "CatBoost"-"LibraryBridgeHandler" which will be added in a next step	2022-08-08 17:16:46 +00:00
Robert Schulze	20bb8a248e	Prepare server-side BridgeHelper for catboost integration Wall of text, sorry, but I also had to document some stuff for myself: There are three ways to communicate data using HTTP: - the HTTP verb: for our purposes, PUT and GET, - the HTTP path: '/ping', '/request' etc., - the HTTP URL parameter(s), e.g. 'method=libNew&dictionary_id=1234' The bridge will use different handlers for communication with the external dictionary library and for communication with the catboost library. Handlers are created based on a combination of the HTTP verb and the HTTP method. More specifically, there will be combinations - GET + '/extdict_ping' - PUT + '/extdict_request' - GET + '/catboost_ping' - PUT + '/catboost_request'. For each combination, the bridge expects a certain set of URL parameters, e.g. for the first combination parameter "dictionary_id" is expected. Starting with this commit, the library-bridge creates handlers based on the first two combinations (the latter two combinations will be added later). This makes the handler creation mechanism consistent with it's counterpart in xdbc-bridge. For that, it was necessary to make both IBridgeHelper methods "getMainURI()" and "getPingURI()" pure virtual so that derived classes (LibraryBridgeHelper and XDBCBridgeHelper) must provide custom URLs with custom paths. Side note 1: Previously, LibraryBridgeHelper sent HTTP URL parameter "method=ping" during handshake (PING) but the library-bridge ignored that parameter. We now omit this parameter, i.e. LibraryBridgeHelper::PING was removed. Again, this makes things consistent with xdbc-bridge. Side note 2: xdbc-bridge is unchanged in this commit. Therefore, XDBCBridgeHelper now uses the HTTP paths previously in the base class. For funny reason, XDBCBridgeHelper did not use IBridgeHelper::getMainURI() - it generates the URLs by itself. I kept it that way for now but provided an implementation of getMainURI() anyways.	2022-08-04 19:29:51 +00:00
Robert Schulze	ea73b98fb9	Prepare library-bridge for catboost integration - Rename generic file and identifier names in library-bridge to something more dictionary-specific. This is needed because later on, catboost will be integrated into library-bridge. - Also: Some smaller fixes like typos and un-inlining non-performance critical code. - The logic remains unchanged in this commit.	2022-08-04 19:26:51 +00:00
Robert Schulze	1a7727a254	Prefix overridden add_executable() command with "clickhouse_" A simple HelloWorld program with zero includes except iostream triggers a build of ca. 2000 source files. The reason is that ClickHouse's top-level CMakeLists.txt overrides "add_executable()" to link all binaries against "clickhouse_new_delete". This links against "clickhouse_common_io", which in turn has lots of 3rd party library dependencies ... Without linking "clickhouse_new_delete", the number of compiled files for "HelloWorld" goes down to ca. 70. As an example, the self-extracting-executable needs none of its current dependencies but other programs may also benefit. In order to restore access to the original "add_executable()", the overriding version is now prefixed. There is precedence for a "clickhouse_" prefix (as opposed to "ch_"), for example "clickhouse_split_debug_symbols". In general prefixing makes sense also because overriding CMake commands relies on undocumented behavior and is considered not-so-great practice (). () https://crascit.com/2018/09/14/do-not-redefine-cmake-commands/	2022-07-11 19:36:18 +02:00
Robert Schulze	bb358617e1	Better naming for stuff related to splitted debug symbols The previous name was slightly misleading, e.g. it is not about "intalling stripped binaries" but about splitting debug symbols from the binary.	2022-06-30 23:41:27 +02:00
Robert Schulze	5a4f21c50f	Support for Clang Thread Safety Analysis (TSA) - TSA is a static analyzer build by Google which finds race conditions and deadlocks at compile time. - It works by associating a shared member variable with a synchronization primitive that protects it. The compiler can then check at each access if proper locking happened before. A good introduction are [0] and [1]. - TSA requires some help by the programmer via annotations. Luckily, LLVM's libcxx already has annotations for std::mutex, std::lock_guard, std::shared_mutex and std::scoped_lock. This commit enables them (--> contrib/libcxx-cmake/CMakeLists.txt). - Further, this commit adds convenience macros for the low-level annotations for use in ClickHouse (--> base/defines.h). For demonstration, they are leveraged in a few places. - As we compile with "-Wall -Wextra -Weverything", the required compiler flag "-Wthread-safety-analysis" was already enabled. Negative checks are an experimental feature of TSA and disabled (--> cmake/warnings.cmake). Compile times did not increase noticeably. - TSA is used in a few places with simple locking. I tried TSA also where locking is more complex. The problem was usually that it is unclear which data is protected by which lock :-(. But there was definitely some weird code where locking looked broken. So there is some potential to find bugs. *** Limitations of TSA besides the ones listed in [1]: - The programmer needs to know which lock protects which piece of shared data. This is not always easy for large classes. - Two synchronization primitives used in ClickHouse are not annotated in libcxx: (1) std::unique_lock: A releaseable lock handle often together with std::condition_variable, e.g. in solve producer-consumer problems. (2) std::recursive_mutex: A re-entrant mutex variant. Its usage can be considered a design flaw + typically it is slower than a standard mutex. In this commit, one std::recursive_mutex was converted to std::mutex and annotated with TSA. - For free-standing functions (e.g. helper functions) which are passed shared data members, it can be tricky to specify the associated lock. This is because the annotations use the normal C++ rules for symbol resolution. [0] https://clang.llvm.org/docs/ThreadSafetyAnalysis.html [1] https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42958.pdf	2022-06-20 16:13:25 +02:00
Amos Bird	4a5e4274f0	base should not depend on Common	2022-04-29 10:26:35 +08:00
Robert Schulze	118e94523c	Activate clang-tidy warning "readability-container-contains" This check suggests replacing <Container>.count() by <Container>.contains() which is more speaking and in case of multimaps/multisets also faster.	2022-04-18 23:53:11 +02:00
alesapin	e790a73081	Simplify strip for new packages	2022-03-23 15:14:30 +01:00
Mikhail f. Shiryaev	1d362796df	Fix strip bug	2022-03-22 11:10:02 +01:00
Mikhail f. Shiryaev	fa2a9bb9aa	Separate BUILD_STRIPPED_BINARIES_PREFIX to option and parameter	2022-03-22 11:10:02 +01:00
alesapin	e53578910b	Add ability to strip binaries in cmake	2022-03-10 22:23:28 +01:00
Azat Khuzhin	3b3635c6d5	Fix formatting error in logging messages Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>	2022-02-01 14:30:04 +03:00
Azat Khuzhin	bedf208cbd	Use fmt::runtime() for LOG_* for non constexpr Here is oneliner: $ gg 'LOG_$DEBUG\\|TRACE\\|INFO\\|TEST\\|WARNING\\|ERROR\\|FATAL$([^,], [a-zA-Z]' -- :.cpp :.h \| cut -d: -f1 \| sort -u \| xargs -r sed -E -i 's#(LOG_[A-Z])\(([^,]), ([A-Za-z][^,)])#\1(\2, fmt::runtime(\3)#' Note, that I tried to do this with coccinelle (tool for semantic patchin), but it cannot parse C++: $ cat fmt.cocci @@ expression log; expression var; @@ -LOG_DEBUG(log, var) +LOG_DEBUG(log, fmt::runtime(var)) I've also tried to use some macros/templates magic to do this implicitly in logger_useful.h, but I failed to do so, and apparently it is not possible for now. Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com> v2: manual fixes Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>	2022-02-01 14:30:03 +03:00
Nikolai Kochetov	a08c98d760	Move some files.	2021-10-16 17:03:50 +03:00
Nikolai Kochetov	ab28c6c855	Remove BlockInputStream interfaces.	2021-10-14 13:25:43 +03:00
Nikolai Kochetov	ec18340351	Remove streams from formats.	2021-10-11 19:11:50 +03:00
Alexey Milovidov	fe6b7c77c7	Rename "common" to "base"	2021-10-02 10:13:14 +03:00
kssenii	294695bb7d	Review fixes	2021-08-02 13:40:58 +00:00
kssenii	9c6a8b0059	Restore previous ids passing	2021-08-01 08:59:19 +00:00
kssenii	130253e3b9	Fix bridge-server interaction in case of metadata inconsistency	2021-08-01 08:59:16 +00:00
Raúl Marín	2442216472	Fix style too	2021-07-28 11:39:53 +02:00
kssenii	3fe5e8d1ce	Fix	2021-07-28 08:30:58 +00:00
kssenii	6c220c8b35	Fix ids parsing	2021-07-27 20:54:21 +00:00
Nikolai Kochetov	d03bcebc8e	Remove debug logging.	2021-07-23 12:05:42 +03:00
Nikolai Kochetov	f38de35b14	Rename some constants.	2021-07-21 19:13:17 +03:00
alexey-milovidov	04be5437d9	Merge pull request #25296 from abyss7/http-issues Add settings for HTTP header limitations	2021-06-17 01:50:48 +03:00
Maksim Kita	67e9b85951	Merge ext into common	2021-06-16 23:28:41 +03:00
Ivan Lezhankin	ba08a580f8	Add test	2021-06-16 17:33:14 +03:00
Ivan Lezhankin	b182d87d9c	Add settings for HTTP header limitations	2021-06-15 17:33:46 +03:00
Alexey Milovidov	e905883c75	More fixes for PVS-Studio	2021-05-08 19:12:31 +03:00
Maksim Kita	dcf41db1ae	Merge pull request #23048 from kitaisreal/library-dictionary-bridge-library-interface LibraryDictionary bridge library interface	2021-04-14 11:23:29 +03:00
Maksim Kita	616d7d19f8	LibraryDictionary bridge library interface	2021-04-13 22:53:36 +03:00
kssenii	7a287e6fe9	Merge branch 'master' of https://github.com/ClickHouse/ClickHouse into nanodbc	2021-04-11 21:36:08 +00:00
Ivan	495c6e03aa	Replace all Context references with std::weak_ptr (#22297 ) * Replace all Context references with std::weak_ptr * Fix shared context captured by value * Fix build * Fix Context with named sessions * Fix copy context * Fix gcc build * Merge with master and fix build * Fix gcc-9 build	2021-04-11 02:33:54 +03:00
kssenii	7a89948801	Fix	2021-04-07 07:16:50 +00:00
kssenii	4419a430cb	Less dependencies	2021-04-06 20:15:32 +00:00
kssenii	b629f5c64d	Add const	2021-04-05 14:15:10 +00:00
kssenii	89a2e94364	Fixes	2021-04-05 14:08:49 +00:00
kssenii	02c6332e86	Pass null values properly	2021-04-02 18:45:42 +03:00
kssenii	ab3caf7b3c	Add exception message	2021-03-26 16:16:31 +00:00
kssenii	6d41f7356b	Better way to pass attributes	2021-03-24 19:32:31 +00:00
kssenii	7da36be9b6	Better	2021-03-24 09:23:29 +00:00
kssenii	c008f054ae	Pass sample_block only once	2021-03-24 08:41:42 +00:00
kssenii	1ef3c1f780	Use binary format for params	2021-03-24 07:55:21 +00:00
kssenii	b8d9b97903	Better	2021-03-23 15:43:14 +00:00
kssenii	e877402406	Better	2021-03-22 15:58:20 +00:00
kssenii	472ce89b75	Small fixes	2021-03-22 14:39:17 +00:00
kssenii	4ba83aa87a	Small improvement	2021-03-17 08:20:14 +00:00
kssenii	323fb54a8e	Fix split build finally	2021-03-12 21:12:34 +00:00
kssenii	02ed33936a	Fix split build	2021-03-12 12:54:49 +00:00
kssenii	4d29241d5a	Try fix build	2021-03-10 21:14:09 +00:00
kssenii	e77ff68582	Better	2021-03-10 18:26:15 +00:00
kssenii	38f7f37468	Add loadKeys method	2021-03-10 13:23:07 +00:00
kssenii	8a9fa561c7	Fix style check, build, ya check	2021-03-07 20:07:42 +00:00
kssenii	b7b15fe920	Remove unused	2021-03-07 15:23:20 +00:00
kssenii	9e9bf2bb75	Common base for bridges part 3	2021-03-07 13:55:40 +00:00
kssenii	1c4d4c8e54	Better handlers	2021-03-07 11:31:55 +00:00
kssenii	61d8e27ea7	Common base for bridges part 2	2021-03-07 11:21:49 +00:00
kssenii	f83a5d83a2	Better	2021-03-06 18:44:40 +00:00
kssenii	94af06588e	Make libraries storage as singleton	2021-03-06 18:38:01 +00:00
kssenii	2c080da51b	Better	2021-03-05 15:37:43 +00:00
kssenii	e0cda1033a	More methods	2021-03-05 10:43:47 +00:00
kssenii	dd4a7b6e3d	First version	2021-03-05 10:19:01 +00:00

1 2 3

119 Commits