6.4 KiB
Applying a Catboost Model in ClickHouse
CatBoost — is a free and open-source gradient boosting library developed at Yandex for machine learning.
With this instruction, you will learn to apply pre-trained models in ClickHouse: as a result, you run the model inference from SQL.
To apply a CatBoost model in ClickHouse:
- Create a Table.
- Insert the Data to the Table.
- Configure the Model.
- Run the Model Inference from SQL.
For more information about training CatBoost models, see Training and applying models.
Prerequisites
If you don't have the Docker yet, install it.
!!! note "Note" Docker is a software platform that allows you to create containers that isolate a CatBoost and ClickHouse installation from the rest of the system.
Before applying a CatBoost model:
1. Pull the Docker image from the registry:
$ docker pull yandex/tutorial-catboost-clickhouse
This Docker image contains everything you need to run CatBoost and ClickHouse: code, runtime, libraries, environment variables, and configuration files.
2. Make sure the Docker image has been successfully pulled:
$ docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
yandex/tutorial-catboost-clickhouse latest 3e5ad9fae997 19 months ago 1.58GB
3. Start a Docker container based on this image:
$ docker run -it -p 8888:8888 yandex/tutorial-catboost-clickhouse
4. Install packages:
$ sudo apt-get install dirmngr
$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv E0C56BD4
$ echo "deb http://repo.yandex.ru/clickhouse/deb/stable/ main/" | sudo tee /etc/apt/sources.list.d/clickhouse.list
$ sudo apt-get update
$ sudo apt-get install -y clickhouse-server clickhouse-client
$ sudo service clickhouse-server start
$ clickhouse-client
For more information, see Quick Start.
1. Create a Table
To create a ClickHouse table for the train sample:
1. Start a ClickHouse client:
$ clickhouse client
!!! note "Note" The ClickHouse server is already running inside the Docker container.
2. Create the table using the command:
:) CREATE TABLE amazon_train
(
date Date MATERIALIZED today(),
ACTION UInt8,
RESOURCE UInt32,
MGR_ID UInt32,
ROLE_ROLLUP_1 UInt32,
ROLE_ROLLUP_2 UInt32,
ROLE_DEPTNAME UInt32,
ROLE_TITLE UInt32,
ROLE_FAMILY_DESC UInt32,
ROLE_FAMILY UInt32,
ROLE_CODE UInt32
)
ENGINE = MergeTree(date, date, 8192)
2. Insert the Data to the Table
To insert the data:
1. Exit from ClickHouse console client:
:) exit
2. Upload the data:
$ clickhouse client --host 127.0.0.1 --query 'INSERT INTO amazon_train FORMAT CSVWithNames' < ~/amazon/train.csv
3. Make sure the data has been uploaded:
$ clickhouse client
:) SELECT count() FROM amazon_train
SELECT count()
FROM amazon_train
+-count()-+
| 32769 |
+---------+
3. Configure the Model to Work with the Trained Model
This step is optional: the Docker container contains all configuration files.
Create a config file in the models
folder (for example, models/config_model.xml
) with the model configuration:
<models>
<model>
<!-- Model type. Now catboost only. -->
<type>catboost</type>
<!-- Model name. -->
<name>amazon</name>
<!-- Path to trained model. -->
<path>/home/catboost/tutorial/catboost_model.bin</path>
<!-- Update interval. -->
<lifetime>0</lifetime>
</model>
</models>
!!! note "Note"
To show contents of the config file in the Docker container, run cat models/amazon_model.xml
.
The ClickHouse config file should already have this setting:
// ../../etc/clickhouse-server/config.xml
<models_config>/home/catboost/models/*_model.xml</models_config>
To check it, run tail ../../etc/clickhouse-server/config.xml
.
4. Run the Model Inference from SQL
For test run the ClickHouse client $ clickhouse client
.
Let's make sure that the model is working:
:) SELECT
modelEvaluate('amazon',
RESOURCE,
MGR_ID,
ROLE_ROLLUP_1,
ROLE_ROLLUP_2,
ROLE_DEPTNAME,
ROLE_TITLE,
ROLE_FAMILY_DESC,
ROLE_FAMILY,
ROLE_CODE) > 0 AS prediction,
ACTION AS target
FROM amazon_train
LIMIT 10
!!! note "Note" Function modelEvaluate returns tuple with per-class raw predictions for multiclass models.
Let's predict probability:
:) SELECT
modelEvaluate('amazon',
RESOURCE,
MGR_ID,
ROLE_ROLLUP_1,
ROLE_ROLLUP_2,
ROLE_DEPTNAME,
ROLE_TITLE,
ROLE_FAMILY_DESC,
ROLE_FAMILY,
ROLE_CODE) AS prediction,
1. / (1 + exp(-prediction)) AS probability,
ACTION AS target
FROM amazon_train
LIMIT 10
!!! note "Note" More info about exp() function.
Let's calculate LogLoss on the sample:
:) SELECT -avg(tg * log(prob) + (1 - tg) * log(1 - prob)) AS logloss
FROM
(
SELECT
modelEvaluate('amazon',
RESOURCE,
MGR_ID,
ROLE_ROLLUP_1,
ROLE_ROLLUP_2,
ROLE_DEPTNAME,
ROLE_TITLE,
ROLE_FAMILY_DESC,
ROLE_FAMILY,
ROLE_CODE) AS prediction,
1. / (1. + exp(-prediction)) AS prob,
ACTION AS tg
FROM amazon_train
)