ClickHouse/docs/en/operations/monitoring.md

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

68 lines
3.5 KiB
Markdown
Raw Normal View History

2020-04-03 13:23:32 +00:00
---
2022-08-28 14:53:34 +00:00
slug: /en/operations/monitoring
sidebar_position: 45
sidebar_label: Monitoring
2023-03-18 02:45:43 +00:00
description: You can monitor the utilization of hardware resources and also ClickHouse server metrics.
2020-04-03 13:23:32 +00:00
---
2022-06-02 10:55:18 +00:00
# Monitoring
import SelfManaged from '@site/docs/en/_snippets/_self_managed_only_automated.md';
<SelfManaged />
2019-02-04 13:30:28 +00:00
You can monitor:
- Utilization of hardware resources.
- ClickHouse server metrics.
2019-02-04 13:30:28 +00:00
## Built-in observability dashboard
2023-11-27 07:57:33 +00:00
<img width="400" alt="Screenshot 2023-11-12 at 6 08 58 PM" src="https://github.com/ClickHouse/ClickHouse/assets/3936029/2bd10011-4a47-4b94-b836-d44557c7fdc1" />
ClickHouse comes with a built-in observability dashboard feature which can be accessed by `$HOST:$PORT/dashboard` (requires user and password) that shows the following metrics:
- Queries/second
- CPU usage (cores)
- Queries running
- Merges running
- Selected bytes/second
- IO wait
- CPU wait
- OS CPU Usage (userspace)
- OS CPU Usage (kernel)
- Read from disk
- Read from filesystem
- Memory (tracked)
- Inserted rows/second
- Total MergeTree parts
- Max parts for partition
2020-03-20 10:10:48 +00:00
## Resource Utilization {#resource-utilization}
2019-02-04 13:30:28 +00:00
2023-02-16 19:09:10 +00:00
ClickHouse also monitors the state of hardware resources by itself such as:
2019-02-04 13:30:28 +00:00
- Load and temperature on processors.
- Utilization of storage system, RAM and network.
2019-02-04 13:30:28 +00:00
2023-02-16 19:09:10 +00:00
This data is collected in the `system.asynchronous_metric_log` table.
## ClickHouse Server Metrics {#clickhouse-server-metrics}
2019-02-04 13:30:28 +00:00
ClickHouse server has embedded instruments for self-state monitoring.
To track server events use server logs. See the [logger](../operations/server-configuration-parameters/settings.md#server_configuration_parameters-logger) section of the configuration file.
2019-02-04 13:30:28 +00:00
ClickHouse collects:
- Different metrics of how the server uses computational resources.
- Common statistics on query processing.
2020-10-13 17:23:29 +00:00
You can find metrics in the [system.metrics](../operations/system-tables/metrics.md#system_tables-metrics), [system.events](../operations/system-tables/events.md#system_tables-events), and [system.asynchronous_metrics](../operations/system-tables/asynchronous_metrics.md#system_tables-asynchronous_metrics) tables.
2019-02-04 13:30:28 +00:00
You can configure ClickHouse to export metrics to [Graphite](https://github.com/graphite-project). See the [Graphite section](../operations/server-configuration-parameters/settings.md#server_configuration_parameters-graphite) in the ClickHouse server configuration file. Before configuring export of metrics, you should set up Graphite by following their official [guide](https://graphite.readthedocs.io/en/latest/install.html).
2019-02-04 13:30:28 +00:00
You can configure ClickHouse to export metrics to [Prometheus](https://prometheus.io). See the [Prometheus section](../operations/server-configuration-parameters/settings.md#server_configuration_parameters-prometheus) in the ClickHouse server configuration file. Before configuring export of metrics, you should set up Prometheus by following their official [guide](https://prometheus.io/docs/prometheus/latest/installation/).
Additionally, you can monitor server availability through the HTTP API. Send the `HTTP GET` request to `/ping`. If the server is available, it responds with `200 OK`.
2019-02-04 13:30:28 +00:00
2020-10-13 17:23:29 +00:00
To monitor servers in a cluster configuration, you should set the [max_replica_delay_for_distributed_queries](../operations/settings/settings.md#settings-max_replica_delay_for_distributed_queries) parameter and use the HTTP resource `/replicas_status`. A request to `/replicas_status` returns `200 OK` if the replica is available and is not delayed behind the other replicas. If a replica is delayed, it returns `503 HTTP_SERVICE_UNAVAILABLE` with information about the gap.