mirror of
https://github.com/ClickHouse/ClickHouse.git
synced 2024-12-01 03:52:15 +00:00
Docs: Update vector search docs
This commit is contained in:
parent
6c1e44a82d
commit
46a3e3e795
@ -1,42 +1,47 @@
|
|||||||
# Approximate Nearest Neighbor Search Indexes [experimental]
|
# Approximate Nearest Neighbor Search Indexes [experimental]
|
||||||
|
|
||||||
Nearest neighborhood search is the problem of finding the M closest points for a given point in an N-dimensional vector space. The most
|
Nearest neighborhood search is the problem of finding the M closest vectors to a given vector in an N-dimensional vector space. The most
|
||||||
straightforward approach to solve this problem is a brute force search where the distance between all points in the vector space and the
|
straightforward approach to solve this problem is an exhaustive (brute-force) search which computes the distance between the reference
|
||||||
reference point is computed. This method guarantees perfect accuracy, but it is usually too slow for practical applications. Thus, nearest
|
vector and all other points in the vector space. While method guarantees a perfectly accurate result, but it is usually too slow for
|
||||||
neighborhood search problems are often solved with [approximative algorithms](https://github.com/erikbern/ann-benchmarks). Approximative
|
practical applications. As an alternative, [approximative algorithms](https://github.com/erikbern/ann-benchmarks) use greedy heuristics to
|
||||||
nearest neighborhood search techniques, in conjunction with [embedding
|
find the M closest vectors much faster. This allows to semantic search of picture, song, text
|
||||||
methods](https://cloud.google.com/architecture/overview-extracting-and-serving-feature-embeddings-for-machine-learning) allow to search huge
|
[embeddings](https://cloud.google.com/architecture/overview-extracting-and-serving-feature-embeddings-for-machine-learning) in milliseconds.
|
||||||
amounts of media (pictures, songs, articles, etc.) in milliseconds.
|
|
||||||
|
|
||||||
Blogs:
|
Blogs:
|
||||||
- [Vector Search with ClickHouse - Part 1](https://clickhouse.com/blog/vector-search-clickhouse-p1)
|
- [Vector Search with ClickHouse - Part 1](https://clickhouse.com/blog/vector-search-clickhouse-p1)
|
||||||
- [Vector Search with ClickHouse - Part 2](https://clickhouse.com/blog/vector-search-clickhouse-p2)
|
- [Vector Search with ClickHouse - Part 2](https://clickhouse.com/blog/vector-search-clickhouse-p2)
|
||||||
|
|
||||||
|
|
||||||
In terms of SQL, the nearest neighborhood problem can be expressed as follows:
|
In terms of SQL, a nearest neighborhood search can be expressed as follows:
|
||||||
|
|
||||||
``` sql
|
``` sql
|
||||||
SELECT *
|
SELECT [...]
|
||||||
FROM table
|
FROM table, [...]
|
||||||
ORDER BY Distance(vectors, Point)
|
ORDER BY DistanceFunction(vectors, reference_vector)
|
||||||
LIMIT N
|
LIMIT N
|
||||||
```
|
```
|
||||||
|
|
||||||
`vectors` contains N-dimensional values of type [Array(Float32)](../../../sql-reference/data-types/array.md) or Array(Float64), for example
|
where
|
||||||
embeddings. Function `Distance` computes the distance between two vectors. Often, the Euclidean (L2) distance is chosen as distance function
|
- `DistanceFunction` computes a distance between two vectors (e.g. the
|
||||||
but [other distance functions](/docs/en/sql-reference/functions/distance-functions.md) are also possible. `Point` is the reference point,
|
[L2Distance](../../../sql-referenc/functions/distance-functions.md#L2Distance) or
|
||||||
e.g. `(0.17, 0.33, ...)`, and `N` limits the number of search results.
|
[cosineDistance](../../../sql-referenc/functions/distance-functions.md#cosineDistance)),
|
||||||
|
- `vectors` is a column of type [Array(Float64)](../../../sql-reference/data-types/array.md) or
|
||||||
|
[Array(Float32)](../../../sql-reference/data-types/array.md), or [Array(BFloat16)](../../../sql-reference/data-types/array.md), typically
|
||||||
|
storing embeddings,
|
||||||
|
- `reference_vector` is a literal of type [Array(Float64)](../../../sql-reference/data-types/array.md) or
|
||||||
|
[Array(Float32)](../../../sql-reference/data-types/array.md), or [Array(BFloat16)](../../../sql-reference/data-types/array.md), and
|
||||||
|
- `N` is a constant integer restricting the number of returned results.
|
||||||
|
|
||||||
This query returns the top-`N` closest points to the reference point. Parameter `N` limits the number of returned values which is useful for
|
The query returns the `N` closest points in `vectors` to `reference_vector`.
|
||||||
situations where `MaxDistance` is difficult to determine in advance.
|
|
||||||
|
|
||||||
With brute force search, the query is expensive (linear in the number of points) because the distance between all points in `vectors` and
|
Exhaustive search computes the distance between `reference_vector` and all vectors in `vectors`. As such, its runtime is linear in the
|
||||||
`Point` must be computed. To speed this process up, Approximate Nearest Neighbor Search Indexes (ANN indexes) store a compact representation
|
number of stored vectors. Approximate search relies on special data structures (e.g. graphs, random forrests, etc.) which allow to find the
|
||||||
of the search space (using clustering, search trees, etc.) which allows to compute an approximate answer much quicker (in sub-linear time).
|
clostest vectors to a given reference vector quickly (i.e. in sub-linear time). ClickHouse provides such a data structure in the form of
|
||||||
|
"vector similarity indexes", a type of [skipping index](mergetree.md#table_engine-mergetree-data_skipping-indexes).
|
||||||
|
|
||||||
# Creating and Using Vector Similarity Indexes
|
# Creating and Using Vector Similarity Indexes
|
||||||
|
|
||||||
Syntax to create a vector similarity index over an [Array(Float32)](../../../sql-reference/data-types/array.md) column:
|
Syntax to create a vector similarity index
|
||||||
|
|
||||||
```sql
|
```sql
|
||||||
CREATE TABLE table
|
CREATE TABLE table
|
||||||
@ -49,19 +54,26 @@ ENGINE = MergeTree
|
|||||||
ORDER BY id;
|
ORDER BY id;
|
||||||
```
|
```
|
||||||
|
|
||||||
Parameters:
|
:::note
|
||||||
- `method`: Supports currently only `hnsw`.
|
USearch indexes are currently experimental, to use them you first need to `SET allow_experimental_vector_similarity_index = 1`.
|
||||||
- `distance_function`: either `L2Distance` (the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) - the length of a
|
:::
|
||||||
line between two points in Euclidean space), or `cosineDistance` (the [cosine
|
|
||||||
distance](https://en.wikipedia.org/wiki/Cosine_similarity#Cosine_distance)- the angle between two non-zero vectors).
|
The index can be build on a column of type [Array(Float64)](../../../sql-reference/data-types/array.md),
|
||||||
|
[Array(Float32)](../../../sql-reference/data-types/array.md), or [Array(BFloat16)](../../../sql-reference/data-types/array.md).
|
||||||
|
|
||||||
|
Index parameters:
|
||||||
|
- `method`: Currently only `hnsw` is supported.
|
||||||
|
- `distance_function`: either `L2Distance` (the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance): the length of a line
|
||||||
|
between two points in Euclidean space), or `cosineDistance` (the [cosine
|
||||||
|
distance](https://en.wikipedia.org/wiki/Cosine_similarity#Cosine_distance): the angle between two non-zero vectors).
|
||||||
- `quantization`: either `f64`, `f32`, `f16`, `bf16`, or `i8` for storing vectors with reduced precision (optional, default: `bf16`)
|
- `quantization`: either `f64`, `f32`, `f16`, `bf16`, or `i8` for storing vectors with reduced precision (optional, default: `bf16`)
|
||||||
- `hnsw_max_connections_per_layer`: the number of neighbors per HNSW graph node, also known as `M` in the [HNSW
|
- `hnsw_max_connections_per_layer`: the number of neighbors per HNSW graph node, also known as `M` in the [HNSW
|
||||||
paper](https://doi.org/10.1109/TPAMI.2018.2889473) (optional, default: 32)
|
paper](https://doi.org/10.1109/TPAMI.2018.2889473). Optional, default: `32`. Value `0` means using the default value.
|
||||||
- `hnsw_candidate_list_size_for_construction`: the size of the dynamic candidate list when constructing the HNSW graph, also known as
|
- `hnsw_candidate_list_size_for_construction`: the size of the dynamic candidate list when constructing the HNSW graph, also known as
|
||||||
`ef_construction` in the original [HNSW paper](https://doi.org/10.1109/TPAMI.2018.2889473) (optional, default: 128)
|
`ef_construction` in the original [HNSW paper](https://doi.org/10.1109/TPAMI.2018.2889473). Optional, default: `128`. Value 0 means using
|
||||||
|
the default value.
|
||||||
|
|
||||||
Values 0 for parameters `hnsw_max_connections_per_layer` and `hnsw_candidate_list_size_for_construction` means using the default values of
|
For normalized data, `L2Distance` is usually the best choice, otherwise `cosineDistance` is recommended to compensate for scale.
|
||||||
these parameters.
|
|
||||||
|
|
||||||
Example:
|
Example:
|
||||||
|
|
||||||
@ -76,53 +88,31 @@ ENGINE = MergeTree
|
|||||||
ORDER BY id;
|
ORDER BY id;
|
||||||
```
|
```
|
||||||
|
|
||||||
Vector similarity indexes are based on the [USearch library](https://github.com/unum-cloud/usearch), which implements the [HNSW
|
|
||||||
algorithm](https://arxiv.org/abs/1603.09320), i.e., a hierarchical graph where each point represents a vector and the edges represent
|
|
||||||
similarity. Such hierarchical structures can be very efficient on large collections. They may often fetch 0.05% or less data from the
|
|
||||||
overall dataset, while still providing 99% recall. This is especially useful when working with high-dimensional vectors, that are expensive
|
|
||||||
to load and compare. The library also has several hardware-specific SIMD optimizations to accelerate further distance computations on modern
|
|
||||||
Arm (NEON and SVE) and x86 (AVX2 and AVX-512) CPUs and OS-specific optimizations to allow efficient navigation around immutable persistent
|
|
||||||
files, without loading them into RAM.
|
|
||||||
|
|
||||||
USearch indexes are currently experimental, to use them you first need to `SET allow_experimental_vector_similarity_index = 1`.
|
|
||||||
|
|
||||||
Vector similarity indexes currently support two distance functions:
|
|
||||||
- `L2Distance`, also called Euclidean distance, is the length of a line segment between two points in Euclidean space
|
|
||||||
([Wikipedia](https://en.wikipedia.org/wiki/Euclidean_distance)).
|
|
||||||
- `cosineDistance`, also called cosine similarity, is the cosine of the angle between two (non-zero) vectors
|
|
||||||
([Wikipedia](https://en.wikipedia.org/wiki/Cosine_similarity)).
|
|
||||||
|
|
||||||
Vector similarity indexes allows storing the vectors in reduced precision formats. Supported scalar kinds are `f64`, `f32`, `f16`, `bf16`,
|
|
||||||
and `i8`. If no scalar kind was specified during index creation, `bf16` is used as default.
|
|
||||||
|
|
||||||
For normalized data, `L2Distance` is usually a better choice, otherwise `cosineDistance` is recommended to compensate for scale. If no
|
|
||||||
distance function was specified during index creation, `L2Distance` is used as default.
|
|
||||||
|
|
||||||
:::note
|
|
||||||
All arrays must have same length. To avoid errors, you can use a
|
All arrays must have same length. To avoid errors, you can use a
|
||||||
[CONSTRAINT](/docs/en/sql-reference/statements/create/table.md#constraints), for example, `CONSTRAINT constraint_name_1 CHECK
|
[CONSTRAINT](/docs/en/sql-reference/statements/create/table.md#constraints), for example, `CONSTRAINT constraint_name_1 CHECK
|
||||||
length(vectors) = 256`. Also, empty `Arrays` and unspecified `Array` values in INSERT statements (i.e. default values) are not supported.
|
length(vectors) = 256`. Empty `Arrays` and unspecified `Array` values in INSERT statements (i.e. default values) are not supported as well.
|
||||||
:::
|
|
||||||
|
|
||||||
:::note
|
Vector similarity indexes are based on the [USearch library](https://github.com/unum-cloud/usearch), which implements the [HNSW
|
||||||
The vector similarity index currently does not work with per-table, non-default `index_granularity` settings (see
|
algorithm](https://arxiv.org/abs/1603.09320), i.e., a hierarchical graph where each node represents a vector and the edges between nodes
|
||||||
[here](https://github.com/ClickHouse/ClickHouse/pull/51325#issuecomment-1605920475)). If necessary, the value must be changed in config.xml.
|
represent similarity. Such hierarchical structures can be very efficient on large collections. They may often fetch 0.05% or less data from
|
||||||
:::
|
the overall dataset, while still providing 99% recall. This is especially useful when working with high-dimensional vectors which are
|
||||||
|
expensive to load and compare. USearch also utilizes SIMD to accelerate distance computations on modern x86 (AVX2 and AVX-512) and ARM (NEON
|
||||||
|
and SVE) CPUs.
|
||||||
|
|
||||||
Vector index creation is known to be slow. To speed the process up, index creation can be parallelized. The maximum number of threads can be
|
Vector similarity indexes are built during column insertion and merge. The HNSW algorithm is known to provide slow inserts. As a result,
|
||||||
configured using server configuration
|
`INSERT` and `OPTIMIZE` statements on tables with vector similarity index will be slower than for ordinary tables. Vector similarity indexes
|
||||||
setting [max_build_vector_similarity_index_thread_pool_size](../../../operations/server-configuration-parameters/settings.md#server_configuration_parameters_max_build_vector_similarity_index_thread_pool_size).
|
are ideally used only with immutable or rarely changed data, respectively when are far more read requests than write requests. Three
|
||||||
|
additional techniques are recommended to speed up index creation:
|
||||||
|
- Index creation can be parallelized. The maximum number of threads can be configured using server setting
|
||||||
|
[max_build_vector_similarity_index_thread_pool_size](../../../operations/server-configuration-parameters/settings.md#server_configuration_parameters_max_build_vector_similarity_index_thread_pool_size).
|
||||||
|
- Index creation on newly inserted parts may be disabled using setting `materialize_skip_indexes_on_insert`. Search on such parts will fall
|
||||||
|
back to exact search but as inserted parts are typically small compared to the total table size, the performance impact is negligible.
|
||||||
|
- As parts are incrementally merged into bigger parts, and these new parts are merged into even bigger parts ("write amplification"),
|
||||||
|
vector similarity indexes are possibly build multiple times for the same vectors. To avoid that, you may suppress merges during insert
|
||||||
|
using statement [`SYSTEM STOP MERGES`](../../../sql-reference/statements/system.md), respectively start merges once all data has been
|
||||||
|
inserted using `SYSTEM START MERGES`.
|
||||||
|
|
||||||
ANN indexes are built during column insertion and merge. As a result, `INSERT` and `OPTIMIZE` statements will be slower than for ordinary
|
Vector similarity indexes support this type of query:
|
||||||
tables. ANNIndexes are ideally used only with immutable or rarely changed data, respectively when are far more read requests than write
|
|
||||||
requests.
|
|
||||||
|
|
||||||
:::tip
|
|
||||||
To reduce the cost of building vector similarity indexes, consider setting `materialize_skip_indexes_on_insert` which disables the
|
|
||||||
construction of skipping indexes on newly inserted parts. Search would fall back to exact search but as inserted parts are typically small
|
|
||||||
compared to the total table size, the performance impact of that would be negligible.
|
|
||||||
|
|
||||||
ANN indexes support this type of query:
|
|
||||||
|
|
||||||
``` sql
|
``` sql
|
||||||
WITH [...] AS reference_vector
|
WITH [...] AS reference_vector
|
||||||
@ -134,44 +124,33 @@ LIMIT N
|
|||||||
SETTINGS enable_analyzer = 0; -- Temporary limitation, will be lifted
|
SETTINGS enable_analyzer = 0; -- Temporary limitation, will be lifted
|
||||||
```
|
```
|
||||||
|
|
||||||
:::tip
|
|
||||||
To avoid writing out large vectors, you can use [query
|
|
||||||
parameters](/docs/en/interfaces/cli.md#queries-with-parameters-cli-queries-with-parameters), e.g.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
clickhouse-client --param_vec='hello' --query="SELECT * FROM table WHERE L2Distance(vectors, {vec: Array(Float32)}) < 1.0"
|
|
||||||
```
|
|
||||||
:::
|
|
||||||
|
|
||||||
To search using a different value of HNSW parameter `hnsw_candidate_list_size_for_search` (default: 256), also known as `ef_search` in the
|
To search using a different value of HNSW parameter `hnsw_candidate_list_size_for_search` (default: 256), also known as `ef_search` in the
|
||||||
original [HNSW paper](https://doi.org/10.1109/TPAMI.2018.2889473), run the `SELECT` query with `SETTINGS hnsw_candidate_list_size_for_search
|
original [HNSW paper](https://doi.org/10.1109/TPAMI.2018.2889473), run the `SELECT` query with `SETTINGS hnsw_candidate_list_size_for_search
|
||||||
= <value>`.
|
= <value>`.
|
||||||
|
|
||||||
**Restrictions**: Approximate algorithms used to determine the nearest neighbors require a limit, hence queries without `LIMIT` clause
|
**Restrictions**: Approximate vector search algorithms require a limit, hence queries without `LIMIT` clause cannot utilize vector
|
||||||
cannot utilize ANN indexes. Also, ANN indexes are only used if the query has a `LIMIT` value smaller than setting
|
similarity indexes. The limit must also be smaller than setting `max_limit_for_ann_queries` (default: 100).
|
||||||
`max_limit_for_ann_queries` (default: 1 million rows). This is a safeguard to prevent large memory allocations by external libraries for
|
|
||||||
approximate neighbor search.
|
|
||||||
|
|
||||||
**Differences to Skip Indexes** Similar to regular [skip indexes](https://clickhouse.com/docs/en/optimize/skipping-indexes), ANN indexes are
|
**Differences to Regular Skip Indexes** Similar to regular [skip indexes](https://clickhouse.com/docs/en/optimize/skipping-indexes), vector
|
||||||
constructed over granules and each indexed block consists of `GRANULARITY = [N]`-many granules (`[N]` = 1 by default for normal skip
|
similarity indexes are constructed over granules and each indexed block consists of `GRANULARITY = [N]`-many granules (`[N]` = 1 by default
|
||||||
indexes). For example, if the primary index granularity of the table is 8192 (setting `index_granularity = 8192`) and `GRANULARITY = 2`,
|
for normal skip indexes). For example, if the primary index granularity of the table is 8192 (setting `index_granularity = 8192`) and
|
||||||
then each indexed block will contain 16384 rows. However, data structures and algorithms for approximate neighborhood search (usually
|
`GRANULARITY = 2`, then each indexed block will contain 16384 rows. However, data structures and algorithms for approximate neighborhood
|
||||||
provided by external libraries) are inherently row-oriented. They store a compact representation of a set of rows and also return rows for
|
search are inherently row-oriented. They store a compact representation of a set of rows and also return rows for vector search queries.
|
||||||
ANN queries. This causes some rather unintuitive differences in the way ANN indexes behave compared to normal skip indexes.
|
This causes some rather unintuitive differences in the way vector vector similarity indexes behave compared to normal skip indexes.
|
||||||
|
|
||||||
When a user defines an ANN index on a column, ClickHouse internally creates an ANN "sub-index" for each index block. The sub-index is "local"
|
When a user defines an vector similarity index on a column, ClickHouse internally creates an vector similarity "sub-index" for each index
|
||||||
in the sense that it only knows about the rows of its containing index block. In the previous example and assuming that a column has 65536
|
block. The sub-index is "local" in the sense that it only knows about the rows of its containing index block. In the previous example and
|
||||||
rows, we obtain four index blocks (spanning eight granules) and an ANN sub-index for each index block. A sub-index is theoretically able to
|
assuming that a column has 65536 rows, we obtain four index blocks (spanning eight granules) and an vector similarity sub-index for each
|
||||||
return the rows with the N closest points within its index block directly. However, since ClickHouse loads data from disk to memory at the
|
index block. A sub-index is theoretically able to return the rows with the N closest points within its index block directly. However, since
|
||||||
granularity of granules, sub-indexes extrapolate matching rows to granule granularity. This is different from regular skip indexes which
|
ClickHouse loads data from disk to memory at the granularity of granules, sub-indexes extrapolate matching rows to granule granularity. This
|
||||||
skip data at the granularity of index blocks.
|
is different from regular skip indexes which skip data at the granularity of index blocks.
|
||||||
|
|
||||||
The `GRANULARITY` parameter determines how many ANN sub-indexes are created. Bigger `GRANULARITY` values mean fewer but larger ANN
|
The `GRANULARITY` parameter determines how many vector similarity sub-indexes are created. Bigger `GRANULARITY` values mean fewer but larger
|
||||||
sub-indexes, up to the point where a column (or a column's data part) has only a single sub-index. In that case, the sub-index has a
|
vector similarity sub-indexes, up to the point where a column (or a column's data part) has only a single sub-index. In that case, the
|
||||||
"global" view of all column rows and can directly return all granules of the column (part) with relevant rows (there are at most
|
sub-index has a "global" view of all column rows and can directly return all granules of the column (part) with relevant rows (there are at
|
||||||
`LIMIT [N]`-many such granules). In a second step, ClickHouse will load these granules and identify the actually best rows by performing a
|
most `LIMIT [N]`-many such granules). In a second step, ClickHouse will load these granules and identify the actually best rows by
|
||||||
brute-force distance calculation over all rows of the granules. With a small `GRANULARITY` value, each of the sub-indexes returns up to
|
performing a brute-force distance calculation over all rows of the granules. With a small `GRANULARITY` value, each of the sub-indexes
|
||||||
`LIMIT N`-many granules. As a result, more granules need to be loaded and post-filtered. Note that the search accuracy is with both cases
|
returns up to `LIMIT N`-many granules. As a result, more granules need to be loaded and post-filtered. Note that the search accuracy is with
|
||||||
equally good, only the processing performance differs. It is generally recommended to use a large `GRANULARITY` for ANN indexes and fall
|
both cases equally good, only the processing performance differs. It is generally recommended to use a large `GRANULARITY` for vector
|
||||||
back to a smaller `GRANULARITY` values only in case of problems like excessive memory consumption of the ANN structures. If no `GRANULARITY`
|
similarity indexes and fall back to a smaller `GRANULARITY` values only in case of problems like excessive memory consumption of the vector
|
||||||
was specified for ANN indexes, the default value is 100 million.
|
similarity structures. If no `GRANULARITY` was specified for vector similarity indexes, the default value is 100 million.
|
||||||
|
Loading…
Reference in New Issue
Block a user