Docs: Update vector search docs

This commit is contained in:
Robert Schulze 2024-11-25 12:36:44 +00:00
parent 6c1e44a82d
commit 46a3e3e795
No known key found for this signature in database
GPG Key ID: 26703B55FB13728A

View File

@ -1,42 +1,47 @@
# Approximate Nearest Neighbor Search Indexes [experimental] # Approximate Nearest Neighbor Search Indexes [experimental]
Nearest neighborhood search is the problem of finding the M closest points for a given point in an N-dimensional vector space. The most Nearest neighborhood search is the problem of finding the M closest vectors to a given vector in an N-dimensional vector space. The most
straightforward approach to solve this problem is a brute force search where the distance between all points in the vector space and the straightforward approach to solve this problem is an exhaustive (brute-force) search which computes the distance between the reference
reference point is computed. This method guarantees perfect accuracy, but it is usually too slow for practical applications. Thus, nearest vector and all other points in the vector space. While method guarantees a perfectly accurate result, but it is usually too slow for
neighborhood search problems are often solved with [approximative algorithms](https://github.com/erikbern/ann-benchmarks). Approximative practical applications. As an alternative, [approximative algorithms](https://github.com/erikbern/ann-benchmarks) use greedy heuristics to
nearest neighborhood search techniques, in conjunction with [embedding find the M closest vectors much faster. This allows to semantic search of picture, song, text
methods](https://cloud.google.com/architecture/overview-extracting-and-serving-feature-embeddings-for-machine-learning) allow to search huge [embeddings](https://cloud.google.com/architecture/overview-extracting-and-serving-feature-embeddings-for-machine-learning) in milliseconds.
amounts of media (pictures, songs, articles, etc.) in milliseconds.
Blogs: Blogs:
- [Vector Search with ClickHouse - Part 1](https://clickhouse.com/blog/vector-search-clickhouse-p1) - [Vector Search with ClickHouse - Part 1](https://clickhouse.com/blog/vector-search-clickhouse-p1)
- [Vector Search with ClickHouse - Part 2](https://clickhouse.com/blog/vector-search-clickhouse-p2) - [Vector Search with ClickHouse - Part 2](https://clickhouse.com/blog/vector-search-clickhouse-p2)
In terms of SQL, the nearest neighborhood problem can be expressed as follows: In terms of SQL, a nearest neighborhood search can be expressed as follows:
``` sql ``` sql
SELECT * SELECT [...]
FROM table FROM table, [...]
ORDER BY Distance(vectors, Point) ORDER BY DistanceFunction(vectors, reference_vector)
LIMIT N LIMIT N
``` ```
`vectors` contains N-dimensional values of type [Array(Float32)](../../../sql-reference/data-types/array.md) or Array(Float64), for example where
embeddings. Function `Distance` computes the distance between two vectors. Often, the Euclidean (L2) distance is chosen as distance function - `DistanceFunction` computes a distance between two vectors (e.g. the
but [other distance functions](/docs/en/sql-reference/functions/distance-functions.md) are also possible. `Point` is the reference point, [L2Distance](../../../sql-referenc/functions/distance-functions.md#L2Distance) or
e.g. `(0.17, 0.33, ...)`, and `N` limits the number of search results. [cosineDistance](../../../sql-referenc/functions/distance-functions.md#cosineDistance)),
- `vectors` is a column of type [Array(Float64)](../../../sql-reference/data-types/array.md) or
[Array(Float32)](../../../sql-reference/data-types/array.md), or [Array(BFloat16)](../../../sql-reference/data-types/array.md), typically
storing embeddings,
- `reference_vector` is a literal of type [Array(Float64)](../../../sql-reference/data-types/array.md) or
[Array(Float32)](../../../sql-reference/data-types/array.md), or [Array(BFloat16)](../../../sql-reference/data-types/array.md), and
- `N` is a constant integer restricting the number of returned results.
This query returns the top-`N` closest points to the reference point. Parameter `N` limits the number of returned values which is useful for The query returns the `N` closest points in `vectors` to `reference_vector`.
situations where `MaxDistance` is difficult to determine in advance.
With brute force search, the query is expensive (linear in the number of points) because the distance between all points in `vectors` and Exhaustive search computes the distance between `reference_vector` and all vectors in `vectors`. As such, its runtime is linear in the
`Point` must be computed. To speed this process up, Approximate Nearest Neighbor Search Indexes (ANN indexes) store a compact representation number of stored vectors. Approximate search relies on special data structures (e.g. graphs, random forrests, etc.) which allow to find the
of the search space (using clustering, search trees, etc.) which allows to compute an approximate answer much quicker (in sub-linear time). clostest vectors to a given reference vector quickly (i.e. in sub-linear time). ClickHouse provides such a data structure in the form of
"vector similarity indexes", a type of [skipping index](mergetree.md#table_engine-mergetree-data_skipping-indexes).
# Creating and Using Vector Similarity Indexes # Creating and Using Vector Similarity Indexes
Syntax to create a vector similarity index over an [Array(Float32)](../../../sql-reference/data-types/array.md) column: Syntax to create a vector similarity index
```sql ```sql
CREATE TABLE table CREATE TABLE table
@ -49,19 +54,26 @@ ENGINE = MergeTree
ORDER BY id; ORDER BY id;
``` ```
Parameters: :::note
- `method`: Supports currently only `hnsw`. USearch indexes are currently experimental, to use them you first need to `SET allow_experimental_vector_similarity_index = 1`.
- `distance_function`: either `L2Distance` (the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) - the length of a :::
line between two points in Euclidean space), or `cosineDistance` (the [cosine
distance](https://en.wikipedia.org/wiki/Cosine_similarity#Cosine_distance)- the angle between two non-zero vectors). The index can be build on a column of type [Array(Float64)](../../../sql-reference/data-types/array.md),
[Array(Float32)](../../../sql-reference/data-types/array.md), or [Array(BFloat16)](../../../sql-reference/data-types/array.md).
Index parameters:
- `method`: Currently only `hnsw` is supported.
- `distance_function`: either `L2Distance` (the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance): the length of a line
between two points in Euclidean space), or `cosineDistance` (the [cosine
distance](https://en.wikipedia.org/wiki/Cosine_similarity#Cosine_distance): the angle between two non-zero vectors).
- `quantization`: either `f64`, `f32`, `f16`, `bf16`, or `i8` for storing vectors with reduced precision (optional, default: `bf16`) - `quantization`: either `f64`, `f32`, `f16`, `bf16`, or `i8` for storing vectors with reduced precision (optional, default: `bf16`)
- `hnsw_max_connections_per_layer`: the number of neighbors per HNSW graph node, also known as `M` in the [HNSW - `hnsw_max_connections_per_layer`: the number of neighbors per HNSW graph node, also known as `M` in the [HNSW
paper](https://doi.org/10.1109/TPAMI.2018.2889473) (optional, default: 32) paper](https://doi.org/10.1109/TPAMI.2018.2889473). Optional, default: `32`. Value `0` means using the default value.
- `hnsw_candidate_list_size_for_construction`: the size of the dynamic candidate list when constructing the HNSW graph, also known as - `hnsw_candidate_list_size_for_construction`: the size of the dynamic candidate list when constructing the HNSW graph, also known as
`ef_construction` in the original [HNSW paper](https://doi.org/10.1109/TPAMI.2018.2889473) (optional, default: 128) `ef_construction` in the original [HNSW paper](https://doi.org/10.1109/TPAMI.2018.2889473). Optional, default: `128`. Value 0 means using
the default value.
Values 0 for parameters `hnsw_max_connections_per_layer` and `hnsw_candidate_list_size_for_construction` means using the default values of For normalized data, `L2Distance` is usually the best choice, otherwise `cosineDistance` is recommended to compensate for scale.
these parameters.
Example: Example:
@ -76,53 +88,31 @@ ENGINE = MergeTree
ORDER BY id; ORDER BY id;
``` ```
Vector similarity indexes are based on the [USearch library](https://github.com/unum-cloud/usearch), which implements the [HNSW
algorithm](https://arxiv.org/abs/1603.09320), i.e., a hierarchical graph where each point represents a vector and the edges represent
similarity. Such hierarchical structures can be very efficient on large collections. They may often fetch 0.05% or less data from the
overall dataset, while still providing 99% recall. This is especially useful when working with high-dimensional vectors, that are expensive
to load and compare. The library also has several hardware-specific SIMD optimizations to accelerate further distance computations on modern
Arm (NEON and SVE) and x86 (AVX2 and AVX-512) CPUs and OS-specific optimizations to allow efficient navigation around immutable persistent
files, without loading them into RAM.
USearch indexes are currently experimental, to use them you first need to `SET allow_experimental_vector_similarity_index = 1`.
Vector similarity indexes currently support two distance functions:
- `L2Distance`, also called Euclidean distance, is the length of a line segment between two points in Euclidean space
([Wikipedia](https://en.wikipedia.org/wiki/Euclidean_distance)).
- `cosineDistance`, also called cosine similarity, is the cosine of the angle between two (non-zero) vectors
([Wikipedia](https://en.wikipedia.org/wiki/Cosine_similarity)).
Vector similarity indexes allows storing the vectors in reduced precision formats. Supported scalar kinds are `f64`, `f32`, `f16`, `bf16`,
and `i8`. If no scalar kind was specified during index creation, `bf16` is used as default.
For normalized data, `L2Distance` is usually a better choice, otherwise `cosineDistance` is recommended to compensate for scale. If no
distance function was specified during index creation, `L2Distance` is used as default.
:::note
All arrays must have same length. To avoid errors, you can use a All arrays must have same length. To avoid errors, you can use a
[CONSTRAINT](/docs/en/sql-reference/statements/create/table.md#constraints), for example, `CONSTRAINT constraint_name_1 CHECK [CONSTRAINT](/docs/en/sql-reference/statements/create/table.md#constraints), for example, `CONSTRAINT constraint_name_1 CHECK
length(vectors) = 256`. Also, empty `Arrays` and unspecified `Array` values in INSERT statements (i.e. default values) are not supported. length(vectors) = 256`. Empty `Arrays` and unspecified `Array` values in INSERT statements (i.e. default values) are not supported as well.
:::
:::note Vector similarity indexes are based on the [USearch library](https://github.com/unum-cloud/usearch), which implements the [HNSW
The vector similarity index currently does not work with per-table, non-default `index_granularity` settings (see algorithm](https://arxiv.org/abs/1603.09320), i.e., a hierarchical graph where each node represents a vector and the edges between nodes
[here](https://github.com/ClickHouse/ClickHouse/pull/51325#issuecomment-1605920475)). If necessary, the value must be changed in config.xml. represent similarity. Such hierarchical structures can be very efficient on large collections. They may often fetch 0.05% or less data from
::: the overall dataset, while still providing 99% recall. This is especially useful when working with high-dimensional vectors which are
expensive to load and compare. USearch also utilizes SIMD to accelerate distance computations on modern x86 (AVX2 and AVX-512) and ARM (NEON
and SVE) CPUs.
Vector index creation is known to be slow. To speed the process up, index creation can be parallelized. The maximum number of threads can be Vector similarity indexes are built during column insertion and merge. The HNSW algorithm is known to provide slow inserts. As a result,
configured using server configuration `INSERT` and `OPTIMIZE` statements on tables with vector similarity index will be slower than for ordinary tables. Vector similarity indexes
setting [max_build_vector_similarity_index_thread_pool_size](../../../operations/server-configuration-parameters/settings.md#server_configuration_parameters_max_build_vector_similarity_index_thread_pool_size). are ideally used only with immutable or rarely changed data, respectively when are far more read requests than write requests. Three
additional techniques are recommended to speed up index creation:
- Index creation can be parallelized. The maximum number of threads can be configured using server setting
[max_build_vector_similarity_index_thread_pool_size](../../../operations/server-configuration-parameters/settings.md#server_configuration_parameters_max_build_vector_similarity_index_thread_pool_size).
- Index creation on newly inserted parts may be disabled using setting `materialize_skip_indexes_on_insert`. Search on such parts will fall
back to exact search but as inserted parts are typically small compared to the total table size, the performance impact is negligible.
- As parts are incrementally merged into bigger parts, and these new parts are merged into even bigger parts ("write amplification"),
vector similarity indexes are possibly build multiple times for the same vectors. To avoid that, you may suppress merges during insert
using statement [`SYSTEM STOP MERGES`](../../../sql-reference/statements/system.md), respectively start merges once all data has been
inserted using `SYSTEM START MERGES`.
ANN indexes are built during column insertion and merge. As a result, `INSERT` and `OPTIMIZE` statements will be slower than for ordinary Vector similarity indexes support this type of query:
tables. ANNIndexes are ideally used only with immutable or rarely changed data, respectively when are far more read requests than write
requests.
:::tip
To reduce the cost of building vector similarity indexes, consider setting `materialize_skip_indexes_on_insert` which disables the
construction of skipping indexes on newly inserted parts. Search would fall back to exact search but as inserted parts are typically small
compared to the total table size, the performance impact of that would be negligible.
ANN indexes support this type of query:
``` sql ``` sql
WITH [...] AS reference_vector WITH [...] AS reference_vector
@ -134,44 +124,33 @@ LIMIT N
SETTINGS enable_analyzer = 0; -- Temporary limitation, will be lifted SETTINGS enable_analyzer = 0; -- Temporary limitation, will be lifted
``` ```
:::tip
To avoid writing out large vectors, you can use [query
parameters](/docs/en/interfaces/cli.md#queries-with-parameters-cli-queries-with-parameters), e.g.
```bash
clickhouse-client --param_vec='hello' --query="SELECT * FROM table WHERE L2Distance(vectors, {vec: Array(Float32)}) < 1.0"
```
:::
To search using a different value of HNSW parameter `hnsw_candidate_list_size_for_search` (default: 256), also known as `ef_search` in the To search using a different value of HNSW parameter `hnsw_candidate_list_size_for_search` (default: 256), also known as `ef_search` in the
original [HNSW paper](https://doi.org/10.1109/TPAMI.2018.2889473), run the `SELECT` query with `SETTINGS hnsw_candidate_list_size_for_search original [HNSW paper](https://doi.org/10.1109/TPAMI.2018.2889473), run the `SELECT` query with `SETTINGS hnsw_candidate_list_size_for_search
= <value>`. = <value>`.
**Restrictions**: Approximate algorithms used to determine the nearest neighbors require a limit, hence queries without `LIMIT` clause **Restrictions**: Approximate vector search algorithms require a limit, hence queries without `LIMIT` clause cannot utilize vector
cannot utilize ANN indexes. Also, ANN indexes are only used if the query has a `LIMIT` value smaller than setting similarity indexes. The limit must also be smaller than setting `max_limit_for_ann_queries` (default: 100).
`max_limit_for_ann_queries` (default: 1 million rows). This is a safeguard to prevent large memory allocations by external libraries for
approximate neighbor search.
**Differences to Skip Indexes** Similar to regular [skip indexes](https://clickhouse.com/docs/en/optimize/skipping-indexes), ANN indexes are **Differences to Regular Skip Indexes** Similar to regular [skip indexes](https://clickhouse.com/docs/en/optimize/skipping-indexes), vector
constructed over granules and each indexed block consists of `GRANULARITY = [N]`-many granules (`[N]` = 1 by default for normal skip similarity indexes are constructed over granules and each indexed block consists of `GRANULARITY = [N]`-many granules (`[N]` = 1 by default
indexes). For example, if the primary index granularity of the table is 8192 (setting `index_granularity = 8192`) and `GRANULARITY = 2`, for normal skip indexes). For example, if the primary index granularity of the table is 8192 (setting `index_granularity = 8192`) and
then each indexed block will contain 16384 rows. However, data structures and algorithms for approximate neighborhood search (usually `GRANULARITY = 2`, then each indexed block will contain 16384 rows. However, data structures and algorithms for approximate neighborhood
provided by external libraries) are inherently row-oriented. They store a compact representation of a set of rows and also return rows for search are inherently row-oriented. They store a compact representation of a set of rows and also return rows for vector search queries.
ANN queries. This causes some rather unintuitive differences in the way ANN indexes behave compared to normal skip indexes. This causes some rather unintuitive differences in the way vector vector similarity indexes behave compared to normal skip indexes.
When a user defines an ANN index on a column, ClickHouse internally creates an ANN "sub-index" for each index block. The sub-index is "local" When a user defines an vector similarity index on a column, ClickHouse internally creates an vector similarity "sub-index" for each index
in the sense that it only knows about the rows of its containing index block. In the previous example and assuming that a column has 65536 block. The sub-index is "local" in the sense that it only knows about the rows of its containing index block. In the previous example and
rows, we obtain four index blocks (spanning eight granules) and an ANN sub-index for each index block. A sub-index is theoretically able to assuming that a column has 65536 rows, we obtain four index blocks (spanning eight granules) and an vector similarity sub-index for each
return the rows with the N closest points within its index block directly. However, since ClickHouse loads data from disk to memory at the index block. A sub-index is theoretically able to return the rows with the N closest points within its index block directly. However, since
granularity of granules, sub-indexes extrapolate matching rows to granule granularity. This is different from regular skip indexes which ClickHouse loads data from disk to memory at the granularity of granules, sub-indexes extrapolate matching rows to granule granularity. This
skip data at the granularity of index blocks. is different from regular skip indexes which skip data at the granularity of index blocks.
The `GRANULARITY` parameter determines how many ANN sub-indexes are created. Bigger `GRANULARITY` values mean fewer but larger ANN The `GRANULARITY` parameter determines how many vector similarity sub-indexes are created. Bigger `GRANULARITY` values mean fewer but larger
sub-indexes, up to the point where a column (or a column's data part) has only a single sub-index. In that case, the sub-index has a vector similarity sub-indexes, up to the point where a column (or a column's data part) has only a single sub-index. In that case, the
"global" view of all column rows and can directly return all granules of the column (part) with relevant rows (there are at most sub-index has a "global" view of all column rows and can directly return all granules of the column (part) with relevant rows (there are at
`LIMIT [N]`-many such granules). In a second step, ClickHouse will load these granules and identify the actually best rows by performing a most `LIMIT [N]`-many such granules). In a second step, ClickHouse will load these granules and identify the actually best rows by
brute-force distance calculation over all rows of the granules. With a small `GRANULARITY` value, each of the sub-indexes returns up to performing a brute-force distance calculation over all rows of the granules. With a small `GRANULARITY` value, each of the sub-indexes
`LIMIT N`-many granules. As a result, more granules need to be loaded and post-filtered. Note that the search accuracy is with both cases returns up to `LIMIT N`-many granules. As a result, more granules need to be loaded and post-filtered. Note that the search accuracy is with
equally good, only the processing performance differs. It is generally recommended to use a large `GRANULARITY` for ANN indexes and fall both cases equally good, only the processing performance differs. It is generally recommended to use a large `GRANULARITY` for vector
back to a smaller `GRANULARITY` values only in case of problems like excessive memory consumption of the ANN structures. If no `GRANULARITY` similarity indexes and fall back to a smaller `GRANULARITY` values only in case of problems like excessive memory consumption of the vector
was specified for ANN indexes, the default value is 100 million. similarity structures. If no `GRANULARITY` was specified for vector similarity indexes, the default value is 100 million.