ClickHouse/docs/en/engines/table-engines/mergetree-family/annindexes.md
Robert Schulze ce8b39487e
Update docs/en/engines/table-engines/mergetree-family/annindexes.md
Co-authored-by: Nikita Taranov <nickita.taranov@gmail.com>
2023-06-06 09:55:50 +02:00

6.3 KiB

Approximate Nearest Neighbor Search Indexes [experimental]

Nearest neighborhood search refers to the problem of finding the point(s) with the smallest distance to a given point in an n-dimensional space. Since exact search is in practice usually typically too slow, the task is often solved with approximate algorithms. A popular use case of of neighbor search is finding similar pictures (texts) for a given picture (text). Pictures (texts) can be decomposed into embeddings, and instead of comparing pictures (texts) pixel-by-pixel (character-by-character), only the embeddings are compared.

In terms of SQL, the problem can be expressed as follows:

SELECT *
FROM table
WHERE L2Distance(column, Point) < MaxDistance
LIMIT N
SELECT *
FROM table
ORDER BY L2Distance(column, Point)
LIMIT N

The queries are expensive because the L2 (Euclidean) distance between Point and all points in column and must be computed. To speed this process up, Approximate Nearest Neighbor Search Indexes (ANN indexes) store a compact representation of the search space (using clustering, search trees, etc.) which allows to compute an approximate answer quickly.

Creating ANN Indexes

As long as ANN indexes are experimental, you first need to SET allow_experimental_annoy_index = 1.

Syntax to create an ANN index over an Array column:

CREATE TABLE table
(
  `id` Int64,
  `embedding` Array(Float32),
  INDEX <ann_index_name> embedding TYPE <ann_index_type>(<ann_index_parameters>) GRANULARITY <N>
)
ENGINE = MergeTree
ORDER BY id;

Syntax to create an ANN index over a Tuple column:

CREATE TABLE table
(
  `id` Int64,
  `embedding` Tuple(Float32[, Float32[, ...]]),
  INDEX <ann_index_name> embedding TYPE <ann_index_type>(<ann_index_parameters>) GRANULARITY <N>
)
ENGINE = MergeTree
ORDER BY id;

ANN indexes are built during column insertion and merge and INSERT and OPTIMIZE statements will be slower than for ordinary tables. ANNIndexes are ideally used only with immutable or rarely changed data, respectively there are much more read requests than write requests.

Similar to regular skip indexes, ANN indexes are constructed over granules and each indexed block consists of GRANULARITY = <N>-many granules. For example, if the primary index granularity of the table is 8192 (setting index_granularity = 8192) and GRANULARITY = 2, then each indexed block will consist of 16384 rows. However, unlike skip indexes, ANN indexes are not only able to skip the entire indexed block, they are able to skip individual granules in indexed blocks. As a result, the GRANULARITY parameter has a different meaning in ANN indexes than in normal skip indexes. Basically, the bigger GRANULARITY is chosen, the more data is provided to a single ANN index, and the higher the chance that with the right hyper parameters, the index will remember the data structure better.

Using ANN Indexes

ANN indexes support two types of queries:

  • WHERE queries:

    SELECT *
    FROM table
    WHERE DistanceFunction(column, Point) < MaxDistance
    LIMIT N
    
  • ORDER BY queries:

    SELECT *
    FROM table
    [WHERE ...]
    ORDER BY DistanceFunction(column, Point)
    LIMIT N
    

DistanceFunction is a distance function, Point is a reference vector (e.g. (0.17, 0.33, ...)) and MaxDistance is a floating point value which restricts the size of the neighbourhood.

:::tip To avoid writing out large vectors, you can use query parameters, e.g.

clickhouse-client --param_vec='hello' --query="SELECT * FROM table WHERE L2Distance(embedding, {vec: Array(Float32)}) < 1.0"

:::

ANN indexes cannot speed up queries that contain both a WHERE DistanceFunction(column, Point) < MaxDistance and an ORDER BY DistanceFunction(column, Point) clause. Also, the approximate algorithms used to determine the nearest neighbors require a limit, hence queries that use an ANN index must have a LIMIT clause.

An ANN index is only used if the query has a LIMIT value smaller than setting max_limit_for_ann_queries (default: 1 million rows). This is a safety measure which helps to avoid large memory consumption by external libraries for approximate neighbor search.

Available ANN Indexes

Annoy

(currently disabled on ARM due to memory safety problems with the algorithm)

This type of ANN index implements the Annoy algorithm which uses a recursive division of the space in random linear surfaces (lines in 2D, planes in 3D etc.).

Syntax to create a Annoy index over a Array column:

CREATE TABLE table
(
  id Int64,
  embedding Array(Float32),
  INDEX <ann_index_name> embedding TYPE annoy([DistanceName[, NumTrees]]) GRANULARITY N
)
ENGINE = MergeTree
ORDER BY id;

Syntax to create a Annoy index over a Tuple column:

CREATE TABLE table
(
  id Int64,
  embedding Tuple(Float32[, Float32[, ...]]),
  INDEX <ann_index_name> embedding TYPE annoy([DistanceName[, NumTrees]]) GRANULARITY N
)
ENGINE = MergeTree
ORDER BY id;

Parameter DistanceName is name of a distance function (default L2Distance). Annoy currently supports L2Distance and cosineDistance as distance functions. Parameter NumTrees (default: 100) is the number of trees which the algorithm will create. Higher values of NumTree mean slower CREATE and SELECT statements (approximately linearly), but increase the accuracy of search results.

:::note Indexes over columns of type Array will generally work faster than indexes on Tuple columns. All arrays must have same length. Use CONSTRAINT to avoid errors. For example, CONSTRAINT constraint_name_1 CHECK length(embedding) = 256. :::

Setting annoy_index_search_k_nodes (default: NumTrees * LIMIT) determines how many tree nodes are inspected during SELECTs. It can be used to balance runtime and accuracy at runtime.

Example:

SELECT *
FROM table_name [WHERE ...]
ORDER BY L2Distance(column, Point)
LIMIT N
SETTINGS annoy_index_search_k_nodes=100