diff --git a/docs/en/engines/table-engines/mergetree-family/annindexes.md b/docs/en/engines/table-engines/mergetree-family/annindexes.md index 3819d63d6bf..7e74a082209 100644 --- a/docs/en/engines/table-engines/mergetree-family/annindexes.md +++ b/docs/en/engines/table-engines/mergetree-family/annindexes.md @@ -2,6 +2,15 @@ The main task that indexes help to solve is to find the nearest neighbors for multidimensional data. An example of such a problem could be similar pictures or texts, for which the problem is reduced to finding the nearest [embeddings](https://cloud.google.com/architecture/overview-extracting-and-serving-feature-embeddings-for-machine-learning). They can be created from data using [UDF](../../../sql-reference/functions/index.md#executable-user-defined-functions). +Next query can find closest neighbor in L2 space: +``` sql +SELECT * +FROM table_name +WHERE L2Distance(Column, TargetEmbedding) < Value +LIMIT N +``` +But it will take some time for execution because of the long calculation of the distance between `TargetEmbedding` and all other vectors. This is where indexes can help. As they store the overall data structure using some methods (clustering, building search trees, etc.), they can give only approximate results for finding the nearest neighbors. + ## Indexes Structure Approximate Nearest Neighbor Search Indexes (`ANNIndexes`) are simmilar to skip indexes. They are constructed by some granules and determine which of them should be skipped. Compared to skip indices, ANN indices use their results not only to skip some group of granules, but also to select particular granules from a set of granules. @@ -12,26 +21,27 @@ Approximate Nearest Neighbor Search Indexes (`ANNIndexes`) are simmilar to skip ``` sql SELECT * FROM table_name - WHERE DistanceFunction(Column, TargetVector) < Value + WHERE DistanceFunction(Column, TargetEmbedding) < Value LIMIT N ``` - ###### Type 2: Order by ``` sql SELECT * FROM table_name [WHERE ...] - ORDER BY DistanceFunction(Column, TargetVector) + ORDER BY DistanceFunction(Column, TargetEmbedding) LIMIT N ``` -In these queries, `DistanceFunction` is selected from tuples of distance functions. `TargetVector` is a known embedding (something like `(0.1, 0.1, ... )`). `Value` - a float value that will bound the neighbourhood. +In these queries, `DistanceFunction` is selected from [tuples of distance functions](../../../sql-reference/functions/tuple-functions/#l1norm). `TargetEmbedding` is a known embedding (something like `(0.1, 0.1, ... )`). `Value` - a float value that will bound the neighbourhood. !!! note "Note" - ANNIndex can't speed up query that satisfies both types and they work only for Tuples. All queries must have the limit, as algorithms are used to find nearest neighbors and need a specific number of them. + ANNIndex can't speed up query that satisfies both types(`where + order by`, only one of them). All queries must have the limit, as algorithms are used to find nearest neighbors and need a specific number of them. -Both types of queries are handled the same way. The indexes get `n` neighbors (where `n` is taken from the `LIMIT` section) and work with them. In `ORDER BY` query they remember the numbers of all parts of the granule that have at least one of neighbor. In `WHERE` query they remember only those parts that satisfy the requirements. +Both types of queries are handled the same way. The indexes get `n` neighbors (where `n` is taken from the `LIMIT` clause) and work with them. In `ORDER BY` query they remember the numbers of all parts of the granule that have at least one of neighbor. In `WHERE` query they remember only those parts that satisfy the requirements. -###### Create table with ANNIndex -``` +## Create table with ANNIndex + +```sql CREATE TABLE t ( `id` Int64, @@ -41,8 +51,19 @@ CREATE TABLE t ENGINE = MergeTree ORDER BY id; ``` - -Number of granules in granularity should be large. With greater `GRANULARITY` indexes remember the data structure better. But some indexes can't be built if they don't have enough data, so this granule will always participate in the query. For more information, see the description of indexes. + +```sql +CREATE TABLE t +( + `id` Int64, + `number` Array(Float32), + INDEX x number TYPE annoy GRANULARITY N +) +ENGINE = MergeTree +ORDER BY id; +``` + +Number of granules in granularity should be large. With greater `GRANULARITY` indexes remember the data structure better. The `GRANULARITY` indicates how many granules will be used to construct the index. The more data is provided for the index, the more of it can be handled by one index and the more chances that with the right hyperparameters the index will remember the data structure better. But some indexes can't be built if they don't have enough data, so this granule will always participate in the query. For more information, see the description of indexes. As the indexes are built only during insertions into table, `INSERT` and `OPTIMIZE` queries are slower than for ordinary table. At this stage indexes remember all the information about the given data. ANNIndexes should be used if you have immutable or rarely changed data and many read requests. @@ -69,6 +90,17 @@ CREATE TABLE t ENGINE = MergeTree ORDER BY id; ``` + +```sql +CREATE TABLE t +( + id Int64, + number Array(Float32), + INDEX x number TYPE annoy(T) GRANULARITY N +) +ENGINE = MergeTree +ORDER BY id; +``` Parameter `T` is the number of trees which algorithm will create. The bigger it is, the slower (approximately linear) it works (in both `CREATE` and `SELECT` requests), but the better accuracy you get (adjusted for randomness). In the `SELECT` in the settings (`ann_index_params`) you can specify the size of the internal buffer (more details in the description above or in the [original repository](https://github.com/spotify/annoy)). diff --git a/src/Storages/MergeTree/CommonANNIndexes.cpp b/src/Storages/MergeTree/CommonANNIndexes.cpp index 012b81be21d..5f6334adf97 100644 --- a/src/Storages/MergeTree/CommonANNIndexes.cpp +++ b/src/Storages/MergeTree/CommonANNIndexes.cpp @@ -21,6 +21,34 @@ namespace ErrorCodes extern const int INCORRECT_QUERY; } +namespace +{ + +template +void extractTargetVectorFromLiteral(ANN::ANNQueryInformation::Embedding & target, Literal literal) +{ + Float64 float_element_of_target_vector; + Int64 int_element_of_target_vector; + + for (const auto & value : literal.value()) + { + if (value.tryGet(float_element_of_target_vector)) + { + target.emplace_back(float_element_of_target_vector); + } + else if (value.tryGet(int_element_of_target_vector)) + { + target.emplace_back(static_cast(int_element_of_target_vector)); + } + else + { + throw Exception(ErrorCodes::INCORRECT_QUERY, "Wrong type of elements in target vector. Only float or int are supported."); + } + } +} + +} + namespace ApproximateNearestNeighbour { @@ -91,7 +119,7 @@ String ANNCondition::getMetricName() const { if (index_is_useful && query_information.has_value()) { - return query_information->metric_name; + return query_information->metric_name; } throw Exception(ErrorCodes::LOGICAL_ERROR, "Metric name was requested for useless or uninitialized index."); } @@ -100,7 +128,7 @@ float ANNCondition::getPValueForLpDistance() const { if (index_is_useful && query_information.has_value()) { - return query_information->p_for_lp_dist; + return query_information->p_for_lp_dist; } throw Exception(ErrorCodes::LOGICAL_ERROR, "P from LPDistance was requested for useless or uninitialized index."); } @@ -351,7 +379,6 @@ bool ANNCondition::matchRPNWhere(RPN & rpn, ANNQueryInformation & expr) } auto iter = rpn.begin(); - bool identifier_found = false; // Query starts from operator less if (iter->function != RPNElement::FUNCTION_COMPARISON) @@ -381,13 +408,7 @@ bool ANNCondition::matchRPNWhere(RPN & rpn, ANNQueryInformation & expr) } auto end = rpn.end(); - if (!matchMainParts(iter, end, expr, identifier_found)) - { - return false; - } - - // Final checks of correctness - if (!identifier_found || expr.target.empty()) + if (!matchMainParts(iter, end, expr)) { return false; } @@ -417,9 +438,8 @@ bool ANNCondition::matchRPNOrderBy(RPN & rpn, ANNQueryInformation & expr) auto iter = rpn.begin(); auto end = rpn.end(); - bool identifier_found = false; - return ANNCondition::matchMainParts(iter, end, expr, identifier_found); + return ANNCondition::matchMainParts(iter, end, expr); } // Returns true and stores Length if we have valid LIMIT clause in query @@ -435,8 +455,10 @@ bool ANNCondition::matchRPNLimit(RPNElement & rpn, UInt64 & limit) } /* Matches dist function, target vector, column name */ -bool ANNCondition::matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANNQueryInformation & expr, bool & identifier_found) +bool ANNCondition::matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANNQueryInformation & expr) { + bool identifier_found = false; + // Matches DistanceFunc->[Column]->[Tuple(array)Func]->TargetVector(floats)->[Column] if (iter->function != RPNElement::FUNCTION_DISTANCE) { @@ -457,7 +479,6 @@ bool ANNCondition::matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANN ++iter; } - if (iter->function == RPNElement::FUNCTION_IDENTIFIER) { identifier_found = true; @@ -470,50 +491,15 @@ bool ANNCondition::matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANN ++iter; } - ///TODO: optimize if (iter->function == RPNElement::FUNCTION_LITERAL_TUPLE) { - Float64 float_element_of_target_vector; - Int64 int_element_of_target_vector; - - for (const auto & value : iter->tuple_literal.value()) - { - if (value.tryGet(float_element_of_target_vector)) - { - expr.target.emplace_back(float_element_of_target_vector); - } - else if (value.tryGet(int_element_of_target_vector)) - { - expr.target.emplace_back(static_cast(int_element_of_target_vector)); - } - else - { - throw Exception(ErrorCodes::INCORRECT_QUERY, "Wrong type of elements in target vector. Only float or int are supported."); - } - } + extractTargetVectorFromLiteral(expr.target, iter->tuple_literal); ++iter; } if (iter->function == RPNElement::FUNCTION_LITERAL_ARRAY) { - Float64 float_element_of_target_vector; - Int64 int_element_of_target_vector; - - for (const auto & value : iter->array_literal.value()) - { - if (value.tryGet(float_element_of_target_vector)) - { - expr.target.emplace_back(float_element_of_target_vector); - } - else if (value.tryGet(int_element_of_target_vector)) - { - expr.target.emplace_back(static_cast(int_element_of_target_vector)); - } - else - { - throw Exception(ErrorCodes::INCORRECT_QUERY, "Wrong type of elements in target vector. Only float or int are supported."); - } - } + extractTargetVectorFromLiteral(expr.target, iter->array_literal); ++iter; } @@ -541,6 +527,12 @@ bool ANNCondition::matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANN ++iter; } + // Final checks of correctness + if (!identifier_found || expr.target.empty()) + { + return false; + } + return true; } diff --git a/src/Storages/MergeTree/CommonANNIndexes.h b/src/Storages/MergeTree/CommonANNIndexes.h index b4e3ed26c0a..b1db9fac08c 100644 --- a/src/Storages/MergeTree/CommonANNIndexes.h +++ b/src/Storages/MergeTree/CommonANNIndexes.h @@ -20,11 +20,11 @@ namespace ApproximateNearestNeighbour * 3) name of column with embeddings * 4) type of query * 5) Number of elements, that should be taken (limit) - * + * * And two optional parameters: * 1) p for LpDistance function * 2) distance to compare with (only for where queries) - */ + */ struct ANNQueryInformation { using Embedding = std::vector; @@ -190,7 +190,7 @@ private: // Checks that at least one rpn is matching for index // New RPNs for other query types can be added here - bool matchAllRPNS(); + bool matchAllRPNS(); // Returns true and stores ANNExpr if the query has valid WHERE section static bool matchRPNWhere(RPN & rpn, ANNQueryInformation & expr); @@ -202,8 +202,7 @@ private: static bool matchRPNLimit(RPNElement & rpn, UInt64 & limit); /* Matches dist function, target vector, column name */ - ///TODO: identifier_found - static bool matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANNQueryInformation & expr, bool & identifier_found); + static bool matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANNQueryInformation & expr); // Gets float or int from AST node static float getFloatOrIntLiteralOrPanic(RPN::iterator& iter); @@ -226,8 +225,8 @@ public: virtual std::vector getUsefulRanges(MergeTreeIndexGranulePtr idx_granule) const = 0; }; -} // namespace ApproximateNearestNeighbour +} namespace ANN = ApproximateNearestNeighbour; -} // namespace DB +} diff --git a/src/Storages/MergeTree/MergeTreeIndexAnnoy.cpp b/src/Storages/MergeTree/MergeTreeIndexAnnoy.cpp index ffef89746ef..2f34809e993 100644 --- a/src/Storages/MergeTree/MergeTreeIndexAnnoy.cpp +++ b/src/Storages/MergeTree/MergeTreeIndexAnnoy.cpp @@ -130,7 +130,7 @@ void MergeTreeIndexAggregatorAnnoy::update(const Block & block, size_t * pos, si if (*pos >= block.rows()) throw Exception( ErrorCodes::LOGICAL_ERROR, - "The provided position is not less than the number of block rows. Position: {}, Block rows: {}.", + "The provided position is not less than the number of block rows. Position: {}, Block rows: {}.", toString(*pos), toString(block.rows())); size_t rows_read = std::min(limit, block.rows() - *pos); @@ -173,7 +173,7 @@ void MergeTreeIndexAggregatorAnnoy::update(const Block & block, size_t * pos, si if (!column_tuple) throw Exception(ErrorCodes::INCORRECT_QUERY, "Wrong type was given to index."); - + const auto & columns = column_tuple->getColumns(); std::vector> data{column_tuple->size(), std::vector()}; @@ -192,7 +192,7 @@ void MergeTreeIndexAggregatorAnnoy::update(const Block & block, size_t * pos, si } for (const auto& item : data) { - index_base->add_item(index_base->get_n_items(), &item[0]); + index_base->add_item(index_base->get_n_items(), item.data()); } } @@ -226,8 +226,8 @@ std::vector MergeTreeIndexConditionAnnoy::getUsefulRanges(MergeTreeIndex = condition.queryHasWhereClause() ? std::optional(condition.getComparisonDistanceForWhereQuery()) : std::nullopt; if (comp_dist && comp_dist.value() < 0) - throw Exception(ErrorCodes::LOGICAL_ERROR, "Attemp to optimize query with where without distance"); - + throw Exception(ErrorCodes::LOGICAL_ERROR, "Attempt to optimize query with where without distance"); + std::vector target_vec = condition.getTargetVector(); auto granule = std::dynamic_pointer_cast(idx_granule); @@ -261,7 +261,7 @@ std::vector MergeTreeIndexConditionAnnoy::getUsefulRanges(MergeTreeIndex throw Exception("Setting of the annoy index should be int", ErrorCodes::INCORRECT_QUERY); } } - annoy->get_nns_by_vector(&target_vec[0], 1, k_search, &items, &dist); + annoy->get_nns_by_vector(target_vec.data(), 1, k_search, &items, &dist); std::unordered_set result; for (size_t i = 0; i < items.size(); ++i) { diff --git a/tests/queries/0_stateless/02354_annoy.sql b/tests/queries/0_stateless/02354_annoy.sql index 4d05fd73f31..007ba5f7c2d 100644 --- a/tests/queries/0_stateless/02354_annoy.sql +++ b/tests/queries/0_stateless/02354_annoy.sql @@ -33,4 +33,4 @@ FROM 02354_annoy WHERE L2Distance(embedding, [0.0, 0.0, 10.0]) < 1.0 LIMIT 5; -DROP TABLE IF EXISTS 02354_annoy; \ No newline at end of file +DROP TABLE IF EXISTS 02354_annoy;