refactoring

2024-11-22 07:31:57 +00:00 · 2022-07-12 13:57:16 +03:00 · 2022-07-12 13:57:16 +03:00 · 0e89fbff53
commit 0e89fbff53
parent fb5272971c
5 changed files with 97 additions and 74 deletions
--- a/docs/en/engines/table-engines/mergetree-family/annindexes.md
+++ b/docs/en/engines/table-engines/mergetree-family/annindexes.md
@ -2,6 +2,15 @@

 The main task that indexes help to solve is to find the nearest neighbors for multidimensional data. An example of such a problem could be similar pictures or texts, for which the problem is reduced to finding the nearest [embeddings](https://cloud.google.com/architecture/overview-extracting-and-serving-feature-embeddings-for-machine-learning). They can be created from data using [UDF](../../../sql-reference/functions/index.md#executable-user-defined-functions).

+Next query can find closest neighbor in L2 space:
+``` sql 
+SELECT * 
+FROM table_name 
+WHERE L2Distance(Column, TargetEmbedding) < Value 
+LIMIT N
+```
+But it will take some time for execution because of the long calculation of the distance between `TargetEmbedding` and all other vectors. This is where indexes can help. As they  store the overall data structure using some methods (clustering, building search trees, etc.), they can give only approximate results for finding the nearest neighbors.
+
 ## Indexes Structure

 Approximate Nearest Neighbor Search Indexes (`ANNIndexes`) are simmilar to skip indexes. They are constructed by some granules and determine which of them should be skipped. Compared to skip indices, ANN indices use their results not only to skip some group of granules, but also to select particular granules from a set of granules.
@ -12,26 +21,27 @@ Approximate Nearest Neighbor Search Indexes (`ANNIndexes`) are simmilar to skip
   ``` sql 
   SELECT * 
   FROM table_name 
-   WHERE DistanceFunction(Column, TargetVector) < Value 
+   WHERE DistanceFunction(Column, TargetEmbedding) < Value 
   LIMIT N
   ```
 - ###### Type 2: Order by
  ``` sql
  SELECT * 
  FROM table_name [WHERE ...] 
-  ORDER BY DistanceFunction(Column, TargetVector) 
+  ORDER BY DistanceFunction(Column, TargetEmbedding) 
  LIMIT N
  ```

-In these queries, `DistanceFunction` is selected from tuples of distance functions. `TargetVector` is a known embedding (something like `(0.1, 0.1, ... )`). `Value` - a float value that will bound the neighbourhood.
+In these queries, `DistanceFunction` is selected from [tuples of distance functions](../../../sql-reference/functions/tuple-functions/#l1norm). `TargetEmbedding` is a known embedding (something like `(0.1, 0.1, ... )`). `Value` - a float value that will bound the neighbourhood.

 !!! note "Note"
-    ANNIndex can't speed up query that satisfies both types and they work only for Tuples. All queries must have the limit, as algorithms are used to find nearest neighbors and need a specific number of them.
+    ANNIndex can't speed up query that satisfies both types(`where + order by`, only one of them). All queries must have the limit, as algorithms are used to find nearest neighbors and need a specific number of them.

-Both types of queries are handled the same way. The indexes get `n` neighbors (where `n` is taken from the `LIMIT` section) and work with them. In `ORDER BY` query they remember the numbers of all parts of the granule that have at least one of neighbor. In `WHERE` query they remember only those parts that satisfy the requirements.
+Both types of queries are handled the same way. The indexes get `n` neighbors (where `n` is taken from the `LIMIT` clause) and work with them. In `ORDER BY` query they remember the numbers of all parts of the granule that have at least one of neighbor. In `WHERE` query they remember only those parts that satisfy the requirements.

-###### Create table with ANNIndex
-```
+## Create table with ANNIndex
+
+```sql
 CREATE TABLE t
 (
  `id` Int64,
@ -42,7 +52,18 @@ ENGINE = MergeTree
 ORDER BY id;
 ```

-Number of granules in granularity should be large. With greater `GRANULARITY` indexes remember the data structure better. But some indexes can't be built if they don't have enough data, so this granule will always participate in the query. For more information, see the description of indexes.
+```sql
+CREATE TABLE t
+(
+  `id` Int64,
+  `number` Array(Float32),
+  INDEX x number TYPE annoy GRANULARITY N
+)
+ENGINE = MergeTree
+ORDER BY id;
+```
+
+Number of granules in granularity should be large. With greater `GRANULARITY` indexes remember the data structure better. The `GRANULARITY` indicates how many granules will be used to construct the index. The more data is provided for the index, the more of it can be handled by one index and the more chances that with the right hyperparameters the index will remember the data structure better. But some indexes can't be built if they don't have enough data, so this granule will always participate in the query. For more information, see the description of indexes.

 As the indexes are built only during insertions into table, `INSERT` and `OPTIMIZE` queries are slower than for ordinary table. At this stage indexes remember all the information about the given data. ANNIndexes should be used if you have immutable or rarely changed data and many read requests.
    
@ -69,6 +90,17 @@ CREATE TABLE t
 ENGINE = MergeTree
 ORDER BY id;
 ```
+
+```sql
+CREATE TABLE t
+(
+  id Int64,
+  number Array(Float32),
+  INDEX x number TYPE annoy(T) GRANULARITY N
+)
+ENGINE = MergeTree
+ORDER BY id;
+```
 Parameter `T` is the number of trees which algorithm will create. The bigger it is, the slower (approximately linear) it works (in both `CREATE` and `SELECT` requests), but the better accuracy you get (adjusted for randomness).

 In the `SELECT` in the settings (`ann_index_params`) you can specify the size of the internal buffer (more details in the description above or in the [original repository](https://github.com/spotify/annoy)).
--- a/src/Storages/MergeTree/CommonANNIndexes.cpp
+++ b/src/Storages/MergeTree/CommonANNIndexes.cpp
@ -21,6 +21,34 @@ namespace ErrorCodes
    extern const int INCORRECT_QUERY;
 }

+namespace
+{
+
+template <typename Literal>
+void extractTargetVectorFromLiteral(ANN::ANNQueryInformation::Embedding & target, Literal literal)
+{
+    Float64 float_element_of_target_vector;
+    Int64 int_element_of_target_vector;
+    
+    for (const auto & value : literal.value())
+    {
+        if (value.tryGet(float_element_of_target_vector))
+        {
+            target.emplace_back(float_element_of_target_vector);
+        }
+        else if (value.tryGet(int_element_of_target_vector))
+        {
+            target.emplace_back(static_cast<float>(int_element_of_target_vector));
+        }
+        else
+        {
+            throw Exception(ErrorCodes::INCORRECT_QUERY, "Wrong type of elements in target vector. Only float or int are supported.");
+        }
+    }
+}
+
+}
+
 namespace ApproximateNearestNeighbour
 {

@ -351,7 +379,6 @@ bool ANNCondition::matchRPNWhere(RPN & rpn, ANNQueryInformation & expr)
    }

    auto iter = rpn.begin();
-    bool identifier_found = false;

    // Query starts from operator less
    if (iter->function != RPNElement::FUNCTION_COMPARISON)
@ -381,13 +408,7 @@ bool ANNCondition::matchRPNWhere(RPN & rpn, ANNQueryInformation & expr)
    }

    auto end = rpn.end();
-    if (!matchMainParts(iter, end, expr, identifier_found))
-    {
-        return false;
-    }
-
-    // Final checks of correctness
-    if (!identifier_found || expr.target.empty())
+    if (!matchMainParts(iter, end, expr))
    {
        return false;
    }
@ -417,9 +438,8 @@ bool ANNCondition::matchRPNOrderBy(RPN & rpn, ANNQueryInformation & expr)

    auto iter = rpn.begin();
    auto end = rpn.end();
-    bool identifier_found = false;

-    return ANNCondition::matchMainParts(iter, end, expr, identifier_found);
+    return ANNCondition::matchMainParts(iter, end, expr);
 }

 // Returns true and stores Length if we have valid LIMIT clause in query
@ -435,8 +455,10 @@ bool ANNCondition::matchRPNLimit(RPNElement & rpn, UInt64 & limit)
 }

 /* Matches dist function, target vector, column name */
-bool ANNCondition::matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANNQueryInformation & expr, bool & identifier_found)
+bool ANNCondition::matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANNQueryInformation & expr)
 {
+    bool identifier_found = false;
+
    // Matches DistanceFunc->[Column]->[Tuple(array)Func]->TargetVector(floats)->[Column]
    if (iter->function != RPNElement::FUNCTION_DISTANCE)
    {
@ -457,7 +479,6 @@ bool ANNCondition::matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANN
        ++iter;
    }

-
    if (iter->function == RPNElement::FUNCTION_IDENTIFIER)
    {
        identifier_found = true;
@ -470,50 +491,15 @@ bool ANNCondition::matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANN
        ++iter;
    }

-    ///TODO: optimize
    if (iter->function == RPNElement::FUNCTION_LITERAL_TUPLE)
    {
-        Float64 float_element_of_target_vector;
-        Int64 int_element_of_target_vector;
-        
-        for (const auto & value : iter->tuple_literal.value())
-        {
-            if (value.tryGet(float_element_of_target_vector))
-            {
-                expr.target.emplace_back(float_element_of_target_vector);
-            }
-            else if (value.tryGet(int_element_of_target_vector))
-            {
-                expr.target.emplace_back(static_cast<float>(int_element_of_target_vector));
-            }
-            else
-            {
-                throw Exception(ErrorCodes::INCORRECT_QUERY, "Wrong type of elements in target vector. Only float or int are supported.");
-            }
-        }
+        extractTargetVectorFromLiteral(expr.target, iter->tuple_literal);
        ++iter;
    }

    if (iter->function == RPNElement::FUNCTION_LITERAL_ARRAY)
    {
-        Float64 float_element_of_target_vector;
-        Int64 int_element_of_target_vector;
-        
-        for (const auto & value : iter->array_literal.value())
-        {
-            if (value.tryGet(float_element_of_target_vector))
-            {
-                expr.target.emplace_back(float_element_of_target_vector);
-            }
-            else if (value.tryGet(int_element_of_target_vector))
-            {
-                expr.target.emplace_back(static_cast<float>(int_element_of_target_vector));
-            }
-            else
-            {
-                throw Exception(ErrorCodes::INCORRECT_QUERY, "Wrong type of elements in target vector. Only float or int are supported.");
-            }
-        }
+        extractTargetVectorFromLiteral(expr.target, iter->array_literal);
        ++iter;
    }

@ -541,6 +527,12 @@ bool ANNCondition::matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANN
        ++iter;
    }

+    // Final checks of correctness
+    if (!identifier_found || expr.target.empty())
+    {
+        return false;
+    }
+
    return true;
 }

--- a/src/Storages/MergeTree/CommonANNIndexes.h
+++ b/src/Storages/MergeTree/CommonANNIndexes.h
@ -202,8 +202,7 @@ private:
    static bool matchRPNLimit(RPNElement & rpn, UInt64 & limit);

    /* Matches dist function, target vector, column name */
-    ///TODO: identifier_found
-    static bool matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANNQueryInformation & expr, bool & identifier_found);
+    static bool matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANNQueryInformation & expr);

    // Gets float or int from AST node
    static float getFloatOrIntLiteralOrPanic(RPN::iterator& iter);
@ -226,8 +225,8 @@ public:
    virtual std::vector<size_t> getUsefulRanges(MergeTreeIndexGranulePtr idx_granule) const = 0;
 };

-} // namespace ApproximateNearestNeighbour
+}

 namespace ANN = ApproximateNearestNeighbour;

-} // namespace DB
+}
--- a/src/Storages/MergeTree/MergeTreeIndexAnnoy.cpp
+++ b/src/Storages/MergeTree/MergeTreeIndexAnnoy.cpp
@ -192,7 +192,7 @@ void MergeTreeIndexAggregatorAnnoy::update(const Block & block, size_t * pos, si
        }
        for (const auto& item : data)
        {
-            index_base->add_item(index_base->get_n_items(), &item[0]);
+            index_base->add_item(index_base->get_n_items(), item.data());
        }
    }

@ -226,7 +226,7 @@ std::vector<size_t> MergeTreeIndexConditionAnnoy::getUsefulRanges(MergeTreeIndex
        = condition.queryHasWhereClause() ? std::optional<float>(condition.getComparisonDistanceForWhereQuery()) : std::nullopt;

    if (comp_dist && comp_dist.value() < 0)
-        throw Exception(ErrorCodes::LOGICAL_ERROR, "Attemp to optimize query with where without distance");
+        throw Exception(ErrorCodes::LOGICAL_ERROR, "Attempt to optimize query with where without distance");

    std::vector<float> target_vec = condition.getTargetVector();

@ -261,7 +261,7 @@ std::vector<size_t> MergeTreeIndexConditionAnnoy::getUsefulRanges(MergeTreeIndex
            throw Exception("Setting of the annoy index should be int", ErrorCodes::INCORRECT_QUERY);
        }
    }
-    annoy->get_nns_by_vector(&target_vec[0], 1, k_search, &items, &dist);
+    annoy->get_nns_by_vector(target_vec.data(), 1, k_search, &items, &dist);
    std::unordered_set<size_t> result;
    for (size_t i = 0; i < items.size(); ++i)
    {