refactoring

This commit is contained in:
FArthur-cmd 2022-07-12 13:57:16 +03:00
parent fb5272971c
commit 0e89fbff53
5 changed files with 97 additions and 74 deletions

View File

@ -2,6 +2,15 @@
The main task that indexes help to solve is to find the nearest neighbors for multidimensional data. An example of such a problem could be similar pictures or texts, for which the problem is reduced to finding the nearest [embeddings](https://cloud.google.com/architecture/overview-extracting-and-serving-feature-embeddings-for-machine-learning). They can be created from data using [UDF](../../../sql-reference/functions/index.md#executable-user-defined-functions).
Next query can find closest neighbor in L2 space:
``` sql
SELECT *
FROM table_name
WHERE L2Distance(Column, TargetEmbedding) < Value
LIMIT N
```
But it will take some time for execution because of the long calculation of the distance between `TargetEmbedding` and all other vectors. This is where indexes can help. As they store the overall data structure using some methods (clustering, building search trees, etc.), they can give only approximate results for finding the nearest neighbors.
## Indexes Structure
Approximate Nearest Neighbor Search Indexes (`ANNIndexes`) are simmilar to skip indexes. They are constructed by some granules and determine which of them should be skipped. Compared to skip indices, ANN indices use their results not only to skip some group of granules, but also to select particular granules from a set of granules.
@ -12,26 +21,27 @@ Approximate Nearest Neighbor Search Indexes (`ANNIndexes`) are simmilar to skip
``` sql
SELECT *
FROM table_name
WHERE DistanceFunction(Column, TargetVector) < Value
WHERE DistanceFunction(Column, TargetEmbedding) < Value
LIMIT N
```
- ###### Type 2: Order by
``` sql
SELECT *
FROM table_name [WHERE ...]
ORDER BY DistanceFunction(Column, TargetVector)
ORDER BY DistanceFunction(Column, TargetEmbedding)
LIMIT N
```
In these queries, `DistanceFunction` is selected from tuples of distance functions. `TargetVector` is a known embedding (something like `(0.1, 0.1, ... )`). `Value` - a float value that will bound the neighbourhood.
In these queries, `DistanceFunction` is selected from [tuples of distance functions](../../../sql-reference/functions/tuple-functions/#l1norm). `TargetEmbedding` is a known embedding (something like `(0.1, 0.1, ... )`). `Value` - a float value that will bound the neighbourhood.
!!! note "Note"
ANNIndex can't speed up query that satisfies both types and they work only for Tuples. All queries must have the limit, as algorithms are used to find nearest neighbors and need a specific number of them.
ANNIndex can't speed up query that satisfies both types(`where + order by`, only one of them). All queries must have the limit, as algorithms are used to find nearest neighbors and need a specific number of them.
Both types of queries are handled the same way. The indexes get `n` neighbors (where `n` is taken from the `LIMIT` section) and work with them. In `ORDER BY` query they remember the numbers of all parts of the granule that have at least one of neighbor. In `WHERE` query they remember only those parts that satisfy the requirements.
Both types of queries are handled the same way. The indexes get `n` neighbors (where `n` is taken from the `LIMIT` clause) and work with them. In `ORDER BY` query they remember the numbers of all parts of the granule that have at least one of neighbor. In `WHERE` query they remember only those parts that satisfy the requirements.
###### Create table with ANNIndex
```
## Create table with ANNIndex
```sql
CREATE TABLE t
(
`id` Int64,
@ -42,7 +52,18 @@ ENGINE = MergeTree
ORDER BY id;
```
Number of granules in granularity should be large. With greater `GRANULARITY` indexes remember the data structure better. But some indexes can't be built if they don't have enough data, so this granule will always participate in the query. For more information, see the description of indexes.
```sql
CREATE TABLE t
(
`id` Int64,
`number` Array(Float32),
INDEX x number TYPE annoy GRANULARITY N
)
ENGINE = MergeTree
ORDER BY id;
```
Number of granules in granularity should be large. With greater `GRANULARITY` indexes remember the data structure better. The `GRANULARITY` indicates how many granules will be used to construct the index. The more data is provided for the index, the more of it can be handled by one index and the more chances that with the right hyperparameters the index will remember the data structure better. But some indexes can't be built if they don't have enough data, so this granule will always participate in the query. For more information, see the description of indexes.
As the indexes are built only during insertions into table, `INSERT` and `OPTIMIZE` queries are slower than for ordinary table. At this stage indexes remember all the information about the given data. ANNIndexes should be used if you have immutable or rarely changed data and many read requests.
@ -69,6 +90,17 @@ CREATE TABLE t
ENGINE = MergeTree
ORDER BY id;
```
```sql
CREATE TABLE t
(
id Int64,
number Array(Float32),
INDEX x number TYPE annoy(T) GRANULARITY N
)
ENGINE = MergeTree
ORDER BY id;
```
Parameter `T` is the number of trees which algorithm will create. The bigger it is, the slower (approximately linear) it works (in both `CREATE` and `SELECT` requests), but the better accuracy you get (adjusted for randomness).
In the `SELECT` in the settings (`ann_index_params`) you can specify the size of the internal buffer (more details in the description above or in the [original repository](https://github.com/spotify/annoy)).

View File

@ -21,6 +21,34 @@ namespace ErrorCodes
extern const int INCORRECT_QUERY;
}
namespace
{
template <typename Literal>
void extractTargetVectorFromLiteral(ANN::ANNQueryInformation::Embedding & target, Literal literal)
{
Float64 float_element_of_target_vector;
Int64 int_element_of_target_vector;
for (const auto & value : literal.value())
{
if (value.tryGet(float_element_of_target_vector))
{
target.emplace_back(float_element_of_target_vector);
}
else if (value.tryGet(int_element_of_target_vector))
{
target.emplace_back(static_cast<float>(int_element_of_target_vector));
}
else
{
throw Exception(ErrorCodes::INCORRECT_QUERY, "Wrong type of elements in target vector. Only float or int are supported.");
}
}
}
}
namespace ApproximateNearestNeighbour
{
@ -351,7 +379,6 @@ bool ANNCondition::matchRPNWhere(RPN & rpn, ANNQueryInformation & expr)
}
auto iter = rpn.begin();
bool identifier_found = false;
// Query starts from operator less
if (iter->function != RPNElement::FUNCTION_COMPARISON)
@ -381,13 +408,7 @@ bool ANNCondition::matchRPNWhere(RPN & rpn, ANNQueryInformation & expr)
}
auto end = rpn.end();
if (!matchMainParts(iter, end, expr, identifier_found))
{
return false;
}
// Final checks of correctness
if (!identifier_found || expr.target.empty())
if (!matchMainParts(iter, end, expr))
{
return false;
}
@ -417,9 +438,8 @@ bool ANNCondition::matchRPNOrderBy(RPN & rpn, ANNQueryInformation & expr)
auto iter = rpn.begin();
auto end = rpn.end();
bool identifier_found = false;
return ANNCondition::matchMainParts(iter, end, expr, identifier_found);
return ANNCondition::matchMainParts(iter, end, expr);
}
// Returns true and stores Length if we have valid LIMIT clause in query
@ -435,8 +455,10 @@ bool ANNCondition::matchRPNLimit(RPNElement & rpn, UInt64 & limit)
}
/* Matches dist function, target vector, column name */
bool ANNCondition::matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANNQueryInformation & expr, bool & identifier_found)
bool ANNCondition::matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANNQueryInformation & expr)
{
bool identifier_found = false;
// Matches DistanceFunc->[Column]->[Tuple(array)Func]->TargetVector(floats)->[Column]
if (iter->function != RPNElement::FUNCTION_DISTANCE)
{
@ -457,7 +479,6 @@ bool ANNCondition::matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANN
++iter;
}
if (iter->function == RPNElement::FUNCTION_IDENTIFIER)
{
identifier_found = true;
@ -470,50 +491,15 @@ bool ANNCondition::matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANN
++iter;
}
///TODO: optimize
if (iter->function == RPNElement::FUNCTION_LITERAL_TUPLE)
{
Float64 float_element_of_target_vector;
Int64 int_element_of_target_vector;
for (const auto & value : iter->tuple_literal.value())
{
if (value.tryGet(float_element_of_target_vector))
{
expr.target.emplace_back(float_element_of_target_vector);
}
else if (value.tryGet(int_element_of_target_vector))
{
expr.target.emplace_back(static_cast<float>(int_element_of_target_vector));
}
else
{
throw Exception(ErrorCodes::INCORRECT_QUERY, "Wrong type of elements in target vector. Only float or int are supported.");
}
}
extractTargetVectorFromLiteral(expr.target, iter->tuple_literal);
++iter;
}
if (iter->function == RPNElement::FUNCTION_LITERAL_ARRAY)
{
Float64 float_element_of_target_vector;
Int64 int_element_of_target_vector;
for (const auto & value : iter->array_literal.value())
{
if (value.tryGet(float_element_of_target_vector))
{
expr.target.emplace_back(float_element_of_target_vector);
}
else if (value.tryGet(int_element_of_target_vector))
{
expr.target.emplace_back(static_cast<float>(int_element_of_target_vector));
}
else
{
throw Exception(ErrorCodes::INCORRECT_QUERY, "Wrong type of elements in target vector. Only float or int are supported.");
}
}
extractTargetVectorFromLiteral(expr.target, iter->array_literal);
++iter;
}
@ -541,6 +527,12 @@ bool ANNCondition::matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANN
++iter;
}
// Final checks of correctness
if (!identifier_found || expr.target.empty())
{
return false;
}
return true;
}

View File

@ -202,8 +202,7 @@ private:
static bool matchRPNLimit(RPNElement & rpn, UInt64 & limit);
/* Matches dist function, target vector, column name */
///TODO: identifier_found
static bool matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANNQueryInformation & expr, bool & identifier_found);
static bool matchMainParts(RPN::iterator & iter, RPN::iterator & end, ANNQueryInformation & expr);
// Gets float or int from AST node
static float getFloatOrIntLiteralOrPanic(RPN::iterator& iter);
@ -226,8 +225,8 @@ public:
virtual std::vector<size_t> getUsefulRanges(MergeTreeIndexGranulePtr idx_granule) const = 0;
};
} // namespace ApproximateNearestNeighbour
}
namespace ANN = ApproximateNearestNeighbour;
} // namespace DB
}

View File

@ -192,7 +192,7 @@ void MergeTreeIndexAggregatorAnnoy::update(const Block & block, size_t * pos, si
}
for (const auto& item : data)
{
index_base->add_item(index_base->get_n_items(), &item[0]);
index_base->add_item(index_base->get_n_items(), item.data());
}
}
@ -226,7 +226,7 @@ std::vector<size_t> MergeTreeIndexConditionAnnoy::getUsefulRanges(MergeTreeIndex
= condition.queryHasWhereClause() ? std::optional<float>(condition.getComparisonDistanceForWhereQuery()) : std::nullopt;
if (comp_dist && comp_dist.value() < 0)
throw Exception(ErrorCodes::LOGICAL_ERROR, "Attemp to optimize query with where without distance");
throw Exception(ErrorCodes::LOGICAL_ERROR, "Attempt to optimize query with where without distance");
std::vector<float> target_vec = condition.getTargetVector();
@ -261,7 +261,7 @@ std::vector<size_t> MergeTreeIndexConditionAnnoy::getUsefulRanges(MergeTreeIndex
throw Exception("Setting of the annoy index should be int", ErrorCodes::INCORRECT_QUERY);
}
}
annoy->get_nns_by_vector(&target_vec[0], 1, k_search, &items, &dist);
annoy->get_nns_by_vector(target_vec.data(), 1, k_search, &items, &dist);
std::unordered_set<size_t> result;
for (size_t i = 0; i < items.size(); ++i)
{