Index types 'annoy' and 'usearch' were removed and replaced by
'vector_similarity' indexes in an earlier commit.
This means unfortuantely, that if customers have tables with these
indexes and upgrade, their database might not start anymore - the
system loads the metadata at startup, thinks something is wrong with
such tables, and halts immediately.
This commit adds support for loading and attaching such indexes back.
Data insert or use (search) return an error which recommends a migration
to 'vector_similarity' indexes. The implementation is generally similar
to what has recently been implemented for 'full_text' indexes [1, 2].
[1] https://github.com/ClickHouse/ClickHouse/pull/64656
[2] https://github.com/ClickHouse/ClickHouse/pull/64846
First, index type "vector_similarity" is more speaking and user-friendly
than "usearch". Second, we should not expose the name of the library
doing the job (usearch). Of course, the docs will continue to mention
usearch (credit where credit is due).
Existing setting `allow_experimental_usearch_index` was marked obsolete.
A new settings `allow_experimental_vector_similarity_index` was added.
These kind of vector search similarity queries are rather obscure and
rare in practice. They require the user to specify a maximum distance
which is not intuitive to obtain. Furthermore, these queries are not
natively supported in USearch, so the vector search index had to emulate
these queries.
Therefore simplifying the code base and restricting vector search to
ORDER-BY queries only.
Indexes for approximate nearest neighbourhood (ANN) search (USearch) can
be build on columns of type Array(Float32) or Tuple(Float32[, Float32[, ...]]).
In practice, Arrays(Float32) is the only relevant data type.
Arrays store high-dimensional embeddings consecutively (--> cache
locality) and the additional flexibility of different data types in a
tuple is not needed for vector search.
Therefore removing support for ANN indexes over tuple columns to
simplify the code, tests and docs.
Annoy indexes fell out of favor in the community, at least when it comes
to vector databases. Such indexes work okay-ish low dimensions but they
suffers badly from a curse of dimensionality which makes them inapt for
a high number of dimensions.
Now that Annoy is gone, issue (*) also disappears and we can drop
'no-ubsan', 'no-cpu-aarch64', and 'no-asan' from tests.
(*) spotify/annoy#456
Registers usearch and annoy properly via configure_config.cmake and
config.h.in like all other 3rd party libs, instead of (mis)using
target_compile_definitions.
No directory 'SimSIMD-map' exists, the build only worked because SimSIMD
support in usearch was (accidentally?) disabled. This commit corrects
the build description. SimSIMD support in usearch will be enabled by a
later commit.
CI found [1] failure of the test:
2024-08-11 21:06:07 /usr/share/clickhouse-test/queries/0_stateless/02122_join_group_by_timeout.sh: line 51: 52614 Killed timeout -s KILL $MAX_PROCESS_WAIT $CLICKHOUSE_CLIENT -q "SELECT a.name as n
And the problem is not the server, but the client, since query executed
for ~1 second:
2024.08.11 21:06:02.284318 [ 49232 ] {ba989ee2-f615-49ca-bcd8-31b3916aeb2c} <Debug> executeQuery: (from [::1]:54144) (comment: 02122_join_group_by_timeout.sh) SELECT a.name as n FROM ( SELECT 'Name' as name, number FROM system.numbers LIMIT 2000000 ) AS a, ( SELECT 'Name' as name2, number FROM system.numbers LIMIT 2000000 ) as b FORMAT Null SETTINGS max_execution_time = 1, timeout_overflow_mode = 'break' (stage: Complete)
2024.08.11 21:06:03.331249 [ 49232 ] {ba989ee2-f615-49ca-bcd8-31b3916aeb2c} <Debug> executeQuery: Read 517104 rows, 3.95 MiB in 1.072023 sec., 482362.78512681165 rows/sec., 3.68 MiB/sec.
[1]: https://s3.amazonaws.com/clickhouse-test-reports/67134/18da3f0ab63da1eef9396627d0dfd56cf5356f65/stateless_tests__msan__[1_4].html
So instead of using timeout, let's use time from the system.query_log
instead.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>