ClickHouse/docs/en/engines/table-engines/mergetree-family/invertedindexes.md

---
slug: /en/engines/table-engines/mergetree-family/invertedindexes
sidebar_label:  Inverted Indexes
description: Quickly find search terms in text.
keywords: [full-text search, text search, inverted, index, indices]
---

# Full-text Search using Inverted Indexes [experimental]

Inverted indexes are an experimental type of [secondary indexes](/docs/en/engines/table-engines/mergetree-family/mergetree.md/#available-types-of-indices) which provide fast text search
capabilities for [String](/docs/en/sql-reference/data-types/string.md) or [FixedString](/docs/en/sql-reference/data-types/fixedstring.md)
columns. The main idea of an inverted index is to store a mapping from "terms" to the rows which contain these terms. "Terms" are
tokenized cells of the string column. For example, the string cell "I will be a little late" is by default tokenized into six terms "I", "will",
"be", "a", "little" and "late". Another kind of tokenizer is n-grams. For example, the result of 3-gram tokenization will be 21 terms "I w",
" wi", "wil", "ill", "ll ", "l b", " be" etc. The more fine-granular the input strings are tokenized, the bigger but also the more
useful the resulting inverted index will be.

:::warning
Inverted indexes are experimental and should not be used in production environments yet. They may change in the future in backward-incompatible
ways, for example with respect to their DDL/DQL syntax or performance/compression characteristics.
:::

## Usage

To use inverted indexes, first enable them in the configuration:

```sql
SET allow_experimental_inverted_index = true;
```

An inverted index can be defined on a string column using the following syntax

``` sql
CREATE TABLE tab
(
    `key` UInt64,
    `str` String,
    INDEX inv_idx(str) TYPE inverted(0) GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY key
```

where `N` specifies the tokenizer:

- `inverted(0)` (or shorter: `inverted()`) set the tokenizer to "tokens", i.e. split strings along spaces,
- `inverted(N)` with `N` between 2 and 8 sets the tokenizer to "ngrams(N)"

Being a type of skipping index, inverted indexes can be dropped or added to a column after table creation:

``` sql
ALTER TABLE tab DROP INDEX inv_idx;
ALTER TABLE tab ADD INDEX inv_idx(s) TYPE inverted(2);
```

To use the index, no special functions or syntax are required. Typical string search predicates automatically leverage the index. As
examples, consider:

```sql
INSERT INTO tab(key, str) values (1, 'Hello World');
SELECT * from tab WHERE str == 'Hello World';
SELECT * from tab WHERE str IN ('Hello', 'World');
SELECT * from tab WHERE str LIKE '%Hello%';
SELECT * from tab WHERE multiSearchAny(str, ['Hello', 'World']);
SELECT * from tab WHERE hasToken(str, 'Hello');
```

The inverted index also works on columns of type `Array(String)`, `Array(FixedString)`, `Map(String)` and `Map(String)`.

Like for other secondary indices, each column part has its own inverted index. Furthermore, each inverted index is internally divided into
"segments". The existence and size of the segments are generally transparent to users but the segment size determines the memory consumption
during index construction (e.g. when two parts are merged). Configuration parameter "max_digestion_size_per_segment" (default: 256 MB)
controls the amount of data read consumed from the underlying column before a new segment is created. Incrementing the parameter raises the
intermediate memory consumption for index construction but also improves lookup performance since fewer segments need to be checked on
average to evaluate a query.

## Full-text search of the Hacker News dataset

Let's look at the performance improvements of inverted indexes on a large dataset with lots of text. We will use 28.7M rows of comments on the popular Hacker News website. Here is the table without an inverted index:

```sql
CREATE TABLE hackernews (
    id UInt64,
    deleted UInt8,
    type String,
    author String,
    timestamp DateTime,
    comment String,
	dead UInt8,
	parent UInt64,
	poll UInt64,
    children Array(UInt32),
    url String,
    score UInt32,
    title String,
    parts Array(UInt32),
    descendants UInt32
)
ENGINE = MergeTree
ORDER BY (type, author);
```

The 28.7M rows are in a Parquet file in S3 - let's insert them into the `hackernews` table:

```sql
INSERT INTO hackernews
	SELECT * FROM s3Cluster(
        'default',
        'https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.parquet',
        'Parquet',
        '
    id UInt64,
    deleted UInt8,
    type String,
    by String,
    time DateTime,
    text String,
	dead UInt8,
	parent UInt64,
	poll UInt64,
    kids Array(UInt32),
    url String,
    score UInt32,
    title String,
    parts Array(UInt32),
    descendants UInt32');
```

Consider the following simple search for the term `ClickHouse` (and its varied upper and lower cases) in the `comment` column:

```sql
SELECT count()
FROM hackernews
WHERE hasToken(lower(comment), 'clickhouse');
```

Notice it takes 3 seconds to execute the query:

```response
┌─count()─┐
│    1145 │
└─────────┘

1 row in set. Elapsed: 3.001 sec. Processed 28.74 million rows, 9.75 GB (9.58 million rows/s., 3.25 GB/s.)
```

We will use `ALTER TABLE` and add an inverted index on the lowercase of the `comment` column, then materialize it (which can take a while - wait for it to materialize):

```sql
ALTER TABLE hackernews
     ADD INDEX comment_lowercase(lower(comment)) TYPE inverted;

ALTER TABLE hackernews MATERIALIZE INDEX comment_lowercase;
```

We run the same query...

```sql
SELECT count()
FROM hackernews
WHERE hasToken(lower(comment), 'clickhouse')
```

...and notice the query executes 4x faster:

```response
┌─count()─┐
│    1145 │
└─────────┘

1 row in set. Elapsed: 0.747 sec. Processed 4.49 million rows, 1.77 GB (6.01 million rows/s., 2.37 GB/s.)
```

:::note
Unlike other secondary indices, inverted indexes (for now) map to row numbers (row ids) instead of granule ids. The reason for this design
is performance. In practice, users often search for multiple terms at once. For example, filter predicate `WHERE s LIKE '%little%' OR s LIKE
'%big%'` can be evaluated directly using an inverted index by forming the union of the row id lists for terms "little" and "big". This also
means that the parameter `GRANULARITY` supplied to index creation has no meaning (it may be removed from the syntax in the future).
:::
edits, add frontmatter 2023-01-26 13:58:10 +00:00			`---`
			`slug: /en/engines/table-engines/mergetree-family/invertedindexes`
capitalize sidebar 2023-01-26 14:05:28 +00:00			`sidebar_label: Inverted Indexes`
add keywords 2023-01-26 15:06:06 +00:00			`description: Quickly find search terms in text.`
Update invertedindexes.md 2023-02-02 03:55:58 +00:00			`keywords: [full-text search, text search, inverted, index, indices]`
edits, add frontmatter 2023-01-26 13:58:10 +00:00			`---`

Update invertedindexes.md 2023-02-02 03:55:58 +00:00			`# Full-text Search using Inverted Indexes [experimental]`
Initial inverted index docs 2023-01-19 18:38:07 +00:00
Update docs/en/engines/table-engines/mergetree-family/invertedindexes.md 2023-01-26 15:51:08 +00:00			`Inverted indexes are an experimental type of [secondary indexes](/docs/en/engines/table-engines/mergetree-family/mergetree.md/#available-types-of-indices) which provide fast text search`
add keywords 2023-01-26 15:06:06 +00:00			`capabilities for [String](/docs/en/sql-reference/data-types/string.md) or [FixedString](/docs/en/sql-reference/data-types/fixedstring.md)`
edits, add frontmatter 2023-01-26 13:58:10 +00:00			`columns. The main idea of an inverted index is to store a mapping from "terms" to the rows which contain these terms. "Terms" are`
			`tokenized cells of the string column. For example, the string cell "I will be a little late" is by default tokenized into six terms "I", "will",`
			`"be", "a", "little" and "late". Another kind of tokenizer is n-grams. For example, the result of 3-gram tokenization will be 21 terms "I w",`
Initial inverted index docs 2023-01-19 18:38:07 +00:00			`" wi", "wil", "ill", "ll ", "l b", " be" etc. The more fine-granular the input strings are tokenized, the bigger but also the more`
Update invertedindexes.md 2023-02-02 03:55:58 +00:00			`useful the resulting inverted index will be.`
Initial inverted index docs 2023-01-19 18:38:07 +00:00
			`:::warning`
edits, add frontmatter 2023-01-26 13:58:10 +00:00			`Inverted indexes are experimental and should not be used in production environments yet. They may change in the future in backward-incompatible`
Initial inverted index docs 2023-01-19 18:38:07 +00:00			`ways, for example with respect to their DDL/DQL syntax or performance/compression characteristics.`
			`:::`

			`## Usage`

			`To use inverted indexes, first enable them in the configuration:`

			```sql
			`SET allow_experimental_inverted_index = true;`
			```

			`An inverted index can be defined on a string column using the following syntax`

			``` sql
Update invertedindexes.md 2023-01-26 19:05:26 +00:00			`CREATE TABLE tab`
			`(`
			`key` UInt64,
			`str` String,
Update docs/en/engines/table-engines/mergetree-family/invertedindexes.md 2023-01-26 19:07:13 +00:00			`INDEX inv_idx(str) TYPE inverted(0) GRANULARITY 1`
Update invertedindexes.md 2023-01-26 19:05:26 +00:00			`)`
			`ENGINE = MergeTree`
			`ORDER BY key`
Initial inverted index docs 2023-01-19 18:38:07 +00:00			```

			where `N` specifies the tokenizer:

			- `inverted(0)` (or shorter: `inverted()`) set the tokenizer to "tokens", i.e. split strings along spaces,
			- `inverted(N)` with `N` between 2 and 8 sets the tokenizer to "ngrams(N)"

edits, add frontmatter 2023-01-26 13:58:10 +00:00			`Being a type of skipping index, inverted indexes can be dropped or added to a column after table creation:`
Initial inverted index docs 2023-01-19 18:38:07 +00:00
			``` sql
Fix broken examples for inverted indexes 2023-01-29 23:12:32 +00:00			`ALTER TABLE tab DROP INDEX inv_idx;`
Update invertedindexes.md 2023-02-02 03:55:58 +00:00			`ALTER TABLE tab ADD INDEX inv_idx(s) TYPE inverted(2);`
Initial inverted index docs 2023-01-19 18:38:07 +00:00			```

			`To use the index, no special functions or syntax are required. Typical string search predicates automatically leverage the index. As`
			`examples, consider:`

			```sql
Fix broken examples for inverted indexes 2023-01-29 23:12:32 +00:00			`INSERT INTO tab(key, str) values (1, 'Hello World');`
			`SELECT * from tab WHERE str == 'Hello World';`
			`SELECT * from tab WHERE str IN ('Hello', 'World');`
			`SELECT * from tab WHERE str LIKE '%Hello%';`
			`SELECT * from tab WHERE multiSearchAny(str, ['Hello', 'World']);`
			`SELECT * from tab WHERE hasToken(str, 'Hello');`
Initial inverted index docs 2023-01-19 18:38:07 +00:00			```

			The inverted index also works on columns of type `Array(String)`, `Array(FixedString)`, `Map(String)` and `Map(String)`.

			`Like for other secondary indices, each column part has its own inverted index. Furthermore, each inverted index is internally divided into`
edits, add frontmatter 2023-01-26 13:58:10 +00:00			`"segments". The existence and size of the segments are generally transparent to users but the segment size determines the memory consumption`
Initial inverted index docs 2023-01-19 18:38:07 +00:00			`during index construction (e.g. when two parts are merged). Configuration parameter "max_digestion_size_per_segment" (default: 256 MB)`
			`controls the amount of data read consumed from the underlying column before a new segment is created. Incrementing the parameter raises the`
edits, add frontmatter 2023-01-26 13:58:10 +00:00			`intermediate memory consumption for index construction but also improves lookup performance since fewer segments need to be checked on`
Initial inverted index docs 2023-01-19 18:38:07 +00:00			`average to evaluate a query.`

Update invertedindexes.md 2023-02-02 03:55:58 +00:00			`## Full-text search of the Hacker News dataset`

			`Let's look at the performance improvements of inverted indexes on a large dataset with lots of text. We will use 28.7M rows of comments on the popular Hacker News website. Here is the table without an inverted index:`

			```sql
			`CREATE TABLE hackernews (`
			`id UInt64,`
			`deleted UInt8,`
			`type String,`
			`author String,`
			`timestamp DateTime,`
			`comment String,`
			`dead UInt8,`
			`parent UInt64,`
			`poll UInt64,`
			`children Array(UInt32),`
			`url String,`
			`score UInt32,`
			`title String,`
			`parts Array(UInt32),`
			`descendants UInt32`
			`)`
			`ENGINE = MergeTree`
			`ORDER BY (type, author);`
			```

			The 28.7M rows are in a Parquet file in S3 - let's insert them into the `hackernews` table:

			```sql
			`INSERT INTO hackernews`
			`SELECT * FROM s3Cluster(`
			`'default',`
			`'https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.parquet',`
			`'Parquet',`
			`'`
			`id UInt64,`
			`deleted UInt8,`
			`type String,`
			`by String,`
			`time DateTime,`
			`text String,`
			`dead UInt8,`
			`parent UInt64,`
			`poll UInt64,`
			`kids Array(UInt32),`
			`url String,`
			`score UInt32,`
			`title String,`
			`parts Array(UInt32),`
			`descendants UInt32');`
			```

			Consider the following simple search for the term `ClickHouse` (and its varied upper and lower cases) in the `comment` column:

			```sql
			`SELECT count()`
			`FROM hackernews`
			`WHERE hasToken(lower(comment), 'clickhouse');`
			```

			`Notice it takes 3 seconds to execute the query:`

			```response
			`┌─count()─┐`
			`│ 1145 │`
			`└─────────┘`

			`1 row in set. Elapsed: 3.001 sec. Processed 28.74 million rows, 9.75 GB (9.58 million rows/s., 3.25 GB/s.)`
			```

			We will use `ALTER TABLE` and add an inverted index on the lowercase of the `comment` column, then materialize it (which can take a while - wait for it to materialize):

			```sql
			`ALTER TABLE hackernews`
			`ADD INDEX comment_lowercase(lower(comment)) TYPE inverted;`

			`ALTER TABLE hackernews MATERIALIZE INDEX comment_lowercase;`
			```

			`We run the same query...`

			```sql
			`SELECT count()`
			`FROM hackernews`
			`WHERE hasToken(lower(comment), 'clickhouse')`
			```

			`...and notice the query executes 4x faster:`

			```response
			`┌─count()─┐`
			`│ 1145 │`
			`└─────────┘`

			`1 row in set. Elapsed: 0.747 sec. Processed 4.49 million rows, 1.77 GB (6.01 million rows/s., 2.37 GB/s.)`
			```

			`:::note`
Initial inverted index docs 2023-01-19 18:38:07 +00:00			`Unlike other secondary indices, inverted indexes (for now) map to row numbers (row ids) instead of granule ids. The reason for this design`
			is performance. In practice, users often search for multiple terms at once. For example, filter predicate `WHERE s LIKE '%little%' OR s LIKE
edits, add frontmatter 2023-01-26 13:58:10 +00:00			'%big%'` can be evaluated directly using an inverted index by forming the union of the row id lists for terms "little" and "big". This also
			means that the parameter `GRANULARITY` supplied to index creation has no meaning (it may be removed from the syntax in the future).
Update invertedindexes.md 2023-02-02 03:55:58 +00:00			`:::`