ClickHouse/docs/en/getting-started/example-datasets/laion.md

# Laion-400M dataset

The [Laion-400M dataset](https://laion.ai/blog/laion-400-open-dataset/) contains 400 million images with English image captions. Laion nowadays provides [an even larger dataset](https://laion.ai/blog/laion-5b/) but working with it will be similar.

The dataset contains the image URL, embeddings for both the image and the image caption, a similarity score between the image and the image caption, as well as metadata, e.g. the image width/height, the licence and a NSFW flag. We can use the dataset to demonstrate [approximate nearest neighbor search](../../engines/table-engines/mergetree-family/annindexes.md) in ClickHouse.

## Data preparation

The embeddings and the metadata are stored in separate files in the raw data. A data preparation step downloads the data, merges the files,
converts them to CSV and imports them into ClickHouse. You can use the following `download.sh` script for that:

```bash
number=${1}
if [[ $number == '' ]]; then
    number=1
fi;
wget --tries=100 https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/embeddings/img_emb/img_emb_${number}.npy          # download image embedding
wget --tries=100 https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/embeddings/text_emb/text_emb_${number}.npy        # download text embedding
wget --tries=100 https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/embeddings/metadata/metadata_${number}.parquet    # download metadata
python3 process.py $number # merge files and convert to CSV
```
Script `process.py` is defined as follows:

```python
import pandas as pd
import numpy as np
import os
import sys

str_i = str(sys.argv[1])
npy_file = "img_emb_" + str_i + '.npy'
metadata_file = "metadata_" + str_i + '.parquet'
text_npy =  "text_emb_" + str_i + '.npy'

# load all files
im_emb = np.load(npy_file)
text_emb = np.load(text_npy) 
data = pd.read_parquet(metadata_file)

# combine files
data = pd.concat([data, pd.DataFrame({"image_embedding" : [*im_emb]}), pd.DataFrame({"text_embedding" : [*text_emb]})], axis=1, copy=False)

# columns to be imported into ClickHouse
data = data[['url', 'caption', 'NSFW', 'similarity', "image_embedding", "text_embedding"]]

# transform np.arrays to lists
data['image_embedding'] = data['image_embedding'].apply(lambda x: list(x))
data['text_embedding'] = data['text_embedding'].apply(lambda x: list(x))

# this small hack is needed becase caption sometimes contains all kind of quotes
data['caption'] = data['caption'].apply(lambda x: x.replace("'", " ").replace('"', " "))

# export data as CSV file
data.to_csv(str_i + '.csv', header=False)

# removed raw data files
os.system(f"rm {npy_file} {metadata_file} {text_npy}")
```

To start the data preparation pipeline, run:

```bash
seq 0 409 | xargs -P1 -I{} bash -c './download.sh {}'
```

The dataset is split into 410 files, each file contains ca. 1 million rows. If you like to work with a smaller subset of the data, simply adjust the limits, e.g. `seq 0 9 | ...`.

(The python script above is very slow (~2-10 minutes per file), takes a lot of memory (41 GB per file), and the resulting csv files are big (10 GB each), so be careful. If you have enough RAM, increase the `-P1` number for more parallelism. If this is still too slow, consider coming up with a better ingestion procedure - maybe converting the .npy files to parquet, then doing all the other processing with clickhouse.)

## Create table

To create a table without indexes, run:

```sql
CREATE TABLE laion
(
    `id` Int64,
    `url` String,
    `caption` String,
    `NSFW` String,
    `similarity` Float32,
    `image_embedding` Array(Float32),
    `text_embedding` Array(Float32)
)
ENGINE = MergeTree
ORDER BY id
SETTINGS index_granularity = 8192
```

To import the CSV files into ClickHouse:

```sql
INSERT INTO laion FROM INFILE '{path_to_csv_files}/*.csv'
```

## Run a brute-force ANN search (without ANN index)

To run a brute-force approximate nearest neighbor search, run:

```sql
SELECT url, caption FROM laion ORDER BY L2Distance(image_embedding, {target:Array(Float32)}) LIMIT 30
```

`target` is an array of 512 elements and a client parameter. A convenient way to obtain such arrays will be presented at the end of the article. For now, we can run the embedding of a random cat picture as `target`.

**Result**

```
┌─url───────────────────────────────────────────────────────────────────────────────────────────────────────────┬─caption────────────────────────────────────────────────────────────────┐
│ https://s3.amazonaws.com/filestore.rescuegroups.org/6685/pictures/animals/13884/13884995/63318230_463x463.jpg │ Adoptable Female Domestic Short Hair                                   │
│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/8/b/6/239905226.jpg                                        │ Adopt A Pet :: Marzipan - New York, NY                                 │
│ http://d1n3ar4lqtlydb.cloudfront.net/9/2/4/248407625.jpg                                                      │ Adopt A Pet :: Butterscotch - New Castle, DE                           │
│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/e/e/c/245615237.jpg                                        │ Adopt A Pet :: Tiggy - Chicago, IL                                     │
│ http://pawsofcoronado.org/wp-content/uploads/2012/12/rsz_pumpkin.jpg                                          │ Pumpkin an orange tabby  kitten for adoption                           │
│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/7/8/3/188700997.jpg                                        │ Adopt A Pet :: Brian the Brad Pitt of cats - Frankfort, IL             │
│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/8/b/d/191533561.jpg                                        │ Domestic Shorthair Cat for adoption in Mesa, Arizona - Charlie         │
│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/0/1/2/221698235.jpg                                        │ Domestic Shorthair Cat for adoption in Marietta, Ohio - Daisy (Spayed) │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────┘

8 rows in set. Elapsed: 6.432 sec. Processed 19.65 million rows, 43.96 GB (3.06 million rows/s., 6.84 GB/s.)
```

## Run a ANN with an ANN index

Create a new table with an ANN index and insert the data from the existing table:

```sql
CREATE TABLE laion_annoy
(
    `id` Int64,
    `url` String,
    `caption` String,
    `NSFW` String,
    `similarity` Float32,
    `image_embedding` Array(Float32),
    `text_embedding` Array(Float32),
    INDEX annoy_image image_embedding TYPE annoy(),
    INDEX annoy_text text_embedding TYPE annoy()
)
ENGINE = MergeTree
ORDER BY id
SETTINGS index_granularity = 8192;

INSERT INTO laion_annoy SELECT * FROM laion;
```

By default, Annoy indexes use the L2 distance as metric. Further tuning knobs for index creation and search are described in the Annoy index [documentation](../../engines/table-engines/mergetree-family/annindexes.md). Let's check now again with the same query:

```sql
SELECT url, caption FROM laion_annoy ORDER BY l2Distance(image_embedding, {target:Array(Float32)}) LIMIT 8
```

**Result**

```
┌─url──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─caption──────────────────────────────────────────────────────────────┐
│ http://tse1.mm.bing.net/th?id=OIP.R1CUoYp_4hbeFSHBaaB5-gHaFj                                                                                                                         │ bed bugs and pets can cats carry bed bugs pets adviser               │
│ http://pet-uploads.adoptapet.com/1/9/c/1963194.jpg?336w                                                                                                                              │ Domestic Longhair Cat for adoption in Quincy, Massachusetts - Ashley │
│ https://thumbs.dreamstime.com/t/cat-bed-12591021.jpg                                                                                                                                 │ Cat on bed Stock Image                                               │
│ https://us.123rf.com/450wm/penta/penta1105/penta110500004/9658511-portrait-of-british-short-hair-kitten-lieing-at-sofa-on-sun.jpg                                                    │ Portrait of british short hair kitten lieing at sofa on sun.         │
│ https://www.easypetmd.com/sites/default/files/Wirehaired%20Vizsla%20(2).jpg                                                                                                          │ Vizsla (Wirehaired) image 3                                          │
│ https://images.ctfassets.net/yixw23k2v6vo/0000000200009b8800000000/7950f4e1c1db335ef91bb2bc34428de9/dog-cat-flickr-Impatience_1.jpg?w=600&h=400&fm=jpg&fit=thumb&q=65&fl=progressive │ dog and cat image                                                    │
│ https://i1.wallbox.ru/wallpapers/small/201523/eaa582ee76a31fd.jpg                                                                                                                    │ cats, kittens, faces, tonkinese                                      │
│ https://www.baxterboo.com/images/breeds/medium/cairn-terrier.jpg                                                                                                                     │ Cairn Terrier Photo                                                  │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────┘

8 rows in set. Elapsed: 0.641 sec. Processed 22.06 thousand rows, 49.36 MB (91.53 thousand rows/s., 204.81 MB/s.)
```

The speed increased significantly at the cost of less accurate results. This is because the ANN index only provide approximate search results. Note the example searched for similar image embeddings, yet it is also possible to search for positive image caption embeddings.

## Creating embeddings with UDFs

One usually wants to create embeddings for new images or new image captions and search for similar image / image caption pairs in the data. We can use [UDF](../../sql-reference/functions/index.md#sql-user-defined-functions) to create the `target` vector without leaving the client. It is important to use the same model to create the data and new embeddings for searches. The following scripts utilize the `ViT-B/32` model which also underlies the dataset.

### Text embeddings

First, store the following Python script in the `user_scripts/` directory of your ClickHouse data path and make it executable (`chmod +x encode_text.py`).

`encode_text.py`:

```python
#!/usr/bin/python3
import clip
import torch
import numpy as np
import sys

if __name__ == '__main__':
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model, preprocess = clip.load("ViT-B/32", device=device)
    for text in sys.stdin:
        inputs = clip.tokenize(text)
        with torch.no_grad():
            text_features = model.encode_text(inputs)[0].tolist()
            print(text_features)
        sys.stdout.flush()
```

Then create `encode_text_function.xml` in a location referenced by `<user_defined_executable_functions_config>/path/to/*_function.xml</user_defined_executable_functions_config>` in your ClickHouse server configuration file.

```xml
<functions>
    <function>
        <type>executable</type>
        <name>encode_text</name>
        <return_type>Array(Float32)</return_type>
        <argument>
            <type>String</type>
            <name>text</name>
        </argument>
        <format>TabSeparated</format>
        <command>encode_text.py</command>
        <command_read_timeout>1000000</command_read_timeout>
    </function>
</functions>
```

You can now simply use:

```sql
SELECT encode_text('cat');
```
The first run will be slow because it loads the model, but repeated runs will be fast. We can then copy the output to `SET param_target=...` and can easily write queries.

### Image embeddings

Image embeddings can be created similarly but we will provide the Python script the path to a local image instead of the image caption text.

`encode_image.py`

```python
#!/usr/bin/python3
import clip
import torch
import numpy as np
from PIL import Image
import sys

if __name__ == '__main__':
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model, preprocess = clip.load("ViT-B/32", device=device)
    for text in sys.stdin:
        image = preprocess(Image.open(text.strip())).unsqueeze(0).to(device)
        with torch.no_grad():
            image_features = model.encode_image(image)[0].tolist()
            print(image_features)
        sys.stdout.flush()
```

`encode_image_function.xml`

```xml
<functions>
    <function>
        <type>executable_pool</type>
        <name>encode_image</name>
        <return_type>Array(Float32)</return_type>
        <argument>
            <type>String</type>
            <name>path</name>
        </argument>
        <format>TabSeparated</format>
        <command>encode_image.py</command>
        <command_read_timeout>1000000</command_read_timeout>
    </function>
</functions>
```

Then run this query:

```sql
SELECT encode_image('/path/to/your/image');
```
add laion documentation 2023-01-24 14:47:04 +00:00			`# Laion-400M dataset`

Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`The [Laion-400M dataset](https://laion.ai/blog/laion-400-open-dataset/) contains 400 million images with English image captions. Laion nowadays provides [an even larger dataset](https://laion.ai/blog/laion-5b/) but working with it will be similar.`
add laion documentation 2023-01-24 14:47:04 +00:00
Fix style 2023-08-29 10:35:56 +00:00			`The dataset contains the image URL, embeddings for both the image and the image caption, a similarity score between the image and the image caption, as well as metadata, e.g. the image width/height, the licence and a NSFW flag. We can use the dataset to demonstrate [approximate nearest neighbor search](../../engines/table-engines/mergetree-family/annindexes.md) in ClickHouse.`
add laion documentation 2023-01-24 14:47:04 +00:00
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`## Data preparation`
add laion documentation 2023-01-24 14:47:04 +00:00
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`The embeddings and the metadata are stored in separate files in the raw data. A data preparation step downloads the data, merges the files,`
			converts them to CSV and imports them into ClickHouse. You can use the following `download.sh` script for that:
add laion documentation 2023-01-24 14:47:04 +00:00
			```bash
Fix issue #61351 2024-03-14 05:53:26 +00:00			`number=${1}`
			`if [[ $number == '' ]]; then`
			`number=1`
			`fi;`
			`wget --tries=100 https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/embeddings/img_emb/img_emb_${number}.npy # download image embedding`
			`wget --tries=100 https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/embeddings/text_emb/text_emb_${number}.npy # download text embedding`
			`wget --tries=100 https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/embeddings/metadata/metadata_${number}.parquet # download metadata`
			`python3 process.py $number # merge files and convert to CSV`
add laion documentation 2023-01-24 14:47:04 +00:00			```
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			Script `process.py` is defined as follows:
add laion documentation 2023-01-24 14:47:04 +00:00
			```python
			`import pandas as pd`
			`import numpy as np`
			`import os`
			`import sys`

			`str_i = str(sys.argv[1])`
			`npy_file = "img_emb_" + str_i + '.npy'`
			`metadata_file = "metadata_" + str_i + '.parquet'`
			`text_npy = "text_emb_" + str_i + '.npy'`

			`# load all files`
			`im_emb = np.load(npy_file)`
			`text_emb = np.load(text_npy)`
			`data = pd.read_parquet(metadata_file)`

Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`# combine files`
small improvement 2023-01-24 14:58:31 +00:00			`data = pd.concat([data, pd.DataFrame({"image_embedding" : [im_emb]}), pd.DataFrame({"text_embedding" : [text_emb]})], axis=1, copy=False)`
add laion documentation 2023-01-24 14:47:04 +00:00
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`# columns to be imported into ClickHouse`
			`data = data[['url', 'caption', 'NSFW', 'similarity', "image_embedding", "text_embedding"]]`
add laion documentation 2023-01-24 14:47:04 +00:00
			`# transform np.arrays to lists`
			`data['image_embedding'] = data['image_embedding'].apply(lambda x: list(x))`
			`data['text_embedding'] = data['text_embedding'].apply(lambda x: list(x))`

			`# this small hack is needed becase caption sometimes contains all kind of quotes`
			`data['caption'] = data['caption'].apply(lambda x: x.replace("'", " ").replace('"', " "))`

Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`# export data as CSV file`
add laion documentation 2023-01-24 14:47:04 +00:00			`data.to_csv(str_i + '.csv', header=False)`

Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`# removed raw data files`
add laion documentation 2023-01-24 14:47:04 +00:00			`os.system(f"rm {npy_file} {metadata_file} {text_npy}")`
			```

Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`To start the data preparation pipeline, run:`

add laion documentation 2023-01-24 14:47:04 +00:00			```bash
Add warnings about ingestion script speed and memory usage in Laion dataset instructions The command given in the instructions would run 100 instances of a script that take 41 GB each. I'm not sure how the author of the instructions was able to run it successfully. 2023-08-31 19:07:51 +00:00			`seq 0 409 \| xargs -P1 -I{} bash -c './download.sh {}'`
add laion documentation 2023-01-24 14:47:04 +00:00			```

Fix style 2023-08-29 10:35:56 +00:00			The dataset is split into 410 files, each file contains ca. 1 million rows. If you like to work with a smaller subset of the data, simply adjust the limits, e.g. `seq 0 9 \| ...`.
add laion documentation 2023-01-24 14:47:04 +00:00
Add warnings about ingestion script speed and memory usage in Laion dataset instructions The command given in the instructions would run 100 instances of a script that take 41 GB each. I'm not sure how the author of the instructions was able to run it successfully. 2023-08-31 19:07:51 +00:00			(The python script above is very slow (~2-10 minutes per file), takes a lot of memory (41 GB per file), and the resulting csv files are big (10 GB each), so be careful. If you have enough RAM, increase the `-P1` number for more parallelism. If this is still too slow, consider coming up with a better ingestion procedure - maybe converting the .npy files to parquet, then doing all the other processing with clickhouse.)

Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`## Create table`
add laion documentation 2023-01-24 14:47:04 +00:00
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`To create a table without indexes, run:`
add laion documentation 2023-01-24 14:47:04 +00:00
			```sql
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`CREATE TABLE laion`
add laion documentation 2023-01-24 14:47:04 +00:00			`(`
			`id` Int64,
			`url` String,
			`caption` String,
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`NSFW` String,
add laion documentation 2023-01-24 14:47:04 +00:00			`similarity` Float32,
			`image_embedding` Array(Float32),
			`text_embedding` Array(Float32)
			`)`
			`ENGINE = MergeTree`
			`ORDER BY id`
			`SETTINGS index_granularity = 8192`
			```

Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`To import the CSV files into ClickHouse:`
add laion documentation 2023-01-24 14:47:04 +00:00
			```sql
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`INSERT INTO laion FROM INFILE '{path_to_csv_files}/*.csv'`
add laion documentation 2023-01-24 14:47:04 +00:00			```

Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`## Run a brute-force ANN search (without ANN index)`
add laion documentation 2023-01-24 14:47:04 +00:00
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`To run a brute-force approximate nearest neighbor search, run:`
add laion documentation 2023-01-24 14:47:04 +00:00
			```sql
Various fixups 2023-08-31 19:18:42 +00:00			`SELECT url, caption FROM laion ORDER BY L2Distance(image_embedding, {target:Array(Float32)}) LIMIT 30`
add laion documentation 2023-01-24 14:47:04 +00:00			```

Various fixups 2023-08-31 19:18:42 +00:00			`target` is an array of 512 elements and a client parameter. A convenient way to obtain such arrays will be presented at the end of the article. For now, we can run the embedding of a random cat picture as `target`.
add laion documentation 2023-01-24 14:47:04 +00:00
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`Result`
add laion documentation 2023-01-24 14:47:04 +00:00
			```
			┌─url───────────────────────────────────────────────────────────────────────────────────────────────────────────┬─caption────────────────────────────────────────────────────────────────┐
			`│ https://s3.amazonaws.com/filestore.rescuegroups.org/6685/pictures/animals/13884/13884995/63318230_463x463.jpg │ Adoptable Female Domestic Short Hair │`
			`│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/8/b/6/239905226.jpg │ Adopt A Pet :: Marzipan - New York, NY │`
			`│ http://d1n3ar4lqtlydb.cloudfront.net/9/2/4/248407625.jpg │ Adopt A Pet :: Butterscotch - New Castle, DE │`
			`│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/e/e/c/245615237.jpg │ Adopt A Pet :: Tiggy - Chicago, IL │`
			`│ http://pawsofcoronado.org/wp-content/uploads/2012/12/rsz_pumpkin.jpg │ Pumpkin an orange tabby kitten for adoption │`
			`│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/7/8/3/188700997.jpg │ Adopt A Pet :: Brian the Brad Pitt of cats - Frankfort, IL │`
			`│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/8/b/d/191533561.jpg │ Domestic Shorthair Cat for adoption in Mesa, Arizona - Charlie │`
			`│ https://s3.amazonaws.com/pet-uploads.adoptapet.com/0/1/2/221698235.jpg │ Domestic Shorthair Cat for adoption in Marietta, Ohio - Daisy (Spayed) │`
			└───────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────┘

			`8 rows in set. Elapsed: 6.432 sec. Processed 19.65 million rows, 43.96 GB (3.06 million rows/s., 6.84 GB/s.)`
			```

Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`## Run a ANN with an ANN index`
add laion documentation 2023-01-24 14:47:04 +00:00
Various fixups 2023-08-31 19:18:42 +00:00			`Create a new table with an ANN index and insert the data from the existing table:`
add laion documentation 2023-01-24 14:47:04 +00:00
			```sql
Various fixups 2023-08-31 19:18:42 +00:00			`CREATE TABLE laion_annoy`
add laion documentation 2023-01-24 14:47:04 +00:00			`(`
			`id` Int64,
			`url` String,
			`caption` String,
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`NSFW` String,
add laion documentation 2023-01-24 14:47:04 +00:00			`similarity` Float32,
			`image_embedding` Array(Float32),
			`text_embedding` Array(Float32),
Various fixups 2023-08-31 19:18:42 +00:00			`INDEX annoy_image image_embedding TYPE annoy(),`
			`INDEX annoy_text text_embedding TYPE annoy()`
add laion documentation 2023-01-24 14:47:04 +00:00			`)`
			`ENGINE = MergeTree`
			`ORDER BY id`
Various fixups 2023-08-31 19:18:42 +00:00			`SETTINGS index_granularity = 8192;`

			`INSERT INTO laion_annoy SELECT * FROM laion;`
add laion documentation 2023-01-24 14:47:04 +00:00			```

Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`By default, Annoy indexes use the L2 distance as metric. Further tuning knobs for index creation and search are described in the Annoy index [documentation](../../engines/table-engines/mergetree-family/annindexes.md). Let's check now again with the same query:`
add laion documentation 2023-01-24 14:47:04 +00:00
			```sql
Various fixups 2023-08-31 19:18:42 +00:00			`SELECT url, caption FROM laion_annoy ORDER BY l2Distance(image_embedding, {target:Array(Float32)}) LIMIT 8`
add laion documentation 2023-01-24 14:47:04 +00:00			```

			`Result`

			```
			┌─url──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─caption──────────────────────────────────────────────────────────────┐
			`│ http://tse1.mm.bing.net/th?id=OIP.R1CUoYp_4hbeFSHBaaB5-gHaFj │ bed bugs and pets can cats carry bed bugs pets adviser │`
			`│ http://pet-uploads.adoptapet.com/1/9/c/1963194.jpg?336w │ Domestic Longhair Cat for adoption in Quincy, Massachusetts - Ashley │`
			`│ https://thumbs.dreamstime.com/t/cat-bed-12591021.jpg │ Cat on bed Stock Image │`
			`│ https://us.123rf.com/450wm/penta/penta1105/penta110500004/9658511-portrait-of-british-short-hair-kitten-lieing-at-sofa-on-sun.jpg │ Portrait of british short hair kitten lieing at sofa on sun. │`
			`│ https://www.easypetmd.com/sites/default/files/Wirehaired%20Vizsla%20(2).jpg │ Vizsla (Wirehaired) image 3 │`
			`│ https://images.ctfassets.net/yixw23k2v6vo/0000000200009b8800000000/7950f4e1c1db335ef91bb2bc34428de9/dog-cat-flickr-Impatience_1.jpg?w=600&h=400&fm=jpg&fit=thumb&q=65&fl=progressive │ dog and cat image │`
			`│ https://i1.wallbox.ru/wallpapers/small/201523/eaa582ee76a31fd.jpg │ cats, kittens, faces, tonkinese │`
			`│ https://www.baxterboo.com/images/breeds/medium/cairn-terrier.jpg │ Cairn Terrier Photo │`
			└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────┘

			`8 rows in set. Elapsed: 0.641 sec. Processed 22.06 thousand rows, 49.36 MB (91.53 thousand rows/s., 204.81 MB/s.)`
			```

Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`The speed increased significantly at the cost of less accurate results. This is because the ANN index only provide approximate search results. Note the example searched for similar image embeddings, yet it is also possible to search for positive image caption embeddings.`
add laion documentation 2023-01-24 14:47:04 +00:00
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`## Creating embeddings with UDFs`
add laion documentation 2023-01-24 14:47:04 +00:00
Fix style 2023-08-29 10:35:56 +00:00			One usually wants to create embeddings for new images or new image captions and search for similar image / image caption pairs in the data. We can use [UDF](../../sql-reference/functions/index.md#sql-user-defined-functions) to create the `target` vector without leaving the client. It is important to use the same model to create the data and new embeddings for searches. The following scripts utilize the `ViT-B/32` model which also underlies the dataset.
add laion documentation 2023-01-24 14:47:04 +00:00
			`### Text embeddings`

Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			First, store the following Python script in the `user_scripts/` directory of your ClickHouse data path and make it executable (`chmod +x encode_text.py`).

add laion documentation 2023-01-24 14:47:04 +00:00			`encode_text.py`:
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00
add laion documentation 2023-01-24 14:47:04 +00:00			```python
			`#!/usr/bin/python3`
			`import clip`
			`import torch`
			`import numpy as np`
			`import sys`

			`if __name__ == '__main__':`
			`device = "cuda" if torch.cuda.is_available() else "cpu"`
			`model, preprocess = clip.load("ViT-B/32", device=device)`
			`for text in sys.stdin:`
			`inputs = clip.tokenize(text)`
			`with torch.no_grad():`
			`text_features = model.encode_text(inputs)[0].tolist()`
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`print(text_features)`
add laion documentation 2023-01-24 14:47:04 +00:00			`sys.stdout.flush()`
			```

Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			Then create `encode_text_function.xml` in a location referenced by `<user_defined_executable_functions_config>/path/to/*_function.xml</user_defined_executable_functions_config>` in your ClickHouse server configuration file.

add laion documentation 2023-01-24 14:47:04 +00:00			```xml
			`<functions>`
			`<function>`
			`<type>executable</type>`
			`<name>encode_text</name>`
			`<return_type>Array(Float32)</return_type>`
			`<argument>`
			`<type>String</type>`
			`<name>text</name>`
			`</argument>`
			`<format>TabSeparated</format>`
			`<command>encode_text.py</command>`
			`<command_read_timeout>1000000</command_read_timeout>`
			`</function>`
			`</functions>`
			```

Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`You can now simply use:`
add laion documentation 2023-01-24 14:47:04 +00:00
			```sql
			`SELECT encode_text('cat');`
			```
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			The first run will be slow because it loads the model, but repeated runs will be fast. We can then copy the output to `SET param_target=...` and can easily write queries.
add laion documentation 2023-01-24 14:47:04 +00:00
			`### Image embeddings`

Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`Image embeddings can be created similarly but we will provide the Python script the path to a local image instead of the image caption text.`

			`encode_image.py`
add laion documentation 2023-01-24 14:47:04 +00:00
			```python
			`#!/usr/bin/python3`
			`import clip`
			`import torch`
			`import numpy as np`
			`from PIL import Image`
			`import sys`

			`if __name__ == '__main__':`
			`device = "cuda" if torch.cuda.is_available() else "cpu"`
			`model, preprocess = clip.load("ViT-B/32", device=device)`
			`for text in sys.stdin:`
			`image = preprocess(Image.open(text.strip())).unsqueeze(0).to(device)`
			`with torch.no_grad():`
			`image_features = model.encode_image(image)[0].tolist()`
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`print(image_features)`
add laion documentation 2023-01-24 14:47:04 +00:00			`sys.stdout.flush()`
			```

Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`encode_image_function.xml`

add laion documentation 2023-01-24 14:47:04 +00:00			```xml
			`<functions>`
			`<function>`
			`<type>executable_pool</type>`
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`<name>encode_image</name>`
add laion documentation 2023-01-24 14:47:04 +00:00			`<return_type>Array(Float32)</return_type>`
			`<argument>`
			`<type>String</type>`
			`<name>path</name>`
			`</argument>`
			`<format>TabSeparated</format>`
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`<command>encode_image.py</command>`
add laion documentation 2023-01-24 14:47:04 +00:00			`<command_read_timeout>1000000</command_read_timeout>`
			`</function>`
			`</functions>`
			```

Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`Then run this query:`

add laion documentation 2023-01-24 14:47:04 +00:00			```sql
Dataset docs: Update + fix LAION-400M tutorial 2023-08-29 10:16:23 +00:00			`SELECT encode_image('/path/to/your/image');`
add laion documentation 2023-01-24 14:47:04 +00:00			```