ClickHouse/docs/en/operations/table_engines/external_data.md
Ivan Blinkov 94f86eda79
WIP on docs: improvements for search + some content changes (#2842)
* Some improvements for introduction/performance.md

* Minor improvements for example_datasets

* Add website/package-lock.json to .gitignore

* YT paragraph was badly outdated and there is no real reason to write a new one

* Use weird introduction article as a starting point for F.A.Q.

* Some refactoring of first half of ya_metrika_task.md

* minor

* Weird docs footer bugfix

* Forgotten redirect

* h/v scrollbars same size in docs

* CLICKHOUSE-3831: introduce security changelog

* A bit more narrow tables on docs front page

* fix flag in ru docs

* Save some space in top level of docs ToC

* Capitalize most words in titles of docs/en/

* more docs scrollbar fixes

* fix incorrect merge

* fix link

* fix switching languages in single page docs mode

* Update mkdocs & mkdocs-material + unminify javascript

* cherrypick 17e18d1ecc
2018-08-10 17:44:49 +03:00

3.3 KiB
Raw Blame History

External Data for Query Processing

ClickHouse allows sending a server the data that is needed for processing a query, together with a SELECT query. This data is put in a temporary table (see the section "Temporary tables") and can be used in the query (for example, in IN operators).

For example, if you have a text file with important user identifiers, you can upload it to the server along with a query that uses filtration by this list.

If you need to run more than one query with a large volume of external data, don't use this feature. It is better to upload the data to the DB ahead of time.

External data can be uploaded using the command-line client (in non-interactive mode), or using the HTTP interface.

In the command-line client, you can specify a parameters section in the format

--external --file=... [--name=...] [--format=...] [--types=...|--structure=...]

You may have multiple sections like this, for the number of tables being transmitted.

--external Marks the beginning of a clause. --file Path to the file with the table dump, or -, which refers to stdin. Only a single table can be retrieved from stdin.

The following parameters are optional: --name Name of the table. If omitted, _data is used. --format Data format in the file. If omitted, TabSeparated is used.

One of the following parameters is required:--types A list of comma-separated column types. For example: UInt64,String. The columns will be named _1, _2, ... --structure The table structure in the formatUserID UInt64, URL String. Defines the column names and types.

The files specified in 'file' will be parsed by the format specified in 'format', using the data types specified in 'types' or 'structure'. The table will be uploaded to the server and accessible there as a temporary table with the name in 'name'.

Examples:

echo -ne "1\n2\n3\n" | clickhouse-client --query="SELECT count() FROM test.visits WHERE TraficSourceID IN _data" --external --file=- --types=Int8
849897
cat /etc/passwd | sed 's/:/\t/g' | clickhouse-client --query="SELECT shell, count() AS c FROM passwd GROUP BY shell ORDER BY c DESC" --external --file=- --name=passwd --structure='login String, unused String, uid UInt16, gid UInt16, comment String, home String, shell String'
/bin/sh 20
/bin/false      5
/bin/bash       4
/usr/sbin/nologin       1
/bin/sync       1

When using the HTTP interface, external data is passed in the multipart/form-data format. Each table is transmitted as a separate file. The table name is taken from the file name. The 'query_string' is passed the parameters 'name_format', 'name_types', and 'name_structure', where 'name' is the name of the table that these parameters correspond to. The meaning of the parameters is the same as when using the command-line client.

Example:

cat /etc/passwd | sed 's/:/\t/g' > passwd.tsv

curl -F 'passwd=@passwd.tsv;' 'http://localhost:8123/?query=SELECT+shell,+count()+AS+c+FROM+passwd+GROUP+BY+shell+ORDER+BY+c+DESC&passwd_structure=login+String,+unused+String,+uid+UInt16,+gid+UInt16,+comment+String,+home+String,+shell+String'
/bin/sh 20
/bin/false      5
/bin/bash       4
/usr/sbin/nologin       1
/bin/sync       1

For distributed query processing, the temporary tables are sent to all the remote servers.