11 KiB
slug | sidebar_position | sidebar_label |
---|---|---|
/en/engines/table-engines/integrations/s3 | 7 | S3 |
S3 Table Engine
This engine provides integration with Amazon S3 ecosystem. This engine is similar to the HDFS engine, but provides S3-specific features.
Create Table
CREATE TABLE s3_engine_table (name String, value UInt32)
ENGINE = S3(path [, NOSIGN | aws_access_key_id, aws_secret_access_key,] format, [compression])
[PARTITION BY expr]
[SETTINGS ...]
Engine parameters
path
— Bucket url with path to file. Supports following wildcards in readonly mode:*
,?
,{abc,def}
and{N..M}
whereN
,M
— numbers,'abc'
,'def'
— strings. For more information see below.NOSIGN
- If this keyword is provided in place of credentials, all the requests will not be signed.format
— The format of the file.aws_access_key_id
,aws_secret_access_key
- Long-term credentials for the AWS account user. You can use these to authenticate your requests. Parameter is optional. If credentials are not specified, they are used from the configuration file. For more information see Using S3 for Data Storage.compression
— Compression type. Supported values:none
,gzip/gz
,brotli/br
,xz/LZMA
,zstd/zst
. Parameter is optional. By default, it will autodetect compression by file extension.
PARTITION BY
PARTITION BY
— Optional. In most cases you don't need a partition key, and if it is needed you generally don't need a partition key more granular than by month. Partitioning does not speed up queries (in contrast to the ORDER BY expression). You should never use too granular partitioning. Don't partition your data by client identifiers or names (instead, make client identifier or name the first column in the ORDER BY expression).
For partitioning by month, use the toYYYYMM(date_column)
expression, where date_column
is a column with a date of the type Date. The partition names here have the "YYYYMM"
format.
Example
CREATE TABLE s3_engine_table (name String, value UInt32)
ENGINE=S3('https://clickhouse-public-datasets.s3.amazonaws.com/my-test-bucket-768/test-data.csv.gz', 'CSV', 'gzip')
SETTINGS input_format_with_names_use_header = 0;
INSERT INTO s3_engine_table VALUES ('one', 1), ('two', 2), ('three', 3);
SELECT * FROM s3_engine_table LIMIT 2;
┌─name─┬─value─┐
│ one │ 1 │
│ two │ 2 │
└──────┴───────┘
Virtual columns
_path
— Path to the file._file
— Name of the file.
For more information about virtual columns see here.
Implementation Details
-
Reads and writes can be parallel
-
Not supported:
ALTER
andSELECT...SAMPLE
operations.- Indexes.
- Zero-copy replication is possible, but not supported.
:::note Zero-copy replication is not ready for production Zero-copy replication is disabled by default in ClickHouse version 22.8 and higher. This feature is not recommended for production use. :::
Wildcards In Path
path
argument can specify multiple files using bash-like wildcards. For being processed file should exist and match to the whole path pattern. Listing of files is determined during SELECT
(not at CREATE
moment).
*
— Substitutes any number of any characters except/
including empty string.?
— Substitutes any single character.{some_string,another_string,yet_another_one}
— Substitutes any of strings'some_string', 'another_string', 'yet_another_one'
.{N..M}
— Substitutes any number in range from N to M including both borders. N and M can have leading zeroes e.g.000..078
.
Constructions with {}
are similar to the remote table function.
:::note
If the listing of files contains number ranges with leading zeros, use the construction with braces for each digit separately or use ?
.
:::
Example with wildcards 1
Create table with files named file-000.csv
, file-001.csv
, … , file-999.csv
:
CREATE TABLE big_table (name String, value UInt32)
ENGINE = S3('https://clickhouse-public-datasets.s3.amazonaws.com/my-bucket/my_folder/file-{000..999}.csv', 'CSV');
Example with wildcards 2
Suppose we have several files in CSV format with the following URIs on S3:
- 'https://clickhouse-public-datasets.s3.amazonaws.com/my-bucket/some_folder/some_file_1.csv'
- 'https://clickhouse-public-datasets.s3.amazonaws.com/my-bucket/some_folder/some_file_2.csv'
- 'https://clickhouse-public-datasets.s3.amazonaws.com/my-bucket/some_folder/some_file_3.csv'
- 'https://clickhouse-public-datasets.s3.amazonaws.com/my-bucket/another_folder/some_file_1.csv'
- 'https://clickhouse-public-datasets.s3.amazonaws.com/my-bucket/another_folder/some_file_2.csv'
- 'https://clickhouse-public-datasets.s3.amazonaws.com/my-bucket/another_folder/some_file_3.csv'
There are several ways to make a table consisting of all six files:
- Specify the range of file postfixes:
CREATE TABLE table_with_range (name String, value UInt32)
ENGINE = S3('https://clickhouse-public-datasets.s3.amazonaws.com/my-bucket/{some,another}_folder/some_file_{1..3}', 'CSV');
- Take all files with
some_file_
prefix (there should be no extra files with such prefix in both folders):
CREATE TABLE table_with_question_mark (name String, value UInt32)
ENGINE = S3('https://clickhouse-public-datasets.s3.amazonaws.com/my-bucket/{some,another}_folder/some_file_?', 'CSV');
- Take all the files in both folders (all files should satisfy format and schema described in query):
CREATE TABLE table_with_asterisk (name String, value UInt32)
ENGINE = S3('https://clickhouse-public-datasets.s3.amazonaws.com/my-bucket/{some,another}_folder/*', 'CSV');
S3-related Settings
The following settings can be set before query execution or placed into configuration file.
s3_max_single_part_upload_size
— The maximum size of object to upload using singlepart upload to S3. Default value is64Mb
.s3_min_upload_part_size
— The minimum size of part to upload during multipart upload to S3 Multipart upload. Default value is512Mb
.s3_max_redirects
— Max number of S3 redirects hops allowed. Default value is10
.s3_single_read_retries
— The maximum number of attempts during single read. Default value is4
.s3_max_put_rps
— Maximum PUT requests per second rate before throttling. Default value is0
(unlimited).s3_max_put_burst
— Max number of requests that can be issued simultaneously before hitting request per second limit. By default (0
value) equals tos3_max_put_rps
.s3_max_get_rps
— Maximum GET requests per second rate before throttling. Default value is0
(unlimited).s3_max_get_burst
— Max number of requests that can be issued simultaneously before hitting request per second limit. By default (0
value) equals tos3_max_get_rps
.
Security consideration: if malicious user can specify arbitrary S3 URLs, s3_max_redirects
must be set to zero to avoid SSRF attacks; or alternatively, remote_host_filter
must be specified in server configuration.
Endpoint-based Settings
The following settings can be specified in configuration file for given endpoint (which will be matched by exact prefix of a URL):
endpoint
— Specifies prefix of an endpoint. Mandatory.access_key_id
andsecret_access_key
— Specifies credentials to use with given endpoint. Optional.use_environment_credentials
— If set totrue
, S3 client will try to obtain credentials from environment variables and Amazon EC2 metadata for given endpoint. Optional, default value isfalse
.region
— Specifies S3 region name. Optional.use_insecure_imds_request
— If set totrue
, S3 client will use insecure IMDS request while obtaining credentials from Amazon EC2 metadata. Optional, default value isfalse
.expiration_window_seconds
— Grace period for checking if expiration-based credentials have expired. Optional, default value is120
.no_sign_request
- Ignore all the credentials so requests are not signed. Useful for accessing public buckets.header
— Adds specified HTTP header to a request to given endpoint. Optional, can be specified multiple times.server_side_encryption_customer_key_base64
— If specified, required headers for accessing S3 objects with SSE-C encryption will be set. Optional.max_single_read_retries
— The maximum number of attempts during single read. Default value is4
. Optional.max_put_rps
,max_put_burst
,max_get_rps
andmax_get_burst
- Throttling settings (see description above) to use for specific endpoint instead of per query. Optional.
Example:
<s3>
<endpoint-name>
<endpoint>https://clickhouse-public-datasets.s3.amazonaws.com/my-test-bucket-768/</endpoint>
<!-- <access_key_id>ACCESS_KEY_ID</access_key_id> -->
<!-- <secret_access_key>SECRET_ACCESS_KEY</secret_access_key> -->
<!-- <region>us-west-1</region> -->
<!-- <use_environment_credentials>false</use_environment_credentials> -->
<!-- <use_insecure_imds_request>false</use_insecure_imds_request> -->
<!-- <expiration_window_seconds>120</expiration_window_seconds> -->
<!-- <no_sign_request>false</no_sign_request> -->
<!-- <header>Authorization: Bearer SOME-TOKEN</header> -->
<!-- <server_side_encryption_customer_key_base64>BASE64-ENCODED-KEY</server_side_encryption_customer_key_base64> -->
<!-- <max_single_read_retries>4</max_single_read_retries> -->
</endpoint-name>
</s3>
Accessing public buckets
ClickHouse tries to fetch credentials from many different types of sources.
Sometimes, it can produce problems when accessing some buckets that are public causing the client to return 403
error code.
This issue can be avoided by using NOSIGN
keyword, forcing the client to ignore all the credentials, and not sign the requests.
CREATE TABLE big_table (name String, value UInt32)
ENGINE = S3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/aapl_stock.csv', NOSIGN, 'CSVWithNames');