6.7 KiB
toc_priority | toc_title |
---|---|
4 | S3 |
S3
This engine provides integration with Amazon S3 ecosystem. This engine is similar to the HDFS engine, but provides S3-specific features.
Usage
ENGINE = S3(path, [aws_access_key_id, aws_secret_access_key,] format, structure, [compression])
Input parameters
path
— Bucket url with path to file. Supports following wildcards in readonly mode: *, ?, {abc,def} and {N..M} where N, M — numbers, `’abc’, ‘def’ — strings.format
— The format of the file.structure
— Structure of the table. Format'column1_name column1_type, column2_name column2_type, ...'
.compression
— Parameter is optional. Supported values: none, gzip/gz, brotli/br, xz/LZMA, zstd/zst. By default, it will autodetect compression by file extension.
Example:
1. Set up the s3_engine_table
table:
CREATE TABLE s3_engine_table (name String, value UInt32) ENGINE=S3('https://storage.yandexcloud.net/my-test-bucket-768/test-data.csv.gz', 'CSV', 'name String, value UInt32', 'gzip')
2. Fill file:
INSERT INTO s3_engine_table VALUES ('one', 1), ('two', 2), ('three', 3)
3. Query the data:
SELECT * FROM s3_engine_table LIMIT 2
┌─name─┬─value─┐
│ one │ 1 │
│ two │ 2 │
└──────┴───────┘
Implementation Details
- Reads and writes can be parallel
- Not supported:
ALTER
andSELECT...SAMPLE
operations.- Indexes.
- Replication.
Globs in path
Multiple path components can have globs. For being processed file should exist and match to the whole path pattern. Listing of files determines during SELECT
(not at CREATE
moment).
*
— Substitutes any number of any characters except/
including empty string.?
— Substitutes any single character.{some_string,another_string,yet_another_one}
— Substitutes any of strings'some_string', 'another_string', 'yet_another_one'
.{N..M}
— Substitutes any number in range from N to M including both borders. N and M can have leading zeroes e.g.000..078
.
Constructions with {}
are similar to the remote table function.
Example
- Suppose we have several files in TSV format with the following URIs on HDFS:
- ‘https://storage.yandexcloud.net/my-test-bucket-768/some_prefix/some_file_1.csv’
- ‘https://storage.yandexcloud.net/my-test-bucket-768/some_prefix/some_file_2.csv’
- ‘https://storage.yandexcloud.net/my-test-bucket-768/some_prefix/some_file_3.csv’
- ‘https://storage.yandexcloud.net/my-test-bucket-768/another_prefix/some_file_1.csv’
- ‘https://storage.yandexcloud.net/my-test-bucket-768/another_prefix/some_file_2.csv’
- ‘https://storage.yandexcloud.net/my-test-bucket-768/another_prefix/some_file_3.csv’
- There are several ways to make a table consisting of all six files:
CREATE TABLE table_with_range (name String, value UInt32) ENGINE = S3('https://storage.yandexcloud.net/my-test-bucket-768/{some,another}_prefix/some_file_{1..3}', 'CSV')
- Another way:
CREATE TABLE table_with_question_mark (name String, value UInt32) ENGINE = S3('https://storage.yandexcloud.net/my-test-bucket-768/{some,another}_prefix/some_file_?', 'CSV')
- Table consists of all the files in both directories (all files should satisfy format and schema described in query):
CREATE TABLE table_with_asterisk (name String, value UInt32) ENGINE = S3('https://storage.yandexcloud.net/my-test-bucket-768/{some,another}_prefix/*', 'CSV')
!!! warning "Warning"
If the listing of files contains number ranges with leading zeros, use the construction with braces for each digit separately or use ?
.
Example
Create table with files named file-000.csv
, file-001.csv
, … , file-999.csv
:
CREATE TABLE big_table (name String, value UInt32) ENGINE = S3('https://storage.yandexcloud.net/my-test-bucket-768/big_prefix/file-{000..999}.csv', 'CSV')
Virtual Columns
_path
— Path to the file._file
— Name of the file.
See Also
S3-related settings
The following settings can be set before query execution or placed into configuration file.
s3_max_single_part_upload_size
— Default value is64Mb
. The maximum size of object to upload using singlepart upload to S3.s3_min_upload_part_size
— Default value is512Mb
. The minimum size of part to upload during multipart upload to S3 Multipart upload.s3_max_redirects
— Default value is10
. Max number of S3 redirects hops allowed.
Security consideration: if malicious user can specify arbitrary S3 URLs, s3_max_redirects
must be set to zero to avoid SSRF attacks; or alternatively, remote_host_filter
must be specified in server configuration.
Endpoint-based settings
The following settings can be specified in configuration file for given endpoint (which will be matched by exact prefix of a URL):
endpoint
— Mandatory. Specifies prefix of an endpoint.access_key_id
andsecret_access_key
— Optional. Specifies credentials to use with given endpoint.use_environment_credentials
— Optional, default value isfalse
. If set totrue
, S3 client will try to obtain credentials from environment variables and Amazon EC2 metadata for given endpoint.header
— Optional, can be speficied multiple times. Adds specified HTTP header to a request to given endpoint.server_side_encryption_customer_key_base64
— Optional. If specified, required headers for accessing S3 objects with SSE-C encryption will be set.
Example:
<s3>
<endpoint-name>
<endpoint>https://storage.yandexcloud.net/my-test-bucket-768/</endpoint>
<!-- <access_key_id>ACCESS_KEY_ID</access_key_id> -->
<!-- <secret_access_key>SECRET_ACCESS_KEY</secret_access_key> -->
<!-- <use_environment_credentials>false</use_environment_credentials> -->
<!-- <header>Authorization: Bearer SOME-TOKEN</header> -->
<!-- <server_side_encryption_customer_key_base64>BASE64-ENCODED-KEY</server_side_encryption_customer_key_base64> -->
</endpoint-name>
</s3>