7.0 KiB
toc_priority | toc_title |
---|---|
4 | S3 |
S3 Table Engine
This engine provides integration with Amazon S3 ecosystem. This engine is similar to the HDFS engine, but provides S3-specific features.
Create Table
CREATE TABLE s3_engine_table (name String, value UInt32)
ENGINE = S3(path, [aws_access_key_id, aws_secret_access_key,] format, structure, [compression])
Engine parameters
path
— Bucket url with path to file. Supports following wildcards in readonly mode:*
,?
,{abc,def}
and{N..M}
whereN
,M
— numbers,'abc'
,'def'
— strings. For more information see below.format
— The format of the file.structure
— Structure of the table. Format'column1_name column1_type, column2_name column2_type, ...'
.compression
— Compression type. Supported values: none, gzip/gz, brotli/br, xz/LZMA, zstd/zst. Parameter is optional. By default, it will autodetect compression by file extension.
Example:
1. Set up the s3_engine_table
table:
CREATE TABLE s3_engine_table (name String, value UInt32) ENGINE=S3('https://storage.yandexcloud.net/my-test-bucket-768/test-data.csv.gz', 'CSV', 'name String, value UInt32', 'gzip')
2. Fill file:
INSERT INTO s3_engine_table VALUES ('one', 1), ('two', 2), ('three', 3)
3. Query the data:
SELECT * FROM s3_engine_table LIMIT 2
┌─name─┬─value─┐
│ one │ 1 │
│ two │ 2 │
└──────┴───────┘
Virtual columns
_path
— Path to the file._file
— Name of the file.
For more information about virtual columns see here.
Implementation Details
- Reads and writes can be parallel
- Not supported:
ALTER
andSELECT...SAMPLE
operations.- Indexes.
- Replication.
Wildcards In Path
path
argument can specify multiple files using bash-like wildcards. For being processed file should exist and match to the whole path pattern. Listing of files is determined during SELECT
(not at CREATE
moment).
*
— Substitutes any number of any characters except/
including empty string.?
— Substitutes any single character.{some_string,another_string,yet_another_one}
— Substitutes any of strings'some_string', 'another_string', 'yet_another_one'
.{N..M}
— Substitutes any number in range from N to M including both borders. N and M can have leading zeroes e.g.000..078
.
Constructions with {}
are similar to the remote table function.
S3-related Settings
The following settings can be set before query execution or placed into configuration file.
s3_max_single_part_upload_size
— The maximum size of object to upload using singlepart upload to S3. Default value is64Mb
.s3_min_upload_part_size
— The minimum size of part to upload during multipart upload to S3 Multipart upload. Default value is512Mb
.s3_max_redirects
— Max number of S3 redirects hops allowed. Default value is10
.
Security consideration: if malicious user can specify arbitrary S3 URLs, s3_max_redirects
must be set to zero to avoid SSRF attacks; or alternatively, remote_host_filter
must be specified in server configuration.
Endpoint-based Settings
The following settings can be specified in configuration file for given endpoint (which will be matched by exact prefix of a URL):
endpoint
— Specifies prefix of an endpoint. Mandatory.access_key_id
andsecret_access_key
— Specifies credentials to use with given endpoint. Optional.use_environment_credentials
— If set totrue
, S3 client will try to obtain credentials from environment variables and Amazon EC2 metadata for given endpoint. Optional, default value isfalse
.header
— Adds specified HTTP header to a request to given endpoint. Optional, can be speficied multiple times.server_side_encryption_customer_key_base64
— If specified, required headers for accessing S3 objects with SSE-C encryption will be set. Optional.
Example:
<s3>
<endpoint-name>
<endpoint>https://storage.yandexcloud.net/my-test-bucket-768/</endpoint>
<!-- <access_key_id>ACCESS_KEY_ID</access_key_id> -->
<!-- <secret_access_key>SECRET_ACCESS_KEY</secret_access_key> -->
<!-- <use_environment_credentials>false</use_environment_credentials> -->
<!-- <header>Authorization: Bearer SOME-TOKEN</header> -->
<!-- <server_side_encryption_customer_key_base64>BASE64-ENCODED-KEY</server_side_encryption_customer_key_base64> -->
</endpoint-name>
</s3>
Usage
Suppose we have several files in TSV format with the following URIs on HDFS:
- 'https://storage.yandexcloud.net/my-test-bucket-768/some_prefix/some_file_1.csv'
- 'https://storage.yandexcloud.net/my-test-bucket-768/some_prefix/some_file_2.csv'
- 'https://storage.yandexcloud.net/my-test-bucket-768/some_prefix/some_file_3.csv'
- 'https://storage.yandexcloud.net/my-test-bucket-768/another_prefix/some_file_1.csv'
- 'https://storage.yandexcloud.net/my-test-bucket-768/another_prefix/some_file_2.csv'
- 'https://storage.yandexcloud.net/my-test-bucket-768/another_prefix/some_file_3.csv'
- There are several ways to make a table consisting of all six files:
CREATE TABLE table_with_range (name String, value UInt32)
ENGINE = S3('https://storage.yandexcloud.net/my-test-bucket-768/{some,another}_prefix/some_file_{1..3}', 'CSV');
- Another way:
CREATE TABLE table_with_question_mark (name String, value UInt32)
ENGINE = S3('https://storage.yandexcloud.net/my-test-bucket-768/{some,another}_prefix/some_file_?', 'CSV');
- Table consists of all the files in both directories (all files should satisfy format and schema described in query):
CREATE TABLE table_with_asterisk (name String, value UInt32)
ENGINE = S3('https://storage.yandexcloud.net/my-test-bucket-768/{some,another}_prefix/*', 'CSV');
!!! warning "Warning"
If the listing of files contains number ranges with leading zeros, use the construction with braces for each digit separately or use ?
.
- Create table with files named
file-000.csv
,file-001.csv
, … ,file-999.csv
:
CREATE TABLE big_table (name String, value UInt32)
ENGINE = S3('https://storage.yandexcloud.net/my-test-bucket-768/big_prefix/file-{000..999}.csv', 'CSV');