Merge pull request #28716 from olgarev/revolg-DOCSUP-13742-partitions_in_s3_table_function

This commit is contained in:
Vladimir C 2021-09-10 17:57:58 +03:00 committed by GitHub
commit 5b967d91ba
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 51 additions and 3 deletions

View File

@ -210,4 +210,4 @@ ENGINE = S3('https://storage.yandexcloud.net/my-test-bucket-768/big_prefix/file-
## See also
- [S3 table function](../../../sql-reference/table-functions/s3.md)
- [s3 table function](../../../sql-reference/table-functions/s3.md)

View File

@ -3,7 +3,7 @@ toc_priority: 45
toc_title: s3
---
# S3 Table Function {#s3-table-function}
# s3 Table Function {#s3-table-function}
Provides table-like interface to select/insert files in [Amazon S3](https://aws.amazon.com/s3/). This table function is similar to [hdfs](../../sql-reference/table-functions/hdfs.md), but provides S3-specific features.
@ -125,6 +125,30 @@ INSERT INTO FUNCTION s3('https://storage.yandexcloud.net/my-test-bucket-768/test
SELECT name, value FROM existing_table;
```
## Partitioned Write {#partitioned-write}
If you specify `PARTITION BY` expression when inserting data into `S3` table, a separate file is created for each partition value. Splitting the data into separate files helps to improve reading operations efficiency.
**Examples**
1. Using partition ID in a key creates separate files:
```sql
INSERT INTO TABLE FUNCTION
s3('http://bucket.amazonaws.com/my_bucket/file_{_partition_id}.csv', 'CSV', 'a String, b UInt32, c UInt32')
PARTITION BY a VALUES ('x', 2, 3), ('x', 4, 5), ('y', 11, 12), ('y', 13, 14), ('z', 21, 22), ('z', 23, 24);
```
As a result, the data is written into three files: `file_x.csv`, `file_y.csv`, and `file_z.csv`.
2. Using partition ID in a bucket name creates files in different buckets:
```sql
INSERT INTO TABLE FUNCTION
s3('http://bucket.amazonaws.com/my_bucket_{_partition_id}/file.csv', 'CSV', 'a UInt32, b UInt32, c UInt32')
PARTITION BY a VALUES (1, 2, 3), (1, 4, 5), (10, 11, 12), (10, 13, 14), (20, 21, 22), (20, 23, 24);
```
As a result, the data is written into three files in different buckets: `my_bucket_1/file.csv`, `my_bucket_10/file.csv`, and `my_bucket_20/file.csv`.
**See Also**
- [S3 engine](../../engines/table-engines/integrations/s3.md)

View File

@ -151,4 +151,4 @@ ENGINE = S3('https://storage.yandexcloud.net/my-test-bucket-768/big_prefix/file-
**Смотрите также**
- [Табличная функция S3](../../../sql-reference/table-functions/s3.md)
- [Табличная функция s3](../../../sql-reference/table-functions/s3.md)

View File

@ -133,6 +133,30 @@ INSERT INTO FUNCTION s3('https://storage.yandexcloud.net/my-test-bucket-768/test
SELECT name, value FROM existing_table;
```
## Партиционирование при записи данных {#partitioned-write}
Если при добавлении данных в таблицу S3 указать выражение `PARTITION BY`, то для каждого значения ключа партиционирования создается отдельный файл. Это повышает эффективность операций чтения.
**Примеры**
1. При использовании ID партиции в имени ключа создаются отдельные файлы:
```sql
INSERT INTO TABLE FUNCTION
s3('http://bucket.amazonaws.com/my_bucket/file_{_partition_id}.csv', 'CSV', 'a UInt32, b UInt32, c UInt32')
PARTITION BY a VALUES ('x', 2, 3), ('x', 4, 5), ('y', 11, 12), ('y', 13, 14), ('z', 21, 22), ('z', 23, 24);
```
В результате данные будут записаны в три файла: `file_x.csv`, `file_y.csv` и `file_z.csv`.
2. При использовании ID партиции в названии бакета создаются файлы в разных бакетах:
```sql
INSERT INTO TABLE FUNCTION
s3('http://bucket.amazonaws.com/my_bucket_{_partition_id}/file.csv', 'CSV', 'a UInt32, b UInt32, c UInt32')
PARTITION BY a VALUES (1, 2, 3), (1, 4, 5), (10, 11, 12), (10, 13, 14), (20, 21, 22), (20, 23, 24);
```
В результате будут созданы три файла в разных бакетах: `my_bucket_1/file.csv`, `my_bucket_10/file.csv` и `my_bucket_20/file.csv`.
**Смотрите также**
- [Движок таблиц S3](../../engines/table-engines/integrations/s3.md)