ClickHouse/docs/en/engines/table-engines/integrations/hdfs.md

---
slug: /en/engines/table-engines/integrations/hdfs
sidebar_position: 80
sidebar_label: HDFS
---

# HDFS

This engine provides integration with the [Apache Hadoop](https://en.wikipedia.org/wiki/Apache_Hadoop) ecosystem by allowing to manage data on [HDFS](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) via ClickHouse. This engine is similar to the [File](../../../engines/table-engines/special/file.md#table_engines-file) and [URL](../../../engines/table-engines/special/url.md#table_engines-url) engines, but provides Hadoop-specific features.

## Usage {#usage}

``` sql
ENGINE = HDFS(URI, format)
```

**Engine Parameters**

- `URI` - whole file URI in HDFS. The path part of `URI` may contain globs. In this case the table would be readonly.
- `format` - specifies one of the available file formats. To perform
`SELECT` queries, the format must be supported for input, and to perform
`INSERT` queries – for output. The available formats are listed in the
[Formats](../../../interfaces/formats.md#formats) section.
- [PARTITION BY expr]

### PARTITION BY

`PARTITION BY` — Optional. In most cases you don't need a partition key, and if it is needed you generally don't need a partition key more granular than by month. Partitioning does not speed up queries (in contrast to the ORDER BY expression). You should never use too granular partitioning. Don't partition your data by client identifiers or names (instead, make client identifier or name the first column in the ORDER BY expression).

For partitioning by month, use the `toYYYYMM(date_column)` expression, where `date_column` is a column with a date of the type [Date](/docs/en/sql-reference/data-types/date.md). The partition names here have the `"YYYYMM"` format.

**Example:**

**1.** Set up the `hdfs_engine_table` table:

``` sql
CREATE TABLE hdfs_engine_table (name String, value UInt32) ENGINE=HDFS('hdfs://hdfs1:9000/other_storage', 'TSV')
```

**2.** Fill file:

``` sql
INSERT INTO hdfs_engine_table VALUES ('one', 1), ('two', 2), ('three', 3)
```

**3.** Query the data:

``` sql
SELECT * FROM hdfs_engine_table LIMIT 2
```

``` text
┌─name─┬─value─┐
│ one  │     1 │
│ two  │     2 │
└──────┴───────┘
```

## Implementation Details {#implementation-details}

- Reads and writes can be parallel.
- Not supported:
    - `ALTER` and `SELECT...SAMPLE` operations.
    - Indexes.
    - [Zero-copy](../../../operations/storing-data.md#zero-copy) replication is possible, but not recommended.

  :::note Zero-copy replication is not ready for production
  Zero-copy replication is disabled by default in ClickHouse version 22.8 and higher.  This feature is not recommended for production use.
  :::

**Globs in path**

Multiple path components can have globs. For being processed file should exists and matches to the whole path pattern. Listing of files determines during `SELECT` (not at `CREATE` moment).

- `*` — Substitutes any number of any characters except `/` including empty string.
- `?` — Substitutes any single character.
- `{some_string,another_string,yet_another_one}` — Substitutes any of strings `'some_string', 'another_string', 'yet_another_one'`.
- `{N..M}` — Substitutes any number in range from N to M including both borders.

Constructions with `{}` are similar to the [remote](../../../sql-reference/table-functions/remote.md) table function.

**Example**

1.  Suppose we have several files in TSV format with the following URIs on HDFS:

    - 'hdfs://hdfs1:9000/some_dir/some_file_1'
    - 'hdfs://hdfs1:9000/some_dir/some_file_2'
    - 'hdfs://hdfs1:9000/some_dir/some_file_3'
    - 'hdfs://hdfs1:9000/another_dir/some_file_1'
    - 'hdfs://hdfs1:9000/another_dir/some_file_2'
    - 'hdfs://hdfs1:9000/another_dir/some_file_3'

1.  There are several ways to make a table consisting of all six files:

<!-- -->

``` sql
CREATE TABLE table_with_range (name String, value UInt32) ENGINE = HDFS('hdfs://hdfs1:9000/{some,another}_dir/some_file_{1..3}', 'TSV')
```

Another way:

``` sql
CREATE TABLE table_with_question_mark (name String, value UInt32) ENGINE = HDFS('hdfs://hdfs1:9000/{some,another}_dir/some_file_?', 'TSV')
```

Table consists of all the files in both directories (all files should satisfy format and schema described in query):

``` sql
CREATE TABLE table_with_asterisk (name String, value UInt32) ENGINE = HDFS('hdfs://hdfs1:9000/{some,another}_dir/*', 'TSV')
```

:::note
If the listing of files contains number ranges with leading zeros, use the construction with braces for each digit separately or use `?`.
:::

**Example**

Create table with files named `file000`, `file001`, … , `file999`:

``` sql
CREATE TABLE big_table (name String, value UInt32) ENGINE = HDFS('hdfs://hdfs1:9000/big_dir/file{0..9}{0..9}{0..9}', 'CSV')
```
## Configuration {#configuration}

Similar to GraphiteMergeTree, the HDFS engine supports extended configuration using the ClickHouse config file. There are two configuration keys that you can use: global (`hdfs`) and user-level (`hdfs_*`). The global configuration is applied first, and then the user-level configuration is applied (if it exists).

``` xml
  <!-- Global configuration options for HDFS engine type -->
  <hdfs>
	<hadoop_kerberos_keytab>/tmp/keytab/clickhouse.keytab</hadoop_kerberos_keytab>
	<hadoop_kerberos_principal>clickuser@TEST.CLICKHOUSE.TECH</hadoop_kerberos_principal>
	<hadoop_security_authentication>kerberos</hadoop_security_authentication>
  </hdfs>

  <!-- Configuration specific for user "root" -->
  <hdfs_root>
	<hadoop_kerberos_principal>root@TEST.CLICKHOUSE.TECH</hadoop_kerberos_principal>
  </hdfs_root>
```

### Configuration Options {#configuration-options}

#### Supported by libhdfs3 {#supported-by-libhdfs3}


| **parameter**                                         | **default value**       |
| -                                                  | -                    |
| rpc\_client\_connect\_tcpnodelay                      | true                    |
| dfs\_client\_read\_shortcircuit                       | true                    |
| output\_replace-datanode-on-failure                   | true                    |
| input\_notretry-another-node                          | false                   |
| input\_localread\_mappedfile                          | true                    |
| dfs\_client\_use\_legacy\_blockreader\_local          | false                   |
| rpc\_client\_ping\_interval                           | 10  * 1000              |
| rpc\_client\_connect\_timeout                         | 600 * 1000              |
| rpc\_client\_read\_timeout                            | 3600 * 1000             |
| rpc\_client\_write\_timeout                           | 3600 * 1000             |
| rpc\_client\_socket\_linger\_timeout                  | -1                      |
| rpc\_client\_connect\_retry                           | 10                      |
| rpc\_client\_timeout                                  | 3600 * 1000             |
| dfs\_default\_replica                                 | 3                       |
| input\_connect\_timeout                               | 600 * 1000              |
| input\_read\_timeout                                  | 3600 * 1000             |
| input\_write\_timeout                                 | 3600 * 1000             |
| input\_localread\_default\_buffersize                 | 1 * 1024 * 1024         |
| dfs\_prefetchsize                                     | 10                      |
| input\_read\_getblockinfo\_retry                      | 3                       |
| input\_localread\_blockinfo\_cachesize                | 1000                    |
| input\_read\_max\_retry                               | 60                      |
| output\_default\_chunksize                            | 512                     |
| output\_default\_packetsize                           | 64 * 1024               |
| output\_default\_write\_retry                         | 10                      |
| output\_connect\_timeout                              | 600 * 1000              |
| output\_read\_timeout                                 | 3600 * 1000             |
| output\_write\_timeout                                | 3600 * 1000             |
| output\_close\_timeout                                | 3600 * 1000             |
| output\_packetpool\_size                              | 1024                    |
| output\_heartbeat\_interval                          | 10 * 1000               |
| dfs\_client\_failover\_max\_attempts                  | 15                      |
| dfs\_client\_read\_shortcircuit\_streams\_cache\_size | 256                     |
| dfs\_client\_socketcache\_expiryMsec                  | 3000                    |
| dfs\_client\_socketcache\_capacity                    | 16                      |
| dfs\_default\_blocksize                               | 64 * 1024 * 1024        |
| dfs\_default\_uri                                     | "hdfs://localhost:9000" |
| hadoop\_security\_authentication                      | "simple"                |
| hadoop\_security\_kerberos\_ticket\_cache\_path       | ""                      |
| dfs\_client\_log\_severity                            | "INFO"                  |
| dfs\_domain\_socket\_path                             | ""                      |


[HDFS Configuration Reference](https://hawq.apache.org/docs/userguide/2.3.0.0-incubating/reference/HDFSConfigurationParameterReference.html) might explain some parameters.


#### ClickHouse extras {#clickhouse-extras}

| **parameter**                                         | **default value**       |
| -                                                  | -                    |
|hadoop\_kerberos\_keytab                               | ""                      |
|hadoop\_kerberos\_principal                            | ""                      |
|libhdfs3\_conf                                         | ""                      |

### Limitations {#limitations}
* `hadoop_security_kerberos_ticket_cache_path` and `libhdfs3_conf` can be global only, not user specific

## Kerberos support {#kerberos-support}

If the `hadoop_security_authentication` parameter has the value `kerberos`, ClickHouse authenticates via Kerberos.
Parameters are [here](#clickhouse-extras) and `hadoop_security_kerberos_ticket_cache_path` may be of help.
Note that due to libhdfs3 limitations only old-fashioned approach is supported,
datanode communications are not secured by SASL (`HADOOP_SECURE_DN_USER` is a reliable indicator of such
security approach). Use `tests/integration/test_storage_kerberized_hdfs/hdfs_configs/bootstrap.sh` for reference.

If `hadoop_kerberos_keytab`, `hadoop_kerberos_principal` or `hadoop_security_kerberos_ticket_cache_path` are specified, Kerberos authentication will be used. `hadoop_kerberos_keytab` and `hadoop_kerberos_principal` are mandatory in this case.
## HDFS Namenode HA support {#namenode-ha}

libhdfs3 support HDFS namenode HA.

- Copy `hdfs-site.xml` from an HDFS node to `/etc/clickhouse-server/`.
- Add following piece to ClickHouse config file:

``` xml
  <hdfs>
    <libhdfs3_conf>/etc/clickhouse-server/hdfs-site.xml</libhdfs3_conf>
  </hdfs>
```

- Then use `dfs.nameservices` tag value of `hdfs-site.xml` as the namenode address in the HDFS URI. For example, replace `hdfs://appadmin@192.168.101.11:8020/abc/` with `hdfs://appadmin@my_nameservice/abc/`.


## Virtual Columns {#virtual-columns}

- `_path` — Path to the file. Type: `LowCardinalty(String)`.
- `_file` — Name of the file. Type: `LowCardinalty(String)`.
- `_size` — Size of the file in bytes. Type: `Nullable(UInt64)`. If the size is unknown, the value is `NULL`.

## Storage Settings {#storage-settings}

- [hdfs_truncate_on_insert](/docs/en/operations/settings/settings.md#hdfs-truncate-on-insert) - allows to truncate file before insert into it. Disabled by default.
- [hdfs_create_multiple_files](/docs/en/operations/settings/settings.md#hdfs_allow_create_multiple_files) - allows to create a new file on each insert if format has suffix. Disabled by default.
- [hdfs_skip_empty_files](/docs/en/operations/settings/settings.md#hdfs_skip_empty_files) - allows to skip empty files while reading. Disabled by default.

**See Also**

- [Virtual columns](../../../engines/table-engines/index.md#table_engines-virtual_columns)
-												Get rid of toc_en.yml (#10023)


											
										
										
											2020-04-03 13:23:32 +00:00
+								---
-												add slugs

											
										
										
											2022-08-28 14:53:34 +00:00
+								slug: /en/engines/table-engines/integrations/hdfs
-												Alphabetize table functions and engines

											
										
										
											2023-06-23 13:16:22 +00:00
+								sidebar_position: 80
-												Removed /ja folder, cleaned up /ru markdown

											
										
										
											2022-04-09 13:29:05 +00:00
+								sidebar_label: HDFS
-												Get rid of toc_en.yml (#10023)


											
										
										
											2020-04-03 13:23:32 +00:00
+								---
-												Remove H1 anchor tags from docs

											
										
										
											2022-06-02 10:55:18 +00:00
+								# HDFS
-												Add docs for hdfs and fix some review comments

											
										
										
											2019-09-03 14:23:51 +00:00
-												Docs: Fix formatting in HDFS engine

											
										
										
											2021-12-16 14:46:12 +00:00
+								This engine provides integration with the [Apache Hadoop](https://en.wikipedia.org/wiki/Apache_Hadoop) ecosystem by allowing to manage data on [HDFS](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) via ClickHouse. This engine is similar to the [File](../../../engines/table-engines/special/file.md#table_engines-file) and [URL](../../../engines/table-engines/special/url.md#table_engines-url) engines, but provides Hadoop-specific features.
-												Add docs for hdfs and fix some review comments

											
										
										
											2019-09-03 14:23:51 +00:00
-												Normalization for en markdown (#9763)


											
										
										
											2020-03-20 10:10:48 +00:00
+								## Usage {#usage}
-												Add docs for hdfs and fix some review comments

											
										
										
											2019-09-03 14:23:51 +00:00
-												Normalization for en markdown (#9763)


											
										
										
											2020-03-20 10:10:48 +00:00
+								``` sql
-												Add docs for hdfs and fix some review comments

											
										
										
											2019-09-03 14:23:51 +00:00
+								ENGINE = HDFS(URI, format)
 								```
-												Normalization for en markdown (#9763)


											
										
										
											2020-03-20 10:10:48 +00:00
-												insert_quorum_parallel-EdTranRus

											
										
										
											2021-12-23 23:28:39 +00:00
+								**Engine Parameters**
 								- `URI` - whole file URI in HDFS. The path part of `URI` may contain globs. In this case the table would be readonly.
-												Minor fixups

											
										
										
											2023-04-19 16:10:51 +00:00
+								- `format` - specifies one of the available file formats. To perform
-												Add docs for hdfs and fix some review comments

											
										
										
											2019-09-03 14:23:51 +00:00
+								`SELECT` queries, the format must be supported for input, and to perform
-												Normalization for en markdown (#9763)


											
										
										
											2020-03-20 10:10:48 +00:00
+								`INSERT` queries – for output. The available formats are listed in the
-												Get rid of toc_en.yml (#10023)


											
										
										
											2020-04-03 13:23:32 +00:00
+								[Formats](../../../interfaces/formats.md#formats) section.
-												add PARTITION BY to s3 and hdfs docs

											
										
										
											2023-01-25 14:09:28 +00:00
+								- [PARTITION BY expr]
 								### PARTITION BY
 								`PARTITION BY` — Optional. In most cases you don't need a partition key, and if it is needed you generally don't need a partition key more granular than by month. Partitioning does not speed up queries (in contrast to the ORDER BY expression). You should never use too granular partitioning. Don't partition your data by client identifiers or names (instead, make client identifier or name the first column in the ORDER BY expression).
 								For partitioning by month, use the `toYYYYMM(date_column)` expression, where `date_column` is a column with a date of the type [Date](/docs/en/sql-reference/data-types/date.md). The partition names here have the `"YYYYMM"` format.
-												Add docs for hdfs and fix some review comments

											
										
										
											2019-09-03 14:23:51 +00:00
 								**Example:**
-												Update docs/en/operations/table_engines/hdfs.md

Co-Authored-By: Ivan Blinkov <github@blinkov.ru>
											
										
										
											2019-09-04 13:26:52 +00:00
+								**1.** Set up the `hdfs_engine_table` table:
-												Add docs for hdfs and fix some review comments

											
										
										
											2019-09-03 14:23:51 +00:00
-												Normalization for en markdown (#9763)


											
										
										
											2020-03-20 10:10:48 +00:00
+								``` sql
-												Add docs for hdfs and fix some review comments

											
										
										
											2019-09-03 14:23:51 +00:00
+								CREATE TABLE hdfs_engine_table (name String, value UInt32) ENGINE=HDFS('hdfs://hdfs1:9000/other_storage', 'TSV')
 								```
-												Improvement

											
										
										
											2019-09-04 19:55:56 +00:00
 								**2.** Fill file:
-												Normalization for en markdown (#9763)


											
										
										
											2020-03-20 10:10:48 +00:00
 								``` sql
-												Improvement

											
										
										
											2019-09-04 19:55:56 +00:00
+								INSERT INTO hdfs_engine_table VALUES ('one', 1), ('two', 2), ('three', 3)
 								```
 								**3.** Query the data:
-												Add docs for hdfs and fix some review comments

											
										
										
											2019-09-03 14:23:51 +00:00
-												Normalization for en markdown (#9763)


											
										
										
											2020-03-20 10:10:48 +00:00
+								``` sql
-												Add docs for hdfs and fix some review comments

											
										
										
											2019-09-03 14:23:51 +00:00
+								SELECT * FROM hdfs_engine_table LIMIT 2
 								```
-												Normalization for en markdown (#9763)


											
										
										
											2020-03-20 10:10:48 +00:00
+								``` text
-												Add docs for hdfs and fix some review comments

											
										
										
											2019-09-03 14:23:51 +00:00
+								┌─name─┬─value─┐
 								│ one  │     1 │
 								│ two  │     2 │
 								└──────┴───────┘
 								```
-												Normalization for en markdown (#9763)


											
										
										
											2020-03-20 10:10:48 +00:00
+								## Implementation Details {#implementation-details}
-												Add docs for hdfs and fix some review comments

											
										
										
											2019-09-03 14:23:51 +00:00
-												Docs: Replace annoying three spaces in enumerations by a single space

											
										
										
											2023-04-19 15:55:29 +00:00
+								- Reads and writes can be parallel.
 								- Not supported:
 								    - `ALTER` and `SELECT...SAMPLE` operations.
 								    - Indexes.
 								    - [Zero-copy](../../../operations/storing-data.md#zero-copy) replication is possible, but not recommended.
-												Alphabetize table functions and engines

											
										
										
											2023-06-23 13:16:22 +00:00
-												standardize admonitions

											
										
										
											2023-03-27 18:54:05 +00:00
+								  :::note Zero-copy replication is not ready for production
-												update for setting change

											
										
										
											2022-08-18 20:05:44 +00:00
+								  Zero-copy replication is disabled by default in ClickHouse version 22.8 and higher.  This feature is not recommended for production use.
 								  :::
-												Add docs for hdfs and fix some review comments

											
										
										
											2019-09-03 14:23:51 +00:00
-												cleanup hdfs docs

											
										
										
											2019-09-20 11:26:00 +00:00
+								**Globs in path**
 								Multiple path components can have globs. For being processed file should exists and matches to the whole path pattern. Listing of files determines during `SELECT` (not at `CREATE` moment).
-												Docs: Replace annoying three spaces in enumerations by a single space

											
										
										
											2023-04-19 15:55:29 +00:00
+								- `*` — Substitutes any number of any characters except `/` including empty string.
 								- `?` — Substitutes any single character.
 								- `{some_string,another_string,yet_another_one}` — Substitutes any of strings `'some_string', 'another_string', 'yet_another_one'`.
 								- `{N..M}` — Substitutes any number in range from N to M including both borders.
-												cleanup hdfs docs

											
										
										
											2019-09-20 11:26:00 +00:00
-												[docs] replace underscores with hyphens (#10606)

* Replace underscores with hyphens

* remove temporary code

* fix style check

* fix collapse
											
										
										
											2020-04-30 18:19:18 +00:00
+								Constructions with `{}` are similar to the [remote](../../../sql-reference/table-functions/remote.md) table function.
-												cleanup hdfs docs

											
										
										
											2019-09-20 11:26:00 +00:00
 								**Example**
-												Normalization for en markdown (#9763)


											
										
										
											2020-03-20 10:10:48 +00:00
+.  Suppose we have several files in TSV format with the following URIs on HDFS:
-												Minor fixups

											
										
										
											2023-04-19 16:10:51 +00:00
+								    - 'hdfs://hdfs1:9000/some_dir/some_file_1'
 								    - 'hdfs://hdfs1:9000/some_dir/some_file_2'
 								    - 'hdfs://hdfs1:9000/some_dir/some_file_3'
 								    - 'hdfs://hdfs1:9000/another_dir/some_file_1'
 								    - 'hdfs://hdfs1:9000/another_dir/some_file_2'
 								    - 'hdfs://hdfs1:9000/another_dir/some_file_3'
-												cleanup hdfs docs

											
										
										
											2019-09-20 11:26:00 +00:00
-												Normalization for en markdown (#9763)


											
										
										
											2020-03-20 10:10:48 +00:00
+.  There are several ways to make a table consisting of all six files:
-												cleanup hdfs docs

											
										
										
											2019-09-20 11:26:00 +00:00
-												Normalization for en markdown (#9763)


											
										
										
											2020-03-20 10:10:48 +00:00
+								<!-- -->
-												cleanup hdfs docs

											
										
										
											2019-09-20 11:26:00 +00:00
-												Normalization for en markdown (#9763)


											
										
										
											2020-03-20 10:10:48 +00:00
+								``` sql
-												cleanup hdfs docs

											
										
										
											2019-09-20 11:26:00 +00:00
+								CREATE TABLE table_with_range (name String, value UInt32) ENGINE = HDFS('hdfs://hdfs1:9000/{some,another}_dir/some_file_{1..3}', 'TSV')
 								```
 								Another way:
-												Normalization for en markdown (#9763)


											
										
										
											2020-03-20 10:10:48 +00:00
+								``` sql
-												cleanup hdfs docs

											
										
										
											2019-09-20 11:26:00 +00:00
+								CREATE TABLE table_with_question_mark (name String, value UInt32) ENGINE = HDFS('hdfs://hdfs1:9000/{some,another}_dir/some_file_?', 'TSV')
 								```
 								Table consists of all the files in both directories (all files should satisfy format and schema described in query):
-												Normalization for en markdown (#9763)


											
										
										
											2020-03-20 10:10:48 +00:00
+								``` sql
-												cleanup hdfs docs

											
										
										
											2019-09-20 11:26:00 +00:00
+								CREATE TABLE table_with_asterisk (name String, value UInt32) ENGINE = HDFS('hdfs://hdfs1:9000/{some,another}_dir/*', 'TSV')
 								```
-												standardize admonitions

											
										
										
											2023-03-27 18:54:05 +00:00
+								:::note
-												Removed /ja folder, cleaned up /ru markdown

											
										
										
											2022-04-09 13:29:05 +00:00
+								If the listing of files contains number ranges with leading zeros, use the construction with braces for each digit separately or use `?`.
 								:::
-												cleanup hdfs docs

											
										
										
											2019-09-20 11:26:00 +00:00
 								**Example**
-												Normalization for en markdown (#9763)


											
										
										
											2020-03-20 10:10:48 +00:00
+								Create table with files named `file000`, `file001`, … , `file999`:
-												cleanup hdfs docs

											
										
										
											2019-09-20 11:26:00 +00:00
-												Normalization for en markdown (#9763)


											
										
										
											2020-03-20 10:10:48 +00:00
+								``` sql
-												Update hdfs.md
											
										
										
											2020-08-28 07:56:44 +00:00
+								CREATE TABLE big_table (name String, value UInt32) ENGINE = HDFS('hdfs://hdfs1:9000/big_dir/file{0..9}{0..9}{0..9}', 'CSV')
-												cleanup hdfs docs

											
										
										
											2019-09-20 11:26:00 +00:00
+								```
-												HADOOP_SECURE_DN_USER way, kinit thread, docker capabilities

											
										
										
											2020-09-28 17:20:04 +00:00
+								## Configuration {#configuration}
 								Similar to GraphiteMergeTree, the HDFS engine supports extended configuration using the ClickHouse config file. There are two configuration keys that you can use: global (`hdfs`) and user-level (`hdfs_*`). The global configuration is applied first, and then the user-level configuration is applied (if it exists).
 								``` xml
 								  <!-- Global configuration options for HDFS engine type -->
 								  <hdfs>
 									<hadoop_kerberos_keytab>/tmp/keytab/clickhouse.keytab</hadoop_kerberos_keytab>
 									<hadoop_kerberos_principal>clickuser@TEST.CLICKHOUSE.TECH</hadoop_kerberos_principal>
 									<hadoop_security_authentication>kerberos</hadoop_security_authentication>
 								  </hdfs>
 								  <!-- Configuration specific for user "root" -->
 								  <hdfs_root>
 									<hadoop_kerberos_principal>root@TEST.CLICKHOUSE.TECH</hadoop_kerberos_principal>
 								  </hdfs_root>
 								```
-												Docs for HDFS

											
										
										
											2021-08-01 02:55:24 +00:00
+								### Configuration Options {#configuration-options}
 								#### Supported by libhdfs3 {#supported-by-libhdfs3}
-												HADOOP_SECURE_DN_USER way, kinit thread, docker capabilities

											
										
										
											2020-09-28 17:20:04 +00:00
 								| **parameter**                                         | **default value**       |
-												Minor fixups

											
										
										
											2023-04-19 16:10:51 +00:00
+								| -                                                  | -                    |
-												HADOOP_SECURE_DN_USER way, kinit thread, docker capabilities

											
										
										
											2020-09-28 17:20:04 +00:00
+								| rpc\_client\_connect\_tcpnodelay                      | true                    |
 								| dfs\_client\_read\_shortcircuit                       | true                    |
 								| output\_replace-datanode-on-failure                   | true                    |
 								| input\_notretry-another-node                          | false                   |
 								| input\_localread\_mappedfile                          | true                    |
 								| dfs\_client\_use\_legacy\_blockreader\_local          | false                   |
 								| rpc\_client\_ping\_interval                           | 10  * 1000              |
 								| rpc\_client\_connect\_timeout                         | 600 * 1000              |
 								| rpc\_client\_read\_timeout                            | 3600 * 1000             |
 								| rpc\_client\_write\_timeout                           | 3600 * 1000             |
-												CI: Fix aspell on nested docs

											
										
										
											2023-06-02 11:30:05 +00:00
+								| rpc\_client\_socket\_linger\_timeout                  | -1                      |
-												HADOOP_SECURE_DN_USER way, kinit thread, docker capabilities

											
										
										
											2020-09-28 17:20:04 +00:00
+								| rpc\_client\_connect\_retry                           | 10                      |
 								| rpc\_client\_timeout                                  | 3600 * 1000             |
 								| dfs\_default\_replica                                 | 3                       |
 								| input\_connect\_timeout                               | 600 * 1000              |
 								| input\_read\_timeout                                  | 3600 * 1000             |
 								| input\_write\_timeout                                 | 3600 * 1000             |
 								| input\_localread\_default\_buffersize                 | 1 * 1024 * 1024         |
 								| dfs\_prefetchsize                                     | 10                      |
 								| input\_read\_getblockinfo\_retry                      | 3                       |
 								| input\_localread\_blockinfo\_cachesize                | 1000                    |
 								| input\_read\_max\_retry                               | 60                      |
 								| output\_default\_chunksize                            | 512                     |
 								| output\_default\_packetsize                           | 64 * 1024               |
 								| output\_default\_write\_retry                         | 10                      |
 								| output\_connect\_timeout                              | 600 * 1000              |
 								| output\_read\_timeout                                 | 3600 * 1000             |
 								| output\_write\_timeout                                | 3600 * 1000             |
 								| output\_close\_timeout                                | 3600 * 1000             |
 								| output\_packetpool\_size                              | 1024                    |
-												CI: Fix aspell on nested docs

											
										
										
											2023-06-02 11:30:05 +00:00
+								| output\_heartbeat\_interval                          | 10 * 1000               |
-												HADOOP_SECURE_DN_USER way, kinit thread, docker capabilities

											
										
										
											2020-09-28 17:20:04 +00:00
+								| dfs\_client\_failover\_max\_attempts                  | 15                      |
 								| dfs\_client\_read\_shortcircuit\_streams\_cache\_size | 256                     |
 								| dfs\_client\_socketcache\_expiryMsec                  | 3000                    |
 								| dfs\_client\_socketcache\_capacity                    | 16                      |
 								| dfs\_default\_blocksize                               | 64 * 1024 * 1024        |
 								| dfs\_default\_uri                                     | "hdfs://localhost:9000" |
 								| hadoop\_security\_authentication                      | "simple"                |
 								| hadoop\_security\_kerberos\_ticket\_cache\_path       | ""                      |
 								| dfs\_client\_log\_severity                            | "INFO"                  |
 								| dfs\_domain\_socket\_path                             | ""                      |
-												try to fix "fake" nowhere links according to https://github.com/ClickHouse/ClickHouse/pull/21268#issuecomment-787106299

											
										
										
											2021-03-04 12:08:35 +00:00
+								[HDFS Configuration Reference](https://hawq.apache.org/docs/userguide/2.3.0.0-incubating/reference/HDFSConfigurationParameterReference.html) might explain some parameters.
-												HADOOP_SECURE_DN_USER way, kinit thread, docker capabilities

											
										
										
											2020-09-28 17:20:04 +00:00
 								#### ClickHouse extras {#clickhouse-extras}
-												style fix per code review, doc improvement, params consistency check

											
										
										
											2020-12-10 21:52:05 +00:00
+								| **parameter**                                         | **default value**       |
-												Minor fixups

											
										
										
											2023-04-19 16:10:51 +00:00
+								| -                                                  | -                    |
-												style fix per code review, doc improvement, params consistency check

											
										
										
											2020-12-10 21:52:05 +00:00
+								|hadoop\_kerberos\_keytab                               | ""                      |
 								|hadoop\_kerberos\_principal                            | ""                      |
-												set env LIBHDFS3_CONF, refers to ClickHouse#8159

											
										
										
											2021-08-28 01:16:48 +00:00
+								|libhdfs3\_conf                                         | ""                      |
-												HADOOP_SECURE_DN_USER way, kinit thread, docker capabilities

											
										
										
											2020-09-28 17:20:04 +00:00
-												Quotes and link fixed for s3, section headings for hdfs

											
										
										
											2021-08-04 05:19:31 +00:00
+								### Limitations {#limitations}
-												insert_quorum_parallel-EdTranRus

											
										
										
											2021-12-23 23:28:39 +00:00
+								* `hadoop_security_kerberos_ticket_cache_path` and `libhdfs3_conf` can be global only, not user specific
-												cleanup, fixes, new submodules, ShellCommand, WriteBufferFromString

											
										
										
											2020-10-30 19:40:16 +00:00
-												HADOOP_SECURE_DN_USER way, kinit thread, docker capabilities

											
										
										
											2020-09-28 17:20:04 +00:00
+								## Kerberos support {#kerberos-support}
-												insert_quorum_parallel-EdTranRus

											
										
										
											2021-12-23 23:28:39 +00:00
+								If the `hadoop_security_authentication` parameter has the value `kerberos`, ClickHouse authenticates via Kerberos.
 								Parameters are [here](#clickhouse-extras) and `hadoop_security_kerberos_ticket_cache_path` may be of help.
-												HADOOP_SECURE_DN_USER way, kinit thread, docker capabilities

											
										
										
											2020-09-28 17:20:04 +00:00
+								Note that due to libhdfs3 limitations only old-fashioned approach is supported,
-												insert_quorum_parallel-EdTranRus

											
										
										
											2021-12-23 23:28:39 +00:00
+								datanode communications are not secured by SASL (`HADOOP_SECURE_DN_USER` is a reliable indicator of such
 								security approach). Use `tests/integration/test_storage_kerberized_hdfs/hdfs_configs/bootstrap.sh` for reference.
-												cleanup hdfs docs

											
										
										
											2019-09-20 11:26:00 +00:00
-												Cleanup code in KerberosInit, HDFSCommon and StorageKafka; update English and Russian documentation.

											
										
										
											2022-06-08 14:57:45 +00:00
+								If `hadoop_kerberos_keytab`, `hadoop_kerberos_principal` or `hadoop_security_kerberos_ticket_cache_path` are specified, Kerberos authentication will be used. `hadoop_kerberos_keytab` and `hadoop_kerberos_principal` are mandatory in this case.
-												insert_quorum_parallel-EdTranRus

											
										
										
											2021-12-23 23:28:39 +00:00
+								## HDFS Namenode HA support {#namenode-ha}
-												set env LIBHDFS3_CONF, refers to ClickHouse#8159

											
										
										
											2021-08-28 01:16:48 +00:00
 								libhdfs3 support HDFS namenode HA.
 								- Copy `hdfs-site.xml` from an HDFS node to `/etc/clickhouse-server/`.
 								- Add following piece to ClickHouse config file:
 								``` xml
 								  <hdfs>
 								    <libhdfs3_conf>/etc/clickhouse-server/hdfs-site.xml</libhdfs3_conf>
 								  </hdfs>
 								```
 								- Then use `dfs.nameservices` tag value of `hdfs-site.xml` as the namenode address in the HDFS URI. For example, replace `hdfs://appadmin@192.168.101.11:8020/abc/` with `hdfs://appadmin@my_nameservice/abc/`.
-												Normalization for en markdown (#9763)


											
										
										
											2020-03-20 10:10:48 +00:00
+								## Virtual Columns {#virtual-columns}
-												Add virtual columns to hdfs and file table functions (#8489)

* Add virtual column _path to hdfs and file table functions with test

* Fix const of headers

* Add column _file with tests

* Add docs

* Fix improper resolve conflicts

* Fix links in docs

* Better condition for virtual columns proccessing in StorageFile

* better condition for virtual columns processing in StorageHDFS

											
										
										
											2020-01-15 07:52:45 +00:00
-												Add docs

											
										
										
											2023-11-22 18:21:30 +00:00
+								- `_path` — Path to the file. Type: `LowCardinalty(String)`.
 								- `_file` — Name of the file. Type: `LowCardinalty(String)`.
 								- `_size` — Size of the file in bytes. Type: `Nullable(UInt64)`. If the size is unknown, the value is `NULL`.
-												Add virtual columns to hdfs and file table functions (#8489)

* Add virtual column _path to hdfs and file table functions with test

* Fix const of headers

* Add column _file with tests

* Add docs

* Fix improper resolve conflicts

* Fix links in docs

* Better condition for virtual columns proccessing in StorageFile

* better condition for virtual columns processing in StorageHDFS

											
										
										
											2020-01-15 07:52:45 +00:00
-												Add docs, fix style

											
										
										
											2023-05-31 17:52:29 +00:00
+								## Storage Settings {#storage-settings}
 								- [hdfs_truncate_on_insert](/docs/en/operations/settings/settings.md#hdfs-truncate-on-insert) - allows to truncate file before insert into it. Disabled by default.
 								- [hdfs_create_multiple_files](/docs/en/operations/settings/settings.md#hdfs_allow_create_multiple_files) - allows to create a new file on each insert if format has suffix. Disabled by default.
 								- [hdfs_skip_empty_files](/docs/en/operations/settings/settings.md#hdfs_skip_empty_files) - allows to skip empty files while reading. Disabled by default.
-												Add virtual columns to hdfs and file table functions (#8489)

* Add virtual column _path to hdfs and file table functions with test

* Fix const of headers

* Add column _file with tests

* Add docs

* Fix improper resolve conflicts

* Fix links in docs

* Better condition for virtual columns proccessing in StorageFile

* better condition for virtual columns processing in StorageHDFS

											
										
										
											2020-01-15 07:52:45 +00:00
+								**See Also**
-												Docs: Replace annoying three spaces in enumerations by a single space

											
										
										
											2023-04-19 15:55:29 +00:00
+								- [Virtual columns](../../../engines/table-engines/index.md#table_engines-virtual_columns)