mirror of https://github.com/ClickHouse/ClickHouse.git synced 2024-11-15 03:53:41 +00:00

Yarik Briukhovetskyi edccd30016

Merge branch 'ClickHouse:master' into hive_style_partitioning

2024-08-08 14:52:33 +02:00

4.8 KiB

Raw Blame History

slug	sidebar_position	sidebar_label
/en/sql-reference/table-functions/hdfs	80	hdfs

hdfs

Creates a table from files in HDFS. This table function is similar to url and file ones.

hdfs(URI, format, structure)

Input parameters

URI — The relative URI to the file in HDFS. Path to file support following globs in readonly mode: *, ?, {abc,def} and {N..M} where N, M — numbers, `'abc', 'def' — strings.
format — The format of the file.
structure — Structure of the table. Format 'column1_name column1_type, column2_name column2_type, ...'.

Returned value

A table with the specified structure for reading or writing data in the specified file.

Example

Table from hdfs://hdfs1:9000/test and selection of the first two rows from it:

SELECT *
FROM hdfs('hdfs://hdfs1:9000/test', 'TSV', 'column1 UInt32, column2 UInt32, column3 UInt32')
LIMIT 2

┌─column1─┬─column2─┬─column3─┐
│       1 │       2 │       3 │
│       3 │       2 │       1 │
└─────────┴─────────┴─────────┘

Globs in path

Paths may use globbing. Files must match the whole path pattern, not only the suffix or prefix.

* — Represents arbitrarily many characters except / but including the empty string.
** — Represents all files inside a folder recursively.
? — Represents an arbitrary single character.
{some_string,another_string,yet_another_one} — Substitutes any of strings 'some_string', 'another_string', 'yet_another_one'. The strings can contain the / symbol.
{N..M} — Represents any number >= N and <= M.

Constructions with {} are similar to the remote and file table functions.

Example

Suppose that we have several files with following URIs on HDFS:

‘hdfs://hdfs1:9000/some_dir/some_file_1’
‘hdfs://hdfs1:9000/some_dir/some_file_2’
‘hdfs://hdfs1:9000/some_dir/some_file_3’
‘hdfs://hdfs1:9000/another_dir/some_file_1’
‘hdfs://hdfs1:9000/another_dir/some_file_2’
‘hdfs://hdfs1:9000/another_dir/some_file_3’

Query the amount of rows in these files:

SELECT count(*)
FROM hdfs('hdfs://hdfs1:9000/{some,another}_dir/some_file_{1..3}', 'TSV', 'name String, value UInt32')

Query the amount of rows in all files of these two directories:

SELECT count(*)
FROM hdfs('hdfs://hdfs1:9000/{some,another}_dir/*', 'TSV', 'name String, value UInt32')

:::note If your listing of files contains number ranges with leading zeros, use the construction with braces for each digit separately or use ?. :::

Example

Query the data from files named file000, file001, ... , file999:

SELECT count(*)
FROM hdfs('hdfs://hdfs1:9000/big_dir/file{0..9}{0..9}{0..9}', 'CSV', 'name String, value UInt32')

Virtual Columns

_path — Path to the file. Type: LowCardinalty(String).
_file — Name of the file. Type: LowCardinalty(String).
_size — Size of the file in bytes. Type: Nullable(UInt64). If the size is unknown, the value is NULL.
_time — Last modified time of the file. Type: Nullable(DateTime). If the time is unknown, the value is NULL.

Hive-style partitioning

When setting use_hive_partitioning is set to 1, ClickHouse will detect Hive-style partitioning in the path (/name=value/) and will allow to use partition columns as virtual columns in the query. These virtual columns will have the same names as in the partitioned path, but starting with _.

Example

Use virtual column, created with Hive-style partitioning

SET use_hive_partitioning = 1;
SELECT * from HDFS('hdfs://hdfs1:9000/data/path/date=*/country=*/code=*/*.parquet') where _date > '2020-01-01' and _country = 'Netherlands' and _code = 42;

Storage Settings

hdfs_truncate_on_insert - allows to truncate file before insert into it. Disabled by default.
hdfs_create_multiple_files - allows to create a new file on each insert if format has suffix. Disabled by default.
hdfs_skip_empty_files - allows to skip empty files while reading. Disabled by default.
ignore_access_denied_multidirectory_globs - allows to ignore permission denied errors for multi-directory globs.

See Also

Virtual columns

4.8 KiB Raw Blame History Unescape Escape

hdfs

Globs in path

Virtual Columns

Hive-style partitioning

Storage Settings

4.8 KiB

Raw Blame History