Commit Graph

411 Commits

Author SHA1 Message Date
Dmitry Kardymon
ed318d1035 Add input_format_csv_ignore_extra_columns setting (prototype) 2023-06-14 10:35:36 +00:00
kevinyhzou
f3b99156ac review fix 2023-06-14 10:48:21 +08:00
Kruglov Pavel
607f337d67
Merge pull request #50592 from Avogar/max-bytes-to-read-in-schema-inference
Add setting to limit the number of bytes to read in schema inference
2023-06-13 16:47:57 +02:00
kevinyhzou
911f8ad8dc use whitespace or tab as field delimiter 2023-06-12 11:57:52 +08:00
kevinyhzou
48e1b21aab Add feature to support read csv by space & tab delimiter 2023-06-08 20:34:30 +08:00
Kruglov Pavel
1baa6404e6
Merge branch 'master' into skip-trailing-empty-lines 2023-06-06 19:39:34 +02:00
avogar
df50833b70 Allow to skip trailing empty lines in CSV/TSV/CustomeSeparated formats 2023-06-06 17:33:05 +00:00
Kruglov Pavel
af880a6f3b
Merge branch 'master' into max-bytes-to-read-in-schema-inference 2023-06-06 14:47:58 +02:00
Nikita Mikhaylov
e87348010d
Rework loading and removing of data parts for MergeTree tables. (#49474)
Co-authored-by: Sergei Trifonov <sergei@clickhouse.com>
2023-06-06 14:42:56 +02:00
avogar
33e51d4f3b Add setting to limit the number of bytes to read in schema inference 2023-06-05 15:22:04 +00:00
Alexey Gerasimchuk
9958731c27
Merge branch 'master' into ADQM-830 2023-06-05 07:46:47 +10:00
Michael Kolupaev
b51064a508 Get rid of SeekableReadBufferFactory, add SeekableReadBuffer::readBigAt() instead 2023-06-01 18:48:30 -07:00
Alexey Gerasimchuck
75791d7a63 Added input_format_csv_trim_whitespaces parameter 2023-05-25 07:51:32 +00:00
Michael Kolupaev
6fd5d8e8ba Add setting output_format_parquet_compliant_nested_types to produce more compatible Parquet files 2023-05-19 18:39:50 +00:00
Alexey Milovidov
f6144ee32b
Revert "Make Pretty formats even prettier." 2023-05-13 02:45:07 +03:00
Alexey Milovidov
ef16077c72
Merge branch 'master' into pretty-time-squashing 2023-05-06 18:20:49 +03:00
Alexey Milovidov
90b0de5677 Make Pretty prettier 2023-05-05 06:36:53 +02:00
Michael Kolupaev
3bd1489f18 Propagate input_format_parquet_preserve_order to parallelizeOutputAfterReading() 2023-05-05 04:20:27 +00:00
Michael Kolupaev
eb3b774ad0 Better control over Parquet row group size 2023-05-04 14:59:55 -07:00
Nikita Mikhaylov
954e3b724c
Speedup outdated parts loading (#49317) 2023-05-03 18:56:45 +02:00
Michael Kolupaev
87be78e6de Better 2023-04-17 04:58:32 +00:00
Michael Kolupaev
e133633359 Parallel decoding with one row group per thread 2023-04-17 04:58:32 +00:00
Michael Kolupaev
683077890f Highly questionable refactoring (getInputMultistream() nonsense) 2023-04-17 04:58:32 +00:00
Michael Kolupaev
2d4fe85513 Something 2023-04-17 04:58:32 +00:00
Alexey Milovidov
bb6b775884 Merge branch 'master' into fuzzer-of-data-formats 2023-03-15 12:42:00 +01:00
Alexey Milovidov
f331b9b398 Fix errors and add tests 2023-03-13 23:49:28 +01:00
Alexey Milovidov
14647525f8 Merge branch 'fix-bson-bug' of github.com:Avogar/ClickHouse into fuzzer-of-data-formats 2023-03-13 22:45:00 +01:00
avogar
4213ec609f Proper fix for bug in parquet, revert reverted #45878 2023-03-13 18:22:09 +00:00
Alexey Milovidov
f33b651686 Add fuzzer for data formats 2023-03-13 04:51:50 +01:00
avogar
5a18acde90 Revert #45878 and add a test 2023-03-11 21:15:14 +00:00
Kruglov Pavel
fe973f3d6f
Merge branch 'master' into native-types-conversions 2023-03-09 13:03:25 +01:00
Kruglov Pavel
69a1309ade
Merge branch 'master' into native-types-conversions 2023-03-07 20:06:17 +01:00
avogar
5ab5902f38 Allow control compression in Parquet/ORC/Arrow output formats, support more compression for input formats 2023-03-01 21:27:46 +00:00
avogar
ab899bf2f3 Allow types conversion in Native input format 2023-02-27 19:28:19 +00:00
Kruglov Pavel
443dedddca
Merge branch 'master' into use-parquet-2 2023-02-27 14:31:43 +01:00
avogar
54622566df Add setting to change parquet version 2023-02-23 16:14:10 +00:00
Kruglov Pavel
9866ecfe8b
Merge branch 'master' into null-as-default-all-formats 2023-02-20 20:49:30 +01:00
Geoff Genz
be8bf3a6a3
Merge branch 'master' into http_client_version 2023-02-13 08:43:59 -07:00
avogar
d1efd02480 Extend setting input_format_null_as_default for more formats 2023-02-10 16:41:09 +00:00
Geoff Genz
99c3ff53c5 Merge remote-tracking branch 'origin/master' into http_client_version
# Conflicts:
#	src/Interpreters/Context.cpp
#	src/Interpreters/Context.h
2023-02-10 04:35:53 -07:00
Geoff Genz
7ed8ed0284 Add support for client_protocol_version sent with HTTP 2023-02-10 03:47:06 -07:00
Kruglov Pavel
4e2918cee3
Merge branch 'master' into parquet-fixed-binary 2023-02-08 12:31:13 +01:00
liuneng
17fc22a21e add parquet max_block_size setting 2023-02-01 18:29:20 +08:00
Alexey Milovidov
04078dbed3 Remove trash 2023-01-29 22:43:36 +01:00
Azat Khuzhin
1a8437f2c9 Add ability to ignore unknown keys in JSON object for named tuples
This can be useful in case your input JSON is complex, while you need
only few fields in it.

This behaviour is controlled by the
input_format_json_ignore_unknown_keys_in_named_tuple setting name, that
is turned OFF by default.

This will, almost, allow to parse gharchive dataset without jq. "almost"
because of two things:
- Tuple cannot be Nullable, so such keys with Tuple type in ClickHouse
  cannot be `null` in JSON
- You cannot use dot.dot notation to extract columns for file() engine,
  only tupleElement()

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-01-27 10:01:08 +01:00
Kruglov Pavel
23c12ac8ee
Merge branch 'master' into parquet-fixed-binary 2023-01-24 16:51:05 +01:00
Kruglov Pavel
cd1cd904a7
Merge branch 'master' into tsv-csv-detect-header 2023-01-23 23:49:56 +01:00
Alexander Tokmakov
70d1adfe4b
Better formatting for exception messages (#45449)
* save format string for NetException

* format exceptions

* format exceptions 2

* format exceptions 3

* format exceptions 4

* format exceptions 5

* format exceptions 6

* fix

* format exceptions 7

* format exceptions 8

* Update MergeTreeIndexGin.cpp

* Update AggregateFunctionMap.cpp

* Update AggregateFunctionMap.cpp

* fix
2023-01-24 00:13:58 +03:00
avogar
5bf4704e7a Support FixedSizeBinary type in Parquet/Arrow 2023-01-16 21:01:31 +00:00
Kruglov Pavel
e9d6590926
Merge branch 'master' into tsv-csv-detect-header 2023-01-16 17:50:24 +01:00
avogar
87b934c472 Insert default values in case of missing tuple elements in JSONEachRow 2023-01-12 16:36:44 +00:00
Kruglov Pavel
05a11ff4a4
Merge branch 'master' into tsv-csv-detect-header 2023-01-12 12:35:18 +01:00
Alexey Milovidov
0d39d26a34 Don't fix parallel formatting 2023-01-09 06:15:20 +01:00
avogar
7fcdb08ec6 Detect header in CSV/TSV/CustomSeparated files automatically 2023-01-05 22:57:25 +00:00
Kruglov Pavel
0a43976977
Merge branch 'master' into validate-types 2023-01-02 16:10:14 +01:00
Kruglov Pavel
4982d132fb
Merge branch 'master' into validate-types 2022-12-30 17:52:13 +01:00
Kruglov Pavel
894726bd8f
Merge branch 'master' into improve-streaming-engines 2022-12-29 22:59:45 +01:00
Raúl Marín
5de11979ce
Unify query elapsed time measurements (#43455)
* Unify query elapsed time reporting

* add-test: Make shell tests executable

* Add some tests around query elapsed time

* Style and ubsan
2022-12-28 21:01:41 +01:00
avogar
4ab3e90382 Validate types in table function arguments/CAST function arguments/JSONAsObject schema inference 2022-12-21 21:21:30 +00:00
Kruglov Pavel
5e01a3d74e
Merge branch 'master' into improve-streaming-engines 2022-12-21 10:51:50 +01:00
Kruglov Pavel
c5b2e4cc23
Merge branch 'master' into improve-streaming-engines 2022-12-15 18:44:35 +01:00
avogar
739ad23b1f Make better, fix bugs, improve error messages 2022-12-12 22:00:45 +00:00
avogar
cd4fa00d2c Merge branch 'master' of github.com:ClickHouse/ClickHouse into refactor-schema-inference 2022-12-09 14:45:10 +00:00
avogar
d0f9bb2ec2 Allow to parse JSON objects into Strings 2022-12-08 18:58:18 +00:00
avogar
7375a7d429 Refactor and improve schema inference for text formats 2022-12-07 21:19:27 +00:00
Kruglov Pavel
c35b2a6495
Add a limit for string size in RowBinary format (#43842) 2022-12-02 13:57:11 +01:00
Anton Popov
fe5fff0347
Merge pull request #43329 from xiedeyantu/support_nested_column
s3 table function can support select nested column using {column_name}.{subcolumn_name}
2022-11-29 22:27:19 +01:00
xiedeyantu
304b6ebf3a s3 table function can support select nested column using {column_name}.{subcolumn_name} 2022-11-23 23:36:12 +08:00
Kruglov Pavel
98d6b96c82
Merge pull request #42033 from mark-polokhov/BSONEachRow
Add BSONEachRow input/output format
2022-11-22 14:45:21 +01:00
Vitaly Baranov
ce81166c7e Fix style. 2022-11-16 01:35:11 +01:00
Vitaly Baranov
8e99f5fea3 Move maskSensitiveInfoInQueryForLogging() to src/Parsers/ 2022-11-14 18:55:19 +01:00
avogar
9e89af28c6 Refactor BSONEachRow format, fix bugs, support more data types, support parallel parsing and schema inference 2022-11-10 20:15:14 +00:00
Kruglov Pavel
b124875257
Merge branch 'master' into improve-streaming-engines 2022-11-03 13:22:06 +01:00
avogar
8e13d1f1ec Improve and refactor Kafka/StorageMQ/NATS and data formats 2022-10-28 16:41:10 +00:00
Alexey Milovidov
f88ed8195b Fix trash 2022-10-17 04:21:08 +02:00
Kruglov Pavel
6fc12dd922
Merge pull request #41703 from Avogar/json-object-each-row
Add setting to obtain object name as column value in JSONObjectEachRow format
2022-10-14 20:11:04 +02:00
Vitaly Baranov
f65d3ff95a Fix parallel parsing: segmentator now checks max_block_size. 2022-09-30 22:34:03 +02:00
Kruglov Pavel
f1ac2d66be
Merge branch 'master' into json-object-each-row 2022-09-28 14:15:02 +02:00
avogar
76be0d2ee1 Infer Object type only when allow_experimental_object_type is enabled 2022-09-27 23:07:36 +00:00
avogar
d3d06251a3 Add setting to obtain object name as column value in JSONObjectEachRow format 2022-09-22 16:48:54 +00:00
Kruglov Pavel
22e11aef2d
Merge pull request #40910 from Avogar/new-json-formats
Add new JSON formats, add improvements and refactoring
2022-09-21 14:19:08 +02:00
avogar
868ce8bc16 Fix comments, make better naming, add docs, add setting output_format_json_quote_64bit_floats 2022-09-20 13:49:17 +00:00
Alexey Milovidov
da01982652
Merge pull request #41046 from azat/build/llvm-15
Switch to llvm/clang 15
2022-09-16 07:31:06 +03:00
Azat Khuzhin
e8d7403a38 Suppress warning in FormatFactory::getFormatFromFileDescriptor() for FreeBSD
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2022-09-10 21:38:35 +02:00
zhenjial
bd9fabc3f7 code optimization, add test 2022-09-09 23:27:42 +08:00
zhenjial
469ceaa156 code optimization 2022-09-09 00:47:43 +08:00
avogar
c380decbbb Make better, add new settings 2022-09-08 16:07:20 +00:00
zhenjial
0f788d98f5 new implementation 2022-09-06 20:39:54 +08:00
avogar
b94e896c1c Remove logs 2022-09-01 19:01:27 +00:00
avogar
afc34dca41 Add new JSON formats, add improvements and refactoring 2022-09-01 19:00:24 +00:00
avogar
612ffaffde Make schema inference cache better, respect format settings that can change the schema 2022-08-19 16:39:13 +00:00
Nikolai Kochetov
5a85531ef7
Merge pull request #38286 from Avogar/schema-inference-cache
Add schema inference cache for s3/hdfs/file/url
2022-08-18 13:07:50 +02:00
avogar
8dd54c043d Merge branch 'master' of github.com:ClickHouse/ClickHouse into schema-inference-cache 2022-08-17 11:47:40 +00:00
avogar
e1ff996ec3 Allow to specify structure hints in schema inference 2022-08-16 09:46:57 +00:00
Kruglov Pavel
6b2186bfeb
Merge branch 'master' into numbers-schema-inference 2022-08-02 19:34:53 +02:00
Kruglov Pavel
381ea139c2
Merge branch 'master' into schema-inference-cache 2022-07-27 11:35:36 +02:00
avogar
784ee11594 Add settings to skip fields with unsupported types in Protobuf/CapnProto schema inference 2022-07-20 11:16:25 +00:00
Kruglov Pavel
b38241b08a
Merge branch 'master' into schema-inference-cache 2022-07-14 12:29:54 +02:00
avogar
7cde9d3b40 Add new features in schema inference 2022-07-13 15:57:55 +00:00
avogar
5b0fd31c64 Put column names in quotes 2022-06-30 16:14:30 +00:00
avogar
9bb68bc6de Add SQLInsert output format 2022-06-27 18:31:57 +00:00
avogar
5155262a16 Add some additional information to cache keys 2022-06-27 12:43:24 +00:00
Alexey Milovidov
5e9e5a4eaf
Merge pull request #37525 from Avogar/avro-structs
Support Maps and Records, allow to insert null as default in Avro format
2022-06-15 00:04:29 +03:00
Robert Schulze
1a0b5f33b3
More consistent use of platform macros
cmake/target.cmake defines macros for the supported platforms, this
commit changes predefined system macros to our own macros.

__linux__ --> OS_LINUX
__APPLE__ --> OS_DARWIN
__FreeBSD__ --> OS_FREEBSD
2022-06-10 10:22:31 +02:00
Kruglov Pavel
7cc87d9a65
Merge pull request #37537 from Avogar/skip-first-lines
Allow to skip some of the first lines in CSV/TSV formats
2022-05-31 14:26:21 +02:00
Kruglov Pavel
0615866aea
Merge pull request #37450 from Avogar/check-format-on-storage-creation
Check format name on storage creation
2022-05-30 14:23:20 +02:00
Alexey Milovidov
c50791dd3b Fix clang-tidy-14, part 1 2022-05-27 22:52:14 +02:00
avogar
4c9812d4c1 Allow to skip some of the first rows in CSV/TSV formats 2022-05-25 15:00:11 +00:00
avogar
038a422aeb Add setting to insert null as default 2022-05-25 12:56:59 +00:00
avogar
f782fa31c6 Merge branch 'master' of github.com:ClickHouse/ClickHouse into check-format-on-storage-creation 2022-05-25 08:42:54 +00:00
avogar
37b66c8a9e Check format name on storage creation 2022-05-23 12:48:48 +00:00
Kruglov Pavel
f539fb835d
Merge branch 'master' into formats-with-names 2022-05-23 12:14:20 +02:00
avogar
a4cf07708c Fix comments 2022-05-20 14:57:27 +00:00
avogar
a0369fb9a6 Allow to use String type instead of Binary in Arrow/Parquet/ORC formats 2022-05-18 14:51:21 +00:00
avogar
68bb07d166 Better naming 2022-05-13 18:39:19 +00:00
avogar
b17fec659a Improve performance and memory usage for select of subset of columns for some formats 2022-05-13 13:51:28 +00:00
Kruglov Pavel
77e55c344c
Merge pull request #36667 from Avogar/mysqldump-format
Add MySQLDump input format
2022-05-04 19:49:48 +02:00
Dmitry Novik
5ba7a55c18
Merge pull request #36650 from bigo-sg/hive_text_parallel_parsing
Parallel parsing of hive text format
2022-05-03 15:56:28 +02:00
Kruglov Pavel
d613f7eab0
Merge branch 'master' into mysqldump-format 2022-05-02 13:31:57 +02:00
Jakub Kuklis
a1f2dd6d34 Adding two settings in place of one, improvements to the test clarity 2022-04-29 10:01:51 +02:00
Jakub Kuklis
507ba1042c Adding a setting to enable Google wrappers special treatment 2022-04-29 10:01:51 +02:00
avogar
d295de1689 Fix comments and test 2022-04-28 14:59:35 +00:00
Kruglov Pavel
4d08587559
Merge branch 'master' into mysqldump-format 2022-04-28 15:58:18 +02:00
avogar
33d845dade Add MySQLDump input format 2022-04-26 10:42:56 +00:00
taiyang-li
b7cc344d62 remove useless codes 2022-04-26 14:42:43 +08:00
taiyang-li
99dee35b6e parallel parsing of hive text format 2022-04-26 14:33:10 +08:00
avogar
1c065f8c7a Some refactoring around schema inference with globs 2022-04-13 17:02:48 +00:00
avogar
d2017a63b1 Merge branch 'master' of github.com:ClickHouse/ClickHouse into improve-schema-inference 2022-04-07 11:36:40 +00:00
Kruglov Pavel
ec2213493f
Merge branch 'master' into allow-read-bools-as-numbers 2022-04-06 14:53:02 +02:00
Kruglov Pavel
9141066de3
Merge branch 'master' into improve-schema-inference 2022-04-06 13:51:07 +02:00
Maksim Kita
371cdc956a Added input format settings for parsing invalid IPv4, IPv6 addresses as default values 2022-03-30 12:54:19 +02:00
avogar
3fc36627b3 Allow to infer and parse bools as numbers in JSON input formats 2022-03-29 17:37:31 +00:00
Kruglov Pavel
d45143ffe0
Merge branch 'master' into improve-schema-inference 2022-03-25 12:05:40 +01:00
avogar
557edbd172 Add some improvements and fixes in schema inference 2022-03-24 12:54:12 +00:00
Antonio Andelic
0c23cd7b94 Add support for case insensitive column matching in arrow 2022-03-22 10:55:10 +00:00
Antonio Andelic
f75b054255 Allow case insensitive column matching 2022-03-21 07:47:37 +00:00
Antonio Andelic
607f785e48 Revert "Merge pull request #35145 from bigo-sg/lower-column-name"
This reverts commit ebf72bf61d, reversing
changes made to f1b812bdc1.
2022-03-17 12:31:43 +00:00
shuchaome
46cb4483a6 Optimise by lowering schema on the beginning. Add a functional test. 2022-03-11 14:34:46 +08:00
shuchaome
56795b831d add setting to lower column case when reading parquet/orc file 2022-03-09 16:07:02 +08:00
Maksim Kita
b1a956c5f1 clang-tidy check performance-move-const-arg fix 2022-03-02 18:15:27 +00:00
avogar
77b42bb9ff Support UUID in MsgPack format 2022-02-07 17:11:44 +03:00
Kruglov Pavel
7873b4475f
Merge branch 'master' into autodetect-format 2022-01-25 10:56:52 +03:00
avogar
a6740d2f9a Detect format and schema for stdin in clickhouse-local 2022-01-25 10:25:37 +03:00
avogar
1f49acc164 Better naming 2022-01-24 16:28:36 +03:00
Kruglov Pavel
a7df9cd53a
Merge branch 'master' into formats-with-suffixes 2022-01-14 21:03:49 +03:00
avogar
253035a5df Fix 2022-01-14 19:17:06 +03:00
Kruglov Pavel
d2e9f37bee
Merge branch 'master' into format-by-extention 2022-01-14 18:36:23 +03:00
avogar
89a181bd19 Make better 2022-01-14 18:16:18 +03:00
avogar
817a314263 Fix tests and style 2022-01-14 17:46:24 +03:00
Kruglov Pavel
5a908e8edd
Merge branch 'master' into formats-with-suffixes 2022-01-14 16:45:20 +03:00