Commit Graph

913 Commits

Author SHA1 Message Date
fenglv
2392d4e2b5 fix 2022-04-16 16:08:28 +00:00
fenglv
58111115c5 fix style 2022-04-16 06:21:09 +00:00
fenglv
74ef1b0198 Add aliases JSONLines and NDJSON for JSONEachRow 2022-04-16 06:01:07 +00:00
Anton Popov
2de6668b3f show names of erroneous files 2022-04-16 00:10:47 +00:00
Alexey Milovidov
cbeeb7ec4f Remove Arcadia 2022-04-16 00:20:47 +02:00
avogar
42726639f3 Check ORC/Parquet/Arrow format magic bytes before loading file in memory 2022-04-13 19:27:38 +00:00
avogar
8b60aeb7bc Improve schema inference for json objects 2022-04-13 19:13:40 +00:00
Alexey Milovidov
a54c01cf72 Remove useless code in ReplicatedMergeTreeRestartingThread 2022-04-11 00:44:30 +02:00
avogar
1c783ed88a Resolve conflicts 2022-04-07 12:17:48 +00:00
avogar
d2017a63b1 Merge branch 'master' of github.com:ClickHouse/ClickHouse into improve-schema-inference 2022-04-07 11:36:40 +00:00
Kruglov Pavel
f3f8f27db5
Merge pull request #35735 from Avogar/allow-read-bools-as-numbers
Allow to infer and parse bools as numbers in JSON input formats
2022-04-07 13:20:49 +02:00
taiyang-li
2ef316801c Merge branch 'master' into use_minmax_index 2022-04-07 10:53:25 +08:00
Kruglov Pavel
ec2213493f
Merge branch 'master' into allow-read-bools-as-numbers 2022-04-06 14:53:02 +02:00
Kruglov Pavel
9141066de3
Merge branch 'master' into improve-schema-inference 2022-04-06 13:51:07 +02:00
taiyang-li
acb9f1632e suppoort skip splits in orc and parquet 2022-04-06 16:40:22 +08:00
mergify[bot]
1e43e26fa1
Merge branch 'master' into fix-order 2022-04-02 12:00:29 +00:00
avogar
ab2a963287 Merge branch 'master' of github.com:ClickHouse/ClickHouse into allow-read-bools-as-numbers 2022-03-31 14:09:43 +00:00
mergify[bot]
24ade25d61
Merge branch 'master' into improve-schema-inference 2022-03-31 13:42:47 +00:00
avogar
3fc36627b3 Allow to infer and parse bools as numbers in JSON input formats 2022-03-29 17:37:31 +00:00
avogar
ce97ccbfb9 Improve schema inference for JSONEachRow and TSKV formats 2022-03-29 14:47:51 +00:00
Antonio Andelic
9990abb76a Use compile-time check for Exception messages, fix wrong messages 2022-03-29 13:16:11 +00:00
avogar
97f5033ea9 Fix tests 2022-03-29 13:07:37 +00:00
mergify[bot]
343588de2c
Merge branch 'master' into improve-schema-inference 2022-03-29 13:06:00 +00:00
Anton Popov
d677635cd8
Merge pull request #35592 from CurtizJ/dynamic-columns-4
Add parallel parsing and schema inference for format `JSONAsObject`
2022-03-28 19:29:55 +02:00
avogar
6fb3c3be04 Fix comments and build 2022-03-25 12:02:21 +00:00
Kruglov Pavel
d45143ffe0
Merge branch 'master' into improve-schema-inference 2022-03-25 12:05:40 +01:00
Anton Popov
78100abc5f add parallel parsing and schema inference for type Object 2022-03-24 17:51:35 +00:00
avogar
557edbd172 Add some improvements and fixes in schema inference 2022-03-24 12:54:12 +00:00
mergify[bot]
bf90edc362
Merge branch 'master' into case-insensitive-column-matching 2022-03-24 08:00:42 +00:00
Kruglov Pavel
826b933b08
Merge pull request #35332 from Avogar/fix-tskv-schema-inference
Fix schema inference for TSKV format while using small max_read_buffer_size
2022-03-23 18:37:07 +01:00
Antonio Andelic
052057f2ef Address PR comments 2022-03-23 15:42:46 +00:00
Antonio Andelic
6b6190554b Fix conversion of arrow to CH column with hint header 2022-03-22 11:15:48 +00:00
Antonio Andelic
0c23cd7b94 Add support for case insensitive column matching in arrow 2022-03-22 10:55:10 +00:00
Antonio Andelic
ca7844e338 Fix tests 2022-03-22 09:27:20 +00:00
Antonio Andelic
6cebb6bc88 Merge branch 'master' into case-insensitive-column-matching 2022-03-22 07:36:35 +00:00
Antonio Andelic
cb3703b46e Style fix 2022-03-21 12:54:56 +00:00
Antonio Andelic
0457a3998a remove old test 2022-03-21 11:58:55 +00:00
Kruglov Pavel
1645b7083f
Update src/Processors/Formats/Impl/TSKVRowInputFormat.cpp
Co-authored-by: Antonio Andelic <antonio2368@users.noreply.github.com>
2022-03-21 12:44:12 +01:00
Kruglov Pavel
0b381ebd26
Update src/Processors/Formats/Impl/TSKVRowInputFormat.cpp
Co-authored-by: Antonio Andelic <antonio2368@users.noreply.github.com>
2022-03-21 12:44:06 +01:00
Kruglov Pavel
f67b8c0bad
Update src/Processors/Formats/Impl/TSKVRowInputFormat.cpp
Co-authored-by: Antonio Andelic <antonio2368@users.noreply.github.com>
2022-03-21 12:44:00 +01:00
Antonio Andelic
0c74fa2c19 Remove unecessary code 2022-03-21 08:38:15 +00:00
tavplubix
716c6f0ffa
Merge pull request #35406 from Avogar/fix-parquet
Fix working with unneeded columns in Arrow/Parquet/ORC formats
2022-03-21 11:36:54 +03:00
Antonio Andelic
29d2bf7d1a Merge branch 'master' into case-insensitive-column-matching 2022-03-21 08:17:27 +00:00
Antonio Andelic
d73c906e68 Format code 2022-03-21 07:50:17 +00:00
Antonio Andelic
f75b054255 Allow case insensitive column matching 2022-03-21 07:47:37 +00:00
avogar
58f2aca120 Fix tests 2022-03-18 19:04:16 +00:00
avogar
cffa2096de Fix working with unneeded columns in Arrow/Parquet/ORC formats 2022-03-18 13:07:54 +00:00
Kruglov Pavel
aa3c05e9d4
Merge pull request #35152 from rschu1ze/protobuf-batch-write
ProtobufList
2022-03-18 13:24:34 +01:00
Antonio Andelic
607f785e48 Revert "Merge pull request #35145 from bigo-sg/lower-column-name"
This reverts commit ebf72bf61d, reversing
changes made to f1b812bdc1.
2022-03-17 12:31:43 +00:00
Anton Popov
2ced42ed41 add experimental settings for Object type 2022-03-16 16:51:23 +00:00
Anton Popov
0ba78c3c3a Merge remote-tracking branch 'upstream/master' into HEAD 2022-03-16 15:28:09 +00:00
avogar
f7c5fe14e4 Fix schema inference for TSKV format while using small max_read_buffer_size 2022-03-16 13:53:50 +00:00
Robert Schulze
0d2ece6d91
Merge branch 'ClickHouse:master' into protobuf-batch-write 2022-03-16 09:43:33 +01:00
Robert Schulze
23122cb327
Fix review comments
ParquetBlockOutputFormat.cpp:
- undo unrelated formatting

ProtobufSerializer.cpp:
- undef debug tracing
- simplify logic in writeRow()

ProtobufSchemas.cpp:
- restore original search in cache by message type
2022-03-15 11:27:17 +01:00
Maksim Kita
2665724301 Fix clang-tidy warnings in Parsers, Processors, QueryPipeline folders 2022-03-14 18:17:35 +00:00
Anton Popov
36ec379aeb Merge remote-tracking branch 'upstream/master' into HEAD 2022-03-14 16:28:35 +00:00
Antonio Andelic
ebf72bf61d
Merge pull request #35145 from bigo-sg/lower-column-name
add setting to lower column case when reading parquet/orc file
2022-03-14 11:25:03 +01:00
Robert Schulze
514d4d2187
Implement ProtobufList - fixes ClickHouse#16436
Introduce IO format "ProtobufList" with protobuf schema

    // schemafile.proto
    message Envelope {
      message MessageType {
        uint32 colA = 1;
        string colB = 2;
      }
      repeated MessageType mt = 1;
    }

where "Envelope" is a hard-coded/expected top-level message and
"MessageType" is a message with user-provided name containing the table
fields to export/import, e.g.

    SELECT * FROM db1.tab1 FORMAT ProtobufList SETTINGS format_schema =
    'schemafile:MessageType'

As a result, the new format wraps a list of messages (one per row) into
a single, containing message. Compare that to the schema of the existing
IO formats "Protobuf" and "ProtobufSingle":

    message MessageType {
      uint32 colA = 1;
      string colB = 2;
    }

The new format does not save space compared to the existing formats, but
it is conceptually a bit more beautiful and also more convenenient.

Implementation details:

- Created new files ProtobufList(Input|Output)Format which use the
  existing ProtobufSerializer mechanism. The goal was to reuse as much
  code as possible and avoid copypasta.

- I was torn between inheriting from I(Input|Output)Format vs.
  IRow(Input|Output)Format for ProtobufList(Input|Output)Format. The
  former is chunk-based which can be better for performance. Since the
  ProtobufSerializer mechanism is row-based but data is generally passed
  around in chunks, I decided for the latter to leverage the existing
  chunk <--> row mapping code in IRow(InputOutput)Format.

- A new ProtobufSerializer called ProtobufSerializerEnvelope was
  introduced (--> ProtobufSerializer.cpp). It represents the top-level
  message which encloses the list of inner nested messages, i.e. the
  rows.

- With the new format, parsing the schema file and matching the fields in
  the schema file to table column works like for the old formats. The only
  difference is that parsing starts one level below the "Envelope" (-->
  ProtobufSchema.cpp). This is more natural than forcing customers to
  have table columns start with "Envelope".

- Creation of the ProtobufSerializer tree also works like before. What
  is different is that we finally add a ProtobufSerializerEnvelope as
  new root of the tree. It's only purpose is to write/read the top-level
  message for the first/last row to write/read.

Caveats:

- The low-level serialization code in ProtobufWriter uses an internal
  buffer which is flushed to the output file only in endMessage().
  In the existing "Protobuf" format, this happens once per row, in the
  new format this happens only at the end of the serialization
  since row-level messages now call start/endNestedMessage(). As a
  future TODO to, the buffer should be flushed also in
  start/endNestedMessage() to reduce memory consumption.
2022-03-14 08:04:58 +01:00
Maksim Kita
ce0c8e5597
Update JSONRowOutputFormat.cpp 2022-03-14 00:58:36 +01:00
Robert Schulze
f0ba39b071
Clean up some header includes and make formatting more consistent 2022-03-13 20:24:12 +01:00
zhanghuajie
53a8987b3b fix build fail with gcc --fix warnings without disabling some parameters 2022-03-11 21:59:19 +08:00
shuchaome
7a3623d216 fix bug 2022-03-11 17:26:13 +08:00
shuchaome
46cb4483a6 Optimise by lowering schema on the beginning. Add a functional test. 2022-03-11 14:34:46 +08:00
shuchaome
b7cd85df6b remove unused column_names in ORCBlockInputFormat 2022-03-09 18:16:22 +08:00
shuchaome
bb50133424
Apply suggestions from code review
Co-authored-by: Antonio Andelic <antonio2368@users.noreply.github.com>
2022-03-09 17:32:27 +08:00
shuchaome
9647818adc add unlikely for performance 2022-03-09 17:02:07 +08:00
shuchaome
8027bb1e32 modify code style 2022-03-09 16:32:18 +08:00
shuchaome
56795b831d add setting to lower column case when reading parquet/orc file 2022-03-09 16:07:02 +08:00
zhanghuajie
11dde7c127 fix build fail with gcc 2022-03-08 22:34:51 +08:00
Anton Popov
df3b07fe7c Merge remote-tracking branch 'upstream/master' into HEAD 2022-03-03 22:25:28 +00:00
Maksim Kita
b1a956c5f1 clang-tidy check performance-move-const-arg fix 2022-03-02 18:15:27 +00:00
Anton Popov
2758db5341 add more comments 2022-03-01 19:32:55 +03:00
Anton Popov
fcdebea925 Merge remote-tracking branch 'upstream/master' into HEAD 2022-02-25 13:41:30 +03:00
taiyang-li
e53719a86b remove comments 2022-02-13 17:13:23 +08:00
taiyang-li
aabf2aac69 finish all tests 2022-02-13 17:06:58 +08:00
taiyang-li
6559941972 support datetime64 when transform ch chunk to arrow table 2022-02-13 14:56:01 +08:00
avogar
9e58ae7577 Support jsonl extension for JSONEachRow format 2022-02-10 16:00:37 +03:00
Anton Popov
18940b8637 Merge remote-tracking branch 'upstream/master' into HEAD 2022-02-09 23:38:38 +03:00
Nikolai Kochetov
82a7d70a31
Merge branch 'master' into fix-removing-order-in-CreatingSetsTransform 2022-02-08 19:29:03 +03:00
Nikolai Kochetov
d2d47b9595 Fixing build. 2022-02-08 16:27:33 +00:00
Maksim Kita
4bb69bcb15
Merge pull request #34398 from DevTeamBK/input_format
Method called on already moved
2022-02-08 15:20:07 +01:00
Rajkumar
6b3adbb0de Method called on already moved 2022-02-07 19:50:34 -08:00
avogar
a4c7ecde87 Make better 2022-02-07 17:51:26 +03:00
avogar
c3d30fd502 Fix comments 2022-02-07 17:11:44 +03:00
Kruglov Pavel
34a17075d3 FIx error messages 2022-02-07 17:11:44 +03:00
avogar
77b42bb9ff Support UUID in MsgPack format 2022-02-07 17:11:44 +03:00
Alexey Milovidov
f98010e374 Small improvements 2022-02-06 07:14:01 +03:00
Alexey Milovidov
4a83dbc514 Fix linkage 2022-02-04 00:26:44 +03:00
Alexey Milovidov
c426f11096 Maybe better 2022-02-04 00:20:16 +03:00
Alexey Milovidov
7c12f5f37a Fix terribly low performance of LineAsString format 2022-02-04 00:07:31 +03:00
Anton Popov
836a348a9c Merge remote-tracking branch 'upstream/master' into HEAD 2022-02-01 15:23:07 +03:00
Alexey Milovidov
e4e7169277 Remove some strange code 2022-02-01 02:52:36 +03:00
Alexey Milovidov
83136f3515 Allow \r in the middle of the line in format Regexp 2022-02-01 02:49:26 +03:00
Alexey Milovidov
872d0a0fbe Improve performance of format Regexp 2022-02-01 02:07:48 +03:00
alesapin
dd61d1c2de
Merge pull request #34172 from ClickHouse/fix_race_in_some_engines
Fix benign race condition for storage HDFS, S3, URL
2022-01-31 22:41:54 +03:00
alesapin
93c0700c4c Fix typo 2022-01-31 16:46:58 +03:00
alesapin
056b9e335f Fix comment 2022-01-31 16:39:42 +03:00
alesapin
31753afb7e Fix cancel logic in parallel parsing 2022-01-31 16:38:15 +03:00
Maksim Kita
5ef83deaa6 Update sort to pdqsort 2022-01-30 19:49:48 +00:00
Anton Popov
78b9f15abb Merge remote-tracking branch 'upstream/master' into HEAD 2022-01-30 03:24:37 +03:00
Anton Popov
b950a12cb3
Merge pull request #34068 from CurtizJ/fix-async-insert-native
Fix asynchronous inserts with `Native` format
2022-01-29 01:24:53 +03:00
Azat Khuzhin
1519985c98 Fix possible "Can't attach query to the thread, it is already attached"
After detachQueryIfNotDetached() had been removed it is not enough to
use attachTo() for ThreadPool (scheduleOrThrowOnError()) since the query
may be already attached, if the thread doing multiple jobs, so
CurrentThread::attachToIfDetached() should be used instead.

This should fix all the places from the failures on CI [1]:

    $ fgrep DB::CurrentThread::attachTo -A1 ~/Downloads/47.txt  | fgrep -v attachTo | cut -d' ' -f5,6 | sort | uniq -c
         92 --
          2 /fasttest-workspace/build/../../ClickHouse/contrib/libcxx/include/deque:1393: DB::ParallelParsingInputFormat::parserThreadFunction(std::__1::shared_ptr<DB::ThreadGroupStatus>,
          4 /fasttest-workspace/build/../../ClickHouse/src/Storages/MergeTree/MergeTreeData.cpp:1595: void
         87 /fasttest-workspace/build/../../ClickHouse/src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp:993: void

  [1]: https://github.com/ClickHouse/ClickHouse/runs/4954466034?check_suite_focus=true

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2022-01-28 16:25:33 +03:00
Azat Khuzhin
b0c862c297 Fix memory accounting for queries that uses < max_untracker_memory
MemoryTracker starts accounting memory directly only after per-thread
allocation exceeded max_untracker_memory (or memory_profiler_step).

But even memory under this limit should be accounted too, and there is
code to do this in ThreadStatus dtor, however due to
PullingAsyncPipelineExecutor detached the query from thread group that
memory was not accounted.

So remove CurrentThread::detachQueryIfNotDetached() from threads that
uses ThreadFromGlobalPool since it has ThreadStatus, and the query will
be detached using CurrentThread::defaultThreadDeleter.

Note, that before this patch memory accounting works for HTTP queries
due to it had been accounted from ParallelFormattingOutputFormat, but
not for TCP.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2022-01-28 16:25:33 +03:00
Anton Popov
6c0959b907 fix asynchronous inserts with Native format 2022-01-28 03:25:15 +03:00
Kruglov Pavel
9f12f4af13
Merge pull request #33302 from Avogar/formats-with-suffixes
Allow to create new files on insert for File/S3/HDFS engines
2022-01-25 10:56:15 +03:00
avogar
1f49acc164 Better naming 2022-01-24 16:28:36 +03:00
Anton Popov
e8ce091e68 Merge remote-tracking branch 'upstream/master' into HEAD 2022-01-21 20:11:18 +03:00
avogar
67e396f8f4 Fix schema inference for JSONEachRow and JSONCompactEachRow 2022-01-20 16:31:24 +03:00
mergify[bot]
b318f9b5db
Merge branch 'master' into formats-with-suffixes 2022-01-18 12:17:07 +00:00
Anton Popov
a25f2518e3
Merge pull request #33141 from 1over/feature_default_keyword
Add support of DEFAULT keyword for INSERT
2022-01-18 02:04:37 +03:00
Kruglov Pavel
a7df9cd53a
Merge branch 'master' into formats-with-suffixes 2022-01-14 21:03:49 +03:00
avogar
253035a5df Fix 2022-01-14 19:17:06 +03:00
Kruglov Pavel
d2e9f37bee
Merge branch 'master' into format-by-extention 2022-01-14 18:36:23 +03:00
avogar
89a181bd19 Make better 2022-01-14 18:16:18 +03:00
Kruglov Pavel
5a908e8edd
Merge branch 'master' into formats-with-suffixes 2022-01-14 16:45:20 +03:00
Kruglov Pavel
d54a430d9c
Merge pull request #33566 from Avogar/fix-avro
Fix segfault in Avro
2022-01-14 16:01:56 +03:00
Kseniia Sumarokova
5da673c3a5
Merge pull request #31104 from bigo-sg/hive_table
Implement hive table engine
2022-01-14 09:39:17 +03:00
Kruglov Pavel
305d58a762
Merge pull request #33524 from Avogar/stacktrace-in-client
Don't print exception twice in client in case of exception in parallel parsing
2022-01-13 15:50:42 +03:00
Nikolai Kochetov
872ee5dc09
Update src/Processors/Formats/Impl/AvroRowOutputFormat.h
Co-authored-by: Bharat Nallan <bharatnc@gmail.com>
2022-01-13 12:55:14 +03:00
avogar
c5ea4b1bc0 Fix segfault in Avro 2022-01-12 18:34:28 +03:00
avogar
8390e9ad60 Detect format by file name in file/hdfs/s3/url table functions 2022-01-12 18:29:31 +03:00
lgbo-ustc
5c71d3687a fixed some bugs
1. interagtion test for test_hive_query failed
2. nullptr reference in arrowSchemaToCHHeader
2022-01-12 17:01:05 +08:00
taiyang-li
66813a3aa9 merge master 2022-01-12 16:56:29 +08:00
avogar
9915ce7ded Fix segfault in arrowSchemaToCHHeader 2022-01-11 20:30:35 +03:00
avogar
0ae0aa712b Don't print exception twice in client in case of exception in parallel parsing 2022-01-11 18:37:07 +03:00
李扬
2df2442ad0
Merge branch 'master' into hive_table 2022-01-04 01:26:16 -06:00
taiyang-li
8730dda895 fix hivte text 2022-01-01 09:16:30 +08:00
taiyang-li
1e102bc1b2 merge master 2022-01-01 09:01:06 +08:00
alexey-milovidov
34b934a1e0
Merge pull request #33331 from ClickHouse/serxa/line-as-string-output-format
Add LineAsString output format
2021-12-31 14:38:36 +03:00
Sergei Trifonov
f1d398ae4b Add LineAsString output format 2021-12-30 20:38:54 +03:00
avogar
97788b9c21 Allow to create new files on insert for File/S3/HDFS engines 2021-12-29 21:19:13 +03:00
avogar
364b4f5d36 Fix special build 2021-12-29 12:21:01 +03:00
Kruglov Pavel
cb0ed7fcb7 Fix typo 2021-12-29 12:21:01 +03:00
avogar
26abf7aa62 Remove code duplication, use simdjson and rapidjson instead of Poco 2021-12-29 12:21:01 +03:00
avogar
74f09d6476 Fix tests 2021-12-29 12:18:56 +03:00
avogar
aaf9f85c67 Add more tests and fixes 2021-12-29 12:18:56 +03:00
avogar
dd994aa761 Add some tests and some fixes 2021-12-29 12:18:56 +03:00
avogar
8112a71233 Implement schema inference for most input formats 2021-12-29 12:18:56 +03:00
kssenii
1f6ca619b7 Allow some killing 2021-12-27 22:42:56 +03:00
taiyang-li
9036b18c2f merge master 2021-12-27 15:12:48 +08:00
alexey-milovidov
c583ea7e6b
Merge pull request #32484 from Algunenano/libcxx13_take2
libc++ 13 compatibility
2021-12-25 10:14:12 +03:00
Andrii Buriachevskyi
e8cc6df7bb Add support of DEFAULT keyword for INSERT 2021-12-24 13:10:19 +01:00
Alexey Milovidov
29d28c531f Move code around to avoid dlsym on Musl 2021-12-24 12:25:27 +03:00
Raúl Marín
77db850c0b Merge remote-tracking branch 'blessed/master' into libcxx13_take2 2021-12-23 12:42:39 +01:00
Raúl Marín
88b8fd8b60 Merge remote-tracking branch 'blessed/master' into libcxx13_take2 2021-12-23 09:16:19 +01:00
Alexey Milovidov
f37ff32c37 Whitespaces 2021-12-23 01:33:47 +03:00
mreddy017
3e50217501 Remove the additional white space as per the pipeline build error. 2021-12-23 01:30:56 +03:00
mreddy017
10eb2dbdb7 Addressing review comments 2021-12-23 01:30:56 +03:00
Harry-Lee
846c46ac4b Fix issue #80: union index out of boundary 2021-12-23 01:30:56 +03:00
taiyang-li
2597925724 merge master 2021-12-21 15:55:39 +08:00