ClickHouse/docs/changelogs/v23.4.1.1943-stable.md
Alexey Milovidov a433115434 Fix typo
2024-08-04 19:18:00 +02:00

68 KiB

sidebar_position sidebar_label
1 2023

2023 Changelog

ClickHouse release v23.4.1.1943-stable (3920eb987f) FIXME as compared to v23.3.1.2823-lts (46e85357ce)

Backward Incompatible Change

  • If path in cache configuration is not empty and is not absolute path, then it will be put in <clickhouse server data directory>/caches/<path_from_cache_config>. #48784 (Kseniia Sumarokova).
  • Compatibility setting parallelize_output_from_storages to enable behavior before #48727. #49101 (Igor Nikonov).

New Feature

  • Add extractKeyValuePairs function to extract key value pairs from strings. Input strings might contain noise (i.e log files / do not need to be 100% formatted in key-value-pair format), the algorithm will look for key value pairs matching the arguments passed to the function. As of now, function accepts the following arguments: data_column (mandatory), key_value_pair_delimiter (defaults to :), pair_delimiters (defaults to \space \, \;) and quoting_character (defaults to double quotes). #43606 (Arthur Passos).
  • Add MemoryTracker for the background tasks (merges and mutation). Introduces merges_mutations_memory_usage_soft_limit and merges_mutations_memory_usage_to_ram_ratio settings that represent the soft memory limit for merges and mutations. If this limit is reached ClickHouse won't schedule new merge or mutation tasks. Also MergesMutationsMemoryTracking metric is introduced to allow observing current memory usage of background tasks. Closes #45710. #46089 (Dmitry Novik).
  • Support new aggregate function quantileGK/quantilesGK, like approx_percentile in spark. Greenwald-Khanna algorithm refer to http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf. #46428 (李扬).
  • Add statement SHOW COLUMNS which shows distilled information from system.columns. #48017 (Robert Schulze).
  • Added LIGHTWEIGHT and PULL modifiers for SYSTEM SYNC REPLICA query. LIGHTWEIGHT version waits for fetches and drop-ranges only (merges and mutations are ignored). PULL version pulls new entries from ZooKeeper and does not wait for them. Fixes #47794. #48085 (Alexander Tokmakov).
  • Add kafkaMurmurHash function for compatibility with Kafka DefaultPartitioner. Closes #47834. #48185 (Nikolay Degterinsky).
  • Allow to easily create a user with the same grants as the current user by using GRANT CURRENT GRANTS. #48262 (pufit).
  • Add statistical aggregate function kolmogorovSmirnovTest. close #48228. #48325 (FFFFFFFHHHHHHH).
  • Added a lost_part_count column to the system.replicas table. The column value shows the total number of lost parts in the corresponding table. Value is stored in zookeeper and can be used instead of not persistent ReplicatedDataLoss profile event for monitoring. #48526 (Sergei Trifonov).
  • Add soundex function. Closes #39880. #48567 (FriendLey).
  • Support map type for JSONExtract. #48629 (李扬).
  • Add PrettyJSONEachRow format to output pretty JSON with new line delimieters and 4 space indents. #48898 (Kruglov Pavel).
  • Add ParquetMetadata input format to read Parquet file metadata. #48911 (Kruglov Pavel).

Performance Improvement

  • Reading files in Parquet format is now much faster. IO and decoding are parallelized (controlled by max_threads setting), and only required data ranges are read. #47964 (Michael Kolupaev).
  • Only check dependencies if necessary when applying ALTER TABLE queries. #48062 (Raúl Marín).
  • Optimize function mapUpdate. #48118 (Anton Popov).
  • Now an internal query to local replica is sent explicitly and data from it received through loopback interface. Setting prefer_localhost_replica is not respected for parallel replicas. This is needed for better scheduling and makes the code cleaner: the initiator is only responsible for coordinating of the reading process and merging results, continiously answering for requests while all the secondary queries read the data. Note: Using loopback interface is not so performant, otherwise some replicas could starve for tasks which could lead to even slower query execution and not utilizing all possible resources. The initialization of the coordinator is now even more lazy. All incoming requests contain the information about the reading algorithm we initialize the coordinator with it when first request comes. If any replica will decide to read with different algorithm - an exception will be thrown and a query will be aborted. #48246 (Nikita Mikhaylov).
  • Do not build set for the right side of IN clause with subquery when it is used only for analysis of skip indexes and they are disabled by setting (use_skip_indexes=0). Previously it might affect the performance of queries. #48299 (Anton Popov).
  • Query processing is parallelized right after reading FROM file(...). Related to #38755. #48525 (Igor Nikonov).
  • Query processing is parallelized right after reading from a data source. Affected data sources are mostly simple or external storages like table functions url, file. #48727 (Igor Nikonov).
  • Using correct memory order for counter in numebers_mt(). #48729 (Igor Nikonov).
  • Lowered contention of ThreadPool mutex (may increase performance for a huge amount of small jobs). #48750 (Sergei Trifonov).
  • Simplify accounting of approximate size of granule in prefetched read pool. #49051 (Nikita Taranov).

Improvement

  • Support config sections keeper/keeper_server as an alternative to zookeeper. Close #34766 , #34767. #35113 (李扬).
  • Many issues in ClickHouse applications's help were fixed. Help is now written to stdout from all tools. Status code for clickhouse help invocation is now 0. Updated help for clickhouse-local, clickhouse-benchmark, clickhouse-client, clickhouse hash, clickhouse su, clickhouse-install. #45819 (Ilya Yatsishin).
  • Entries in the query cache are now squashed to max_block_size and compressed. #45912 (Robert Schulze).
  • It is possible to set secure flag in named_collections for a dictionary with a ClickHouse table source. Addresses #38450 . #46323 (Ilya Golshtein).
  • Functions replaceOne(), replaceAll(), replaceRegexpOne() and replaceRegexpAll() can now be called with non-const pattern and replacement arguments. #46589 (Robert Schulze).
  • Bump internal ZSTD from 1.5.4 to 1.5.5. #46797 (Robert Schulze).
  • If we run a mutation with IN (subquery) like this: ALTER TABLE t UPDATE col='new value' WHERE id IN (SELECT id FROM huge_table) and the table t has multiple parts than for each part a set for subquery SELECT id FROM huge_table is built in memory. And if there are many parts then this might consume a lot of memory (and lead to an OOM) and CPU. The solution is to introduce a short-lived cache of sets that are currently being built by mutation tasks. If another task of the same mutation is executed concurrently it can lookup the set in the cache, wait for it be be built and reuse it. #46835 (Alexander Gololobov).
  • Added configurable retries for all operations with [Zoo]Keeper for Backup queries. #47224 (Nikita Mikhaylov).
  • Add async connection to socket and async writing to socket. Make creating connections and sending query/external tables async across shards. Refactor code with fibers. Closes #46931. We will be able to increase connect_timeout_with_failover_ms by default after this PR (https://github.com/ClickHouse/ClickHouse/issues/5188). #47229 (Kruglov Pavel).
  • Formatter '%M' in function formatDateTime() now prints the month name instead of the minutes. This makes the behavior consistent with MySQL. The previous behavior can be restored using setting "formatdatetime_parsedatetime_m_is_month_name = 0". #47246 (Robert Schulze).
  • Several improvements around data lakes: - Make StorageIceberg work with non-partitioned data. - Support Iceberg format version V2 (previously only V1 was supported) - Support reading partitioned data for DeltaLake/Hudi - Faster reading of DeltaLake metadata by using Delta's checkpoint files - Fixed incorrect Hudi reads: previously it incorrectly chose which data to read and therefore was able to read correctly only small size tables - Made these engines to pickup updates of changed data (previously the state was set on table creation) - Make proper testing for Iceberg/DeltaLake/Hudi using spark. #47307 (Kseniia Sumarokova).
  • Enable use_environment_credentials for S3 by default, so the entire provider chain is constructed by default. #47397 (Antonio Andelic).
  • Currently, the JSON_VALUE function is similar as spark's get_json_object function, which support to get value from json string by a path like '$.key'. But still has something different - 1. in spark's get_json_object will return null while the path is not exist, but in JSON_VALUE will return empty string; - 2. in spark's get_json_object will return a complext type value, such as a json object/array value, but in JSON_VALUE will return empty string. #47494 (KevinyhZou).
  • Add CNF/constraint optimizer in new analyzer. #47617 (Antonio Andelic).
  • For use_structure_from_insertion_table_in_table_functions more flexible insert table structure propagation to table function. Fixed bug with name mapping and using virtual columns. No more need for 'auto' setting. #47962 (Yakov Olkhovskiy).
  • Do not continue retrying to connect to ZK if the query is killed or over limits. #47985 (Raúl Marín).
  • Added functions to work with columns of type Map: mapConcat, mapSort, mapExists. #48071 (Anton Popov).
  • Support Enum output/input in BSONEachRow, allow all map key types and avoid extra calculations on output. #48122 (Kruglov Pavel).
  • Support more ClickHouse types in ORC/Arrow/Parquet formats: Enum(8|16), (U)Int(128|256), Decimal256 (for ORC), allow reading IPv4 from Int32 values (ORC outputs IPv4 as Int32 and we couldn't read it back), fix reading Nullable(IPv6) from binary data for ORC. #48126 (Kruglov Pavel).
  • Add columns perform_ttl_move_on_insert, load_balancing for table system.storage_policies, modify column volume_type type to enum8. #48167 (lizhuoyu5).
  • Added support for BACKUP ALL command which backups all tables and databases, including temporary and system ones. #48189 (Vitaly Baranov).
  • Function mapFromArrays support map type as input. #48207 (李扬).
  • The output of some SHOW PROCESSLIST is now sorted. #48241 (Robert Schulze).
  • Per-query/per-server throttling for remote IO/local IO/BACKUPs (server settings: max_remote_read_network_bandwidth_for_server, max_remote_write_network_bandwidth_for_server, max_local_read_bandwidth_for_server, max_local_write_bandwidth_for_server, max_backup_bandwidth_for_server, settings: max_remote_read_network_bandwidth, max_remote_write_network_bandwidth, max_local_read_bandwidth, max_local_write_bandwidth, max_backup_bandwidth). #48242 (Azat Khuzhin).
  • Support more types in CapnProto format: Map, (U)Int(128|256), Decimal(128|256). Allow integer conversions during input/output. #48257 (Kruglov Pavel).
  • It is now possible to define per-user quotas in the query cache. #48284 (Robert Schulze).
  • Don't throw CURRENT_WRITE_BUFFER_IS_EXHAUSTED for normal behaviour. #48288 (Raúl Marín).
  • Add new setting keeper_map_strict_mode which enforces extra guarantees on operations made on top of KeeperMap tables. #48293 (Antonio Andelic).
  • Check primary key type for simple dictionary is native unsigned integer type Add setting check_dictionary_primary_key for compatibility(set check_dictionary_primary_key =false to disable checking). #48335 (lizhuoyu5).
  • Don't replicate mutations for KeeperMap because it's unnecessary. #48354 (Antonio Andelic).
  • Allow write/read unnamed tuple as nested Message in Protobuf format. Tuple elements and Message fields are mathced by position. #48390 (Kruglov Pavel).
  • Support additional_table_filters and additional_result_filter settings in the new planner. Also, add a documentation entry for additional_result_filter. #48405 (Dmitry Novik).
  • Parsedatetime now understands format string '%f' (fractional seconds). #48420 (Robert Schulze).
  • Format string "%f" in formatDateTime() now prints "000000" if the formatted value has no fractional seconds, the previous behavior (single zero) can be restored using setting "formatdatetime_f_prints_single_zero = 1". #48422 (Robert Schulze).
  • Don't replicate DELETE and TRUNCATE for KeeperMap. #48434 (Antonio Andelic).
  • Generate valid Decimals and Bools in generateRandom function. #48436 (Kruglov Pavel).
  • Allow trailing commas in expression list of SELECT query, for example SELECT a, b, c, FROM table. Closes #37802. #48438 (Nikolay Degterinsky).
  • Override CLICKHOUSE_USER and CLICKHOUSE_PASSWORD environment variables with --user and --password client parameters. Closes #38909. #48440 (Nikolay Degterinsky).
  • Added retries to loading of data parts in MergeTree tables in case of retryable errors. #48442 (Anton Popov).
  • Add support for Date, Date32, DateTime, DateTime64 data types to arrayMin, arrayMax, arrayDifference functions. Closes #21645. #48445 (Nikolay Degterinsky).
  • Reduce memory usage for multiple ALTER DELETE mutations. #48522 (Nikolai Kochetov).
  • Primary/secondary indices and sorting keys with identical expressions are now rejected. This behavior can be disabled using setting allow_suspicious_indices. #48536 (凌涛).
  • Just fix small typo in comment around lockForAlter method in IStorage.h. #48559 (artem-pershin).
  • Add support for {server_uuid} macro. It is useful for identifying replicas in autoscaled clusters when new replicas are constantly added and removed in runtime. This closes #48554. #48563 (Alexey Milovidov).
  • The installation script will create a hard link instead of copying if it is possible. #48578 (Alexey Milovidov).
  • Support SHOW TABLE syntax meaning the same as SHOW CREATE TABLE. Closes #48580. #48591 (flynn).
  • HTTP temporary buffer support working with fs cache. #48664 (Vladimir C).
  • Make Schema inference works for CREATE AS SELECT. Closes #47599. #48679 (flynn).
  • Added a replicated_max_mutations_in_one_entry setting for ReplicatedMergeTree that allows limiting the number of mutation commands per one MUTATE_PART entry (default is 10000). #48731 (Alexander Tokmakov).
  • In AggregateFunction types, don't count unused arena bytes as read_bytes. #48745 (Raúl Marín).
  • Fix some mysql related settings not being handled with mysql dictionary source + named collection. Closes #48402. #48759 (Kseniia Sumarokova).
  • Fix squashing in query cache. #48763 (Robert Schulze).
  • Support following new jsonpath format - '$.1key', path element begins with number - '$[key]', '$[“key”]', '$[\'key\']', '$["key 123"]', path element embraced in []. #48768 (lgbo).
  • If a user set max_single_part_upload_size to a very large value, it can lead to a crash due to a bug in the AWS S3 SDK. This fixes #47679. #48816 (Alexey Milovidov).
  • Not for changelog. #48824 (Yakov Olkhovskiy).
  • Fix data race in StorageRabbitMQ (report), refactor the code. #48845 (Kseniia Sumarokova).
  • Add aliases name and part_name form system.parts and system.part_log. Closes #48718. #48850 (sichenzhao).
  • Functions "arrayDifferenceSupport()", "arrayCumSum()" and "arrayCumSumNonNegative()" now support input arrays of wide integer types (U)Int128/256. #48866 (cluster).
  • Multi-line history in clickhouse-client is now no longer padded. This makes pasting more natural. #48870 (Joanna Hulboj).
  • Not for changelog. #48873 (Yakov Olkhovskiy).
  • Implement a slight improvement for the rare case when ClickHouse is run inside LXC and LXCFS is used. The LXCFS has an issue: sometimes it returns an error "Transport endpoint is not connected" on reading from the file inside /proc. This error was correctly logged into ClickHouse's server log. We have additionally workaround this issue by reopening a file. This is a minuscule change. #48922 (Real).
  • Improve memory accounting for prefetches. Randomise prefetch settings In CI. #48973 (Kseniia Sumarokova).
  • Correctly set headers for native copy operations on GCS. #48981 (Antonio Andelic).
  • Add support for specifying setting names in the command line with dashes instead of underscores, for example, --max-threads instead of --max_threads. Additionally, support Unicode dash characters like instead of -- - this is useful when you communicate with a team in another company, and a manager from that team copy-pasted code from MS Word. #48985 (alekseygolub).
  • Add fallback to password authentication when authentication with SSL user certificate has failed. Closes #48974. #48989 (Nikolay Degterinsky).
  • Increase default value for connect_timeout_with_failover_ms to 1000 ms (because of adding async connections in https://github.com/ClickHouse/ClickHouse/pull/47229) . Closes #5188. #49009 (Kruglov Pavel).
  • Improve the embedded dashboard. Close #46671. #49036 (Kevin Zhang).
  • Add profile events for log messages, so you can easily see the count of log messages by severity. #49042 (Alexey Milovidov).
  • bitCount function support FixedString and String data type. #49044 (flynn).
  • In previous versions, the LineAsString format worked inconsistently when the parallel parsing was enabled or not, in presence of DOS or MacOS Classic line breaks. This closes #49039. #49052 (Alexey Milovidov).
  • The exception message about the unparsed query parameter will also tell about the name of the parameter. Reimplement #48878. Close #48772. #49061 (Alexey Milovidov).
  • Added field rows with number of rows parsed from asynchronous insert to system.asynchronous_insert_log. #49120 (Anton Popov).
    1. Bump Intel QPL from v1.0.0 to v1.1.0 (fixes build issue #47877) 2. the DEFLATE_QPL codec now respects the maximum hardware jobs returned by libaccel_config. #49126 (jasperzhu).

Build/Testing/Packaging Improvement

  • Reduce the number of dependencies in the header files to speed up the build. #47984 (Dmitry Novik).
  • Randomize compression of marks and indices in tests. #48286 (Alexey Milovidov).
  • Randomize vertical merges from compact to wide parts in tests. #48287 (Raúl Marín).
  • With the current approach, all ports are calculated at the beginning and could overlap or even be highjacked, see the report for port is already allocated. It's possibly the reason for #45368. #48393 (Mikhail f. Shiryaev).
  • Update time zones. The following were updated: Africa/Cairo, Africa/Casablanca, Africa/El_Aaiun, America/Bogota, America/Cambridge_Bay, America/Ciudad_Juarez, America/Godthab, America/Inuvik, America/Iqaluit, America/Nuuk, America/Ojinaga, America/Pangnirtung, America/Rankin_Inlet, America/Resolute, America/Whitehorse, America/Yellowknife, Asia/Gaza, Asia/Hebron, Asia/Kuala_Lumpur, Asia/Singapore, Canada/Yukon, Egypt, Europe/Kirov, Europe/Volgograd, Singapore. #48572 (Alexey Milovidov).
  • Support for CRC32 checksum in HDFS. Fix performance issues. #48614 (Alexey Milovidov).
  • Remove remainders of GCC support. #48671 (Robert Schulze).
  • Add CI run with new analyzer infrastructure enabled. #48719 (Dmitry Novik).
  • Not for changelog. #48879 (larryluogit).
  • After the recent update, the dockerd requires --tlsverify=false together with the http port explicitly. #48924 (Mikhail f. Shiryaev).
  • Run more functional tests concurrently. #48970 (alesapin).
  • Fix glibc compatibility check: replace preadv from musl. #49144 (alesapin).
  • Use position independent encoding/code for sanitizers (at least msan :D) build to avoid issues with maximum relocation size. #49145 (alesapin).

Bug Fix (user-visible misbehavior in an official stable release)

Build Improvement

  • Fixed hashing issue in creating partition IDs for s390x. #48134 (Harry Lee).

NO CL ENTRY

NOT FOR CHANGELOG / INSIGNIFICANT