ClickHouse/docs/changelogs/v23.9.1.1854-stable.md
2024-01-23 10:31:32 -05:00

69 KiB

sidebar_position sidebar_label
1 2023

2023 Changelog

ClickHouse release v23.9.1.1854-stable (8f9a227de1) FIXME as compared to v23.8.1.2992-lts (ebc7d9a9f3)

Backward Incompatible Change

  • Remove the status_info configuration option and dictionaries status from the default Prometheus handler. #54090 (Alexey Milovidov).
  • The experimental parts metadata cache is removed from the codebase. #54215 (Alexey Milovidov).
  • Disable setting input_format_json_try_infer_numbers_from_strings by default, so we don't try to infer numbers from strings in JSON formats by default to avoid possible parsing errors when sample data contains strings that looks like a number. #55099 (Kruglov Pavel).
  • IPv6 bloom filter indexes created prior to March 2023 are not compatible with current version and have to be rebuilt. #54200 (Yakov Olkhovskiy).

New Feature

  • Added new type of authentication based on SSH keys. It works only for Native TCP protocol. #41109 (George Gamezardashvili).
  • Added IO Scheduling support for remote disks. Storage configuration for disk types s3, s3_plain, hdfs and azure_blob_storage can now contain read_resource and write_resource elements holding resource names. Scheduling policies for these resources can be configured in a separate server configuration section resources. Queries can be marked using setting workload and classified using server configuration section workload_classifiers to achieve diverse resource scheduling goals. More details in docs/en/operations/workload-scheduling.md. #47009 (Sergei Trifonov).
  • Added a new column _block_number resolves #44532. #47532 (SmitaRKulkarni).
  • Add options partial_result_update_duration_ms and max_rows_in_partial_result to show updates of a partial result of output table in real-time during query execution. #48607 (Alexey Perevyshin).
  • Support case-insensitive and dot-all matching modes in RegExpTree dictionaries. #50906 (Johann Gan).
  • Add support for ALTER TABLE MODIFY COMMENT. Note: something similar was added by an external contributor a long time ago, but the feature did not work at all and only confused users. This closes #36377. #51304 (Alexey Milovidov).
  • Added "GCD" aka. "greatest common denominator" as a new data compression codec. The codec computes the GCD of all column values, and then divides each value by the GCD. The GCD codec is a data preparation codec (similar to Delta and DoubleDelta) and cannot be used stand-alone. It works with data integer, decimal and date/time type. A viable use case for the GCD codec are column values that change (increase/decrease) in multiples of the GCD, e.g. 24 - 28 - 16 - 24 - 8 - 24 (assuming GCD = 4). #53149 (Alexander Nam).
  • Two new type aliases "DECIMAL(P)" (as shortcut for "DECIMAL(P, 0") and "DECIMAL" (as shortcut for "DECIMAL(10, 0)") were added. This makes ClickHouse more compatible with MySQL's SQL dialect. #53328 (Val Doroshchuk).
  • Added a new system log table backup_log to track all BACKUP and RESTORE operations. #53638 (Victor Krasnov).
  • Added a format setting "output_format_markdown_escape_special_characters" (default: false). The setting controls whether special characters like "!", "#", "$" etc. are escaped (i.e. prefixed by a backslash) in the "Markdown" output format. #53860 (irenjj).
  • Add function decodeHTMLComponent. #54097 (Bharat Nallan).
  • Added peak_threads_usage to query_log table. #54335 (Alexey Gerasimchuck).
  • Add SHOW FUNCTIONS support to clickhouse-client. #54337 (Julia Kartseva).
  • This PRs improves schema inference from JSON formats: 1) Now it's possible to infer named Tuples from JSON objects without experimantal JSON type under a setting input_format_json_try_infer_named_tuples_from_objects in JSON formats. Previously without experimantal type JSON we could only infer JSON objects as Strings or Maps, now we can infer named Tuple. Resulting Tuple type will conain all keys of objects that were read in data sample during schema inference. It can be useful for reading structured JSON data without sparse objects. The setting is enabled by default. 2) Allow parsing JSON array into a column with type String under setting input_format_json_read_arrays_as_strings. It can help reading arrays with values with different types. 3) Allow to use type String for JSON keys with unkown types (null/[]/{}) in sample data under setting input_format_json_infer_incomplete_types_as_strings. Now in JSON formats we can read any value into String column and we can avoid getting error Cannot determine type for column 'column_name' by first 25000 rows of data, most likely this column contains only Nulls or empty Arrays/Maps during schema inference by using type String for unknown types, so the data will be read successfully. #54427 (Kruglov Pavel).
  • Added function "toDaysSinceYearZero" with alias "TO_DAYS()" (for compatibility with MySQL) which returns the number of days passed since 0001-01-01. #54479 (Robert Schulze).
  • Added functions YYYYMMDDtoDate(), YYYYMMDDtoDate32(), YYYYMMDDhhmmssToDateTime() and YYYYMMDDhhmmssToDateTime64(). They convert a date or date with time encoded as integer (e.g. 20230911) into a native date or date with time. As such, they provide the opposite functionality of existing functions YYYYMMDDToDate(), YYYYMMDDToDateTime(), YYYYMMDDhhmmddToDateTime(), YYYYMMDDhhmmddToDateTime64(). #54509 (Robert Schulze).
  • Added "bandwidth_limit" IO scheduling node type. It allows you to specify max_speed and max_burst constraints on traffic passing though this node. More details in docs/en/operations/workload-scheduling.md. #54618 (Sergei Trifonov).
  • Function toDaysSinceYearZero() now supports arguments of type DateTime and DateTime64. #54856 (Serge Klochkov).
  • Allow S3-style URLs for table functions s3, gcs, oss. URL is automatically converted to HTTP. Example: 's3://clickhouse-public-datasets/hits.csv' is converted to 'https://clickhouse-public-datasets.s3.amazonaws.com/hits.csv'. #54931 (Yarik Briukhovetskyi).
  • Add several string distance functions, include byteHammingDistance, byteJaccardIndex, byteEditDistance. ### Documentation entry for user-facing changes. #54935 (flynn).
  • Add new setting print_pretty_type_names to print pretty deep nested types like Tuple/Maps/Arrays. #55095 (Kruglov Pavel).

Performance Improvement

  • Improve performance of sorting for decimal columns. Improve performance of insertion into MergeTree if ORDER BY contains Decimal column. Improve performance of sorting when data is already sorted or almost sorted. #35961 (Maksim Kita).
  • Improve performance for huge query analysis. Fixes #51224. #51469 (frinkr).
    1. Add rewriter for new analyzer. #52082 (JackyWoo).
    1. Add rewriter for both old and new analyzer. 2. Add settings optimize_uniq_to_count. #52645 (JackyWoo).
  • Remove manual calls to mmap/mremap/munmap and delegate all this work to jemalloc. #52792 (Nikita Taranov).
  • Now roaringBitmaps being optimized before serialization. #52842 (UnamedRus).
  • Optimize group by constant keys. Will optimize queries with group by _file/_path after https://github.com/ClickHouse/ClickHouse/pull/53529. #53549 (Kruglov Pavel).
  • Speed up reading from S3 by enabling prefetches by default. #53709 (Alexey Milovidov).
  • Do not implicitly read pk and version columns in lonely parts if unnecessary. #53919 (Duc Canh Le).
  • Fixed high in CPU consumption when working with NATS. #54399 (Vasilev Pyotr).
  • Since we use separate instructions for executing toString() with datetime argument, it is possible to improve performance a bit for non-datetime arguments and have some parts of the code cleaner. Follows up #53680. #54443 (Yarik Briukhovetskyi).
  • Instead of serializing json elements into a std::stringstream, this PR try to put the serialization result into ColumnString direclty. #54613 (lgbo).
  • Enable ORDER BY optimization for reading data in corresponding order from a MergeTree table in case that the table is behind a view. #54628 (Vitaly Baranov).
  • Improve JSON SQL functions by reusing GeneratorJSONPath. Since there are several make_shared in GenerateorJSONPath's constructor, it has bad performance. #54735 (lgbo).

Improvement

  • Keeper improvement: Add a createIfNotExists Keeper command. #48855 (Konstantin Bogdanov).
  • Add IF EMPTY clause for DROP TABLE queries. #48915 (Pavel Novitskiy).
  • The Keeper dynamically adjusts log levels. #50372 (helifu).
  • Allow to replace long names of files of columns in MergeTree data parts to hashes of names. It helps to avoid File name too long error in some cases. #50612 (Anton Popov).
  • Allow specifying the expiration date and, optionally, the time for user credentials with VALID UNTIL datetime clause. #51261 (Nikolay Degterinsky).
  • Add setting ignore_access_denied_multidirectory_globs. #52839 (Andrey Zvonov).
  • Output valid JSON/XML on excetpion during HTTP query execution. Add setting http_write_exception_in_output_format to enable/disable this behaviour (enabled by default). #52853 (Kruglov Pavel).
  • More precise Integer type inference, fix #51236. #53003 (Chen768959).
  • Keeper tries to batch flush requests for better performance. #53049 (Antonio Andelic).
  • Introduced resolving of charsets in the string literals for MaterializedMySQL. #53220 (Val Doroshchuk).
  • Fix a subtle issue with a rarely used EmbeddedRocksDB table engine in an extremely rare scenario: sometimes the EmbeddedRocksDB table engine does not close files correctly in NFS after running DROP TABLE. #53502 (Mingliang Pan).
  • SQL functions "toString(datetime)" and "formatDateTime()" now support non-constant timezone arguments. #53680 (Yarik Briukhovetskyi).
  • RESTORE TABLE ON CLUSTER must create replicated tables with a matching UUID on hosts. Otherwise the macro {uuid} in ZooKeeper path can't work correctly after RESTORE. This PR implements that. #53765 (Vitaly Baranov).
  • Added restore setting restore_broken_parts_as_detached: if it's true the RESTORE process won't stop on broken parts while restoring, instead all the broken parts will be copied to the detached folder with the prefix `broken-from-backup'. If it's false the RESTORE process will stop on the first broken part (if any). The default value is false. #53877 (Vitaly Baranov).
  • The creation of Annoy indexes can now be parallelized using setting max_threads_for_annoy_index_creation. #54047 (Robert Schulze).
  • The MySQL interface gained a minimal implementation of prepared statements, just enough to allow a connection from Tableau Online to ClickHouse via the MySQL connector. #54115 (Serge Klochkov).
  • Replaced the library to handle (encode/decode) base64 values from Turbo-Base64 to aklomp-base64. Both are SIMD-accelerated on x86 and ARM but 1. the license of the latter (BSD-2) is more favorable for ClickHouse, Turbo64 switched in the meantime to GPL-3, 2. with more GitHub stars, aklomp-base64 seems more future-proof, 3. aklomp-base64 has a slightly nicer API (which is arguably subjective), and 4. aklomp-base64 does not require us to hack around bugs (like non-threadsafe initialization). Note: aklomp-base64 rejects unpadded base64 values whereas Turbo-Base64 decodes them on a best-effort basis. RFC-4648 leaves it open whether padding is mandatory or not, but depending on the context this may be a behavioral change to be aware of. #54119 (Mikhail Koviazin).
  • Add elapsed_ns to HTTP headers X-ClickHouse-Progress and X-ClickHouse-Summary. #54179 (joelynch).
  • Implementation of reconfig (https://github.com/ClickHouse/ClickHouse/pull/49450), sync, and exists commands for keeper-client. #54201 (pufit).
  • "clickhouse-local" and "clickhouse-client" now allow to specify the "--query" parameter multiple times, e.g. './clickhouse-client --query "SELECT 1" --query "SELECT 2"'. This syntax is slightly more intuitive than `./clickhouse-client --multiquery "SELECT 1;SELECT2", a bit easier to script (e.g. "queries.push_back('--query "$q"')") and more consistent with the behavior of existing parameter "--queries-file" (e.g. "./clickhouse client --queries-file queries1.sql --queries-file queries2.sql"). #54249 (Robert Schulze).
  • Add sub-second precision to formatReadableTimeDelta. #54250 (Andrey Zvonov).
  • Fix wrong reallocation in HashedArrayDictionary:. #54254 (Vitaly Baranov).
  • Enable allow_remove_stale_moving_parts by default. #54260 (vdimir).
  • Fix using count from cache and improve progress bar for reading from archives. #54271 (Kruglov Pavel).
  • Add support for S3 credentials using SSO. To define a profile to be used with SSO, set AWS_PROFILE environment variable. #54347 (Antonio Andelic).
  • Support NULL as default for nested types Array/Tuple/Map for input formats. Closes #51100. #54351 (Kruglov Pavel).
  • This is actually a bug fix, but not sure I'll be able to add a test to support the case, so I have put it as an improvement. This issue was introduced in https://github.com/ClickHouse/ClickHouse/pull/45878, which is when CH started reading arrow in batches. #54370 (Arthur Passos).
  • Add STD alias to stddevPop function for MySQL compatibility. Closes #54274. #54382 (Nikolay Degterinsky).
  • Add addDate function for compatibility with MySQL and subDate for consistency. Reference #54275. #54400 (Nikolay Degterinsky).
  • Parse data in JSON format as JSONEachRow if failed to parse metadata. It will allow to read files with .json extension even if real format is JSONEachRow. Closes #45740. #54405 (Kruglov Pavel).
  • Pass http retry timeout as milliseconds. #54438 (Duc Canh Le).
  • Support SAMPLE BY for VIEW. #54477 (Azat Khuzhin).
  • Add modification_time into system.detached_parts. #54506 (Azat Khuzhin).
  • Added a setting "splitby_max_substrings_includes_remaining_string" which controls if functions "splitBy*()" with argument "max_substring" > 0 include the remaining string (if any) in the result array (Python/Spark semantics) or not. The default behavior does not change. #54518 (Robert Schulze).
  • Now clickhouse-client process files in parallel in case of INFILE 'glob_expression'. Closes #54218. #54533 (Max K.).
  • Allow to use primary key for IN function where primary key column types are different from IN function right side column types. Example: SELECT id FROM test_table WHERE id IN (SELECT '5'). Closes #48936. #54544 (Maksim Kita).
  • Better integer types inference for Int64/UInt64 fields. Continuation of https://github.com/ClickHouse/ClickHouse/pull/53003. Now it works also for nested types like Arrays of Arrays anf for functions like map/tuple. Issue: #51236. #54553 (Kruglov Pavel).
  • HashJoin tries to shrink internal buffers consuming half of maximal available memory (set by max_bytes_in_join). #54584 (vdimir).
  • Added array operations for multiplying, dividing and modulo on scalar. Works in each way, for example 5 * [5, 5] and [5, 5] * 5 - both cases are possible. #54608 (Yarik Briukhovetskyi).
  • Added function timestamp for compatibility with MySQL. Closes #54275. #54639 (Nikolay Degterinsky).
  • Respect max_block_size for array join to avoid possible OOM. Close #54290. #54664 (李扬).
  • Add optional version argument to rm command in keeper-client to support safer deletes. #54708 (János Benjamin Antal).
  • Disable killing the server by systemd (that may lead to data loss when using Buffer tables). #54744 (Azat Khuzhin).
  • Added field "is_deterministic" to system table "system.functions" which indicates whether the result of a function is stable between two invocations (given exactly the same inputs) or not. #54766 (Robert Schulze).
  • Made the views in schema "information_schema" more compatible with the equivalent views in MySQL (i.e. modified and extended them) up to a point where Tableau Online is able to connect to ClickHouse. More specifically: 1. The type of field "information_schema.tables.table_type" changed from Enum8 to String. 2. Added fields "table_comment" and "table_collation" to view "information_schema.table". 3. Added views "information_schema.key_column_usage" and "referential_constraints". 4. Replaced uppercase aliases in "information_schema" views with concrete uppercase columns. #54773 (Serge Klochkov).
  • The query cache now returns an error if the user tries to cache the result of a query with a non-deterministic function such as "now()", "randomString()" and "dictGet()". Compared to the previous behavior (silently don't cache the result), this reduces confusion and surprise for users. #54801 (Robert Schulze).
  • Forbid special columns for file/s3/url/... storages, fix insert into ephemeral columns from files. Closes #53477. #54803 (Kruglov Pavel).
  • More configurable collecting metadata for backup. #54804 (Vitaly Baranov).
  • clickhouse-local's log file (if enabled with --server_logs_file flag) will now prefix each line with timestamp, thread id, etc, just like clickhouse-server. #54807 (Michael Kolupaev).
  • Reuse HTTP connections in s3 table function. #54812 (Michael Kolupaev).
  • Avoid excessive calls to getifaddrs in isLocalAddress. #54819 (Duc Canh Le).
  • Field "is_obsolete" in system.merge_tree_settings is now 1 for obsolete merge tree settings. Previously, only the description indicated that the setting is obsolete. #54837 (Robert Schulze).
  • Make it possible to use plural when using interval literals. INTERVAL 2 HOURS should be equivalent to INTERVAL 2 HOUR. #54860 (Jordi Villar).
  • Replace the linear method in MergeTreeRangeReader::Stream::ceilRowsToCompleteGranules with a binary search. #54869 (usurai).
  • Always allow the creation of a projection with Nullable PK. This fixes #54814. #54895 (Amos Bird).
  • Retry backup S3 operations after connection reset failure. #54900 (Vitaly Baranov).
  • Make the exception message exact in case of the maximum value of a settings is less than the minimum value. #54925 (János Benjamin Antal).
  • LIKE, match, and other regular expressions matching functions now allow matching with patterns containing non-UTF-8 substrings by falling back to binary matching. Example: you can use string LIKE '\xFE\xFF%' to detect BOM. This closes #54486. #54942 (Alexey Milovidov).
  • ProfileEvents added ContextLockWaitMicroseconds event. #55029 (Maksim Kita).
  • Added field "is_deterministic" to system table "system.functions" which indicates whether the result of a function is stable between two invocations (given exactly the same inputs) or not. #55035 (Robert Schulze).
  • View information_schema.tables now has a new field data_length which shows the approximate size of the data on disk. Required to run queries generated by Amazon QuickSight. #55037 (Robert Schulze).

Build/Testing/Packaging Improvement

  • ClickHouse is built with Musl instead of GLibc by default. #52550 (Alexey Milovidov).
  • ClickHouse is built with Musl instead of GLibc. #52721 (Azat Khuzhin).
  • Bumped the compiler of official and continuous integration builds of ClickHouse from Clang 16 to 17. #53831 (Robert Schulze).
  • Fix flaky test. wait_resolver function was asserting the response to be == proxy1, but it might actually return proxy2. Account for that as well. #54191 (Arthur Passos).
  • Regenerated tld data for lookups (tldLookup.generated.cpp). #54269 (Bharat Nallan).
  • Report properly timeout for check itself in fast_test_check/stress_check. #54278 (Igor Nikonov).
  • Suddenly, test_host_regexp_multiple_ptr_records_concurrent became flaky. #54307 (Arthur Passos).
  • Fixed precise float parsing issue on s390x. #54330 (Harry Lee).
  • Enrich changed_images.json with the latest tag from master for images that are not changed in the pull request. #54369 (János Benjamin Antal).
  • Fixed endian issue in jemalloc_bins system table for s390x. #54517 (Harry Lee).
  • Fixed random generation issue for UInt256 and IPv4 on s390x. #54576 (Harry Lee).
  • Remove redundant clickhouse-keeper-client symlink. #54587 (Tomas Barton).
  • Use /usr/bin/env to resolve bash. #54603 (Fionera).
  • Move all tests/ci/*.lib files to stateless-tests image. Closes #54540. #54668 (Kruglov Pavel).
  • We build and upload them for every push, which isn't worth it. #54675 (Mikhail f. Shiryaev).
  • Fixed SimHash function endian issue for s390x. #54793 (Harry Lee).
  • Do not clone the fast tests repo twice; parallelize submodules checkout; use the current user in the fast-tests container. #54849 (Mikhail f. Shiryaev).
  • Avoid running pull request ci workflow for fixes touching .md files only. #54914 (Max K.).
  • CMake added PROFILE_CPU option needed to perform perf record without using DWARF call graph. #54917 (Maksim Kita).
  • Use --gtest_output='json:' to parse unit test results. #54922 (Mikhail f. Shiryaev).
  • Added support for additional scripts (you need to mound a volume) to extend build process. #55000 (Nikita Mikhaylov).
  • If the linker is different than LLD, stop with a fatal error. #55036 (Alexey Milovidov).

Bug Fix (user-visible misbehavior in an official stable release)

NO CL ENTRY

NOT FOR CHANGELOG / INSIGNIFICANT