Merge remote-tracking branch 'upstream/master' into DateTime64_fixes_comparison

This commit is contained in:
Vasily Nemkov 2020-11-23 15:49:36 +03:00
commit c375dc6b14
14 changed files with 224 additions and 93 deletions

View File

@ -1,9 +1,9 @@
# This strings autochanged from release_lib.sh:
SET(VERSION_REVISION 54443)
SET(VERSION_REVISION 54444)
SET(VERSION_MAJOR 20)
SET(VERSION_MINOR 12)
SET(VERSION_MINOR 13)
SET(VERSION_PATCH 1)
SET(VERSION_GITHASH c53725fb1f846fda074347607ab582fbb9c6f7a1)
SET(VERSION_DESCRIBE v20.12.1.1-prestable)
SET(VERSION_STRING 20.12.1.1)
SET(VERSION_GITHASH e581f9ccfc5c64867b0f488cce72412fd2966471)
SET(VERSION_DESCRIBE v20.13.1.1-prestable)
SET(VERSION_STRING 20.13.1.1)
# end of autochange

2
contrib/cctz vendored

@ -1 +1 @@
Subproject commit 7a2db4ece6e0f1b246173cbdb62711ae258ee841
Subproject commit 260ba195ef6c489968bae8c88c62a67cdac5ff9d

4
debian/changelog vendored
View File

@ -1,5 +1,5 @@
clickhouse (20.12.1.1) unstable; urgency=low
clickhouse (20.13.1.1) unstable; urgency=low
* Modified source code
-- clickhouse-release <clickhouse-release@yandex-team.ru> Thu, 05 Nov 2020 21:52:47 +0300
-- clickhouse-release <clickhouse-release@yandex-team.ru> Mon, 23 Nov 2020 10:29:24 +0300

View File

@ -1,7 +1,7 @@
FROM ubuntu:18.04
ARG repository="deb https://repo.clickhouse.tech/deb/stable/ main/"
ARG version=20.12.1.*
ARG version=20.13.1.*
RUN apt-get update \
&& apt-get install --yes --no-install-recommends \

View File

@ -1,7 +1,7 @@
FROM ubuntu:20.04
ARG repository="deb https://repo.clickhouse.tech/deb/stable/ main/"
ARG version=20.12.1.*
ARG version=20.13.1.*
ARG gosu_ver=1.10
RUN apt-get update \

View File

@ -1,7 +1,7 @@
FROM ubuntu:18.04
ARG repository="deb https://repo.clickhouse.tech/deb/stable/ main/"
ARG version=20.12.1.*
ARG version=20.13.1.*
RUN apt-get update && \
apt-get install -y apt-transport-https dirmngr && \

View File

@ -1,42 +1,42 @@
# ClickHouse obfuscator
Simple tool for table data obfuscation.
It reads input table and produces output table, that retain some properties of input, but contains different data.
It allows to publish almost real production data for usage in benchmarks.
It is designed to retain the following properties of data:
- cardinalities of values (number of distinct values) for every column and for every tuple of columns;
- conditional cardinalities: number of distinct values of one column under condition on value of another column;
- probability distributions of absolute value of integers; sign of signed integers; exponent and sign for floats;
- probability distributions of length of strings;
- probability of zero values of numbers; empty strings and arrays, NULLs;
- data compression ratio when compressed with LZ77 and entropy family of codecs;
- continuity (magnitude of difference) of time values across table; continuity of floating point values.
- date component of DateTime values;
- UTF-8 validity of string values;
- string values continue to look somewhat natural.
Most of the properties above are viable for performance testing:
reading data, filtering, aggregation and sorting will work at almost the same speed
as on original data due to saved cardinalities, magnitudes, compression ratios, etc.
It works in deterministic fashion: you define a seed value and transform is totally determined by input data and by seed.
Some transforms are one to one and could be reversed, so you need to have large enough seed and keep it in secret.
It use some cryptographic primitives to transform data, but from the cryptographic point of view,
It doesn't do anything properly and you should never consider the result as secure, unless you have other reasons for it.
It may retain some data you don't want to publish.
It always leave numbers 0, 1, -1 as is. Also it leaves dates, lengths of arrays and null flags exactly as in source data.
For example, you have a column IsMobile in your table with values 0 and 1. In transformed data, it will have the same value.
So, the user will be able to count exact ratio of mobile traffic.
Another example, suppose you have some private data in your table, like user email and you don't want to publish any single email address.
If your table is large enough and contain multiple different emails and there is no email that have very high frequency than all others,
It will perfectly anonymize all data. But if you have small amount of different values in a column, it can possibly reproduce some of them.
And you should take care and look at exact algorithm, how this tool works, and probably fine tune some of it command line parameters.
This tool works fine only with reasonable amount of data (at least 1000s of rows).
# ClickHouse obfuscator
A simple tool for table data obfuscation.
It reads an input table and produces an output table, that retains some properties of input, but contains different data.
It allows publishing almost real production data for usage in benchmarks.
It is designed to retain the following properties of data:
- cardinalities of values (number of distinct values) for every column and every tuple of columns;
- conditional cardinalities: number of distinct values of one column under the condition on the value of another column;
- probability distributions of the absolute value of integers; the sign of signed integers; exponent and sign for floats;
- probability distributions of the length of strings;
- probability of zero values of numbers; empty strings and arrays, `NULL`s;
- data compression ratio when compressed with LZ77 and entropy family of codecs;
- continuity (magnitude of difference) of time values across the table; continuity of floating-point values;
- date component of `DateTime` values;
- UTF-8 validity of string values;
- string values look natural.
Most of the properties above are viable for performance testing:
reading data, filtering, aggregatio, and sorting will work at almost the same speed
as on original data due to saved cardinalities, magnitudes, compression ratios, etc.
It works in a deterministic fashion: you define a seed value and the transformation is determined by input data and by seed.
Some transformations are one to one and could be reversed, so you need to have a large seed and keep it in secret.
It uses some cryptographic primitives to transform data but from the cryptographic point of view, it doesn't do it properly, that is why you should not consider the result as secure unless you have another reason. The result may retain some data you don't want to publish.
It always leaves 0, 1, -1 numbers, dates, lengths of arrays, and null flags exactly as in source data.
For example, you have a column `IsMobile` in your table with values 0 and 1. In transformed data, it will have the same value.
So, the user will be able to count the exact ratio of mobile traffic.
Let's give another example. When you have some private data in your table, like user email and you don't want to publish any single email address.
If your table is large enough and contains multiple different emails and no email has a very high frequency than all others, it will anonymize all data. But if you have a small number of different values in a column, it can reproduce some of them.
You should look at the working algorithm of this tool works, and fine-tune its command line parameters.
This tool works fine only with an average amount of data (at least 1000s of rows).

View File

@ -0,0 +1,43 @@
# Обфускатор ClickHouse
Простой инструмент для обфускации табличных данных.
Он считывает данные входной таблицы и создает выходную таблицу, которая сохраняет некоторые свойства входных данных, но при этом содержит другие данные.
Это позволяет публиковать практически реальные данные и использовать их в тестах на производительность.
Обфускатор предназначен для сохранения следующих свойств данных:
- кардинальность (количество уникальных данных) для каждого столбца и каждого кортежа столбцов;
- условная кардинальность: количество уникальных данных одного столбца в соответствии со значением другого столбца;
- вероятностные распределения абсолютного значения целых чисел; знак числа типа Int; показатель степени и знак для чисел с плавающей запятой;
- вероятностное распределение длины строк;
- вероятность нулевых значений чисел; пустые строки и массивы, `NULL`;
- степень сжатия данных алгоритмом LZ77 и семейством энтропийных кодеков;
- непрерывность (величина разницы) значений времени в таблице; непрерывность значений с плавающей запятой;
- дату из значений `DateTime`;
- кодировка UTF-8 значений строки;
- строковые значения выглядят естественным образом.
Большинство перечисленных выше свойств пригодны для тестирования производительности. Чтение данных, фильтрация, агрегирование и сортировка будут работать почти с той же скоростью, что и исходные данные, благодаря сохраненной кардинальности, величине, степени сжатия и т. д.
Он работает детерминированно. Вы задаёте значение инициализатора, а преобразование полностью определяется входными данными и инициализатором.
Некоторые преобразования выполняются один к одному, и их можно отменить. Поэтому нужно использовать большое значение инициализатора и хранить его в секрете.
Обфускатор использует некоторые криптографические примитивы для преобразования данных, но, с криптографической точки зрения, результат будет небезопасным. В нем могут сохраниться данные, которые не следует публиковать.
Он всегда оставляет без изменений числа 0, 1, -1, даты, длины массивов и нулевые флаги.
Например, если у вас есть столбец `IsMobile` в таблице со значениями 0 и 1, то в преобразованных данных он будет иметь то же значение.
Таким образом, пользователь сможет посчитать точное соотношение мобильного трафика.
Давайте рассмотрим случай, когда у вас есть какие-то личные данные в таблице (например, электронная почта пользователя), и вы не хотите их публиковать.
Если ваша таблица достаточно большая и содержит несколько разных электронных почтовых адресов, и ни один из них не встречается часто, то обфускатор полностью анонимизирует все данные. Но, если у вас есть небольшое количество разных значений в столбце, он может скопировать некоторые из них.
В этом случае вам следует посмотреть на алгоритм работы инструмента и настроить параметры командной строки.
Обфускатор полезен в работе со средним объемом данных (не менее 1000 строк).

View File

@ -6,7 +6,7 @@
我们使用RoaringBitmap实际存储位图对象当基数小于或等于32时它使用Set保存。当基数大于32时它使用RoaringBitmap保存。这也是为什么低基数集的存储更快的原因。
有关RoaringBitmap的更多信息请参阅[呻吟声](https://github.com/RoaringBitmap/CRoaring)。
有关RoaringBitmap的更多信息请参阅[RoaringBitmap](https://github.com/RoaringBitmap/CRoaring)。
## bitmapBuild {#bitmapbuild}

View File

@ -1,6 +1,7 @@
<html> <!-- TODO If I write DOCTYPE HTML something changes but I don't know what. -->
<head>
<meta charset="UTF-8">
<link rel="icon" href="">
<title>ClickHouse Query</title>
<!-- Code Style:
@ -21,26 +22,11 @@
<!-- Development Roadmap:
1. Add indication that the query was sent and when the query has been finished.
Do not use any animated spinners. Just a text or check mark.
Eliminate race conditions (results from the previous query should be ignored on arrival, the previous request should be cancelled).
2. Support readonly servers.
1. Support readonly servers.
Check if readonly = 1 (with SELECT FROM system.settings) to avoid sending settings. It can be done once on address/credentials change.
It can be done in background, e.g. wait 100 ms after address/credentials change and do the check.
Also it can provide visual indication that credentials are correct.
3. Add history in localstorage. Integrate with history API.
There can be a counter in localstorage, that will be appended to location #fragment.
The 'back', 'forward' buttons in browser should work.
Also there should be UI element to list all the queries from history and select from the list.
4. Trivial sharing capabilities.
Sharing is only possible when system.query_log is accessible. Read the X-ClickHouse-QueryId from the response.
Share button will: - emit SYSTEM FLUSH LOGS if not readonly; - find the query in the query_log;
- generate an URL with the query id and: server address if not equal to the URL's host; user name if not default;
indication that password should be entered in case of non-empty password.
-->
<style type="text/css">
@ -273,6 +259,22 @@
{
color: var(--null-color);
}
#hourglass
{
display: none;
padding-left: 1rem;
font-size: 110%;
color: #888;
}
#check-mark
{
display: none;
padding-left: 1rem;
font-size: 110%;
color: #080;
}
</style>
</head>
@ -286,6 +288,8 @@
<div id="run_div">
<button class="shadow" id="run">Run</button>
<span class="hint">&nbsp;(Ctrl+Enter)</span>
<span id="hourglass"></span>
<span id="check-mark"></span>
<span id="stats"></span>
<span id="toggle-dark">🌑</span><span id="toggle-light">🌞</span>
</div>
@ -299,50 +303,117 @@
<script type="text/javascript">
/// Incremental request number. When response is received,
/// if it's request number does not equal to the current request number, response will be ignored.
/// This is to avoid race conditions.
var request_num = 0;
/// Save query in history only if it is different.
var previous_query = '';
/// Substitute the address of the server where the page is served.
if (location.protocol != 'file:') {
document.getElementById('url').value = location.origin;
}
function post()
/// Substitute user name if it's specified in the query string
var user_from_url = (new URL(window.location)).searchParams.get('user');
if (user_from_url) {
document.getElementById('user').value = user_from_url;
}
function postImpl(posted_request_num, query)
{
/// TODO: Avoid race condition on subsequent requests when responses may come out of order.
/// TODO: Check if URL already contains query string (append parameters).
var user = document.getElementById('user').value;
var password = document.getElementById('password').value;
var url = document.getElementById('url').value +
/// Ask server to allow cross-domain requests.
'?add_http_cors_header=1' +
'&user=' + encodeURIComponent(document.getElementById('user').value) +
'&password=' + encodeURIComponent(document.getElementById('password').value) +
'&user=' + encodeURIComponent(user) +
'&password=' + encodeURIComponent(password) +
'&default_format=JSONCompact' +
/// Safety settings to prevent results that browser cannot display.
'&max_result_rows=1000&max_result_bytes=10000000&result_overflow_mode=break';
var query = document.getElementById('query').value;
var xhr = new XMLHttpRequest;
xhr.open('POST', url, true);
xhr.send(query);
xhr.onreadystatechange = function()
{
if (this.readyState === XMLHttpRequest.DONE) {
if (this.status === 200) {
var json;
try { json = JSON.parse(this.response); } catch (e) {}
if (json !== undefined && json.statistics !== undefined) {
renderResult(json);
} else {
renderUnparsedResult(this.response);
}
} else {
/// TODO: Proper rendering of network errors.
renderError(this.response);
if (posted_request_num != request_num) {
return;
} else if (this.readyState === XMLHttpRequest.DONE) {
renderResponse(this.status, this.response);
/// The query is saved in browser history (in state JSON object)
/// as well as in URL fragment identifier.
if (query != previous_query) {
previous_query = query;
var title = "ClickHouse Query: " + query;
history.pushState(
{
query: query,
status: this.status,
response: this.response.length > 100000 ? null : this.response /// Lower than the browser's limit.
},
title,
window.location.pathname + '?user=' + encodeURIComponent(user) + '#' + window.btoa(query));
document.title = title;
}
} else {
//console.log(this);
}
}
document.getElementById('check-mark').style.display = 'none';
document.getElementById('hourglass').style.display = 'inline';
xhr.send(query);
}
function renderResponse(status, response) {
document.getElementById('hourglass').style.display = 'none';
if (status === 200) {
var json;
try { json = JSON.parse(response); } catch (e) {}
if (json !== undefined && json.statistics !== undefined) {
renderResult(json);
} else {
renderUnparsedResult(response);
}
document.getElementById('check-mark').style.display = 'inline';
} else {
/// TODO: Proper rendering of network errors.
renderError(response);
}
}
window.onpopstate = function(event) {
if (!event.state) {
return;
}
document.getElementById('query').value = event.state.query;
if (!event.state.response) {
clear();
return;
}
renderResponse(event.state.status, event.state.response);
};
if (window.location.hash) {
document.getElementById('query').value = window.atob(window.location.hash.substr(1));
}
function post()
{
++request_num;
var query = document.getElementById('query').value;
postImpl(request_num, query);
}
document.getElementById('run').onclick = function()
@ -350,7 +421,7 @@
post();
}
document.getElementById('query').onkeypress = function(event)
document.onkeypress = function(event)
{
/// Firefox has code 13 for Enter and Chromium has code 10.
if (event.ctrlKey && (event.charCode == 13 || event.charCode == 10)) {
@ -372,6 +443,9 @@
document.getElementById('error').style.display = 'none';
document.getElementById('stats').innerText = '';
document.getElementById('hourglass').style.display = 'none';
document.getElementById('check-mark').style.display = 'none';
}
function renderResult(response)
@ -443,7 +517,7 @@
function renderError(response)
{
clear();
document.getElementById('error').innerText = response;
document.getElementById('error').innerText = response ? response : "No response.";
document.getElementById('error').style.display = 'block';
}

View File

@ -225,7 +225,7 @@ void JSONCompactEachRowRowInputFormat::readField(size_t index, MutableColumns &
}
catch (Exception & e)
{
e.addMessage("(while read the value of key " + getPort().getHeader().getByPosition(index).name + ")");
e.addMessage("(while reading the value of key " + getPort().getHeader().getByPosition(index).name + ")");
throw;
}
}

View File

@ -163,7 +163,7 @@ void JSONEachRowRowInputFormat::readField(size_t index, MutableColumns & columns
}
catch (Exception & e)
{
e.addMessage("(while read the value of key " + columnName(index) + ")");
e.addMessage("(while reading the value of key " + columnName(index) + ")");
throw;
}
}

View File

@ -102,7 +102,7 @@ bool RegexpRowInputFormat::readField(size_t index, MutableColumns & columns)
}
catch (Exception & e)
{
e.addMessage("(while read the value of column " + getPort().getHeader().getByPosition(index).name + ")");
e.addMessage("(while reading the value of column " + getPort().getHeader().getByPosition(index).name + ")");
throw;
}
return read;

View File

@ -93,6 +93,7 @@ const char * auto_contributors[] {
"Anton Zhabolenko",
"Ariel Robaldo",
"Arsen Hakobyan",
"ArtCorp",
"Artem Andreenko",
"Artem Gavrilov",
"Artem Hnilov",
@ -104,6 +105,7 @@ const char * auto_contributors[] {
"Arthur Petukhovsky",
"Arthur Tokarchuk",
"Arthur Wong",
"Artur",
"Artur Beglaryan",
"AsiaKorushkina",
"Atri Sharma",
@ -259,6 +261,7 @@ const char * auto_contributors[] {
"Jochen Schalanda",
"John",
"Jonatas Freitas",
"Kang Liu",
"Karl Pietrzak",
"Keiji Yoshida",
"Kiran",
@ -353,6 +356,7 @@ const char * auto_contributors[] {
"NeZeD [Mac Pro]",
"Neeke Gao",
"Nico Mandery",
"Nico Piderman",
"Nicolae Vartolomei",
"Nik",
"Nikhil Raman",
@ -379,7 +383,9 @@ const char * auto_contributors[] {
"Olga Revyakina",
"Orivej Desh",
"Oskar Wojciski",
"OuO",
"Paramtamtam",
"Patrick Zippenfenig",
"Pavel",
"Pavel Kartaviy",
"Pavel Kartavyy",
@ -461,6 +467,7 @@ const char * auto_contributors[] {
"Tsarkova Anastasia",
"Ubuntu",
"Ubus",
"V",
"VDimir",
"Vadim",
"Vadim Plakhtinskiy",
@ -527,6 +534,7 @@ const char * auto_contributors[] {
"Yury Stankevich",
"Zhichang Yu",
"Zhipeng",
"a.palagashvili",
"abdrakhmanov",
"abyss7",
"achimbab",
@ -546,6 +554,7 @@ const char * auto_contributors[] {
"ana-uvarova",
"andrei-karpliuk",
"andrewsg",
"annvsh",
"anrodigina",
"antikvist",
"anton",
@ -613,8 +622,10 @@ const char * auto_contributors[] {
"ggerogery",
"giordyb",
"glockbender",
"gyuton",
"hao.he",
"hcz",
"heng zhao",
"hexiaoting",
"hotid",
"hustnn",
@ -694,6 +705,7 @@ const char * auto_contributors[] {
"pufit",
"pyos",
"qianlixiang",
"qianmoQ",
"quid",
"r1j1k",
"rainbowsysu",
@ -716,6 +728,7 @@ const char * auto_contributors[] {
"spyros87",
"stavrolia",
"stepenhu",
"su-houzhen",
"sundy",
"sundy-li",
"sundyli",
@ -727,6 +740,7 @@ const char * auto_contributors[] {
"tavplubix",
"topvisor",
"tyrionhuang",
"ubuntu",
"unegare",
"unknown",
"urgordeadbeef",