Merge branch 'master' of github.com:yandex/ClickHouse

This commit is contained in:
BayoNet 2018-10-22 19:18:01 +03:00
commit bd81b72e4a
77 changed files with 3758 additions and 1243 deletions

View File

@ -1,3 +1,119 @@
## ClickHouse release 18.14.9, 2018-10-16
### Новые возможности:
* Модификатор `WITH CUBE` для `GROUP BY` (также доступен синтаксис: `GROUP BY CUBE(...)`).
* Добавлена функция `formatDateTime`. [Alexandr Krasheninnikov]()
* Добавлен движок таблиц `JDBC` и табличная функция `jdbc` (для работы требуется установка clickhouse-jdbc-bridge). [Alexandr Krasheninnikov]()
* Добавлены функции для работы с ISO номером недели: `toISOWeek`, `toISOYear`, `toStartOfISOYear`.
* Добавлена функция `toDayOfYear`.
* Добавлена возможность использования столбцов типа `Nullable` для таблиц типа `MySQL`, `ODBC`.
* Возможность чтения вложенных структур данных как вложенных объектов в формате `JSONEachRow`. Добавлена настройка `input_format_import_nested_json`. [Veloman Yunkan]().
* Возможность параллельной обработки многих `MATERIALIZED VIEW` при вставке данных. Настройка `parallel_view_processing`. [Marek Vavruša]().
* Добавлен запрос `SYSTEM FLUSH LOGS` (форсированный сброс логов в системные таблицы, такие как например, `query_log`).
* Возможность использования предопределённых макросов `database` и `table` в объявлении `Replicated` таблиц.
* Добавлена возможность чтения значения типа `Decimal` в инженерной нотации (с указанием десятичной экспоненты).
### Экспериментальные возможности:
* Оптимизация GROUP BY для типов данных `LowCardinality`.
* Оптимизации вычисления выражений для типов данных `LowCardinality`.
### Улучшения:
* Существенно уменьшено потребление памяти для запросов с `ORDER BY` и `LIMIT`. Настройка `max_bytes_before_remerge_sort`.
* При отсутствии указания типа `JOIN` (`LEFT`, `INNER`, ...), подразумевается `INNER JOIN`.
* Корректная работа квалифицированных звёздочек в запросах с `JOIN`. [Winter Zhang]()
* Движок таблиц `ODBC` корректно выбирает способ квотирования идентификаторов в SQL диалекте удалённой СУБД. [Alexandr Krasheninnikov]()
* Настройка `compile_expressions` (JIT компиляция выражений) включена по-умолчанию.
* Исправлено поведение при одновременном DROP DATABASE/TABLE IF EXISTS и CREATE DATABASE/TABLE IF NOT EXISTS. Ранее запрос `CREATE DATABASE ... IF NOT EXISTS` мог выдавать сообщение об ошибке вида "File ... already exists", а запросы `CREATE TABLE ... IF NOT EXISTS` и `DROP TABLE IF EXISTS` могли выдавать сообщение `Table ... is creating or attaching right now`.
* Выражения LIKE и IN с константной правой частью пробрасываются на удалённый сервер при запросах из таблиц типа MySQL и ODBC.
* Сравнения с константными выражениями в секции WHERE пробрасываются на удалённый сервер при запросах из таблиц типа MySQL и ODBC. Ранее пробрасывались только сравнения с константами.
* Корректное вычисление ширины строк в терминале для `Pretty` форматов, в том числе для строк с иероглифами. [Amos Bird]().
* Возможность указания `ON CLUSTER` для запросов `ALTER UPDATE`.
* Увеличена производительность чтения данных в формате `JSONEachRow`.
* Добавлены синонимы функций `LENGTH`, `CHARACTER_LENGTH` для совместимости. Функция `CONCAT` стала регистронезависимой.
* Добавлен синоним `TIMESTAMP` для типа `DateTime`.
* В логах сервера всегда присутствует место для query_id, даже если строчка лога не относится к запросу. Это сделано для более простого парсинга текстовых логов сервера сторонними инструментами.
* Логгирование потребления памяти запросом при превышении очередной отметки целого числа гигабайт.
* Добавлен режим совместимости для случая, когда клиентская библиотека, работающая по Native протоколу, по ошибке отправляет меньшее количество столбцов, чем сервер ожидает для запроса INSERT. Такой сценарий был возможен при использовании библиотеки clickhouse-cpp. Ранее этот сценарий приводил к падению сервера.
* В `clickhouse-copier`, в задаваемом пользователем выражении WHERE, появилась возможность использовать алиас `partition_key` (для дополнительной фильтрации по партициям исходной таблицы). Это полезно, если схема партиционирования изменяется при копировании, но изменяется незначительно.
* Рабочий поток движка `Kafka` перенесён в фоновый пул потоков для того, чтобы автоматически уменьшать скорость чтения данных при большой нагрузке. [Marek Vavruša]().
* Поддержка чтения значений типа `Tuple` и `Nested` структур как `struct` в формате `Cap'n'Proto` [Marek Vavruša]().
* В список доменов верхнего уровня для функции `firstSignificantSubdomain` добавлен домен `biz` [decaseal]().
* В конфигурации внешних словарей, пустое значение `null_value` интерпретируется, как значение типа данных по-умоланию.
* Поддержка функций `intDiv`, `intDivOrZero` для `Decimal`.
* Поддержка типов `Date`, `DateTime`, `UUID`, `Decimal` в качестве ключа для агрегатной функции `sumMap`.
* Поддержка типа данных `Decimal` во внешних словарях.
* Поддержка типа данных `Decimal` в таблицах типа `SummingMergeTree`.
* Добавлена специализация для `UUID` в функции `if`.
* Уменьшено количество системных вызовов `open`, `close` при чтении из таблиц семейства `MergeTree`.
* Возможность выполнения запроса `TRUNCATE TABLE` на любой реплике (запрос пробрасывается на реплику-лидера). [Kirill Shvakov]()
### Исправление ошибок:
* Исправлена ошибка в работе таблиц типа `Dictionary` для словарей типа `range_hashed`. Ошибка возникла в версии 18.12.17.
* Исправлена ошибка при загрузке словарей типа `range_hashed` (сообщение `Unsupported type Nullable(...)`). Ошибка возникла в версии 18.12.17.
* Исправлена некорректная работа функции `pointInPolygon` из-за накопления погрешности при вычислениях для полигонов с большим количеством близко расположенных вершин.
* Если после слияния кусков данных, у результирующего куска чексумма отличается от результата того же слияния на другой реплике, то результат слияния удаляется, и вместо этого кусок скачивается с другой реплики (это правильное поведение). Но после скачивания куска, он не мог добавиться в рабочий набор из-за ошибки, что кусок уже существует (так как кусок после слияния удалялся не сразу, а с задержкой). Это приводило к циклическим попыткам скачивания одних и тех же данных.
* Исправлен некорректный учёт общего потребления оперативной памяти запросами (что приводило к неправильной работе настройки `max_memory_usage_for_all_queries` и неправильному значению метрики `MemoryTracking`). Ошибка возникла в версии 18.12.13. [Marek Vavruša]()
* Исправлена работоспособность запросов `CREATE TABLE ... ON CLUSTER ... AS SELECT ...` Ошибка возникла в версии 18.12.13.
* Исправлена лишняя подготовка структуры данных для `JOIN` на сервере-инициаторе запроса, если `JOIN` выполняется только на удалённых серверах.
* Исправлены ошибки в движке `Kafka`: неработоспособность после исключения при начале чтения данных; блокировка при завершении [Marek Vavruša]().
* Для таблиц `Kafka` не передавался опциональный параметр `schema` (схема формата `Cap'n'Proto`). [Vojtech Splichal]
* Если ансамбль серверов ZooKeeper содержит серверы, которые принимают соединение, но сразу же разрывают его вместо ответа на рукопожатие, то ClickHouse выбирает для соединения другой сервер. Ранее в этом случае возникала ошибка `Cannot read all data. Bytes read: 0. Bytes expected: 4.` и сервер не мог стартовать.
* Если ансамбль серверов ZooKeeper содержит серверы, для которых DNS запрос возвращает ошибку, то такие серверы пропускаются.
* Исправлено преобразование типов между `Date` и `DateTime` при вставке данных в формате `VALUES` (в случае, когда `input_format_values_interpret_expressions = 1`). Ранее преобразование производилось между числовым значением количества дней с начала unix эпохи и unix timestamp, что приводило к неожиданным результатам.
* Исправление преобразования типов между `Decimal` и целыми числами.
* Исправлены ошибки в работе настройки `enable_optimize_predicate_expression`. [Winter Zhang]()
* Настройка `enable_optimize_predicate_expression` выключена по-умолчанию.
* Исправлена ошибка парсинга формата CSV с числами с плавающей запятой, если используется разделитель CSV не по-умолчанию, такой как например, `;`.
* Испоавлена функция `arrayCumSumNonNegative` (она не накапливает отрицательные значения, если аккумулятор становится меньше нуля).
* Исправлена работа `Merge` таблицы поверх `Distributed` таблиц при использовании `PREWHERE`.
* Исправления ошибок в запросе ALTER UPDATE.
* Исправления ошибок в табличной функции `odbc`, которые возникли в версии 18.12.
* Исправлена работа агрегатных функций с комбинаторами `StateArray`.
* Исправлено падение при делении значения типа `Decimal` на ноль.
* Исправлен вывод типов для операций с использованием аргументов типа `Decimal` и целых чисел.
* Исправлен segfault при `GROUP BY` по `Decimal128`.
* Настройка `log_query_threads` (логгирование информации о каждом потоке исполнения запроса) теперь имеет эффект только если настройка `log_queries` (логгирование информации о запросах) выставлена в 1. Так как настройка `log_query_threads` включена по-умолчанию, ранее информация о потоках логгировалась даже если логгирование запросов выключено.
* Исправлена ошибка в распределённой работе агрегатной функции quantiles (сообщение об ошибке вида `Not found column quantile...`).
* Исправлена проблема совместимости при одновременной работе на кластере серверов версии 18.12.17 и более старых, приводящая к тому, что при распределённых запросах с GROUP BY по ключам одновременно фиксированной и не фиксированной длины, при условии, что количество данных в процессе агрегации большое, могли возвращаться не до конца агрегированные данные (одни и те же ключи агрегации в двух разных строках).
* Исправлена обработка подстановок в `clickhouse-performance-test`, если запрос содержит только часть из объявленных в тесте подстановок.
* Исправлена ошибка при использовании `FINAL` совместно с `PREWHERE`.
* Исправлена ошибка при использовании `PREWHERE` над столбцами, добавленными при `ALTER`.
* Добавлена проверка отсутствия `arrayJoin` для `DEFAULT`, `MATERIALIZED` выражений. Ранее наличие `arrayJoin` приводило к ошибке при вставке данных.
* Добавлена проверка отсутствия `arrayJoin` в секции `PREWHERE`. Ранее это приводило к сообщениям вида `Size ... doesn't match` или `Unknown compression method` при выполнении запросов.
* Исправлен segfault, который мог возникать в редких случаях после оптимизации - замены цепочек AND из равенства выражения константам на соответствующее выражение IN. [liuyimin]().
* Мелкие исправления `clickhouse-benchmark`: ранее информация о клиенте не передавалась на сервер; более корректный подсчёт числа выполненных запросов при завершении работы и для ограничения числа итераций.
### Обратно несовместимые изменения:
* Удалена настройка `allow_experimental_decimal_type`. Тип данных `Decimal` доступен для использования по-умолчанию.
## ClickHouse release 18.12.17, 2018-09-16
### Новые возможности:
* `invalidate_query` (возможность задать запрос для проверки необходимости обновления внешнего словаря) реализована для источника `clickhouse`.
* Добавлена возможность использования типов данных `UInt*`, `Int*`, `DateTime` (наравне с типом `Date`) в качестве ключа внешнего словаря типа `range_hashed`, определяющего границы диапазонов. Возможность использования NULL в качестве обозначения открытого диапазона. [Vasily Nemkov]()
* Для типа `Decimal` добавлена поддержка агрегатных функций `var*`, `stddev*`.
* Для типа `Decimal` добавлена поддержка математических функций (`exp`, `sin` и т. п.)
* В таблицу `system.part_log` добавлен столбец `partition_id`.
### Исправление ошибок:
* Исправлена работа `Merge` таблицы поверх `Distributed` таблиц. [Winter Zhang]()
* Исправлена несовместимость (лишняя зависимость от версии `glibc`), приводящая к невозможности запуска ClickHouse на `Ubuntu Precise` и более старых. Несовместимость возникла в версии 18.12.13.
* Исправлены ошибки в работе настройки `enable_optimize_predicate_expression`. [Winter Zhang]()
* Исправлено незначительное нарушение обратной совместимости, проявляющееся при одновременной работе на кластере реплик версий до 18.12.13 и создании новой реплики таблицы на сервере более новой версии (выдаётся сообщение `Can not clone replica, because the ... updated to new ClickHouse version`, что полностью логично, но не должно было происходить).
### Обратно несовместимые изменения:
* Настройка `enable_optimize_predicate_expression` включена по-умолчанию, что конечно очень оптимистично. При возникновении ошибок анализа запроса, связанных с поиском имён столбцов, следует выставить `enable_optimize_predicate_expression` в 0. [Winter Zhang]()
## ClickHouse release 18.12.14, 2018-09-13
### Новые возможности:

View File

@ -147,7 +147,7 @@ set (CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${COMPILER_FLAGS} -fn
set (CMAKE_C_FLAGS_RELWITHDEBINFO "${CMAKE_C_FLAGS_RELWITHDEBINFO} -O3 ${CMAKE_C_FLAGS_ADD}")
set (CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} -O0 -g3 -ggdb3 -fno-inline ${CMAKE_C_FLAGS_ADD}")
if (MAKE_STATIC_LIBRARIES AND NOT APPLE AND NOT (CMAKE_CXX_COMPILER_ID STREQUAL "Clang" AND OS_FREEBSD))
if (MAKE_STATIC_LIBRARIES AND NOT APPLE AND NOT (COMPILER_CLANG AND OS_FREEBSD))
set (CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -static-libgcc -static-libstdc++")
# Along with executables, we also build example of shared library for "library dictionary source"; and it also should be self-contained.
@ -158,7 +158,7 @@ set(THREADS_PREFER_PTHREAD_FLAG ON)
include (cmake/test_compiler.cmake)
if (OS_LINUX AND CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
if (OS_LINUX AND COMPILER_CLANG)
set (CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS}")
option (USE_LIBCXX "Use libc++ and libc++abi instead of libstdc++ (only make sense on Linux with Clang)" ${HAVE_LIBCXX})

View File

@ -12,3 +12,4 @@ ClickHouse is an open-source column-oriented database management system that all
## Upcoming Meetups
* [Beijing on October 28](http://www.clickhouse.com.cn/topic/5ba0e3f99d28dfde2ddc62a1)
* [Amsterdam on November 15](https://events.yandex.com/events/meetings/15-11-2018/)

View File

@ -2,9 +2,9 @@
#if USE_POCO_SQLODBC || USE_POCO_DATAODBC
#if USE_POCO_SQLODBC
#include <Poco/SQL/ODBC/ODBCException.h>
#include <Poco/SQL/ODBC/SessionImpl.h>
#include <Poco/SQL/ODBC/Utility.h>
#include <Poco/SQL/ODBC/ODBCException.h> // Y_IGNORE
#include <Poco/SQL/ODBC/SessionImpl.h> // Y_IGNORE
#include <Poco/SQL/ODBC/Utility.h> // Y_IGNORE
#define POCO_SQL_ODBC_CLASS Poco::SQL::ODBC
#endif
#if USE_POCO_DATAODBC

View File

@ -2,9 +2,9 @@
#if USE_POCO_SQLODBC || USE_POCO_DATAODBC
#if USE_POCO_SQLODBC
#include <Poco/SQL/ODBC/ODBCException.h>
#include <Poco/SQL/ODBC/SessionImpl.h>
#include <Poco/SQL/ODBC/Utility.h>
#include <Poco/SQL/ODBC/ODBCException.h> // Y_IGNORE
#include <Poco/SQL/ODBC/SessionImpl.h> // Y_IGNORE
#include <Poco/SQL/ODBC/Utility.h> // Y_IGNORE
#define POCO_SQL_ODBC_CLASS Poco::SQL::ODBC
#endif
#if USE_POCO_DATAODBC

View File

@ -8,7 +8,7 @@
#if USE_POCO_SQLODBC || USE_POCO_DATAODBC
#if USE_POCO_SQLODBC
#include <Poco/SQL/ODBC/Utility.h>
#include <Poco/SQL/ODBC/Utility.h> // Y_IGNORE
#endif
#if USE_POCO_DATAODBC
#include <Poco/Data/ODBC/Utility.h>

View File

@ -19,6 +19,6 @@ list(REMOVE_ITEM clickhouse_aggregate_functions_headers
FactoryHelpers.h
)
add_library(clickhouse_aggregate_functions ${clickhouse_aggregate_functions_sources})
add_library(clickhouse_aggregate_functions ${LINK_MODE} ${clickhouse_aggregate_functions_sources})
target_link_libraries(clickhouse_aggregate_functions dbms)
target_include_directories (clickhouse_aggregate_functions BEFORE PRIVATE ${COMMON_INCLUDE_DIR})

View File

@ -73,7 +73,7 @@ DataTypePtr FieldToDataType::operator() (const DecimalField<Decimal64> & x) cons
DataTypePtr FieldToDataType::operator() (const DecimalField<Decimal128> & x) const
{
using Type = DataTypeDecimal<Decimal64>;
using Type = DataTypeDecimal<Decimal128>;
return std::make_shared<Type>(Type::maxPrecision(), x.getScale());
}

View File

@ -50,7 +50,7 @@ add_headers_and_sources(clickhouse_functions ${FUNCTIONS_GENERATED_DIR})
list(REMOVE_ITEM clickhouse_functions_sources IFunction.cpp FunctionFactory.cpp FunctionHelpers.cpp)
list(REMOVE_ITEM clickhouse_functions_headers IFunction.h FunctionFactory.h FunctionHelpers.h)
add_library(clickhouse_functions ${clickhouse_functions_sources})
add_library(clickhouse_functions ${LINK_MODE} ${clickhouse_functions_sources})
target_link_libraries(clickhouse_functions PUBLIC dbms PRIVATE ${CONSISTENT_HASHING_LIBRARY} consistent-hashing-sumbur ${FARMHASH_LIBRARIES} ${METROHASH_LIBRARIES} murmurhash)

View File

@ -0,0 +1,646 @@
#include <Functions/FunctionFactory.h>
#include <Functions/FunctionsMiscellaneous.h>
#include <AggregateFunctions/AggregateFunctionFactory.h>
#include <DataTypes/DataTypeSet.h>
#include <DataTypes/DataTypesNumber.h>
#include <DataTypes/DataTypeFunction.h>
#include <DataTypes/DataTypeTuple.h>
#include <DataTypes/DataTypeLowCardinality.h>
#include <DataTypes/FieldToDataType.h>
#include <DataStreams/LazyBlockInputStream.h>
#include <Columns/ColumnSet.h>
#include <Columns/ColumnConst.h>
#include <Columns/ColumnsNumber.h>
#include <Storages/StorageSet.h>
#include <Parsers/ASTFunction.h>
#include <Common/typeid_cast.h>
#include <Parsers/DumpASTNode.h>
#include <Parsers/ASTIdentifier.h>
#include <Parsers/ASTLiteral.h>
#include <Parsers/ASTSelectQuery.h>
#include <Parsers/ASTSubquery.h>
#include <Parsers/ASTTablesInSelectQuery.h>
#include <Interpreters/ProjectionManipulation.h>
#include <Interpreters/ExpressionActions.h>
#include <Interpreters/QueryNormalizer.h>
#include <Interpreters/ActionsVisitor.h>
#include <Interpreters/InterpreterSelectWithUnionQuery.h>
#include <Interpreters/Set.h>
#include <Interpreters/evaluateConstantExpression.h>
#include <Interpreters/convertFieldToType.h>
#include <Interpreters/interpretSubquery.h>
namespace DB
{
namespace ErrorCodes
{
extern const int UNKNOWN_IDENTIFIER;
extern const int NOT_AN_AGGREGATE;
extern const int UNEXPECTED_EXPRESSION;
extern const int TYPE_MISMATCH;
extern const int NUMBER_OF_ARGUMENTS_DOESNT_MATCH;
}
/// defined in ExpressionAnalyser.cpp
NamesAndTypesList::iterator findColumn(const String & name, NamesAndTypesList & cols);
void makeExplicitSet(const ASTFunction * node, const Block & sample_block, bool create_ordered_set,
const Context & context, const SizeLimits & size_limits, PreparedSets & prepared_sets)
{
const IAST & args = *node->arguments;
if (args.children.size() != 2)
throw Exception("Wrong number of arguments passed to function in", ErrorCodes::NUMBER_OF_ARGUMENTS_DOESNT_MATCH);
const ASTPtr & left_arg = args.children.at(0);
const ASTPtr & right_arg = args.children.at(1);
auto getTupleTypeFromAst = [&context](const ASTPtr & tuple_ast) -> DataTypePtr
{
auto ast_function = typeid_cast<const ASTFunction *>(tuple_ast.get());
if (ast_function && ast_function->name == "tuple" && !ast_function->arguments->children.empty())
{
/// Won't parse all values of outer tuple.
auto element = ast_function->arguments->children.at(0);
std::pair<Field, DataTypePtr> value_raw = evaluateConstantExpression(element, context);
return std::make_shared<DataTypeTuple>(DataTypes({value_raw.second}));
}
return evaluateConstantExpression(tuple_ast, context).second;
};
const DataTypePtr & left_arg_type = sample_block.getByName(left_arg->getColumnName()).type;
const DataTypePtr & right_arg_type = getTupleTypeFromAst(right_arg);
std::function<size_t(const DataTypePtr &)> getTupleDepth;
getTupleDepth = [&getTupleDepth](const DataTypePtr & type) -> size_t
{
if (auto tuple_type = typeid_cast<const DataTypeTuple *>(type.get()))
return 1 + (tuple_type->getElements().empty() ? 0 : getTupleDepth(tuple_type->getElements().at(0)));
return 0;
};
size_t left_tuple_depth = getTupleDepth(left_arg_type);
size_t right_tuple_depth = getTupleDepth(right_arg_type);
DataTypes set_element_types = {left_arg_type};
auto left_tuple_type = typeid_cast<const DataTypeTuple *>(left_arg_type.get());
if (left_tuple_type && left_tuple_type->getElements().size() != 1)
set_element_types = left_tuple_type->getElements();
for (auto & element_type : set_element_types)
if (const auto * low_cardinality_type = typeid_cast<const DataTypeLowCardinality *>(element_type.get()))
element_type = low_cardinality_type->getDictionaryType();
ASTPtr elements_ast = nullptr;
/// 1 in 1; (1, 2) in (1, 2); identity(tuple(tuple(tuple(1)))) in tuple(tuple(tuple(1))); etc.
if (left_tuple_depth == right_tuple_depth)
{
ASTPtr exp_list = std::make_shared<ASTExpressionList>();
exp_list->children.push_back(right_arg);
elements_ast = exp_list;
}
/// 1 in (1, 2); (1, 2) in ((1, 2), (3, 4)); etc.
else if (left_tuple_depth + 1 == right_tuple_depth)
{
ASTFunction * set_func = typeid_cast<ASTFunction *>(right_arg.get());
if (!set_func || set_func->name != "tuple")
throw Exception("Incorrect type of 2nd argument for function " + node->name
+ ". Must be subquery or set of elements with type " + left_arg_type->getName() + ".",
ErrorCodes::ILLEGAL_TYPE_OF_ARGUMENT);
elements_ast = set_func->arguments;
}
else
throw Exception("Invalid types for IN function: "
+ left_arg_type->getName() + " and " + right_arg_type->getName() + ".",
ErrorCodes::ILLEGAL_TYPE_OF_ARGUMENT);
SetPtr set = std::make_shared<Set>(size_limits, create_ordered_set);
set->createFromAST(set_element_types, elements_ast, context);
prepared_sets[right_arg->range] = std::move(set);
}
static String getUniqueName(const Block & block, const String & prefix)
{
int i = 1;
while (block.has(prefix + toString(i)))
++i;
return prefix + toString(i);
}
ScopeStack::ScopeStack(const ExpressionActionsPtr & actions, const Context & context_)
: context(context_)
{
stack.emplace_back();
stack.back().actions = actions;
const Block & sample_block = actions->getSampleBlock();
for (size_t i = 0, size = sample_block.columns(); i < size; ++i)
stack.back().new_columns.insert(sample_block.getByPosition(i).name);
}
void ScopeStack::pushLevel(const NamesAndTypesList & input_columns)
{
stack.emplace_back();
Level & prev = stack[stack.size() - 2];
ColumnsWithTypeAndName all_columns;
NameSet new_names;
for (NamesAndTypesList::const_iterator it = input_columns.begin(); it != input_columns.end(); ++it)
{
all_columns.emplace_back(nullptr, it->type, it->name);
new_names.insert(it->name);
stack.back().new_columns.insert(it->name);
}
const Block & prev_sample_block = prev.actions->getSampleBlock();
for (size_t i = 0, size = prev_sample_block.columns(); i < size; ++i)
{
const ColumnWithTypeAndName & col = prev_sample_block.getByPosition(i);
if (!new_names.count(col.name))
all_columns.push_back(col);
}
stack.back().actions = std::make_shared<ExpressionActions>(all_columns, context);
}
size_t ScopeStack::getColumnLevel(const std::string & name)
{
for (int i = static_cast<int>(stack.size()) - 1; i >= 0; --i)
if (stack[i].new_columns.count(name))
return i;
throw Exception("Unknown identifier: " + name, ErrorCodes::UNKNOWN_IDENTIFIER);
}
void ScopeStack::addAction(const ExpressionAction & action)
{
size_t level = 0;
Names required = action.getNeededColumns();
for (size_t i = 0; i < required.size(); ++i)
level = std::max(level, getColumnLevel(required[i]));
Names added;
stack[level].actions->add(action, added);
stack[level].new_columns.insert(added.begin(), added.end());
for (size_t i = 0; i < added.size(); ++i)
{
const ColumnWithTypeAndName & col = stack[level].actions->getSampleBlock().getByName(added[i]);
for (size_t j = level + 1; j < stack.size(); ++j)
stack[j].actions->addInput(col);
}
}
ExpressionActionsPtr ScopeStack::popLevel()
{
ExpressionActionsPtr res = stack.back().actions;
stack.pop_back();
return res;
}
const Block & ScopeStack::getSampleBlock() const
{
return stack.back().actions->getSampleBlock();
}
ActionsVisitor::ActionsVisitor(
const Context & context_, SizeLimits set_size_limit_, bool is_conditional_tree, size_t subquery_depth_,
const NamesAndTypesList & source_columns_, const ExpressionActionsPtr & actions,
PreparedSets & prepared_sets_, SubqueriesForSets & subqueries_for_sets_,
bool no_subqueries_, bool only_consts_, bool no_storage_or_local_, std::ostream * ostr_)
: context(context_),
set_size_limit(set_size_limit_),
subquery_depth(subquery_depth_),
source_columns(source_columns_),
prepared_sets(prepared_sets_),
subqueries_for_sets(subqueries_for_sets_),
no_subqueries(no_subqueries_),
only_consts(only_consts_),
no_storage_or_local(no_storage_or_local_),
visit_depth(0),
ostr(ostr_),
actions_stack(actions, context)
{
if (is_conditional_tree)
projection_manipulator = std::make_shared<ConditionalTree>(actions_stack, context);
else
projection_manipulator = std::make_shared<DefaultProjectionManipulator>(actions_stack);
}
void ActionsVisitor::visit(const ASTPtr & ast)
{
DumpASTNode dump(*ast, ostr, visit_depth, "getActions");
String ast_column_name;
auto getColumnName = [&ast, &ast_column_name]()
{
if (ast_column_name.empty())
ast_column_name = ast->getColumnName();
return ast_column_name;
};
/// If the result of the calculation already exists in the block.
if ((typeid_cast<ASTFunction *>(ast.get()) || typeid_cast<ASTLiteral *>(ast.get()))
&& projection_manipulator->tryToGetFromUpperProjection(getColumnName()))
return;
if (typeid_cast<ASTIdentifier *>(ast.get()))
{
if (!only_consts && !projection_manipulator->tryToGetFromUpperProjection(getColumnName()))
{
/// The requested column is not in the block.
/// If such a column exists in the table, then the user probably forgot to surround it with an aggregate function or add it to GROUP BY.
bool found = false;
for (const auto & column_name_type : source_columns)
if (column_name_type.name == getColumnName())
found = true;
if (found)
throw Exception("Column " + getColumnName() + " is not under aggregate function and not in GROUP BY.",
ErrorCodes::NOT_AN_AGGREGATE);
}
}
else if (ASTFunction * node = typeid_cast<ASTFunction *>(ast.get()))
{
if (node->name == "lambda")
throw Exception("Unexpected lambda expression", ErrorCodes::UNEXPECTED_EXPRESSION);
/// Function arrayJoin.
if (node->name == "arrayJoin")
{
if (node->arguments->children.size() != 1)
throw Exception("arrayJoin requires exactly 1 argument", ErrorCodes::TYPE_MISMATCH);
ASTPtr arg = node->arguments->children.at(0);
visit(arg);
if (!only_consts)
{
String result_name = projection_manipulator->getColumnName(getColumnName());
actions_stack.addAction(ExpressionAction::copyColumn(projection_manipulator->getColumnName(arg->getColumnName()), result_name));
NameSet joined_columns;
joined_columns.insert(result_name);
actions_stack.addAction(ExpressionAction::arrayJoin(joined_columns, false, context));
}
return;
}
if (functionIsInOrGlobalInOperator(node->name))
{
/// Let's find the type of the first argument (then getActionsImpl will be called again and will not affect anything).
visit(node->arguments->children.at(0));
if (!no_subqueries)
{
/// Transform tuple or subquery into a set.
makeSet(node, actions_stack.getSampleBlock());
}
else
{
if (!only_consts)
{
/// We are in the part of the tree that we are not going to compute. You just need to define types.
/// Do not subquery and create sets. We treat "IN" as "ignore" function.
actions_stack.addAction(ExpressionAction::applyFunction(
FunctionFactory::instance().get("ignore", context),
{ node->arguments->children.at(0)->getColumnName() },
projection_manipulator->getColumnName(getColumnName()),
projection_manipulator->getProjectionSourceColumn()));
}
return;
}
}
/// A special function `indexHint`. Everything that is inside it is not calculated
/// (and is used only for index analysis, see KeyCondition).
if (node->name == "indexHint")
{
actions_stack.addAction(ExpressionAction::addColumn(ColumnWithTypeAndName(
ColumnConst::create(ColumnUInt8::create(1, 1), 1), std::make_shared<DataTypeUInt8>(),
projection_manipulator->getColumnName(getColumnName())), projection_manipulator->getProjectionSourceColumn(), false));
return;
}
if (AggregateFunctionFactory::instance().isAggregateFunctionName(node->name))
return;
/// Context object that we pass to function should live during query.
const Context & function_context = context.hasQueryContext()
? context.getQueryContext()
: context;
const FunctionBuilderPtr & function_builder = FunctionFactory::instance().get(node->name, function_context);
auto projection_action = getProjectionAction(node->name, actions_stack, projection_manipulator, getColumnName(), function_context);
Names argument_names;
DataTypes argument_types;
bool arguments_present = true;
/// If the function has an argument-lambda expression, you need to determine its type before the recursive call.
bool has_lambda_arguments = false;
for (size_t arg = 0; arg < node->arguments->children.size(); ++arg)
{
auto & child = node->arguments->children[arg];
auto child_column_name = child->getColumnName();
ASTFunction * lambda = typeid_cast<ASTFunction *>(child.get());
if (lambda && lambda->name == "lambda")
{
/// If the argument is a lambda expression, just remember its approximate type.
if (lambda->arguments->children.size() != 2)
throw Exception("lambda requires two arguments", ErrorCodes::NUMBER_OF_ARGUMENTS_DOESNT_MATCH);
ASTFunction * lambda_args_tuple = typeid_cast<ASTFunction *>(lambda->arguments->children.at(0).get());
if (!lambda_args_tuple || lambda_args_tuple->name != "tuple")
throw Exception("First argument of lambda must be a tuple", ErrorCodes::TYPE_MISMATCH);
has_lambda_arguments = true;
argument_types.emplace_back(std::make_shared<DataTypeFunction>(DataTypes(lambda_args_tuple->arguments->children.size())));
/// Select the name in the next cycle.
argument_names.emplace_back();
}
else if (prepared_sets.count(child->range) && functionIsInOrGlobalInOperator(node->name) && arg == 1)
{
ColumnWithTypeAndName column;
column.type = std::make_shared<DataTypeSet>();
const SetPtr & set = prepared_sets[child->range];
/// If the argument is a set given by an enumeration of values (so, the set was already built), give it a unique name,
/// so that sets with the same literal representation do not fuse together (they can have different types).
if (!set->empty())
column.name = getUniqueName(actions_stack.getSampleBlock(), "__set");
else
column.name = child_column_name;
column.name = projection_manipulator->getColumnName(column.name);
if (!actions_stack.getSampleBlock().has(column.name))
{
column.column = ColumnSet::create(1, set);
actions_stack.addAction(ExpressionAction::addColumn(column, projection_manipulator->getProjectionSourceColumn(), false));
}
argument_types.push_back(column.type);
argument_names.push_back(column.name);
}
else
{
/// If the argument is not a lambda expression, call it recursively and find out its type.
projection_action->preArgumentAction();
visit(child);
std::string name = projection_manipulator->getColumnName(child_column_name);
projection_action->postArgumentAction(child_column_name);
if (actions_stack.getSampleBlock().has(name))
{
argument_types.push_back(actions_stack.getSampleBlock().getByName(name).type);
argument_names.push_back(name);
}
else
{
if (only_consts)
{
arguments_present = false;
}
else
{
throw Exception("Unknown identifier: " + name + ", projection layer " + projection_manipulator->getProjectionExpression() , ErrorCodes::UNKNOWN_IDENTIFIER);
}
}
}
}
if (only_consts && !arguments_present)
return;
if (has_lambda_arguments && !only_consts)
{
function_builder->getLambdaArgumentTypes(argument_types);
/// Call recursively for lambda expressions.
for (size_t i = 0; i < node->arguments->children.size(); ++i)
{
ASTPtr child = node->arguments->children[i];
ASTFunction * lambda = typeid_cast<ASTFunction *>(child.get());
if (lambda && lambda->name == "lambda")
{
const DataTypeFunction * lambda_type = typeid_cast<const DataTypeFunction *>(argument_types[i].get());
ASTFunction * lambda_args_tuple = typeid_cast<ASTFunction *>(lambda->arguments->children.at(0).get());
ASTs lambda_arg_asts = lambda_args_tuple->arguments->children;
NamesAndTypesList lambda_arguments;
for (size_t j = 0; j < lambda_arg_asts.size(); ++j)
{
ASTIdentifier * identifier = typeid_cast<ASTIdentifier *>(lambda_arg_asts[j].get());
if (!identifier)
throw Exception("lambda argument declarations must be identifiers", ErrorCodes::TYPE_MISMATCH);
String arg_name = identifier->name;
lambda_arguments.emplace_back(arg_name, lambda_type->getArgumentTypes()[j]);
}
projection_action->preArgumentAction();
actions_stack.pushLevel(lambda_arguments);
visit(lambda->arguments->children.at(1));
ExpressionActionsPtr lambda_actions = actions_stack.popLevel();
String result_name = projection_manipulator->getColumnName(lambda->arguments->children.at(1)->getColumnName());
lambda_actions->finalize(Names(1, result_name));
DataTypePtr result_type = lambda_actions->getSampleBlock().getByName(result_name).type;
Names captured;
Names required = lambda_actions->getRequiredColumns();
for (const auto & required_arg : required)
if (findColumn(required_arg, lambda_arguments) == lambda_arguments.end())
captured.push_back(required_arg);
/// We can not name `getColumnName()`,
/// because it does not uniquely define the expression (the types of arguments can be different).
String lambda_name = getUniqueName(actions_stack.getSampleBlock(), "__lambda");
auto function_capture = std::make_shared<FunctionCapture>(
lambda_actions, captured, lambda_arguments, result_type, result_name);
actions_stack.addAction(ExpressionAction::applyFunction(function_capture, captured, lambda_name,
projection_manipulator->getProjectionSourceColumn()));
argument_types[i] = std::make_shared<DataTypeFunction>(lambda_type->getArgumentTypes(), result_type);
argument_names[i] = lambda_name;
projection_action->postArgumentAction(lambda_name);
}
}
}
if (only_consts)
{
for (const auto & argument_name : argument_names)
{
if (!actions_stack.getSampleBlock().has(argument_name))
{
arguments_present = false;
break;
}
}
}
if (arguments_present)
{
projection_action->preCalculation();
if (projection_action->isCalculationRequired())
{
actions_stack.addAction(
ExpressionAction::applyFunction(function_builder,
argument_names,
projection_manipulator->getColumnName(getColumnName()),
projection_manipulator->getProjectionSourceColumn()));
}
}
}
else if (ASTLiteral * literal = typeid_cast<ASTLiteral *>(ast.get()))
{
DataTypePtr type = applyVisitor(FieldToDataType(), literal->value);
ColumnWithTypeAndName column;
column.column = type->createColumnConst(1, convertFieldToType(literal->value, *type));
column.type = type;
column.name = getColumnName();
actions_stack.addAction(ExpressionAction::addColumn(column, "", false));
projection_manipulator->tryToGetFromUpperProjection(column.name);
}
else
{
for (auto & child : ast->children)
{
/// Do not go to FROM, JOIN, UNION.
if (!typeid_cast<const ASTTableExpression *>(child.get())
&& !typeid_cast<const ASTSelectQuery *>(child.get()))
visit(child);
}
}
}
void ActionsVisitor::makeSet(const ASTFunction * node, const Block & sample_block)
{
/** You need to convert the right argument to a set.
* This can be a table name, a value, a value enumeration, or a subquery.
* The enumeration of values is parsed as a function `tuple`.
*/
const IAST & args = *node->arguments;
const ASTPtr & arg = args.children.at(1);
/// Already converted.
if (prepared_sets.count(arg->range))
return;
/// If the subquery or table name for SELECT.
const ASTIdentifier * identifier = typeid_cast<const ASTIdentifier *>(arg.get());
if (typeid_cast<const ASTSubquery *>(arg.get()) || identifier)
{
/// We get the stream of blocks for the subquery. Create Set and put it in place of the subquery.
String set_id = arg->getColumnName();
/// A special case is if the name of the table is specified on the right side of the IN statement,
/// and the table has the type Set (a previously prepared set).
if (identifier)
{
auto database_table = getDatabaseAndTableNameFromIdentifier(*identifier);
StoragePtr table = context.tryGetTable(database_table.first, database_table.second);
if (table)
{
StorageSet * storage_set = dynamic_cast<StorageSet *>(table.get());
if (storage_set)
{
prepared_sets[arg->range] = storage_set->getSet();
return;
}
}
}
SubqueryForSet & subquery_for_set = subqueries_for_sets[set_id];
/// If you already created a Set with the same subquery / table.
if (subquery_for_set.set)
{
prepared_sets[arg->range] = subquery_for_set.set;
return;
}
SetPtr set = std::make_shared<Set>(set_size_limit, false);
/** The following happens for GLOBAL INs:
* - in the addExternalStorage function, the IN (SELECT ...) subquery is replaced with IN _data1,
* in the subquery_for_set object, this subquery is set as source and the temporary table _data1 as the table.
* - this function shows the expression IN_data1.
*/
if (!subquery_for_set.source && no_storage_or_local)
{
auto interpreter = interpretSubquery(arg, context, subquery_depth, {});
subquery_for_set.source = std::make_shared<LazyBlockInputStream>(
interpreter->getSampleBlock(), [interpreter]() mutable { return interpreter->execute().in; });
/** Why is LazyBlockInputStream used?
*
* The fact is that when processing a query of the form
* SELECT ... FROM remote_test WHERE column GLOBAL IN (subquery),
* if the distributed remote_test table contains localhost as one of the servers,
* the query will be interpreted locally again (and not sent over TCP, as in the case of a remote server).
*
* The query execution pipeline will be:
* CreatingSets
* subquery execution, filling the temporary table with _data1 (1)
* CreatingSets
* reading from the table _data1, creating the set (2)
* read from the table subordinate to remote_test.
*
* (The second part of the pipeline under CreateSets is a reinterpretation of the query inside StorageDistributed,
* the query differs in that the database name and tables are replaced with subordinates, and the subquery is replaced with _data1.)
*
* But when creating the pipeline, when creating the source (2), it will be found that the _data1 table is empty
* (because the query has not started yet), and empty source will be returned as the source.
* And then, when the query is executed, an empty set will be created in step (2).
*
* Therefore, we make the initialization of step (2) lazy
* - so that it does not occur until step (1) is completed, on which the table will be populated.
*
* Note: this solution is not very good, you need to think better.
*/
}
subquery_for_set.set = set;
prepared_sets[arg->range] = set;
}
else
{
/// An explicit enumeration of values in parentheses.
makeExplicitSet(node, sample_block, false, context, set_size_limit, prepared_sets);
}
}
}

View File

@ -0,0 +1,121 @@
#pragma once
#include <Parsers/StringRange.h>
namespace DB
{
class Context;
class ASTFunction;
struct ProjectionManipulatorBase;
class ExpressionActions;
using ExpressionActionsPtr = std::shared_ptr<ExpressionActions>;
class Set;
using SetPtr = std::shared_ptr<Set>;
/// Will compare sets by their position in query string. It's possible because IAST::clone() doesn't chane IAST::range.
/// It should be taken into account when we want to change AST part which contains sets.
using PreparedSets = std::unordered_map<StringRange, SetPtr, StringRangePointersHash, StringRangePointersEqualTo>;
class Join;
using JoinPtr = std::shared_ptr<Join>;
/// Information on what to do when executing a subquery in the [GLOBAL] IN/JOIN section.
struct SubqueryForSet
{
/// The source is obtained using the InterpreterSelectQuery subquery.
BlockInputStreamPtr source;
/// If set, build it from result.
SetPtr set;
JoinPtr join;
/// Apply this actions to joined block.
ExpressionActionsPtr joined_block_actions;
/// Rename column from joined block from this list.
NamesWithAliases joined_block_aliases;
/// If set, put the result into the table.
/// This is a temporary table for transferring to remote servers for distributed query processing.
StoragePtr table;
};
/// ID of subquery -> what to do with it.
using SubqueriesForSets = std::unordered_map<String, SubqueryForSet>;
/// The case of an explicit enumeration of values.
void makeExplicitSet(const ASTFunction * node, const Block & sample_block, bool create_ordered_set,
const Context & context, const SizeLimits & limits, PreparedSets & prepared_sets);
/** For ActionsVisitor
* A stack of ExpressionActions corresponding to nested lambda expressions.
* The new action should be added to the highest possible level.
* For example, in the expression "select arrayMap(x -> x + column1 * column2, array1)"
* calculation of the product must be done outside the lambda expression (it does not depend on x),
* and the calculation of the sum is inside (depends on x).
*/
struct ScopeStack
{
struct Level
{
ExpressionActionsPtr actions;
NameSet new_columns;
};
using Levels = std::vector<Level>;
Levels stack;
const Context & context;
ScopeStack(const ExpressionActionsPtr & actions, const Context & context_);
void pushLevel(const NamesAndTypesList & input_columns);
size_t getColumnLevel(const std::string & name);
void addAction(const ExpressionAction & action);
ExpressionActionsPtr popLevel();
const Block & getSampleBlock() const;
};
/// Collect ExpressionAction from AST. Returns PreparedSets and SubqueriesForSets too.
/// After AST is visited source ExpressionActions should be updated with popActionsLevel() method.
class ActionsVisitor
{
public:
ActionsVisitor(const Context & context_, SizeLimits set_size_limit_, bool is_conditional_tree, size_t subquery_depth_,
const NamesAndTypesList & source_columns_, const ExpressionActionsPtr & actions,
PreparedSets & prepared_sets_, SubqueriesForSets & subqueries_for_sets_,
bool no_subqueries_, bool only_consts_, bool no_storage_or_local_, std::ostream * ostr_ = nullptr);
void visit(const ASTPtr & ast);
ExpressionActionsPtr popActionsLevel() { return actions_stack.popLevel(); }
private:
const Context & context;
SizeLimits set_size_limit;
size_t subquery_depth;
const NamesAndTypesList & source_columns;
PreparedSets & prepared_sets;
SubqueriesForSets & subqueries_for_sets;
const bool no_subqueries;
const bool only_consts;
const bool no_storage_or_local;
mutable size_t visit_depth;
std::ostream * ostr;
ScopeStack actions_stack;
std::shared_ptr<ProjectionManipulatorBase> projection_manipulator;
void makeSet(const ASTFunction * node, const Block & sample_block);
};
}

View File

@ -0,0 +1,94 @@
#pragma once
namespace DB
{
/// Fills the array_join_result_to_source: on which columns-arrays to replicate, and how to call them after that.
class ArrayJoinedColumnsVisitor
{
public:
ArrayJoinedColumnsVisitor(NameToNameMap & array_join_name_to_alias_,
NameToNameMap & array_join_alias_to_name_,
NameToNameMap & array_join_result_to_source_)
: array_join_name_to_alias(array_join_name_to_alias_),
array_join_alias_to_name(array_join_alias_to_name_),
array_join_result_to_source(array_join_result_to_source_)
{}
void visit(ASTPtr & ast) const
{
if (!tryVisit<ASTTablesInSelectQuery>(ast) &&
!tryVisit<ASTIdentifier>(ast))
visitChildren(ast);
}
private:
NameToNameMap & array_join_name_to_alias;
NameToNameMap & array_join_alias_to_name;
NameToNameMap & array_join_result_to_source;
void visit(const ASTTablesInSelectQuery *, ASTPtr &) const
{}
void visit(const ASTIdentifier * node, ASTPtr &) const
{
if (node->general())
{
auto splitted = Nested::splitName(node->name); /// ParsedParams, Key1
if (array_join_alias_to_name.count(node->name))
{
/// ARRAY JOIN was written with an array column. Example: SELECT K1 FROM ... ARRAY JOIN ParsedParams.Key1 AS K1
array_join_result_to_source[node->name] = array_join_alias_to_name[node->name]; /// K1 -> ParsedParams.Key1
}
else if (array_join_alias_to_name.count(splitted.first) && !splitted.second.empty())
{
/// ARRAY JOIN was written with a nested table. Example: SELECT PP.KEY1 FROM ... ARRAY JOIN ParsedParams AS PP
array_join_result_to_source[node->name] /// PP.Key1 -> ParsedParams.Key1
= Nested::concatenateName(array_join_alias_to_name[splitted.first], splitted.second);
}
else if (array_join_name_to_alias.count(node->name))
{
/** Example: SELECT ParsedParams.Key1 FROM ... ARRAY JOIN ParsedParams.Key1 AS PP.Key1.
* That is, the query uses the original array, replicated by itself.
*/
array_join_result_to_source[ /// PP.Key1 -> ParsedParams.Key1
array_join_name_to_alias[node->name]] = node->name;
}
else if (array_join_name_to_alias.count(splitted.first) && !splitted.second.empty())
{
/** Example: SELECT ParsedParams.Key1 FROM ... ARRAY JOIN ParsedParams AS PP.
*/
array_join_result_to_source[ /// PP.Key1 -> ParsedParams.Key1
Nested::concatenateName(array_join_name_to_alias[splitted.first], splitted.second)] = node->name;
}
}
}
void visit(const ASTSubquery *, ASTPtr &) const
{}
void visit(const ASTSelectQuery *, ASTPtr &) const
{}
void visitChildren(ASTPtr & ast) const
{
for (auto & child : ast->children)
if (!tryVisit<ASTSubquery>(child) &&
!tryVisit<ASTSelectQuery>(child))
visit(child);
}
template <typename T>
bool tryVisit(ASTPtr & ast) const
{
if (const T * t = typeid_cast<const T *>(ast.get()))
{
visit(t, ast);
return true;
}
return false;
}
};
}

View File

@ -156,7 +156,12 @@ void Clusters::updateClusters(Poco::Util::AbstractConfiguration & config, const
impl.clear();
for (const auto & key : config_keys)
{
if (key.find('.') != String::npos)
throw Exception("Cluster names with dots are not supported: `" + key + "`", ErrorCodes::SYNTAX_ERROR);
impl.emplace(key, std::make_shared<Cluster>(config, settings, config_name + "." + key));
}
}
Clusters::Impl Clusters::getContainer() const

View File

@ -35,7 +35,7 @@ static ASTPtr addTypeConversion(std::unique_ptr<ASTLiteral> && ast, const String
return res;
}
void ExecuteScalarSubqueriesVisitor::visit(ASTSubquery * subquery, ASTPtr & ast, const DumpASTNode &) const
void ExecuteScalarSubqueriesVisitor::visit(const ASTSubquery * subquery, ASTPtr & ast, const DumpASTNode &) const
{
Context subquery_context = context;
Settings subquery_settings = context.getSettings();
@ -101,12 +101,12 @@ void ExecuteScalarSubqueriesVisitor::visit(ASTSubquery * subquery, ASTPtr & ast,
}
void ExecuteScalarSubqueriesVisitor::visit(ASTTableExpression *, ASTPtr &, const DumpASTNode &) const
void ExecuteScalarSubqueriesVisitor::visit(const ASTTableExpression *, ASTPtr &, const DumpASTNode &) const
{
/// Don't descend into subqueries in FROM section.
}
void ExecuteScalarSubqueriesVisitor::visit(ASTFunction * func, ASTPtr & ast, const DumpASTNode &) const
void ExecuteScalarSubqueriesVisitor::visit(const ASTFunction * func, ASTPtr & ast, const DumpASTNode &) const
{
/// Don't descend into subqueries in arguments of IN operator.
/// But if an argument is not subquery, than deeper may be scalar subqueries and we need to descend in them.

View File

@ -54,9 +54,9 @@ private:
mutable size_t visit_depth;
std::ostream * ostr;
void visit(ASTSubquery * subquery, ASTPtr & ast, const DumpASTNode & dump) const;
void visit(ASTFunction * func, ASTPtr & ast, const DumpASTNode &) const;
void visit(ASTTableExpression *, ASTPtr &, const DumpASTNode &) const;
void visit(const ASTSubquery * subquery, ASTPtr & ast, const DumpASTNode & dump) const;
void visit(const ASTFunction * func, ASTPtr & ast, const DumpASTNode &) const;
void visit(const ASTTableExpression *, ASTPtr &, const DumpASTNode &) const;
void visitChildren(ASTPtr & ast) const
{
@ -67,7 +67,7 @@ private:
template <typename T>
bool tryVisit(ASTPtr & ast, const DumpASTNode & dump) const
{
if (T * t = typeid_cast<T *>(ast.get()))
if (const T * t = typeid_cast<const T *>(ast.get()))
{
visit(t, ast, dump);
return true;

File diff suppressed because it is too large Load Diff

View File

@ -2,10 +2,10 @@
#include <Interpreters/AggregateDescription.h>
#include <Interpreters/Settings.h>
#include <Core/Block.h>
#include <Interpreters/ExpressionActions.h>
#include <Interpreters/ProjectionManipulation.h>
#include <Parsers/StringRange.h>
#include <Interpreters/ActionsVisitor.h>
#include <Core/Block.h>
#include <Parsers/ASTTablesInSelectQuery.h>
namespace DB
@ -16,18 +16,9 @@ class Context;
class ExpressionActions;
struct ExpressionActionsChain;
class Join;
using JoinPtr = std::shared_ptr<Join>;
class IAST;
using ASTPtr = std::shared_ptr<IAST>;
class Set;
using SetPtr = std::shared_ptr<Set>;
/// Will compare sets by their position in query string. It's possible because IAST::clone() doesn't chane IAST::range.
/// It should be taken into account when we want to change AST part which contains sets.
using PreparedSets = std::unordered_map<StringRange, SetPtr, StringRangePointersHash, StringRangePointersEqualTo>;
class IBlockInputStream;
using BlockInputStreamPtr = std::shared_ptr<IBlockInputStream>;
@ -39,58 +30,12 @@ class ASTFunction;
class ASTExpressionList;
class ASTSelectQuery;
struct ProjectionManipulatorBase;
using ProjectionManipulatorPtr = std::shared_ptr<ProjectionManipulatorBase>;
/** Information on what to do when executing a subquery in the [GLOBAL] IN/JOIN section.
*/
struct SubqueryForSet
inline SizeLimits getSetSizeLimits(const Settings & settings)
{
/// The source is obtained using the InterpreterSelectQuery subquery.
BlockInputStreamPtr source;
return SizeLimits(settings.max_rows_in_set, settings.max_bytes_in_set, settings.set_overflow_mode);
}
/// If set, build it from result.
SetPtr set;
JoinPtr join;
/// Apply this actions to joined block.
ExpressionActionsPtr joined_block_actions;
/// Rename column from joined block from this list.
NamesWithAliases joined_block_aliases;
/// If set, put the result into the table.
/// This is a temporary table for transferring to remote servers for distributed query processing.
StoragePtr table;
};
/// ID of subquery -> what to do with it.
using SubqueriesForSets = std::unordered_map<String, SubqueryForSet>;
struct ScopeStack
{
struct Level
{
ExpressionActionsPtr actions;
NameSet new_columns;
};
using Levels = std::vector<Level>;
Levels stack;
const Context & context;
ScopeStack(const ExpressionActionsPtr & actions, const Context & context_);
void pushLevel(const NamesAndTypesList & input_columns);
size_t getColumnLevel(const std::string & name);
void addAction(const ExpressionAction & action);
ExpressionActionsPtr popLevel();
const Block & getSampleBlock() const;
};
/** Transforms an expression from a syntax tree into a sequence of actions to execute it.
*
@ -303,7 +248,6 @@ private:
/// All new temporary tables obtained by performing the GLOBAL IN/JOIN subqueries.
Tables external_tables;
size_t external_table_id = 1;
/// Predicate optimizer overrides the sub queries
bool rewrite_subqueries = false;
@ -341,21 +285,14 @@ private:
void optimizeIfWithConstantConditionImpl(ASTPtr & current_ast);
bool tryExtractConstValueFromCondition(const ASTPtr & condition, bool & value) const;
void makeSet(const ASTFunction * node, const Block & sample_block);
/// Adds a list of ALIAS columns from the table.
void addAliasColumns();
/// Replacing scalar subqueries with constant values.
void executeScalarSubqueries();
void executeScalarSubqueriesImpl(ASTPtr & ast);
/// Find global subqueries in the GLOBAL IN/JOIN sections. Fills in external_tables.
void initGlobalSubqueriesAndExternalTables();
void initGlobalSubqueries(ASTPtr & ast);
/// Finds in the query the usage of external tables (as table identifiers). Fills in external_tables.
void findExternalTables(ASTPtr & ast);
/** Initialize InterpreterSelectQuery for a subquery in the GLOBAL IN/JOIN section,
* create a temporary table of type Memory and store it in the external_tables dictionary.
@ -363,20 +300,16 @@ private:
void addExternalStorage(ASTPtr & subquery_or_table_name);
void getArrayJoinedColumns();
void getArrayJoinedColumnsImpl(const ASTPtr & ast);
void addMultipleArrayJoinAction(ExpressionActionsPtr & actions) const;
void addJoinAction(ExpressionActionsPtr & actions, bool only_types) const;
bool isThereArrayJoin(const ASTPtr & ast);
void getActionsImpl(const ASTPtr & ast, bool no_subqueries, bool only_consts, ScopeStack & actions_stack,
ProjectionManipulatorPtr projection_manipulator);
/// If ast is ASTSelectQuery with JOIN, add actions for JOIN key columns.
void getActionsFromJoinKeys(const ASTTableJoin & table_join, bool no_subqueries, bool only_consts, ExpressionActionsPtr & actions);
void getActionsFromJoinKeys(const ASTTableJoin & table_join, bool no_subqueries, ExpressionActionsPtr & actions);
void getRootActions(const ASTPtr & ast, bool no_subqueries, bool only_consts, ExpressionActionsPtr & actions);
void getRootActions(const ASTPtr & ast, bool no_subqueries, ExpressionActionsPtr & actions, bool only_consts = false);
void getActionsBeforeAggregation(const ASTPtr & ast, ExpressionActionsPtr & actions, bool no_subqueries);
@ -389,26 +322,12 @@ private:
void getAggregates(const ASTPtr & ast, ExpressionActionsPtr & actions);
void assertNoAggregates(const ASTPtr & ast, const char * description);
/** Get a set of necessary columns to read from the table.
* In this case, the columns specified in ignored_names are considered unnecessary. And the ignored_names parameter can be modified.
* The set of columns available_joined_columns are the columns available from JOIN, they are not needed for reading from the main table.
* Put in required_joined_columns the set of columns available from JOIN and needed.
*/
void getRequiredSourceColumnsImpl(const ASTPtr & ast,
const NameSet & available_columns, NameSet & required_source_columns, NameSet & ignored_names,
const NameSet & available_joined_columns, NameSet & required_joined_columns);
/// columns - the columns that are present before the transformations begin.
void initChain(ExpressionActionsChain & chain, const NamesAndTypesList & columns) const;
void assertSelect() const;
void assertAggregation() const;
/** Create Set from an explicit enumeration of values in the query.
* If create_ordered_set = true - create a data structure suitable for using the index.
*/
void makeExplicitSet(const ASTFunction * node, const Block & sample_block, bool create_ordered_set);
/**
* Create Set from a subuqery or a table expression in the query. The created set is suitable for using the index.
* The set will not be created if its size hits the limit.
@ -427,6 +346,8 @@ private:
* This is the case when we have DISTINCT or arrayJoin: we require more columns in SELECT even if we need less columns in result.
*/
void removeUnneededColumnsFromSelectClause();
bool isRemoteStorage() const;
};
}

View File

@ -0,0 +1,47 @@
#pragma once
namespace DB
{
/// Finds in the query the usage of external tables (as table identifiers). Fills in external_tables.
class ExternalTablesVisitor
{
public:
ExternalTablesVisitor(const Context & context_, Tables & tables)
: context(context_),
external_tables(tables)
{}
void visit(ASTPtr & ast) const
{
/// Traverse from the bottom. Intentionally go into subqueries.
for (auto & child : ast->children)
visit(child);
tryVisit<ASTIdentifier>(ast);
}
private:
const Context & context;
Tables & external_tables;
void visit(const ASTIdentifier * node, ASTPtr &) const
{
if (node->special())
if (StoragePtr external_storage = context.tryGetExternalTable(node->name))
external_tables[node->name] = external_storage;
}
template <typename T>
bool tryVisit(ASTPtr & ast) const
{
if (const T * t = typeid_cast<const T *>(ast.get()))
{
visit(t, ast);
return true;
}
return false;
}
};
}

View File

@ -0,0 +1,167 @@
#pragma once
namespace DB
{
/// Converts GLOBAL subqueries to external tables; Puts them into the external_tables dictionary: name -> StoragePtr.
class GlobalSubqueriesVisitor
{
public:
GlobalSubqueriesVisitor(const Context & context_, size_t subquery_depth_, bool is_remote_,
Tables & tables, SubqueriesForSets & subqueries_for_sets_, bool & has_global_subqueries_)
: context(context_),
subquery_depth(subquery_depth_),
is_remote(is_remote_),
external_table_id(1),
external_tables(tables),
subqueries_for_sets(subqueries_for_sets_),
has_global_subqueries(has_global_subqueries_)
{}
void visit(ASTPtr & ast) const
{
/// Recursive calls. We do not go into subqueries.
for (auto & child : ast->children)
if (!typeid_cast<ASTSelectQuery *>(child.get()))
visit(child);
/// Bottom-up actions.
if (tryVisit<ASTFunction>(ast) ||
tryVisit<ASTTablesInSelectQueryElement>(ast))
{}
}
private:
const Context & context;
size_t subquery_depth;
bool is_remote;
mutable size_t external_table_id = 1;
Tables & external_tables;
SubqueriesForSets & subqueries_for_sets;
bool & has_global_subqueries;
/// GLOBAL IN
void visit(ASTFunction * func, ASTPtr &) const
{
if (func->name == "globalIn" || func->name == "globalNotIn")
{
addExternalStorage(func->arguments->children.at(1));
has_global_subqueries = true;
}
}
/// GLOBAL JOIN
void visit(ASTTablesInSelectQueryElement * table_elem, ASTPtr &) const
{
if (table_elem->table_join
&& static_cast<const ASTTableJoin &>(*table_elem->table_join).locality == ASTTableJoin::Locality::Global)
{
addExternalStorage(table_elem->table_expression);
has_global_subqueries = true;
}
}
template <typename T>
bool tryVisit(ASTPtr & ast) const
{
if (T * t = typeid_cast<T *>(ast.get()))
{
visit(t, ast);
return true;
}
return false;
}
void addExternalStorage(ASTPtr & subquery_or_table_name_or_table_expression) const
{
/// With nondistributed queries, creating temporary tables does not make sense.
if (!is_remote)
return;
ASTPtr subquery;
ASTPtr table_name;
ASTPtr subquery_or_table_name;
if (typeid_cast<const ASTIdentifier *>(subquery_or_table_name_or_table_expression.get()))
{
table_name = subquery_or_table_name_or_table_expression;
subquery_or_table_name = table_name;
}
else if (auto ast_table_expr = typeid_cast<const ASTTableExpression *>(subquery_or_table_name_or_table_expression.get()))
{
if (ast_table_expr->database_and_table_name)
{
table_name = ast_table_expr->database_and_table_name;
subquery_or_table_name = table_name;
}
else if (ast_table_expr->subquery)
{
subquery = ast_table_expr->subquery;
subquery_or_table_name = subquery;
}
}
else if (typeid_cast<const ASTSubquery *>(subquery_or_table_name_or_table_expression.get()))
{
subquery = subquery_or_table_name_or_table_expression;
subquery_or_table_name = subquery;
}
if (!subquery_or_table_name)
throw Exception("Logical error: unknown AST element passed to ExpressionAnalyzer::addExternalStorage method",
ErrorCodes::LOGICAL_ERROR);
if (table_name)
{
/// If this is already an external table, you do not need to add anything. Just remember its presence.
if (external_tables.end() != external_tables.find(static_cast<const ASTIdentifier &>(*table_name).name))
return;
}
/// Generate the name for the external table.
String external_table_name = "_data" + toString(external_table_id);
while (external_tables.count(external_table_name))
{
++external_table_id;
external_table_name = "_data" + toString(external_table_id);
}
auto interpreter = interpretSubquery(subquery_or_table_name, context, subquery_depth, {});
Block sample = interpreter->getSampleBlock();
NamesAndTypesList columns = sample.getNamesAndTypesList();
StoragePtr external_storage = StorageMemory::create(external_table_name, ColumnsDescription{columns});
external_storage->startup();
/** We replace the subquery with the name of the temporary table.
* It is in this form, the request will go to the remote server.
* This temporary table will go to the remote server, and on its side,
* instead of doing a subquery, you just need to read it.
*/
auto database_and_table_name = ASTIdentifier::createSpecial(external_table_name);
if (auto ast_table_expr = typeid_cast<ASTTableExpression *>(subquery_or_table_name_or_table_expression.get()))
{
ast_table_expr->subquery.reset();
ast_table_expr->database_and_table_name = database_and_table_name;
ast_table_expr->children.clear();
ast_table_expr->children.emplace_back(database_and_table_name);
}
else
subquery_or_table_name_or_table_expression = database_and_table_name;
external_tables[external_table_name] = external_storage;
subqueries_for_sets[external_table_name].source = interpreter->execute().in;
subqueries_for_sets[external_table_name].table = external_storage;
/** NOTE If it was written IN tmp_table - the existing temporary (but not external) table,
* then a new temporary table will be created (for example, _data1),
* and the data will then be copied to it.
* Maybe this can be avoided.
*/
}
};
}

View File

@ -5,7 +5,7 @@
#include <DataTypes/DataTypesNumber.h>
#include <Functions/FunctionFactory.h>
#include <Interpreters/ExpressionActions.h>
#include <Interpreters/ExpressionAnalyzer.h>
#include <Interpreters/ActionsVisitor.h>
#include <Interpreters/ProjectionManipulation.h>
#include <Common/Exception.h>
#include <Common/typeid_cast.h>

View File

@ -1,13 +1,18 @@
#pragma once
#include <string>
#include <vector>
#include <memory>
#include <unordered_map>
namespace DB
{
class ExpressionAnalyzer;
class ExpressionAnalyzer;
class Context;
struct ScopeStack;
namespace ErrorCodes
{
extern const int CONDITIONAL_TREE_PARENT_NOT_FOUND;

View File

@ -67,18 +67,18 @@ void QueryAliasesVisitor::getNodeAlias(const ASTPtr & ast, Aliases & aliases, co
}
else if (auto subquery = typeid_cast<ASTSubquery *>(ast.get()))
{
/// Set unique aliases for all subqueries. This is needed, because content of subqueries could change after recursive analysis,
/// and auto-generated column names could become incorrect.
/// Set unique aliases for all subqueries. This is needed, because:
/// 1) content of subqueries could change after recursive analysis, and auto-generated column names could become incorrect
/// 2) result of different scalar subqueries can be cached inside expressions compilation cache and must have different names
if (subquery->alias.empty())
{
size_t subquery_index = 1;
static std::atomic_uint64_t subquery_index = 1;
while (true)
{
alias = "_subquery" + toString(subquery_index);
if (!aliases.count("_subquery" + toString(subquery_index)))
alias = "_subquery" + std::to_string(subquery_index++);
if (!aliases.count(alias))
break;
++subquery_index;
}
subquery->setAlias(alias);

View File

@ -0,0 +1,136 @@
#pragma once
namespace DB
{
namespace ErrorCodes
{
extern const int TYPE_MISMATCH;
}
/** Get a set of necessary columns to read from the table.
* In this case, the columns specified in ignored_names are considered unnecessary. And the ignored_names parameter can be modified.
* The set of columns available_joined_columns are the columns available from JOIN, they are not needed for reading from the main table.
* Put in required_joined_columns the set of columns available from JOIN and needed.
*/
class RequiredSourceColumnsVisitor
{
public:
RequiredSourceColumnsVisitor(const NameSet & available_columns_, NameSet & required_source_columns_, NameSet & ignored_names_,
const NameSet & available_joined_columns_, NameSet & required_joined_columns_)
: available_columns(available_columns_),
required_source_columns(required_source_columns_),
ignored_names(ignored_names_),
available_joined_columns(available_joined_columns_),
required_joined_columns(required_joined_columns_)
{}
/** Find all the identifiers in the query.
* We will use depth first search in AST.
* In this case
* - for lambda functions we will not take formal parameters;
* - do not go into subqueries (they have their own identifiers);
* - there is some exception for the ARRAY JOIN clause (it has a slightly different identifiers);
* - we put identifiers available from JOIN in required_joined_columns.
*/
void visit(const ASTPtr & ast) const
{
if (!tryVisit<ASTIdentifier>(ast) &&
!tryVisit<ASTFunction>(ast))
visitChildren(ast);
}
private:
const NameSet & available_columns;
NameSet & required_source_columns;
NameSet & ignored_names;
const NameSet & available_joined_columns;
NameSet & required_joined_columns;
void visit(const ASTIdentifier * node, const ASTPtr &) const
{
if (node->general()
&& !ignored_names.count(node->name)
&& !ignored_names.count(Nested::extractTableName(node->name)))
{
if (!available_joined_columns.count(node->name)
|| available_columns.count(node->name)) /// Read column from left table if has.
required_source_columns.insert(node->name);
else
required_joined_columns.insert(node->name);
}
}
void visit(const ASTFunction * node, const ASTPtr & ast) const
{
if (node->name == "lambda")
{
if (node->arguments->children.size() != 2)
throw Exception("lambda requires two arguments", ErrorCodes::NUMBER_OF_ARGUMENTS_DOESNT_MATCH);
ASTFunction * lambda_args_tuple = typeid_cast<ASTFunction *>(node->arguments->children.at(0).get());
if (!lambda_args_tuple || lambda_args_tuple->name != "tuple")
throw Exception("First argument of lambda must be a tuple", ErrorCodes::TYPE_MISMATCH);
/// You do not need to add formal parameters of the lambda expression in required_source_columns.
Names added_ignored;
for (auto & child : lambda_args_tuple->arguments->children)
{
ASTIdentifier * identifier = typeid_cast<ASTIdentifier *>(child.get());
if (!identifier)
throw Exception("lambda argument declarations must be identifiers", ErrorCodes::TYPE_MISMATCH);
String & name = identifier->name;
if (!ignored_names.count(name))
{
ignored_names.insert(name);
added_ignored.push_back(name);
}
}
visit(node->arguments->children.at(1));
for (size_t i = 0; i < added_ignored.size(); ++i)
ignored_names.erase(added_ignored[i]);
return;
}
/// A special function `indexHint`. Everything that is inside it is not calculated
/// (and is used only for index analysis, see KeyCondition).
if (node->name == "indexHint")
return;
visitChildren(ast);
}
void visitChildren(const ASTPtr & ast) const
{
for (auto & child : ast->children)
{
/** We will not go to the ARRAY JOIN section, because we need to look at the names of non-ARRAY-JOIN columns.
* There, `collectUsedColumns` will send us separately.
*/
if (!typeid_cast<const ASTSelectQuery *>(child.get())
&& !typeid_cast<const ASTArrayJoin *>(child.get())
&& !typeid_cast<const ASTTableExpression *>(child.get())
&& !typeid_cast<const ASTTableJoin *>(child.get()))
visit(child);
}
}
template <typename T>
bool tryVisit(const ASTPtr & ast) const
{
if (const T * t = typeid_cast<const T *>(ast.get()))
{
visit(t, ast);
return true;
}
return false;
}
};
}

View File

@ -0,0 +1,130 @@
#include <Common/typeid_cast.h>
#include <IO/WriteHelpers.h>
#include <Storages/IStorage.h>
#include <Parsers/ASTFunction.h>
#include <Parsers/ASTIdentifier.h>
#include <Parsers/ASTSelectQuery.h>
#include <Parsers/ASTSelectWithUnionQuery.h>
#include <Parsers/ASTSubquery.h>
#include <Interpreters/interpretSubquery.h>
#include <Interpreters/evaluateQualified.h>
namespace DB
{
std::shared_ptr<InterpreterSelectWithUnionQuery> interpretSubquery(
const ASTPtr & table_expression, const Context & context, size_t subquery_depth, const Names & required_source_columns)
{
/// Subquery or table name. The name of the table is similar to the subquery `SELECT * FROM t`.
const ASTSubquery * subquery = typeid_cast<const ASTSubquery *>(table_expression.get());
const ASTFunction * function = typeid_cast<const ASTFunction *>(table_expression.get());
const ASTIdentifier * table = typeid_cast<const ASTIdentifier *>(table_expression.get());
if (!subquery && !table && !function)
throw Exception("Table expression is undefined, Method: ExpressionAnalyzer::interpretSubquery." , ErrorCodes::LOGICAL_ERROR);
/** The subquery in the IN / JOIN section does not have any restrictions on the maximum size of the result.
* Because the result of this query is not the result of the entire query.
* Constraints work instead
* max_rows_in_set, max_bytes_in_set, set_overflow_mode,
* max_rows_in_join, max_bytes_in_join, join_overflow_mode,
* which are checked separately (in the Set, Join objects).
*/
Context subquery_context = context;
Settings subquery_settings = context.getSettings();
subquery_settings.max_result_rows = 0;
subquery_settings.max_result_bytes = 0;
/// The calculation of `extremes` does not make sense and is not necessary (if you do it, then the `extremes` of the subquery can be taken instead of the whole query).
subquery_settings.extremes = 0;
subquery_context.setSettings(subquery_settings);
ASTPtr query;
if (table || function)
{
/// create ASTSelectQuery for "SELECT * FROM table" as if written by hand
const auto select_with_union_query = std::make_shared<ASTSelectWithUnionQuery>();
query = select_with_union_query;
select_with_union_query->list_of_selects = std::make_shared<ASTExpressionList>();
const auto select_query = std::make_shared<ASTSelectQuery>();
select_with_union_query->list_of_selects->children.push_back(select_query);
const auto select_expression_list = std::make_shared<ASTExpressionList>();
select_query->select_expression_list = select_expression_list;
select_query->children.emplace_back(select_query->select_expression_list);
NamesAndTypesList columns;
/// get columns list for target table
if (function)
{
auto query_context = const_cast<Context *>(&context.getQueryContext());
const auto & storage = query_context->executeTableFunction(table_expression);
columns = storage->getColumns().ordinary;
select_query->addTableFunction(*const_cast<ASTPtr *>(&table_expression));
}
else
{
auto database_table = getDatabaseAndTableNameFromIdentifier(*table);
const auto & storage = context.getTable(database_table.first, database_table.second);
columns = storage->getColumns().ordinary;
select_query->replaceDatabaseAndTable(database_table.first, database_table.second);
}
select_expression_list->children.reserve(columns.size());
/// manually substitute column names in place of asterisk
for (const auto & column : columns)
select_expression_list->children.emplace_back(std::make_shared<ASTIdentifier>(column.name));
}
else
{
query = subquery->children.at(0);
/** Columns with the same name can be specified in a subquery. For example, SELECT x, x FROM t
* This is bad, because the result of such a query can not be saved to the table, because the table can not have the same name columns.
* Saving to the table is required for GLOBAL subqueries.
*
* To avoid this situation, we will rename the same columns.
*/
std::set<std::string> all_column_names;
std::set<std::string> assigned_column_names;
if (ASTSelectWithUnionQuery * select_with_union = typeid_cast<ASTSelectWithUnionQuery *>(query.get()))
{
if (ASTSelectQuery * select = typeid_cast<ASTSelectQuery *>(select_with_union->list_of_selects->children.at(0).get()))
{
for (auto & expr : select->select_expression_list->children)
all_column_names.insert(expr->getAliasOrColumnName());
for (auto & expr : select->select_expression_list->children)
{
auto name = expr->getAliasOrColumnName();
if (!assigned_column_names.insert(name).second)
{
size_t i = 1;
while (all_column_names.end() != all_column_names.find(name + "_" + toString(i)))
++i;
name = name + "_" + toString(i);
expr = expr->clone(); /// Cancels fuse of the same expressions in the tree.
expr->setAlias(name);
all_column_names.insert(name);
assigned_column_names.insert(name);
}
}
}
}
}
return std::make_shared<InterpreterSelectWithUnionQuery>(
query, subquery_context, required_source_columns, QueryProcessingStage::Complete, subquery_depth + 1);
}
}

View File

@ -0,0 +1,14 @@
#pragma once
#include <Parsers/IAST.h>
#include <Interpreters/InterpreterSelectWithUnionQuery.h>
namespace DB
{
class Context;
std::shared_ptr<InterpreterSelectWithUnionQuery> interpretSubquery(
const ASTPtr & table_expression, const Context & context, size_t subquery_depth, const Names & required_source_columns);
}

View File

@ -651,7 +651,7 @@ BlockInputStreams MergeTreeDataSelectExecutor::spreadMarkRangesAmongStreams(
/// Let's estimate total number of rows for progress bar.
const size_t total_rows = data.index_granularity * sum_marks;
LOG_TRACE(log, "Reading approx. " << total_rows << " rows");
LOG_TRACE(log, "Reading approx. " << total_rows << " rows with " << num_streams << " streams");
for (size_t i = 0; i < num_streams; ++i)
{
@ -681,6 +681,8 @@ BlockInputStreams MergeTreeDataSelectExecutor::spreadMarkRangesAmongStreams(
while (need_marks > 0 && !parts.empty())
{
RangesInDataPart part = parts.back();
parts.pop_back();
size_t & marks_in_part = sum_marks_in_parts.back();
/// We will not take too few rows from a part.
@ -704,7 +706,6 @@ BlockInputStreams MergeTreeDataSelectExecutor::spreadMarkRangesAmongStreams(
ranges_to_get_from_part = part.ranges;
need_marks -= marks_in_part;
parts.pop_back();
sum_marks_in_parts.pop_back();
}
else
@ -727,6 +728,7 @@ BlockInputStreams MergeTreeDataSelectExecutor::spreadMarkRangesAmongStreams(
if (range.begin == range.end)
part.ranges.pop_back();
}
parts.emplace_back(part);
}
BlockInputStreamPtr source_stream = std::make_shared<MergeTreeBlockInputStream>(

View File

@ -1,5 +1,5 @@
include(${ClickHouse_SOURCE_DIR}/cmake/dbms_glob_sources.cmake)
add_headers_and_sources(storages_system .)
add_library(clickhouse_storages_system ${storages_system_headers} ${storages_system_sources})
add_library(clickhouse_storages_system ${LINK_MODE} ${storages_system_headers} ${storages_system_sources})
target_link_libraries(clickhouse_storages_system dbms)

View File

@ -4,5 +4,5 @@ add_headers_and_sources(clickhouse_table_functions .)
list(REMOVE_ITEM clickhouse_table_functions_sources ITableFunction.cpp TableFunctionFactory.cpp)
list(REMOVE_ITEM clickhouse_table_functions_headers ITableFunction.h TableFunctionFactory.h)
add_library(clickhouse_table_functions ${clickhouse_table_functions_sources})
add_library(clickhouse_table_functions ${LINK_MODE} ${clickhouse_table_functions_sources})
target_link_libraries(clickhouse_table_functions clickhouse_storages_system dbms ${Poco_Foundation_LIBRARY})

View File

@ -0,0 +1 @@
00443_optimize_final_vertical_merge.reference

View File

@ -0,0 +1,7 @@
#!/usr/bin/env bash
set -e
CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
env CLICKHOUSE_CLIENT_OPT="--merge_tree_uniform_read_distribution=0" bash $CURDIR/00443_optimize_final_vertical_merge.sh

View File

@ -23,7 +23,7 @@ SELECT * FROM (SELECT 1 AS id UNION ALL SELECT 2) WHERE id = 1;
SELECT * FROM (SELECT arrayJoin([1, 2, 3]) AS id) WHERE id = 1;
SELECT id FROM (SELECT arrayJoin([1, 2, 3]) AS id) WHERE id = 1;
SELECT * FROM (SELECT 1 AS id, (SELECT 1)) WHERE _subquery1 = 1;
SELECT * FROM (SELECT 1 AS id, (SELECT 1) as subquery) WHERE subquery = 1;
SELECT * FROM (SELECT toUInt64(b) AS a, sum(id) AS b FROM test.test) WHERE a = 3;
SELECT * FROM (SELECT toUInt64(b), sum(id) AS b FROM test.test) WHERE `toUInt64(sum(id))` = 3;
SELECT date, id, name, value FROM (SELECT date, name, value, min(id) AS id FROM test.test GROUP BY date, name, value) WHERE id = 1;

View File

@ -0,0 +1,23 @@
SET compile_expressions = 1;
SET min_count_to_compile = 1;
SET optimize_move_to_prewhere = 0;
SET enable_optimize_predicate_expression=0;
DROP TABLE IF EXISTS test.dt;
DROP TABLE IF EXISTS test.testx;
CREATE TABLE test.dt(tkey Int32) ENGINE = MergeTree order by tuple();
INSERT INTO test.dt VALUES (300000);
CREATE TABLE test.testx(t Int32, a UInt8) ENGINE = MergeTree ORDER BY tuple();
INSERT INTO test.testx VALUES (100000, 0);
SELECT COUNT(*) FROM test.testx WHERE NOT a AND t < (SELECT tkey FROM test.dt);
DROP TABLE test.dt;
CREATE TABLE test.dt(tkey Int32) ENGINE = MergeTree order by tuple();
INSERT INTO test.dt VALUES (0);
SELECT COUNT(*) FROM test.testx WHERE NOT a AND t < (SELECT tkey FROM test.dt);
DROP TABLE IF EXISTS test.dt;
DROP TABLE IF EXISTS test.testx;

View File

@ -2,7 +2,7 @@
export CLICKHOUSE_BINARY=${CLICKHOUSE_BINARY:="clickhouse"}
export CLICKHOUSE_CLIENT=${CLICKHOUSE_CLIENT:="${CLICKHOUSE_BINARY}-client"}
export CLICKHOUSE_CLIENT_SERVER_LOGS_LEVEL=${CLICKHOUSE_CLIENT_SERVER_LOGS_LEVEL:="warning"}
export CLICKHOUSE_CLIENT="${CLICKHOUSE_CLIENT} --send_logs_level=${CLICKHOUSE_CLIENT_SERVER_LOGS_LEVEL}"
export CLICKHOUSE_CLIENT="${CLICKHOUSE_CLIENT} --send_logs_level=${CLICKHOUSE_CLIENT_SERVER_LOGS_LEVEL} ${CLICKHOUSE_CLIENT_OPT}"
export CLICKHOUSE_LOCAL=${CLICKHOUSE_LOCAL:="${CLICKHOUSE_BINARY}-local"}
export CLICKHOUSE_CONFIG=${CLICKHOUSE_CONFIG:="/etc/clickhouse-server/config.xml"}

View File

@ -5,6 +5,9 @@ RUN apt-get update -y \
apt-get install --yes --no-install-recommends \
bash \
cmake \
ccache \
distcc \
distcc-pump \
curl \
gcc-7 \
g++-7 \

View File

@ -5,6 +5,9 @@ RUN apt-get update -y \
apt-get install --yes --no-install-recommends \
bash \
fakeroot \
ccache \
distcc \
distcc-pump \
cmake \
curl \
gcc-7 \

View File

@ -52,7 +52,7 @@ def run_image_with_env(image_name, output, env_variables, ch_root):
subprocess.check_call(cmd, shell=True)
def parse_env_variables(build_type, compiler, sanitizer, package_type):
def parse_env_variables(build_type, compiler, sanitizer, package_type, cache, distcc_hosts):
result = []
if package_type == "deb":
result.append("DEB_CC={}".format(compiler))
@ -66,18 +66,28 @@ def parse_env_variables(build_type, compiler, sanitizer, package_type):
if build_type:
result.append("BUILD_TYPE={}".format(build_type))
if cache:
result.append("CCACHE_PREFIX={}".format(cache))
if distcc_hosts:
hosts_with_params = ["{}/24,lzo".format(host) for host in distcc_hosts] + ["localhost/`nproc`"]
result.append('DISTCC_HOSTS="{}"'.format(" ".join(hosts_with_params)))
elif cache == "distcc":
result.append('DISTCC_HOSTS="{}"'.format("localhost/`nproc`"))
return result
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
parser = argparse.ArgumentParser(description="ClickHouse building script via docker")
parser.add_argument("--package-type", choices=IMAGE_MAP.keys())
parser.add_argument("--package-type", choices=IMAGE_MAP.keys(), required=True)
parser.add_argument("--clickhouse-repo-path", default="../../")
parser.add_argument("--output-dir", required=True)
parser.add_argument("--build-type", choices=("debug", ""), default="")
parser.add_argument("--compiler", choices=("clang-6.0", "gcc-7", "gcc-8"), default="gcc-7")
parser.add_argument("--sanitizer", choices=("address", "thread", "memory", "undefined", ""), default="")
parser.add_argument("--cache", choices=("ccache", "distcc"))
parser.add_argument("--distcc-hosts", nargs="+")
parser.add_argument("--force-build-image", action="store_true")
args = parser.parse_args()
@ -95,5 +105,6 @@ if __name__ == "__main__":
if not check_image_exists_locally(image_name) or args.force_build_image:
if not pull_image(image_name) or args.force_build_image:
build_image(image_name, dockerfile)
run_image_with_env(image_name, args.output_dir, parse_env_variables(args.build_type, args.compiler, args.sanitizer, args.package_type), ch_root)
env_prepared = parse_env_variables(args.build_type, args.compiler, args.sanitizer, args.package_type, args.cache, args.distcc_hosts)
run_image_with_env(image_name, args.output_dir, env_prepared, ch_root)
logging.info("Output placed into {}".format(args.output_dir))

View File

@ -4,7 +4,7 @@
Signed fixed point numbers that keep precision during add, subtract and multiply operations. For division least significant digits are discarded (not rounded).
## Параметры
## Parameters
- P - precision. Valid range: [ 1 : 38 ]. Determines how many decimal digits number can have (including fraction).
- S - scale. Valid range: [ 0 : P ]. Determines how many decimal digits fraction can have.

View File

@ -7,15 +7,15 @@ nav:
- '性能': 'introduction/performance.md'
- 'Yandex.Metrica使用案例': 'introduction/ya_metrika_task.md'
- '起步':
- '入门指南':
- '部署运行': 'getting_started/index.md'
- '示例数据集':
- 'OnTime': 'getting_started/example_datasets/ontime.md'
- 'New York Taxi data': 'getting_started/example_datasets/nyc_taxi.md'
- 'AMPLab Big Data Benchmark': 'getting_started/example_datasets/amplab_benchmark.md'
- 'WikiStat': 'getting_started/example_datasets/wikistat.md'
- 'Terabyte click logs from Criteo': 'getting_started/example_datasets/criteo.md'
- 'Star Schema Benchmark': 'getting_started/example_datasets/star_schema.md'
- '航班飞行数据': 'getting_started/example_datasets/ontime.md'
- '纽约市出租车数据': 'getting_started/example_datasets/nyc_taxi.md'
- 'AMPLab大数据基准测试': 'getting_started/example_datasets/amplab_benchmark.md'
- '维基访问数据': 'getting_started/example_datasets/wikistat.md'
- 'Criteo TB级别点击日志': 'getting_started/example_datasets/criteo.md'
- 'Star Schema基准测试': 'getting_started/example_datasets/star_schema.md'
- '客户端':
- '介绍': 'interfaces/index.md'

View File

@ -64,7 +64,7 @@
}
body {
font: 400 14pt/200% 'Yandex Sans Text Web', Arial, sans-serif;
font: 300 14pt/200% 'Yandex Sans Text Web', Arial, sans-serif;
}
body.md-lang-zh {

View File

@ -7,6 +7,10 @@
"footer.next": "Next",
"meta.comments": "Comments",
"meta.source": "Source",
"nav.multi_page": "Multi page version",
"nav.pdf": "PDF version",
"nav.single_page": "Single page version",
"nav.source": "ClickHouse source code",
"search.placeholder": "Search",
"search.result.placeholder": "Type to start searching",
"search.result.none": "No matching documents",

View File

@ -8,6 +8,10 @@
"footer.next": "بعدی",
"meta.comments": "نظرات",
"meta.source": "منبع",
"nav.multi_page": "نسخه چند صفحه ای",
"nav.pdf": "نسخه PDF",
"nav.single_page": "نسخه تک صفحه",
"nav.source": "کد منبع کلیک",
"search.language": "",
"search.pipeline.stopwords": false,
"search.pipeline.trimmer": false,

View File

@ -7,6 +7,10 @@
"footer.next": "Вперед",
"meta.comments": "Комментарии",
"meta.source": "Исходный код",
"nav.multi_page": "Многостраничная версия",
"nav.pdf": "PDF версия",
"nav.single_page": "Одностраничная версия",
"nav.source": "Исходный код ClickHouse",
"search.placeholder": "Поиск",
"search.result.placeholder": "Начните печатать для поиска",
"search.result.none": "Совпадений не найдено",

View File

@ -7,6 +7,10 @@
"footer.next": "前进",
"meta.comments": "评论",
"meta.source": "来源",
"nav.multi_page": "多页版本",
"nav.pdf": "PDF版本",
"nav.single_page": "单页版本",
"nav.source": "ClickHouse源代码",
"search.placeholder": "搜索",
"search.result.placeholder": "键入以开始搜索",
"search.result.none": "没有找到符合条件的结果",

View File

@ -15,35 +15,19 @@
<ul id="md-extra-nav" class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item md-nav__item--active">
{% if config.theme.language == 'ru' %}
{% if config.extra.single_page %}
<a href="{{ base_url }}" class="md-nav__link md-nav__link--active">Многостраничная версия</a>
<a href="{{ base_url }}" class="md-nav__link md-nav__link--active">{{ lang.t("nav.multi_page") }}</a>
{% else %}
<a href="{{ base_url }}/single/" class="md-nav__link md-nav__link--active">Одностраничная версия</a>
<a href="{{ base_url }}/single/" class="md-nav__link md-nav__link--active">{{ lang.t("nav.single_page") }}</a>
{% endif %}
{% else %}
{% if config.extra.single_page %}
<a href="{{ base_url }}" class="md-nav__link md-nav__link--active">Multi page version</a>
{% else %}
<a href="{{ base_url }}/single/" class="md-nav__link md-nav__link--active">Single page version</a>
{% endif %}
{% endif %}
</li>
<li class="md-nav__item md-nav__item--active">
{% if config.theme.language == 'ru' %}
<a href="{{ base_url }}/single/clickhouse_{{ config.theme.language }}.pdf" class="md-nav__link md-nav__link--active">PDF версия</a>
{% else %}
<a href="{{ base_url }}/single/clickhouse_{{ config.theme.language }}.pdf" class="md-nav__link md-nav__link--active">PDF version</a>
{% endif %}
<a href="{{ base_url }}/single/clickhouse_{{ config.theme.language }}.pdf" class="md-nav__link md-nav__link--active">{{ lang.t("nav.pdf") }}</a>
</li>
{% if config.repo_url %}
<li class="md-nav__item md-nav__item--active">
<a href="{{ config.repo_url }}" rel="external nofollow" target="_blank" class="md-nav__link">
{% if config.theme.language == 'ru' %}
Исходники ClickHouse
{% else %}
ClickHouse sources
{% endif %}
{{ lang.t("nav.source") }}
</a>
</li>
{% endif %}

View File

@ -1 +0,0 @@
../../en/data_types/array.md

View File

@ -0,0 +1,84 @@
<a name="data_type-array"></a>
# Array(T)
`T` 类型元素组成的数组。
`T` 可以是任意类型,包含数组类型。 但不推荐使用多维数组ClickHouse 对多维数组的支持有限。例如,不能存储在 `MergeTree` 表中存储多维数组。
## 创建数组
您可以使用array函数来创建数组
```
array(T)
```
您也可以使用方括号:
```
[]
```
创建数组示例:
```
:) SELECT array(1, 2) AS x, toTypeName(x)
SELECT
[1, 2] AS x,
toTypeName(x)
┌─x─────┬─toTypeName(array(1, 2))─┐
│ [1,2] │ Array(UInt8) │
└───────┴─────────────────────────┘
1 rows in set. Elapsed: 0.002 sec.
:) SELECT [1, 2] AS x, toTypeName(x)
SELECT
[1, 2] AS x,
toTypeName(x)
┌─x─────┬─toTypeName([1, 2])─┐
│ [1,2] │ Array(UInt8) │
└───────┴────────────────────┘
1 rows in set. Elapsed: 0.002 sec.
```
## 使用数据类型
ClickHouse会自动检测数组元素,并根据元素计算出存储这些元素最小的数据类型。如果在元素中存在[NULL](../query_language/syntax.md#null-literal)或存在[Nullable](nullable.md#data_type-nullable)类型元素,那么数组的元素类型将会变成[Nullable](nullable.md#data_type-nullable)。
如果 ClickHouse 无法确定数据类型,它将产生异常。当尝试同时创建一个包含字符串和数字的数组时会发生这种情况 (`SELECT array(1, 'a')`)。
自动数据类型检测示例:
```
:) SELECT array(1, 2, NULL) AS x, toTypeName(x)
SELECT
[1, 2, NULL] AS x,
toTypeName(x)
┌─x──────────┬─toTypeName(array(1, 2, NULL))─┐
│ [1,2,NULL] │ Array(Nullable(UInt8)) │
└────────────┴───────────────────────────────┘
1 rows in set. Elapsed: 0.002 sec.
```
如果您尝试创建不兼容的数据类型数组ClickHouse 将引发异常:
```
:) SELECT array(1, 'a')
SELECT [1, 'a']
Received exception from server (version 1.1.54388):
Code: 386. DB::Exception: Received from localhost:9000, 127.0.0.1. DB::Exception: There is no supertype for types UInt8, String because some of them are String/FixedString and some of them are not.
0 rows in set. Elapsed: 0.246 sec.
```

View File

@ -1 +0,0 @@
../../en/data_types/boolean.md

View File

@ -0,0 +1,3 @@
# Boolean Values
没有单独的类型来存储布尔值。可以使用 UInt8 类型,取值限制为 0 或 1。

View File

@ -1 +0,0 @@
../../en/data_types/date.md

View File

@ -0,0 +1,5 @@
# Date
日期类型,用两个字节存储,表示从 1970-01-01 (无符号) 到当前的日期值。允许存储从 Unix 纪元开始到编译阶段定义的上限阈值常量目前上限是2106年但最终完全支持的年份为2105。最小值输出为0000-00-00。
日期中没有存储时区信息。

View File

@ -1 +0,0 @@
../../en/data_types/datetime.md

View File

@ -0,0 +1,13 @@
<a name="data_type-datetime"></a>
# DateTime
时间戳类型。用四个字节(无符号的)存储 Unix 时间戳)。允许存储与日期类型相同的范围内的值。最小值为 0000-00-00 00:00:00。时间戳类型值精确到秒没有闰秒
## 时区
使用启动客户端或服务器时的系统时区,时间戳是从文本(分解为组件)转换为二进制并返回。在文本格式中,有关夏令时的信息会丢失。
默认情况下,客户端连接到服务的时候会使用服务端时区。您可以通过启用客户端命令行选项 `--use_client_time_zone` 来设置使用客户端时间。
因此,在处理文本日期时(例如,在保存文本转储时),请记住在夏令时更改期间可能存在歧义,如果时区发生更改,则可能存在匹配数据的问题。

View File

@ -1 +0,0 @@
../../en/data_types/decimal.md

View File

@ -0,0 +1,97 @@
<a name="data_type-decimal"></a>
# Decimal(P, S), Decimal32(S), Decimal64(S), Decimal128(S)
有符号的定点数,可在加、减和乘法运算过程中保持精度。对于除法,最低有效数字会被丢弃(不舍入)。
## 参数
- P - 精度。有效范围:[1:38],决定可以有多少个十进制数字(包括分数)。
- S - 规模。有效范围:[0P],决定数字的小数部分中包含的小数位数。
对于不同的 P 参数值 Decimal 表示,以下例子都是同义的:
- P from [ 1 : 9 ] - for Decimal32(S)
- P from [ 10 : 18 ] - for Decimal64(S)
- P from [ 19 : 38 ] - for Decimal128(S)
## 十进制值范围
- Decimal32(S) - ( -1 * 10^(9 - S), 1 * 10^(9 - S) )
- Decimal64(S) - ( -1 * 10^(18 - S), 1 * 10^(18 - S) )
- Decimal128(S) - ( -1 * 10^(38 - S), 1 * 10^(38 - S) )
例如Decimal32(4) 可以表示 -99999.9999 至 99999.9999 的数值步长为0.0001。
## 内部表示方式
数据采用与自身位宽相同的有符号整数存储。这个数在内存中实际范围会高于上述范围,从 String 转换到十进制数的时候会做对应的检查。
由于现代CPU不支持128位数字因此 Decimal128 上的操作由软件模拟。所以 Decimal128 的运算速度明显慢于 Decimal32/Decimal64。
## 运算和结果类型
对Decimal的二进制运算导致更宽的结果类型无论参数的顺序如何
- Decimal64(S1) <op> Decimal32(S2) -> Decimal64(S)
- Decimal128(S1) <op> Decimal32(S2) -> Decimal128(S)
- Decimal128(S1) <op> Decimal64(S2) -> Decimal128(S)
精度变化的规则:
- 加法减法S = max(S1, S2)。
- 乘法S = S1 + S2。
- 除法S = S1。
对于 Decimal 和整数之间的类似操作,结果是与参数大小相同的十进制。
未定义Decimal和Float32/Float64之间的函数。要执行此类操作您可以使用toDecimal32、toDecimal64、toDecimal128 或 toFloat32toFloat64需要显式地转换其中一个参数。注意结果将失去精度类型转换是昂贵的操作。
Decimal上的一些函数返回结果为Float64例如var或stddev。对于其中一些中间计算发生在Decimal中。对于此类函数尽管结果类型相同但Float64和Decimal中相同数据的结果可能不同。
## 溢出检查
在对 Decimal 类型执行操作时,数值可能会发生溢出。分数中的过多数字被丢弃(不是舍入的)。整数中的过多数字将导致异常。
```
SELECT toDecimal32(2, 4) AS x, x / 3
```
```
┌──────x─┬─divide(toDecimal32(2, 4), 3)─┐
│ 2.0000 │ 0.6666 │
└────────┴──────────────────────────────┘
```
```
SELECT toDecimal32(4.2, 8) AS x, x * x
```
```
DB::Exception: Scale is out of bounds.
```
```
SELECT toDecimal32(4.2, 8) AS x, 6 * x
```
```
DB::Exception: Decimal math overflow.
```
检查溢出会导致计算变慢。如果已知溢出不可能,则可以通过设置`decimal_check_overflow`来禁用溢出检查,在这种情况下,溢出将导致结果不正确:
```
SET decimal_check_overflow = 0;
SELECT toDecimal32(4.2, 8) AS x, 6 * x
```
```
┌──────────x─┬─multiply(6, toDecimal32(4.2, 8))─┐
│ 4.20000000 │ -17.74967296 │
└────────────┴──────────────────────────────────┘
```
溢出检查不仅发生在算术运算上,还发生在比较运算上:
```
SELECT toDecimal32(1, 8) < 100
```
```
DB::Exception: Can't compare.
```

View File

@ -1 +0,0 @@
../../en/data_types/enum.md

116
docs/zh/data_types/enum.md Normal file
View File

@ -0,0 +1,116 @@
<a name="data_type-enum"></a>
# Enum8, Enum16
包括 `Enum8``Enum16` 类型。`Enum` 保存 `'string'= integer` 的对应关系。在 ClickHouse 中,尽管用户使用的是字符串常量,但所有含有 `Enum` 数据类型的操作都是按照包含整数的值来执行。这在性能方面比使用 `String` 数据类型更有效。
- `Enum8``'String'= Int8` 对描述。
- `Enum16``'String'= Int16` 对描述。
## 用法示例
创建一个带有一个枚举 `Enum8('hello' = 1, 'world' = 2)` 类型的列:
```
CREATE TABLE t_enum
(
x Enum8('hello' = 1, 'world' = 2)
)
ENGINE = TinyLog
```
这个 `x` 列只能存储类型定义中列出的值:`'hello'`或`'world'`。如果您尝试保存任何其他值ClickHouse 抛出异常。
```
:) INSERT INTO t_enum VALUES ('hello'), ('world'), ('hello')
INSERT INTO t_enum VALUES
Ok.
3 rows in set. Elapsed: 0.002 sec.
:) insert into t_enum values('a')
INSERT INTO t_enum VALUES
Exception on client:
Code: 49. DB::Exception: Unknown element 'a' for type Enum8('hello' = 1, 'world' = 2)
```
当您从表中查询数据时ClickHouse 从 `Enum` 中输出字符串值。
```
SELECT * FROM t_enum
┌─x─────┐
│ hello │
│ world │
│ hello │
└───────┘
```
如果需要看到对应行的数值,则必须将 `Enum` 值转换为整数类型。
```
SELECT CAST(x, 'Int8') FROM t_enum
┌─CAST(x, 'Int8')─┐
│ 1 │
│ 2 │
│ 1 │
└─────────────────┘
```
在查询中创建枚举值,您还需要使用 `CAST`
```
SELECT toTypeName(CAST('a', 'Enum8(\'a\' = 1, \'b\' = 2)'))
┌─toTypeName(CAST('a', 'Enum8(\'a\' = 1, \'b\' = 2)'))─┐
│ Enum8('a' = 1, 'b' = 2) │
└──────────────────────────────────────────────────────┘
```
## 规则及用法
`Enum8` 类型的每个值范围是 `-128 ... 127``Enum16` 类型的每个值范围是 `-32768 ... 32767`。所有的字符串或者数字都必须是不一样的。允许存在空字符串。如果某个 Enum 类型被指定了(在表定义的时候),数字可以是任意顺序。然而,顺序并不重要。
`Enum` 中的字符串和数值都不能是 [NULL](../query_language/syntax.md#null-literal)。
`Enum` 包含在 [Nullable](nullable.md#data_type-nullable) 类型中。因此,如果您使用此查询创建一个表
```
CREATE TABLE t_enum_nullable
(
x Nullable( Enum8('hello' = 1, 'world' = 2) )
)
ENGINE = TinyLog
```
不仅可以存储 `'hello'``'world'` ,还可以存储 `NULL`
```
INSERT INTO t_enum_null Values('hello'),('world'),(NULL)
```
在内存中,`Enum` 列的存储方式与相应数值的 `Int8``Int16` 相同。
当以文本方式读取的时候ClickHouse 将值解析成字符串然后去枚举值的集合中搜索对应字符串。如果没有找到,会抛出异常。当读取文本格式的时候,会根据读取到的字符串去找对应的数值。如果没有找到,会抛出异常。
当以文本形式写入时ClickHouse 将值解析成字符串写入。如果列数据包含垃圾数据不是来自有效集合的数字则抛出异常。Enum 类型以二进制读取和写入的方式与 `Int8``Int16` 类型一样的。
隐式默认值是数值最小的值。
`ORDER BY``GROUP BY``IN``DISTINCT` 等等中Enum 的行为与相应的数字相同。例如按数字排序。对于等式运算符和比较运算符Enum 的工作机制与它们在底层数值上的工作机制相同。
枚举值不能与数字进行比较。枚举可以与常量字符串进行比较。如果与之比较的字符串不是有效Enum值则将引发异常。可以使用 IN 运算符来判断一个 Enum 是否存在于某个 Enum 集合中,其中集合中的 Enum 需要用字符串表示。
大多数具有数字和字符串的运算并不适用于Enums例如Enum 类型不能和一个数值相加。但是Enum有一个原生的 `toString` 函数,它返回它的字符串值。
Enum 值使用 `toT` 函数可以转换成数值类型,其中 T 是一个数值类型。若 `T` 恰好对应 Enum 的底层数值类型,这个转换是零消耗的。
Enum 类型可以被 `ALTER` 无成本地修改对应集合的值。可以通过 `ALTER` 操作来增加或删除 Enum 的成员(只要表没有用到该值,删除都是安全的)。作为安全保障,改变之前使用过的 Enum 成员将抛出异常。
通过 `ALTER` 操作,可以将 `Enum8` 转成 `Enum16`,反之亦然,就像 `Int8``Int16`一样。

View File

@ -1 +0,0 @@
../../en/data_types/fixedstring.md

View File

@ -0,0 +1,9 @@
# FixedString(N)
固定长度 N 的字符串。N 必须是严格的正自然数。
当服务端读取长度小于 N 的字符串时候(例如解析 INSERT 数据时),通过在字符串末尾添加空字节来达到 N 字节长度。
当服务端读取长度大于 N 的字符串时候,将返回错误消息。
当服务器写入一个字符串(例如,当输出 SELECT 查询的结果NULL字节不会从字符串的末尾被移除而是被输出。
注意这种方式与 MYSQL 的 CHAR 类型是不一样的MYSQL 的字符串会以空格填充,然后输出的时候空格会被修剪)。
`String` 类型相比,极少的函数会使用 `FixedString(N)`,因此使用起来不太方便。

View File

@ -1 +0,0 @@
../../en/data_types/float.md

View File

@ -0,0 +1,72 @@
# Float32, Float64
[浮点数](https://en.wikipedia.org/wiki/IEEE_754)。
类型与以下 C 语言中类型是相同的:
- `Float32` - `float`
- `Float64` - `double`
我们建议您尽可能以整数形式存储数据。例如,将固定精度的数字转换为整数值,例如货币数量或页面加载时间用毫秒为单位表示
## 使用浮点数
- 对浮点数进行计算可能引起四舍五入的误差。
```sql
SELECT 1 - 0.9
```
```
┌───────minus(1, 0.9)─┐
│ 0.09999999999999998 │
└─────────────────────┘
```
- 计算的结果取决于计算方法(计算机系统的处理器类型和体系结构)
- 浮点计算结果可能是诸如无穷大(`INF`)和"非数字"`NaN`)。对浮点数计算的时候应该考虑到这点。
- 当一行行阅读浮点数的时候,浮点数的结果可能不是机器最近显示的数值。
## NaN and Inf
与标准SQL相比ClickHouse 支持以下类别的浮点数:
- `Inf` 正无穷
```sql
SELECT 0.5 / 0
```
```
┌─divide(0.5, 0)─┐
│ inf │
└────────────────┘
```
- `-Inf` 负无穷
```sql
SELECT -0.5 / 0
```
```
┌─divide(-0.5, 0)─┐
│ -inf │
└─────────────────┘
```
- `NaN` 非数字
```
SELECT 0 / 0
```
```
┌─divide(0, 0)─┐
│ nan │
└──────────────┘
```
可以在[ORDER BY 子句](../query_language/select.md#query_language-queries-order_by) 查看更多关于 ` NaN` 排序的规则。

View File

@ -1 +0,0 @@
../../en/data_types/index.md

View File

@ -0,0 +1,7 @@
<a name="data_types"></a>
# 数据类型
ClickHouse 可以在数据表中存储多种数据类型。
本节描述 ClickHouse 支持的数据类型,以及使用或者实现它们时(如果有的话)的注意事项。

View File

@ -1 +0,0 @@
../../en/data_types/int_uint.md

View File

@ -0,0 +1,19 @@
<a name="data_type-int"></a>
# UInt8, UInt16, UInt32, UInt64, Int8, Int16, Int32, Int64
固定长度的整型,包括有符号整型或无符号整型。
## 整型范围
- Int8 - [-128 : 127]
- Int16 - [-32768 : 32767]
- Int32 - [-2147483648 : 2147483647]
- Int64 - [-9223372036854775808 : 9223372036854775807]
## 无符号整型范围
- UInt8 - [0 : 255]
- UInt16 - [0 : 65535]
- UInt32 - [0 : 4294967295]
- UInt64 - [0 : 18446744073709551615]

View File

@ -1 +0,0 @@
../../../en/data_types/nested_data_structures/aggregatefunction.md

View File

@ -0,0 +1,3 @@
# AggregateFunction(name, types_of_arguments...)
表示聚合函数中的中间状态。可以在聚合函数中通过 '-State' 后缀来访问它。更多信息,参考 "AggregatingMergeTree"。

View File

@ -1 +0,0 @@
../../../en/data_types/nested_data_structures/index.md

View File

@ -0,0 +1 @@
# 嵌套数据结构

View File

@ -1 +0,0 @@
../../../en/data_types/nested_data_structures/nested.md

View File

@ -0,0 +1,97 @@
# Nested(Name1 Type1, Name2 Type2, ...)
嵌套数据结构类似于嵌套表。嵌套数据结构的参数(列名和类型)与 CREATE 查询类似。每个表可以包含任意多行嵌套数据结构。
示例:
```sql
CREATE TABLE test.visits
(
CounterID UInt32,
StartDate Date,
Sign Int8,
IsNew UInt8,
VisitID UInt64,
UserID UInt64,
...
Goals Nested
(
ID UInt32,
Serial UInt32,
EventTime DateTime,
Price Int64,
OrderID String,
CurrencyID UInt32
),
...
) ENGINE = CollapsingMergeTree(StartDate, intHash32(UserID), (CounterID, StartDate, intHash32(UserID), VisitID), 8192, Sign)
```
上述示例声明了 `Goals` 这种嵌套数据结构,它包含访客转化相关的数据(访客达到的目标)。在 'visits' 表中每一行都可以对应零个或者任意个转化数据。
只支持一级嵌套。嵌套结构的列中若列的类型是数组类型那么该列其实和多维数组是相同的所以目前嵌套层级的支持很局限MergeTree 引擎中不支持存储这样的列)
大多数情况下,处理嵌套数据结构时,会指定一个单独的列。为了这样实现,列的名称会与点号连接起来。这些列构成了一组匹配类型。在同一条嵌套数据中,所有的列都具有相同的长度。
示例:
```sql
SELECT
Goals.ID,
Goals.EventTime
FROM test.visits
WHERE CounterID = 101500 AND length(Goals.ID) < 5
LIMIT 10
```
```text
┌─Goals.ID───────────────────────┬─Goals.EventTime───────────────────────────────────────────────────────────────────────────┐
│ [1073752,591325,591325] │ ['2014-03-17 16:38:10','2014-03-17 16:38:48','2014-03-17 16:42:27'] │
│ [1073752] │ ['2014-03-17 00:28:25'] │
│ [1073752] │ ['2014-03-17 10:46:20'] │
│ [1073752,591325,591325,591325] │ ['2014-03-17 13:59:20','2014-03-17 22:17:55','2014-03-17 22:18:07','2014-03-17 22:18:51'] │
│ [] │ [] │
│ [1073752,591325,591325] │ ['2014-03-17 11:37:06','2014-03-17 14:07:47','2014-03-17 14:36:21'] │
│ [] │ [] │
│ [] │ [] │
│ [591325,1073752] │ ['2014-03-17 00:46:05','2014-03-17 00:46:05'] │
│ [1073752,591325,591325,591325] │ ['2014-03-17 13:28:33','2014-03-17 13:30:26','2014-03-17 18:51:21','2014-03-17 18:51:45'] │
└────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────┘
```
所以可以简单地把嵌套数据结构当做是所有列都是相同长度的多列数组。
SELECT 查询只有在使用 ARRAY JOIN 的时候才可以指定整个嵌套数据结构的名称。更多信息,参考 "ARRAY JOIN 子句"。示例:
```sql
SELECT
Goal.ID,
Goal.EventTime
FROM test.visits
ARRAY JOIN Goals AS Goal
WHERE CounterID = 101500 AND length(Goals.ID) < 5
LIMIT 10
```
```text
┌─Goal.ID─┬──────Goal.EventTime─┐
│ 1073752 │ 2014-03-17 16:38:10 │
│ 591325 │ 2014-03-17 16:38:48 │
│ 591325 │ 2014-03-17 16:42:27 │
│ 1073752 │ 2014-03-17 00:28:25 │
│ 1073752 │ 2014-03-17 10:46:20 │
│ 1073752 │ 2014-03-17 13:59:20 │
│ 591325 │ 2014-03-17 22:17:55 │
│ 591325 │ 2014-03-17 22:18:07 │
│ 591325 │ 2014-03-17 22:18:51 │
│ 1073752 │ 2014-03-17 11:37:06 │
└─────────┴─────────────────────┘
```
不能对整个嵌套数据结构执行 SELECT。只能明确列出属于它一部分列。
对于 INSERT 查询,可以单独地传入所有嵌套数据结构中的列数组(假如它们是单独的列数组)。在插入过程中,系统会检查它们是否有相同的长度。
对于 DESCRIBE 查询,嵌套数据结构中的列会以相同的方式分别列出来。
ALTER 查询对嵌套数据结构的操作非常有限。

View File

@ -1 +0,0 @@
../../en/data_types/nullable.md

View File

@ -0,0 +1,57 @@
<a name="data_type-nullable"></a>
# Nullable(TypeName)
允许用特殊标记 ([NULL](../query_language/syntax.md#null-literal)) 表示"缺失值",可以与 `TypeName` 的正常值存放一起。例如,`Nullable(Int8)` 类型的列可以存储 `Int8` 类型值,而没有值的行将存储 `NULL`
对于 `TypeName`,不能使用复合数据类型 [Array](array.md#data_type is array) 和 [Tuple](tuple.md#data_type-tuple)。复合数据类型可以包含 `Nullable` 类型值,例如`Array(Nullable(Int8))`。
`Nullable` 类型字段不能包含在表索引中。
除非在 ClickHouse 服务器配置中另有说明,否则 `NULL` 是任何 `Nullable` 类型的默认值。
## 存储特性
要在表的列中存储 `Nullable` 类型值ClickHouse 除了使用带有值的普通文件外,还使用带有 `NULL` 掩码的单独文件。 掩码文件中的条目允许 ClickHouse 区分每个表行的 `NULL` 和相应数据类型的默认值。 由于附加了新文件,`Nullable` 列与类似的普通文件相比消耗额外的存储空间。
!!! 注意点
使用 `Nullable` 几乎总是对性能产生负面影响,在设计数据库时请记住这一点
掩码文件中的条目允许ClickHouse区分每个表行的对应数据类型的"NULL"和默认值由于有额外的文件,"Nullable"列比普通列消耗更多的存储空间
## 用法示例
```
:) CREATE TABLE t_null(x Int8, y Nullable(Int8)) ENGINE TinyLog
CREATE TABLE t_null
(
x Int8,
y Nullable(Int8)
)
ENGINE = TinyLog
Ok.
0 rows in set. Elapsed: 0.012 sec.
:) INSERT INTO t_null VALUES (1, NULL)
INSERT INTO t_null VALUES
Ok.
1 rows in set. Elapsed: 0.007 sec.
:) SELECT x + y FROM t_null
SELECT x + y
FROM t_null
┌─plus(x, y)─┐
│ ᴺᵁᴸᴸ │
│ 5 │
└────────────┘
2 rows in set. Elapsed: 0.144 sec.
```

View File

@ -1 +0,0 @@
../../../en/data_types/special_data_types/expression.md

View File

@ -0,0 +1,3 @@
# Expression
用于表示高阶函数中的Lambd表达式。

View File

@ -1 +0,0 @@
../../../en/data_types/special_data_types/index.md

View File

@ -0,0 +1,3 @@
# Special Data Types
特殊数据类型的值既不能存在表中也不能在结果中输出,但可用于查询的中间结果。

View File

@ -1 +0,0 @@
../../../en/data_types/special_data_types/nothing.md

View File

@ -0,0 +1,21 @@
<a name="special_data_type-nothing"></a>
# Nothing
此数据类型的唯一目的是表示不是期望值的情况。 所以不能创建一个 `Nothing` 类型的值。
例如,文本 [NULL](../../query_language/syntax.md#null-literal) 的类型为 `Nullable(Nothing)`。详情请见 [Nullable](../../data_types/nullable.md#data_type-nullable)。
`Nothing` 类型也可以用来表示空数组:
```bash
:) SELECT toTypeName(array())
SELECT toTypeName([])
┌─toTypeName(array())─┐
│ Array(Nothing) │
└─────────────────────┘
1 rows in set. Elapsed: 0.062 sec.
```

View File

@ -1 +0,0 @@
../../../en/data_types/special_data_types/set.md

View File

@ -0,0 +1,4 @@
# Set
可以用在 IN 表达式的右半部分。

View File

@ -1 +0,0 @@
../../en/data_types/string.md

View File

@ -0,0 +1,13 @@
<a name="data_types-string"></a>
# String
字符串可以任意长度的。它可以包含任意的字节集,包含空字节。因此,字符串类型可以代替其他 DBMSs 中的 VARCHAR、BLOB、CLOB 等类型。
## 编码
ClickHouse 没有编码的概念。字符串可以是任意的字节集,按它们原本的方式进行存储和输出。
若需存储文本,我们建议使用 UTF-8 编码。至少如果你的终端使用UTF-8推荐这样读写就不需要进行任何的转换了。
同样,对不同的编码文本 ClickHouse 会有不同处理字符串的函数。
比如,`length` 函数可以计算字符串包含的字节数组的长度,然而 `lengthUTF8` 函数是假设字符串以 UTF-8 编码,计算的是字符串包含的 Unicode 字符的长度。

View File

@ -1 +0,0 @@
../../en/data_types/tuple.md

View File

@ -0,0 +1,53 @@
<a name="data_type-tuple"></a>
# Tuple(T1, T2, ...)
元组,其中每个元素都有单独的 [类型](index.md#data_types)。
不能在表中存储元组除了内存表。它们可以用于临时列分组。在查询中IN 表达式和带特定参数的 lambda 函数可以来对临时列进行分组。更多信息,请参阅 [IN 操作符](../query_language/select.md#in_operators) and [Higher order functions](../query_language/functions/higher_order_functions.md#higher_order_functions)。
元组可以是查询的结果。在这种情况下对于JSON以外的文本格式括号中的值是逗号分隔的。在JSON格式中元组作为数组输出在方括号中
## 创建元组
可以使用函数来创建元组:
```
tuple(T1, T2, ...)
```
创建元组的示例:
```
:) SELECT tuple(1,'a') AS x, toTypeName(x)
SELECT
(1, 'a') AS x,
toTypeName(x)
┌─x───────┬─toTypeName(tuple(1, 'a'))─┐
│ (1,'a') │ Tuple(UInt8, String) │
└─────────┴───────────────────────────┘
1 rows in set. Elapsed: 0.021 sec.
```
## 元组中的数据类型
在动态创建元组时ClickHouse 会自动为元组的每一个参数赋予最小可表达的类型。如果参数为 [NULL](../query_language/syntax.md#null-literal),那这个元组对应元素是 [Nullable](nullable.md#data_type-nullable)。
自动数据类型检测示例:
```
SELECT tuple(1, NULL) AS x, toTypeName(x)
SELECT
(1, NULL) AS x,
toTypeName(x)
┌─x────────┬─toTypeName(tuple(1, NULL))──────┐
│ (1,NULL) │ Tuple(UInt8, Nullable(Nothing)) │
└──────────┴─────────────────────────────────┘
1 rows in set. Elapsed: 0.002 sec.
```

View File

@ -1 +0,0 @@
../../../en/getting_started/example_datasets/amplab_benchmark.md

View File

@ -0,0 +1,123 @@
# AMPLab 大数据基准测试
参考 <https://amplab.cs.berkeley.edu/benchmark/>
需要您在<https://aws.amazon.com>注册一个免费的账号。注册时需要您提供信用卡、邮箱、电话等信息。之后可以在<https://console.aws.amazon.com/iam/home?nc2=h_m_sc#security_credential>获取新的访问密钥
在控制台运行以下命令:
```bash
sudo apt-get install s3cmd
mkdir tiny; cd tiny;
s3cmd sync s3://big-data-benchmark/pavlo/text-deflate/tiny/ .
cd ..
mkdir 1node; cd 1node;
s3cmd sync s3://big-data-benchmark/pavlo/text-deflate/1node/ .
cd ..
mkdir 5nodes; cd 5nodes;
s3cmd sync s3://big-data-benchmark/pavlo/text-deflate/5nodes/ .
cd ..
```
在ClickHouse运行如下查询
``` sql
CREATE TABLE rankings_tiny
(
pageURL String,
pageRank UInt32,
avgDuration UInt32
) ENGINE = Log;
CREATE TABLE uservisits_tiny
(
sourceIP String,
destinationURL String,
visitDate Date,
adRevenue Float32,
UserAgent String,
cCode FixedString(3),
lCode FixedString(6),
searchWord String,
duration UInt32
) ENGINE = MergeTree(visitDate, visitDate, 8192);
CREATE TABLE rankings_1node
(
pageURL String,
pageRank UInt32,
avgDuration UInt32
) ENGINE = Log;
CREATE TABLE uservisits_1node
(
sourceIP String,
destinationURL String,
visitDate Date,
adRevenue Float32,
UserAgent String,
cCode FixedString(3),
lCode FixedString(6),
searchWord String,
duration UInt32
) ENGINE = MergeTree(visitDate, visitDate, 8192);
CREATE TABLE rankings_5nodes_on_single
(
pageURL String,
pageRank UInt32,
avgDuration UInt32
) ENGINE = Log;
CREATE TABLE uservisits_5nodes_on_single
(
sourceIP String,
destinationURL String,
visitDate Date,
adRevenue Float32,
UserAgent String,
cCode FixedString(3),
lCode FixedString(6),
searchWord String,
duration UInt32
) ENGINE = MergeTree(visitDate, visitDate, 8192);
```
回到控制台运行如下命令:
```bash
for i in tiny/rankings/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO rankings_tiny FORMAT CSV"; done
for i in tiny/uservisits/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO uservisits_tiny FORMAT CSV"; done
for i in 1node/rankings/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO rankings_1node FORMAT CSV"; done
for i in 1node/uservisits/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO uservisits_1node FORMAT CSV"; done
for i in 5nodes/rankings/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO rankings_5nodes_on_single FORMAT CSV"; done
for i in 5nodes/uservisits/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO uservisits_5nodes_on_single FORMAT CSV"; done
```
简单的查询示例:
``` sql
SELECT pageURL, pageRank FROM rankings_1node WHERE pageRank > 1000
SELECT substring(sourceIP, 1, 8), sum(adRevenue) FROM uservisits_1node GROUP BY substring(sourceIP, 1, 8)
SELECT
sourceIP,
sum(adRevenue) AS totalRevenue,
avg(pageRank) AS pageRank
FROM rankings_1node ALL INNER JOIN
(
SELECT
sourceIP,
destinationURL AS pageURL,
adRevenue
FROM uservisits_1node
WHERE (visitDate > '1980-01-01') AND (visitDate < '1980-04-01')
) USING pageURL
GROUP BY sourceIP
ORDER BY totalRevenue DESC
LIMIT 1
```
[Original article](https://clickhouse.yandex/docs/en/getting_started/example_datasets/amplab_benchmark/) <!--hide-->

View File

@ -1 +0,0 @@
../../../en/getting_started/example_datasets/criteo.md

View File

@ -0,0 +1,75 @@
# Criteo TB级别点击日志
可以从<http://labs.criteo.com/downloads/download-terabyte-click-logs/>上下载数据
创建原始数据对应的表结构:
``` sql
CREATE TABLE criteo_log (date Date, clicked UInt8, int1 Int32, int2 Int32, int3 Int32, int4 Int32, int5 Int32, int6 Int32, int7 Int32, int8 Int32, int9 Int32, int10 Int32, int11 Int32, int12 Int32, int13 Int32, cat1 String, cat2 String, cat3 String, cat4 String, cat5 String, cat6 String, cat7 String, cat8 String, cat9 String, cat10 String, cat11 String, cat12 String, cat13 String, cat14 String, cat15 String, cat16 String, cat17 String, cat18 String, cat19 String, cat20 String, cat21 String, cat22 String, cat23 String, cat24 String, cat25 String, cat26 String) ENGINE = Log
```
下载数据:
```bash
for i in {00..23}; do echo $i; zcat datasets/criteo/day_${i#0}.gz | sed -r 's/^/2000-01-'${i/00/24}'\t/' | clickhouse-client --host=example-perftest01j --query="INSERT INTO criteo_log FORMAT TabSeparated"; done
```
创建转换后的数据对应的表结构:
``` sql
CREATE TABLE criteo
(
date Date,
clicked UInt8,
int1 Int32,
int2 Int32,
int3 Int32,
int4 Int32,
int5 Int32,
int6 Int32,
int7 Int32,
int8 Int32,
int9 Int32,
int10 Int32,
int11 Int32,
int12 Int32,
int13 Int32,
icat1 UInt32,
icat2 UInt32,
icat3 UInt32,
icat4 UInt32,
icat5 UInt32,
icat6 UInt32,
icat7 UInt32,
icat8 UInt32,
icat9 UInt32,
icat10 UInt32,
icat11 UInt32,
icat12 UInt32,
icat13 UInt32,
icat14 UInt32,
icat15 UInt32,
icat16 UInt32,
icat17 UInt32,
icat18 UInt32,
icat19 UInt32,
icat20 UInt32,
icat21 UInt32,
icat22 UInt32,
icat23 UInt32,
icat24 UInt32,
icat25 UInt32,
icat26 UInt32
) ENGINE = MergeTree(date, intHash32(icat1), (date, intHash32(icat1)), 8192)
```
将第一张表中的原始数据转化写入到第二张表中去:
``` sql
INSERT INTO criteo SELECT date, clicked, int1, int2, int3, int4, int5, int6, int7, int8, int9, int10, int11, int12, int13, reinterpretAsUInt32(unhex(cat1)) AS icat1, reinterpretAsUInt32(unhex(cat2)) AS icat2, reinterpretAsUInt32(unhex(cat3)) AS icat3, reinterpretAsUInt32(unhex(cat4)) AS icat4, reinterpretAsUInt32(unhex(cat5)) AS icat5, reinterpretAsUInt32(unhex(cat6)) AS icat6, reinterpretAsUInt32(unhex(cat7)) AS icat7, reinterpretAsUInt32(unhex(cat8)) AS icat8, reinterpretAsUInt32(unhex(cat9)) AS icat9, reinterpretAsUInt32(unhex(cat10)) AS icat10, reinterpretAsUInt32(unhex(cat11)) AS icat11, reinterpretAsUInt32(unhex(cat12)) AS icat12, reinterpretAsUInt32(unhex(cat13)) AS icat13, reinterpretAsUInt32(unhex(cat14)) AS icat14, reinterpretAsUInt32(unhex(cat15)) AS icat15, reinterpretAsUInt32(unhex(cat16)) AS icat16, reinterpretAsUInt32(unhex(cat17)) AS icat17, reinterpretAsUInt32(unhex(cat18)) AS icat18, reinterpretAsUInt32(unhex(cat19)) AS icat19, reinterpretAsUInt32(unhex(cat20)) AS icat20, reinterpretAsUInt32(unhex(cat21)) AS icat21, reinterpretAsUInt32(unhex(cat22)) AS icat22, reinterpretAsUInt32(unhex(cat23)) AS icat23, reinterpretAsUInt32(unhex(cat24)) AS icat24, reinterpretAsUInt32(unhex(cat25)) AS icat25, reinterpretAsUInt32(unhex(cat26)) AS icat26 FROM criteo_log;
DROP TABLE criteo_log;
```
[Original article](https://clickhouse.yandex/docs/en/getting_started/example_datasets/criteo/) <!--hide-->

View File

@ -1 +0,0 @@
../../../en/getting_started/example_datasets/nyc_taxi.md

File diff suppressed because one or more lines are too long

View File

@ -1 +0,0 @@
../../../en/getting_started/example_datasets/ontime.md

View File

@ -0,0 +1,318 @@
<a name="example_datasets-ontime"></a>
# 航班飞行数据
下载数据:
```bash
for s in `seq 1987 2017`
do
for m in `seq 1 12`
do
wget http://transtats.bts.gov/PREZIP/On_Time_On_Time_Performance_${s}_${m}.zip
done
done
```
(引用 <https://github.com/Percona-Lab/ontime-airline-performance/blob/master/download.sh> )
创建表结构:
```sql
CREATE TABLE `ontime` (
`Year` UInt16,
`Quarter` UInt8,
`Month` UInt8,
`DayofMonth` UInt8,
`DayOfWeek` UInt8,
`FlightDate` Date,
`UniqueCarrier` FixedString(7),
`AirlineID` Int32,
`Carrier` FixedString(2),
`TailNum` String,
`FlightNum` String,
`OriginAirportID` Int32,
`OriginAirportSeqID` Int32,
`OriginCityMarketID` Int32,
`Origin` FixedString(5),
`OriginCityName` String,
`OriginState` FixedString(2),
`OriginStateFips` String,
`OriginStateName` String,
`OriginWac` Int32,
`DestAirportID` Int32,
`DestAirportSeqID` Int32,
`DestCityMarketID` Int32,
`Dest` FixedString(5),
`DestCityName` String,
`DestState` FixedString(2),
`DestStateFips` String,
`DestStateName` String,
`DestWac` Int32,
`CRSDepTime` Int32,
`DepTime` Int32,
`DepDelay` Int32,
`DepDelayMinutes` Int32,
`DepDel15` Int32,
`DepartureDelayGroups` String,
`DepTimeBlk` String,
`TaxiOut` Int32,
`WheelsOff` Int32,
`WheelsOn` Int32,
`TaxiIn` Int32,
`CRSArrTime` Int32,
`ArrTime` Int32,
`ArrDelay` Int32,
`ArrDelayMinutes` Int32,
`ArrDel15` Int32,
`ArrivalDelayGroups` Int32,
`ArrTimeBlk` String,
`Cancelled` UInt8,
`CancellationCode` FixedString(1),
`Diverted` UInt8,
`CRSElapsedTime` Int32,
`ActualElapsedTime` Int32,
`AirTime` Int32,
`Flights` Int32,
`Distance` Int32,
`DistanceGroup` UInt8,
`CarrierDelay` Int32,
`WeatherDelay` Int32,
`NASDelay` Int32,
`SecurityDelay` Int32,
`LateAircraftDelay` Int32,
`FirstDepTime` String,
`TotalAddGTime` String,
`LongestAddGTime` String,
`DivAirportLandings` String,
`DivReachedDest` String,
`DivActualElapsedTime` String,
`DivArrDelay` String,
`DivDistance` String,
`Div1Airport` String,
`Div1AirportID` Int32,
`Div1AirportSeqID` Int32,
`Div1WheelsOn` String,
`Div1TotalGTime` String,
`Div1LongestGTime` String,
`Div1WheelsOff` String,
`Div1TailNum` String,
`Div2Airport` String,
`Div2AirportID` Int32,
`Div2AirportSeqID` Int32,
`Div2WheelsOn` String,
`Div2TotalGTime` String,
`Div2LongestGTime` String,
`Div2WheelsOff` String,
`Div2TailNum` String,
`Div3Airport` String,
`Div3AirportID` Int32,
`Div3AirportSeqID` Int32,
`Div3WheelsOn` String,
`Div3TotalGTime` String,
`Div3LongestGTime` String,
`Div3WheelsOff` String,
`Div3TailNum` String,
`Div4Airport` String,
`Div4AirportID` Int32,
`Div4AirportSeqID` Int32,
`Div4WheelsOn` String,
`Div4TotalGTime` String,
`Div4LongestGTime` String,
`Div4WheelsOff` String,
`Div4TailNum` String,
`Div5Airport` String,
`Div5AirportID` Int32,
`Div5AirportSeqID` Int32,
`Div5WheelsOn` String,
`Div5TotalGTime` String,
`Div5LongestGTime` String,
`Div5WheelsOff` String,
`Div5TailNum` String
) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192)
```
加载数据:
```bash
for i in *.zip; do echo $i; unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client --host=example-perftest01j --query="INSERT INTO ontime FORMAT CSVWithNames"; done
```
查询:
Q0.
```sql
select avg(c1) from (select Year, Month, count(*) as c1 from ontime group by Year, Month);
```
Q1. 查询从2000年到2008年每天的航班数
```sql
SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
```
Q2. 查询从2000年到2008年每周延误超过10分钟的航班数。
```sql
SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC
```
Q3. 查询2000年到2008年每个机场延误超过10分钟以上的次数
```sql
SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10
```
Q4. 查询2007年各航空公司延误超过10分钟以上的次数
```sql
SELECT Carrier, count(*) FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count(*) DESC
```
Q5. 查询2007年各航空公司延误超过10分钟以上的百分比
```sql
SELECT Carrier, c, c2, c*1000/c2 as c3
FROM
(
SELECT
Carrier,
count(*) AS c
FROM ontime
WHERE DepDelay>10
AND Year=2007
GROUP BY Carrier
)
ANY INNER JOIN
(
SELECT
Carrier,
count(*) AS c2
FROM ontime
WHERE Year=2007
GROUP BY Carrier
) USING Carrier
ORDER BY c3 DESC;
```
更好的查询版本:
```sql
SELECT Carrier, avg(DepDelay > 10) * 1000 AS c3 FROM ontime WHERE Year = 2007 GROUP BY Carrier ORDER BY Carrier
```
Q6. 同上一个查询一致,只是查询范围扩大到2000年到2008年
```sql
SELECT Carrier, c, c2, c*1000/c2 as c3
FROM
(
SELECT
Carrier,
count(*) AS c
FROM ontime
WHERE DepDelay>10
AND Year >= 2000 AND Year <= 2008
GROUP BY Carrier
)
ANY INNER JOIN
(
SELECT
Carrier,
count(*) AS c2
FROM ontime
WHERE Year >= 2000 AND Year <= 2008
GROUP BY Carrier
) USING Carrier
ORDER BY c3 DESC;
```
更好的查询版本:
```sql
SELECT Carrier, avg(DepDelay > 10) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier ORDER BY Carrier
```
Q7. 每年航班延误超过10分钟的百分比
```sql
SELECT Year, c1/c2
FROM
(
select
Year,
count(*)*1000 as c1
from ontime
WHERE DepDelay>10
GROUP BY Year
)
ANY INNER JOIN
(
select
Year,
count(*) as c2
from ontime
GROUP BY Year
) USING (Year)
ORDER BY Year
```
更好的查询版本:
```sql
SELECT Year, avg(DepDelay > 10) FROM ontime GROUP BY Year ORDER BY Year
```
Q8. 每年更受人们喜爱的目的地
```sql
SELECT DestCityName, uniqExact(OriginCityName) AS u FROM ontime WHERE Year >= 2000 and Year <= 2010 GROUP BY DestCityName ORDER BY u DESC LIMIT 10;
```
Q9.
```sql
select Year, count(*) as c1 from ontime group by Year;
```
Q10.
```sql
select
min(Year), max(Year), Carrier, count(*) as cnt,
sum(ArrDelayMinutes>30) as flights_delayed,
round(sum(ArrDelayMinutes>30)/count(*),2) as rate
FROM ontime
WHERE
DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI')
and DestState not in ('AK', 'HI', 'PR', 'VI')
and FlightDate < '2010-01-01'
GROUP by Carrier
HAVING cnt > 100000 and max(Year) > 1990
ORDER by rate DESC
LIMIT 1000;
```
Bonus:
```sql
SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month)
select avg(c1) from (select Year,Month,count(*) as c1 from ontime group by Year,Month)
SELECT DestCityName, uniqExact(OriginCityName) AS u FROM ontime GROUP BY DestCityName ORDER BY u DESC LIMIT 10;
SELECT OriginCityName, DestCityName, count() AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10;
SELECT OriginCityName, count() AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10;
```
这个性能测试由Vadim Tkachenko提供。参考
- <https://www.percona.com/blog/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/>
- <https://www.percona.com/blog/2009/10/26/air-traffic-queries-in-luciddb/>
- <https://www.percona.com/blog/2009/11/02/air-traffic-queries-in-infinidb-early-alpha/>
- <https://www.percona.com/blog/2014/04/21/using-apache-hadoop-and-impala-together-with-mysql-for-data-analysis/>
- <https://www.percona.com/blog/2016/01/07/apache-spark-with-air-ontime-performance-data/>
- <http://nickmakos.blogspot.ru/2012/08/analyzing-air-traffic-performance-with.html>

View File

@ -1 +0,0 @@
../../../en/getting_started/example_datasets/star_schema.md

View File

@ -0,0 +1,87 @@
# Star Schema 基准测试
编译 dbgen: <https://github.com/vadimtk/ssb-dbgen>
```bash
git clone git@github.com:vadimtk/ssb-dbgen.git
cd ssb-dbgen
make
```
在编译过程中可能会有一些警告,这是正常的。
将`dbgen`和`dists.dss`放在一个可用容量大于800GB的磁盘中。
开始生成数据:
```bash
./dbgen -s 1000 -T c
./dbgen -s 1000 -T l
```
在ClickHouse中创建表结构
``` sql
CREATE TABLE lineorder (
LO_ORDERKEY UInt32,
LO_LINENUMBER UInt8,
LO_CUSTKEY UInt32,
LO_PARTKEY UInt32,
LO_SUPPKEY UInt32,
LO_ORDERDATE Date,
LO_ORDERPRIORITY String,
LO_SHIPPRIORITY UInt8,
LO_QUANTITY UInt8,
LO_EXTENDEDPRICE UInt32,
LO_ORDTOTALPRICE UInt32,
LO_DISCOUNT UInt8,
LO_REVENUE UInt32,
LO_SUPPLYCOST UInt32,
LO_TAX UInt8,
LO_COMMITDATE Date,
LO_SHIPMODE String
)Engine=MergeTree(LO_ORDERDATE,(LO_ORDERKEY,LO_LINENUMBER,LO_ORDERDATE),8192);
CREATE TABLE customer (
C_CUSTKEY UInt32,
C_NAME String,
C_ADDRESS String,
C_CITY String,
C_NATION String,
C_REGION String,
C_PHONE String,
C_MKTSEGMENT String,
C_FAKEDATE Date
)Engine=MergeTree(C_FAKEDATE,(C_CUSTKEY,C_FAKEDATE),8192);
CREATE TABLE part (
P_PARTKEY UInt32,
P_NAME String,
P_MFGR String,
P_CATEGORY String,
P_BRAND String,
P_COLOR String,
P_TYPE String,
P_SIZE UInt8,
P_CONTAINER String,
P_FAKEDATE Date
)Engine=MergeTree(P_FAKEDATE,(P_PARTKEY,P_FAKEDATE),8192);
CREATE TABLE lineorderd AS lineorder ENGINE = Distributed(perftest_3shards_1replicas, default, lineorder, rand());
CREATE TABLE customerd AS customer ENGINE = Distributed(perftest_3shards_1replicas, default, customer, rand());
CREATE TABLE partd AS part ENGINE = Distributed(perftest_3shards_1replicas, default, part, rand());
```
如果是在单节点中进行的测试那么只需要创建对应的MergeTree表。
如果是在多节点中进行的测试,您需要在配置文件中配置`perftest_3shards_1replicas`集群的信息。
然后在每个节点中同时创建MergeTree表和Distributed表。
下载数据(如果您是分布式测试的话将'customer'更改为'customerd'
```bash
cat customer.tbl | sed 's/$/2000-01-01/' | clickhouse-client --query "INSERT INTO customer FORMAT CSV"
cat lineorder.tbl | clickhouse-client --query "INSERT INTO lineorder FORMAT CSV"
```
[Original article](https://clickhouse.yandex/docs/en/getting_started/example_datasets/star_schema/) <!--hide-->

View File

@ -1 +0,0 @@
../../../en/getting_started/example_datasets/wikistat.md

View File

@ -0,0 +1,29 @@
# 维基访问数据
参考: <http://dumps.wikimedia.org/other/pagecounts-raw/>
创建表结构:
``` sql
CREATE TABLE wikistat
(
date Date,
time DateTime,
project String,
subproject String,
path String,
hits UInt64,
size UInt64
) ENGINE = MergeTree(date, (path, time), 8192);
```
加载数据:
```bash
for i in {2007..2016}; do for j in {01..12}; do echo $i-$j >&2; curl -sSL "http://dumps.wikimedia.org/other/pagecounts-raw/$i/$i-$j/" | grep -oE 'pagecounts-[0-9]+-[0-9]+\.gz'; done; done | sort | uniq | tee links.txt
cat links.txt | while read link; do wget http://dumps.wikimedia.org/other/pagecounts-raw/$(echo $link | sed -r 's/pagecounts-([0-9]{4})([0-9]{2})[0-9]{2}-[0-9]+\.gz/\1/')/$(echo $link | sed -r 's/pagecounts-([0-9]{4})([0-9]{2})[0-9]{2}-[0-9]+\.gz/\1-\2/')/$link; done
ls -1 /opt/wikistat/ | grep gz | while read i; do echo $i; gzip -cd /opt/wikistat/$i | ./wikistat-loader --time="$(echo -n $i | sed -r 's/pagecounts-([0-9]{4})([0-9]{2})([0-9]{2})-([0-9]{2})([0-9]{2})([0-9]{2})\.gz/\1-\2-\3 \4-00-00/')" | clickhouse-client --query="INSERT INTO wikistat FORMAT TabSeparated"; done
```
[Original article](https://clickhouse.yandex/docs/en/getting_started/example_datasets/wikistat/) <!--hide-->

View File

@ -1 +0,0 @@
../../en/getting_started/index.md

View File

@ -0,0 +1,141 @@
# 入门指南
## 系统要求
如果从官方仓库安装需要确保您使用的是x86_64处理器构架的Linux并且支持SSE 4.2指令集
检查是否支持SSE 4.2
```bash
grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"
```
我们推荐使用Ubuntu或者Debian。终端必须使用UTF-8编码。
基于rpm的系统,你可以使用第三方的安装包https://packagecloud.io/altinity/clickhouse 或者直接安装debian安装包。
ClickHouse还可以在FreeBSD与Mac OS X上工作。同时它可以在不支持SSE 4.2的x86_64构架和AArch64 CPUs上编译。
## 安装
为了测试和开发系统可以安装在单个服务器或普通PC机上。
### 为Debian/Ubuntu安装
在`/etc/apt/sources.list` (或创建`/etc/apt/sources.list.d/clickhouse.list`文件)中添加仓库:
```text
deb http://repo.yandex.ru/clickhouse/deb/stable/ main/
```
如果你想使用最新的测试版本,请使用'testing'替换'stable'。
然后运行:
```bash
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E0C56BD4 # optional
sudo apt-get update
sudo apt-get install clickhouse-client clickhouse-server
```
你也可以从这里手动下载安装包:<https://repo.yandex.ru/clickhouse/deb/stable/main/>
ClickHouse包含访问控制配置它们位于`users.xml`文件中(与'config.xml'同目录)。
默认情况下允许从任何地方使用默认的default用户无密码的访问ClickHouse。参考user/default/networks
有关更多信息,请参考"Configuration files"部分。
### 使用源码安装
具体编译方式可以参考build.md。
你可以编译并安装它们。
你也可以直接使用而不进行安装。
```text
Client: dbms/programs/clickhouse-client
Server: dbms/programs/clickhouse-server
```
在服务器中为数据创建如下目录:
```text
/opt/clickhouse/data/default/
/opt/clickhouse/metadata/default/
```
(它们可以在server config中配置。)
为需要的用户运行chown
日志的路径可以在server config (src/dbms/programs/server/config.xml)中配置。
### 其他的安装方法
Docker image<https://hub.docker.com/r/yandex/clickhouse-server/>
CentOS或RHEL安装包<https://github.com/Altinity/clickhouse-rpm-install>
Gentoo`emerge clickhouse`
## 启动
可以运行如下命令在后台启动服务:
```bash
sudo service clickhouse-server start
```
可以在`/var/log/clickhouse-server/`目录中查看日志。
如果服务没有启动,请检查配置文件 `/etc/clickhouse-server/config.xml`
你也可以在控制台中直接启动服务:
```bash
clickhouse-server --config-file=/etc/clickhouse-server/config.xml
```
在这种情况下,日志将被打印到控制台中,这在开发过程中很方便。
如果配置文件在当前目录中,你可以不指定‘--config-file参数。它默认使用./config.xml
你可以使用命令行客户端连接到服务:
```bash
clickhouse-client
```
默认情况下它使用default用户无密码的与localhost:9000服务建立连接。
客户端也可以用于连接远程服务,例如:
```bash
clickhouse-client --host=example.com
```
有关更多信息,请参考"Command-line client"部分。
检查系统是否工作:
```bash
milovidov@hostname:~/work/metrica/src/dbms/src/Client$ ./clickhouse-client
ClickHouse client version 0.0.18749.
Connecting to localhost:9000.
Connected to ClickHouse server version 0.0.18749.
:) SELECT 1
SELECT 1
┌─1─┐
│ 1 │
└───┘
1 rows in set. Elapsed: 0.003 sec.
:)
```
**恭喜,系统已经工作了!**
为了继续进行实验,你可以尝试下载测试数据集。
[Original article](https://clickhouse.yandex/docs/en/getting_started/) <!--hide-->

View File

@ -173,7 +173,7 @@ clickhouse-client --format_csv_delimiter="|" --query="INSERT INTO test.csv FORMA
解析的时候,可以使用或不使用引号来解析所有值。支持双引号和单引号。行也可以不用引号排列。 在这种情况下它们被解析为逗号或换行符CR 或 LF。在解析不带引号的行时若违反 RFC 规则,会忽略前导和尾随的空格和制表符。 对于换行,全部支持 UnixLFWindowsCR LF和 Mac OS ClassicCR LF
`NULL` is formatted as `\N`.
`NULL` 将输出为 `\N`
CSV 格式是和 TabSeparated 一样的方式输出总数和极值。
@ -511,7 +511,7 @@ test: string with 'quotes' and with some special
characters
```
Compare with the Vertical format:
和 Vertical 格式相比:
```
:) SELECT 'string with \'quotes\' and \t with some special \n characters' AS test FORMAT Vertical;

Some files were not shown because too many files have changed in this diff Show More