#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include namespace DB { namespace ErrorCodes { extern const int MULTIPLE_EXPRESSIONS_FOR_ALIAS; extern const int UNKNOWN_IDENTIFIER; extern const int CYCLIC_ALIASES; extern const int INCORRECT_RESULT_OF_SCALAR_SUBQUERY; extern const int TOO_MANY_ROWS; extern const int NOT_FOUND_COLUMN_IN_BLOCK; extern const int INCORRECT_ELEMENT_OF_SET; extern const int ALIAS_REQUIRED; extern const int EMPTY_NESTED_TABLE; extern const int NOT_AN_AGGREGATE; extern const int UNEXPECTED_EXPRESSION; extern const int DUPLICATE_COLUMN; extern const int FUNCTION_CANNOT_HAVE_PARAMETERS; extern const int ILLEGAL_AGGREGATION; extern const int SUPPORT_IS_DISABLED; extern const int TOO_DEEP_AST; extern const int TOO_BIG_AST; extern const int NUMBER_OF_ARGUMENTS_DOESNT_MATCH; } /** Calls to these functions in the GROUP BY statement would be * replaced by their immediate argument. */ const std::unordered_set injective_function_names { "negate", "bitNot", "reverse", "reverseUTF8", "toString", "toFixedString", "IPv4NumToString", "IPv4StringToNum", "hex", "unhex", "bitmaskToList", "bitmaskToArray", "tuple", "regionToName", "concatAssumeInjective", }; const std::unordered_set possibly_injective_function_names { "dictGetString", "dictGetUInt8", "dictGetUInt16", "dictGetUInt32", "dictGetUInt64", "dictGetInt8", "dictGetInt16", "dictGetInt32", "dictGetInt64", "dictGetFloat32", "dictGetFloat64", "dictGetDate", "dictGetDateTime" }; namespace { bool functionIsInOperator(const String & name) { return name == "in" || name == "notIn"; } bool functionIsInOrGlobalInOperator(const String & name) { return name == "in" || name == "notIn" || name == "globalIn" || name == "globalNotIn"; } void removeDuplicateColumns(NamesAndTypesList & columns) { std::set names; for (auto it = columns.begin(); it != columns.end();) { if (names.emplace(it->name).second) ++it; else columns.erase(it++); } } } ExpressionAnalyzer::ExpressionAnalyzer( const ASTPtr & ast_, const Context & context_, const StoragePtr & storage_, const NamesAndTypesList & source_columns_, const Names & required_result_columns_, size_t subquery_depth_, bool do_global_, const SubqueriesForSets & subqueries_for_set_) : ast(ast_), context(context_), settings(context.getSettings()), subquery_depth(subquery_depth_), source_columns(source_columns_), required_result_columns(required_result_columns_.begin(), required_result_columns_.end()), storage(storage_), do_global(do_global_), subqueries_for_sets(subqueries_for_set_) { select_query = typeid_cast(ast.get()); if (!storage && select_query) { auto select_database = select_query->database(); auto select_table = select_query->table(); if (select_table && !typeid_cast(select_table.get()) && !typeid_cast(select_table.get())) { String database = select_database ? typeid_cast(*select_database).name : ""; const String & table = typeid_cast(*select_table).name; storage = context.tryGetTable(database, table); } } if (storage && source_columns.empty()) source_columns = storage->getColumns().getAllPhysical(); else removeDuplicateColumns(source_columns); addAliasColumns(); translateQualifiedNames(); /// Depending on the user's profile, check for the execution rights /// distributed subqueries inside the IN or JOIN sections and process these subqueries. InJoinSubqueriesPreprocessor(context).process(select_query); /// Optimizes logical expressions. LogicalExpressionsOptimizer(select_query, settings).perform(); /// Creates a dictionary `aliases`: alias -> ASTPtr addASTAliases(ast); /// Common subexpression elimination. Rewrite rules. normalizeTree(); /// Remove unneeded columns according to 'required_source_columns'. /// Leave all selected columns in case of DISTINCT; columns that contain arrayJoin function inside. /// Must be after 'normalizeTree' (after expanding aliases, for aliases not get lost) /// and before 'executeScalarSubqueries', 'analyzeAggregation', etc. to avoid excessive calculations. removeUnneededColumnsFromSelectClause(); /// Executing scalar subqueries - replacing them with constant values. executeScalarSubqueries(); /// Optimize if with constant condition after constants was substituted instead of sclalar subqueries. optimizeIfWithConstantCondition(); /// GROUP BY injective function elimination. optimizeGroupBy(); /// Remove duplicate items from ORDER BY. optimizeOrderBy(); // Remove duplicated elements from LIMIT BY clause. optimizeLimitBy(); /// array_join_alias_to_name, array_join_result_to_source. getArrayJoinedColumns(); /// Delete the unnecessary from `source_columns` list. Create `unknown_required_source_columns`. Form `columns_added_by_join`. collectUsedColumns(); /// external_tables, subqueries_for_sets for global subqueries. /// Replaces global subqueries with the generated names of temporary tables that will be sent to remote servers. initGlobalSubqueriesAndExternalTables(); /// has_aggregation, aggregation_keys, aggregate_descriptions, aggregated_columns. /// This analysis should be performed after processing global subqueries, because otherwise, /// if the aggregate function contains a global subquery, then `analyzeAggregation` method will save /// in `aggregate_descriptions` the information about the parameters of this aggregate function, among which /// global subquery. Then, when you call `initGlobalSubqueriesAndExternalTables` method, this /// the global subquery will be replaced with a temporary table, resulting in aggregate_descriptions /// will contain out-of-date information, which will lead to an error when the query is executed. analyzeAggregation(); } void ExpressionAnalyzer::translateQualifiedNames() { String database_name; String table_name; String alias; if (!select_query || !select_query->tables || select_query->tables->children.empty()) return; ASTTablesInSelectQueryElement & element = static_cast(*select_query->tables->children[0]); if (!element.table_expression) /// This is ARRAY JOIN without a table at the left side. return; ASTTableExpression & table_expression = static_cast(*element.table_expression); if (table_expression.database_and_table_name) { const ASTIdentifier & identifier = static_cast(*table_expression.database_and_table_name); alias = identifier.tryGetAlias(); if (table_expression.database_and_table_name->children.empty()) { database_name = context.getCurrentDatabase(); table_name = identifier.name; } else { if (table_expression.database_and_table_name->children.size() != 2) throw Exception("Logical error: number of components in table expression not equal to two", ErrorCodes::LOGICAL_ERROR); database_name = static_cast(*identifier.children[0]).name; table_name = static_cast(*identifier.children[1]).name; } } else if (table_expression.table_function) { alias = table_expression.table_function->tryGetAlias(); } else if (table_expression.subquery) { alias = table_expression.subquery->tryGetAlias(); } else throw Exception("Logical error: no known elements in ASTTableExpression", ErrorCodes::LOGICAL_ERROR); translateQualifiedNamesImpl(ast, database_name, table_name, alias); } void ExpressionAnalyzer::translateQualifiedNamesImpl(ASTPtr & ast, const String & database_name, const String & table_name, const String & alias) { if (ASTIdentifier * ident = typeid_cast(ast.get())) { if (ident->kind == ASTIdentifier::Column) { /// It is compound identifier if (!ast->children.empty()) { size_t num_components = ast->children.size(); size_t num_qualifiers_to_strip = 0; /// database.table.column if (num_components >= 3 && !database_name.empty() && static_cast(*ast->children[0]).name == database_name && static_cast(*ast->children[1]).name == table_name) { num_qualifiers_to_strip = 2; } /// table.column or alias.column. If num_components > 2, it is like table.nested.column. if (num_components >= 2 && ((!table_name.empty() && static_cast(*ast->children[0]).name == table_name) || (!alias.empty() && static_cast(*ast->children[0]).name == alias))) { num_qualifiers_to_strip = 1; } if (num_qualifiers_to_strip) { /// plain column if (num_components - num_qualifiers_to_strip == 1) { String node_alias = ast->tryGetAlias(); ast = ast->children.back(); if (!node_alias.empty()) ast->setAlias(node_alias); } else /// nested column { ident->children.erase(ident->children.begin(), ident->children.begin() + num_qualifiers_to_strip); String new_name; for (const auto & child : ident->children) { if (!new_name.empty()) new_name += '.'; new_name += static_cast(*child.get()).name; } ident->name = new_name; } } } } } else if (typeid_cast(ast.get())) { if (ast->children.size() != 1) throw Exception("Logical error: qualified asterisk must have exactly one child", ErrorCodes::LOGICAL_ERROR); ASTIdentifier * ident = typeid_cast(ast->children[0].get()); if (!ident) throw Exception("Logical error: qualified asterisk must have identifier as its child", ErrorCodes::LOGICAL_ERROR); size_t num_components = ident->children.size(); if (num_components > 2) throw Exception("Qualified asterisk cannot have more than two qualifiers", ErrorCodes::UNKNOWN_ELEMENT_IN_AST); /// database.table.*, table.* or alias.* if ( (num_components == 2 && !database_name.empty() && static_cast(*ident->children[0]).name == database_name && static_cast(*ident->children[1]).name == table_name) || (num_components == 0 && ((!table_name.empty() && ident->name == table_name) || (!alias.empty() && ident->name == alias)))) { /// Replace to plain asterisk. ast = std::make_shared(); } } else { for (auto & child : ast->children) { /// Do not go to FROM, JOIN, subqueries. if (!typeid_cast(child.get()) && !typeid_cast(child.get())) { translateQualifiedNamesImpl(child, database_name, table_name, alias); } } } } void ExpressionAnalyzer::optimizeIfWithConstantCondition() { optimizeIfWithConstantConditionImpl(ast, aliases); } bool ExpressionAnalyzer::tryExtractConstValueFromCondition(const ASTPtr & condition, bool & value) const { /// numeric constant in condition if (const ASTLiteral * literal = typeid_cast(condition.get())) { if (literal->value.getType() == Field::Types::Int64 || literal->value.getType() == Field::Types::UInt64) { value = literal->value.get(); return true; } } /// cast of numeric constant in condition to UInt8 if (const ASTFunction * function = typeid_cast(condition.get())) { if (function->name == "CAST") { if (ASTExpressionList * expr_list = typeid_cast(function->arguments.get())) { const ASTPtr & type_ast = expr_list->children.at(1); if (const ASTLiteral * type_literal = typeid_cast(type_ast.get())) { if (type_literal->value.getType() == Field::Types::String && type_literal->value.get() == "UInt8") return tryExtractConstValueFromCondition(expr_list->children.at(0), value); } } } } return false; } void ExpressionAnalyzer::optimizeIfWithConstantConditionImpl(ASTPtr & current_ast, ExpressionAnalyzer::Aliases & aliases) const { if (!current_ast) return; for (ASTPtr & child : current_ast->children) { ASTFunction * function_node = typeid_cast(child.get()); if (!function_node || function_node->name != "if") { optimizeIfWithConstantConditionImpl(child, aliases); continue; } optimizeIfWithConstantConditionImpl(function_node->arguments, aliases); ASTExpressionList * args = typeid_cast(function_node->arguments.get()); ASTPtr condition_expr = args->children.at(0); ASTPtr then_expr = args->children.at(1); ASTPtr else_expr = args->children.at(2); bool condition; if (tryExtractConstValueFromCondition(condition_expr, condition)) { ASTPtr replace_ast = condition ? then_expr : else_expr; ASTPtr child_copy = child; String replace_alias = replace_ast->tryGetAlias(); String if_alias = child->tryGetAlias(); if (replace_alias.empty()) { replace_ast->setAlias(if_alias); child = replace_ast; } else { /// Only copy of one node is required here. /// But IAST has only method for deep copy of subtree. /// This can be a reason of performance degradation in case of deep queries. ASTPtr replace_ast_deep_copy = replace_ast->clone(); replace_ast_deep_copy->setAlias(if_alias); child = replace_ast_deep_copy; } if (!if_alias.empty()) { auto alias_it = aliases.find(if_alias); if (alias_it != aliases.end() && alias_it->second.get() == child_copy.get()) alias_it->second = child; } } } } void ExpressionAnalyzer::analyzeAggregation() { /** Find aggregation keys (aggregation_keys), information about aggregate functions (aggregate_descriptions), * as well as a set of columns obtained after the aggregation, if any, * or after all the actions that are usually performed before aggregation (aggregated_columns). * * Everything below (compiling temporary ExpressionActions) - only for the purpose of query analysis (type output). */ if (select_query && (select_query->group_expression_list || select_query->having_expression)) has_aggregation = true; ExpressionActionsPtr temp_actions = std::make_shared(source_columns, settings); if (select_query && select_query->array_join_expression_list()) { getRootActions(select_query->array_join_expression_list(), true, false, temp_actions); addMultipleArrayJoinAction(temp_actions); } if (select_query) { const ASTTablesInSelectQueryElement * join = select_query->join(); if (join) { if (static_cast(*join->table_join).using_expression_list) getRootActions(static_cast(*join->table_join).using_expression_list, true, false, temp_actions); addJoinAction(temp_actions, true); } } getAggregates(ast, temp_actions); if (has_aggregation) { assertSelect(); /// Find out aggregation keys. if (select_query->group_expression_list) { NameSet unique_keys; ASTs & group_asts = select_query->group_expression_list->children; for (ssize_t i = 0; i < ssize_t(group_asts.size()); ++i) { ssize_t size = group_asts.size(); getRootActions(group_asts[i], true, false, temp_actions); const auto & column_name = group_asts[i]->getColumnName(); const auto & block = temp_actions->getSampleBlock(); if (!block.has(column_name)) throw Exception("Unknown identifier (in GROUP BY): " + column_name, ErrorCodes::UNKNOWN_IDENTIFIER); const auto & col = block.getByName(column_name); /// Constant expressions have non-null column pointer at this stage. if (const auto is_constexpr = col.column) { /// But don't remove last key column if no aggregate functions, otherwise aggregation will not work. if (!aggregate_descriptions.empty() || size > 1) { if (i + 1 < static_cast(size)) group_asts[i] = std::move(group_asts.back()); group_asts.pop_back(); --i; continue; } } NameAndTypePair key{column_name, col.type}; /// Aggregation keys are uniqued. if (!unique_keys.count(key.name)) { unique_keys.insert(key.name); aggregation_keys.push_back(key); /// Key is no longer needed, therefore we can save a little by moving it. aggregated_columns.push_back(std::move(key)); } } if (group_asts.empty()) { select_query->group_expression_list = nullptr; has_aggregation = select_query->having_expression || aggregate_descriptions.size(); } } for (size_t i = 0; i < aggregate_descriptions.size(); ++i) { AggregateDescription & desc = aggregate_descriptions[i]; aggregated_columns.emplace_back(desc.column_name, desc.function->getReturnType()); } } else { aggregated_columns = temp_actions->getSampleBlock().getNamesAndTypesList(); } } void ExpressionAnalyzer::initGlobalSubqueriesAndExternalTables() { /// Adds existing external tables (not subqueries) to the external_tables dictionary. findExternalTables(ast); /// Converts GLOBAL subqueries to external tables; Puts them into the external_tables dictionary: name -> StoragePtr. initGlobalSubqueries(ast); } void ExpressionAnalyzer::initGlobalSubqueries(ASTPtr & ast) { /// Recursive calls. We do not go into subqueries. for (auto & child : ast->children) if (!typeid_cast(child.get())) initGlobalSubqueries(child); /// Bottom-up actions. if (ASTFunction * node = typeid_cast(ast.get())) { /// For GLOBAL IN. if (do_global && (node->name == "globalIn" || node->name == "globalNotIn")) addExternalStorage(node->arguments->children.at(1)); } else if (ASTTablesInSelectQueryElement * node = typeid_cast(ast.get())) { /// For GLOBAL JOIN. if (do_global && node->table_join && static_cast(*node->table_join).locality == ASTTableJoin::Locality::Global) addExternalStorage(node->table_expression); } } void ExpressionAnalyzer::findExternalTables(ASTPtr & ast) { /// Traverse from the bottom. Intentionally go into subqueries. for (auto & child : ast->children) findExternalTables(child); /// If table type identifier StoragePtr external_storage; if (ASTIdentifier * node = typeid_cast(ast.get())) if (node->kind == ASTIdentifier::Table) if ((external_storage = context.tryGetExternalTable(node->name))) external_tables[node->name] = external_storage; } static std::pair getDatabaseAndTableNameFromIdentifier(const ASTIdentifier & identifier) { std::pair res; res.second = identifier.name; if (!identifier.children.empty()) { if (identifier.children.size() != 2) throw Exception("Qualified table name could have only two components", ErrorCodes::LOGICAL_ERROR); res.first = typeid_cast(*identifier.children[0]).name; res.second = typeid_cast(*identifier.children[1]).name; } return res; } static std::shared_ptr interpretSubquery( const ASTPtr & subquery_or_table_name, const Context & context, size_t subquery_depth, const Names & required_source_columns) { /// Subquery or table name. The name of the table is similar to the subquery `SELECT * FROM t`. const ASTSubquery * subquery = typeid_cast(subquery_or_table_name.get()); const ASTIdentifier * table = typeid_cast(subquery_or_table_name.get()); if (!subquery && !table) throw Exception("IN/JOIN supports only SELECT subqueries.", ErrorCodes::BAD_ARGUMENTS); /** The subquery in the IN / JOIN section does not have any restrictions on the maximum size of the result. * Because the result of this query is not the result of the entire query. * Constraints work instead * max_rows_in_set, max_bytes_in_set, set_overflow_mode, * max_rows_in_join, max_bytes_in_join, join_overflow_mode, * which are checked separately (in the Set, Join objects). */ Context subquery_context = context; Settings subquery_settings = context.getSettings(); subquery_settings.max_result_rows = 0; subquery_settings.max_result_bytes = 0; /// The calculation of `extremes` does not make sense and is not necessary (if you do it, then the `extremes` of the subquery can be taken instead of the whole query). subquery_settings.extremes = 0; subquery_context.setSettings(subquery_settings); ASTPtr query; if (table) { /// create ASTSelectQuery for "SELECT * FROM table" as if written by hand const auto select_with_union_query = std::make_shared(); query = select_with_union_query; select_with_union_query->list_of_selects = std::make_shared(); const auto select_query = std::make_shared(); select_with_union_query->list_of_selects->children.push_back(select_query); const auto select_expression_list = std::make_shared(); select_query->select_expression_list = select_expression_list; select_query->children.emplace_back(select_query->select_expression_list); /// get columns list for target table auto database_table = getDatabaseAndTableNameFromIdentifier(*table); const auto & storage = context.getTable(database_table.first, database_table.second); const auto & columns = storage->getColumns().ordinary; select_expression_list->children.reserve(columns.size()); /// manually substitute column names in place of asterisk for (const auto & column : columns) select_expression_list->children.emplace_back(std::make_shared(column.name)); select_query->replaceDatabaseAndTable(database_table.first, database_table.second); } else { query = subquery->children.at(0); /** Columns with the same name can be specified in a subquery. For example, SELECT x, x FROM t * This is bad, because the result of such a query can not be saved to the table, because the table can not have the same name columns. * Saving to the table is required for GLOBAL subqueries. * * To avoid this situation, we will rename the same columns. */ std::set all_column_names; std::set assigned_column_names; if (ASTSelectWithUnionQuery * select_with_union = typeid_cast(query.get())) { if (ASTSelectQuery * select = typeid_cast(select_with_union->list_of_selects->children.at(0).get())) { for (auto & expr : select->select_expression_list->children) all_column_names.insert(expr->getAliasOrColumnName()); for (auto & expr : select->select_expression_list->children) { auto name = expr->getAliasOrColumnName(); if (!assigned_column_names.insert(name).second) { size_t i = 1; while (all_column_names.end() != all_column_names.find(name + "_" + toString(i))) ++i; name = name + "_" + toString(i); expr = expr->clone(); /// Cancels fuse of the same expressions in the tree. expr->setAlias(name); all_column_names.insert(name); assigned_column_names.insert(name); } } } } } return std::make_shared( query, subquery_context, required_source_columns, QueryProcessingStage::Complete, subquery_depth + 1); } void ExpressionAnalyzer::addExternalStorage(ASTPtr & subquery_or_table_name_or_table_expression) { /// With nondistributed queries, creating temporary tables does not make sense. if (!(storage && storage->isRemote())) return; ASTPtr subquery; ASTPtr table_name; ASTPtr subquery_or_table_name; if (typeid_cast(subquery_or_table_name_or_table_expression.get())) { table_name = subquery_or_table_name_or_table_expression; subquery_or_table_name = table_name; } else if (auto ast_table_expr = typeid_cast(subquery_or_table_name_or_table_expression.get())) { if (ast_table_expr->database_and_table_name) { table_name = ast_table_expr->database_and_table_name; subquery_or_table_name = table_name; } else if (ast_table_expr->subquery) { subquery = ast_table_expr->subquery; subquery_or_table_name = subquery; } } else if (typeid_cast(subquery_or_table_name_or_table_expression.get())) { subquery = subquery_or_table_name_or_table_expression; subquery_or_table_name = subquery; } if (!subquery_or_table_name) throw Exception("Logical error: unknown AST element passed to ExpressionAnalyzer::addExternalStorage method", ErrorCodes::LOGICAL_ERROR); if (table_name) { /// If this is already an external table, you do not need to add anything. Just remember its presence. if (external_tables.end() != external_tables.find(static_cast(*table_name).name)) return; } /// Generate the name for the external table. String external_table_name = "_data" + toString(external_table_id); while (external_tables.count(external_table_name)) { ++external_table_id; external_table_name = "_data" + toString(external_table_id); } auto interpreter = interpretSubquery(subquery_or_table_name, context, subquery_depth, {}); Block sample = interpreter->getSampleBlock(); NamesAndTypesList columns = sample.getNamesAndTypesList(); StoragePtr external_storage = StorageMemory::create(external_table_name, ColumnsDescription{columns}); external_storage->startup(); /** We replace the subquery with the name of the temporary table. * It is in this form, the request will go to the remote server. * This temporary table will go to the remote server, and on its side, * instead of doing a subquery, you just need to read it. */ auto database_and_table_name = std::make_shared(external_table_name, ASTIdentifier::Table); if (auto ast_table_expr = typeid_cast(subquery_or_table_name_or_table_expression.get())) { ast_table_expr->subquery.reset(); ast_table_expr->database_and_table_name = database_and_table_name; ast_table_expr->children.clear(); ast_table_expr->children.emplace_back(database_and_table_name); } else subquery_or_table_name_or_table_expression = database_and_table_name; external_tables[external_table_name] = external_storage; subqueries_for_sets[external_table_name].source = interpreter->execute().in; subqueries_for_sets[external_table_name].table = external_storage; /** NOTE If it was written IN tmp_table - the existing temporary (but not external) table, * then a new temporary table will be created (for example, _data1), * and the data will then be copied to it. * Maybe this can be avoided. */ } static NamesAndTypesList::iterator findColumn(const String & name, NamesAndTypesList & cols) { return std::find_if(cols.begin(), cols.end(), [&](const NamesAndTypesList::value_type & val) { return val.name == name; }); } /// ignore_levels - aliases in how many upper levels of the subtree should be ignored. /// For example, with ignore_levels=1 ast can not be put in the dictionary, but its children can. void ExpressionAnalyzer::addASTAliases(ASTPtr & ast, int ignore_levels) { /// Bottom-up traversal. We do not go into subqueries. for (auto & child : ast->children) { int new_ignore_levels = std::max(0, ignore_levels - 1); /// The top-level aliases in the ARRAY JOIN section have a special meaning, we will not add them /// (skip the expression list itself and its children). if (typeid_cast(ast.get())) new_ignore_levels = 3; /// Don't descent into table functions and subqueries. if (!typeid_cast(child.get()) && !typeid_cast(child.get())) addASTAliases(child, new_ignore_levels); } if (ignore_levels > 0) return; String alias = ast->tryGetAlias(); if (!alias.empty()) { if (aliases.count(alias) && ast->getTreeHash() != aliases[alias]->getTreeHash()) { std::stringstream message; message << "Different expressions with the same alias " << backQuoteIfNeed(alias) << ":\n"; formatAST(*ast, message, false, true); message << "\nand\n"; formatAST(*aliases[alias], message, false, true); message << "\n"; throw Exception(message.str(), ErrorCodes::MULTIPLE_EXPRESSIONS_FOR_ALIAS); } aliases[alias] = ast; } else if (auto subquery = typeid_cast(ast.get())) { /// Set unique aliases for all subqueries. This is needed, because content of subqueries could change after recursive analysis, /// and auto-generated column names could become incorrect. size_t subquery_index = 1; while (true) { alias = "_subquery" + toString(subquery_index); if (!aliases.count("_subquery" + toString(subquery_index))) break; ++subquery_index; } subquery->setAlias(alias); subquery->prefer_alias_to_column_name = true; aliases[alias] = ast; } } void ExpressionAnalyzer::normalizeTree() { SetOfASTs tmp_set; MapOfASTs tmp_map; normalizeTreeImpl(ast, tmp_map, tmp_set, "", 0); try { ast->checkSize(settings.max_expanded_ast_elements); } catch (Exception & e) { e.addMessage("(after expansion of aliases)"); throw; } } /// finished_asts - already processed vertices (and by what they replaced) /// current_asts - vertices in the current call stack of this method /// current_alias - the alias referencing to the ancestor of ast (the deepest ancestor with aliases) void ExpressionAnalyzer::normalizeTreeImpl( ASTPtr & ast, MapOfASTs & finished_asts, SetOfASTs & current_asts, std::string current_alias, size_t level) { if (level > settings.max_ast_depth) throw Exception("Normalized AST is too deep. Maximum: " + settings.max_ast_depth.toString(), ErrorCodes::TOO_DEEP_AST); if (finished_asts.count(ast)) { ast = finished_asts[ast]; return; } ASTPtr initial_ast = ast; current_asts.insert(initial_ast.get()); String my_alias = ast->tryGetAlias(); if (!my_alias.empty()) current_alias = my_alias; /// rewrite rules that act when you go from top to bottom. bool replaced = false; ASTIdentifier * identifier_node = nullptr; ASTFunction * func_node = nullptr; if ((func_node = typeid_cast(ast.get()))) { /// `IN t` can be specified, where t is a table, which is equivalent to `IN (SELECT * FROM t)`. if (functionIsInOrGlobalInOperator(func_node->name)) if (ASTIdentifier * right = typeid_cast(func_node->arguments->children.at(1).get())) if (!aliases.count(right->name)) right->kind = ASTIdentifier::Table; /// Special cases for count function. String func_name_lowercase = Poco::toLower(func_node->name); if (startsWith(func_name_lowercase, "count")) { /// Select implementation of countDistinct based on settings. /// Important that it is done as query rewrite. It means rewritten query /// will be sent to remote servers during distributed query execution, /// and on all remote servers, function implementation will be same. if (endsWith(func_node->name, "Distinct") && func_name_lowercase == "countdistinct") func_node->name = settings.count_distinct_implementation; /// As special case, treat count(*) as count(), not as count(list of all columns). if (func_name_lowercase == "count" && func_node->arguments->children.size() == 1 && typeid_cast(func_node->arguments->children[0].get())) { func_node->arguments->children.clear(); } } } else if ((identifier_node = typeid_cast(ast.get()))) { if (identifier_node->kind == ASTIdentifier::Column) { /// If it is an alias, but not a parent alias (for constructs like "SELECT column + 1 AS column"). auto it_alias = aliases.find(identifier_node->name); if (it_alias != aliases.end() && current_alias != identifier_node->name) { /// Let's replace it with the corresponding tree node. if (current_asts.count(it_alias->second.get())) throw Exception("Cyclic aliases", ErrorCodes::CYCLIC_ALIASES); if (!my_alias.empty() && my_alias != it_alias->second->getAliasOrColumnName()) { /// Avoid infinite recursion here auto replace_to_identifier = typeid_cast(it_alias->second.get()); bool is_cycle = replace_to_identifier && replace_to_identifier->kind == ASTIdentifier::Column && replace_to_identifier->name == identifier_node->name; if (!is_cycle) { /// In a construct like "a AS b", where a is an alias, you must set alias b to the result of substituting alias a. ast = it_alias->second->clone(); ast->setAlias(my_alias); replaced = true; } } else { ast = it_alias->second; replaced = true; } } } } else if (ASTExpressionList * node = typeid_cast(ast.get())) { /// Replace * with a list of columns. ASTs & asts = node->children; for (int i = static_cast(asts.size()) - 1; i >= 0; --i) { if (typeid_cast(asts[i].get())) { ASTs all_columns; if (storage) { /// If we select from a table, get only not MATERIALIZED, not ALIAS columns. for (const auto & name_type : storage->getColumns().ordinary) all_columns.emplace_back(std::make_shared(name_type.name)); } else { for (const auto & name_type : source_columns) all_columns.emplace_back(std::make_shared(name_type.name)); } asts.erase(asts.begin() + i); asts.insert(asts.begin() + i, all_columns.begin(), all_columns.end()); } } } else if (ASTTablesInSelectQueryElement * node = typeid_cast(ast.get())) { if (node->table_expression) { auto & database_and_table_name = static_cast(*node->table_expression).database_and_table_name; if (database_and_table_name) { if (ASTIdentifier * right = typeid_cast(database_and_table_name.get())) { right->kind = ASTIdentifier::Table; } } } } /// If we replace the root of the subtree, we will be called again for the new root, in case the alias is replaced by an alias. if (replaced) { normalizeTreeImpl(ast, finished_asts, current_asts, current_alias, level + 1); current_asts.erase(initial_ast.get()); current_asts.erase(ast.get()); finished_asts[initial_ast] = ast; return; } /// Recurring calls. Don't go into subqueries. Don't go into components of compound identifiers. /// We also do not go to the left argument of lambda expressions, so as not to replace the formal parameters /// on aliases in expressions of the form 123 AS x, arrayMap(x -> 1, [2]). if (func_node && func_node->name == "lambda") { /// We skip the first argument. We also assume that the lambda function can not have parameters. for (size_t i = 1, size = func_node->arguments->children.size(); i < size; ++i) { auto & child = func_node->arguments->children[i]; if (typeid_cast(child.get()) || typeid_cast(child.get())) continue; normalizeTreeImpl(child, finished_asts, current_asts, current_alias, level + 1); } } else if (identifier_node) { } else { for (auto & child : ast->children) { if (typeid_cast(child.get()) || typeid_cast(child.get())) continue; normalizeTreeImpl(child, finished_asts, current_asts, current_alias, level + 1); } } /// If the WHERE clause or HAVING consists of a single alias, the reference must be replaced not only in children, but also in where_expression and having_expression. if (ASTSelectQuery * select = typeid_cast(ast.get())) { if (select->prewhere_expression) normalizeTreeImpl(select->prewhere_expression, finished_asts, current_asts, current_alias, level + 1); if (select->where_expression) normalizeTreeImpl(select->where_expression, finished_asts, current_asts, current_alias, level + 1); if (select->having_expression) normalizeTreeImpl(select->having_expression, finished_asts, current_asts, current_alias, level + 1); } current_asts.erase(initial_ast.get()); current_asts.erase(ast.get()); finished_asts[initial_ast] = ast; } void ExpressionAnalyzer::addAliasColumns() { if (!select_query) return; if (!storage) return; const auto & aliases = storage->getColumns().aliases; source_columns.insert(std::end(source_columns), std::begin(aliases), std::end(aliases)); } void ExpressionAnalyzer::executeScalarSubqueries() { if (!select_query) executeScalarSubqueriesImpl(ast); else { for (auto & child : ast->children) { /// Do not go to FROM, JOIN, UNION. if (!typeid_cast(child.get()) && !typeid_cast(child.get())) { executeScalarSubqueriesImpl(child); } } } } static ASTPtr addTypeConversion(std::unique_ptr && ast, const String & type_name) { auto func = std::make_shared(); ASTPtr res = func; func->alias = ast->alias; func->prefer_alias_to_column_name = ast->prefer_alias_to_column_name; ast->alias.clear(); func->name = "CAST"; auto exp_list = std::make_shared(); func->arguments = exp_list; func->children.push_back(func->arguments); exp_list->children.emplace_back(ast.release()); exp_list->children.emplace_back(std::make_shared(type_name)); return res; } void ExpressionAnalyzer::executeScalarSubqueriesImpl(ASTPtr & ast) { /** Replace subqueries that return exactly one row * ("scalar" subqueries) to the corresponding constants. * * If the subquery returns more than one column, it is replaced by a tuple of constants. * * Features * * A replacement occurs during query analysis, and not during the main runtime. * This means that the progress indicator will not work during the execution of these requests, * and also such queries can not be aborted. * * But the query result can be used for the index in the table. * * Scalar subqueries are executed on the request-initializer server. * The request is sent to remote servers with already substituted constants. */ if (ASTSubquery * subquery = typeid_cast(ast.get())) { Context subquery_context = context; Settings subquery_settings = context.getSettings(); subquery_settings.max_result_rows = 1; subquery_settings.extremes = 0; subquery_context.setSettings(subquery_settings); ASTPtr query = subquery->children.at(0); BlockIO res = InterpreterSelectWithUnionQuery(query, subquery_context, {}, QueryProcessingStage::Complete, subquery_depth + 1).execute(); Block block; try { block = res.in->read(); if (!block) { /// Interpret subquery with empty result as Null literal auto ast_new = std::make_unique(Null()); ast_new->setAlias(ast->tryGetAlias()); ast = std::move(ast_new); return; } if (block.rows() != 1 || res.in->read()) throw Exception("Scalar subquery returned more than one row", ErrorCodes::INCORRECT_RESULT_OF_SCALAR_SUBQUERY); } catch (const Exception & e) { if (e.code() == ErrorCodes::TOO_MANY_ROWS) throw Exception("Scalar subquery returned more than one row", ErrorCodes::INCORRECT_RESULT_OF_SCALAR_SUBQUERY); else throw; } size_t columns = block.columns(); if (columns == 1) { auto lit = std::make_unique((*block.safeGetByPosition(0).column)[0]); lit->alias = subquery->alias; lit->prefer_alias_to_column_name = subquery->prefer_alias_to_column_name; ast = addTypeConversion(std::move(lit), block.safeGetByPosition(0).type->getName()); } else { auto tuple = std::make_shared(); tuple->alias = subquery->alias; ast = tuple; tuple->name = "tuple"; auto exp_list = std::make_shared(); tuple->arguments = exp_list; tuple->children.push_back(tuple->arguments); exp_list->children.resize(columns); for (size_t i = 0; i < columns; ++i) { exp_list->children[i] = addTypeConversion( std::make_unique((*block.safeGetByPosition(i).column)[0]), block.safeGetByPosition(i).type->getName()); } } } else { /** Don't descend into subqueries in FROM section. */ if (!typeid_cast(ast.get())) { /** Don't descend into subqueries in arguments of IN operator. * But if an argument is not subquery, than deeper may be scalar subqueries and we need to descend in them. */ ASTFunction * func = typeid_cast(ast.get()); if (func && functionIsInOrGlobalInOperator(func->name)) { for (auto & child : ast->children) { if (child != func->arguments) executeScalarSubqueriesImpl(child); else for (size_t i = 0, size = func->arguments->children.size(); i < size; ++i) if (i != 1 || !typeid_cast(func->arguments->children[i].get())) executeScalarSubqueriesImpl(func->arguments->children[i]); } } else for (auto & child : ast->children) executeScalarSubqueriesImpl(child); } } } void ExpressionAnalyzer::optimizeGroupBy() { if (!(select_query && select_query->group_expression_list)) return; const auto is_literal = [] (const ASTPtr & ast) { return typeid_cast(ast.get()); }; auto & group_exprs = select_query->group_expression_list->children; /// removes expression at index idx by making it last one and calling .pop_back() const auto remove_expr_at_index = [&group_exprs] (const size_t idx) { if (idx < group_exprs.size() - 1) std::swap(group_exprs[idx], group_exprs.back()); group_exprs.pop_back(); }; /// iterate over each GROUP BY expression, eliminate injective function calls and literals for (size_t i = 0; i < group_exprs.size();) { if (const auto function = typeid_cast(group_exprs[i].get())) { /// assert function is injective if (possibly_injective_function_names.count(function->name)) { /// do not handle semantic errors here if (function->arguments->children.size() < 2) { ++i; continue; } const auto & dict_name = typeid_cast(*function->arguments->children[0]) .value.safeGet(); const auto & dict_ptr = context.getExternalDictionaries().getDictionary(dict_name); const auto & attr_name = typeid_cast(*function->arguments->children[1]) .value.safeGet(); if (!dict_ptr->isInjective(attr_name)) { ++i; continue; } } else if (!injective_function_names.count(function->name)) { ++i; continue; } /// copy shared pointer to args in order to ensure lifetime auto args_ast = function->arguments; /** remove function call and take a step back to ensure * next iteration does not skip not yet processed data */ remove_expr_at_index(i); /// copy non-literal arguments std::remove_copy_if( std::begin(args_ast->children), std::end(args_ast->children), std::back_inserter(group_exprs), is_literal ); } else if (is_literal(group_exprs[i])) { remove_expr_at_index(i); } else { /// if neither a function nor literal - advance to next expression ++i; } } if (group_exprs.empty()) { /** You can not completely remove GROUP BY. Because if there were no aggregate functions, then it turns out that there will be no aggregation. * Instead, leave `GROUP BY const`. * Next, see deleting the constants in the analyzeAggregation method. */ /// You must insert a constant that is not the name of the column in the table. Such a case is rare, but it happens. UInt64 unused_column = 0; String unused_column_name = toString(unused_column); while (source_columns.end() != std::find_if(source_columns.begin(), source_columns.end(), [&unused_column_name](const NameAndTypePair & name_type) { return name_type.name == unused_column_name; })) { ++unused_column; unused_column_name = toString(unused_column); } select_query->group_expression_list = std::make_shared(); select_query->group_expression_list->children.emplace_back(std::make_shared(UInt64(unused_column))); } } void ExpressionAnalyzer::optimizeOrderBy() { if (!(select_query && select_query->order_expression_list)) return; /// Make unique sorting conditions. using NameAndLocale = std::pair; std::set elems_set; ASTs & elems = select_query->order_expression_list->children; ASTs unique_elems; unique_elems.reserve(elems.size()); for (const auto & elem : elems) { String name = elem->children.front()->getColumnName(); const ASTOrderByElement & order_by_elem = typeid_cast(*elem); if (elems_set.emplace(name, order_by_elem.collation ? order_by_elem.collation->getColumnName() : "").second) unique_elems.emplace_back(elem); } if (unique_elems.size() < elems.size()) elems = unique_elems; } void ExpressionAnalyzer::optimizeLimitBy() { if (!(select_query && select_query->limit_by_expression_list)) return; std::set elems_set; ASTs & elems = select_query->limit_by_expression_list->children; ASTs unique_elems; unique_elems.reserve(elems.size()); for (const auto & elem : elems) { if (elems_set.emplace(elem->getColumnName()).second) unique_elems.emplace_back(elem); } if (unique_elems.size() < elems.size()) elems = unique_elems; } void ExpressionAnalyzer::makeSetsForIndex() { if (storage && ast && storage->supportsIndexForIn()) makeSetsForIndexImpl(ast, storage->getSampleBlock()); } void ExpressionAnalyzer::tryMakeSetFromSubquery(const ASTPtr & subquery_or_table_name) { BlockIO res = interpretSubquery(subquery_or_table_name, context, subquery_depth + 1, {})->execute(); SetPtr set = std::make_shared(SizeLimits(settings.max_rows_in_set, settings.max_bytes_in_set, settings.set_overflow_mode)); while (Block block = res.in->read()) { /// If the limits have been exceeded, give up and let the default subquery processing actions take place. if (!set->insertFromBlock(block, true)) return; } prepared_sets[subquery_or_table_name.get()] = std::move(set); } void ExpressionAnalyzer::makeSetsForIndexImpl(const ASTPtr & node, const Block & sample_block) { for (auto & child : node->children) { /// Process expression only in current subquery if (!typeid_cast(child.get())) makeSetsForIndexImpl(child, sample_block); } const ASTFunction * func = typeid_cast(node.get()); if (func && functionIsInOperator(func->name)) { const IAST & args = *func->arguments; const ASTPtr & arg = args.children.at(1); if (!prepared_sets.count(arg.get())) /// Not already prepared. { if (typeid_cast(arg.get()) || typeid_cast(arg.get())) { if (settings.use_index_for_in_with_subqueries && storage->mayBenefitFromIndexForIn(args.children.at(0))) tryMakeSetFromSubquery(arg); } else { try { ExpressionActionsPtr temp_actions = std::make_shared(source_columns, settings); getRootActions(func->arguments->children.at(0), true, false, temp_actions); makeExplicitSet(func, temp_actions->getSampleBlock(), true); } catch (const Exception & e) { /// in `sample_block` there are no columns that are added by `getActions` if (e.code() != ErrorCodes::NOT_FOUND_COLUMN_IN_BLOCK && e.code() != ErrorCodes::UNKNOWN_IDENTIFIER) throw; /// TODO: Delete the catch in the next release tryLogCurrentException(&Poco::Logger::get("ExpressionAnalyzer")); } } } } } void ExpressionAnalyzer::makeSet(const ASTFunction * node, const Block & sample_block) { /** You need to convert the right argument to a set. * This can be a table name, a value, a value enumeration, or a subquery. * The enumeration of values is parsed as a function `tuple`. */ const IAST & args = *node->arguments; const ASTPtr & arg = args.children.at(1); /// Already converted. if (prepared_sets.count(arg.get())) return; /// If the subquery or table name for SELECT. const ASTIdentifier * identifier = typeid_cast(arg.get()); if (typeid_cast(arg.get()) || identifier) { /// We get the stream of blocks for the subquery. Create Set and put it in place of the subquery. String set_id = arg->getColumnName(); /// A special case is if the name of the table is specified on the right side of the IN statement, /// and the table has the type Set (a previously prepared set). if (identifier) { auto database_table = getDatabaseAndTableNameFromIdentifier(*identifier); StoragePtr table = context.tryGetTable(database_table.first, database_table.second); if (table) { StorageSet * storage_set = dynamic_cast(table.get()); if (storage_set) { prepared_sets[arg.get()] = storage_set->getSet(); return; } } } SubqueryForSet & subquery_for_set = subqueries_for_sets[set_id]; /// If you already created a Set with the same subquery / table. if (subquery_for_set.set) { prepared_sets[arg.get()] = subquery_for_set.set; return; } SetPtr set = std::make_shared(SizeLimits(settings.max_rows_in_set, settings.max_bytes_in_set, settings.set_overflow_mode)); /** The following happens for GLOBAL INs: * - in the addExternalStorage function, the IN (SELECT ...) subquery is replaced with IN _data1, * in the subquery_for_set object, this subquery is set as source and the temporary table _data1 as the table. * - this function shows the expression IN_data1. */ if (!subquery_for_set.source) { auto interpreter = interpretSubquery(arg, context, subquery_depth, {}); subquery_for_set.source = std::make_shared( interpreter->getSampleBlock(), [interpreter]() mutable { return interpreter->execute().in; }); /** Why is LazyBlockInputStream used? * * The fact is that when processing a query of the form * SELECT ... FROM remote_test WHERE column GLOBAL IN (subquery), * if the distributed remote_test table contains localhost as one of the servers, * the query will be interpreted locally again (and not sent over TCP, as in the case of a remote server). * * The query execution pipeline will be: * CreatingSets * subquery execution, filling the temporary table with _data1 (1) * CreatingSets * reading from the table _data1, creating the set (2) * read from the table subordinate to remote_test. * * (The second part of the pipeline under CreateSets is a reinterpretation of the query inside StorageDistributed, * the query differs in that the database name and tables are replaced with subordinates, and the subquery is replaced with _data1.) * * But when creating the pipeline, when creating the source (2), it will be found that the _data1 table is empty * (because the query has not started yet), and empty source will be returned as the source. * And then, when the query is executed, an empty set will be created in step (2). * * Therefore, we make the initialization of step (2) lazy * - so that it does not occur until step (1) is completed, on which the table will be populated. * * Note: this solution is not very good, you need to think better. */ } subquery_for_set.set = set; prepared_sets[arg.get()] = set; } else { /// An explicit enumeration of values in parentheses. makeExplicitSet(node, sample_block, false); } } /// The case of an explicit enumeration of values. void ExpressionAnalyzer::makeExplicitSet(const ASTFunction * node, const Block & sample_block, bool create_ordered_set) { const IAST & args = *node->arguments; if (args.children.size() != 2) throw Exception("Wrong number of arguments passed to function in", ErrorCodes::NUMBER_OF_ARGUMENTS_DOESNT_MATCH); const ASTPtr & arg = args.children.at(1); DataTypes set_element_types; const ASTPtr & left_arg = args.children.at(0); const ASTFunction * left_arg_tuple = typeid_cast(left_arg.get()); /** NOTE If tuple in left hand side specified non-explicitly * Example: identity((a, b)) IN ((1, 2), (3, 4)) * instead of (a, b)) IN ((1, 2), (3, 4)) * then set creation doesn't work correctly. */ if (left_arg_tuple && left_arg_tuple->name == "tuple") { for (const auto & arg : left_arg_tuple->arguments->children) { const auto & data_type = sample_block.getByName(arg->getColumnName()).type; /// NOTE prevent crash in query: SELECT (1, [1]) in (1, 1) if (const auto array = typeid_cast(data_type.get())) throw Exception("Incorrect element of tuple: " + array->getName(), ErrorCodes::INCORRECT_ELEMENT_OF_SET); set_element_types.push_back(data_type); } } else { DataTypePtr left_type = sample_block.getByName(left_arg->getColumnName()).type; if (const DataTypeArray * array_type = typeid_cast(left_type.get())) set_element_types.push_back(array_type->getNestedType()); else set_element_types.push_back(left_type); } /// The case `x in (1, 2)` distinguishes from the case `x in 1` (also `x in (1)`). bool single_value = false; ASTPtr elements_ast = arg; if (ASTFunction * set_func = typeid_cast(arg.get())) { if (set_func->name == "tuple") { if (set_func->arguments->children.empty()) { /// Empty set. elements_ast = set_func->arguments; } else { /// Distinguish the case `(x, y) in ((1, 2), (3, 4))` from the case `(x, y) in (1, 2)`. ASTFunction * any_element = typeid_cast(set_func->arguments->children.at(0).get()); if (set_element_types.size() >= 2 && (!any_element || any_element->name != "tuple")) single_value = true; else elements_ast = set_func->arguments; } } else { if (set_element_types.size() >= 2) throw Exception("Incorrect type of 2nd argument for function " + node->name + ". Must be subquery or set of " + toString(set_element_types.size()) + "-element tuples.", ErrorCodes::ILLEGAL_TYPE_OF_ARGUMENT); single_value = true; } } else if (typeid_cast(arg.get())) { single_value = true; } else { throw Exception("Incorrect type of 2nd argument for function " + node->name + ". Must be subquery or set of values.", ErrorCodes::ILLEGAL_TYPE_OF_ARGUMENT); } if (single_value) { ASTPtr exp_list = std::make_shared(); exp_list->children.push_back(elements_ast); elements_ast = exp_list; } SetPtr set = std::make_shared(SizeLimits(settings.max_rows_in_set, settings.max_bytes_in_set, settings.set_overflow_mode)); set->createFromAST(set_element_types, elements_ast, context, create_ordered_set); prepared_sets[arg.get()] = std::move(set); } static String getUniqueName(const Block & block, const String & prefix) { int i = 1; while (block.has(prefix + toString(i))) ++i; return prefix + toString(i); } /** For getActionsImpl. * A stack of ExpressionActions corresponding to nested lambda expressions. * The new action should be added to the highest possible level. * For example, in the expression "select arrayMap(x -> x + column1 * column2, array1)" * calculation of the product must be done outside the lambda expression (it does not depend on x), and the calculation of the sum is inside (depends on x). */ struct ExpressionAnalyzer::ScopeStack { struct Level { ExpressionActionsPtr actions; NameSet new_columns; }; using Levels = std::vector; Levels stack; Settings settings; ScopeStack(const ExpressionActionsPtr & actions, const Settings & settings_) : settings(settings_) { stack.emplace_back(); stack.back().actions = actions; const Block & sample_block = actions->getSampleBlock(); for (size_t i = 0, size = sample_block.columns(); i < size; ++i) stack.back().new_columns.insert(sample_block.getByPosition(i).name); } void pushLevel(const NamesAndTypesList & input_columns) { stack.emplace_back(); Level & prev = stack[stack.size() - 2]; ColumnsWithTypeAndName all_columns; NameSet new_names; for (NamesAndTypesList::const_iterator it = input_columns.begin(); it != input_columns.end(); ++it) { all_columns.emplace_back(nullptr, it->type, it->name); new_names.insert(it->name); stack.back().new_columns.insert(it->name); } const Block & prev_sample_block = prev.actions->getSampleBlock(); for (size_t i = 0, size = prev_sample_block.columns(); i < size; ++i) { const ColumnWithTypeAndName & col = prev_sample_block.getByPosition(i); if (!new_names.count(col.name)) all_columns.push_back(col); } stack.back().actions = std::make_shared(all_columns, settings); } size_t getColumnLevel(const std::string & name) { for (int i = static_cast(stack.size()) - 1; i >= 0; --i) if (stack[i].new_columns.count(name)) return i; throw Exception("Unknown identifier: " + name, ErrorCodes::UNKNOWN_IDENTIFIER); } void addAction(const ExpressionAction & action) { size_t level = 0; Names required = action.getNeededColumns(); for (size_t i = 0; i < required.size(); ++i) level = std::max(level, getColumnLevel(required[i])); Names added; stack[level].actions->add(action, added); stack[level].new_columns.insert(added.begin(), added.end()); for (size_t i = 0; i < added.size(); ++i) { const ColumnWithTypeAndName & col = stack[level].actions->getSampleBlock().getByName(added[i]); for (size_t j = level + 1; j < stack.size(); ++j) stack[j].actions->addInput(col); } } ExpressionActionsPtr popLevel() { ExpressionActionsPtr res = stack.back().actions; stack.pop_back(); return res; } const Block & getSampleBlock() const { return stack.back().actions->getSampleBlock(); } }; void ExpressionAnalyzer::getRootActions(const ASTPtr & ast, bool no_subqueries, bool only_consts, ExpressionActionsPtr & actions) { ScopeStack scopes(actions, settings); getActionsImpl(ast, no_subqueries, only_consts, scopes); actions = scopes.popLevel(); } void ExpressionAnalyzer::getArrayJoinedColumns() { if (select_query && select_query->array_join_expression_list()) { ASTs & array_join_asts = select_query->array_join_expression_list()->children; for (const auto & ast : array_join_asts) { const String nested_table_name = ast->getColumnName(); const String nested_table_alias = ast->getAliasOrColumnName(); if (nested_table_alias == nested_table_name && !typeid_cast(ast.get())) throw Exception("No alias for non-trivial value in ARRAY JOIN: " + nested_table_name, ErrorCodes::ALIAS_REQUIRED); if (array_join_alias_to_name.count(nested_table_alias) || aliases.count(nested_table_alias)) throw Exception("Duplicate alias in ARRAY JOIN: " + nested_table_alias, ErrorCodes::MULTIPLE_EXPRESSIONS_FOR_ALIAS); array_join_alias_to_name[nested_table_alias] = nested_table_name; array_join_name_to_alias[nested_table_name] = nested_table_alias; } getArrayJoinedColumnsImpl(ast); /// If the result of ARRAY JOIN is not used, it is necessary to ARRAY-JOIN any column, /// to get the correct number of rows. if (array_join_result_to_source.empty()) { ASTPtr expr = select_query->array_join_expression_list()->children.at(0); String source_name = expr->getColumnName(); String result_name = expr->getAliasOrColumnName(); /// This is an array. if (!typeid_cast(expr.get()) || findColumn(source_name, source_columns) != source_columns.end()) { array_join_result_to_source[result_name] = source_name; } else /// This is a nested table. { bool found = false; for (const auto & column_name_type : source_columns) { auto splitted = Nested::splitName(column_name_type.name); if (splitted.first == source_name && !splitted.second.empty()) { array_join_result_to_source[Nested::concatenateName(result_name, splitted.second)] = column_name_type.name; found = true; break; } } if (!found) throw Exception("No columns in nested table " + source_name, ErrorCodes::EMPTY_NESTED_TABLE); } } } } /// Fills the array_join_result_to_source: on which columns-arrays to replicate, and how to call them after that. void ExpressionAnalyzer::getArrayJoinedColumnsImpl(const ASTPtr & ast) { if (typeid_cast(ast.get())) return; if (ASTIdentifier * node = typeid_cast(ast.get())) { if (node->kind == ASTIdentifier::Column) { auto splitted = Nested::splitName(node->name); /// ParsedParams, Key1 if (array_join_alias_to_name.count(node->name)) { /// ARRAY JOIN was written with an array column. Example: SELECT K1 FROM ... ARRAY JOIN ParsedParams.Key1 AS K1 array_join_result_to_source[node->name] = array_join_alias_to_name[node->name]; /// K1 -> ParsedParams.Key1 } else if (array_join_alias_to_name.count(splitted.first) && !splitted.second.empty()) { /// ARRAY JOIN was written with a nested table. Example: SELECT PP.KEY1 FROM ... ARRAY JOIN ParsedParams AS PP array_join_result_to_source[node->name] /// PP.Key1 -> ParsedParams.Key1 = Nested::concatenateName(array_join_alias_to_name[splitted.first], splitted.second); } else if (array_join_name_to_alias.count(node->name)) { /** Example: SELECT ParsedParams.Key1 FROM ... ARRAY JOIN ParsedParams.Key1 AS PP.Key1. * That is, the query uses the original array, replicated by itself. */ array_join_result_to_source[ /// PP.Key1 -> ParsedParams.Key1 array_join_name_to_alias[node->name]] = node->name; } else if (array_join_name_to_alias.count(splitted.first) && !splitted.second.empty()) { /** Example: SELECT ParsedParams.Key1 FROM ... ARRAY JOIN ParsedParams AS PP. */ array_join_result_to_source[ /// PP.Key1 -> ParsedParams.Key1 Nested::concatenateName(array_join_name_to_alias[splitted.first], splitted.second)] = node->name; } } } else { for (auto & child : ast->children) if (!typeid_cast(child.get()) && !typeid_cast(child.get())) getArrayJoinedColumnsImpl(child); } } void ExpressionAnalyzer::getActionsImpl(const ASTPtr & ast, bool no_subqueries, bool only_consts, ScopeStack & actions_stack) { /// If the result of the calculation already exists in the block. if ((typeid_cast(ast.get()) || typeid_cast(ast.get())) && actions_stack.getSampleBlock().has(ast->getColumnName())) return; if (ASTIdentifier * node = typeid_cast(ast.get())) { std::string name = node->getColumnName(); if (!only_consts && !actions_stack.getSampleBlock().has(name)) { /// The requested column is not in the block. /// If such a column exists in the table, then the user probably forgot to surround it with an aggregate function or add it to GROUP BY. bool found = false; for (const auto & column_name_type : source_columns) if (column_name_type.name == name) found = true; if (found) throw Exception("Column " + name + " is not under aggregate function and not in GROUP BY.", ErrorCodes::NOT_AN_AGGREGATE); } } else if (ASTFunction * node = typeid_cast(ast.get())) { if (node->name == "lambda") throw Exception("Unexpected lambda expression", ErrorCodes::UNEXPECTED_EXPRESSION); /// Function arrayJoin. if (node->name == "arrayJoin") { if (node->arguments->children.size() != 1) throw Exception("arrayJoin requires exactly 1 argument", ErrorCodes::TYPE_MISMATCH); ASTPtr arg = node->arguments->children.at(0); getActionsImpl(arg, no_subqueries, only_consts, actions_stack); if (!only_consts) { String result_name = node->getColumnName(); actions_stack.addAction(ExpressionAction::copyColumn(arg->getColumnName(), result_name)); NameSet joined_columns; joined_columns.insert(result_name); actions_stack.addAction(ExpressionAction::arrayJoin(joined_columns, false, context)); } return; } if (functionIsInOrGlobalInOperator(node->name)) { if (!no_subqueries) { /// Let's find the type of the first argument (then getActionsImpl will be called again and will not affect anything). getActionsImpl(node->arguments->children.at(0), no_subqueries, only_consts, actions_stack); /// Transform tuple or subquery into a set. makeSet(node, actions_stack.getSampleBlock()); } else { if (!only_consts) { /// We are in the part of the tree that we are not going to compute. You just need to define types. /// Do not subquery and create sets. We insert an arbitrary column of the correct type. ColumnWithTypeAndName fake_column; fake_column.name = node->getColumnName(); fake_column.type = std::make_shared(); actions_stack.addAction(ExpressionAction::addColumn(fake_column)); getActionsImpl(node->arguments->children.at(0), no_subqueries, only_consts, actions_stack); } return; } } /// A special function `indexHint`. Everything that is inside it is not calculated /// (and is used only for index analysis, see PKCondition). if (node->name == "indexHint") { actions_stack.addAction(ExpressionAction::addColumn(ColumnWithTypeAndName( ColumnConst::create(ColumnUInt8::create(1, 1), 1), std::make_shared(), node->getColumnName()))); return; } if (AggregateFunctionFactory::instance().isAggregateFunctionName(node->name)) return; const FunctionBuilderPtr & function_builder = FunctionFactory::instance().get(node->name, context); Names argument_names; DataTypes argument_types; bool arguments_present = true; /// If the function has an argument-lambda expression, you need to determine its type before the recursive call. bool has_lambda_arguments = false; for (auto & child : node->arguments->children) { ASTFunction * lambda = typeid_cast(child.get()); if (lambda && lambda->name == "lambda") { /// If the argument is a lambda expression, just remember its approximate type. if (lambda->arguments->children.size() != 2) throw Exception("lambda requires two arguments", ErrorCodes::NUMBER_OF_ARGUMENTS_DOESNT_MATCH); ASTFunction * lambda_args_tuple = typeid_cast(lambda->arguments->children.at(0).get()); if (!lambda_args_tuple || lambda_args_tuple->name != "tuple") throw Exception("First argument of lambda must be a tuple", ErrorCodes::TYPE_MISMATCH); has_lambda_arguments = true; argument_types.emplace_back(std::make_shared(DataTypes(lambda_args_tuple->arguments->children.size()))); /// Select the name in the next cycle. argument_names.emplace_back(); } else if (prepared_sets.count(child.get())) { ColumnWithTypeAndName column; column.type = std::make_shared(); const SetPtr & set = prepared_sets[child.get()]; /// If the argument is a set given by an enumeration of values (so, the set was already built), give it a unique name, /// so that sets with the same record do not fuse together (they can have different types). if (!set->empty()) column.name = getUniqueName(actions_stack.getSampleBlock(), "__set"); else column.name = child->getColumnName(); if (!actions_stack.getSampleBlock().has(column.name)) { column.column = ColumnSet::create(1, set); actions_stack.addAction(ExpressionAction::addColumn(column)); } argument_types.push_back(column.type); argument_names.push_back(column.name); } else { /// If the argument is not a lambda expression, call it recursively and find out its type. getActionsImpl(child, no_subqueries, only_consts, actions_stack); std::string name = child->getColumnName(); if (actions_stack.getSampleBlock().has(name)) { argument_types.push_back(actions_stack.getSampleBlock().getByName(name).type); argument_names.push_back(name); } else { if (only_consts) { arguments_present = false; } else { throw Exception("Unknown identifier: " + name, ErrorCodes::UNKNOWN_IDENTIFIER); } } } } if (only_consts && !arguments_present) return; if (has_lambda_arguments && !only_consts) { function_builder->getLambdaArgumentTypes(argument_types); /// Call recursively for lambda expressions. for (size_t i = 0; i < node->arguments->children.size(); ++i) { ASTPtr child = node->arguments->children[i]; ASTFunction * lambda = typeid_cast(child.get()); if (lambda && lambda->name == "lambda") { const DataTypeFunction * lambda_type = typeid_cast(argument_types[i].get()); ASTFunction * lambda_args_tuple = typeid_cast(lambda->arguments->children.at(0).get()); ASTs lambda_arg_asts = lambda_args_tuple->arguments->children; NamesAndTypesList lambda_arguments; for (size_t j = 0; j < lambda_arg_asts.size(); ++j) { ASTIdentifier * identifier = typeid_cast(lambda_arg_asts[j].get()); if (!identifier) throw Exception("lambda argument declarations must be identifiers", ErrorCodes::TYPE_MISMATCH); String arg_name = identifier->name; lambda_arguments.emplace_back(arg_name, lambda_type->getArgumentTypes()[j]); } actions_stack.pushLevel(lambda_arguments); getActionsImpl(lambda->arguments->children.at(1), no_subqueries, only_consts, actions_stack); ExpressionActionsPtr lambda_actions = actions_stack.popLevel(); String result_name = lambda->arguments->children.at(1)->getColumnName(); lambda_actions->finalize(Names(1, result_name)); DataTypePtr result_type = lambda_actions->getSampleBlock().getByName(result_name).type; Names captured; Names required = lambda_actions->getRequiredColumns(); for (size_t j = 0; j < required.size(); ++j) if (findColumn(required[j], lambda_arguments) == lambda_arguments.end()) captured.push_back(required[j]); /// We can not name `getColumnName()`, /// because it does not uniquely define the expression (the types of arguments can be different). String lambda_name = getUniqueName(actions_stack.getSampleBlock(), "__lambda"); auto function_capture = std::make_shared( lambda_actions, captured, lambda_arguments, result_type, result_name); actions_stack.addAction(ExpressionAction::applyFunction(function_capture, captured, lambda_name)); argument_types[i] = std::make_shared(lambda_type->getArgumentTypes(), result_type); argument_names[i] = lambda_name; } } } if (only_consts) { for (size_t i = 0; i < argument_names.size(); ++i) { if (!actions_stack.getSampleBlock().has(argument_names[i])) { arguments_present = false; break; } } } if (arguments_present) actions_stack.addAction(ExpressionAction::applyFunction(function_builder, argument_names, node->getColumnName())); } else if (ASTLiteral * node = typeid_cast(ast.get())) { DataTypePtr type = applyVisitor(FieldToDataType(), node->value); ColumnWithTypeAndName column; column.column = type->createColumnConst(1, convertFieldToType(node->value, *type)); column.type = type; column.name = node->getColumnName(); actions_stack.addAction(ExpressionAction::addColumn(column)); } else { for (auto & child : ast->children) { /// Do not go to FROM, JOIN, UNION. if (!typeid_cast(child.get()) && !typeid_cast(child.get())) getActionsImpl(child, no_subqueries, only_consts, actions_stack); } } } void ExpressionAnalyzer::getAggregates(const ASTPtr & ast, ExpressionActionsPtr & actions) { /// There can not be aggregate functions inside the WHERE and PREWHERE. if (select_query && (ast.get() == select_query->where_expression.get() || ast.get() == select_query->prewhere_expression.get())) { assertNoAggregates(ast, "in WHERE or PREWHERE"); return; } /// If we are not analyzing a SELECT query, but a separate expression, then there can not be aggregate functions in it. if (!select_query) { assertNoAggregates(ast, "in wrong place"); return; } const ASTFunction * node = typeid_cast(ast.get()); if (node && AggregateFunctionFactory::instance().isAggregateFunctionName(node->name)) { has_aggregation = true; AggregateDescription aggregate; aggregate.column_name = node->getColumnName(); /// Make unique aggregate functions. for (size_t i = 0; i < aggregate_descriptions.size(); ++i) if (aggregate_descriptions[i].column_name == aggregate.column_name) return; const ASTs & arguments = node->arguments->children; aggregate.argument_names.resize(arguments.size()); DataTypes types(arguments.size()); for (size_t i = 0; i < arguments.size(); ++i) { /// There can not be other aggregate functions within the aggregate functions. assertNoAggregates(arguments[i], "inside another aggregate function"); getRootActions(arguments[i], true, false, actions); const std::string & name = arguments[i]->getColumnName(); types[i] = actions->getSampleBlock().getByName(name).type; aggregate.argument_names[i] = name; } aggregate.parameters = (node->parameters) ? getAggregateFunctionParametersArray(node->parameters) : Array(); aggregate.function = AggregateFunctionFactory::instance().get(node->name, types, aggregate.parameters); aggregate_descriptions.push_back(aggregate); } else { for (const auto & child : ast->children) if (!typeid_cast(child.get()) && !typeid_cast(child.get())) getAggregates(child, actions); } } void ExpressionAnalyzer::assertNoAggregates(const ASTPtr & ast, const char * description) { const ASTFunction * node = typeid_cast(ast.get()); if (node && AggregateFunctionFactory::instance().isAggregateFunctionName(node->name)) throw Exception("Aggregate function " + node->getColumnName() + " is found " + String(description) + " in query", ErrorCodes::ILLEGAL_AGGREGATION); for (const auto & child : ast->children) if (!typeid_cast(child.get()) && !typeid_cast(child.get())) assertNoAggregates(child, description); } void ExpressionAnalyzer::assertSelect() const { if (!select_query) throw Exception("Not a select query", ErrorCodes::LOGICAL_ERROR); } void ExpressionAnalyzer::assertAggregation() const { if (!has_aggregation) throw Exception("No aggregation", ErrorCodes::LOGICAL_ERROR); } void ExpressionAnalyzer::initChain(ExpressionActionsChain & chain, const NamesAndTypesList & columns) const { if (chain.steps.empty()) { chain.settings = settings; chain.steps.emplace_back(std::make_shared(columns, settings)); } } /// "Big" ARRAY JOIN. void ExpressionAnalyzer::addMultipleArrayJoinAction(ExpressionActionsPtr & actions) const { NameSet result_columns; for (const auto & result_source : array_join_result_to_source) { /// Assign new names to columns, if needed. if (result_source.first != result_source.second) actions->add(ExpressionAction::copyColumn(result_source.second, result_source.first)); /// Make ARRAY JOIN (replace arrays with their insides) for the columns in these new names. result_columns.insert(result_source.first); } actions->add(ExpressionAction::arrayJoin(result_columns, select_query->array_join_is_left(), context)); } bool ExpressionAnalyzer::appendArrayJoin(ExpressionActionsChain & chain, bool only_types) { assertSelect(); if (!select_query->array_join_expression_list()) return false; initChain(chain, source_columns); ExpressionActionsChain::Step & step = chain.steps.back(); getRootActions(select_query->array_join_expression_list(), only_types, false, step.actions); addMultipleArrayJoinAction(step.actions); return true; } void ExpressionAnalyzer::addJoinAction(ExpressionActionsPtr & actions, bool only_types) const { if (only_types) actions->add(ExpressionAction::ordinaryJoin(nullptr, columns_added_by_join)); else for (auto & subquery_for_set : subqueries_for_sets) if (subquery_for_set.second.join) actions->add(ExpressionAction::ordinaryJoin(subquery_for_set.second.join, columns_added_by_join)); } bool ExpressionAnalyzer::appendJoin(ExpressionActionsChain & chain, bool only_types) { assertSelect(); if (!select_query->join()) return false; initChain(chain, source_columns); ExpressionActionsChain::Step & step = chain.steps.back(); const ASTTablesInSelectQueryElement & join_element = static_cast(*select_query->join()); const ASTTableJoin & join_params = static_cast(*join_element.table_join); const ASTTableExpression & table_to_join = static_cast(*join_element.table_expression); if (join_params.using_expression_list) getRootActions(join_params.using_expression_list, only_types, false, step.actions); /// Two JOINs are not supported with the same subquery, but different USINGs. auto join_hash = join_element.getTreeHash(); SubqueryForSet & subquery_for_set = subqueries_for_sets[toString(join_hash.first) + "_" + toString(join_hash.second)]; /// Special case - if table name is specified on the right of JOIN, then the table has the type Join (the previously prepared mapping). /// TODO This syntax does not support specifying a database name. if (table_to_join.database_and_table_name) { auto database_table = getDatabaseAndTableNameFromIdentifier(static_cast(*table_to_join.database_and_table_name)); StoragePtr table = context.tryGetTable(database_table.first, database_table.second); if (table) { StorageJoin * storage_join = dynamic_cast(table.get()); if (storage_join) { storage_join->assertCompatible(join_params.kind, join_params.strictness); /// TODO Check the set of keys. JoinPtr & join = storage_join->getJoin(); subquery_for_set.join = join; } } } if (!subquery_for_set.join) { JoinPtr join = std::make_shared( join_key_names_left, join_key_names_right, settings.join_use_nulls, SizeLimits(settings.max_rows_in_join, settings.max_bytes_in_join, settings.join_overflow_mode), join_params.kind, join_params.strictness); Names required_joined_columns(join_key_names_right.begin(), join_key_names_right.end()); for (const auto & name_type : columns_added_by_join) required_joined_columns.push_back(name_type.name); /** For GLOBAL JOINs (in the case, for example, of the push method for executing GLOBAL subqueries), the following occurs * - in the addExternalStorage function, the JOIN (SELECT ...) subquery is replaced with JOIN _data1, * in the subquery_for_set object this subquery is exposed as source and the temporary table _data1 as the `table`. * - this function shows the expression JOIN _data1. */ if (!subquery_for_set.source) { ASTPtr table; if (table_to_join.database_and_table_name) table = table_to_join.database_and_table_name; else table = table_to_join.subquery; auto interpreter = interpretSubquery(table, context, subquery_depth, required_joined_columns); subquery_for_set.source = std::make_shared( interpreter->getSampleBlock(), [interpreter]() mutable { return interpreter->execute().in; }); } /// TODO You do not need to set this up when JOIN is only needed on remote servers. subquery_for_set.join = join; subquery_for_set.join->setSampleBlock(subquery_for_set.source->getHeader()); } addJoinAction(step.actions, false); return true; } bool ExpressionAnalyzer::appendWhere(ExpressionActionsChain & chain, bool only_types) { assertSelect(); if (!select_query->where_expression) return false; initChain(chain, source_columns); ExpressionActionsChain::Step & step = chain.steps.back(); step.required_output.push_back(select_query->where_expression->getColumnName()); getRootActions(select_query->where_expression, only_types, false, step.actions); return true; } bool ExpressionAnalyzer::appendGroupBy(ExpressionActionsChain & chain, bool only_types) { assertAggregation(); if (!select_query->group_expression_list) return false; initChain(chain, source_columns); ExpressionActionsChain::Step & step = chain.steps.back(); ASTs asts = select_query->group_expression_list->children; for (size_t i = 0; i < asts.size(); ++i) { step.required_output.push_back(asts[i]->getColumnName()); getRootActions(asts[i], only_types, false, step.actions); } return true; } void ExpressionAnalyzer::appendAggregateFunctionsArguments(ExpressionActionsChain & chain, bool only_types) { assertAggregation(); initChain(chain, source_columns); ExpressionActionsChain::Step & step = chain.steps.back(); for (size_t i = 0; i < aggregate_descriptions.size(); ++i) { for (size_t j = 0; j < aggregate_descriptions[i].argument_names.size(); ++j) { step.required_output.push_back(aggregate_descriptions[i].argument_names[j]); } } getActionsBeforeAggregation(select_query->select_expression_list, step.actions, only_types); if (select_query->having_expression) getActionsBeforeAggregation(select_query->having_expression, step.actions, only_types); if (select_query->order_expression_list) getActionsBeforeAggregation(select_query->order_expression_list, step.actions, only_types); } bool ExpressionAnalyzer::appendHaving(ExpressionActionsChain & chain, bool only_types) { assertAggregation(); if (!select_query->having_expression) return false; initChain(chain, aggregated_columns); ExpressionActionsChain::Step & step = chain.steps.back(); step.required_output.push_back(select_query->having_expression->getColumnName()); getRootActions(select_query->having_expression, only_types, false, step.actions); return true; } void ExpressionAnalyzer::appendSelect(ExpressionActionsChain & chain, bool only_types) { assertSelect(); initChain(chain, aggregated_columns); ExpressionActionsChain::Step & step = chain.steps.back(); getRootActions(select_query->select_expression_list, only_types, false, step.actions); for (const auto & child : select_query->select_expression_list->children) step.required_output.push_back(child->getColumnName()); } bool ExpressionAnalyzer::appendOrderBy(ExpressionActionsChain & chain, bool only_types) { assertSelect(); if (!select_query->order_expression_list) return false; initChain(chain, aggregated_columns); ExpressionActionsChain::Step & step = chain.steps.back(); getRootActions(select_query->order_expression_list, only_types, false, step.actions); ASTs asts = select_query->order_expression_list->children; for (size_t i = 0; i < asts.size(); ++i) { ASTOrderByElement * ast = typeid_cast(asts[i].get()); if (!ast || ast->children.size() < 1) throw Exception("Bad order expression AST", ErrorCodes::UNKNOWN_TYPE_OF_AST_NODE); ASTPtr order_expression = ast->children.at(0); step.required_output.push_back(order_expression->getColumnName()); } return true; } bool ExpressionAnalyzer::appendLimitBy(ExpressionActionsChain & chain, bool only_types) { assertSelect(); if (!select_query->limit_by_expression_list) return false; initChain(chain, aggregated_columns); ExpressionActionsChain::Step & step = chain.steps.back(); getRootActions(select_query->limit_by_expression_list, only_types, false, step.actions); for (const auto & child : select_query->limit_by_expression_list->children) step.required_output.push_back(child->getColumnName()); return true; } void ExpressionAnalyzer::appendProjectResult(ExpressionActionsChain & chain) const { assertSelect(); initChain(chain, aggregated_columns); ExpressionActionsChain::Step & step = chain.steps.back(); NamesWithAliases result_columns; ASTs asts = select_query->select_expression_list->children; for (size_t i = 0; i < asts.size(); ++i) { String result_name = asts[i]->getAliasOrColumnName(); if (required_result_columns.empty() || required_result_columns.count(result_name)) { result_columns.emplace_back(asts[i]->getColumnName(), result_name); step.required_output.push_back(result_columns.back().second); } } step.actions->add(ExpressionAction::project(result_columns)); } void ExpressionAnalyzer::getActionsBeforeAggregation(const ASTPtr & ast, ExpressionActionsPtr & actions, bool no_subqueries) { ASTFunction * node = typeid_cast(ast.get()); if (node && AggregateFunctionFactory::instance().isAggregateFunctionName(node->name)) for (auto & argument : node->arguments->children) getRootActions(argument, no_subqueries, false, actions); else for (auto & child : ast->children) getActionsBeforeAggregation(child, actions, no_subqueries); } ExpressionActionsPtr ExpressionAnalyzer::getActions(bool project_result) { ExpressionActionsPtr actions = std::make_shared(source_columns, settings); NamesWithAliases result_columns; Names result_names; ASTs asts; if (auto node = typeid_cast(ast.get())) asts = node->children; else asts = ASTs(1, ast); for (size_t i = 0; i < asts.size(); ++i) { std::string name = asts[i]->getColumnName(); std::string alias; if (project_result) alias = asts[i]->getAliasOrColumnName(); else alias = name; result_columns.emplace_back(name, alias); result_names.push_back(alias); getRootActions(asts[i], false, false, actions); } if (project_result) { actions->add(ExpressionAction::project(result_columns)); } else { /// We will not delete the original columns. for (const auto & column_name_type : source_columns) result_names.push_back(column_name_type.name); } actions->finalize(result_names); return actions; } ExpressionActionsPtr ExpressionAnalyzer::getConstActions() { ExpressionActionsPtr actions = std::make_shared(NamesAndTypesList(), settings); getRootActions(ast, true, true, actions); return actions; } void ExpressionAnalyzer::getAggregateInfo(Names & key_names, AggregateDescriptions & aggregates) const { for (const auto & name_and_type : aggregation_keys) key_names.emplace_back(name_and_type.name); aggregates = aggregate_descriptions; } void ExpressionAnalyzer::collectUsedColumns() { /** Calculate which columns are required to execute the expression. * Then, delete all other columns from the list of available columns. * After execution, columns will only contain the list of columns needed to read from the table. */ NameSet required; NameSet ignored; NameSet available_columns; for (const auto & column : source_columns) available_columns.insert(column.name); if (select_query && select_query->array_join_expression_list()) { ASTs & expressions = select_query->array_join_expression_list()->children; for (size_t i = 0; i < expressions.size(); ++i) { /// Ignore the top-level identifiers from the ARRAY JOIN section. /// Then add them separately. if (typeid_cast(expressions[i].get())) { ignored.insert(expressions[i]->getColumnName()); } else { /// Nothing needs to be ignored for expressions in ARRAY JOIN. NameSet empty; getRequiredSourceColumnsImpl(expressions[i], available_columns, required, empty, empty, empty); } ignored.insert(expressions[i]->getAliasOrColumnName()); } } /** You also need to ignore the identifiers of the columns that are obtained by JOIN. * (Do not assume that they are required for reading from the "left" table). */ NameSet available_joined_columns; collectJoinedColumns(available_joined_columns, columns_added_by_join); NameSet required_joined_columns; getRequiredSourceColumnsImpl(ast, available_columns, required, ignored, available_joined_columns, required_joined_columns); for (NamesAndTypesList::iterator it = columns_added_by_join.begin(); it != columns_added_by_join.end();) { if (required_joined_columns.count(it->name)) ++it; else columns_added_by_join.erase(it++); } /// Insert the columns required for the ARRAY JOIN calculation into the required columns list. NameSet array_join_sources; for (const auto & result_source : array_join_result_to_source) array_join_sources.insert(result_source.second); for (const auto & column_name_type : source_columns) if (array_join_sources.count(column_name_type.name)) required.insert(column_name_type.name); /// You need to read at least one column to find the number of rows. if (select_query && required.empty()) required.insert(ExpressionActions::getSmallestColumn(source_columns)); NameSet unknown_required_source_columns = required; for (NamesAndTypesList::iterator it = source_columns.begin(); it != source_columns.end();) { unknown_required_source_columns.erase(it->name); if (!required.count(it->name)) source_columns.erase(it++); else ++it; } /// If there are virtual columns among the unknown columns. Remove them from the list of unknown and add /// in columns list, so that when further processing they are also considered. if (storage) { for (auto it = unknown_required_source_columns.begin(); it != unknown_required_source_columns.end();) { if (storage->hasColumn(*it)) { source_columns.push_back(storage->getColumn(*it)); unknown_required_source_columns.erase(it++); } else ++it; } } if (!unknown_required_source_columns.empty()) throw Exception("Unknown identifier: " + *unknown_required_source_columns.begin(), ErrorCodes::UNKNOWN_IDENTIFIER); } void ExpressionAnalyzer::collectJoinedColumns(NameSet & joined_columns, NamesAndTypesList & joined_columns_name_type) { if (!select_query) return; const ASTTablesInSelectQueryElement * node = select_query->join(); if (!node) return; const ASTTableJoin & table_join = static_cast(*node->table_join); const ASTTableExpression & table_expression = static_cast(*node->table_expression); Block nested_result_sample; if (table_expression.database_and_table_name) { auto database_table = getDatabaseAndTableNameFromIdentifier(static_cast(*table_expression.database_and_table_name)); const auto & table = context.getTable(database_table.first, database_table.second); nested_result_sample = table->getSampleBlockNonMaterialized(); } else if (table_expression.subquery) { const auto & subquery = table_expression.subquery->children.at(0); nested_result_sample = InterpreterSelectWithUnionQuery::getSampleBlock(subquery, context); } if (table_join.using_expression_list) { auto & keys = typeid_cast(*table_join.using_expression_list); for (const auto & key : keys.children) { if (join_key_names_left.end() == std::find(join_key_names_left.begin(), join_key_names_left.end(), key->getColumnName())) join_key_names_left.push_back(key->getColumnName()); else throw Exception("Duplicate column " + key->getColumnName() + " in USING list", ErrorCodes::DUPLICATE_COLUMN); if (join_key_names_right.end() == std::find(join_key_names_right.begin(), join_key_names_right.end(), key->getAliasOrColumnName())) join_key_names_right.push_back(key->getAliasOrColumnName()); else throw Exception("Duplicate column " + key->getAliasOrColumnName() + " in USING list", ErrorCodes::DUPLICATE_COLUMN); } } for (const auto i : ext::range(0, nested_result_sample.columns())) { const auto & col = nested_result_sample.safeGetByPosition(i); if (join_key_names_right.end() == std::find(join_key_names_right.begin(), join_key_names_right.end(), col.name) && !joined_columns.count(col.name)) /// Duplicate columns in the subquery for JOIN do not make sense. { joined_columns.insert(col.name); bool make_nullable = settings.join_use_nulls && (table_join.kind == ASTTableJoin::Kind::Left || table_join.kind == ASTTableJoin::Kind::Full); joined_columns_name_type.emplace_back(col.name, make_nullable ? makeNullable(col.type) : col.type); } } } Names ExpressionAnalyzer::getRequiredSourceColumns() const { return source_columns.getNames(); } void ExpressionAnalyzer::getRequiredSourceColumnsImpl(const ASTPtr & ast, const NameSet & available_columns, NameSet & required_source_columns, NameSet & ignored_names, const NameSet & available_joined_columns, NameSet & required_joined_columns) { /** Find all the identifiers in the query. * We will use depth first search in AST. * In this case * - for lambda functions we will not take formal parameters; * - do not go into subqueries (they have their own identifiers); * - there is some exception for the ARRAY JOIN clause (it has a slightly different identifiers); * - we put identifiers available from JOIN in required_joined_columns. */ if (ASTIdentifier * node = typeid_cast(ast.get())) { if (node->kind == ASTIdentifier::Column && !ignored_names.count(node->name) && !ignored_names.count(Nested::extractTableName(node->name))) { if (!available_joined_columns.count(node->name) || available_columns.count(node->name)) /// Read column from left table if has. required_source_columns.insert(node->name); else required_joined_columns.insert(node->name); } return; } if (ASTFunction * node = typeid_cast(ast.get())) { if (node->name == "lambda") { if (node->arguments->children.size() != 2) throw Exception("lambda requires two arguments", ErrorCodes::NUMBER_OF_ARGUMENTS_DOESNT_MATCH); ASTFunction * lambda_args_tuple = typeid_cast(node->arguments->children.at(0).get()); if (!lambda_args_tuple || lambda_args_tuple->name != "tuple") throw Exception("First argument of lambda must be a tuple", ErrorCodes::TYPE_MISMATCH); /// You do not need to add formal parameters of the lambda expression in required_source_columns. Names added_ignored; for (auto & child : lambda_args_tuple->arguments->children) { ASTIdentifier * identifier = typeid_cast(child.get()); if (!identifier) throw Exception("lambda argument declarations must be identifiers", ErrorCodes::TYPE_MISMATCH); String & name = identifier->name; if (!ignored_names.count(name)) { ignored_names.insert(name); added_ignored.push_back(name); } } getRequiredSourceColumnsImpl(node->arguments->children.at(1), available_columns, required_source_columns, ignored_names, available_joined_columns, required_joined_columns); for (size_t i = 0; i < added_ignored.size(); ++i) ignored_names.erase(added_ignored[i]); return; } /// A special function `indexHint`. Everything that is inside it is not calculated /// (and is used only for index analysis, see PKCondition). if (node->name == "indexHint") return; } /// Recursively traverses an expression. for (auto & child : ast->children) { /** We will not go to the ARRAY JOIN section, because we need to look at the names of non-ARRAY-JOIN columns. * There, `collectUsedColumns` will send us separately. */ if (!typeid_cast(child.get()) && !typeid_cast(child.get()) && !typeid_cast(child.get())) getRequiredSourceColumnsImpl(child, available_columns, required_source_columns, ignored_names, available_joined_columns, required_joined_columns); } } static bool hasArrayJoin(const ASTPtr & ast) { if (const ASTFunction * function = typeid_cast(&*ast)) if (function->name == "arrayJoin") return true; for (const auto & child : ast->children) if (!typeid_cast(child.get()) && hasArrayJoin(child)) return true; return false; } void ExpressionAnalyzer::removeUnneededColumnsFromSelectClause() { if (!select_query) return; if (required_result_columns.empty() || select_query->distinct) return; ASTs & elements = select_query->select_expression_list->children; elements.erase(std::remove_if(elements.begin(), elements.end(), [this](const auto & node) { return !required_result_columns.count(node->getAliasOrColumnName()) && !hasArrayJoin(node); }), elements.end()); } }