ClickHouse/utils/db-generator
2022-11-04 12:27:46 +01:00
..
CMakeLists.txt Prefix overridden add_executable() command with "clickhouse_" 2022-07-11 19:36:18 +02:00
query_db_generator.cpp Fix macOS M1 builds due to sprintf deprecation 2022-11-04 12:27:46 +01:00
README.md DOCSUP-3417 translate to english (#16797) 2020-11-13 13:12:06 +03:00

Clickhouse query analysis

Here we will consider only SELECT queries, i.e. those queries that get data from the table. The built-in Clickhouse parser accepts a string as input, which is a query. Among 14 main clauses of SELECT statement: WITH, SELECT, TABLES, PREWHERE, WHERE, GROUP_BY, HAVING, ORDER_BY, LIMIT_BY_OFFSET, LIMIT_BY_LENGTH, LIMIT_BY, LIMIT_OFFSET, LIMIT_LENGTH, SETTINGS, we will analyze the SELECT, TABLES, WHERE, GROUP_BY, HAVING, ORDER_BY clauses because the most of data is there. We need this data to analyze the structure and to identify values. The parser issues a tree structure after parsing a query, where each node is a specific query execution operation, a function over values, a constant, a designation, etc. Nodes also have subtrees where their arguments or suboperations are located. We will try to reveal the data we need by avoiding this tree.

Scheme analysis

It is necessary to determine possible tables by a query. Having a query string, you can understand which parts of it represent the names of the tables, so you can determine their number in our database. In the Clickhouse parser, TABLES (Figure 1) is a query subtree responsible for tables where we get data. It contains the main table where the columns come from, as well as the JOIN operations that are performed in the query. Avoiding all nodes in the subtree, we use the names of the tables and databases where they are located, as well as their alias, i.e. the shortened names chosen by the query author. We may need these names to determine the ownership of the column in the future. Thus, we get a set of databases for the query, as well as tables and their aliases, with the help of them a query is made.

Then we need to define the set of columns that are in the query and the tables they can refer to. The set of columns in each table is already known during the query execution. Therefore, the program automatically links the column and table at runtime. However, in our case, it is impossible to unambiguously interpret the belonging of a column to a specific table, for example, in the following query SELECT column1, column2, column3 FROM table1 JOIN table2 on table1.column2 = table2.column3. In this case, we can say which table column2 and column3 belong to. However, column1 can belong to either the first or the second table. We will refer undefined columns to the main table, on which a query is made, for unambiguous interpretation of such cases. For example, in this case, it will be table1. All columns in the tree are in IDENTIFIER type nodes, which are in the SELECT, TABLES, WHERE, GROUP_BY, HAVING, ORDER_BY subtrees. We form a set of all tables recursively avoiding the subtrees, then we split the column into constituents such as the table (if it is explicitly specified with a dot) and the name. Then, since the table can be an alias, we replace the alias with the original table name. We now have a list of all the columns and tables they belong to. We define the main query table for non-table columns.

Column analysis

Then we need to exactly define data types for columns that have a value in the query. An example is the boolean WHERE clause where we test boolean expressions in its attributes. If the query specifies column > 5, then we can conclude that this column contains a numeric value, or if the LIKE expression is applied to the attribute, then the attribute has a string type. In this part, you need to learn how to extract such expressions from a query and match data types for columns, where it is possible. At the same time, it is clear that it is not always possible to make an unambiguous decision about the type of a particular attribute from the available values. For example, column > 5 can mean many numeric types such as UINT8, UINT32, INT32, INT64, etc. It is necessary to determine the interpretation of certain values since searching through all possible values can be quite large and long. It can take a long time to iterate over all possible values, so we use INT64 and FLOAT64 types for numeric values, STRING for strings, DATE and DATETIME for dates, and ARRAY. We can determine column values using boolean, arithmetic and other functions on the column values that are specified in the query. Such functions are in the SELECT and WHERE subtrees. The function parameter can be a constant, a column or another function (Figure 2). Thus, the following parameters can help to understand the type of the column:

  • The types of arguments that a function can take, for example, the TOSTARTOFMINUTE function (truncate time up to a multiple of 5 minutes down) can only accept DATETIME, so if the argument of this function is a column, then this column has DATETIME type.
  • The types of the remaining arguments in this function. For example, the EQUALS function means equality of its argument types, so if a constant and a column are present in this function, then we can define the type of the column as the type of the constant.

Thus, we define the possible argument types, the return type, the parameter for each function, and the function arguments of the identical type. The recursive function handler will determine the possible types of columns used in these functions by the values of the arguments, and then return the possible types of the function's result. Now, for each column, we have many possible types of values. We will choose one specific type from this set to interpret the query unambiguously.

Column values definition

At this stage, we already have a certain structure of the database tables, we need to fill this table with values. We should understand which columns depend on each other when executing the function (for example, the join is done according to two columns, which means that they must have the same values). We also need to understand what values the columns must have to fulfill various conditions during execution. We search for all comparison operations in our query to achieve the goal. If the arguments of the operation are two columns, then we consider them linked. If the arguments are the column and the value, then we assign that value to the possible column value and add the value with some noise. A random number is a noise for a numeric type, it is a random number of days for a date, etc. In this case, a handler for this operation is required for each comparison operation, which generates at least two values, one of them is the operation condition, and the other is not. For example, a value greater than 5 and less than or equal to 5 must be assigned for the operation column1 > 5, column1, for the operation column2 LIKE some% string the same is true. The satisfying and not satisfying expression must be assigned to column2. Now we have many associated columns and many values. We know that the connectivity of columns is symmetric, but we need to add transitivity for a complete definition, because if column1 = column2 and column2 = column3, then column1 = column3, but this does not follow from the construction. Accordingly, we need to extend the connectivity across all columns. We combine multiple values for each column with the values associated with it. If we have columns with no values, then we generate random values.

Generation

We have a complete view of the database schema as well as many values for each table now. We will generate data by cartesian product of the value set of each column for a specific table. Thus, we get a set for each table, consisting of sets of values for each column. We start generating queries that create this table and fill it with data. We generate the CREATE QUERY that creates this table based on the structure of the table and the types of its columns, and then we generate the INSERT QUERY over the set of values, which fills the table with data.