Merge pull request #59088 from Blargian/#31363_format_template_configure_in_settings

#31363 format template configure in settings
This commit is contained in:
Kruglov Pavel 2024-02-14 13:29:13 +01:00 committed by GitHub
commit c8c9adbd8b
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
10 changed files with 123 additions and 18 deletions

View File

@ -253,7 +253,7 @@ This format is also available under the name `TSVRawWithNamesAndNames`.
This format allows specifying a custom format string with placeholders for values with a specified escaping rule.
It uses settings `format_template_resultset`, `format_template_row`, `format_template_rows_between_delimiter` and some settings of other formats (e.g. `output_format_json_quote_64bit_integers` when using `JSON` escaping, see further)
It uses settings `format_template_resultset`, `format_template_row` (`format_template_row_format`), `format_template_rows_between_delimiter` and some settings of other formats (e.g. `output_format_json_quote_64bit_integers` when using `JSON` escaping, see further)
Setting `format_template_row` specifies the path to the file containing format strings for rows with the following syntax:
@ -279,9 +279,11 @@ the values of `SearchPhrase`, `c` and `price` columns, which are escaped as `Quo
`Search phrase: 'bathroom interior design', count: 2166, ad price: $3;`
In cases where it is challenging or not possible to deploy format output configuration for the template format to a directory on all nodes in a cluster, or if the format is trivial then `format_template_row_format` can be used to set the template string directly in the query, rather than a path to the file which contains it.
The `format_template_rows_between_delimiter` setting specifies the delimiter between rows, which is printed (or expected) after every row except the last one (`\n` by default)
Setting `format_template_resultset` specifies the path to the file, which contains a format string for resultset. Format string for resultset has the same syntax as a format string for row and allows to specify a prefix, a suffix and a way to print some additional information. It contains the following placeholders instead of column names:
Setting `format_template_resultset` specifies the path to the file, which contains a format string for resultset. Setting `format_template_resultset_format` can be used to set the template string for the result set directly in the query itself. Format string for resultset has the same syntax as a format string for row and allows to specify a prefix, a suffix and a way to print some additional information. It contains the following placeholders instead of column names:
- `data` is the rows with data in `format_template_row` format, separated by `format_template_rows_between_delimiter`. This placeholder must be the first placeholder in the format string.
- `totals` is the row with total values in `format_template_row` format (when using WITH TOTALS)

View File

@ -1662,6 +1662,10 @@ Result:
Path to file which contains format string for result set (for Template format).
### format_template_resultset_format {#format_template_resultset_format}
Format string for result set (for Template format)
### format_template_row {#format_template_row}
Path to file which contains format string for rows (for Template format).
@ -1670,6 +1674,10 @@ Path to file which contains format string for rows (for Template format).
Delimiter between rows (for Template format).
### format_template_row_format {#format_template_row_format}
Format string for rows (for Template format)
## CustomSeparated format settings {custom-separated-format-settings}
### format_custom_escaping_rule {#format_custom_escaping_rule}

View File

@ -201,7 +201,7 @@ SELECT * FROM nestedt FORMAT TSV
Этот формат позволяет указать произвольную форматную строку, в которую подставляются значения, сериализованные выбранным способом.
Для этого используются настройки `format_template_resultset`, `format_template_row`, `format_template_rows_between_delimiter` и настройки экранирования других форматов (например, `output_format_json_quote_64bit_integers` при экранировании как в `JSON`, см. далее)
Для этого используются настройки `format_template_resultset`, `format_template_row` (`format_template_row_format`), `format_template_rows_between_delimiter` и настройки экранирования других форматов (например, `output_format_json_quote_64bit_integers` при экранировании как в `JSON`, см. далее)
Настройка `format_template_row` задаёт путь к файлу, содержащему форматную строку для строк таблицы, которая должна иметь вид:
@ -227,9 +227,11 @@ SELECT * FROM nestedt FORMAT TSV
`Search phrase: 'bathroom interior design', count: 2166, ad price: $3;`
В тех случаях, когда не удобно или не возможно указать произвольную форматную строку в файле, можно использовать `format_template_row_format` указать произвольную форматную строку в запросе.
Настройка `format_template_rows_between_delimiter` задаёт разделитель между строками, который выводится (или ожмдается при вводе) после каждой строки, кроме последней. По умолчанию `\n`.
Настройка `format_template_resultset` задаёт путь к файлу, содержащему форматную строку для результата. Форматная строка для результата имеет синтаксис аналогичный форматной строке для строк таблицы и позволяет указать префикс, суффикс и способ вывода дополнительной информации. Вместо имён столбцов в ней указываются следующие имена подстановок:
Настройка `format_template_resultset` задаёт путь к файлу, содержащему форматную строку для результата. Настройка `format_template_resultset_format` используется для установки форматной строки для результата непосредственно в запросе. Форматная строка для результата имеет синтаксис аналогичный форматной строке для строк таблицы и позволяет указать префикс, суффикс и способ вывода дополнительной информации. Вместо имён столбцов в ней указываются следующие имена подстановок:
- `data` - строки с данными в формате `format_template_row`, разделённые `format_template_rows_between_delimiter`. Эта подстановка должна быть первой подстановкой в форматной строке.
- `totals` - строка с тотальными значениями в формате `format_template_row` (при использовании WITH TOTALS)

View File

@ -1093,6 +1093,8 @@ class IColumn;
M(String, format_schema, "", "Schema identifier (used by schema-based formats)", 0) \
M(String, format_template_resultset, "", "Path to file which contains format string for result set (for Template format)", 0) \
M(String, format_template_row, "", "Path to file which contains format string for rows (for Template format)", 0) \
M(String, format_template_row_format, "", "Format string for rows (for Template format)", 0) \
M(String, format_template_resultset_format, "", "Format string for result set (for Template format)", 0) \
M(String, format_template_rows_between_delimiter, "\n", "Delimiter between rows (for Template format)", 0) \
\
M(EscapingRule, format_custom_escaping_rule, "Escaped", "Field escaping rule (for CustomSeparated format)", 0) \

View File

@ -91,6 +91,8 @@ static std::map<ClickHouseVersion, SettingsChangesHistory::SettingsChanges> sett
{"async_insert_busy_timeout_max_ms", 200, 200, "The minimum value of the asynchronous insert timeout in milliseconds; async_insert_busy_timeout_ms is aliased to async_insert_busy_timeout_max_ms"},
{"async_insert_busy_timeout_increase_rate", 0.2, 0.2, "The exponential growth rate at which the adaptive asynchronous insert timeout increases"},
{"async_insert_busy_timeout_decrease_rate", 0.2, 0.2, "The exponential growth rate at which the adaptive asynchronous insert timeout decreases"},
{"format_template_row_format", "", "", "Template row format string can be set directly in query"},
{"format_template_resultset_format", "", "", "Template result set format string can be set in query"},
{"split_parts_ranges_into_intersecting_and_non_intersecting_final", true, true, "Allow to split parts ranges into intersecting and non intersecting during FINAL optimization"},
{"split_intersecting_parts_ranges_into_layers_final", true, true, "Allow to split intersecting parts ranges into layers during FINAL optimization"},
{"azure_max_single_part_copy_size", 256*1024*1024, 256*1024*1024, "The maximum size of object to copy using single part copy to Azure blob storage."},

View File

@ -166,6 +166,8 @@ FormatSettings getFormatSettings(ContextPtr context, const Settings & settings)
format_settings.template_settings.resultset_format = settings.format_template_resultset;
format_settings.template_settings.row_between_delimiter = settings.format_template_rows_between_delimiter;
format_settings.template_settings.row_format = settings.format_template_row;
format_settings.template_settings.row_format_template = settings.format_template_row_format;
format_settings.template_settings.resultset_format_template = settings.format_template_resultset_format;
format_settings.tsv.crlf_end_of_line = settings.output_format_tsv_crlf_end_of_line;
format_settings.tsv.empty_as_default = settings.input_format_tsv_empty_as_default;
format_settings.tsv.enum_as_number = settings.input_format_tsv_enum_as_number;

View File

@ -338,6 +338,8 @@ struct FormatSettings
String resultset_format;
String row_format;
String row_between_delimiter;
String row_format_template;
String resultset_format_template;
} template_settings;
struct

View File

@ -11,6 +11,7 @@ namespace DB
namespace ErrorCodes
{
extern const int SYNTAX_ERROR;
extern const int INVALID_TEMPLATE_FORMAT;
}
TemplateBlockOutputFormat::TemplateBlockOutputFormat(const Block & header_, WriteBuffer & out_, const FormatSettings & settings_,
@ -193,13 +194,25 @@ void registerOutputFormatTemplate(FormatFactory & factory)
const FormatSettings & settings)
{
ParsedTemplateFormatString resultset_format;
auto idx_resultset_by_name = [&](const String & partName)
{
return static_cast<size_t>(TemplateBlockOutputFormat::stringToResultsetPart(partName));
};
if (settings.template_settings.resultset_format.empty())
{
/// Default format string: "${data}"
resultset_format.delimiters.resize(2);
resultset_format.escaping_rules.emplace_back(ParsedTemplateFormatString::EscapingRule::None);
resultset_format.format_idx_to_column_idx.emplace_back(0);
resultset_format.column_names.emplace_back("data");
if (settings.template_settings.resultset_format_template.empty())
{
resultset_format.delimiters.resize(2);
resultset_format.escaping_rules.emplace_back(ParsedTemplateFormatString::EscapingRule::None);
resultset_format.format_idx_to_column_idx.emplace_back(0);
resultset_format.column_names.emplace_back("data");
}
else
{
resultset_format = ParsedTemplateFormatString();
resultset_format.parse(settings.template_settings.resultset_format_template, idx_resultset_by_name);
}
}
else
{
@ -207,20 +220,34 @@ void registerOutputFormatTemplate(FormatFactory & factory)
resultset_format = ParsedTemplateFormatString(
FormatSchemaInfo(settings.template_settings.resultset_format, "Template", false,
settings.schema.is_server, settings.schema.format_schema_path),
[&](const String & partName)
{
return static_cast<size_t>(TemplateBlockOutputFormat::stringToResultsetPart(partName));
});
idx_resultset_by_name);
if (!settings.template_settings.resultset_format_template.empty())
{
throw Exception(DB::ErrorCodes::INVALID_TEMPLATE_FORMAT, "Expected either format_template_resultset or format_template_resultset_format, but not both");
}
}
ParsedTemplateFormatString row_format = ParsedTemplateFormatString(
ParsedTemplateFormatString row_format;
auto idx_row_by_name = [&](const String & colName)
{
return sample.getPositionByName(colName);
};
if (settings.template_settings.row_format.empty())
{
row_format = ParsedTemplateFormatString();
row_format.parse(settings.template_settings.row_format_template, idx_row_by_name);
}
else
{
row_format = ParsedTemplateFormatString(
FormatSchemaInfo(settings.template_settings.row_format, "Template", false,
settings.schema.is_server, settings.schema.format_schema_path),
[&](const String & colName)
{
return sample.getPositionByName(colName);
});
idx_row_by_name);
if (!settings.template_settings.row_format_template.empty())
{
throw Exception(DB::ErrorCodes::INVALID_TEMPLATE_FORMAT, "Expected either format_template_row or format_template_row_format, but not both");
}
}
return std::make_shared<TemplateBlockOutputFormat>(sample, buf, settings, resultset_format, row_format, settings.template_settings.row_between_delimiter);
});

View File

@ -0,0 +1,9 @@
Question: 'How awesome is clickhouse?', Answer: 'unbelievably awesome!', Number of Likes: 456, Date: 2016-01-02;
Question: 'How fast is clickhouse?', Answer: 'Lightning fast!', Number of Likes: 9876543210, Date: 2016-01-03;
Question: 'Is it opensource?', Answer: 'of course it is!', Number of Likes: 789, Date: 2016-01-04
===== Results =====
Question: 'How awesome is clickhouse?', Answer: 'unbelievably awesome!', Number of Likes: 456, Date: 2016-01-02;
Question: 'How fast is clickhouse?', Answer: 'Lightning fast!', Number of Likes: 9876543210, Date: 2016-01-03;
Question: 'Is it opensource?', Answer: 'of course it is!', Number of Likes: 789, Date: 2016-01-04
===================

View File

@ -0,0 +1,49 @@
#!/usr/bin/env bash
# shellcheck disable=SC2016
CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
# shellcheck source=../shell_config.sh
. "$CURDIR"/../shell_config.sh
# Test format_template_row_format setting
$CLICKHOUSE_CLIENT --query="DROP TABLE IF EXISTS template";
$CLICKHOUSE_CLIENT --query="CREATE TABLE template (question String, answer String, likes UInt64, date Date) ENGINE = Memory";
$CLICKHOUSE_CLIENT --query="INSERT INTO template VALUES
('How awesome is clickhouse?', 'unbelievably awesome!', 456, '2016-01-02'),\
('How fast is clickhouse?', 'Lightning fast!', 9876543210, '2016-01-03'),\
('Is it opensource?', 'of course it is!', 789, '2016-01-04')";
$CLICKHOUSE_CLIENT --query="SELECT * FROM template GROUP BY question, answer, likes, date WITH TOTALS ORDER BY date LIMIT 3 FORMAT Template SETTINGS \
format_template_row_format = 'Question: \${question:Quoted}, Answer: \${answer:Quoted}, Number of Likes: \${likes:Raw}, Date: \${date:Raw}', \
format_template_rows_between_delimiter = ';\n'";
echo -e "\n"
# Test that if both format_template_row_format setting and format_template_row are provided, error is thrown
row_format_file="$CURDIR"/"${CLICKHOUSE_TEST_UNIQUE_NAME}"_template_output_format_row.tmp
echo -ne 'Question: ${question:Quoted}, Answer: ${answer:Quoted}, Number of Likes: ${likes:Raw}, Date: ${date:Raw}' > $row_format_file
$CLICKHOUSE_CLIENT --multiline --multiquery --query "SELECT * FROM template GROUP BY question, answer, likes, date WITH TOTALS ORDER BY date LIMIT 3 FORMAT Template SETTINGS \
format_template_row = '$row_format_file', \
format_template_row_format = 'Question: \${question:Quoted}, Answer: \${answer:Quoted}, Number of Likes: \${likes:Raw}, Date: \${date:Raw}', \
format_template_rows_between_delimiter = ';\n'; --{clientError 474}"
# Test format_template_resultset_format setting
$CLICKHOUSE_CLIENT --query="SELECT * FROM template GROUP BY question, answer, likes, date WITH TOTALS ORDER BY date LIMIT 3 FORMAT Template SETTINGS \
format_template_row_format = 'Question: \${question:Quoted}, Answer: \${answer:Quoted}, Number of Likes: \${likes:Raw}, Date: \${date:Raw}', \
format_template_resultset_format = '===== Results ===== \n\${data}\n===================\n', \
format_template_rows_between_delimiter = ';\n'";
# Test that if both format_template_result_format setting and format_template_resultset are provided, error is thrown
resultset_output_file="$CURDIR"/"$CLICKHOUSE_TEST_UNIQUE_NAME"_template_output_format_resultset.tmp
echo -ne '===== Resultset ===== \n \${data} \n ===============' > $resultset_output_file
$CLICKHOUSE_CLIENT --multiline --multiquery --query "SELECT * FROM template GROUP BY question, answer, likes, date WITH TOTALS ORDER BY date LIMIT 3 FORMAT Template SETTINGS \
format_template_resultset = '$resultset_output_file', \
format_template_resultset_format = '===== Resultset ===== \n \${data} \n ===============', \
format_template_row_format = 'Question: \${question:Quoted}, Answer: \${answer:Quoted}, Number of Likes: \${likes:Raw}, Date: \${date:Raw}', \
format_template_rows_between_delimiter = ';\n'; --{clientError 474}"
$CLICKHOUSE_CLIENT --query="DROP TABLE template";
rm $row_format_file
rm $resultset_output_file