Merge branch 'master' into yandex-to-clickhouse-in-configs

This commit is contained in:
mergify[bot] 2021-09-29 21:48:43 +00:00 committed by GitHub
commit 8852f933bc
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
8 changed files with 832 additions and 54 deletions

View File

@ -3,7 +3,7 @@ toc_priority: 21
toc_title: Menus
---
# New York Public Library "What's on the Menu?" Dataset
# New York Public Library "What's on the Menu?" Dataset {#menus-dataset}
The dataset is created by the New York Public Library. It contains historical data on the menus of hotels, restaurants and cafes with the dishes along with their prices.
@ -11,34 +11,38 @@ Source: http://menus.nypl.org/data
The data is in public domain.
The data is from library's archive and it may be incomplete and difficult for statistical analysis. Nevertheless it is also very yummy.
The size is just 1.3 million records about dishes in the menus (a very small data volume for ClickHouse, but it's still a good example).
The size is just 1.3 million records about dishes in the menus — it's a very small data volume for ClickHouse, but it's still a good example.
## Download the Dataset
## Download the Dataset {#download-dataset}
```
Run the command:
```bash
wget https://s3.amazonaws.com/menusdata.nypl.org/gzips/2021_08_01_07_01_17_data.tgz
```
Replace the link to the up to date link from http://menus.nypl.org/data if needed.
Download size is about 35 MB.
## Unpack the Dataset
## Unpack the Dataset {#unpack-dataset}
```
```bash
tar xvf 2021_08_01_07_01_17_data.tgz
```
Uncompressed size is about 150 MB.
The data is normalized consisted of four tables:
- Menu: information about menus: the name of the restaurant, the date when menu was seen, etc;
- Dish: information about dishes: the name of the dish along with some characteristic;
- MenuPage: information about the pages in the menus; every page belongs to some menu;
- MenuItem: an item of the menu - a dish along with its price on some menu page: links to dish and menu page.
- `Menu` — Information about menus: the name of the restaurant, the date when menu was seen, etc.
- `Dish` — Information about dishes: the name of the dish along with some characteristic.
- `MenuPage` — Information about the pages in the menus, because every page belongs to some menu.
- `MenuItem` — An item of the menu. A dish along with its price on some menu page: links to dish and menu page.
## Create the Tables
## Create the Tables {#create-tables}
```
We use [Decimal](../../sql-reference/data-types/decimal.md) data type to store prices.
```sql
CREATE TABLE dish
(
id UInt32,
@ -101,35 +105,33 @@ CREATE TABLE menu_item
) ENGINE = MergeTree ORDER BY id;
```
We use `Decimal` data type to store prices. Everything else is quite straightforward.
## Import the Data {#import-data}
## Import Data
Upload data into ClickHouse, run:
Upload data into ClickHouse:
```
```bash
clickhouse-client --format_csv_allow_single_quotes 0 --input_format_null_as_default 0 --query "INSERT INTO dish FORMAT CSVWithNames" < Dish.csv
clickhouse-client --format_csv_allow_single_quotes 0 --input_format_null_as_default 0 --query "INSERT INTO menu FORMAT CSVWithNames" < Menu.csv
clickhouse-client --format_csv_allow_single_quotes 0 --input_format_null_as_default 0 --query "INSERT INTO menu_page FORMAT CSVWithNames" < MenuPage.csv
clickhouse-client --format_csv_allow_single_quotes 0 --input_format_null_as_default 0 --date_time_input_format best_effort --query "INSERT INTO menu_item FORMAT CSVWithNames" < MenuItem.csv
```
We use `CSVWithNames` format as the data is represented by CSV with header.
We use [CSVWithNames](../../interfaces/formats.md#csvwithnames) format as the data is represented by CSV with header.
We disable `format_csv_allow_single_quotes` as only double quotes are used for data fields and single quotes can be inside the values and should not confuse the CSV parser.
We disable `input_format_null_as_default` as our data does not have NULLs. Otherwise ClickHouse will try to parse `\N` sequences and can be confused with `\` in data.
We disable [input_format_null_as_default](../../operations/settings/settings.md#settings-input-format-null-as-default) as our data does not have [NULL](../../sql-reference/syntax.md#null-literal). Otherwise ClickHouse will try to parse `\N` sequences and can be confused with `\` in data.
The setting `--date_time_input_format best_effort` allows to parse `DateTime` fields in wide variety of formats. For example, ISO-8601 without seconds like '2000-01-01 01:02' will be recognized. Without this setting only fixed DateTime format is allowed.
The setting [date_time_input_format best_effort](../../operations/settings/settings.md#settings-date_time_input_format) allows to parse [DateTime](../../sql-reference/data-types/datetime.md) fields in wide variety of formats. For example, ISO-8601 without seconds like '2000-01-01 01:02' will be recognized. Without this setting only fixed DateTime format is allowed.
## Denormalize the Data
## Denormalize the Data {#denormalize-data}
Data is presented in multiple tables in normalized form. It means you have to perform JOINs if you want to query, e.g. dish names from menu items.
For typical analytical tasks it is way more efficient to deal with pre-JOINed data to avoid doing JOIN every time. It is called "denormalized" data.
Data is presented in multiple tables in [normalized form](https://en.wikipedia.org/wiki/Database_normalization#Normal_forms). It means you have to perform [JOIN](../../sql-reference/statements/select/join.md#select-join) if you want to query, e.g. dish names from menu items.
For typical analytical tasks it is way more efficient to deal with pre-JOINed data to avoid doing `JOIN` every time. It is called "denormalized" data.
We will create a table that will contain all the data JOINed together:
We will create a table `menu_item_denorm` where will contain all the data JOINed together:
```
```sql
CREATE TABLE menu_item_denorm
ENGINE = MergeTree ORDER BY (dish_name, created_at)
AS SELECT
@ -171,21 +173,32 @@ AS SELECT
FROM menu_item
JOIN dish ON menu_item.dish_id = dish.id
JOIN menu_page ON menu_item.menu_page_id = menu_page.id
JOIN menu ON menu_page.menu_id = menu.id
JOIN menu ON menu_page.menu_id = menu.id;
```
## Validate the Data
## Validate the Data {#validate-data}
```
SELECT count() FROM menu_item_denorm
1329175
Query:
```sql
SELECT count() FROM menu_item_denorm;
```
## Run Some Queries
Averaged historical prices of dishes:
Result:
```text
┌─count()─┐
│ 1329175 │
└─────────┘
```
## Run Some Queries {#run-queries}
### Averaged historical prices of dishes {#query-averaged-historical-prices}
Query:
```sql
SELECT
round(toUInt32OrZero(extract(menu_date, '^\\d{4}')), -1) AS d,
count(),
@ -194,8 +207,12 @@ SELECT
FROM menu_item_denorm
WHERE (menu_currency = 'Dollars') AND (d > 0) AND (d < 2022)
GROUP BY d
ORDER BY d ASC
ORDER BY d ASC;
```
Result:
```text
┌────d─┬─count()─┬─round(avg(price), 2)─┬─bar(avg(price), 0, 100, 100)─┐
│ 1850 │ 618 │ 1.5 │ █▍ │
│ 1860 │ 1634 │ 1.29 │ █▎ │
@ -215,15 +232,15 @@ ORDER BY d ASC
│ 2000 │ 2467 │ 11.85 │ ███████████▋ │
│ 2010 │ 597 │ 25.66 │ █████████████████████████▋ │
└──────┴─────────┴──────────────────────┴──────────────────────────────┘
17 rows in set. Elapsed: 0.044 sec. Processed 1.33 million rows, 54.62 MB (30.00 million rows/s., 1.23 GB/s.)
```
Take it with a grain of salt.
### Burger Prices:
### Burger Prices {#query-burger-prices}
```
Query:
```sql
SELECT
round(toUInt32OrZero(extract(menu_date, '^\\d{4}')), -1) AS d,
count(),
@ -232,8 +249,12 @@ SELECT
FROM menu_item_denorm
WHERE (menu_currency = 'Dollars') AND (d > 0) AND (d < 2022) AND (dish_name ILIKE '%burger%')
GROUP BY d
ORDER BY d ASC
ORDER BY d ASC;
```
Result:
```text
┌────d─┬─count()─┬─round(avg(price), 2)─┬─bar(avg(price), 0, 50, 100)───────────┐
│ 1880 │ 2 │ 0.42 │ ▋ │
│ 1890 │ 7 │ 0.85 │ █▋ │
@ -250,13 +271,13 @@ ORDER BY d ASC
│ 2000 │ 21 │ 7.14 │ ██████████████▎ │
│ 2010 │ 6 │ 18.42 │ ████████████████████████████████████▋ │
└──────┴─────────┴──────────────────────┴───────────────────────────────────────┘
14 rows in set. Elapsed: 0.052 sec. Processed 1.33 million rows, 94.15 MB (25.48 million rows/s., 1.80 GB/s.)
```
### Vodka:
### Vodka {#query-vodka}
```
Query:
```sql
SELECT
round(toUInt32OrZero(extract(menu_date, '^\\d{4}')), -1) AS d,
count(),
@ -265,8 +286,12 @@ SELECT
FROM menu_item_denorm
WHERE (menu_currency IN ('Dollars', '')) AND (d > 0) AND (d < 2022) AND (dish_name ILIKE '%vodka%')
GROUP BY d
ORDER BY d ASC
ORDER BY d ASC;
```
Result:
```text
┌────d─┬─count()─┬─round(avg(price), 2)─┬─bar(avg(price), 0, 50, 100)─┐
│ 1910 │ 2 │ 0 │ │
│ 1920 │ 1 │ 0.3 │ ▌ │
@ -282,11 +307,13 @@ ORDER BY d ASC
To get vodka we have to write `ILIKE '%vodka%'` and this definitely makes a statement.
### Caviar:
### Caviar {#query-caviar}
Let's print caviar prices. Also let's print a name of any dish with caviar.
```
Query:
```sql
SELECT
round(toUInt32OrZero(extract(menu_date, '^\\d{4}')), -1) AS d,
count(),
@ -296,8 +323,12 @@ SELECT
FROM menu_item_denorm
WHERE (menu_currency IN ('Dollars', '')) AND (d > 0) AND (d < 2022) AND (dish_name ILIKE '%caviar%')
GROUP BY d
ORDER BY d ASC
ORDER BY d ASC;
```
Result:
```text
┌────d─┬─count()─┬─round(avg(price), 2)─┬─bar(avg(price), 0, 50, 100)──────┬─any(dish_name)──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 1090 │ 1 │ 0 │ │ Caviar │
│ 1880 │ 3 │ 0 │ │ Caviar │
@ -319,6 +350,6 @@ ORDER BY d ASC
At least they have caviar with vodka. Very nice.
### Test it in Playground
## Online Playground {#playground}
The data is uploaded to ClickHouse Playground, [example](https://gh-api.clickhouse.com/play?user=play#U0VMRUNUCiAgICByb3VuZCh0b1VJbnQzMk9yWmVybyhleHRyYWN0KG1lbnVfZGF0ZSwgJ15cXGR7NH0nKSksIC0xKSBBUyBkLAogICAgY291bnQoKSwKICAgIHJvdW5kKGF2ZyhwcmljZSksIDIpLAogICAgYmFyKGF2ZyhwcmljZSksIDAsIDUwLCAxMDApLAogICAgYW55KGRpc2hfbmFtZSkKRlJPTSBtZW51X2l0ZW1fZGVub3JtCldIRVJFIChtZW51X2N1cnJlbmN5IElOICgnRG9sbGFycycsICcnKSkgQU5EIChkID4gMCkgQU5EIChkIDwgMjAyMikgQU5EIChkaXNoX25hbWUgSUxJS0UgJyVjYXZpYXIlJykKR1JPVVAgQlkgZApPUkRFUiBCWSBkIEFTQw==).

View File

@ -386,7 +386,7 @@ The CSV format supports the output of totals and extremes the same way as `TabSe
## CSVWithNames {#csvwithnames}
Also prints the header row, similar to `TabSeparatedWithNames`.
Also prints the header row, similar to [TabSeparatedWithNames](#tabseparatedwithnames).
## CustomSeparated {#format-customseparated}

View File

@ -45,7 +45,7 @@ Configuration template:
- `min_part_size` The minimum size of a data part.
- `min_part_size_ratio` The ratio of the data part size to the table size.
- `method` Compression method. Acceptable values: `lz4`, `lz4hc`, `zstd`.
- `level` Compression level. See [Codecs](../../sql-reference/statements/create/table/#create-query-general-purpose-codecs).
- `level` Compression level. See [Codecs](../../sql-reference/statements/create/table.md#create-query-general-purpose-codecs).
You can configure multiple `<case>` sections.

View File

@ -7,7 +7,8 @@ toc_title: DateTime64
Allows to store an instant in time, that can be expressed as a calendar date and a time of a day, with defined sub-second precision
Tick size (precision): 10<sup>-precision</sup> seconds
Tick size (precision): 10<sup>-precision</sup> seconds. Valid range: [ 0 : 9 ].
Typically are used - 3 (milliseconds), 6 (microseconds), 9 (nanoseconds).
**Syntax:**

View File

@ -0,0 +1,386 @@
---
toc_title: S2 Geometry
---
# Functions for Working with S2 Index {#s2Index}
[S2](https://s2geometry.io/) is a geographical indexing system where all geographical data is represented on a three-dimensional sphere (similar to a globe).
In the S2 library points are represented as unit length vectors called S2 point indices (points on the surface of a three dimensional unit sphere) as opposed to traditional (latitude, longitude) pairs.
## geoToS2 {#geoToS2}
Returns [S2](#s2index) point index corresponding to the provided coordinates `(longitude, latitude)`.
**Syntax**
``` sql
geoToS2(lon, lat)
```
**Arguments**
- `lon` — Longitude. [Float64](../../../sql-reference/data-types/float.md).
- `lat` — Latitude. [Float64](../../../sql-reference/data-types/float.md).
**Returned values**
- S2 point index.
Type: [UInt64](../../../sql-reference/data-types/int-uint.md).
**Example**
Query:
``` sql
SELECT geoToS2(37.79506683, 55.71290588) as s2Index;
```
Result:
``` text
┌─────────────s2Index─┐
│ 4704772434919038107 │
└─────────────────────┘
```
## s2ToGeo {#s2ToGeo}
Returns geo coordinates `(longitude, latitude)` corresponding to the provided [S2](#s2index) point index.
**Syntax**
``` sql
s2ToGeo(s2index)
```
**Arguments**
- `s2Index` — S2 Index. [UInt64](../../../sql-reference/data-types/int-uint.md).
**Returned values**
- A tuple consisting of two values: `tuple(lon,lat)`.
Type: `lon` - [Float64](../../../sql-reference/data-types/float.md). `lat` — [Float64](../../../sql-reference/data-types/float.md).
**Example**
Query:
``` sql
SELECT s2ToGeo(4704772434919038107) as s2Coodrinates;
```
Result:
``` text
┌─s2Coodrinates────────────────────────┐
│ (37.79506681471008,55.7129059052841) │
└──────────────────────────────────────┘
```
## s2GetNeighbors {#s2GetNeighbors}
Returns S2 neighbor indices corresponding to the provided [S2](#s2index)). Each cell in the S2 system is a quadrilateral bounded by four geodesics. So, each cell has 4 neighbors.
**Syntax**
``` sql
s2GetNeighbors(s2index)
```
**Arguments**
- `s2index` — S2 Index. [UInt64](../../../sql-reference/data-types/int-uint.md).
**Returned values**
- An array consisting of the 4 neighbor indices: `array[s2index1, s2index3, s2index2, s2index4]`.
Type: Each S2 index is [UInt64](../../../sql-reference/data-types/int-uint.md).
**Example**
Query:
``` sql
select s2GetNeighbors(5074766849661468672) AS s2Neighbors;
```
Result:
``` text
┌─s2Neighbors───────────────────────────────────────────────────────────────────────┐
│ [5074766987100422144,5074766712222515200,5074767536856236032,5074767261978329088] │
└───────────────────────────────────────────────────────────────────────────────────┘
```
## s2CellsIntersect {#s2CellsIntersect}
Determines if the two provided [S2](#s2index)) cell indices intersect or not.
**Syntax**
``` sql
s2CellsIntersect(s2index1, s2index2)
```
**Arguments**
- `siIndex1`, `s2index2` — S2 Index. [UInt64](../../../sql-reference/data-types/int-uint.md).
**Returned values**
- 1 — If the S2 cell indices intersect.
- 0 — If the S2 cell indices don't intersect.
Type: [UInt8](../../../sql-reference/data-types/int-uint.md).
**Example**
Query:
``` sql
select s2CellsIntersect(9926595209846587392, 9926594385212866560) as intersect;
```
Result:
``` text
┌─intersect─┐
│ 1 │
└───────────┘
```
## s2CapContains {#s2CapContains}
A cap represents a portion of the sphere that has been cut off by a plane. It is defined by a point on a sphere and a radius in degrees.
Determines if a cap contains a s2 point index.
**Syntax**
``` sql
s2CapContains(center, degrees, point)
```
**Arguments**
- `center` - S2 point index corresponding to the cap. [UInt64](../../../sql-reference/data-types/int-uint.md).
- `degrees` - Radius of the cap in degrees. [Float64](../../../sql-reference/data-types/float.md).
- `point` - S2 point index. [UInt64](../../../sql-reference/data-types/int-uint.md).
**Returned values**
- 1 — If the cap contains the S2 point index.
- 0 — If the cap doesn't contain the S2 point index.
Type: [UInt8](../../../sql-reference/data-types/int-uint.md).
**Example**
Query:
``` sql
select s2CapContains(1157339245694594829, 1.0, 1157347770437378819) as capContains;
```
Result:
``` text
┌─capContains─┐
│ 1 │
└─────────────┘
```
## s2CapUnion {#s2CapUnion}
A cap represents a portion of the sphere that has been cut off by a plane. It is defined by a point on a sphere and a radius in degrees.
Determines the smallest cap that contains the given two input caps.
**Syntax**
``` sql
s2CapUnion(center1, radius1, center2, radius2)
```
**Arguments**
- `center1`, `center2` - S2 point indices corresponding to the two input caps. [UInt64](../../../sql-reference/data-types/int-uint.md).
- `radius1`, `radius2` - Radii of the two input caps in degrees. [Float64](../../../sql-reference/data-types/float.md).
**Returned values**
- `center` - S2 point index corresponding the center of the smallest cap containing the two input caps. Type: [UInt64](../../../sql-reference/data-types/int-uint.md).
- `radius` - Radius of the smallest cap containing the two input caps. Type: [Float64](../../../sql-reference/data-types/float.md).
**Example**
Query:
``` sql
SELECT s2CapUnion(3814912406305146967, 1.0, 1157347770437378819, 1.0) AS capUnion;
```
Result:
``` text
┌─capUnion───────────────────────────────┐
│ (4534655147792050737,60.2088283994957) │
└────────────────────────────────────────┘
```
## s2RectAdd{#s2RectAdd}
In the S2 system, a rectangle is represented by a type of S2Region called a S2LatLngRect that represents a rectangle in latitude-longitude space.
Increases the size of the bounding rectangle to include the given S2 point index.
**Syntax**
``` sql
s2RectAdd(s2pointLow, s2pointHigh, s2Point)
```
**Arguments**
- `s2PointLow` - Low S2 point index corresponding to the rectangle. [UInt64](../../../sql-reference/data-types/int-uint.md).
- `s2PointHigh` - High S2 point index corresponding to the rectangle. [UInt64](../../../sql-reference/data-types/int-uint.md).
- `s2Point` - Target S2 point index that the bound rectangle should be grown to include. [UInt64](../../../sql-reference/data-types/int-uint.md).
**Returned values**
- `s2PointLow` - Low S2 cell id corresponding to the grown rectangle. Type: [UInt64](../../../sql-reference/data-types/int-uint.md).
- `s2PointHigh` - Hight S2 cell id corresponding to the grown rectangle. Type: [UInt64](../../../sql-reference/data-types/float.md).
**Example**
Query:
``` sql
SELECT s2RectAdd(5178914411069187297, 5177056748191934217, 5179056748191934217) as rectAdd;
```
Result:
``` text
┌─rectAdd───────────────────────────────────┐
│ (5179062030687166815,5177056748191934217) │
└───────────────────────────────────────────┘
```
## s2RectContains{#s2RectContains}
In the S2 system, a rectangle is represented by a type of S2Region called a S2LatLngRect that represents a rectangle in latitude-longitude space.
Determines if a given rectangle contains a S2 point index.
**Syntax**
``` sql
s2RectContains(s2PointLow, s2PointHi, s2Point)
```
**Arguments**
- `s2PointLow` - Low S2 point index corresponding to the rectangle. [UInt64](../../../sql-reference/data-types/int-uint.md).
- `s2PointHigh` - High S2 point index corresponding to the rectangle. [UInt64](../../../sql-reference/data-types/int-uint.md).
- `s2Point` - Target S2 point index. [UInt64](../../../sql-reference/data-types/int-uint.md).
**Returned values**
- 1 — If the rectangle contains the given S2 point.
- 0 — If the rectangle doesn't contain the given S2 point.
**Example**
Query:
``` sql
SELECT s2RectContains(5179062030687166815, 5177056748191934217, 5177914411069187297) AS rectContains
```
Result:
``` text
┌─rectContains─┐
│ 0 │
└──────────────┘
```
## s2RectUinion{#s2RectUnion}
In the S2 system, a rectangle is represented by a type of S2Region called a S2LatLngRect that represents a rectangle in latitude-longitude space.
Returns the smallest rectangle containing the union of this rectangle and the given rectangle.
**Syntax**
``` sql
s2RectUnion(s2Rect1PointLow, s2Rect1PointHi, s2Rect2PointLow, s2Rect2PointHi)
```
**Arguments**
- `s2Rect1PointLow`, `s2Rect1PointHi` - Low and High S2 point indices corresponding to the first rectangle. [UInt64](../../../sql-reference/data-types/int-uint.md).
- `s2Rect2PointLow`, `s2Rect2PointHi` - Low and High S2 point indices corresponding to the second rectangle. [UInt64](../../../sql-reference/data-types/int-uint.md).
**Returned values**
- `s2UnionRect2PointLow` - Low S2 cell id corresponding to the union rectangle. Type: [UInt64](../../../sql-reference/data-types/int-uint.md).
- `s2UnionRect2PointHi` - High S2 cell id corresponding to the union rectangle. Type: [UInt64](../../../sql-reference/data-types/int-uint.md).
**Example**
Query:
``` sql
SELECT s2RectUnion(5178914411069187297, 5177056748191934217, 5179062030687166815, 5177056748191934217) AS rectUnion
```
Result:
``` text
┌─rectUnion─────────────────────────────────┐
│ (5179062030687166815,5177056748191934217) │
└───────────────────────────────────────────┘
```
## s2RectIntersection{#s2RectIntersection}
Returns the smallest Rectangle containing the intersection of this rectangle and the given rectangle.
**Syntax**
``` sql
s2RectIntersection(s2Rect1PointLow, s2Rect1PointHi, s2Rect2PointLow, s2Rect2PointHi)
```
**Arguments**
- `s2Rect1PointLow`, `s2Rect1PointHi` - Low and High S2 point indices corresponding to the first rectangle. [UInt64](../../../sql-reference/data-types/int-uint.md).
- `s2Rect2PointLow`, `s2Rect2PointHi` - Low and High S2 point indices corresponding to the second rectangle. [UInt64](../../../sql-reference/data-types/int-uint.md).
**Returned values**
- `s2UnionRect2PointLow` - Low S2 cell id corresponding to the rectangle containing the intersection of the given rectangles. Type: [UInt64](../../../sql-reference/data-types/int-uint.md).
- `s2UnionRect2PointHi` - Hi S2 cell id corresponding to the rectangle containing the intersection of the given rectangles. Type: [UInt64](../../../sql-reference/data-types/int-uint.md).
**Example**
Query:
``` sql
SELECT s2RectIntersection(5178914411069187297, 5177056748191934217, 5179062030687166815, 5177056748191934217) AS rectIntersection
```
Result:
``` text
┌─rectIntersection──────────────────────────┐
│ (5178914411069187297,5177056748191934217) │
└───────────────────────────────────────────┘
```

View File

@ -1 +0,0 @@
../../../en/getting-started/example-datasets/menus.md

View File

@ -0,0 +1,360 @@
---
toc_priority: 21
toc_title: Меню
---
# Набор данных публичной библиотеки Нью-Йорка "Что в меню?" {#menus-dataset}
Набор данных создан Нью-Йоркской публичной библиотекой. Он содержит исторические данные о меню отелей, ресторанов и кафе с блюдами, а также их ценами.
Источник: http://menus.nypl.org/data
Эти данные находятся в открытом доступе.
Данные взяты из архива библиотеки, и они могут быть неполными и сложными для статистического анализа. Тем не менее, это тоже очень интересно.
В наборе всего 1,3 миллиона записей о блюдах в меню — очень небольшой объем данных для ClickHouse, но это все равно хороший пример.
## Загрузите набор данных {#download-dataset}
Выполните команду:
```bash
wget https://s3.amazonaws.com/menusdata.nypl.org/gzips/2021_08_01_07_01_17_data.tgz
```
При необходимости замените ссылку на актуальную ссылку с http://menus.nypl.org/data.
Размер архива составляет около 35 МБ.
## Распакуйте набор данных {#unpack-dataset}
```bash
tar xvf 2021_08_01_07_01_17_data.tgz
```
Размер распакованных данных составляет около 150 МБ.
Данные нормализованы и состоят из четырех таблиц:
- `Menu` — информация о меню: название ресторана, дата, когда было просмотрено меню, и т.д.
- `Dish` — информация о блюдах: название блюда вместе с некоторыми характеристиками.
- `MenuPage` — информация о страницах в меню, потому что каждая страница принадлежит какому-либо меню.
- `MenuItem` — один из пунктов меню. Блюдо вместе с его ценой на какой-либо странице меню: ссылки на блюдо и страницу меню.
## Создайте таблицы {#create-tables}
Для хранения цен используется тип данных [Decimal](../../sql-reference/data-types/decimal.md).
```sql
CREATE TABLE dish
(
id UInt32,
name String,
description String,
menus_appeared UInt32,
times_appeared Int32,
first_appeared UInt16,
last_appeared UInt16,
lowest_price Decimal64(3),
highest_price Decimal64(3)
) ENGINE = MergeTree ORDER BY id;
CREATE TABLE menu
(
id UInt32,
name String,
sponsor String,
event String,
venue String,
place String,
physical_description String,
occasion String,
notes String,
call_number String,
keywords String,
language String,
date String,
location String,
location_type String,
currency String,
currency_symbol String,
status String,
page_count UInt16,
dish_count UInt16
) ENGINE = MergeTree ORDER BY id;
CREATE TABLE menu_page
(
id UInt32,
menu_id UInt32,
page_number UInt16,
image_id String,
full_height UInt16,
full_width UInt16,
uuid UUID
) ENGINE = MergeTree ORDER BY id;
CREATE TABLE menu_item
(
id UInt32,
menu_page_id UInt32,
price Decimal64(3),
high_price Decimal64(3),
dish_id UInt32,
created_at DateTime,
updated_at DateTime,
xpos Float64,
ypos Float64
) ENGINE = MergeTree ORDER BY id;
```
## Импортируйте данные {#import-data}
Импортируйте данные в ClickHouse, выполните команды:
```bash
clickhouse-client --format_csv_allow_single_quotes 0 --input_format_null_as_default 0 --query "INSERT INTO dish FORMAT CSVWithNames" < Dish.csv
clickhouse-client --format_csv_allow_single_quotes 0 --input_format_null_as_default 0 --query "INSERT INTO menu FORMAT CSVWithNames" < Menu.csv
clickhouse-client --format_csv_allow_single_quotes 0 --input_format_null_as_default 0 --query "INSERT INTO menu_page FORMAT CSVWithNames" < MenuPage.csv
clickhouse-client --format_csv_allow_single_quotes 0 --input_format_null_as_default 0 --date_time_input_format best_effort --query "INSERT INTO menu_item FORMAT CSVWithNames" < MenuItem.csv
```
Поскольку данные представлены в формате CSV с заголовком, используется формат [CSVWithNames](../../interfaces/formats.md#csvwithnames).
Отключите `format_csv_allow_single_quotes`, так как для данных используются только двойные кавычки, а одинарные кавычки могут находиться внутри значений и не должны сбивать с толку CSV-парсер.
Отключите [input_format_null_as_default](../../operations/settings/settings.md#settings-input-format-null-as-default), поскольку в данных нет значений [NULL](../../sql-reference/syntax.md#null-literal).
В противном случае ClickHouse попытается проанализировать последовательности `\N` и может перепутать с `\` в данных.
Настройка [date_time_input_format best_effort](../../operations/settings/settings.md#settings-date_time_input_format) позволяет анализировать поля [DateTime](../../sql-reference/data-types/datetime.md) в самых разных форматах. К примеру, будет распознан ISO-8601 без секунд: '2000-01-01 01:02'. Без этой настройки допускается только фиксированный формат даты и времени.
## Денормализуйте данные {#denormalize-data}
Данные представлены в нескольких таблицах в [нормализованном виде](https://ru.wikipedia.org/wiki/%D0%9D%D0%BE%D1%80%D0%BC%D0%B0%D0%BB%D1%8C%D0%BD%D0%B0%D1%8F_%D1%84%D0%BE%D1%80%D0%BC%D0%B0).
Это означает, что вам нужно использовать условие объединения [JOIN](../../sql-reference/statements/select/join.md#select-join), если вы хотите получить, например, названия блюд из пунктов меню.
Для типовых аналитических задач гораздо эффективнее работать с предварительно объединенными данными, чтобы не использовать `JOIN` каждый раз. Такие данные называются денормализованными.
Создайте таблицу `menu_item_denorm`, которая будет содержать все данные, объединенные вместе:
```sql
CREATE TABLE menu_item_denorm
ENGINE = MergeTree ORDER BY (dish_name, created_at)
AS SELECT
price,
high_price,
created_at,
updated_at,
xpos,
ypos,
dish.id AS dish_id,
dish.name AS dish_name,
dish.description AS dish_description,
dish.menus_appeared AS dish_menus_appeared,
dish.times_appeared AS dish_times_appeared,
dish.first_appeared AS dish_first_appeared,
dish.last_appeared AS dish_last_appeared,
dish.lowest_price AS dish_lowest_price,
dish.highest_price AS dish_highest_price,
menu.id AS menu_id,
menu.name AS menu_name,
menu.sponsor AS menu_sponsor,
menu.event AS menu_event,
menu.venue AS menu_venue,
menu.place AS menu_place,
menu.physical_description AS menu_physical_description,
menu.occasion AS menu_occasion,
menu.notes AS menu_notes,
menu.call_number AS menu_call_number,
menu.keywords AS menu_keywords,
menu.language AS menu_language,
menu.date AS menu_date,
menu.location AS menu_location,
menu.location_type AS menu_location_type,
menu.currency AS menu_currency,
menu.currency_symbol AS menu_currency_symbol,
menu.status AS menu_status,
menu.page_count AS menu_page_count,
menu.dish_count AS menu_dish_count
FROM menu_item
JOIN dish ON menu_item.dish_id = dish.id
JOIN menu_page ON menu_item.menu_page_id = menu_page.id
JOIN menu ON menu_page.menu_id = menu.id;
```
## Проверьте загруженные данные {#validate-data}
Запрос:
```sql
SELECT count() FROM menu_item_denorm;
```
Результат:
```text
┌─count()─┐
│ 1329175 │
└─────────┘
```
## Примеры запросов {#run-queries}
### Усредненные исторические цены на блюда {#query-averaged-historical-prices}
Запрос:
```sql
SELECT
round(toUInt32OrZero(extract(menu_date, '^\\d{4}')), -1) AS d,
count(),
round(avg(price), 2),
bar(avg(price), 0, 100, 100)
FROM menu_item_denorm
WHERE (menu_currency = 'Dollars') AND (d > 0) AND (d < 2022)
GROUP BY d
ORDER BY d ASC;
```
Результат:
```text
┌────d─┬─count()─┬─round(avg(price), 2)─┬─bar(avg(price), 0, 100, 100)─┐
│ 1850 │ 618 │ 1.5 │ █▍ │
│ 1860 │ 1634 │ 1.29 │ █▎ │
│ 1870 │ 2215 │ 1.36 │ █▎ │
│ 1880 │ 3909 │ 1.01 │ █ │
│ 1890 │ 8837 │ 1.4 │ █▍ │
│ 1900 │ 176292 │ 0.68 │ ▋ │
│ 1910 │ 212196 │ 0.88 │ ▊ │
│ 1920 │ 179590 │ 0.74 │ ▋ │
│ 1930 │ 73707 │ 0.6 │ ▌ │
│ 1940 │ 58795 │ 0.57 │ ▌ │
│ 1950 │ 41407 │ 0.95 │ ▊ │
│ 1960 │ 51179 │ 1.32 │ █▎ │
│ 1970 │ 12914 │ 1.86 │ █▋ │
│ 1980 │ 7268 │ 4.35 │ ████▎ │
│ 1990 │ 11055 │ 6.03 │ ██████ │
│ 2000 │ 2467 │ 11.85 │ ███████████▋ │
│ 2010 │ 597 │ 25.66 │ █████████████████████████▋ │
└──────┴─────────┴──────────────────────┴──────────────────────────────┘
```
Просто не принимайте это всерьез.
### Цены на бургеры {#query-burger-prices}
Запрос:
```sql
SELECT
round(toUInt32OrZero(extract(menu_date, '^\\d{4}')), -1) AS d,
count(),
round(avg(price), 2),
bar(avg(price), 0, 50, 100)
FROM menu_item_denorm
WHERE (menu_currency = 'Dollars') AND (d > 0) AND (d < 2022) AND (dish_name ILIKE '%burger%')
GROUP BY d
ORDER BY d ASC;
```
Результат:
```text
┌────d─┬─count()─┬─round(avg(price), 2)─┬─bar(avg(price), 0, 50, 100)───────────┐
│ 1880 │ 2 │ 0.42 │ ▋ │
│ 1890 │ 7 │ 0.85 │ █▋ │
│ 1900 │ 399 │ 0.49 │ ▊ │
│ 1910 │ 589 │ 0.68 │ █▎ │
│ 1920 │ 280 │ 0.56 │ █ │
│ 1930 │ 74 │ 0.42 │ ▋ │
│ 1940 │ 119 │ 0.59 │ █▏ │
│ 1950 │ 134 │ 1.09 │ ██▏ │
│ 1960 │ 272 │ 0.92 │ █▋ │
│ 1970 │ 108 │ 1.18 │ ██▎ │
│ 1980 │ 88 │ 2.82 │ █████▋ │
│ 1990 │ 184 │ 3.68 │ ███████▎ │
│ 2000 │ 21 │ 7.14 │ ██████████████▎ │
│ 2010 │ 6 │ 18.42 │ ████████████████████████████████████▋ │
└──────┴─────────┴──────────────────────┴───────────────────────────────────────┘
```
### Водка {#query-vodka}
Запрос:
```sql
SELECT
round(toUInt32OrZero(extract(menu_date, '^\\d{4}')), -1) AS d,
count(),
round(avg(price), 2),
bar(avg(price), 0, 50, 100)
FROM menu_item_denorm
WHERE (menu_currency IN ('Dollars', '')) AND (d > 0) AND (d < 2022) AND (dish_name ILIKE '%vodka%')
GROUP BY d
ORDER BY d ASC;
```
Результат:
```text
┌────d─┬─count()─┬─round(avg(price), 2)─┬─bar(avg(price), 0, 50, 100)─┐
│ 1910 │ 2 │ 0 │ │
│ 1920 │ 1 │ 0.3 │ ▌ │
│ 1940 │ 21 │ 0.42 │ ▋ │
│ 1950 │ 14 │ 0.59 │ █▏ │
│ 1960 │ 113 │ 2.17 │ ████▎ │
│ 1970 │ 37 │ 0.68 │ █▎ │
│ 1980 │ 19 │ 2.55 │ █████ │
│ 1990 │ 86 │ 3.6 │ ███████▏ │
│ 2000 │ 2 │ 3.98 │ ███████▊ │
└──────┴─────────┴──────────────────────┴─────────────────────────────┘
```
Чтобы получить водку, мы должны написать `ILIKE '%vodka%'`, и это хорошая идея.
### Икра {#query-caviar}
Посмотрите цены на икру. Получите название любого блюда с икрой.
Запрос:
```sql
SELECT
round(toUInt32OrZero(extract(menu_date, '^\\d{4}')), -1) AS d,
count(),
round(avg(price), 2),
bar(avg(price), 0, 50, 100),
any(dish_name)
FROM menu_item_denorm
WHERE (menu_currency IN ('Dollars', '')) AND (d > 0) AND (d < 2022) AND (dish_name ILIKE '%caviar%')
GROUP BY d
ORDER BY d ASC;
```
Результат:
```text
┌────d─┬─count()─┬─round(avg(price), 2)─┬─bar(avg(price), 0, 50, 100)──────┬─any(dish_name)──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 1090 │ 1 │ 0 │ │ Caviar │
│ 1880 │ 3 │ 0 │ │ Caviar │
│ 1890 │ 39 │ 0.59 │ █▏ │ Butter and caviar │
│ 1900 │ 1014 │ 0.34 │ ▋ │ Anchovy Caviar on Toast │
│ 1910 │ 1588 │ 1.35 │ ██▋ │ 1/1 Brötchen Caviar │
│ 1920 │ 927 │ 1.37 │ ██▋ │ ASTRAKAN CAVIAR │
│ 1930 │ 289 │ 1.91 │ ███▋ │ Astrachan caviar │
│ 1940 │ 201 │ 0.83 │ █▋ │ (SPECIAL) Domestic Caviar Sandwich │
│ 1950 │ 81 │ 2.27 │ ████▌ │ Beluga Caviar │
│ 1960 │ 126 │ 2.21 │ ████▍ │ Beluga Caviar │
│ 1970 │ 105 │ 0.95 │ █▊ │ BELUGA MALOSSOL CAVIAR AMERICAN DRESSING │
│ 1980 │ 12 │ 7.22 │ ██████████████▍ │ Authentic Iranian Beluga Caviar the world's finest black caviar presented in ice garni and a sampling of chilled 100° Russian vodka │
│ 1990 │ 74 │ 14.42 │ ████████████████████████████▋ │ Avocado Salad, Fresh cut avocado with caviare │
│ 2000 │ 3 │ 7.82 │ ███████████████▋ │ Aufgeschlagenes Kartoffelsueppchen mit Forellencaviar │
│ 2010 │ 6 │ 15.58 │ ███████████████████████████████▏ │ "OYSTERS AND PEARLS" "Sabayon" of Pearl Tapioca with Island Creek Oysters and Russian Sevruga Caviar │
└──────┴─────────┴──────────────────────┴──────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
```
По крайней мере, есть икра с водкой. Очень мило.
## Online Playground {#playground}
Этот набор данных доступен в интерактивном ресурсе [Online Playground](https://gh-api.clickhouse.tech/play?user=play#U0VMRUNUCiAgICByb3VuZCh0b1VJbnQzMk9yWmVybyhleHRyYWN0KG1lbnVfZGF0ZSwgJ15cXGR7NH0nKSksIC0xKSBBUyBkLAogICAgY291bnQoKSwKICAgIHJvdW5kKGF2ZyhwcmljZSksIDIpLAogICAgYmFyKGF2ZyhwcmljZSksIDAsIDUwLCAxMDApLAogICAgYW55KGRpc2hfbmFtZSkKRlJPTSBtZW51X2l0ZW1fZGVub3JtCldIRVJFIChtZW51X2N1cnJlbmN5IElOICgnRG9sbGFycycsICcnKSkgQU5EIChkID4gMCkgQU5EIChkIDwgMjAyMikgQU5EIChkaXNoX25hbWUgSUxJS0UgJyVjYXZpYXIlJykKR1JPVVAgQlkgZApPUkRFUiBCWSBkIEFTQw==).

View File

@ -364,7 +364,7 @@ $ clickhouse-client --format_csv_delimiter="|" --query="INSERT INTO test.csv FOR
## CSVWithNames {#csvwithnames}
Выводит также заголовок, аналогично `TabSeparatedWithNames`.
Выводит также заголовок, аналогично [TabSeparatedWithNames](#tabseparatedwithnames).
## CustomSeparated {#format-customseparated}

View File

@ -7,7 +7,8 @@ toc_title: DateTime64
Позволяет хранить момент времени, который может быть представлен как календарная дата и время, с заданной суб-секундной точностью.
Размер тика (точность, precision): 10<sup>-precision</sup> секунд, где precision - целочисленный параметр.
Размер тика (точность, precision): 10<sup>-precision</sup> секунд, где precision - целочисленный параметр. Возможные значения: [ 0 : 9 ].
Обычно используются - 3 (миллисекунды), 6 (микросекунды), 9 (наносекунды).
**Синтаксис:**