mirror of
https://github.com/ClickHouse/ClickHouse.git
synced 2024-12-15 02:41:59 +00:00
391 lines
38 KiB
Markdown
391 lines
38 KiB
Markdown
|
---
|
|||
|
machine_translated: true
|
|||
|
machine_translated_rev: e8cd92bba3269f47787db090899f7c242adf7818
|
|||
|
toc_priority: 16
|
|||
|
toc_title: New York Taksi Verileri
|
|||
|
---
|
|||
|
|
|||
|
# New York Taksi Verileri {#new-york-taxi-data}
|
|||
|
|
|||
|
Bu veri kümesi iki şekilde elde edilebilir:
|
|||
|
|
|||
|
- ham verilerden içe aktarma
|
|||
|
- hazırlanan bölüm downloadlerin indir downloadilmesi
|
|||
|
|
|||
|
## Ham veri nasıl alınır {#how-to-import-the-raw-data}
|
|||
|
|
|||
|
Bkz. https://github.com/toddwschneider/nyc-taxi-data ve http://tech.marksblogg.com/billion-nyc-taxi-rides-redshift.html bir veri kümesinin açıklaması ve indirme talimatları için.
|
|||
|
|
|||
|
İndirme, CSV dosyalarında yaklaşık 227 GB sıkıştırılmamış veri ile sonuçlanacaktır. İndirme, 1 Gbit bağlantısı üzerinden yaklaşık bir saat sürer (paralel indirme s3.amazonaws.com 1 Gbit kanalın en az yarısını kurtarır).
|
|||
|
Bazı dosyalar tam olarak indirilmeyebilir. Dosya boyutlarını kontrol edin ve şüpheli görünen herhangi birini yeniden indirin.
|
|||
|
|
|||
|
Bazı dosyalar geçersiz satırlar içerebilir. Bunları aşağıdaki gibi düzeltebilirsiniz:
|
|||
|
|
|||
|
``` bash
|
|||
|
sed -E '/(.*,){18,}/d' data/yellow_tripdata_2010-02.csv > data/yellow_tripdata_2010-02.csv_
|
|||
|
sed -E '/(.*,){18,}/d' data/yellow_tripdata_2010-03.csv > data/yellow_tripdata_2010-03.csv_
|
|||
|
mv data/yellow_tripdata_2010-02.csv_ data/yellow_tripdata_2010-02.csv
|
|||
|
mv data/yellow_tripdata_2010-03.csv_ data/yellow_tripdata_2010-03.csv
|
|||
|
```
|
|||
|
|
|||
|
Daha sonra veriler Postgresql'de önceden işlenmelidir. Bu, çokgenlerdeki noktaların seçimlerini oluşturacaktır (Haritadaki noktaları New York şehrinin ilçeleriyle eşleştirmek için) ve tüm verileri bir birleştirme kullanarak tek bir denormalize düz tabloda birleştirecektir. Bunu yapmak için Postgresql'i Postgıs desteği ile yüklemeniz gerekir.
|
|||
|
|
|||
|
Çalışırken dikkatli olun `initialize_database.sh` ve tüm tabloların doğru şekilde oluşturulduğunu manuel olarak yeniden kontrol edin.
|
|||
|
|
|||
|
Postgresql'deki her ayın verilerini işlemek yaklaşık 20-30 dakika sürer, toplam yaklaşık 48 saat sürer.
|
|||
|
|
|||
|
İndirilen satır sayısını aşağıdaki gibi kontrol edebilirsiniz:
|
|||
|
|
|||
|
``` bash
|
|||
|
$ time psql nyc-taxi-data -c "SELECT count(*) FROM trips;"
|
|||
|
## Count
|
|||
|
1298979494
|
|||
|
(1 row)
|
|||
|
|
|||
|
real 7m9.164s
|
|||
|
```
|
|||
|
|
|||
|
(Bu, Mark Litwintschik tarafından bir dizi blog gönderisinde bildirilen 1.1 milyar satırdan biraz daha fazladır .)
|
|||
|
|
|||
|
Postgresql'deki veriler 370 GB alan kullanıyor.
|
|||
|
|
|||
|
PostgreSQL veri verme:
|
|||
|
|
|||
|
``` sql
|
|||
|
COPY
|
|||
|
(
|
|||
|
SELECT trips.id,
|
|||
|
trips.vendor_id,
|
|||
|
trips.pickup_datetime,
|
|||
|
trips.dropoff_datetime,
|
|||
|
trips.store_and_fwd_flag,
|
|||
|
trips.rate_code_id,
|
|||
|
trips.pickup_longitude,
|
|||
|
trips.pickup_latitude,
|
|||
|
trips.dropoff_longitude,
|
|||
|
trips.dropoff_latitude,
|
|||
|
trips.passenger_count,
|
|||
|
trips.trip_distance,
|
|||
|
trips.fare_amount,
|
|||
|
trips.extra,
|
|||
|
trips.mta_tax,
|
|||
|
trips.tip_amount,
|
|||
|
trips.tolls_amount,
|
|||
|
trips.ehail_fee,
|
|||
|
trips.improvement_surcharge,
|
|||
|
trips.total_amount,
|
|||
|
trips.payment_type,
|
|||
|
trips.trip_type,
|
|||
|
trips.pickup,
|
|||
|
trips.dropoff,
|
|||
|
|
|||
|
cab_types.type cab_type,
|
|||
|
|
|||
|
weather.precipitation_tenths_of_mm rain,
|
|||
|
weather.snow_depth_mm,
|
|||
|
weather.snowfall_mm,
|
|||
|
weather.max_temperature_tenths_degrees_celsius max_temp,
|
|||
|
weather.min_temperature_tenths_degrees_celsius min_temp,
|
|||
|
weather.average_wind_speed_tenths_of_meters_per_second wind,
|
|||
|
|
|||
|
pick_up.gid pickup_nyct2010_gid,
|
|||
|
pick_up.ctlabel pickup_ctlabel,
|
|||
|
pick_up.borocode pickup_borocode,
|
|||
|
pick_up.boroname pickup_boroname,
|
|||
|
pick_up.ct2010 pickup_ct2010,
|
|||
|
pick_up.boroct2010 pickup_boroct2010,
|
|||
|
pick_up.cdeligibil pickup_cdeligibil,
|
|||
|
pick_up.ntacode pickup_ntacode,
|
|||
|
pick_up.ntaname pickup_ntaname,
|
|||
|
pick_up.puma pickup_puma,
|
|||
|
|
|||
|
drop_off.gid dropoff_nyct2010_gid,
|
|||
|
drop_off.ctlabel dropoff_ctlabel,
|
|||
|
drop_off.borocode dropoff_borocode,
|
|||
|
drop_off.boroname dropoff_boroname,
|
|||
|
drop_off.ct2010 dropoff_ct2010,
|
|||
|
drop_off.boroct2010 dropoff_boroct2010,
|
|||
|
drop_off.cdeligibil dropoff_cdeligibil,
|
|||
|
drop_off.ntacode dropoff_ntacode,
|
|||
|
drop_off.ntaname dropoff_ntaname,
|
|||
|
drop_off.puma dropoff_puma
|
|||
|
FROM trips
|
|||
|
LEFT JOIN cab_types
|
|||
|
ON trips.cab_type_id = cab_types.id
|
|||
|
LEFT JOIN central_park_weather_observations_raw weather
|
|||
|
ON weather.date = trips.pickup_datetime::date
|
|||
|
LEFT JOIN nyct2010 pick_up
|
|||
|
ON pick_up.gid = trips.pickup_nyct2010_gid
|
|||
|
LEFT JOIN nyct2010 drop_off
|
|||
|
ON drop_off.gid = trips.dropoff_nyct2010_gid
|
|||
|
) TO '/opt/milovidov/nyc-taxi-data/trips.tsv';
|
|||
|
```
|
|||
|
|
|||
|
Veri anlık görüntüsü saniyede yaklaşık 50 MB hızında oluşturulur. Anlık görüntü oluştururken, PostgreSQL diskten saniyede yaklaşık 28 MB hızında okur.
|
|||
|
Bu yaklaşık 5 saat sürer. Elde edilen TSV dosyası 590612904969 bayttır.
|
|||
|
|
|||
|
Clickhouse'da geçici bir tablo oluşturma:
|
|||
|
|
|||
|
``` sql
|
|||
|
CREATE TABLE trips
|
|||
|
(
|
|||
|
trip_id UInt32,
|
|||
|
vendor_id String,
|
|||
|
pickup_datetime DateTime,
|
|||
|
dropoff_datetime Nullable(DateTime),
|
|||
|
store_and_fwd_flag Nullable(FixedString(1)),
|
|||
|
rate_code_id Nullable(UInt8),
|
|||
|
pickup_longitude Nullable(Float64),
|
|||
|
pickup_latitude Nullable(Float64),
|
|||
|
dropoff_longitude Nullable(Float64),
|
|||
|
dropoff_latitude Nullable(Float64),
|
|||
|
passenger_count Nullable(UInt8),
|
|||
|
trip_distance Nullable(Float64),
|
|||
|
fare_amount Nullable(Float32),
|
|||
|
extra Nullable(Float32),
|
|||
|
mta_tax Nullable(Float32),
|
|||
|
tip_amount Nullable(Float32),
|
|||
|
tolls_amount Nullable(Float32),
|
|||
|
ehail_fee Nullable(Float32),
|
|||
|
improvement_surcharge Nullable(Float32),
|
|||
|
total_amount Nullable(Float32),
|
|||
|
payment_type Nullable(String),
|
|||
|
trip_type Nullable(UInt8),
|
|||
|
pickup Nullable(String),
|
|||
|
dropoff Nullable(String),
|
|||
|
cab_type Nullable(String),
|
|||
|
precipitation Nullable(UInt8),
|
|||
|
snow_depth Nullable(UInt8),
|
|||
|
snowfall Nullable(UInt8),
|
|||
|
max_temperature Nullable(UInt8),
|
|||
|
min_temperature Nullable(UInt8),
|
|||
|
average_wind_speed Nullable(UInt8),
|
|||
|
pickup_nyct2010_gid Nullable(UInt8),
|
|||
|
pickup_ctlabel Nullable(String),
|
|||
|
pickup_borocode Nullable(UInt8),
|
|||
|
pickup_boroname Nullable(String),
|
|||
|
pickup_ct2010 Nullable(String),
|
|||
|
pickup_boroct2010 Nullable(String),
|
|||
|
pickup_cdeligibil Nullable(FixedString(1)),
|
|||
|
pickup_ntacode Nullable(String),
|
|||
|
pickup_ntaname Nullable(String),
|
|||
|
pickup_puma Nullable(String),
|
|||
|
dropoff_nyct2010_gid Nullable(UInt8),
|
|||
|
dropoff_ctlabel Nullable(String),
|
|||
|
dropoff_borocode Nullable(UInt8),
|
|||
|
dropoff_boroname Nullable(String),
|
|||
|
dropoff_ct2010 Nullable(String),
|
|||
|
dropoff_boroct2010 Nullable(String),
|
|||
|
dropoff_cdeligibil Nullable(String),
|
|||
|
dropoff_ntacode Nullable(String),
|
|||
|
dropoff_ntaname Nullable(String),
|
|||
|
dropoff_puma Nullable(String)
|
|||
|
) ENGINE = Log;
|
|||
|
```
|
|||
|
|
|||
|
Alanları daha doğru veri türlerine dönüştürmek ve mümkünse Boşları ortadan kaldırmak için gereklidir.
|
|||
|
|
|||
|
``` bash
|
|||
|
$ time clickhouse-client --query="INSERT INTO trips FORMAT TabSeparated" < trips.tsv
|
|||
|
|
|||
|
real 75m56.214s
|
|||
|
```
|
|||
|
|
|||
|
Veri 112-140 Mb/saniye hızında okunur.
|
|||
|
Bir akışta bir günlük türü tablosuna veri yükleme 76 dakika sürdü.
|
|||
|
Bu tablodaki veriler 142 GB kullanır.
|
|||
|
|
|||
|
(Verileri doğrudan Postgres'ten içe aktarmak da mümkündür `COPY ... TO PROGRAM`.)
|
|||
|
|
|||
|
Unfortunately, all the fields associated with the weather (precipitation…average\_wind\_speed) were filled with NULL. Because of this, we will remove them from the final data set.
|
|||
|
|
|||
|
Başlamak için, tek bir sunucuda bir tablo oluşturacağız. Daha sonra tabloyu dağıtacağız.
|
|||
|
|
|||
|
Özet Tablo oluşturma ve doldurma:
|
|||
|
|
|||
|
``` sql
|
|||
|
CREATE TABLE trips_mergetree
|
|||
|
ENGINE = MergeTree(pickup_date, pickup_datetime, 8192)
|
|||
|
AS SELECT
|
|||
|
|
|||
|
trip_id,
|
|||
|
CAST(vendor_id AS Enum8('1' = 1, '2' = 2, 'CMT' = 3, 'VTS' = 4, 'DDS' = 5, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14)) AS vendor_id,
|
|||
|
toDate(pickup_datetime) AS pickup_date,
|
|||
|
ifNull(pickup_datetime, toDateTime(0)) AS pickup_datetime,
|
|||
|
toDate(dropoff_datetime) AS dropoff_date,
|
|||
|
ifNull(dropoff_datetime, toDateTime(0)) AS dropoff_datetime,
|
|||
|
assumeNotNull(store_and_fwd_flag) IN ('Y', '1', '2') AS store_and_fwd_flag,
|
|||
|
assumeNotNull(rate_code_id) AS rate_code_id,
|
|||
|
assumeNotNull(pickup_longitude) AS pickup_longitude,
|
|||
|
assumeNotNull(pickup_latitude) AS pickup_latitude,
|
|||
|
assumeNotNull(dropoff_longitude) AS dropoff_longitude,
|
|||
|
assumeNotNull(dropoff_latitude) AS dropoff_latitude,
|
|||
|
assumeNotNull(passenger_count) AS passenger_count,
|
|||
|
assumeNotNull(trip_distance) AS trip_distance,
|
|||
|
assumeNotNull(fare_amount) AS fare_amount,
|
|||
|
assumeNotNull(extra) AS extra,
|
|||
|
assumeNotNull(mta_tax) AS mta_tax,
|
|||
|
assumeNotNull(tip_amount) AS tip_amount,
|
|||
|
assumeNotNull(tolls_amount) AS tolls_amount,
|
|||
|
assumeNotNull(ehail_fee) AS ehail_fee,
|
|||
|
assumeNotNull(improvement_surcharge) AS improvement_surcharge,
|
|||
|
assumeNotNull(total_amount) AS total_amount,
|
|||
|
CAST((assumeNotNull(payment_type) AS pt) IN ('CSH', 'CASH', 'Cash', 'CAS', 'Cas', '1') ? 'CSH' : (pt IN ('CRD', 'Credit', 'Cre', 'CRE', 'CREDIT', '2') ? 'CRE' : (pt IN ('NOC', 'No Charge', 'No', '3') ? 'NOC' : (pt IN ('DIS', 'Dispute', 'Dis', '4') ? 'DIS' : 'UNK'))) AS Enum8('CSH' = 1, 'CRE' = 2, 'UNK' = 0, 'NOC' = 3, 'DIS' = 4)) AS payment_type_,
|
|||
|
assumeNotNull(trip_type) AS trip_type,
|
|||
|
ifNull(toFixedString(unhex(pickup), 25), toFixedString('', 25)) AS pickup,
|
|||
|
ifNull(toFixedString(unhex(dropoff), 25), toFixedString('', 25)) AS dropoff,
|
|||
|
CAST(assumeNotNull(cab_type) AS Enum8('yellow' = 1, 'green' = 2, 'uber' = 3)) AS cab_type,
|
|||
|
|
|||
|
assumeNotNull(pickup_nyct2010_gid) AS pickup_nyct2010_gid,
|
|||
|
toFloat32(ifNull(pickup_ctlabel, '0')) AS pickup_ctlabel,
|
|||
|
assumeNotNull(pickup_borocode) AS pickup_borocode,
|
|||
|
CAST(assumeNotNull(pickup_boroname) AS Enum8('Manhattan' = 1, 'Queens' = 4, 'Brooklyn' = 3, '' = 0, 'Bronx' = 2, 'Staten Island' = 5)) AS pickup_boroname,
|
|||
|
toFixedString(ifNull(pickup_ct2010, '000000'), 6) AS pickup_ct2010,
|
|||
|
toFixedString(ifNull(pickup_boroct2010, '0000000'), 7) AS pickup_boroct2010,
|
|||
|
CAST(assumeNotNull(ifNull(pickup_cdeligibil, ' ')) AS Enum8(' ' = 0, 'E' = 1, 'I' = 2)) AS pickup_cdeligibil,
|
|||
|
toFixedString(ifNull(pickup_ntacode, '0000'), 4) AS pickup_ntacode,
|
|||
|
|
|||
|
CAST(assumeNotNull(pickup_ntaname) AS Enum16('' = 0, 'Airport' = 1, 'Allerton-Pelham Gardens' = 2, 'Annadale-Huguenot-Prince\'s Bay-Eltingville' = 3, 'Arden Heights' = 4, 'Astoria' = 5, 'Auburndale' = 6, 'Baisley Park' = 7, 'Bath Beach' = 8, 'Battery Park City-Lower Manhattan' = 9, 'Bay Ridge' = 10, 'Bayside-Bayside Hills' = 11, 'Bedford' = 12, 'Bedford Park-Fordham North' = 13, 'Bellerose' = 14, 'Belmont' = 15, 'Bensonhurst East' = 16, 'Bensonhurst West' = 17, 'Borough Park' = 18, 'Breezy Point-Belle Harbor-Rockaway Park-Broad Channel' = 19, 'Briarwood-Jamaica Hills' = 20, 'Brighton Beach' = 21, 'Bronxdale' = 22, 'Brooklyn Heights-Cobble Hill' = 23, 'Brownsville' = 24, 'Bushwick North' = 25, 'Bushwick South' = 26, 'Cambria Heights' = 27, 'Canarsie' = 28, 'Carroll Gardens-Columbia Street-Red Hook' = 29, 'Central Harlem North-Polo Grounds' = 30, 'Central Harlem South' = 31, 'Charleston-Richmond Valley-Tottenville' = 32, 'Chinatown' = 33, 'Claremont-Bathgate' = 34, 'Clinton' = 35, 'Clinton Hill' = 36, 'Co-op City' = 37, 'College Point' = 38, 'Corona' = 39, 'Crotona Park East' = 40, 'Crown Heights North' = 41, 'Crown Heights South' = 42, 'Cypress Hills-City Line' = 43, 'DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill' = 44, 'Douglas Manor-Douglaston-Little Neck' = 45, 'Dyker Heights' = 46, 'East Concourse-Concourse Village' = 47, 'East Elmhurst' = 48, 'East Flatbush-Farragut' = 49, 'East Flushing' = 50, 'East Harlem North' = 51, 'East Harlem South' = 52, 'East New York' = 53, 'East New York (Pennsylvania Ave)' = 54, 'East Tremont' = 55, 'East Village' = 56, 'East Williamsburg' = 57, 'Eastchester-Edenwald-Baychester' = 58, 'Elmhurst' = 59, 'Elmhurst-Maspeth' = 60, 'Erasmus' = 61, 'Far Rockaway-Bayswater' = 62, 'Flatbush' = 63, 'Flatlands' = 64, 'Flushing' = 65, 'Fordham South' = 66, 'Forest Hills' = 67, 'Fort Greene' = 68, 'Fresh Meadows-Utopia' = 69, 'Ft. Totten-Bay Terrace-Clearview' = 70, 'Georgetown-Marine Park-Bergen Beach-Mill Basin' = 71, 'Glen Oaks-Floral Park-New Hyde Park' = 72, 'Glendale' = 73, 'Gramercy' = 74, 'Grasmere-Arrochar-Ft. Wadsworth' = 75, 'Gravesend' = 76, 'Great Kills' = 77, 'Greenpoint' = 78, 'Grymes Hill-Clifton-Fox Hills' = 79, 'Hamilton Heights' = 80, 'Hammels-Arverne-Edgemere' = 81, 'Highbridge' = 82, 'Hollis' = 83, 'Homecrest' = 84, 'Hudson Yards-Chelsea-Flatiron-Union Square' = 85, 'Hunters Point-Sunnyside-West Maspeth' = 86, 'Hunts Point' = 87, 'Jackson Heights' = 88, 'Jamaica' = 89, 'Jamaica Estates-Holliswood' = 90, 'Kensington-Ocean Parkway' = 91, 'Kew Gardens' = 92, 'Kew Gardens Hills' = 93, 'Kingsbridge Heights' = 94, 'Laurelton' = 95, 'Lenox Hill-Roosevelt Island' = 96, 'Lincoln Square' = 97, 'Lindenwood-Howard Beach' = 98, 'Longwood' = 99, 'Lower East Side' = 100, 'Madison' = 101, 'Manhattanville' = 102, 'Marble Hill-Inwood' = 103, 'Mariner\'s Harbor-Arlington-Port Ivory-Graniteville' = 104, 'Maspeth' = 105, 'Melrose South-Mott Haven North' = 106, 'Middle Village' = 107, 'Midtown-Midtown South' = 108, 'Midwood' = 109, 'Morningside Heights' = 110, 'Morrisania-Melrose' = 111, 'Mott Haven-Port Morris' = 112, 'Mount Hope' = 113, 'Murray Hill' = 114, 'Murray Hill-Kips Bay' = 115, 'New Brighton-Silver Lake' = 116, 'New Dorp-Midland Beach' = 117, 'New Springville-Bloomfield-Travis' = 118, 'North Corona' = 119, 'North Riverdale-Fieldston-Riverdale' = 120, 'North Side-South Side' = 121, 'Norwood' = 122, 'Oakland Gardens' = 123, 'Oakwood-Oakwood Beach' = 124, 'Ocean Hill' = 125, 'Ocean Parkway South' = 126, 'Old Astoria' = 127, 'Old Town-Dongan Hills-South Beach' = 128, 'Ozone Park' = 129, 'Park Slope-Gowanus' = 130, 'Parkchester' = 131, 'Pelham Bay-Country Club-City Island' = 132, 'Pelham Parkway' = 133, 'Pomonok-Flushing Heights-Hillcrest' = 134, 'Port Richmond' = 135, 'Prospect Heights' = 136, 'Prospect Lefferts Gardens-Wingate' = 137, 'Queens Village' = 138, 'Queensboro Hill' = 139, 'Queensbridge-Ravenswood-Long Island City' = 140, 'Rego Park' = 141, 'Richmond Hill' = 142, 'Ridgewood' = 143, 'Rikers Island' = 144, 'Rosedale' = 145, 'Rossville-Woodrow' = 146, 'Rugby-Remsen Village' = 147, 'S
|
|||
|
|
|||
|
toUInt16(ifNull(pickup_puma, '0')) AS pickup_puma,
|
|||
|
|
|||
|
assumeNotNull(dropoff_nyct2010_gid) AS dropoff_nyct2010_gid,
|
|||
|
toFloat32(ifNull(dropoff_ctlabel, '0')) AS dropoff_ctlabel,
|
|||
|
assumeNotNull(dropoff_borocode) AS dropoff_borocode,
|
|||
|
CAST(assumeNotNull(dropoff_boroname) AS Enum8('Manhattan' = 1, 'Queens' = 4, 'Brooklyn' = 3, '' = 0, 'Bronx' = 2, 'Staten Island' = 5)) AS dropoff_boroname,
|
|||
|
toFixedString(ifNull(dropoff_ct2010, '000000'), 6) AS dropoff_ct2010,
|
|||
|
toFixedString(ifNull(dropoff_boroct2010, '0000000'), 7) AS dropoff_boroct2010,
|
|||
|
CAST(assumeNotNull(ifNull(dropoff_cdeligibil, ' ')) AS Enum8(' ' = 0, 'E' = 1, 'I' = 2)) AS dropoff_cdeligibil,
|
|||
|
toFixedString(ifNull(dropoff_ntacode, '0000'), 4) AS dropoff_ntacode,
|
|||
|
|
|||
|
CAST(assumeNotNull(dropoff_ntaname) AS Enum16('' = 0, 'Airport' = 1, 'Allerton-Pelham Gardens' = 2, 'Annadale-Huguenot-Prince\'s Bay-Eltingville' = 3, 'Arden Heights' = 4, 'Astoria' = 5, 'Auburndale' = 6, 'Baisley Park' = 7, 'Bath Beach' = 8, 'Battery Park City-Lower Manhattan' = 9, 'Bay Ridge' = 10, 'Bayside-Bayside Hills' = 11, 'Bedford' = 12, 'Bedford Park-Fordham North' = 13, 'Bellerose' = 14, 'Belmont' = 15, 'Bensonhurst East' = 16, 'Bensonhurst West' = 17, 'Borough Park' = 18, 'Breezy Point-Belle Harbor-Rockaway Park-Broad Channel' = 19, 'Briarwood-Jamaica Hills' = 20, 'Brighton Beach' = 21, 'Bronxdale' = 22, 'Brooklyn Heights-Cobble Hill' = 23, 'Brownsville' = 24, 'Bushwick North' = 25, 'Bushwick South' = 26, 'Cambria Heights' = 27, 'Canarsie' = 28, 'Carroll Gardens-Columbia Street-Red Hook' = 29, 'Central Harlem North-Polo Grounds' = 30, 'Central Harlem South' = 31, 'Charleston-Richmond Valley-Tottenville' = 32, 'Chinatown' = 33, 'Claremont-Bathgate' = 34, 'Clinton' = 35, 'Clinton Hill' = 36, 'Co-op City' = 37, 'College Point' = 38, 'Corona' = 39, 'Crotona Park East' = 40, 'Crown Heights North' = 41, 'Crown Heights South' = 42, 'Cypress Hills-City Line' = 43, 'DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill' = 44, 'Douglas Manor-Douglaston-Little Neck' = 45, 'Dyker Heights' = 46, 'East Concourse-Concourse Village' = 47, 'East Elmhurst' = 48, 'East Flatbush-Farragut' = 49, 'East Flushing' = 50, 'East Harlem North' = 51, 'East Harlem South' = 52, 'East New York' = 53, 'East New York (Pennsylvania Ave)' = 54, 'East Tremont' = 55, 'East Village' = 56, 'East Williamsburg' = 57, 'Eastchester-Edenwald-Baychester' = 58, 'Elmhurst' = 59, 'Elmhurst-Maspeth' = 60, 'Erasmus' = 61, 'Far Rockaway-Bayswater' = 62, 'Flatbush' = 63, 'Flatlands' = 64, 'Flushing' = 65, 'Fordham South' = 66, 'Forest Hills' = 67, 'Fort Greene' = 68, 'Fresh Meadows-Utopia' = 69, 'Ft. Totten-Bay Terrace-Clearview' = 70, 'Georgetown-Marine Park-Bergen Beach-Mill Basin' = 71, 'Glen Oaks-Floral Park-New Hyde Park' = 72, 'Glendale' = 73, 'Gramercy' = 74, 'Grasmere-Arrochar-Ft. Wadsworth' = 75, 'Gravesend' = 76, 'Great Kills' = 77, 'Greenpoint' = 78, 'Grymes Hill-Clifton-Fox Hills' = 79, 'Hamilton Heights' = 80, 'Hammels-Arverne-Edgemere' = 81, 'Highbridge' = 82, 'Hollis' = 83, 'Homecrest' = 84, 'Hudson Yards-Chelsea-Flatiron-Union Square' = 85, 'Hunters Point-Sunnyside-West Maspeth' = 86, 'Hunts Point' = 87, 'Jackson Heights' = 88, 'Jamaica' = 89, 'Jamaica Estates-Holliswood' = 90, 'Kensington-Ocean Parkway' = 91, 'Kew Gardens' = 92, 'Kew Gardens Hills' = 93, 'Kingsbridge Heights' = 94, 'Laurelton' = 95, 'Lenox Hill-Roosevelt Island' = 96, 'Lincoln Square' = 97, 'Lindenwood-Howard Beach' = 98, 'Longwood' = 99, 'Lower East Side' = 100, 'Madison' = 101, 'Manhattanville' = 102, 'Marble Hill-Inwood' = 103, 'Mariner\'s Harbor-Arlington-Port Ivory-Graniteville' = 104, 'Maspeth' = 105, 'Melrose South-Mott Haven North' = 106, 'Middle Village' = 107, 'Midtown-Midtown South' = 108, 'Midwood' = 109, 'Morningside Heights' = 110, 'Morrisania-Melrose' = 111, 'Mott Haven-Port Morris' = 112, 'Mount Hope' = 113, 'Murray Hill' = 114, 'Murray Hill-Kips Bay' = 115, 'New Brighton-Silver Lake' = 116, 'New Dorp-Midland Beach' = 117, 'New Springville-Bloomfield-Travis' = 118, 'North Corona' = 119, 'North Riverdale-Fieldston-Riverdale' = 120, 'North Side-South Side' = 121, 'Norwood' = 122, 'Oakland Gardens' = 123, 'Oakwood-Oakwood Beach' = 124, 'Ocean Hill' = 125, 'Ocean Parkway South' = 126, 'Old Astoria' = 127, 'Old Town-Dongan Hills-South Beach' = 128, 'Ozone Park' = 129, 'Park Slope-Gowanus' = 130, 'Parkchester' = 131, 'Pelham Bay-Country Club-City Island' = 132, 'Pelham Parkway' = 133, 'Pomonok-Flushing Heights-Hillcrest' = 134, 'Port Richmond' = 135, 'Prospect Heights' = 136, 'Prospect Lefferts Gardens-Wingate' = 137, 'Queens Village' = 138, 'Queensboro Hill' = 139, 'Queensbridge-Ravenswood-Long Island City' = 140, 'Rego Park' = 141, 'Richmond Hill' = 142, 'Ridgewood' = 143, 'Rikers Island' = 144, 'Rosedale' = 145, 'Rossville-Woodrow' = 146, 'Rugby-Remsen Village' = 147, '
|
|||
|
|
|||
|
toUInt16(ifNull(dropoff_puma, '0')) AS dropoff_puma
|
|||
|
|
|||
|
FROM trips
|
|||
|
```
|
|||
|
|
|||
|
Bu, saniyede yaklaşık 428.000 satırlık bir hızda 3030 saniye sürer.
|
|||
|
Daha hızlı yüklemek için, tablo ile oluşturabilirsiniz `Log` motor yerine `MergeTree`. Bu durumda, indirme 200 saniyeden daha hızlı çalışır.
|
|||
|
|
|||
|
Tablo 126 GB disk alanı kullanır.
|
|||
|
|
|||
|
``` sql
|
|||
|
SELECT formatReadableSize(sum(bytes)) FROM system.parts WHERE table = 'trips_mergetree' AND active
|
|||
|
```
|
|||
|
|
|||
|
``` text
|
|||
|
┌─formatReadableSize(sum(bytes))─┐
|
|||
|
│ 126.18 GiB │
|
|||
|
└────────────────────────────────┘
|
|||
|
```
|
|||
|
|
|||
|
Diğer şeylerin yanı sıra, MERGETREE üzerinde en iyi duruma getirme sorgusunu çalıştırabilirsiniz. Ama her şey onsuz iyi olacak çünkü gerekli değildir.
|
|||
|
|
|||
|
## Hazırlanan Bölüm downloadlerin indir downloadilmesi {#download-of-prepared-partitions}
|
|||
|
|
|||
|
``` bash
|
|||
|
$ curl -O https://clickhouse-datasets.s3.yandex.net/trips_mergetree/partitions/trips_mergetree.tar
|
|||
|
$ tar xvf trips_mergetree.tar -C /var/lib/clickhouse # path to ClickHouse data directory
|
|||
|
$ # check permissions of unpacked data, fix if required
|
|||
|
$ sudo service clickhouse-server restart
|
|||
|
$ clickhouse-client --query "select count(*) from datasets.trips_mergetree"
|
|||
|
```
|
|||
|
|
|||
|
!!! info "Bilgin"
|
|||
|
Aşağıda açıklanan sorguları çalıştıracaksanız, tam tablo adını kullanmanız gerekir, `datasets.trips_mergetree`.
|
|||
|
|
|||
|
## Tek Server ile ilgili sonuçlar {#results-on-single-server}
|
|||
|
|
|||
|
Q1:
|
|||
|
|
|||
|
``` sql
|
|||
|
SELECT cab_type, count(*) FROM trips_mergetree GROUP BY cab_type
|
|||
|
```
|
|||
|
|
|||
|
0.490 saniye.
|
|||
|
|
|||
|
Q2:
|
|||
|
|
|||
|
``` sql
|
|||
|
SELECT passenger_count, avg(total_amount) FROM trips_mergetree GROUP BY passenger_count
|
|||
|
```
|
|||
|
|
|||
|
1.224 saniye.
|
|||
|
|
|||
|
Q3:
|
|||
|
|
|||
|
``` sql
|
|||
|
SELECT passenger_count, toYear(pickup_date) AS year, count(*) FROM trips_mergetree GROUP BY passenger_count, year
|
|||
|
```
|
|||
|
|
|||
|
2.104 saniye.
|
|||
|
|
|||
|
Q4:
|
|||
|
|
|||
|
``` sql
|
|||
|
SELECT passenger_count, toYear(pickup_date) AS year, round(trip_distance) AS distance, count(*)
|
|||
|
FROM trips_mergetree
|
|||
|
GROUP BY passenger_count, year, distance
|
|||
|
ORDER BY year, count(*) DESC
|
|||
|
```
|
|||
|
|
|||
|
3.593 saniye.
|
|||
|
|
|||
|
Aşağıdaki sunucu kullanıldı:
|
|||
|
|
|||
|
İki Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60 GHz, 16 fiziksel çekirdekler toplam,128 GiB RAM,8x6 TB HD donanım RAID-5
|
|||
|
|
|||
|
Yürütme süresi üç koşunun en iyisidir. Ancak ikinci çalıştırmadan başlayarak, sorgular dosya sistemi önbelleğinden verileri okur. Başka önbelleğe alma oluşur: veri okundu ve her vadede işlenir.
|
|||
|
|
|||
|
Üç sunucuda tablo oluşturma:
|
|||
|
|
|||
|
Her sunucuda:
|
|||
|
|
|||
|
``` sql
|
|||
|
CREATE TABLE default.trips_mergetree_third ( trip_id UInt32, vendor_id Enum8('1' = 1, '2' = 2, 'CMT' = 3, 'VTS' = 4, 'DDS' = 5, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14), pickup_date Date, pickup_datetime DateTime, dropoff_date Date, dropoff_datetime DateTime, store_and_fwd_flag UInt8, rate_code_id UInt8, pickup_longitude Float64, pickup_latitude Float64, dropoff_longitude Float64, dropoff_latitude Float64, passenger_count UInt8, trip_distance Float64, fare_amount Float32, extra Float32, mta_tax Float32, tip_amount Float32, tolls_amount Float32, ehail_fee Float32, improvement_surcharge Float32, total_amount Float32, payment_type_ Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4), trip_type UInt8, pickup FixedString(25), dropoff FixedString(25), cab_type Enum8('yellow' = 1, 'green' = 2, 'uber' = 3), pickup_nyct2010_gid UInt8, pickup_ctlabel Float32, pickup_borocode UInt8, pickup_boroname Enum8('' = 0, 'Manhattan' = 1, 'Bronx' = 2, 'Brooklyn' = 3, 'Queens' = 4, 'Staten Island' = 5), pickup_ct2010 FixedString(6), pickup_boroct2010 FixedString(7), pickup_cdeligibil Enum8(' ' = 0, 'E' = 1, 'I' = 2), pickup_ntacode FixedString(4), pickup_ntaname Enum16('' = 0, 'Airport' = 1, 'Allerton-Pelham Gardens' = 2, 'Annadale-Huguenot-Prince\'s Bay-Eltingville' = 3, 'Arden Heights' = 4, 'Astoria' = 5, 'Auburndale' = 6, 'Baisley Park' = 7, 'Bath Beach' = 8, 'Battery Park City-Lower Manhattan' = 9, 'Bay Ridge' = 10, 'Bayside-Bayside Hills' = 11, 'Bedford' = 12, 'Bedford Park-Fordham North' = 13, 'Bellerose' = 14, 'Belmont' = 15, 'Bensonhurst East' = 16, 'Bensonhurst West' = 17, 'Borough Park' = 18, 'Breezy Point-Belle Harbor-Rockaway Park-Broad Channel' = 19, 'Briarwood-Jamaica Hills' = 20, 'Brighton Beach' = 21, 'Bronxdale' = 22, 'Brooklyn Heights-Cobble Hill' = 23, 'Brownsville' = 24, 'Bushwick North' = 25, 'Bushwick South' = 26, 'Cambria Heights' = 27, 'Canarsie' = 28, 'Carroll Gardens-Columbia Street-Red Hook' = 29, 'Central Harlem North-Polo Grounds' = 30, 'Central Harlem South' = 31, 'Charleston-Richmond Valley-Tottenville' = 32, 'Chinatown' = 33, 'Claremont-Bathgate' = 34, 'Clinton' = 35, 'Clinton Hill' = 36, 'Co-op City' = 37, 'College Point' = 38, 'Corona' = 39, 'Crotona Park East' = 40, 'Crown Heights North' = 41, 'Crown Heights South' = 42, 'Cypress Hills-City Line' = 43, 'DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill' = 44, 'Douglas Manor-Douglaston-Little Neck' = 45, 'Dyker Heights' = 46, 'East Concourse-Concourse Village' = 47, 'East Elmhurst' = 48, 'East Flatbush-Farragut' = 49, 'East Flushing' = 50, 'East Harlem North' = 51, 'East Harlem South' = 52, 'East New York' = 53, 'East New York (Pennsylvania Ave)' = 54, 'East Tremont' = 55, 'East Village' = 56, 'East Williamsburg' = 57, 'Eastchester-Edenwald-Baychester' = 58, 'Elmhurst' = 59, 'Elmhurst-Maspeth' = 60, 'Erasmus' = 61, 'Far Rockaway-Bayswater' = 62, 'Flatbush' = 63, 'Flatlands' = 64, 'Flushing' = 65, 'Fordham South' = 66, 'Forest Hills' = 67, 'Fort Greene' = 68, 'Fresh Meadows-Utopia' = 69, 'Ft. Totten-Bay Terrace-Clearview' = 70, 'Georgetown-Marine Park-Bergen Beach-Mill Basin' = 71, 'Glen Oaks-Floral Park-New Hyde Park' = 72, 'Glendale' = 73, 'Gramercy' = 74, 'Grasmere-Arrochar-Ft. Wadsworth' = 75, 'Gravesend' = 76, 'Great Kills' = 77, 'Greenpoint' = 78, 'Grymes Hill-Clifton-Fox Hills' = 79, 'Hamilton Heights' = 80, 'Hammels-Arverne-Edgemere' = 81, 'Highbridge' = 82, 'Hollis' = 83, 'Homecrest' = 84, 'Hudson Yards-Chelsea-Flatiron-Union Square' = 85, 'Hunters Point-Sunnyside-West Maspeth' = 86, 'Hunts Point' = 87, 'Jackson Heights' = 88, 'Jamaica' = 89, 'Jamaica Estates-Holliswood' = 90, 'Kensington-Ocean Parkway' = 91, 'Kew Gardens' = 92, 'Kew Gardens Hills' = 93, 'Kingsbridge Heights' = 94, 'Laurelton' = 95, 'Lenox Hill-Roosevelt Island' = 96, 'Lincoln Square' = 97, 'Lindenwood-Howard Beach' = 98, 'Longwood' = 99, 'Lower East Side' = 100, 'Madison' = 101, 'Manhattanville' = 102, 'Marble Hill-Inwood' = 103, 'Mariner\'s Harbor-Arlington-Port Ivory-Graniteville' = 104, 'Maspeth' = 105,
|
|||
|
```
|
|||
|
|
|||
|
Kaynak sunucuda:
|
|||
|
|
|||
|
``` sql
|
|||
|
CREATE TABLE trips_mergetree_x3 AS trips_mergetree_third ENGINE = Distributed(perftest, default, trips_mergetree_third, rand())
|
|||
|
```
|
|||
|
|
|||
|
Aşağıdaki sorgu verileri yeniden dağıtır:
|
|||
|
|
|||
|
``` sql
|
|||
|
INSERT INTO trips_mergetree_x3 SELECT * FROM trips_mergetree
|
|||
|
```
|
|||
|
|
|||
|
Bu 2454 saniye sürer.
|
|||
|
|
|||
|
Üç sunucuda:
|
|||
|
|
|||
|
Q1: 0.212 saniye.
|
|||
|
Q2: 0.438 saniye.
|
|||
|
Q3: 0.733 saniye.
|
|||
|
Q4: 1.241 saniye.
|
|||
|
|
|||
|
Sorgular doğrusal olarak ölçeklendiğinden, burada sürpriz yok.
|
|||
|
|
|||
|
Ayrıca 140 sunucu kümesinden elde edilen sonuçlara sahibiz:
|
|||
|
|
|||
|
Q1: 0.028 sn.
|
|||
|
Q2: 0.043 sn.
|
|||
|
Q3: 0.051 sn.
|
|||
|
Q4: 0.072 sn.
|
|||
|
|
|||
|
Bu durumda, sorgu işleme süresi her şeyden önce ağ gecikmesi ile belirlenir.
|
|||
|
Finlandiya'daki bir Yandex veri merkezinde bulunan ve Rusya'daki bir kümede bulunan ve yaklaşık 20 ms gecikme süresi ekleyen bir istemci kullanarak sorgular çalıştırdık.
|
|||
|
|
|||
|
## Özet {#summary}
|
|||
|
|
|||
|
| hizmetçiler | Q1 | Q2 | Q3 | Q4 |
|
|||
|
|-------------|-------|-------|-------|-------|
|
|||
|
| 1 | 0.490 | 1.224 | 2.104 | 3.593 |
|
|||
|
| 3 | 0.212 | 0.438 | 0.733 | 1.241 |
|
|||
|
| 140 | 0.028 | 0.043 | 0.051 | 0.072 |
|
|||
|
|
|||
|
[Orijinal makale](https://clickhouse.tech/docs/en/getting_started/example_datasets/nyc_taxi/) <!--hide-->
|