mirror of
https://github.com/ClickHouse/ClickHouse.git
synced 2024-11-10 17:44:23 +00:00
259 lines
10 KiB
Markdown
259 lines
10 KiB
Markdown
# Usage Recommendations
|
||
|
||
## CPU
|
||
|
||
The SSE 4.2 instruction set must be supported. Modern processors (since 2008) support it.
|
||
|
||
When choosing a processor, prefer a large number of cores and slightly slower clock rate over fewer cores and a higher clock rate.
|
||
For example, 16 cores with 2600 MHz is better than 8 cores with 3600 MHz.
|
||
|
||
## Hyper-threading
|
||
|
||
Don't disable hyper-threading. It helps for some queries, but not for others.
|
||
|
||
## Turbo Boost
|
||
|
||
Turbo Boost is highly recommended. It significantly improves performance with a typical load.
|
||
You can use `turbostat` to view the CPU's actual clock rate under a load.
|
||
|
||
## CPU Scaling Governor
|
||
|
||
Always use the `performance` scaling governor. The `on-demand` scaling governor works much worse with constantly high demand.
|
||
|
||
```bash
|
||
sudo echo 'performance' | tee /sys/devices/system/cpu/cpu\*/cpufreq/scaling_governor
|
||
```
|
||
|
||
## CPU Limitations
|
||
|
||
Processors can overheat. Use `dmesg` to see if the CPU's clock rate was limited due to overheating.
|
||
The restriction can also be set externally at the datacenter level. You can use `turbostat` to monitor it under a load.
|
||
|
||
## RAM
|
||
|
||
For small amounts of data (up to \~200 GB compressed), it is best to use as much memory as the volume of data.
|
||
For large amounts of data and when processing interactive (online) queries, you should use a reasonable amount of RAM (128 GB or more) so the hot data subset will fit in the cache of pages.
|
||
Even for data volumes of \~50 TB per server, using 128 GB of RAM significantly improves query performance compared to 64 GB.
|
||
|
||
## Swap File
|
||
|
||
Always disable the swap file. The only reason for not doing this is if you are using ClickHouse on your personal laptop.
|
||
|
||
## Huge Pages
|
||
|
||
Always disable transparent huge pages. It interferes with memory allocators, which leads to significant performance degradation.
|
||
|
||
```bash
|
||
echo 'never' | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
|
||
```
|
||
|
||
Use `perf top` to watch the time spent in the kernel for memory management.
|
||
Permanent huge pages also do not need to be allocated.
|
||
|
||
## Storage Subsystem
|
||
|
||
If your budget allows you to use SSD, use SSD.
|
||
If not, use HDD. SATA HDDs 7200 RPM will do.
|
||
|
||
Give preference to a lot of servers with local hard drives over a smaller number of servers with attached disk shelves.
|
||
But for storing archives with rare queries, shelves will work.
|
||
|
||
## RAID
|
||
|
||
When using HDD, you can combine their RAID-10, RAID-5, RAID-6 or RAID-50.
|
||
For Linux, software RAID is better (with `mdadm`). We don't recommend using LVM.
|
||
When creating RAID-10, select the `far` layout.
|
||
If your budget allows, choose RAID-10.
|
||
|
||
If you have more than 4 disks, use RAID-6 (preferred) or RAID-50, instead of RAID-5.
|
||
When using RAID-5, RAID-6 or RAID-50, always increase stripe_cache_size, since the default value is usually not the best choice.
|
||
|
||
```bash
|
||
echo 4096 | sudo tee /sys/block/md2/md/stripe_cache_size
|
||
```
|
||
|
||
Calculate the exact number from the number of devices and the block size, using the formula: `2 * num_devices * chunk_size_in_bytes / 4096`.
|
||
|
||
A block size of 1025 KB is sufficient for all RAID configurations.
|
||
Never set the block size too small or too large.
|
||
|
||
You can use RAID-0 on SSD.
|
||
Regardless of RAID use, always use replication for data security.
|
||
|
||
Enable NCQ with a long queue. For HDD, choose the CFQ scheduler, and for SSD, choose noop. Don't reduce the 'readahead' setting.
|
||
For HDD, enable the write cache.
|
||
|
||
## File System
|
||
|
||
Ext4 is the most reliable option. Set the mount options `noatime, nobarrier`.
|
||
XFS is also suitable, but it hasn't been as thoroughly tested with ClickHouse.
|
||
Most other file systems should also work fine. File systems with delayed allocation work better.
|
||
|
||
## Linux Kernel
|
||
|
||
Don't use an outdated Linux kernel. In 2015, 3.18.19 was new enough.
|
||
Consider using the kernel build from Yandex:<https://github.com/yandex/smart> – it provides at least a 5% performance increase.
|
||
|
||
## Network
|
||
|
||
If you are using IPv6, increase the size of the route cache.
|
||
The Linux kernel prior to 3.2 had a multitude of problems with IPv6 implementation.
|
||
|
||
Use at least a 10 GB network, if possible. 1 Gb will also work, but it will be much worse for patching replicas with tens of terabytes of data, or for processing distributed queries with a large amount of intermediate data.
|
||
|
||
## ZooKeeper
|
||
|
||
You are probably already using ZooKeeper for other purposes. You can use the same installation of ZooKeeper, if it isn't already overloaded.
|
||
|
||
It's best to use a fresh version of ZooKeeper – 3.4.9 or later. The version in stable Linux distributions may be outdated.
|
||
|
||
Do not run ZooKeeper on the same servers as ClickHouse. Because ZooKeeper is very sensitive for latency and ClickHouse may utilize all available system resources.
|
||
|
||
With the default settings, ZooKeeper is a time bomb:
|
||
|
||
> The ZooKeeper server won't delete files from old snapshots and logs when using the default configuration (see autopurge), and this is the responsibility of the operator.
|
||
|
||
This bomb must be defused.
|
||
|
||
If you want to move data between different ZooKeeper clusters, never move it by hand-written script, because it will produce wrong data for sequential nodes. Never use "zkcopy" tool, by the same reason: https://github.com/ksprojects/zkcopy/issues/15
|
||
|
||
If you want to split ZooKeeper cluster, proper way is to increase number of replicas and then reconfigure it as two independent clusters.
|
||
|
||
The ZooKeeper (3.5.1) configuration below is used in the Yandex.Metrica production environment as of May 20, 2017:
|
||
|
||
zoo.cfg:
|
||
|
||
```bash
|
||
# http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html
|
||
|
||
# The number of milliseconds of each tick
|
||
tickTime=2000
|
||
# The number of ticks that the initial
|
||
# synchronization phase can take
|
||
initLimit=30000
|
||
# The number of ticks that can pass between
|
||
# sending a request and getting an acknowledgement
|
||
syncLimit=10
|
||
|
||
maxClientCnxns=2000
|
||
|
||
maxSessionTimeout=60000000
|
||
# the directory where the snapshot is stored.
|
||
dataDir=/opt/zookeeper/{{ cluster['name'] }}/data
|
||
# Place the dataLogDir to a separate physical disc for better performance
|
||
dataLogDir=/opt/zookeeper/{{ cluster['name'] }}/logs
|
||
|
||
autopurge.snapRetainCount=10
|
||
autopurge.purgeInterval=1
|
||
|
||
|
||
# To avoid seeks ZooKeeper allocates space in the transaction log file in
|
||
# blocks of preAllocSize kilobytes. The default block size is 64M. One reason
|
||
# for changing the size of the blocks is to reduce the block size if snapshots
|
||
# are taken more often. (Also, see snapCount).
|
||
preAllocSize=131072
|
||
|
||
# Clients can submit requests faster than ZooKeeper can process them,
|
||
# especially if there are a lot of clients. To prevent ZooKeeper from running
|
||
# out of memory due to queued requests, ZooKeeper will throttle clients so that
|
||
# there is no more than globalOutstandingLimit outstanding requests in the
|
||
# system. The default limit is 1,000.ZooKeeper logs transactions to a
|
||
# transaction log. After snapCount transactions are written to a log file a
|
||
# snapshot is started and a new transaction log file is started. The default
|
||
# snapCount is 10,000.
|
||
snapCount=3000000
|
||
|
||
# If this option is defined, requests will be will logged to a trace file named
|
||
# traceFile.year.month.day.
|
||
#traceFile=
|
||
|
||
# Leader accepts client connections. Default value is "yes". The leader machine
|
||
# coordinates updates. For higher update throughput at thes slight expense of
|
||
# read throughput the leader can be configured to not accept clients and focus
|
||
# on coordination.
|
||
leaderServes=yes
|
||
|
||
standaloneEnabled=false
|
||
dynamicConfigFile=/etc/zookeeper-{{ cluster['name'] }}/conf/zoo.cfg.dynamic
|
||
```
|
||
|
||
Java version:
|
||
|
||
```text
|
||
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
|
||
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
|
||
```
|
||
|
||
JVM parameters:
|
||
|
||
```bash
|
||
NAME=zookeeper-{{ cluster['name'] }}
|
||
ZOOCFGDIR=/etc/$NAME/conf
|
||
|
||
# TODO this is really ugly
|
||
# How to find out, which jars are needed?
|
||
# seems, that log4j requires the log4j.properties file to be in the classpath
|
||
CLASSPATH="$ZOOCFGDIR:/usr/build/classes:/usr/build/lib/*.jar:/usr/share/zookeeper/zookeeper-3.5.1-metrika.jar:/usr/share/zookeeper/slf4j-log4j12-1.7.5.jar:/usr/share/zookeeper/slf4j-api-1.7.5.jar:/usr/share/zookeeper/servlet-api-2.5-20081211.jar:/usr/share/zookeeper/netty-3.7.0.Final.jar:/usr/share/zookeeper/log4j-1.2.16.jar:/usr/share/zookeeper/jline-2.11.jar:/usr/share/zookeeper/jetty-util-6.1.26.jar:/usr/share/zookeeper/jetty-6.1.26.jar:/usr/share/zookeeper/javacc.jar:/usr/share/zookeeper/jackson-mapper-asl-1.9.11.jar:/usr/share/zookeeper/jackson-core-asl-1.9.11.jar:/usr/share/zookeeper/commons-cli-1.2.jar:/usr/src/java/lib/*.jar:/usr/etc/zookeeper"
|
||
|
||
ZOOCFG="$ZOOCFGDIR/zoo.cfg"
|
||
ZOO_LOG_DIR=/var/log/$NAME
|
||
USER=zookeeper
|
||
GROUP=zookeeper
|
||
PIDDIR=/var/run/$NAME
|
||
PIDFILE=$PIDDIR/$NAME.pid
|
||
SCRIPTNAME=/etc/init.d/$NAME
|
||
JAVA=/usr/bin/java
|
||
ZOOMAIN="org.apache.zookeeper.server.quorum.QuorumPeerMain"
|
||
ZOO_LOG4J_PROP="INFO,ROLLINGFILE"
|
||
JMXLOCALONLY=false
|
||
JAVA_OPTS="-Xms{{ cluster.get('xms','128M') }} \
|
||
-Xmx{{ cluster.get('xmx','1G') }} \
|
||
-Xloggc:/var/log/$NAME/zookeeper-gc.log \
|
||
-XX:+UseGCLogFileRotation \
|
||
-XX:NumberOfGCLogFiles=16 \
|
||
-XX:GCLogFileSize=16M \
|
||
-verbose:gc \
|
||
-XX:+PrintGCTimeStamps \
|
||
-XX:+PrintGCDateStamps \
|
||
-XX:+PrintGCDetails
|
||
-XX:+PrintTenuringDistribution \
|
||
-XX:+PrintGCApplicationStoppedTime \
|
||
-XX:+PrintGCApplicationConcurrentTime \
|
||
-XX:+PrintSafepointStatistics \
|
||
-XX:+UseParNewGC \
|
||
-XX:+UseConcMarkSweepGC \
|
||
-XX:+CMSParallelRemarkEnabled"
|
||
```
|
||
|
||
Salt init:
|
||
|
||
```text
|
||
description "zookeeper-{{ cluster['name'] }} centralized coordination service"
|
||
|
||
start on runlevel [2345]
|
||
stop on runlevel [!2345]
|
||
|
||
respawn
|
||
|
||
limit nofile 8192 8192
|
||
|
||
pre-start script
|
||
[ -r "/etc/zookeeper-{{ cluster['name'] }}/conf/environment" ] || exit 0
|
||
. /etc/zookeeper-{{ cluster['name'] }}/conf/environment
|
||
[ -d $ZOO_LOG_DIR ] || mkdir -p $ZOO_LOG_DIR
|
||
chown $USER:$GROUP $ZOO_LOG_DIR
|
||
end script
|
||
|
||
script
|
||
. /etc/zookeeper-{{ cluster['name'] }}/conf/environment
|
||
[ -r /etc/default/zookeeper ] && . /etc/default/zookeeper
|
||
if [ -z "$JMXDISABLE" ]; then
|
||
JAVA_OPTS="$JAVA_OPTS -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=$JMXLOCALONLY"
|
||
fi
|
||
exec start-stop-daemon --start -c $USER --exec $JAVA --name zookeeper-{{ cluster['name'] }} \
|
||
-- -cp $CLASSPATH $JAVA_OPTS -Dzookeeper.log.dir=${ZOO_LOG_DIR} \
|
||
-Dzookeeper.root.logger=${ZOO_LOG4J_PROP} $ZOOMAIN $ZOOCFG
|
||
end script
|
||
```
|
||
|