#!/bin/bash # shellcheck disable=SC2094 # shellcheck disable=SC2086 # shellcheck disable=SC2024 # This script is similar to script for common stress test # Avoid overlaps with previous runs dmesg --clear set -x # core.COMM.PID-TID sysctl kernel.core_pattern='core.%e.%p-%P' function install_packages() { dpkg -i $1/clickhouse-common-static_*.deb dpkg -i $1/clickhouse-common-static-dbg_*.deb dpkg -i $1/clickhouse-server_*.deb dpkg -i $1/clickhouse-client_*.deb } function configure() { # install test configs export USE_DATABASE_ORDINARY=1 export EXPORT_S3_STORAGE_POLICIES=1 /usr/share/clickhouse-test/config/install.sh # we mount tests folder from repo to /usr/share ln -s /usr/share/clickhouse-test/ci/stress.py /usr/bin/stress ln -s /usr/share/clickhouse-test/clickhouse-test /usr/bin/clickhouse-test ln -s /usr/share/clickhouse-test/ci/download_release_packages.py /usr/bin/download_release_packages ln -s /usr/share/clickhouse-test/ci/get_previous_release_tag.py /usr/bin/get_previous_release_tag # avoid too slow startup sudo cat /etc/clickhouse-server/config.d/keeper_port.xml | sed "s|100000|10000|" > /etc/clickhouse-server/config.d/keeper_port.xml.tmp sudo mv /etc/clickhouse-server/config.d/keeper_port.xml.tmp /etc/clickhouse-server/config.d/keeper_port.xml sudo chown clickhouse /etc/clickhouse-server/config.d/keeper_port.xml sudo chgrp clickhouse /etc/clickhouse-server/config.d/keeper_port.xml # for clickhouse-server (via service) echo "ASAN_OPTIONS='malloc_context_size=10 verbosity=1 allocator_release_to_os_interval_ms=10000'" >> /etc/environment # for clickhouse-client export ASAN_OPTIONS='malloc_context_size=10 allocator_release_to_os_interval_ms=10000' # since we run clickhouse from root sudo chown root: /var/lib/clickhouse # Set more frequent update period of asynchronous metrics to more frequently update information about real memory usage (less chance of OOM). echo "1" \ > /etc/clickhouse-server/config.d/asynchronous_metrics_update_period_s.xml local total_mem total_mem=$(awk '/MemTotal/ { print $(NF-1) }' /proc/meminfo) # KiB total_mem=$(( total_mem*1024 )) # bytes # Set maximum memory usage as half of total memory (less chance of OOM). # # But not via max_server_memory_usage but via max_memory_usage_for_user, # so that we can override this setting and execute service queries, like: # - hung check # - show/drop database # - ... # # So max_memory_usage_for_user will be a soft limit, and # max_server_memory_usage will be hard limit, and queries that should be # executed regardless memory limits will use max_memory_usage_for_user=0, # instead of relying on max_untracked_memory local max_server_mem max_server_mem=$((total_mem*75/100)) # 75% echo "Setting max_server_memory_usage=$max_server_mem" cat > /etc/clickhouse-server/config.d/max_server_memory_usage.xml < ${max_server_mem} EOL local max_users_mem max_users_mem=$((total_mem*50/100)) # 50% echo "Setting max_memory_usage_for_user=$max_users_mem" cat > /etc/clickhouse-server/users.d/max_memory_usage_for_user.xml < ${max_users_mem} EOL cat > /etc/clickhouse-server/config.d/core.xml < 107374182400 $PWD EOL # Analyzer is not yet ready for testing cat > /etc/clickhouse-server/users.d/no_analyzer.xml < EOL } function stop() { local pid # Preserve the pid, since the server can hung after the PID will be deleted. pid="$(cat /var/run/clickhouse-server/clickhouse-server.pid)" clickhouse stop $max_tries --do-not-kill && return if [ -n "$1" ] then # temporarily disable it in BC check clickhouse stop --force return fi # We failed to stop the server with SIGTERM. Maybe it hang, let's collect stacktraces. kill -TERM "$(pidof gdb)" ||: sleep 5 echo "thread apply all backtrace (on stop)" >> /test_output/gdb.log timeout 30m gdb -batch -ex 'thread apply all backtrace' -p "$pid" | ts '%Y-%m-%d %H:%M:%S' >> /test_output/gdb.log clickhouse stop --force } function start() { counter=0 until clickhouse-client --query "SELECT 1" do if [ "$counter" -gt ${1:-120} ] then echo "Cannot start clickhouse-server" echo -e "Cannot start clickhouse-server\tFAIL" >> /test_output/test_results.tsv cat /var/log/clickhouse-server/stdout.log tail -n1000 /var/log/clickhouse-server/stderr.log tail -n100000 /var/log/clickhouse-server/clickhouse-server.log | grep -F -v -e ' RaftInstance:' -e ' RaftInstance' | tail -n1000 break fi # use root to match with current uid clickhouse start --user root >/var/log/clickhouse-server/stdout.log 2>>/var/log/clickhouse-server/stderr.log sleep 0.5 counter=$((counter + 1)) done # Set follow-fork-mode to parent, because we attach to clickhouse-server, not to watchdog # and clickhouse-server can do fork-exec, for example, to run some bridge. # Do not set nostop noprint for all signals, because some it may cause gdb to hang, # explicitly ignore non-fatal signals that are used by server. # Number of SIGRTMIN can be determined only in runtime. RTMIN=$(kill -l SIGRTMIN) echo " set follow-fork-mode parent handle SIGHUP nostop noprint pass handle SIGINT nostop noprint pass handle SIGQUIT nostop noprint pass handle SIGPIPE nostop noprint pass handle SIGTERM nostop noprint pass handle SIGUSR1 nostop noprint pass handle SIGUSR2 nostop noprint pass handle SIG$RTMIN nostop noprint pass info signals continue backtrace full thread apply all backtrace full info registers disassemble /s up disassemble /s up disassemble /s p \"done\" detach quit " > script.gdb # FIXME Hung check may work incorrectly because of attached gdb # 1. False positives are possible # 2. We cannot attach another gdb to get stacktraces if some queries hung gdb -batch -command script.gdb -p "$(cat /var/run/clickhouse-server/clickhouse-server.pid)" | ts '%Y-%m-%d %H:%M:%S' >> /test_output/gdb.log & sleep 5 # gdb will send SIGSTOP, spend some time loading debug info and then send SIGCONT, wait for it (up to send_timeout, 300s) time clickhouse-client --query "SELECT 'Connected to clickhouse-server after attaching gdb'" ||: } # Thread Fuzzer allows to check more permutations of possible thread scheduling # and find more potential issues. # Temporarily disable ThreadFuzzer with tsan because of https://github.com/google/sanitizers/issues/1540 is_tsan_build=$(clickhouse local -q "select value like '% -fsanitize=thread %' from system.build_options where name='CXX_FLAGS'") if [ "$is_tsan_build" -eq "0" ]; then export THREAD_FUZZER_CPU_TIME_PERIOD_US=1000 export THREAD_FUZZER_SLEEP_PROBABILITY=0.1 export THREAD_FUZZER_SLEEP_TIME_US=100000 export THREAD_FUZZER_pthread_mutex_lock_BEFORE_MIGRATE_PROBABILITY=1 export THREAD_FUZZER_pthread_mutex_lock_AFTER_MIGRATE_PROBABILITY=1 export THREAD_FUZZER_pthread_mutex_unlock_BEFORE_MIGRATE_PROBABILITY=1 export THREAD_FUZZER_pthread_mutex_unlock_AFTER_MIGRATE_PROBABILITY=1 export THREAD_FUZZER_pthread_mutex_lock_BEFORE_SLEEP_PROBABILITY=0.001 export THREAD_FUZZER_pthread_mutex_lock_AFTER_SLEEP_PROBABILITY=0.001 export THREAD_FUZZER_pthread_mutex_unlock_BEFORE_SLEEP_PROBABILITY=0.001 export THREAD_FUZZER_pthread_mutex_unlock_AFTER_SLEEP_PROBABILITY=0.001 export THREAD_FUZZER_pthread_mutex_lock_BEFORE_SLEEP_TIME_US=10000 export THREAD_FUZZER_pthread_mutex_lock_AFTER_SLEEP_TIME_US=10000 export THREAD_FUZZER_pthread_mutex_unlock_BEFORE_SLEEP_TIME_US=10000 export THREAD_FUZZER_pthread_mutex_unlock_AFTER_SLEEP_TIME_US=10000 fi azurite-blob --blobHost 0.0.0.0 --blobPort 10000 --debug /azurite_log & ./setup_minio.sh stateless # to have a proper environment # But we still need default disk because some tables loaded only into it sudo cat /etc/clickhouse-server/config.d/s3_storage_policy_by_default.xml | sed "s|
s3
|
s3
default|" > /etc/clickhouse-server/config.d/s3_storage_policy_by_default.xml.tmp mv /etc/clickhouse-server/config.d/s3_storage_policy_by_default.xml.tmp /etc/clickhouse-server/config.d/s3_storage_policy_by_default.xml sudo chown clickhouse /etc/clickhouse-server/config.d/s3_storage_policy_by_default.xml sudo chgrp clickhouse /etc/clickhouse-server/config.d/s3_storage_policy_by_default.xml echo "Get previous release tag" previous_release_tag=$(dpkg --info package_folder/clickhouse-client*.deb | grep "Version: " | awk '{print $2}' | get_previous_release_tag) echo $previous_release_tag echo "Clone previous release repository" git clone https://github.com/ClickHouse/ClickHouse.git --no-tags --progress --branch=$previous_release_tag --no-recurse-submodules --depth=1 previous_release_repository echo "Download clickhouse-server from the previous release" mkdir previous_release_package_folder echo $previous_release_tag | download_release_packages && echo -e 'Download script exit code\tOK' >> /test_output/test_results.tsv \ || echo -e 'Download script failed\tFAIL' >> /test_output/test_results.tsv # Check if we cloned previous release repository successfully if ! [ "$(ls -A previous_release_repository/tests/queries)" ] then echo -e "Failed to clone previous release tests\tFAIL" >> /test_output/test_results.tsv elif ! [ "$(ls -A previous_release_package_folder/clickhouse-common-static_*.deb && ls -A previous_release_package_folder/clickhouse-server_*.deb)" ] then echo -e "Failed to download previous release packages\tFAIL" >> /test_output/test_results.tsv else echo -e "Successfully cloned previous release tests\tOK" >> /test_output/test_results.tsv echo -e "Successfully downloaded previous release packages\tOK" >> /test_output/test_results.tsv # Make upgrade check more funny by forcing Ordinary engine for system database mkdir /var/lib/clickhouse/metadata echo "ATTACH DATABASE system ENGINE=Ordinary" > /var/lib/clickhouse/metadata/system.sql # Install previous release packages install_packages previous_release_package_folder # Start server from previous release # Previous version may not be ready for fault injections export ZOOKEEPER_FAULT_INJECTION=0 configure # Avoid "Setting s3_check_objects_after_upload is neither a builtin setting..." rm -f /etc/clickhouse-server/users.d/enable_blobs_check.xml ||: rm -f /etc/clickhouse-server/users.d/marks.xml ||: # Remove s3 related configs to avoid "there is no disk type `cache`" rm -f /etc/clickhouse-server/config.d/storage_conf.xml ||: rm -f /etc/clickhouse-server/config.d/azure_storage_conf.xml ||: # Turn on after 22.12 rm -f /etc/clickhouse-server/config.d/compressed_marks_and_index.xml ||: # it uses recently introduced settings which previous versions may not have rm -f /etc/clickhouse-server/users.d/insert_keeper_retries.xml ||: start clickhouse-client --query="SELECT 'Server version: ', version()" # Install new package before running stress test because we should use new # clickhouse-client and new clickhouse-test. # # But we should leave old binary in /usr/bin/ and debug symbols in # /usr/lib/debug/usr/bin (if any) for gdb and internal DWARF parser, so it # will print sane stacktraces and also to avoid possible crashes. # # FIXME: those files can be extracted directly from debian package, but # actually better solution will be to use different PATH instead of playing # games with files from packages. mv /usr/bin/clickhouse previous_release_package_folder/ mv /usr/lib/debug/usr/bin/clickhouse.debug previous_release_package_folder/ install_packages package_folder mv /usr/bin/clickhouse package_folder/ mv /usr/lib/debug/usr/bin/clickhouse.debug package_folder/ mv previous_release_package_folder/clickhouse /usr/bin/ mv previous_release_package_folder/clickhouse.debug /usr/lib/debug/usr/bin/clickhouse.debug mkdir tmp_stress_output stress --test-cmd="/usr/bin/clickhouse-test --queries=\"previous_release_repository/tests/queries\"" --upgrade-check --output-folder tmp_stress_output --global-time-limit=1200 \ && echo -e 'Test script exit code\tOK' >> /test_output/test_results.tsv \ || echo -e 'Test script failed\tFAIL' >> /test_output/test_results.tsv rm -rf tmp_stress_output # We experienced deadlocks in this command in very rare cases. Let's debug it: timeout 10m clickhouse-client --query="SELECT 'Tables count:', count() FROM system.tables" || ( echo "thread apply all backtrace (on select tables count)" >> /test_output/gdb.log timeout 30m gdb -batch -ex 'thread apply all backtrace' -p "$(cat /var/run/clickhouse-server/clickhouse-server.pid)" | ts '%Y-%m-%d %H:%M:%S' >> /test_output/gdb.log clickhouse stop --force ) stop 1 mv /var/log/clickhouse-server/clickhouse-server.log /var/log/clickhouse-server/clickhouse-server.stress.log # Start new server mv package_folder/clickhouse /usr/bin/ mv package_folder/clickhouse.debug /usr/lib/debug/usr/bin/clickhouse.debug # Disable fault injections on start (we don't test them here, and it can lead to tons of requests in case of huge number of tables). export ZOOKEEPER_FAULT_INJECTION=0 configure start 500 clickhouse-client --query "SELECT 'Server successfully started', 'OK'" >> /test_output/test_results.tsv \ || (echo -e 'Server failed to start\tFAIL' >> /test_output/test_results.tsv \ && grep -a ".*Application" /var/log/clickhouse-server/clickhouse-server.log >> /test_output/application_errors.txt) # Remove file application_errors.txt if it's empty [ -s /test_output/application_errors.txt ] || rm /test_output/application_errors.txt clickhouse-client --query="SELECT 'Server version: ', version()" # Let the server run for a while before checking log. sleep 60 stop mv /var/log/clickhouse-server/clickhouse-server.log /var/log/clickhouse-server/clickhouse-server.upgrade.log # Error messages (we should ignore some errors) # FIXME https://github.com/ClickHouse/ClickHouse/issues/38643 ("Unknown index: idx.") # FIXME https://github.com/ClickHouse/ClickHouse/issues/39174 ("Cannot parse string 'Hello' as UInt64") # FIXME Not sure if it's expected, but some tests from stress test may not be finished yet when we restarting server. # Let's just ignore all errors from queries ("} TCPHandler: Code:", "} executeQuery: Code:") # FIXME https://github.com/ClickHouse/ClickHouse/issues/39197 ("Missing columns: 'v3' while processing query: 'v3, k, v1, v2, p'") # NOTE Incompatibility was introduced in https://github.com/ClickHouse/ClickHouse/pull/39263, it's expected # ("This engine is deprecated and is not supported in transactions", "[Queue = DB::MergeMutateRuntimeQueue]: Code: 235. DB::Exception: Part") # FIXME https://github.com/ClickHouse/ClickHouse/issues/39174 - bad mutation does not indicate backward incompatibility echo "Check for Error messages in server log:" zgrep -Fav -e "Code: 236. DB::Exception: Cancelled merging parts" \ -e "Code: 236. DB::Exception: Cancelled mutating parts" \ -e "REPLICA_IS_ALREADY_ACTIVE" \ -e "REPLICA_ALREADY_EXISTS" \ -e "ALL_REPLICAS_LOST" \ -e "DDLWorker: Cannot parse DDL task query" \ -e "RaftInstance: failed to accept a rpc connection due to error 125" \ -e "UNKNOWN_DATABASE" \ -e "NETWORK_ERROR" \ -e "UNKNOWN_TABLE" \ -e "ZooKeeperClient" \ -e "KEEPER_EXCEPTION" \ -e "DirectoryMonitor" \ -e "TABLE_IS_READ_ONLY" \ -e "Code: 1000, e.code() = 111, Connection refused" \ -e "UNFINISHED" \ -e "NETLINK_ERROR" \ -e "Renaming unexpected part" \ -e "PART_IS_TEMPORARILY_LOCKED" \ -e "and a merge is impossible: we didn't find" \ -e "found in queue and some source parts for it was lost" \ -e "is lost forever." \ -e "Unknown index: idx." \ -e "Cannot parse string 'Hello' as UInt64" \ -e "} TCPHandler: Code:" \ -e "} executeQuery: Code:" \ -e "Missing columns: 'v3' while processing query: 'v3, k, v1, v2, p'" \ -e "This engine is deprecated and is not supported in transactions" \ -e "[Queue = DB::MergeMutateRuntimeQueue]: Code: 235. DB::Exception: Part" \ -e "The set of parts restored in place of" \ -e "(ReplicatedMergeTreeAttachThread): Initialization failed. Error" \ -e "Code: 269. DB::Exception: Destination table is myself" \ -e "Coordination::Exception: Connection loss" \ -e "MutateFromLogEntryTask" \ -e "No connection to ZooKeeper, cannot get shared table ID" \ -e "Session expired" \ /var/log/clickhouse-server/clickhouse-server.upgrade.log | zgrep -Fa "" > /test_output/upgrade_error_messages.txt \ && echo -e 'Error message in logs after server upgrade (see upgrade_error_messages.txt)\tFAIL' >> /test_output/test_results.tsv \ || echo -e 'No Error messages after server upgrade\tOK' >> /test_output/test_results.tsv # Remove file bc_check_error_messages.txt if it's empty [ -s /test_output/upgrade_error_messages.txt ] || rm /test_output/upgrade_error_messages.txt # Sanitizer asserts zgrep -Fa "==================" /var/log/clickhouse-server/stderr.log >> /test_output/tmp zgrep -Fa "WARNING" /var/log/clickhouse-server/stderr.log >> /test_output/tmp zgrep -Fav -e "ASan doesn't fully support makecontext/swapcontext functions" -e "DB::Exception" /test_output/tmp > /dev/null \ && echo -e 'Sanitizer assert (in stderr.log)\tFAIL' >> /test_output/test_results.tsv \ || echo -e 'No sanitizer asserts\tOK' >> /test_output/test_results.tsv rm -f /test_output/tmp # OOM zgrep -Fa " Application: Child process was terminated by signal 9" /var/log/clickhouse-server/clickhouse-server.*.log > /dev/null \ && echo -e 'OOM killer (or signal 9) in clickhouse-server.log\tFAIL' >> /test_output/test_results.tsv \ || echo -e 'No OOM messages in clickhouse-server.log\tOK' >> /test_output/test_results.tsv # Logical errors echo "Check for Logical errors in server log:" zgrep -Fa -A20 "Code: 49, e.displayText() = DB::Exception:" /var/log/clickhouse-server/clickhouse-server.*.log > /test_output/logical_errors.txt \ && echo -e 'Logical error thrown (see server logs or logical_errors.txt)\tFAIL' >> /test_output/test_results.tsv \ || echo -e 'No logical errors\tOK' >> /test_output/test_results.tsv # Remove file logical_errors.txt if it's empty [ -s /test_output/logical_errors.txt ] || rm /test_output/logical_errors.txt # Crash zgrep -Fa "########################################" /var/log/clickhouse-server/clickhouse-server.*.log > /dev/null \ && echo -e 'Killed by signal (in server logs)\tFAIL' >> /test_output/test_results.tsv \ || echo -e 'Not crashed\tOK' >> /test_output/test_results.tsv # It also checks for crash without stacktrace (printed by watchdog) echo "Check for Fatal message in server log:" zgrep -Fa " " /var/log/clickhouse-server/clickhouse-server.*.log > /test_output/fatal_messages.txt \ && echo -e 'Fatal message in server logs (see bc_check_fatal_messages.txt)\tFAIL' >> /test_output/test_results.tsv \ || echo -e 'No fatal messages in server logs\tOK' >> /test_output/test_results.tsv # Remove file fatal_messages.txt if it's empty [ -s /test_output/fatal_messages.txt ] || rm /test_output/fatal_messages.txt tar -chf /test_output/coordination.tar /var/lib/clickhouse/coordination ||: for table in query_log trace_log do clickhouse-local --path /var/lib/clickhouse/ --only-system-tables -q "select * from system.$table format TSVWithNamesAndTypes" | pigz > /test_output/$table.backward.tsv.gz ||: done fi dmesg -T > /test_output/dmesg.log # OOM in dmesg -- those are real grep -q -F -e 'Out of memory: Killed process' -e 'oom_reaper: reaped process' -e 'oom-kill:constraint=CONSTRAINT_NONE' /test_output/dmesg.log \ && echo -e 'OOM in dmesg\tFAIL' >> /test_output/test_results.tsv \ || echo -e 'No OOM in dmesg\tOK' >> /test_output/test_results.tsv mv /var/log/clickhouse-server/stderr.log /test_output/ # Write check result into check_status.tsv clickhouse-local --structure "test String, res String" -q "SELECT 'failure', test FROM table WHERE res != 'OK' order by (lower(test) like '%hung%'), rowNumberInAllBlocks() LIMIT 1" < /test_output/test_results.tsv > /test_output/check_status.tsv [ -s /test_output/check_status.tsv ] || echo -e "success\tNo errors found" > /test_output/check_status.tsv # Core dumps for core in core.*; do pigz $core mv $core.gz /test_output/ done