Merge branch 'master' of github.com:yandex/ClickHouse

This commit is contained in:
BayoNet 2018-12-10 13:59:30 +03:00
commit 291e531290
95 changed files with 6434 additions and 386 deletions

18
.gitmodules vendored
View File

@ -31,9 +31,6 @@
[submodule "contrib/ssl"]
path = contrib/ssl
url = https://github.com/ClickHouse-Extras/ssl.git
[submodule "contrib/boost"]
path = contrib/boost
url = https://github.com/ClickHouse-Extras/boost.git
[submodule "contrib/llvm"]
path = contrib/llvm
url = https://github.com/ClickHouse-Extras/llvm
@ -46,6 +43,21 @@
[submodule "contrib/unixodbc"]
path = contrib/unixodbc
url = https://github.com/ClickHouse-Extras/UnixODBC.git
[submodule "contrib/protobuf"]
path = contrib/protobuf
url = https://github.com/ClickHouse-Extras/protobuf.git
[submodule "contrib/boost"]
path = contrib/boost
url = https://github.com/ClickHouse-Extras/boost-extra.git
[submodule "contrib/base64"]
path = contrib/base64
url = https://github.com/aklomp/base64.git
[submodule "contrib/libhdfs3"]
path = contrib/libhdfs3
url = https://github.com/ClickHouse-Extras/libhdfs3.git
[submodule "contrib/libxml2"]
path = contrib/libxml2
url = https://github.com/GNOME/libxml2.git
[submodule "contrib/libgsasl"]
path = contrib/libgsasl
url = https://github.com/ClickHouse-Extras/libgsasl.git

View File

@ -274,6 +274,9 @@ include (cmake/find_rdkafka.cmake)
include (cmake/find_capnp.cmake)
include (cmake/find_llvm.cmake)
include (cmake/find_cpuid.cmake)
include (cmake/find_libgsasl.cmake)
include (cmake/find_libxml2.cmake)
include (cmake/find_hdfs3.cmake)
include (cmake/find_consistent-hashing.cmake)
include (cmake/find_base64.cmake)
if (ENABLE_TESTS)

190
LICENSE
View File

@ -1,4 +1,192 @@
Copyright 2016-2018 YANDEX LLC
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright 2016-2018 Yandex LLC
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.

View File

@ -1,4 +1,4 @@
# ClickHouse
[![ClickHouse — open source distributed column-oriented DBMS](https://github.com/yandex/ClickHouse/raw/master/website/images/logo-400x240.png)](https://clickhouse.yandex)
ClickHouse is an open-source column-oriented database management system that allows generating analytical data reports in real time.

26
cmake/find_hdfs3.cmake Normal file
View File

@ -0,0 +1,26 @@
if (NOT ARCH_ARM AND NOT OS_FREEBSD AND NOT APPLE)
option (ENABLE_HDFS "Enable HDFS" ${NOT_UNBUNDLED})
endif ()
if (ENABLE_HDFS AND NOT EXISTS "${ClickHouse_SOURCE_DIR}/contrib/libhdfs3/include/hdfs/hdfs.h")
message (WARNING "submodule contrib/libhdfs3 is missing. to fix try run: \n git submodule update --init --recursive")
set (ENABLE_HDFS 0)
endif ()
if (ENABLE_HDFS)
option (USE_INTERNAL_HDFS3_LIBRARY "Set to FALSE to use system HDFS3 instead of bundled" ON)
if (NOT USE_INTERNAL_HDFS3_LIBRARY)
find_package(hdfs3)
endif ()
if (HDFS3_LIBRARY AND HDFS3_INCLUDE_DIR)
else ()
set(HDFS3_INCLUDE_DIR "${ClickHouse_SOURCE_DIR}/contrib/libhdfs3/include")
set(HDFS3_LIBRARY hdfs3)
endif()
set (USE_HDFS 1)
endif()
message (STATUS "Using hdfs3: ${HDFS3_INCLUDE_DIR} : ${HDFS3_LIBRARY}")

22
cmake/find_libgsasl.cmake Normal file
View File

@ -0,0 +1,22 @@
if (NOT APPLE)
option (USE_INTERNAL_LIBGSASL_LIBRARY "Set to FALSE to use system libgsasl library instead of bundled" ${NOT_UNBUNDLED})
endif ()
if (USE_INTERNAL_LIBGSASL_LIBRARY AND NOT EXISTS "${ClickHouse_SOURCE_DIR}/contrib/libgsasl/src/gsasl.h")
message (WARNING "submodule contrib/libgsasl is missing. to fix try run: \n git submodule update --init --recursive")
set (USE_INTERNAL_LIBGSASL_LIBRARY 0)
endif ()
if (NOT USE_INTERNAL_LIBGSASL_LIBRARY)
find_library (LIBGSASL_LIBRARY gsasl)
find_path (LIBGSASL_INCLUDE_DIR NAMES gsasl.h PATHS ${LIBGSASL_INCLUDE_PATHS})
endif ()
if (LIBGSASL_LIBRARY AND LIBGSASL_INCLUDE_DIR)
else ()
set (LIBGSASL_INCLUDE_DIR ${ClickHouse_SOURCE_DIR}/contrib/libgsasl/src ${ClickHouse_SOURCE_DIR}/contrib/libgsasl/linux_x86_64/include)
set (USE_INTERNAL_LIBGSASL_LIBRARY 1)
set (LIBGSASL_LIBRARY libgsasl)
endif ()
message (STATUS "Using libgsasl: ${LIBGSASL_INCLUDE_DIR} : ${LIBGSASL_LIBRARY}")

20
cmake/find_libxml2.cmake Normal file
View File

@ -0,0 +1,20 @@
option (USE_INTERNAL_LIBXML2_LIBRARY "Set to FALSE to use system libxml2 library instead of bundled" ${NOT_UNBUNDLED})
if (USE_INTERNAL_LIBXML2_LIBRARY AND NOT EXISTS "${ClickHouse_SOURCE_DIR}/contrib/libxml2/libxml.h")
message (WARNING "submodule contrib/libxml2 is missing. to fix try run: \n git submodule update --init --recursive")
set (USE_INTERNAL_LIBXML2_LIBRARY 0)
endif ()
if (NOT USE_INTERNAL_LIBXML2_LIBRARY)
find_library (LIBXML2_LIBRARY libxml2)
find_path (LIBXML2_INCLUDE_DIR NAMES libxml.h PATHS ${LIBXML2_INCLUDE_PATHS})
endif ()
if (LIBXML2_LIBRARY AND LIBXML2_INCLUDE_DIR)
else ()
set (LIBXML2_INCLUDE_DIR ${ClickHouse_SOURCE_DIR}/contrib/libxml2/include ${ClickHouse_SOURCE_DIR}/contrib/libxml2-cmake/linux_x86_64/include)
set (USE_INTERNAL_LIBXML2_LIBRARY 1)
set (LIBXML2_LIBRARY libxml2)
endif ()
message (STATUS "Using libxml2: ${LIBXML2_INCLUDE_DIR} : ${LIBXML2_LIBRARY}")

80
cmake/find_protobuf.cmake Normal file
View File

@ -0,0 +1,80 @@
option (USE_INTERNAL_PROTOBUF_LIBRARY "Set to FALSE to use system protobuf instead of bundled" ON)
if (NOT USE_INTERNAL_PROTOBUF_LIBRARY)
find_package(Protobuf)
endif ()
if (Protobuf_LIBRARY AND Protobuf_INCLUDE_DIR)
else ()
set(Protobuf_INCLUDE_DIR ${CMAKE_SOURCE_DIR}/contrib/protobuf/src)
set(Protobuf_LIBRARY libprotobuf)
set(Protobuf_PROTOC_LIBRARY libprotoc)
set(Protobuf_LITE_LIBRARY libprotobuf-lite)
set(Protobuf_PROTOC_EXECUTABLE ${CMAKE_BINARY_DIR}/contrib/protobuf/cmake/protoc)
if(NOT DEFINED PROTOBUF_GENERATE_CPP_APPEND_PATH)
set(PROTOBUF_GENERATE_CPP_APPEND_PATH TRUE)
endif()
function(PROTOBUF_GENERATE_CPP SRCS HDRS)
if(NOT ARGN)
message(SEND_ERROR "Error: PROTOBUF_GENERATE_CPP() called without any proto files")
return()
endif()
if(PROTOBUF_GENERATE_CPP_APPEND_PATH)
# Create an include path for each file specified
foreach(FIL ${ARGN})
get_filename_component(ABS_FIL ${FIL} ABSOLUTE)
get_filename_component(ABS_PATH ${ABS_FIL} PATH)
list(FIND _protobuf_include_path ${ABS_PATH} _contains_already)
if(${_contains_already} EQUAL -1)
list(APPEND _protobuf_include_path -I ${ABS_PATH})
endif()
endforeach()
else()
set(_protobuf_include_path -I ${CMAKE_CURRENT_SOURCE_DIR})
endif()
if(DEFINED PROTOBUF_IMPORT_DIRS AND NOT DEFINED Protobuf_IMPORT_DIRS)
set(Protobuf_IMPORT_DIRS "${PROTOBUF_IMPORT_DIRS}")
endif()
if(DEFINED Protobuf_IMPORT_DIRS)
foreach(DIR ${Protobuf_IMPORT_DIRS})
get_filename_component(ABS_PATH ${DIR} ABSOLUTE)
list(FIND _protobuf_include_path ${ABS_PATH} _contains_already)
if(${_contains_already} EQUAL -1)
list(APPEND _protobuf_include_path -I ${ABS_PATH})
endif()
endforeach()
endif()
set(${SRCS})
set(${HDRS})
foreach(FIL ${ARGN})
get_filename_component(ABS_FIL ${FIL} ABSOLUTE)
get_filename_component(FIL_WE ${FIL} NAME_WE)
list(APPEND ${SRCS} "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}.pb.cc")
list(APPEND ${HDRS} "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}.pb.h")
add_custom_command(
OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}.pb.cc"
"${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}.pb.h"
COMMAND ${Protobuf_PROTOC_EXECUTABLE}
ARGS --cpp_out ${CMAKE_CURRENT_BINARY_DIR} ${_protobuf_include_path} ${ABS_FIL}
DEPENDS ${ABS_FIL} ${Protobuf_PROTOC_EXECUTABLE}
COMMENT "Running C++ protocol buffer compiler on ${FIL}"
VERBATIM )
endforeach()
set_source_files_properties(${${SRCS}} ${${HDRS}} PROPERTIES GENERATED TRUE)
set(${SRCS} ${${SRCS}} PARENT_SCOPE)
set(${HDRS} ${${HDRS}} PARENT_SCOPE)
endfunction()
endif()
message (STATUS "Using protobuf: ${Protobuf_INCLUDE_DIR} : ${Protobuf_LIBRARY}")

View File

@ -110,6 +110,7 @@ if (USE_INTERNAL_SSL_LIBRARY)
add_subdirectory (ssl)
target_include_directories(${OPENSSL_CRYPTO_LIBRARY} SYSTEM PUBLIC ${OPENSSL_INCLUDE_DIR})
target_include_directories(${OPENSSL_SSL_LIBRARY} SYSTEM PUBLIC ${OPENSSL_INCLUDE_DIR})
set (POCO_SKIP_OPENSSL_FIND 1)
endif ()
if (ENABLE_MYSQL AND USE_INTERNAL_MYSQL_LIBRARY)
@ -192,6 +193,24 @@ if (USE_INTERNAL_LLVM_LIBRARY)
add_subdirectory (llvm/llvm)
endif ()
if (USE_INTERNAL_LIBGSASL_LIBRARY)
add_subdirectory(libgsasl)
endif()
if (USE_INTERNAL_LIBXML2_LIBRARY)
add_subdirectory(libxml2-cmake)
endif ()
if (USE_INTERNAL_HDFS3_LIBRARY)
include(${ClickHouse_SOURCE_DIR}/cmake/find_protobuf.cmake)
if (USE_INTERNAL_PROTOBUF_LIBRARY)
set(protobuf_BUILD_TESTS OFF CACHE INTERNAL "" FORCE)
set(protobuf_BUILD_SHARED_LIBS OFF CACHE INTERNAL "" FORCE)
add_subdirectory(protobuf/cmake)
endif ()
add_subdirectory(libhdfs3-cmake)
endif ()
if (USE_BASE64)
add_subdirectory (base64-cmake)
endif()

2
contrib/boost vendored

@ -1 +1 @@
Subproject commit 2d5cb2c86f61126f4e1efe9ab97332efd44e7dea
Subproject commit 6883b40449f378019aec792f9983ce3afc7ff16e

View File

@ -42,12 +42,17 @@ ${LIBRARY_DIR}/libs/filesystem/src/windows_file_codecvt.cpp)
add_library(boost_system_internal ${LINK_MODE}
${LIBRARY_DIR}/libs/system/src/error_code.cpp)
add_library(boost_random_internal ${LINK_MODE}
${LIBRARY_DIR}/libs/random/src/random_device.cpp)
target_link_libraries (boost_filesystem_internal PUBLIC boost_system_internal)
target_include_directories (boost_program_options_internal SYSTEM BEFORE PUBLIC ${Boost_INCLUDE_DIRS})
target_include_directories (boost_filesystem_internal SYSTEM BEFORE PUBLIC ${Boost_INCLUDE_DIRS})
target_include_directories (boost_system_internal SYSTEM BEFORE PUBLIC ${Boost_INCLUDE_DIRS})
target_include_directories (boost_random_internal SYSTEM BEFORE PUBLIC ${Boost_INCLUDE_DIRS})
target_compile_definitions (boost_program_options_internal PUBLIC BOOST_SYSTEM_NO_DEPRECATED)
target_compile_definitions (boost_filesystem_internal PUBLIC BOOST_SYSTEM_NO_DEPRECATED)
target_compile_definitions (boost_system_internal PUBLIC BOOST_SYSTEM_NO_DEPRECATED)
target_compile_definitions (boost_random_internal PUBLIC BOOST_SYSTEM_NO_DEPRECATED)

1
contrib/libgsasl vendored Submodule

@ -0,0 +1 @@
Subproject commit 3b8948a4042e34fb00b4fb987535dc9e02e39040

1
contrib/libhdfs3 vendored Submodule

@ -0,0 +1 @@
Subproject commit bd6505cbb0c130b0db695305b9a38546fa880e5a

View File

@ -0,0 +1,10 @@
#include <exception>
#include <stdexcept>
int main() {
try {
throw 2;
} catch (int) {
std::throw_with_nested(std::runtime_error("test"));
}
}

View File

@ -0,0 +1,7 @@
#include <chrono>
using std::chrono::steady_clock;
void foo(const steady_clock &clock) {
return;
}

View File

@ -0,0 +1,10 @@
#include <string.h>
int main()
{
// We can't test "char *p = strerror_r()" because that only causes a
// compiler warning when strerror_r returns an integer.
char *buf = 0;
int i = strerror_r(0, buf, 100);
return i;
}

View File

@ -0,0 +1,48 @@
# Check prereqs
FIND_PROGRAM(GCOV_PATH gcov)
FIND_PROGRAM(LCOV_PATH lcov)
FIND_PROGRAM(GENHTML_PATH genhtml)
IF(NOT GCOV_PATH)
MESSAGE(FATAL_ERROR "gcov not found! Aborting...")
ENDIF(NOT GCOV_PATH)
IF(NOT CMAKE_BUILD_TYPE STREQUAL Debug)
MESSAGE(WARNING "Code coverage results with an optimised (non-Debug) build may be misleading")
ENDIF(NOT CMAKE_BUILD_TYPE STREQUAL Debug)
#Setup compiler options
ADD_DEFINITIONS(-fprofile-arcs -ftest-coverage)
SET(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fprofile-arcs ")
SET(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -fprofile-arcs ")
IF(NOT LCOV_PATH)
MESSAGE(FATAL_ERROR "lcov not found! Aborting...")
ENDIF(NOT LCOV_PATH)
IF(NOT GENHTML_PATH)
MESSAGE(FATAL_ERROR "genhtml not found! Aborting...")
ENDIF(NOT GENHTML_PATH)
#Setup target
ADD_CUSTOM_TARGET(ShowCoverage
#Capturing lcov counters and generating report
COMMAND ${LCOV_PATH} --directory . --capture --output-file CodeCoverage.info
COMMAND ${LCOV_PATH} --remove CodeCoverage.info '${CMAKE_CURRENT_BINARY_DIR}/*' 'test/*' 'mock/*' '/usr/*' '/opt/*' '*ext/rhel5_x86_64*' '*ext/osx*' --output-file CodeCoverage.info.cleaned
COMMAND ${GENHTML_PATH} -o CodeCoverageReport CodeCoverage.info.cleaned
)
ADD_CUSTOM_TARGET(ShowAllCoverage
#Capturing lcov counters and generating report
COMMAND ${LCOV_PATH} -a CodeCoverage.info.cleaned -a CodeCoverage.info.cleaned_withoutHA -o AllCodeCoverage.info
COMMAND sed -e 's|/.*/src|${CMAKE_SOURCE_DIR}/src|' -ig AllCodeCoverage.info
COMMAND ${GENHTML_PATH} -o AllCodeCoverageReport AllCodeCoverage.info
)
ADD_CUSTOM_TARGET(ResetCoverage
#Cleanup lcov
COMMAND ${LCOV_PATH} --directory . --zerocounters
)

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,26 @@
# - Try to find the GNU sasl library (gsasl)
#
# Once done this will define
#
# GSASL_FOUND - System has gnutls
# GSASL_INCLUDE_DIR - The gnutls include directory
# GSASL_LIBRARIES - The libraries needed to use gnutls
# GSASL_DEFINITIONS - Compiler switches required for using gnutls
IF (GSASL_INCLUDE_DIR AND GSASL_LIBRARIES)
# in cache already
SET(GSasl_FIND_QUIETLY TRUE)
ENDIF (GSASL_INCLUDE_DIR AND GSASL_LIBRARIES)
FIND_PATH(GSASL_INCLUDE_DIR gsasl.h)
FIND_LIBRARY(GSASL_LIBRARIES gsasl)
INCLUDE(FindPackageHandleStandardArgs)
# handle the QUIETLY and REQUIRED arguments and set GSASL_FOUND to TRUE if
# all listed variables are TRUE
FIND_PACKAGE_HANDLE_STANDARD_ARGS(GSASL DEFAULT_MSG GSASL_LIBRARIES GSASL_INCLUDE_DIR)
MARK_AS_ADVANCED(GSASL_INCLUDE_DIR GSASL_LIBRARIES)

View File

@ -0,0 +1,65 @@
include(CheckCXXSourceRuns)
find_path(GTest_INCLUDE_DIR gtest/gtest.h
NO_DEFAULT_PATH
PATHS
"${PROJECT_SOURCE_DIR}/../thirdparty/googletest/googletest/include"
"/usr/local/include"
"/usr/include")
find_path(GMock_INCLUDE_DIR gmock/gmock.h
NO_DEFAULT_PATH
PATHS
"${PROJECT_SOURCE_DIR}/../thirdparty/googletest/googlemock/include"
"/usr/local/include"
"/usr/include")
find_library(Gtest_LIBRARY
NAMES libgtest.a
HINTS
"${PROJECT_SOURCE_DIR}/../thirdparty/googletest/build/googlemock/gtest"
"/usr/local/lib"
"/usr/lib")
find_library(Gmock_LIBRARY
NAMES libgmock.a
HINTS
"${PROJECT_SOURCE_DIR}/../thirdparty/googletest/build/googlemock"
"/usr/local/lib"
"/usr/lib")
message(STATUS "Find GoogleTest include path: ${GTest_INCLUDE_DIR}")
message(STATUS "Find GoogleMock include path: ${GMock_INCLUDE_DIR}")
message(STATUS "Find Gtest library path: ${Gtest_LIBRARY}")
message(STATUS "Find Gmock library path: ${Gmock_LIBRARY}")
set(CMAKE_REQUIRED_INCLUDES ${GTest_INCLUDE_DIR} ${GMock_INCLUDE_DIR})
set(CMAKE_REQUIRED_LIBRARIES ${Gtest_LIBRARY} ${Gmock_LIBRARY} -lpthread)
set(CMAKE_REQUIRED_FLAGS)
check_cxx_source_runs("
#include <gtest/gtest.h>
#include <gmock/gmock.h>
int main(int argc, char *argv[])
{
double pi = 3.14;
EXPECT_EQ(pi, 3.14);
return 0;
}
" GoogleTest_CHECK_FINE)
message(STATUS "GoogleTest check: ${GoogleTest_CHECK_FINE}")
include(FindPackageHandleStandardArgs)
find_package_handle_standard_args(
GoogleTest
REQUIRED_VARS
GTest_INCLUDE_DIR
GMock_INCLUDE_DIR
Gtest_LIBRARY
Gmock_LIBRARY
GoogleTest_CHECK_FINE)
set(GoogleTest_INCLUDE_DIR ${GTest_INCLUDE_DIR} ${GMock_INCLUDE_DIR})
set(GoogleTest_LIBRARIES ${Gtest_LIBRARY} ${Gmock_LIBRARY})
mark_as_advanced(
GoogleTest_INCLUDE_DIR
GoogleTest_LIBRARIES)

View File

@ -0,0 +1,23 @@
# - Find kerberos
# Find the native KERBEROS includes and library
#
# KERBEROS_INCLUDE_DIRS - where to find krb5.h, etc.
# KERBEROS_LIBRARIES - List of libraries when using krb5.
# KERBEROS_FOUND - True if krb5 found.
IF (KERBEROS_INCLUDE_DIRS)
# Already in cache, be silent
SET(KERBEROS_FIND_QUIETLY TRUE)
ENDIF (KERBEROS_INCLUDE_DIRS)
FIND_PATH(KERBEROS_INCLUDE_DIRS krb5.h)
SET(KERBEROS_NAMES krb5 k5crypto com_err)
FIND_LIBRARY(KERBEROS_LIBRARIES NAMES ${KERBEROS_NAMES})
# handle the QUIETLY and REQUIRED arguments and set KERBEROS_FOUND to TRUE if
# all listed variables are TRUE
INCLUDE(FindPackageHandleStandardArgs)
FIND_PACKAGE_HANDLE_STANDARD_ARGS(KERBEROS DEFAULT_MSG KERBEROS_LIBRARIES KERBEROS_INCLUDE_DIRS)
MARK_AS_ADVANCED(KERBEROS_LIBRARIES KERBEROS_INCLUDE_DIRS)

View File

@ -0,0 +1,26 @@
# - Try to find the Open ssl library (ssl)
#
# Once done this will define
#
# SSL_FOUND - System has gnutls
# SSL_INCLUDE_DIR - The gnutls include directory
# SSL_LIBRARIES - The libraries needed to use gnutls
# SSL_DEFINITIONS - Compiler switches required for using gnutls
IF (SSL_INCLUDE_DIR AND SSL_LIBRARIES)
# in cache already
SET(SSL_FIND_QUIETLY TRUE)
ENDIF (SSL_INCLUDE_DIR AND SSL_LIBRARIES)
FIND_PATH(SSL_INCLUDE_DIR openssl/opensslv.h)
FIND_LIBRARY(SSL_LIBRARIES crypto)
INCLUDE(FindPackageHandleStandardArgs)
# handle the QUIETLY and REQUIRED arguments and set SSL_FOUND to TRUE if
# all listed variables are TRUE
FIND_PACKAGE_HANDLE_STANDARD_ARGS(SSL DEFAULT_MSG SSL_LIBRARIES SSL_INCLUDE_DIR)
MARK_AS_ADVANCED(SSL_INCLUDE_DIR SSL_LIBRARIES)

View File

@ -0,0 +1,46 @@
FUNCTION(AUTO_SOURCES RETURN_VALUE PATTERN SOURCE_SUBDIRS)
IF ("${SOURCE_SUBDIRS}" STREQUAL "RECURSE")
SET(PATH ".")
IF (${ARGC} EQUAL 4)
LIST(GET ARGV 3 PATH)
ENDIF ()
ENDIF()
IF ("${SOURCE_SUBDIRS}" STREQUAL "RECURSE")
UNSET(${RETURN_VALUE})
FILE(GLOB SUBDIR_FILES "${PATH}/${PATTERN}")
LIST(APPEND ${RETURN_VALUE} ${SUBDIR_FILES})
FILE(GLOB SUBDIRS RELATIVE ${PATH} ${PATH}/*)
FOREACH(DIR ${SUBDIRS})
IF (IS_DIRECTORY ${PATH}/${DIR})
IF (NOT "${DIR}" STREQUAL "CMAKEFILES")
FILE(GLOB_RECURSE SUBDIR_FILES "${PATH}/${DIR}/${PATTERN}")
LIST(APPEND ${RETURN_VALUE} ${SUBDIR_FILES})
ENDIF()
ENDIF()
ENDFOREACH()
ELSE ()
FILE(GLOB ${RETURN_VALUE} "${PATTERN}")
FOREACH (PATH ${SOURCE_SUBDIRS})
FILE(GLOB SUBDIR_FILES "${PATH}/${PATTERN}")
LIST(APPEND ${RETURN_VALUE} ${SUBDIR_FILES})
ENDFOREACH(PATH ${SOURCE_SUBDIRS})
ENDIF ()
IF (${FILTER_OUT})
LIST(REMOVE_ITEM ${RETURN_VALUE} ${FILTER_OUT})
ENDIF()
SET(${RETURN_VALUE} ${${RETURN_VALUE}} PARENT_SCOPE)
ENDFUNCTION(AUTO_SOURCES)
FUNCTION(CONTAINS_STRING FILE SEARCH RETURN_VALUE)
FILE(STRINGS ${FILE} FILE_CONTENTS REGEX ".*${SEARCH}.*")
IF (FILE_CONTENTS)
SET(${RETURN_VALUE} TRUE PARENT_SCOPE)
ENDIF()
ENDFUNCTION(CONTAINS_STRING)

View File

@ -0,0 +1,169 @@
OPTION(ENABLE_COVERAGE "enable code coverage" OFF)
OPTION(ENABLE_DEBUG "enable debug build" OFF)
OPTION(ENABLE_SSE "enable SSE4.2 buildin function" ON)
OPTION(ENABLE_FRAME_POINTER "enable frame pointer on 64bit system with flag -fno-omit-frame-pointer, on 32bit system, it is always enabled" ON)
OPTION(ENABLE_LIBCPP "using libc++ instead of libstdc++, only valid for clang compiler" OFF)
OPTION(ENABLE_BOOST "using boost instead of native compiler c++0x support" OFF)
INCLUDE (CheckFunctionExists)
CHECK_FUNCTION_EXISTS(dladdr HAVE_DLADDR)
CHECK_FUNCTION_EXISTS(nanosleep HAVE_NANOSLEEP)
IF(ENABLE_DEBUG STREQUAL ON)
SET(CMAKE_BUILD_TYPE Debug CACHE
STRING "Choose the type of build, options are: None Debug Release RelWithDebInfo MinSizeRel." FORCE)
SET(CMAKE_CXX_FLAGS_DEBUG "-g -O0" CACHE STRING "compiler flags for debug" FORCE)
SET(CMAKE_C_FLAGS_DEBUG "-g -O0" CACHE STRING "compiler flags for debug" FORCE)
ELSE(ENABLE_DEBUG STREQUAL ON)
SET(CMAKE_BUILD_TYPE RelWithDebInfo CACHE
STRING "Choose the type of build, options are: None Debug Release RelWithDebInfo MinSizeRel." FORCE)
ENDIF(ENABLE_DEBUG STREQUAL ON)
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-strict-aliasing")
SET(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fno-strict-aliasing")
IF(ENABLE_COVERAGE STREQUAL ON)
INCLUDE(CodeCoverage)
ENDIF(ENABLE_COVERAGE STREQUAL ON)
IF(ENABLE_FRAME_POINTER STREQUAL ON)
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-omit-frame-pointer")
ENDIF(ENABLE_FRAME_POINTER STREQUAL ON)
IF(ENABLE_SSE STREQUAL ON)
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -msse4.2")
ENDIF(ENABLE_SSE STREQUAL ON)
IF(NOT TEST_HDFS_PREFIX)
SET(TEST_HDFS_PREFIX "./" CACHE STRING "default directory prefix used for test." FORCE)
ENDIF(NOT TEST_HDFS_PREFIX)
ADD_DEFINITIONS(-DTEST_HDFS_PREFIX="${TEST_HDFS_PREFIX}")
ADD_DEFINITIONS(-D__STDC_FORMAT_MACROS)
ADD_DEFINITIONS(-D_GNU_SOURCE)
IF(OS_MACOSX AND CMAKE_COMPILER_IS_GNUCXX)
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wl,-bind_at_load")
ENDIF(OS_MACOSX AND CMAKE_COMPILER_IS_GNUCXX)
IF(OS_LINUX)
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wl,--export-dynamic")
ENDIF(OS_LINUX)
SET(BOOST_ROOT ${CMAKE_PREFIX_PATH})
IF(ENABLE_BOOST STREQUAL ON)
MESSAGE(STATUS "using boost instead of native compiler c++0x support.")
FIND_PACKAGE(Boost 1.50 REQUIRED)
SET(NEED_BOOST true CACHE INTERNAL "boost is required")
ELSE(ENABLE_BOOST STREQUAL ON)
SET(NEED_BOOST false CACHE INTERNAL "boost is required")
ENDIF(ENABLE_BOOST STREQUAL ON)
IF(CMAKE_COMPILER_IS_GNUCXX)
IF(ENABLE_LIBCPP STREQUAL ON)
MESSAGE(FATAL_ERROR "Unsupport using GCC compiler with libc++")
ENDIF(ENABLE_LIBCPP STREQUAL ON)
IF((GCC_COMPILER_VERSION_MAJOR EQUAL 4) AND (GCC_COMPILER_VERSION_MINOR EQUAL 4) AND OS_MACOSX)
SET(NEED_GCCEH true CACHE INTERNAL "Explicitly link with gcc_eh")
MESSAGE(STATUS "link with -lgcc_eh for TLS")
ENDIF((GCC_COMPILER_VERSION_MAJOR EQUAL 4) AND (GCC_COMPILER_VERSION_MINOR EQUAL 4) AND OS_MACOSX)
IF((GCC_COMPILER_VERSION_MAJOR LESS 4) OR ((GCC_COMPILER_VERSION_MAJOR EQUAL 4) AND (GCC_COMPILER_VERSION_MINOR LESS 4)))
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall")
IF(NOT ENABLE_BOOST STREQUAL ON)
MESSAGE(STATUS "gcc version is older than 4.6.0, boost is required.")
FIND_PACKAGE(Boost 1.50 REQUIRED)
SET(NEED_BOOST true CACHE INTERNAL "boost is required")
ENDIF(NOT ENABLE_BOOST STREQUAL ON)
ELSEIF((GCC_COMPILER_VERSION_MAJOR EQUAL 4) AND (GCC_COMPILER_VERSION_MINOR LESS 7))
IF(NOT ENABLE_BOOST STREQUAL ON)
MESSAGE(STATUS "gcc version is older than 4.6.0, boost is required.")
FIND_PACKAGE(Boost 1.50 REQUIRED)
SET(NEED_BOOST true CACHE INTERNAL "boost is required")
ENDIF(NOT ENABLE_BOOST STREQUAL ON)
MESSAGE(STATUS "adding c++0x support for gcc compiler")
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++0x")
ELSE((GCC_COMPILER_VERSION_MAJOR LESS 4) OR ((GCC_COMPILER_VERSION_MAJOR EQUAL 4) AND (GCC_COMPILER_VERSION_MINOR LESS 4)))
MESSAGE(STATUS "adding c++0x support for gcc compiler")
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++0x")
ENDIF((GCC_COMPILER_VERSION_MAJOR LESS 4) OR ((GCC_COMPILER_VERSION_MAJOR EQUAL 4) AND (GCC_COMPILER_VERSION_MINOR LESS 4)))
IF(NEED_BOOST)
IF((Boost_MAJOR_VERSION LESS 1) OR ((Boost_MAJOR_VERSION EQUAL 1) AND (Boost_MINOR_VERSION LESS 50)))
MESSAGE(FATAL_ERROR "boost 1.50+ is required")
ENDIF()
ELSE(NEED_BOOST)
IF(HAVE_NANOSLEEP)
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D_GLIBCXX_USE_NANOSLEEP")
ELSE(HAVE_NANOSLEEP)
MESSAGE(FATAL_ERROR "nanosleep() is required")
ENDIF(HAVE_NANOSLEEP)
ENDIF(NEED_BOOST)
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall")
ELSEIF(CMAKE_COMPILER_IS_CLANG)
MESSAGE(STATUS "adding c++0x support for clang compiler")
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++0x")
SET(CMAKE_XCODE_ATTRIBUTE_CLANG_CXX_LANGUAGE_STANDARD "c++0x")
IF(ENABLE_LIBCPP STREQUAL ON)
MESSAGE(STATUS "using libc++ instead of libstdc++")
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -stdlib=libc++")
SET(CMAKE_XCODE_ATTRIBUTE_CLANG_CXX_LIBRARY "libc++")
ENDIF(ENABLE_LIBCPP STREQUAL ON)
ENDIF(CMAKE_COMPILER_IS_GNUCXX)
TRY_COMPILE(STRERROR_R_RETURN_INT
${CMAKE_CURRENT_BINARY_DIR}
${HDFS3_ROOT_DIR}/CMake/CMakeTestCompileStrerror.cpp
CMAKE_FLAGS "-DCMAKE_CXX_LINK_EXECUTABLE='echo not linking now...'"
OUTPUT_VARIABLE OUTPUT)
MESSAGE(STATUS "Checking whether strerror_r returns an int")
IF(STRERROR_R_RETURN_INT)
MESSAGE(STATUS "Checking whether strerror_r returns an int -- yes")
ELSE(STRERROR_R_RETURN_INT)
MESSAGE(STATUS "Checking whether strerror_r returns an int -- no")
ENDIF(STRERROR_R_RETURN_INT)
TRY_COMPILE(HAVE_STEADY_CLOCK
${CMAKE_CURRENT_BINARY_DIR}
${HDFS3_ROOT_DIR}/CMake/CMakeTestCompileSteadyClock.cpp
CMAKE_FLAGS "-DCMAKE_CXX_LINK_EXECUTABLE='echo not linking now...'"
OUTPUT_VARIABLE OUTPUT)
TRY_COMPILE(HAVE_NESTED_EXCEPTION
${CMAKE_CURRENT_BINARY_DIR}
${HDFS3_ROOT_DIR}/CMake/CMakeTestCompileNestedException.cpp
CMAKE_FLAGS "-DCMAKE_CXX_LINK_EXECUTABLE='echo not linking now...'"
OUTPUT_VARIABLE OUTPUT)
FILE(WRITE ${CMAKE_CURRENT_BINARY_DIR}/test.cpp "#include <boost/chrono.hpp>")
TRY_COMPILE(HAVE_BOOST_CHRONO
${CMAKE_CURRENT_BINARY_DIR}
${CMAKE_CURRENT_BINARY_DIR}/test.cpp
CMAKE_FLAGS "-DCMAKE_CXX_LINK_EXECUTABLE='echo not linking now...'"
-DINCLUDE_DIRECTORIES=${Boost_INCLUDE_DIR}
OUTPUT_VARIABLE OUTPUT)
FILE(WRITE ${CMAKE_CURRENT_BINARY_DIR}/test.cpp "#include <chrono>")
TRY_COMPILE(HAVE_STD_CHRONO
${CMAKE_CURRENT_BINARY_DIR}
${CMAKE_CURRENT_BINARY_DIR}/test.cpp
CMAKE_FLAGS "-DCMAKE_CXX_LINK_EXECUTABLE='echo not linking now...'"
OUTPUT_VARIABLE OUTPUT)
FILE(WRITE ${CMAKE_CURRENT_BINARY_DIR}/test.cpp "#include <boost/atomic.hpp>")
TRY_COMPILE(HAVE_BOOST_ATOMIC
${CMAKE_CURRENT_BINARY_DIR}
${CMAKE_CURRENT_BINARY_DIR}/test.cpp
CMAKE_FLAGS "-DCMAKE_CXX_LINK_EXECUTABLE='echo not linking now...'"
-DINCLUDE_DIRECTORIES=${Boost_INCLUDE_DIR}
OUTPUT_VARIABLE OUTPUT)
FILE(WRITE ${CMAKE_CURRENT_BINARY_DIR}/test.cpp "#include <atomic>")
TRY_COMPILE(HAVE_STD_ATOMIC
${CMAKE_CURRENT_BINARY_DIR}
${CMAKE_CURRENT_BINARY_DIR}/test.cpp
CMAKE_FLAGS "-DCMAKE_CXX_LINK_EXECUTABLE='echo not linking now...'"
OUTPUT_VARIABLE OUTPUT)

View File

@ -0,0 +1,33 @@
IF(CMAKE_SYSTEM_NAME STREQUAL "Linux")
SET(OS_LINUX true CACHE INTERNAL "Linux operating system")
ELSEIF(CMAKE_SYSTEM_NAME STREQUAL "Darwin")
SET(OS_MACOSX true CACHE INTERNAL "Mac Darwin operating system")
ELSE(CMAKE_SYSTEM_NAME STREQUAL "Linux")
MESSAGE(FATAL_ERROR "Unsupported OS: \"${CMAKE_SYSTEM_NAME}\"")
ENDIF(CMAKE_SYSTEM_NAME STREQUAL "Linux")
IF(CMAKE_COMPILER_IS_GNUCXX)
EXECUTE_PROCESS(COMMAND ${CMAKE_CXX_COMPILER} -dumpversion OUTPUT_VARIABLE GCC_COMPILER_VERSION)
IF (NOT GCC_COMPILER_VERSION)
MESSAGE(FATAL_ERROR "Cannot get gcc version")
ENDIF (NOT GCC_COMPILER_VERSION)
STRING(REGEX MATCHALL "[0-9]+" GCC_COMPILER_VERSION ${GCC_COMPILER_VERSION})
LIST(GET GCC_COMPILER_VERSION 0 GCC_COMPILER_VERSION_MAJOR)
LIST(GET GCC_COMPILER_VERSION 0 GCC_COMPILER_VERSION_MINOR)
SET(GCC_COMPILER_VERSION_MAJOR ${GCC_COMPILER_VERSION_MAJOR} CACHE INTERNAL "gcc major version")
SET(GCC_COMPILER_VERSION_MINOR ${GCC_COMPILER_VERSION_MINOR} CACHE INTERNAL "gcc minor version")
MESSAGE(STATUS "checking compiler: GCC (${GCC_COMPILER_VERSION_MAJOR}.${GCC_COMPILER_VERSION_MINOR}.${GCC_COMPILER_VERSION_PATCH})")
ELSE(CMAKE_COMPILER_IS_GNUCXX)
EXECUTE_PROCESS(COMMAND ${CMAKE_C_COMPILER} --version OUTPUT_VARIABLE COMPILER_OUTPUT)
IF(COMPILER_OUTPUT MATCHES "clang")
SET(CMAKE_COMPILER_IS_CLANG true CACHE INTERNAL "using clang as compiler")
MESSAGE(STATUS "checking compiler: CLANG")
ELSE(COMPILER_OUTPUT MATCHES "clang")
MESSAGE(FATAL_ERROR "Unsupported compiler: \"${CMAKE_CXX_COMPILER}\"")
ENDIF(COMPILER_OUTPUT MATCHES "clang")
ENDIF(CMAKE_COMPILER_IS_GNUCXX)

View File

@ -0,0 +1,212 @@
if (NOT USE_INTERNAL_PROTOBUF_LIBRARY)
# compatiable with protobuf which was compiled old C++ ABI
set(CMAKE_CXX_FLAGS "-D_GLIBCXX_USE_CXX11_ABI=0")
set(CMAKE_C_FLAGS "")
if (NOT (CMAKE_VERSION VERSION_LESS "3.8.0"))
unset(CMAKE_CXX_STANDARD)
endif ()
endif()
SET(WITH_KERBEROS false)
# project and source dir
set(HDFS3_ROOT_DIR ${CMAKE_SOURCE_DIR}/contrib/libhdfs3)
set(HDFS3_SOURCE_DIR ${HDFS3_ROOT_DIR}/src)
set(HDFS3_COMMON_DIR ${HDFS3_SOURCE_DIR}/common)
# module
set(CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/CMake" ${CMAKE_MODULE_PATH})
include(Platform)
include(Options)
# prefer shared libraries
if (WITH_KERBEROS)
find_package(KERBEROS REQUIRED)
endif()
# source
set(PROTO_FILES
#${HDFS3_SOURCE_DIR}/proto/encryption.proto
${HDFS3_SOURCE_DIR}/proto/ClientDatanodeProtocol.proto
${HDFS3_SOURCE_DIR}/proto/hdfs.proto
${HDFS3_SOURCE_DIR}/proto/Security.proto
${HDFS3_SOURCE_DIR}/proto/ProtobufRpcEngine.proto
${HDFS3_SOURCE_DIR}/proto/ClientNamenodeProtocol.proto
${HDFS3_SOURCE_DIR}/proto/IpcConnectionContext.proto
${HDFS3_SOURCE_DIR}/proto/RpcHeader.proto
${HDFS3_SOURCE_DIR}/proto/datatransfer.proto
)
PROTOBUF_GENERATE_CPP(PROTO_SOURCES PROTO_HEADERS ${PROTO_FILES})
configure_file(${HDFS3_SOURCE_DIR}/platform.h.in ${CMAKE_CURRENT_BINARY_DIR}/platform.h)
set(SRCS
${HDFS3_SOURCE_DIR}/network/TcpSocket.cpp
${HDFS3_SOURCE_DIR}/network/DomainSocket.cpp
${HDFS3_SOURCE_DIR}/network/BufferedSocketReader.cpp
${HDFS3_SOURCE_DIR}/client/ReadShortCircuitInfo.cpp
${HDFS3_SOURCE_DIR}/client/Pipeline.cpp
${HDFS3_SOURCE_DIR}/client/Hdfs.cpp
${HDFS3_SOURCE_DIR}/client/Packet.cpp
${HDFS3_SOURCE_DIR}/client/OutputStreamImpl.cpp
${HDFS3_SOURCE_DIR}/client/KerberosName.cpp
${HDFS3_SOURCE_DIR}/client/PacketHeader.cpp
${HDFS3_SOURCE_DIR}/client/LocalBlockReader.cpp
${HDFS3_SOURCE_DIR}/client/UserInfo.cpp
${HDFS3_SOURCE_DIR}/client/RemoteBlockReader.cpp
${HDFS3_SOURCE_DIR}/client/Permission.cpp
${HDFS3_SOURCE_DIR}/client/FileSystemImpl.cpp
${HDFS3_SOURCE_DIR}/client/DirectoryIterator.cpp
${HDFS3_SOURCE_DIR}/client/FileSystemKey.cpp
${HDFS3_SOURCE_DIR}/client/DataTransferProtocolSender.cpp
${HDFS3_SOURCE_DIR}/client/LeaseRenewer.cpp
${HDFS3_SOURCE_DIR}/client/PeerCache.cpp
${HDFS3_SOURCE_DIR}/client/InputStream.cpp
${HDFS3_SOURCE_DIR}/client/FileSystem.cpp
${HDFS3_SOURCE_DIR}/client/InputStreamImpl.cpp
${HDFS3_SOURCE_DIR}/client/Token.cpp
${HDFS3_SOURCE_DIR}/client/PacketPool.cpp
${HDFS3_SOURCE_DIR}/client/OutputStream.cpp
${HDFS3_SOURCE_DIR}/rpc/RpcChannelKey.cpp
${HDFS3_SOURCE_DIR}/rpc/RpcProtocolInfo.cpp
${HDFS3_SOURCE_DIR}/rpc/RpcClient.cpp
${HDFS3_SOURCE_DIR}/rpc/RpcRemoteCall.cpp
${HDFS3_SOURCE_DIR}/rpc/RpcChannel.cpp
${HDFS3_SOURCE_DIR}/rpc/RpcAuth.cpp
${HDFS3_SOURCE_DIR}/rpc/RpcContentWrapper.cpp
${HDFS3_SOURCE_DIR}/rpc/RpcConfig.cpp
${HDFS3_SOURCE_DIR}/rpc/RpcServerInfo.cpp
${HDFS3_SOURCE_DIR}/rpc/SaslClient.cpp
${HDFS3_SOURCE_DIR}/server/Datanode.cpp
${HDFS3_SOURCE_DIR}/server/LocatedBlocks.cpp
${HDFS3_SOURCE_DIR}/server/NamenodeProxy.cpp
${HDFS3_SOURCE_DIR}/server/NamenodeImpl.cpp
${HDFS3_SOURCE_DIR}/server/NamenodeInfo.cpp
${HDFS3_SOURCE_DIR}/common/WritableUtils.cpp
${HDFS3_SOURCE_DIR}/common/ExceptionInternal.cpp
${HDFS3_SOURCE_DIR}/common/SessionConfig.cpp
${HDFS3_SOURCE_DIR}/common/StackPrinter.cpp
${HDFS3_SOURCE_DIR}/common/Exception.cpp
${HDFS3_SOURCE_DIR}/common/Logger.cpp
${HDFS3_SOURCE_DIR}/common/CFileWrapper.cpp
${HDFS3_SOURCE_DIR}/common/XmlConfig.cpp
${HDFS3_SOURCE_DIR}/common/WriteBuffer.cpp
${HDFS3_SOURCE_DIR}/common/HWCrc32c.cpp
${HDFS3_SOURCE_DIR}/common/MappedFileWrapper.cpp
${HDFS3_SOURCE_DIR}/common/Hash.cpp
${HDFS3_SOURCE_DIR}/common/SWCrc32c.cpp
${HDFS3_SOURCE_DIR}/common/Thread.cpp
${HDFS3_SOURCE_DIR}/network/TcpSocket.h
${HDFS3_SOURCE_DIR}/network/BufferedSocketReader.h
${HDFS3_SOURCE_DIR}/network/Socket.h
${HDFS3_SOURCE_DIR}/network/DomainSocket.h
${HDFS3_SOURCE_DIR}/network/Syscall.h
${HDFS3_SOURCE_DIR}/client/InputStreamImpl.h
${HDFS3_SOURCE_DIR}/client/FileSystem.h
${HDFS3_SOURCE_DIR}/client/ReadShortCircuitInfo.h
${HDFS3_SOURCE_DIR}/client/InputStreamInter.h
${HDFS3_SOURCE_DIR}/client/FileSystemImpl.h
${HDFS3_SOURCE_DIR}/client/PacketPool.h
${HDFS3_SOURCE_DIR}/client/Pipeline.h
${HDFS3_SOURCE_DIR}/client/OutputStreamInter.h
${HDFS3_SOURCE_DIR}/client/RemoteBlockReader.h
${HDFS3_SOURCE_DIR}/client/Token.h
${HDFS3_SOURCE_DIR}/client/KerberosName.h
${HDFS3_SOURCE_DIR}/client/DirectoryIterator.h
${HDFS3_SOURCE_DIR}/client/hdfs.h
${HDFS3_SOURCE_DIR}/client/FileSystemStats.h
${HDFS3_SOURCE_DIR}/client/FileSystemKey.h
${HDFS3_SOURCE_DIR}/client/DataTransferProtocolSender.h
${HDFS3_SOURCE_DIR}/client/Packet.h
${HDFS3_SOURCE_DIR}/client/PacketHeader.h
${HDFS3_SOURCE_DIR}/client/FileSystemInter.h
${HDFS3_SOURCE_DIR}/client/LocalBlockReader.h
${HDFS3_SOURCE_DIR}/client/TokenInternal.h
${HDFS3_SOURCE_DIR}/client/InputStream.h
${HDFS3_SOURCE_DIR}/client/PipelineAck.h
${HDFS3_SOURCE_DIR}/client/BlockReader.h
${HDFS3_SOURCE_DIR}/client/Permission.h
${HDFS3_SOURCE_DIR}/client/OutputStreamImpl.h
${HDFS3_SOURCE_DIR}/client/LeaseRenewer.h
${HDFS3_SOURCE_DIR}/client/UserInfo.h
${HDFS3_SOURCE_DIR}/client/PeerCache.h
${HDFS3_SOURCE_DIR}/client/OutputStream.h
${HDFS3_SOURCE_DIR}/client/FileStatus.h
${HDFS3_SOURCE_DIR}/client/DataTransferProtocol.h
${HDFS3_SOURCE_DIR}/client/BlockLocation.h
${HDFS3_SOURCE_DIR}/rpc/RpcConfig.h
${HDFS3_SOURCE_DIR}/rpc/SaslClient.h
${HDFS3_SOURCE_DIR}/rpc/RpcAuth.h
${HDFS3_SOURCE_DIR}/rpc/RpcClient.h
${HDFS3_SOURCE_DIR}/rpc/RpcCall.h
${HDFS3_SOURCE_DIR}/rpc/RpcContentWrapper.h
${HDFS3_SOURCE_DIR}/rpc/RpcProtocolInfo.h
${HDFS3_SOURCE_DIR}/rpc/RpcRemoteCall.h
${HDFS3_SOURCE_DIR}/rpc/RpcServerInfo.h
${HDFS3_SOURCE_DIR}/rpc/RpcChannel.h
${HDFS3_SOURCE_DIR}/rpc/RpcChannelKey.h
${HDFS3_SOURCE_DIR}/server/BlockLocalPathInfo.h
${HDFS3_SOURCE_DIR}/server/LocatedBlocks.h
${HDFS3_SOURCE_DIR}/server/DatanodeInfo.h
${HDFS3_SOURCE_DIR}/server/RpcHelper.h
${HDFS3_SOURCE_DIR}/server/ExtendedBlock.h
${HDFS3_SOURCE_DIR}/server/NamenodeInfo.h
${HDFS3_SOURCE_DIR}/server/NamenodeImpl.h
${HDFS3_SOURCE_DIR}/server/LocatedBlock.h
${HDFS3_SOURCE_DIR}/server/NamenodeProxy.h
${HDFS3_SOURCE_DIR}/server/Datanode.h
${HDFS3_SOURCE_DIR}/server/Namenode.h
${HDFS3_SOURCE_DIR}/common/XmlConfig.h
${HDFS3_SOURCE_DIR}/common/Logger.h
${HDFS3_SOURCE_DIR}/common/WriteBuffer.h
${HDFS3_SOURCE_DIR}/common/HWCrc32c.h
${HDFS3_SOURCE_DIR}/common/Checksum.h
${HDFS3_SOURCE_DIR}/common/SessionConfig.h
${HDFS3_SOURCE_DIR}/common/Unordered.h
${HDFS3_SOURCE_DIR}/common/BigEndian.h
${HDFS3_SOURCE_DIR}/common/Thread.h
${HDFS3_SOURCE_DIR}/common/StackPrinter.h
${HDFS3_SOURCE_DIR}/common/Exception.h
${HDFS3_SOURCE_DIR}/common/WritableUtils.h
${HDFS3_SOURCE_DIR}/common/StringUtil.h
${HDFS3_SOURCE_DIR}/common/LruMap.h
${HDFS3_SOURCE_DIR}/common/Function.h
${HDFS3_SOURCE_DIR}/common/DateTime.h
${HDFS3_SOURCE_DIR}/common/Hash.h
${HDFS3_SOURCE_DIR}/common/SWCrc32c.h
${HDFS3_SOURCE_DIR}/common/ExceptionInternal.h
${HDFS3_SOURCE_DIR}/common/Memory.h
${HDFS3_SOURCE_DIR}/common/FileWrapper.h
)
# target
add_library(hdfs3 STATIC ${SRCS} ${PROTO_SOURCES} ${PROTO_HEADERS})
if (USE_INTERNAL_PROTOBUF_LIBRARY)
add_dependencies(hdfs3 protoc)
endif()
target_include_directories(hdfs3 PRIVATE ${HDFS3_SOURCE_DIR})
target_include_directories(hdfs3 PRIVATE ${HDFS3_COMMON_DIR})
target_include_directories(hdfs3 PRIVATE ${CMAKE_CURRENT_BINARY_DIR})
target_include_directories(hdfs3 PRIVATE ${LIBGSASL_INCLUDE_DIR})
if (WITH_KERBEROS)
target_include_directories(hdfs3 PRIVATE ${KERBEROS_INCLUDE_DIRS})
endif()
target_include_directories(hdfs3 PRIVATE ${LIBXML2_INCLUDE_DIR})
target_link_libraries(hdfs3 ${LIBGSASL_LIBRARY})
if (WITH_KERBEROS)
target_link_libraries(hdfs3 ${KERBEROS_LIBRARIES})
endif()
target_link_libraries(hdfs3 ${LIBXML2_LIBRARY})
# inherit from parent cmake
target_include_directories(hdfs3 PRIVATE ${Boost_INCLUDE_DIRS})
target_include_directories(hdfs3 PRIVATE ${Protobuf_INCLUDE_DIR})
target_include_directories(hdfs3 PRIVATE ${OPENSSL_INCLUDE_DIR})
target_link_libraries(hdfs3 ${Protobuf_LIBRARY})
target_link_libraries(hdfs3 ${OPENSSL_LIBRARIES})

1
contrib/libxml2 vendored Submodule

@ -0,0 +1 @@
Subproject commit 18890f471c420411aa3c989e104d090966ec9dbf

View File

@ -0,0 +1,65 @@
set(LIBXML2_SOURCE_DIR ${CMAKE_SOURCE_DIR}/contrib/libxml2)
set(LIBXML2_BINARY_DIR ${CMAKE_BINARY_DIR}/contrib/libxml2)
set(SRCS
${LIBXML2_SOURCE_DIR}/parser.c
${LIBXML2_SOURCE_DIR}/HTMLparser.c
${LIBXML2_SOURCE_DIR}/buf.c
${LIBXML2_SOURCE_DIR}/xzlib.c
${LIBXML2_SOURCE_DIR}/xmlregexp.c
${LIBXML2_SOURCE_DIR}/entities.c
${LIBXML2_SOURCE_DIR}/rngparser.c
${LIBXML2_SOURCE_DIR}/encoding.c
${LIBXML2_SOURCE_DIR}/legacy.c
${LIBXML2_SOURCE_DIR}/error.c
${LIBXML2_SOURCE_DIR}/debugXML.c
${LIBXML2_SOURCE_DIR}/xpointer.c
${LIBXML2_SOURCE_DIR}/DOCBparser.c
${LIBXML2_SOURCE_DIR}/xmlcatalog.c
${LIBXML2_SOURCE_DIR}/c14n.c
${LIBXML2_SOURCE_DIR}/xmlreader.c
${LIBXML2_SOURCE_DIR}/xmlstring.c
${LIBXML2_SOURCE_DIR}/dict.c
${LIBXML2_SOURCE_DIR}/xpath.c
${LIBXML2_SOURCE_DIR}/tree.c
${LIBXML2_SOURCE_DIR}/trionan.c
${LIBXML2_SOURCE_DIR}/pattern.c
${LIBXML2_SOURCE_DIR}/globals.c
${LIBXML2_SOURCE_DIR}/xmllint.c
${LIBXML2_SOURCE_DIR}/chvalid.c
${LIBXML2_SOURCE_DIR}/relaxng.c
${LIBXML2_SOURCE_DIR}/list.c
${LIBXML2_SOURCE_DIR}/xinclude.c
${LIBXML2_SOURCE_DIR}/xmlIO.c
${LIBXML2_SOURCE_DIR}/triostr.c
${LIBXML2_SOURCE_DIR}/hash.c
${LIBXML2_SOURCE_DIR}/xmlsave.c
${LIBXML2_SOURCE_DIR}/HTMLtree.c
${LIBXML2_SOURCE_DIR}/SAX.c
${LIBXML2_SOURCE_DIR}/xmlschemas.c
${LIBXML2_SOURCE_DIR}/SAX2.c
${LIBXML2_SOURCE_DIR}/threads.c
${LIBXML2_SOURCE_DIR}/runsuite.c
${LIBXML2_SOURCE_DIR}/catalog.c
${LIBXML2_SOURCE_DIR}/uri.c
${LIBXML2_SOURCE_DIR}/xmlmodule.c
${LIBXML2_SOURCE_DIR}/xlink.c
${LIBXML2_SOURCE_DIR}/parserInternals.c
${LIBXML2_SOURCE_DIR}/xmlwriter.c
${LIBXML2_SOURCE_DIR}/xmlunicode.c
${LIBXML2_SOURCE_DIR}/runxmlconf.c
${LIBXML2_SOURCE_DIR}/xmlmemory.c
${LIBXML2_SOURCE_DIR}/nanoftp.c
${LIBXML2_SOURCE_DIR}/xmlschemastypes.c
${LIBXML2_SOURCE_DIR}/valid.c
${LIBXML2_SOURCE_DIR}/nanohttp.c
${LIBXML2_SOURCE_DIR}/schematron.c
)
add_library(libxml2 STATIC ${SRCS})
target_link_libraries(libxml2 ${ZLIB_LIBRARIES})
target_include_directories(libxml2 PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/linux_x86_64/include)
target_include_directories(libxml2 PUBLIC ${LIBXML2_SOURCE_DIR}/include)
target_include_directories(libxml2 PRIVATE ${ZLIB_INCLUDE_DIR}/include)

View File

@ -0,0 +1,285 @@
/* config.h. Generated from config.h.in by configure. */
/* config.h.in. Generated from configure.ac by autoheader. */
/* Type cast for the gethostbyname() argument */
#define GETHOSTBYNAME_ARG_CAST /**/
/* Define to 1 if you have the <arpa/inet.h> header file. */
#define HAVE_ARPA_INET_H 1
/* Define to 1 if you have the <arpa/nameser.h> header file. */
#define HAVE_ARPA_NAMESER_H 1
/* Whether struct sockaddr::__ss_family exists */
/* #undef HAVE_BROKEN_SS_FAMILY */
/* Define to 1 if you have the <ctype.h> header file. */
#define HAVE_CTYPE_H 1
/* Define to 1 if you have the <dirent.h> header file. */
#define HAVE_DIRENT_H 1
/* Define to 1 if you have the <dlfcn.h> header file. */
#define HAVE_DLFCN_H 1
/* Have dlopen based dso */
#define HAVE_DLOPEN /**/
/* Define to 1 if you have the <dl.h> header file. */
/* #undef HAVE_DL_H */
/* Define to 1 if you have the <errno.h> header file. */
#define HAVE_ERRNO_H 1
/* Define to 1 if you have the <fcntl.h> header file. */
#define HAVE_FCNTL_H 1
/* Define to 1 if you have the <float.h> header file. */
#define HAVE_FLOAT_H 1
/* Define to 1 if you have the `fprintf' function. */
#define HAVE_FPRINTF 1
/* Define to 1 if you have the `ftime' function. */
#define HAVE_FTIME 1
/* Define if getaddrinfo is there */
#define HAVE_GETADDRINFO /**/
/* Define to 1 if you have the `gettimeofday' function. */
#define HAVE_GETTIMEOFDAY 1
/* Define to 1 if you have the <inttypes.h> header file. */
#define HAVE_INTTYPES_H 1
/* Define to 1 if you have the `isascii' function. */
#define HAVE_ISASCII 1
/* Define if isinf is there */
#define HAVE_ISINF /**/
/* Define if isnan is there */
#define HAVE_ISNAN /**/
/* Define if history library is there (-lhistory) */
/* #undef HAVE_LIBHISTORY */
/* Define if pthread library is there (-lpthread) */
#define HAVE_LIBPTHREAD /**/
/* Define if readline library is there (-lreadline) */
/* #undef HAVE_LIBREADLINE */
/* Define to 1 if you have the <limits.h> header file. */
#define HAVE_LIMITS_H 1
/* Define to 1 if you have the `localtime' function. */
#define HAVE_LOCALTIME 1
/* Define to 1 if you have the <lzma.h> header file. */
/* #undef HAVE_LZMA_H */
/* Define to 1 if you have the <malloc.h> header file. */
#define HAVE_MALLOC_H 1
/* Define to 1 if you have the <math.h> header file. */
#define HAVE_MATH_H 1
/* Define to 1 if you have the <memory.h> header file. */
#define HAVE_MEMORY_H 1
/* Define to 1 if you have the `mmap' function. */
#define HAVE_MMAP 1
/* Define to 1 if you have the `munmap' function. */
#define HAVE_MUNMAP 1
/* mmap() is no good without munmap() */
#if defined(HAVE_MMAP) && !defined(HAVE_MUNMAP)
# undef /**/ HAVE_MMAP
#endif
/* Define to 1 if you have the <ndir.h> header file, and it defines `DIR'. */
/* #undef HAVE_NDIR_H */
/* Define to 1 if you have the <netdb.h> header file. */
#define HAVE_NETDB_H 1
/* Define to 1 if you have the <netinet/in.h> header file. */
#define HAVE_NETINET_IN_H 1
/* Define to 1 if you have the <poll.h> header file. */
#define HAVE_POLL_H 1
/* Define to 1 if you have the `printf' function. */
#define HAVE_PRINTF 1
/* Define if <pthread.h> is there */
#define HAVE_PTHREAD_H /**/
/* Define to 1 if you have the `putenv' function. */
#define HAVE_PUTENV 1
/* Define to 1 if you have the `rand' function. */
#define HAVE_RAND 1
/* Define to 1 if you have the `rand_r' function. */
#define HAVE_RAND_R 1
/* Define to 1 if you have the <resolv.h> header file. */
#define HAVE_RESOLV_H 1
/* Have shl_load based dso */
/* #undef HAVE_SHLLOAD */
/* Define to 1 if you have the `signal' function. */
#define HAVE_SIGNAL 1
/* Define to 1 if you have the <signal.h> header file. */
#define HAVE_SIGNAL_H 1
/* Define to 1 if you have the `snprintf' function. */
#define HAVE_SNPRINTF 1
/* Define to 1 if you have the `sprintf' function. */
#define HAVE_SPRINTF 1
/* Define to 1 if you have the `srand' function. */
#define HAVE_SRAND 1
/* Define to 1 if you have the `sscanf' function. */
#define HAVE_SSCANF 1
/* Define to 1 if you have the `stat' function. */
#define HAVE_STAT 1
/* Define to 1 if you have the <stdarg.h> header file. */
#define HAVE_STDARG_H 1
/* Define to 1 if you have the <stdint.h> header file. */
#define HAVE_STDINT_H 1
/* Define to 1 if you have the <stdlib.h> header file. */
#define HAVE_STDLIB_H 1
/* Define to 1 if you have the `strftime' function. */
#define HAVE_STRFTIME 1
/* Define to 1 if you have the <strings.h> header file. */
#define HAVE_STRINGS_H 1
/* Define to 1 if you have the <string.h> header file. */
#define HAVE_STRING_H 1
/* Define to 1 if you have the <sys/dir.h> header file, and it defines `DIR'.
*/
/* #undef HAVE_SYS_DIR_H */
/* Define to 1 if you have the <sys/mman.h> header file. */
#define HAVE_SYS_MMAN_H 1
/* Define to 1 if you have the <sys/ndir.h> header file, and it defines `DIR'.
*/
/* #undef HAVE_SYS_NDIR_H */
/* Define to 1 if you have the <sys/select.h> header file. */
#define HAVE_SYS_SELECT_H 1
/* Define to 1 if you have the <sys/socket.h> header file. */
#define HAVE_SYS_SOCKET_H 1
/* Define to 1 if you have the <sys/stat.h> header file. */
#define HAVE_SYS_STAT_H 1
/* Define to 1 if you have the <sys/timeb.h> header file. */
#define HAVE_SYS_TIMEB_H 1
/* Define to 1 if you have the <sys/time.h> header file. */
#define HAVE_SYS_TIME_H 1
/* Define to 1 if you have the <sys/types.h> header file. */
#define HAVE_SYS_TYPES_H 1
/* Define to 1 if you have the `time' function. */
#define HAVE_TIME 1
/* Define to 1 if you have the <time.h> header file. */
#define HAVE_TIME_H 1
/* Define to 1 if you have the <unistd.h> header file. */
#define HAVE_UNISTD_H 1
/* Whether va_copy() is available */
#define HAVE_VA_COPY 1
/* Define to 1 if you have the `vfprintf' function. */
#define HAVE_VFPRINTF 1
/* Define to 1 if you have the `vsnprintf' function. */
#define HAVE_VSNPRINTF 1
/* Define to 1 if you have the `vsprintf' function. */
#define HAVE_VSPRINTF 1
/* Define to 1 if you have the <zlib.h> header file. */
/* #undef HAVE_ZLIB_H */
/* Whether __va_copy() is available */
/* #undef HAVE___VA_COPY */
/* Define as const if the declaration of iconv() needs const. */
#define ICONV_CONST
/* Define to the sub-directory where libtool stores uninstalled libraries. */
#define LT_OBJDIR ".libs/"
/* Name of package */
#define PACKAGE "libxml2"
/* Define to the address where bug reports for this package should be sent. */
#define PACKAGE_BUGREPORT ""
/* Define to the full name of this package. */
#define PACKAGE_NAME ""
/* Define to the full name and version of this package. */
#define PACKAGE_STRING ""
/* Define to the one symbol short name of this package. */
#define PACKAGE_TARNAME ""
/* Define to the home page for this package. */
#define PACKAGE_URL ""
/* Define to the version of this package. */
#define PACKAGE_VERSION ""
/* Type cast for the send() function 2nd arg */
#define SEND_ARG2_CAST /**/
/* Define to 1 if you have the ANSI C header files. */
#define STDC_HEADERS 1
/* Support for IPv6 */
#define SUPPORT_IP6 /**/
/* Define if va_list is an array type */
#define VA_LIST_IS_ARRAY 1
/* Version number of package */
#define VERSION "2.9.8"
/* Determine what socket length (socklen_t) data type is */
#define XML_SOCKLEN_T socklen_t
/* Define for Solaris 2.5.1 so the uint32_t typedef from <sys/synch.h>,
<pthread.h>, or <semaphore.h> is not used. If the typedef were allowed, the
#define below would cause a syntax error. */
/* #undef _UINT32_T */
/* ss_family is not defined here, use __ss_family instead */
/* #undef ss_family */
/* Define to the type of an unsigned integer type of width exactly 32 bits if
such a type exists and the standard includes do not define it. */
/* #undef uint32_t */

View File

@ -0,0 +1,481 @@
/*
* Summary: compile-time version informations
* Description: compile-time version informations for the XML library
*
* Copy: See Copyright for the status of this software.
*
* Author: Daniel Veillard
*/
#ifndef __XML_VERSION_H__
#define __XML_VERSION_H__
#include <libxml/xmlexports.h>
#ifdef __cplusplus
extern "C" {
#endif
/*
* use those to be sure nothing nasty will happen if
* your library and includes mismatch
*/
#ifndef LIBXML2_COMPILING_MSCCDEF
XMLPUBFUN void XMLCALL xmlCheckVersion(int version);
#endif /* LIBXML2_COMPILING_MSCCDEF */
/**
* LIBXML_DOTTED_VERSION:
*
* the version string like "1.2.3"
*/
#define LIBXML_DOTTED_VERSION "2.9.8"
/**
* LIBXML_VERSION:
*
* the version number: 1.2.3 value is 10203
*/
#define LIBXML_VERSION 20908
/**
* LIBXML_VERSION_STRING:
*
* the version number string, 1.2.3 value is "10203"
*/
#define LIBXML_VERSION_STRING "20908"
/**
* LIBXML_VERSION_EXTRA:
*
* extra version information, used to show a CVS compilation
*/
#define LIBXML_VERSION_EXTRA "-GITv2.9.9-rc2-1-g6fc04d71"
/**
* LIBXML_TEST_VERSION:
*
* Macro to check that the libxml version in use is compatible with
* the version the software has been compiled against
*/
#define LIBXML_TEST_VERSION xmlCheckVersion(20908);
#ifndef VMS
#if 0
/**
* WITH_TRIO:
*
* defined if the trio support need to be configured in
*/
#define WITH_TRIO
#else
/**
* WITHOUT_TRIO:
*
* defined if the trio support should not be configured in
*/
#define WITHOUT_TRIO
#endif
#else /* VMS */
/**
* WITH_TRIO:
*
* defined if the trio support need to be configured in
*/
#define WITH_TRIO 1
#endif /* VMS */
/**
* LIBXML_THREAD_ENABLED:
*
* Whether the thread support is configured in
*/
#define LIBXML_THREAD_ENABLED 1
/**
* LIBXML_THREAD_ALLOC_ENABLED:
*
* Whether the allocation hooks are per-thread
*/
#if 0
#define LIBXML_THREAD_ALLOC_ENABLED
#endif
/**
* LIBXML_TREE_ENABLED:
*
* Whether the DOM like tree manipulation API support is configured in
*/
#if 1
#define LIBXML_TREE_ENABLED
#endif
/**
* LIBXML_OUTPUT_ENABLED:
*
* Whether the serialization/saving support is configured in
*/
#if 1
#define LIBXML_OUTPUT_ENABLED
#endif
/**
* LIBXML_PUSH_ENABLED:
*
* Whether the push parsing interfaces are configured in
*/
#if 1
#define LIBXML_PUSH_ENABLED
#endif
/**
* LIBXML_READER_ENABLED:
*
* Whether the xmlReader parsing interface is configured in
*/
#if 1
#define LIBXML_READER_ENABLED
#endif
/**
* LIBXML_PATTERN_ENABLED:
*
* Whether the xmlPattern node selection interface is configured in
*/
#if 1
#define LIBXML_PATTERN_ENABLED
#endif
/**
* LIBXML_WRITER_ENABLED:
*
* Whether the xmlWriter saving interface is configured in
*/
#if 1
#define LIBXML_WRITER_ENABLED
#endif
/**
* LIBXML_SAX1_ENABLED:
*
* Whether the older SAX1 interface is configured in
*/
#if 1
#define LIBXML_SAX1_ENABLED
#endif
/**
* LIBXML_FTP_ENABLED:
*
* Whether the FTP support is configured in
*/
#if 1
#define LIBXML_FTP_ENABLED
#endif
/**
* LIBXML_HTTP_ENABLED:
*
* Whether the HTTP support is configured in
*/
#if 1
#define LIBXML_HTTP_ENABLED
#endif
/**
* LIBXML_VALID_ENABLED:
*
* Whether the DTD validation support is configured in
*/
#if 1
#define LIBXML_VALID_ENABLED
#endif
/**
* LIBXML_HTML_ENABLED:
*
* Whether the HTML support is configured in
*/
#if 1
#define LIBXML_HTML_ENABLED
#endif
/**
* LIBXML_LEGACY_ENABLED:
*
* Whether the deprecated APIs are compiled in for compatibility
*/
#if 1
#define LIBXML_LEGACY_ENABLED
#endif
/**
* LIBXML_C14N_ENABLED:
*
* Whether the Canonicalization support is configured in
*/
#if 1
#define LIBXML_C14N_ENABLED
#endif
/**
* LIBXML_CATALOG_ENABLED:
*
* Whether the Catalog support is configured in
*/
#if 1
#define LIBXML_CATALOG_ENABLED
#endif
/**
* LIBXML_DOCB_ENABLED:
*
* Whether the SGML Docbook support is configured in
*/
#if 1
#define LIBXML_DOCB_ENABLED
#endif
/**
* LIBXML_XPATH_ENABLED:
*
* Whether XPath is configured in
*/
#if 1
#define LIBXML_XPATH_ENABLED
#endif
/**
* LIBXML_XPTR_ENABLED:
*
* Whether XPointer is configured in
*/
#if 1
#define LIBXML_XPTR_ENABLED
#endif
/**
* LIBXML_XINCLUDE_ENABLED:
*
* Whether XInclude is configured in
*/
#if 1
#define LIBXML_XINCLUDE_ENABLED
#endif
/**
* LIBXML_ICONV_ENABLED:
*
* Whether iconv support is available
*/
#if 1
#define LIBXML_ICONV_ENABLED
#endif
/**
* LIBXML_ICU_ENABLED:
*
* Whether icu support is available
*/
#if 0
#define LIBXML_ICU_ENABLED
#endif
/**
* LIBXML_ISO8859X_ENABLED:
*
* Whether ISO-8859-* support is made available in case iconv is not
*/
#if 1
#define LIBXML_ISO8859X_ENABLED
#endif
/**
* LIBXML_DEBUG_ENABLED:
*
* Whether Debugging module is configured in
*/
#if 1
#define LIBXML_DEBUG_ENABLED
#endif
/**
* DEBUG_MEMORY_LOCATION:
*
* Whether the memory debugging is configured in
*/
#if 0
#define DEBUG_MEMORY_LOCATION
#endif
/**
* LIBXML_DEBUG_RUNTIME:
*
* Whether the runtime debugging is configured in
*/
#if 0
#define LIBXML_DEBUG_RUNTIME
#endif
/**
* LIBXML_UNICODE_ENABLED:
*
* Whether the Unicode related interfaces are compiled in
*/
#if 1
#define LIBXML_UNICODE_ENABLED
#endif
/**
* LIBXML_REGEXP_ENABLED:
*
* Whether the regular expressions interfaces are compiled in
*/
#if 1
#define LIBXML_REGEXP_ENABLED
#endif
/**
* LIBXML_AUTOMATA_ENABLED:
*
* Whether the automata interfaces are compiled in
*/
#if 1
#define LIBXML_AUTOMATA_ENABLED
#endif
/**
* LIBXML_EXPR_ENABLED:
*
* Whether the formal expressions interfaces are compiled in
*/
#if 1
#define LIBXML_EXPR_ENABLED
#endif
/**
* LIBXML_SCHEMAS_ENABLED:
*
* Whether the Schemas validation interfaces are compiled in
*/
#if 1
#define LIBXML_SCHEMAS_ENABLED
#endif
/**
* LIBXML_SCHEMATRON_ENABLED:
*
* Whether the Schematron validation interfaces are compiled in
*/
#if 1
#define LIBXML_SCHEMATRON_ENABLED
#endif
/**
* LIBXML_MODULES_ENABLED:
*
* Whether the module interfaces are compiled in
*/
#if 1
#define LIBXML_MODULES_ENABLED
/**
* LIBXML_MODULE_EXTENSION:
*
* the string suffix used by dynamic modules (usually shared libraries)
*/
#define LIBXML_MODULE_EXTENSION ".so"
#endif
/**
* LIBXML_ZLIB_ENABLED:
*
* Whether the Zlib support is compiled in
*/
#if 1
#define LIBXML_ZLIB_ENABLED
#endif
/**
* LIBXML_LZMA_ENABLED:
*
* Whether the Lzma support is compiled in
*/
#if 0
#define LIBXML_LZMA_ENABLED
#endif
#ifdef __GNUC__
/**
* ATTRIBUTE_UNUSED:
*
* Macro used to signal to GCC unused function parameters
*/
#ifndef ATTRIBUTE_UNUSED
# if ((__GNUC__ > 2) || ((__GNUC__ == 2) && (__GNUC_MINOR__ >= 7)))
# define ATTRIBUTE_UNUSED __attribute__((unused))
# else
# define ATTRIBUTE_UNUSED
# endif
#endif
/**
* LIBXML_ATTR_ALLOC_SIZE:
*
* Macro used to indicate to GCC this is an allocator function
*/
#ifndef LIBXML_ATTR_ALLOC_SIZE
# if (!defined(__clang__) && ((__GNUC__ > 4) || ((__GNUC__ == 4) && (__GNUC_MINOR__ >= 3))))
# define LIBXML_ATTR_ALLOC_SIZE(x) __attribute__((alloc_size(x)))
# else
# define LIBXML_ATTR_ALLOC_SIZE(x)
# endif
#else
# define LIBXML_ATTR_ALLOC_SIZE(x)
#endif
/**
* LIBXML_ATTR_FORMAT:
*
* Macro used to indicate to GCC the parameter are printf like
*/
#ifndef LIBXML_ATTR_FORMAT
# if ((__GNUC__ > 3) || ((__GNUC__ == 3) && (__GNUC_MINOR__ >= 3)))
# define LIBXML_ATTR_FORMAT(fmt,args) __attribute__((__format__(__printf__,fmt,args)))
# else
# define LIBXML_ATTR_FORMAT(fmt,args)
# endif
#else
# define LIBXML_ATTR_FORMAT(fmt,args)
#endif
#else /* ! __GNUC__ */
/**
* ATTRIBUTE_UNUSED:
*
* Macro used to signal to GCC unused function parameters
*/
#define ATTRIBUTE_UNUSED
/**
* LIBXML_ATTR_ALLOC_SIZE:
*
* Macro used to indicate to GCC this is an allocator function
*/
#define LIBXML_ATTR_ALLOC_SIZE(x)
/**
* LIBXML_ATTR_FORMAT:
*
* Macro used to indicate to GCC the parameter are printf like
*/
#define LIBXML_ATTR_FORMAT(fmt,args)
#endif /* __GNUC__ */
#ifdef __cplusplus
}
#endif /* __cplusplus */
#endif

2
contrib/poco vendored

@ -1 +1 @@
Subproject commit 20c1d877773b6a672f1bbfe3290dfea42a117ed5
Subproject commit fe5505e56c27b6ecb0dcbc40c49dc2caf4e9637f

1
contrib/protobuf vendored Submodule

@ -0,0 +1 @@
Subproject commit 12735370922a35f03999afff478e1c6d7aa917a4

View File

@ -264,6 +264,11 @@ target_link_libraries(dbms PRIVATE ${OPENSSL_CRYPTO_LIBRARY} Threads::Threads)
target_include_directories (dbms SYSTEM BEFORE PRIVATE ${DIVIDE_INCLUDE_DIR})
target_include_directories (dbms SYSTEM BEFORE PRIVATE ${SPARCEHASH_INCLUDE_DIR})
if (USE_HDFS)
target_link_libraries (dbms PRIVATE ${HDFS3_LIBRARY})
target_include_directories (dbms SYSTEM BEFORE PRIVATE ${HDFS3_INCLUDE_DIR})
endif()
if (NOT USE_INTERNAL_LZ4_LIBRARY)
target_include_directories (dbms SYSTEM BEFORE PRIVATE ${LZ4_INCLUDE_DIR})
endif ()

View File

@ -36,6 +36,7 @@ public:
const ColumnPtr & getNestedColumn() const override;
const ColumnPtr & getNestedNotNullableColumn() const override { return column_holder; }
bool nestedColumnIsNullable() const override { return is_nullable; }
size_t uniqueInsert(const Field & x) override;
size_t uniqueInsertFrom(const IColumn & src, size_t n) override;

View File

@ -18,6 +18,8 @@ public:
/// The same as getNestedColumn, but removes null map if nested column is nullable.
virtual const ColumnPtr & getNestedNotNullableColumn() const = 0;
virtual bool nestedColumnIsNullable() const = 0;
/// Returns array with StringRefHash calculated for each row of getNestedNotNullableColumn() column.
/// Returns nullptr if nested column doesn't contain strings. Otherwise calculates hash (if it wasn't).
/// Uses thread-safe cache.

View File

@ -10,16 +10,17 @@ template
typename Cell,
typename Hash = DefaultHash<Key>,
typename Grower = TwoLevelHashTableGrower<>,
typename Allocator = HashTableAllocator
typename Allocator = HashTableAllocator,
template <typename ...> typename ImplTable = HashMapTable
>
class TwoLevelHashMapTable : public TwoLevelHashTable<Key, Cell, Hash, Grower, Allocator, HashMapTable<Key, Cell, Hash, Grower, Allocator>>
class TwoLevelHashMapTable : public TwoLevelHashTable<Key, Cell, Hash, Grower, Allocator, ImplTable<Key, Cell, Hash, Grower, Allocator>>
{
public:
using key_type = Key;
using mapped_type = typename Cell::Mapped;
using value_type = typename Cell::value_type;
using TwoLevelHashTable<Key, Cell, Hash, Grower, Allocator, HashMapTable<Key, Cell, Hash, Grower, Allocator>>::TwoLevelHashTable;
using TwoLevelHashTable<Key, Cell, Hash, Grower, Allocator, ImplTable<Key, Cell, Hash, Grower, Allocator>>::TwoLevelHashTable;
mapped_type & ALWAYS_INLINE operator[](Key x)
{
@ -41,9 +42,10 @@ template
typename Mapped,
typename Hash = DefaultHash<Key>,
typename Grower = TwoLevelHashTableGrower<>,
typename Allocator = HashTableAllocator
typename Allocator = HashTableAllocator,
template <typename ...> typename ImplTable = HashMapTable
>
using TwoLevelHashMap = TwoLevelHashMapTable<Key, HashMapCell<Key, Mapped, Hash>, Hash, Grower, Allocator>;
using TwoLevelHashMap = TwoLevelHashMapTable<Key, HashMapCell<Key, Mapped, Hash>, Hash, Grower, Allocator, ImplTable>;
template
@ -52,6 +54,7 @@ template
typename Mapped,
typename Hash = DefaultHash<Key>,
typename Grower = TwoLevelHashTableGrower<>,
typename Allocator = HashTableAllocator
typename Allocator = HashTableAllocator,
template <typename ...> typename ImplTable = HashMapTable
>
using TwoLevelHashMapWithSavedHash = TwoLevelHashMapTable<Key, HashMapCellWithSavedHash<Key, Mapped, Hash>, Hash, Grower, Allocator>;
using TwoLevelHashMapWithSavedHash = TwoLevelHashMapTable<Key, HashMapCellWithSavedHash<Key, Mapped, Hash>, Hash, Grower, Allocator, ImplTable>;

View File

@ -16,3 +16,4 @@
#cmakedefine01 USE_POCO_NETSSL
#cmakedefine01 CLICKHOUSE_SPLIT_BINARY
#cmakedefine01 USE_BASE64
#cmakedefine01 USE_HDFS

View File

@ -154,7 +154,10 @@ Block NativeBlockInputStream::readImpl()
column.column = std::move(read_column);
if (server_revision && server_revision < DBMS_MIN_REVISION_WITH_LOW_CARDINALITY_TYPE)
{
column.column = recursiveLowCardinalityConversion(column.column, column.type, header.getByPosition(i).type);
column.type = header.getByPosition(i).type;
}
res.insert(std::move(column));

View File

@ -0,0 +1,96 @@
#pragma once
#include <Common/config.h>
#if USE_HDFS
#include <IO/ReadBuffer.h>
#include <Poco/URI.h>
#include <hdfs/hdfs.h>
#include <IO/BufferWithOwnMemory.h>
#ifndef O_DIRECT
#define O_DIRECT 00040000
#endif
namespace DB
{
namespace ErrorCodes
{
extern const int BAD_ARGUMENTS;
extern const int NETWORK_ERROR;
}
/** Accepts path to file and opens it, or pre-opened file descriptor.
* Closes file by himself (thus "owns" a file descriptor).
*/
class ReadBufferFromHDFS : public BufferWithOwnMemory<ReadBuffer>
{
protected:
std::string hdfs_uri;
struct hdfsBuilder *builder;
hdfsFS fs;
hdfsFile fin;
public:
ReadBufferFromHDFS(const std::string & hdfs_name_, size_t buf_size = DBMS_DEFAULT_BUFFER_SIZE)
: BufferWithOwnMemory<ReadBuffer>(buf_size), hdfs_uri(hdfs_name_) , builder(hdfsNewBuilder())
{
Poco::URI uri(hdfs_name_);
auto & host = uri.getHost();
auto port = uri.getPort();
auto & path = uri.getPath();
if (host.empty() || port == 0 || path.empty())
{
throw Exception("Illegal HDFS URI: " + hdfs_uri, ErrorCodes::BAD_ARGUMENTS);
}
// set read/connect timeout, default value in libhdfs3 is about 1 hour, and too large
/// TODO Allow to tune from query Settings.
hdfsBuilderConfSetStr(builder, "input.read.timeout", "60000"); // 1 min
hdfsBuilderConfSetStr(builder, "input.connect.timeout", "60000"); // 1 min
hdfsBuilderSetNameNode(builder, host.c_str());
hdfsBuilderSetNameNodePort(builder, port);
fs = hdfsBuilderConnect(builder);
if (fs == nullptr)
{
throw Exception("Unable to connect to HDFS: " + String(hdfsGetLastError()), ErrorCodes::NETWORK_ERROR);
}
fin = hdfsOpenFile(fs, path.c_str(), O_RDONLY, 0, 0, 0);
}
ReadBufferFromHDFS(ReadBufferFromHDFS &&) = default;
~ReadBufferFromHDFS() override
{
close();
hdfsFreeBuilder(builder);
}
/// Close HDFS connection before destruction of object.
void close()
{
hdfsCloseFile(fs, fin);
}
bool nextImpl() override
{
int bytes_read = hdfsRead(fs, fin, internal_buffer.begin(), internal_buffer.size());
if (bytes_read < 0)
{
throw Exception("Fail to read HDFS file: " + hdfs_uri + " " + String(hdfsGetLastError()), ErrorCodes::NETWORK_ERROR);
}
if (bytes_read)
working_buffer.resize(bytes_read);
else
return false;
return true;
}
const std::string & getHDFSUri() const
{
return hdfs_uri;
}
};
}
#endif

View File

@ -453,6 +453,27 @@ AggregatedDataVariants::Type Aggregator::chooseAggregationMethod()
return AggregatedDataVariants::Type::nullable_keys256;
}
if (has_low_cardinality && params.keys_size == 1)
{
if (types_removed_nullable[0]->isValueRepresentedByNumber())
{
size_t size_of_field = types_removed_nullable[0]->getSizeOfValueInMemory();
if (size_of_field == 1)
return AggregatedDataVariants::Type::low_cardinality_key8;
if (size_of_field == 2)
return AggregatedDataVariants::Type::low_cardinality_key16;
if (size_of_field == 4)
return AggregatedDataVariants::Type::low_cardinality_key32;
if (size_of_field == 8)
return AggregatedDataVariants::Type::low_cardinality_key64;
}
else if (isString(types_removed_nullable[0]))
return AggregatedDataVariants::Type::low_cardinality_key_string;
else if (isFixedString(types_removed_nullable[0]))
return AggregatedDataVariants::Type::low_cardinality_key_fixed_string;
}
/// Fallback case.
return AggregatedDataVariants::Type::serialized;
}
@ -1139,12 +1160,10 @@ void Aggregator::convertToBlockImpl(
convertToBlockImplFinal(method, data, key_columns, final_aggregate_columns);
else
convertToBlockImplNotFinal(method, data, key_columns, aggregate_columns);
/// In order to release memory early.
data.clearAndShrink();
}
template <typename Method, typename Table>
void NO_INLINE Aggregator::convertToBlockImplFinal(
Method & method,
@ -1152,6 +1171,19 @@ void NO_INLINE Aggregator::convertToBlockImplFinal(
MutableColumns & key_columns,
MutableColumns & final_aggregate_columns) const
{
if constexpr (Method::low_cardinality_optimization)
{
if (data.hasNullKeyData())
{
key_columns[0]->insert(Field()); /// Null
for (size_t i = 0; i < params.aggregates_size; ++i)
aggregate_functions[i]->insertResultInto(
data.getNullKeyData() + offsets_of_aggregate_states[i],
*final_aggregate_columns[i]);
}
}
for (const auto & value : data)
{
method.insertKeyIntoColumns(value, key_columns, key_sizes);
@ -1172,6 +1204,17 @@ void NO_INLINE Aggregator::convertToBlockImplNotFinal(
MutableColumns & key_columns,
AggregateColumnsData & aggregate_columns) const
{
if constexpr (Method::low_cardinality_optimization)
{
if (data.hasNullKeyData())
{
key_columns[0]->insert(Field()); /// Null
for (size_t i = 0; i < params.aggregates_size; ++i)
aggregate_columns[i]->push_back(data.getNullKeyData() + offsets_of_aggregate_states[i]);
}
}
for (auto & value : data)
{
method.insertKeyIntoColumns(value, key_columns, key_sizes);
@ -1470,12 +1513,50 @@ BlocksList Aggregator::convertToBlocks(AggregatedDataVariants & data_variants, b
}
template <typename Method, typename Table>
void NO_INLINE Aggregator::mergeDataNullKey(
Table & table_dst,
Table & table_src,
Arena * arena) const
{
if constexpr (Method::low_cardinality_optimization)
{
if (table_src.hasNullKeyData())
{
if (!table_dst.hasNullKeyData())
{
table_dst.hasNullKeyData() = true;
table_dst.getNullKeyData() = table_src.getNullKeyData();
}
else
{
for (size_t i = 0; i < params.aggregates_size; ++i)
aggregate_functions[i]->merge(
table_dst.getNullKeyData() + offsets_of_aggregate_states[i],
table_src.getNullKeyData() + offsets_of_aggregate_states[i],
arena);
for (size_t i = 0; i < params.aggregates_size; ++i)
aggregate_functions[i]->destroy(
table_src.getNullKeyData() + offsets_of_aggregate_states[i]);
}
table_src.hasNullKeyData() = false;
table_src.getNullKeyData() = nullptr;
}
}
}
template <typename Method, typename Table>
void NO_INLINE Aggregator::mergeDataImpl(
Table & table_dst,
Table & table_src,
Arena * arena) const
{
if constexpr (Method::low_cardinality_optimization)
mergeDataNullKey<Method, Table>(table_dst, table_src, arena);
for (auto it = table_src.begin(), end = table_src.end(); it != end; ++it)
{
typename Table::iterator res_it;
@ -1513,6 +1594,10 @@ void NO_INLINE Aggregator::mergeDataNoMoreKeysImpl(
Table & table_src,
Arena * arena) const
{
/// Note : will create data for NULL key if not exist
if constexpr (Method::low_cardinality_optimization)
mergeDataNullKey<Method, Table>(table_dst, table_src, arena);
for (auto it = table_src.begin(), end = table_src.end(); it != end; ++it)
{
typename Table::iterator res_it = table_dst.find(it->first, it.getHash());
@ -1543,6 +1628,10 @@ void NO_INLINE Aggregator::mergeDataOnlyExistingKeysImpl(
Table & table_src,
Arena * arena) const
{
/// Note : will create data for NULL key if not exist
if constexpr (Method::low_cardinality_optimization)
mergeDataNullKey<Method, Table>(table_dst, table_src, arena);
for (auto it = table_src.begin(); it != table_src.end(); ++it)
{
decltype(it) res_it = table_dst.find(it->first, it.getHash());
@ -2341,6 +2430,15 @@ void NO_INLINE Aggregator::convertBlockToTwoLevelImpl(
/// For every row.
for (size_t i = 0; i < rows; ++i)
{
if constexpr (Method::low_cardinality_optimization)
{
if (state.isNullAt(i))
{
selector[i] = 0;
continue;
}
}
/// Obtain a key. Calculate bucket number from it.
typename Method::Key key = state.getKey(key_columns, params.keys_size, i, key_sizes, keys, *pool);

View File

@ -88,6 +88,56 @@ using AggregatedDataWithStringKeyHash64 = HashMapWithSavedHash<StringRef, Aggreg
using AggregatedDataWithKeys128Hash64 = HashMap<UInt128, AggregateDataPtr, UInt128Hash>;
using AggregatedDataWithKeys256Hash64 = HashMap<UInt256, AggregateDataPtr, UInt256Hash>;
template <typename Base>
struct AggregationDataWithNullKey : public Base
{
using Base::Base;
bool & hasNullKeyData() { return has_null_key_data; }
AggregateDataPtr & getNullKeyData() { return null_key_data; }
bool hasNullKeyData() const { return has_null_key_data; }
const AggregateDataPtr & getNullKeyData() const { return null_key_data; }
private:
bool has_null_key_data = false;
AggregateDataPtr null_key_data = nullptr;
};
template <typename Base>
struct AggregationDataWithNullKeyTwoLevel : public Base
{
using Base::Base;
using Base::impls;
template <typename Other>
explicit AggregationDataWithNullKeyTwoLevel(const Other & other) : Base(other)
{
impls[0].hasNullKeyData() = other.hasNullKeyData();
impls[0].getNullKeyData() = other.getNullKeyData();
}
bool & hasNullKeyData() { return impls[0].hasNullKeyData(); }
AggregateDataPtr & getNullKeyData() { return impls[0].getNullKeyData(); }
bool hasNullKeyData() const { return impls[0].hasNullKeyData(); }
const AggregateDataPtr & getNullKeyData() const { return impls[0].getNullKeyData(); }
};
template <typename ... Types>
using HashTableWithNullKey = AggregationDataWithNullKey<HashMapTable<Types ...>>;
using AggregatedDataWithNullableUInt8Key = AggregationDataWithNullKey<AggregatedDataWithUInt8Key>;
using AggregatedDataWithNullableUInt16Key = AggregationDataWithNullKey<AggregatedDataWithUInt16Key>;
using AggregatedDataWithNullableUInt64Key = AggregationDataWithNullKey<AggregatedDataWithUInt64Key>;
using AggregatedDataWithNullableStringKey = AggregationDataWithNullKey<AggregatedDataWithStringKey>;
using AggregatedDataWithNullableUInt64KeyTwoLevel = AggregationDataWithNullKeyTwoLevel<
TwoLevelHashMap<UInt64, AggregateDataPtr, HashCRC32<UInt64>,
TwoLevelHashTableGrower<>, HashTableAllocator, HashTableWithNullKey>>;
using AggregatedDataWithNullableStringKeyTwoLevel = AggregationDataWithNullKeyTwoLevel<
TwoLevelHashMapWithSavedHash<StringRef, AggregateDataPtr, DefaultHash<StringRef>,
TwoLevelHashTableGrower<>, HashTableAllocator, HashTableWithNullKey>>;
/// Cache which can be used by aggregations method's states. Object is shared in all threads.
struct AggregationStateCache
{
@ -403,8 +453,10 @@ struct AggregationMethodSingleLowCardinalityColumn : public SingleColumnMethod
ColumnPtr dictionary_holder;
/// Cache AggregateDataPtr for current column in order to decrease the number of hash table usages.
PaddedPODArray<AggregateDataPtr> aggregate_data;
PaddedPODArray<AggregateDataPtr> * aggregate_data_cache;
PaddedPODArray<AggregateDataPtr> aggregate_data_cache;
/// If initialized column is nullable.
bool is_nullable = false;
void init(ColumnRawPtrs &)
{
@ -429,7 +481,8 @@ struct AggregationMethodSingleLowCardinalityColumn : public SingleColumnMethod
+ demangle(typeid(cached_val).name()), ErrorCodes::LOGICAL_ERROR);
}
auto * dict = column->getDictionary().getNestedColumn().get();
auto * dict = column->getDictionary().getNestedNotNullableColumn().get();
is_nullable = column->getDictionary().nestedColumnIsNullable();
key = {dict};
bool is_shared_dict = column->isSharedDictionary();
@ -463,8 +516,7 @@ struct AggregationMethodSingleLowCardinalityColumn : public SingleColumnMethod
}
AggregateDataPtr default_data = nullptr;
aggregate_data.assign(key[0]->size(), default_data);
aggregate_data_cache = &aggregate_data;
aggregate_data_cache.assign(key[0]->size(), default_data);
size_of_index_type = column->getSizeOfIndexType();
positions = column->getIndexesPtr().get();
@ -507,10 +559,18 @@ struct AggregationMethodSingleLowCardinalityColumn : public SingleColumnMethod
Arena & pool)
{
size_t row = getIndexAt(i);
if ((*aggregate_data_cache)[row])
if (is_nullable && row == 0)
{
inserted = !data.hasNullKeyData();
data.hasNullKeyData() = true;
return &data.getNullKeyData();
}
if (aggregate_data_cache[row])
{
inserted = false;
return &(*aggregate_data_cache)[row];
return &aggregate_data_cache[row];
}
else
{
@ -527,23 +587,35 @@ struct AggregationMethodSingleLowCardinalityColumn : public SingleColumnMethod
if (inserted)
Base::onNewKey(*it, keys_size, keys, pool);
else
(*aggregate_data_cache)[row] = Base::getAggregateData(it->second);
aggregate_data_cache[row] = Base::getAggregateData(it->second);
return &Base::getAggregateData(it->second);
}
}
ALWAYS_INLINE bool isNullAt(size_t i)
{
if (!is_nullable)
return false;
return getIndexAt(i) == 0;
}
ALWAYS_INLINE void cacheAggregateData(size_t i, AggregateDataPtr data)
{
size_t row = getIndexAt(i);
(*aggregate_data_cache)[row] = data;
aggregate_data_cache[row] = data;
}
template <typename D>
ALWAYS_INLINE AggregateDataPtr * findFromRow(D & data, size_t i)
{
size_t row = getIndexAt(i);
if (!(*aggregate_data_cache)[row])
if (is_nullable && row == 0)
return data.hasNullKeyData() ? &data.getNullKeyData() : nullptr;
if (!aggregate_data_cache[row])
{
ColumnRawPtrs key_columns;
Sizes key_sizes;
@ -558,9 +630,9 @@ struct AggregationMethodSingleLowCardinalityColumn : public SingleColumnMethod
it = data.find(key);
if (it != data.end())
(*aggregate_data_cache)[row] = Base::getAggregateData(it->second);
aggregate_data_cache[row] = Base::getAggregateData(it->second);
}
return &(*aggregate_data_cache)[row];
return &aggregate_data_cache[row];
}
};
@ -971,17 +1043,17 @@ struct AggregatedDataVariants : private boost::noncopyable
std::unique_ptr<AggregationMethodKeysFixed<AggregatedDataWithKeys256TwoLevel, true>> nullable_keys256_two_level;
/// Support for low cardinality.
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodOneNumber<UInt8, AggregatedDataWithUInt8Key>>> low_cardinality_key8;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodOneNumber<UInt16, AggregatedDataWithUInt16Key>>> low_cardinality_key16;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodOneNumber<UInt32, AggregatedDataWithUInt64Key>>> low_cardinality_key32;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodOneNumber<UInt64, AggregatedDataWithUInt64Key>>> low_cardinality_key64;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodString<AggregatedDataWithStringKey>>> low_cardinality_key_string;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodFixedString<AggregatedDataWithStringKey>>> low_cardinality_key_fixed_string;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodOneNumber<UInt8, AggregatedDataWithNullableUInt8Key>>> low_cardinality_key8;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodOneNumber<UInt16, AggregatedDataWithNullableUInt16Key>>> low_cardinality_key16;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodOneNumber<UInt32, AggregatedDataWithNullableUInt64Key>>> low_cardinality_key32;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodOneNumber<UInt64, AggregatedDataWithNullableUInt64Key>>> low_cardinality_key64;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodString<AggregatedDataWithNullableStringKey>>> low_cardinality_key_string;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodFixedString<AggregatedDataWithNullableStringKey>>> low_cardinality_key_fixed_string;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodOneNumber<UInt32, AggregatedDataWithUInt64KeyTwoLevel>>> low_cardinality_key32_two_level;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodOneNumber<UInt64, AggregatedDataWithUInt64KeyTwoLevel>>> low_cardinality_key64_two_level;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodString<AggregatedDataWithStringKeyTwoLevel>>> low_cardinality_key_string_two_level;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodFixedString<AggregatedDataWithStringKeyTwoLevel>>> low_cardinality_key_fixed_string_two_level;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodOneNumber<UInt32, AggregatedDataWithNullableUInt64KeyTwoLevel>>> low_cardinality_key32_two_level;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodOneNumber<UInt64, AggregatedDataWithNullableUInt64KeyTwoLevel>>> low_cardinality_key64_two_level;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodString<AggregatedDataWithNullableStringKeyTwoLevel>>> low_cardinality_key_string_two_level;
std::unique_ptr<AggregationMethodSingleLowCardinalityColumn<AggregationMethodFixedString<AggregatedDataWithNullableStringKeyTwoLevel>>> low_cardinality_key_fixed_string_two_level;
std::unique_ptr<AggregationMethodKeysFixed<AggregatedDataWithKeys128, false, true>> low_cardinality_keys128;
std::unique_ptr<AggregationMethodKeysFixed<AggregatedDataWithKeys256, false, true>> low_cardinality_keys256;
@ -1580,6 +1652,13 @@ public:
Arena * arena) const;
protected:
/// Merge NULL key data from hash table `src` into `dst`.
template <typename Method, typename Table>
void mergeDataNullKey(
Table & table_dst,
Table & table_src,
Arena * arena) const;
/// Merge data from hash table `src` into `dst`.
template <typename Method, typename Table>
void mergeDataImpl(

View File

@ -590,7 +590,7 @@ void ExpressionActions::checkLimits(Block & block) const
{
std::stringstream list_of_non_const_columns;
for (size_t i = 0, size = block.columns(); i < size; ++i)
if (!block.safeGetByPosition(i).column->isColumnConst())
if (block.safeGetByPosition(i).column && !block.safeGetByPosition(i).column->isColumnConst())
list_of_non_const_columns << "\n" << block.safeGetByPosition(i).name;
throw Exception("Too many temporary non-const columns:" + list_of_non_const_columns.str()

View File

@ -0,0 +1,179 @@
#include <Common/config.h>
#if USE_HDFS
#include <Storages/StorageFactory.h>
#include <Storages/StorageHDFS.h>
#include <Interpreters/Context.h>
#include <Interpreters/evaluateConstantExpression.h>
#include <Parsers/ASTIdentifier.h>
#include <Parsers/ASTLiteral.h>
#include <IO/ReadBufferFromHDFS.h>
#include <Formats/FormatFactory.h>
#include <DataStreams/IBlockOutputStream.h>
#include <DataStreams/UnionBlockInputStream.h>
#include <DataStreams/IProfilingBlockInputStream.h>
#include <DataStreams/OwningBlockInputStream.h>
#include <Poco/Path.h>
#include <TableFunctions/parseRemoteDescription.h>
#include <Common/typeid_cast.h>
namespace DB
{
namespace ErrorCodes
{
extern const int NUMBER_OF_ARGUMENTS_DOESNT_MATCH;
extern const int NOT_IMPLEMENTED;
extern const int BAD_ARGUMENTS;
}
StorageHDFS::StorageHDFS(const String & uri_,
const std::string & table_name_,
const String & format_name_,
const ColumnsDescription & columns_,
Context &)
: IStorage(columns_), uri(uri_), format_name(format_name_), table_name(table_name_)
{
}
namespace
{
class StorageHDFSBlockInputStream : public IProfilingBlockInputStream
{
public:
StorageHDFSBlockInputStream(const String & uri,
const String & format,
const String & name_,
const Block & sample_block,
const Context & context,
size_t max_block_size)
: name(name_)
{
// Assume no query and fragment in uri, todo, add sanity check
String fuzzyFileNames;
String uriPrefix = uri.substr(0, uri.find_last_of('/'));
if (uriPrefix.length() == uri.length())
{
fuzzyFileNames = uri;
uriPrefix.clear();
}
else
{
uriPrefix += "/";
fuzzyFileNames = uri.substr(uriPrefix.length());
}
std::vector<String> fuzzyNameList = parseRemoteDescription(fuzzyFileNames, 0, fuzzyFileNames.length(), ',' , 100/* hard coded max files */);
BlockInputStreams inputs;
for (auto & name: fuzzyNameList)
{
std::unique_ptr<ReadBuffer> read_buf = std::make_unique<ReadBufferFromHDFS>(uriPrefix + name);
inputs.emplace_back(
std::make_shared<OwningBlockInputStream<ReadBuffer>>(
FormatFactory::instance().getInput(format, *read_buf, sample_block, context, max_block_size),
std::move(read_buf)));
}
if (inputs.size() == 0)
throw Exception("StorageHDFS inputs interpreter error", ErrorCodes::BAD_ARGUMENTS);
if (inputs.size() == 1)
{
reader = inputs[0];
}
else
{
reader = std::make_shared<UnionBlockInputStream>(inputs, nullptr, context.getSettingsRef().max_distributed_connections);
}
}
String getName() const override
{
return name;
}
Block readImpl() override
{
return reader->read();
}
Block getHeader() const override
{
return reader->getHeader();
}
void readPrefixImpl() override
{
reader->readPrefix();
}
void readSuffixImpl() override
{
auto explicitReader = dynamic_cast<UnionBlockInputStream *>(reader.get());
if (explicitReader) explicitReader->cancel(false); // skip Union read suffix assertion
reader->readSuffix();
}
private:
String name;
BlockInputStreamPtr reader;
};
}
BlockInputStreams StorageHDFS::read(
const Names & /*column_names*/,
const SelectQueryInfo & /*query_info*/,
const Context & context,
QueryProcessingStage::Enum /*processed_stage*/,
size_t max_block_size,
unsigned /*num_streams*/)
{
return {std::make_shared<StorageHDFSBlockInputStream>(
uri,
format_name,
getName(),
getSampleBlock(),
context,
max_block_size)};
}
void StorageHDFS::rename(const String & /*new_path_to_db*/, const String & /*new_database_name*/, const String & /*new_table_name*/) {}
BlockOutputStreamPtr StorageHDFS::write(const ASTPtr & /*query*/, const Settings & /*settings*/)
{
throw Exception("StorageHDFS write is not supported yet", ErrorCodes::NOT_IMPLEMENTED);
return {};
}
void registerStorageHDFS(StorageFactory & factory)
{
factory.registerStorage("HDFS", [](const StorageFactory::Arguments & args)
{
ASTs & engine_args = args.engine_args;
if (!(engine_args.size() == 1 || engine_args.size() == 2))
throw Exception(
"Storage HDFS requires exactly 2 arguments: url and name of used format.", ErrorCodes::NUMBER_OF_ARGUMENTS_DOESNT_MATCH);
engine_args[0] = evaluateConstantExpressionOrIdentifierAsLiteral(engine_args[0], args.local_context);
String url = static_cast<const ASTLiteral &>(*engine_args[0]).value.safeGet<String>();
engine_args[1] = evaluateConstantExpressionOrIdentifierAsLiteral(engine_args[1], args.local_context);
String format_name = static_cast<const ASTLiteral &>(*engine_args[1]).value.safeGet<String>();
return StorageHDFS::create(url, args.table_name, format_name, args.columns, args.context);
});
}
}
#endif

View File

@ -0,0 +1,56 @@
#pragma once
#include <Common/config.h>
#if USE_HDFS
#include <Storages/IStorage.h>
#include <Poco/URI.h>
#include <common/logger_useful.h>
#include <ext/shared_ptr_helper.h>
namespace DB
{
/**
* This class represents table engine for external hdfs files.
* Read method is supported for now.
*/
class StorageHDFS : public ext::shared_ptr_helper<StorageHDFS>, public IStorage
{
public:
String getName() const override
{
return "HDFS";
}
String getTableName() const override
{
return table_name;
}
BlockInputStreams read(const Names & column_names,
const SelectQueryInfo & query_info,
const Context & context,
QueryProcessingStage::Enum processed_stage,
size_t max_block_size,
unsigned num_streams) override;
BlockOutputStreamPtr write(const ASTPtr & query, const Settings & settings) override;
void rename(const String & new_path_to_db, const String & new_database_name, const String & new_table_name) override;
protected:
StorageHDFS(const String & uri_,
const String & table_name_,
const String & format_name_,
const ColumnsDescription & columns_,
Context & context_);
private:
String uri;
String format_name;
String table_name;
Logger * log = &Logger::get("StorageHDFS");
};
}
#endif

View File

@ -24,6 +24,10 @@ void registerStorageJoin(StorageFactory & factory);
void registerStorageView(StorageFactory & factory);
void registerStorageMaterializedView(StorageFactory & factory);
#if USE_HDFS
void registerStorageHDFS(StorageFactory & factory);
#endif
#if USE_POCO_SQLODBC || USE_POCO_DATAODBC
void registerStorageODBC(StorageFactory & factory);
#endif
@ -60,6 +64,10 @@ void registerStorages()
registerStorageView(factory);
registerStorageMaterializedView(factory);
#if USE_HDFS
registerStorageHDFS(factory);
#endif
#if USE_POCO_SQLODBC || USE_POCO_DATAODBC
registerStorageODBC(factory);
#endif

View File

@ -3,7 +3,6 @@
#include <string>
#include <memory>
namespace DB
{

View File

@ -0,0 +1,25 @@
#include <Common/config.h>
#if USE_HDFS
#include <Storages/StorageHDFS.h>
#include <TableFunctions/TableFunctionFactory.h>
#include <TableFunctions/TableFunctionHDFS.h>
namespace DB
{
StoragePtr TableFunctionHDFS::getStorage(
const String & source, const String & format, const Block & sample_block, Context & global_context) const
{
return StorageHDFS::create(source,
getName(),
format,
ColumnsDescription{sample_block.getNamesAndTypesList()},
global_context);
}
void registerTableFunctionHDFS(TableFunctionFactory & factory)
{
factory.registerFunction<TableFunctionHDFS>();
}
}
#endif

View File

@ -0,0 +1,32 @@
#pragma once
#include <Common/config.h>
#if USE_HDFS
#include <TableFunctions/ITableFunctionFileLike.h>
#include <Interpreters/Context.h>
#include <Core/Block.h>
namespace DB
{
/* hdfs(name_node_ip:name_node_port, format, structure) - creates a temporary storage from hdfs file
*
*/
class TableFunctionHDFS : public ITableFunctionFileLike
{
public:
static constexpr auto name = "hdfs";
std::string getName() const override
{
return name;
}
private:
StoragePtr getStorage(
const String & source, const String & format, const Block & sample_block, Context & global_context) const override;
};
}
#endif

View File

@ -11,6 +11,7 @@
#include <TableFunctions/TableFunctionRemote.h>
#include <TableFunctions/TableFunctionFactory.h>
#include <TableFunctions/parseRemoteDescription.h>
namespace DB
@ -22,165 +23,6 @@ namespace ErrorCodes
extern const int BAD_ARGUMENTS;
}
/// The Cartesian product of two sets of rows, the result is written in place of the first argument
static void append(std::vector<String> & to, const std::vector<String> & what, size_t max_addresses)
{
if (what.empty())
return;
if (to.empty())
{
to = what;
return;
}
if (what.size() * to.size() > max_addresses)
throw Exception("Table function 'remote': first argument generates too many result addresses",
ErrorCodes::BAD_ARGUMENTS);
std::vector<String> res;
for (size_t i = 0; i < to.size(); ++i)
for (size_t j = 0; j < what.size(); ++j)
res.push_back(to[i] + what[j]);
to.swap(res);
}
/// Parse number from substring
static bool parseNumber(const String & description, size_t l, size_t r, size_t & res)
{
res = 0;
for (size_t pos = l; pos < r; pos ++)
{
if (!isNumericASCII(description[pos]))
return false;
res = res * 10 + description[pos] - '0';
if (res > 1e15)
return false;
}
return true;
}
/* Parse a string that generates shards and replicas. Separator - one of two characters | or ,
* depending on whether shards or replicas are generated.
* For example:
* host1,host2,... - generates set of shards from host1, host2, ...
* host1|host2|... - generates set of replicas from host1, host2, ...
* abc{8..10}def - generates set of shards abc8def, abc9def, abc10def.
* abc{08..10}def - generates set of shards abc08def, abc09def, abc10def.
* abc{x,yy,z}def - generates set of shards abcxdef, abcyydef, abczdef.
* abc{x|yy|z} def - generates set of replicas abcxdef, abcyydef, abczdef.
* abc{1..9}de{f,g,h} - is a direct product, 27 shards.
* abc{1..9}de{0|1} - is a direct product, 9 shards, in each 2 replicas.
*/
static std::vector<String> parseDescription(const String & description, size_t l, size_t r, char separator, size_t max_addresses)
{
std::vector<String> res;
std::vector<String> cur;
/// An empty substring means a set of an empty string
if (l >= r)
{
res.push_back("");
return res;
}
for (size_t i = l; i < r; ++i)
{
/// Either the numeric interval (8..10) or equivalent expression in brackets
if (description[i] == '{')
{
int cnt = 1;
int last_dot = -1; /// The rightmost pair of points, remember the index of the right of the two
size_t m;
std::vector<String> buffer;
bool have_splitter = false;
/// Look for the corresponding closing bracket
for (m = i + 1; m < r; ++m)
{
if (description[m] == '{') ++cnt;
if (description[m] == '}') --cnt;
if (description[m] == '.' && description[m-1] == '.') last_dot = m;
if (description[m] == separator) have_splitter = true;
if (cnt == 0) break;
}
if (cnt != 0)
throw Exception("Table function 'remote': incorrect brace sequence in first argument",
ErrorCodes::BAD_ARGUMENTS);
/// The presence of a dot - numeric interval
if (last_dot != -1)
{
size_t left, right;
if (description[last_dot - 1] != '.')
throw Exception("Table function 'remote': incorrect argument in braces (only one dot): " + description.substr(i, m - i + 1),
ErrorCodes::BAD_ARGUMENTS);
if (!parseNumber(description, i + 1, last_dot - 1, left))
throw Exception("Table function 'remote': incorrect argument in braces (Incorrect left number): "
+ description.substr(i, m - i + 1),
ErrorCodes::BAD_ARGUMENTS);
if (!parseNumber(description, last_dot + 1, m, right))
throw Exception("Table function 'remote': incorrect argument in braces (Incorrect right number): "
+ description.substr(i, m - i + 1),
ErrorCodes::BAD_ARGUMENTS);
if (left > right)
throw Exception("Table function 'remote': incorrect argument in braces (left number is greater then right): "
+ description.substr(i, m - i + 1),
ErrorCodes::BAD_ARGUMENTS);
if (right - left + 1 > max_addresses)
throw Exception("Table function 'remote': first argument generates too many result addresses",
ErrorCodes::BAD_ARGUMENTS);
bool add_leading_zeroes = false;
size_t len = last_dot - 1 - (i + 1);
/// If the left and right borders have equal numbers, then you must add leading zeros.
if (last_dot - 1 - (i + 1) == m - (last_dot + 1))
add_leading_zeroes = true;
for (size_t id = left; id <= right; ++id)
{
String cur = toString<UInt64>(id);
if (add_leading_zeroes)
{
while (cur.size() < len)
cur = "0" + cur;
}
buffer.push_back(cur);
}
}
else if (have_splitter) /// If there is a current delimiter inside, then generate a set of resulting rows
buffer = parseDescription(description, i + 1, m, separator, max_addresses);
else /// Otherwise just copy, spawn will occur when you call with the correct delimiter
buffer.push_back(description.substr(i, m - i + 1));
/// Add all possible received extensions to the current set of lines
append(cur, buffer, max_addresses);
i = m;
}
else if (description[i] == separator)
{
/// If the delimiter, then add found rows
res.insert(res.end(), cur.begin(), cur.end());
cur.clear();
}
else
{
/// Otherwise, simply append the character to current lines
std::vector<String> buffer;
buffer.push_back(description.substr(i, 1));
append(cur, buffer, max_addresses);
}
}
res.insert(res.end(), cur.begin(), cur.end());
if (res.size() > max_addresses)
throw Exception("Table function 'remote': first argument generates too many result addresses",
ErrorCodes::BAD_ARGUMENTS);
return res;
}
StoragePtr TableFunctionRemote::executeImpl(const ASTPtr & ast_function, const Context & context) const
{
ASTs & args_func = typeid_cast<ASTFunction &>(*ast_function).children;
@ -304,11 +146,11 @@ StoragePtr TableFunctionRemote::executeImpl(const ASTPtr & ast_function, const C
{
/// Create new cluster from the scratch
size_t max_addresses = context.getSettingsRef().table_function_remote_max_addresses;
std::vector<String> shards = parseDescription(cluster_description, 0, cluster_description.size(), ',', max_addresses);
std::vector<String> shards = parseRemoteDescription(cluster_description, 0, cluster_description.size(), ',', max_addresses);
std::vector<std::vector<String>> names;
for (size_t i = 0; i < shards.size(); ++i)
names.push_back(parseDescription(shards[i], 0, shards[i].size(), '|', max_addresses));
names.push_back(parseRemoteDescription(shards[i], 0, shards[i].size(), '|', max_addresses));
if (names.empty())
throw Exception("Shard list is empty after parsing first argument", ErrorCodes::BAD_ARGUMENTS);

View File

@ -0,0 +1,171 @@
#include <TableFunctions/parseRemoteDescription.h>
#include <Common/Exception.h>
#include <IO/WriteHelpers.h>
namespace DB
{
namespace ErrorCodes
{
extern const int NUMBER_OF_ARGUMENTS_DOESNT_MATCH;
extern const int BAD_ARGUMENTS;
}
/// The Cartesian product of two sets of rows, the result is written in place of the first argument
static void append(std::vector<String> & to, const std::vector<String> & what, size_t max_addresses)
{
if (what.empty())
return;
if (to.empty())
{
to = what;
return;
}
if (what.size() * to.size() > max_addresses)
throw Exception("Table function 'remote': first argument generates too many result addresses",
ErrorCodes::BAD_ARGUMENTS);
std::vector<String> res;
for (size_t i = 0; i < to.size(); ++i)
for (size_t j = 0; j < what.size(); ++j)
res.push_back(to[i] + what[j]);
to.swap(res);
}
/// Parse number from substring
static bool parseNumber(const String & description, size_t l, size_t r, size_t & res)
{
res = 0;
for (size_t pos = l; pos < r; pos ++)
{
if (!isNumericASCII(description[pos]))
return false;
res = res * 10 + description[pos] - '0';
if (res > 1e15)
return false;
}
return true;
}
/* Parse a string that generates shards and replicas. Separator - one of two characters | or ,
* depending on whether shards or replicas are generated.
* For example:
* host1,host2,... - generates set of shards from host1, host2, ...
* host1|host2|... - generates set of replicas from host1, host2, ...
* abc{8..10}def - generates set of shards abc8def, abc9def, abc10def.
* abc{08..10}def - generates set of shards abc08def, abc09def, abc10def.
* abc{x,yy,z}def - generates set of shards abcxdef, abcyydef, abczdef.
* abc{x|yy|z} def - generates set of replicas abcxdef, abcyydef, abczdef.
* abc{1..9}de{f,g,h} - is a direct product, 27 shards.
* abc{1..9}de{0|1} - is a direct product, 9 shards, in each 2 replicas.
*/
std::vector<String> parseRemoteDescription(const String & description, size_t l, size_t r, char separator, size_t max_addresses)
{
std::vector<String> res;
std::vector<String> cur;
/// An empty substring means a set of an empty string
if (l >= r)
{
res.push_back("");
return res;
}
for (size_t i = l; i < r; ++i)
{
/// Either the numeric interval (8..10) or equivalent expression in brackets
if (description[i] == '{')
{
int cnt = 1;
int last_dot = -1; /// The rightmost pair of points, remember the index of the right of the two
size_t m;
std::vector<String> buffer;
bool have_splitter = false;
/// Look for the corresponding closing bracket
for (m = i + 1; m < r; ++m)
{
if (description[m] == '{') ++cnt;
if (description[m] == '}') --cnt;
if (description[m] == '.' && description[m-1] == '.') last_dot = m;
if (description[m] == separator) have_splitter = true;
if (cnt == 0) break;
}
if (cnt != 0)
throw Exception("Table function 'remote': incorrect brace sequence in first argument",
ErrorCodes::BAD_ARGUMENTS);
/// The presence of a dot - numeric interval
if (last_dot != -1)
{
size_t left, right;
if (description[last_dot - 1] != '.')
throw Exception("Table function 'remote': incorrect argument in braces (only one dot): " + description.substr(i, m - i + 1),
ErrorCodes::BAD_ARGUMENTS);
if (!parseNumber(description, i + 1, last_dot - 1, left))
throw Exception("Table function 'remote': incorrect argument in braces (Incorrect left number): "
+ description.substr(i, m - i + 1),
ErrorCodes::BAD_ARGUMENTS);
if (!parseNumber(description, last_dot + 1, m, right))
throw Exception("Table function 'remote': incorrect argument in braces (Incorrect right number): "
+ description.substr(i, m - i + 1),
ErrorCodes::BAD_ARGUMENTS);
if (left > right)
throw Exception("Table function 'remote': incorrect argument in braces (left number is greater then right): "
+ description.substr(i, m - i + 1),
ErrorCodes::BAD_ARGUMENTS);
if (right - left + 1 > max_addresses)
throw Exception("Table function 'remote': first argument generates too many result addresses",
ErrorCodes::BAD_ARGUMENTS);
bool add_leading_zeroes = false;
size_t len = last_dot - 1 - (i + 1);
/// If the left and right borders have equal numbers, then you must add leading zeros.
if (last_dot - 1 - (i + 1) == m - (last_dot + 1))
add_leading_zeroes = true;
for (size_t id = left; id <= right; ++id)
{
String cur = toString<UInt64>(id);
if (add_leading_zeroes)
{
while (cur.size() < len)
cur = "0" + cur;
}
buffer.push_back(cur);
}
}
else if (have_splitter) /// If there is a current delimiter inside, then generate a set of resulting rows
buffer = parseRemoteDescription(description, i + 1, m, separator, max_addresses);
else /// Otherwise just copy, spawn will occur when you call with the correct delimiter
buffer.push_back(description.substr(i, m - i + 1));
/// Add all possible received extensions to the current set of lines
append(cur, buffer, max_addresses);
i = m;
}
else if (description[i] == separator)
{
/// If the delimiter, then add found rows
res.insert(res.end(), cur.begin(), cur.end());
cur.clear();
}
else
{
/// Otherwise, simply append the character to current lines
std::vector<String> buffer;
buffer.push_back(description.substr(i, 1));
append(cur, buffer, max_addresses);
}
}
res.insert(res.end(), cur.begin(), cur.end());
if (res.size() > max_addresses)
throw Exception("Table function 'remote': first argument generates too many result addresses",
ErrorCodes::BAD_ARGUMENTS);
return res;
}
}

View File

@ -0,0 +1,20 @@
#pragma once
#include <Core/Types.h>
#include <vector>
namespace DB
{
/* Parse a string that generates shards and replicas. Separator - one of two characters | or ,
* depending on whether shards or replicas are generated.
* For example:
* host1,host2,... - generates set of shards from host1, host2, ...
* host1|host2|... - generates set of replicas from host1, host2, ...
* abc{8..10}def - generates set of shards abc8def, abc9def, abc10def.
* abc{08..10}def - generates set of shards abc08def, abc09def, abc10def.
* abc{x,yy,z}def - generates set of shards abcxdef, abcyydef, abczdef.
* abc{x|yy|z} def - generates set of replicas abcxdef, abcyydef, abczdef.
* abc{1..9}de{f,g,h} - is a direct product, 27 shards.
* abc{1..9}de{0|1} - is a direct product, 9 shards, in each 2 replicas.
*/
std::vector<String> parseRemoteDescription(const String & description, size_t l, size_t r, char separator, size_t max_addresses);
}

View File

@ -14,6 +14,10 @@ void registerTableFunctionCatBoostPool(TableFunctionFactory & factory);
void registerTableFunctionFile(TableFunctionFactory & factory);
void registerTableFunctionURL(TableFunctionFactory & factory);
#if USE_HDFS
void registerTableFunctionHDFS(TableFunctionFactory & factory);
#endif
#if USE_POCO_SQLODBC || USE_POCO_DATAODBC
void registerTableFunctionODBC(TableFunctionFactory & factory);
#endif
@ -37,6 +41,10 @@ void registerTableFunctions()
registerTableFunctionFile(factory);
registerTableFunctionURL(factory);
#if USE_HDFS
registerTableFunctionHDFS(factory);
#endif
#if USE_POCO_SQLODBC || USE_POCO_DATAODBC
registerTableFunctionODBC(factory);
#endif

View File

@ -33,7 +33,12 @@ set the following environment variables:
### Running with runner script
The only requirement is fresh docker.
The only requirement is fresh docker configured docker.
Notes:
* If you want to run integration tests without `sudo` you have to add your user to docker group `sudo usermod -aG docker $USER`. [More information](https://docs.docker.com/install/linux/linux-postinstall/) about docker configuration.
* If you already had run these tests without `./runner` script you may have problems with pytest cache. It can be removed with `rm -r __pycache__ .pytest_cache/`.
* Some tests maybe require a lot of resources (CPU, RAM, etc.). Better not try large tests like `test_cluster_copier` or `test_distributed_ddl*` on your notebook.
You can run tests via `./runner` script and pass pytest arguments as last arg:
```

View File

@ -14,11 +14,13 @@ import xml.dom.minidom
from kazoo.client import KazooClient
from kazoo.exceptions import KazooException
import psycopg2
import requests
import docker
from docker.errors import ContainerError
from .client import Client, CommandRequest
from .hdfs_api import HDFSApi
HELPERS_DIR = p.dirname(__file__)
@ -83,6 +85,7 @@ class ClickHouseCluster:
self.with_postgres = False
self.with_kafka = False
self.with_odbc_drivers = False
self.with_hdfs = False
self.docker_client = None
self.is_up = False
@ -94,7 +97,7 @@ class ClickHouseCluster:
cmd += " client"
return cmd
def add_instance(self, name, config_dir=None, main_configs=[], user_configs=[], macros={}, with_zookeeper=False, with_mysql=False, with_kafka=False, clickhouse_path_dir=None, with_odbc_drivers=False, with_postgres=False, hostname=None, env_variables={}, image="yandex/clickhouse-integration-test", stay_alive=False):
def add_instance(self, name, config_dir=None, main_configs=[], user_configs=[], macros={}, with_zookeeper=False, with_mysql=False, with_kafka=False, clickhouse_path_dir=None, with_odbc_drivers=False, with_postgres=False, with_hdfs=False, hostname=None, env_variables={}, image="yandex/clickhouse-integration-test", stay_alive=False):
"""Add an instance to the cluster.
name - the name of the instance directory and the value of the 'instance' macro in ClickHouse.
@ -148,13 +151,19 @@ class ClickHouseCluster:
self.base_postgres_cmd = ['docker-compose', '--project-directory', self.base_dir, '--project-name',
self.project_name, '--file', p.join(HELPERS_DIR, 'docker_compose_postgres.yml')]
if with_kafka and not self.with_kafka:
self.with_kafka = True
self.base_cmd.extend(['--file', p.join(HELPERS_DIR, 'docker_compose_kafka.yml')])
self.base_kafka_cmd = ['docker-compose', '--project-directory', self.base_dir, '--project-name',
self.project_name, '--file', p.join(HELPERS_DIR, 'docker_compose_kafka.yml')]
if with_hdfs and not self.with_hdfs:
self.with_hdfs = True
self.base_cmd.extend(['--file', p.join(HELPERS_DIR, 'docker_compose_hdfs.yml')])
self.base_hdfs_cmd = ['docker-compose', '--project-directory', self.base_dir, '--project-name',
self.project_name, '--file', p.join(HELPERS_DIR, 'docker_compose_hdfs.yml')]
return instance
@ -212,6 +221,20 @@ class ClickHouseCluster:
raise Exception("Cannot wait ZooKeeper container")
def wait_hdfs_to_start(self, timeout=60):
hdfs_api = HDFSApi("root")
start = time.time()
while time.time() - start < timeout:
try:
hdfs_api.write_data("/somefilewithrandomname222", "1")
print "Connected to HDFS and SafeMode disabled! "
return
except Exception as ex:
print "Can't connect to HDFS " + str(ex)
time.sleep(1)
raise Exception("Can't wait HDFS to start")
def start(self, destroy_dirs=True):
if self.is_up:
return
@ -250,7 +273,11 @@ class ClickHouseCluster:
subprocess_check_call(self.base_kafka_cmd + ['up', '-d', '--force-recreate'])
self.kafka_docker_id = self.get_instance_docker_id('kafka1')
subprocess_check_call(self.base_cmd + ['up', '-d', '--force-recreate'])
if self.with_hdfs and self.base_hdfs_cmd:
subprocess_check_call(self.base_hdfs_cmd + ['up', '-d', '--force-recreate'])
self.wait_hdfs_to_start(120)
subprocess_check_call(self.base_cmd + ['up', '-d', '--no-recreate'])
start_deadline = time.time() + 20.0 # seconds
for instance in self.instances.itervalues():
@ -310,7 +337,6 @@ services:
{name}:
image: {image}
hostname: {hostname}
user: '{uid}'
volumes:
- {binary_path}:/usr/bin/clickhouse:ro
- {configs_dir}:/etc/clickhouse-server/
@ -588,7 +614,6 @@ class ClickHouseInstance:
image=self.image,
name=self.name,
hostname=self.hostname,
uid=os.getuid(),
binary_path=self.server_bin_path,
configs_dir=configs_dir,
config_d_dir=config_d_dir,

View File

@ -0,0 +1,9 @@
version: '2'
services:
hdfs1:
image: sequenceiq/hadoop-docker:2.7.0
restart: always
ports:
- 50075:50075
- 50070:50070
entrypoint: /etc/bootstrap.sh -d

View File

@ -0,0 +1,46 @@
#-*- coding: utf-8 -*-
import requests
import subprocess
from tempfile import NamedTemporaryFile
class HDFSApi(object):
def __init__(self, user):
self.host = "localhost"
self.http_proxy_port = "50070"
self.http_data_port = "50075"
self.user = user
def read_data(self, path):
response = requests.get("http://{host}:{port}/webhdfs/v1{path}?op=OPEN".format(host=self.host, port=self.http_proxy_port, path=path), allow_redirects=False)
if response.status_code != 307:
response.raise_for_status()
additional_params = '&'.join(response.headers['Location'].split('&')[1:2])
response_data = requests.get("http://{host}:{port}/webhdfs/v1{path}?op=OPEN&{params}".format(host=self.host, port=self.http_data_port, path=path, params=additional_params))
if response_data.status_code != 200:
response_data.raise_for_status()
return response_data.text
# Requests can't put file
def _curl_to_put(self, filename, path, params):
url = "http://{host}:{port}/webhdfs/v1{path}?op=CREATE&{params}".format(host=self.host, port=self.http_data_port, path=path, params=params)
cmd = "curl -s -i -X PUT -T {fname} '{url}'".format(fname=filename, url=url)
output = subprocess.check_output(cmd, shell=True)
return output
def write_data(self, path, content):
named_file = NamedTemporaryFile()
fpath = named_file.name
named_file.write(content)
named_file.flush()
response = requests.put(
"http://{host}:{port}/webhdfs/v1{path}?op=CREATE".format(host=self.host, port=self.http_proxy_port, path=path, user=self.user),
allow_redirects=False
)
if response.status_code != 307:
response.raise_for_status()
additional_params = '&'.join(response.headers['Location'].split('&')[1:2] + ["user.name={}".format(self.user), "overwrite=true"])
output = self._curl_to_put(fpath, path, additional_params)
if "201 Created" not in output:
raise Exception("Can't create file on hdfs:\n {}".format(output))

View File

@ -1,4 +1,4 @@
FROM ubuntu
FROM ubuntu:18.04
RUN apt-get update && env DEBIAN_FRONTEND=noninteractive apt-get install --yes --force-yes \
@ -16,7 +16,9 @@ RUN apt-get update && env DEBIAN_FRONTEND=noninteractive apt-get install --yes -
module-init-tools \
cgroupfs-mount \
python-pip \
tzdata
tzdata \
libreadline-dev \
libicu-dev
ENV TZ=Europe/Moscow
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
@ -24,7 +26,7 @@ RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN pip install pytest docker-compose==1.22.0 docker dicttoxml kazoo PyMySQL psycopg2
ENV DOCKER_CHANNEL stable
ENV DOCKER_VERSION 18.09.0
ENV DOCKER_VERSION 17.09.1-ce
RUN set -eux; \
\

View File

@ -2,54 +2,94 @@
#-*- coding: utf-8 -*-
import subprocess
import os
import getpass
import argparse
import logging
import signal
import subprocess
CUR_FILE_DIR_PATH = os.path.dirname(os.path.realpath(__file__))
DEFAULT_CLICKHOUSE_ROOT = os.path.abspath(os.path.join(CUR_FILE_DIR_PATH, "../../../"))
CUR_FILE_DIR = os.path.dirname(os.path.realpath(__file__))
DEFAULT_CLICKHOUSE_ROOT = os.path.abspath(os.path.join(CUR_FILE_DIR, "../../../"))
CURRENT_WORK_DIR = os.getcwd()
CONTAINER_NAME = "clickhouse_integration_tests"
DIND_INTEGRATION_TESTS_IMAGE_NAME = "yandex/clickhouse-integration-tests-runner"
def check_args_and_update_paths(args):
if not os.path.isabs(args.binary):
args.binary = os.path.abspath(os.path.join(CURRENT_WORK_DIR, args.binary))
if not os.path.isabs(args.configs_dir):
args.configs_dir = os.path.abspath(os.path.join(CURRENT_WORK_DIR, args.configs_dir))
if not os.path.isabs(args.clickhouse_root):
args.clickhouse_root = os.path.abspath(os.path.join(CURRENT_WORK_DIR, args.clickhouse_root))
for path in [args.binary, args.configs_dir, args.clickhouse_root]:
if not os.path.exists(path):
raise Exception("Path {} doesn't exists".format(path))
def try_rm_image():
try:
subprocess.check_call('docker rm {name}'.format(name=CONTAINER_NAME), shell=True)
except:
pass
def docker_kill_handler_handler(signum, frame):
subprocess.check_call('docker kill $(docker ps -a -q --filter name={name} --format="{{{{.ID}}}}")'.format(name=CONTAINER_NAME), shell=True)
try_rm_image()
raise KeyboardInterrupt("Killed by Ctrl+C")
signal.signal(signal.SIGINT, docker_kill_handler_handler)
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
parser = argparse.ArgumentParser(description="ClickHouse integration tests runner")
parser.add_argument(
"--binary",
default=os.environ.get("CLICKHOUSE_TESTS_SERVER_BIN_PATH", os.environ.get("CLICKHOUSE_TESTS_CLIENT_BIN_PATH", "/usr/bin/clickhouse")),
help="Path to clickhouse binary")
parser.add_argument(
"--configs-dir",
default=os.environ.get("CLICKHOUSE_TESTS_BASE_CONFIG_DIR", "/etc/clickhouse-server"),
help="Path to clickhouse configs directory"
)
default=os.environ.get("CLICKHOUSE_TESTS_BASE_CONFIG_DIR", os.path.join(DEFAULT_CLICKHOUSE_ROOT, "dbms/programs/server")),
help="Path to clickhouse configs directory")
parser.add_argument(
"--clickhouse-root",
default=DEFAULT_CLICKHOUSE_ROOT,
help="Path to repository root folder"
)
help="Path to repository root folder")
parser.add_argument(
"--disable-net-host",
action='store_true',
default=False,
help="Don't use net host in parent docker container"
)
help="Don't use net host in parent docker container")
parser.add_argument('pytest_args', nargs='*', help="args for pytest command")
args = parser.parse_args()
check_args_and_update_paths(args)
net = ""
if not args.disable_net_host:
net = "--net=host"
cmd = "docker run {net} --privileged --volume={bin}:/clickhouse \
--volume={cfg}:/clickhouse-config --volume={pth}:/ClickHouse -e PYTEST_OPTS='{opts}' {img}".format(
cmd = "docker run {net} --name {name} --user={user} --privileged --volume={bin}:/clickhouse \
--volume={cfg}:/clickhouse-config --volume={pth}:/ClickHouse -e PYTEST_OPTS='{opts}' {img} ".format(
net=net,
bin=args.binary,
cfg=args.configs_dir,
pth=args.clickhouse_root,
opts=' '.join(args.pytest_args),
img=DIND_INTEGRATION_TESTS_IMAGE_NAME,
user=getpass.getuser(),
name=CONTAINER_NAME,
)
subprocess.check_call(cmd, shell=True)
try:
subprocess.check_call(cmd, shell=True)
finally:
try_rm_image()

View File

@ -0,0 +1,11 @@
<yandex>
<logger>
<level>trace</level>
<log>/var/log/clickhouse-server/log.log</log>
<errorlog>/var/log/clickhouse-server/log.err.log</errorlog>
<size>1000M</size>
<count>10</count>
<stderr>/var/log/clickhouse-server/stderr.log</stderr>
<stdout>/var/log/clickhouse-server/stdout.log</stdout>
</logger>
</yandex>

View File

@ -0,0 +1,47 @@
import time
import pytest
import requests
from tempfile import NamedTemporaryFile
from helpers.hdfs_api import HDFSApi
import os
from helpers.cluster import ClickHouseCluster
import subprocess
SCRIPT_DIR = os.path.dirname(os.path.realpath(__file__))
cluster = ClickHouseCluster(__file__)
node1 = cluster.add_instance('node1', with_hdfs=True, image='withlibsimage', config_dir="configs", main_configs=['configs/log_conf.xml'])
@pytest.fixture(scope="module")
def started_cluster():
try:
cluster.start()
yield cluster
except Exception as ex:
print(ex)
raise ex
finally:
cluster.shutdown()
def test_read_write_storage(started_cluster):
hdfs_api = HDFSApi("root")
hdfs_api.write_data("/simple_storage", "1\tMark\t72.53\n")
assert hdfs_api.read_data("/simple_storage") == "1\tMark\t72.53\n"
node1.query("create table SimpleHDFSStorage (id UInt32, name String, weight Float64) ENGINE = HDFS('hdfs://hdfs1:9000/simple_storage', 'TSV')")
assert node1.query("select * from SimpleHDFSStorage") == "1\tMark\t72.53\n"
def test_read_write_table(started_cluster):
hdfs_api = HDFSApi("root")
data = "1\tSerialize\t555.222\n2\tData\t777.333\n"
hdfs_api.write_data("/simple_table_function", data)
assert hdfs_api.read_data("/simple_table_function") == data
assert node1.query("select * from hdfs('hdfs://hdfs1:9000/simple_table_function', 'TSV', 'id UInt64, text String, number Float64')") == data

View File

@ -17,7 +17,7 @@ function thread1()
function thread2()
{
seq 1 1000 | sed -r -e 's/.+/SELECT count() FROM test.buffer;/' | ${CLICKHOUSE_CLIENT} --multiquery --server_logs_file='/dev/null' --ignore-error 2>&1 | grep -vP '^0$|^10$|^Received exception|^Code: 60'
seq 1 1000 | sed -r -e 's/.+/SELECT count() FROM test.buffer;/' | ${CLICKHOUSE_CLIENT} --multiquery --server_logs_file='/dev/null' --ignore-error 2>&1 | grep -vP '^0$|^10$|^Received exception|^Code: 60|^Code: 218'
}
thread1 &

View File

@ -24,6 +24,8 @@ RUN apt-get update \
/tmp/* \
&& apt-get clean
RUN mkdir /docker-entrypoint-initdb.d
COPY docker_related_config.xml /etc/clickhouse-server/config.d/
COPY entrypoint.sh /entrypoint.sh
ADD https://github.com/tianon/gosu/releases/download/1.10/gosu-amd64 /bin/gosu

View File

@ -33,6 +33,22 @@ ClickHouse configuration represented with a file "config.xml" ([documentation](h
$ docker run -d --name some-clickhouse-server --ulimit nofile=262144:262144 -v /path/to/your/config.xml:/etc/clickhouse-server/config.xml yandex/clickhouse-server
```
## How to extend this image
If you would like to do additional initialization in an image derived from this one, add one or more `*.sql`, `*.sql.gz`, or `*.sh` scripts under `/docker-entrypoint-initdb.d`. After the entrypoint calls `initdb` it will run any `*.sql` files, run any executable `*.sh` scripts, and source any non-executable `*.sh` scripts found in that directory to do further initialization before starting the service.
For example, to add an additional user and database, add the following to `/docker-entrypoint-initdb.d/init-db.sh`:
```bash
#!/bin/bash
set -e
clickhouse client -n <<-EOSQL
CREATE DATABASE docker;
CREATE TABLE docker.docker (x Int32) ENGINE = Log;
EOSQL
```
## License
View [license information](https://github.com/yandex/ClickHouse/blob/master/LICENSE) for the software contained in this image.

View File

@ -30,6 +30,36 @@ chown -R $USER:$GROUP \
"$TMP_DIR" \
"$USER_PATH"
if [ -n "$(ls /docker-entrypoint-initdb.d/)" ]; then
gosu clickhouse /usr/bin/clickhouse-server --config-file=$CLICKHOUSE_CONFIG &
pid="$!"
sleep 1
clickhouseclient=( clickhouse client --multiquery )
echo
for f in /docker-entrypoint-initdb.d/*; do
case "$f" in
*.sh)
if [ -x "$f" ]; then
echo "$0: running $f"
"$f"
else
echo "$0: sourcing $f"
. "$f"
fi
;;
*.sql) echo "$0: running $f"; cat "$f" | "${clickhouseclient[@]}" ; echo ;;
*.sql.gz) echo "$0: running $f"; gunzip -c "$f" | "${clickhouseclient[@]}"; echo ;;
*) echo "$0: ignoring $f" ;;
esac
echo
done
if ! kill -s TERM "$pid" || ! wait "$pid"; then
echo >&2 'ClickHouse init process failed.'
exit 1
fi
fi
# if no args passed to `docker run` or first argument start with `--`, then the user is passing clickhouse-server arguments
if [[ $# -lt 1 ]] || [[ "$1" == "--"* ]]; then

View File

@ -22,9 +22,8 @@ cd ClickHouse
# How to Build ClickHouse for Development
Build should work on Ubuntu Linux.
The following tutorial is based on the Ubuntu Linux system.
With appropriate changes, it should also work on any other Linux distribution.
The build process is not intended to work on Mac OS X.
Only x86_64 with SSE 4.2 is supported. Support for AArch64 is experimental.
To test for SSE 4.2, do

View File

@ -39,6 +39,7 @@ You can also download and install packages manually from here: <https://repo.yan
Yandex does not run ClickHouse on `rpm` based Linux distributions and `rpm` packages are not as thoroughly tested. So use them at your own risk, but there are many other companies that do successfully run them in production without any major issues.
For CentOS, RHEL or Fedora there are the following options:
* Packages from <https://repo.yandex.ru/clickhouse/rpm/stable/x86_64/> are generated from official `deb` packages by Yandex and have byte-identical binaries.
* Packages from <https://github.com/Altinity/clickhouse-rpm-install> are built by independent company Altinity, but are used widely without any complaints.
* Or you can use Docker (see below).

View File

@ -1,10 +1,12 @@
# Visual Interfaces from Third-party Developers
## Tabix
## Open-Source
### Tabix
Web interface for ClickHouse in the [Tabix](https://github.com/tabixio/tabix) project.
Main features:
Features:
- Works with ClickHouse directly from the browser, without the need to install additional software.
- Query editor with syntax highlighting.
@ -14,11 +16,11 @@ Main features:
[Tabix documentation](https://tabix.io/doc/).
## HouseOps
### HouseOps
[HouseOps](https://github.com/HouseOps/HouseOps) is a UI/IDE for OSX, Linux and Windows.
Main features:
Features:
- Query builder with syntax highlighting. View the response in a table or JSON view.
- Export query results as CSV or JSON.
@ -36,25 +38,27 @@ The following features are planned for development:
- Cluster management.
- Monitoring replicated and Kafka tables.
## DBeaver
## Commercial
### DBeaver
[DBeaver](https://dbeaver.io/) - universal desktop database client with ClickHouse support.
Key features:
Features:
- Query development with syntax highlight.
- Table preview.
- Autocompletion.
## DataGrip
### DataGrip
[DataGrip](https://www.jetbrains.com/datagrip/) - Database IDE from JetBrains with dedicated support for ClickHouse. The same is embedded into other IntelliJ-based tools: PyCharm, IntelliJIDEA, GoLand, PhpStorm etc.
[DataGrip](https://www.jetbrains.com/datagrip/) is a database IDE from JetBrains with dedicated support for ClickHouse. It is also embedded into other IntelliJ-based tools: PyCharm, IntelliJ IDEA, GoLand, PhpStorm and others.
Features:
- Very fast code completion.
- Clickhouse synthax highlighting.
- Specific Clickhouse features support in SQL, i.e. nested columns, table engines.
- ClickHouse syntax highlighting.
- Support for features specific to ClickHouse, for example nested columns, table engines.
- Data Editor.
- Refactorings.
- Search and Navigation.

View File

@ -3,6 +3,11 @@
!!! warning "Disclaimer"
Yandex does **not** maintain the libraries listed below and haven't done any extensive testing to ensure their quality.
- Relational database management systems
- [MySQL](https://www.mysql.com)
- [ProxySQL](https://github.com/sysown/proxysql/wiki/ClickHouse-Support)
- [PostgreSQL](https://www.postgresql.org)
- [infi.clickhouse_fdw](https://github.com/Infinidat/infi.clickhouse_fdw) (uses [infi.clickhouse_orm](https://github.com/Infinidat/infi.clickhouse_orm))
- Python
- [SQLAlchemy](https://www.sqlalchemy.org)
- [sqlalchemy-clickhouse](https://github.com/cloudflare/sqlalchemy-clickhouse) (uses [infi.clickhouse_orm](https://github.com/Infinidat/infi.clickhouse_orm))
@ -22,4 +27,4 @@
- [clickhouse_ecto](https://github.com/appodeal/clickhouse_ecto)
[Original article](https://clickhouse.yandex/docs/en/interfaces/third-party/integrations/) <!--hide-->
[Original article](https://clickhouse.yandex/docs/en/interfaces/third-party/integrations/) <!--hide-->

View File

@ -14,7 +14,7 @@ Some column-oriented DBMSs (InfiniDB CE and MonetDB) do not use data compression
## Disk Storage of Data
Mving a data physically sorted by primary key makes it possible to extract data for it's specific values or value ranges with low latency, less than few dozen milliseconds. some column-oriented DBMSs (such as SAP HANA and Google PowerDrill) can only work in RAM. This approach encourages the allocation of a larger hardware budget than is actually necessary for real-time analysis. ClickHouse is designed to work on regular hard drives, which means the cost per GB of data storage is low, but SSD and additional RAM are also fully used if available.
Keeping data physically sorted by primary key makes it possible to extract data for it's specific values or value ranges with low latency, less than few dozen milliseconds. Some column-oriented DBMSs (such as SAP HANA and Google PowerDrill) can only work in RAM. This approach encourages the allocation of a larger hardware budget than is actually necessary for real-time analysis. ClickHouse is designed to work on regular hard drives, which means the cost per GB of data storage is low, but SSD and additional RAM are also fully used if available.
## Parallel Processing on Multiple Cores

View File

@ -125,10 +125,6 @@ CREATE [MATERIALIZED] VIEW [IF NOT EXISTS] [db.]table_name [TO[db.]name] [ENGINE
Creates a view. There are two types of views: normal and MATERIALIZED.
When creating a materialized view, you must specify ENGINE the table engine for storing data.
A materialized view works as follows: when inserting data to the table specified in SELECT, part of the inserted data is converted by this SELECT query, and the result is inserted in the view.
Normal views don't store any data, but just perform a read from another table. In other words, a normal view is nothing more than a saved query. When reading from a view, this saved query is used as a subquery in the FROM clause.
As an example, assume you've created a view:

View File

@ -48,7 +48,7 @@ The FINAL modifier can be used only for a SELECT from a CollapsingMergeTree tabl
The SAMPLE clause allows for approximated query processing. Approximated query processing is only supported by MergeTree\* type tables, and only if the sampling expression was specified during table creation (see the section "MergeTree engine").
`SAMPLE` has the `format SAMPLE k`, where `k` is a decimal number from 0 to 1, or `SAMPLE n`, where 'n' is a sufficiently large integer.
`SAMPLE` has the `SAMPLE k`, where `k` is a decimal number from 0 to 1, or `SAMPLE n`, where 'n' is a sufficiently large integer.
In the first case, the query will be executed on 'k' percent of data. For example, `SAMPLE 0.1` runs the query on 10% of data.
In the second case, the query will be executed on a sample of no more than 'n' rows. For example, `SAMPLE 10000000` runs the query on a maximum of 10,000,000 rows.
@ -437,18 +437,20 @@ Only one `JOIN` can be specified in a query (on a single level). To run multiple
Each time a query is run with the same `JOIN`, the subquery is run again the result is not cached. To avoid this, use the special 'Join' table engine, which is a prepared array for joining that is always in RAM. For more information, see the section "Table engines, Join".
In some cases, it is more efficient to use `IN` instead of `JOIN`.
Among the various types of `JOIN`, the most efficient is ANY `LEFT JOIN`, then `ANY INNER JOIN`. The least efficient are `ALL LEFT JOIN` and `ALL INNER JOIN`.
Among the various types of `JOIN`, the most efficient is `ANY LEFT JOIN`, then `ANY INNER JOIN`. The least efficient are `ALL LEFT JOIN` and `ALL INNER JOIN`.
If you need a `JOIN` for joining with dimension tables (these are relatively small tables that contain dimension properties, such as names for advertising campaigns), a `JOIN` might not be very convenient due to the bulky syntax and the fact that the right table is re-accessed for every query. For such cases, there is an "external dictionaries" feature that you should use instead of `JOIN`. For more information, see the section [External dictionaries](dicts/external_dicts.md#dicts-external_dicts).
#### NULL processing
The JOIN behavior is affected by the [join_use_nulls](../operations/settings/settings.md#settings-join_use_nulls) setting. With `join_use_nulls=1`, `JOIN` works like in standard SQL.
If the JOIN keys are [Nullable](../data_types/nullable.md#data_types-nullable) fields, the rows where at least one of the keys has the value [NULL](syntax.md#null-literal) are not joined.
<a name="query_language-queries-where"></a>
### WHERE Clause
The JOIN behavior is affected by the [join_use_nulls](../operations/settings/settings.md#settings-join_use_nulls) setting. With `join_use_nulls=1,` `JOIN` works like in standard SQL.
If the JOIN keys are [Nullable](../data_types/nullable.md#data_types-nullable) fields, the rows where at least one of the keys has the value [NULL](syntax.md#null-literal) are not joined.
If there is a WHERE clause, it must contain an expression with the UInt8 type. This is usually an expression with comparison and logical operators.
This expression will be used for filtering data before all other transformations.
@ -696,6 +698,8 @@ The result will be the same as if GROUP BY were specified across all the fields
DISTINCT is not supported if SELECT has at least one array column.
`DISTINCT` works with [NULL](syntax.md#null-literal) as if `NULL` were a specific value, and `NULL=NULL`. In other words, in the `DISTINCT` results, different combinations with `NULL` only occur once.
### LIMIT Clause
LIMIT m allows you to select the first 'm' rows from the result.
@ -705,8 +709,6 @@ LIMIT n, m allows you to select the first 'm' rows from the result after skippin
If there isn't an ORDER BY clause that explicitly sorts results, the result may be arbitrary and nondeterministic.
`DISTINCT` works with [NULL](syntax.md#null-literal) as if `NULL` were a specific value, and `NULL=NULL`. In other words, in the `DISTINCT` results, different combinations with `NULL` only occur once.
### UNION ALL Clause
You can use UNION ALL to combine any number of queries. Example:
@ -852,7 +854,7 @@ FROM t_null
#### Distributed Subqueries
There are two options for IN-s with subqueries (similar to JOINs): normal `IN` / ` OIN` and `IN GLOBAL` / `GLOBAL JOIN`. They differ in how they are run for distributed query processing.
There are two options for IN-s with subqueries (similar to JOINs): normal `IN` / `JOIN` and `GLOBAL IN` / `GLOBAL JOIN`. They differ in how they are run for distributed query processing.
!!! attention
Remember that the algorithms described below may work differently depending on the [settings](../operations/settings/settings.md#settings-distributed_product_mode) `distributed_product_mode` setting.

View File

@ -2,11 +2,14 @@
# interface های visual توسعه دهندگان third-party
## Tabix
## متن باز
### Tabix
interface تحت وب برای ClickHouse در پروژه [Tabix](https://github.com/tabixio/tabix).
### ویژگی ها:
ویژگی ها:
- کار با ClickHouse به صورت مستقیم و از طریق مرورگر، بدون نیاز به نرم افزار اضافی.
- ادیتور query به همراه syntax highlighting.
- ویژگی Auto-completion برای دستورات.
@ -16,11 +19,12 @@ interface تحت وب برای ClickHouse در پروژه [Tabix](https://github
[مستندات Tabix](https://tabix.io/doc/).
## HouseOps
### HouseOps
[HouseOps](https://github.com/HouseOps/HouseOps) نرم افزار Desktop برای سیستم عامل های Linux و OSX و Windows می باشد..
### ویژگی ها:
ویژگی ها:
- ابزار Query builder به همراه syntax highlighting. نمایش نتایج به صورت جدول و JSON Object.
- خروجی نتایج به صورت csv و JSON Object.
- Pنمایش Processes List ها به همراه توضیحات، ویژگی حالت record و kill کردن process ها.
@ -35,5 +39,30 @@ interface تحت وب برای ClickHouse در پروژه [Tabix](https://github
- مانیتورینگ کافکا و جداول replicate (بزودی);
- و بسیاری از ویژگی های دیگر برای شما.
## تجاری
### DBeaver
[DBeaver](https://dbeaver.io/) - مشتری دسکتاپ دسکتاپ دسکتاپ با پشتیبانی ClickHouse.
امکانات:
- توسعه پرس و جو با برجسته نحو
- پیش نمایش جدول
- تکمیل خودکار
### DataGrip
[DataGrip](https://www.jetbrains.com/datagrip/) IDE پایگاه داده از JetBrains با پشتیبانی اختصاصی برای ClickHouse است. این ابزار همچنین به سایر ابزارهای مبتنی بر IntelliJ تعبیه شده است: PyCharm، IntelliJ IDEA، GoLand، PhpStorm و دیگران.
امکانات:
- تکمیل کد بسیار سریع
- نحو برجسته ClickHouse.
- پشتیبانی از ویژگی های خاص برای ClickHouse، برای مثال ستون های توپی، موتورهای جدول.
- ویرایشگر داده.
- Refactorings.
- جستجو و ناوبری
</div>
[مقاله اصلی](https://clickhouse.yandex/docs/fa/interfaces/third-party_gui/) <!--hide-->

View File

@ -5,12 +5,17 @@
!!! warning "سلب مسئولیت"
Yandex نه حفظ کتابخانه ها در زیر ذکر شده و نشده انجام هر آزمایش های گسترده ای برای اطمینان از کیفیت آنها.
- سیستم های مدیریت پایگاه داده رابطه ای
- [MySQL](https://www.mysql.com)
- [ProxySQL](https://github.com/sysown/proxysql/wiki/ClickHouse-Support)
- [PostgreSQL](https://www.postgresql.org)
- [infi.clickhouse_fdw](https://github.com/Infinidat/infi.clickhouse_fdw) (استفاده می کند [infi.clickhouse_orm](https://github.com/Infinidat/infi.clickhouse_orm))
- Python
- [SQLAlchemy](https://www.sqlalchemy.org)
- [sqlalchemy-clickhouse](https://github.com/cloudflare/sqlalchemy-clickhouse) (uses [infi.clickhouse_orm](https://github.com/Infinidat/infi.clickhouse_orm))
- [sqlalchemy-clickhouse](https://github.com/cloudflare/sqlalchemy-clickhouse) (استفاده می کند [infi.clickhouse_orm](https://github.com/Infinidat/infi.clickhouse_orm))
- Java
- [Hadoop](http://hadoop.apache.org)
- [clickhouse-hdfs-loader](https://github.com/jaykelin/clickhouse-hdfs-loader) (uses [JDBC](../jdbc.md))
- [clickhouse-hdfs-loader](https://github.com/jaykelin/clickhouse-hdfs-loader) (استفاده می کند [JDBC](../jdbc.md))
- Scala
- [Akka](https://akka.io)
- [clickhouse-scala-client](https://github.com/crobox/clickhouse-scala-client)

View File

@ -25,7 +25,7 @@
## Внутреннее представление
Внутри данные представляются как знаковые целые числа, соответсвующей разрядности. Реальные диапазоны, хранящиеся в ячейках памяти несколько больше заявленных. Заявленные диапазоны Decimal проверяются только при вводе числа из строкового представления.
Поскольку современные CPU не поддежривают 128-битные числа, операции над Decimal128 эмулируются программно. Decimal128 работает в разы медленней чем Decimal32/Decimal64.
Поскольку современные CPU не поддерживают 128-битные числа, операции над Decimal128 эмулируются программно. Decimal128 работает в разы медленней чем Decimal32/Decimal64.
## Операции и типы результата

View File

@ -4,7 +4,7 @@
ClickHouse может работать на любом Linux, FreeBSD или Mac OS X с архитектурой процессора x86\_64.
Хотя предсобранные релизы обычно компилируются с использованием набора инструкций SSE 4.2, что добавляет использование поддерживающего его процессора в список системных требований. Команда для п проверки наличия поддержки инструкций SSE 4.2 на текущем процессоре:
Хотя предсобранные релизы обычно компилируются с использованием набора инструкций SSE 4.2, что добавляет использование поддерживающего его процессора в список системных требований. Команда для проверки наличия поддержки инструкций SSE 4.2 на текущем процессоре:
```bash
$ grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"
@ -39,6 +39,7 @@ sudo apt-get install clickhouse-client clickhouse-server
Яндекс не использует ClickHouse на поддерживающих `rpm` дистрибутивах Linux, а также `rpm` пакеты менее тщательно тестируются. Таким образом, использовать их стоит на свой страх и риск, но, тем не менее, многие другие компании успешно работают на них в production без каких-либо серьезных проблем.
Для CentOS, RHEL и Fedora возможны следующие варианты:
* Пакеты из <https://repo.yandex.ru/clickhouse/rpm/stable/x86_64/> генерируются на основе официальных `deb` пакетов от Яндекса и содержат в точности тот же исполняемый файл.
* Пакеты из <https://github.com/Altinity/clickhouse-rpm-install> собираются независимой компанией Altinity, но широко используются без каких-либо нареканий.
* Либо можно использовать Docker (см. ниже).

View File

@ -1,6 +1,8 @@
# Визуальные интерфейсы от сторонних разработчиков
## Tabix
## С открытым исходным кодом
### Tabix
Веб-интерфейс для ClickHouse в проекте [Tabix](https://github.com/tabixio/tabix).
@ -14,7 +16,7 @@
[Документация Tabix](https://tabix.io/doc/).
## HouseOps
### HouseOps
[HouseOps](https://github.com/HouseOps/HouseOps) — UI/IDE для OSX, Linux и Windows.
@ -39,7 +41,9 @@
- Управление кластером;
- Мониторинг реплицированных и Kafka таблиц.
## DBeaver
## Коммерческие
### DBeaver
[DBeaver](https://dbeaver.io/) - универсальный desktop клиент баз данных с поддержкой ClickHouse.
@ -49,4 +53,17 @@
- Просмотр таблиц;
- Автодополнение команд.
### DataGrip
[DataGrip](https://www.jetbrains.com/datagrip/) — это IDE для баз данных о JetBrains с выделенной поддержкой ClickHouse. Он также встроен в другие инструменты на основе IntelliJ: PyCharm, IntelliJ IDEA, GoLand, PhpStorm и другие.
Основные возможности:
- Очень быстрое дополнение кода.
- Подсветка синтаксиса для SQL диалекта ClickHouse.
- Поддержка функций, специфичных для ClickHouse, например вложенных столбцов, движков таблиц.
- Редактор данных.
- Рефакторинги.
- Поиск и навигация.
[Оригинальная статья](https://clickhouse.yandex/docs/ru/interfaces/third-party_gui/) <!--hide-->

View File

@ -3,12 +3,17 @@
!!! warning "Disclaimer"
Яндекс не поддерживает перечисленные ниже библиотеки и не проводит тщательного тестирования для проверки их качества.
- Реляционные системы управления базами данных
- [MySQL](https://www.mysql.com)
- [ProxySQL](https://github.com/sysown/proxysql/wiki/ClickHouse-Support)
- [PostgreSQL](https://www.postgresql.org)
- [infi.clickhouse_fdw](https://github.com/Infinidat/infi.clickhouse_fdw) (использует [infi.clickhouse_orm](https://github.com/Infinidat/infi.clickhouse_orm))
- Python
- [SQLAlchemy](https://www.sqlalchemy.org)
- [sqlalchemy-clickhouse](https://github.com/cloudflare/sqlalchemy-clickhouse) (uses [infi.clickhouse_orm](https://github.com/Infinidat/infi.clickhouse_orm))
- [sqlalchemy-clickhouse](https://github.com/cloudflare/sqlalchemy-clickhouse) (использует [infi.clickhouse_orm](https://github.com/Infinidat/infi.clickhouse_orm))
- Java
- [Hadoop](http://hadoop.apache.org)
- [clickhouse-hdfs-loader](https://github.com/jaykelin/clickhouse-hdfs-loader) (uses [JDBC](../jdbc.md))
- [clickhouse-hdfs-loader](https://github.com/jaykelin/clickhouse-hdfs-loader) (использует [JDBC](../jdbc.md))
- Scala
- [Akka](https://akka.io)
- [clickhouse-scala-client](https://github.com/crobox/clickhouse-scala-client)

View File

@ -141,7 +141,7 @@ default_expression String - выражение для значения по ум
- marks_size (UInt64) - Размер файла с засечками.
- rows (UInt64) - Количество строк.
- bytes (UInt64) - Количество байт в сжатом виде.
- modification_time (DateTime) - Время модификации директории с куском. Обычно соответствует времени создания куска.|
- modification_time (DateTime) - Время модификации директории с куском. Обычно соответствует времени создания куска.
- remove_time (DateTime) - Время, когда кусок стал неактивным.
- refcount (UInt32) - Количество мест, в котором кусок используется. Значение больше 2 говорит о том, что кусок участвует в запросах или в слияниях.
- min_date (Date) - Минимальное значение ключа даты в куске.

View File

@ -14,7 +14,7 @@ CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
...
) ENGINE = MergeTree()
) ENGINE = SummingMergeTree()
[PARTITION BY expr]
[ORDER BY expr]
[SAMPLE BY expr]

View File

@ -714,6 +714,8 @@ WHERE и HAVING отличаются тем, что WHERE выполняется
`DISTINCT` работает с [NULL](syntax.md#null-literal) как если бы `NULL` был конкретным значением, причём `NULL=NULL`. Т.е. в результате `DISTINCT` разные комбинации с `NULL` встретятся только по одному разу.
`DISTINCT` работает с [NULL](syntax.md#null-literal) как если бы `NULL` был конкретным значением, причём `NULL=NULL`. Т.е. в результате `DISTINCT` разные комбинации с `NULL` встретятся только по одному разу.
### Секция LIMIT
LIMIT m позволяет выбрать из результата первые m строк.
@ -723,8 +725,6 @@ n и m должны быть неотрицательными целыми чи
При отсутствии секции ORDER BY, однозначно сортирующей результат, результат может быть произвольным и может являться недетерминированным.
`DISTINCT` работает с [NULL](syntax.md#null-literal) как если бы `NULL` был конкретным значением, причём `NULL=NULL`. Т.е. в результате `DISTINCT` разные комбинации с `NULL` встретятся только по одному разу.
### Секция UNION ALL
Произвольное количество запросов может быть объединено с помощью `UNION ALL`. Пример:

View File

@ -2,20 +2,46 @@
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import markdown.inlinepatterns
import markdown.extensions
import markdown.util
class NofollowMixin(object):
def handleMatch(self, m):
try:
el = super(NofollowMixin, self).handleMatch(m)
except IndexError:
return
if el is not None:
href = el.get('href') or ''
if href.startswith('http') and not href.startswith('https://clickhouse.yandex'):
el.set('rel', 'external nofollow')
return el
class NofollowAutolinkPattern(NofollowMixin, markdown.inlinepatterns.AutolinkPattern):
pass
class NofollowLinkPattern(NofollowMixin, markdown.inlinepatterns.LinkPattern):
pass
class ClickHousePreprocessor(markdown.util.Processor):
def run(self, lines):
for line in lines:
if '<!--hide-->' not in line:
yield line
class ClickHouseMarkdown(markdown.extensions.Extension):
def extendMarkdown(self, md, md_globals):
md.preprocessors['clickhouse'] = ClickHousePreprocessor()
md.inlinePatterns['link'] = NofollowLinkPattern(markdown.inlinepatterns.LINK_RE, md)
md.inlinePatterns['autolink'] = NofollowAutolinkPattern(markdown.inlinepatterns.AUTOLINK_RE, md)
def makeExtension(**kwargs):

View File

@ -136,10 +136,10 @@
<article class="md-content__inner md-typeset">
{% block content %}
{% if config.extra.single_page %}
<a href="{{ config.repo_url }}tree/master/docs" title="{{ lang.t('edit.link.title') }}" class="md-icon md-content__icon">&#xE3C9;</a>
<a href="{{ config.repo_url }}tree/master/docs" title="{{ lang.t('edit.link.title') }}" class="md-icon md-content__icon" rel="external nofollow">&#xE3C9;</a>
{% else %}
{% if page.edit_url %}
<a href="{{ page.edit_url }}" title="{{ lang.t('edit.link.title') }}" class="md-icon md-content__icon">&#xE3C9;</a>
<a href="{{ page.edit_url }}" title="{{ lang.t('edit.link.title') }}" class="md-icon md-content__icon" rel="external nofollow">&#xE3C9;</a>
{% endif %}
{% endif %}
{% if not "\x3ch1" in page.content %}
@ -155,7 +155,7 @@
<h2 id="__source">{{ lang.t("meta.source") }}</h2>
{% set path = page.meta.path | default([""]) %}
{% set file = page.meta.source %}
<a href="{{ [config.repo_url, path, file] | join('/') }}" title="{{ file }}" class="md-source-file">
<a href="{{ [config.repo_url, path, file] | join('/') }}" title="{{ file }}" class="md-source-file" rel="external nofollow">
{{ file }}
</a>
{% endif %}

View File

@ -40,7 +40,7 @@
</label>
{% endif %}
<a href="{{ base_url }}/{{ nav_item.url }}" title="{{ nav_item.title }}" class="md-nav__link md-nav__link--active">
{{ nav_item.title }}
<strong>{{ nav_item.title }}</strong>
</a>
{% if toc_ | first is defined %}
{% include "partials/toc.html" %}

View File

@ -1 +0,0 @@
../../en/development/build.md

View File

@ -0,0 +1,99 @@
# 如何构建 ClickHouse 发布包
## 安装 Git 和 Pbuilder
```bash
sudo apt-get update
sudo apt-get install git pbuilder debhelper lsb-release fakeroot sudo debian-archive-keyring debian-keyring
```
## 拉取 ClickHouse 源码
```bash
git clone --recursive --branch stable https://github.com/yandex/ClickHouse.git
cd ClickHouse
```
## 运行发布脚本
```bash
./release
```
# 如何在开发过程中编译 ClickHouse
以下教程是在 Ubuntu Linux 中进行编译的示例。
通过适当的更改,它应该可以适用于任何其他的 Linux 发行版。
仅支持具有 SSE 4.2的 x86_64。 对 AArch64 的支持是实验性的。
测试是否支持 SSE 4.2,执行:
```bash
grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"
```
## 安装 Git 和 CMake
```bash
sudo apt-get install git cmake ninja-build
```
Or cmake3 instead of cmake on older systems.
或者在早期版本的系统中用 cmake3 替代 cmake
## 安装 GCC 7
There are several ways to do this.
### 安装 PPA 包
```bash
sudo apt-get install software-properties-common
sudo apt-add-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install gcc-7 g++-7
```
### 源码安装 gcc
请查看 [ci/build-gcc-from-sources.sh](https://github.com/yandex/ClickHouse/blob/master/ci/build-gcc-from-sources.sh)
## 使用 GCC 7 来编译
```bash
export CC=gcc-7
export CXX=g++-7
```
## 安装所需的工具依赖库
```bash
sudo apt-get install libicu-dev libreadline-dev
```
## 拉取 ClickHouse 源码
```bash
git clone --recursive git@github.com:yandex/ClickHouse.git
# or: git clone --recursive https://github.com/yandex/ClickHouse.git
cd ClickHouse
```
For the latest stable version, switch to the `stable` branch.
## 编译 ClickHouse
```bash
mkdir build
cd build
cmake ..
ninja
cd ..
```
若要创建一个执行文件, 执行 `ninja clickhouse`
这个命令会使得 `dbms/programs/clickhouse` 文件可执行,您可以使用 `client` or `server` 参数运行。
[来源文章](https://clickhouse.yandex/docs/en/development/build/) <!--hide-->

View File

@ -1 +0,0 @@
../../en/development/build_osx.md

View File

@ -0,0 +1,83 @@
# 在 Mac OS X 中编译 ClickHouse
ClickHouse 支持在 Mac OS X 10.12 版本中编译。若您在用更早的操作系统版本,可以尝试在指令中使用 `Gentoo Prefix``clang sl`.
通过适当的更改,它应该可以适用于任何其他的 Linux 发行版。
## 安装 Homebrew
```bash
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
```
## 安装编译器,工具库
```bash
brew install cmake ninja gcc icu4c mariadb-connector-c openssl libtool gettext readline
```
## 拉取 ClickHouse 源码
```bash
git clone --recursive git@github.com:yandex/ClickHouse.git
# or: git clone --recursive https://github.com/yandex/ClickHouse.git
cd ClickHouse
```
For the latest stable version, switch to the `stable` branch.
## 编译 ClickHouse
```bash
mkdir build
cd build
cmake .. -DCMAKE_CXX_COMPILER=`which g++-8` -DCMAKE_C_COMPILER=`which gcc-8`
ninja
cd ..
```
## 注意事项
若你想运行 clickhouse-server请先确保增加系统的最大文件数配置。
!!! 注意
可能需要用 sudo
为此,请创建以下文件:
/Library/LaunchDaemons/limit.maxfiles.plist:
``` xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>limit.maxfiles</string>
<key>ProgramArguments</key>
<array>
<string>launchctl</string>
<string>limit</string>
<string>maxfiles</string>
<string>524288</string>
<string>524288</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>ServiceIPC</key>
<false/>
</dict>
</plist>
```
执行以下命令:
``` bash
$ sudo chown root:wheel /Library/LaunchDaemons/limit.maxfiles.plist
```
然后重启。
可以通过 `ulimit -n` 命令来检查是否生效。
[来源文章](https://clickhouse.yandex/docs/en/development/build_osx/) <!--hide-->

View File

@ -1 +0,0 @@
../../en/development/index.md

View File

@ -0,0 +1,4 @@
# ClickHouse 开发
[来源文章](https://clickhouse.yandex/docs/en/development/) <!--hide-->

View File

@ -1 +0,0 @@
../../en/development/style.md

View File

@ -0,0 +1,838 @@
# 如何编写 C++ 代码
## 一般建议
**1.** 以下是建议,而不是要求。
**2.** 如果你在修改代码,遵守已有的风格是有意义的。
**3.** 代码的风格需保持一致。一致的风格有利于阅读代码,并且方便检索代码。
**4.** 许多规则没有逻辑原因; 它们是由既定的做法决定的。
## 格式化
**1.** 大多数格式化可以用 `clang-format` 自动完成。
**2.** 缩进是4个空格。 配置开发环境,使得 TAB 代表添加四个空格。
**3.** 左右花括号需在单独的行。
```cpp
inline void readBoolText(bool & x, ReadBuffer & buf)
{
char tmp = '0';
readChar(tmp, buf);
x = tmp != '0';
}
```
**4.** 若整个方法体仅有一行 `描述` 则可以放到单独的行上。 在花括号周围放置空格(除了行尾的空格)。
```cpp
inline size_t mask() const { return buf_size() - 1; }
inline size_t place(HashValue x) const { return x & mask(); }
```
**5.** 对于函数。 不要在括号周围放置空格。
```cpp
void reinsert(const Value & x)
```
```cpp
memcpy(&buf[place_value], &x, sizeof(x));
```
**6.** 在`if``for``while`和其他表达式中,在开括号前面插入一个空格(与函数声明相反)。
```cpp
for (size_t i = 0; i < rows; i += storage.index_granularity)
```
**7.** 在二元运算符(`+``-``*``/```...)和三元运算符 `?:` 周围添加空格。
```cpp
UInt16 year = (s[0] - '0') * 1000 + (s[1] - '0') * 100 + (s[2] - '0') * 10 + (s[3] - '0');
UInt8 month = (s[5] - '0') * 10 + (s[6] - '0');
UInt8 day = (s[8] - '0') * 10 + (s[9] - '0');
```
**8.** 若有换行,新行应该以运算符开头,并且增加对应的缩进。
```cpp
if (elapsed_ns)
message << " ("
<< rows_read_on_server * 1000000000 / elapsed_ns << " rows/s., "
<< bytes_read_on_server * 1000.0 / elapsed_ns << " MB/s.) ";
```
**9.** 如果需要,可以在一行内使用空格来对齐。
```cpp
dst.ClickLogID = click.LogID;
dst.ClickEventID = click.EventID;
dst.ClickGoodEvent = click.GoodEvent;
```
**10.** 不要在 `.``->` 周围加入空格
如有必要,运算符可以包裹到下一行。 在这种情况下,它前面的偏移量增加。
**11.** 不要使用空格来分开一元运算符 (`--`, `++`, `*`, `&`, ...) 和参数。
**12.** 在逗号后面加一个空格,而不是在之前。同样的规则也适合 `for` 循环中的分号。
**13.** 不要用空格分开 `[]` 运算符。
**14.** 在 `template <...>` 表达式中,在 `template``<` 中加入一个空格,在 `<` 后面或在 `>` 前面都不要有空格。
```cpp
template <typename TKey, typename TValue>
struct AggregatedStatElement
{}
```
**15.** 在类和结构体中, `public` `private` 以及 `protected``class/struct` 无需缩进,其他代码须缩进。
```cpp
template <typename T>
class MultiVersion
{
public:
/// Version of object for usage. shared_ptr manage lifetime of version.
using Version = std::shared_ptr<const T>;
...
}
```
**16.** 如果对整个文件使用相同的 `namespace`,并且没有其他重要的东西,则 `namespace` 中不需要偏移量。
**17.** 在 `if`, `for`, `while` 中包裹的代码块中,若代码是一个单行的 `statement`,那么大括号是可选的。 可以将 `statement` 放到一行中。这个规则同样适用于嵌套的 `if` `for` `while` ...
但是如果内部 `statement` 包含大括号或 `else`,则外部块应该用大括号括起来。
```cpp
/// Finish write.
for (auto & stream : streams)
stream.second->finalize();
```
**18.** 行的某尾不应该包含空格。
**19.** 源文件应该用 UTF-8 编码。
**20.** 非ASCII字符可用于字符串文字。
```cpp
<< ", " << (timer.elapsed() / chunks_stats.hits) << " μsec/hit.";
```
**21** 不要在一行中写入多个表达式。
**22.** 将函数内部的代码段分组,并将它们与不超过一行的空行分开。
**23.** 将 函数,类用一个或两个空行分开。
**24.** `const` 必须写在类型名称之前。
```cpp
//correct
const char * pos
const std::string & s
//incorrect
char const * pos
```
**25.** 声明指针或引用时,`*` 和 `` 符号两边应该都用空格分隔。
```cpp
//correct
const char * pos
//incorrect
const char* pos
const char *pos
```
**26.** 使用模板类型时,使用 `using` 关键字对它们进行别名(最简单的情况除外)。
换句话说,模板参数仅在 `using` 中指定,并且不在代码中重复。
`using`可以在本地声明,例如在函数内部。
```cpp
//correct
using FileStreams = std::map<std::string, std::shared_ptr<Stream>>;
FileStreams streams;
//incorrect
std::map<std::string, std::shared_ptr<Stream>> streams;
```
**27.** 不要在一个语句中声明不同类型的多个变量。
```cpp
//incorrect
int x, *y;
```
**28.** 不要使用C风格的类型转换。
```cpp
//incorrect
std::cerr << (int)c <<; std::endl;
//correct
std::cerr << static_cast<int>(c) << std::endl;
```
**29.** 在类和结构中,组成员和函数分别在每个可见范围内。
**30.** 对于小类和结构,没有必要将方法声明与实现分开。
对于任何类或结构中的小方法也是如此。
对于模板化类和结构,不要将方法声明与实现分开(因为否则它们必须在同一个转换单元中定义)
**31.** 您可以将换行规则定在140个字符而不是80个字符。
**32.** 如果不需要 postfix请始终使用前缀增量/减量运算符。
```cpp
for (Names::const_iterator it = column_names.begin(); it != column_names.end(); ++it)
```
## Comments
**1.** 请务必为所有非常重要的代码部分添加注释。
这是非常重要的。 编写注释可能会帮助您意识到代码不是必需的,或者设计错误。
```cpp
/** Part of piece of memory, that can be used.
* For example, if internal_buffer is 1MB, and there was only 10 bytes loaded to buffer from file for reading,
* then working_buffer will have size of only 10 bytes
* (working_buffer.end() will point to position right after those 10 bytes available for read).
*/
```
**2.** 注释可以尽可能详细。
**3.** 在他们描述的代码之前放置注释。 在极少数情况下,注释可以在代码之后,在同一行上。
```cpp
/** Parses and executes the query.
*/
void executeQuery(
ReadBuffer & istr, /// Where to read the query from (and data for INSERT, if applicable)
WriteBuffer & ostr, /// Where to write the result
Context & context, /// DB, tables, data types, engines, functions, aggregate functions...
BlockInputStreamPtr & query_plan, /// Here could be written the description on how query was executed
QueryProcessingStage::Enum stage = QueryProcessingStage::Complete /// Up to which stage process the SELECT query
)
```
**4.** 注释应该只用英文撰写。
**5.** 如果您正在编写库,请在主头文件中包含解释它的详细注释。
**6.** 请勿添加无效的注释。 特别是,不要留下像这样的空注释:
```cpp
/*
* Procedure Name:
* Original procedure name:
* Author:
* Date of creation:
* Dates of modification:
* Modification authors:
* Original file name:
* Purpose:
* Intent:
* Designation:
* Classes used:
* Constants:
* Local variables:
* Parameters:
* Date of creation:
* Purpose:
*/
```
这个示例来源于 [http://home.tamk.fi/~jaalto/course/coding-style/doc/unmaintainable-code/](http://home.tamk.fi/~jaalto/course/coding-style/doc/unmaintainable-code/)。
**7.** 不要在每个文件的开头写入垃圾注释(作者,创建日期...)。
**8.** 单行注释用三个斜杆: `///` ,多行注释以 `/**`开始。 这些注释会当做文档。
注意:您可以使用 Doxygen 从这些注释中生成文档。 但是通常不使用 Doxygen因为在 IDE 中导航代码更方便。
**9.** 多行注释的开头和结尾不得有空行(关闭多行注释的行除外)。
**10.** 要注释掉代码,请使用基本注释,而不是“记录”注释。
**11.** 在提交之前删除代码的无效注释部分。
**12.** 不要在注释或代码中使用亵渎语言。
**13.** 不要使用大写字母。 不要使用过多的标点符号。
```cpp
/// WHAT THE FAIL???
```
**14.** 不要使用注释来制作分隔符。
```cpp
///******************************************************
```
**15.** 不要在注释中开始讨论。
```cpp
/// Why did you do this stuff?
```
**16.** 没有必要在块的末尾写一条注释来描述它的含义。
```cpp
/// for
```
## Names
**1.** 在变量和类成员的名称中使用带下划线的小写字母。
```cpp
size_t max_block_size;
```
**2.** 对于函数(方法)的名称,请使用以小写字母开头的驼峰标识。
```cpp
std::string getName() const override { return "Memory"; }
```
**3.** 对于类结构的名称使用以大写字母开头的驼峰标识。接口名称用I前缀。
```cpp
class StorageMemory : public IStorage
```
**4.** `using` 的命名方式与类相同或者以__t`命名。
**5.** 模板类型参数的名称:在简单的情况下,使用`T`; `T``U`; `T1``T2`。
对于更复杂的情况,要么遵循类名规则,要么添加前缀`T`。
```cpp
template <typename TKey, typename TValue>
struct AggregatedStatElement
```
**6.** 模板常量参数的名称:遵循变量名称的规则,或者在简单的情况下使用 `N`
```cpp
template <bool without_www>
struct ExtractDomain
```
**7.** 对于抽象类型(接口),用 `I` 前缀。
```cpp
class IBlockInputStream
```
**8.** 如果在本地使用变量,则可以使用短名称。
在所有其他情况下,请使用能描述含义的名称。
```cpp
bool info_successfully_loaded = false;
```
**9.** `define` 和全局常量的名称使用带下划线的 `ALL_CAPS`
```cpp
#define MAX_SRC_TABLE_NAMES_TO_STORE 1000
```
**10.** 文件名应使用与其内容相同的样式。
如果文件包含单个类,则以与该类名称相同的方式命名该文件。
如果文件包含单个函数,则以与函数名称相同的方式命名文件。
**11.** 如果名称包含缩写,则:
- 对于变量名,缩写应使用小写字母 `mysql_connection`(不是 `mySQL_connection` )。
- 对于类和函数的名称,请将大写字母保留在缩写 `MySQLConnection`(不是 `MySqlConnection`
**12.** 仅用于初始化类成员的构造方法参数的命名方式应与类成员相同,但最后使用下划线。
```cpp
FileQueueProcessor(
const std::string & path_,
const std::string & prefix_,
std::shared_ptr<FileHandler> handler_)
: path(path_),
prefix(prefix_),
handler(handler_),
log(&Logger::get("FileQueueProcessor"))
{
}
```
如果构造函数体中未使用该参数,则可以省略下划线后缀。
**13.** 局部变量和类成员的名称没有区别(不需要前缀)。
```cpp
timer (not m_timer)
```
**14.** 对于 `enum` 中的常量请使用带大写字母的驼峰标识。ALL_CAPS 也可以接受。如果 `enum` 是非本地的,请使用 `enum class`
```cpp
enum class CompressionMethod
{
QuickLZ = 0,
LZ4 = 1,
};
```
**15.** 所有名字必须是英文。不允许音译俄语单词。
```
not Stroka
```
**16.** 缩写须是众所周知的(当您可以在维基百科或搜索引擎中轻松找到缩写的含义时)。
```
`AST`, `SQL`.
Not `NVDH` (some random letters)
```
如果缩短版本是常用的,则可以接受不完整的单词。
如果注释中旁边包含全名,您也可以使用缩写。
**17.** C++ 源码文件名称必须为 `.cpp` 拓展名。 头文件必须为 `.h` 拓展名。
## 如何编写代码
**1.** 内存管理。
手动内存释放 (`delete`) 只能在库代码中使用。
在库代码中, `delete` 运算符只能在析构函数中使用。
在应用程序代码中,内存必须由拥有它的对象释放。
示例:
- 最简单的方法是将对象放在堆栈上,或使其成为另一个类的成员。
- 对于大量小对象,请使用容器。
- 对于自动释放少量在堆中的对象,可以用 `shared_ptr/unique_ptr`
**2.** 资源管理。
使用 `RAII` 以及查看以上说明。
**3.** 错误处理。
在大多数情况下,您只需要抛出一个异常,而不需要捕获它(因为`RAII`)。
在离线数据处理应用程序中,通常可以接受不捕获异常。
在处理用户请求的服务器中,通常足以捕获连接处理程序顶层的异常。
在线程函数中,你应该在 `join` 之后捕获并保留所有异常以在主线程中重新抛出它们。
```cpp
/// If there weren't any calculations yet, calculate the first block synchronously
if (!started)
{
calculate();
started = true;
}
else /// If calculations are already in progress, wait for the result
pool.wait();
if (exception)
exception->rethrow();
```
不处理就不要隐藏异常。 永远不要盲目地把所有异常都记录到日志中。
```cpp
//Not correct
catch (...) {}
```
如果您需要忽略某些异常,请仅针对特定异常执行此操作并重新抛出其余异常。
```cpp
catch (const DB::Exception & e)
{
if (e.code() == ErrorCodes::UNKNOWN_AGGREGATE_FUNCTION)
return nullptr;
else
throw;
}
```
当使用具有返回码或 `errno` 的函数时,请始终检查结果并在出现错误时抛出异常。
```cpp
if (0 != close(fd))
throwFromErrno("Cannot close file " + file_name, ErrorCodes::CANNOT_CLOSE_FILE);
```
`不要使用断言`
**4.** 异常类型。
不需要在应用程序代码中使用复杂的异常层次结构。 系统管理员应该可以理解异常文本。
**5.** 从析构函数中抛出异常。
不建议这样做,但允许这样做。
按照以下选项:
- 创建一个函数( `done``finalize` ),它将提前完成所有可能导致异常的工作。 如果调用了该函数,则稍后在析构函数中应该没有异常。
- 过于复杂的任务(例如通过网络发送消息)可以放在单独的方法中,类用户必须在销毁之前调用它们。
- 如果析构函数中存在异常,则最好记录它而不是隐藏它(如果 logger 可用)。
- 在简单的应用程序中,依赖于`std::terminate`对于C++ 11中默认情况下为 `noexcept` 的情况)来处理异常是可以接受的。
**6.** 匿名代码块。
您可以在单个函数内创建单独的代码块,以使某些变量成为局部变量,以便在退出块时调用析构函数。
```cpp
Block block = data.in->read();
{
std::lock_guard<std::mutex> lock(mutex);
data.ready = true;
data.block = block;
}
ready_any.set();
```
**7.** 多线程。
在离线数据处理程序中:
- 尝试在单个CPU核心上获得最佳性能。 然后,您可以根据需要并行化代码。
在服务端应用中:
- 使用线程池来处理请求。 此时,我们还没有任何需要用户空间上下文切换的任务。
Fork不用于并行化。
**8.** 同步线程。
通常可以使不同的线程使用不同的存储单元(甚至更好:不同的缓存线),并且不使用任何线程同步(除了`joinAll`)。
如果需要同步,在大多数情况下,在 `lock_guard` 下使用互斥量就足够了。
在其他情况下,使用系统同步原语。不要使用忙等待。
仅在最简单的情况下才应使用原子操作。
除非是您的主要专业领域,否则不要尝试实施无锁数据结构。
**9.** 指针和引用。
大部分情况下,请用引用。
**10.** 常量。
使用 const 引用,指向常量的指针,`const_iterator`和 const 指针。
`const` 视为默认值,仅在必要时使用非 `const`
当按值传递变量时,使用 `const` 通常没有意义。
**11.** 无符号。
必要时使用`unsigned`。
**12.** 数值类型。
使用 `UInt8` `UInt16` `UInt32` `UInt64` `Int8` `Int16` `Int32` 以及 `Int64` `size_t` `ssize_t` 还有 `ptrdiff_t`
不要使用这些类型:`signed / unsigned long``long long``short``signed / unsigned char``char`。
**13.** 参数传递。
通过引用传递复杂类型 (包括 `std::string`)。
如果函数中传递堆中创建的对象,则使参数类型为 `shared_ptr` 或者 `unique_ptr`.
**14.** 返回值
大部分情况下使用 `return`。不要使用 `[return std::move(res)]{.strike}`
如果函数在堆上分配对象并返回它,请使用 `shared_ptr``unique_ptr`
在极少数情况下,您可能需要通过参数返回值。 在这种情况下,参数应该是引用传递的。
```cpp
using AggregateFunctionPtr = std::shared_ptr<IAggregateFunction>;
/** Allows creating an aggregate function by its name.
*/
class AggregateFunctionFactory
{
public:
AggregateFunctionFactory();
AggregateFunctionPtr get(const String & name, const DataTypes & argument_types) const;
```
**15.** 命名空间。
没有必要为应用程序代码使用单独的 `namespace`
小型库也不需要这个。
对于中大型库,须将所有代码放在 `namespace` 中。
在库的 `.h` 文件中,您可以使用 `namespace detail` 来隐藏应用程序代码不需要的实现细节。
`.cpp` 文件中,您可以使用 `static` 或匿名命名空间来隐藏符号。
同样 `namespace` 可用于 `enum` 以防止相应的名称落入外部 `namespace`(但最好使用`enum class`)。
**16.** 延迟初始化。
如果初始化需要参数,那么通常不应该编写默认构造函数。
如果稍后您需要延迟初始化,则可以添加将创建无效对象的默认构造函数。 或者,对于少量对象,您可以使用 `shared_ptr / unique_ptr`
```cpp
Loader(DB::Connection * connection_, const std::string & query, size_t max_block_size_);
/// For deferred initialization
Loader() {}
```
**17.** 虚函数。
如果该类不是用于多态使用,则不需要将函数设置为虚拟。这也适用于析构函数。
**18.** 编码。
在所有情况下使用 UTF-8 编码。使用 `std::string` and `char *`。不要使用 `std::wstring``wchar_t`
**19.** 日志。
请参阅代码中的示例。
在提交之前,删除所有无意义和调试日志记录,以及任何其他类型的调试输出。
应该避免循环记录日志,即使在 Trace 级别也是如此。
日志必须在任何日志记录级别都可读。
在大多数情况下,只应在应用程序代码中使用日志记录。
日志消息必须用英文写成。
对于系统管理员来说,日志最好是可以理解的。
不要在日志中使用亵渎语言。
在日志中使用UTF-8编码。 在极少数情况下您可以在日志中使用非ASCII字符。
**20.** 输入-输出。
不要使用 `iostreams` 在对应用程序性能至关重要的内部循环中(并且永远不要使用 `stringstream` )。
使用 `DB/IO` 库替代。
**21.** 日期和时间。
参考 `DateLUT` 库。
**22.** 引入头文件。
一直用 `#pragma once` 而不是其他宏。
**23.** using 语法
`using namespace` 不会被使用。 您可以使用特定的 `using`。 但是在类或函数中使它成为局部的。
**24.** 不要使用 `trailing return type` 为必要的功能。
```cpp
[auto f() -&gt; void;]{.strike}
```
**25.** 声明和初始化变量。
```cpp
//right way
std::string s = "Hello";
std::string s{"Hello"};
//wrong way
auto s = std::string{"Hello"};
```
**26.** 对于虚函数,在基类中编写 `virtual`,但在后代类中写 `override` 而不是`virtual`。
## 没有用到的 C++ 特性。
**1.** 不使用虚拟继承。
**2.** 不使用 C++03 中的异常标准。
## 平台
**1.** 我们为特定平台编写代码。
但在其他条件相同的情况下,首选跨平台或可移植代码。
**2.** 语言: C++17.
**3.** 编译器: `gcc`。 此时2017年12月代码使用7.2版编译。(它也可以使用`clang 4` 编译)
使用标准库 (`libstdc++` 或 `libc++`)。
**4.** 操作系统Linux Ubuntu不比 Precise 早。
**5.** 代码是为x86_64 CPU架构编写的。
CPU指令集是我们服务器中支持的最小集合。 目前它是SSE 4.2。
**6.** 使用 `-Wall -Wextra -Werror` 编译参数。
**7.** 对所有库使用静态链接,除了那些难以静态连接的库(参见 `ldd` 命令的输出).
**8.** 使用发布的设置来开发和调试代码。
## 工具
**1.** KDevelop 是一个好的 IDE.
**2.** 调试可以使用 `gdb` `valgrind` (`memcheck`) `strace` `-fsanitize=...``tcmalloc_minimal_debug`.
**3.** 对于性能分析,使用 `Linux Perf` `valgrind` (`callgrind`),或者 `strace -cf`
**4.** 源代码用 Git 作版本控制。
**5.** 使用 `CMake` 构建。
**6.** 程序的发布使用 `deb` 安装包。
**7.** 提交到 master 分支的代码不能破坏编译。
虽然只有选定的修订被认为是可行的。
**8.** 尽可能经常地进行提交,即使代码只是部分准备好了。
目的明确的功能,使用分支。
如果 `master` 分支中的代码尚不可构建,请在 `push` 之前将其从构建中排除。您需要在几天内完成或删除它。
**9.** 对于不重要的更改,请使用分支并在服务器上发布它们。
**10.** 未使用的代码将从 repo 中删除。
## 库
**1.** 使用C ++ 14标准库允许实验性功能以及 `boost``Poco` 框架。
**2.** 如有必要,您可以使用 OS 包中提供的任何已知库。
如果有一个好的解决方案已经可用,那就使用它,即使这意味着你必须安装另一个库。
(但要准备从代码中删除不好的库)
**3.** 如果软件包没有您需要的软件包或者有过时的版本或错误的编译类型,则可以安装不在软件包中的库。
**4.** 如果库很小并且没有自己的复杂构建系统,请将源文件放在 `contrib` 文件夹中。
**5.** 始终优先考虑已经使用的库。
## 一般建议
**1.** 尽可能精简代码。
**2.** 尝试用最简单的方式实现。
**3.** 在你知道代码是如何工作以及内部循环如何运作之前,不要编写代码。
**4.** 在最简单的情况下,使用 `using` 而不是类或结构。
**5.** 如果可能,不要编写复制构造函数,赋值运算符,析构函数(虚拟函数除外,如果类包含至少一个虚函数),移动构造函数或移动赋值运算符。 换句话说,编译器生成的函数必须正常工作。 您可以使用 `default`
**6.** 鼓励简化代码。 尽可能减小代码的大小。
## 其他建议
**1.** 从 `stddef.h` 明确指定 `std ::` 的类型。
不推荐。 换句话说,我们建议写 `size_t` 而不是 `std::size_t`,因为它更短。
也接受添加 `std::`
**2.** 为标准C库中的函数明确指定 `std::`
不推荐。换句话说,写 `memcpy` 而不是`std::memcpy`。
原因是有类似的非标准功能,例如 `memmem`。我们偶尔会使用这些功能。`namespace std`中不存在这些函数。
如果你到处都写 `std::memcpy` 而不是 `memcpy`,那么没有 `std::``memmem` 会显得很奇怪。
不过,如果您愿意,仍然可以使用 `std::`
**3.** 当标准C++库中提供相同的函数时使用C中的函数。
如果它更高效,这是可以接受的。
例如,使用`memcpy`而不是`std::copy`来复制大块内存。
**4.** 函数的多行参数。
允许以下任何包装样式:
```cpp
function(
T1 x1,
T2 x2)
```
```cpp
function(
size_t left, size_t right,
const & RangesInDataParts ranges,
size_t limit)
```
```cpp
function(size_t left, size_t right,
const & RangesInDataParts ranges,
size_t limit)
```
```cpp
function(size_t left, size_t right,
const & RangesInDataParts ranges,
size_t limit)
```
```cpp
function(
size_t left,
size_t right,
const & RangesInDataParts ranges,
size_t limit)
```
[来源文章](https://clickhouse.yandex/docs/en/development/style/) <!--hide-->

View File

@ -1,6 +1,8 @@
# 第三方开发的可视化界面
## Tabix
## 开源
### Tabix
ClickHouse Web 界面 [Tabix](https://github.com/tabixio/tabix).
@ -15,7 +17,7 @@ ClickHouse Web 界面 [Tabix](https://github.com/tabixio/tabix).
[Tabix 文档](https://tabix.io/doc/).
## HouseOps
### HouseOps
[HouseOps](https://github.com/HouseOps/HouseOps) 是一个交互式 UI/IDE 工具,可以运行在 OSX, Linux and Windows 平台中。
@ -36,5 +38,29 @@ ClickHouse Web 界面 [Tabix](https://github.com/tabixio/tabix).
- 集群管理
- 监控副本情况以及 Kafka 引擎表
## 商业
### DBeaver
[DBeaver](https://dbeaver.io/) - 具有ClickHouse支持的通用桌面数据库客户端。
特征:
- 使用语法高亮显示查询开发。
- 表格预览。
- 自动完成。
### DataGrip
[DataGrip](https://www.jetbrains.com/datagrip/)是JetBrains的数据库IDE专门支持ClickHouse。 它还嵌入到其他基于IntelliJ的工具中PyCharmIntelliJ IDEAGoLandPhpStorm等。
特征:
- 非常快速的代码完成。
- ClickHouse语法高亮显示。
- 支持ClickHouse特有的功能例如嵌套列表引擎。
- 数据编辑器。
- 重构。
- 搜索和导航。
[来源文章](https://clickhouse.yandex/docs/zh/interfaces/third-party_gui/) <!--hide-->

View File

@ -1,14 +1,19 @@
# 第三方集成库
!!! warning "放弃"
!!! warning "声明"
Yandex不维护下面列出的库也没有进行任何广泛的测试以确保其质量。
- 关系数据库管理系统
- [MySQL](https://www.mysql.com)
- [ProxySQL](https://github.com/sysown/proxysql/wiki/ClickHouse-Support)
- [PostgreSQL](https://www.postgresql.org)
- [infi.clickhouse_fdw](https://github.com/Infinidat/infi.clickhouse_fdw) (使用 [infi.clickhouse_orm](https://github.com/Infinidat/infi.clickhouse_orm))
- Python
- [SQLAlchemy](https://www.sqlalchemy.org)
- [sqlalchemy-clickhouse](https://github.com/cloudflare/sqlalchemy-clickhouse) (uses [infi.clickhouse_orm](https://github.com/Infinidat/infi.clickhouse_orm))
- [sqlalchemy-clickhouse](https://github.com/cloudflare/sqlalchemy-clickhouse) (使用 [infi.clickhouse_orm](https://github.com/Infinidat/infi.clickhouse_orm))
- Java
- [Hadoop](http://hadoop.apache.org)
- [clickhouse-hdfs-loader](https://github.com/jaykelin/clickhouse-hdfs-loader) (uses [JDBC](../jdbc.md))
- [clickhouse-hdfs-loader](https://github.com/jaykelin/clickhouse-hdfs-loader) (使用 [JDBC](../jdbc.md))
- Scala
- [Akka](https://akka.io)
- [clickhouse-scala-client](https://github.com/crobox/clickhouse-scala-client)

View File

@ -1,155 +1,164 @@
## 创建数据库
## CREATE DATABASE
创建 `db_name` 数据库。
该查询用于根据指定名称创建数据库。
```sql
``` sql
CREATE DATABASE [IF NOT EXISTS] db_name
```
数据库是一个包含多个表的目录如果在CREATE DATABASE语句中包含`IF NOT EXISTS`,则在数据库已经存在的情况下查询也不会返回错误。
数据库其实只是用于存放表的一个目录。
如果查询中存在`IF NOT EXISTS`,则当数据库已经存在时,该查询不会返回任何错误。
<a name="query_language-queries-create_table"></a>
## 创建表
## CREATE TABLE
`CREATE TABLE` 语句有几种形式.
对于`CREATE TABLE`,存在以下几种方式。
```sql
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db.]name [ON CLUSTER cluster]
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1]
name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2]
name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
...
) ENGINE = engine
```
如果`db`没有设置, 在数据库`db`中或者当前数据库中, 创建一个表名为`name`的表, 在括号和`engine` 引擎中指定结构。 表的结构是一个列描述的列表。 如果引擎支持索引, 则他们将是表引擎的参数。
在指定的db数据库中创建一个名为name的表如果查询中没有包含db则默认使用当前选择的数据库作为db。后面的是包含在括号中的表结构以及表引擎的声明。
其中表结构声明是一个包含一组列描述声明的组合。如果表引擎是支持索引的,那么可以在表引擎的参数中对其进行说明。
表结构是一个列描述的列表。 如果引擎支持索引, 他们以表引擎的参数表示。
在最简单的情况下,列描述是指`名称 类型`这样的子句。例如: `RegionID UInt32`
但是也可以为列另外定义默认值表达式(见后文)。
在最简单的情况, 一个列描述是'命名类型'。 例如: RegionID UInt32。 对于默认值, 表达式也能够被定义。
```sql
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db.]name AS [db2.]name2 [ENGINE = engine]
``` sql
CREATE TABLE [IF NOT EXISTS] [db.]table_name AS [db2.]name2 [ENGINE = engine]
```
创建一个表, 其结构与另一个表相同。 你能够为此表指定一个不同的引擎。 如果引擎没有被指定, 相同的引擎将被用于`db2。name2`表上
创建一个与`db2.name2`具有相同结构的表,同时你可以对其指定不同的表引擎声明。如果没有表引擎声明,则创建的表将与`db2.name2`使用相同的表引擎
```sql
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db.]name ENGINE = engine AS SELECT ...
``` sql
CREATE TABLE [IF NOT EXISTS] [db.]table_name ENGINE = engine AS SELECT ...
```
创建一个表,其结构类似于 SELECT 查询后的结果, 带有`engine` 引擎, 从 SELECT查询数据填充它。
使用指定的引擎创建一个与`SELECT`子句的结果具有相同结构的表,并使用`SELECT`子句的结果填充它。
在所有情况下,如果`IF NOT EXISTS`被指定, 如果表已经存在, 查询并不返回一个错误。 在这种情况下, 查询并不做任何事情。
以上所有情况,如果指定了`IF NOT EXISTS`,那么在该表已经存在的情况下,查询不会返回任何错误。在这种情况下,查询几乎不会做任何事情。
在`ENGINE`子句后还可能存在一些其他的子句,更详细的信息可以参考[表引擎](../operations/table_engines/index.md#table_engines)中关于建表的描述。
### 默认值
列描述能够为默认值指定一个表达式, 其中一个方法是:DEFAULT expr MATERIALIZED expr ALIAS expr。
例如: URLDomain String DEFAULT domain(URL)
在列描述中你可以通过以下方式之一为列指定默认表达式:`DEFAULT expr``MATERIALIZED expr``ALIAS expr`。
示例:`URLDomain String DEFAULT domain(URL)`
如果默认值的一个表达式没有定义, 如果字段是数字类型, 默认值是将设置为0 如果是字符类型, 则设置为空字符串, 日期类型则设置为 0000-00-00 或者 0000-00-00 00:00:00(时间戳)。 NULLs 则不支持
如果在列描述中未定义任何默认表达式那么系统将会根据类型设置对应的默认值数值类型为零、字符串类型为空字符串、数组类型为空数组、日期类型为0000-00-00以及时间类型为0000-00-00 00:00:00。不支持使用NULL作为普通类型的默认值
如果默认表达式被定义, 字段类型是可选的。 如果没有明确的定义类型, 则将使用默认表达式。 例如: EventDate DEFAULT toDate(EventTime) `Date` 类型将用于 `EventDate` 字段
如果定义了默认表达式,则可以不定义列的类型。如果没有明确的定义类的类型,则使用默认表达式的类型。例如:`EventDate DEFAULT toDate(EventTime)` - 最终EventDate将使用Date作为类型
如果数据类型和默认表达式被明确定义, 此表达式将使用函数被转换为特定的类型。 例如: Hits UInt32 DEFAULT 0 与 Hits UInt32 DEFAULT toUInt32(0)是等价的
如果同时指定了默认表达式与列的类型,则将使用类型转换函数将默认表达式转换为指定的类型。例如:`Hits UInt32 DEFAULT 0`与`Hits UInt32 DEFAULT toUInt32(0)`意思相同
默认表达是可能被定义为一个任意的表达式,如表的常量和字段。 当创建和更改表结构时, 它将检查表达式是否包含循环。 对于 INSERT操作来说 它将检查表达式是否可解析 所有的字段通过传参后进行计算
默认表达式可以包含常量或表的任意其他列。当创建或更改表结构时系统将会运行检查确保不会包含循环依赖。对于INSERT, 它仅检查表达式是否是可以解析的 - 它们可以从中计算出所有需要的列的默认值
`DEFAULT expr`
正常的默认值。 如果 INSERT 查询并没有指定对应的字段, 它将通过计算对应的表达式来填充
普通的默认值如果INSERT中不包含指定的列那么将通过表达式计算它的默认值并填充它
`物化表达式`
`MATERIALIZED expr`
物化表达式。 此类型字段并没有指定插入操作, 因为它经常执行计算任务。 对一个插入操作, 无字段列表, 那么这些字段将不考虑。 另外, 当在一个SELECT查询语句中使用星号时 此字段并不被替换。 这将保证INSERT INTO SELECT * FROM 的不可变性。
物化表达式被该表达式指定的列不能包含在INSERT的列表中因为它总是被计算出来的。
对于INSERT而言不需要考虑这些列。
另外在SELECT查询中如果包含星号此列不会被用来替换星号这是因为考虑到数据转储在使用`SELECT *`查询出的结果总能够被'INSERT'回表。
`别名表达式`
`ALIAS expr`
别名。 此字段不存储在表中。
此列的值不插入到表中, 当在一个SELECT查询语句中使用星号时此字段并不被替换。
它能够用在 SELECTs中如果别名在查询解析时被扩展
别名。这样的列不会存储在表中。
它的值不能够通过INSERT写入同时使用SELECT查询星号时这些列也不会被用来替换星号。
但是它们可以显示的用于SELECT中在这种情况下在查询分析中别名将被替换
当使用更新查询添加一个新的字段, 这些列的旧值不被写入。 相反, 新字段没有值,当读取旧值时, 表达式将被计算。 然而,如果运行表达式需要不同的字段, 这些字段将被读取 但是仅读取相关的数据块
当使用ALTER查询对添加新的列时不同于为所有旧数据添加这个列对于需要在旧数据中查询新列只会在查询时动态计算这个新列的值。但是如果新列的默认表示中依赖其他列的值进行计算那么同样会加载这些依赖的列的数据
如果你添加一个新的字段到表中, 然后改变它的默认表达式, 对于使用的旧值将更改(对于此数据, 值不保存在磁盘上)。 当运行背景线程时, 缺少合并数据块的字段数据写入到合并数据块中。
在嵌套数据结构中设置默认值是不允许的。
如果你向表中添加一个新列,并在之后的一段时间后修改它的默认表达式,则旧数据中的值将会被改变。请注意,在运行后台合并时,缺少的列的值将被计算后写入到合并后的数据部分中。
不能够为nested类型的列设置默认值。
### 临时表
在任何情况下, 如果临时表被指定, 一个临时表将被创建。 临时表有如下的特性:
ClickHouse支持临时表其具有以下特征
- 当会话结束后, 临时表将删除,或者连接丢失
- 一个临时表使用内存表引擎创建。 其他的表引擎不支持临时表
- 数据库不能为一个临时表指定。 它将创建在数据库之外
- 如果一个临时表与另外的表有相同的名称 ,一个查询指定了表名并没有指定数据库, 将使用临时表。
- 对于分布式查询处理, 查询中的临时表将被传递给远程服务器。
- 当回话结束时,临时表将随会话一起消失,这包含链接中断
- 临时表仅能够使用Memory表引擎
- 无法为临时表指定数据库。它是在数据库之外创建的
- 如果临时表与另一个表名称相同那么当在查询时没有显示的指定db的情况下将优先使用临时表。
- 对于分布式处理,查询中使用的临时表将被传递到远程服务器。
在大多数情况下, 临时表并不能手工创建, 但当查询外部数据或使用分布式全局(GLOBAL)IN时可以创建临时表。
分布式 DDL 查询 (ON CLUSTER clause)
----------------------------------------------
`CREATE` `DROP` `ALTER``RENAME` 查询支持在集群上分布式执行。 例如, 如下的查询在集群中的每个机器节点上创建了 all_hits Distributed 表:
可以使用下面的语法创建一个临时表:
```sql
CREATE TABLE IF NOT EXISTS all_hits ON CLUSTER cluster (p Date i Int32) ENGINE = Distributed(cluster default hits)
CREATE TEMPORARY TABLE [IF NOT EXISTS] table_name [ON CLUSTER cluster]
(
name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
...
)
```
为了正确执行这些语句,每个节点必须有相同的集群设置(为了简化同步配置,可以使用 zookeeper 来替换)。 这些节点也可以连接到ZooKeeper 服务器。
查询语句会在每个节点上执行, 而`ALTER`查询目前暂不支持在同步表(replicated table)上执行。
大多数情况下,临时表不是手动创建的,只有在分布式查询处理中使用`(GLOBAL) IN`时为外部数据创建。更多信息,可以参考相关章节。
## 分布式DDL查询 ON CLUSTER 子句)
对于 `CREATE` `DROP` `ALTER`,以及`RENAME`查询,系统支持其运行在整个集群上。
例如,以下查询将在`cluster`集群的所有节点上创建名为`all_hits`的`Distributed`表:
``` sql
CREATE TABLE IF NOT EXISTS all_hits ON CLUSTER cluster (p Date, i Int32) ENGINE = Distributed(cluster, default, hits)
```
为了能够正确的运行这种查询每台主机必须具有相同的cluster声明为了简化配置的同步你可以使用zookeeper的方式进行配置。同时这些主机还必须链接到zookeeper服务器。
这个查询将最终在集群的每台主机上运行,即使一些主机当前处于不可用状态。同时它还保证了所有的查询在单台主机中的执行顺序。
replicated系列表还没有支持`ALTER`查询。
## CREATE VIEW
```sql
CREATE [MATERIALIZED] VIEW [IF NOT EXISTS] [db.]name [TO[db.]name] [ENGINE = engine] [POPULATE] AS SELECT ...
``` sql
CREATE [MATERIALIZED] VIEW [IF NOT EXISTS] [db.]table_name [TO[db.]name] [ENGINE = engine] [POPULATE] AS SELECT ...
```
创建一个视图。 有两种类型的视图: 正常视图和物化(MATERIALIZED)视图。
创建一个视图。它存在两种可选择的类型:普通视图与物化视图。
当创建一个物化视图时, 你必须指定表引擎 此表引擎用于存储数据
普通视图不存储任何数据只是执行从另一个表中的读取。换句话说普通视图只是保存了视图的查询当从视图中查询时此查询被作为子查询用于替换FROM子句。
一个物化视图工作流程如下所示: 当插入数据到SELECT 查询指定的表中时, 插入数据部分通过SELECT查询部分来转换 结果插入到视图中。
举个例子,假设你已经创建了一个视图:
正常视图不保存任何数据, 但是可以从任意表中读取数据。 换句话说,正常视图可以看作是查询结果的一个结果缓存。 当从一个视图中读取数据时, 此查询可以看做是 FROM语句的子查询。
例如, 假设你已经创建了一个视图:
```sql
``` sql
CREATE VIEW view AS SELECT ...
```
写了一个查询语句:
还有一个查询:
```sql
``` sql
SELECT a, b, c FROM view
```
此查询完全等价于子查询:
```sql
这个查询完全等价于:
``` sql
SELECT a, b, c FROM (SELECT ...)
```
物化视图存由SELECT语句查询转换的数据
物化视图存储的数据是相应的SELECT查询转换得来的。
当创建一个物化视图时,你必须指定一个引擎 存储数据的目标引擎
在创建物化视图时,你还必须指定表的引擎 - 将会使用这个表引擎存储数据
一个物化视图使用流程如下: 当插入数据到 SELECT 指定的表时, 插入数据部分通过SELECT 来转换, 同时结果被插入到视图中。
目前物化视图的工作原理当将数据写入到物化视图中SELECT子句所指定的表时插入的数据会通过SELECT子句查询进行转换并将最终结果插入到视图中。
如果创建物化视图时指定了POPULATE子句则在创建时将该表的数据插入到物化视图中。就像使用`CREATE TABLE ... AS SELECT ...`一样。否则物化视图只会包含在物化视图创建后的新写入的数据。我们不推荐使用POPULATE因为在视图创建期间写入的数据将不会写入其中。
如果你指定了 POPULATE 当创建时, 现有的表数据被插入到了视图中, 类似于 `CREATE TABLE ... AS SELECT ...` . 否则, 在创建视图之后,查询仅包含表中插入的数据. 我们不建议使用 POPULATE 在视图创建过程中,插入到表中的数据不插入到其中.
当一个`SELECT`子句包含`DISTINCT`, `GROUP BY`, `ORDER BY`, `LIMIT`时,请注意,这些仅会在插入数据时在每个单独的数据块上执行。例如,如果你在其中包含了`GROUP BY`,则只会在查询期间进行聚合,但聚合范围仅限于单个批的写入数据。数据不会进一步被聚合。但是当你使用一些其他数据聚合引擎时这是例外的,如:`SummingMergeTree`。
一个`SELECT`查询可以包含 `DISTINCT` `GROUP BY` `ORDER BY` `LIMIT`。。。 对应的转换在每个数据块上独立执行。 例如, 如果 GROUP BY 被设置, 数据将在插入过程中进行聚合, 但仅是在一个插入数据包中。数据不再进一步聚合。 当使用一个引擎时, 如SummingMergeTree它将独立执行数据聚合
目前对物化视图执行`ALTER`是不支持的,因此这可能是不方便的。如果物化视图是使用的`TO [db.]name`的方式进行构建的,你可以使用`DETACH`语句现将视图剥离,然后使用`ALTER`运行在目标表上,然后使用`ATTACH`将之前剥离的表重新加载进来
视图看起来和正常表相同。 例如, 你可以使用 SHOW TABLES来列出视图表的相关信息
视图看起来和普通的表相同。例如,你可以通过`SHOW TABLES`查看到它们
物化视图的`ALTER`查询执行还没有完全开发出来, 因此使用上可能不方便。 如果物化视图使用 `TO [db。]name` 你能够 `DETACH` 视图, 在目标表运行 `ALTER` 然后 `ATTACH` 之前的 `DETACH`视图。
视图看起来和正常表相同。 例如, 你可以使用 `SHOW TABLES` 来列出视图表的相关信息。
因此并没有一个单独的SQL语句来删除视图。 为了删除一个视图, 可以使用 `DROP TABLE`
没有单独的删除视图的语法。如果要删除视图,请使用`DROP TABLE`。
[来源文章](https://clickhouse.yandex/docs/en/query_language/create/) <!--hide-->

View File

@ -2,67 +2,70 @@
## INSERT
正在添加数据。
INSERT查询主要用于向系统中添加数据.
基本查询格式:
查询的基本格式:
```sql
``` sql
INSERT INTO [db.]table [(c1, c2, c3)] VALUES (v11, v12, v13), (v21, v22, v23), ...
```
此查询能够指定字段的列表来插入 `[(c1, c2, c3)]`。 在这种情况下, 剩下的字段用如下来填充:
您可以在查询中指定插入的列的列表,如:`[(c1, c2, c3)]`。对于存在于表结构中但不存在于插入列表中的列,它们将会按照如下方式填充数据:
- 从表定义中指定的 `DEFAULT` 表达式中计算出值。
- 空字符串, 如果 `DEFAULT` 表达式没有定义
- 如果存在`DEFAULT`表达式,根据`DEFAULT`表达式计算被填充的值。
- 如果没有定义`DEFAULT`表达式,则填充零或空字符串
如果 [strict_insert_defaults=1](../operations/settings/settings.md#settings-strict_insert_defaults) 没有 `DEFAULT` 定义的字段必须在查询中列出.
如果[strict_insert_defaults=1](../operations/settings/settings.md#settings-strict_insert_defaults),你必须在查询中列出所有没有定义`DEFAULT`表达式的列。
在任何ClickHouse所支持的格式上 [format](../interfaces/formats.md#formats) 数据被传入到 INSERT中. 此格式必须被显式地指定在查询中:
数据可以以ClickHouse支持的任何[输入输出格式](../interfaces/formats.md#formats)传递给INSERT。格式的名称必须显示的指定在查询中
```sql
INSERT INTO [db.]table [(c1 c2 c3)] FORMAT format_name data_set
``` sql
INSERT INTO [db.]table [(c1, c2, c3)] FORMAT format_name data_set
```
例如, 如下的查询格式与基本的 `INSERT ... VALUES` 版本相同:
例如,下面的查询所使用的输入格式就与上面INSERT ... VALUES的中使用的输入格式相同
```sql
``` sql
INSERT INTO [db.]table [(c1, c2, c3)] FORMAT Values (v11, v12, v13), (v21, v22, v23), ...
```
ClickHouse 在数据之前, 删除所有空格和换行(如果有)。 当形成一个查询时, 我们推荐在查询操作符之后将数据放入新行(如果数据以空格开始, 这是重要的)
ClickHouse会清除数据前所有的空白字符与一行摘要信息(如果需要的话)。所以在进行查询时,我们建议您将数据放入到输入输出格式名称后的新的一行中去(如果数据是以空白字符开始的,这将非常重要)
示例:
```sql
``` sql
INSERT INTO t FORMAT TabSeparated
11 Hello, world!
22 Qwerty
```
你能够单独从查询中插入数据,通过命令行或 HTTP 接口. 进一步信息, 参见 "[Interfaces](../interfaces/index.md#interfaces)".
在使用命令行客户端或HTTP客户端时你可以将具体的查询语句与数据分开发送。更多具体信息请参考“[客户端](../interfaces/index.md#interfaces)”部分。
### Inserting The Results of `SELECT`
<a name="queries-insert-select"></a>
```sql
### 使用`SELECT`的结果写入
``` sql
INSERT INTO [db.]table [(c1, c2, c3)] SELECT ...
```
在 SELECT语句中 根据字段的位置来映射。 然而, 在SELECT表达式中的名称和表名可能不同。 如果必要, 可以进行类型转换。
写入与SELECT的列的对应关系是使用位置来进行对应的尽管它们在SELECT表达式与INSERT中的名称可能是不同的。如果需要会对它们执行对应的类型转换。
除了值以外没有其他数据类型允许设置值到表达式中, 例如 `now()` `1 + 2` 等。 值格式允许使用有限制的表达式, 但是它并不推荐, 因为在这种情况下, 执行了低效的代码
除了VALUES格式之外其他格式中的数据都不允许出现诸如`now()``1 + 2`等表达式。VALUES格式允许您有限度的使用这些表达式但是不建议您这么做因为执行这些表达式总是低效的
不支持修改数据分区的查询如下: `UPDATE` `DELETE` `REPLACE` `MERGE` `UPSERT` `INSERT UPDATE`
然而, 你能够使用 `ALTER TABLE ... DROP PARTITION`来删除旧数据。
系统不支持的其他用于修改数据的查询:`UPDATE`, `DELETE`, `REPLACE`, `MERGE`, `UPSERT`, `INSERT UPDATE`
但是,您可以使用 `ALTER TABLE ... DROP PARTITION`查询来删除一些旧的数据。
### Performance Considerations
### 性能的注意事项
`INSERT` 通过主键来排序数据, 并通过月份来拆分数据到每个分区中。 如果插入的数据有混合的月份, 会显著降低`INSERT` 插入的性能。 应该避免此类操作:
在进行`INSERT`时将会对写入的数据进行一些处理,按照主键排序,按照月份对数据进行分区等。所以如果在您的写入数据中包含多个月份的混合数据时,将会显著的降低`INSERT`的性能。为了避免这种情况:
- 大批量地添加数据, 如每次 100000 行。
- 在上传数据之前, 通过月份分组数据
- 数据总是以尽量大的batch进行写入如每次写入100,000行。
- 数据在写入ClickHouse前预先的对数据进行分组
下面操作性能不会下降:
在以下的情况下,性能不会下降:
- 数据实时插入。
- 上传的数据通过时间来排序。
- 数据总是被实时的写入。
- 写入的数据已经按照时间排序。
[来源文章](https://clickhouse.yandex/docs/en/query_language/insert_into/) <!--hide-->

View File

@ -1 +0,0 @@
../../en/query_language/select.md

View File

@ -0,0 +1,986 @@
# SELECT 查询语法
`SELECT` 语句用于执行数据的检索。
``` sql
SELECT [DISTINCT] expr_list
[FROM [db.]table | (subquery) | table_function] [FINAL]
[SAMPLE sample_coeff]
[ARRAY JOIN ...]
[GLOBAL] ANY|ALL INNER|LEFT JOIN (subquery)|table USING columns_list
[PREWHERE expr]
[WHERE expr]
[GROUP BY expr_list] [WITH TOTALS]
[HAVING expr]
[ORDER BY expr_list]
[LIMIT [n, ]m]
[UNION ALL ...]
[INTO OUTFILE filename]
[FORMAT format]
[LIMIT n BY columns]
```
所有的子句都是可选的除了SELECT之后的表达式列表(expr_list)。
下面将按照查询运行的顺序逐一对各个子句进行说明。
如果查询中不包含`DISTINCT``GROUP BY``ORDER BY`子句以及`IN`和`JOIN`子查询那么它将仅使用O(1)数量的内存来完全流式的处理查询
否则,这个查询将消耗大量的内存,除非你指定了这些系统配置:`max_memory_usage`, `max_rows_to_group_by`, `max_rows_to_sort`, `max_rows_in_distinct`, `max_bytes_in_distinct`, `max_rows_in_set`, `max_bytes_in_set`, `max_rows_in_join`, `max_bytes_in_join`, `max_bytes_before_external_sort`, `max_bytes_before_external_group_by`。它们规定了可以使用外部排序(将临时表存储到磁盘中)以及外部聚合,`目前系统不存在关于Join的配置`,更多关于它们的信息,可以参见“配置”部分。
### FROM 子句
如果查询中不包含FROM子句那么将读取`system.one`。
`system.one`中仅包含一行数据此表实现了与其他数据库管理系统中的DUAL相同的功能
FROM子句规定了将从哪个表、或子查询、或表函数中读取数据同时ARRAY JOIN子句和JOIN子句也可以出现在这里见后文
可以使用包含在括号里的子查询来替代表。
在这种情况下,子查询的处理将会构建在外部的查询内部。
不同于SQL标准子查询后无需指定别名。为了兼容你可以在子查询后添加AS 别名’,但是指定的名字不能被使用在任何地方。
也可以使用表函数来代替表,有关信息,参见“表函数”。
执行查询时,在查询中列出的所有列都将从对应的表中提取数据;如果你使用的是子查询的方式,则任何在外部查询中没有使用的列,子查询将从查询中忽略它们;
如果你的查询没有列出任何的列例如SELECT count() FROM t则将额外的从表中提取一些列最好的情况下是最小的列以便计算行数。
最后的FINAL修饰符仅能够被使用在SELECT from CollapsingMergeTree场景中。当你为FROM指定了FINAL修饰符时你的查询结果将会在查询过程中被聚合。需要注意的是在这种情况下查询将在单个流中读取所有相关的主键列同时对需要的数据进行合并。这意味着当使用FINAL修饰符时查询将会处理的更慢。在大多数情况下你应该避免使用FINAL修饰符。更多信息请参阅“CollapsingMergeTree引擎”部分。
### SAMPLE 子句
通过SAMPLE子句用户可以进行近似查询处理近似查询处理仅能工作在MergeTree\*类型的表中并且在创建表时需要您指定采样表达式参见“MergeTree 引擎”部分)。
`SAMPLE`子句可以使用`SAMPLE k`来表示其中k可以是0到1的小数值或者是一个足够大的正整数值。
当k为0到1的小数时查询将使用'k'作为百分比选取数据。例如,`SAMPLE 0.1`查询只会检索数据总量的10%。
当k为一个足够大的正整数时查询将使用'k'作为最大样本数。例如, `SAMPLE 10000000`查询只会检索最多10000000行数据。
Example:
``` sql
SELECT
Title,
count() * 10 AS PageViews
FROM hits_distributed
SAMPLE 0.1
WHERE
CounterID = 34
AND toDate(EventDate) >= toDate('2013-01-29')
AND toDate(EventDate) <= toDate('2013-02-04')
AND NOT DontCountHits
AND NOT Refresh
AND Title != ''
GROUP BY Title
ORDER BY PageViews DESC LIMIT 1000
```
在这个例子中查询将检索数据总量的0.1 10%)的数据。值得注意的是,查询不会自动校正聚合函数最终的结果,所以为了得到更加精确的结果,需要将`count()`的结果手动乘以10。
当使用像`SAMPLE 10000000`这样的方式进行近似查询时,由于没有了任何关于将会处理了哪些数据或聚合函数应该被乘以几的信息,所以这种方式不适合在这种场景下使用。
使用相同的采样率得到的结果总是一致的:如果我们能够看到所有可能存在在表中的数据,那么相同的采样率总是能够得到相同的结果(在建表时使用相同的采样表达式),换句话说,系统在不同的时间,不同的服务器,不同表上总以相同的方式对数据进行采样。
例如我们可以使用采样的方式获取到与不进行采样相同的用户ID的列表。这将表明你可以在IN子查询中使用采样或者使用采样的结果与其他查询进行关联。
<a name="select-array-join"></a>
### ARRAY JOIN 子句
ARRAY JOIN子句可以帮助查询进行与数组和nested数据类型的连接。它有点类似arrayJoin函数但它的功能更广泛。
`ARRAY JOIN` 本质上等同于`INNERT JOIN`数组。 例如:
```
:) CREATE TABLE arrays_test (s String, arr Array(UInt8)) ENGINE = Memory
CREATE TABLE arrays_test
(
s String,
arr Array(UInt8)
) ENGINE = Memory
Ok.
0 rows in set. Elapsed: 0.001 sec.
:) INSERT INTO arrays_test VALUES ('Hello', [1,2]), ('World', [3,4,5]), ('Goodbye', [])
INSERT INTO arrays_test VALUES
Ok.
3 rows in set. Elapsed: 0.001 sec.
:) SELECT * FROM arrays_test
SELECT *
FROM arrays_test
┌─s───────┬─arr─────┐
│ Hello │ [1,2] │
│ World │ [3,4,5] │
│ Goodbye │ [] │
└─────────┴─────────┘
3 rows in set. Elapsed: 0.001 sec.
:) SELECT s, arr FROM arrays_test ARRAY JOIN arr
SELECT s, arr
FROM arrays_test
ARRAY JOIN arr
┌─s─────┬─arr─┐
│ Hello │ 1 │
│ Hello │ 2 │
│ World │ 3 │
│ World │ 4 │
│ World │ 5 │
└───────┴─────┘
5 rows in set. Elapsed: 0.001 sec.
```
你还可以为ARRAY JOIN子句指定一个别名这时你可以通过这个别名来访问数组中的数据但是数据本身仍然可以通过原来的名称进行访问。例如
```
:) SELECT s, arr, a FROM arrays_test ARRAY JOIN arr AS a
SELECT s, arr, a
FROM arrays_test
ARRAY JOIN arr AS a
┌─s─────┬─arr─────┬─a─┐
│ Hello │ [1,2] │ 1 │
│ Hello │ [1,2] │ 2 │
│ World │ [3,4,5] │ 3 │
│ World │ [3,4,5] │ 4 │
│ World │ [3,4,5] │ 5 │
└───────┴─────────┴───┘
5 rows in set. Elapsed: 0.001 sec.
```
当多个具有相同大小的数组使用逗号分割出现在ARRAY JOIN子句中时ARRAY JOIN会将它们同时执行直接合并而不是它们的笛卡尔积。例如
```
:) SELECT s, arr, a, num, mapped FROM arrays_test ARRAY JOIN arr AS a, arrayEnumerate(arr) AS num, arrayMap(x -> x + 1, arr) AS mapped
SELECT s, arr, a, num, mapped
FROM arrays_test
ARRAY JOIN arr AS a, arrayEnumerate(arr) AS num, arrayMap(lambda(tuple(x), plus(x, 1)), arr) AS mapped
┌─s─────┬─arr─────┬─a─┬─num─┬─mapped─┐
│ Hello │ [1,2] │ 1 │ 1 │ 2 │
│ Hello │ [1,2] │ 2 │ 2 │ 3 │
│ World │ [3,4,5] │ 3 │ 1 │ 4 │
│ World │ [3,4,5] │ 4 │ 2 │ 5 │
│ World │ [3,4,5] │ 5 │ 3 │ 6 │
└───────┴─────────┴───┴─────┴────────┘
5 rows in set. Elapsed: 0.002 sec.
:) SELECT s, arr, a, num, arrayEnumerate(arr) FROM arrays_test ARRAY JOIN arr AS a, arrayEnumerate(arr) AS num
SELECT s, arr, a, num, arrayEnumerate(arr)
FROM arrays_test
ARRAY JOIN arr AS a, arrayEnumerate(arr) AS num
┌─s─────┬─arr─────┬─a─┬─num─┬─arrayEnumerate(arr)─┐
│ Hello │ [1,2] │ 1 │ 1 │ [1,2] │
│ Hello │ [1,2] │ 2 │ 2 │ [1,2] │
│ World │ [3,4,5] │ 3 │ 1 │ [1,2,3] │
│ World │ [3,4,5] │ 4 │ 2 │ [1,2,3] │
│ World │ [3,4,5] │ 5 │ 3 │ [1,2,3] │
└───────┴─────────┴───┴─────┴─────────────────────┘
5 rows in set. Elapsed: 0.002 sec.
```
另外ARRAY JOIN也可以工作在nested数据结构上。例如
```
:) CREATE TABLE nested_test (s String, nest Nested(x UInt8, y UInt32)) ENGINE = Memory
CREATE TABLE nested_test
(
s String,
nest Nested(
x UInt8,
y UInt32)
) ENGINE = Memory
Ok.
0 rows in set. Elapsed: 0.006 sec.
:) INSERT INTO nested_test VALUES ('Hello', [1,2], [10,20]), ('World', [3,4,5], [30,40,50]), ('Goodbye', [], [])
INSERT INTO nested_test VALUES
Ok.
3 rows in set. Elapsed: 0.001 sec.
:) SELECT * FROM nested_test
SELECT *
FROM nested_test
┌─s───────┬─nest.x──┬─nest.y─────┐
│ Hello │ [1,2] │ [10,20] │
│ World │ [3,4,5] │ [30,40,50] │
│ Goodbye │ [] │ [] │
└─────────┴─────────┴────────────┘
3 rows in set. Elapsed: 0.001 sec.
:) SELECT s, nest.x, nest.y FROM nested_test ARRAY JOIN nest
SELECT s, `nest.x`, `nest.y`
FROM nested_test
ARRAY JOIN nest
┌─s─────┬─nest.x─┬─nest.y─┐
│ Hello │ 1 │ 10 │
│ Hello │ 2 │ 20 │
│ World │ 3 │ 30 │
│ World │ 4 │ 40 │
│ World │ 5 │ 50 │
└───────┴────────┴────────┘
5 rows in set. Elapsed: 0.001 sec.
```
当你在ARRAY JOIN指定nested数据类型的名称时其作用与与包含所有数组元素的ARRAY JOIN相同例如
```
:) SELECT s, nest.x, nest.y FROM nested_test ARRAY JOIN nest.x, nest.y
SELECT s, `nest.x`, `nest.y`
FROM nested_test
ARRAY JOIN `nest.x`, `nest.y`
┌─s─────┬─nest.x─┬─nest.y─┐
│ Hello │ 1 │ 10 │
│ Hello │ 2 │ 20 │
│ World │ 3 │ 30 │
│ World │ 4 │ 40 │
│ World │ 5 │ 50 │
└───────┴────────┴────────┘
5 rows in set. Elapsed: 0.001 sec.
```
这种方式也是可以运行的:
```
:) SELECT s, nest.x, nest.y FROM nested_test ARRAY JOIN nest.x
SELECT s, `nest.x`, `nest.y`
FROM nested_test
ARRAY JOIN `nest.x`
┌─s─────┬─nest.x─┬─nest.y─────┐
│ Hello │ 1 │ [10,20] │
│ Hello │ 2 │ [10,20] │
│ World │ 3 │ [30,40,50] │
│ World │ 4 │ [30,40,50] │
│ World │ 5 │ [30,40,50] │
└───────┴────────┴────────────┘
5 rows in set. Elapsed: 0.001 sec.
```
为了方便使用原来的nested类型的数组你可以为nested类型定义一个别名。例如
```
:) SELECT s, n.x, n.y, nest.x, nest.y FROM nested_test ARRAY JOIN nest AS n
SELECT s, `n.x`, `n.y`, `nest.x`, `nest.y`
FROM nested_test
ARRAY JOIN nest AS n
┌─s─────┬─n.x─┬─n.y─┬─nest.x──┬─nest.y─────┐
│ Hello │ 1 │ 10 │ [1,2] │ [10,20] │
│ Hello │ 2 │ 20 │ [1,2] │ [10,20] │
│ World │ 3 │ 30 │ [3,4,5] │ [30,40,50] │
│ World │ 4 │ 40 │ [3,4,5] │ [30,40,50] │
│ World │ 5 │ 50 │ [3,4,5] │ [30,40,50] │
└───────┴─────┴─────┴─────────┴────────────┘
5 rows in set. Elapsed: 0.001 sec.
```
使用arrayEnumerate函数的示例
```
:) SELECT s, n.x, n.y, nest.x, nest.y, num FROM nested_test ARRAY JOIN nest AS n, arrayEnumerate(nest.x) AS num
SELECT s, `n.x`, `n.y`, `nest.x`, `nest.y`, num
FROM nested_test
ARRAY JOIN nest AS n, arrayEnumerate(`nest.x`) AS num
┌─s─────┬─n.x─┬─n.y─┬─nest.x──┬─nest.y─────┬─num─┐
│ Hello │ 1 │ 10 │ [1,2] │ [10,20] │ 1 │
│ Hello │ 2 │ 20 │ [1,2] │ [10,20] │ 2 │
│ World │ 3 │ 30 │ [3,4,5] │ [30,40,50] │ 1 │
│ World │ 4 │ 40 │ [3,4,5] │ [30,40,50] │ 2 │
│ World │ 5 │ 50 │ [3,4,5] │ [30,40,50] │ 3 │
└───────┴─────┴─────┴─────────┴────────────┴─────┘
5 rows in set. Elapsed: 0.002 sec.
```
在一个查询中只能出现一个ARRAY JOIN子句。
如果在WHERE/PREWHERE子句中使用了ARRAY JOIN子句的结果它将优先于WHERE/PREWHERE子句执行否则它将在WHERE/PRWHERE子句之后执行以便减少计算。
<a name="query-language-join"></a>
### JOIN 子句
JOIN子句用于连接数据作用与[SQL JOIN](https://en.wikipedia.org/wiki/Join_(SQL))的定义相同。
!!! info "注意"
与[ARRAY JOIN](#select-array-join)没有关系.
``` sql
SELECT <expr_list>
FROM <left_subquery>
[GLOBAL] [ANY|ALL] INNER|LEFT|RIGHT|FULL|CROSS [OUTER] JOIN <right_subquery>
(ON <expr_list>)|(USING <column_list>) ...
```
可以使用具体的表名来代替`<left_subquery>`与`<right_subquery>`。但这与使用`SELECT * FROM table`子查询的方式相同。除非你的表是[Join](../operations/table_engines/join.md#table-engine-join)引擎 - 对JOIN进行了预处理。
**支持的`JOIN`类型**
- `INNER JOIN`
- `LEFT OUTER JOIN`
- `RIGHT OUTER JOIN`
- `FULL OUTER JOIN`
- `CROSS JOIN`
你可以跳过默认的`OUTER`关键字。
**`ANY` 与 `ALL`**
在使用`ALL`修饰符对JOIN进行修饰时如果右表中存在多个与左表关联的数据那么系统则将右表中所有可以与左表关联的数据全部返回在结果中。这与SQL标准的JOIN行为相同。
在使用`ANY`修饰符对JOIN进行修饰时如果右表中存在多个与左表关联的数据那么系统仅返回第一个与左表匹配的结果。如果左表与右表一一对应不存在多余的行时`ANY`与`ALL`的结果相同。
你可以在会话中通过设置[join_default_strictness](../operations/settings/settings.md#session-setting-join_default_strictness)来指定默认的JOIN修饰符。
**`GLOBAL` distribution**
当使用普通的`JOIN`时,查询将被发送给远程的服务器。并在这些远程服务器上生成右表并与它们关联。换句话说,右表来自于各个服务器本身。
当使用`GLOBAL ... JOIN`,首先会在请求服务器上计算右表并以临时表的方式将其发送到所有服务器。这时每台服务器将直接使用它进行计算。
使用`GLOBAL`时需要小心。更多信息,参阅[Distributed subqueries](#queries-distributed-subqueries)部分。
**使用建议**
从子查询中删除所有`JOIN`不需要的列。
当执行`JOIN`查询时因为与其他阶段相比没有进行执行顺序的优化JOIN优先于WHERE与聚合执行。因此为了显示的指定执行顺序我们推荐你使用子查询的方式执行`JOIN`。
示例:
``` sql
SELECT
CounterID,
hits,
visits
FROM
(
SELECT
CounterID,
count() AS hits
FROM test.hits
GROUP BY CounterID
) ANY LEFT JOIN
(
SELECT
CounterID,
sum(Sign) AS visits
FROM test.visits
GROUP BY CounterID
) USING CounterID
ORDER BY hits DESC
LIMIT 10
```
```
┌─CounterID─┬───hits─┬─visits─┐
│ 1143050 │ 523264 │ 13665 │
│ 731962 │ 475698 │ 102716 │
│ 722545 │ 337212 │ 108187 │
│ 722889 │ 252197 │ 10547 │
│ 2237260 │ 196036 │ 9522 │
│ 23057320 │ 147211 │ 7689 │
│ 722818 │ 90109 │ 17847 │
│ 48221 │ 85379 │ 4652 │
│ 19762435 │ 77807 │ 7026 │
│ 722884 │ 77492 │ 11056 │
└───────────┴────────┴────────┘
```
子查询不允许您设置别名或在其他地方引用它们。
`USING`中指定的列必须在两个子查询中具有相同的名称,而其他列必须具有不同的名称。您可以通过使用别名的方式来更改子查询中的列名(示例中就分别使用了'hits'与'visits'别名)。
`USING`子句用于指定要进行链接的一个或多个列系统会将这些列在两张表中相等的值连接起来。如果列是一个列表不需要使用括号包裹。同时JOIN不支持其他更复杂的Join方式。
右表(子查询的结果)将会保存在内存中。如果没有足够的内存,则无法运行`JOIN`。
只能在查询中指定一个`JOIN`。若要运行多个`JOIN`,你可以将它们放入子查询中。
每次运行相同的`JOIN`查询,总是会再次计算 - 没有缓存结果。 为了避免这种情况可以使用Join引擎它是一个预处理的Join数据结构总是保存在内存中。更多信息参见“Join引擎”部分。
在一些场景下,使用`IN`代替`JOIN`将会得到更高的效率。在各种类型的JOIN中最高效的是`ANY LEFT JOIN`,然后是`ANY INNER JOIN`,效率最差的是`ALL LEFT JOIN`以及`ALL INNER JOIN`。
如果你需要使用`JOIN`来关联一些纬度表(包含纬度属性的一些相对比较小的表,例如广告活动的名称),那么`JOIN`可能不是好的选择,因为语法负责,并且每次查询都将重新访问这些表。对于这种情况,您应该使用“外部字典”的功能来替换`JOIN`。更多信息,参见[外部字典](dicts/external_dicts.md#dicts-external_dicts)部分。
#### Null的处理
JOIN的行为受[join_use_nulls](../operations/settings/settings.md#settings-join_use_nulls)的影响。当`join_use_nulls=1`时,`JOIN`的工作与SQL标准相同。
如果JOIN的key是[Nullable](../data_types/nullable.md#data_types-nullable)类型的字段,则其中至少一个存在[NULL](syntax.md#null-literal)值的key不会被关联。
<a name="query_language-queries-where"></a>
### WHERE 子句
如果存在WHERE子句, 则在该子句中必须包含一个UInt8类型的表达式。 这个表达是通常是一个带有比较和逻辑的表达式。
这个表达式将会在所有数据转换前用来过滤数据。
如果在支持索引的数据库表引擎中,这个表达式将被评估是否使用索引。
<a name="query_language-queries-prewhere"></a>
### PREWHERE 子句
这个子句与WHERE子句的意思相同。主要的不同之处在于表数据的读取。
当使用PREWHERE时首先只读取PREWHERE表达式中需要的列。然后在根据PREWHERE执行的结果读取其他需要的列。
如果在过滤条件中有少量不适合索引过滤的列但是它们又可以提供很强的过滤能力。这时使用PREWHERE是有意义的因为它将帮助减少数据的读取。
例如在一个需要提取大量列的查询中为少部分列编写PREWHERE是很有作用的。
PREWHERE 仅支持`*MergeTree`系列引擎。
在一个查询中可以同时指定PREWHERE和WHERE在这种情况下PREWHERE优先于WHERE执行。
值得注意的是PREWHERE不适合用于已经存在于索引中的列因为当列已经存在于索引中的情况下只有满足索引的数据块才会被读取。
如果将'optimize_move_to_prewhere'设置为1并且在查询中不包含PREWHERE则系统将自动的把适合PREWHERE表达式的部分从WHERE中抽离到PREWHERE中。
### GROUP BY 子句
这是列式数据库管理系统中最重要的一部分。
如果存在GROUP BY子句则在该子句中必须包含一个表达式列表。其中每个表达式将会被称之为“key”。
SELECTHAVINGORDER BY子句中的表达式列表必须来自于这些“key”或聚合函数。简而言之被选择的列中不能包含非聚合函数或key之外的其他列。
如果查询表达式列表中仅包含聚合函数则可以省略GROUP BY子句这时会假定将所有数据聚合成一组空“key”。
Example:
``` sql
SELECT
count(),
median(FetchTiming > 60 ? 60 : FetchTiming),
count() - sum(Refresh)
FROM hits
```
与SQL标准不同的是如果表中不存在任何数据可能表本身中就不存在任何数据或者由于被WHERE条件过滤掉了将返回一个空结果而不是一个包含聚合函数初始值的结果。
与MySQL不同的是实际上这是符合SQL标准的你不能够获得一个不在key中的非聚合函数列除了常量表达式。但是你可以使用any返回遇到的第一个值、max、min等聚合函数使它工作。
Example:
``` sql
SELECT
domainWithoutWWW(URL) AS domain,
count(),
any(Title) AS title -- getting the first occurred page header for each domain.
FROM hits
GROUP BY domain
```
GROUP BY子句会为遇到的每一个不同的key计算一组聚合函数的值。
在GROUP BY子句中不支持使用Array类型的列。
常量不能作为聚合函数的参数传入聚合函数中。例如: sum(1)。这种情况下你可以省略常量。例如:`count()`。
#### NULL 处理
对于GROUP BY子句ClickHouse将[NULL](syntax.md#null-literal)解释为一个值,并且支持`NULL=NULL`。
下面这个例子将说明这将意味着什么。
假设你有这样一张表:
```
┌─x─┬────y─┐
│ 1 │ 2 │
│ 2 │ ᴺᵁᴸᴸ │
│ 3 │ 2 │
│ 3 │ 3 │
│ 3 │ ᴺᵁᴸᴸ │
└───┴──────┘
```
运行`SELECT sum(x), y FROM t_null_big GROUP BY y`你将得到如下结果:
```
┌─sum(x)─┬────y─┐
│ 4 │ 2 │
│ 3 │ 3 │
│ 5 │ ᴺᵁᴸᴸ │
└────────┴──────┘
```
你可以看到GROUP BY为`y=NULL`的聚合了x。
如果你在向`GROUP BY`中放入几个key结果将列出所有的组合可能。就像`NULL`是一个特定的值一样。
#### WITH TOTALS 修饰符
如果你指定了WITH TOTALS修饰符你将会在结果中得到一个被额外计算出的行。在这一行中将包含所有key的默认值零或者空值以及所有聚合函数对所有被选择数据行的聚合结果。
该行仅在JSON\*, TabSeparated\*, Pretty\*输出格式中与其他行分开输出。
在JSON\*输出格式中这行将出现在Json的totals字段中。在TabSeparated\*输出格式中这行将位于其他结果之后同时与其他结果使用空白行分隔。在Pretty\*输出格式中,这行将作为单独的表在所有结果之后输出。
当`WITH TOTALS`与HAVING子句同时存在时它的行为受totals_mode配置的影响。
默认情况下,`totals_mode = 'before_having'`,这时`WITH TOTALS`将会在HAVING前计算最多不超过`max_rows_to_group_by`行的数据。
在`group_by_overflow_mode = 'any'`并指定了`max_rows_to_group_by`的情况下,`WITH TOTALS`的行为受`totals_mode`的影响。
`after_having_exclusive` - 在HAVING后进行计算计算不超过`max_rows_to_group_by`行的数据。
`after_having_inclusive` - 在HAVING后进行计算计算不少于`max_rows_to_group_by`行的数据。
`after_having_auto` - 在HAVING后进行计算采用统计通过HAVING的行数在超过不超过max_rows_to_group_by指定值默认为50%)的情况下,包含所有行的结果。否则排除这些结果。
`totals_auto_threshold` - 默认 0.5,是`after_having_auto`的参数。
如果`group_by_overflow_mode != 'any'`并没有指定`max_rows_to_group_by`情况下,所有的模式都与`after_having`相同。
你可以在子查询包含子查询的JOIN子句中使用WITH TOTALS在这种情况下它们各自的总值会被组合在一起
#### GROUP BY 使用外部存储设备
你可以在GROUP BY中允许将临时数据转存到磁盘上以限制对内存的使用。
`max_bytes_before_external_group_by`这个配置确定了在GROUP BY中启动将临时数据转存到磁盘上的内存阈值。如果你将它设置为0这是默认值这项功能将被禁用。
当使用`max_bytes_before_external_group_by`时我们建议将max_memory_usage设置为它的两倍。这是因为一个聚合需要两个阶段来完成1读取数据并形成中间数据 2合并中间数据。临时数据的转存只会发生在第一个阶段。如果没有发生临时文件的转存那么阶段二将最多消耗与1阶段相同的内存大小。
例如:如果将`max_memory_usage`设置为10000000000并且你想要开启外部聚合那么你需要将`max_bytes_before_external_group_by`设置为10000000000的同时将`max_memory_usage`设置为20000000000。当外部聚合被触发时如果刚好只形成了一份临时数据它的内存使用量将会稍高与`max_bytes_before_external_group_by`。
在分布式查询处理中,外部聚合将会在远程的服务器中执行。为了使请求服务器只使用较少的内存,可以设置`distributed_aggregation_memory_efficient`为1。
当合并被刷到磁盘的临时数据以及合并远程的服务器返回的结果时,如果在启动`distributed_aggregation_memory_efficient`的情况下将会消耗1/256 \* 线程数的总内存大小。
当启动外部聚合时,如果数据的大小小于`max_bytes_before_external_group_by`设置的值数据没有被刷到磁盘中那么数据的聚合速度将会和没有启动外部聚合时一样快。如果有临时数据被刷到了磁盘中那么这个查询的运行时间将会被延长几倍大约是3倍
如果你在GROUP BY后面存在ORDER BY子句并且ORDER BY后面存在一个极小限制的LIMIT那么ORDER BY子句将不会使用太多内存。
否则请不要忘记启动外部排序(`max_bytes_before_external_sort`)。
### LIMIT N BY 子句
LIMIT N BY子句和LIMIT没有关系, LIMIT N BY COLUMNS 子句可以用来在每一个COLUMNS分组中求得最大的N行数据。我们可以将它们同时用在一个查询中。LIMIT N BY子句中可以包含任意多个分组字段表达式列表。
示例:
``` sql
SELECT
domainWithoutWWW(URL) AS domain,
domainWithoutWWW(REFERRER_URL) AS referrer,
device_type,
count() cnt
FROM hits
GROUP BY domain, referrer, device_type
ORDER BY cnt DESC
LIMIT 5 BY domain, device_type
LIMIT 100
```
查询将会为每个`domain, device_type`的组合选出前5个访问最多的数据但是结果最多将不超过100行`LIMIT n BY + LIMIT`)。
### HAVING 子句
HAVING子句可以用来过滤GROUP BY之后的数据类似于WHERE子句。
WHERE于HAVING不同之处在于WHERE在聚合前(GROUP BY)执行HAVING在聚合后执行。
如果不存在聚合则不能使用HAVING。
<a name="query_language-queries-order_by"></a>
### ORDER BY 子句
如果存在ORDER BY 子句则该子句中必须存在一个表达式列表表达式列表中每一个表达式都可以分配一个DESC或ASC排序的方向。如果没有指明排序的方向将假定以ASC的方式进行排序。其中ASC表示按照升序排序DESC按照降序排序。示例`ORDER BY Visits DESC, SearchPhrase`
对于字符串的排序来讲你可以为其指定一个排序规则在指定排序规则时排序总是不会区分大小写。并且如果与ASC或DESC同时出现时排序规则必须在它们的后面指定。例如`ORDER BY SearchPhrase COLLATE 'tr'` - 使用土耳其字母表对它进行升序排序同时排序时不会区分大小写并按照UTF-8字符集进行编码。
我们推荐只在少量的数据集中使用COLLATE因为COLLATE的效率远低于正常的字节排序。
针对排序表达式中相同值的行将以任意的顺序进行输出,这是不确定的(每次都可能不同)。
如果省略ORDER BY子句则结果的顺序也是不固定的。
`NaN``NULL` 的排序规则:
- 当使用`NULLS FIRST`修饰符时,将会先输出`NULL`,然后是`NaN`,最后才是其他值。
- 当使用`NULLS LAST`修饰符时,将会先输出其他值,然后是`NaN`,最后才是`NULL`。
- 默认情况下与使用`NULLS LAST`修饰符相同。
示例:
假设存在如下一张表
```
┌─x─┬────y─┐
│ 1 │ ᴺᵁᴸᴸ │
│ 2 │ 2 │
│ 1 │ nan │
│ 2 │ 2 │
│ 3 │ 4 │
│ 5 │ 6 │
│ 6 │ nan │
│ 7 │ ᴺᵁᴸᴸ │
│ 6 │ 7 │
│ 8 │ 9 │
└───┴──────┘
```
运行查询 `SELECT * FROM t_null_nan ORDER BY y NULLS FIRST` 将获得如下结果:
```
┌─x─┬────y─┐
│ 1 │ ᴺᵁᴸᴸ │
│ 7 │ ᴺᵁᴸᴸ │
│ 1 │ nan │
│ 6 │ nan │
│ 2 │ 2 │
│ 2 │ 2 │
│ 3 │ 4 │
│ 5 │ 6 │
│ 6 │ 7 │
│ 8 │ 9 │
└───┴──────┘
```
当使用浮点类型的数值进行排序时不管排序的顺序如何NaNs总是出现在所有值的后面。换句话说当你使用升序排列一个浮点数值列时NaNs好像比所有值都要大。反之当你使用降序排列一个浮点数值列时NaNs好像比所有值都小。
如果你在ORDER BY子句后面存在LIMIT并给定了较小的数值则将会使用较少的内存。否则内存的使用量将与需要排序的数据成正比。对于分布式查询如果省略了GROUP BY则在远程服务器上执行部分排序最后在请求服务器上合并排序结果。这意味这对于分布式查询而言要排序的数据量可以大于单台服务器的内存。
如果没有足够的内存,可以使用外部排序(在磁盘中创建一些临时文件)。可以使用`max_bytes_before_external_sort`来设置外部排序如果你讲它设置为0默认则表示禁用外部排序功能。如果启用该功能。当要排序的数据量达到所指定的字节数时当前排序的结果会被转存到一个临时文件中去。当全部数据读取完毕后所有的临时文件将会合并成最终输出结果。这些临时文件将会写到config文件配置的/var/lib/clickhouse/tmp/目录中(默认值,你可以通过修改'tmp_path'配置调整该目录的位置)。
查询运行使用的内存要高于max_bytes_before_external_sort为此这个配置必须要远远小于max_memory_usage配置的值。例如如果你的服务器有128GB的内存去运行一个查询那么推荐你将max_memory_usage设置为100GBmax_bytes_before_external_sort设置为80GB。
外部排序效率要远低于在内存中排序。
### SELECT 子句
在完成上述列出的所有子句后将对SELECT子句中的表达式进行分析。
具体来讲,如果在存在聚合函数的情况下,将对聚合函数之前的表达式进行分析。
聚合函数与聚合函数之前的表达式都将在聚合期间完成计算GROUP BY
就像他们本身就已经存在结果上一样。
### DISTINCT 子句
如果存在DISTINCT子句则会对结果中的完全相同的行进行去重。
在GROUP BY不包含聚合函数并对全部SELECT部分都包含在GROUP BY中时的作用一样。但该子句还是与GROUP BY子句存在以下几点不同
- 可以与GROUP BY配合使用。
- 当不存在ORDER BY子句并存在LIMIT子句时查询将在同时满足DISTINCT与LIMIT的情况下立即停止查询。
- 在处理数据的同时输出结果,并不是等待整个查询全部完成。
在SELECT表达式中存在Array类型的列时不能使用DISTINCT。
`DISTINCT`可以与[NULL](syntax.md#null-literal)一起工作,就好像`NULL`仅是一个特殊的值一样,并且`NULL=NULL`。换而言之,在`DISTINCT`的结果中,与`NULL`不同的组合仅能出现一次。
### LIMIT 子句
LIMIT m 用于在查询结果中选择前m行数据。
LIMIT n, m 用于在查询结果中选择从n行开始的m行数据。
nm必须是正整数。
如果没有指定ORDER BY子句则结果可能是任意的顺序并且是不确定的。
### UNION ALL 子句
UNION ALL子句可以组合任意数量的查询例如
``` sql
SELECT CounterID, 1 AS table, toInt64(count()) AS c
FROM test.hits
GROUP BY CounterID
UNION ALL
SELECT CounterID, 2 AS table, sum(Sign) AS c
FROM test.visits
GROUP BY CounterID
HAVING c > 0
```
仅支持UNION ALL不支持其他UNION规则(UNION DISTINCT)。如果你需要UNION DISTINCT你可以使用UNION ALL中包含SELECT DISTINCT的子查询的方式。
UNION ALL中的查询可以同时运行它们的结果将被混合到一起。
这些查询的结果结果必须相同列的数量和类型。列名可以是不同的。在这种情况下最终结果的列名将从第一个查询中获取。UNION会为查询之间进行类型转换。例如如果组合的两个查询中包含相同的字段并且是类型兼容的`Nullable`和non-`Nullable`,则结果将会将该字段转换为`Nullable`类型的字段。
作为UNION ALL查询的部分不能包含在括号内。ORDER BY与LIMIT子句应该被应用在每个查询中而不是最终的查询中。如果你需要做最终结果转换你可以将UNION ALL作为一个子查询包含在FROM子句中。
### INTO OUTFILE 子句
`INTO OUTFILE filename` 子句用于将查询结果重定向输出到指定文件中filename是一个字符串类型的值
与MySQL不同执行的结果文件将在客户端建立如果文件已存在查询将会失败。
此命令可以工作在命令行客户端与clickhouse-local中通过HTTP借口发送将会失败
默认的输出格式是TabSeparated与命令行客户端的批处理模式相同
### FORMAT 子句
'FORMAT format' 子句用于指定返回数据的格式。
你可以使用它方便的转换或创建数据的转储。
更多信息,参见“输入输出格式”部分。
如果不存在FORMAT子句则使用默认的格式这将取决与DB的配置以及所使用的客户端。对于批量模式的HTTP客户端和命令行客户端而言默认的格式是TabSeparated。对于交互模式下的命令行客户端默认的格式是PrettyCompact它有更加美观的格式
当使用命令行客户端时数据以内部高效的格式在服务器和客户端之间进行传递。客户端将单独的解析FORMAT子句以帮助数据格式的转换这将减轻网络和服务器的负载
<a name="query_language-in_operators"></a>
### IN 运算符
对于`IN`、`NOT IN`、`GLOBAL IN`、`GLOBAL NOT IN`操作符被分别实现,因为它们的功能非常丰富。
运算符的左侧是单列或列的元组。
示例:
``` sql
SELECT UserID IN (123, 456) FROM ...
SELECT (CounterID, UserID) IN ((34, 123), (101500, 456)) FROM ...
```
如果左侧是单个列并且是一个索引,并且右侧是一组常量时,系统将使用索引来处理查询。
不要在列表中列出太多的值(百万)。如果数据集很大,将它们放入临时表中(可以参考“”), 然后使用子查询。
Don't list too many values explicitly (i.e. millions). If a data set is large, put it in a temporary table (for example, see the section "External data for query processing"), then use a subquery.
右侧可以是一个由常量表达式组成的元组列表(像上面的例子一样),或者是一个数据库中的表的名称,或是一个包含在括号中的子查询。
如果右侧是一个表的名字(例如,`UserID IN users`),这相当于`UserID IN (SELECT * FROM users)`。在查询与外部数据表组合使用时可以使用该方法。例如查询与包含user IDS的users临时表一起被发送的同时需要对结果进行过滤时。
如果操作符的右侧是一个Set引擎的表时数据总是在内存中准备好则不会每次都为查询创建新的数据集。
子查询可以指定一个列以上的元组来进行过滤。
示例:
``` sql
SELECT (CounterID, UserID) IN (SELECT CounterID, UserID FROM ...) FROM ...
```
IN操作符的左右两侧应具有相同的类型。
IN操作符的子查询中可以出现任意子句包含聚合函数与lambda函数。
示例:
``` sql
SELECT
EventDate,
avg(UserID IN
(
SELECT UserID
FROM test.hits
WHERE EventDate = toDate('2014-03-17')
)) AS ratio
FROM test.hits
GROUP BY EventDate
ORDER BY EventDate ASC
```
```
┌──EventDate─┬────ratio─┐
│ 2014-03-17 │ 1 │
│ 2014-03-18 │ 0.807696 │
│ 2014-03-19 │ 0.755406 │
│ 2014-03-20 │ 0.723218 │
│ 2014-03-21 │ 0.697021 │
│ 2014-03-22 │ 0.647851 │
│ 2014-03-23 │ 0.648416 │
└────────────┴──────────┘
```
为3月17日之后的每一天计算与3月17日访问该网站的用户浏览网页的百分比。
IN子句中的子查询仅在单个服务器上运行一次。不能够是相关子查询。
#### NULL 处理
在处理中IN操作符总是假定[NULL](syntax.md#null-literal)值的操作结果总是等于`0`,而不管`NULL`位于左侧还是右侧。`NULL`值不应该包含在任何数据集中,它们彼此不能够对应,并且不能够比较。
下面的示例中有一个`t_null`表:
```
┌─x─┬────y─┐
│ 1 │ ᴺᵁᴸᴸ │
│ 2 │ 3 │
└───┴──────┘
```
运行查询`SELECT x FROM t_null WHERE y IN (NULL,3)`将得到如下结果:
```
┌─x─┐
│ 2 │
└───┘
```
你可以看到在查询结果中不存在`y = NULL`的结果。这是因为ClickHouse无法确定`NULL`是否包含在`(NULL,3)`数据集中,对于这次比较操作返回了`0`,并且在`SELECT`的最终输出中排除了这行。
```
SELECT y IN (NULL, 3)
FROM t_null
┌─in(y, tuple(NULL, 3))─┐
│ 0 │
│ 1 │
└───────────────────────┘
```
<a name="queries-distributed-subqueries"></a>
#### 分布式子查询
对于带有子查询的类似与JOINsIN中有两种选择:普通的`IN``JOIN`与`GLOBAL IN` `GLOBAL JOIN`。它们对于分布式查询的处理运行方式是不同的。
!!! 注意
请记住,下面描述的算法可能因为根据[settings](../operations/settings/settings.md#settings-distributed_product_mode)配置的不同而不同。
当使用普通的IN时查询总是被发送到远程的服务器并且在每个服务器中运行“IN”或“JOIN”子句中的子查询。
当使用`GLOBAL IN` / `GLOBAL JOIN`时,首先会为`GLOBAL IN` / `GLOBAL JOIN`运行所有子查询,并将结果收集到临时表中,并将临时表发送到每个远程服务器,并使用该临时表运行查询。
对于非分布式查询,请使用普通的`IN` `JOIN`
在分布式查询中使用`IN` / `JOIN`子句中使用子查询需要小心。
让我们来看一些例子。假设集群中的每个服务器都存在一个正常表**local_table**。与一个分布式表**distributed_table**。
对于所有查询**distributed_table**的查询,查询会被发送到所有的远程服务器并使用**local_table**表运行查询。
例如,查询
``` sql
SELECT uniq(UserID) FROM distributed_table
```
将发送如下查询到所有远程服务器
``` sql
SELECT uniq(UserID) FROM local_table
```
这时将并行的执行它们,直到达到可以组合数据的中间结果状态。然后中间结果将返回到请求服务器并在请求服务器上进行合并,最终将结果发送给客户端。
现在让我运行一个带有IN的查询
``` sql
SELECT uniq(UserID) FROM distributed_table WHERE CounterID = 101500 AND UserID IN (SELECT UserID FROM local_table WHERE CounterID = 34)
```
- 计算两个站点的用户交集。
此查询将被发送给所有的远程服务器
``` sql
SELECT uniq(UserID) FROM local_table WHERE CounterID = 101500 AND UserID IN (SELECT UserID FROM local_table WHERE CounterID = 34)
```
换句话说IN子句中的数据集将被在每台服务器上被独立的收集仅与每台服务器上的本地存储上的数据计算交集。
如果您已经将数据分散到了集群的每台服务器上并且单个UserID的数据完全分布在单个服务器上那么这将是正确且最佳的查询方式。在这种情况下所有需要的数据都可以在每台服务器的本地进行获取。否则结果将是不准确的。我们将这种查询称为“local IN”。
为了修正这种在数据随机分布的集群中的工作,你可以在子查询中使用**distributed_table**。查询将更改为这样:
``` sql
SELECT uniq(UserID) FROM distributed_table WHERE CounterID = 101500 AND UserID IN (SELECT UserID FROM distributed_table WHERE CounterID = 34)
```
此查询将被发送给所有的远程服务器
``` sql
SELECT uniq(UserID) FROM local_table WHERE CounterID = 101500 AND UserID IN (SELECT UserID FROM distributed_table WHERE CounterID = 34)
```
子查询将在每个远程服务器上执行。因为子查询使用分布式表,所有每个远程服务器上的子查询将查询再次发送给所有的远程服务器
``` sql
SELECT UserID FROM local_table WHERE CounterID = 34
```
例如如果你拥有100台服务器的集群执行整个查询将需要10000次请求这通常被认为是不可接受的。
在这种情况下你应该使用GLOBAL IN来替代IN。让我们看一下它是如何工作的。
``` sql
SELECT uniq(UserID) FROM distributed_table WHERE CounterID = 101500 AND UserID GLOBAL IN (SELECT UserID FROM distributed_table WHERE CounterID = 34)
```
在请求服务器上运行子查询
``` sql
SELECT UserID FROM distributed_table WHERE CounterID = 34
```
将结果放入内存中的临时表中。然后将请求发送到每一台远程服务器
``` sql
SELECT uniq(UserID) FROM local_table WHERE CounterID = 101500 AND UserID GLOBAL IN _data1
```
临时表`_data1`也会随着查询一起发送到每一台远程服务器(临时表的名称由具体实现定义)。
这比使用普通的IN更加理想但是请注意以下几点
1. 创建临时表时数据不是唯一的为了减少通过网络传输的数据量请在子查询中使用DISTINCT你不需要在普通的IN中这么做
2. 临时表将发送到所有远程服务器。其中传输不考虑网络的拓扑结构。例如如果你有10个远程服务器存在与请求服务器非常远的数据中心中则数据将通过通道发送数据到远程数据中心10次。使用GLOBAL IN时应避免大数据集。
3. 当向远程服务器发送数据时,网络带宽的限制是不可配置的,这可能会网络的负载造成压力。
4. 尝试将数据跨服务器分布这样你将不需要使用GLOBAL IN。
5. 如果你需要经常使用GLOBAL IN请规划你的ClickHouse集群位置以便副本之间不存在跨数据中心并且它们之间具有快速的网络交换能力以便查询可以完全在一个数据中心内完成。
另外,在`GLOBAL IN`子句中使用本地表也是有用的,比如,本地表仅在请求服务器上可用,并且您希望在远程服务器上使用来自本地表的数据。
### Extreme Values
除了结果外,你还可以获得结果列的最大值与最小值,可以将**extremes**配置设置成1来做到这一点。最大值最小值的计算是针对于数字类型日期类型进行计算的对于其他列将会输出默认值。
额外计算的两行结果 - 最大值与最小值这两行额外的结果仅在JSON\*, TabSeparated\*, and Pretty\* 格式与其他行分开的输出方式输出,不支持其他输出格式。
在JSON\*格式中Extreme值在单独的'extremes'字段中。在TabSeparated\*格式中,在其他结果与'totals'之后输出并使用空行与其分隔。在Pretty\* 格式中,将在其他结果与'totals'后以单独的表格输出。
如果在计算Extreme值的同时包含LIMIT。extremes的计算结果将包含offset跳过的行。在流式的请求中它可能还包含多余LIMIT的少量行的值。
### 注意事项
不同于MySQL, `GROUP BY`与`ORDER BY`子句不支持使用列的位置信息作为参数但这实际上是符合SQL标准的。
例如,`GROUP BY 1, 2`将被解释为按照常量进行分组(即,所有的行将会被聚合成一行)。
可以在查询的任何部分使用AS。
可以在查询的任何部分添加星号,而不仅仅是表达式。在分析查询时,星号被替换为所有的列(不包含`MATERIALIZED`与`ALIAS`的列)。
只有少数情况下使用星号是合理的:
- 创建表转储时。
- 对于仅包含几个列的表,如系统表.
- 获取表中的列信息。在这种情况下应该使用`LIMIT 1`。但是,更好的办法是使用`DESC TABLE`。
- 当使用`PREWHERE`在少数的几个列上做强过滤时。
- 在子查询中(因为外部查询不需要的列被排除在子查询之外)。
在所有的其他情况下,我们不建议使用星号,因为它是列式数据库的缺点而不是优点。
[来源文章](https://clickhouse.yandex/docs/en/query_language/select/) <!--hide-->

View File

@ -409,6 +409,7 @@ img {
}
#index_ul {
padding-bottom: 30px;
padding-left: 0;
margin: 0 0 30px -16px;
font-size: 90%;

View File

@ -3,6 +3,7 @@
<head>
<meta charset="utf-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<meta name="viewport" content="width=device-width,initial-scale=1">
<title>ClickHouse — open source distributed column-oriented DBMS</title>