Merge pull request #1844 from bocharov/master

Fix uniqHLL12 and uniqCombined for cardinalities 100M+.
2024-11-22 07:31:57 +00:00 · 2018-02-08 20:01:45 +03:00 · 2018-02-08 20:01:45 +03:00 · b7d0ae49fd
commit b7d0ae49fd
parent 3bb75a9b6e 9963e2f160
6 changed files with 81 additions and 12 deletions
--- a/dbms/scripts/test_intHash32_for_linear_counting.py
+++ b/dbms/scripts/test_intHash32_for_linear_counting.py
@ -0,0 +1,56 @@
+#!/usr/bin/python3
+import sys
+import math
+import statistics as stat
+
+start = int(sys.argv[1])
+end = int(sys.argv[2])
+
+#Copied from dbms/src/Common/HashTable/Hash.h
+def intHash32(key, salt = 0):
+    key ^= salt;
+
+    key = (~key) + (key << 18);
+    key = key ^ ((key >> 31) | (key << 33));
+    key = key * 21;
+    key = key ^ ((key >> 11) | (key << 53));
+    key = key + (key << 6);
+    key = key ^ ((key >> 22) | (key << 42));
+
+    return key & 0xffffffff
+
+#Number of buckets for precision p = 12, m = 2^p
+m = 4096
+n = start
+c = 0
+m1 = {}
+m2 = {}
+l1 = []
+l2 = []
+while n <= end:
+    c += 1
+
+    h = intHash32(n)
+    #Extract left most 12 bits
+    x1 = (h >> 20) & 0xfff
+    m1[x1] = 1
+    z1 = m - len(m1)
+    #Linear counting formula
+    u1 = int(m * math.log(float(m) / float(z1)))
+    e1 = abs(100*float(u1 - c)/float(c))
+    l1.append(e1)
+    print("%d %d %d %f" % (n, c, u1, e1))
+
+    #Extract right most 12 bits
+    x2 = h & 0xfff
+    m2[x2] = 1
+    z2 = m - len(m2)
+    u2 = int(m * math.log(float(m) / float(z2)))
+    e2 = abs(100*float(u2 - c)/float(c))
+    l2.append(e2)
+    print("%d %d %d %f" % (n, c, u2, e2))
+
+    n += 1
+
+print("Left 12 bits error: min=%f max=%f avg=%f median=%f median_low=%f median_high=%f" % (min(l1), max(l1), stat.mean(l1), stat.median(l1), stat.median_low(l1), stat.median_high(l1)))
+print("Right 12 bits error: min=%f max=%f avg=%f median=%f median_low=%f median_high=%f" % (min(l2), max(l2), stat.mean(l2), stat.median(l2), stat.median_low(l2), stat.median_high(l2)))
--- a/dbms/scripts/test_uniq_functions.sh
+++ b/dbms/scripts/test_uniq_functions.sh
@ -0,0 +1,10 @@
+#!/bin/bash
+for ((p = 2; p <= 10; p++))
+do
+    for ((i = 1; i <= 9; i++))
+    do
+        n=$(( 10**p * i ))
+        echo -n "$n "
+        clickhouse-client -q "select uniqHLL12(number), uniq(number), uniqCombined(number) from numbers($n);"
+    done
+done
--- a/dbms/src/Common/HyperLogLogCounter.h
+++ b/dbms/src/Common/HyperLogLogCounter.h
@ -305,7 +305,7 @@ public:
        update(bucket, rank);
    }

-    UInt32 size() const
+    UInt64 size() const
    {
        /// Normalizing factor for harmonic mean.
        static constexpr double alpha_m =
@ -323,7 +323,7 @@ public:

        double final_estimate = fixRawEstimate(raw_estimate);

-        return static_cast<UInt32>(final_estimate + 0.5);
+        return static_cast<UInt64>(final_estimate + 0.5);
    }

    void merge(const HyperLogLogCounter & rhs)
@ -443,13 +443,12 @@ private:
            return applyBiasCorrection(raw_estimate);
        else if (mode == HyperLogLogMode::FullFeatured)
        {
-            static constexpr bool fix_big_cardinalities = std::is_same_v<HashValueType, UInt32>;
            static constexpr double pow2_32 = 4294967296.0;

            double fixed_estimate;

-            if (fix_big_cardinalities && (raw_estimate > (pow2_32 / 30.0)))
-                fixed_estimate = -pow2_32 * log(1.0 - raw_estimate / pow2_32);
+            if (raw_estimate > (pow2_32 / 30.0))
+                fixed_estimate = raw_estimate;
            else
                fixed_estimate = applyCorrection(raw_estimate);

--- a/dbms/src/Common/HyperLogLogWithSmallSetOptimization.h
+++ b/dbms/src/Common/HyperLogLogWithSmallSetOptimization.h
@ -80,7 +80,7 @@ public:
            large->insert(value);
    }

-    UInt32 size() const
+    UInt64 size() const
    {
        return !isLarge() ? small.size() : large->size();
    }
--- a/docs/en/agg_functions/reference.md
+++ b/docs/en/agg_functions/reference.md
@ -129,6 +129,8 @@ This algorithm is also very accurate for data sets with small cardinality (up to

 The result is determinate (it doesn't depend on the order of query processing).

+This function provides excellent accuracy even for data sets with huge cardinality (10B+ elements) and is recommended for use by default.
+
 ## uniqCombined(x)

 Calculates the approximate number of different values of the argument. Works for numbers, strings, dates, date-with-time, and for multiple arguments and tuple arguments.
@ -137,16 +139,16 @@ A combination of three algorithms is used: array, hash table and [HyperLogLog](h

 The result is determinate (it doesn't depend on the order of query processing).

-The `uniqCombined` function is a good default choice for calculating the number of different values.
+The `uniqCombined` function is a good default choice for calculating the number of different values, but the following should be considered: for data sets with large cardinality (200M+) error of estimate will only grow and for data sets with huge cardinality(1B+ elements) it returns result with high inaccuracy.

 ## uniqHLL12(x)

 Uses the [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) algorithm to approximate the number of different values of the argument.
-212 5-bit cells are used. The size of the state is slightly more than 2.5 KB.
+212 5-bit cells are used. The size of the state is slightly more than 2.5 KB. Result is not very accurate (error up to ~10%) for data sets of small cardinality(<10K elements), but for data sets with large cardinality (10K - 100M) result is quite accurate (error up to ~1.6%) and after that error of estimate will only grow and for data sets with huge cardinality (1B+ elements) it returns result with high inaccuracy.

 The result is determinate (it doesn't depend on the order of query processing).

-In most cases, use the  `uniq` or `uniqCombined` function.
+This function is not recommended for use, and in most cases, use the  `uniq` or `uniqCombined` function.

 ## uniqExact(x)

--- a/docs/ru/agg_functions/reference.md
+++ b/docs/ru/agg_functions/reference.md
@ -139,6 +139,8 @@ GROUP BY timeslot

 Результат детерминирован (не зависит от порядка выполнения запроса).

+Данная функция обеспечивает отличную точность даже для множеств огромной кардинальности (10B+ элементов) и рекомендуется к использованию по умолчанию.
+

 ## uniqCombined(x)

@ -148,17 +150,17 @@ GROUP BY timeslot

 Результат детерминирован (не зависит от порядка выполнения запроса).

-Функция `uniqCombined` является хорошим выбором по умолчанию для подсчёта количества различных значений.
+Функция `uniqCombined` является хорошим выбором по умолчанию для подсчёта количества различных значений, но стоит иметь ввиду что для множеств большой кардинальности (200M+) ошибка оценки будет только расти и для множеств огромной кардинальности (1B+ элементов) функция возвращает результат с очень большой неточностью.


 ## uniqHLL12(x)

 Приближённо вычисляет количество различных значений аргумента, используя алгоритм [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog).
-Используется 212 5-битовых ячеек. Размер состояния чуть больше 2.5 КБ.
+Используется 212 5-битовых ячеек. Размер состояния чуть больше 2.5 КБ. Результат является не точным(ошибка до ~10%) для небольших множеств (<10K элементов), но для множеств большой кардинальности (10K - 100M) результат довольно точен (ошибка до ~1.6%) и начиная с 100M ошибка оценки будет только расти и для множеств огромной кардинальности (1B+ элементов) функция возвращает результат с очень большой неточностью.

 Результат детерминирован (не зависит от порядка выполнения запроса).

-В большинстве случаев, используйте функцию `uniq` или `uniqCombined`.
+Данная функция не рекомендуется к использованию, и в большинстве случаев, используйте функцию `uniq` или `uniqCombined`.


 ## uniqExact(x)