Commit Graph

9 Commits

Author SHA1 Message Date
Azat Khuzhin
5fe44f2736 Fix lowerUTF8()/upperUTF8() in case of symbol was in between 16-byte boundary
In lowerUTF8()/upperUTF8() there is an SSE optimization that handles
16 byte at a time, but only for ASCII, for UTF8 symbols converion will
be done by symbol.

Consider the following example:

    КВ АМ И СЖ
             ^ - offset is 15, length of sequence is 2
                 so first byte of a symbol is in first 16 bytes
                 second byte of a symbol is not ther

And in this case it will be handled incorrectly because it will try to
process oly these 16 bytes w/o looking forward.

This had been broken by #41286, before this patch it does not looks at
the row boundaries but only at the string end and so this sutation
wasn't possible.

Fixes: #42756
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2022-10-31 08:17:38 +01:00
Azat Khuzhin
32febf5155 Remove dead code in LowerUpperUTF8Impl::array()
AFAICS it was there before since it was possible to overrun the
expected_end, since utf8.convert() was called with "src_end - src" not
"expected_end - src".

Refs: 5a21f3908b054a0efc90c65a12fbe151c74d90dc:dbms/include/DB/Functions/FunctionsString.h
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2022-10-31 08:00:29 +01:00
Azat Khuzhin
01b5b9ceba Do not allow invalid sequences influence other rows in lowerUTF8/upperUTF8
Right now lowerUTF8() and upperUTF8() does not respect row boundaries,
and so one row may break another.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2022-09-17 11:16:45 +02:00
Maksim Kita
538f8cbaad Fix clang-tidy warnings in Disks, Formats, Functions folders 2022-03-14 18:17:35 +00:00
Alexey Milovidov
0fa5142715 Remove tons of garbage 2021-01-31 05:36:52 +03:00
Alexey Milovidov
d69af4333d Better asserts 2021-01-28 03:46:12 +03:00
Alexey Milovidov
95e15131a8 Fix unsufficient args check (trash code) in StringSearcher 2021-01-27 20:32:59 +03:00
Alexey Milovidov
269b6383f5 Check for #pragma once in headers 2020-10-10 21:37:02 +03:00
Ivan Lezhankin
06446b4f08 dbms/ → src/ 2020-04-03 18:14:31 +03:00