#include #include #include #include #include #include /** A function to extract text from HTML or XHTML. * It does not necessarily 100% conforms to any of the HTML, XML or XHTML standards, * but the implementation is reasonably accurate and it is fast. * * The rules are the following: * * 1. Comments are skipped. Example: * Comment must end with -->. Nested comments are not possible. * Note: constructions like are not valid comments in HTML but will be skipped by other rules. * * 2. CDATA is pasted verbatim. * Note: CDATA is XML/XHTML specific. But we still process it for "best-effort" approach. * * 3. 'script' and 'style' elements are removed with all their content. * Note: it's assumed that closing tag cannot appear inside content. * For example, in JS string literal is has to be escaped as "<\/script>". * Note: comments and CDATA is possible inside script or style - then closing tags are not searched inside CDATA. * Example: ]]> * But still searched inside comments. Sometimes it becomes complicated: * var y = "-->"; alert(x + y); * Note: script and style can be the names of XML namespaces - then they are not treat like usual script or style. * Example: Hello. * Note: whitespaces are possible after closing tag name: but not before: < / script>. * * 4. Other tags or tag-like elements are skipped without inner content. * Example: . * Note: it's expected that this HTML is illegal: * Note: it will also skip something like tags: <>, , etc. * Note: tag without end will be skipped to the end of input: * 5. HTML and XML entities are not decoded. * It should be processed by separate function. * * 6. Whitespaces in text are collapsed or inserted by specific rules. * Whitespaces at beginning and at the end are removed. * Consecutive whitespaces are collapsed. * But if text is separated by other elements and there is no whitespace, it is inserted. * It may be unnatural, examples: Helloworld, Helloworld * - in HTML there will be no whitespace, but the function will insert it. * But also consider: Hello

world

, Hello
world. * This behaviour is reasonable for data analysis, e.g. convert HTML to a bag of words. * * 7. Also note that correct handling of whitespaces would require * support of
 and CSS display and white-space properties.
  *
  * Usage example:
  *
  * SELECT extractTextFromHTML(html) FROM url('https://github.com/ClickHouse/ClickHouse', RawBLOB, 'html String')
  *
  * - ClickHouse has embedded web browser.
  */

namespace DB
{

namespace ErrorCodes
{
    extern const int ILLEGAL_COLUMN;
    extern const int ILLEGAL_TYPE_OF_ARGUMENT;
}

namespace
{

inline bool startsWith(const char * s, const char * end, const std::string_view prefix)
{
    return s + prefix.length() < end && 0 == memcmp(s, prefix.data(), prefix.length());
}

inline bool checkAndSkip(const char * __restrict & s, const char * end, const std::string_view prefix)
{
    if (startsWith(s, end, prefix))
    {
        s += prefix.length();
        return true;
    }
    return false;
}

bool processComment(const char * __restrict & src, const char * end)
{
    if (!checkAndSkip(src, end, "