#include #include #include #include #include #include /** A function to extract text from HTML or XHTML. * It does not necessarily 100% conforms to any of the HTML, XML or XHTML standards, * but the implementation is reasonably accurate and it is fast. * * The rules are the following: * * 1. Comments are skipped. Example: * Comment must end with -->. Nested comments are not possible. * Note: constructions like are not valid comments in HTML but will be skipped by other rules. * * 2. CDATA is pasted verbatim. * Note: CDATA is XML/XHTML specific. But we still process it for "best-effort" approach. * * 3. 'script' and 'style' elements are removed with all their content. * Note: it's assumed that closing tag cannot appear inside content. * For example, in JS string literal is has to be escaped as "<\/script>". * Note: comments and CDATA is possible inside script or style - then closing tags are not searched inside CDATA. * Example: ]]> * But still searched inside comments. Sometimes it becomes complicated: * var y = "-->"; alert(x + y); * Note: script and style can be the names of XML namespaces - then they are not treat like usual script or style. * Example: Hello. * Note: whitespaces are possible after closing tag name: but not before: < / script>. * * 4. Other tags or tag-like elements are skipped without inner content. * Example: . * Note: it's expected that this HTML is illegal: * Note: it will also skip something like tags: <>, , etc. * Note: tag without end will be skipped to the end of input: * 5. HTML and XML entities are not decoded. * It should be processed by separate function. * * 6. Whitespaces in text are collapsed or inserted by specific rules. * Whitespaces at beginning and at the end are removed. * Consecutive whitespaces are collapsed. * But if text is separated by other elements and there is no whitespace, it is inserted. * It may be unnatural, examples: Helloworld, Helloworld * - in HTML there will be no whitespace, but the function will insert it. * But also consider: Hello

world

, Hello
world. * This behaviour is reasonable for data analysis, e.g. convert HTML to a bag of words. * * 7. Also note that correct handling of whitespaces would require * support of

and CSS display and white-space properties. * * Usage example: * * SELECT extractTextFromHTML(html) FROM url('https://yandex.ru/', RawBLOB, 'html String') * * - ClickHouse has embedded web browser. */ namespace DB { namespace ErrorCodes { extern const int ILLEGAL_COLUMN; extern const int ILLEGAL_TYPE_OF_ARGUMENT; } namespace { inline bool startsWith(const char * s, const char * end, const char * prefix) { return s + strlen(prefix) < end && 0 == memcmp(s, prefix, strlen(prefix)); } inline bool checkAndSkip(const char * __restrict & s, const char * end, const char * prefix) { if (startsWith(s, end, prefix)) { s += strlen(prefix); return true; } return false; } bool processComment(const char * __restrict & src, const char * end) { if (!checkAndSkip(src, end, "