mirror of
https://github.com/ClickHouse/ClickHouse.git
synced 2024-09-21 01:00:48 +00:00
Some minor changes: words, formatting
This commit is contained in:
parent
80508356c0
commit
97188f5b85
@ -656,23 +656,22 @@ A function to extract text from HTML or XHTML.
|
||||
It does not necessarily 100% conforms to any of the HTML, XML or XHTML standards, but the implementation is reasonably accurate and it is fast. The rules are the following:
|
||||
|
||||
1. Comments are skipped. Example: `<!-- test -->`. Comment must end with `-->`. Nested comments are not possible.
|
||||
Note: constructions like `<!-->` `<!--->` are not valid comments in HTML but will be skipped by other rules.
|
||||
Note: constructions like `<!-->` and `<!--->` are not valid comments in HTML but they will be skipped by other rules.
|
||||
2. CDATA is pasted verbatim. Note: CDATA is XML/XHTML specific. But we still process it for "best-effort" approach.
|
||||
3. `script` and `style` elements are removed with all their content. Note: it's assumed that closing tag cannot appear inside content. For example, in JS string literal is has to be escaped as `"<\/script>"`.
|
||||
Note: comments and CDATA is possible inside script or style - then closing tags are not searched inside CDATA. Example: `<script><![CDATA[</script>]]></script>` But still searched inside comments. Sometimes it becomes complicated: `<script>var x = "<!--"; </script> var y = "-->"; alert(x + y);</script>`
|
||||
Note: script and style can be the names of XML namespaces - then they are not treat like usual script or style. Example: `<script:a>Hello</script:a>`.
|
||||
3. `script` and `style` elements are removed with all their content. Note: it's assumed that closing tag cannot appear inside content. For example, in JS string literal has to be escaped like `"<\/script>"`.
|
||||
Note: comments and CDATA is possible inside `script` or `style` - then closing tags are not searched inside CDATA. Example: `<script><![CDATA[</script>]]></script>` But they are still searched inside comments. Sometimes it becomes complicated: `<script>var x = "<!--"; </script> var y = "-->"; alert(x + y);</script>`
|
||||
Note: `script` and `style` can be the names of XML namespaces - then they are not treated like usual `script` or `style` elements. Example: `<script:a>Hello</script:a>`.
|
||||
Note: whitespaces are possible after closing tag name: `</script >` but not before: `< / script>`.
|
||||
4. Other tags or tag-like elements are skipped without inner content. Example: `<a>.</a>`
|
||||
Note: it's expected that this HTML is illegal: `<a test=">"></a>`
|
||||
Note: it will also skip something like tags: `<>`, `<!>`, etc.
|
||||
Note: tag without end will be skipped to the end of input: `<hello `
|
||||
5. HTML and XML entities are not decoded. It should be processed by separate function.
|
||||
6. Whitespaces in text are collapsed or inserted by specific rules.
|
||||
Whitespaces at beginning and at the end are removed.
|
||||
Consecutive whitespaces are collapsed.
|
||||
But if text is separated by other elements and there is no whitespace, it is inserted.
|
||||
It may be unnatural, examples: `Hello<b>world</b>`, `Hello<!-- -->world` - in HTML there will be no whitespace, but the function will insert it. But also consider: `Hello<p>world</p>`, `Hello<br>world`.
|
||||
This behaviour is reasonable for data analysis, e.g. convert HTML to a bag of words.
|
||||
5. HTML and XML entities are not decoded. They must be processed by separate function.
|
||||
6. Whitespaces in the text are collapsed or inserted by specific rules.
|
||||
- Whitespaces at the beginning and at the end are removed.
|
||||
- Consecutive whitespaces are collapsed.
|
||||
- But if the text is separated by other elements and there is no whitespace, it is inserted.
|
||||
- It may be unnatural examples: `Hello<b>world</b>`, `Hello<!-- -->world` - in HTML there will be no whitespace, but the function will insert it. But also consider: `Hello<p>world</p>`, `Hello<br>world`. This behavior is reasonable for data analysis, e.g. to convert HTML to a bag of words.
|
||||
7. Also note that correct handling of whitespaces would require support of `<pre></pre>` and CSS display and white-space properties.
|
||||
|
||||
**Syntax**
|
||||
|
Loading…
Reference in New Issue
Block a user