Some minor changes: words, formatting

2024-09-21 01:00:48 +00:00 · 2021-04-06 19:10:22 +00:00 · 2021-04-06 19:10:22 +00:00 · 97188f5b85
commit 97188f5b85
parent 80508356c0
1 changed files with 10 additions and 11 deletions
--- a/docs/en/sql-reference/functions/string-functions.md
+++ b/docs/en/sql-reference/functions/string-functions.md
@ -656,23 +656,22 @@ A function to extract text from HTML or XHTML.
 It does not necessarily 100% conforms to any of the HTML, XML or XHTML standards, but the implementation is reasonably accurate and it is fast. The rules are the following:

 1. Comments are skipped. Example: `<!-- test -->`. Comment must end with `-->`. Nested comments are not possible.
-Note: constructions like `<!-->` `<!--->` are not valid comments in HTML but will be skipped by other rules.
+Note: constructions like `<!-->` and `<!--->` are not valid comments in HTML but they will be skipped by other rules.
 2. CDATA is pasted verbatim. Note: CDATA is XML/XHTML specific. But we still process it for "best-effort" approach.
-3. `script` and `style` elements are removed with all their content. Note: it's assumed that closing tag cannot appear inside content. For example, in JS string literal is has to be escaped as `"<\/script>"`.
-Note: comments and CDATA is possible inside script or style - then closing tags are not searched inside CDATA. Example: `<script><![CDATA[</script>]]></script>` But still searched inside comments. Sometimes it becomes complicated: `<script>var x = "<!--"; </script> var y = "-->"; alert(x + y);</script>`
-Note: script and style can be the names of XML namespaces - then they are not treat like usual script or style. Example: `<script:a>Hello</script:a>`.
+3. `script` and `style` elements are removed with all their content. Note: it's assumed that closing tag cannot appear inside content. For example, in JS string literal has to be escaped like `"<\/script>"`.
+Note: comments and CDATA is possible inside `script` or `style` - then closing tags are not searched inside CDATA. Example: `<script><![CDATA[</script>]]></script>` But they are still searched inside comments. Sometimes it becomes complicated: `<script>var x = "<!--"; </script> var y = "-->"; alert(x + y);</script>`
+Note: `script` and `style` can be the names of XML namespaces - then they are not treated like usual `script` or `style` elements. Example: `<script:a>Hello</script:a>`.
 Note: whitespaces are possible after closing tag name: `</script >` but not before: `< / script>`.
 4. Other tags or tag-like elements are skipped without inner content. Example: `<a>.</a>`
 Note: it's expected that this HTML is illegal: `<a test=">"></a>`
 Note: it will also skip something like tags: `<>`, `<!>`, etc.
 Note: tag without end will be skipped to the end of input: `<hello   `
-5. HTML and XML entities are not decoded. It should be processed by separate function.
-6. Whitespaces in text are collapsed or inserted by specific rules.
-Whitespaces at beginning and at the end are removed.
-Consecutive whitespaces are collapsed.
-But if text is separated by other elements and there is no whitespace, it is inserted.
-It may be unnatural, examples: `Hello<b>world</b>`, `Hello<!-- -->world` - in HTML there will be no whitespace, but the function will insert it. But also consider: `Hello<p>world</p>`, `Hello<br>world`.
-This behaviour is reasonable for data analysis, e.g. convert HTML to a bag of words.
+5. HTML and XML entities are not decoded. They must be processed by separate function.
+6. Whitespaces in the text are collapsed or inserted by specific rules.
+    - Whitespaces at the beginning and at the end are removed.
+    - Consecutive whitespaces are collapsed.
+    - But if the text is separated by other elements and there is no whitespace, it is inserted.
+    - It may be unnatural examples: `Hello<b>world</b>`, `Hello<!-- -->world` - in HTML there will be no whitespace, but the function will insert it. But also consider: `Hello<p>world</p>`, `Hello<br>world`. This behavior is reasonable for data analysis, e.g. to convert HTML to a bag of words.
 7. Also note that correct handling of whitespaces would require support of `<pre></pre>` and CSS display and white-space properties.

 **Syntax**