Some minor changes: words, formatting

This commit is contained in:
Alexey 2021-04-06 19:10:22 +00:00
parent 80508356c0
commit 97188f5b85

View File

@ -656,23 +656,22 @@ A function to extract text from HTML or XHTML.
It does not necessarily 100% conforms to any of the HTML, XML or XHTML standards, but the implementation is reasonably accurate and it is fast. The rules are the following:
1. Comments are skipped. Example: `<!-- test -->`. Comment must end with `-->`. Nested comments are not possible.
Note: constructions like `<!-->` `<!--->` are not valid comments in HTML but will be skipped by other rules.
Note: constructions like `<!-->` and `<!--->` are not valid comments in HTML but they will be skipped by other rules.
2. CDATA is pasted verbatim. Note: CDATA is XML/XHTML specific. But we still process it for "best-effort" approach.
3. `script` and `style` elements are removed with all their content. Note: it's assumed that closing tag cannot appear inside content. For example, in JS string literal is has to be escaped as `"<\/script>"`.
Note: comments and CDATA is possible inside script or style - then closing tags are not searched inside CDATA. Example: `<script><![CDATA[</script>]]></script>` But still searched inside comments. Sometimes it becomes complicated: `<script>var x = "<!--"; </script> var y = "-->"; alert(x + y);</script>`
Note: script and style can be the names of XML namespaces - then they are not treat like usual script or style. Example: `<script:a>Hello</script:a>`.
3. `script` and `style` elements are removed with all their content. Note: it's assumed that closing tag cannot appear inside content. For example, in JS string literal has to be escaped like `"<\/script>"`.
Note: comments and CDATA is possible inside `script` or `style` - then closing tags are not searched inside CDATA. Example: `<script><![CDATA[</script>]]></script>` But they are still searched inside comments. Sometimes it becomes complicated: `<script>var x = "<!--"; </script> var y = "-->"; alert(x + y);</script>`
Note: `script` and `style` can be the names of XML namespaces - then they are not treated like usual `script` or `style` elements. Example: `<script:a>Hello</script:a>`.
Note: whitespaces are possible after closing tag name: `</script >` but not before: `< / script>`.
4. Other tags or tag-like elements are skipped without inner content. Example: `<a>.</a>`
Note: it's expected that this HTML is illegal: `<a test=">"></a>`
Note: it will also skip something like tags: `<>`, `<!>`, etc.
Note: tag without end will be skipped to the end of input: `<hello `
5. HTML and XML entities are not decoded. It should be processed by separate function.
6. Whitespaces in text are collapsed or inserted by specific rules.
Whitespaces at beginning and at the end are removed.
Consecutive whitespaces are collapsed.
But if text is separated by other elements and there is no whitespace, it is inserted.
It may be unnatural, examples: `Hello<b>world</b>`, `Hello<!-- -->world` - in HTML there will be no whitespace, but the function will insert it. But also consider: `Hello<p>world</p>`, `Hello<br>world`.
This behaviour is reasonable for data analysis, e.g. convert HTML to a bag of words.
5. HTML and XML entities are not decoded. They must be processed by separate function.
6. Whitespaces in the text are collapsed or inserted by specific rules.
- Whitespaces at the beginning and at the end are removed.
- Consecutive whitespaces are collapsed.
- But if the text is separated by other elements and there is no whitespace, it is inserted.
- It may be unnatural examples: `Hello<b>world</b>`, `Hello<!-- -->world` - in HTML there will be no whitespace, but the function will insert it. But also consider: `Hello<p>world</p>`, `Hello<br>world`. This behavior is reasonable for data analysis, e.g. to convert HTML to a bag of words.
7. Also note that correct handling of whitespaces would require support of `<pre></pre>` and CSS display and white-space properties.
**Syntax**