151 lines
4.4 KiB
Markdown
151 lines
4.4 KiB
Markdown
![]() |
# parse-latin
|
|||
|
|
|||
|
[![Build][build-badge]][build]
|
|||
|
[![Coverage][coverage-badge]][coverage]
|
|||
|
[![Downloads][downloads-badge]][downloads]
|
|||
|
[![Size][size-badge]][size]
|
|||
|
[![Chat][chat-badge]][chat]
|
|||
|
|
|||
|
A Latin-script language parser for [**retext**][retext] producing **[nlcst][]**
|
|||
|
nodes.
|
|||
|
|
|||
|
Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum
|
|||
|
penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”),
|
|||
|
`parse-latin` does a good job at tokenizing it.
|
|||
|
|
|||
|
Note also that `parse-latin` does a decent job at tokenizing Latin-like scripts,
|
|||
|
Cyrillic (“Добро пожаловать!”), Georgian (“როგორა ხარ?”), Armenian (“Շատ հաճելի
|
|||
|
է”), and such.
|
|||
|
|
|||
|
## Install
|
|||
|
|
|||
|
This package is ESM only: Node 12+ is needed to use it and it must be `import`ed
|
|||
|
instead of `require`d.
|
|||
|
|
|||
|
[npm][]:
|
|||
|
|
|||
|
```sh
|
|||
|
npm install parse-latin
|
|||
|
```
|
|||
|
|
|||
|
## Use
|
|||
|
|
|||
|
```js
|
|||
|
import {inspect} from 'unist-util-inspect'
|
|||
|
import {ParseLatin} from 'parse-latin'
|
|||
|
|
|||
|
const tree = new ParseLatin().parse('A simple sentence.')
|
|||
|
|
|||
|
console.log(inspect(tree))
|
|||
|
```
|
|||
|
|
|||
|
Which, when inspecting, yields:
|
|||
|
|
|||
|
```txt
|
|||
|
RootNode[1] (1:1-1:19, 0-18)
|
|||
|
└─0 ParagraphNode[1] (1:1-1:19, 0-18)
|
|||
|
└─0 SentenceNode[6] (1:1-1:19, 0-18)
|
|||
|
├─0 WordNode[1] (1:1-1:2, 0-1)
|
|||
|
│ └─0 TextNode "A" (1:1-1:2, 0-1)
|
|||
|
├─1 WhiteSpaceNode " " (1:2-1:3, 1-2)
|
|||
|
├─2 WordNode[1] (1:3-1:9, 2-8)
|
|||
|
│ └─0 TextNode "simple" (1:3-1:9, 2-8)
|
|||
|
├─3 WhiteSpaceNode " " (1:9-1:10, 8-9)
|
|||
|
├─4 WordNode[1] (1:10-1:18, 9-17)
|
|||
|
│ └─0 TextNode "sentence" (1:10-1:18, 9-17)
|
|||
|
└─5 PunctuationNode "." (1:18-1:19, 17-18)
|
|||
|
```
|
|||
|
|
|||
|
## API
|
|||
|
|
|||
|
This package exports the following identifiers: `ParseLatin`.
|
|||
|
There is no default export.
|
|||
|
|
|||
|
### `ParseLatin(value)`
|
|||
|
|
|||
|
Exposes the functionality needed to tokenize natural Latin-script languages into
|
|||
|
a syntax tree.
|
|||
|
If `value` is passed here, it’s not needed to give it to `#parse()`.
|
|||
|
|
|||
|
#### `ParseLatin#tokenize(value)`
|
|||
|
|
|||
|
Tokenize `value` (`string`) into letters and numbers (words), white space, and
|
|||
|
everything else (punctuation).
|
|||
|
The returned nodes are a flat list without paragraphs or sentences.
|
|||
|
|
|||
|
###### Returns
|
|||
|
|
|||
|
[`Array.<Node>`][nlcst] — Nodes.
|
|||
|
|
|||
|
#### `ParseLatin#parse(value)`
|
|||
|
|
|||
|
Tokenize `value` (`string`) into an [NLCST][] tree.
|
|||
|
The returned node is a `RootNode` with in it paragraphs and sentences.
|
|||
|
|
|||
|
###### Returns
|
|||
|
|
|||
|
[`Node`][nlcst] — Root node.
|
|||
|
|
|||
|
## Algorithm
|
|||
|
|
|||
|
> Note: The easiest way to see **how parse-latin tokenizes and parses**, is by
|
|||
|
> using the [online parser demo][demo], which
|
|||
|
> shows the syntax tree corresponding to the typed text.
|
|||
|
|
|||
|
`parse-latin` splits text into white space, word, and punctuation tokens.
|
|||
|
`parse-latin` starts out with a pretty easy definition, one that most other
|
|||
|
tokenizers use:
|
|||
|
|
|||
|
* A “word” is one or more letter or number characters
|
|||
|
* A “white space” is one or more white space characters
|
|||
|
* A “punctuation” is one or more of anything else
|
|||
|
|
|||
|
Then, it manipulates and merges those tokens into a ([nlcst][]) syntax tree,
|
|||
|
adding sentences and paragraphs where needed.
|
|||
|
|
|||
|
* Some punctuation marks are part of the word they occur in, such as
|
|||
|
`non-profit`, `she’s`, `G.I.`, `11:00`, `N/A`, `&c`, `nineteenth- and…`
|
|||
|
* Some full-stops do not mark a sentence end, such as `1.`, `e.g.`, `id.`
|
|||
|
* Although full-stops, question marks, and exclamation marks (sometimes) end a
|
|||
|
sentence, that end might not occur directly after the mark, such as `.)`,
|
|||
|
`."`
|
|||
|
* And many more exceptions
|
|||
|
|
|||
|
## License
|
|||
|
|
|||
|
[MIT][license] © [Titus Wormer][author]
|
|||
|
|
|||
|
<!-- Definitions -->
|
|||
|
|
|||
|
[build-badge]: https://github.com/wooorm/parse-latin/workflows/main/badge.svg
|
|||
|
|
|||
|
[build]: https://github.com/wooorm/parse-latin/actions
|
|||
|
|
|||
|
[coverage-badge]: https://img.shields.io/codecov/c/github/wooorm/parse-latin.svg
|
|||
|
|
|||
|
[coverage]: https://codecov.io/github/wooorm/parse-latin
|
|||
|
|
|||
|
[downloads-badge]: https://img.shields.io/npm/dm/parse-latin.svg
|
|||
|
|
|||
|
[downloads]: https://www.npmjs.com/package/parse-latin
|
|||
|
|
|||
|
[size-badge]: https://img.shields.io/bundlephobia/minzip/parse-latin.svg
|
|||
|
|
|||
|
[size]: https://bundlephobia.com/result?p=parse-latin
|
|||
|
|
|||
|
[chat-badge]: https://img.shields.io/badge/join%20the%20community-on%20spectrum-7b16ff.svg
|
|||
|
|
|||
|
[chat]: https://spectrum.chat/unified/retext
|
|||
|
|
|||
|
[npm]: https://docs.npmjs.com/cli/install
|
|||
|
|
|||
|
[demo]: https://wooorm.com/parse-latin/
|
|||
|
|
|||
|
[license]: license
|
|||
|
|
|||
|
[author]: https://wooorm.com
|
|||
|
|
|||
|
[retext]: https://github.com/retextjs/retext
|
|||
|
|
|||
|
[nlcst]: https://github.com/syntax-tree/nlcst
|