kjelsrud.dev/node_modules/parse-latin
2023-07-19 21:31:30 +02:00
..
lib 🎉 initiate project *astro_rewrite* 2023-07-19 21:31:30 +02:00
index.js 🎉 initiate project *astro_rewrite* 2023-07-19 21:31:30 +02:00
license 🎉 initiate project *astro_rewrite* 2023-07-19 21:31:30 +02:00
package.json 🎉 initiate project *astro_rewrite* 2023-07-19 21:31:30 +02:00
readme.md 🎉 initiate project *astro_rewrite* 2023-07-19 21:31:30 +02:00

parse-latin

Build Coverage Downloads Size Chat

A Latin-script language parser for retext producing nlcst nodes.

Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”), parse-latin does a good job at tokenizing it.

Note also that parse-latin does a decent job at tokenizing Latin-like scripts, Cyrillic (“Добро пожаловать!”), Georgian (“როგორა ხარ?”), Armenian (“Շատ հաճելի է”), and such.

Install

This package is ESM only: Node 12+ is needed to use it and it must be imported instead of required.

npm:

npm install parse-latin

Use

import {inspect} from 'unist-util-inspect'
import {ParseLatin} from 'parse-latin'

const tree = new ParseLatin().parse('A simple sentence.')

console.log(inspect(tree))

Which, when inspecting, yields:

RootNode[1] (1:1-1:19, 0-18)
└─0 ParagraphNode[1] (1:1-1:19, 0-18)
    └─0 SentenceNode[6] (1:1-1:19, 0-18)
        ├─0 WordNode[1] (1:1-1:2, 0-1)
        │   └─0 TextNode "A" (1:1-1:2, 0-1)
        ├─1 WhiteSpaceNode " " (1:2-1:3, 1-2)
        ├─2 WordNode[1] (1:3-1:9, 2-8)
        │   └─0 TextNode "simple" (1:3-1:9, 2-8)
        ├─3 WhiteSpaceNode " " (1:9-1:10, 8-9)
        ├─4 WordNode[1] (1:10-1:18, 9-17)
        │   └─0 TextNode "sentence" (1:10-1:18, 9-17)
        └─5 PunctuationNode "." (1:18-1:19, 17-18)

API

This package exports the following identifiers: ParseLatin. There is no default export.

ParseLatin(value)

Exposes the functionality needed to tokenize natural Latin-script languages into a syntax tree. If value is passed here, its not needed to give it to #parse().

ParseLatin#tokenize(value)

Tokenize value (string) into letters and numbers (words), white space, and everything else (punctuation). The returned nodes are a flat list without paragraphs or sentences.

Returns

Array.<Node> — Nodes.

ParseLatin#parse(value)

Tokenize value (string) into an NLCST tree. The returned node is a RootNode with in it paragraphs and sentences.

Returns

Node — Root node.

Algorithm

Note: The easiest way to see how parse-latin tokenizes and parses, is by using the online parser demo, which shows the syntax tree corresponding to the typed text.

parse-latin splits text into white space, word, and punctuation tokens. parse-latin starts out with a pretty easy definition, one that most other tokenizers use:

  • A “word” is one or more letter or number characters
  • A “white space” is one or more white space characters
  • A “punctuation” is one or more of anything else

Then, it manipulates and merges those tokens into a (nlcst) syntax tree, adding sentences and paragraphs where needed.

  • Some punctuation marks are part of the word they occur in, such as non-profit, shes, G.I., 11:00, N/A, &c, nineteenth- and…
  • Some full-stops do not mark a sentence end, such as 1., e.g., id.
  • Although full-stops, question marks, and exclamation marks (sometimes) end a sentence, that end might not occur directly after the mark, such as .), ."
  • And many more exceptions

License

MIT © Titus Wormer