R packages by macmillancontentscience

morphemepiece - Morpheme Tokenization

Tokenize text into morphemes. The morphemepiece algorithm uses a lookup table to determine the morpheme breakdown of words, and falls back on a modified wordpiece tokenization algorithm for words not found in the lookup table.

Last updated 3 years ago

5.04 score 11 stars 8 scripts 231 downloads

wordpiece - R Implementation of Wordpiece Tokenization

Apply 'Wordpiece' (<arXiv:1609.08144>) tokenization to input text, given an appropriate vocabulary. The 'BERT' (<arXiv:1810.04805>) tokenization conventions are used by default.

Last updated 3 years ago

4.60 score 8 stars 7 scripts 282 downloads

dlr - Download and Cache Files Safely

The goal of dlr is to provide a friendly wrapper around the common pattern of downloading a file if that file does not already exist locally.

Last updated 3 years ago

4.48 score 1 stars 2 dependents 4 scripts 396 downloads

piecemaker - Tools for Preparing Text for Tokenizers

Tokenizers break text into pieces that are more usable by machine learning models. Many tokenizers share some preparation steps. This package provides those shared steps, along with a simple tokenizer.

Last updated 2 years ago

3.48 score 2 dependents 6 scripts 337 downloads

morphemepiece.data - Data for Morpheme Tokenization

Provides data about morphemes, the smallest units of meaning in a language.

Last updated 3 years ago

3.18 score 1 stars 1 dependents 2 scripts 194 downloads

wordpiece.data - Data for Wordpiece-Style Tokenization

Provides data to be used by the wordpiece algorithm in order to tokenize text into somewhat meaningful chunks. Included vocabularies were retrieved from <https://huggingface.co/bert-base-cased/resolve/main/vocab.txt> and <https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt> and parsed into an R-friendly format.

Last updated 3 years ago

3.18 score 1 dependents 5 scripts 277 downloads