morphemepiece - Morpheme Tokenization
Tokenize text into morphemes. The morphemepiece algorithm uses a lookup table to determine the morpheme breakdown of words, and falls back on a modified wordpiece tokenization algorithm for words not found in the lookup table.
Last updated 3 years ago
5.04 score 11 stars 8 scripts 223 downloadswordpiece - R Implementation of Wordpiece Tokenization
Apply 'Wordpiece' (<arXiv:1609.08144>) tokenization to input text, given an appropriate vocabulary. The 'BERT' (<arXiv:1810.04805>) tokenization conventions are used by default.
Last updated 3 years ago
4.60 score 8 stars 7 scripts 210 downloadsdlr - Download and Cache Files Safely
The goal of dlr is to provide a friendly wrapper around the common pattern of downloading a file if that file does not already exist locally.
Last updated 2 years ago
4.48 score 2 packages 4 scripts 293 downloadspiecemaker - Tools for Preparing Text for Tokenizers
Tokenizers break text into pieces that are more usable by machine learning models. Many tokenizers share some preparation steps. This package provides those shared steps, along with a simple tokenizer.
Last updated 1 years ago
3.48 score 2 packages 6 scripts 249 downloadsmorphemepiece.data - Data for Morpheme Tokenization
Provides data about morphemes, the smallest units of meaning in a language.
Last updated 3 years ago
3.18 score 1 stars 1 packages 2 scripts 220 downloadswordpiece.data - Data for Wordpiece-Style Tokenization
Provides data to be used by the wordpiece algorithm in order to tokenize text into somewhat meaningful chunks. Included vocabularies were retrieved from <https://huggingface.co/bert-base-cased/resolve/main/vocab.txt> and <https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt> and parsed into an R-friendly format.
Last updated 3 years ago
3.18 score 1 packages 5 scripts 255 downloads