morphemepiece - Morpheme Tokenization
Tokenize text into morphemes. The morphemepiece algorithm uses a lookup table to determine the morpheme breakdown of words, and falls back on a modified wordpiece tokenization algorithm for words not found in the lookup table.
Last updated
5.04 score 11 stars 9 scripts 268 downloadsdlr - Download and Cache Files Safely
The goal of dlr is to provide a friendly wrapper around the common pattern of downloading a file if that file does not already exist locally.
Last updated
4.48 score 2 dependents 4 scripts 320 downloadspiecemaker - Tools for Preparing Text for Tokenizers
Tokenizers break text into pieces that are more usable by machine learning models. Many tokenizers share some preparation steps. This package provides those shared steps, along with a simple tokenizer.
Last updated
3.48 score 2 dependents 6 scripts 262 downloadsmorphemepiece.data - Data for Morpheme Tokenization
Provides data about morphemes, the smallest units of meaning in a language.
Last updated
3.18 score 1 stars 1 dependents 2 scripts 265 downloadswordpiece.data - Data for Wordpiece-Style Tokenization
Provides data to be used by the wordpiece algorithm in order to tokenize text into somewhat meaningful chunks. Included vocabularies were retrieved from <https://huggingface.co/bert-base-cased/resolve/main/vocab.txt> and <https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt> and parsed into an R-friendly format.
Last updated
3.18 score 1 dependents 5 scripts 256 downloads