• About
  • Documentation

  • More Universes
  • Recent Updates
  • Leader board

  • All repositories
  • All packages
  • All articles
  • All datasets
  • All system Libraries
macmillancontentscience
  • Builds
  • Packages
  • Articles
  • Datasets
  • Contribution
  • Badges
  • API
  • Feed

Links tomacmillancontentscience

morphemepiece - Morpheme Tokenization

Tokenize text into morphemes. The morphemepiece algorithm uses a lookup table to determine the morpheme breakdown of words, and falls back on a modified wordpiece tokenization algorithm for words not found in the lookup table.

Last updated

5.04 score 11 stars 9 scripts 268 downloads

wordpiece - R Implementation of Wordpiece Tokenization

Apply 'Wordpiece' (<arXiv:1609.08144>) tokenization to input text, given an appropriate vocabulary. The 'BERT' (<arXiv:1810.04805>) tokenization conventions are used by default.

Last updated

4.60 score 8 stars 7 scripts 253 downloads

dlr - Download and Cache Files Safely

The goal of dlr is to provide a friendly wrapper around the common pattern of downloading a file if that file does not already exist locally.

Last updated

4.48 score 2 dependents 4 scripts 320 downloads

piecemaker - Tools for Preparing Text for Tokenizers

Tokenizers break text into pieces that are more usable by machine learning models. Many tokenizers share some preparation steps. This package provides those shared steps, along with a simple tokenizer.

Last updated

3.48 score 2 dependents 6 scripts 282 downloads

morphemepiece.data - Data for Morpheme Tokenization

Provides data about morphemes, the smallest units of meaning in a language.

Last updated

3.18 score 1 stars 1 dependents 2 scripts 265 downloads

wordpiece.data - Data for Wordpiece-Style Tokenization

Provides data to be used by the wordpiece algorithm in order to tokenize text into somewhat meaningful chunks. Included vocabularies were retrieved from <https://huggingface.co/bert-base-cased/resolve/main/vocab.txt> and <https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt> and parsed into an R-friendly format.

Last updated

3.18 score 1 dependents 5 scripts 256 downloads