| Title: | Data for Wordpiece-Style Tokenization |
|---|---|
| Description: | Provides data to be used by the wordpiece algorithm in order to tokenize text into somewhat meaningful chunks. Included vocabularies were retrieved from <https://huggingface.co/bert-base-cased/resolve/main/vocab.txt> and <https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt> and parsed into an R-friendly format. |
| Authors: | Jonathan Bratt [aut] (ORCID: <https://orcid.org/0000-0003-2859-0076>), Jon Harmon [aut, cre] (ORCID: <https://orcid.org/0000-0003-4781-4346>), Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [cph], Google, Inc [cph] (original BERT vocabularies) |
| Maintainer: | Jon Harmon <[email protected]> |
| License: | Apache License (>= 2) |
| Version: | 2.0.0 |
| Built: | 2026-05-07 06:36:32 UTC |
| Source: | https://github.com/macmillancontentscience/wordpiece.data |
A wordpiece vocabulary is a named integer vector with class "wordpiece_vocabulary". The names of the vector are the tokens, and the values are the integer identifiers of those tokens. The vocabulary is 0-indexed for compatibility with Python implementations.
wordpiece_vocab(cased = FALSE)wordpiece_vocab(cased = FALSE)
cased |
Logical; load the uncased vocabulary, or the cased vocabulary? |
A wordpiece_vocabulary.
head(wordpiece_vocab()) head(wordpiece_vocab(cased = TRUE))head(wordpiece_vocab()) head(wordpiece_vocab(cased = TRUE))