Package 'wordpiece.data' reference manual

Package 'wordpiece.data'

Title:	Data for Wordpiece-Style Tokenization
Description:	Provides data to be used by the wordpiece algorithm in order to tokenize text into somewhat meaningful chunks. Included vocabularies were retrieved from <https://huggingface.co/bert-base-cased/resolve/main/vocab.txt> and <https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt> and parsed into an R-friendly format.
Authors:	Jonathan Bratt [aut] (ORCID: <https://orcid.org/0000-0003-2859-0076>), Jon Harmon [aut, cre] (ORCID: <https://orcid.org/0000-0003-4781-4346>), Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [cph], Google, Inc [cph] (original BERT vocabularies)
Maintainer:	Jon Harmon <[email protected]>
License:	Apache License (>= 2)
Version:	2.0.0
Built:	2026-05-07 06:36:32 UTC
Source:	https://github.com/macmillancontentscience/wordpiece.data

Title:

Data for Wordpiece-Style Tokenization

Description:

Provides data to be used by the wordpiece algorithm in order to tokenize text into somewhat meaningful chunks. Included vocabularies were retrieved from <https://huggingface.co/bert-base-cased/resolve/main/vocab.txt> and <https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt> and parsed into an R-friendly format.

Authors:

Jonathan Bratt [aut] (ORCID: <https://orcid.org/0000-0003-2859-0076>), Jon Harmon [aut, cre] (ORCID: <https://orcid.org/0000-0003-4781-4346>), Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [cph], Google, Inc [cph] (original BERT vocabularies)

Maintainer:

Jon Harmon <[email protected]>

License:

Apache License (>= 2)

Version:

2.0.0

Built:

2026-05-07 06:36:32 UTC

Source:

https://github.com/macmillancontentscience/wordpiece.data

Help Index

Load a wordpiece Vocabulary

Description

A wordpiece vocabulary is a named integer vector with class "wordpiece_vocabulary". The names of the vector are the tokens, and the values are the integer identifiers of those tokens. The vocabulary is 0-indexed for compatibility with Python implementations.

Usage

wordpiece_vocab(cased = FALSE)
wordpiece_vocab(cased = FALSE)

Arguments

cased

Logical; load the uncased vocabulary, or the cased vocabulary?

Value

A wordpiece_vocabulary.

Examples

head(wordpiece_vocab())
head(wordpiece_vocab(cased = TRUE))
head(wordpiece_vocab())
head(wordpiece_vocab(cased = TRUE))