This package applies WordPiece tokenization to input text, given an appropriate WordPiece vocabulary. The BERT tokenization conventions are used. The basic tokenization algorithm is:
Ideally, a WordPiece vocabulary will be complete enough to represent any word, but this is not required.
Two vocabularies are provided via the {wordpiece.data} package. These are the wordpiece vocabularies used in Google Research’s BERT models (and most models based on BERT).
library(wordpiece)
# The default vocabulary is uncased.
wordpiece_tokenize(
"I like tacos!"
)
#> [[1]]
#> i like ta ##cos !
#> 1045 2066 11937 13186 999
# A cased vocabulary is also provided.
wordpiece_tokenize(
"I like tacos!",
vocab = wordpiece_vocab(cased = TRUE)
)
#> [[1]]
#> I like ta ##cos !
#> 146 1176 27629 13538 106
For the rest of this vignette, we use a tiny vocabulary for illustrative purposes. You should not use this vocabulary for actual tokenization.
The vocabulary is represented by the package as a named integer
vector, with a logical attribute is_cased
to indicate
whether the vocabulary is case sensitive. The names are the actual
tokens, and the integer values are the token indices. The integer values
would be the input to a BERT model, for example.
A vocabulary can be read from a text file containing a single token per line. The token index is taken to be the line number, starting from zero. These conventions are adopted for compatibility with the vocabulary and file format used in the pretrained BERT checkpoints released by Google Research. The casedness of the vocabulary is inferred from the content of the vocabulary.
# Get path to sample vocabulary included with package.
vocab_path <- system.file("extdata", "tiny_vocab.txt", package = "wordpiece")
# Load the vocabulary.
vocab <- load_vocab(vocab_path)
# Take a peek at the vocabulary.
head(vocab)
#> [1] "[PAD]" "[CLS]" "[SEP]" "!" "." ","
When a text vocabulary is loaded with
load_or_retrieve_vocabulary
in an interactive R session,
the option is given to cache the vocabulary as an RDS file for faster
future loading.
Tokenize text by calling wordpiece_tokenize
on the text,
passing the vocabulary as the vocab
parameter. The output
of wordpiece_tokenize
is a named integer vector of token
indices.
The above vocabulary contained no tokens starting with an uppercase letter, so it was assumed to be uncased. When tokenizing text with an uncased vocabulary, the input is converted to lowercase before any other processing is applied. If the vocabulary contains at least one capitalized token, it will be taken as case-sensitive, and the case of the input text is preserved. Note that in a cased vocabulary, capitalized and uncapitalized versions of the same word are different tokens, and must both be included in the vocabulary to be recognized.
# The above vocabulary was uncased.
attr(vocab, "is_cased")
#> [1] FALSE
# Here is the same vocabulary, but containing the capitalized token "Hi".
vocab_path2 <- system.file("extdata", "tiny_vocab_cased.txt",
package = "wordpiece")
vocab_cased <- load_vocab(vocab_path2)
head(vocab_cased)
#> [1] "[PAD]" "[CLS]" "[SEP]" "!" "." ","
# vocab_cased is inferred to be case-sensitive...
attr(vocab_cased, "is_cased")
#> [1] TRUE
# ... so the tokenization will *not* convert strings to lowercase, and so the
# words "I" and "And" are not found in the vocabulary (though "and" still is).
wordpiece_tokenize(text = "And I love tacos and salsa!", vocab = vocab_cased)
#> [[1]]
#> [UNK] [UNK] love tacos and s ##a ##l ##s ##a !
#> 64 64 8 9 10 30 38 49 56 38 3
Note that the default value for the unk_token
argument,
“[UNK]”, is present in the above vocabularies, so it had an integer
index in the tokenization. If that token were not in the vocabulary, its
index would be coded as NA
.
wordpiece_tokenize(text = "I love tacos!",
vocab = vocab_cased,
unk_token = "[missing]")
#> [[1]]
#> [missing] love tacos !
#> NA 8 9 3
The package defaults are set to be compatible with BERT tokenization. If you have a different use case, be sure to check all parameter values.