Title: | Morpheme Tokenization |
---|---|
Description: | Tokenize text into morphemes. The morphemepiece algorithm uses a lookup table to determine the morpheme breakdown of words, and falls back on a modified wordpiece tokenization algorithm for words not found in the lookup table. |
Authors: | Jonathan Bratt [aut, cre] , Jon Harmon [aut] , Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [cph] |
Maintainer: | Jonathan Bratt <[email protected]> |
License: | Apache License (>= 2) |
Version: | 1.2.3 |
Built: | 2024-10-29 03:46:08 UTC |
Source: | https://github.com/macmillancontentscience/morphemepiece |
Tokenize words into morphemes (the smallest unit of meaning).
Usually you will want to use the included lookup that can be accessed via
morphemepiece_lookup()
. This function can be used to load a different
lookup from a file.
load_lookup(lookup_file)
load_lookup(lookup_file)
lookup_file |
path to lookup file. File is assumed to be a text file, with one word per line. The lookup value, if different from the word, follows the word on the same line, after a space. |
The lookup as a named list. Names are words in lookup.
Usually you will want to use the included lookup that can be accessed via
morphemepiece_lookup()
. This function can be used to load (and cache) a
different lookup from a file.
load_or_retrieve_lookup(lookup_file)
load_or_retrieve_lookup(lookup_file)
lookup_file |
path to lookup file. File is assumed to be a text file, with one word per line. The lookup value, if different from the word, follows the word on the same line, after a space. |
The lookup table as a named character vector.
Usually you will want to use the included vocabulary that can be accessed via
morphemepiece_vocab()
. This function can be used to load (and cache) a
different vocabulary from a file.
load_or_retrieve_vocab(vocab_file)
load_or_retrieve_vocab(vocab_file)
vocab_file |
path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number (starting at zero) corresponding to the index of that token in the vocabulary. |
The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.
Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing, it would break any pre-trained models using that vocabulary.
Usually you will want to use the included vocabulary that can be accessed via
morphemepiece_vocab()
. This function can be used to load a different
vocabulary from a file.
load_vocab(vocab_file)
load_vocab(vocab_file)
vocab_file |
path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number (starting at zero) corresponding to the index of that token in the vocabulary. |
The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.
Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing, it would break any pre-trained models using that vocabulary.
The morphemepiece cache directory is a platform- and user-specific path where morphemepiece saves caches (such as a downloaded lookup). You can override the default location in a few ways:
Option: morphemepiece.dir
Use
set_morphemepiece_cache_dir
to set a specific cache directory
for this session
Environment: MORPHEMEPIECE_CACHE_DIR
Set this environment
variable to specify a morphemepiece cache directory for all sessions.
Environment: R_USER_CACHE_DIR
Set this environment variable
to specify a cache directory root for all packages that use the caching
system.
morphemepiece_cache_dir()
morphemepiece_cache_dir()
A character vector with the normalized path to the cache.
Given a single sequence of text and a morphemepiece vocabulary, tokenizes the text.
morphemepiece_tokenize( text, vocab = morphemepiece_vocab(), lookup = morphemepiece_lookup(), unk_token = "[UNK]", max_chars = 100 )
morphemepiece_tokenize( text, vocab = morphemepiece_vocab(), lookup = morphemepiece_lookup(), unk_token = "[UNK]", max_chars = 100 )
text |
Character scalar; text to tokenize. |
vocab |
A morphemepiece vocabulary. |
lookup |
A morphemepiece lookup table. |
unk_token |
Token to represent unknown words. |
max_chars |
Maximum length of word recognized. |
A character vector of tokenized text (later, this should be a named integer vector, as in the wordpiece package.)
We use a character vector with class morphemepiece_vocabulary to provide
information about tokens used in
morphemepiece_tokenize
. This function takes a character vector
of tokens and puts it into that format.
prepare_vocab(token_list)
prepare_vocab(token_list)
token_list |
A character vector of tokens. |
The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.
Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing, it would break any pre-trained models using that vocabulary.
my_vocab <- prepare_vocab(c("some", "example", "tokens")) class(my_vocab) attr(my_vocab, "is_cased")
my_vocab <- prepare_vocab(c("some", "example", "tokens")) class(my_vocab) attr(my_vocab, "is_cased")
Use this function to override the cache path used by morphemepiece for the
current session. Set the MORPHEMEPIECE_CACHE_DIR
environment variable
for a more permanent change.
set_morphemepiece_cache_dir(cache_dir = NULL)
set_morphemepiece_cache_dir(cache_dir = NULL)
cache_dir |
Character scalar; a path to a cache directory. |
A normalized path to a cache directory. The directory is created if the user has write access and the directory does not exist.