Package: piecemaker 1.0.2.9000
piecemaker: Tools for Preparing Text for Tokenizers
Tokenizers break text into pieces that are more usable by machine learning models. Many tokenizers share some preparation steps. This package provides those shared steps, along with a simple tokenizer.
Authors:
piecemaker_1.0.2.9000.tar.gz
piecemaker_1.0.2.9000.zip(r-4.5)piecemaker_1.0.2.9000.zip(r-4.4)piecemaker_1.0.2.9000.zip(r-4.3)
piecemaker_1.0.2.9000.tgz(r-4.4-any)piecemaker_1.0.2.9000.tgz(r-4.3-any)
piecemaker_1.0.2.9000.tar.gz(r-4.5-noble)piecemaker_1.0.2.9000.tar.gz(r-4.4-noble)
piecemaker_1.0.2.9000.tgz(r-4.4-emscripten)piecemaker_1.0.2.9000.tgz(r-4.3-emscripten)
piecemaker.pdf |piecemaker.html✨
piecemaker/json (API)
NEWS
# Install 'piecemaker' in R: |
install.packages('piecemaker', repos = c('https://macmillancontentscience.r-universe.dev', 'https://cloud.r-project.org')) |
Bug tracker:https://github.com/macmillancontentscience/piecemaker/issues
Last updated 1 years agofrom:b02c1a7492. Checks:OK: 7. Indexed: yes.
Target | Result | Date |
---|---|---|
Doc / Vignettes | OK | Oct 29 2024 |
R-4.5-win | OK | Oct 29 2024 |
R-4.5-linux | OK | Oct 29 2024 |
R-4.4-win | OK | Oct 29 2024 |
R-4.4-mac | OK | Oct 29 2024 |
R-4.3-win | OK | Oct 29 2024 |
R-4.3-mac | OK | Oct 29 2024 |
Exports:prepare_and_tokenizeprepare_textremove_control_charactersremove_diacriticsremove_replacement_charactersspace_cjkspace_punctuationsquish_whitespacetokenize_spacevalidate_utf8
Dependencies:cligluelifecyclemagrittrrlangstringistringrvctrs
Readme and manuals
Help Manual
Help page | Topics |
---|---|
Split Text on Spaces | prepare_and_tokenize |
Prepare Text for Tokenization | prepare_text |
Remove Non-Character Characters | remove_control_characters |
Remove Diacritical Marks on Characters | remove_diacritics |
Remove the Unicode Replacement Character | remove_replacement_characters |
Add Spaces Around CJK Ideographs | space_cjk |
Add Spaces Around Punctuation | space_punctuation |
Remove Extra Whitespace | squish_whitespace |
Break Text at Spaces | tokenize_space |
Clean Up Text to UTF-8 | validate_utf8 |