% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/vocab.R
\name{prepare_vocab}
\alias{prepare_vocab}
\title{Format a Token List as a Vocabulary}
\usage{
prepare_vocab(token_list)
}
\arguments{
\item{token_list}{A character vector of tokens.}
}
\value{
The vocab as a named integer vector. Names are tokens in the
vocabulary, values are integer indices. The casedness of the vocabulary is
inferred and attached as the "is_cased" attribute.

Note that from the perspective of a neural net, the numeric indices \emph{are}
the tokens, and the mapping from token to index is fixed. If we changed the
indexing, it would break any pre-trained models using that vocabulary. This
is why the vocabulary is stored as a named integer vector, and why it
starts with index zero.
}
\description{
We use a special named integer vector with class wordpiece_vocabulary to
provide information about tokens used in \code{\link{wordpiece_tokenize}}.
This function takes a character vector of tokens and puts it into that
format.
}
\examples{
my_vocab <- prepare_vocab(c("some", "example", "tokens"))
class(my_vocab)
attr(my_vocab, "is_cased")
}
