% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/tokenize.R
\name{tokens}
\alias{tokens}
\alias{tokenize}
\title{Create or coerce an object into class \code{tokens}}
\usage{
tokenize(
  x,
  re_drop_line = NULL,
  line_glue = NULL,
  re_cut_area = NULL,
  re_token_splitter = re("[^_\\\\p{L}\\\\p{N}\\\\p{M}'-]+"),
  re_token_extractor = re("[_\\\\p{L}\\\\p{N}\\\\p{M}'-]+"),
  re_drop_token = NULL,
  re_token_transf_in = NULL,
  token_transf_out = NULL,
  token_to_lower = TRUE,
  perl = TRUE,
  ngram_size = NULL,
  max_skip = 0,
  ngram_sep = "_",
  ngram_n_open = 0,
  ngram_open = "[]"
)
}
\arguments{
\item{x}{Either a character vector or an object of class
\link[NLP:TextDocument]{NLP::TextDocument} that contains the text to be tokenized.}

\item{re_drop_line}{\code{NULL} or character vector. If \code{NULL}, it is ignored.
Otherwise, a character vector (assumed to be of length 1)
containing a regular expression. Lines in \code{x}
that contain a match for \code{re_drop_line} are
treated as not belonging to the corpus and are excluded from the results.}

\item{line_glue}{\code{NULL} or character vector. If \code{NULL}, it is ignored.
Otherwise, all lines in a corpus file (or in \code{x}, if
\code{as_text} is \code{TRUE}), are glued together in one
character vector of length 1, with the string \code{line_glue}
pasted in between consecutive lines.
The value of \code{line_glue} can also be equal to the empty string \code{""}.
The 'line glue' operation is conducted immediately after the 'drop line' operation.}

\item{re_cut_area}{\code{NULL} or character vector. If \code{NULL}, it is ignored.
Otherwise, all matches in a corpus file (or in \code{x},
if \code{as_text} is \code{TRUE}), are 'cut out' of the text prior
to the identification of the tokens in the text (and are therefore
not taken into account when identifying the tokens).
The 'cut area' operation is conducted immediately after the 'line glue' operation.}

\item{re_token_splitter}{Regular expression or \code{NULL}.
Regular expression that identifies the locations where lines in the corpus
files are split into tokens. (See Details.)

The 'token identification' operation is conducted immediately after the
'cut area' operation.}

\item{re_token_extractor}{Regular expression that identifies the locations of the
actual tokens. This argument is only used if \code{re_token_splitter} is \code{NULL}.
(See Details.)

The 'token identification' operation is conducted immediately after the
'cut area' operation.}

\item{re_drop_token}{Regular expression or \code{NULL}. If \code{NULL}, it is ignored.
Otherwise, it identifies tokens that are to
be excluded from the results. Any token that contains a match for
\code{re_drop_token} is removed from the results.
The 'drop token' operation is conducted immediately after the 'token identification' operation.}

\item{re_token_transf_in}{Regular expression that identifies areas in the
tokens that are to be transformed. This argument works together with the argument
\code{token_transf_out}.

If both \code{re_token_transf_in} and \code{token_transf_out} differ
from \code{NA}, then all matches, in the tokens, for the
regular expression  \code{re_token_transf_in} are replaced with
the replacement string \code{token_transf_out}.

The 'token transformation' operation is conducted immediately after the
'drop token' operation.}

\item{token_transf_out}{Replacement string. This argument works together with
\code{re_token_transf_in} and is ignored if \code{re_token_transf_in}
is \code{NULL} or \code{NA}.}

\item{token_to_lower}{Logical. Whether tokens must be converted
to lowercase before returning the result.
The 'token to lower' operation is conducted immediately after the
'token transformation' operation.}

\item{perl}{Logical. Whether the PCRE regular expression
flavor is being used in the arguments that contain regular expressions.}

\item{ngram_size}{Argument in support of ngrams/skipgrams (see also \code{max_skip}).

If one wants to identify individual tokens, the value of \code{ngram_size}
should be \code{NULL} or \code{1}. If one wants to retrieve
token ngrams/skipgrams, \code{ngram_size} should be an integer indicating
the size of the ngrams/skipgrams. E.g. \code{2} for bigrams, or \code{3} for
trigrams, etc.}

\item{max_skip}{Argument in support of skipgrams. This argument is ignored if
\code{ngram_size} is \code{NULL} or is \code{1}.

If \code{ngram_size} is \code{2} or higher, and \code{max_skip}
is \code{0}, then regular ngrams are being retrieved (albeit that they
may contain open slots; see \code{ngram_n_open}).

If \code{ngram_size} is \code{2} or higher, and \code{max_skip}
is \code{1} or higher, then skipgrams are being retrieved (which in the
current implementation cannot contain open slots; see \code{ngram_n_open}).

For instance, if \code{ngram_size} is \code{3} and \code{max_skip} is
\code{2}, then 2-skip trigrams are being retrieved.
Or if \code{ngram_size} is \code{5} and \code{max_skip} is
\code{3}, then 3-skip 5-grams are being retrieved.}

\item{ngram_sep}{Character vector of length 1 containing the string that is used to
separate/link tokens in the representation of ngrams/skipgrams
in the output of this function.}

\item{ngram_n_open}{If \code{ngram_size} is \code{2} or higher, and moreover
\code{ngram_n_open} is a number higher than \code{0}, then
ngrams with 'open slots' in them are retrieved. These
ngrams with 'open slots' are generalizations of fully lexically specific
ngrams (with the generalization being that one or more of the items
in the ngram are replaced by a notation that stands for 'any arbitrary token').

For instance, if \code{ngram_size} is \code{4} and \code{ngram_n_open} is
\code{1}, and if moreover the input contains a
4-gram \code{"it_is_widely_accepted"}, then the output will contain
all modifications of \code{"it_is_widely_accepted"} in which one (since
\code{ngram_n_open} is \code{1}) of the items in this n-gram is
replaced by an open slot. The first and the last item inside
an ngram are never turned into an open slot; only the items in between
are candidates for being turned into open slots. Therefore, in the
example, the output will contain \code{"it_[]_widely_accepted"} and
\code{"it_is_[]_accepted"}.

As a second example, if \code{ngram_size} is \code{5} and
\code{ngram_n_open} is \code{2}, and if moreover the input contains a
5-gram \code{"it_is_widely_accepted_that"}, then the output will contain
\code{"it_[]_[]_accepted_that"}, \code{"it_[]_widely_[]_that"}, and
\code{"it_is_[]_[]_that"}.}

\item{ngram_open}{Character string used to represent open slots in ngrams in the
output of this function.}
}
\value{
An object of class \code{\link{tokens}}, i.e. a sequence of tokens.
It has a number of attributes and method such as:
\itemize{
\item base \code{\link[=print.types]{print}}, \code{\link[=as_data_frame]{as_data_frame()}}, \code{\link[=summary]{summary()}}
(which returns the number of items), \code{\link[=sort]{sort()}} and \code{\link[=rev]{rev()}},
\item \code{\link[tibble:as_tibble]{tibble::as_tibble()}},
\item an interactive \code{\link[=explore]{explore()}} method,
\item some getters, namely \code{\link[=n_tokens]{n_tokens()}} and \code{\link[=n_types]{n_types()}},
\item subsetting methods such as \code{\link[=keep_types]{keep_types()}}, \code{\link[=keep_pos]{keep_pos()}}, etc. including \verb{[]}
subsetting (see \link{brackets}).
}

Additional manipulation functions include the \code{\link[=trunc_at]{trunc_at()}} method to ??,
\code{\link[=tokens_merge]{tokens_merge()}} and \code{\link[=tokens_merge_all]{tokens_merge_all()}} to combine token lists and an
\code{\link[=as_character]{as_character()}} method to convert to a character vector.

Objects of class \code{tokens} can be saved to file with \code{\link[=write_tokens]{write_tokens()}};
these files can be read with \code{\link[=read_freqlist]{read_freqlist()}}.
}
\description{
\code{tokenize()} splits a text into a sequence of tokens, using regular expressions
to identify them, and returns an object of the class \code{\link{tokens}}.
}
\details{
If the output contains ngrams with open slots, then the order
of the items in the output is no longer meaningful. For instance, let's imagine
a case where \code{ngram_size} is \code{5} and \code{ngram_n_open} is \code{2}.
If the input contains a 5-gram \code{"it_is_widely_accepted_that"}, then the output
will contain \code{"it_[]_[]_accepted_that"}, \code{"it_[]_widely_[]_that"} and
\code{"it_is_[]_[]_that"}. The relative order of these three items in the output
must be considered arbitrary.
}
\examples{
toy_corpus <- "Once upon a time there was a tiny toy corpus.
It consisted of three sentences. And it lived happily ever after."

tks <- tokenize(toy_corpus)
print(tks, n = 1000)

tks <- tokenize(toy_corpus, re_token_splitter = "\\\\W+")
print(tks, n = 1000)
sort(tks)
summary(tks)

tokenize(toy_corpus, ngram_size = 3)

tokenize(toy_corpus, ngram_size = 3, max_skip = 2)

tokenize(toy_corpus, ngram_size = 3, ngram_n_open = 1)
}
\seealso{
\code{\link[=as_tokens]{as_tokens()}}
}
