% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/tika.R
\name{tika}
\alias{tika}
\title{Main R Interface to 'Apache Tika'}
\usage{
tika(input, output = c("text", "jsonRecursive", "xml", "html")[1],
  output_dir = "", return = TRUE, java = rtika::java(),
  jar = rtika::tika_jar(), threads = 2, max_restarts = integer(),
  timeout = 3e+05, max_file_size = integer(), args = character(),
  quiet = TRUE, cleanup = TRUE, lib.loc = .libPaths())
}
\arguments{
\item{input}{Character vector describing the paths to the input documents.
Strings starting with 'http://','https://', or 'ftp://' are downloaded to a
temporary directory first. Each file will be read, but not modified.}

\item{output}{Optional character vector of the output format. The default,
\code{"text"}, gets plain text without metadata. \code{"xml"} and
\code{"html"} get \code{XHTML} text with metadata. \code{"jsonRecursive"}
gets \code{XHTML} text and \code{json} metadata.
\code{c("jsonRecursive","text")} or \code{c("J","t")} get plain text and
\code{json} metadata. See the 'Output Details' section.}

\item{output_dir}{Optional directory path to save the converted files in.
Tika may overwrite files so an empty directory is best. See the 'Output
Details' section before using.}

\item{return}{Logical if an R object should be returned. Defaults to
TRUE. If set to FALSE, and output_dir (above) must be specified.}

\item{java}{Optional command to invoke Java. For example, it can be the full
path to a particular Java version. See the Configuration section below.}

\item{jar}{Optional alternative path to a \code{tika-app-X.XX.jar}. Useful
if this package becomes out of date.}

\item{threads}{Integer of the number of file consumer threads Tika uses.
Defaults to 2.}

\item{max_restarts}{Integer of the maximum number of times the watchdog
process will restart the child process. The default is no limit.}

\item{timeout}{Integer of the number of milliseconds allowed to a parse
before the process is killed and restarted. Defaults to 300000.}

\item{max_file_size}{Integer of the maximum bytes allowed.
Do not process files larger than this. The default is unlimited.}

\item{args}{Optional character vector of additional arguments passed to Tika,
that may not yet be implemented in this R interface, in the pattern of
\code{c('-arg1','setting1','-arg2','setting2')}.}

\item{quiet}{Logical if Tika command line messages and errors are to be
suppressed. Defaults to \code{TRUE}.}

\item{cleanup}{Logical to clean up temporary files after running the command,
which can accumulate. Defaults to \code{TRUE}. They are in \code{tempdir()}. These
files are automatically removed at the end of the R session even if set to
FALSE.}

\item{lib.loc}{Optional character vector describing the library paths
containing the \code{data.table} package. Normally, it's best to
install these and leave this parameter alone. The parameter is included
mainly for package testing.}
}
\value{
A character vector in the same order and with the same length as
\code{input}. Unprocessed files are \code{as.character(NA)}.
If \code{return = FALSE}, then a \code{NULL} value is invisibly returned.
See the Output Details section below.
}
\description{
Extract text or metadata from over a thousand file types.
Get either plain text or structured \code{XHTML}.
Metadata includes \code{Content-Type}, character encoding, and Exif data from
jpeg or tiff images. See the long list of supported file types:
\url{https://tika.apache.org/1.19/formats.html}.
}
\section{Output Details}{

If an input file did not exist, could not be downloaded, was a directory, or
Tika could not process it, the result will be \code{as.character(NA)} for
that file.

By default, \code{output = "text"} and this produces plain text with no
metadata. Some formatting is preserved in this case using tabs, newlines and
spaces.

Setting \code{output} to either \code{"xml"} or the shortcut \code{"x"} will
produce a strict form of \code{HTML} known as \code{XHTML}, with metadata in
the \code{head} node and formatted text in the \code{body}.
Content retains more formatting with \code{"xml"}. For example, a Word or
Excel table will become a HTML \code{table}, with table data as text in
\code{td} elements. The \code{"html"} option and its shortcut \code{"h"}
seem to produce the same result as \code{"xml"}.
Parse XHTML output with \code{xml2::read_html}.

Setting \code{output} to \code{"jsonRecursive"} or its shortcut \code{"J"}
produces a tree structure in `json`. Metadata fields are at the top level.
The \code{XHTML} or plain text will be found in the \code{X-TIKA:content}
field. By default the text is \code{XHTML}. This can be changed to plain
text like this: \code{output=c("jsonRecursive","text")} or
\code{output=c("J","t")}. This syntax is meant to mirror Tika's. Parse
\code{json} with \code{jsonlite::fromJSON}.

 If \code{output_dir} is specified, then the converted files will also be
 saved to this directory. It's best to use an empty directory because Tika
 may overwrite existing files. Tika seems to add an extra file extension to
 each file to reduce the chance, but it's still best to use an empty
 directory. The file locations within the \code{output_dir} maintain the same
 general path structure as the input files. Downloaded files have a path
 similar to the `tempdir()` that R uses. The original paths are now relative
 to \code{output_dir}.  Files are appended with \code{.txt} for the default
 plain text, but can be \code{.json}, \code{.xml}, or \code{.html} depending
 on the \code{output} setting. One way to get a list of the processed files
 is to use \code{list.files} with \code{recursive=TRUE}.
 If \code{output_dir} is not specified, files are saved to a volatile temp
 directory named by \code{tempdir()} and will be deleted when R shuts down.
 If this function will be run on very large batches repeatedly, these
 temporary files can be cleaned up every time by adding
 \code{cleanup=TRUE}.
}

\section{Background}{

Tika is a foundational library for several Apache projects such as the Apache
Solr search engine. It has been in development since at least 2007. The most
efficient way I've found to process many thousands of documents is Tika's
'batch' mode, which is the only mode used in `rtika`. There are potentially
more things that can be done, given enough time and attention, because
Apache Tika includes many libraries and methods in its .jar file. The source is available at:
\url{https://tika.apache.org/}.
}

\section{Installation}{

 Tika requires Java 8.

 Java installation instructions are at http://openjdk.java.net/install/
or https://www.java.com/en/download/help/download_options.xml.

By default, this R package internally invokes Java by calling the \code{java}
command from the command line. To specify the path to a particular Java
version, set the path in the \code{java} attribute of the \code{tika}
function.

Having the \code{data.table} package installed will slightly speed up the
communication between R and Tika, but especially if there are hundreds of
thousands of documents to process.
}

\examples{
\donttest{
#extract text
batch <- c(
  system.file("extdata", "jsonlite.pdf", package = "rtika"),
  system.file("extdata", "curl.pdf", package = "rtika"),
  system.file("extdata", "table.docx", package = "rtika"),
  system.file("extdata", "xml2.pdf", package = "rtika"),
  system.file("extdata", "R-FAQ.html", package = "rtika"),
  system.file("extdata", "calculator.jpg", package = "rtika"),
  system.file("extdata", "tika.apache.org.zip", package = "rtika")
)
text = tika(batch)
cat(substr(text[1],45,450))

#more complex metadata
if(requireNamespace('jsonlite')){

  json = tika(batch,c('J','t'))
  # 'J' is shortcut for jsonRecursive
  # 't' for text
  metadata = lapply(json, jsonlite::fromJSON )

  #embedded resources
  lapply(metadata, function(x){ as.character(x$'Content-Type') })

  lapply(metadata, function(x){ as.character(x$'Creation-Date') })

  lapply(metadata, function(x){  as.character(x$'X-TIKA:embedded_resource_path') })
}
}
}
