% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/kanjivec.R
\name{kanjivec}
\alias{kanjivec}
\title{Create kanjivec objects from kanjivg data}
\usage{
kanjivec(
  kanji,
  database = NULL,
  flatten = "intelligent",
  save = FALSE,
  overwrite = FALSE,
  simplify = TRUE
)
}
\arguments{
\item{kanji}{a (vector of) character string(s) of one or several kanji.}

\item{database}{the path to a local copy of (a subset of) the KanjiVG database. It is expected
that the svg files reside at this exact location (not in a subdirectory). If \code{NULL},
an attempt is made to read the svg file(s) from the KanjiVG GitHub repository (after
prompting for confirmation, which can be switched off via the \link[=kanjistat_options]{option}
\code{ask_github}).}

\item{flatten}{logical. Should nodes that are only-children be fused with their parents?
Alternatively one of the strings "intelligent", "inner" or "leaves". Although the first is the default
it is experimental and the precise meaning will change in the future; see details.}

\item{save}{logical or character. If FALSE return the (list of) kanjivec object(s). Otherwise save the result
as an rds file in the working directory (as kvecsave.rds) or under the file path provided.}

\item{overwrite}{logical. If FALSE return an error (before any computations are done) if the designated
file path already exists. Otherwise an existing file is overwritten.}

\item{simplify}{logical. Shall a single kanjivec object be returned (instead a list of one) if \code{kanji}
is a single kanji?}
}
\value{
A list of objects of class \code{kanjivec} or, if only one kanji was specified and
\code{simplify} is \code{TRUE}, a single objects of class \code{kanjivec}. If \code{save = TRUE},
the same is (saved and) still returned invisibly.
}
\description{
Create a (list of) kanjivec object(s). Each object is a representation of the kanji as a tree of strokes
based on .svg files from the KanjiVG database containing further, derived information.
}
\details{
A kanjivec object contains detailed information on the strokes of which an individual kanji
is composed including their order, a segmentation into reasonable components ("radicals" in a
more general sense of the word), classification of individual strokes, and both
vector data and interpolated points to recreate the actual stroke in a Kyoukashou style font.
For more information on the original data see \url{http://kanjivg.tagaini.net/}. That data
is licenced under Creative Commons BY-SA 3.0 (see licence file of this package).

The original .svg files sometimes contain additional \verb{<g>} elements that provide
information about the current group of strokes rather than establishing a new subgroup
of its own. This happens typically for information that establishes coherence with another
part of the tree (by noting that the current subgroup is also part 2 of something else),
but also for variant information. With the option \code{flatten = TRUE} the extra hierarchy
level in the tree is avoided, while the original information in the KanjiVG file is kept.
This is achieved by fusing only-children to their parents, giving the new node the name
of the child and all its attributes, but prefixing \code{p.} to the attribute names
of the parent (the parents' "names" attribute is discarded, but can be reconstructed from
the parents' id). Removal of several hierarchies in sequence can lead to attribute names
with multiple \code{p.} in front. Fusing to parents is suppressed if the parent is the
root of the hierarchy (typically for one-stroke kanji), as this could lead to confusing
results.

The options \code{flatten = "inner"} and \code{flatten = "leaves"} implement the above behavior
only for the corresponding type of node (inner nodes or leaves). The option
\code{flatten = "intelligent"} tries to find out in more sophisticated ways which flattening
is desirable and which is not (it will flatten rather conservatively). Currently nodes without
an element attribute that have only one child are flattened away (one example where this is
reasonable is in kanji \code{kbase[187, ]}), as are nodes with an element attribute and only
one child if this child is also an inner node and has the same element and part attribute as the
parent, but both have no number (this would be problematic for any component-building code
in the particular case of kanji \code{kbase[1111, ]}).

A \code{kanjivec} object has components
\describe{
\item{\code{char}}{the kanji (a single character)}
\item{\code{hex}}{its Unicode codepoint (integer of class \code{hexmode})}
\item{\code{padhex}}{the Unicode codepoint padded with zeros to five digits (mode character)}
\item{\code{family}}{the font on which the data is based. Currently only "schoolbook" (to be extended with "kaisho" at some point)}
\item{\code{nstrokes}}{the number of strokes in the kanji}
\item{\code{ncompos}}{a vector of the number of components at each depth of the tree}
\item{\code{nveins}}{the number of veins in the component structure}
\item{\code{strokedend}}{the decomposition tree of the kanji as an object of class \code{dendrogram}}
\item{\code{components}}{the component structure by segmentation depth (components can overlap) in terms
of KanjiVG elements and their depth-first tree coordinates}
\item{\code{veins}}{the veins in the component structure. Each vein is represented as a two-column matrix
that lists in its rows the indices of \code{components} (starting at the root,
which in the component indexing is \code{c(1,1)})}
\item{\code{stroketree}}{the decomposition tree of the kanji, a list containing the full information of the
the KanjiVG file (except some top level attributes)}
}

\code{stroketree} is a close representation of the KanjiVG svg file as list object with
some serious nesting of sublists. The XML attributes become attributes of the list and its elements.
The user will usually not have to look at or manipulate \code{stroketree} directly, but
\code{strokedend} and \code{compents} are derived from it and other functions may process it
further.

The main differences to the svg file are
\enumerate{
\item the actual strokes are not only given as d-attributes describing Bézier curves
but also as two-column matrices describing discretizations of these curves. These matrices
are the actual contents of the innermost lists in \code{stroketree}, but are more conveniently
accessed via the function \code{\link{get_strokes}}.
\item The positions of the stroke numbers (for plotting) are saved as an attribute strokenum_coords
to the entire stroke tree rather than a separate element.
}

\code{strokedend} is more easy to examine and work with due to various convenience functions for
dendrograms in the packages \code{stats} and \code{\link{dendextend}}, including \code{\link[utils]{str}}
and \code{\link[stats]{plot.dendrogram}}. The function \code{\link{plot.kanjivec}} with option
\code{type = "dend"} is a wrapper for \code{\link[stats]{plot.dendrogram}} with reasonable presets
for various options.

The label-attributes of the nodes of \code{strokedend} are taken from the element (for inner nodes)
and type (for leaves) attributes of the .svg files. They consist of UTF-8 characters representing
kanji parts and a combination of UTF-8 characters for representing strokes and may not represent
well in all CJK fonts (see details of \code{\link{plot.kanjivec}}). If element and type are missing
in the .svg file, the label assigned is the second part of the id-attribute, e.g. g5 or s9.

The \code{components} at a given level can be plotted, see \code{\link{plot.kanjivec}} with
\code{type = "kanji"}. Both \code{components} and \code{veins} serve mainly for the computation
of \link[=kanjidist]{kanji distances}.
}
\examples{
if (interactive()) {
  # Try to load the svg file for the kanji from GitHub.
  res <- kanjivec("\u85e4", database=NULL)
  str(res)
}

fivebetas  # sample kanjivec data
str(fivebetas[[1]])

}
\seealso{
\code{\link{plot.kanjivec}}, \code{\link{str.kanjivec}}
}
