% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/keyword_merge.R
\name{keyword_merge}
\alias{keyword_merge}
\title{Merge keywords that supposed to have same meanings}
\usage{
keyword_merge(dt, id = "id", keyword = "keyword", reduce_form = "lemma")
}
\arguments{
\item{dt}{A data.frame containing at least two columns with document ID and keyword.}

\item{id}{Quoted characters specifying the column name of document ID.Default uses "id".}

\item{keyword}{Quoted characters specifying the column name of keyword.Default uses "keyword".}

\item{reduce_form}{Merge keywords with the same stem("stem") or lemma("lemma"). See details.
Default uses "lemma". Another advanced option is "partof". If a non-unigram (A) is part (subset) of
another non-unigram (B), then the longer one(B) would be replaced by the shorter one(A).}
}
\value{
A tbl, namely a tidy table with document ID and merged keyword.
}
\description{
Merge keywords that have common stem or lemma, and return the majority form of the word. This function
recieves a tidy table (data.frame) with document ID and keyword waiting to be merged.
}
\details{
While \code{keyword_clean} has provided a robust way to lemmatize the keywords, the returned token
might not be the most common way to use.This function first gets the stem or lemma of
every keyword using \code{\link{stem_strings}} or \code{\link{lemmatize_strings}} from \pkg{textstem} package,
then find the most frequent form (if more than 1,randomly select one)
for each stem or lemma. Last, every keyword
would be replaced by the most frequent keyword which share the same stem or lemma with it.

When the `reduce_form` is set to "partof", then for non-unigrams in the same document,
if one non-unigram is the subset of another, then they would be merged into the shorter one,
which is considered to be more general (e.g. "time series" and "time series analysis" would be
merged into "time series" if they co-occur in the same document). This could reduce the redundant
information. This is only applied to multi-word phrases, because using it for one word would
oversimplify the token and cause information loss (therefore, "time series" and "time" would not be
merged into "time"). This is an advanced option that should be used with caution (A trade-off between
information generalization and detailed information retention).
}
\examples{
library(akc)

\donttest{
bibli_data_table \%>\%
  keyword_clean(lemmatize = FALSE) \%>\%
  keyword_merge(reduce_form = "stem")

bibli_data_table \%>\%
  keyword_clean(lemmatize = FALSE) \%>\%
  keyword_merge(reduce_form = "lemma")
}

}
\seealso{
\code{\link[textstem]{stem_strings}}, \code{\link[textstem]{lemmatize_strings}}
}
