% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/LncFinder.R
\name{make_frequencies}
\alias{make_frequencies}
\title{Make the frequencies file for new classifier construction}
\usage{
make_frequencies(cds.seq, mRNA.seq, lncRNA.seq, SS.features = FALSE,
  cds.format = "DNA", lnc.format = "DNA", check.cds = TRUE,
  ignore.illegal = TRUE)
}
\arguments{
\item{cds.seq}{Coding sequences (mRNA without UTRs). Can be a FASTA file loaded
by \code{\link[seqinr]{seqinr-package}} or secondary structure
sequences (Dot-Bracket Notation) obtained form function \code{\link{run_RNAfold}}.
CDs are used to calculate hexamer frequencies of nucleotide sequences,thus
secondary structure is not needed. Parameter \code{cds.format} should be
\code{"SS"} when input is secondary structure sequences. (See details for
more information.)}

\item{mRNA.seq}{mRNA sequences with Dot-Bracket Notation. The secondary
structure sequences can be obtained from function \code{\link{run_RNAfold}}.
mRNA sequences are used to calculate the frequencies of acgu-ACGU and a acguD
(see details), thus, mRNA sequences are required only when \code{SS.features = TRUE}.}

\item{lncRNA.seq}{Long non-coding RNA sequences. Can be a FASTA file loaded by
\code{\link[seqinr]{seqinr-package}} or secondary structure
sequences (Dot-Bracket Notation) obtained from function \code{\link{run_RNAfold}}.
If \code{SS.features = TRUE}, \code{lncRNA.seq} must be RNA sequences with
secondary structure sequences and parameter \code{lnc.format} should be defined
as \code{"SS"}.}

\item{SS.features}{Logical. If \code{SS.features = TRUE}, frequencies of secondary
structure will also be calculated and the model can be built with secondary
structure features. In this case, \code{mRNA.seq} and \code{lncRNA.seq} should
be secondary structure sequences.}

\item{cds.format}{String. Define the format of the sequences of \code{cds.seq}.
Can be \code{"DNA"} or \code{"SS"}. \code{"DNA"} for DNA sequences and \code{"SS"}
for secondary structure sequences.}

\item{lnc.format}{String. Define the format of lncRNAs (\code{lncRNA.seq}).
Can be \code{"DNA"} or \code{"SS"}. \code{"DNA"} for DNA sequences and \code{"SS"}
for secondary structure sequences. This parameter must be defined as \code{"SS"}
when \code{SS.features = TURE}.}

\item{check.cds}{Logical. Incomplete CDs can lead to a false shift and a
inaccurate hexamer frequencies. When \code{check.cds = TRUE}, hexamer frequencies
will be calculated on the longest ORF. This parameter is strongly recommended to
set as \code{TRUE} when mRNA is used as CDs.}

\item{ignore.illegal}{Logical. If \code{TRUE}, the sequences with non-nucleotide
characters (nucleotide characters: "a", "c", "g", "t") will be ignored when
calculating hexamer frequencies.}
}
\value{
Returns a list which consists the frequencies of protein-coding sequences
and non-coding sequences.
}
\description{
This function is used to calculate the frequencies of lncRNAs, CDs, and
secondary structure sequences. The frequencies file can be used to build the classifier
using function \code{\link{extract_features}}. Functions \code{make_frequencies} and
\code{extract_features} are useful when users are trying
to build their own model.

NOTE: Function \code{make_frequencies} makes the frequencies file
for building the classifiers of LncFinder method. If users need to calculate Logarithm-Distance,
Euclidean-Distance, and hexamer score, the frequencies file need to be computed using function
\code{\link{make_referFreq}}.
}
\details{
This function is used to make frequencies file for LncFinder method. This file is needed
when users are trying to build their own model.

In order to achieve high accuracy, mRNA should not be regarded as CDs and assigned
to parameter \code{cds.seq}. However, CDs of some species may be insufficient
for calculating frequencies, and mRNAs can be regarded as CDs with parameter
\code{check.cds = TRUE}. In this case, hexamer frequencies will be calculated
on ORF region.

Considering that it is time consuming to obtain secondary structure sequences,
users can only provide nucleotide sequences and build a model without secondary
structure features (\code{SS.features = } \code{FALSE}). If users want to build a model
with secondary structure features, parameter \code{SS.features} should be set
as \code{TRUE}. At the same time, the format of the sequences of \code{mRNA.seq}
and \code{lnc.seq} should be secondary structure sequences (Dot-Bracket Notation).
Secondary structure sequences can be obtained by function \code{\link{run_RNAfold}}.

Please note that:

SS.features can improve the performance when the species of unevaluated sequences
is identical to the species of the sequences that used to build the model.

However, if users are trying to predict sequences with the model trained on
other species, SS.features may lead to low accuracy.

The frequencies file consists three groups: Hexamer Frequencies; acgu-ACGU
Frequencies and acguD Frequencies.

Hexamer Frequencies are calculated on the original nucleotide sequences by
employing \emph{k}-mer scheme (\emph{k} = 6), and the sliding window will slide
3 nt each step.

For any secondary structure sequences (Dot-Bracket Notation), if one position
is a dot, the corresponding nucleotide of the RNA sequence will be replaced
with character "D". acguD Frequencies are the \emph{k}-mer frequencies
(\emph{k} = 4) calculated on this new sequences.

Similarly, for any secondary structure sequences (Dot-Bracket Notation), if
one position is "(" or ")", the corresponding nucleotide of the RNA sequence
will be replaced with upper case ("A", "C", "G", "U").

A brief example,

DNA Sequence:\code{          5'-   t  a  c  a  g  t  t  a  t  g   -3'}

RNA Sequence:\code{          5'-   u  a  c  a  g  u  u  a  u  g   -3'}

Dot-Bracket Sequence:\code{     5'-   .  .  .  .  (  (  (  (  (  (   -3'}

acguD Sequence:\code{         \{     D, D, D, D, g, u, u, a, u, g   \}}

acgu-ACGU Sequence:\code{    \{     u, a, c, a, G, U, U, A, U, G   \}}
}
\section{References}{

Siyu Han, Yanchun Liang, Qin Ma, Yangyi Xu, Yu Zhang, Wei Du, Cankun Wang & Ying Li.
LncFinder: an integrated platform for long non-coding RNA identification utilizing
sequence intrinsic composition, structural information, and physicochemical property.
\emph{Briefings in Bioinformatics}, 2018, bby065.
}

\examples{
### Only for examples:
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

\dontrun{
### Obtain the secondary structure sequences (Windows OS):
RNAfold.path <- '"E:/Program Files/ViennaRNA/RNAfold.exe"'
SS.seq <- run_RNAfold(Seqs, RNAfold.path = RNAfold.path, parallel.cores = 2)

### Make frequencies file with secondary strucutre features,
my_file_1 <- make_frequencies(cds.seq = SS.seq, mRNA.seq = SS.seq,
                              lncRNA.seq = SS.seq, SS.features = TRUE,
                              cds.format = "SS", lnc.format = "SS",
                              check.cds = TRUE, ignore.illegal = FALSE)
}

### Make frequencies file without secondary strucutre features,
my_file_2 <- make_frequencies(cds.seq = Seqs, lncRNA.seq = Seqs,
                              SS.features = FALSE, cds.format = "DNA",
                              lnc.format = "DNA", check.cds = TRUE,
                              ignore.illegal = FALSE)

### The input of cds.seq and lncRNA.seq can also be secondary structure
### sequences when SS.features = FALSE, such as,
data(demp_SS.seq)
SS.seq <- demo_SS.seq
my_file_3 <- make_frequencies(cds.seq = SS.seq, lncRNA.seq = Seqs,
                              SS.features = FALSE, cds.format = "SS",
                              lnc.format = "DNA", check.cds = TRUE,
                              ignore.illegal = FALSE)
}
\seealso{
\code{\link{run_RNAfold}}, \code{\link{read_SS}},
         \code{\link{build_model}}, \code{\link{extract_features}},
         \code{\link{make_referFreq}}.
}
\author{
HAN Siyu
}
