report_pruning.tex

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[a4paper, left=1in, right=1in, top=1in, bottom=1in]{geometry}
\usepackage{amsmath}
\title{Report(pruning)}
\author{Qi Ji}
\date{May 2020}

\begin{document}

\maketitle

\section{Background knowledge}
We can do pruning to optimize the tree size and reduce overfitting. Generally, there are two kinds of pruning: Pre-pruning and Post-pruning. 
\subsection{Pre-pruning}
\begin{itemize}
    \item Pre-pruning is also called early stoping.
    \item Pre-pruning stops the tree before it grows perfectly.
    \item People can set maximum depth of the tree to restrict the tree.
\end{itemize}

\subsection{Post-pruning}
\subsubsection{REP(Reduced-Error Pruning)}
REP is one of the simplest forms of Post-pruning.\\
First, calculate the error of a subtree $E_{r}(t)$.
Then, calculate the error of each leaf node of this subtree $E_{t}(Tt)$.
If $$\ E_{r}(t) < \sum E_{r}(T_{t})$$
Then, replace the subtree with a leaf node, whose label is determined by the majority label of the subtree. Repeat this process from the bottom of the tree to the top.
\subsubsection{PEP(Pessimistic-Error Pruning)}
For a leaf node with N samples and E errors, its error rate $e$ is $\frac{E+0.5}{N}$. "0.5" is called penalty factor. For a subtree with L leaf nodes, its error rate is 
$$ \ p = \frac{\sum_{i=1}^{L}\ E_{i}+0.5L}{\sum_{i=1}^{L}\ N_{i}}$$
Suppose that all the samples in a subtree is of binomial distribution $B(N,P)$, then the expectation and standard deviation of error before pruning are:
$$\ E_{T} = N*p = N*\frac{\sum_{i=1}^{L}\ E_{i}+0.5L}{\sum_{i=1}^{L}\ N_{i}} = \sum_{i=1}^{L}\ E_{i}+0.5L $$
$$\sigma = \sqrt{N*p*(1-p)}$$
Expectation of error after pruning is:
$$\ E_{t} = N*e = N*\frac{E+0.5}{N} = E+0.5$$
If $$E_{t} - E_{T} < \sigma$$
then prune the subtree. Repeat this process from the top of the tree to the bottom.
\subsubsection{CCP(Cost-Complexity Pruning)}
$\alpha$ is a real number called the complexity parameter, it's defined as follows: 
$$\alpha = \frac{R(t)- R(T_{t})}{|T_{t}|-1},$$ where $R(t)$ is the error after pruning, $R(T_{t})$ is the error of the original subtree and $|T_{t}|$ is the number of samples in the subtree.\\
Step1: Calculate the value of $\alpha$ for all the subtrees from the bottom to the top, each time prune the subtree with minimal $\alpha$. Get a set $\{T_{0}, T_{1},..., T_{M}\}$, where $T_{0}$ is a complete tree and $T_{M}$ is a root node.\\
Step2: Pick the best tree from $\{T_{0}, T_{1},..., T_{M}\}$, according to its performance on the testing sets.\\
*Note that if we use CCP to prune the tree, we should separate a testing set from the given training set before training begins.
\section{Approaches to try}
We adopted Pre-pruning by setting the maximum depth of the tree when building it. We considered three methods of Post-pruning: REP(Reduced-Error Pruning), PEP(Pessimistic-Error Pruning) and CCP(Cost-Complexity Pruning). Eventually we tried PEP and CCP, and CCP worked better. Therefore, we chose CCP to do the pruning.
\end{document}