Classifier_comparison.Rmd

---
title: "Designing a Classifier Evaluation Workflow"
date: "`r format(Sys.time(), '%b %d, %Y')`"
bibliography: library.bib
biblio-style: apalike
---

```{r, child="_setup.Rmd"}
```

The workflow for comparing classification algorithms starts with identifying potentially usable classifier methods. Several approaches are described later. For each potential classifier method, the next steps are:

*  Establish randomized training and test datasets
*  Train the classifier algorithm, then evaluate that model on a test dataset
*  Calculate an accuracy score using Gini index or similar method
*  Select best performing classifiers and either refine, or operationalize for new datasets

####\  

#Picking the Algorithm(s)
Algorithms approach text classification several ways.

The **bag of keywords approach** requires compiling a list of "key terms" that qualifies the type of content in question to a certain topic. If one or more qualifying keywords is present in a document, the document is assigned to the topic. One of the many problems with this approach is that identifying and organizing a list of terms (preferred, alternate, etc.) is labor intensive (much like simple pattern matching), and the natural ambiguity of language (one keyword can have different meanings) causes lots of false positives. These limitations make the bag of keywords approach both poorly scalable and less accurate.

**Rules-based approaches** use linguistic rules that capture all of the elements and attributes of a document to assign it to a category. Rules can be written manually or generated by automatic analysis and then validated manually (reducing manual effort up to 90%). Rules can be understood and improved (unlike the black box system used with statistics-based algorithms). A rules-based approach is flexible, powerful (much more than a bag of keywords) and easy to express. This approach works best when combined with semantic technology that can uncover deeper connections within text (meaning, relevancy, relationship between concepts, etc.) 

**Statistical approaches** use a “training set” of documents that covers the same topic to identify key features (usually words). The algorithm (Bayesian, LSA, etc.) calculates term frequencies, then builds implicit rules in order to classify other content. This approach has no understanding of meaning. In addition, systems using these types of text classification algorithms are essentially a “black box”; no one can explain why specific terms are selected by the algorithm or how they are being weighted. In the event the classification is incorrect, there is no accurate way to modify a rule for better results. 

Algorithms also can be separated into un-supervised and supervised methods. 
**Unsupervised learning** uses test data that has not been labeled, classified or categorized. Unsupervised learning identifies commonalities in the training data, then classifies new texts based on the presence or absence of such commonalities. 

Unsupervised learning algorithms appearing frequently in literature are:

*  Clustering
	+  Hierarchical clustering
	+  k-means
	+  Mixture models
*  Principal component analysis (PCA)

**Supervised learning** uses labeled training data to defined important text features. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. . 

Supervised text classification algorithms appearing regularly in the literature are:

*  Naive Bayes
	+  Binary independent NB
	+  Multinomial NB
*  Linear Classifier
*  Logistic Regression
*  Support vector machines
	+  Basic SVM
	+  SVM multiple classes
	+  SVM multiple kernels
*  Decision trees
*  Linear Discriminant Analysis
*  Nearest neighbor
	+  k-Nearest Neighbor
	+  Weighted k-Nearest Neighbor
*  Similarity Learning
*  Relevance feedback
	+  Rocchio
	+  Rocchio in a query zone
*  Ensemble methods
	+  Bagging based Models
	+  Stacking based Models
	+  Boosting based Models
	+  Random Forest Models

For reference I assembled short excerpts from online resources or published articles describing most of these methods.

####\  

##Naive Bayes
A classification technique based on Bayes’ Theorem with an assumption of independence among predictors. A Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature here. NB can be run on:

*  Word count or n-gram count vectors
*  Word or n-gram level TF-IDF vectors
*  Character Level TF-IDF vectors
	+  Would be useful for evaluating use of "?" and "=" 

Data are organized into 2 dictionaries: **corpus words** (each stemmed word and the # of occurances) and **class words** (each class and the list of stemmed words within it). The algorithm then uses these data structures to classify new text.

####\  

##Linear Classifier
The linear regression algorithm is one of the fundamental supervised machine-learning algorithms due to its relative simplicity and well-known properties.

The goal of statistical classification is to use an object's characteristics to identify which class (or group) it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use. 

####\  

##Logistic Regression
A logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic/sigmoid function. LR can be run on:

*  Word count or n-gram count vectors
*  Word or n-gram level TF-IDF vectors
*  Character Level TF-IDF vectors

Resources for Logistic Regression 
https://www.analyticsvidhya.com/blog/2015/10/basics-logistic-regression/

R for logistic regression
https://www.saedsayad.com/logistic_regression.htm

####\  

##Support Vector Machine (SVM) Model
This is a supervised machine learning algorithm which can be used for both classification or regression challenges. The model extracts a best possible hyper-plane / line that segregates the two classes. The only example I found of SVM was on Ngram Level TF-IDF Vectors. Support vector machines can be:

*  Basic SVM
*  SVM multiple classes
*  SVM multiple kernels

####\  

##Decision Trees
Flowchart-like structures in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.

In decision analysis, a decision tree and the closely related influence diagram are used as a visual and analytical decision support tool, where the expected values (or expected utility) of competing alternatives are calculated.

A decision tree consists of three types of nodes:

*  Decision nodes – typically represented by squares
*  Chance nodes – typically represented by circles
*  End nodes – typically represented by triangles

Decision tree learning is the construction of a decision tree from class-labeled training tuples. A decision tree is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node. 

The goal is to create a model that predicts the value of a target variable based on several input variables. An example is shown in the diagram at right. Each interior node corresponds to one of the input variables; there are edges to children for each of the possible values of that input variable. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf.

A decision tree is a simple representation for classifying examples. For this section, assume that all of the input features have finite discrete domains, and there is a single target feature called the "classification". Each element of the domain of the classification is called a class. A decision tree or a classification tree is a tree in which each internal (non-leaf) node is labeled with an input feature. The arcs coming from a node labeled with an input feature are labeled with each of the possible values of the target or output feature or the arc leads to a subordinate decision node on a different input feature. Each leaf of the tree is labeled with a class or a probability distribution over the classes. 

A tree can be "learned" by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node has all the same value of the target variable, or when splitting no longer adds value to the predictions. This process of top-down induction of decision trees (TDIDT) is an example of a greedy algorithm, and it is by far the most common strategy for learning decision trees from data.

In data mining, decision trees can be described also as the combination of mathematical and computational techniques to aid the description, categorization and generalization of a given set of data. 

####\  

##Linear Discriminant Analysis
LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is discriminant correspondence analysis. The statistical system R includes the packages: MASS, ade4, ca, vegan, ExPosition, andFactoMineR which perform correspondence analysis and multiple correspondence analysis.

As the name indicates, discriminant correspondence analysis (DCA) is an extension of discriminant analysis (DA) and correspondence analysis (CA). Like discriminant analysis, the goal of DCA is to categorize observations in pre-defined groups, and like  correspondence analysis, it is used with nominal variables.The main idea behind DCA is to represent each group by thesum of its observations and to perform a simple CA on the groupsby variables matrix. The original observations are then projectedas  supplementary  elements and each observation is assigned to the closest group. The comparison between the a priori and the a posteriori classifications can be used to assess the quality of thendiscrimination. A similar procedure can be used to assign new observations to categories. The stability of the analysis can be evaluated using cross-validation techniques such as jackknifing or boot-strapping.

For the R model: https://pbil.univ-lyon1.fr/ade4/ade4-html/discrimin.coa.html

####\  

##k-Nearest Neighbors
k-NN is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression. In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

####\  

##Similarity Learning
Similarity learning is an area of supervised machine learning in artificial intelligence. It is closely related to regression and classification, but the goal is to learn from examples a similarity function that measures how similar or related two objects are. It has applications in ranking, in recommendation systems, visual identity tracking, face verification, and speaker verification. 

####\  

##Relevance Feedback/Rocchio Algorithm
Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query, to gather user feedback, and to use information about whether or not those results are relevant to perform a new query. We can usefully distinguish between three types of feedback: explicit feedback, implicit feedback, and blind or "pseudo" feedback. 

The Rocchio algorithm is based on a method of relevance feedback. Like many other retrieval systems, the Rocchio feedback approach was developed using the Vector Space Model. The algorithm is based on the assumption that most users have a general conception of which documents should be denoted as relevant or non-relevant. Therefore, the user's search query is revised to include an arbitrary percentage of relevant and non-relevant documents as a means of increasing the search engine's recall, and possibly the precision as well. 

For more:
https://nlp.stanford.edu/IR-book/html/htmledition/rocchio-classification-1.html

####\  

##Ensemble Methods
The excerpt below nice describes how ensemble methods work. It is reprinted from: *Ensemble Methods Explained in Simple English*, by Tavish Srivastava, posted August 2, 2015.

Ensemble modeling is a powerful way to improve the performance of your model. It usually pays off to apply ensemble learning over and above various models you might be building. 

Example: I want to invest in a company XYZ. I am not sure about its performance though. So, I look for advice on whether the stock price will increase more than 6% per annum or not? I decide to approach various experts having diverse domain experience:

>1.  Employee of Company XYZ: This person knows the internal functionality of the company and have the insider information about the functionality of the firm. But he lacks a broader perspective on how are competitors innovating, how is the technology evolving and what will be the impact of this evolution on Company XYZ’s product. In the past, he has been right 70% times.  
  
>2.  Financial Advisor of Company XYZ: This person has a broader perspective on how companies strategy will fair of in this competitive environment. However, he lacks a view on how the company’s internal policies are fairing off. In the past, he has been right 75% times.  
  
>3.  Stock Market Trader: This person has observed the company’s stock price over past 3 years. He knows the seasonality trends and how the overall market is performing. He also has developed a strong intuition on how stocks might vary over time. In the past, he has been right 70% times.  
  
>4.  Employee of a competitor: This person knows the internal functionality of the competitor firms and is aware of certain changes which are yet to be brought. He lacks a sight of company in focus and the external factors which can relate the growth of competitor with the company of subject. In the past, he has been right 60% of times.  
  
>5.  Market Research team in same segment: This team analyzes the customer preference of company XYZ’s product over others and how is this changing with time. Because he deals with customer side, he is unaware of the changes company XYZ will bring because of alignment to its own goals. In the past, they have been right 75% of times.  
  
>6.  Social Media Expert: This person can help us understand how has company XYZ has positioned its products in the market. And how are the sentiment of customers changing over time towards company. He is unaware of any kind of details beyond digital marketing. In the past, he has been right 65% of times.  
Given the broad spectrum of access we have, we can probably combine all the information and make an informed decision. In a scenario when all the 6 experts/teams verify that it’s a good decision(assuming all the predictions are independent of each other), we will get a combined accuracy rate of:

```
		  1 - 30%*25%*30%*40%*25%*35%

		= 1 - 0.07875 = 99.92125%
```

Assumption: The assumption used here that all the predictions are completely independent is slightly extreme as they are expected to be correlated. However, we see how we can be so sure by combining various predictions together.

Let us now change the scenario slightly. This time we have 6 experts, all of them are employee of company XYZ working in the same division. Everyone has a propensity of 70% to advocate correctly.

>*What if we combine all this advice together, can we still raise up our confidence to >99%?*

Obviously not, as all the predictions are based on very similar set of information. They are certain to be influenced by similar set of information and the only variation in their advice would be due to their personal opinions & collected facts about the firm.

Ensemble learning is the art of combining individual models together to improve the stability and predictive power of the model. 

####\  

###Error Sources in Ensemble Learning (Variance vs. Bias)
The error emerging from any model can be broken down into three components mathematically. 

**Error of a model** Why is this important in the current context? To understand what really goes behind an ensemble model, we need to first understand what causes error in the model. 

**Bias error** quantifies how much on an average the predicted values differ from the actual value. A high bias error means we have a under-performing model which keeps on missing important trends.

**Variance** quantifies how much the predictions made on same observations differ from each other. A high variance model will over-fit on your training population and perform badly on any observation beyond training.

Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens till a particular point. As you continue to make your model more complex, you end up **over-fitting your model** and hence your model will start suffering from high variance.

A champion model should maintain a balance between these two types of errors. This is known as the **trade-off management of bias-variance errors.** Ensemble learning is one way to execute this trade off analysis.

####\  

###Commonly used Ensemble learning techniques
**Bagging** tries to implement similar learners on small sample populations and then takes a mean of all the predictions. In generalized bagging, you can use different learners on different population.  As you can expect this helps us to reduce the variance error.

**Boosting** is an iterative technique which adjust the weight of an observation based on the last classification. If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa. Boosting in general decreases the bias error and builds strong predictive models. However, they may sometimes over fit on the training data.

**Stacking** is a very interesting way of combining models. Here we use a learner to combine output from different learners. This can lead to decrease in either bias or variance error depending on the combining learner we use.

####\  

###Random Forest Models
Random Forest models are a type of ensemble models, particularly bagging models. They are part of the tree based model family. RF can be run on:

*  Word count or n-gram count vectors
*  Word level TF-IDF vectors

Resources for Random Forest: 
https://www.analyticsvidhya.com/blog/2014/06/introduction-random-forest-simplified/

####\  

###Extreme Gradient Boosting Model
Boosting models are another type of ensemble models part of tree based models. Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing). XGB can be run on:

*  Word count or n-gram count vectors
*  Word level TF-IDF vectors
*  Character Level TF-IDF vectors

####\  

##Hierarchical Clustering
In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. 

####\  

##k-Means Clustering
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

The problem is computationally difficult (NP-hard); however, efficient heuristic algorithms converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both k-means and Gaussian mixture modeling. They both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes.

The algorithm has a loose relationship to the k-nearest neighbor classifier, a popular machine learning technique for classification that is often confused with k-means due to the name. Applying the 1-nearest neighbor classifier the cluster centers obtained by k-means classifies new data into the existing clusters. This is known as nearest centroid classifier or Rocchio algorithm. 

####\  

##Mixture Model
In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. 

Topics in a document: Assume that a document is composed of N different words from a total vocabulary of size V, where each word corresponds to one of K possible topics. The distribution of such words could be modelled as a mixture of K different V-dimensional categorical distributions. A model of this sort is commonly termed a topic model. Note that expectation maximization applied to such a model will typically fail to produce realistic results, due (among other things) to the excessive number of parameters. Some sorts of additional assumptions are typically necessary to get good results. 

A prior distribution is placed over the parameters describing the topic distributions, using a Dirichlet distribution with a concentration parameter that is set significantly below 1, so as to encourage sparse distributions (where only a small number of words have significantly non-zero probabilities).

Some sort of additional constraint is placed over the topic identities of words, to take advantage of natural clustering.

For example, a Markov chain could be placed on the topic identities (i.e., the latent variables specifying the mixture component of each observation), corresponding to the fact that nearby words belong to similar topics. (This results in a hidden Markov model, specifically one where a prior distribution is placed over state transitions that favors transitions that stay in the same state.)

Another possibility is the latent Dirichlet allocation model, which divides up the words into D different documents and assumes that in each document only a small number of topics occur with any frequency.

####\  

##Principal Component Analysis 
PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

PCA is mostly used as a tool in exploratory data analysis and for making predictive models. It is often used to visualize genetic distance and relatedness between populations. PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after a normalization step of the initial data. The normalization of each attribute consists of mean centering – subtracting each data value from its variable's measured mean so that its empirical mean (average) is zero – and, possibly, normalizing each variable's variance to make it equal to 1. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score). If component scores are standardized to unit variance loadings must contain the data variance in them (and that is the magnitude of eigenvalues). If component scores are not standardized (therefore they contain the data variance) then loadings must be unit-scaled, ("normalized") and these weights are called eigenvectors; they are the cosines of orthogonal rotation of variables into principal components or back.

PCA is the simplest of the true eigenvector-based multivariate analyses. Often, its operation can be thought of as revealing the internal structure of the data in a way that best explains the variance in the data. If a multivariate dataset is visualised as a set of coordinates in a high-dimensional data space (1 axis per variable), PCA can supply the user with a lower-dimensional picture, a projection of this object when viewed from its most informative viewpoint. This is done by using only the first few principal components so that the dimensionality of the transformed data is reduced.

PCA is closely related to factor analysis. Factor analysis typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix.

PCA is also related to canonical correlation analysis (CCA). CCA defines coordinate systems that optimally describe the cross-covariance between two datasets while PCA defines a new orthogonal coordinate system that optimally describes variance in a single dataset.


####\