-
Notifications
You must be signed in to change notification settings - Fork 9
/
Natural Language Analysis.Rmd
426 lines (298 loc) · 25.6 KB
/
Natural Language Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
---
title: "Natural Language Analysis"
author: '[Ryo®, Eng Lian Hu](http://englianhu.wordpress.com) <img src=''figure/me.JPG''
width=''24''> TonyStark®'
date: "9/22/2015"
output:
html_document:
fig_height: 3
fig_width: 5
highlight: haddock
theme: cerulean
toc: yes
pdf_document:
fig_height: 3
fig_width: 5
highlight: haddock
toc: yes
---
This is an natural language analysis on the matching soccer teams' name when I am doing research on [Betting Strategy and Model Validation](https://github.com/Scibrokes/Betting-Strategy-and-Model-Validation/blob/master/Betting%20Strategy%20and%20Model%20Validation.Rmd). The purpose of writing the functions just would like easier future scrap teams name for further calculation to reduce my workload.
Where the subject/topic is that the last course [Data Science Capstone](https://www.coursera.org/course/dsscapstone) on Coursera (JHU Johns Hopkins University) which I have failed few times and will retake on this coming October-2015 (Next month).
Note that the `echo = FALSE` and `include=FALSE` parameters were added to the code chunks below to prevent printing of the R code that generated the plots/tables. However you can feel free to see the source code via [Natural Language Analysis.Rmd](https://github.com/Scibrokes/Betting-Strategy-and-Model-Validation/Natural Language Analysis.Rmd).
## 1. Setup Options, Loading Required Libraries and Preparing Environment
Setup `knitr` options and loading the required libraries.
```{r load-packages, include=FALSE}
## You can write as ```{r load-packages, include=FALSE} if you want to hide the particular chunk
## Setting to omitt all warnings
options(warn=-1)
## Loading the packages
if(!'devtools' %in% installed.packages()){
install.packages('devtools')}
if(!'BBmisc' %in% installed.packages()){
install.packages('BBmisc')}
suppressPackageStartupMessages(library('BBmisc'))
pkgs <- c('devtools','stringr','stringi','reshape','reshape2','data.table','sparkline','DT','plyr','dplyr','magrittr','foreach','doParallel','rmarkdown','tidyr','gtable','grid','gridExtra','pander','stringdist','knitr','rmarkdown','lubridate','d3Network','networkD3')
suppressAll(lib(pkgs)); rm(pkgs)
```
Creating a parallel computing Cluster and support functions.
```{r setting, include=FALSE}
## Preparing the parallel cluster using the cores
doParallel::registerDoParallel(cores = 16)
#' @BiocParallel::register(MulticoreParam(workers=2))
## Make pretty table
## http://kbroman.org/knitr_knutshell/pages/figs_tables.html
## https://cran.r-project.org/web/packages/htmlTable/vignettes/tables.html
##
## knitr configuration
opts_knit$set(progress=FALSE)
opts_chunk$set(echo=TRUE, message=FALSE, tidy=TRUE, comment=NA, fig.path='figure/', fig.keep='high', fig.width=10, fig.height=6, fig.align="center", scrolling='Auto')
## Table width setting
panderOptions('table.split.table', Inf)
```
## 2. Read and Process the Dataset
Read the dataset of World Wide soccer matches from year 2011 until 2015 from a British betting consultancy named firm A.
```{r read-datasetA, echo=FALSE, results='asis'}
## Read the datasets
## Refer to **Testing efficiency of coding.Rmd** at chunk `get-data-summary-table-2.1`
source(paste0(getwd(),'/function/readfirmDatasets.R'))
years <- seq(2011,2015)
mbase <- readfirmDatasets(years=years)
dateID <- sort(unique(mbase$datasets$Date))
#'@ pander(head(mbase$datasets)) ## exactly same layout with kable(x)
## example of the dataset in the research paper
## due to the data heavy and overload ,max 5mb while generated 79mb html file, here I just simply subset the head section
mbase$datasets %>% head %>% datatable(.,caption="Table 2.1 : Soccer Staking Data from Firm A",extensions=c('ColReorder','ColVis','TableTools'),options=list(dom='TC<"clear">rlfrtip',colVis=list(exclude=c(0),activate='mouseover'),tableTools=list(sSwfPath=copySWF(pdf=TRUE)),scrollX=TRUE,scrollCollapse=TRUE))
```
*table 2.1* `r paste(unlist(strsplit(as.character(dim(mbase$datasets)),' ')), collapse=' x ')`
Due to the dataset very big `r paste(unlist(strsplit(as.character(dim(mbase$datasets)),' ')), collapse=' x ')` caused the webpage keep loading and unable open. Here I just only subset few rows from the data frame.
Read the dataset of World Wide soccer matches scrapped from year 2011 until 2015 from [spbo livescore website](http://www.spbo.com/eend0.htm).
```{r read-datasetB, echo=FALSE, results='asis'}
## Load the scraped spbo livescore datasets.
source(paste0(getwd(),'/function/readSPBO2.R'))
spboData <- readSPBO2(dateID=dateID, parallel=FALSE)
## example of the scrapped livescore dataset in the research paper
## due to the data heavy and overload ,max 5mb while generated 79mb html file, here I just simply subset the head section
spboData %>% head %>% datatable(.,caption="Table 2.2 : SPBO Soccer Data",extensions=c('ColReorder','ColVis','TableTools'),options=list(dom='TC<"clear">rlfrtip',colVis=list(exclude=c(0),activate='mouseover'),tableTools=list(sSwfPath=copySWF(pdf=TRUE)),scrollX=TRUE,scrollCollapse=TRUE))
```
*table 2.2* `r paste(unlist(strsplit(as.character(dim(spboData)),' ')), collapse=' x ')`
Due to the dataset very big `r paste(unlist(strsplit(as.character(dim(spboData)),' ')), collapse=' x ')` caused the webpage keep loading and unable open. Here I just only subset few rows from the data frame.
## 3. Matching the team names
### 3.1 Matching Duplicated Teams' Name
In order to matching a string. Firstly we can apply `match()` or `%in%` to matching the teams' name. Although, the capital letter different is not duplicated string in R programming while I apply the `tolower()` to match the teams' name since it is consider exactly matching teams' name in our real life.
```{r matching-01, echo=FALSE, results='asis'}
## Get and filter the teams' name
## Filter and drop the first-half, corners and other games
teamID <- sort(unique(c(as.character(mbase$datasets$Home), as.character(mbase$datasets$Away))))
teamID <- teamID[!teamID %in% mbase$others]
spboTeam <- sort(c(as.vector(spboData$Home), as.vector(spboData$Away)))
spboTeamID <- sort(unique(spboTeam))
df1 <- data.frame(team=teamID[tolower(teamID) %in% tolower(spboTeamID)], spbo=spboTeamID[tolower(spboTeamID) %in% tolower(teamID)]) %>% tbl_df %>% mutate(team=as.character(team),spbo=as.character(spbo),pass=ifelse(team==spbo,'Duplicated','Capital Letters')) %>% arrange(pass)
row.names(df1) <- NULL
rbind(df1 %>% filter(pass=='Duplicated') %>% head(3),df1 %>% filter(pass=='Capital Letters') %>% head(3)) %>% kable(.,caption='Table 3.1.1 : Exactly match and capital letters difference.')
```
*table 3.1.1* `r paste(unlist(strsplit(as.character(dim(df1)),' ')), collapse=' x ')`
### 3.2 Apply amatch() and stringdist()
There has a concern which is noramlly second teams' name must be exactly same with first team but only add II, reserved etc to the first team name, for example : *Mainz 05* is first team but not fifth reserved team. More soccer matches data scrapped will be more accurate, for example if we only scrapped one day data, how can we matching the first team if let say only Chelsea reserved team play on that particular date.
However there has another concern which is first team *TSV 1860 Munchen* but second/U19 team termed as *1860 Munchen II*, *1860 Munchen U19* etc. The *Lincoln* team name supposed to be matched with *Lincoln City* but not *Lincoln United* while *Lincoln City* will be most approximately matching to *Lincoln Xxitxx* compare to *Lincoln*.
Besides, if I set the priority of matching the kick-off date and later team names, it will be a concern of possibilities of postponed staked matches (postponed after firm A placed bets, sometimes firm A will placed bets on Early market or the kick-off date accidentially changed/postponed before kick-off due to snowing/downpour/etc).
I load the [`stringdist`](https://cran.r-project.org/web/packages/stringdist/index.html) package to apply the algorithmic matching `amatch()` the team names.
* 01. [osa](https://en.wikipedia.org/wiki/Optimal_string_alignment) - Optimal string aligment, (restricted Damerau-Levenshtein distance).
* 02. [lv](https://en.wikipedia.org/wiki/Levenshtein_distance) - Levenshtein distance (as in R’s native adist).
* 03. [dl](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) - Full Damerau-Levenshtein distance.
* 04. [hamming](https://en.wikipedia.org/wiki/Hamming_distance) - Hamming distance (a and b must have same nr of characters).
* 05. [lcs](https://en.wikipedia.org/wiki/Longest_common_subsequence_problem) - Longest common substring distance.
* 06. [qgram](https://en.wikipedia.org/wiki/N-gram) - q-gram distance.
* 07. cosine - cosine distance between q-gram profiles.
* 08. [jaccard](https://en.wikipedia.org/wiki/Jaccard_index) - Jaccard distance between q-gram profiles.
* 09. [jw](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) - Jaro, or Jaro-Winker distance.
* 10. [soundex](https://en.wikipedia.org/wiki/Soundex) - Distance based on soundex encoding (see below).
Lets take an example below.
```{r matching-02A, echo=FALSE, results='asis'}
## Apply stringdist() to match the most approximate matching team names
method=c('osa','lv','dl','hamming','lcs','qgram','cosine','jaccard','jw','soundex')
#'@ levDist=0.1 # The default MaxDist inside stringdist() is 0.1.
strList <- function(team_a, team_b, method, levDist=NULL){
unlist(llply(as.list(method), function(x){
if(is.null(levDist)){
levDist=min(stringdist(team_a, team_b, method=x))
}else if(is.numeric(levDist)){
levDist=levDist
}else{
stop('Please enter a numeric or just keep default NULL value on levDist!')
}
if(!method %in% c('osa','lv','dl','hamming','lcs','qgram','cosine','jaccard','jw','soundex')){
stop('Please enter value within "osa","lv","dl","hamming","lcs","qgram","cosine","jaccard","jw","soundex")')
}
team_b[amatch(team_a, team_b, method=x, maxDist=levDist)]
},.parallel=FALSE))
}
## Check how many teams' name includes string 'Lincoln'.
teamID[grep('Lincoln',teamID)]
lst <- list(uniqueID_0.1=strList('Lincoln',spboTeamID,method=method,levDist=0.1),
allElems_0.1=strList('Lincoln',spboTeam,method=method,levDist=0.1),
uniqueID_0.5=strList('Lincoln',spboTeamID,method=method,levDist=0.5),
allElems_0.5=strList('Lincoln',spboTeam,method=method,levDist=0.5),
uniqueID_1.0=strList('Lincoln',spboTeamID,method=method,levDist=1.0),
allElems_1.0=strList('Lincoln',spboTeam,method=method,levDist=1.0),
uniqueID_2.0=strList('Lincoln',spboTeamID,method=method,levDist=2.0),
allElems_2.0=strList('Lincoln',spboTeam,method=method,levDist=2.0),
uniqueID_Inf=strList('Lincoln',spboTeamID,method=method,levDist=Inf),
allElems_Inf=strList('Lincoln',spboTeam,method=method,levDist=Inf))
len <- sapply(lst,length)
n <- max(len)
len <- n-len
df2A <- mapply(function(x,y) c(x, rep(NA, y)), lst, len) %>% data.frame %>% mutate(Matching1='Lincoln',Matching2='Lincoln City',method=method) %>% select(Matching1,method,uniqueID_0.1,allElems_0.1,uniqueID_0.5,allElems_0.5,uniqueID_1.0,allElems_1.0,uniqueID_2.0,allElems_2.0,uniqueID_Inf,allElems_Inf) %>% tbl_df %>% mutate_each(funs(as.character))
rm(lst,len,n)
df2A %>% datatable(.,caption="Table 3.2.1 : StringDist Matching 'Lincoln' ",extensions=c('ColReorder','ColVis','TableTools'),options=list(dom='TC<"clear">rlfrtip',colVis=list(exclude=c(0),activate='mouseover'),tableTools=list(sSwfPath=copySWF(pdf=TRUE)),scrollX=TRUE,scrollCollapse=TRUE))
```
*table 3.2.1* `r paste(unlist(strsplit(as.character(dim(df2A)),' ')), collapse=' x ')`
I simply matching the key words `Lincoln` in Home and Away teams' name data which get from firm A.
```{r matching-02B, echo=FALSE, results='asis'}
## We get 'Lincoln City' from teamID[grep('Lincoln',teamID)]
lst <- list(uniqueID_0.1=strList('Lincoln City',spboTeamID,method=method,levDist=0.1),
allElems_0.1=strList('Lincoln City',spboTeam,method=method,levDist=0.1),
uniqueID_0.5=strList('Lincoln City',spboTeamID,method=method,levDist=0.5),
allElems_0.5=strList('Lincoln City',spboTeam,method=method,levDist=0.5),
uniqueID_1.0=strList('Lincoln City',spboTeamID,method=method,levDist=1.0),
allElems_1.0=strList('Lincoln City',spboTeam,method=method,levDist=1.0),
uniqueID_2.0=strList('Lincoln City',spboTeamID,method=method,levDist=2.0),
allElems_2.0=strList('Lincoln City',spboTeam,method=method,levDist=2.0),
uniqueID_Inf=strList('Lincoln City',spboTeamID,method=method,levDist=Inf),
allElems_Inf=strList('Lincoln City',spboTeam,method=method,levDist=Inf))
len <- sapply(lst,length)
n <- max(len)
len <- n-len
df2B <- mapply(function(x,y) c(x, rep(NA, y)), lst, len) %>% data.frame %>% mutate(Matching1='Lincoln',Matching2='Lincoln City',method=method) %>% select(Matching1,method,uniqueID_0.1,allElems_0.1,uniqueID_0.5,allElems_0.5,uniqueID_1.0,allElems_1.0,uniqueID_2.0,allElems_2.0,uniqueID_Inf,allElems_Inf) %>% tbl_df %>% mutate_each(funs(as.character))
rm(lst,len,n)
df2B %>% datatable(.,caption="Table 3.2.2 : StringDist Matching 'Lincoln City' ",extensions=c('ColReorder','ColVis','TableTools'),options=list(dom='TC<"clear">rlfrtip',colVis=list(exclude=c(0),activate='mouseover'),tableTools=list(sSwfPath=copySWF(pdf=TRUE)),scrollX=TRUE,scrollCollapse=TRUE))
```
*table 3.2.2* `r paste(unlist(strsplit(as.character(dim(df2B)),' ')), collapse=' x ')`
From the two tables stated above, I apply stringdist by set the MaxDist to be default value `0.1`,`0.5`,`1.0`,`2.0` and also `Inf` and select all methods avaiable (10 methods stated above in section 3 before the run coding). Well, I dont pretend to know how does the algorimthic of `stringdist()` matching the string. Therefore I try both unique teams' name and also all elements (without filter to be unique).
### 3.3 Apply agrep()
I tried to simply apply the `agrep()` function to partially matching the teams' name.
```{r matching-03, echo=FALSE, results='asis'}
## Filter spboTeamID wihthout other games and 1st Half team names
## Apply agrep() to match the most approximate matching team names
## http://stackoverflow.com/questions/21103410/irregular-list-of-lists-to-dataframe
lst <- list(team1=sort(unique(c(teamID[agrep('Lincoln',teamID)]))),spbo1=sort(unique(c(spboTeamID[agrep('Lincoln',spboTeamID)]))), team2=sort(unique(c(teamID[agrep('Lincoln City',teamID)]))),spbo2=sort(unique(c(spboTeamID[agrep('Lincoln City',spboTeamID)]))))
len <- sapply(lst,length)
n <- max(len)
len <- n-len
df3 <- mapply(function(x,y) c(x, rep(NA, y)), lst, len) %>% data.frame %>% mutate(Matching1='Lincoln',Matching2='Lincoln City') %>% select(Matching1,team1,spbo1,Matching2,team2,spbo2) %>% tbl_df %>% mutate_each(funs(as.character))
rm(lst,len,n,len)
df3 %>% kable(.,caption='Table 3.3.1 : Simply apply agrep().')
```
*table 3.3.1* `r paste(unlist(strsplit(as.character(dim(df3)),' ')), collapse=' x ')`
### 3.4 Apply partialMatch()
Secondly, there is an article from [Merging Data Sets Based on Partially Matched Data Elements](http://www.r-bloggers.com/merging-data-sets-based-on-partially-matched-data-elements/) which apply subset to partial matching the teams' name.
```{r matching-04A, echo=FALSE, results='asis'}
## Load the partialMatch() function
source(paste0(getwd(),'/function/partialMatch.R'))
df4 <- partialMatch(iconv(teamID), spboTeamID)
#'@ rbind(df4 %>% filter(pass=='Duplicate') %>% head(3),df4 %>% filter(pass=='Partial') %>% head(3)) %>% kable
df4 %>% datatable(.,caption="Table 3.4.1 : Partial Matching Teams' Name.",extensions=c('ColReorder','ColVis','TableTools'),options=list(dom='TC<"clear">rlfrtip',colVis=list(exclude=c(0),activate='mouseover'),tableTools=list(sSwfPath=copySWF(pdf=TRUE)),scrollX=TRUE,scrollCollapse=TRUE))
```
Below table simply display few matched teams' name which are not accurate.
```{r matching-04B, echo=FALSE, results='asis'}
df4 %>% filter((teamID %in% teamID[grep('Women|U[0-9]{2}',teamID)])|(spboID %in% spboID[grep('Women|U[0-9]{2}',spboID)]), Match=='Partial') %>% head %>% kable(.,caption='Table 3.4.2 : Inaccuracy of Matching Result.')
```
*table 3.4.2* `r paste(unlist(strsplit(as.character(dim(df4)),' ')), collapse=' x ')`
From the table above we all know that the team `AaB Aalborg` from firm A will match with `AaB Aalborg U17` from livescore website and `Airdrie United` match to `Airdrie United Women` while there are totally different team and will lead reasearcher calculate a wrong predictive figures for investment.
In order to maximized the soccer matches (observations) available for the research, here I seperates few steps to matching the teams' name by using `split()` and cross-matching each others to seperately rearrange the data prior to start the algorithmic matching function in **section 4 Reprocess the Data**.
## 4. Reprocess the Data
### 4.1 Dicission Tree
```{r matching-05A, echo=FALSE, results='asis'}
source(paste0(getwd(),'/function/makeList.R'))
## Since the elements are not much enough but list quit a number, just set parallel=FALSE will be faster few minutes.
dfm <- makeList(mbase, spboData, levDist=0.1, parallel=FALSE)
```
I would like to plot a hierarchical chart for spliting teams' name for `agrep`. However due to `rpart` and `randomForest` packages required numeric data while diagram doesn't special. Here I plot two dynamic graphs.
```{r decission-tree-A, echo=FALSE, results='asis'}
## Refered `rpart` and `randomForest` to plot the decision tree, showing how to match the teams' name but required numeric data and also static graph.
## Here I using `d3Network`
d3data <- dfm$partialData
#'@ d3data %>% d3SimpleNetwork(.,width='automatic', height=400) ## https://github.com/christophergandrud/d3Network/issues/31
#'@ d3data %>% as.list %>% d3Tree ## https://github.com/christophergandrud/d3Network/issues/31
d3data[1:2] %>% simpleNetwork(.,width=400, height=400)
```
Since the `simpleNetwork()` function only apply to 2 columns dataset, here I split to be 2 graphs.
```{r decission-tree-B, echo=FALSE, results='asis'}
d3data[3:4] %>% simpleNetwork(.,width='auto', height=800)
```
### 4.2 Filtering and Reprocess the Data
Prior to start the algorithmic string matching, I am using the idea from `Apply signature() from country names to reduce some of the minor differences between strings. In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces. So for example, United Kingdom would become kingdomunited` which inside the [Merging Data Sets Based on Partially Matched Data Elements](http://www.r-bloggers.com/merging-data-sets-based-on-partially-matched-data-elements/). It will minimize/reduce the string distance to maximize the matching result.
Here I tried to `split` teams' name into list and simply apply `grep` and `agrep` to apply first filtering.
```{r matching-05B, echo=FALSE, results='asis'}
d3data %>% datatable(.,caption="Table 4.2.1 : Reprocess the Teams' Name.",extensions=c('ColReorder','ColVis','TableTools'),options=list(dom='TC<"clear">rlfrtip',colVis=list(exclude=c(0),activate='mouseover'),tableTools=list(sSwfPath=copySWF(pdf=TRUE)),scrollX=TRUE,scrollCollapse=TRUE))
```
### 4.3 StringDist Maximum Likelihood
There is an good example from [How can I match fuzzy match strings from two datasets?](http://stackoverflow.com/questions/26405895/how-can-i-match-fuzzy-match-strings-from-two-datasets) which apply `expand.grid()` to build a data frame and then Expectation Maximization theory by using while loop on `stringdist()`.
```{r stringdist, echo=FALSE, results='asis'}
source(paste0(getwd(),'/function/arrTeamID.R'))
strDF <- arrTeamID(mbase, spboData, levDist=0.1, parallel=FALSE)
strDF$result %>% datatable(.,caption="Table 4.3.1 : StringDist Approximately Matched Teams' Name.",extensions=c('ColReorder','ColVis','TableTools'),options=list(dom='TC<"clear">rlfrtip',colVis=list(exclude=c(0),activate='mouseover'),tableTools=list(sSwfPath=copySWF(pdf=TRUE)),scrollX=TRUE,scrollCollapse=TRUE))
```
From the above table, I've matching the teams' name which is Section 2 Dataset inside [Betting Strategy and Model Validation](https://github.com/englianhu/Betting-Strategy-and-Model-Validation/blob/master/Betting%20Strategy%20and%20Model%20Validation.Rmd). Here I apply method = `r method` inside the `stringdist` function. You are feel free to apply the function to scrap and also re-arrange the teams' name and soccer scores data for your own odds price modelling.
## 5. Result
### 5.1 Checked and Filtered the Teams' Name
Here I tried to manually check the teams' name and compile as a file to compare the accuracy of the stringDist().
```{r read-datasetC, echo=FALSE, results='asis'}
tmIDdata <- read.csv(paste0(getwd(),'/datasets/teamID.csv'),header=TRUE,sep=',') %>% tbl_df %>% mutate_each(funs(as.character))
## Kuban Krasnodar duplicated in tmID but different in spbo column.
#'@ dp <- tmIDdata %>% filter(tmID==agrep('Kuban',tmID,value=TRUE))
tmIDdata <- tmIDdata %>% filter(spbo!='Kuban Krasnodar')
tmIDdata %>% datatable(.,caption="Table 5.1.1 : Table of Teams Name (Manually Checked)",extensions=c('ColReorder','ColVis','TableTools'),options=list(dom='TC<"clear">rlfrtip',colVis=list(exclude=c(0),activate='mouseover'),tableTools=list(sSwfPath=copySWF(pdf=TRUE)),scrollX=TRUE,scrollCollapse=TRUE))
```
### 5.2 Comparison of the Model
Firstly, we try to filter-up the teams' name.
```{r compare-datasetA, echo=FALSE, results='asis'}
comp1 <- tmIDdata %>% subset(., .$teamID %in% strDF$result$teamID & !duplicated(.$teamID)) %>% merge(.,strDF$result) %>% tbl_df %>% .[c('teamID','spbo',method)] %>% mutate_each(funs(as.character))
comp1 %>% datatable(.,caption="Table 5.2.1 : Table of Teams Name (stringDistList)",extensions=c('ColReorder','ColVis','TableTools'),options=list(dom='TC<"clear">rlfrtip',colVis=list(exclude=c(0),activate='mouseover'),tableTools=list(sSwfPath=copySWF(pdf=TRUE)),scrollX=TRUE,scrollCollapse=TRUE))
```
Secondly, we simply compare the accuracy and also number of teams.
```{r compare-datasetB, echo=FALSE, results='asis'}
res1 <- sapply(seq(2,ncol(comp1)),function(i) sum(as.numeric(comp1[2]==comp1[i])))
names(res1) <- c('spbo',method)
data.frame(match=names(res1),rate=res1/res1[1],n=res1) %>% tbl_df %>% kable(.,caption='Table 5.2.2 : Summary of Matching Result 1')
```
Same with above, we simply filter the `PartialMatch` function.
```{r compare-datasetC, echo=FALSE, results='asis'}
comp2 <- tmIDdata %>% subset(., .$teamID %in% df4$teamID & !duplicated(.$teamID)) %>% merge(.,df4) %>% tbl_df %>% .[c('teamID','spbo','spboID')] %>% mutate_each(funs(as.character))
comp2 %>% datatable(.,caption="Table 5.2.3 : Table of Teams Name (PartialMatch)",extensions=c('ColReorder','ColVis','TableTools'),options=list(dom='TC<"clear">rlfrtip',colVis=list(exclude=c(0),activate='mouseover'),tableTools=list(sSwfPath=copySWF(pdf=TRUE)),scrollX=TRUE,scrollCollapse=TRUE))
```
Here we also summarized the table.
```{r compare-datasetD, echo=FALSE, results='asis'}
res2 <- sapply(seq(2,ncol(comp2)),function(i) sum(as.numeric(comp2[2]==comp2[i])))
names(res2) <- c('spbo','PartialMatch')
data.frame(match=names(res2),rate=res2/res2[1],n=res2) %>% tbl_df %>% kable(.,caption='Table 5.2.4 : Summary of Matching Result 2')
```
Based from the above two functions, we know that modified `stringdist()` which is `stringDistList()` has correctly gather `r res1[2]` teams from `r res1[1]` teams. Meanwhile `partialMatch()` has matched `r res2[2]` teams from `r res2[1]` teams. More teams correctly gathered the information to diversify the investment opportunity on different leagues.
### 5.3 Future Works
There will be more accurate to approximately matching if I apply multivariate matching kick-off time and also both home team and away team at once. I was initially tried to match the teams name by criteria kick-off time but the kick-off time will sometimes unexpected change few hours prior to kick-off.
I will also write as a package to easier load and log.
## 6. Appendices
### 6.1 Documenting File Creation
It's useful to record some information about how your file was created.
* File creation date: 2015-10-29
* `r R.version.string`
* R version (short form): `r getRversion()`
* `rmarkdown` package version: `r packageVersion('rmarkdown')`
* File version: 1.0.3
* File latest updated date: `r Sys.Date()`
* Author Profile: [Ryo®, Eng Lian Hu](http://rpubs.com/englianhu/ryoeng)
* GitHub: [Source Code](https://github.com/Scibrokes/Betting-Strategy-and-Model-Validation/blob/master/Natural%20Language%20Analysis.Rmd)
* Additional session information
```{r echo=FALSE, results='asis'}
lubridate::now()
devtools::session_info()$platform
Sys.info()
```
### 6.2 References
* [Merging Data Sets Based on Partially Matched Data Elements](http://www.r-bloggers.com/merging-data-sets-based-on-partially-matched-data-elements/)
* [How can I match fuzzy match strings from two datasets?](http://stackoverflow.com/questions/26405895/how-can-i-match-fuzzy-match-strings-from-two-datasets)
* [Fuzzy String Matching – a survival skill to tackle unstructured information](http://www.r-bloggers.com/fuzzy-string-matching-a-survival-skill-to-tackle-unstructured-information/)
* [Compute Levenshtein distance using R](http://www.yimizhao.com/#!Compute-Levenshtein-distance-using-R/cu6k/55249b460cf215f35a4a815d)
* [d3Network](http://christophergandrud.github.io/d3Network/)
* [Tables with htmlTable and some alternatives](https://cran.r-project.org/web/packages/htmlTable/vignettes/tables.html)
* [DT: An R interface to the DataTables library](http://blog.rstudio.org/2015/06/24/dt-an-r-interface-to-the-datatables-library/)
* Rstudio Blog - [DT: An R interface to the DataTables library](https://rstudio.github.io/DT/)