-
Notifications
You must be signed in to change notification settings - Fork 114
/
01_exploratory_data_analysis.Rmd
1283 lines (837 loc) · 52.9 KB
/
01_exploratory_data_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r include = FALSE}
if(!knitr:::is_html_output())
{
options("width"=56)
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=56, indent = 2), tidy = TRUE)
knitr::opts_chunk$set(fig.pos = 'H')
}
```
# Exploratory Data Analysis {#exploratory_data_analysis}
Listening to the numbers :)
## Profiling, The voice of the numbers {#profiling}
```{r echo=FALSE, out.width="100px"}
knitr::include_graphics("exploratory_data_analysis/Galeano.jpg")
```
> "The voice of the numbers" -- a metaphor by [Eduardo Galeano](https://en.wikipedia.org/wiki/Eduardo_Galeano). Writer and novelist.
The data we explore could be like Egyptian hieroglyphs without a correct interpretation. Profiling is the very first step in a series of iterative stages in the pursuit of finding what the data want to tell us, if we are patient enough to listen.
This chapter will cover, with a few functions, a complete data profiling. This should be the entry step in a data project, where we start by knowing the correct data types and exploring distributions in numerical and categorical variables.
It also focuses on the extraction of semantic conclusions, which is useful when writing a report for non-technical people.
**What are we going to review in this chapter?**
* **Dataset health status**:
+ Getting metrics like total rows, columns, data types, zeros, and missing values
+ How each of the previous items impacts on different analysis
+ How to quickly filter and operate on (and with) them, to clean the data
* **Univariate analysis in categorical variable**:
+ Frequency, percentage, cumulative value, and colorful plots
* **Univariate analysis with numerical variables**:
+ Percentile, dispersion, standard deviation, mean, top and bottom values
+ Percentile vs. quantile vs. quartile
+ Kurtosis, skewness, inter-quartile range, variation coefficient
+ Plotting distributions
+ Complete **case study** based on _"Data World"_, data preparation, and data analysis
<br>
Functions summary review in the chapter:
* `df_status(data)`: Profiling dataset structure
* `describe(data)`: Numerical and categorical profiling (quantitative)
* `freq(data)`: Categorical profiling (quantitative and plot).
* `profiling_num(data)`: Profiling for numerical variables (quantitative)
* `plot_num(data)`: Profiling for numerical variables (plots)
Note: `describe` is in the `Hmisc` package while remaining functions are in `funModeling.`
<br>
### Dataset health status {#dataset-health-status}
The quantity of zeros, NA, Inf, unique values as well as the data type may lead to a good or bad model. Here's an approach to cover the very first step in data modeling.
First, we load the `funModeling` and `dplyr` libraries.
```{r, results="hide", message=FALSE, warning=FALSE}
# Loading funModeling!
library(funModeling)
library(dplyr)
data(heart_disease)
```
#### Checking missing values, zeros, data type, and unique values
Probably one of the first steps, when we get a new dataset to analyze, is to know if there are missing values (`NA` in **R**) and the data type.
The `df_status` function coming in `funModeling` can help us by showing these numbers in relative and percentage values. It also retrieves the infinite and zeros statistics.
```{r df-status, eval=FALSE}
# Profiling the data input
df_status(heart_disease)
```
```{r profiling-data-in-r, echo=FALSE, fig.cap="Dataset health status", out.extra=""}
knitr::include_graphics("exploratory_data_analysis/dataset_profiling.png")
```
* `q_zeros`: quantity of zeros (`p_zeros`: in percent)
* `q_inf`: quantity of infinite values (`p_inf`: in percent)
* `q_na`: quantity of NA (`p_na`: in percent)
* `type`: factor or numeric
* `unique`: quantity of unique values
#### Why are these metrics important?
* **Zeros**: Variables with **lots of zeros** may not be useful for modeling and, in some cases, they may dramatically bias the model.
* **NA**: Several models automatically exclude rows with NA (**random forest** for example). As a result, the final model can be biased due to several missing rows because of only one variable. For example, if the data contains only one out of 100 variables with 90% of NAs, the model will be training with only 10% of the original rows.
* **Inf**: Infinite values may lead to an unexpected behavior in some functions in R.
* **Type**: Some variables are encoded as numbers, but they are codes or categories and the models **don't handle them** in the same way.
* **Unique**: Factor/categorical variables with a high number of different values (~30) tend to do overfitting if the categories have low cardinality (**decision trees,** for example).
<br>
#### Filtering unwanted cases
The function `df_status` takes a data frame and returns a _status table_ that can help us quickly remove features (or variables) based on all the metrics described in the last section. For example:
**Removing variables with a _high number_ of zeros**
```{r profiling-data}
# Profiling the Data Input
my_data_status=df_status(heart_disease, print_results = F)
# Removing variables with 60% of zero values
vars_to_remove=filter(my_data_status, p_zeros > 60) %>% .$variable
vars_to_remove
# Keeping all columns except the ones present in 'vars_to_remove' vector
heart_disease_2=select(heart_disease, -one_of(vars_to_remove))
```
**Ordering data by percentage of zeros**
```{r profiling-data-2}
arrange(my_data_status, -p_zeros) %>% select(variable, q_zeros, p_zeros)
```
<br>
The same reasoning applies when we want to remove (or keep) those variables above or below a certain threshold. Please check the missing values chapter to get more information about the implications when dealing with variables containing missing values.
<br>
#### Going deep into these topics
Values returned by `df_status` are deeply covered in other chapters:
* **Missing values** (NA) treatment, analysis, and imputation are deeply covered in the [Missing Data](#missing_data) chapter.
* **Data type**, its conversions and implications when handling different data types and more are covered in the [Data Types](#data_types) chapter.
* A high number of **unique values** is synonymous for high-cardinality variables. This situation is studied in both chapters:
+ [High Cardinality Variable in Descriptive Stats](#high_cardinality_descriptive_stats)
+ [High Cardinality Variable in Predictive Modeling](#high_cardinality_predictive_modeling)
<br>
#### Getting other common statistics: **total rows**, **total columns** and **column names**:
```{r}
# Total rows
nrow(heart_disease)
# Total columns
ncol(heart_disease)
# Column names
colnames(heart_disease)
```
<br>
---
### Profiling categorical variables {#profiling-categorical-variables}
_Make sure you have the latest 'funModeling' version (>= 1.6)._
Frequency or distribution analysis is made simple by the `freq` function. This retrieves the distribution in a table and a plot (by default) and shows the distribution of absolute and relative numbers.
If you want the distribution for two variables:
```{r frequency-analysis-r, fig.height=3, fig.width=5, fig.cap=c('Frequency analysis 1','Frequency analysis 2'), out.extra = ""}
freq(data=heart_disease, input = c('thal','chest_pain'))
```
As well as in the remaining `funModeling` functions, if `input` is missing, then it will run for all factor or character variables present in a given data frame:
```{r profiling-categorical-variable-4, eval=F}
freq(data=heart_disease)
```
If we only want to print the table excluding the plot, then we set the `plot` parameter to `FALSE`.
The `freq` example can also handle a **single variable** as an input.
By _default_, `NA` values **are considered** in both the table and the plot. If it is needed to exclude the `NA` then set `na.rm = TRUE`.
Both examples in the following line:
```{r profiling-categorical-variable-5, eval=F}
freq(data=heart_disease$thal, plot = FALSE, na.rm = TRUE)
```
If only one variable is provided, then `freq` returns the printed table; thus, it is easy to perform some calculations based on the variables it provides.
* For example, to print the categories that represent most of the 80% of the share (based on `cumulative_perc < 80`).
* To get the categories belonging to the **long tail**, i.e., filtering by `percentage < 1` by retrieving those categories appearing less than 1% of the time.
In addition, as with the other plot functions in the package, if there is a need to export plots, then add the `path_out` parameter, which will create the folder if it's not yet created.
```{r, eval=F}
freq(data=heart_disease, path_out='my_folder')
```
##### Analysis
The output is ordered by the `frequency` variable, which quickly analyzes the most frequent categories and how many shares they represent (`cummulative_perc` variable). In general terms, we as human beings like order. If the variables are not ordered, then our eyes start moving over all the bars to do the comparison and our brains place each bar in relation to the other bars.
Check the difference for the same data input, first without order and then with order:
```{r univariate-analysis, echo=FALSE, fig.cap="Order and beauty", out.extra = ''}
knitr::include_graphics("exploratory_data_analysis/profiling_text_variable-bw.png")
```
Generally, there are just a few categories that appear most of the time.
A more complete analysis is in [High Cardinality Variable in Descriptive Stats](#high_cardinality_descriptive_stats)
<br>
#### Introducing the `describe` function
This function comes in the `Hmisc` package and allows us to quickly profile a complete dataset for both numerical and categorical variables. In this case, we'll select only two variables and we will analyze the result.
```{r}
# Just keeping two variables to use in this example
heart_disease_3=select(heart_disease, thal, chest_pain)
# Profiling the data!
describe(heart_disease_3)
```
Where:
* `n`: quantity of non-`NA` rows. In this case, it indicates there are `301` patients containing a number.
* `missing`: number of missing values. Summing this indicator to `n` gives us the total number of rows.
* `unique`: number of unique (or distinct) values.
The other information is pretty similar to the `freq` function and returns between parentheses the total number in relative and absolute values for each different category.
<br>
---
### Profiling numerical variables
This section is separated into two parts:
* Part 1: Introducing the “World Data” case study
* Part 2: Doing the numerical profiling in R
If you don’t want to know how the data preparation stage from Data World is calculated, then you can jump to "Part 2: Doing the numerical profiling in R", when the profiling started.
#### Part 1: Introducing the World Data case study
This contains many indicators regarding world development. Regardless the profiling example, the idea is to provide a ready-to-use table for sociologists, researchers, etc. interested in analyzing this kind of data.
The original data source is: [http://databank.worldbank.org](http://databank.worldbank.org/data/reports.aspx?source=2&Topic=11#). There you will find a data dictionary that explains all the variables.
First, we have to do some data wrangling. We are going to keep with the newest value per indicator.
```{r}
library(Hmisc)
# Loading data from the book repository without altering the format
data_world=read.csv(file = "https://goo.gl/2TrDgN", header = T, stringsAsFactors = F, na.strings = "..")
# Excluding missing values in Series.Code. The data downloaded from the web page contains four lines with "free-text" at the bottom of the file.
data_world=filter(data_world, Series.Code!="")
# The magical function that keeps the newest values for each metric. If you're not familiar with R, then skip it.
max_ix<-function(d)
{
ix=which(!is.na(d))
res=ifelse(length(ix)==0, NA, d[max(ix)])
return(res)
}
data_world$newest_value=apply(data_world[,5:ncol(data_world)], 1, FUN=max_ix)
# Printing the first three rows
head(data_world, 3)
```
The columns `Series.Name` and `Series.Code` are the indicators to be analyzed.
`Country.Name` and `Country.Code` are the countries. Each row represents a unique combination of country and indicator.
Remaining columns, `X1990..YR1990.` (year 1990),`X2000..YR2000.` (year 2000), `X2007..YR2007.` (year 2007), and so on indicate the metric value for that year, thus each column is a year.
<br>
#### Making a data scientist decision
There are many `NAs` because some countries don't have the measure of the indicator in those years. At this point, we need to **make a decision** as a data scientist. Probably no the optimal if we don’t ask to an expert, e.g., a sociologist.
What to do with the `NA` values? In this case, we are going to to keep with the **newest value** for all the indicators. Perhaps this is not the best way to extract conclusions for a paper as we are going to compare some countries with information up to 2016 while other countries will be updated only to 2009. To compare all the indicators with the newest data is a valid approach for the first analysis.
Another solution could have been to keep with the newest value, but only if this number belongs to the last five years. This would reduce the number of countries to analyze.
These questions are impossible to answer for an _artificial intelligence system_, yet the decision can change the results dramatically.
**The last transformation**
The next step will convert the last table from _long_ to _wide_ format. In other words, each row will represent a country and each column an indicator (thanks to the last transformation that has the _newest value_ for each combination of indicator-country).
The indicator names are unclear, so we will "translate" a few of them.
```{r, tidy=FALSE}
# Get the list of indicator descriptions.
names=unique(select(data_world, Series.Name, Series.Code))
head(names, 5)
# Convert a few
df_conv_world=data.frame(
new_name=c("urban_poverty_headcount",
"rural_poverty_headcount",
"gini_index",
"pop_living_slums",
"poverty_headcount_1.9"),
Series.Code=c("SI.POV.URHC",
"SI.POV.RUHC",
"SI.POV.GINI",
"EN.POP.SLUM.UR.ZS",
"SI.POV.DDAY"),
stringsAsFactors = F)
# adding the new indicator value
data_world_2 = left_join(data_world,
df_conv_world,
by="Series.Code",
all.x=T)
data_world_2 =
mutate(data_world_2, Series.Code_2=
ifelse(!is.na(new_name),
as.character(data_world_2$new_name),
data_world_2$Series.Code)
)
```
Any indicator meaning can be checked in data.worldbank.org. For example, if we want to know what `EN.POP.SLUM.UR.ZS` means, then we type: http://data.worldbank.org/indicator/EN.POP.SLUM.UR.ZS
```{r, warning=FALSE}
# The package 'reshape2' contains both 'dcast' and 'melt' functions
library(reshape2)
data_world_wide=dcast(data_world_2, Country.Name ~ Series.Code_2, value.var = "newest_value")
```
Note: To understand more about `long` and `wide` format using `reshape2` package, and how to convert from one to another, please go to http://seananderson.ca/2013/10/19/reshape.html.
Now we have the final table to analyze:
```{r}
# Printing the first three rows
head(data_world_wide, 3)
```
<br>
#### Part 2: Doing the numerical profiling in R {#numerical-profiling-in-r}
We will see the following functions:
* `describe` from `Hmisc`
* `profiling_num` (full univariate analysis), and `plot_num` (hisotgrams) from `funModeling`
We'll pick up only two variables as an example:
```{r}
library(Hmisc) # contains the `describe` function
vars_to_profile=c("gini_index", "poverty_headcount_1.9")
data_subset=select(data_world_wide, one_of(vars_to_profile))
# Using the `describe` on a complete dataset. # It can be run with one variable; for example, describe(data_subset$poverty_headcount_1.9)
describe(data_subset)
```
Taking `poverty_headcount_1.9` (_Poverty headcount ratio at $1.90 a day is the percentage of the population living on less than $1.90 a day at 2011 international prices._), we can describe it as:
* `n`: quantity of non-`NA` rows. In this case, it indicates `116` countries that contain a number.
* `missing`: number of missing values. Summing this indicator to `n` gives us the total number of rows. Almost half of the countries have no data.
* `unique`: number of unique (or distinct) values.
* `Info`: an estimator of the amount of information present in the variable and not important at this point.
* `Mean`: the classical mean or average.
* Numbers: `.05`, `.10`, `.25`, `.50`, `.75`, `.90` and `.95 ` stand for the percentiles. These values are really useful since it helps us to describe the distribution. It will be deeply covered later on, i.e., `.05` is the 5th percentile.
* `lowest` and `highest`: the five lowest/highest values. Here, we can spot outliers and data errors. For example, if the variable represents a percentage, then it cannot contain negative values.
<br>
The next function is `profiling_num` which takes a data frame and retrieves a _big_ table, easy to get overwhelmed in a _sea of metrics_. This is similar to what we can see in the movie _The Matrix_.
```{r matrix-movie, out.width="150px", echo=FALSE, fig.cap="The matrix of data", out.extra = ""}
knitr::include_graphics("exploratory_data_analysis/matrix_movie.png")
```
_Picture from the movie: "The Matrix" (1999). The Wachowski Brothers (Directors)._
The idea of the following table is to give to the user a **full set of metrics,**, for then, she or he can decide which ones to pick for the study.
Note: Every metric has a lot of statistical theory behind it. Here we'll be covering just a tiny and **oversimplified** approach to introduce the concepts.
```{r}
library(funModeling)
# Full numerical profiling in one function automatically excludes non-numerical variables
profiling_num(data_world_wide)
```
Each indicator has _its raison d'être_:
* `variable`: variable name
* `mean`: the well-known mean or average
* `std_dev`: standard deviation, a measure of **dispersion** or **spread** around the mean value. A value around `0` means almost no variation (thus, it seems more like a constant); on the other side, it is harder to set what _high_ is, but we can tell that the higher the variation the greater the spread.
_Chaos may look like infinite standard variation_. The unit is the same as the mean so that it can be compared.
* `variation_coef`: variation coefficient=`std_dev`/`mean`. Because the `std_dev` is an absolute number, it's good to have an indicator that puts it in a relative number, comparing the `std_dev` against the `mean` A value of `0.22` indicates the `std_dev` is 22% of the `mean` If it were close to `0` then the variable tends to be more centered around the mean. If we compare two classifiers, then we may prefer the one with less `std_dev` and `variation_coef` on its accuracy.
* `p_01`, `p_05`, `p_25`, `p_50`, `p_75`, `p_95`, `p_99`: **Percentiles** at 1%, 5%, 25%, and so on. Later on in this chapter is a complete review about percentiles.
For a full explanation about percentiles, please go to: [Annex 1: The magic of percentiles](#appendix-percentiles).
* `skewness`: is a measure of _asymmetry_. Close to **0** indicates that the distribution is _equally_ distributed (or symmetrical) around its mean. A **positive number** implies a long tail on the right, whereas a **negative number** means the opposite.
After this section, check the skewness in the plots. The variable `pop_living_slums` is close to 0 ("equally" distributed), `poverty_headcount_1.9` is positive (tail on the right), and `SI.DST.04TH.20` is negative (tail on the left). The further the skewness is from 0 the more likely the distribution is to have **outliers**
* `kurtosis`: describes the distribution **tails**; keeping it simple, a higher number may indicate the presence of outliers (just as we'll see later for the variable `SI.POV.URGP` holding an outlier around the value `50`
For a complete skewness and kurtosis review, check Refs. [@skew_kurt_1] and [@skew_kurt_2].
* `iqr`: the inter-quartile range is the result of looking at percentiles `0.25` and `0.75` and indicates, in the same variable unit, the dispersion length of 50% of the values. The higher the value the more sparse the variable.
* `range_98` and `range_80`: indicates the range where `98%` of the values are. It removes the bottom and top 1% (thus, the `98%` number). It is good to know the variable range without potential outliers. For example, `pop_living_slums` goes from `0` to `76.15` It's **more robust** than comparing the **min** and **max** values.
The `range_80` is the same as the `range_98` but without the bottom and top `10%`
`iqr`, `range_98` and `range_80` are based on percentiles, which we'll be covering later in this chapter.
**Important**: All the metrics are calculated having removed the `NA` values. Otherwise, the table would be filled with NA`s.
<br>
##### Advice when using `profiling_num`
The idea of `profiling_num` is to provide to the data scientist with a full set of metrics, so they can select the most relevant. This can easily be done using the `select` function from the `dplyr` package.
In addition, we have to set in `profiling_num` the parameter `print_results = FALSE`. This way we avoid the printing in the console.
For example, let's get with the `mean`, `p_01`, `p_99` and `range_80`:
```{r}
my_profiling_table=profiling_num(data_world_wide, print_results = FALSE) %>% select(variable, mean, p_01, p_99, range_80)
# Printing only the first three rows
head(my_profiling_table, 3)
```
Please note that `profiling_num` returns a table, so we can quickly filter cases given on the conditions we set.
<br>
##### Profiling numerical variables by plotting {#plotting-numerical-variable}
Another function in `funModeling` is `plot_num` which takes a dataset and plots the distribution of every numerical variable while automatically excluding the non-numerical ones:
```{r, profiling-numerical-variable-with-histograms, fig.cap="Profiling numerical data", out.extra = ''}
plot_num(data_world_wide)
```
We can adjust the number of bars used in the plot by changing the `bins` parameter (default value is set to 10). For example: `plot_num(data_world_wide, bins = 20)`.
---
<br>
### Final thoughts
Many numbers have appeared here so far, _and even more in the percentile appendix_. The important point is for you to find the right approach to explore your data. This can come from other metrics or other criteria.
The functions `df_status`, `describe`, `freq`, `profiling_num` and `plot_num` can be run at the beginning of a data project.
Regarding the **normal and abnormal behavior** on data, it's important to study both. To describe the dataset in general terms, we should exclude the extreme values: for example, with `range_98` variable. The mean should decrease after the exclusion.
These analyses are **univariate**; that is, they do not take into account other variables (**multivariate** analysis). This will be part of this book later on. Meanwhile, for the correlation between input (and output) variables, you can check the [Correlation](#correlation) chapter.
<br>
---
```{r echo=FALSE}
knitr::include_graphics("introduction/spacer_bar.png")
```
---
<br>
## Correlation and Relationship {#correlation}
```{r manderbolt-fractal,echo=FALSE, out.width="150px"}
knitr::include_graphics("exploratory_data_analysis/fractal_manderbolt.png")
```
_Manderbolt fractal, where the chaos expresses its beauty; image source: Wikipedia._
<br>
### What is this about?
This chapter contains both methodological and practical aspects of measuring correlation in variables. We will see that _correlation_ word can be translated into "**functional relationship**".
In methodological you will find the Anscombe Quartet, a set of four plots with dissimilar spatial distribution, but sharing the same correlation measure. We'll go one step ahead re-calculating their relationship though a more robust metric (MIC).
We will mention **Information Theory** several times, although by now it's not going to be covered at the mathematical level, it's planned to. Many algorithms are based on it, even deep learning.
Understanding these concepts in low dimension (two variables) and small data (a bunch of rows) allow us to better understand high dimensional data. Nonetheless, some real cases are only _small_ data.
From the practical point of view, you'll be able to replicate the analysis with your own data, profiling and exposing their relationships in fancy plots.
<br>
Let's starting loading all needed libraries.
```{r lib, message=F, results="hide", warning=FALSE}
# Loading needed libraries
library(funModeling) # contains heart_disease data
library(minerva) # contains MIC statistic
library(ggplot2)
library(dplyr)
library(reshape2)
library(gridExtra) # allow us to plot two plots in a row
options(scipen=999) # disable scientific notation
```
<br>
### Linear correlation {#linear-correlation}
Perhaps the most standard correlation measure for numeric variables is the `R statistic` (or Pearson coefficient) which goes from `1` _positive correlation_ to `-1` _negative correlation_. A value around `0` implies no correlation.
Consider the following example, which calculates R measure based on a target variable (for example to do feature engineering). Function `correlation_table` retrieves R metric for all numeric variables skipping the categorical/nominal ones.
```{r}
correlation_table(data=heart_disease, target="has_heart_disease")
```
Variable `heart_disease_severity` is the most important -numerical- variable, the higher its value the higher the chances of having a heart disease (positive correlation). Just the opposite to `max_heart_rate`, which has a negative correlation.
Squaring this number returns the `R-squared` statistic (aka `R2`), which goes from `0` _no correlation_ to `1` _high correlation_.
R statistic is highly influenced by **outliers** and **non-linear** relationships.
<br>
#### Correlation on Anscombe's Quartet
Take a look at the **Anscombe's quartet**, quoting [Wikipedia](https://en.wikipedia.org/wiki/Anscombe%27s_quartet):
> They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties.
1973 and still valid, fantastic.
These four relationships are different, but all of them have the same R2: `0.816`.
Following example calculates the R2 and plot every pair.
```{r anscombe-data, message=FALSE, tidy=FALSE, fig.cap="Anscombe set", out.extra = ''}
# Reading anscombe quartet data
anscombe_data =
read.delim(file="https://goo.gl/mVLz5L", header = T)
# calculating the correlation (R squared, or R2) for
#every pair, every value is the same: 0.86.
cor_1 = cor(anscombe_data$x1, anscombe_data$y1)
cor_2 = cor(anscombe_data$x2, anscombe_data$y2)
cor_3 = cor(anscombe_data$x3, anscombe_data$y3)
cor_4 = cor(anscombe_data$x4, anscombe_data$y4)
# defining the function
plot_anscombe <- function(x, y, value, type)
{
# 'anscombe_data' is a global variable, this is
# a bad programming practice ;)
p=ggplot(anscombe_data, aes_string(x,y)) +
geom_smooth(method='lm', fill=NA) +
geom_point(aes(colour=factor(1),
fill = factor(1)),
shape=21, size = 2
) +
ylim(2, 13) +
xlim(4, 19) +
theme_minimal() +
theme(legend.position="none") +
annotate("text",
x = 12,
y =4.5,
label =
sprintf("%s: %s",
type,
round(value,2)
)
)
return(p)
}
# plotting in a 2x2 grid
grid.arrange(plot_anscombe("x1", "y1", cor_1, "R2"),
plot_anscombe("x2", "y2", cor_2, "R2"),
plot_anscombe("x3", "y3", cor_3, "R2"),
plot_anscombe("x4", "y4", cor_4, "R2"),
ncol=2,
nrow=2)
```
4-different plots, having the same `mean` for every `x` and `y` variable (9 and 7.501 respectively), and the same degree of correlation. You can check all the measures by typing `summary(anscombe_data)`.
This is why is so important to plot relationships when analyzing correlations.
We'll back on this data later. It can be improved! First, we'll introduce some concepts of information theory.
<br>
### Correlation based on Information Theory
This relationships can be measure better with [Information Theory](https://en.wikipedia.org/wiki/Information_theory) concepts. One of the many algorithms to measure correlation based on this is: **MINE**, acronym for: Maximal Information-based nonparametric exploration.
The implementation in R can be found in [minerva](https://cran.r-project.org/web/packages/minerva/index.html) package. It's also available in other languages like Python.
<br>
#### An example in R: A perfect relationship
Let's plot a non-linear relationship, directly based on a function (negative exponential), and print the MIC value.
```{r mic-non-linear-relationship, message=FALSE, fig.width=4, fig.height=3, fig.cap="A perfect relationship", out.extra = ''}
x=seq(0, 20, length.out=500)
df_exp=data.frame(x=x, y=dexp(x, rate=0.65))
ggplot(df_exp, aes(x=x, y=y)) + geom_line(color='steelblue') + theme_minimal()
# position [1,2] contains the correlation of both variables, excluding the correlation measure of each variable against itself.
# Calculating linear correlation
res_cor_R2=cor(df_exp)[1,2]^2
sprintf("R2: %s", round(res_cor_R2,2))
# now computing the MIC metric
res_mine=mine(df_exp)
sprintf("MIC: %s", res_mine$MIC[1,2])
```
**MIC** value goes from 0 to 1. Being 0 implies no correlation and 1 highest correlation. The interpretation is the same as the R-squared.
<br>
#### Results analysis
The `MIC=1` indicates there is a perfect correlation between the two variables. If we were doing **feature engineering** this variable should be included.
Further than a simple correlation, what the MIC says is: "Hey these two variables show a functional relationship".
In machine learning terms (and oversimplifying): "variable `y` is dependant of variable `x` and a function -that we don't know which one- can be found model the relationship."
This is tricky because that relationship was effectively created based on a function, an exponential one.
But let's continue with other examples...
<br>
### Adding noise
Noise is an undesired signal adding to the original one. In machine learning noise helps the model to get confused. Concretely: two identical input cases -for example customers- have different outcomes -one buy and the other doesn't-.
Now we are going to add some noise creating the `y_noise_1` variable.
```{r noisy-relationship, fig.width=4, fig.height=3, fig.cap="Adding some noise", out.extra = ''}
df_exp$y_noise_1=jitter(df_exp$y, factor = 1000, amount = NULL)
ggplot(df_exp, aes(x=x, y=y_noise_1)) +
geom_line(color='steelblue') + theme_minimal()
```
Calculating the correlation and MIC again, printing in both cases the entire matrix, which shows the correlation/MIC metric of each input variable against all the others including themselves.
```{r}
# calculating R squared
res_R2=cor(df_exp)^2
res_R2
# Calculating mine
res_mine_2=mine(df_exp)
# Printing MIC
res_mine_2$MIC
```
Adding noise to the data decreases the MIC value from 1 to 0.7226365 (-27%), and this is great!
R2 also decreased but just a little bit, from 0.3899148 to 0.3866319 (-0.8%).
**Conclusion:** MIC reflects a noisy relationship much better than R2, and it's helpful to find correlated associations.
**About the last example:** Generate data based on a function is only for teaching purposes. But the concept of noise in variables is quite common in _almost_ **every data set**, no matter its source. You don't have to do anything to add noise to variables, it's already there.
Machine learning models deal with this noise, by approaching to the _real_ shape of data.
It's quite useful to use the MIC measure to get a sense of the information present in a relationship between two variables.
<br>
### Measuring non-linearity (MIC-R2)
`mine` function returns several metrics, we checked only **MIC**, but due to the nature of the algorithm (you can check the original paper [@ReshefEtAl2011]), it computes more interesting indicators. Check them all by inspecting `res_mine_2` object.
One of them is `MICR2`, used as a measure of **non-linearity**. It is calculated by doing the: MIC - R2. Since R2 measures the linearity, a high `MICR2` would indicate a non-linear relationship.
We can check it by calculating the MICR2 manually, following two matrix returns the same result:
```{r, eval=FALSE}
# MIC r2: non-linearity metric
round(res_mine_2$MICR2, 3)
# calculating MIC r2 manually
round(res_mine_2$MIC-res_R2, 3)
```
Non-linear relationships are harder to build a model, even more using a linear algorithm like decision trees or linear regression.
Imagine we need to explain the relationship to another person, we'll need "more words" to do it. It's easier to say: _"A increases as B increases and the ratio is always 3x"_ (if A=1 then B=3, linear).
In comparison to: _"A increases as B increases, but A is almost 0 until B reaches the value 10, then A raises to 300; and when B reaches 15, A goes to 1000."_
```{r measuring-non-linearity, message=FALSE, fig.width=8, fig.height=3, tidy=FALSE, fig.cap="Comparing relationships", out.extra = ''}
# creating data example
df_example=data.frame(x=df_exp$x,
y_exp=df_exp$y,
y_linear=3*df_exp$x+2)
# getting mine metrics
res_mine_3=mine(df_example)
# generating labels to print the results
results_linear =
sprintf("MIC: %s \n MIC-R2 (non-linearity): %s",
res_mine_3$MIC[1,3],
round(res_mine_3$MICR2[1,3],2)
)
results_exp =
sprintf("MIC: %s \n MIC-R2 (non-linearity): %s",
res_mine_3$MIC[1,2],
round(res_mine_3$MICR2[1,2],4)
)
# Plotting results
# Creating plot exponential variable
p_exp=ggplot(df_example, aes(x=x, y=y_exp)) +
geom_line(color='steelblue') +
annotate("text", x = 11, y =0.4, label = results_exp) +
theme_minimal()
# Creating plot linear variable
p_linear=ggplot(df_example, aes(x=x, y=y_linear)) +
geom_line(color='steelblue') +
annotate("text", x = 8, y = 55,
label = results_linear) +
theme_minimal()
grid.arrange(p_exp,p_linear,ncol=2)
```
<br>
Both plots show a perfect correlation (or relationship), holding an MIC=1.
Regarding non-linearity, MICR2 behaves as expected, in `y_exp`=0.6101, and in `y_linear`=0.
This point is important since the **MIC behaves like R2 does in linear relationships**, plus it adapts quite well to **non-linear** relationships as we saw before, retrieving a particular score metric (`MICR2`) to profile the relationship.
<br>
### Measuring information on Anscombe Quartet
Remember the example we review at the beginning? Every pair of Anscombe Quartet returns a **R2 of 0.86**. But based on its plots it was clearly that not every pair exhibits neither a good correlation nor a similar distribution of `x` and `y`.
But what happen if we measure the relationship with a metric based on Information Theory? Yes, MIC again.
```{r anscombe-set-information, fig.cap="MIC statistic", out.extra = ''}
# calculating the MIC for every pair
mic_1=mine(anscombe_data$x1, anscombe_data$y1, alpha=0.8)$MIC
mic_2=mine(anscombe_data$x2, anscombe_data$y2, alpha=0.8)$MIC
mic_3=mine(anscombe_data$x3, anscombe_data$y3, alpha=0.8)$MIC
mic_4=mine(anscombe_data$x4, anscombe_data$y4, alpha=0.8)$MIC
# plotting MIC in a 2x2 grid
grid.arrange(plot_anscombe("x1", "y1", mic_1, "MIC"), plot_anscombe("x2", "y2", mic_2,"MIC"), plot_anscombe("x3", "y3", mic_3,"MIC"), plot_anscombe("x4", "y4", mic_4,"MIC"), ncol=2, nrow=2)
```
As you may notice we increased the `alpha` value to 0.8, this is a good practice -according to the documentation- when we analyzed small samples. The default value is 0.6 and its maximum 1.
In this case, MIC value spotted the most spurious relationship in the pair `x4 - y4`. Probably due to a few cases per plot (11 rows) the MIC was the same for all the others pairs. Having more cases will show different MIC values.
But when combining the MIC with **MIC-R2** (non-linearity measurement) new insights appears:
```{r anscombe-set}
# Calculating the MIC for every pair, note the "MIC-R2" object has the hyphen when the input are two vectors, unlike when it takes a data frame which is "MICR2".
mic_r2_1=mine(anscombe_data$x1, anscombe_data$y1, alpha = 0.8)$`MIC-R2`
mic_r2_2=mine(anscombe_data$x2, anscombe_data$y2, alpha = 0.8)$`MIC-R2`
mic_r2_3=mine(anscombe_data$x3, anscombe_data$y3, alpha = 0.8)$`MIC-R2`
mic_r2_4=mine(anscombe_data$x4, anscombe_data$y4, alpha = 0.8)$`MIC-R2`
# Ordering according mic_r2
df_mic_r2=data.frame(pair=c(1,2,3,4), mic_r2=c(mic_r2_1,mic_r2_2,mic_r2_3,mic_r2_4)) %>% arrange(-mic_r2)
df_mic_r2
```
Ordering decreasingly by its **non-linearity** the results are consisent with the plots: 2 > 3 > 1 > 4.
Something strange for pair 4, a negative number. This is because MIC is lower than the R2. A relationship that worth to be plotted.
<br>
### Measuring non-monotonicity: MAS measure
MINE can also help us to profile time series regarding its non-monotonicity with **MAS** (maximum asymmetry score).
A monotonic series is such it never changes its tendency, it always goes up or down. More on this on [@monotonic_function].
Following example simulates two-time series, one not-monotonic `y_1` and the other monotonic `y_2`.
```{r monotonic-non-monotonic-function, fig.height=3.5, fig.width=8, tidy=FALSE, fig.cap="Monotonicity in functions", out.extra = ''}
# creating sample data (simulating time series)
time_x=sort(runif(n=1000, min=0, max=1))
y_1=4*(time_x-0.5)^2
y_2=4*(time_x-0.5)^3
# Calculating MAS for both series
mas_y1=round(mine(time_x,y_1)$MAS,2)
mas_y2=mine(time_x,y_2)$MAS
# Putting all together
df_mono=data.frame(time_x=time_x, y_1=y_1, y_2=y_2)
# Plotting
label_p_y_1 =
sprintf("MAS=%s (goes down \n and up => not-monotonic)",
mas_y1)
p_y_1=ggplot(df_mono, aes(x=time_x, y=y_1)) +
geom_line(color='steelblue') +
theme_minimal() +
annotate("text", x = 0.45, y =0.75,
label = label_p_y_1)
label_p_y_2=
sprintf("MAS=%s (goes up => monotonic)", mas_y2)
p_y_2=ggplot(df_mono, aes(x=time_x, y=y_2)) +
geom_line(color='steelblue') +
theme_minimal() +
annotate("text", x = 0.43, y =0.35,
label = label_p_y_2)
grid.arrange(p_y_1,p_y_2,ncol=2)
```
<br>
From another perspective, MAS is also useful to detect periodic relationships. Let's illustrate this with an example
```{r periodical-function, fig.height=3.5, fig.width=8, tidy=FALSE, fig.cap="Periodicity in functions", out.extra = '', echo=FALSE}
# creating data for non-periodic function
x_per=seq(-2*pi,2*pi,0.2)
df_per=data.frame(x=x_per, y=sin(2*x_per))
# creating data for periodic function
x_non_per=1:100
df_non_per=data.frame(x=x_non_per, y=2*x_non_per*x_non_per+2)
# Calculating MAS for both series
mas_y1_b=mine(df_per$x,df_per$y)$MAS
mas_y2_b=round(mine(df_non_per$x,df_non_per$y)$MAS,2)
# Plotting
p_y_1_b=ggplot(df_per, aes(x=x, y=y)) +
geom_line(color='steelblue') +
theme_minimal() +
ggtitle(sprintf("MAS=%s (periodic)", round(mas_y1_b,2)))
p_y_2_b=ggplot(df_non_per, aes(x=x, y=y)) +
geom_line(color='steelblue') +
theme_minimal() +
ggtitle(sprintf("MAS=%s (non-periodic)", mas_y2_b))
grid.arrange(p_y_1_b, p_y_2_b, ncol=2)
```
#### A more real example: Time Series
Consider the following case which contains three-time series: `y1`, `y2` and `y3`. They can be profiled concerning its non-monotonicity or overall growth trend.
```{r correlation-time-series, fig.height=3, fig.width=5, tidy=FALSE, fig.cap="Time series example", out.extra = ''}
# reading data
df_time_series =
read.delim(file="https://goo.gl/QDUjfd")
# converting to long format so they can be plotted
df_time_series_long=melt(df_time_series, id="time")
# Plotting
plot_time_series =
ggplot(data=df_time_series_long,
aes(x=time, y=value, colour=variable)) +
geom_line() +
theme_minimal() +
scale_color_brewer(palette="Set2")
plot_time_series
```
```{r}
# Calculating and printing MAS values for time series data
mine_ts=mine(df_time_series)
mine_ts$MAS
```
<br>
We need to look at `time` column, so we've got the MAS value of each series regarding the time.
`y2` is the most monotonic (and less periodic) series, and it can be confirmed by looking at it. It seems to be always up.
**MAS summary:**
* MAS ~ 0 indicates monotonic or non-periodic function ("always" up or down)
* MAS ~ 1 indicates non-monotonic or periodic function
<br>
### Correlation between time series
MIC metric can also measure the **correlation in time series**, it is not a general purpose tool but can be helpful to compare different series quickly.
This section is based on the same data we used in MAS example.
```{r time-series, fig.width=4, fig.height=2.5, fig.cap="Time series example", out.extra = ''}
# printing again the 3-time series
plot_time_series
# Printing MIC values
mine_ts$MIC
```
<br>
Now we need to look at `y1` column. According to MIC measure, we can confirm the same that it's shown in last plot:
`y1` is more similar to `y3` (MIC=0.709) than what is `y2` (MIC=0.61).
<br>
#### Going further: Dynamic Time Warping
MIC will not be helpful for more complex esenarios having time series which vary in speed, you would use [dynamic time warping](https://en.wikipedia.org/wiki/Dynamic_time_warping) technique (**DTW**).
Let's use an image to catch up the concept visually:
```{r Dynamic-Time-warping, echo=FALSE, out.width="300px", fig.cap="Dynamic time warping", out.extra = ''}
knitr::include_graphics("exploratory_data_analysis/dynamic_time_warping.png")
```
Image source: _Dynamic time wrapping Converting images into time series for data mining_ [@img_time_series].
The last image shows two different approaches to compare time series, and the **euclidean** is more similar to MIC measure. While DTW can track similarities occurring at different times.
A nice implementation in **R**: [dtw package](http://dtw.r-forge.r-project.org).
Finding correlations between time series is another way of performing **time series clustering**.
<br>
### Correlation on categorical variables