-
Notifications
You must be signed in to change notification settings - Fork 0
/
MissingDataGUI.tex
1114 lines (989 loc) · 51.8 KB
/
MissingDataGUI.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[article]{jss}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% declarations for jss.cls %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% almost as usual
\author{Xiaoyue Cheng\\Iowa State University \And
Dianne Cook\\Iowa State University \And
Heike Hofmann\\Iowa State University}
\title{Visually Exploring Missing Values in Multivariable Data
Using a Graphical User Interface}
%% for pretty printing and a nice hypersummary also set:
\Plainauthor{Xiaoyue Cheng, Dianne Cook, Heike Hofmann} %% comma-separated
\Plaintitle{Visually Exploring Missing Values in Multivariable Data
Using a Graphical User Interface} %% without formatting
\Shorttitle{Visually Exploring Missing Values} %% a short title (if necessary)
%% an abstract and keywords
\Abstract{
Missing values are common in data, and usually require attention
in order to conduct the statistical analysis. One of the first
steps is to explore the structure of the missing values, and how
missingness relates to the other collected variables. This article
describes an \proglang{R} package, that provides a graphical user
interface (GUI) designed to help explore the missing data structure
and to examine the results of different imputation methods. The
GUI provides numerical and graphical summaries conditional on
missingness, and includes imputations using fixed values, multiple
imputations and nearest neighbors.
}
\Keywords{missing values, imputation, exploratory data analysis,
statistical graphics, data visualization, graphical user interface}
\Plainkeywords{missing values, imputation, exploratory data analysis,
statistical graphics, visualization, graphical user interface}
%% publication information
%% NOTE: Typically, this can be left commented and will be filled out by the technical editor
%% \Volume{50}
%% \Issue{9}
%% \Month{June}
%% \Year{2012}
%% \Submitdate{2012-06-04}
%% \Acceptdate{2012-06-04}
%% The address of (at least) one author should be given
%% in the following format:
\Address{
Xiaoyue Cheng\\
Department of Statistics and Statistical Laboratory\\
Iowa State University\\
2406 Snedecor Hall\\
Ames, IA, 50011, United States of America\\
E-mail: \email{[email protected]}\\
URL: \url{http://xycheng.public.iastate.edu/}\\
\\
Dianne Cook\\
Department of Statistics and Statistical Laboratory\\
Iowa State University\\
2415 Snedecor Hall\\
Ames, IA, 50011, United States of America\\
E-mail: \email{[email protected]}\\
URL: \url{http://dicook.public.iastate.edu/}\\
\\
Heike Hofmann\\
Department of Statistics and Statistical Laboratory\\
Iowa State University\\
2413 Snedecor Hall\\
Ames, IA, 50011, United States of America\\
E-mail: \email{[email protected]}\\
URL: \url{http://hofmann.public.iastate.edu/}\\
}
%% It is also possible to add a telephone and fax number
%% before the e-mail in the following format:
%% Telephone: +43/512/507-7103
%% Fax: +43/512/507-2851
%% for those who use Sweave please include the following line (with % symbols):
%% need no \usepackage{Sweave.sty}
%% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}
\section{Introduction}\label{introduction}
Missing values are a very common problem affecting data analysis.
Many imputation methods have been developed but little has been
done for exploring the missing value structure visually. Most
plotting methods handle missing values by simply removing the
incomplete records with or without a warning, especially when
the data are continuous. Most statistical functions provide a
limited list of handling missing values, such as, delete all cases
with any missing values, delete pairwise or on single variables only.
The issue is, that in order to decide what to do with the missing
values before analyzing the data, we need to understand what the
distribution of the missing values is, and how the missingness
depends on the other collected variables. A few \proglang{R}
packages, like \pkg{Hmisc} \citep{hmisc}, \pkg{norm} \citep{norm},
and \pkg{mice} \citep{mice}, have some routines for summarizing
the number of missing by variable, and by case, in preparation
for imputing the missing values. To understand the distribution
of missings versus non-missings it is also important to make plots
of the data.
For model-based imputation methods, it is important to check
assumptions like missing completely at random (MCAR) or missing
at random (MAR). These are not easy to verify. \citet{little1988test}
provided tests of the MCAR assumption, under normality conditions,
and \citet{jaeger2006testing} proposed a test for MAR under some
distributional conditions. Both tests employ inference based on
likelihood ratios, and caution that the tests are sensitive to
model misspecification \citep{little1988test}. Visual exploration
of the missingness can help check the assumptions: it cannot prove
any randomness assumption holds but visual checks can be used to
reject MCAR assumptions, or suggest what dependencies exist, and
should be incorporated into imputation for MAR data.
Some existing work describing visual exploration of missingness,
and implementations, can be found in \citet{unwin1996interactive},
\citet{swayne1998missing}, and \citet{templ2008visualization}.
\proglang{MANET} \citep{unwin1996interactive} implements the
interactive methods to missing data. It presents the segmented
barcharts of missing versus non-missing values for each variable,
and with its many plot types like histograms, scatterplots, and
mosaic plots, encourages the user to select cases that are missing
on any variable to highlight in other plots. This enables the user
to explore the missing status dependence in the distributions of
the complete cases of other variables. \proglang{XGobi}
\citep{swayne1998xgobi}, which implements the ideas described in
\citet{swayne1998missing}, is similar to \proglang{MANET}, but
focuses on interactive graphics for exploring missing values in
real-valued data. It creates a shadow matrix of the original data
where entries are 0 (complete) or 1 (missing value). This additional
data structure allows the user to explore the multivariate pattern
of missing values, the dependence between missing value status
and complete cases, and compare imputation mehtods. These ideas
were re-implemented in \proglang{GGobi} \citep{STLBC03}.
In the \proglang{R} community, the package \pkg{VIM} \citep{VIM}
provides a graphical user interface via \pkg{VIMGUI} \citep{VIMGUI},
to explore the structure of missing values and the quality of
several single imputation methods (kNN, hotdeck, irmi). Some
packages for multiple imputation have interfaces for easy manipulation,
for example, \pkg{migui}, \code{AmeliaView()} and \pkg{miP}.
The \pkg{migui} \citep{migui} is an interface for \pkg{mi} \citep{mi},
which implements multiple imputation via Bayesian models and weakly
informative prior distributions. The function \code{AmeliaView()}
in \pkg{Amelia} \citep{amelia}, generates a graphical interface,
to implement its ``EM with bootstrapping'' algorithm. The package
\pkg{miP} \citep{mip} adopts \pkg{VIM} to visualize the imputation
results from packages \pkg{mice}, \pkg{mi}, and \pkg{Amelia}.
This current work describes a new package for \proglang{R},
\pkg{MissingDataGUI}, which allows the exploration of missing
value structure, and comparison of different imputations, using
static graphics and numerical summaries. The GUI makes these
methods accessible for novice users. This work builds on the ideas
developed in \citet{unwin1996interactive} and \citet{swayne1998missing}.
The package utilizes routines in \pkg{Hmisc}, \pkg{norm}, \pkg{mice},
and \pkg{mi} for multiple imputation, and provides several other
routines including kNN, random sampling and fixed values for the
single imputation. Section \ref{Functionality} explains the GUI
design, functionality and rationale. Section \ref{Examples} gives
a usage example.
\begin{center}
\begin{figure}[h]
\begin{centering}
\includegraphics[width=.9\textwidth]{graph/fig1-GUI-1-tab1}
\par\end{centering}
\caption{Overview of the missing data GUI. Region 1 contains the
list of variables, variable type, and summary of missings on that
variable. Region 2 has a list of the categorical variables that can
be used for conditioning plots and imputations. Region 3 has a
selection panel for selection of coloring by different types of
missingness in the plots. Region 4 contains a radio button selection
of imputation methods. Region 5 has several plot type selections,
and region 6 allows selecting numeric or graphical summaries and
some output routines. The summaries are displayed in the region 7.}
\label{fig:missingGUI}
\end{figure}
\par\end{center}
\section{Functionality}\label{Functionality}
\subsection{Overview of the missing data GUI}
The appearance of the missing data GUI is shown in
Figure~\ref{fig:missingGUI}. (Section \ref{Examples} describes
the dataset.) All variables in the data along with the variable
type and the percentages of \code{NA}'s are listed on the top left
(region 1). The categorical variables (factor, ordinal factor, and
character), auto-detected by their type, are shown on the bottom
left as the potential conditioning variables (region 2). The variables
having missing values are displayed under ``Color.by.the.missing.of''
on the top center (region 3). The graphical summary will distinguish
the imputations from the observations by two colors, yellow (missing)
versus blue (non-missing). This panel is used to choose what missing
structure to color. Selecting the first row ``Missing Any Variables''
means that the color will depend on whether the case has missing
values in any variables. The second row ``Missing on Selected Variables''
means the graph is colored by whether the case has missings in the
selected variable. ``Method'' (region 4) and ``Graph Type'' (region 5)
are two widgets illustrated in Sections \ref{imputation} and
\ref{plottype}. On the top right (region 6) there are five buttons:
``Summary'' can create a window as described in Section \ref{numsum};
``Plot'' produces the plots in the graphics panel on the bottom
right of GUI (region 7); ``Export data'' saves the imputed data
into a file or to an \proglang{R} data frame; ``Save plot'' saves
the plots in region 7 to png files; ``Quit'' destroys the main
GUI window and the derived child windows.
\subsection{Summary of missing values}\label{numsum}
\subsubsection{Numerical summaries}
To investigate missingness in a data set, start examining the
numerical summaries of the missings. The ``Summary'' button will
open a window with the overall missingness information
(Figure~\ref{fig: num-summry} left panel) or conditional summary
(Figure~\ref{fig: num-summry} right panel), depending on whether
conditioning variables are chosen. Both summary windows present
the percent of the values that are missing, the percent of variables
that contain missing values, the percent of the cases that have at
least one missing value, along with a tabulation of the number of
values missing per case. The style of the table follows the summary
provided by the package \pkg{norm}. In Figure~\ref{fig: num-summry}
(left) it can be seen that the data has two observations have 3
missing values, another two have 2 missing values, 167 observations
have one missing value and 565 are complete. By percentages, 76.8\%
of the cases have no missings. Figure~\ref{fig: num-summry} (right)
is conditioned on the variable ``year'', which produced two boxes
for 1993 and 1997 respectively. We can see that there are fewer
missing values in 1997 than 1993, and all the observations having
more than 1 missings appeared in 1993.
\begin{center}
\begin{figure}[h]
\begin{centering}
\includegraphics[width=0.33\textwidth]{graph/fig2-summary-1}
\includegraphics[width=0.66\textwidth]{graph/fig2-summary-2-condition}
\par\end{centering}
\caption{A numerical summary of missing values in the data is
shown in a pop-up window. The left panel is the overall summary.
The right panel shows the summary conditioned on ``year''. The
percentages of missings by total number of data values, by
variables and by cases, is shown on the top. This dataset has
8 variables and the missing values by variable are summarized
in the bottom table. No cases have more than 3 missing values,
76.8\% of cases are complete, 22.7\% of cases have one missing
value, and only 4 cases have more than one missing values. The
right panel shows that the missing pattern is different for each year.}
\label{fig: num-summry}
\end{figure}
\par\end{center}
\subsubsection{Missingness map}
The missingness map (Figure~\ref{fig:missingmap}) provides a
graphical summary of the missing patterns. Like the shadow
matrix used in \proglang{GGobi}, the missingness map shows
the position of missing values relative to variables and cases.
The \proglang{R} packages \pkg{Amelia} \citep{amelia} and
\pkg{VIM} \citep{VIM} have versions of missingness maps.
Organizing the missing values into blocks can be achieved by
re-ordering variables and case ids,making it easier to see
missing patterns, especially for large data. Two re-ordered
missingness maps are shown in Figure~\ref{fig:missingmap}.
One arranges the variables and cases by the number of missings,
from the largest to the smallest; the other applies hierarchical
clustering to both rows and columns. The strength of missingness
map is to reveal whether the missings occur at some variables
simultaneously. If so, then a similar missing pattern may
indicate some association between the variables. If the missings
happen at some observations synchronously, then it suggests
dependence between those observations.
Figure~\ref{fig:missingmap} displays 245 observations and
34 variables for the dataset \code{brfss} (described in
Section~\ref{Examples}). From the missingness maps we can see
that most of the missings occurred in seven variables. The
missingness on some variables occur synchronously, indicating
assoication. Users of the data should check the data collection
procedures for these variables. For example, in this data,
questions about the drinking time and amount (ALCDAY4 and
AVEDRNK2, the top two variables in the right panel) were both
skipped when the subject answered a previous question with
``did not drink in the past 30 days''.
\begin{center}
\begin{figure}[h]
\begin{centering}
\includegraphics[width=0.3\textwidth]{graph/fig5-3-missingmap-1}
\includegraphics[width=0.3\textwidth]{graph/fig5-3-missingmap-2}
\includegraphics[width=0.345\textwidth]{graph/fig5-3-missingmap-3}
\par\end{centering}
\caption{Missingness maps, same data but different ordering
of variables (rows) and cases (columns): (left) raw data order,
(middle) variables and cases sorted by decreasing number of
missing values, (right) sorted by hierarchical clustering of
missingness. From the raw data missingness map, the horizontal
stripes indicate several variables have many missings, and the
vertical stripes near the bottom indicate some structural
missing cases. When variables and cases are sorted by
missingness rate (middle), variables with missings often have
missings on the same cases, and the additional few sporadic
missing values can be easily spotted. Using the clustered
missingness map (right) the blocks of missings on variables
and cases is more easily seen.}
\label{fig:missingmap}
\end{figure}
\par\end{center}
\subsection{Imputation}\label{imputation}
A number of imputation methods are available in the package.
The purpose is two-fold: to enable exploring dependence between
missings or non-missings, and also to produce a complete data
set for later analysis. A few criteria were considered in the
choices of methods to make available and the design: (1) easy
to understand and implement; (2) computing complexity is medium
or low; (3) adaptability to different situations, i.e., no strong
model assumptions. Not all of the imputation methods available
in \proglang{R} are available in the package because (1) there
are too many methods and variations, so it is not practical to
include all, and (2) users may use their own method and import
the result to missing data GUI for exploration.
The seven imputation methods provided are: ``Below 10\%'',
``Simple'', ``Neighbor'', ``MI:areg'', ``MI:norm'', ``MI:mice'', ``MI:mi''.
``Simple'' and ``Neighbor'' contain more than one method.
Some methods (e.g., ``Below 10\%'') are only suitable for
exploring the missingness patterns, and are not suitable
to use for producing a complete data set for analysis. Three
tab labels interface to the three methods provided by ``Simple'',
overall median, mean, and random value (Figure~\ref{fig:missingGUI},
region 7). ``Neighbor'' interfaces to two methods,: mean of
the nearest neighbors, and random nearest neighbor. The neighbor
methods also allow the user to change the number of neighbors.
Table~\ref{tab:compare-methods} summarizes and compares the
imputation methods available in the GUI.
\begin{center}
\begin{table}[h]
\begin{centering}
\begin{tabular}{l|l|c|c|c}
\hline
\textbf{\scriptsize{Method}} & \textbf{\scriptsize{Description}} & \textbf{\scriptsize{Determinisitic}} & \textbf{\scriptsize{Univariate}} & \textbf{\scriptsize{Multiple imp.}}\tabularnewline
\hline
{\scriptsize{Below 10\%}} & {\scriptsize{below 10\% of the range}} & {\scriptsize{x}} & {\scriptsize{x}} & \tabularnewline
\hline
& {\scriptsize{overall median}} & {\scriptsize{x}} & {\scriptsize{x}} & \tabularnewline
{\scriptsize{Simple}} & {\scriptsize{overall mean}} & {\scriptsize{x}} & {\scriptsize{x}} & \tabularnewline
& {\scriptsize{random value}} & & {\scriptsize{x}} & \tabularnewline
\hline
{\scriptsize{Neighbor}} & {\scriptsize{mean of the nearest neighbors}} & {\scriptsize{x}} & & \tabularnewline
& {\scriptsize{random nearest neighbor}} & & & \tabularnewline
\hline
{\scriptsize{MI:areg}} & {\scriptsize{predictive mean matching}} & & & {\scriptsize{x}}\tabularnewline
\hline
{\scriptsize{MI:norm}} & {\scriptsize{multivariate normal model}} & & & {\scriptsize{x}}\tabularnewline
\hline
{\scriptsize{MI:mice}} & {\scriptsize{multivariate imp. by chained equations}} & & & {\scriptsize{x}}\tabularnewline
\hline
{\scriptsize{MI:mi}} & {\scriptsize{multiple iterative regression imputation}} & & & {\scriptsize{x}}\tabularnewline
\hline
\end{tabular}
\par\end{centering}
\caption{Imputation methods included in the missing data GUI.
Strictly speaking, ``Below 10\%'' is not an imputation method,
but a way to put the missing values in the same graph with the
observations. ``Deterministic'' indicates whether the method
has a stochastic component or not. ``Univariate'' means whether
the imputation only uses the individual variable where imputation
is needed, or makes use of other variables as well. ``Multiple
imp.'' indicates whether the methods is a type of multiple imputation
that will provide multiple samples to impute the missings.}
\label{tab:compare-methods}
\end{table}
\par\end{center}
\subsubsection{Univariate imputations}
The simplest start involves setting the missing values to 10\%
below the minimum on each variable. The purpose of this is to
place the missing values into the plot where they can be
distinguished from the non-missing values. In a scatterplot,
all missing values will lie along a vertical line on the left
or a horizontal line on the bottom of the display
(Figure~\ref{fig:univariate-imputation} (a)).This placement
enables the distribution of missings to be compared with the
distribution of non-missings. In the histogram, missing values
will form a bar to the left of other data values. And in the
parallel coordinates plot, the missing values are at the bottom
of each axis.
Using the median, mean, or mode of the complete cases is a simple
way to impute missing values. The software makes some automatic
choices for the user: if the user selects median but the variable
type is nominal, or selects mean but the variable is categorical,
then the mode is returned. In the graph, points and bars are
colored according to the missing status of the case.
Figure~\ref{fig:univariate-imputation} (b) and (c) show examples
of the imputation by the median and mean for real-valued variables.
The ``random value'' method (Figure~\ref{fig:univariate-imputation}
(d)) randomly selects an existing value of the variable to impute
the missing. When there is more than one missing value in an
observation, then values are sampled independently from each
related variable.
\begin{center}
\begin{figure}[h]
\begin{centering}
\begin{tabular}{cccc}
{\tiny{(a) Below 10\%}} & & & {\tiny{(b) Overall median}}\tabularnewline
\includegraphics[width=0.31\textwidth]{graph/fig3-1-below10} & & & \includegraphics[width=0.31\textwidth]{graph/fig3-2-median}\tabularnewline
{\tiny{(c) Overall mean}} & & & {\tiny{(d) Random value}}\tabularnewline
\includegraphics[width=0.31\textwidth]{graph/fig3-3-mean} & & & \includegraphics[width=0.31\textwidth]{graph/fig3-4-random}\tabularnewline
\end{tabular}
\par\end{centering}
\caption{Four panels of scatterplots displaying the results of
different univariate imputations: (a) 10\% below the minimum
(not strictly an imputation method, it is used for displaying
missings as part of a plot of complete cases); (b) median of
each variable; (c) mean of each variable; (d) random selection
from the existing values.}
\label{fig:univariate-imputation}
\end{figure}
\par\end{center}
These imputation methods operate separately on each variable.
Dependencies between variables are ignored, yielding covariance
and correlation estimates that are potentially very different
from those of the complete cases. This could be a big problem
for some analyses. These methods are not ideal from a statistical
perspective. In some situations where the inadequate estimation
of covariance does not affect results and conclusions they can
provide a simple, few assumptions required, solution, but in most
situations they are not advised. For the application here, we are
primarily concerned about providing methods for analysts to
explore the missing value structure, and the plots reveal quite
clearly why these univariate imputation methods are inadequate.
Figure \ref{fig:univariate-imputation} shows the ``cross structure''
(orange) induced on the pattern of points by mean and median
imputation, and makes it quite clear that the covariance estimates
for the imputed data would not well match that of the complete cases.
\subsubsection{Neighbor imputations}
The ``Neighbor'' methods replace a missing value with the mean of,
or a random selection from, its $k$ nearest complete neighbors
(Figure~\ref{fig:neighbor-imputation}). The distance between two
observations is calculated using Euclidean distance on the
standardized variables that have no missings.
Figure~\ref{fig:neighbor-diagram} illustrates the procedure.
Ties are not considered, and only the first $k$ entries are used.
This method requires at least one case in the dataset to be
complete, and no categorical variables can be used. (Ordinal
variables are treated as integers.) If there are less than $k$
complete cases, then all of them are used to generate the mean
or a random value. If none of the cases are complete, then the
mean or a random value of the entire data will be returned. By
default $k=5$, but this is the user's choice.
\begin{center}
\begin{figure}[h]
\begin{centering}
\begin{tabular}{cccc}
{\tiny{(a) Mean of the neighbors}} & & & {\tiny{(b) Random neighbor}}\tabularnewline
\includegraphics[width=0.32\textwidth]{graph/fig3-5-knn} & & & \includegraphics[width=0.32\textwidth]{graph/fig3-5-knn-2}\tabularnewline
\end{tabular}
\par\end{centering}
\caption{Scatterplots for nearest neighbor imputation methods: (a) mean of the 5 nearest neighbors, (b) a random value from the 5 nearest neighbors.}
\label{fig:neighbor-imputation}
\end{figure}
\par\end{center}
%The methods should work well when the set of complete observations
%is large, but when there are few fully observations, the imputation
%may be biased due to too much waste information.
The neighbor methods in \pkg{MissingDataGUI} can be seen as two
special cases of hot deck imputation \citep{andridge2010review}.
The neighbor mean method averages the weights on all chosen neighbors,
and the random neighbor method places all the weight on one arbitrary
neighbor. When $k=1$, the methods are deterministic hot deck.
\begin{center}
\begin{figure}[h]
\begin{centering}
\includegraphics[width=1\textwidth]{graph/fig9-diagram}
\par\end{centering}
\caption{Illustration of the $k$ nearest neighbors imputation method.
The shaded entries are the complete observations to rank. The
variables in red frames are used to compute the distance. After
getting the rank of all complete observations, the first $k$ are
used as neighbors.}
\label{fig:neighbor-diagram}
\end{figure}
\par\end{center}
%\clearpage
\subsubsection{Multiple imputations}
Multiple imputation, first proposed by \citet{rubin1978multiple},
is a method to get valid inferences by simulation. Multiple
imputed datasets are generated based on the joint distribution,
and serve a wide variety of analytical purposes. Functions from
four \proglang{R} packages are utilized to implement multiple
imputations in \pkg{MissingDataGUI}. Figure~\ref{fig:multiple-imputation}
demonstrates the results from different multiple imputations
on the same data.
Among the four packages, \pkg{norm} is quite different from
the other three. The ideas behind the package were introduced
by \citet{schafer1998multiple}. It assumes the observations
are sampled from a multivariate normal distribution, and uses
the EM algorithm to estimate the mean and variance-covariance
matrix. It utilizes a data augmentation method to converge on
distribution.
The other packages use a chained equation approach with similar
steps but different settings. A comparison between the three
packages is given in Table~\ref{tab:compare-mi}, based on
\citet{hmisc}, \citet{mice}, and \citet{mi}. The main differences
are that \pkg{Hmisc} provides three models with flexible drawing
methods around the predicted values for quantitative variables,
and applies bootstrap to obtain a sample for every iteration.
The package \pkg{mi} uses a convergence criterion to stop the
iteration with some allowance for special situations. In between
these two is \pkg{mice}: The models provided are more flexible
than \pkg{Hmisc}, but not as bayesian as \pkg{mi}.
By default, $m=3$ chains are imputed and users can choose the
number of chains. Each chain will produce a result shown in a
separate graphical panel. By switching between the panels, the
user can compare the results and observe discrepances between
the results. Figure~\ref{fig:chaintabs} shows the results of
four different chains produced by \pkg{mice}. Three of the four
produced results where a small clump of imputed values occurred.
\begin{center}
\begin{figure}[!h]
\begin{centering}
\begin{tabular}{cccc}
{\tiny{(a) \pkg{Hmisc}: predictive mean matching}} & & & {\tiny{(b) \pkg{norm}: multivariate normal model}}\tabularnewline
\includegraphics[width=0.32\textwidth]{graph/fig3-6-areg-2} & & & \includegraphics[width=0.32\textwidth]{graph/fig3-7-norm-2}\tabularnewline
{\tiny{(c) \pkg{mice}: chained equations}} & & & {\tiny{(d) \pkg{mi}: iterative regression}}\tabularnewline
\includegraphics[width=0.32\textwidth]{graph/fig3-8-mice-2} & & & \includegraphics[width=0.32\textwidth]{graph/fig3-9-mi-2}\tabularnewline
\end{tabular}
\par\end{centering}
\caption{Scatterplots for the multiple imputations from four
\proglang{R} packages: (a) predictive mean matching by \pkg{Hmisc};
(b) multivariate normal model by \pkg{norm}; (c) multivariate
imputation using chained equations by \pkg{mice}; (d) multiple
iterative regression imputation by \pkg{mi}. All the four
imputations are conditioned on year.}
\label{fig:multiple-imputation}
\end{figure}
\par\end{center}
\begin{center}
\begin{table}[!h]
\begin{centering}
\begin{tabular}{l|c|c|c}
\hline
\textbf{\scriptsize{Algorithm steps}} & \textbf{\scriptsize{Hmisc}} & \textbf{\scriptsize{mice}} & \textbf{\scriptsize{mi}}\tabularnewline
\hline
\textbf{\scriptsize{1. Fill in the missing}} & \multicolumn{3}{c}{{\scriptsize{at random}}}\tabularnewline
\hline
\textbf{\scriptsize{2. Specify the model}} & {\scriptsize{pmm/regression/normpmm}} & \multicolumn{2}{c}{{\scriptsize{selectable model or user-specific model}}}\tabularnewline
\hline
\textbf{\scriptsize{~~~~(Default model)}} & \multicolumn{2}{c|}{{\scriptsize{predictive mean matching}}} & {\scriptsize{Baysian generalized linear models}}\tabularnewline
\hline
\textbf{\scriptsize{3. Decide the data}} & {\scriptsize{a bootstrap sample}} & \multicolumn{2}{c}{{\scriptsize{the entire dataset with the current imputed values}}}\tabularnewline
\hline
\textbf{\scriptsize{4. Iterate imputation}} & \multicolumn{3}{c}{{\scriptsize{in every cycle, variables with missings are imputed sequentially}}}\tabularnewline
\hline
\textbf{\scriptsize{5. Stop when}} & \multicolumn{2}{c|}{{\scriptsize{achieving the max \# of iterations}}} & {\scriptsize{difference of within and between variance is small}}\tabularnewline
\hline
\end{tabular}
\par\end{centering}
\caption{Comparison of the algorithm steps among three multiple
imputation packages that use the chained equation approach.}
\label{tab:compare-mi}
\end{table}
\par\end{center}
\begin{center}
\begin{figure}[!h]
\begin{centering}
\begin{tabular}{cccc}
\includegraphics[width=0.4\textwidth]{graph/fig10-1-chain} & & & \includegraphics[width=0.4\textwidth]{graph/fig10-2-chain}\tabularnewline
%& & & \tabularnewline
\includegraphics[width=0.4\textwidth]{graph/fig10-3-chain} & & & \includegraphics[width=0.4\textwidth]{graph/fig10-4-chain}\tabularnewline
\end{tabular}
\par\end{centering}
\caption{Results of four imputing chains by \pkg{mice}, starting
with the default random seed. Users can switch the panels by
clicking the tabs, or close a panel by hitting the `x' sign.
Focusing on the imputed values when air temperatures around
22 degree, we see that the first, third and fourth chains cluster
values in a small range of y-axis, but the second chain spread
them very evenly in the y-direction.}
\label{fig:chaintabs}
\end{figure}
\par\end{center}
\subsubsection{Conditional on the categorical variables}
When the variables of interest have bimodal or multi-modal distributions,
using center statistics like the mean or median for imputation,
or simulating from an overall estimate like \pkg{norm} does, is
inadequate because the center does not reflect the shape of
distribution properly. In many situations, the modalities arise
from the mixture of groups. Hence, a better imputation method is
to condition by group, and then calculate the statistics.
This is available using the controls ``categorical variables to
condition on''. All categorical variables are listed with checkboxes.
The variables checked will partition the data into blocks and
then the imputation method is implemented in each block of the
data. However, the condition is not used when the method is
``Below 10\%'', since the aim of ``Below 10\%'' is simply to
display the missings away from the non-missings. If the conditioning
factor variable has missing values, then a ``factor = \code{NA}''
group will be generated to calculate the numeric summary or the
imputed values. If the conditioning factor itself is one of the
plotting variables, then a message box will emerge to ask the user
to impute the missing values on the factor before other variables,
and the plots are created without the condition.
The importance of conditioning in the imputation is illustrated
in Figure~\ref{fig: condition}. Without the condition, the imputated
do not match the distribution of complete values
(Figure~\ref{fig: condition} left). Calculating separately by group
provides a better result (Figure~\ref{fig: condition} right).
\begin{center}
\begin{figure}[h]
\begin{centering}
\includegraphics[width=.48\textwidth]{graph/fig4-1-median-uncondition}
\includegraphics[width=.48\textwidth]{graph/fig4-2-median-condition}
\par\end{centering}
\caption{Effect of conditioning on imputed vaues. The left panel
is the imputation by median without condition and the right one
is conditioned on year. In the left plot we can see that the
imputed values (yellow) fall between the two clusters, at the
overall median. But when the imputation is conditioned on year
(right plot), the imputed values are now better placed into the
two clusters in the data.}
\label{fig: condition}
\end{figure}
\par\end{center}
\subsection{Plot types}\label{plottype}
There are four types of graphs available in \pkg{MissingDataGUI}:
histogram/barchart, spinogram/spineplot, pairwise plots, and
parallel coordinates plot. Figure~\ref{fig:graphtypes} displays
all the graph types. Two color-blind friendly colors represent
the observations and imputed values on any chosen variables. In
Figure~\ref{fig:graphtypes} the yellow color means that
the value is originally missing in humidity.
Separate histograms (continuous variables) and barcharts
(categorical variables) are shown for each of the variables
selected. When the missing values and the complete values share
one bar, the bar is cut into two parts, and the ratio of the
two heights is equal to the ratio of missing and non-missing
values in that bar.
The spinogram (continuous variable) and spineplot (categorical variable),
introduced by \citet{hummel1996linked} and \citet{theus1999visualizing},
use width of the rectangle to represent count. Height is the
same for all bars. The focus is on proportion for each group.
The bars in the spinogram or spineplot are partitioned into two
colors for the missing and non-missing values.
A scatterplot matrix is used to display pairs of variables.
Variable names and scales are placed on the diagonal. For the
continuous variables, the pairwise scatterplots are placed in
the lower triangle, and the contour plots are shown in the upper
triangle. For the categorical variables, barcharts are displayed
in both upper and lower triangles. Bars are colored in proportion
to the missings. The combination of continuous and categorical
variables is displayed as side-by-side boxplots of missing and
non-missing values for each category on the upper triangle and
side-by-side histograms on the lower triangle. Limited space
available to the graphics device limits the number of variables
that can be shown. The upper limit of the number of variables
is set to be 5 and the lower limit is 2.
The parallel coordinates plot by \citet{inselberg1985plane} and
\citet{wegman1990hyperdimensional} can be used to high-dimensional
data. Though many plot types, like the scatterplot or histogram,
are helpful to reveal the missing pattern, they are not convenient
to display many variables simultaneously. The parallel coordinates
plot can give an overview of a relatively large quantity of
variables. In \pkg{MissingDataGUI}, the order of the variables
can be chosen in one of two ways: the original order in the data,
or by sorting the variables from the best separator to the worst
of missing values by the $F$-statistic from ANOVA. In
Figure~\ref{fig:graphtypes}, the best separating variable for
the missingness of humidity is humidity itself, because the
``below 10\%'' method makes a big gap between the missing and
non-missing values. ``Below 10\%'' is not an ideal method for
the ordered parallel coordinates plot. However, the plot is
still useful: it reveals that the missingness on humidity
occurred in one year and one location, when sea.surface.temp
and air.temp were low.
\begin{center}
\begin{figure}[h]
\begin{centering}
\includegraphics[width=.24\textwidth]{graph/fig5-1-barchart}
\includegraphics[width=.24\textwidth]{graph/fig5-1-histogram}
\includegraphics[width=.24\textwidth]{graph/fig5-1-spineplot-1}
\includegraphics[width=.24\textwidth]{graph/fig5-1-spinogram-1}
\includegraphics[width=.48\textwidth]{graph/fig5-2-pairwise}
\includegraphics[width=.48\textwidth]{graph/fig5-2-pcp}
\par\end{centering}
\caption{The four types of graphs available: (top, from left
to right) barchart, histogram, spineplot, and spinogram, and
(bottom, left to right) pairwise plots, and two parallel
coordinates plots. The order of variables in the parallel
coordinate plot changed from the original (upper plot) to
being ordered by difference between missings and non-missings.
All the plots use ``below 10\%'' imputation and are colored
by the missingness on humidity.}
\label{fig:graphtypes}
\end{figure}
\par\end{center}
\subsection{Design issues}
%\subsection{Maps to items in GUIs}
The missing data GUI is organized as one window with three tabs.
As shown in Figure~\ref{fig:missingGUI}, the summary tab
includes all the important widgets: list of variables, radio
for imputation methods, checkboxes for the conditional variables,
the graphics device, etc. An appropriate layout makes the widgets
less crowded, and is easy to maintain. The other two tabs are
not as critical as the main tab, but also play important roles.
The help tab shown in Figure~\ref{fig: missingGUI-tabs} (left)
has the same layout as the summary tab. The only difference is
that the graphics device is replaced by the help document. The
corresponding help shows up when the user moves the mouse upon
a widget.
The Settings tab shown in Figure~\ref{fig: missingGUI-tabs}
(right) allows the user to choose options for the imputation
methods in the package \pkg{mice}, as well as other settings
for the multiple imputation, neighbor selection, and the
display of parallel coordinates plot. To change the imputation
models, users can double click a variable in the left table,
and select any method provided in the pop-up window. The
choices vary depending on the type of the variable.
%\begin{center}
\begin{figure}[!h]
\begin{centering}
\includegraphics[width=0.49\textwidth]{graph/fig1-GUI-tab2}
\includegraphics[width=0.49\textwidth]{graph/fig1-GUI-tab3}
\par\end{centering}
\caption{Subsidiary GUI tabs: (left) help tab, (right) settings
tab. The layout of the help tab mirrors the actual functional
GUI. Mousing over any part of it or clicking the radio/checkbox
items will pop up text explanations in the summary region. All
the widgets have a detailed introduction. The settings tab is
used to make changes to the variable types and algorithm options.
Users can modify the number of imputed sets to generate, the
random number seed, the number of neighbors, and the jitter
setting for parallel coordinates plot.}
\label{fig: missingGUI-tabs}
\end{figure}
%\par\end{center}
\subsection{Data input and output}
Data can be entered as either a data frame or a comma separated
file (csv). The preferred approach is to read an existing data
frame in \proglang{R} because the type of variables (e.g.,
factor, numeric) are preserved. \code{MissingDataGUI(data)}
is used to achieve this.
If reading from a csv file, \code{MissingDataGUI()} will trigger
the data import GUI (Figure~\ref{fig: import}), from which to
select a file. The ``Open'' button is for choosing files and
the ``Watch Missing Values'' buttons will launch the missing
data GUI. The file format must be csv, and only one data set
can be imported into the missing data GUI at a time, although
several files can be opened in the data import GUI.
Once values are imputed, and a complete data set created, it can
be saved using the ``Export data'' button (Figure~\ref{fig: export}).
Only the selected variables will be imputed, but users could
choose whether to export the selected columns or all the columns
(with \code{NA}'s existing in the unselected variables). The shadow
matrix is exported by default, so that analysts can always track
back to find the locations of the real missings. Data can be saved
in three ways: a csv file, an rda file, or a data frame. The
multiple imputed sets from several chains will be saved as a list
in rda format or data frame, or in separate csv files.
The exported data with its shadow matrix can be loaded back
into the GUI, which implies the imputed data from other
imputation methods (not provided by the missing data GUI)
can also be imported. Users only need to provide a shadow
matrix which indicates the locations of missings. In other
words, the imported structure should be a data frame or a csv
file with the first $n$ columns being the imputed data and
the next $n$ columns being the shadow matrix.
\begin{center}
\begin{figure}[!h]
\begin{centering}
\includegraphics[width=0.8\textwidth]{graph/fig6-open}
\par\end{centering}
\caption{The data import GUI, with file selector, which pops
up upon clicking the ``open'' button. More than one file
could be listed in the GUI, but only one data set is allowed
active in the missing data GUI. The first file is automatically
imported if none of the data sets are chosen when the
``Watching Missing Values'' button is hit.}
\label{fig: import}
\end{figure}
\par\end{center}
\begin{center}
\begin{figure}[!h]
\begin{centering}
\includegraphics[width=0.6\textwidth]{graph/fig7-export}
\par\end{centering}
\caption{The data export GUI. By default, all columns are
exported with a shadow matrix. The current working directory
is set to be location for the exported files. Three
exporting formats are provided.}
\label{fig: export}
\end{figure}
\par\end{center}
\subsection{Additional features of the GUI}
\begin{itemize}
\item Change the variable attributes. Double clicking on any
variables in the top left table of the summary tab will open
an attribute window, as displayed in Figure~\ref{fig: attributes}.
Users could edit the variable name, or assign another class
to the variable. When the class of a variable is switched
from numeric/integer to character/factor/ordinal, the variable
will be automatically loaded into the checkbox group as the
potential conditioning variable.
\item Search a variable by text typing. The variable table,
conditioning checkboxes, and color-by-variable selector allow
text entry to find a variable. This feature is especially
useful when there are many variables in the data.
\item Save the plots. Plots can be saved to png formatted files
by ``Save plot'' button. The imputation method and plot type
will be auto-completed in the file name.
\end{itemize}
\begin{center}
\begin{figure}[h]
\begin{centering}
\includegraphics[width=0.6\textwidth]{graph/fig8-query}
\par\end{centering}
\caption{The attributes list for variable selection is interactive.
The name can be edited, and the class could be changed to one of
the five classes: integer, numeric, character, factor, or ordinal
(factor). When a numeric variable is changed to a categorical
variable, the widget for conditions will be updated.}
\label{fig: attributes}
\end{figure}
\par\end{center}
\section{Example}\label{Examples}
\subsection{Data}
Two data sets are provided with the package: \code{tao}, which
is used as the example in this section, and \code{brfss}. The
\code{brfss} data is a subset of the 2009 survey from the
Behavioral Risk Factor Surveillance System, an ongoing data
collection program designed to measure behavioral risk factors
for the US adult population (18 years of age or older). The
website for this program is \url{http://www.cdc.gov/BRFSS/index.htm}.
The data \code{tao} is from the Tropical Atmosphere Ocean
project (TAO) \citep{tao}. The TAO array consists of approximately
70 moorings in the Tropical Pacific Ocean, telemetering oceanographic
and meteorological data to shore in real-time via the Argos
satellite system. A subset of data from 6 moorings in 1993
and 1997 is used for the example. The data has 8 variables
(year, latitude, longitude, sea surface temperature, air temperature,
humidity, uwind and vwind) and 736 observations. The numeric
summary of the 8 variables is shown in Figure~\ref{fig: num-summry}.
This subset is provided by \citet{CS07}. We can open the GUI
by the following commands:
\begin{Code}
library("MissingDataGUI")
MissingDataGUI(tao)
\end{Code}
\subsection{Exploring missings}
Three of the 8 variables have missing values. First, let's look
at the distribution of missings on these variables.
Figure~\ref{fig:tao1} (left) shows the pairwise plots of three
variables (sea.surface.temp, air.temp, and humidity) with
missing values on any of the three variables colored in yellow,
and shown as 10\% below the minimum data value. Cases which are
missing on humidity (string of points at bottom of bottom row
of plots) have low values of sea and air temperature. This
suggests the dependence between humidity missingness and the
temperature variables. Imputation methods that incorporate
this dependence may be preferable.
Figure~\ref{fig:tao1} (right) shows the data imputed with
median values. This imputation imposes a cross structure
on the data, which does not match the shape of the complete
cases. This would not be a recommended method for creating
a complete data set.
\begin{figure*}[htp]
\centerline{\includegraphics[width=0.49\textwidth]{graph/fig4-3-below10-uncondition}
\includegraphics[width=0.49\textwidth]{graph/fig4-1-median-uncondition}}
\caption{(Left) Exploring the effect missingness (yellow) on
humidity, sea and air temperature. Missings on humidity (the
bottom line of the third row) occur at the lower temperature
values, suggesting a dependence relationship. Missing values
are not missing completely at random. (Right) Imputation using
the medians. Median imputation introduces a cross structure to
the point scatter, and the imputed values don't match the data well.}
\label{fig:tao1}
\end{figure*}
Figure~\ref{fig:tao3} (left) shows the data imputed with
median values conditional by year. This better matches the
distribution of complete cases, although the imputed values
still form bands in the scatterplot. This might be a problem
because the variance estimation will be affected.
For this data, the better ways to impute the data would take
the strong association between the variables into account.
This suggests that neighbor or multiple imputation might be
the more desirable imputation methods. Figure~\ref{fig:tao3}
(right) shows the results for MI:areg, the regression-based
imputation, conditional on year. The imputed values match
the distribution of complete cases reasonably well. There
are a few slight concerns: some of the imputed values have
lower air temperature values than any of the complete cases,
the spread of the imputed values is a little greater than
the complete cases. But overall, this is probably as good
as it is going to get with imputing the missings for this
data set. It would be reasonable to export the imputed data
for further analysis at this point.
\begin{figure*}[htp]
\centerline{\includegraphics[width=0.49\textwidth]{graph/fig4-2-median-condition}\includegraphics[width=0.49\textwidth]{graph/fig4-4-areg-condition}}
\caption{(Left) Imputation using the median, conditional on year.
Imputed values better match the complete cases, with the
exception of the banding due to a fixed median value.
(Right) Imputation using the multiple imputation MI:areg
conditional on year. The distribution of imputed values is
fairly close to the distribution of complete cases.}
\label{fig:tao3}
\end{figure*}
\subsection{Check assumptions}
In the statistical imputation literature, there are three types of
missing data mechanisms: MCAR (missing completely at random), MAR
(missing at random), and MNAR (missing not at random). Many
imputation methods, including multiple imputation, assume MCAR or MAR.
However, MCAR is the most difficult mechanism to substantiate,
because it requires that missingness be independent of the observed
or other missing values. MAR is less strict, because it allows for