Gooogle (Group Regularization for Zero Inflated Count Regression Models)

Introduction

Zero-inflated count data are omnipresent in many fields including health care research and actuarial science. Zero-inflated Poisson (ZIP) and Zero-inflated Negative Binomial (ZINB) regression are commonly used to model such outcomes. These mixture models typically include a logistic component to model the presence of excess zeros above and beyond those generated by the count component and a Poisson/Negative Binomial component to model the counts. Several methods have been proposed for variable selection in ZIP and ZINB regression models. However, when the features to be associated possess an inherent grouping structure, these individual variable selection approaches are suboptimal. In order to perform group variable selection in ZIP/ZINB regression models, we extend various commonly used group regularization methods from the linear regression literature. With that formulation, we are able to achieve bi-level variable selection in both zero and count submodels of the corresponding zero-inflated models. The tuning parameter(s) of the final model can be chosen according to the minimum AIC/BIC criteria (following Huang et al., 2009).

Installation

You can install Gooogle directly from Github as follows:

install.packages("devtools")
devtools::install_github("himelmallick/Gooogle")
library(Gooogle)

Basic Usage

gooogle(data=data,yvar=yvar,xvars=xvars,zvars=xvars,group=rep(1,14),dist="poisson",penalty="gBridge")

data: the dataset (in data frame) to be used for the analysis.
yvar: the outcome variable name.
xvars: the vector of variable names to be included in count model.
zvars: the vector of variable names to be included in zero model.
group: the vector of integers indicating the grouping structure among predictors.
dist: the distribution of count model ('poisson' or 'negbin').
penalty: the penalty to be applied for regularization. For group selection, it is one of 'grLasso', 'grMCP', or 'grSCAD' while for bi-level selection it is 'gBridge'.

For efficiency, if any coefficients are to be included in the model without being penalized, their grouping index should be zero. group is expected to be a vector of consecutive integers.

The gooogle function will return a list containing the following objects:

coefficients: a list containing the estimates for the count and logistic submodels.
aic: AIC for the fitted model.
bic: BIC for the fitted model.
loglik: Log-likelihood for the fitted model.

Examples

Let's try one example on real data for which we are using the docvisit dataset from the R package zic. Similar to previous studies (Jochmann, 2013), we express each continuous predictor as a group of three cubic spline variables, resulting in 24 candidate predictors with 5 triplets and 9 singleton groups.

####################

# Load the dataset and prepare the variables

library(zic)
library(splines)

data("docvisits")

n<-nrow(docvisits)
age<-bs(docvisits$age,3)[1:n,]
hlth<-bs(docvisits$health,3)[1:n,]
hdeg<-bs(docvisits$hdegree,3)[1:n,]
schl<-bs(docvisits$schooling,3)[1:n,]
hhin<-bs(docvisits$hhincome,3)[1:n,]

attach(docvisits)
doc.spline<-cbind.data.frame(docvisits$docvisits,age,hlth,hdeg,schl,hhin,handicap,married,children,self,civil,bluec,employed,public,addon)

names(doc.spline)[1:16]<-c("docvisits",paste("age",1:3,sep=""),paste("health",1:3,sep=""),paste("hdegree",1:3,sep=""),paste("schooling",1:3,sep=""),paste("hhincome",1:3,sep=""))
data<-doc.spline

#####################################################################

Considering the grouping structure among the variables age, health, hdegree, schooling, and hhincome, we can use our algorithm to perform group level or bi-level variable selection. Below is an example implementation of the gooogle function using gBridge penalty.

# Fit the Gooogle method using group bridge penalty

group=c(rep(1:5,each=3),(6:14))

yvar<-names(data)[1]
xvars<-names(data)[-1]
zvars<-xvars

fit.gooogle <- gooogle(data=data,yvar=yvar,xvars=xvars,zvars=zvars,group=group,dist="negbin",penalty="gBridge")
fit.gooogle

References

Huang, J., Ma, S., Xie, H., and Zhang, C. (2009). A group bridge approach for variable selection. Biometrika 96(2):339–355.

Jochmann, M. (2013). What belongs where? variable selection for zero-inflated count models with an application to the demand for health care. Computational Statistics 28:1947-1964.

Citation

If you use Gooogle in your work, please cite the following papers:

Chatterjee, S., Chowdhury, S., Mallick, H., Banerjee, P., and Garai, B. (2018). Group Regularization for Zero-inflated Negative Binomial Regression Models with An Application to Healthcare Demand in Germany. Statistics in Medicine 37(20):3012-3026.

Chowdhury, S., Chatterjee, S., Mallick, H., Banerjee, P., and Garai, B. (2019). Group Regularization for Zero-inflated Poisson Regression Models with An Application to Insurance Ratemaking. Journal of Applied Statistics 46(9):1567-1581.

Contact

Feel free to contact us at schatterjee@niu.edu, gg0658@wayne.edu, and/or hmallick@hsph.harvard.edu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Gooogle (Group Regularization for Zero Inflated Count Regression Models)

Introduction

Installation

Basic Usage

Examples

References

Citation

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

Gooogle (Group Regularization for Zero Inflated Count Regression Models)

Introduction

Installation

Basic Usage

Examples

References

Citation

Contact