ELCIC is a robust and consistent model selection criterion based upon the empirical likelihood function which is data-driven. In particular, this framework adopts plug-in estimators that can be achieved by solving external estimating equations, not limited to the empirical likelihood, which avoids potential computational convergence issues and allows versatile applications, such as generalized linear models, generalized estimating equations, penalized regressions, and so on. The formulation of our proposed criterion is initially derived from the asymptotic expansion of the marginal likelihood under the variable selection framework, but more importantly, the consistent model selection property is established under a general context.
Installation
if (!require("devtools")) {
install.packages("devtools")
}
devtools::install_github("chencxxy28/ELCIC")
## Warning: package 'MASS' was built under R version 4.1.2
We provide some basic tutorial for illustrating the usage of ELCIC package. More technical details are referred to the Chen et al. (2021). In the following, we list three applications of ELCIC, which is the variable selection in generalized linear model, model selection in longitudinal data, and model selection in longitudinal data with missingness.
Variable Selection in Generalized Linear Models (GLMs)
First, let us generate a pseudo data from negative binomial
distribution with overdispersion parameter (ov) equal to 8. We provide a
function glm.generator
to generate responses and
covariates. Different outcome distributions are considered in this
function including Gaussian, Poisson, Binomial, and Negative Binomial
distributions.
We first test how ELCIC works when the distribution is correctly
specified. The outcomes are generated from Poisson distribution. Then,
we consider seven candidate models with different covariate combinations
and compare ELCIC with most commonly used criteria including AIC, BIC,
and GIC. The
function ELCIC.glm
can produce the values of ELCIC, AIC,
BIC, and GIC given a candidate model. The selection rates from four
criteria are present based on 20 Monte Carlo runs (may take 1-2 minutes
to run).
set.seed(28582)
inter.total=20
count.matrix<-matrix(0,nrow=4,ncol=7)
candidate.sets<-list(c(1,2),c(1,3),c(1,4),c(1,2,3),c(1,2,4),c(1,3,4),c(1,2,3,4))
rownames(count.matrix)<-c("ELCIC","AIC","BIC","GIC")
colnames(count.matrix)<-candidate.sets
samplesize<-100
ov=2
for(iter in 1:inter.total)
{
#generate data
simulated.data<-glm.generator(beta=c(0.5,0.5,0.5,0),samplesize=samplesize,rho=0.5,dist="poisson")
y<-simulated.data[["y"]]
x<-simulated.data[["x"]]
criterion.all<-ELCIC.glm(x=x,y=y,candidate.sets=candidate.sets,name.var=NULL,dist="poisson")
#print(iter)
index.used<-cbind(1:4,apply(criterion.all,1,which.min))
count.matrix[index.used]<-count.matrix[index.used]+1
}
count.matrix/inter.total
Based on the results, we find ELCIC and BIC has the best performance due to the highest selection rate among others. It shows that ELCIC is as powerful as BIC when the distribution is correctly specified. Next, we consider the situation where the AIC, BIC, and GIC is based on misspecified distribution. Accordingly, we generate outcomes from Negative Binomial distribution with the dispersion parameter equal to 2, while AIC, BIC, and GIC consider Poisson distribution as the input (may take 1-2 minutes to run).
set.seed(28582)
inter.total=20
count.matrix<-matrix(0,nrow=4,ncol=7)
candidate.sets<-list(c(1,2),c(1,3),c(1,4),c(1,2,3),c(1,2,4),c(1,3,4),c(1,2,3,4))
rownames(count.matrix)<-c("ELCIC","AIC","BIC","GIC")
colnames(count.matrix)<-candidate.sets
samplesize<-100
ov=2
for(iter in 1:inter.total)
{
#generate data
simulated.data<-glm.generator(beta=c(0.5,0.5,0.5,0),samplesize=samplesize,rho=0.5,dist="NB",ov=2)
y<-simulated.data[["y"]]
x<-simulated.data[["x"]]
criterion.all<-ELCIC.glm(x=x,y=y,candidate.sets=candidate.sets,name.var=NULL,dist="poisson")
#print(iter)
index.used<-cbind(1:4,apply(criterion.all,1,which.min))
count.matrix[index.used]<-count.matrix[index.used]+1
}
count.matrix/inter.total
The above results show that ELCIC has the highest selection rate. The better performance is attributed to its distribution free property, which is highly valuable in real applications where the underlying distribution is hard to be correctly specified.
Model Selection in longitudinal data analysis (assuming missing completely at random)
Since ELCIC is distribution free, it is most suitable to statistical models in semi-parametric framework. This section evaluates how ELCIC works for joint selection of marginal mean and correlation structures in longitudinal data analysis. We assume the data is complete or missing completely at random. The following packages are used to generate different types of outcomes.
In the following evaluation, we consider generalized estimating
equations GEEs
framework, which is a semi-parametric framework and widely used in
longitudinal data analysis. We jointly select marginal mean and
correlation structures. To be noted, correctly identifying working
correlation structure will be beneficial to estimation efficiency in the
mean structure. We provide a function gee.generator
to
generate responses and covariates. Different outcome distributions are
considered in this function including Gaussian, Poisson, and Binomial,
and both time-dependent covariates and time-independent covariates are
allowed in this generator function. We compare ELCIC with another widely
adopted information criterion QIC.
set.seed(28582)
inter.total=20
samplesize<-300
time=3
dist="poisson"
candidate.sets<-list(c(1,2),c(1,3),c(1,4),c(1,2,3),c(1,2,4),c(1,3,4),c(1,2,3,4))
candidate.cor.sets<-c("independence","exchangeable","ar1")
count.matrix.elcic<-count.matrix.qic<-matrix(0,nrow=length(candidate.cor.sets),ncol=length(candidate.sets))
rownames(count.matrix.elcic)<-rownames(count.matrix.qic)<-candidate.cor.sets
colnames(count.matrix.elcic)<-colnames(count.matrix.qic)<-candidate.sets
for(iter in 1:inter.total)
{
#generate data
data.corpos<-gee.generator(beta=c(-1,1,0.5,0),samplesize=samplesize,time=time,num.time.dep=2,num.time.indep=1,rho=0.4,x.rho=0.2,dist="poisson",cor.str="exchangeable",x.cor.str="exchangeable")
y<-data.corpos$y
x<-data.corpos$x
id<-data.corpos$id
r<-rep(1,samplesize*time)
criterion.elcic<-ELCIC.gee(x=x,y=y,r=r,id=id,time=time,candidate.sets=candidate.sets,name.var=NULL,dist="poisson",candidate.cor.sets=candidate.cor.sets)
criterion.qic<-QICc.gee(x=x,y=y,id=id,dist=dist,candidate.sets=candidate.sets, name.var=NULL, candidate.cor.sets=candidate.cor.sets)
index.used.elcic<-which(criterion.elcic==min(criterion.elcic),arr.ind = TRUE)
count.matrix.elcic[index.used.elcic]<-count.matrix.elcic[index.used.elcic]+1
index.used.qic.row<-which(candidate.cor.sets==rownames(criterion.qic))
index.used.qic.col<-which.min(criterion.qic)
count.matrix.qic[index.used.qic.row,index.used.qic.col]<-count.matrix.qic[index.used.qic.row,index.used.qic.col]+1
print(iter)
}
count.matrix.elcic/inter.total
count.matrix.qic/inter.total
The results above show that ELCIC has much higher power to implement joint selection, especially for the selection of marginal mean structure.