I used two nested case-control and six case-cohort methods in the following paper:

Kim RS, A New Comparison of Nested Case-Control and Case-Cohort Designs and Methods (Under Review)

I originally intended to use existing packages but found inaccuracies in some functions. I have discussed the inaccuracies with R community (here) as well as with one of the authors privately. So I ended up writing my own codes for the methods and I am sharing them here with research community.

For each method, one can easily use ‘robust’ or ‘sandwich’ option in software to compute variance estimators that converge to approximate jackknife estimators (or ‘robust’ estimators by Lin & Wei, 1989) as sample size increases to the size of full cohorts. When someone has already reported its use, I indicated the names (e.g. Lin and Ying). Otherwise, I named them Type B variance estimators.

t.entry.nm = NULL;t.exit.nm = “time”;fail.nm <- “status”;match.nm =NULL

dt <- simulation(N=N,beta1=beta1.choices[j],max.censor=max.censor.choice[j])

risk.table <- risk.table.f(fail.nm=fail.nm,data=dt,t.entry.nm=t.entry.nm,t.exit.nm=t.exit.nm)
inclusion.prob <- inclusion.prob.ncc.f(data=dt,t.entry.nm=t.entry.nm, t.exit.nm=t.exit.nm, fail.nm=fail.nm,controls=m,risk.table=risk.table,match.nm=match.nm)
sub.dt.clogit <- nccdata(t.entry.nm = t.entry.nm, t.exit.nm = t.exit.nm, fail.nm=fail.nm,controls = m, match.nm =NULL, data=dt,risk.table=risk.table,inclusion.prob=inclusion.prob)
sub.dt.ncc <- sub.dt.clogit[!duplicated(sub.dt.clogit$Map),]
sub.dt.ncc <- cbind(sub.dt.ncc,cluster.id=1:nrow(sub.dt.ncc))
n.ncc<-nrow(sub.dt.ncc);n0.ncc<-sum(sub.dt.ncc[,fail.nm]==0);N0<-sum(dt[,fail.nm]==0)
cch.p =n0.ncc/N0 #To ensure expected sample size of ncc and cch are the same
CCH <-casecohortdata(dt,cch.p,fail.nm)
CCH <- cbind(CCH,cluster.id=1:nrow(CCH))

The following function produces Self & Prentice estimate for linear coefficients but two types of variances, one by Self & Prentice and one by Lin & Ying.

Nested Case-Control Methods

Thomas (1977)’s Hazard Ratio and Standard Error Estimation (“Conditional Logistic Method”)

First, make a ‘trick’ data set “by including multiple inputs for subjects who are selected multiple times, and recording all randomly selected failures as non-failures.” (Kim 2013)

clogit.fn <- function(formula,data){
 fit <- coxph(formula=formula,data=data,method="exact")
 res <- list(coef=fit$coef, se=sqrt(diag(fit$var)))
 }
formula.clogit <- Surv(time, Fail) ~ X1+X2 + strata(Set)
clogit.fn(formula.clogit,data=sub.dt.clogit)

Samuelsen (1997)’s Hazard Ratio and Kim (2013)’s Standard Error Estimation (“IPW method”)

robust.coxph.fn <-function(formula,sub.dt,wt="wt"){
 if(is.null(wt)) {wt<-rep(1,nrow(sub.dt))} else { wt<-sub.dt[,wt]}
 fit<-coxph(formula=formula,data=sub.dt, weights=wt)
 res<-list(coef=fit$coef,se=sqrt(diag(fit$var)))
}
robust.formula <- Surv(time, status) ~ X1+X2+cluster(cluster.id)
robust.coxph.fn(robust.formula,sub.dt.ncc,wt="wt")

Case-Cohort Methods

Barlow (1994)’s Hazard Ratio and Standard Error Estimation

Barlow (1994) originally proposed two approaches for weighting: 1) using time dependent subcohort proportion π(t) which is the number of subjects at risk among the subcohort at t divided by the number of subjects at risk in the cohort at t, or 2) using the baseline subcohort proportion π. Later Barlow et al. (1999) used only π. Here the program uses the baseline subcohort proportion π.

barlow.dt <- function(CCH,cch.p){
#assumes no ties in failure time & zero entry time
#http://lib.stat.cmu.edu/general/robphreg
eps<-1e-08
wt.temp=1/cch.p
ST <- CCH$status
SC <- CCH$in.subcohort
temp1 <- CCH[ST==1,]
temp1 <- cbind(temp1, T=temp1$time,WWTT=1,start=temp1$time-eps,CASE=1)
situation2<-ST==1 & SC==T
if(sum(situation2)>0){
 temp2 <- CCH[ST==1 & SC==T,]
 temp2 <- cbind(temp2,T=temp2$time-eps,WWTT=wt.temp,start=temp2$t.entry,CASE=0)
 }
 temp3 <- CCH[ST==0 & SC==T,]
 temp3 <- cbind(temp3,T=temp3$time,    WWTT=wt.temp,start=temp3$t.entry,CASE=0)
 if(sum(situation2)>0){
 res<-rbind(temp1,temp2,temp3)
 } else res <-rbind(temp1,temp3)
 return(cbind(res,logwt=log(res$WWTT)))
 }
barlow.fn <-function(formula=formula.Barlow,data=CCH.barlow){
 fit<-coxph(formula=formula,weights=WWTT, data=data)
 res<-list(coef=fit$coef,se=sqrt(diag(fit$var)))
 }

formula.Barlow<-Surv(start,T, CASE) ~ X1+X2+cluster(cluster.id)
CCH.barlow <- barlow.dt(cbind(CCH, t.entry=0), cch.p)
barlow.fn(formula.Barlow, data=CCH.barlow)

Prentice (1986)’s Hazard Ratio and Self & Prentice (1988)’s Standard Error Estimation

prentice.dt <-function(CCH){
 eps  <- 1e-08
 TM   <-CCH$time
 start<-CCH$t.entry
 SC   <-CCH$in.subcohort
 start[SC == FALSE] <- TM[SC == FALSE] - eps
return(cbind(CCH,start,TM))
}

CCH.prentice <- prentice.dt(cbind(CCH,t.entry=0))
prentice <-function(formula.prentice,data=CCH.prentice,cohort.size=nrow(dt)){
 SC <- data$in.subcohort
 fit <- coxph(formula.prentice,data=data,x=T)
 dfbeta <- resid(fit,type="dfbeta")
 d2 <- dfbeta[SC,]
 fit$naive.var <- fit$naive.var + (1- sum(SC)/cohort.size)* t(d2)%*% d2
 fit$SelfPrentice.var <- fit$naive.var
fit$AJK.var <- fit$var
 fit<-fit[c("coefficients","SelfPrentice.var","AJK.var"]
 return(fit)
 }
prentice.fn <-function(formula=formula.prentice,data=CCH.prentice){
 fit<-prentice(formula=formula,data=data)
 res<-list(coef=fit$coef,se=sqrt(diag(fit$SelfPrentice.var)))
 }
formula.prentice <- Surv(start, T, status) ~ X1+X2+cluster(cluster.id)
prentice.fn(formula.prentice,data =CCH.prentice)

Prentice (1986)’s Hazard Ratio and Type B (Approximate Jackknife) Standard Error Estimation

prentice.AJK.fn <-function(formula=formula.prentice,data=CCH.prentice){
 fit<-prentice(formula=formula,data=data)
 res<-list(coef=fit$coef,se=sqrt(diag(fit$AJK.var)))
 }
formula.prentice<-Surv(start, T, status) ~ X1+X2+cluster(cluster.id)
prentice.AJK.fn(formula.prentice,data =CCH.prentice)

Self & Prentice (1998)’s Hazard Ratio and Standard Error Estimation

temp1 <- data.frame(CCH[CCH$status==1, ],         group=1, dummy= -100)
temp2 <- data.frame(CCH[CCH$in.subcohort==TRUE, ],group=0, dummy= 0)
CCH.SP         <- rbind(temp1, temp2)

selfprentice <-function(formula.SP,data=sub.dt,cohort.size=nrow(dt)){
    SC <- data$in.subcohort
   fit <- coxph(formula.SP,data=data,x=T)
dfbeta <- resid(fit,type="dfbeta")
    d2 <- dfbeta[SC,]
fit$SelfPrentice.var <- fit$naive.var + (1- sum(SC)/cohort.size)* t(d2)%*% d2
     fit$LinYing.var <- fit$var
fit<-fit[c("coefficients","SelfPrentice.var","LinYing.var")]
return(fit)
}
selfprentice.fn <-function(formula,data=sub.dt,cohort.size=nrow(dt)){
 fit<-selfprentice(formula=formula,data=data,cohort.size=cohort.size)
 res<-list(coef=fit$coef,se=sqrt(diag(fit$SelfPrentice.var)))
 }
formula.SP <- Surv(time, group) ~ X1+X2+offset(dummy) + cluster(cluster.id)
selfprentice.fn(formula.SP, data=CCH.SP, cohort.size=nrow(dt))

Self & Prentice (1998)’s Hazard Ratio and Lin & Ying (1993)’s (Approximate Jackknife) Standard Error Estimation

linying.fn <-function(formula,data=sub.dt,cohort.size=nrow(dt)){
 fit<-selfprentice(formula=formula,data=data,cohort.size=cohort.size)
 res<-list(coef=fit$coef,se=sqrt(diag(fit$LinYing.var)))
 }
formula.SP <- Surv(time, group) ~ X1+X2+offset(dummy) + cluster(cluster.id)
linying.fn(formula.SP, data =CCH.SP, cohort.size=nrow(dt))

Binder (1992)’s Hazard Ratio and Type B (Approximate Jackknife) Standard Error Estimation (“IPW method”)

robust.coxph.fn <-function(formula,sub.dt,nm="Robust.Cox",wt="wt"){
 if(is.null(wt)) {wt<-rep(1,nrow(sub.dt))} else { wt<-sub.dt[,wt]}
 fit<-coxph(formula=formula,data=sub.dt, weights=wt)
 res<-list(coef=fit$coef,se=sqrt(diag(fit$var)))
}
robust.formula <- Surv(time, status) ~ X1+X2+cluster(cluster.id)
robust.coxph.fn(robust.formula,CCH,wt="wt")

Contact me for beta versions of the following:

Samuelsen (1992)’s Hazard Ratio Standard Error Estimation (“IPW method”)

Binder (1992)’s Hazard Ratio and Lin (2000)’s Standard Error Estimation (“IPW method”)

library(survey)
svycoxph.fn<-function(formula,design=design){
 fit<-svycoxph(formula,design=design)
 list(coef=fit$coef,se=sqrt(diag(fit$var)))
 }
formula <- Surv(time, status) ~ X1+X2
svycoxph.fn(formula,design=svydesign(id=~1, weights=~wt, data=CCH))

Someone recently emailed me for a code to compute inclusion probabilities in nested case-control studies. A nested case-control study design, along with case-cohort study design, is a schema to collect a statistically representative and powerful sub-sample from a cohort. They are commonly used in epidemiological studies to reduce the cost of exposure assessment when the outcome of interest is time-to-event (e.g. time to death, disease incidence, etc.)

It was Samuelsen (1997) who first proposed using IPW method to analyze nested case-control studies. I showed it was just fine to use a simpler variance estimator that can be computed by existing software (Kim 2013). Recently, I also proposed using IPW method to analyze secondary outcomes in nested case-control study designs (Kim 2014). The probability of each subject to be included in the sub-sample must be computed to use inverse probability weighting (IPW) method.

Now back to calculating the inclusion probabilities. In order to compute the probabilities, you need access to the full cohort data. Consider a full cohort data (only 10 subjects shown) that looks like below.

> dt[1:10,]
  X.delta time.to.X Y.delta time.to.Y GENDER HeavyDrinking
1       0      2050       1      1741 female       FALSE
2       0       475       0       438 female        TRUE
3       0      1626       0      1155 female       FALSE
4       0      1018       1      1185   male        TRUE
5       0       427       0       550 female        TRUE
6       0      1207       0      1728 female       FALSE
7       0       490       0       616   male        TRUE
8       0      1219       1      1062   male       FALSE
9       0      1137       0      1382   male        TRUE
10      0       615       0       675   male        TRUE

Consider again a nested case-control study with number of controls at 2 (i.e. m=2) with two matching variables GENDER and HeavyDrinking from this full cohort based on the primary outcome variable (X).

You need the following two functions to calculate the inclusion probability of each subject. The first function creates the table with risk set. The second function calculates the inclusion probabilities.

(For those of you who are performing secondary outcome analysis using nested case-control studies, notice that you do not need to consider failure time with respect to the secondary outcome, Y in the example, when computing inclusion probabilities. )

risk.table.f<-function(fail.nm, data, t.exit.nm, t.entry.nm){
 if(is.null(t.entry.nm)){t.entry<-0} else {t.entry <- data[,t.entry.nm]}
 t.exit<-data[,t.exit.nm]
 FT<-unique(sort(t.exit[data[,fail.nm]==1])) #Failure times
 risk.table<- cbind(FT,
t(sapply(FT, function(x){c(sum(t.exit==x), sum(t.exit>=x & t.entry<=x))}))
)
 risk.table<-data.frame(risk.table)
 colnames(risk.table)<-c("failure.time","cases","at.risk")
 return(risk.table)
}

inclusion.prob.ncc.f <- function(data,t.entry.nm, t.exit.nm, fail.nm, controls, risk.table, match.nm=NULL){
 mm<-length(match.nm)
 t.exit<-data[,t.exit.nm]
 if(is.null(t.entry.nm)){t.entry<-0} else {t.entry<-data[,t.entry.nm]}
 if(mm==0 & is.data.frame(risk.table)){
 CS<-risk.table$cases 
 AR<-risk.table$at.risk
 FT<-risk.table$failure.time
 inclusion.prob<-apply(cbind(t.entry,t.exit), 1, function(x){
 p<- pmin(1, controls*CS/(AR-CS))
 p[FT > x[2] | FT < x[1]]<-0 #zero when not at risk
 1-prod(1-p) #inclusion prob
 })
 inclusion.prob[data[,fail.nm]==1]<-1 
}
if(mm>0 & !is.data.frame(risk.table)){
 match <- data[,match.nm]
 inclusion.prob<-apply(cbind(t.entry,t.exit,match), 1, function(x){i.design.strata <- as.vector(paste(x[-(1:2)],collapse=":"))
 CS<-risk.table[[i.design.strata]]$cases 
 AR<-risk.table[[i.design.strata]]$at.risk
 FT<-risk.table[[i.design.strata]]$failure.time
 p <- pmin(1, controls*CS/(AR-CS) )
 p[FT > as.numeric(x[2]) | FT < as.numeric(x[1])]<- 0 #zero when not at risk
 1-prod(1-p) #inclusion prob
 })
 inclusion.prob[data[,fail.nm]==1]<-1
}
 return(inclusion.prob)
}

Using the two functions, you can compute the inclusion probabilities the following way:

t.entry.nm <- NULL                        #Entry time
t.exit.nm1 <- "time.to.X"                 #Time to  failure
  fail.nm1 <- "X.delta"                   #Indicator for failure
  match.nm <- c("GENDER","HeavyDrinking") #Matching variables
         m <- 2                           #Number of controls

design.strata <- as.vector(apply(dt[,match.nm],1,paste,collapse=":"))

risk.table.strata <- by(dt,design.strata, function(x){risk.table.f(fail.nm=fail.nm1, x, t.entry.nm=t.entry.nm, t.exit.nm=t.exit.nm1)})

inclusion.prob <- inclusion.prob.ncc.f(data=dt, t.entry.nm=t.entry.nm, t.exit.nm=t.exit.nm1, fail.nm=fail.nm1, controls=m, risk.table=risk.table.strata, match.nm=match.nm)

Once you invert the inclusion probabilities, add them as a column (‘wt’) to your nested case-control study data. For our illustration, I’m going to call the dataframe ‘nccdata’. In order to fit the Cox model with IPW method, use the following command. You can find justification for this method in my 2013 article.

fit<-coxph(formula=Surv(time.to.X, delta.X) ~ Gender + HeavyDrinking + cluster(ID), data=nccdata, weights=wt)

If you are performing a secondary outcome analysis from the nested case-control study, use the following command. Notice that the weights are computed based on the primary outcome (X) while the risk sets and failure times are based on the secondary outcome (Y). You can find justification for this method in my 2014 article.

fit<-coxph(formula=Surv(time.to.Y, delta.Y) ~ Gender + HeavyDrinking + cluster(ID), data=nccdata, weights=wt)

References

Samuelsen SO, A Pseudolikelihood Approach to Analysis of Nested Case-Control Studies, Biometrika, 1997: 84(2): 379-94

Kim RS, Analysis of Nested Case-Control Study Designs: Revisiting the Inverse Probability Weighting Method, Communications for Statistical Applications and Methods, 2013: 20(6): 455–66

Kim RS, Kaplan R, Analysis of Secondary Outcomes in Nested Case-Control Study Designs, Statistics in Medicine, 2014:33 (24): 4215-26

Kim RS. R code: computation of inclusion probabilities in nested case-control studies, 2014 Oct. Retrieved from http://missionalconsulting.com/methods

Quantitative and Statistical Consulting Blog

R code: six case-cohort and two nested case-control methods