R code: computation of inclusion probabilities in nested case-control studies

Ryung S. Kim — Tue, 14 Oct 2014 21:54:30 +0000

Someone recently emailed me for a code to compute inclusion probabilities in nested case-control studies. A nested case-control study design, along with case-cohort study design, is a schema to collect a statistically representative and powerful sub-sample from a cohort. They are commonly used in epidemiological studies to reduce the cost of exposure assessment when the outcome of interest is time-to-event (e.g. time to death, disease incidence, etc.)

It was Samuelsen (1997) who first proposed using IPW method to analyze nested case-control studies. I showed it was just fine to use a simpler variance estimator that can be computed by existing software (Kim 2013). Recently, I also proposed using IPW method to analyze secondary outcomes in nested case-control study designs (Kim 2014). The probability of each subject to be included in the sub-sample must be computed to use inverse probability weighting (IPW) method.

Now back to calculating the inclusion probabilities. In order to compute the probabilities, you need access to the full cohort data. Consider a full cohort data (only 10 subjects shown) that looks like below.

> dt[1:10,]
  X.delta time.to.X Y.delta time.to.Y GENDER HeavyDrinking
1       0      2050       1      1741 female       FALSE
2       0       475       0       438 female        TRUE
3       0      1626       0      1155 female       FALSE
4       0      1018       1      1185   male        TRUE
5       0       427       0       550 female        TRUE
6       0      1207       0      1728 female       FALSE
7       0       490       0       616   male        TRUE
8       0      1219       1      1062   male       FALSE
9       0      1137       0      1382   male        TRUE
10      0       615       0       675   male        TRUE

Consider again a nested case-control study with number of controls at 2 (i.e. m=2) with two matching variables GENDER and HeavyDrinking from this full cohort based on the primary outcome variable (X).

You need the following two functions to calculate the inclusion probability of each subject. The first function creates the table with risk set. The second function calculates the inclusion probabilities.

(For those of you who are performing secondary outcome analysis using nested case-control studies, notice that you do not need to consider failure time with respect to the secondary outcome, Y in the example, when computing inclusion probabilities. )

risk.table.f<-function(fail.nm, data, t.exit.nm, t.entry.nm){
 if(is.null(t.entry.nm)){t.entry<-0} else {t.entry <- data[,t.entry.nm]}
 t.exit<-data[,t.exit.nm]
 FT<-unique(sort(t.exit[data[,fail.nm]==1])) #Failure times
 risk.table<- cbind(FT,
t(sapply(FT, function(x){c(sum(t.exit==x), sum(t.exit>=x & t.entry<=x))}))
)
 risk.table<-data.frame(risk.table)
 colnames(risk.table)<-c("failure.time","cases","at.risk")
 return(risk.table)
}

inclusion.prob.ncc.f <- function(data,t.entry.nm, t.exit.nm, fail.nm, controls, risk.table, match.nm=NULL){
 mm<-length(match.nm)
 t.exit<-data[,t.exit.nm]
 if(is.null(t.entry.nm)){t.entry<-0} else {t.entry<-data[,t.entry.nm]}
 if(mm==0 & is.data.frame(risk.table)){
 CS<-risk.table$cases 
 AR<-risk.table$at.risk
 FT<-risk.table$failure.time
 inclusion.prob<-apply(cbind(t.entry,t.exit), 1, function(x){
 p<- pmin(1, controls*CS/(AR-CS))
 p[FT > x[2] | FT < x[1]]<-0 #zero when not at risk
 1-prod(1-p) #inclusion prob
 })
 inclusion.prob[data[,fail.nm]==1]<-1 
}
if(mm>0 & !is.data.frame(risk.table)){
 match <- data[,match.nm]
 inclusion.prob<-apply(cbind(t.entry,t.exit,match), 1, function(x){i.design.strata <- as.vector(paste(x[-(1:2)],collapse=":"))
 CS<-risk.table[[i.design.strata]]$cases 
 AR<-risk.table[[i.design.strata]]$at.risk
 FT<-risk.table[[i.design.strata]]$failure.time
 p <- pmin(1, controls*CS/(AR-CS) )
 p[FT > as.numeric(x[2]) | FT < as.numeric(x[1])]<- 0 #zero when not at risk
 1-prod(1-p) #inclusion prob
 })
 inclusion.prob[data[,fail.nm]==1]<-1
}
 return(inclusion.prob)
}

Using the two functions, you can compute the inclusion probabilities the following way:

t.entry.nm <- NULL                        #Entry time
t.exit.nm1 <- "time.to.X"                 #Time to  failure
  fail.nm1 <- "X.delta"                   #Indicator for failure
  match.nm <- c("GENDER","HeavyDrinking") #Matching variables
         m <- 2                           #Number of controls

design.strata <- as.vector(apply(dt[,match.nm],1,paste,collapse=":"))

risk.table.strata <- by(dt,design.strata, function(x){risk.table.f(fail.nm=fail.nm1, x, t.entry.nm=t.entry.nm, t.exit.nm=t.exit.nm1)})

inclusion.prob <- inclusion.prob.ncc.f(data=dt, t.entry.nm=t.entry.nm, t.exit.nm=t.exit.nm1, fail.nm=fail.nm1, controls=m, risk.table=risk.table.strata, match.nm=match.nm)

Once you invert the inclusion probabilities, add them as a column (‘wt’) to your nested case-control study data. For our illustration, I’m going to call the dataframe ‘nccdata’. In order to fit the Cox model with IPW method, use the following command. You can find justification for this method in my 2013 article.

fit<-coxph(formula=Surv(time.to.X, delta.X) ~ Gender + HeavyDrinking + cluster(ID), data=nccdata, weights=wt)

If you are performing a secondary outcome analysis from the nested case-control study, use the following command. Notice that the weights are computed based on the primary outcome (X) while the risk sets and failure times are based on the secondary outcome (Y). You can find justification for this method in my 2014 article.

fit<-coxph(formula=Surv(time.to.Y, delta.Y) ~ Gender + HeavyDrinking + cluster(ID), data=nccdata, weights=wt)

References

Samuelsen SO, A Pseudolikelihood Approach to Analysis of Nested Case-Control Studies, Biometrika, 1997: 84(2): 379-94

Kim RS, Analysis of Nested Case-Control Study Designs: Revisiting the Inverse Probability Weighting Method, Communications for Statistical Applications and Methods, 2013: 20(6): 455–66

Kim RS, Kaplan R, Analysis of Secondary Outcomes in Nested Case-Control Study Designs, Statistics in Medicine, 2014:33 (24): 4215-26

Kim RS. R code: computation of inclusion probabilities in nested case-control studies, 2014 Oct. Retrieved from http://missionalconsulting.com/methods

Quantitative and Statistical Consulting Blog » inclusion probabilities

R code: computation of inclusion probabilities in nested case-control studies