Estimating R_0 and other parameters for the 2019-nCov epidemic

Summary

The epidemiology of the global 2019-nCov is poorly understood. Identifying the key processes that shape transmission and estimating the relevant model parameters is therefore an important task. This document presents arguments and analysis to support the estimation of a number of key quantities.

Findings are preliminary and subject to change, pending future changes in the underlying data. Results have not been peer-reviewed, but have been prepared to a professional standard with the intention of providing useful information about a rapidly developing event.

Key parameters investigated in this document include:

Data

Key resources used for this investigation include:

A record of cumulative confirmed case reports at the national level maintained by the Drake lab at the University of Georgia.
A record of case reports at the province level maintained by the Drake lab at the University of Georgia.
A ``line list’’ maintained by Mortiz Kraemer containing case level information, including start dates for key individual events (presentation of symptoms, hospitalization, case notification, etc.)
A ``line list’’ maintained by MOBS containing case level information, including start dates for key individual events (presentation of symptoms, hospitalization, case notification, etc.)

Note: There are known discrepancies among these data sets.

Epidemic curve

December 1, 2019 is taken to be the notional time of first infection for the purposes of this analysis.

The epidemic curve associated with the national level data is presented below.

New daily cases. On Feb 12, clinical cases were listed as confirmed cases. Clinical cases are cases that had signs of COVID but did not test positive on a COVID genetic test or were not yet tested.

The epidemic curve for the province level data, highlighting Hubei, is presented next.

New daily cases inside and outside of Hubei. On Feb 12, clinical cases were listed as confirmed cases. Clinical cases are cases that had signs of COVID but did not test positive on a COVID genetic test or were not yet tested.

Incubation period

Previous work on distribution of incubation period estimated the incubation period to vary from around 1-14 days and be estimated by gamma distribution with mean of 5.6 and shape parameter 6.2 (Backer pre-print).

We use the MK linelist to estimate incubation period. We assumed incubation period was the delay in between day of travel to Wuhan and the symptom onset.

Only cases outside Wuhan were used in this calculation since we used travel to Wuhan as the time of exposure.

We fit gamma and Erlang distributions to these intervals. The Erlang distribution is a special case of a gamma distribution with an integer shape parameter. The PDF of the Erlang distribution is

\[\begin{equation} f(x; k, \lambda) = \frac{\lambda^{k}x^{k-1}e^{-\lambda x}}{(k-1)!} \end{equation}\]

for rate parameter \(\lambda \geq 0\). We estimate the Erlang distribution of isolation intervals numerically and analytically by optimizing the negative log likelihood. The MLE of this distribution can be solved by finding the derivative of the log likelihood function:

\[\begin{equation} L = n k log(\lambda) + (k-1) \sum_{i=1}^{n}{x_i} - \lambda \sum_{i=1}^{n}{x_i} - n log(k-1)! \end{equation}\]

which has solution \(\hat{\lambda}=\frac{k}{\sum_{i=1}^{n}{x_i}}\).

Delay between travel to Wuhan and symptom onset of nCov cases.

Erlang and gamma distribution parameters for the incubation period.
Distribution	shape	rate	mean
Erlang	2.000	0.373	5.36
Gamma	1.519	0.283	5.36

Isolation rate: Interval between onset of symptoms and hospital admission

Here we investigate the interval between symptom onset and hospital admission (coinciding with isolation from the general population). The case notification rate is estimated from the Kraemer line list.

Negative log likelihood curves of Erlang shape parameter (k) given optimal rate parameter for the incubation period (left) and isolation interval (right). Blue line shows the MLE of the rate parameter (lambda) and the dashed grey line shows the estimates within 2 log units of the MLE.

We find that the best Erlang distribution for the isolation interval pre-January 19 is \(\text{Erlang}(k=2, \lambda=0.44)\).

Fit of Erlang and gamma distributions to isolation interval data from Kraemer line list.

Erlang and gamma distribution parameters for the isolation interval.
Distribution	shape	rate	mean
Erlang	2.000	0.445	4.495
Gamma	2.153	0.479	4.495

We see that the gamma and erlang distribution are nearly identical for the isolation interval.

However, we find the isolation interval does not appear to be constant over time. There is a decline in an approximately linear decline after a certain point. To obtain an estimate of the rate of change over time, we use a stepwise linear regression. For breakpoints, we used two values. On Jan 19, China increased testing in Wuhan suggesting a natural breakpoint when the interval to admission starts to decline. After plotting the data, we noticed another potential breakpoint that seemed to show a starker contrast between interval times: Jan 15th. We have shown results for both breakpoints.

Here we plot isolation rates over time inside and outside of Hubei.

Regression summaries of isolation rate linear models inside and outside of Hubei. Pre Jan 14, and Post Jan 14 data points and observations are shown for each model.
model	obs_pre	obs_post	slope_pre	slope_post
all ~ day	63	230	0	0.02
hubei ~ day	29	3	0.002	0.014
not_hubei ~ day	34	228	0	0.02

We see that there are few data points in Hubei for which symptom and admission data are available currently making it difficult to ascertain the trend in Hubei post Jan 14.

There was a non-significant trend in the isolation rate, as a result we will use the mean isolation rate for Pre Jan 14 and then the increasing trend after this date.

Rate of symptoms to removal.

Since there is no significant trend in isolation rates prior to day 45, for modeling purposes, our isolation rate is:

\[\begin{equation} \gamma(t) = \begin{cases} 0.157, & \text{if } t < 45\\ 0.157 + .02 t, & \text{otherwise} \end{cases} \end{equation}\]

Notification rate: Interval between hospital admission and case notification

Here we investigate the interval between hospital admission (coinciding with isolation from the general population) and case notification (at which time the case appears in the data set.) We denote the interval between admission and notification by \(\tau_1\) and define the case notification rate as \(\eta=1/\tau_1\), which is presumed to be a function of time. The case notification rate is estimated from the Kraemer line list. The decline in average notification interval is approximately linear. A linear regression is used to obtain an estimate of the rate of change over time. The inverse of this is a hyperbolic function for the increase in case notification rate, which diverges around Day 54.

## 
## Call:
## lm(formula = lag ~ day, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7289 -1.6276 -0.4249  1.1346  9.1873 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.30181    1.00705   15.20   <2e-16 ***
## day         -0.22027    0.01824  -12.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.243 on 447 degrees of freedom
## Multiple R-squared:  0.246,  Adjusted R-squared:  0.2443 
## F-statistic: 145.9 on 1 and 447 DF,  p-value: < 2.2e-16

For modeling purposes, we assume that the maximum case notification rate is one (i.e. one day between admission and notification) since this is the reporting interval of almost all case reports. Thus, our model of the notification rate is

\[\begin{equation} \eta(t) = \begin{cases} (-0.28t+18.3)^{-1}, & \text{if } t \leq 55 \\ 1, & \text{otherwise} \end{cases} \end{equation}\]

Fatality rate: Interval between case notification and death reports

Here, we investigate the fatality rate of nCov in China.

We look at the interval between hospitalization and date of death in linelist data over time to see how much of a lag in death reports there may be. We use hospitalization as a proxy for date of confirmation since there seems to be more data. The lag between hospitalization and confirmation is small and decreasing over time.

Hospitalization to death.

Assuming that the period in mid-late January is representative of the last couple of days, the lag is about 5 days. Thus, we plot the cumulative cases and deaths reported over time and with a lag of 5 days.

Cumulative cases and deaths in China since Dec 29.

## Warning in bind_rows_(x, .id): binding factor and character vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

Data summary of incubation, isolation, reporting, and fatality intervals. Red points show Hubei data points, grey points show other province data points or undisclosed locations. There are more points from outside Hubei mostly due to the ability to calculate exposure time and since Hubei hosptials and clinics may have had so many cases, it may have been difficult to collect and disseminate individual level data.

Interval distributions during and after key interventions.

Estimating \(R_0\) and \(R_{eff}\)

\(R_0\), the basic reproduction number, refers to the average number of secondary cases expected to arise from a single infected individual in a wholly susceptible population. \(R_{eff}\), the effective reproduction number, refers to the expected number of secondary cases to arise from an arbitrary case at any point in time. \(R_{eff}\) is expected to change over the course of an outbreak. Containment will occur when \(R_{eff}<1\).

Estimating \(R_0\) in this outbreak is challenging because: 1. There is little information from the first few infection generations 2. The distribution of incubation period and time from presentation of symptoms to hospitalization are not exponetially diastributed 3. Interventions and policies intended to curtail the outbreak have affected the unfolding process and are therefore reflected in the case notification data.

Takeoff estimators

Here we consider ‘’takeoff’’ estimators (e.g. Wearing & Rohani). Wearing & Rohani show that dynamics of an epidemic in the early phases are related to \(R_0\). First, we plot case notifications over time from the start of the outbreak on 1 December 2020. A regression line through these points (neglecting the first couple), clearly does not go through the origin, suggesting that the epidemic was already in its exponential phase by Day 47 (Jan 16). If \(\gamma\) were constant during this period after Day 47 (which it’s not!), the number of cases would be expected to grow according to the proportionality

\[\begin{equation} log(X_t) \propto (R_0-1)\gamma t, \end{equation}\]

implying that a linear fit of \(\log(X_t)\) against \(t\) provides sufficient information to estimate \(R_{eff}\) via the relation \(R_{eff} = \lambda_1 / \gamma + 1\), where \(\lambda_1\) is the slope coefficient of the regression. We therefore fit a simple linear regression with log-transformed case data.

## 
## Call:
## lm(formula = log(cases) ~ as.numeric(day), data = data.china[8:dim(data.china)[1], 
##     ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0354 -0.3739  0.0466  0.4744  0.7284 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -7.39654    0.82287  -8.989 1.86e-09 ***
## as.numeric(day)  0.26040    0.01348  19.315  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5762 on 26 degrees of freedom
## Multiple R-squared:  0.9348, Adjusted R-squared:  0.9323 
## F-statistic: 373.1 on 1 and 26 DF,  p-value: < 2.2e-16

Although \(\gamma\) is changing, we can nevertheless use this method to provide upper and lower bounds on \(R_{eff}\) by looking at the plausible range of \(\gamma\). Recalling that \(R_{eff} = \lambda_1 / \gamma + 1\) and substituting the slope coefficient \(\lambda_1=\) 0.2603954, the upper bound is obtained by assuming \(\gamma =\) \(0.157\), yielding \(R_{eff} = coef(model)[2]) * (1/meanPreJan15) + 1 \approx\) \(2.7\). In recent days, with expanded testing, \(\gamma\) appears to be \(\approx 1\), yielding \(R_{eff} = coef(model)[2]) \times 1 + 1 \approx\) \(1.3\).

A second approach looks at the first significant report in these data (41 cases on Day 41, Jan 10). During the pre-exponential phase of the epidemic, the takeoff rate is given by

\[\begin{equation} R_0 = \lambda_2(\lambda_2/(\sigma m)+1)^m/(\gamma (1-(\lambda_2/(\gamma n)+1)^{-n})), \end{equation}\]

where \(\lambda_2\) is the slope of the takeoff at the beginning of the epidemic, \(1/\sigma\) is the average latent period, \(1/\gamma\) is the average infectious period and \(m\) and \(n\) are the parameters of Erlang distributions for the interval (Wearing and Rohani).

For assumed initial case on Day \(t_1=1\) and \(x=41\) cases on Day \(t=41\), we have \(\lambda_2 \approx \frac{\log x}{t-t_1} = \frac{\log 41}{41-1} = 0.0928\). Inserting into the formula yields the estimate:

## [1] "R0=2.08"

## Warning in xy.coords(x, y, xlabel, ylabel, log): 1 y value <= 0 omitted
## from logarithmic plot

A similar analysis using the Wearning et al. estimation method can be performed using the province level data. For each province, we determine the period of exponential growth as the first date of 3 successive non-NA values. Then for each province, we estimate \(R_0\) in 3 ways:

Assuming the recovery rate is equal to the pre-intervention value (0.157)
Assuming the recovery rate is equal to the current recovery rate (\(1/1 = 1^-\text{day}\))
Assuming that the recovery rate is equal to the value obtained by linear regression at the date of each province’s start date.

Initial R0 by province plotted against outbreak start time for that province. Start time was estimated when that province had 3 continuous reports of cases begin to occur. The point estimate shows the recovery rate estimated with linear regression given the province start day.

Case detection rate

Here we present an argument for how to estimate the case detection rate of 2019-nCov infections in Wuhan during January 2020. The idea is to use outcome data on a cohort of patients from early in the epidemic to estimate the size of the population from which they were sampled by comparing with estimates of case fatality rate from later in the epidemic.

The logic is as follows.

As time \(s\) there are \(I_s\) infections in the population, of which \(D_s\) will result in death. The case fatality rate is therefore defined as

\[\begin{equation} \label{eq:cfr} CFR = \frac{D_s}{I_s}. \end{equation}\]

It is assumed that all deaths are known cases, but that there are an unknown number of infections that have gone undetected. The detection rate is \(q\) such that

\[\begin{equation} \label{eq:cases} C_s = qI_s \rightarrow I_s = C_s/q. \end{equation}\]

A fraction \(p\) of the population were are enrolled in cohort \(m\) at time \(s\). Since the outcome (death or recovery) isn’t known at the time of enrollment, we assume that patients with death and recovery outcomes are included in the cohort in proportion to their fequency in the population, i.e., \(D_m = pD_s\) is the number of patients in the cohort whose infections will result in death and \(C_m = pC_s\) is the number of patients in the cohort whose infections will result in recovery. Rearranging these equations, we have \(D_s = D_m/p\) and \(C_s=C_m/p\). Just making substitutions, we have

\[\begin{equation} CFR = \frac{D_s}{I_s} = q\frac{D_s}{C_s} = q \frac{pD_m}{pC_m} = q\frac{D_m}{C_m} = q (CFR_m). \end{equation}\]

where the ratio \(CFR_m = \frac{D_m}{C_m}\) is just the case fatality rate in cohort \(m\). Rearranging, we have a formula for \(q\).

\[\begin{equation} q = \frac{CFR}{CFR_m}. \end{equation}\]

Ghani et al have shown that, during the initial phase of an outbreak when the outcomes of all cases are not yet known, the simple case fatality estimator

\[\begin{equation} \label{eq:e1} e_1=D_m/C_m \end{equation}\]

typically provides an underestimate due to right censoring. A better estimator is provided by the formula

\[\begin{equation} \label{eq:e2} e_2(s) = \frac{D_m}{D_m+R_m}, \end{equation}\]

where \(D_m\) is understood to be the number of patients in cohort \(m\) that have died at the time of estimation and \(R_m\) is the number that have recovered, while the outcomes of the remaining \(X_m = C_m-D_m-R_m\) are unknown.

Chen et al. report on \(C_m=99\) patients treated in Wuhan Jinyintan Hospital from Jan 1 to Jan 20, 2020. All 2019-nCov patients were eligible for treatment in the hospital and all treated patients were enrolled in the study. The majority (possibly all) patients would have been infected prior to widespread testing which began on January 19 and before high case fatality rate was widely known. Thus, it is reasonable to assume that, at least for the purpose of statistical analysis, this cohort represents a random sample of the case detection process during this time.

Chen et al. report that, at the time of their writing, \(D_m=11\) of 99 patients died, \(R_m=31\) patients had been discharged, and \(C_m-D_m-R_m=57\) patients remained in the hospital.

Using equation , the cohort case fatality rate on January 25 (the final day in the study) may be estimated to be \(11/(11+31)=0.262\). Several news sources (https://www.nytimes.com/interactive/2020/world/asia/china-coronavirus-contain.html#virulence, https://www3.nhk.or.jp/news/html/20200124/k10012257631000.html) report that the actual case fatality rate may be around 3% (attributing this information to the WHO). It is unclear if this estimate is based estimator or an analysis that either excludes or accounts for right censoring. If the estimate of 3% is close to the truth, then we have a case detection rate for January 1 - January 20 of \(q=0.03/0.262 \approx 0.11\).

We can estimate the uncertainty in this number by randomly sampling from among the \(11+31=42\) cases for which outcome is known and computing a distribution.

##       2.5%      97.5% 
## 0.07411765 0.21000000

Transmissibility (\(\beta\))

Transmissibility (\(\beta\)) is defined in the usual way by rearranging the equation \(R_0 = \frac{\beta}{\gamma}\) to give \(\beta = R_0 \times \gamma\). Taking our estimate from the early stages of the outbreak and fixed infectious period (\(\gamma = 1/7\)) we conclude that \(\beta\) ranges from \(\beta = R_0 \times \gamma = 1.6/7 = 0.229\) to \(\beta = R_0 \times \gamma = 2.3/7 = 0.3299\). For comparison, if we use the Imperial College estimate of \(R_0=2.6\), we obtain \(\beta = 2.6/ 7 = 0.371\).

Additional parameters

Population size of Wuhan: \(N \approx 11,081,000\) (Wikipedia)

Population size of Hubei: \(N \approx 59,002,000\) (Wikipedia)

Estimating \(R_0\) and other parameters for the 2019-nCov epidemic

Paige Miller, Pej Rohani, John Drake

February 14, 2020