├── LICENSE.md
├── README.md
├── code
    ├── TCM_log_linear_reg_model.Rmd
    ├── TCM_log_linear_reg_model.pdf
    ├── TCM_poisson_reg_model.Rmd
    ├── TCM_poisson_reg_model.pdf
    ├── f1.m
    ├── f2.m
    ├── f3.m
    ├── f4.m
    ├── f5.m
    ├── fit_arrvial_model.R
    └── mcmcpredplot.m
└── data
    ├── data_fig_2.xlsx
    ├── data_fig_S1A.xlsx
    ├── data_fig_S2.xlsx
    ├── data_fig_S4A.xlsx
    ├── data_fig_S4B.xlsx
    ├── fit_arrival_model.xlsx
    ├── health commission.xlsx
    ├── nCoV-data.xlsx
    └── nCoV-data_0323.xlsx


/LICENSE.md:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Tian et al.
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # COVID-19_TCM-50d_China
 2 | [![DOI](https://zenodo.org/badge/249884240.svg)](https://zenodo.org/badge/latestdoi/249884240)
 3 | 
 4 | Code for: An investigation of transmission control measures during the first 50 days of the COVID-19 epidemic in China
 5 | 
 6 | ## Citation
 7 | 
 8 | An investigation of transmission control measures during the first 50 days of the COVID-19 epidemic in China
 9 | 
10 | Huaiyu Tian, Yonghong Liu, Yidan Li, Chieh-Hsi Wu, Bin Chen, Moritz U.G. Kraemer, Bingying Li, Jun Cai, Bo Xu, Qiqi Yang, Ben Wang, Peng Yang, Yujun Cui, Yimeng Song, Pai Zheng, Quanyi Wang, Ottar N. Bjornstad, Ruifu Yang, Bryan T. Grenfell, Oliver G. Pybus, and Christopher Dye.
11 | 
12 | DOI: 10.1126/science.abb6105
13 | 
14 | ## Abstract
15 | 
16 | Responding to an outbreak of a novel coronavirus (agent of COVID-19) in December 2019, China banned travel to and from Wuhan city on 23 January and implemented a national emergency response. We investigated the spread and control of COVID-19 using a unique data set including case reports, human movement and public health interventions. The Wuhan shutdown was associated with the delayed arrival of COVID-19 in other cities by 2.91 days (95%CI: 2.54-3.29). Cities that implemented control measures pre-emptively reported fewer cases, on average, in the first week of their outbreaks (13.0; 7.1-18.8) compared with cities that started control later (20.6; 14.5-26.8). Suspending intra-city public transport, closing entertainment venues and banning public gatherings were associated with reductions in case incidence. The national emergency response appears to have delayed the growth and limited the size of the COVID-19 epidemic in China, averting hundreds of thousands of cases by 19 February (day 50).
17 | 
18 | ## Notes on the code
19 | 
20 | To run, you need a Matlab toolbox called "DRAM": 
21 | DRAM is a combination of two ideas for improving the efficiency of Metropolis-Hastings type Markov chain Monte Carlo (MCMC) algorithms, Delayed Rejection and Adaptive Metropolis. This page explains the basic ideas behind DRAM and provides examples and Matlab code for the computations.(see http://helios.fmi.fi/~lainema/dram/)
22 | 
23 | The TCM regression analysis was performed using R3.6.2 on a MAC OS version 10.15.3 (19D76).
24 | 
25 | ## Data
26 | 
27 | ### Epidemiological data
28 | 
29 | The COVID-19 cases data and transmission control measures implemented between 31 December 2019 and 19 February 2020 were collected from the official reports of the health commission of 34 provincial-level administrative units and 342 city-level units. The information was collected by Bingying Li.
30 | 
31 | ### Mobility data
32 | 
33 | Human movements were tracked with mobile phone data, through location-based services (LBS) employed by popular Tencent applications such as WeChat and QQ. Movement outflows from Wuhan City to other cities (i.e. records of the number of people leaving each day) by air, train and road, were obtained from the migration flows database (https://heat.qq.com/) from 13 January 2017 to 21 February 2017 (Spring Festival travel 2017), from 1 February 2018 to 12 March 2018 (Spring Festival travel 2018), and from 1 January 2018 to 31 December 2018 (entire 2018). Tecent's LBS data was collected and processed Dr.Bin Chen and Dr.Yimeng Song.
34 | 
35 | To reconstruct the movement outflow from Wuhan during the 2020 Spring Festival (from 11 January to 25 January, before the Chinese Lunar New Year), mobile phone data (provided by the telecommunications operators) were used together with the Baidu migration index (http://qianxi.baidu.com/); using both data sources gave the most accurate measure of movement volume. The expected movement outflows from Wuhan after the New Year festival from 26 January to 19 February.
36 | 
37 | ## License
38 | 
39 | (see LICENSE)
40 | 
41 | Additional license, warranty, and copyright information
42 | 
43 | We provide a license for our code (see LICENSE) and do not claim ownership, nor the right to license, the data we have obtained nor any third-party software tools/code used in our analyses. Please cite the appropriate agency, paper, and/or individual in publications and/or derivatives using these data, contact them regarding the legal use of these data, and remember to pass-forward any existing license/warranty/copyright information. THE DATA AND SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATA AND/OR SOFTWARE OR THE USE OR OTHER DEALINGS IN THE DATA AND/OR SOFTWARE.
44 | 


--------------------------------------------------------------------------------
/code/TCM_log_linear_reg_model.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Log-linear regression analysis to investigate the associations between transmission
  3 |   controls measures and the number of reported cases in the first seven days of the
  4 |   outbreaks in cities"
  5 | author: "Chieh-Hsi Wu"
  6 | output:
  7 |   pdf_document:
  8 |     extra_dependencies: ["xcolor"]
  9 |     toc: true
 10 |     toc_depth: 3
 11 |   html_notebook: default
 12 |   html_document:
 13 |     df_print: paged
 14 | ---
 15 | 
 16 | 
 17 | Here we investigate the associations between transmission control measures and the number of reported cases in the first week of the outbreaks in cities.
 18 | 
 19 | ```{r load_library}
 20 | library(readxl)
 21 | library(caret)
 22 | library(lmtest)
 23 | library(boot)
 24 | library(R330)
 25 | ```
 26 | 
 27 | 
 28 | ```{r define_functions}
 29 | # function to obtain regression coefficients
 30 | bs = function(formula, data, indices) {
 31 |   # allows boot to select sample
 32 |   d <- data[indices,] 
 33 |   fit <- lm(formula, data=d)
 34 |   return(coef(fit))
 35 | }
 36 | 
 37 | 
 38 | getSgnChgIndex = function(ci = NULL, a = NULL){
 39 |   colnames(ci) = a
 40 |   rownames(ci) = c("CI level", 
 41 |                    "lower index", "upper index", 
 42 |                    "95% CI lower bound", "95% CI upper bound")
 43 |   
 44 |   ci = t(ci[4:5,])
 45 |   cprCIBndSgn = apply(ci,1,
 46 |         function(z){
 47 |           sgn = z>0; 
 48 |           return(sgn[1]==sgn[2])
 49 |         })
 50 |   return(max(which(cprCIBndSgn)))
 51 | }
 52 | 
 53 | ```
 54 | 
 55 | 
 56 | ```{r read_in_data}
 57 | covid2019FilePath = ".../data/nCoV-data.xlsx"
 58 | covid2019.df = read_excel(path = covid2019FilePath, sheet = "3resp-7days")
 59 | covid2019Dist.df = read_excel(path = covid2019FilePath, sheet = "dist-296")
 60 | 
 61 | covid2019v2.df = read_excel(path = ".../data/nCoV-data_0323.xlsx", 
 62 |                             sheet = "3resp-7days")
 63 | ```
 64 | 
 65 | # Processing the data
 66 | 
 67 | Some of the cities have a inflow from Wuhan recorded as 0, which causes calculations to run into an error when we use it as an offset variable.
 68 | To resolve this issue, 0 values are changed to $10^{-6}$, which is equivalent to only one person arriving to a city from Wuhan.
 69 | 
 70 | ```{r process_totalflow, echo = F}
 71 | covid2019.df$new.totalflow_million = covid2019.df$totalflow_million
 72 | covid2019.df$new.totalflow_million[covid2019.df$totalflow_million == 0] = 1e-6
 73 | ```
 74 | 
 75 | The arrival time is processed so that 31 December 2019 is coded as day 0.
 76 | 
 77 | ```{r process_arrival_time}
 78 | covid2019.df$new.arr.time  = covid2019.df$arr.time - 1
 79 | ```
 80 | 
 81 | The timing of suspending intra-city public transport is processed so that 31 December 2019 is coded as day 0.
 82 | 
 83 | ```{r process_TTCM_bus}
 84 | bus.resp.tab = table(covid2019.df$Bus.resp)
 85 | bus.resp.tab
 86 | bus.date.tab1 = table(covid2019.df$Bus.date[which(covid2019.df$Bus.resp==1)])
 87 | bus.date.tab1
 88 | 
 89 | covid2019.df$new.Bus.date = covid2019.df$Bus.date - 1 
 90 | new.bus.date.tab1 = table(covid2019.df$new.Bus.date[which(covid2019.df$Bus.resp==1)])
 91 | new.bus.date.tab1
 92 | 
 93 | covid2019.df$new.Bus.date[which(covid2019.df$Bus.resp==0)] = 0
 94 | new.bus.date.tab = table(covid2019.df$new.Bus.date)
 95 | new.bus.date.tab
 96 | 
 97 | ## Sanity check
 98 | ## Should return 0(s).
 99 | # bus.date.tab1 - new.bus.date.tab1
100 | # (as.numeric(names(bus.date.tab1)) - 1) - as.numeric(names(new.bus.date.tab1))
101 | # bus.date.tab1 - new.bus.date.tab[-1]
102 | # (as.numeric(names(bus.date.tab1)) - 1) - as.numeric(names(new.bus.date.tab[-1]))
103 | # bus.resp.tab["0"] - new.bus.date.tab["0"]
104 | ## Sanity check complete
105 | ```
106 | 
107 | The timing of suspending inter-city passenger traffic is processed so that 31 December 2019 is coded as day 0.
108 | 
109 | 
110 | ```{r process_TTCM_railway, echo = F}
111 | rail.resp.tab =  table(covid2019.df$Railway.resp)
112 | rail.resp.tab
113 | rail.date.tab1 = table(covid2019.df$Railway.date[which(covid2019.df$Railway.resp == 1)])
114 | rail.date.tab1
115 | 
116 | 
117 | covid2019.df$new.Railway.date = covid2019.df$Railway.date - 1
118 | new.rail.date.tab1 = table(covid2019.df$new.Railway.date[which(covid2019.df$Railway.resp == 1)])
119 | new.rail.date.tab1
120 | 
121 | covid2019.df$new.Railway.date[which(covid2019.df$Railway.resp == 0)] = 0
122 | new.rail.date.tab = table(covid2019.df$new.Railway.date)
123 | new.rail.date.tab
124 | 
125 | ## Sanity check
126 | ## Should return 0(s).
127 | # rail.date.tab1 - new.rail.date.tab1
128 | # (as.numeric(names(rail.date.tab1)) - 1) - as.numeric(names(new.rail.date.tab1))
129 | # rail.date.tab1 - new.rail.date.tab[-1]
130 | # (as.numeric(names(rail.date.tab1)) - 1) - as.numeric(names(new.rail.date.tab[-1]))
131 | # rail.resp.tab["0"] - new.rail.date.tab["0"]
132 | ## Sanity check complete
133 | 
134 | ```
135 | 
136 | 
137 | The timing of closure of entertainment venues and banning public gathering is processed so that 31 December 2019 is coded as day 0.
138 | 
139 | 
140 | ```{r process_TTCM_enter, echo = F}
141 | enter.resp.tab = table(covid2019.df$Enter.resp)
142 | enter.resp.tab
143 | enter.date.tab1 = table(covid2019.df$Enter.date[which(covid2019.df$Enter.resp == 1)])
144 | enter.date.tab1
145 | 
146 | covid2019.df$new.Enter.date = covid2019.df$Enter.date - 1
147 | new.enter.date.tab1 = table(covid2019.df$new.Enter.date[which(covid2019.df$Enter.resp == 1)])
148 | new.enter.date.tab1
149 | 
150 | covid2019.df$new.Enter.date[which(covid2019.df$Enter.resp == 0)] = 0
151 | new.enter.date.tab = table(covid2019.df$new.Enter.date)
152 | new.enter.date.tab 
153 | 
154 | ## Sanity check
155 | ## Should return 0(s).
156 | # enter.date.tab1 - new.enter.date.tab1
157 | # (as.numeric(names(enter.date.tab1)) - 1) - as.numeric(names(new.enter.date.tab1))
158 | # enter.date.tab1 - new.enter.date.tab[-1]
159 | # (as.numeric(names(enter.date.tab1)) - 1) - as.numeric(names(new.enter.date.tab[-1]))
160 | # enter.resp.tab["0"] - new.enter.date.tab["0"]
161 | ## Sanity check complete
162 | ```
163 | 
164 | 
165 | 
166 | # Fitting a log-linear regression model
167 | 
168 | First we create a new response variable which is incidence per capita divided by the inflow Wuhan. 
169 | 
170 | ```{r create_standardised_7day_cases}
171 | covid2019.df$std.sevendays.cucase = 
172 |   covid2019.df$sevendays.cucase/
173 |   (covid2019.df$Pop_million_2018*covid2019.df$new.totalflow_million)
174 | ```
175 | 
176 | After some exploration we found that there's non-linear relationship between dependent variable and the timing of closure of entertainment venues and banning public gathering.
177 | Therefore we have used a square term of that timeing variable to take care of that.
178 | The log-linear model summarised below.
179 | 
180 | ```{r lm_add}
181 | yfc.resp.date.lm = lm(log(std.sevendays.cucase) ~ new.arr.time + log10.Dis.WH +
182 |                           Bus.resp + new.Bus.date + 
183 |                           Railway.resp + new.Railway.date +
184 |                           Enter.resp + new.Enter.date + I(new.Enter.date^2),
185 |                         data = covid2019.df)
186 | summary(yfc.resp.date.lm)
187 | ```
188 | 
189 | The influential plots indicates an apparent influential point.
190 | 
191 | 
192 | ```{r lm_add_diagnostics}
193 | par(mfrow = c(2, 4))
194 | influenceplots(yfc.resp.date.lm)
195 | ```
196 | 
197 | For interpretabiliy, instead of having a squared we discretise the timing variable, and re-fit the model with the timing variable.
198 | 
199 | ```{r lm1_add}
200 | covid2019.df$new.Enter.date.cat = cut(covid2019.df$new.Enter.date, c(-1, c(23, 24, 25, 36) -1))
201 | covid2019.df$new.Enter.date.cat = as.numeric(covid2019.df$new.Enter.date.cat) -1 
202 | 
203 | yfc.resp.date.lm1 = lm(log(std.sevendays.cucase) ~ new.arr.time + log10.Dis.WH +
204 |                           Bus.resp + new.Bus.date + 
205 |                           Railway.resp + new.Railway.date +
206 |                           Enter.resp +  new.Enter.date.cat,
207 |                         data = covid2019.df)
208 | summary(yfc.resp.date.lm1)
209 | ```
210 | 
211 | After the discretising the timing variable, the observation flagged as an influential is no longer problematic.
212 | The Cook's distances do not indicate presence of outliers. 
213 | Here are a number of observations indicated as having large hat values.
214 | However, they are not too far away from they cutoff, and their hat values are quite similar to each other.
215 | So it is hard to justigy removing only a subset of those points, but there are quite a few of them, so we leave them in the model.
216 | 
217 | 
218 | ```{r lm1_infl_plots}
219 | par(mfrow = c(2,4))
220 | influenceplots(yfc.resp.date.lm1)
221 | ```
222 | 
223 | The studentized Breusch-Pagan test provides evidence for heterscedasticity in the residuals.
224 | 
225 | ```{r yfc.resp.date.lm1_test}
226 | lmtest::bptest(yfc.resp.date.lm1)
227 | ```
228 | 
229 | After some exploration, we found that heteroscedasticity occurs when we include either the log10 of distance to Wuhan, and the binary and timing varibles for closure of entertainment venues and banning public gathrings.
230 | Coincidentally, the models above show that we have no evidence for the the associations between those three variables and the dependent variable.
231 | 
232 | The conclusions regarding the rest of the variables do not change.
233 | 
234 | ```{r lm2_fitted_vs_residuals, echo = F}
235 | yfc.resp.date.lm2 = lm(log(std.sevendays.cucase) ~ new.arr.time + 
236 |                           Bus.resp + new.Bus.date +
237 |                           Enter.resp + new.Enter.date.cat ,
238 |                         data = covid2019.df)
239 | summary(yfc.resp.date.lm2)
240 | ```
241 | 
242 | The plot of studentised residuals does not indicate evident non-linearity in the residuals
243 | 
244 | ```{r lm2_add_diagnostics, fig.height = 6, fig.width = 6.5}
245 | par(mfrow = c(2,2))
246 | plot(yfc.resp.date.lm2)
247 | ```
248 | 
249 | The influence plots do provide strong indication any points that are particularly influential and need to be removed.
250 | 
251 | ```{r yfc_resp_date_lm2_influ_plot}
252 | par(mfrow = c(2,4))
253 | influenceplots(yfc.resp.date.lm2)
254 | ```
255 | 
256 | The studentized Breusch-Pagan test dose not provide evidence for heteroscedasticity in the residuals.
257 | 
258 | ```{r yfc.resp.date.lm2_bptest}
259 | lmtest::bptest(yfc.resp.date.lm2)
260 | ```
261 | 
262 | ```{r lm2_add_est}
263 | yfc.resp.date.lm2.est = coef(summary(yfc.resp.date.lm2))
264 | yfc.resp.date.lm2.est.tab = cbind(yfc.resp.date.lm2.est[,"Estimate"], 
265 | yfc.resp.date.lm2.est[,"Estimate"]-1.96*yfc.resp.date.lm2.est[,"Std. Error"],
266 | yfc.resp.date.lm2.est[,"Estimate"]+1.96*yfc.resp.date.lm2.est[,"Std. Error"])
267 | colnames(yfc.resp.date.lm2.est.tab) = c("Coefficient", "95% CI upper bound", "95% CI lower bound")
268 | round(yfc.resp.date.lm2.est.tab, 2)
269 | 
270 | ```
271 | 
272 | # Diagnostics
273 | 
274 | ## Check for spatial correlation in the residuals
275 | 
276 | Here we check whether cities that are closer together are going have more similar residuals
277 | 
278 | ```{r create_dist_matrix_for_cities}
279 | covid2019DistMat = as.matrix(covid2019Dist.df[,-1], nrow = 296, ncol = 296)
280 | rownames(covid2019DistMat) = covid2019Dist.df$code
281 | colnames(covid2019DistMat) = covid2019Dist.df$code
282 | ```
283 | 
284 | 
285 | ```{r check_the_order_of_the_matrix}
286 | all(covid2019.df$Code == rownames(covid2019DistMat)) 
287 | all(covid2019.df$Code == colnames(covid2019DistMat)) 
288 | ```
289 | 
290 | 
291 | 
292 | ```{r create_residual_dist_mat}
293 | yfc.resp.date.lm2.res = residuals(yfc.resp.date.lm2)
294 | res.dist = as.matrix(dist(yfc.resp.date.lm2.res))
295 | cityDist = covid2019DistMat
296 | # 262 is NA by mistake in the input file
297 | diag(cityDist)[262] = 0 
298 | ```
299 | 
300 | ```{r get_upper_dist_matrix}
301 | ## Only extract the values of the upper triangle of the matrix as
302 | ## The lower triangle repeats the upper triange values.
303 | res.dist.unique = res.dist[upper.tri(res.dist)]
304 | city.dist.unique = cityDist[upper.tri(cityDist)]
305 | length(res.dist.unique) == length(city.dist.unique)
306 | 
307 | ## Sanity checks
308 | ## The codes below should all return TRUE(s).
309 | # res.dist[1,2] == res.dist.unique[1]
310 | # res.dist[1:2,3] == res.dist.unique[1+1:2]
311 | # res.dist[1:4,5] == res.dist.unique[6+1:4]
312 | # res.dist[1:5,6] == res.dist.unique[10+1:5]
313 | 
314 | # cityDist[1,2] == city.dist.unique[1]
315 | # cityDist[1:2,3] == city.dist.unique[1+1:2]
316 | # cityDist[1:4,5] == city.dist.unique[6+1:4]
317 | #cityDist[1:5,6] == city.dist.unique[10+1:5]
318 | ## Sanity checks complete
319 | ```
320 | 
321 | There is no evident correlation between the pairiwse residual differences and pairwise city differences.
322 | 
323 | ```{r calculate_correlation_between_geo_distance_and_residual_diff}
324 | cor((res.dist.unique), city.dist.unique)
325 | ```
326 | 
327 | Here we check whether the peak times of inflow from Wuhan correlations with the residuals.
328 | The peak inflow is calculated for the period from 11 January 2020 to 23 January 2020.
329 | 11 January is 15 days before the Chinese New Year, while 23 January is day of Wuhan shutdown.
330 | 
331 | ```{r}
332 | all(covid2019.df$Code == covid2019v2.df$Code)
333 | par(mfrow = c(1,2))
334 | plot(x = covid2019v2.df$peak_time2, 
335 |      y = yfc.resp.date.lm2.res)
336 | ```
337 | 
338 | We do not find evident correlation between the residuals peak times of inflow from Wuhan.
339 | 
340 | ```{r}
341 | cor(x = covid2019v2.df$peak_time2, 
342 |     y = yfc.resp.date.lm2.res, use="pairwise.complete.obs")
343 | ```
344 | 
345 | 
346 | 
347 | ## Check for temporal correlation in the residuals
348 | 
349 | Here we check whether there's some temporal autocorrelation in the data.
350 | To this end we evaluate the strength of evidence for the association between the mean residuals with arrival time on day j and the residuals with arrival time on day j+1. 
351 | 
352 | ```{r Calculate_the_mean_residual_value_for_each_day}
353 | ## Get the unqiue arrival times
354 | arr.time.level = sort(unique(covid2019.df$arr.time))
355 | 
356 | ## Calculate the mean arrival time for each arrival time value 
357 | res.mean.by.time = vector(length = length(arr.time.level))
358 | for(i in 1:length(arr.time.level)){
359 |   # get the residual values for a given arrival time
360 |   res.per.arrT = yfc.resp.date.lm2.res[covid2019.df$arr.time == arr.time.level[i]]
361 |   res.mean.by.time[i] = mean(res.per.arrT)
362 | }
363 | names(res.mean.by.time) = arr.time.level
364 | ```
365 | 
366 | ```{r plot_arr_time_res}
367 | plot(covid2019.df$arr.time, 
368 |      yfc.resp.date.lm2.res,
369 |      xlab = "Time", ylab = "Residuals")
370 | lines(x = arr.time.level, y = res.mean.by.time, 
371 |       col = "red", lwd = 2)
372 | ```
373 | 
374 | 
375 | ```{r get_valid_prev_days}
376 | arr.time.level.prev = arr.time.level[-length(arr.time.level)]
377 | arr.time.level.prev
378 | names(arr.time.level.prev) = arr.time.level[-1]
379 | ```
380 | 
381 | The earliest arrival time is 20 January.
382 | So the residuals with this arrival time will not have a covariate value.
383 | 
384 | ```{r identify_the_imediate_previous_date}
385 | prev.day = arr.time.level.prev[as.character(covid2019.df$arr.time)]
386 | table(covid2019.df$arr.time - prev.day, useNA = "always")
387 | 
388 | # Sanity check
389 | # table(covid2019.df$arr.time - 1)[-1] == table(prev.day)
390 | ```
391 | 
392 | There is not evidence for an association between the mean residuals with arrival time on day j and the residuals with arrival time on day j + 1.
393 | 
394 | ```{r determine_Time_autocorrelation}
395 | prev.mean = res.mean.by.time[as.character(prev.day)]
396 | summary(lm(yfc.resp.date.lm2.res ~ prev.mean))
397 | ```
398 | 
399 | ## Check the normality assumption
400 | 
401 | The disribution of the residuals is fairly symmetric albeit the longer right tail.
402 | As a result, we see moderate departure from normality in the Q-Q plots.
403 | 
404 | ```{r qq_plot}
405 | #pdf(file = "/Users/chwu/Documents/research/nCov-2019_TCM/logLinearNormality.pdf",
406 | #    width = 9, height = 4.5)
407 | par(mfrow = c(1,2), mar = c(5,4,1,2)+0.2)
408 | 
409 | hist(yfc.resp.date.lm2.res, 
410 |      xlab = "Residuals", main = "",
411 |      nclass = 20, prob = T)
412 | yfc.resp.date.lm2.res2 = 
413 |   yfc.resp.date.lm2.res[which(yfc.resp.date.lm2.res < abs(min(yfc.resp.date.lm2.res)))]
414 | normFitDens = dnorm(-100:100/10, mean = -0.5642705, sd = 2.12)
415 | lines(-100:100/10, normFitDens)
416 | 
417 | qqnorm(yfc.resp.date.lm2.res, main ="", col = "#00000077", pch = 16)
418 | qqline(yfc.resp.date.lm2.res, col="red")
419 | #dev.off()
420 | 
421 | ```
422 | 
423 | The Shapiro-Wilk test provides evidence for departure of normality in the errors.
424 | 
425 | ```{r yfc.resp.date.lm2_shapiro_test}
426 | shapiro.test(yfc.resp.date.lm2.res)
427 | ```
428 | 
429 | # Bootstrap analysis
430 | 
431 | By central limit theorem, the moderate departure from normality, should not be a problem.
432 | However, just to be safe, we evaluate the bootstrap esimtates of the regression coefficients, 95\% CI and p-values.
433 | The bootstrap esimtates does not assume a parametric distribution for the errors and therefore is a suitable alternative when departure of normality would be a problem.
434 | 
435 | ## Simulate bootstrap replicates
436 | 
437 | We simulated 10000 bootstrap replicates.
438 | 
439 | 
440 | ```{r get_boot_strap_rep}
441 | set.seed(777)
442 | # bootstrapping with 1000 replications
443 | results <- boot(data=covid2019.df, statistic=bs,
444 |    R=10000, formula=log(std.sevendays.cucase) ~ new.arr.time + 
445 |                           Bus.resp + new.Bus.date +
446 |                           Enter.resp + new.Enter.date.cat)
447 | ```
448 | 
449 | The bias is generally very small compare to the coefficients, which indicates the bootstrap esimates are very similar to the least squre estimates above.
450 | This confirms that the moderate departure of normality is not an issue this case.
451 | 
452 | 
453 | ```{r bootstrap_results}
454 | # view results
455 | print(results)
456 | ```
457 | 
458 | The bootstrap estimates for the coefficients are
459 | 
460 | ```{r bootstrap_estimate}
461 | round(colMeans(results$t), 2)
462 | ```
463 | 
464 | ## Normality of the bootstrap statistics
465 | 
466 | ### Intercept and the adjusting variable
467 | 
468 | ```{r normality_of_bootstrap_statistics_of_intercept_and_adjusted_variables}
469 | par(mfrow = c(2,2), mar = c(5, 5, 2, 2) + 0.2)
470 | hist(results$t[,1], nclass = 50, prob = T,
471 |      xlab = "bootstrap replicates", main = "Intercept")
472 | qqnorm(results$t[,1], main =  "Intercept", ylab = "Sampled quantiles\n(bootstrap)")
473 | 
474 | hist(results$t[,2], nclass = 50, prob = T,
475 |      xlab = "bootstrap replicates", main = "Arrival time")
476 | qqnorm(results$t[,2], main =  "Arrival time", ylab = "Sampled quantiles\n(bootstrap)")
477 | ```
478 | 
479 | ### Transmission control measure variables
480 | 
481 | ```{r normality_of_bootstrap_statistics_of_TCM_variables}
482 | par(mfrow = c(2,4), mar = c(5, 5, 2, 2) + 0.2)
483 | hist(results$t[,3], nclass = 50, prob = T,
484 |      xlab = "bootstrap replicates", main = expression(M[S]))
485 | qqnorm(results$t[,3], main = expression(M[S]), 
486 |        ylab = "Sampled quantiles\n(bootstrap)")
487 | qqline(results$t[,3], col = "red")
488 | 
489 | hist(results$t[,4], nclass = 50, prob = T,
490 |      xlab = "bootstrap replicates", main  = expression("T"[S]))
491 | qqnorm(results$t[,4], main = expression("T"[S]), 
492 |        ylab = "Sampled quantiles\n(bootstrap)")
493 | qqline(results$t[,4], col = "red")
494 | 
495 | hist(results$t[,5], nclass = 50, prob = T,
496 |      xlab = "bootstrap replicates", main = expression(M[B]))
497 | qqnorm(results$t[,5], main = expression(M[B]), 
498 |        ylab = "Sampled quantiles\n(bootstrap)")
499 | qqline(results$t[,5], col = "red")
500 | 
501 | hist(results$t[,6], nclass = 50, prob = T,
502 |      xlab = "bootstrap replicates", main  = expression("T"[B]))
503 | qqnorm(results$t[,6], main = expression("T"[B]), 
504 |        ylab = "Sampled quantiles\n(bootstrap)")
505 | qqline(results$t[,6], col = "red")
506 | ```
507 | 
508 | ## Confidence intervals
509 | 
510 | The confidence intervals calculated by using the adjusted bootstrap percentile for each coefficient is given by
511 | 
512 | ```{r bootstrap_CI}
513 | # get 95% confidence intervals
514 | boot.ci.results.tab = sapply(c(1:6),
515 |        function(varIndex  = NULL){
516 |          boot.ci(results, type = "bca", index = varIndex)$bca
517 |        })
518 | boot.ci.tab = t(boot.ci.results.tab)[,4:5]
519 | rownames(boot.ci.tab) = names(coef(yfc.resp.date.lm2))
520 | colnames(boot.ci.tab) = paste(c("95% lower bound", "95% lower bound"))
521 | round(boot.ci.tab,2)
522 | ```
523 | 
524 | ## P-values suggested by the confidence intervals
525 | 
526 | The p-value is defined as 1 minus the highest confidence level that produces a confidence interval excluding 0.
527 | While we might be able to assume that the bootstrap statistics follow a normal distribution, this method is chosen so that the p-values and confidence intervals are consistent.
528 | The p-values are calculated with sufficient precision to round up to 2 decimal places.
529 | 
530 | ### Intercept 
531 | 
532 | ```{r pval_by_boostrap_ci_intercept}
533 | a1 = c(500:520)/1000
534 | 
535 | ci1 = sapply(a1, function(a){
536 |   boot.ci(results, type= c("bca"), 
537 |         conf = a, index=1)$bca
538 | })
539 | 
540 | round(1 - a1[getSgnChgIndex(ci = ci1, a = a1)],2)
541 | ```
542 | 
543 | ### Arrival time
544 | 
545 | ```{r pval_by_boostrap_ci_arrival_time}
546 | boot.ci(results, type= c("bca"), 
547 |         conf = 0.999, index=2)$bca
548 | 1 - 0.999
549 | ```
550 | 
551 | ### Binary variable for suspension of the intra-city public transport
552 | 
553 | ```{r pval_by_boostrap_ci_M_S}
554 | a3 = c(9980:9990)/10000
555 | ci3 = sapply(a3, function(a){
556 |    boot.ci(results, type= c("bca"), 
557 |          conf = a, index=3)$bca
558 |   })
559 | #ci3[,c(3:4)]
560 | #1-a3[3]
561 | 
562 | 1 - a3[getSgnChgIndex(ci = ci3, a = a3)]
563 | ```
564 | 
565 | ### Timing of suspension of the intra-city public transport
566 | 
567 | ```{r pval_by_boostrap_ci_T_S}
568 | a4 = c(9970:9980)/10000
569 | ci4 = sapply(a4, function(a){
570 |    boot.ci(results, type= c("bca"), 
571 |          conf = a, index=4)$bca
572 |   })
573 | 
574 | 1 - a4[getSgnChgIndex(ci = ci4, a = a4)]
575 | 
576 | ```
577 | 
578 | ### Binary variable for closure of entertainment venues and banning of public gatherings
579 | 
580 | ```{r pval_by_boostrap_ci_M_B}
581 | a5 = c(9900:9910)/10000
582 | 
583 | ci5 = sapply(a5, function(a){
584 |   boot.ci(results, type= c("bca"), 
585 |         conf = a, index=5)$bca
586 | })
587 | 
588 | 1 - a5[getSgnChgIndex(ci = ci5, a = a5)]
589 | ```
590 | 
591 | ### Timing of closure of entertainment venues and banning of public gatherings
592 | 
593 | ```{r pval_by_boostrap_ci_T_B}
594 | a6 = c(990:999)/1000
595 | 
596 | ci6 = sapply(a6, function(a){
597 |   boot.ci(results, type= c("bca"), 
598 |         conf = a, index=6)$bca
599 | })
600 | 
601 | 1 - a6[getSgnChgIndex(ci = ci6, a = a6)]
602 | ```
603 | 


--------------------------------------------------------------------------------
/code/TCM_log_linear_reg_model.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huaiyutian/COVID-19_TCM-50d_China/cc3257fc99335922a8e9b8668eea9e10fbb0e752/code/TCM_log_linear_reg_model.pdf


--------------------------------------------------------------------------------
/code/TCM_poisson_reg_model.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Poisson regression analysis to investigate the associations between transmission
  3 |   controls measures and the number of reported cases in the first seven days of the
  4 |   outbreaks in cities
  5 | author: "Chieh-Hsi Wu"
  6 | output:
  7 |   pdf_document: default
  8 |   html_notebook: default
  9 | ---
 10 | 
 11 | Here we investigate the associations between transmission control measures and the number of reported cases in the first week of the outbreaks in cities.
 12 | 
 13 | 
 14 | ```{r read_in_data}
 15 | library(R330)
 16 | library(readxl)
 17 | library(car)
 18 | library(lmtest)
 19 | 
 20 | covid2019FilePath = ".../data/nCoV-data.xlsx"
 21 | covid2019.df = read_excel(path = covid2019FilePath, sheet = "3resp-7days")
 22 | 
 23 | ```
 24 | 
 25 | # Processing the data
 26 | 
 27 | Some of the cities have a inflow from Wuhan recorded as 0, which causes calculations to run into an error when we use it as an offset variable.
 28 | To resolve this issue, 0 values are changed to $10^{-6}$, which is equivalent to only one person arriving to a city from Wuhan.
 29 | 
 30 | ```{r process_totalflow}
 31 | covid2019.df$new.totalflow_million = covid2019.df$totalflow_million
 32 | covid2019.df$new.totalflow_million[covid2019.df$totalflow_million == 0] = 1e-6
 33 | ```
 34 | 
 35 | The arrival time is processed so that 31 December 2019 is coded as day 0.
 36 | 
 37 | ```{r process_arr.time}
 38 | covid2019.df$new.arr.time = covid2019.df$arr.time - 1
 39 | ```
 40 | 
 41 | The timing of suspending intra-city public transport is processed so that 31 December 2019 is coded as day 0.
 42 | 
 43 | ```{r process_TTCM_bus}
 44 | bus.resp.tab = table(covid2019.df$Bus.resp)
 45 | bus.resp.tab
 46 | bus.date.tab1 = table(covid2019.df$Bus.date[which(covid2019.df$Bus.resp==1)])
 47 | bus.date.tab1
 48 | 
 49 | covid2019.df$new.Bus.date = covid2019.df$Bus.date - 1 
 50 | new.bus.date.tab1 = table(covid2019.df$new.Bus.date[which(covid2019.df$Bus.resp==1)])
 51 | new.bus.date.tab1
 52 | 
 53 | covid2019.df$new.Bus.date[which(covid2019.df$Bus.resp==0)] = 0
 54 | new.bus.date.tab = table(covid2019.df$new.Bus.date)
 55 | new.bus.date.tab
 56 | 
 57 | ## Sanity check
 58 | ## Codes below should return (0)
 59 | # bus.date.tab1 - new.bus.date.tab1
 60 | # (as.numeric(names(bus.date.tab1)) - 1) - as.numeric(names(new.bus.date.tab1))
 61 | # bus.date.tab1 - new.bus.date.tab[-1]
 62 | # (as.numeric(names(bus.date.tab1)) - 1) - as.numeric(names(new.bus.date.tab[-1]))
 63 | # bus.resp.tab["0"] - new.bus.date.tab["0"]
 64 | ## Sanity check complete
 65 | ```
 66 | 
 67 | The timing of suspending inter-city passenger traffic is processed so that 31 December 2019 is coded as day 0.
 68 | 
 69 | ```{r process_TTCM_railway}
 70 | rail.resp.tab =  table(covid2019.df$Railway.resp)
 71 | rail.resp.tab
 72 | rail.date.tab1 = table(covid2019.df$Railway.date[which(covid2019.df$Railway.resp == 1)])
 73 | rail.date.tab1
 74 | 
 75 | 
 76 | covid2019.df$new.Railway.date = covid2019.df$Railway.date - 1
 77 | new.rail.date.tab1 = table(covid2019.df$new.Railway.date[which(covid2019.df$Railway.resp == 1)])
 78 | new.rail.date.tab1
 79 | 
 80 | covid2019.df$new.Railway.date[which(covid2019.df$Railway.resp == 0)] = 0
 81 | new.rail.date.tab = table(covid2019.df$new.Railway.date)
 82 | new.rail.date.tab
 83 | 
 84 | ## Sanity check
 85 | ## Codes below should return 0(s)
 86 | # rail.date.tab1 - new.rail.date.tab1
 87 | # (as.numeric(names(rail.date.tab1)) - 1) - as.numeric(names(new.rail.date.tab1))
 88 | # rail.date.tab1 - new.rail.date.tab[-1]
 89 | # (as.numeric(names(rail.date.tab1)) - 1) - as.numeric(names(new.rail.date.tab[-1]))
 90 | # rail.resp.tab["0"] - new.rail.date.tab["0"]
 91 | ## Sanity check complete
 92 | ```
 93 | 
 94 | The timing of closure of entertainment venues and banning public gathering is processed so that 31 December 2019 is coded as day 0.
 95 | 
 96 | ```{r process_TTCM_enter}
 97 | enter.resp.tab = table(covid2019.df$Enter.resp)
 98 | enter.resp.tab
 99 | enter.date.tab1 = table(covid2019.df$Enter.date[which(covid2019.df$Enter.resp == 1)])
100 | enter.date.tab1
101 | 
102 | covid2019.df$new.Enter.date = covid2019.df$Enter.date - 1
103 | new.enter.date.tab1 = table(covid2019.df$new.Enter.date[which(covid2019.df$Enter.resp == 1)])
104 | new.enter.date.tab1
105 | 
106 | covid2019.df$new.Enter.date[which(covid2019.df$Enter.resp == 0)] = 0
107 | new.enter.date.tab = table(covid2019.df$new.Enter.date)
108 | new.enter.date.tab 
109 | 
110 | ## Sanity check
111 | ## Codes should return 0(s)
112 | # enter.date.tab1 - new.enter.date.tab1
113 | # (as.numeric(names(enter.date.tab1)) - 1) - as.numeric(names(new.enter.date.tab1))
114 | # enter.date.tab1 - new.enter.date.tab[-1]
115 | # (as.numeric(names(enter.date.tab1)) - 1) - as.numeric(names(new.enter.date.tab[-1]))
116 | # enter.resp.tab["0"] - new.enter.date.tab["0"]
117 | ## Sanity check complete
118 | ```
119 | 
120 | # Regression analysis
121 | 
122 | ## Poisson regression
123 | 
124 | The analysis using a Poisson regression model is presented below.
125 | 
126 | ```{r glm_add1}
127 | yfc.resp.date.glm1 = glm(sevendays.cucase ~ new.arr.time + log10.Dis.WH+
128 |                           Bus.resp + new.Bus.date + 
129 |                           Railway.resp + new.Railway.date + 
130 |                           Enter.resp + new.Enter.date +
131 |                           offset(log(Pop_million_2018*new.totalflow_million)),
132 |                         family = "poisson",
133 |                         data = covid2019.df)
134 | summary(yfc.resp.date.glm1)
135 | ```
136 | 
137 | There appears to be three influential points.
138 | 
139 | ```{r glm_add1_infl_pts, fig.height=5, fig.width=5}
140 | influencePlot(yfc.resp.date.glm1)
141 | ```
142 | 
143 | Below is a visualisation of the influential points indicators.
144 | 
145 | ```{r glm_add1_infl_plots, fig.height=6.5, fig.width=6}
146 | influenceplots(yfc.resp.date.glm1)
147 | ```
148 | 
149 | Removing the three influential points does not affect the conclusions.
150 | 
151 | ```{r r glm_add2}
152 | yfc.resp.date.glm2 = glm(sevendays.cucase ~ new.arr.time + log10.Dis.WH+
153 |                           Bus.resp + new.Bus.date + 
154 |                           Railway.resp + new.Railway.date + 
155 |                           Enter.resp + new.Enter.date +
156 |                           offset(log(Pop_million_2018*new.totalflow_million)),
157 |                         family = "poisson",
158 |                         data = covid2019.df[-c(1, 150, 157),])
159 | summary(yfc.resp.date.glm2)
160 | ```
161 | 
162 | There appears to be more observations with outstanding Cook's distances.
163 | 
164 | ```{r glm_add2_infl_pts, fig.height=5, fig.width=5}
165 | influencePlot(yfc.resp.date.glm2)
166 | ```
167 | 
168 | The plots below show that the outstanding Cook's distance a substantially far apart from the rest.
169 | 
170 | ```{r glm_add2_infl_plots, fig.height=6.5, fig.width=6}
171 | influenceplots(yfc.resp.date.glm2)
172 | ```
173 | 
174 | The conclusions are not affected by removing the observations with the outstanding Cook's distances.
175 | 
176 | 
177 | ```{r glm_add}
178 | yfc.resp.date.glm = glm(sevendays.cucase ~ new.arr.time + log10.Dis.WH+
179 |                           Bus.resp + new.Bus.date + 
180 |                           Railway.resp + new.Railway.date + 
181 |                           Enter.resp + new.Enter.date +
182 |                           offset(log(Pop_million_2018*new.totalflow_million)),
183 |                         family = "poisson",
184 |                         data = covid2019.df[-c(1, 7, 150, 157, 226),])
185 | summary(yfc.resp.date.glm)
186 | ```
187 | 
188 | There still are several data points with large Cook's Distances.
189 | However there are a large number of these and by standard practice, it would be too many to remove.
190 | In general, we should only remove one or two influential points, as this is to check that the results observed is not just purely the result of one or two points.
191 | If too many points are removed, the defies the purpose of this check.
192 | 
193 | ```{r glm_add_infl, fig.height=6.5, fig.width=6}
194 | influenceplots(yfc.resp.date.glm)
195 | ```
196 | 
197 | ```{r ci}
198 | yfc.resp.date.glm.est = coef(summary(yfc.resp.date.glm))
199 | yfc.resp.date.glm.est.tab = cbind(yfc.resp.date.glm.est[,"Estimate"], 
200 | yfc.resp.date.glm.est[,"Estimate"]-1.96*yfc.resp.date.glm.est[,"Std. Error"],
201 | yfc.resp.date.glm.est[,"Estimate"]+1.96*yfc.resp.date.glm.est[,"Std. Error"])
202 | colnames(yfc.resp.date.glm.est.tab) = c("Estimate", "Lower 95% CI", "Upper 95% CI")
203 | round(yfc.resp.date.glm.est.tab, 2)
204 | ```
205 | 
206 | If the model is correct, the Pearson residuals should have constant spread across fitted values. 
207 | However, the plot below clearly shows that this is not the case---there is evident heteroscedasticity in the pearson residuals.
208 | 
209 | 
210 | ```{r glm_pearson_res, fig.height= 4.5, fig.width=5.5}
211 | plot((fitted(yfc.resp.date.glm)), 
212 |      residuals(yfc.resp.date.glm, type= "pearson"),
213 |      xlab = "Fitted values", ylab = "Pearson residuals")
214 | ```
215 | 
216 | Due to these issues we report the parameter estimates from the regression with influential points excluded as that give us the most conservative estimates.
217 | 
218 | ## Quasi-Poisson regression 
219 | 
220 | 
221 | A quasi-Poisson regression model is fitted to see whether it can rectify this problem.
222 | This model uses a quasi-likelihood and suggests that there is no evidence that any of the coefficients are significant.
223 | This is dubious as we demonstrate in the next step.
224 | 
225 | ```{r yfc.resp.date.qp.glm}
226 | yfc.resp.date.qp.glm = glm(sevendays.cucase ~ new.arr.time + log10.Dis.WH+
227 |                           Bus.resp + new.Bus.date + 
228 |                           Railway.resp + new.Railway.date + 
229 |                           Enter.resp + new.Enter.date +
230 |                           offset(log(Pop_million_2018*new.totalflow_million)),
231 |                         family = "quasipoisson",
232 |                         data = covid2019.df[-c(1, 7, 150, 157, 226),])
233 | summary(yfc.resp.date.qp.glm)
234 | ```
235 | 
236 | We consider the arrival time, and plot it against the logarithem of the number of reported cases divided by the population size and inflow from Wuhan.
237 | It is apparente that the arrival time increases as the number of reported cases increases.
238 | 
239 | ```{r arrT_cumucase_plot, fig.height= 4.5, fig.width=5.5}
240 | par(mar = c(5, 6, 2, 1) + 0.2)
241 | log.std.seven.cucase = log(covid2019.df$sevendays.cucase/
242 |                              (covid2019.df$Pop_million_2018*covid2019.df$new.totalflow_million))
243 | plot(covid2019.df$arr.time,log.std.seven.cucase,
244 |      xlab = "Arrival time", 
245 |      ylab = "The number of case reported in the\nfirst seven days of the outbreaks in cities")
246 | ```
247 | 
248 | However, the quasi-Poisson regression again indicates the no evidence for coefficient of arrival time.
249 | This could be because of the heteroscedasticity in the Pearson residuals is extremely severe, the quasi-Poisson regression is over-compensating and is unable to detect any signal in the data efficiently.
250 | Furthermore, the plot below shows that the quasi-Poisson model provides no indication that the heteroscedasticity in the Pearson residuals have been rectified.
251 | (Pearson residuals in the plot is from the full quasi-Poisson model including all the control measure varaibles.)
252 | Therefore, we are also uncertain about the reliability of the quasi-Poisson regression analysis.
253 | 
254 | ```{r yfc_arr_time_qp_glm, fig.height= 5, fig.width=5.5}
255 | yfc.arr.time.qp.glm = glm(sevendays.cucase ~ new.arr.time +
256 |                           offset(log(Pop_million_2018*new.totalflow_million)),
257 |                         family = "quasipoisson",
258 |                         data = covid2019.df[-c(1, 7, 150, 157, 226),])
259 | summary(yfc.arr.time.qp.glm)
260 | ```
261 | 
262 | 
263 | ```{r glm_arrTqp_pearson_res, fig.height = 4.5, fig.width = 5.5}
264 | par(mar = c(5, 5, 5, 2) + 0.2)
265 | plot((fitted(yfc.resp.date.qp.glm)), 
266 |      residuals(yfc.resp.date.qp.glm, type= "pearson"),
267 |      xlab = "Fitted values", 
268 |      ylab = "Pearson residual")
269 | ```
270 | 
271 | ## Negative binomial regression
272 | 
273 | A negative regression model is fitted to see whether it can rectify the heteroscedasticity in the Pearson residuals.
274 | 
275 | ```{r yfc.resp.date.nb.glm}
276 | library(MASS)
277 | yfc.resp.date.nb.glm = glm.nb(sevendays.cucase ~ new.arr.time + log10.Dis.WH+
278 |                           Bus.resp + new.Bus.date + 
279 |                           Railway.resp + new.Railway.date + 
280 |                           Enter.resp + new.Enter.date +
281 |                           offset(log10(Pop_million_2018*new.totalflow_million)),
282 |                         data = covid2019.df[-c(1, 7, 150, 157, 226),])
283 | summary(yfc.resp.date.nb.glm)
284 | 
285 | ```
286 | 
287 | The plot below shows that the Pearson residuals of the negative-binomial regression still display heteroscedasticity.
288 | And therefore, we are also uncertain about the reliability of the results from the negative binomial regression.
289 | 
290 | ```{r glm_nb_peason_res, fig.height = 4.5, fig.width = 5.5}
291 | par(mar = c(5, 5, 5, 2) + 0.2)
292 | plot((fitted(yfc.resp.date.nb.glm)), 
293 |      residuals(yfc.resp.date.nb.glm, type= "pearson"),
294 |      xlab = "Fitted values", 
295 |      ylab = "Pearson residual")
296 | ```


--------------------------------------------------------------------------------
/code/TCM_poisson_reg_model.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huaiyutian/COVID-19_TCM-50d_China/cc3257fc99335922a8e9b8668eea9e10fbb0e752/code/TCM_poisson_reg_model.pdf


--------------------------------------------------------------------------------
/code/f1.m:
--------------------------------------------------------------------------------
 1 | clear model data params options
 2 | load data_model.mat
 3 | load name_province.mat
 4 | 
 5 | %%
 6 | % The model sum of squares in file <algaess.html |algaess.m|> is
 7 | % given in the model structure.
 8 | model.ssfun = @f2;
 9 | 
10 | %%
11 | % All parameters are constrained to be positive. The initial
12 | % concentrations are also unknown and are treated as extra parameters.
13 | params1 = {
14 | 
15 |     
16 |     {'Ro', 2.2, 1.2,3.2,2.2,1}  %basic reproduction number
17 |     {'sigma', 4.5, 4.1,7,4.7,0.51}  %incubation period
18 |     {'report', 0.002,0,1} % report rate
19 |     {'controlh',0.97,0,1,0.8,0.1} %control intensity
20 |     {'controlm',0.64, 0,1,0.6,0.1}
21 |     {'controllow',0.33,0,1,0.3,0.1}
22 |     {'control2nd',0.012,0,1,0.05,0.05}%2nd control
23 |     {'gamma', 5.29, 1,15,5,3}%time period from onset to isolation
24 |     {'Iw0', 2,1,11}%intial infected cases in wuhan 
25 |     
26 |      
27 |   
28 |     
29 |     };
30 | 
31 | %%
32 | % We assume having at least some prior information on the
33 | % repeatability of the observation and assign rather non informational
34 | % prior for the residual variances of the observed states. The default
35 | % prior distribution is sigma2 ~ invchisq(S20,N0), the inverse chi
36 | % squared distribution (see for example Gelman et al.). The 3
37 | % components (_A_, _Z_, _P_) all have separate variances.
38 | 
39 | model.S20 = [4];
40 | model.N0  = [1];
41 | 
42 | %%
43 | % First generate an initial chain.
44 | options.nsimu = 50000;
45 | options.stats = 1;
46 | [results, chain, s2chain]= mcmcrun(model,data,params1,options);
47 | 
48 | 
49 | % % %%generate  chain after burn-in
50 | options.nsimu = 200000;
51 | options.stats = 1;
52 | [results2, chain2, s2chain2] = mcmcrun(model,data,params1,options,results);
53 | 
54 | %%
55 | % Chain plots should reveal that the chain has converged and we can
56 | % % use the results for estimation and predictive inference.
57 | figure
58 | mcmcplot(chain2,[],results2); %,'pairs'
59 | figure
60 | mcmcplot(chain2,[],results2,'denspanel',2);
61 | 
62 | 
63 | %%
64 | % Function |chainstats| calculates mean ans std from the chain and
65 | % estimates the Monte Carlo error of the estimates. Number |tau| is
66 | % the integrated autocorrelation time and |geweke| is a simple test
67 | % for a null hypothesis that the chain has converged.
68 | 
69 | results2.sstype = 1; % needed for mcmcpred and sqrt transformation
70 | 
71 | chainstats(chain2,results2)
72 | 
73 | 
74 | %%
75 | % In order to use the |mcmcpred| function we need
76 | % function |modelfun| with input arguments given as
77 | % |modelfun(xdata,theta)|. We construct this as an anonymous function.
78 | 
79 | modelfun = @(d,th) f3(d(:,1),th,th(end),d);
80 | 
81 | 
82 | % We sample 1000 parameter realizations from |chain| and |s2chain|
83 | % and calculate the predictive plots.
84 | nsample = 1000;
85 | results2.sstype = 1;
86 | out = mcmcpred(results2,chain2,s2chain2,data.xdata,modelfun,nsample);%data.ydata-->data
87 | 
88 | figure
89 | mcmcpredplot(out,out.data,data,C);
90 | 
91 | 


--------------------------------------------------------------------------------
/code/f2.m:
--------------------------------------------------------------------------------
 1 | function ss = f2(theta,data)
 2 | % algae sum-of-squares function
 3 | 
 4 | time   = data.ydata(:,1);
 5 | ydata  = data.ydata(:,2:32);
 6 | xdata  = data.xdata;
 7 | 
 8 | % 3 last parameters are the initial states
 9 | y0 = theta(end);
10 | 
11 | ymodel = f3(time,theta,y0,xdata);
12 | 
13 | ss =sum((sqrt(ymodel) - sqrt(ydata)).^2);
14 | 


--------------------------------------------------------------------------------
/code/f3.m:
--------------------------------------------------------------------------------
1 | function y=f3(time,theta,y0,xdata)
2 | 
3 | y = f4(time,y0,theta,xdata);
4 | 
5 | 
6 | 


--------------------------------------------------------------------------------
/code/f4.m:
--------------------------------------------------------------------------------
 1 | function ydot = f4(t,y,theta,xdata)
 2 | 
 3 | 
 4 | gamma1(1:31)=theta(8); %time period from onset to isolation
 5 | gamma=1./gamma1;
 6 | sigma=1/theta(2);%Incubation period 
 7 | beta1=theta(1).*gamma; % transmission rate
 8 | report=theta(3); %report rate
 9 | control1(1:31)=theta(5);% control intensity
10 | control1([30,4,24,18,22,19,10,9,28,27,8,21,16])=theta(6);
11 | control1([31,29,25,11,23])=theta(4);
12 | control2nd(1:31)=theta(7);%second control
13 | 
14 | 
15 | Pwuhan=10892900;%population of Wuhan
16 | flow=xdata(ceil(t),[2:31,96]);%matrix,1 colomu-1 province
17 | HRwuhan=xdata(ceil(t),32);%wuhan report daily cases
18 | HRwuhan(1:5)=theta(9);% estimated initial infected cases in wuhan from 1/11-1/15
19 | pop=xdata(ceil(1:31),33);%population of other province 
20 | importI=HRwuhan(1)*flow(1,:)/report/Pwuhan;%imported cases
21 | timelevel1=xdata(ceil(t),[35:64,97]);
22 | time2nd=xdata(ceil(t),[66:95,98]);
23 | 
24 | beta(1:31)=beta1;
25 | HE(1,1:31) = importI;
26 | HI(1,1:31) = 0;
27 | HS(1,1:31) = pop'-HE(1,:)-HI(1,:);
28 | HR(1,1:31) = 0;
29 | I(1,1:31)=0;
30 | beta(1:31)=beta1; 
31 | 
32 | for i = 2:40
33 | 
34 | %control since level 1 response
35 | beta(find(timelevel1(i,:)==1))=beta1(find(timelevel1(i,:)==1)).*control1(find(timelevel1(i,:)==1));
36 | 
37 | %%2nd control 
38 | beta(find(time2nd(i,:)==1))=beta1(find(time2nd(i,:)==1)).*control1(find(time2nd(i,:)==1)).*control2nd(find(time2nd(i,:)==1));
39 | 
40 | %Imported Infected cases
41 | importI=HRwuhan(i)*flow(i,:)/report/Pwuhan;%1*30
42 | 
43 | 
44 | %SEIR model
45 | HS(i,:)=HS(i-1,:)-beta.*HI(i-1,:).*HS(i-1,:)./pop';
46 | HE(i,:)=HE(i-1,:)+beta.*HI(i-1,:).*HS(i-1,:)./pop'+importI-sigma*HE(i-1,:);  
47 | HI(i,:)=HI(i-1,:)+sigma*HE(i-1,:)-gamma.*HI(i-1,:);    
48 | HR(i,:)=HR(i-1,:)+gamma.*HI(i-1,:);
49 | 
50 | I(i,:)=gamma.*HI(i-1,:);
51 | 
52 | end
53 | 
54 |  
55 | ydot=[I(:,:)];
56 | 
57 | 


--------------------------------------------------------------------------------
/code/f5.m:
--------------------------------------------------------------------------------
 1 | %predict different scenarios
 2 | load name_province.mat
 3 | 
 4 | %predict without shutdown but with control
 5 | load without_shutdown_with_control.mat %noshutcon
 6 | modelfun = @(d,th) f3(d(:,1),th,th(end),d);
 7 | nsample = 500;
 8 | results2.sstype = 1;
 9 | out1 = mcmcpred(results2,chain2,s2chain2,data.xdata,modelfun,nsample);
10 | figure
11 | mcmcpredplot(out1,out.data,data,C);
12 | 
13 | %%%%%%predict with shutdown but without control
14 | load with_shutdown_without_control.mat 
15 | modelfun = @(d,th) f3(d(:,1),th,th(end),d);
16 | nsample = 500;
17 | results2.sstype = 1;
18 | out2 = mcmcpred(results2,chain2,s2chain2,data.xdata,modelfun,nsample);
19 | 
20 | %%%%%%% predict without shutdown and without control
21 | load without_shutdown_without_control.mat 
22 | modelfun = @(d,th) f3(d(:,1),th,th(end),d);
23 | nsample = 500;
24 | results2.sstype = 1;
25 | out3 = mcmcpred(results2,chain2,s2chain2,data.xdata,modelfun,nsample);
26 | 
27 | 
28 | 
29 | 


--------------------------------------------------------------------------------
/code/fit_arrvial_model.R:
--------------------------------------------------------------------------------
 1 | rm(list = ls())
 2 | library(MASS)
 3 | library(readxl)
 4 | library(car)
 5 | 
 6 | infected=read_xlsx(".../data/fit_arrival_model.xlsx",sheet="1")
 7 | infectedW=infected[infected$arrivalday>=2,]
 8 | 
 9 | step.model <- lm(arrivalday~lat+lon+log10(Pop2018)+
10 |              log10(Totalflow)+aftershutdown,data=infectedW)
11 | summary(step.model)
12 | 
13 | ### Diagnosis###
14 | par(mfrow = c(2,2))
15 | plot(fitted(step.model),
16 |      residuals(step.model, type = "pearson"),
17 |      xlab = "Fitted values", ylab = "Pearson residuals")
18 | plot(fitted(step.model),
19 |      residuals(step.model, type = "deviance"),
20 |      xlab = "Fitted values", ylab = "Deviance residuals")
21 | influencePlot(step.model)
22 | fit1 <- lm(arrivalday~lat+lon+log10(Pop2018)+log10(Totalflow)+aftershutdown,
23 |            data=infectedW[-c(64,72,255,257,258),])
24 | 
25 | summary(fit1)
26 | influencePlot(fit1)
27 | lmtest::bptest(fit1)
28 | coeftest(fit1, vcov = vcovHC(fit1))
29 | qqnorm(fit1$residuals)
30 | qqline(fit1$residuals)
31 | shapiro.test(fit1$residuals)
32 | 
33 | 
34 | 


--------------------------------------------------------------------------------
/code/mcmcpredplot.m:
--------------------------------------------------------------------------------
  1 | function h=mcmcpredplot(out,data,adddata,name)
  2 | %MCMCPREDPLOT - predictive plot for mcmc results
  3 | % Creates predictive figures for each batch in the data set using
  4 | % mcmc chain. Needs input from the function mcmcpred.
  5 | % Example:
  6 | %  out=mcmcpred(results,chain,s2chain,data,modelfun);
  7 | %  mcmcpredplot(out)
  8 | %
  9 | % If s2chain has been given to mcmcpred, then the plot shows 95%
 10 | % probability limits for new observations and for model parameter
 11 | % uncertainty. If s2chain is not used then the plot contains 50%,
 12 | % 90%, 95%, and 99% predictive probability limits due parameter uncertainty.
 13 | 
 14 | % $Revision: 1.5 $  $Date: 2013/04/17 08:45:40 $
 15 | 
 16 | if nargin < 2
 17 |   data = out.data;
 18 | end
 19 | 
 20 | if nargin < 3
 21 |   adddata = 0; 
 22 | end
 23 | 
 24 | nbatch = length(out.predlims);
 25 | 
 26 | if ~iscell(data)
 27 |   d=data; data=[]; data{1}=d; clear d
 28 | end
 29 | 
 30 | np = size(out.predlims{1}{1},1);
 31 | nn = (np+1)/2; % median
 32 | np = nn-1;
 33 | 
 34 | hh = zeros(nbatch,1);
 35 | 
 36 | for i=1:nbatch
 37 |   if nbatch > 1; hh(i) = figure;else hh(1)=gcf; end; % create new figures
 38 |   plimi = out.predlims{i};
 39 |   ny = size(plimi,2);
 40 |   
 41 |   datai = data{i};
 42 |   
 43 |   if isnumeric(datai)
 44 |     time = datai(:,1); % time is the first columd of data
 45 |   elseif isfield(datai,'ydata')
 46 |     time = datai.ydata(:,1); % first column of ydata
 47 |   elseif isfield(datai,'xdata')
 48 |     time = datai.xdata(:,1); % first column of xdata
 49 |   else
 50 |     error('dont know the x axis of the plots')
 51 |   end
 52 |     
 53 |   for j=1:ny
 54 |     dimc = [0.9 0.9 0.9]; % dimmest (lightest) color
 55 |     %yidanli changed to subplot the figure 2020/3/25(line57) original code line56
 56 |     %if ny>1;subplot(ny,1,j);end 
 57 |     subplot(6,6,j)
 58 |     if ~isempty(out.obslims)
 59 |       fillyy(time,out.obslims{i}{j}(1,:),out.obslims{i}{j}(3,:),dimc);
 60 |       hold on
 61 |       dimc = [0.8,0.8,0.8];
 62 |     end
 63 |     fillyy(time,plimi{j}(1,:),plimi{j}(2*nn-1,:),dimc);
 64 |     hold on
 65 |     for k=2:(nn-1)
 66 |       fillyy(time,plimi{j}(k,:),plimi{j}(2*nn-k,:),dimc.*0.9.^(k-1));
 67 |     end
 68 |     plot(time,plimi{j}(nn,:),'-k');
 69 |     %yidanli changed code to add real data in figures original code are in
 70 |     %line 72-74 %changed code line 76-78
 71 |     
 72 | %     if adddata
 73 | %       plot(datai.ydata(:,1), datai.ydata(:,j+1),'sk');
 74 | %     end
 75 |     
 76 |     if ~isempty(adddata)
 77 |       plot(adddata.ydata(:,1),adddata.ydata(:,j+1),'sk');
 78 |     end
 79 |    
 80 |     %%add the real data in the figure yidanli changed on 2020/3/25
 81 |     
 82 |     hold off
 83 |     if nbatch > 1
 84 |       title(sprintf('Data set %d, y[%d]',i,j));
 85 |     elseif ny > 1
 86 |       
 87 |       %yidanli changed code to produce the name of province original code
 88 |       %in in line 90, new code in line 91
 89 |       
 90 |       %title(sprintf('y[%d]',j));
 91 |       title(sprintf(char(name(j)),j),'FontSize',9,'Fontname','Arial');
 92 |     end
 93 |   end
 94 |   
 95 | end
 96 | 
 97 | if nargout > 0
 98 |   h=hh;
 99 | end
100 | 


--------------------------------------------------------------------------------
/data/data_fig_2.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huaiyutian/COVID-19_TCM-50d_China/cc3257fc99335922a8e9b8668eea9e10fbb0e752/data/data_fig_2.xlsx


--------------------------------------------------------------------------------
/data/data_fig_S1A.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huaiyutian/COVID-19_TCM-50d_China/cc3257fc99335922a8e9b8668eea9e10fbb0e752/data/data_fig_S1A.xlsx


--------------------------------------------------------------------------------
/data/data_fig_S2.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huaiyutian/COVID-19_TCM-50d_China/cc3257fc99335922a8e9b8668eea9e10fbb0e752/data/data_fig_S2.xlsx


--------------------------------------------------------------------------------
/data/data_fig_S4A.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huaiyutian/COVID-19_TCM-50d_China/cc3257fc99335922a8e9b8668eea9e10fbb0e752/data/data_fig_S4A.xlsx


--------------------------------------------------------------------------------
/data/data_fig_S4B.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huaiyutian/COVID-19_TCM-50d_China/cc3257fc99335922a8e9b8668eea9e10fbb0e752/data/data_fig_S4B.xlsx


--------------------------------------------------------------------------------
/data/fit_arrival_model.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huaiyutian/COVID-19_TCM-50d_China/cc3257fc99335922a8e9b8668eea9e10fbb0e752/data/fit_arrival_model.xlsx


--------------------------------------------------------------------------------
/data/health commission.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huaiyutian/COVID-19_TCM-50d_China/cc3257fc99335922a8e9b8668eea9e10fbb0e752/data/health commission.xlsx


--------------------------------------------------------------------------------
/data/nCoV-data.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huaiyutian/COVID-19_TCM-50d_China/cc3257fc99335922a8e9b8668eea9e10fbb0e752/data/nCoV-data.xlsx


--------------------------------------------------------------------------------
/data/nCoV-data_0323.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huaiyutian/COVID-19_TCM-50d_China/cc3257fc99335922a8e9b8668eea9e10fbb0e752/data/nCoV-data_0323.xlsx


--------------------------------------------------------------------------------