├── README.md
├── Restaurant_Slides.pdf
├── data_dianping
├── food_baoding_v2.csv.zip
├── food_beijing_v2.csv.zip
├── food_chengdu_v2.csv.zip
├── food_hengyang_v2.csv.zip
├── food_kunming_v2.csv.zip
├── food_shenyang_v2.csv.zip
├── food_shenzhen_v2.csv.zip
├── food_yueyang_v2.csv.zip
└── food_zhengzhou_v2.csv.zip
├── food-small.jpg
└── restaurant_replication.R
/README.md:
--------------------------------------------------------------------------------
1 | # Predicting neighborhoods' socioeconomic attributes using restaurant data
2 |
3 |
4 |

5 |
6 |
7 | Proceedings of the National Academy of Sciences (PNAS)
8 |
9 | by Lei Dong (MIT), Carlo Ratti (MIT), and Siqi Zheng (MIT)
10 |
11 |
12 | ### Abstract
13 |
14 | Accessing high-resolution, timely socioeconomic data such as data on population, employment, and enterprise activity at the neighborhood level is critical for social scientists and policy makers to design and implement location-based policies. However, in many developing countries or cities, reliable local-scale socioeconomic data remain scarce. Here, we show an easily accessible and timely updated location attribute—restaurant—can be used to accurately predict a range of socioeconomic attributes of urban neighborhoods. We merge restaurant data from an online platform with 3 microdatasets for 9 Chinese cities. Using features extracted from restaurants, we train machine-learning models to estimate daytime and nighttime population, number of firms, and consumption level at various spatial resolutions. The trained model can explain 90 to 95% of the variation of those attributes across neighborhoods in the test dataset. We analyze the tradeoff between accuracy, spatial resolution, and number of training samples, as well as the heterogeneity of the predicted results across different spatial locations, demographics, and firm industries. Finally, we demonstrate the cross-city generality of this method by training the model in one city and then applying it directly to other cities. The transferability of this restaurant model can help bridge data gaps between cities, allowing all cities to enjoy big data and algorithm dividends.
15 |
16 | [Web](http://senseable.mit.edu/tasty-data/) | [Paper](https://www.pnas.org/content/116/31/15447) | [Appendix](https://www.pnas.org/content/suppl/2019/07/09/1903064116.DCSupplemental) | [Slides](https://github.com/leiii/restaurant/blob/master/Restaurant_Slides.pdf)
17 |
18 | ### Replication data and code
19 |
20 | - restaurant_replication.R
21 | * R code to replicate the results of the paper
22 |
23 | - data_dianping
24 | * Dianping restaurant data of nine cities
25 | * Baoding, Beijing, Chengdu, Hengyang, Kunming, Shenyang, Shenzhen, Yueyang, and Zhengzhou
26 |
27 | - rst
28 | * Model training results
29 | * [Download](https://drive.google.com/open?id=1O8rIy6CDWjapFu1YOYmmOqte8WfHe4Q-)
30 |
31 | - feature
32 | * Feature for training
33 | * [Download](https://drive.google.com/open?id=1VbWKqrkNU6MIZb17xH8B1y-k0PWdHUVw)
34 |
35 |
36 | Contact: arch.dongl@gmail.com
37 |
--------------------------------------------------------------------------------
/Restaurant_Slides.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/Restaurant_Slides.pdf
--------------------------------------------------------------------------------
/data_dianping/food_baoding_v2.csv.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/data_dianping/food_baoding_v2.csv.zip
--------------------------------------------------------------------------------
/data_dianping/food_beijing_v2.csv.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/data_dianping/food_beijing_v2.csv.zip
--------------------------------------------------------------------------------
/data_dianping/food_chengdu_v2.csv.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/data_dianping/food_chengdu_v2.csv.zip
--------------------------------------------------------------------------------
/data_dianping/food_hengyang_v2.csv.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/data_dianping/food_hengyang_v2.csv.zip
--------------------------------------------------------------------------------
/data_dianping/food_kunming_v2.csv.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/data_dianping/food_kunming_v2.csv.zip
--------------------------------------------------------------------------------
/data_dianping/food_shenyang_v2.csv.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/data_dianping/food_shenyang_v2.csv.zip
--------------------------------------------------------------------------------
/data_dianping/food_shenzhen_v2.csv.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/data_dianping/food_shenzhen_v2.csv.zip
--------------------------------------------------------------------------------
/data_dianping/food_yueyang_v2.csv.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/data_dianping/food_yueyang_v2.csv.zip
--------------------------------------------------------------------------------
/data_dianping/food_zhengzhou_v2.csv.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/data_dianping/food_zhengzhou_v2.csv.zip
--------------------------------------------------------------------------------
/food-small.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/food-small.jpg
--------------------------------------------------------------------------------
/restaurant_replication.R:
--------------------------------------------------------------------------------
1 | # Lei Dong
2 | # Email: arch.dongl@gmail.com
3 | library(glmnet)
4 | library(tidyr)
5 | library(dplyr)
6 | library(ggplot2)
7 | library(ggpubr)
8 | library(stringr)
9 |
10 | setwd("~/downloads/restaurant/replicate")
11 |
12 | ########### Figure 1 ##############
13 | food <- read.table("data_dianping/food_hengyang_v2.csv", header = T, sep = ",")
14 | category <- data.frame(table(food$mainCategoryNameEn))
15 | category <- category[order(category$Freq), ]
16 |
17 | ggbarplot(tail(category,20), x = "Var1", y = "Freq",
18 | fill = "#4292c6", # change fill color by cyl
19 | color = "white", # Set bar border colors to white
20 | palette = "jco", # jco journal color palett. see ?ggpar
21 | sort.val = "desc", # Sort the value in dscending order
22 | sort.by.groups = FALSE, # Don't sort inside each group
23 | x.text.angle = 90, # Rotate vertically x axis texts
24 | xlab = "Hengyang"
25 | )
26 | #ggsave("hengyang_cnt_v2.pdf", width = 5, height = 3)
27 |
28 |
29 |
30 | ########### Figure 2 ##############
31 | ### Figure 2ABCD group size
32 | city <- c("beijing", "shenzhen", "chengdu",
33 | "shenyang", "zhengzhou", "kunming",
34 | "baoding", "yueyang", "hengyang")
35 |
36 | # read files
37 | r2 <- data.frame()
38 | for (i in city){
39 | dt <- read.table(paste0("rst/", i, "_lasso_r2_rolling_v2.csv"), header = TRUE, sep = ",")
40 | r2 <- rbind(r2, dt)
41 | }
42 |
43 | # filter, summarise, and plot
44 | for (i in city){
45 | d <- r2 %>%
46 | filter(city == i) %>%
47 | group_by(cellsize, y) %>%
48 | summarise(r2_test_mean = mean(r2_test),
49 | r2_test_sd = sd(r2_test))
50 |
51 | p <- ggplot(d, aes(cellsize/1000, r2_test_mean)) +
52 | geom_pointrange(aes(color = factor(y),
53 | ymin = r2_test_mean - r2_test_sd,
54 | ymax = r2_test_mean + r2_test_sd),
55 | alpha = 0.8, size = 0.8) +
56 | geom_line(aes(color = factor(y)), size = 1.5) +
57 | scale_color_manual(name = "legend",
58 | breaks = c("employment", "population", "firm", "consumption"),
59 | values = c("employment" = "red", "population" = "#E69F00",
60 | "firm" = "#56B4E9", "consumption" = "#999999")) +
61 | geom_vline(xintercept = 4.5, linetype = 2) +
62 | annotate("text", label = i, x = 4, y = 0.5) +
63 | ylim(c(0.3, 1.1)) +
64 | xlab("Grid size (km)") +
65 | ylab("Accuracy (R2)") +
66 | theme_classic(base_size = 18)
67 | p
68 |
69 | ggsave(paste0(i, "_r2_v2.pdf"), width = 6, height = 4)
70 | }
71 |
72 |
73 |
74 | ### Figure 2E group size
75 | big_city <- r2 %>%
76 | filter(city %in% "beijing" | city %in% "shenzhen" | city %in% "chengdu") %>%
77 | filter(cellsize == 5000 | cellsize == 1000) %>%
78 | group_by(cellsize, y) %>%
79 | summarise(r2_test_mean = mean(r2_test),
80 | r2_test_sd = sd(r2_test))
81 |
82 | mid_city <- r2 %>%
83 | filter(city %in% "shenyang" | city %in% "zhengzhou" | city %in% "kunming") %>%
84 | filter(cellsize == 5000 | cellsize == 1000) %>%
85 | group_by(cellsize, y) %>%
86 | summarise(r2_test_mean = mean(r2_test),
87 | r2_test_sd = sd(r2_test))
88 |
89 | sma_city <- r2 %>%
90 | filter(city %in% "baoding" | city %in% "yueyang" | city %in% "hengyang") %>%
91 | filter(cellsize == 5000 | cellsize == 1000) %>%
92 | group_by(cellsize, y) %>%
93 | summarise(r2_test_mean = mean(r2_test),
94 | r2_test_sd = sd(r2_test))
95 |
96 | # set x axis
97 | sma_city$x <- 0
98 | mid_city$x <- 1
99 | big_city$x <- 2
100 | rst <- as.data.frame(rbind(sma_city, mid_city, big_city))
101 | rst[rst$y == "employment", ]$x <- rst[rst$y == "employment", ]$x - 0.3
102 | rst[rst$y == "population", ]$x <- rst[rst$y == "population", ]$x - 0.15
103 | rst[rst$y == "consumption", ]$x <- rst[rst$y == "consumption", ]$x + 0.15
104 |
105 | # plot
106 | ggplot(rst, aes(x, r2_test_mean)) +
107 | geom_pointrange(aes(color = factor(y),
108 | ymin = r2_test_mean - r2_test_sd,
109 | ymax = r2_test_mean + r2_test_sd),
110 | size = 0.8) +
111 | scale_color_manual(name = "legend",
112 | breaks = c("employment", "population", "firm", "consumption"),
113 | values = c("employment" = "red", "population" = "#E69F00",
114 | "firm" = "#56B4E9", "consumption" = "#999999")) +
115 | #ylim(c(0.4, 1.2)) +
116 | xlab("City size group") +
117 | ylab("Accuracy (R2)") +
118 | theme_classic(base_size = 18)
119 | #ggsave("citygroup_r2_v2.pdf", width = 6, height = 4, useDingbats=FALSE)
120 |
121 |
122 |
123 | ### Figure 2F subsample
124 | bj <- read.table("rst/beijing_subsample_lasso_r2_rolling_v2.csv", header = TRUE, sep = ",")
125 | bj1 <- bj %>%
126 | group_by(smp_per, y) %>%
127 | summarise(r2_test_mean = mean(r2_test),
128 | r2_test_sd = sd(r2_test))
129 |
130 | ggplot(bj1, aes(smp_per*100, r2_test_mean)) +
131 | geom_pointrange(aes(color = factor(y),
132 | ymin = r2_test_mean - r2_test_sd,
133 | ymax = r2_test_mean + r2_test_sd),
134 | size = 0.8, alpha = 0.8) +
135 | scale_color_manual(name = "legend",
136 | breaks = c("employment", "population", "firm", "consumption"),
137 | values = c("employment" = "red", "population" = "#E69F00",
138 | "firm" = "#56B4E9", "consumption" = "#999999")) +
139 | ylim(c(0.6, 1.0)) +
140 | xlab("Percentage (%)") +
141 | ylab("Accuracy (R2)") +
142 | theme_classic(base_size = 18)
143 | #ggsave("beijing_subsample_r2_v2.pdf", width = 6, height = 4, useDingbats=FALSE)
144 |
145 |
146 |
147 | ########### Figure 3 ##############
148 | ### Figure 3A spatial distribution of erros
149 | dt0 <- read.table("feature/beijing_employment_3000_0_0.csv", header = TRUE, sep = ",")
150 | dt1 <- read.table("feature/beijing_employment_3000_0.5_0.csv", header = TRUE, sep = ",")
151 | dt2 <- read.table("feature/beijing_employment_3000_0_0.5.csv", header = TRUE, sep = ",")
152 | dt3 <- read.table("feature/beijing_employment_3000_0.5_0.5.csv", header = TRUE, sep = ",")
153 | dt <- rbind(dt1, dt2, dt3)
154 | dt <- separate(data = dt, col = X, into = c("x", "y"), sep = "_")
155 | dt_ <- separate(data = dt0, col = X, into = c("x", "y"), sep = "_")
156 | dt[is.na(dt)] <- 0
157 | dt_[is.na(dt_)] <- 0
158 |
159 | # data format
160 | train_x <- as.matrix(dt[,c(4:ncol(dt))])
161 | train_y <- as.matrix(dt[,c(3)])
162 | test_x <- as.matrix(dt_[,c(4:ncol(dt_))])
163 | test_y <- as.matrix(dt_[,c(3)])
164 |
165 | # LASSO fit (alpha=1: LASSO)
166 | cvfit <- cv.glmnet(train_x, train_y, alpha = 1, nfolds = 5)
167 | plot(cvfit)
168 |
169 | # r2 of training data
170 | mse <- cvfit$cvm[cvfit$lambda == cvfit$lambda.min]
171 | r2_train <- 1- mse/var(train_y)
172 | print(r2_train)
173 |
174 | # r2 of testing data
175 | pred_lasso <- predict(cvfit, newx = test_x, s = "lambda.min")
176 | mse <- sum((pred_lasso - test_y)^2) / length(test_y)
177 | r2_test <- 1- mse/var(test_y)
178 | print(r2_test)
179 |
180 | # calculate errors
181 | rst <- data.frame(cbind(dt_$x, dt_$y, dt_$employment, pred_lasso))
182 | colnames(rst) <- c("X", "Y", "observed", "predicted")
183 | rst <- mutate_all(rst, function(x) as.numeric(as.character(x)))
184 | rst$error <- (rst$predicted - rst$observed)/rst$observed
185 | rst$error[rst$error > 1] <- 1
186 | rst$error[rst$error < -1] <- -1
187 |
188 | # plot ground truth vs. prediction
189 | ggplot(aes(log10(observed), log10(predicted)),
190 | data = rst[rst$predicted > 100 & rst$observed > 100,]) +
191 | geom_point(color = "blue", alpha = 0.3) +
192 | geom_abline(slope = 1, intercept = 0) +
193 | ggtitle("Beijing daytime population (3km)") +
194 | xlab("Observed (log10)") +
195 | ylab("Predicted (log10)")
196 | #ggsave("beijing_employment_3000.pdf", width = 4, height = 3)
197 |
198 |
199 |
200 | # plot spatial distribution of errors
201 | library(rgdal) #coord. transform
202 | library(OpenStreetMap) #base map
203 |
204 | # convert projection to wgs84
205 | cord_dec <- SpatialPoints(cbind(rst$X, rst$Y), proj4string = CRS("+init=epsg:2436"))
206 | cord_wgs <- spTransform(cord_dec, CRS("+init=epsg:4326"))
207 | lonlat <- as.data.frame(cord_wgs@coords)
208 | colnames(lonlat) <- c("lon", "lat")
209 | rst <- cbind(rst, lonlat)
210 |
211 | # boundary of Beijing
212 | LAT1 = 39.45 ; LAT2 = 40.5
213 | LON1 = 115.8 ; LON2 = 117
214 |
215 | # boundary of Chengdu
216 | #LAT1 = 30.3; LAT2 = 30.96
217 | #LON1 = 103.73; LON2 = 104.47
218 |
219 | # boundary of Zhengzhou
220 | #LAT1 = 34.55; LAT2 = 34.97
221 | #LON1 = 113.46; LON2 = 113.95
222 |
223 | # boundary of Baoding
224 | #LAT1 = 38.6; LAT2 = 39.1
225 | #LON1 = 115.17; LON2 = 115.69
226 |
227 | map <- openmap(c(LAT2, LON1), c(LAT1, LON2), zoom = NULL,
228 | type = "stamen-toner",
229 | mergeTiles = TRUE)
230 | map_latlon <- openproj(map, projection = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
231 |
232 | # with basemap
233 | osmmap <- autoplot(map_latlon)+
234 | #labs(x = "Lon", y = "Lat")+
235 | geom_point(data = rst[rst$observed > 2000,],
236 | aes(x = lon, y = lat, color = error*100, size = observed)) +
237 | scale_size(range = c(1, 5), name = "Daytime popu.") +
238 | scale_color_gradient2(low = "steelblue", high = "#f46d43", name = "Error(%)") +
239 | xlim(LON1, LON2) +
240 | ylim(LAT1, LAT2)
241 | osmmap
242 | #ggsave("beijing_employment_3000_error_spatial_v2.pdf", width = 5, height = 4, useDingbats=FALSE)
243 |
244 |
245 |
246 | ### Figure 3B and Table S3 jiedao demographic
247 | demo <- read.table("rst/beijing_demographic_r2_jiedao.csv", header = TRUE, sep = ",")
248 | demo_feature <- read.table("rst/beijing_demographic_feature_jiedao.csv", header = TRUE, sep = ",")
249 |
250 | demo1 <- demo %>%
251 | group_by(y) %>%
252 | summarise(r2_mean = mean(r2),
253 | r2_sd = sd(r2))
254 |
255 | ggplot(demo1, aes(y, r2_mean)) +
256 | geom_pointrange(aes(color = factor(y),
257 | ymin = r2_mean - r2_sd,
258 | ymax = r2_mean + r2_sd),
259 | size = 0.8, alpha = 0.8) +
260 | xlab("Group") +
261 | ylab("Accuracy (R2)") +
262 | ylim(0.4, 0.8) +
263 | theme_classic(base_size = 18)
264 | #ggsave("beijing_demographic_r2_jiedao.pdf", width = 4, height = 4, useDingbats=FALSE)
265 |
266 |
267 |
268 | # Table S3
269 | young <- demo_feature[demo_feature$y == "14",]
270 | middle <- demo_feature[demo_feature$y == "15-64",]
271 | old <- demo_feature[demo_feature$y == "65",]
272 | migrant <- demo_feature[demo_feature$y == "migrant",]
273 | wealth <- demo_feature[demo_feature$y == "wealth",]
274 |
275 | migrant$split <- as.data.frame(str_split_fixed(migrant$name, "_", 2))[[1]]
276 | wealth$split <- as.data.frame(str_split_fixed(wealth$name, "_", 2))[[1]]
277 | young$split <- as.data.frame(str_split_fixed(young$name, "_", 2))[[1]]
278 | middle$split <- as.data.frame(str_split_fixed(middle$name, "_", 2))[[1]]
279 | old$split <- as.data.frame(str_split_fixed(old$name, "_", 2))[[1]]
280 |
281 | migrant_sum <- as.data.frame(table(migrant$split))
282 | wealth_sum <- as.data.frame(table(wealth$split))
283 | young_sum <- as.data.frame(table(young$split))
284 | middle_sum <- as.data.frame(table(middle$split))
285 | old_sum <- as.data.frame(table(old$split))
286 |
287 |
288 |
289 | ### Figure 3C Firm
290 | firm <- read.table("rst/beijing_lasso_firm_r2_jiedao_v2.csv", header = TRUE, sep = ",")
291 | firm_feature <- read.table("rst/beijing_lasso_firm_feature_jiedao_v2.csv", header = TRUE, sep = ",")
292 |
293 | firm1 <- firm %>%
294 | group_by(y) %>%
295 | summarise(r2_train_mean = mean(r2_train),
296 | r2_train_sd = sd(r2_train))
297 |
298 | ggplot(firm1, aes(y, r2_train_mean)) +
299 | geom_pointrange(aes(color = factor(y),
300 | ymin = r2_train_mean - r2_train_sd,
301 | ymax =r2_train_mean + r2_train_sd)) +
302 | xlab("Percentage (%)") +
303 | ylab("Accuracy (R2)") +
304 | ylim(0.4, 0.6) +
305 | theme_classic(base_size = 18)
306 | #ggsave("beijing_firm_r2_jiedao.pdf", width = 6, height = 4, useDingbats=FALSE)
307 |
308 |
309 |
310 | # Table S3
311 | business_service <- firm_feature[firm_feature$y == "retail",]
312 | business_service$split <- as.data.frame(str_split_fixed(business_service$name, "_", 2))[[1]]
313 | business_service_sum <- as.data.frame(table(business_service$split))
314 |
315 |
316 |
317 | ########### Transfer ##############
318 | ### Figure 4
319 | city <- c("beijing", "shenzhen", "chengdu",
320 | "shenyang", "zhengzhou", "kunming",
321 | "baoding", "yueyang", "hengyang")
322 |
323 | r2_matrix <- data.frame()
324 | feature <- data.frame()
325 | for (i in city){
326 | print(i)
327 | dt <- read.table(paste0("rst/", i, "_lasso_r2_transfer_v2.csv"), header = TRUE, sep = ",")
328 | fe <- read.table(paste0("rst/" , i, "_lasso_feature_transfer_v2.csv"), header = TRUE, sep = ",")
329 | feature <- rbind(feature, fe)
330 | dt1 <- dt %>%
331 | group_by(city_b, y) %>%
332 | summarise(r2_test_mean = mean(r2_test),
333 | r2_test_sd = sd(r2_test))
334 | dt1 <- as.data.frame(dt1)
335 | dt1$city_a <- i
336 | r2_matrix <- rbind(r2_matrix, dt1)
337 | }
338 |
339 | # add diag of the matrix
340 | r2_matrix_diag <- data.frame()
341 | for (i in city){
342 | print(i)
343 | dt <- read.table(paste0("rst/", i, "_lasso_r2_rolling_v2.csv"), header = TRUE, sep = ",")
344 | dt1 <- dt %>%
345 | filter(cellsize == 3000) %>%
346 | group_by(y) %>%
347 | summarise(r2_test_mean = mean(r2_test),
348 | r2_test_sd = sd(r2_test))
349 | dt1 <- as.data.frame(dt1)
350 | dt1$city_a <- i
351 | dt1$city_b <- i
352 | r2_matrix_diag <- rbind(r2_matrix_diag, dt1)
353 | }
354 |
355 | r2_matrix_sum <- rbind(r2_matrix, r2_matrix_diag)
356 | r2_matrix_sum$city_a_order <- factor(r2_matrix_sum$city_a, levels = city)
357 | r2_matrix_sum$city_b_order <- factor(r2_matrix_sum$city_b, levels = city)
358 |
359 | # plot
360 | ggplot(data = r2_matrix_sum[r2_matrix_sum$y == "population", ],
361 | aes(city_a_order, city_b_order, fill = r2_test_mean)) +
362 | geom_tile(color = "white") +
363 | geom_text(aes(label = round(r2_test_mean, 2)), alpha = 0.8) +
364 | scale_fill_gradient(high = "steelblue", low = "white",
365 | limits = c(0.5, 1), name = "R2") +
366 | theme_minimal() +
367 | theme(axis.text.x = element_text(angle = 45, vjust = 1,
368 | size = 12, hjust = 1)) +
369 | coord_fixed()
370 | #ggsave("transfer_population_r2_v2.pdf", width = 6, height = 5, useDingbats=FALSE)
371 |
372 |
373 | transfer_feature <- feature[feature$y == "consumption",]
374 | transfer_feature$split <- as.data.frame(str_split_fixed(transfer_feature$name, "_", 2))[[1]]
375 | transfer_feature_sum <- as.data.frame(table(transfer_feature$split))
376 |
377 |
378 |
379 |
380 | ########### Supplementary Information ##############
381 | ### Figure S3-5
382 | city <- c("baoding", "yueyang", "hengyang")
383 | empl <- data.frame()
384 | popu <- data.frame()
385 | firm <- data.frame()
386 | cons <- data.frame()
387 | poi.cnt <- data.frame()
388 | price <- data.frame()
389 |
390 | for (i in city){
391 | empl_ <- read.table(paste0("feature/", i ,"_employment_3000_0_0.csv"), header = TRUE, sep = ",")
392 | popu_ <- read.table(paste0("feature/", i ,"_population_3000_0_0.csv"), header = TRUE, sep = ",")
393 | firm_ <- read.table(paste0("feature/", i ,"_firm_3000_0_0.csv"), header = TRUE, sep = ",")
394 | cons_ <- read.table(paste0("feature/", i ,"_consumption_3000_0_0.csv"), header = TRUE, sep = ",")
395 | empl <- rbind(empl, cbind(empl_$employment, rep(i, nrow(empl_))))
396 | popu <- rbind(popu, cbind(popu_$population, rep(i, nrow(popu_))))
397 | firm <- rbind(firm, cbind(firm_$firm, rep(i, nrow(firm_))))
398 | cons <- rbind(cons, cbind(cons_$consumption, rep(i, nrow(cons_))))
399 | poi.cnt <- rbind(poi.cnt, cbind(empl_$poi_cnt, rep(i, nrow(empl_))))
400 | price <- rbind(price, cbind(empl_$price_mean, rep(i, nrow(empl_))))
401 | }
402 |
403 | ggplot(empl, aes(x=log(as.numeric(as.character(V1))), color=as.factor(V2))) +
404 | geom_density(size = 1) +
405 | scale_color_brewer(palette="Paired") +
406 | xlab("Daytime population (log)") +
407 | theme_classic(base_size = 16)
408 | #ggsave("smallcity_employment.pdf", width = 5, height = 4, useDingbats=FALSE)
409 |
410 | ggplot(popu, aes(x=log(as.numeric(as.character(V1))), color=as.factor(V2))) +
411 | geom_density(size = 1) +
412 | scale_color_brewer(palette="Paired") +
413 | xlab("Nightime population (log)") +
414 | theme_classic(base_size = 16)
415 | #ggsave("smallcity_population.pdf", width = 5, height = 4, useDingbats=FALSE)
416 |
417 | ggplot(firm, aes(x=log(as.numeric(as.character(V1))), color=as.factor(V2))) +
418 | geom_density(size = 1) +
419 | scale_color_brewer(palette="Paired") +
420 | xlab("Firm (log)") +
421 | theme_classic(base_size = 16)
422 | #ggsave("smallcity_firm.pdf", width = 5, height = 4, useDingbats=FALSE)
423 |
424 | ggplot(cons, aes(x=log(as.numeric(as.character(V1))), color=as.factor(V2))) +
425 | geom_density(size = 1) +
426 | scale_color_brewer(palette="Paired") +
427 | xlab("Consumption (log)") +
428 | theme_classic(base_size = 16)
429 | #ggsave("smallcity_consumption.pdf", width = 5, height = 4, useDingbats=FALSE)
430 |
431 | ggplot(poi.cnt, aes(x=as.numeric(as.character(V1)), color=as.factor(V2))) +
432 | geom_density(size = 1) +
433 | scale_color_brewer(palette="Paired") +
434 | xlab("Restaurant number (log)") +
435 | theme_classic(base_size = 16)
436 | #ggsave("smallcity_restaurant.pdf", width = 5, height = 4, useDingbats=FALSE)
437 |
438 | ggplot(price, aes(x=as.numeric(as.character(V1)), color=as.factor(V2))) +
439 | geom_density(size = 1) +
440 | scale_color_brewer(palette="Paired") +
441 | xlab("Average price (log)") +
442 | theme_classic(base_size = 16)
443 | #ggsave("smallcity_price.pdf", width = 5, height = 4, useDingbats=FALSE)
444 |
445 |
446 |
447 | ### Figure S6
448 | city <- c("beijing")
449 | r2 <- data.frame()
450 | for (i in city){
451 | dt <- read.table(paste0("rst/", i, "_lasso_r2_norolling_v2.csv"), header = TRUE, sep = ",")
452 | r2 <- rbind(r2, dt)
453 | }
454 |
455 | d <- r2 %>%
456 | group_by(city, cellsize, y) %>%
457 | summarise(r2_test_mean = mean(r2_test),
458 | r2_test_sd = sd(r2_test))
459 |
460 | ggplot(d, aes(cellsize/1000, r2_test_mean)) +
461 | geom_pointrange(aes(colour = factor(y),
462 | ymin=r2_test_mean-r2_test_sd,
463 | ymax=r2_test_mean+r2_test_sd),
464 | alpha = 0.8, size = 0.8) +
465 | geom_line(aes(colour = factor(y)), size = 1.5) +
466 | scale_color_manual(name = "legend",
467 | breaks = c("employment", "population", "firm", "consumption"),
468 | values = c("employment" = "red", "population" = "#E69F00",
469 | "firm" = "#56B4E9", "consumption" = "#999999")) +
470 | geom_vline(xintercept = 4.5, linetype = 2) +
471 | annotate("text", label = i, x = 4, y = 0.5) +
472 | ylim(c(0.3, 1.1)) +
473 | xlab("Grid size (km)") +
474 | ylab("Accuracy (R2)") +
475 | theme_classic(base_size = 18)
476 | #ggsave("beijing_r2_norolling_v2.pdf", width = 6, height = 4)
477 |
478 |
479 |
480 | ### Table S2
481 | r2_rst <- expand.grid(
482 | city = c("beijing"),
483 | cellsize = seq(1000, 5000, by = 500),
484 | y = c("population", "employment", "firm", "consumption"),
485 | r2 = 0
486 | )
487 |
488 | for (i in 1:nrow(r2_rst)){
489 | print (i)
490 | # readfile
491 | basefilename <- paste(r2_rst[i, "city"], r2_rst[i, "y"], r2_rst[i, "cellsize"], sep = "_")
492 | dt <- read.table(paste0("feature/", basefilename, "_0_0.csv"), header = T, sep = ",")
493 | dt1 <- dt[,c(1,2)]
494 |
495 | light <- read.table(paste0("feature/beijing_light_", r2_rst[i, "cellsize"], "_0_0.csv"), header = TRUE, sep = ",")
496 | light1 <- light %>%
497 | group_by(idx) %>%
498 | summarise(light_sum = sum(light))
499 | m <- merge(dt1, light1, by.x = "X", by.y = "idx")
500 | model <- summary(lm(log(m[, 2]) ~ log(m$light_sum)))
501 | r2_rst[i, "r2"] <- model$r.squared
502 | }
503 |
504 |
505 |
--------------------------------------------------------------------------------