├── README.md ├── Restaurant_Slides.pdf ├── data_dianping ├── food_baoding_v2.csv.zip ├── food_beijing_v2.csv.zip ├── food_chengdu_v2.csv.zip ├── food_hengyang_v2.csv.zip ├── food_kunming_v2.csv.zip ├── food_shenyang_v2.csv.zip ├── food_shenzhen_v2.csv.zip ├── food_yueyang_v2.csv.zip └── food_zhengzhou_v2.csv.zip ├── food-small.jpg └── restaurant_replication.R /README.md: -------------------------------------------------------------------------------- 1 | # Predicting neighborhoods' socioeconomic attributes using restaurant data 2 | 3 |

4 |

5 |

6 | 7 | Proceedings of the National Academy of Sciences (PNAS) 8 | 9 | by Lei Dong (MIT), Carlo Ratti (MIT), and Siqi Zheng (MIT) 10 | 11 | 12 | ### Abstract 13 | 14 | Accessing high-resolution, timely socioeconomic data such as data on population, employment, and enterprise activity at the neighborhood level is critical for social scientists and policy makers to design and implement location-based policies. However, in many developing countries or cities, reliable local-scale socioeconomic data remain scarce. Here, we show an easily accessible and timely updated location attribute—restaurant—can be used to accurately predict a range of socioeconomic attributes of urban neighborhoods. We merge restaurant data from an online platform with 3 microdatasets for 9 Chinese cities. Using features extracted from restaurants, we train machine-learning models to estimate daytime and nighttime population, number of firms, and consumption level at various spatial resolutions. The trained model can explain 90 to 95% of the variation of those attributes across neighborhoods in the test dataset. We analyze the tradeoff between accuracy, spatial resolution, and number of training samples, as well as the heterogeneity of the predicted results across different spatial locations, demographics, and firm industries. Finally, we demonstrate the cross-city generality of this method by training the model in one city and then applying it directly to other cities. The transferability of this restaurant model can help bridge data gaps between cities, allowing all cities to enjoy big data and algorithm dividends. 15 | 16 | [Web](http://senseable.mit.edu/tasty-data/) | [Paper](https://www.pnas.org/content/116/31/15447) | [Appendix](https://www.pnas.org/content/suppl/2019/07/09/1903064116.DCSupplemental) | [Slides](https://github.com/leiii/restaurant/blob/master/Restaurant_Slides.pdf) 17 | 18 | ### Replication data and code 19 | 20 | - restaurant_replication.R 21 | * R code to replicate the results of the paper 22 | 23 | - data_dianping 24 | * Dianping restaurant data of nine cities 25 | * Baoding, Beijing, Chengdu, Hengyang, Kunming, Shenyang, Shenzhen, Yueyang, and Zhengzhou 26 | 27 | - rst 28 | * Model training results 29 | * [Download](https://drive.google.com/open?id=1O8rIy6CDWjapFu1YOYmmOqte8WfHe4Q-) 30 | 31 | - feature 32 | * Feature for training 33 | * [Download](https://drive.google.com/open?id=1VbWKqrkNU6MIZb17xH8B1y-k0PWdHUVw) 34 | 35 | 36 | Contact: arch.dongl@gmail.com 37 | -------------------------------------------------------------------------------- /Restaurant_Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/Restaurant_Slides.pdf -------------------------------------------------------------------------------- /data_dianping/food_baoding_v2.csv.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/data_dianping/food_baoding_v2.csv.zip -------------------------------------------------------------------------------- /data_dianping/food_beijing_v2.csv.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/data_dianping/food_beijing_v2.csv.zip -------------------------------------------------------------------------------- /data_dianping/food_chengdu_v2.csv.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/data_dianping/food_chengdu_v2.csv.zip -------------------------------------------------------------------------------- /data_dianping/food_hengyang_v2.csv.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/data_dianping/food_hengyang_v2.csv.zip -------------------------------------------------------------------------------- /data_dianping/food_kunming_v2.csv.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/data_dianping/food_kunming_v2.csv.zip -------------------------------------------------------------------------------- /data_dianping/food_shenyang_v2.csv.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/data_dianping/food_shenyang_v2.csv.zip -------------------------------------------------------------------------------- /data_dianping/food_shenzhen_v2.csv.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/data_dianping/food_shenzhen_v2.csv.zip -------------------------------------------------------------------------------- /data_dianping/food_yueyang_v2.csv.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/data_dianping/food_yueyang_v2.csv.zip -------------------------------------------------------------------------------- /data_dianping/food_zhengzhou_v2.csv.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/data_dianping/food_zhengzhou_v2.csv.zip -------------------------------------------------------------------------------- /food-small.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leiii/restaurant/6f94980c6d5de73263d0306974b6c8a9eac061de/food-small.jpg -------------------------------------------------------------------------------- /restaurant_replication.R: -------------------------------------------------------------------------------- 1 | # Lei Dong 2 | # Email: arch.dongl@gmail.com 3 | library(glmnet) 4 | library(tidyr) 5 | library(dplyr) 6 | library(ggplot2) 7 | library(ggpubr) 8 | library(stringr) 9 | 10 | setwd("~/downloads/restaurant/replicate") 11 | 12 | ########### Figure 1 ############## 13 | food <- read.table("data_dianping/food_hengyang_v2.csv", header = T, sep = ",") 14 | category <- data.frame(table(food$mainCategoryNameEn)) 15 | category <- category[order(category$Freq), ] 16 | 17 | ggbarplot(tail(category,20), x = "Var1", y = "Freq", 18 | fill = "#4292c6", # change fill color by cyl 19 | color = "white", # Set bar border colors to white 20 | palette = "jco", # jco journal color palett. see ?ggpar 21 | sort.val = "desc", # Sort the value in dscending order 22 | sort.by.groups = FALSE, # Don't sort inside each group 23 | x.text.angle = 90, # Rotate vertically x axis texts 24 | xlab = "Hengyang" 25 | ) 26 | #ggsave("hengyang_cnt_v2.pdf", width = 5, height = 3) 27 | 28 | 29 | 30 | ########### Figure 2 ############## 31 | ### Figure 2ABCD group size 32 | city <- c("beijing", "shenzhen", "chengdu", 33 | "shenyang", "zhengzhou", "kunming", 34 | "baoding", "yueyang", "hengyang") 35 | 36 | # read files 37 | r2 <- data.frame() 38 | for (i in city){ 39 | dt <- read.table(paste0("rst/", i, "_lasso_r2_rolling_v2.csv"), header = TRUE, sep = ",") 40 | r2 <- rbind(r2, dt) 41 | } 42 | 43 | # filter, summarise, and plot 44 | for (i in city){ 45 | d <- r2 %>% 46 | filter(city == i) %>% 47 | group_by(cellsize, y) %>% 48 | summarise(r2_test_mean = mean(r2_test), 49 | r2_test_sd = sd(r2_test)) 50 | 51 | p <- ggplot(d, aes(cellsize/1000, r2_test_mean)) + 52 | geom_pointrange(aes(color = factor(y), 53 | ymin = r2_test_mean - r2_test_sd, 54 | ymax = r2_test_mean + r2_test_sd), 55 | alpha = 0.8, size = 0.8) + 56 | geom_line(aes(color = factor(y)), size = 1.5) + 57 | scale_color_manual(name = "legend", 58 | breaks = c("employment", "population", "firm", "consumption"), 59 | values = c("employment" = "red", "population" = "#E69F00", 60 | "firm" = "#56B4E9", "consumption" = "#999999")) + 61 | geom_vline(xintercept = 4.5, linetype = 2) + 62 | annotate("text", label = i, x = 4, y = 0.5) + 63 | ylim(c(0.3, 1.1)) + 64 | xlab("Grid size (km)") + 65 | ylab("Accuracy (R2)") + 66 | theme_classic(base_size = 18) 67 | p 68 | 69 | ggsave(paste0(i, "_r2_v2.pdf"), width = 6, height = 4) 70 | } 71 | 72 | 73 | 74 | ### Figure 2E group size 75 | big_city <- r2 %>% 76 | filter(city %in% "beijing" | city %in% "shenzhen" | city %in% "chengdu") %>% 77 | filter(cellsize == 5000 | cellsize == 1000) %>% 78 | group_by(cellsize, y) %>% 79 | summarise(r2_test_mean = mean(r2_test), 80 | r2_test_sd = sd(r2_test)) 81 | 82 | mid_city <- r2 %>% 83 | filter(city %in% "shenyang" | city %in% "zhengzhou" | city %in% "kunming") %>% 84 | filter(cellsize == 5000 | cellsize == 1000) %>% 85 | group_by(cellsize, y) %>% 86 | summarise(r2_test_mean = mean(r2_test), 87 | r2_test_sd = sd(r2_test)) 88 | 89 | sma_city <- r2 %>% 90 | filter(city %in% "baoding" | city %in% "yueyang" | city %in% "hengyang") %>% 91 | filter(cellsize == 5000 | cellsize == 1000) %>% 92 | group_by(cellsize, y) %>% 93 | summarise(r2_test_mean = mean(r2_test), 94 | r2_test_sd = sd(r2_test)) 95 | 96 | # set x axis 97 | sma_city$x <- 0 98 | mid_city$x <- 1 99 | big_city$x <- 2 100 | rst <- as.data.frame(rbind(sma_city, mid_city, big_city)) 101 | rst[rst$y == "employment", ]$x <- rst[rst$y == "employment", ]$x - 0.3 102 | rst[rst$y == "population", ]$x <- rst[rst$y == "population", ]$x - 0.15 103 | rst[rst$y == "consumption", ]$x <- rst[rst$y == "consumption", ]$x + 0.15 104 | 105 | # plot 106 | ggplot(rst, aes(x, r2_test_mean)) + 107 | geom_pointrange(aes(color = factor(y), 108 | ymin = r2_test_mean - r2_test_sd, 109 | ymax = r2_test_mean + r2_test_sd), 110 | size = 0.8) + 111 | scale_color_manual(name = "legend", 112 | breaks = c("employment", "population", "firm", "consumption"), 113 | values = c("employment" = "red", "population" = "#E69F00", 114 | "firm" = "#56B4E9", "consumption" = "#999999")) + 115 | #ylim(c(0.4, 1.2)) + 116 | xlab("City size group") + 117 | ylab("Accuracy (R2)") + 118 | theme_classic(base_size = 18) 119 | #ggsave("citygroup_r2_v2.pdf", width = 6, height = 4, useDingbats=FALSE) 120 | 121 | 122 | 123 | ### Figure 2F subsample 124 | bj <- read.table("rst/beijing_subsample_lasso_r2_rolling_v2.csv", header = TRUE, sep = ",") 125 | bj1 <- bj %>% 126 | group_by(smp_per, y) %>% 127 | summarise(r2_test_mean = mean(r2_test), 128 | r2_test_sd = sd(r2_test)) 129 | 130 | ggplot(bj1, aes(smp_per*100, r2_test_mean)) + 131 | geom_pointrange(aes(color = factor(y), 132 | ymin = r2_test_mean - r2_test_sd, 133 | ymax = r2_test_mean + r2_test_sd), 134 | size = 0.8, alpha = 0.8) + 135 | scale_color_manual(name = "legend", 136 | breaks = c("employment", "population", "firm", "consumption"), 137 | values = c("employment" = "red", "population" = "#E69F00", 138 | "firm" = "#56B4E9", "consumption" = "#999999")) + 139 | ylim(c(0.6, 1.0)) + 140 | xlab("Percentage (%)") + 141 | ylab("Accuracy (R2)") + 142 | theme_classic(base_size = 18) 143 | #ggsave("beijing_subsample_r2_v2.pdf", width = 6, height = 4, useDingbats=FALSE) 144 | 145 | 146 | 147 | ########### Figure 3 ############## 148 | ### Figure 3A spatial distribution of erros 149 | dt0 <- read.table("feature/beijing_employment_3000_0_0.csv", header = TRUE, sep = ",") 150 | dt1 <- read.table("feature/beijing_employment_3000_0.5_0.csv", header = TRUE, sep = ",") 151 | dt2 <- read.table("feature/beijing_employment_3000_0_0.5.csv", header = TRUE, sep = ",") 152 | dt3 <- read.table("feature/beijing_employment_3000_0.5_0.5.csv", header = TRUE, sep = ",") 153 | dt <- rbind(dt1, dt2, dt3) 154 | dt <- separate(data = dt, col = X, into = c("x", "y"), sep = "_") 155 | dt_ <- separate(data = dt0, col = X, into = c("x", "y"), sep = "_") 156 | dt[is.na(dt)] <- 0 157 | dt_[is.na(dt_)] <- 0 158 | 159 | # data format 160 | train_x <- as.matrix(dt[,c(4:ncol(dt))]) 161 | train_y <- as.matrix(dt[,c(3)]) 162 | test_x <- as.matrix(dt_[,c(4:ncol(dt_))]) 163 | test_y <- as.matrix(dt_[,c(3)]) 164 | 165 | # LASSO fit (alpha=1: LASSO) 166 | cvfit <- cv.glmnet(train_x, train_y, alpha = 1, nfolds = 5) 167 | plot(cvfit) 168 | 169 | # r2 of training data 170 | mse <- cvfit$cvm[cvfit$lambda == cvfit$lambda.min] 171 | r2_train <- 1- mse/var(train_y) 172 | print(r2_train) 173 | 174 | # r2 of testing data 175 | pred_lasso <- predict(cvfit, newx = test_x, s = "lambda.min") 176 | mse <- sum((pred_lasso - test_y)^2) / length(test_y) 177 | r2_test <- 1- mse/var(test_y) 178 | print(r2_test) 179 | 180 | # calculate errors 181 | rst <- data.frame(cbind(dt_$x, dt_$y, dt_$employment, pred_lasso)) 182 | colnames(rst) <- c("X", "Y", "observed", "predicted") 183 | rst <- mutate_all(rst, function(x) as.numeric(as.character(x))) 184 | rst$error <- (rst$predicted - rst$observed)/rst$observed 185 | rst$error[rst$error > 1] <- 1 186 | rst$error[rst$error < -1] <- -1 187 | 188 | # plot ground truth vs. prediction 189 | ggplot(aes(log10(observed), log10(predicted)), 190 | data = rst[rst$predicted > 100 & rst$observed > 100,]) + 191 | geom_point(color = "blue", alpha = 0.3) + 192 | geom_abline(slope = 1, intercept = 0) + 193 | ggtitle("Beijing daytime population (3km)") + 194 | xlab("Observed (log10)") + 195 | ylab("Predicted (log10)") 196 | #ggsave("beijing_employment_3000.pdf", width = 4, height = 3) 197 | 198 | 199 | 200 | # plot spatial distribution of errors 201 | library(rgdal) #coord. transform 202 | library(OpenStreetMap) #base map 203 | 204 | # convert projection to wgs84 205 | cord_dec <- SpatialPoints(cbind(rst$X, rst$Y), proj4string = CRS("+init=epsg:2436")) 206 | cord_wgs <- spTransform(cord_dec, CRS("+init=epsg:4326")) 207 | lonlat <- as.data.frame(cord_wgs@coords) 208 | colnames(lonlat) <- c("lon", "lat") 209 | rst <- cbind(rst, lonlat) 210 | 211 | # boundary of Beijing 212 | LAT1 = 39.45 ; LAT2 = 40.5 213 | LON1 = 115.8 ; LON2 = 117 214 | 215 | # boundary of Chengdu 216 | #LAT1 = 30.3; LAT2 = 30.96 217 | #LON1 = 103.73; LON2 = 104.47 218 | 219 | # boundary of Zhengzhou 220 | #LAT1 = 34.55; LAT2 = 34.97 221 | #LON1 = 113.46; LON2 = 113.95 222 | 223 | # boundary of Baoding 224 | #LAT1 = 38.6; LAT2 = 39.1 225 | #LON1 = 115.17; LON2 = 115.69 226 | 227 | map <- openmap(c(LAT2, LON1), c(LAT1, LON2), zoom = NULL, 228 | type = "stamen-toner", 229 | mergeTiles = TRUE) 230 | map_latlon <- openproj(map, projection = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs") 231 | 232 | # with basemap 233 | osmmap <- autoplot(map_latlon)+ 234 | #labs(x = "Lon", y = "Lat")+ 235 | geom_point(data = rst[rst$observed > 2000,], 236 | aes(x = lon, y = lat, color = error*100, size = observed)) + 237 | scale_size(range = c(1, 5), name = "Daytime popu.") + 238 | scale_color_gradient2(low = "steelblue", high = "#f46d43", name = "Error(%)") + 239 | xlim(LON1, LON2) + 240 | ylim(LAT1, LAT2) 241 | osmmap 242 | #ggsave("beijing_employment_3000_error_spatial_v2.pdf", width = 5, height = 4, useDingbats=FALSE) 243 | 244 | 245 | 246 | ### Figure 3B and Table S3 jiedao demographic 247 | demo <- read.table("rst/beijing_demographic_r2_jiedao.csv", header = TRUE, sep = ",") 248 | demo_feature <- read.table("rst/beijing_demographic_feature_jiedao.csv", header = TRUE, sep = ",") 249 | 250 | demo1 <- demo %>% 251 | group_by(y) %>% 252 | summarise(r2_mean = mean(r2), 253 | r2_sd = sd(r2)) 254 | 255 | ggplot(demo1, aes(y, r2_mean)) + 256 | geom_pointrange(aes(color = factor(y), 257 | ymin = r2_mean - r2_sd, 258 | ymax = r2_mean + r2_sd), 259 | size = 0.8, alpha = 0.8) + 260 | xlab("Group") + 261 | ylab("Accuracy (R2)") + 262 | ylim(0.4, 0.8) + 263 | theme_classic(base_size = 18) 264 | #ggsave("beijing_demographic_r2_jiedao.pdf", width = 4, height = 4, useDingbats=FALSE) 265 | 266 | 267 | 268 | # Table S3 269 | young <- demo_feature[demo_feature$y == "14",] 270 | middle <- demo_feature[demo_feature$y == "15-64",] 271 | old <- demo_feature[demo_feature$y == "65",] 272 | migrant <- demo_feature[demo_feature$y == "migrant",] 273 | wealth <- demo_feature[demo_feature$y == "wealth",] 274 | 275 | migrant$split <- as.data.frame(str_split_fixed(migrant$name, "_", 2))[[1]] 276 | wealth$split <- as.data.frame(str_split_fixed(wealth$name, "_", 2))[[1]] 277 | young$split <- as.data.frame(str_split_fixed(young$name, "_", 2))[[1]] 278 | middle$split <- as.data.frame(str_split_fixed(middle$name, "_", 2))[[1]] 279 | old$split <- as.data.frame(str_split_fixed(old$name, "_", 2))[[1]] 280 | 281 | migrant_sum <- as.data.frame(table(migrant$split)) 282 | wealth_sum <- as.data.frame(table(wealth$split)) 283 | young_sum <- as.data.frame(table(young$split)) 284 | middle_sum <- as.data.frame(table(middle$split)) 285 | old_sum <- as.data.frame(table(old$split)) 286 | 287 | 288 | 289 | ### Figure 3C Firm 290 | firm <- read.table("rst/beijing_lasso_firm_r2_jiedao_v2.csv", header = TRUE, sep = ",") 291 | firm_feature <- read.table("rst/beijing_lasso_firm_feature_jiedao_v2.csv", header = TRUE, sep = ",") 292 | 293 | firm1 <- firm %>% 294 | group_by(y) %>% 295 | summarise(r2_train_mean = mean(r2_train), 296 | r2_train_sd = sd(r2_train)) 297 | 298 | ggplot(firm1, aes(y, r2_train_mean)) + 299 | geom_pointrange(aes(color = factor(y), 300 | ymin = r2_train_mean - r2_train_sd, 301 | ymax =r2_train_mean + r2_train_sd)) + 302 | xlab("Percentage (%)") + 303 | ylab("Accuracy (R2)") + 304 | ylim(0.4, 0.6) + 305 | theme_classic(base_size = 18) 306 | #ggsave("beijing_firm_r2_jiedao.pdf", width = 6, height = 4, useDingbats=FALSE) 307 | 308 | 309 | 310 | # Table S3 311 | business_service <- firm_feature[firm_feature$y == "retail",] 312 | business_service$split <- as.data.frame(str_split_fixed(business_service$name, "_", 2))[[1]] 313 | business_service_sum <- as.data.frame(table(business_service$split)) 314 | 315 | 316 | 317 | ########### Transfer ############## 318 | ### Figure 4 319 | city <- c("beijing", "shenzhen", "chengdu", 320 | "shenyang", "zhengzhou", "kunming", 321 | "baoding", "yueyang", "hengyang") 322 | 323 | r2_matrix <- data.frame() 324 | feature <- data.frame() 325 | for (i in city){ 326 | print(i) 327 | dt <- read.table(paste0("rst/", i, "_lasso_r2_transfer_v2.csv"), header = TRUE, sep = ",") 328 | fe <- read.table(paste0("rst/" , i, "_lasso_feature_transfer_v2.csv"), header = TRUE, sep = ",") 329 | feature <- rbind(feature, fe) 330 | dt1 <- dt %>% 331 | group_by(city_b, y) %>% 332 | summarise(r2_test_mean = mean(r2_test), 333 | r2_test_sd = sd(r2_test)) 334 | dt1 <- as.data.frame(dt1) 335 | dt1$city_a <- i 336 | r2_matrix <- rbind(r2_matrix, dt1) 337 | } 338 | 339 | # add diag of the matrix 340 | r2_matrix_diag <- data.frame() 341 | for (i in city){ 342 | print(i) 343 | dt <- read.table(paste0("rst/", i, "_lasso_r2_rolling_v2.csv"), header = TRUE, sep = ",") 344 | dt1 <- dt %>% 345 | filter(cellsize == 3000) %>% 346 | group_by(y) %>% 347 | summarise(r2_test_mean = mean(r2_test), 348 | r2_test_sd = sd(r2_test)) 349 | dt1 <- as.data.frame(dt1) 350 | dt1$city_a <- i 351 | dt1$city_b <- i 352 | r2_matrix_diag <- rbind(r2_matrix_diag, dt1) 353 | } 354 | 355 | r2_matrix_sum <- rbind(r2_matrix, r2_matrix_diag) 356 | r2_matrix_sum$city_a_order <- factor(r2_matrix_sum$city_a, levels = city) 357 | r2_matrix_sum$city_b_order <- factor(r2_matrix_sum$city_b, levels = city) 358 | 359 | # plot 360 | ggplot(data = r2_matrix_sum[r2_matrix_sum$y == "population", ], 361 | aes(city_a_order, city_b_order, fill = r2_test_mean)) + 362 | geom_tile(color = "white") + 363 | geom_text(aes(label = round(r2_test_mean, 2)), alpha = 0.8) + 364 | scale_fill_gradient(high = "steelblue", low = "white", 365 | limits = c(0.5, 1), name = "R2") + 366 | theme_minimal() + 367 | theme(axis.text.x = element_text(angle = 45, vjust = 1, 368 | size = 12, hjust = 1)) + 369 | coord_fixed() 370 | #ggsave("transfer_population_r2_v2.pdf", width = 6, height = 5, useDingbats=FALSE) 371 | 372 | 373 | transfer_feature <- feature[feature$y == "consumption",] 374 | transfer_feature$split <- as.data.frame(str_split_fixed(transfer_feature$name, "_", 2))[[1]] 375 | transfer_feature_sum <- as.data.frame(table(transfer_feature$split)) 376 | 377 | 378 | 379 | 380 | ########### Supplementary Information ############## 381 | ### Figure S3-5 382 | city <- c("baoding", "yueyang", "hengyang") 383 | empl <- data.frame() 384 | popu <- data.frame() 385 | firm <- data.frame() 386 | cons <- data.frame() 387 | poi.cnt <- data.frame() 388 | price <- data.frame() 389 | 390 | for (i in city){ 391 | empl_ <- read.table(paste0("feature/", i ,"_employment_3000_0_0.csv"), header = TRUE, sep = ",") 392 | popu_ <- read.table(paste0("feature/", i ,"_population_3000_0_0.csv"), header = TRUE, sep = ",") 393 | firm_ <- read.table(paste0("feature/", i ,"_firm_3000_0_0.csv"), header = TRUE, sep = ",") 394 | cons_ <- read.table(paste0("feature/", i ,"_consumption_3000_0_0.csv"), header = TRUE, sep = ",") 395 | empl <- rbind(empl, cbind(empl_$employment, rep(i, nrow(empl_)))) 396 | popu <- rbind(popu, cbind(popu_$population, rep(i, nrow(popu_)))) 397 | firm <- rbind(firm, cbind(firm_$firm, rep(i, nrow(firm_)))) 398 | cons <- rbind(cons, cbind(cons_$consumption, rep(i, nrow(cons_)))) 399 | poi.cnt <- rbind(poi.cnt, cbind(empl_$poi_cnt, rep(i, nrow(empl_)))) 400 | price <- rbind(price, cbind(empl_$price_mean, rep(i, nrow(empl_)))) 401 | } 402 | 403 | ggplot(empl, aes(x=log(as.numeric(as.character(V1))), color=as.factor(V2))) + 404 | geom_density(size = 1) + 405 | scale_color_brewer(palette="Paired") + 406 | xlab("Daytime population (log)") + 407 | theme_classic(base_size = 16) 408 | #ggsave("smallcity_employment.pdf", width = 5, height = 4, useDingbats=FALSE) 409 | 410 | ggplot(popu, aes(x=log(as.numeric(as.character(V1))), color=as.factor(V2))) + 411 | geom_density(size = 1) + 412 | scale_color_brewer(palette="Paired") + 413 | xlab("Nightime population (log)") + 414 | theme_classic(base_size = 16) 415 | #ggsave("smallcity_population.pdf", width = 5, height = 4, useDingbats=FALSE) 416 | 417 | ggplot(firm, aes(x=log(as.numeric(as.character(V1))), color=as.factor(V2))) + 418 | geom_density(size = 1) + 419 | scale_color_brewer(palette="Paired") + 420 | xlab("Firm (log)") + 421 | theme_classic(base_size = 16) 422 | #ggsave("smallcity_firm.pdf", width = 5, height = 4, useDingbats=FALSE) 423 | 424 | ggplot(cons, aes(x=log(as.numeric(as.character(V1))), color=as.factor(V2))) + 425 | geom_density(size = 1) + 426 | scale_color_brewer(palette="Paired") + 427 | xlab("Consumption (log)") + 428 | theme_classic(base_size = 16) 429 | #ggsave("smallcity_consumption.pdf", width = 5, height = 4, useDingbats=FALSE) 430 | 431 | ggplot(poi.cnt, aes(x=as.numeric(as.character(V1)), color=as.factor(V2))) + 432 | geom_density(size = 1) + 433 | scale_color_brewer(palette="Paired") + 434 | xlab("Restaurant number (log)") + 435 | theme_classic(base_size = 16) 436 | #ggsave("smallcity_restaurant.pdf", width = 5, height = 4, useDingbats=FALSE) 437 | 438 | ggplot(price, aes(x=as.numeric(as.character(V1)), color=as.factor(V2))) + 439 | geom_density(size = 1) + 440 | scale_color_brewer(palette="Paired") + 441 | xlab("Average price (log)") + 442 | theme_classic(base_size = 16) 443 | #ggsave("smallcity_price.pdf", width = 5, height = 4, useDingbats=FALSE) 444 | 445 | 446 | 447 | ### Figure S6 448 | city <- c("beijing") 449 | r2 <- data.frame() 450 | for (i in city){ 451 | dt <- read.table(paste0("rst/", i, "_lasso_r2_norolling_v2.csv"), header = TRUE, sep = ",") 452 | r2 <- rbind(r2, dt) 453 | } 454 | 455 | d <- r2 %>% 456 | group_by(city, cellsize, y) %>% 457 | summarise(r2_test_mean = mean(r2_test), 458 | r2_test_sd = sd(r2_test)) 459 | 460 | ggplot(d, aes(cellsize/1000, r2_test_mean)) + 461 | geom_pointrange(aes(colour = factor(y), 462 | ymin=r2_test_mean-r2_test_sd, 463 | ymax=r2_test_mean+r2_test_sd), 464 | alpha = 0.8, size = 0.8) + 465 | geom_line(aes(colour = factor(y)), size = 1.5) + 466 | scale_color_manual(name = "legend", 467 | breaks = c("employment", "population", "firm", "consumption"), 468 | values = c("employment" = "red", "population" = "#E69F00", 469 | "firm" = "#56B4E9", "consumption" = "#999999")) + 470 | geom_vline(xintercept = 4.5, linetype = 2) + 471 | annotate("text", label = i, x = 4, y = 0.5) + 472 | ylim(c(0.3, 1.1)) + 473 | xlab("Grid size (km)") + 474 | ylab("Accuracy (R2)") + 475 | theme_classic(base_size = 18) 476 | #ggsave("beijing_r2_norolling_v2.pdf", width = 6, height = 4) 477 | 478 | 479 | 480 | ### Table S2 481 | r2_rst <- expand.grid( 482 | city = c("beijing"), 483 | cellsize = seq(1000, 5000, by = 500), 484 | y = c("population", "employment", "firm", "consumption"), 485 | r2 = 0 486 | ) 487 | 488 | for (i in 1:nrow(r2_rst)){ 489 | print (i) 490 | # readfile 491 | basefilename <- paste(r2_rst[i, "city"], r2_rst[i, "y"], r2_rst[i, "cellsize"], sep = "_") 492 | dt <- read.table(paste0("feature/", basefilename, "_0_0.csv"), header = T, sep = ",") 493 | dt1 <- dt[,c(1,2)] 494 | 495 | light <- read.table(paste0("feature/beijing_light_", r2_rst[i, "cellsize"], "_0_0.csv"), header = TRUE, sep = ",") 496 | light1 <- light %>% 497 | group_by(idx) %>% 498 | summarise(light_sum = sum(light)) 499 | m <- merge(dt1, light1, by.x = "X", by.y = "idx") 500 | model <- summary(lm(log(m[, 2]) ~ log(m$light_sum))) 501 | r2_rst[i, "r2"] <- model$r.squared 502 | } 503 | 504 | 505 | --------------------------------------------------------------------------------