├── LICENSE ├── README.md └── data └── census ├── census_city_2010-2020_v1.csv ├── census_county_2010-2020_v1.csv ├── census_county_2010_v1.xlsx ├── census_county_2020_v1.xlsx └── readme.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Lei Dong 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # China Census 2 | 3 | ## Backgroud 4 | 5 | Most countries rely on census data to seek a comprehensive probe into population dynamics. However, the records between census years at the city/county level in China are not comparable as the country has made substantial changes to administrative levels and boundaries over the past decade. Furthermore, these adjustments were not well documented in the released census datasets. Additionally, the delays in releasing census data are frequent. To date, the National Bureau of Statistics of China still has not released the detailed city/county-level data from the 2020 Census. To overcome such data challenges, we manually collected and digitized 2,317 gazettes (~ 10,000 PDF and webpages in total) from local official sources for the 2020 Census and merged them with 2010 Census data. To consider intercensal changes in county boundaries, we matched county-level data with 770 released administrative changes, e.g.,changes in the status of the administrative level, city/county consolidation or disintegration, land reconfigurations. Finally, we built a population panel from 2010 to 2020 at both the county and prefectural-city levels, which covers 2,666 counties in 356 prefectural-cities. The dataset has been made puclicly available in this repo. 6 | 7 | ### Data contributors: 8 | 9 | - Lei Dong (lead), Xiaohuan Wu (coordinator), Yunhan Yang, Yizhen Yang, Qianqian Yu, Qiushi Zhou, Lei Yu, Shuang Li, He Zhang, Shang Gao, Liang Zhao, Xinru Chen, and Yuxia Wang. 10 | 11 | ### Roadmap 12 | 13 | - Release V1 data (Mar 2022) 14 | * 2020 county-level/prefectural-level population data (based on gazettes) 15 | * 2010 county-level/prefectural-level population data (based on 6th census yearbook) 16 | * 2010/2020 county-level/prefectural-level matched data with geographical boundaries 17 | 18 | - V2 data ~~(June 2023)~~ | Postpone 19 | * 2000/2010/2020 county-level/prefectural-level matched data with geographical boundaries 20 | * update 2020 county-level/prefectural-level population data (based on 7th census yearbook) [DONE] 21 | * 2010 township-level population data (based on 6th census yearbook) 22 | * 2000 county-level/prefectural-level population data (based on 5th census yearbook) 23 | 24 | - V3 data 25 | * 2005/2015 1% national population sample survey 26 | 27 | 28 | We also have a plan to build a comprehensive database of Chinese cities by combining the economic census, annual prefecture-level city statistical yearbook, and more socio-economic related data (e.g., firms, nighttime light, remote sensing, mobile phone). If you are interested in contributing, feel free to send me an email. 29 | 30 | ### Reference 31 | 32 | If this dataset is helpful for your research please cite the following paper: 33 | > Mapping Evolving Population Geography in China, Lei Dong, Rui Du, and Yu Liu, [Working Paper](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4049338), 2022. 34 | 35 | 36 | -------------------------------------------------------------------------------- /data/census/census_county_2010_v1.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leiii/census/0ba0f1efae092f4aa6569ef38e53b3bb4d2f0087/data/census/census_county_2010_v1.xlsx -------------------------------------------------------------------------------- /data/census/census_county_2020_v1.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leiii/census/0ba0f1efae092f4aa6569ef38e53b3bb4d2f0087/data/census/census_county_2020_v1.xlsx -------------------------------------------------------------------------------- /data/census/readme.md: -------------------------------------------------------------------------------- 1 | # Readme 2 | 3 | ## Data description 4 | 5 | - census_county_2010_v1.xlsx : 2010 county-level census data from the 6th Census Yearbook (without boundaries). 6 | - census_county_2020_v1.xlsx : 2020 county-level census data from the 7th Census Yearbook (without boundaries). 7 | - census_county_2010-2020_v1.csv : Data for 2010 are from the 6th Census Yearbook, and data for 2020 are from the 7th Census Bulletin (see below for details, with administrative boundaries). 8 | - census_city_2010-2020_v1.csv : Same as the county-level data, only at the city level. 9 | 10 | ## Census data source 11 | 12 | The data sources are the statistical bulletins of the censuses published by local governments. We collected relevant information from official government websites, the census website, and the bureau of statistics website. We also included census statistics released by some local governments on their official WeChat Accounts. 13 | 14 | There are a handful of counties that did not release data from the above sources. For these counties, we contacted the local statistical bureaus by email or phone to obtain the data. Fig. S1 shows a sample web page of a census statistical bulletin. 15 | 16 | Because most of the counties’ census documents are web pages, PDFs, or even images, we manually converted them into a structured dataset; each data point was cross-validated by at least two research assistants to ensure data quality. We corrected a small number of errors in the original information released by the governments. 17 | 18 | In addition to the resident population, we also collected data on sex ratio, average number of family members, and age structure (including four groups: 0-14, 15-59, >=60, and >=65) from the statistical bulletins. 19 | 20 | Note that there are some great recent efforts made by other scholars to compile the 2020 county-level census data. However, previous data only include population size and are not open-source. In comparison, our unique contributions are: 21 | - providing a rich set of variables (e.g., population, sex ratio, age groups, family size, etc.); 22 | - open-sourcing the full dataset; 23 | - adjusting the administrative boundary changes based on more accurate first-hand information we gathered. 24 | 25 | ## Administrative boundary adjustment 26 | 27 | - Step 1. For the 2020 Census data, most statistical bulletins show the year-on-year change in population compared to the 2010 Census, which are used to derive the population in 2010 as our baseline population records. In some counties and districts, the bulletins also indicate the administrative boundaries that have been adjusted and provide information about the adjusted population data in 2010 based on the boundary information from the 2020 Census. 28 | - Step 2. We compare the 2010 population data from the 2020 Census with those from the 2010 Census. 29 | * If the records from these two sources are consistent, there is no change in the administrative boundary of the county. 30 | * Otherwise, the administrative boundary of the county has been adjusted between 2010 and 2020. 31 | * Note that some districts/counties did not publish the 2010 population data in the 2020 Census, or clearly stated that the boundaries have been adjusted and the 2010 data are unadjusted. We adjust these districts/counties in Step 3. 32 | - Step 3. For districts/counties with adjusted administrative boundaries, we further check the 2010 and 2020 statistical codes using the records published on the website of [the Bureau of Statistics](http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/). We then combine the codes and information from Baidu Baike (the Wikipedia in China) and government [websites](http://www.mca.gov.cn/article/sj/xzqh/1980/ ) to infer the exact adjustment procedure. Common adjustments include changes in administration status (e.g., from county-level city to district), changes in administrative areas, etc. 33 | - Step 4. For specific boundary changes, we adjust the census data accordingly. For example: 34 | * If county A is abolished to establish district A (no change of the administrative area), then only the statistical county code needs to be changed. 35 | * If county A and county B are merged into county C, we then add up the population data of county A and county B in 2010 (merged proportional data such as population by age structure are weighted sums of the focal variable in the pre-merging counties). 36 | * If county D is divided into counties E and F, then we add the population data of counties E and F in 2020. These adjustments require the use of the township-level census data in certain cases. 37 | * Finally, there are changes present in intercensal years for which adjustment could not be applied due to a lack of information. For example, a portion of county G might be transferred to county H but the transferred area is not well documented in census data. In this case, we merge counties G and H into one county-level unit. 38 | * The baseline in the adjustment process is to make the data in 2010 and 2020 comparable. 39 | --------------------------------------------------------------------------------