├── MegaCOV_US_mobility.png ├── README.md ├── megaCOV_cities._geo.png ├── megaCOV_cities.png ├── megaCOV_geo.png ├── megaCOV_top_langs.png └── tweet_ids └── readme.md /MegaCOV_US_mobility.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UBC-NLP/megacov/b19781e48c7920ba1876c9bb4550287cdd0caeca/MegaCOV_US_mobility.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Mega-COV 2 | This repository includes information about *Mega-COV*, the dataset first introduced in our recent paper ([Mega-COV: A Billion-Scale Dataset of 100+ Languages For COVID-19](https://arxiv.org/abs/2005.06012)), currently published on ArXiv (to appear in EACL 2021). We have also published a [Medium post](https://medium.com/@mumageed/billion-scale-investigation-of-covid-19-impact-on-human-communication-in-104-languages-874b5a37beac) about this work. Mega-COV is a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 234 countries), longitudinal (goes as back as 2007), multilingual (comes in 100+ languages), and has a significant number of location-tagged tweets (~32M tweets). We release tweet IDs from the dataset, hoping it will be useful for studying various phenomena related to the ongoing pandemic and accelerating viable solutions to associated problems. 3 | 4 | *We note that the data here is our second version and so it is a more recent version of the initial reelease we had when we first published the manuscript on ArXiv. We will soon add an updated version of the paper, reflecting the current data release.* 5 | 6 | --- 7 | 8 | # World map coverage of Mega-COV 9 | ![World map coverage of Mega-COV](megaCOV_cities._geo.png) 10 | **(a) Left: Cities.** Each dot is a city. Contiguous cities of the same color belong to the same country. **(b) Right: Point co-ordinates**. Each dot is a point co-ordinate (longitude and Latitude) from which at least one tweet was posted. 11 | 12 | --- 13 | 14 | # Word clouds for hashtags of Mega-COV 15 | Word clouds for hashtags in tweets from the top 10 languages in the data. We note that tweets in non-English can still carry English hashtags or employ Latin script. 16 | ![World cloudf Mega-COV](megaCOV_top_langs.png) 17 | 18 | --- 19 | # Inter-state user mobility in the U.S. for Jan.-May, 2020 20 | ![MegaCOV_US_mobility](MegaCOV_US_mobility.png) 21 | 22 | # Download the Data 23 | 24 | ## Data Usage Agreement 25 | - **Mega-COV** is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License ([CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)). By using the dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s [Terms of Service](https://developer.twitter.com/en/developer-terms/agreement-and-policy), and cite the related paper ([Mega-COV: A Billion-Scale Dataset of 100+ Languages For COVID-19](https://arxiv.org/abs/2005.06012)). (Please see citation section in the end of this README). 26 | 27 | - `Data orgnization and download files` [link](https://github.com/UBC-NLP/megacov/tree/master/tweet_ids) 28 | 29 | # Ethical Considerations 30 | We collect **Mega-COV** from the public domain (Twitter). In compliance with Twitter policy, we do not publish hydrated tweet content. Rather, we only publish publicly available tweet IDs. All Twitter policies, including respect and protection of user privacy, apply. We encourage all researchers who decide to use **Mega-COV** to review Twitter policy at [Twitter policy](https://developer.twitter.com/en/developer-terms/agreement-and-policy) before they start working with the data. For example, Twitter provides the following policy around use of [sensitive information](https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases): 31 | 32 | 33 | ## Sensitive information 34 | 35 | You should be careful about using Twitter data to derive or infer potentially sensitive characteristics about Twitter users. Never derive or infer, or store derived or inferred, information about a Twitter user’s: 36 | 37 | - Health (including pregnancy) 38 | - Negative financial status or condition 39 | - Political affiliation or beliefs 40 | - Racial or ethnic origin 41 | - Religious or philosophical affiliation or beliefs 42 | - Sex life or sexual orientation 43 | - Trade union membership 44 | - Alleged or actual commission of a crime 45 | - Aggregate analysis of Twitter content that does not store any personal data (for example, user IDs, usernames, and other identifiers) is permitted, provided that the analysis also complies with applicable laws and all parts of the Developer Agreement and Policy. 46 | 47 | --- 48 | 49 | # Inquiries? 50 | `If you have any questions about this dataset please contact us @ *muhammad.mageed[at]ubc[dot]ca*.` 51 | 52 | --- 53 | # Citation 54 | ``` 55 | @inproceedings{mageed2020MegaCOV, 56 | title={Mega-COV: A Billion-Scale Dataset of 100+ Languages For COVID-19}, 57 | author={Muhammad Abdul-Mageed and AbdelRahim Elmadany and El Moatez Billah Nagoudi and Dinesh Pabbi and Kunal Verma and Rannie Lin}, 58 | journal={EACL}, 59 | year={2021} 60 | } 61 | ``` 62 | -------------------------------------------------------------------------------- /megaCOV_cities._geo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UBC-NLP/megacov/b19781e48c7920ba1876c9bb4550287cdd0caeca/megaCOV_cities._geo.png -------------------------------------------------------------------------------- /megaCOV_cities.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UBC-NLP/megacov/b19781e48c7920ba1876c9bb4550287cdd0caeca/megaCOV_cities.png -------------------------------------------------------------------------------- /megaCOV_geo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UBC-NLP/megacov/b19781e48c7920ba1876c9bb4550287cdd0caeca/megaCOV_geo.png -------------------------------------------------------------------------------- /megaCOV_top_langs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UBC-NLP/megacov/b19781e48c7920ba1876c9bb4550287cdd0caeca/megaCOV_top_langs.png -------------------------------------------------------------------------------- /tweet_ids/readme.md: -------------------------------------------------------------------------------- 1 | # Data Organization 2 | The **Mega-COV** *V0.2* files are organized as follows: 3 | - Tweet IDs are categorised based on years, with 2020 data also broken down into months. The collection is a total of 7 compressed files. 4 | - For each tweeet, the files below have the folowing information: 5 | - `tweet_id` 6 | - `Twitter_lang`: the language assigined by Twitter 7 | - `LangID_tool` : for the undefined tweets by Twitter "und", we provide the language based on langid tool 8 | - `country` : the country based on the tweet posted from (assgined by Twitter) 9 | - `date_year` 10 | - `date_month` 11 | - `date_weekOfYear` 12 | - `date` 13 | - The data files are saved in JSON format. 14 | --- 15 | 16 | # Download 17 | 18 | - [MagaCOV_Jan_2020.tar.gz](https://drive.google.com/file/d/1Cu245viZuOem9izj81W-4DvJ72LV7o9M/view?usp=sharing) 19 | - [MagaCOV_Feb_2020.tar.gz](https://drive.google.com/file/d/1FaTGLDM7BUDwbt5eFHD3e4F5qkZx32dg/view?usp=sharing) 20 | - [MagaCOV_Mar_2020.tar.gz](https://drive.google.com/file/d/1bbah6100egayyWnyLfWztfL1G2gxQ7rz/view?usp=sharing) 21 | - [MagaCOV_Apr_2020.tar.gz](https://drive.google.com/file/d/1b_OgzP93njQKERhNKszlHPPuLnrA3Ep3/view?usp=sharing) 22 | - [MagaCOV_May_2020.tar.gz](https://drive.google.com/file/d/1qA7JlgjZiB7-w_hISMnzSlL8-5iUCD5-/view?usp=sharing) 23 | - [MagaCOV_2019.tar.gz](https://drive.google.com/file/d/1-8knSEaXdLLnFPew_cNfq5MYqDXS5njH/view?usp=sharing) 24 | - [MagaCOV_2018.tar.gz](https://drive.google.com/file/d/1bmzKLY7x95gfPqLJjOh1R3Pv78HGk1w4/view?usp=sharing) 25 | - [MagaCOV_2007-2017.tar.gz](https://drive.google.com/file/d/1sCKfF-7ue2q4IwRfcAlW90rTFLWOhqY2/view?usp=sharing) 26 | 27 | 28 | --------------------------------------------------------------------------------