├── README.md ├── faq.md └── publicdatasets.md /README.md: -------------------------------------------------------------------------------- 1 | ## IMPORTANT NOTE! 2 | We are preparing to move this project to GitLab account due to the planned changes in UX and CI\CD pipelines. This repository will be still in use for 2 main aspects: 3 | 1. Storing and providing most recent versions of the Decision Trees 4 | 2. Storing and providing access to image maps 5 | Also we are planning to keep assets and icon sets here. 6 | ## All the issues will be still tracked via this repository. 7 | 8 | # albero 9 | Public Representation of Albero Project. Helping to Choose the Right Data Backend Technology on Azure. 10 | 11 | Here is the main representation of Decision Tree for Data Backend Technologies for Azure. Please use this HTML file for a simple navigation. Click on drill down to be redirected to the subsequent Decision Trees. 12 | 13 | Below are some explanations & our comments on why we have created it, how to use it and how to submit request for changes. Enjoy! 14 | 15 | # How to Select Proper Data Backend Technology on Azure 16 | 17 | _Disclaimer. This article represents personal experience and understanding of the authors. Please use this for the reference only. This article doesn’t represent official position of Microsoft._ 18 | 19 | _Simplicity is an ultimate sophistication._ 20 | _-- Leonardo Da Vinci_ 21 | 22 | # Before We Begin 23 | In this article we are talking a lot about different methods of comparison and selection of databases. Also, we are presenting an alternative approach for looking and considering different options. At the same time, I would like to highlight that this is just one of the viewpoints. Please use below as a reference rather than a prescriptive guidance. 24 | 25 | # Important note: What is and What isn’t this Document 26 | This Decision Tree is: 27 | • Map of the Azure Data Services with the main goal to help you to navigate among them and understand their strengths and weaknesses. 28 | • Supplementary material to the officially published Microsoft documentation helping you to define and shape your thought process around selection of the certain data technologies and using them together in your solutions. 29 | This Decision Tree is not: 30 | • A Definitive Guide to selection of Data Technologies. 31 | • Business / politics related document. All the criteria we were using are purely technical. 32 | • Not a pattern or use-case focused document. 33 | • Not a competitive analysis of any kind. 34 | We are keeping some responsibility on maintaining this document as long as we can but still would recommend verifying points in the document against Microsoft official guidance and documentation. 35 | Also do not hesitate to apply common sense and, please check things before putting into production. Not all the situations are the same / similar. 36 | 37 | The article has four chapters: 38 | Chapter 1: Read and Write Profiles – explains the premise of the decision tree. 39 | Chapter 2: Navigating Through the Decision Tree – guide to navigate through the decision tree. 40 | Chapter 3: Mapping Use Case to the Decision Tree – examples of how the decision tree is used for different use cases. 41 | Chapter 4: Getting Access and Providing Feedback - Finally, do not hesitate to provide us with your experience / feedback. We will cover how to do this in this chapter. 42 | 43 | # Chapter 1: Read and Write Profiles 44 | Our data technologies were developed mainly for two major purposes. And guess what, these are not encryption and obfuscation rather reading and writing data. Actually, mainly for reading as (and I hope you agree with me) there is no point in writing data you cannot read later on. 45 | Surprisingly, we never compare data technologies based on their actual read and write behavior. Typically, while compare data technologies we are (pick all that applies): 46 | 47 | - Focusing on some subset of the requirements. 48 | - Checking “similar” cases. 49 | - Adding technologies to the design one-by-one. 50 | - Using “Reference Architectures” and “Patterns” in seeking for forsaken tribe knowledge. 51 | - Surfing Internet long nights in a hope that by modifying the query we can find some kind of sense. 52 | 53 | Basically, we craft design of our data estate based on experience, preferences, and beliefs. When our group faced first time the need to compare different technologies and recommend one, our first thought was – it is impossible. How would you compare NoSQL database to Ledger database? 54 | Very simple – using their fundamental read and write goal as a foundation for such a comparison. The essence of the technology remains the same as well as a goal of its creation. A sheep cannot become a tiger 😉 55 | 56 | Intuitively (and, hopefully, obviously), if some data has a write path it should also have a read path and may or may not have one or more processing capabilities / tools / approaches. 57 | 58 | Of course, loads of the technologies and vendors are claiming that one single solution can solve all the possible issues but the entire history of rise of Data Technologies over the last decade shows that it is surely not the case any longer. 59 | 60 | Well it seems that we have finished with WHY and already started with WHAT? Let's move on and show you one of these Decision Trees in more details. 61 | 62 | # Chapter 2: Navigating Through the Decision Tree 63 | So, in order to help you to navigate across ever-changing and pretty complex Azure Data Landscape, we have built a set of decision trees based on the concept of read and write profiles. Conceptually Decision Tree looks very simple. 64 | 65 |  66 | 67 | Well, it is not that simple obviously. The good thing is that it covers almost entire Azure Data Portfolio in one diagram (which is comprised of more than 20 services, tens of SKUs, integrations and important features). So, it just cannot be super simple. But we are trying 😉 68 | 69 | In order to guide you through it, let’s just paste a small example (subset) of this decision tree here and demonstrate you some of the main features and ways to navigate through them. 70 | 71 | ## Basic Navigation 72 | It is comprised of two main paths read and write patterns. Write pattern from top to the middle marked with blue boxes and lines and read pattern – from the bottom to the middle marked with greenish lines and boxes. This represents some of the fundamental differences in behavior of various technologies. 73 | In the grey boxes you can see either questions or workload descriptions. As mentioned, this approach is not strictly defined in the mathematical sense rather follows industry practices and includes specific features and tech aspects which differentiate this technology from others. 74 | In case of the doubt, just simply follow yes / no path. When you have to choose among descriptions, you have to find the one which fits best. 75 | Below are the components of a simple navigation. 76 | 77 |  78 | 79 | ## Leaning 80 | There are also some more tricky parts, where you cannot say with certainty which workload will be a better fit. In such cases we are using wide blue arrows representing “leaning” concept. Pretty much like in one of the examples below. 81 | 82 |  83 | 84 | There is one more style of “leaning” which is represented by the so-called “paradigm”. In some cases there are technologies which will be preferred when you are using particular programming language or stack. In our decision tree this is represented by the notion of “paradigm” as described on a picture below. 85 | 86 |  87 | 88 | Typically, in one paradigm we have more than one product available. To distinguish the main goal of the product within certain paradigm we are using some code wording like in the example above. This goal is represented by one word which is shown above the box with the service and holds the same color as a paradigm. 89 | ## Default Path 90 | In most of the technology patterns we also have a “default” path for reads and writes. Typically for a greenfield project this is the easiest and richest path (in terms of functionality, new features and, possibly, overall happiness of the users). 91 | 92 |  93 | 94 | ## Drill Down 95 | In some cases, we also have implemented a drill-down approach to simplify the navigation. Drill downs lead to a different diagram explaining some details around service offerings or SKUs for a particular product / service. 96 | 97 |  98 | 99 | Drill down will bring you to the new Decision Tree which is specific for the particular technology (such as SQL Database on Azure, PostgreSQL or others). These Decision Trees are following same / similar patterns with some reduced number of possible read and write profiles (as shown on a diagram below). On these Decision Trees SLAs, High-Availability options as well as Storage and RAM limits are defined on the per SKU basis. 100 | 101 | ## SLAs & Limitations 102 | Another cool feature of the Decision Tree is a depiction of maximum achievable SLA, High Availability options and Storage / RAM limits (when it makes sense). 103 | These are implemented as shown below. Please remember that these might be different from SKU to SKU and only the maximum achievable is shown on the main Decision Tree. 104 | 105 |  106 | 107 | Please note that all / most (just in case we forgot something) of the icons with limitations, HA & SLA are clickable so you will be redirected to the official Microsoft documentation. 108 | ## Developer View 109 | One of the newest features is a Developer View. In this view we are listing all the Procedural Languages supported by the technology as well as SDKs and some important limitations of size of items or resultsets where applicable. Also we are depicting supported file types and formats. 110 | We are planning to make these references to the official Microsoft documentation (pretty much like it was done with SLAs, Storage, etc.) 111 | 112 |  113 | 114 | ## Read and Write Profiles Do Not Match 115 | With two separate profiles for reads and writes there is a very important and frequently asked question: “What if read and write profiles do not match?” 116 | Let’s answer with the question. What do you typically do when your technology used for writes is not suitable for doing reads with the pattern / functionality required? The answer is quite obvious – you will introduce one more technology to your solution. 117 | To help you to find which components can be directly integrated with each other we have introduced the concepts of “Integration In” and “Integration Out”. The example of the notation is shown below. 118 | 119 |  120 | 121 | In this example we can see that Azure Synapse Analytics can accept data from: 122 | • Azure Cosmos DB using Azure Synapse Link 123 | • Azure Data Lake Store Gen2 / Blob Storage using CTAS functionality of Polybase 124 | • Azure Stream Analytics via output directly 125 | • Azure Databricks using Azure Databricks Connector for Azure Synapse 126 | And export data through ADLS Gen2 using CETAS statements of Polybase. 127 | As you may see on the Decision Tree itself, we can only see that such an integration is possible, but we are not specifying exact mechanism or its limitations. If you click on this icon, you will be redirected to the official Microsoft documentation. 128 | One more important note here. We do not show Azure Data Factory on this diagram as this is the service which meant to be used across the entire Azure portfolio and adding it to the diagram will make it even more messy. So, we implicitly mean that Azure Data Factory can be used to integrate with most of the services mentioned on the Decision Tree. 129 | Ok, let’s take a look on how to apply this in practice. In the next chapter we will cover some examples of using Decision Tree to craft the architecture and select appropriate technology for your workload. 130 | # Chapter 3: Mapping Use Case to the Decision Tree 131 | Why and How to Map Your Use Case to the Decision Tree 132 | As you can see, these Decision Trees can be pretty complex but at the same time they represent almost full subset of data technologies. At the same time, industrial and technological use cases might still be very relevant especially if combined with the Decision Tree as a frame for discussion. 133 | In such case one can clearly see not only choices made but also choices omitted. Also, it can immediately give you an idea which alternatives and when you may consider. 134 | HOW? Just shade out all what is not needed and add your relevant metrics for the decisions made (for instance, predicted throughput, data size, latency, etc.) 135 | Let’s take a closer look on how we can do it. And we will start with a small example. 136 | ## Use Case: Relational OLTP / HTAP 137 | In this example, your business specializes in the retail industry and you're building a retail management system to track the inventory of the products in each store. You also want to add some basic reporting capabilities based on the geospatial data from the stores. To decide what is the best database for these requirements let’s take our uber tree and start from a write pattern. 138 | - The data for the orders, users and products get stored as soon as it arrives, and it gets updated in an individual basis. The throughput of such system is not high. 139 | - The schema of the entities is expected to be the same and a normalized logic is preferred to make the updates simpler. 140 | - Your store needs to support geospatial data and indexing 141 | This already narrows down our choices in the RDBMS space. Moving to read profile. 142 | - The queries will have different levels of complexity, a user might need to get the stock of a specific item in a single store, or even join the data from stores that are located to a distance close to a specific store. 143 | - The store manager will need to have a report available that it will show which days and time most traffic is expected. 144 | - HQ will need to identify the positive or negative factors that have effect on a zip code’s total sales to increase the sales coming from the retail channel. 145 | Since the queries have some geo-related clauses Postgres could be a good candidate and since some analysis and visualization is required SQL would be another option. Going further down, you could discuss with the development team the application stack and more specifically the programming language. If the app is written in node.js or Ruby, Postgres will be a great choice, otherwise, with .net Azure SQL will be the perfect solution. Other factors to take into consideration would be the amount of data to be stored, how to scale out if the data increases and the HA SLAs. 146 | 147 |  148 | 149 | ## Use Case: Mixing the write and read patterns 150 | The next example of how the Uber Tree can be used as a tool to produce a data architecture comes from the gaming space. Your team is building new features for a massively multiplayer online game and they need to collect and store all actions of the players, analyze those that matter in near-real time and archive the history. As usual, we will start from the write profile. 151 | • The events are captured and stored and are never updated. 152 | • High throughput is expected with hundreds of thousands of events per second. 153 | For the specific use case seems that there is a single path for the writes; Event Hub answers those requirements. But the way we will process and read the data is not in a sequential order. More specifically: 154 | • The data needs to be read in a time-series manner prioritizing the most recent and aggregating based on time. 155 | • Need to narrow down the analysis to the metrics that are relevant for a particular game and also enrich the data with data coming from different sources, so basically, you need control over the schema. 156 | On the read pattern, it looks like Azure Data Explorer would be the most suitable store. 157 | In this case, where two different profiles for the write and read are identified, we will leverage two solutions that are integrated. Azure Data Explorer natively supports ingestion from Event Hub. So, we can provide a queue interface to the event producer and an analytical interface to the team that will run the analysis on those events. 158 | 159 |  160 | 161 | 162 | ## Use Case: Analytics 163 | In this example, your business specializes in the energy industry and you're building a analytics platform for power plant operation and asset management. It would include all the necessary pieces, from condition monitoring and performance management to maintenance planning, risk management and market scheduling. To decide what is the best approach for these requirements let’s start with the write patterns of our uber tree. 164 | - Since operating a power plant generates a large amount of varying data, the platform must have the ability to process batch data coming in huge volumes. 165 | - 70% of the data is structured. 166 | - The data coming from meters need to be processed near real time and involves complex processing before it can be unified with the data from performance, risk, and finance systems. 167 | This already narrows down our choices to Azure Synapse and Azure Databricks with combination of Azure Storage & ADLS Gen2 with Polybase 168 | Moving to read profile. 169 | - The queries will have different levels of complexity, 170 | - Takes the data and information from the source systems, merges them to create the unified view to make it possible to monitor the performance of the plants/assets through an executive dashboard. 171 | - Machine learning to provide decision support. 172 | Since the requirement is to unify the huge volumes of data across different source systems where 70% of the data is structured, PolyBase would be right choice to land the data in Azure Synapse Storage and perform the transformations using Synapse SQL to create the dimensional model for historical analysis and dashboarding. There is also 30% of the unstructured data that needs to be processed before merging that into the dimensional model where optimized spark engine like Databricks is a perfect fit for purpose and also can be extended to the ML use-cases for decision support. Other factors to take into consideration would be the amount of data to be stored, how to scale out if the data increases and the HA SLAs. 173 | 174 |  175 | 176 | 177 | ## Use Case: HTAP 178 | In this example, your business specializes in the healthcare industry and you're building a platform for patient outreach and engagement. You are trying to build an advanced analytics solution looking to take chronically unwell patients that have high utilization of emergency department/ unplanned inpatient services and, through a more coordinated provision of ambulatory services keep them well at home. To decide what is the best approach for these requirements let’s start with the write patterns of our uber tree. 179 | - The write pattern itself is largely event-driven and completely serverless, which aggregates messages from close to 200 data sources across the organization. 180 | - They also have a million different EHRs and other sources of data (Radiology, Cardiology, Surgery, Labs Systems, etc.) and also millions of transactions per day. 181 | - Have laws requiring data for each patient to be kept for at least 7 years (28 years for newborns). 182 | This already narrows down our choices to Azure Cosmos DB for capturing the patient data across the different systems and to decide on the ability to build the analytics solution on the data captured in cosmos DB, we now look at the read profile. 183 | - Real time data must be available as soon as the data is Input, updated or calculated within the Cosmos DB database. 184 | - Complex analytical queries must report results within 900 seconds. 185 | Since the requirement is to provide the ability to do advanced analytics on the data captured in Cosmos DB but in near real time, the ETL approach cannot be leveraged here. The Synapse Link in Azure Synapse Analytics or Databricks could be considered as the possible options. If the usage pattern is ad-hoc or intermittent, you may gain considerable savings by actually using a Synapse Link solution compared to a cluster-based solution. This is because the SQL On-Demand will be charged per data processed and not per time that the cluster is up and running, hence you wouldn’t be paying for times when the cluster is idle or over-provisioned. 186 | 187 |  188 | 189 | 190 | # Chapter 4: Getting Access and Providing Feedback 191 | Here we are. Thank you for being with us up to this moment. This will be the shortest Chapter – we promise 😉 192 | You can find interactive Decision Tree on GitHub Pages by following this link: http://albero.cloud/ 193 | All the materials can be found in a public GitHub Repository here: https://github.com/albero-azure/albero 194 | You can provide your feedback / submit questions and propose materials via Issues of the GitHub Repository. 195 | Thank you and have a very pleasant day! 196 | BTW. Just tested it from my smartphone and it also looks pretty nice 😉 197 | 198 | -------------------------------------------------------------------------------- /faq.md: -------------------------------------------------------------------------------- 1 | # What is the goal of the project? 2 | The goal of the project is to provide a common approach to select data technologies on Azure and (possibly) beyond. Selecting of the data technology might be a challenging task due to a vast variety of available technologies and super-specialization of some tools. We are helping to reduce complexity of choice and simplify seach and representation of data services and tools. 3 | 4 | # Who is working on the project? 5 | This Decision Tree is currently a project of an initiative group of people across the globe. We have a core group who have equal rights for the repository and content. These are admins of https://github.com/albero-azure. We also have a group of contributors from within and outside of Microsoft. 6 | 7 | # Where from these recommendations are coming? 8 | The core of the recommendations is a joint expertise of tech professionals across the globe. Vast majority of the recommendations is coming from the experience with the real projects (successful and unsuccesful). We are constantly reviewing these recommendations and adjusting them based on the feedback. 9 | 10 | # Where can I post my feedback? 11 | Easiest way to provide feedback is to raise an issue in the repository: https://github.com/albero-azure/albero/issues -------------------------------------------------------------------------------- /publicdatasets.md: -------------------------------------------------------------------------------- 1 | |**Dataset Name**|**Purpose**|**Source**|**Format / Size**|**Application**|**License**|**Initiative**| 2 | | :-: | :-: | :-: | :-: | :-: | :-: | :-: | 3 | |[NOAA Integrated Surface Data (ISD)](https://azure.microsoft.com/services/open-datasets/catalog/noaa-integrated-surface-data/)|Worldwide hourly weather data from NOAA with the best spatial coverage in North America, Europe, Australia, and parts of Asia. Updated daily.|National Oceanic and Atmospheric Administration (NOAA)|Preprocessed Structured Data available via Azure Notebooks and Azure Databricks|**Natural Science.** Data cleansing and preparation|This dataset is provided under the original terms that Microsoft received source data. The dataset may include data sourced from Microsoft.|Azure Open Datasets| 4 | |[NOAA Global Forecast System (GFS)](https://azure.microsoft.com/services/open-datasets/catalog/noaa-global-forecast-system/)|15-day U.S. hourly weather forecast data from NOAA. Updated daily.|National Oceanic and Atmospheric Administration (NOAA)|Preprocessed Structured Data available via Azure Notebooks and Azure Databricks|**Natural Science.** Data cleansing and preparation|This dataset is provided under the original terms that Microsoft received source data. The dataset may include data sourced from Microsoft.|Azure Open Datasets| 5 | |[Public Holidays](https://azure.microsoft.com/services/open-datasets/catalog/public-holidays/)|Worldwide public holiday data, covering 41 countries or regions from 1970 to 2099. Includes country and whether most people have paid time off.|Microsoft|Preprocessed Structured Data available via Azure Notebooks and Azure Databricks|**Economic Science.** Data cleansing and preparation|This dataset is provided under the original terms that Microsoft received source data. The dataset may include data sourced from Microsoft.|Azure Open Datasets| 6 | |[TartanAir: AirSim Simulation Dataset](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-tartanair-simulation)|AirSim Autonomous vehicle data generated to solve Simultaneous Localization and Mapping (SLAM).|AirLab: Carnegie Mellon University|PNG, NPY, TXT|**Robots and autonomous vehicles.** Visual Simultaneous Localization and Mapping (V-SLAM). |This dataset is provided under the original terms that Microsoft received source data. The dataset may include data sourced from Microsoft.|Azure Open Datasets| 7 | |[NYC Taxi & Limousine Commission - yellow taxi trip records](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow)|The yellow taxi trip records include pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.|NYC Taxi and Limousine Commission (TLC)|Parquet format. There are about 1.5B rows (50 GB) in total as of 2018.|**Transportation.** Basic time-series data processing and analysis.|This dataset is provided under the original terms that Microsoft received source data. The dataset may include data sourced from Microsoft.|Azure Open Datasets| 8 | |[NYC Taxi & Limousine Commission - green taxi trip records](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-green)|The green taxi trip records include pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.|NYC Taxi and Limousine Commission (TLC)|This dataset is stored in Parquet format. There are about 80M rows (2 GB) in total as of 2018.|**Transportation.** Basic time-series data processing and analysis.|This dataset is provided under the original terms that Microsoft received source data. The dataset may include data sourced from Microsoft.|Azure Open Datasets| 9 | |[NYC Taxi & Limousine Commission - For-Hire Vehicle (FHV) trip records](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-for-hire-vehicle)|The For-Hire Vehicle trip records include the dispatching base license number and the pick-up date, time, and taxi zone location ID.|NYC Taxi and Limousine Commission (TLC)|This dataset is stored in Parquet format. There are about 500M rows (5 GB) as of 2018.|**Transportation.** Basic time-series data processing and analysis.|This dataset is provided under the original terms that Microsoft received source data. The dataset may include data sourced from Microsoft.|Azure Open Datasets| 10 | |[Bing COVID-19 Data](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-bing-covid-19)|COVID-19 Data Lake collection is a collection of COVID-19 related datasets from various sources, covering testing and patient outcome tracking data, social distancing policy, hospital capacity, mobility, etc.|Microsoft Bing|All datasets are updated daily. As of May 11, 2020 they contained 125,576 rows (CSV 16.1 MB, JSON 40.0 MB, JSONL 39.6 MB, Parquet 1.1 MB).|**Life Science.** Basic aggregated data processing and analysis.|[Bing COVID-19 license](https://github.com/microsoft/Bing-COVID-19-Data/blob/master/LICENSE.txt)|Azure Open Datasets| 11 | |[COVID Tracking Project](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-covid-tracking)|The COVID Tracking Project dataset provides the latest numbers on tests, confirmed cases, hospitalizations, and patient outcomes from every US state and territory.|COVID Tracking Project at the Atlantic|All datasets are updated daily. As of May 13, 2020 they contained 4,100 rows (CSV 574 KB, JSON 1.8 MB, JSONL 1.8 MB, Parquet 334 KB).|**Life Science.** Basic aggregated data processing and analysis.|[Apache License 2.0](https://github.com/COVID19Tracking/covid-tracking-data/blob/master/LICENSE)|Azure Open Datasets| 12 | |[European Centre for Disease Prevention and Control (ECDC) Covid-19 Cases](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-ecdc-covid-cases)|The latest available public data on geographic distribution of COVID-19 cases worldwide from the European Center for Disease Prevention and Control (ECDC). Each row/entry contains the number of new cases reported per day and per country or region.|European Center for Disease Prevention and Control (ECDC)|As of May 28, 2020 they contained 19,876 rows (CSV 1.5 MB, JSON 4.9 MB, JSONL 4.9 MB, Parquet 54.1 KB).|**Life Science.** Basic aggregated data processing and analysis.|This data is made available and may be used as permitted under the ECDC copyright policy here. For any documents where the copyright lies with a third party, permission for reproduction must be obtained from the copyright holder.|Azure Open Datasets| 13 | |[Oxford COVID-19 Government Response Tracker](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-oxford-covid-government-response-tracker)|The Oxford Covid-19 Government Response Tracker (OxCGRT) dataset contains systematic information on which governments have taken which measures, and when.|Oxford Covid-19 Government Response Tracker (OxCGRT)|As of June 8, 2020 they contained 27,919 rows (CSV 4.9 MB, JSON 20.9 MB, JSONL 20.8 MB, Parquet 133.0 KB).|**Life Science.** Basic aggregated data processing and analysis.|This data is licensed under the [Creative Commons Attribution 4.0 International License](https://github.com/OxCGRT/covid-policy-tracker/blob/master/LICENSE.txt).|Azure Open Datasets| 14 | |[COVID-19 Open Research Dataset](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-covid-19-open-research)|A full-text and metadata dataset of COVID-19 and coronavirus-related scholarly articles optimized for machine readability and made available for use by the global research community.|Allen Institute of AI and [Semantic Scholar](https://pages.semanticscholar.org/coronavirus-research)|This dataset is stored in JSON format and the latest release contains over 36,000 full text articles. Each paper is represented as a single JSON object.|**Life Science.** Full-text analysis and metadata extraction.|CORD-19 [Dataset License](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-03-13/COVID.DATA.LIC.AGMT.pdf)|Azure Open Datasets| 15 | |[Illumina Platinum Genomes](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-illumina-platinum-genomes)|Illumina has generated deep, whole-genome sequence data of 17 individuals in a three-generation pedigree. Illumina has called variants in each genome using a range of currently available algorithms.|` `[Illumina](https://www.illumina.com/platinumgenomes.html)|This dataset contains approximately 2 GB of data and is updated daily.|**Genomics.**|Data is available without restrictions. For more information and citation details, see the [official Illumina site](https://www.illumina.com/platinumgenomes.html).|Azure Open Datasets| 16 | |[Human Reference Genomes](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-human-reference-genomes)|This dataset includes two human-genome references assembled by the [Genome Reference Consortium](https://www.ncbi.nlm.nih.gov/grc): Hg19 and Hg38.|[Genome Reference Consortium](https://www.ncbi.nlm.nih.gov/grc)|This dataset contains approximately 10 GB of data and is updated daily.|**Genomics.**|Data is available without restrictions. For more information and citation details, see the [NCBI Reference Sequence Database site](https://www.ncbi.nlm.nih.gov/refseq/).|Azure Open Datasets| 17 | |[ClinVar Annotations](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-clinvar-annotations)|[ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. It facilitates access to and communication about the relationships asserted between human variation and observed health status, and the history of that interpretation.|National Center for Biotechnology Information|This dataset contains approximately 56 GB of data and is updated daily.|**Genomics.**|Data is available without restrictions. More information and citation details, see [Accessing and using data in ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/docs/maintenance_use/).|Azure Open Datasets| 18 | |[SnpEff](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-snpeff)|Genetic variant annotation and functional effect prediction toolbox. It annotates and predicts the effects of genetic variants on genes and proteins (such as amino acid changes).|National Center for Biotechnology Information|This dataset contains approximately 2 TB of data and is updated monthly.|**Genomics.**|Data is available without restrictions. More information and citation details, see [Accessing and using data in ClinVar](https://pcingola.github.io/SnpEff/se_introduction/).|Azure Open Datasets| 19 | |[gnomAD](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-gnomad)|The [Genome Aggregation Database (gnomAD)](https://gnomad.broadinstitute.org/) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects.|[Broadinstitute](https://gnomad.broadinstitute.org/about)|This dataset contains approximately 30 TB of data and is updated with each gnomAD release.|**Genomics.**|Data is available without restrictions. For more information and citation details, see the [gnomAD about page](https://gnomad.broadinstitute.org/about).|Azure Open Datasets| 20 | |[1000 Genomes](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-1000-genomes)|The 1000 Genomes Project ran between 2008 and 2015, creating the largest public catalog of human variation and genotype data. The final data set contains data for 2,504 individuals from 26 populations and 84 million identified variants. For more information, see the 1000 Genome Project website and the following publications:|[International Genome](https://www.internationalgenome.org/home)|This dataset contains approximately 815 TB of data and is updated daily.|**Genomics.**|Following the final publications, data from the 1000 Genomes Project is publicly available without embargo to anyone for use under the terms provided by the dataset [source](http://www.internationalgenome.org/data). |Azure Open Datasets| 21 | |[OpenCravat](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-open-cravat)|OpenCRAVAT is a python package that performs genomic variant interpretation including variant impact, annotation, and scoring. OpenCRAVAT has a modular architecture with a wide variety of analysis modules and annotation resources that can be selected and installed/run based on the needs of a given study.|[OpenCravat](https://opencravat.org/)|This dataset includes 500 GB of data, and is updated daily.|**Genomics.**|OpenCRAVAT is available with a GPLv3 license. Most data sources are free for non-commercial use. For commercial use, consult the institutional contacts for each data source.|Azure Open Datasets| 22 | |[ENCODE](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-encode)|ENCODE's goal is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.|National Human Genome Research Institute |This dataset includes approximately 756 TB of data, and is updated monthly during the first week of every month.|**Genomics.**|External data users may freely download, analyze, and publish results based on any ENCODE data without restrictions, regardless of type or size, and includes no grace period for ENCODE data producers, either as individual members or as part of the Consortium. Researchers using unpublished ENCODE data are encouraged to contact the data producers to discuss possible publications. |Azure Open Datasets| 23 | |[GATK Resource Bundle](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-gatk-resource-bundle)|The [GATK resource bundle](https://gatk.broadinstitute.org/hc/articles/360035890811-Resource-bundle) is a collection of standard files for working with human resequencing data with the GATK.|[Broadinstitute](https://gnomad.broadinstitute.org/about)|Over 6 TB. Datasets are updated monthly during the first week of every month.|**Genomics.**|Visit the [GATK resource bundle official site](https://gatk.broadinstitute.org/hc/articles/360035890811-Resource-bundle)|Azure Open Datasets| 24 | |[US Labor Force Statistics](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-us-labor-force)|US Labor Force Statistics provides Labor Force Statistics, labor force participation rates, and the civilian noninstitutional population by age, gender, race, and ethnic groups. in the United States.|[US Bureau of Labor Statistics (BLS)](https://www.bls.gov/)|CSV over 6 MB.|
**Economics.**
Basic aggregated data processing and analysis.
|[Linking and Copyright Information](https://www.bls.gov/bls/linksite.htm)|Azure Open Datasets| 25 | |[US National Employment Hours and Earnings](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-us-national-employment-earnings)|The Current Employment Statistics (CES) program produces detailed industry estimates of nonfarm employment, hours, and earnings of workers on payrolls in the United States.|[US Bureau of Labor Statistics (BLS)](https://www.bls.gov/)||**Economics.**
Basic aggregated data processing and analysis.
|[Linking and Copyright Information](https://www.bls.gov/bls/linksite.htm)|Azure Open Datasets| 26 | |[US Local Area Unemployment Statistics](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-us-local-unemployment)|The US Local Area Unemployment Statistics datasets provides monthly and annual employment, unemployment, and labor force data for Census regions and divisions, States, counties, metropolitan areas, and many cities in the United States.|[US Bureau of Labor Statistics (BLS)](https://www.bls.gov/)||**Economics.**
Basic aggregated data processing and analysis.
|[Linking and Copyright Information](https://www.bls.gov/bls/linksite.htm)|Azure Open Datasets| 27 | |[US Consumer Price Index](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-us-consumer-price-index)|The Consumer Price Index (CPI) is a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services.|[US Bureau of Labor Statistics (BLS)](https://www.bls.gov/)||**Economics.**
Basic aggregated data processing and analysis.
|[Linking and Copyright Information](https://www.bls.gov/bls/linksite.htm)|Azure Open Datasets| 28 | |[US Producer Price Index - Industry](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-us-producer-price-index-industry)|The Producer Price Index (PPI) is a measure of average change over time in the selling prices received by domestic producers for their output.|[US Bureau of Labor Statistics (BLS)](https://www.bls.gov/)||**Economics.**
Basic aggregated data processing and analysis.
|[Linking and Copyright Information](https://www.bls.gov/bls/linksite.htm)|Azure Open Datasets| 29 | |[US Producer Price Index - Commodities](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-us-producer-price-index-commodities)|The Producer Price Index (PPI) is a measure of average change over time in the selling prices received by domestic producers for their commodities.|[Linking and Copyright Information](https://www.bls.gov/bls/linksite.htm)||**Economics.**
Basic aggregated data processing and analysis.
|[Linking and Copyright Information](https://www.bls.gov/bls/linksite.htm)|Azure Open Datasets| 30 | |[US Population by County](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-us-population-county)|US population by gender and race for each US county sourced from 2000 and 2010 Decennial Census. This dataset is sourced from the United States Census Bureau.|United States Census Bureau|This dataset is stored in Parquet format and has data for the year 2000 and 2010.|**Demographics.**
Basic aggregated data processing and analysis.
|[Terms of Service](https://www.census.gov/data/developers/about/terms-of-service.html)|Azure Open Datasets| 31 | |[US Population by ZIP Code](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-us-population-zip)|US population by gender and race for each US ZIP code sourced from 2010 Decennial Census. This dataset is sourced from the United States Census Bureau.|United States Census Bureau|This dataset is stored in Parquet format and has data for the year 2010.|**Demographics.**
Basic aggregated data processing and analysis.
|[Terms of Service](https://www.census.gov/data/developers/about/terms-of-service.html)|Azure Open Datasets| 32 | |[Boston Safety Data](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-boston-safety)|Read data about 311 calls reported to the city of Boston. This dataset is stored in Parquet format and is updated daily.|City of Boston Government|This dataset is stored in Parquet format. It is updated daily and contains about 100-K rows (10 MB) in total as of 2019.|**Demographics.**
Basic aggregated data processing and analysis.
|[Open Data Commons Public Domain Dedication and License (ODC PDDL)](http://opendefinition.org/licenses/odc-pddl/)|Azure Open Datasets| 33 | |[Chicago Safety Data](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-chicago-safety)|Read data about 311 calls reported to the city of Chicago. This dataset is stored in Parquet format and is updated daily.|City of Chicago Government|This dataset is stored in Parquet format. It is updated daily, and contains about 1M rows (80 MB) in total as of 2018.|**Demographics.**
Basic aggregated data processing and analysis.
|[Open Data Commons Public Domain Dedication and License (ODC PDDL)](http://opendefinition.org/licenses/odc-pddl/)|Azure Open Datasets| 34 | |[New York City Safety Data](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-new-york-city-safety)|This dataset contains all New York City 311 service requests from 2010 to the present. It’s stored in Parquet format and updated daily.|City of New York City Government|This dataset is stored in Parquet format. It is updated daily, and contains about 12M rows (500 MB) in total as of 2019.|**Demographics.**
Basic aggregated data processing and analysis.
|[Open Data Commons Public Domain Dedication and License (ODC PDDL)](http://opendefinition.org/licenses/odc-pddl/)|Azure Open Datasets| 35 | |[San Francisco Safety Data](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-san-francisco-safety)|Fire department calls for service and 311 cases in San Francisco. This dataset contains historical records accumulated from 2015 to the present.|San Francisco City Government|This dataset is stored in Parquet format. It is updated daily with about 6M rows (400 MB) as of 2019.|**Demographics.**
Basic aggregated data processing and analysis.
|[Open Data Commons Public Domain Dedication and License (ODC PDDL)](http://opendefinition.org/licenses/odc-pddl/)|Azure Open Datasets| 36 | |[Seattle Safety Data](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-seattle-safety)|Seattle Fire Department 911 dispatches. This dataset is updated daily, and contains historical records accumulated from 2010 to the present.|Seattle City Government|This dataset is stored in Parquet format. It's updated daily, and contains about 800,000 rows (20 MB) in 2019.|**Demographics.**
Basic aggregated data processing and analysis.
|[Open Data Commons Public Domain Dedication and License (ODC PDDL)](http://opendefinition.org/licenses/odc-pddl/)|Azure Open Datasets| 37 | |[Diabetes](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-diabetes)|The Diabetes dataset has 442 samples with 10 features, making it ideal for getting started with machine learning algorithms.|Standford University|Delimited Text (18 KB)|**Life Science.** Basic machine learning applications.|[Citation required](https://scikit-learn.org/stable/about.html#citing-scikit-learn)|Azure Open Datasets| 38 | |[OJ Sales Simulated Data](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-oj-sales-simulated)|This dataset is derived from the Dominick’s OJ dataset and includes extra simulated data with the goal of providing a dataset that makes it easy to simultaneously train thousands of models on Azure Machine Learning.|[Emmanuele Taufer](http://www.cs.unitn.it/~taufer/QMMA/L10-OJ-Data.html#\(1\))|CSV (4.6 MB)|**Economics.** Basic analysis of sales data.||Azure Open Datasets| 39 | |[MNIST database of handwritten digits](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-mnist)|The MNIST database of handwritten digits has a training set of 60,000 examples and a test set of 10,000 examples. The digits have been size-normalized and centered in a fixed-size image.|[National Institute of Standards and Technology](https://www.nist.gov/)|Images|**Machine Learning.** Dataset suitable to train basic image recognition models.||Azure Open Datasets| 40 | |[Microsoft News recommendation dataset](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-microsoft-news)|Microsoft News Dataset (MIND) is a large-scale dataset for news recommendation research. It serves as a benchmark dataset for news recommendation, and facilitates research in news recommendation and recommender systems.|Microsoft|Delimited Text, Graph. MIND contains about 160k English news articles and more than 15 million impression logs generated by 1 million users.|**Machine Learning.** Servers as a benchmark for news recommendation engines.|[Microsoft Research License Terms](https://github.com/msnews/MIND/blob/master/MSR%20License_Data.pdf)|Azure Open Datasets| 41 | |[Hippocorpus](https://msropendata.com/datasets/0a83fb6f-a759-4a17-aaa2-fbac84577318)|6,854 English diary-like short stories about recalled and imagined events. Using a crowdsourcing framework, we first collect recalled stories and summaries from workers, then provide these summaries to other workers who write imagined stories.|Microsoft|CSV, TXT (12.33 MB)|**Machine Learning.** Text recognition training and text analysis. Sentiment Analysis.|CDLA Permissive 2.0|Microsoft Research Open Data| 42 | |[Indoor Location Dataset](https://msropendata.com/datasets/7bfdeb9f-ab53-40fe-97f5-457df6143f79)|The dataset was released along with Microsoft Indoor Location Competition 2.0. It consists of dense indoor signatures of WiFi, geomagnetic field, iBeacons etc., as well as ground truth (waypoint) (locations) collected by Android smartphones from hundreds of buildings in Chinese cities. We hope the dataset will be of great value to research and development of indoor space including localization and navigation.|Microsoft|CSV, JSON, PNG, TXT (28,855 files / 56.51 GB)|**Indoor Navigation.**|[Computational Use of Data Agreement v1.0](https://msropendata-web-api.azurewebsites.net/licenses/a889b26e-5149-4486-866e-ec896bb728c4/view)|Microsoft Research Open Data| 43 | |[Public Perception of Artificial Intelligence](https://msropendata.com/datasets/feed8996-d9d4-47b9-9938-44ed298628fc)|Analyses of text corpora over time can reveal trends in beliefs, interest, and sentiment about a topic. We focus on views expressed about artificial intelligence (AI) in the New York Times over a 30-year period.|Microsoft|CSV (11.10 MB)|**Machine Learning.** Text recognition training and text analysis. Sentiment Analysis.|CDLA Permissive 2.0|Microsoft Research Open Data| 44 | |[Dual Word Embeddings Trained on Bing Queries](https://msropendata.com/datasets/30a504b0-cff2-4d4a-864f-3bc9a66f9d7e)|This data is being released for research purposes only. The DESM Word Embeddings dataset may include terms that some may consider offensive, indecent or otherwise objectionable. Microsoft has not reviewed or modified the content of the dataset. Microsoft is providing this dataset as a convenience and is not responsible or liable for any inappropriate content resulting from your use of the dataset. Use of the dataset is at your own risk and discretion.|Microsoft|TXT (10.38 GB)|**Machine Learning.** Text recognition training and text analysis. Sentiment Analysis.|CDLA Permissive 2.0|Microsoft Research Open Data| 45 | |[GPS Trajectory](https://msropendata.com/datasets/d19b353b-7483-4db7-a828-b130f6d1f035)|This GPS trajectory dataset was collected in (Microsoft Research Asia) Geolife project by 182 users in a period of over three years (from April 2007 to August 2012). A GPS trajectory of this dataset is represented by a sequence of time-stamped points, each of which contains the information of latitude, longitude and altitude. This dataset contains 17,621 trajectories with a total distance of about 1.2 million kilometers and a total duration of 48,000+ hours.|Microsoft|PDF, PLT, TXT (18740 files / 1.67 GB)|**Machine Learning.** Path and trajectory analysis.|CDLA Permissive 2.0|Microsoft Research Open Data| 46 | 47 | --------------------------------------------------------------------------------