├── .gitignore ├── LICENSE ├── README.md ├── open-data-portal-api.R ├── open-data-portal-api.Rproj ├── open-data-portal-api.ipynb └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | *.csv 2 | .ipynb_checkpoints/ 3 | .Rproj.user 4 | .Rhistory 5 | .RData 6 | .Ruserdata -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright 2021 NHS Business Services Authority 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Open Data API 2 | A collection of examples of how to query the NHSBSA Open Data Portal API. 3 | 4 | ## Usage 5 | For R users, please use `open-data-portal-api.R` and follow the instructions. 6 | 7 | For Python users, please use `open-data-portal-api.ipynb` and install the 8 | required packages using: 9 | ``` 10 | pip install -r requirements.txt 11 | ``` 12 | If required, please export the notebook file to `.py`. 13 | -------------------------------------------------------------------------------- /open-data-portal-api.R: -------------------------------------------------------------------------------- 1 | # 1. Script details ------------------------------------------------------------ 2 | 3 | # Name of script: OpenDataAPIQuery 4 | # Description: Using R to query the NHSBSA open data portal API. 5 | # Created by: Matthew Wilson (NHSBSA) 6 | # Created on: 26-03-2020 7 | # Latest update by: Adam Ivison (NHSBSA) 8 | # Latest update on: 24-06-2021 9 | # Update notes: Updated endpoint in the script, refactored code and added async 10 | 11 | # R version: created in 3.5.3 12 | 13 | # 2. Load packages ------------------------------------------------------------- 14 | 15 | # List packages we will use 16 | packages <- c( 17 | "jsonlite", # 1.6 18 | "dplyr", # 0.8.3 19 | "crul" # 1.1.0 20 | ) 21 | 22 | # Install packages if they aren't already 23 | if (length(setdiff(packages, rownames(installed.packages()))) > 0) { 24 | install.packages(setdiff(packages, rownames(installed.packages()))) 25 | } 26 | 27 | # 3. Define variables ---------------------------------------------------------- 28 | 29 | # Define the url for the API call 30 | base_endpoint <- "https://opendata.nhsbsa.net/api/3/action/" 31 | package_list_method <- "package_list" # List of data-sets in the portal 32 | package_show_method <- "package_show?id=" # List all resources of a data-set 33 | action_method <- "datastore_search_sql?" # SQL action method 34 | 35 | # Send API call to get list of data-sets 36 | datasets_response <- jsonlite::fromJSON(paste0( 37 | base_endpoint, 38 | package_list_method 39 | )) 40 | 41 | # Now lets have a look at the data-sets currently available 42 | datasets_response$result 43 | 44 | # For this example we're interested in the English Prescribing Dataset (EPD). 45 | # We know the name of this data-set so can set this manually, or access it 46 | # from datasets_response. 47 | dataset_id <- "english-prescribing-data-epd" 48 | 49 | # 4. API calls for single month ------------------------------------------------ 50 | 51 | # Define the parameters for the SQL query 52 | resource_name <- "EPD_202001" # For EPD resources are named EPD_YYYYMM 53 | pco_code <- "13T00" # Newcastle Gateshead CCG 54 | bnf_chemical_substance <- "0407010H0" # Paracetamol 55 | 56 | # Build SQL query (WHERE criteria should be enclosed in single quotes) 57 | single_month_query <- paste0( 58 | " 59 | SELECT 60 | * 61 | FROM `", 62 | resource_name, "` 63 | WHERE 64 | 1=1 65 | AND pco_code = '", pco_code, "' 66 | AND bnf_chemical_substance = '", bnf_chemical_substance, "' 67 | " 68 | ) 69 | 70 | # Build API call 71 | single_month_api_call <- paste0( 72 | base_endpoint, 73 | action_method, 74 | "resource_id=", 75 | resource_name, 76 | "&", 77 | "sql=", 78 | URLencode(single_month_query) # Encode spaces in the url 79 | ) 80 | 81 | # Grab the response JSON as a list 82 | single_month_response <- jsonlite::fromJSON(single_month_api_call) 83 | 84 | # Extract records in the response to a dataframe 85 | single_month_df <- single_month_response$result$result$records 86 | 87 | # Lets have a quick look at the data 88 | str(single_month_df) 89 | head(single_month_df) 90 | 91 | # You can use any of the fields listed in the data-set within the SQL query as 92 | # part of the select or in the where clause in order to filter. 93 | 94 | # Information on the fields present in a data-set and an accompanying data 95 | # dictionary can be found on the page for the relevant data-set on the Open Data 96 | # Portal. 97 | 98 | # 5. API calls for data for multiple months ------------------------------------ 99 | 100 | # Now that you have extracted data for a single month, you may want to get the 101 | # data for several months, or a whole year. 102 | 103 | # Firstly we need to get a list of all of the names and resource IDs for every 104 | # EPD file. We therefore extract the metadata for the EPD dataset. 105 | metadata_repsonse <- jsonlite::fromJSON(paste0( 106 | base_endpoint, 107 | package_show_method, 108 | dataset_id 109 | )) 110 | 111 | # Resource names and IDs are kept within the resources table returned from the 112 | # package_show_method call. 113 | resources_table <- metadata_repsonse$result$resources 114 | 115 | # We only want data for one calendar year, to do this we need to look at the 116 | # name of the data-set to identify the year. For this example we're looking at 117 | # 2020. 118 | resource_name_list <- resources_table$name[grepl("2020", resources_table$name)] 119 | 120 | # 5.1. For loop ---------------------------------------------------------------- 121 | 122 | # We can do this with a for loop that makes all of the individual API calls for 123 | # you and combines the data together into one dataframe 124 | 125 | # Initialise dataframe that data will be saved to 126 | for_loop_df <- data.frame() 127 | 128 | # As each individual month of EPD data is so large it will be unlikely that your 129 | # local system will have enough RAM to hold a full year's worth of data in 130 | # memory. Therefore we will only look at a single CCG and chemical substance as 131 | # we did previously 132 | 133 | # Loop through resource_name_list and make call to API to extract data, then 134 | # bind each month together to make a single data-set 135 | for(month in resource_name_list) { 136 | 137 | # Build temporary SQL query 138 | tmp_query <- paste0( 139 | " 140 | SELECT 141 | * 142 | FROM `", 143 | month, "` 144 | WHERE 145 | 1=1 146 | AND pco_code = '", pco_code, "' 147 | AND bnf_chemical_substance = '", bnf_chemical_substance, "' 148 | " 149 | ) 150 | 151 | # Build temporary API call 152 | tmp_api_call <- paste0( 153 | base_endpoint, 154 | action_method, 155 | "resource_id=", 156 | month, 157 | "&", 158 | "sql=", 159 | URLencode(tmp_query) # Encode spaces in the url 160 | ) 161 | 162 | # Grab the response JSON as a temporary list 163 | tmp_response <- jsonlite::fromJSON(tmp_api_call) 164 | 165 | # Extract records in the response to a temporary dataframe 166 | tmp_df <- tmp_response$result$result$records 167 | 168 | # Bind the temporary data to the main dataframe 169 | for_loop_df <- dplyr::bind_rows(for_loop_df, tmp_df) 170 | } 171 | 172 | # 5.2. Async -- ---------------------------------------------------------------- 173 | 174 | # We can call the API asynchronously and this will result in an approx 10x speed 175 | # increase over a for loop for large resource_names by vectorising our approach. 176 | 177 | # Construct the SQL query as a function 178 | async_query <- function(resource_name) { 179 | paste0( 180 | " 181 | SELECT 182 | * 183 | FROM `", 184 | resource_name, "` 185 | WHERE 186 | 1=1 187 | AND pco_code = '", pco_code, "' 188 | AND bnf_chemical_substance = '", bnf_chemical_substance, "' 189 | " 190 | ) 191 | } 192 | 193 | # Create the API calls 194 | async_api_calls <- lapply( 195 | X = resource_name_list, 196 | FUN = function(x) 197 | paste0( 198 | base_endpoint, 199 | action_method, 200 | "resource_id=", 201 | x, 202 | "&", 203 | "sql=", 204 | URLencode(async_query(x)) # Encode spaces in the url 205 | ) 206 | ) 207 | 208 | # Use crul::Async to get the results 209 | dd <- crul::Async$new(urls = async_api_calls) 210 | res <- dd$get() 211 | 212 | # Check that everything is a success 213 | all(vapply(res, function(z) z$success(), logical(1))) 214 | 215 | # Parse the output into a list of dataframes 216 | async_dfs <- lapply( 217 | X = res, 218 | FUN = function(x) { 219 | 220 | # Parse the response 221 | tmp_response <- x$parse("UTF-8") 222 | 223 | # Extract the records 224 | tmp_df <- jsonlite::fromJSON(tmp_response)$result$result$records 225 | } 226 | ) 227 | 228 | # Concatenate the results 229 | aysnc_df <- do.call(dplyr::bind_rows, async_dfs) 230 | 231 | # 6. Export the data ----------------------------------------------------------- 232 | 233 | # Use write.csv for ease 234 | write.csv(single_month_df, "single_month.csv") 235 | write.csv(for_loop_df, "for_loop.csv") 236 | write.csv(aysnc_df, "aysnc.csv") 237 | -------------------------------------------------------------------------------- /open-data-portal-api.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /open-data-portal-api.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 1. Script details" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Name of script: OpenDataAPIQuery
\n", 15 | "Description: Using Python to query the NHSBSA open data portal API.
\n", 16 | "Created by: Ryan Leggett (NHSBSA)
\n", 17 | "Created on: 26-06-2022
\n", 18 | "Python version: created in 3.8" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "# 2. Load packages" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "List packages we will use" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "import grequests\n", 42 | "import pandas as pd\n", 43 | "import re\n", 44 | "import requests\n", 45 | "import warnings\n", 46 | "import urllib.parse\n", 47 | "\n", 48 | "warnings.simplefilter(\"ignore\", category=UserWarning)" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "Install packages if they aren't already using `Pip/Conda install -r requirements.txt`" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "# 3. Define variablesDefine the url for the API call" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "Define the url for the API call" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "base_endpoint = 'https://opendata.nhsbsa.net/api/3/action/'\n", 79 | "package_list_method = 'package_list' # List of data-sets in the portal\n", 80 | "package_show_method = 'package_show?id=' # List all resources of a data-set\n", 81 | "action_method = 'datastore_search_sql?' # SQL action method" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "Send API call to get list of data-sets" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "datasets_response = requests.get(base_endpoint + package_list_method).json()" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "Now lets have a look at the data-sets currently available" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": null, 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "print(datasets_response['result'])" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": {}, 119 | "source": [ 120 | "For this example we're interested in the English Prescribing Dataset (EPD).\n", 121 | "We know the name of this data-set so can set this manually, or access it \n", 122 | "from datasets_response." 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "dataset_id = \"english-prescribing-data-epd\"" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "# 4. API calls for single month" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "Define the parameters for the SQL query" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": {}, 152 | "outputs": [], 153 | "source": [ 154 | "resource_name = 'EPD_202001' # For EPD resources are named EPD_YYYYMM\n", 155 | "pco_code = '13T00' # Newcastle Gateshead CCG\n", 156 | "bnf_chemical_substance = '0407010H0' # Paracetamol" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "Build SQL query (WHERE criteria should be enclosed in single quotes)" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "single_month_query = \"SELECT * \" \\\n", 173 | " f\"FROM `{resource_name}` \" \\\n", 174 | " f\"WHERE pco_code = '{pco_code}' \" \\\n", 175 | " f\"AND bnf_chemical_substance = '{bnf_chemical_substance}'\"" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "Build API call" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": {}, 189 | "outputs": [], 190 | "source": [ 191 | "single_month_api_call = f\"{base_endpoint}\" \\\n", 192 | " f\"{action_method}\" \\\n", 193 | " \"resource_id=\" \\\n", 194 | " f\"{resource_name}\" \\\n", 195 | " \"&\" \\\n", 196 | " \"sql=\" \\\n", 197 | " f\"{urllib.parse.quote(single_month_query)}\" # Encode spaces in the url" 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "metadata": {}, 203 | "source": [ 204 | "Grab the response JSON as a list" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "single_month_response = requests.get(single_month_api_call).json()" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": {}, 219 | "source": [ 220 | "Extract records in the response to a dataframe" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": null, 226 | "metadata": {}, 227 | "outputs": [], 228 | "source": [ 229 | "single_month_df = pd.json_normalize(single_month_response['result']['result']['records'])" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "Lets have a quick look at the data" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": null, 242 | "metadata": {}, 243 | "outputs": [], 244 | "source": [ 245 | "single_month_df.head()" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": {}, 251 | "source": [ 252 | "You can use any of the fields listed in the data-set within the SQL query as \n", 253 | "part of the select or in the where clause in order to filter.\n", 254 | "\n", 255 | "Information on the fields present in a data-set and an accompanying data \n", 256 | "dictionary can be found on the page for the relevant data-set on the Open Data \n", 257 | "Portal." 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "# 5. API calls for data for multiple months" 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "metadata": {}, 270 | "source": [ 271 | "Now that you have extracted data for a single month, you may want to get the \n", 272 | "data for several months, or a whole year.\n", 273 | "\n", 274 | "Firstly we need to get a list of all of the names and resource IDs for every \n", 275 | "EPD file. We therefore extract the metadata for the EPD dataset." 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": null, 281 | "metadata": {}, 282 | "outputs": [], 283 | "source": [ 284 | "metadata_repsonse = requests.get(f\"{base_endpoint}\" \\\n", 285 | " f\"{package_show_method}\" \\\n", 286 | " f\"{dataset_id}\").json()" 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "metadata": {}, 292 | "source": [ 293 | "Resource names and IDs are kept within the resources table returned from the \n", 294 | "package_show_method call." 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": null, 300 | "metadata": {}, 301 | "outputs": [], 302 | "source": [ 303 | "resources_table = pd.json_normalize(metadata_repsonse['result']['resources'])" 304 | ] 305 | }, 306 | { 307 | "cell_type": "markdown", 308 | "metadata": {}, 309 | "source": [ 310 | "We only want data for one calendar year, to do this we need to look at the \n", 311 | "name of the data-set to identify the year. For this example we're looking at \n", 312 | "2020." 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": null, 318 | "metadata": {}, 319 | "outputs": [], 320 | "source": [ 321 | "resource_name_list = resources_table[resources_table['name'].str.contains('2020')]['name']" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "## 5.1. For loop" 329 | ] 330 | }, 331 | { 332 | "cell_type": "markdown", 333 | "metadata": {}, 334 | "source": [ 335 | "We can do this with a for loop that makes all of the individual API calls for \n", 336 | "you and combines the data together into one dataframe\n", 337 | "\n", 338 | "Initialise dataframe that data will be saved to" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": null, 344 | "metadata": {}, 345 | "outputs": [], 346 | "source": [ 347 | "for_loop_df = pd.DataFrame()" 348 | ] 349 | }, 350 | { 351 | "cell_type": "markdown", 352 | "metadata": {}, 353 | "source": [ 354 | "As each individual month of EPD data is so large it will be unlikely that your \n", 355 | "local system will have enough RAM to hold a full year's worth of data in \n", 356 | "memory. Therefore we will only look at a single CCG and chemical substance as \n", 357 | "we did previously\n", 358 | "\n", 359 | "Loop through resource_name_list and make call to API to extract data, then \n", 360 | "bind each month together to make a single data-set" 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": null, 366 | "metadata": {}, 367 | "outputs": [], 368 | "source": [ 369 | "for month in resource_name_list:\n", 370 | " \n", 371 | " # Build temporary SQL query\n", 372 | " tmp_query = \"SELECT * \" \\\n", 373 | " f\"FROM `{month}` \" \\\n", 374 | " f\"WHERE pco_code = '{pco_code}' \" \\\n", 375 | " f\"AND bnf_chemical_substance = '{bnf_chemical_substance}'\"\n", 376 | " \n", 377 | " # Build temporary API call\n", 378 | " tmp_api_call = f\"{base_endpoint}\" \\\n", 379 | " f\"{action_method}\" \\\n", 380 | " \"resource_id=\" \\\n", 381 | " f\"{month}\" \\\n", 382 | " \"&\" \\\n", 383 | " \"sql=\" \\\n", 384 | " f\"{urllib.parse.quote(tmp_query)}\" # Encode spaces in the url\n", 385 | " \n", 386 | " # Grab the response JSON as a temporary list\n", 387 | " tmp_response = requests.get(tmp_api_call).json()\n", 388 | " \n", 389 | " # Extract records in the response to a temporary dataframe\n", 390 | " tmp_df = pd.json_normalize(tmp_response['result']['result']['records'])\n", 391 | " \n", 392 | " # Bind the temporary data to the main dataframe\n", 393 | " for_loop_df = for_loop_df.append(tmp_df)" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "Lets have a quick look at the data" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": null, 406 | "metadata": {}, 407 | "outputs": [], 408 | "source": [ 409 | "for_loop_df.head()" 410 | ] 411 | }, 412 | { 413 | "cell_type": "markdown", 414 | "metadata": {}, 415 | "source": [ 416 | "## 5.2. Async" 417 | ] 418 | }, 419 | { 420 | "cell_type": "markdown", 421 | "metadata": {}, 422 | "source": [ 423 | "We can call the API asynchronously and this will result in an approx 10x speed \n", 424 | "increase over a for loop for large resource_names by vectorising our approach.\n", 425 | "\n", 426 | "Construct the SQL query as a function" 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": null, 432 | "metadata": {}, 433 | "outputs": [], 434 | "source": [ 435 | "def async_query(resource_name):\n", 436 | " query = \"SELECT * \" \\\n", 437 | " f\"FROM `{resource_name}` \" \\\n", 438 | " f\"WHERE pco_code = '{pco_code}' \" \\\n", 439 | " f\"AND bnf_chemical_substance = '{bnf_chemical_substance}'\"\n", 440 | " return(query)" 441 | ] 442 | }, 443 | { 444 | "cell_type": "markdown", 445 | "metadata": {}, 446 | "source": [ 447 | "Create the API calls" 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": null, 453 | "metadata": {}, 454 | "outputs": [], 455 | "source": [ 456 | "async_api_calls = []\n", 457 | "for x in resource_name_list:\n", 458 | " async_api_calls.append(\n", 459 | " f\"{base_endpoint}\" \\\n", 460 | " f\"{action_method}\" \\\n", 461 | " \"resource_id=\" \\\n", 462 | " f\"{x}\" \\\n", 463 | " \"&\" \\\n", 464 | " \"sql=\" \\\n", 465 | " f\"{urllib.parse.quote(async_query(x))}\" # Encode spaces in the url \n", 466 | " )" 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "metadata": {}, 472 | "source": [ 473 | "Use grequests to get the results" 474 | ] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "execution_count": null, 479 | "metadata": {}, 480 | "outputs": [], 481 | "source": [ 482 | "dd = (grequests.get(u) for u in async_api_calls)\n", 483 | "res = grequests.map(dd)" 484 | ] 485 | }, 486 | { 487 | "cell_type": "markdown", 488 | "metadata": {}, 489 | "source": [ 490 | "Check that everything is a success" 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": null, 496 | "metadata": {}, 497 | "outputs": [], 498 | "source": [ 499 | "for x in res:\n", 500 | " if x.ok:\n", 501 | " print(True)\n", 502 | " else:\n", 503 | " print(False)" 504 | ] 505 | }, 506 | { 507 | "cell_type": "markdown", 508 | "metadata": {}, 509 | "source": [ 510 | "Parse the output into a list of dataframes and concatenate the results" 511 | ] 512 | }, 513 | { 514 | "cell_type": "code", 515 | "execution_count": null, 516 | "metadata": {}, 517 | "outputs": [], 518 | "source": [ 519 | "async_df = pd.DataFrame()\n", 520 | "\n", 521 | "for x in res:\n", 522 | " # Grab the response JSON as a temporary list\n", 523 | " tmp_response = x.json()\n", 524 | " \n", 525 | " # Extract records in the response to a temporary dataframe\n", 526 | " tmp_df = pd.json_normalize(tmp_response['result']['result']['records'])\n", 527 | " \n", 528 | " # Bind the temporary data to the main dataframe\n", 529 | " async_df = async_df.append(tmp_df)" 530 | ] 531 | }, 532 | { 533 | "cell_type": "markdown", 534 | "metadata": {}, 535 | "source": [ 536 | "Lets have a quick look at the data" 537 | ] 538 | }, 539 | { 540 | "cell_type": "code", 541 | "execution_count": null, 542 | "metadata": {}, 543 | "outputs": [], 544 | "source": [ 545 | "async_df.head()" 546 | ] 547 | }, 548 | { 549 | "cell_type": "markdown", 550 | "metadata": {}, 551 | "source": [ 552 | "# 6. Export the data" 553 | ] 554 | }, 555 | { 556 | "cell_type": "code", 557 | "execution_count": null, 558 | "metadata": {}, 559 | "outputs": [], 560 | "source": [ 561 | "single_month_df.to_csv('single_month.csv')\n", 562 | "for_loop_df.to_csv('for_loop.csv')\n", 563 | "async_df.to_csv('async.csv')" 564 | ] 565 | } 566 | ], 567 | "metadata": { 568 | "kernelspec": { 569 | "display_name": "Python 3", 570 | "language": "python", 571 | "name": "python3" 572 | }, 573 | "language_info": { 574 | "codemirror_mode": { 575 | "name": "ipython", 576 | "version": 3 577 | }, 578 | "file_extension": ".py", 579 | "mimetype": "text/x-python", 580 | "name": "python", 581 | "nbconvert_exporter": "python", 582 | "pygments_lexer": "ipython3", 583 | "version": "3.8.5" 584 | } 585 | }, 586 | "nbformat": 4, 587 | "nbformat_minor": 4 588 | } 589 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | grequests>=0.6.0 2 | pandas>=1.1.3 3 | requests>=2.24.0 4 | urllib3>=1.25.11 --------------------------------------------------------------------------------