├── .gitignore
├── LICENSE
├── README.md
├── open-data-portal-api.R
├── open-data-portal-api.Rproj
├── open-data-portal-api.ipynb
└── requirements.txt
/.gitignore:
--------------------------------------------------------------------------------
1 | *.csv
2 | .ipynb_checkpoints/
3 | .Rproj.user
4 | .Rhistory
5 | .RData
6 | .Ruserdata
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright 2021 NHS Business Services Authority
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Open Data API
2 | A collection of examples of how to query the NHSBSA Open Data Portal API.
3 |
4 | ## Usage
5 | For R users, please use `open-data-portal-api.R` and follow the instructions.
6 |
7 | For Python users, please use `open-data-portal-api.ipynb` and install the
8 | required packages using:
9 | ```
10 | pip install -r requirements.txt
11 | ```
12 | If required, please export the notebook file to `.py`.
13 |
--------------------------------------------------------------------------------
/open-data-portal-api.R:
--------------------------------------------------------------------------------
1 | # 1. Script details ------------------------------------------------------------
2 |
3 | # Name of script: OpenDataAPIQuery
4 | # Description: Using R to query the NHSBSA open data portal API.
5 | # Created by: Matthew Wilson (NHSBSA)
6 | # Created on: 26-03-2020
7 | # Latest update by: Adam Ivison (NHSBSA)
8 | # Latest update on: 24-06-2021
9 | # Update notes: Updated endpoint in the script, refactored code and added async
10 |
11 | # R version: created in 3.5.3
12 |
13 | # 2. Load packages -------------------------------------------------------------
14 |
15 | # List packages we will use
16 | packages <- c(
17 | "jsonlite", # 1.6
18 | "dplyr", # 0.8.3
19 | "crul" # 1.1.0
20 | )
21 |
22 | # Install packages if they aren't already
23 | if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
24 | install.packages(setdiff(packages, rownames(installed.packages())))
25 | }
26 |
27 | # 3. Define variables ----------------------------------------------------------
28 |
29 | # Define the url for the API call
30 | base_endpoint <- "https://opendata.nhsbsa.net/api/3/action/"
31 | package_list_method <- "package_list" # List of data-sets in the portal
32 | package_show_method <- "package_show?id=" # List all resources of a data-set
33 | action_method <- "datastore_search_sql?" # SQL action method
34 |
35 | # Send API call to get list of data-sets
36 | datasets_response <- jsonlite::fromJSON(paste0(
37 | base_endpoint,
38 | package_list_method
39 | ))
40 |
41 | # Now lets have a look at the data-sets currently available
42 | datasets_response$result
43 |
44 | # For this example we're interested in the English Prescribing Dataset (EPD).
45 | # We know the name of this data-set so can set this manually, or access it
46 | # from datasets_response.
47 | dataset_id <- "english-prescribing-data-epd"
48 |
49 | # 4. API calls for single month ------------------------------------------------
50 |
51 | # Define the parameters for the SQL query
52 | resource_name <- "EPD_202001" # For EPD resources are named EPD_YYYYMM
53 | pco_code <- "13T00" # Newcastle Gateshead CCG
54 | bnf_chemical_substance <- "0407010H0" # Paracetamol
55 |
56 | # Build SQL query (WHERE criteria should be enclosed in single quotes)
57 | single_month_query <- paste0(
58 | "
59 | SELECT
60 | *
61 | FROM `",
62 | resource_name, "`
63 | WHERE
64 | 1=1
65 | AND pco_code = '", pco_code, "'
66 | AND bnf_chemical_substance = '", bnf_chemical_substance, "'
67 | "
68 | )
69 |
70 | # Build API call
71 | single_month_api_call <- paste0(
72 | base_endpoint,
73 | action_method,
74 | "resource_id=",
75 | resource_name,
76 | "&",
77 | "sql=",
78 | URLencode(single_month_query) # Encode spaces in the url
79 | )
80 |
81 | # Grab the response JSON as a list
82 | single_month_response <- jsonlite::fromJSON(single_month_api_call)
83 |
84 | # Extract records in the response to a dataframe
85 | single_month_df <- single_month_response$result$result$records
86 |
87 | # Lets have a quick look at the data
88 | str(single_month_df)
89 | head(single_month_df)
90 |
91 | # You can use any of the fields listed in the data-set within the SQL query as
92 | # part of the select or in the where clause in order to filter.
93 |
94 | # Information on the fields present in a data-set and an accompanying data
95 | # dictionary can be found on the page for the relevant data-set on the Open Data
96 | # Portal.
97 |
98 | # 5. API calls for data for multiple months ------------------------------------
99 |
100 | # Now that you have extracted data for a single month, you may want to get the
101 | # data for several months, or a whole year.
102 |
103 | # Firstly we need to get a list of all of the names and resource IDs for every
104 | # EPD file. We therefore extract the metadata for the EPD dataset.
105 | metadata_repsonse <- jsonlite::fromJSON(paste0(
106 | base_endpoint,
107 | package_show_method,
108 | dataset_id
109 | ))
110 |
111 | # Resource names and IDs are kept within the resources table returned from the
112 | # package_show_method call.
113 | resources_table <- metadata_repsonse$result$resources
114 |
115 | # We only want data for one calendar year, to do this we need to look at the
116 | # name of the data-set to identify the year. For this example we're looking at
117 | # 2020.
118 | resource_name_list <- resources_table$name[grepl("2020", resources_table$name)]
119 |
120 | # 5.1. For loop ----------------------------------------------------------------
121 |
122 | # We can do this with a for loop that makes all of the individual API calls for
123 | # you and combines the data together into one dataframe
124 |
125 | # Initialise dataframe that data will be saved to
126 | for_loop_df <- data.frame()
127 |
128 | # As each individual month of EPD data is so large it will be unlikely that your
129 | # local system will have enough RAM to hold a full year's worth of data in
130 | # memory. Therefore we will only look at a single CCG and chemical substance as
131 | # we did previously
132 |
133 | # Loop through resource_name_list and make call to API to extract data, then
134 | # bind each month together to make a single data-set
135 | for(month in resource_name_list) {
136 |
137 | # Build temporary SQL query
138 | tmp_query <- paste0(
139 | "
140 | SELECT
141 | *
142 | FROM `",
143 | month, "`
144 | WHERE
145 | 1=1
146 | AND pco_code = '", pco_code, "'
147 | AND bnf_chemical_substance = '", bnf_chemical_substance, "'
148 | "
149 | )
150 |
151 | # Build temporary API call
152 | tmp_api_call <- paste0(
153 | base_endpoint,
154 | action_method,
155 | "resource_id=",
156 | month,
157 | "&",
158 | "sql=",
159 | URLencode(tmp_query) # Encode spaces in the url
160 | )
161 |
162 | # Grab the response JSON as a temporary list
163 | tmp_response <- jsonlite::fromJSON(tmp_api_call)
164 |
165 | # Extract records in the response to a temporary dataframe
166 | tmp_df <- tmp_response$result$result$records
167 |
168 | # Bind the temporary data to the main dataframe
169 | for_loop_df <- dplyr::bind_rows(for_loop_df, tmp_df)
170 | }
171 |
172 | # 5.2. Async -- ----------------------------------------------------------------
173 |
174 | # We can call the API asynchronously and this will result in an approx 10x speed
175 | # increase over a for loop for large resource_names by vectorising our approach.
176 |
177 | # Construct the SQL query as a function
178 | async_query <- function(resource_name) {
179 | paste0(
180 | "
181 | SELECT
182 | *
183 | FROM `",
184 | resource_name, "`
185 | WHERE
186 | 1=1
187 | AND pco_code = '", pco_code, "'
188 | AND bnf_chemical_substance = '", bnf_chemical_substance, "'
189 | "
190 | )
191 | }
192 |
193 | # Create the API calls
194 | async_api_calls <- lapply(
195 | X = resource_name_list,
196 | FUN = function(x)
197 | paste0(
198 | base_endpoint,
199 | action_method,
200 | "resource_id=",
201 | x,
202 | "&",
203 | "sql=",
204 | URLencode(async_query(x)) # Encode spaces in the url
205 | )
206 | )
207 |
208 | # Use crul::Async to get the results
209 | dd <- crul::Async$new(urls = async_api_calls)
210 | res <- dd$get()
211 |
212 | # Check that everything is a success
213 | all(vapply(res, function(z) z$success(), logical(1)))
214 |
215 | # Parse the output into a list of dataframes
216 | async_dfs <- lapply(
217 | X = res,
218 | FUN = function(x) {
219 |
220 | # Parse the response
221 | tmp_response <- x$parse("UTF-8")
222 |
223 | # Extract the records
224 | tmp_df <- jsonlite::fromJSON(tmp_response)$result$result$records
225 | }
226 | )
227 |
228 | # Concatenate the results
229 | aysnc_df <- do.call(dplyr::bind_rows, async_dfs)
230 |
231 | # 6. Export the data -----------------------------------------------------------
232 |
233 | # Use write.csv for ease
234 | write.csv(single_month_df, "single_month.csv")
235 | write.csv(for_loop_df, "for_loop.csv")
236 | write.csv(aysnc_df, "aysnc.csv")
237 |
--------------------------------------------------------------------------------
/open-data-portal-api.Rproj:
--------------------------------------------------------------------------------
1 | Version: 1.0
2 |
3 | RestoreWorkspace: Default
4 | SaveWorkspace: Default
5 | AlwaysSaveHistory: Default
6 |
7 | EnableCodeIndexing: Yes
8 | UseSpacesForTab: Yes
9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 |
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 |
--------------------------------------------------------------------------------
/open-data-portal-api.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 1. Script details"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Name of script: OpenDataAPIQuery
\n",
15 | "Description: Using Python to query the NHSBSA open data portal API.
\n",
16 | "Created by: Ryan Leggett (NHSBSA)
\n",
17 | "Created on: 26-06-2022
\n",
18 | "Python version: created in 3.8"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "# 2. Load packages"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "List packages we will use"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": null,
38 | "metadata": {},
39 | "outputs": [],
40 | "source": [
41 | "import grequests\n",
42 | "import pandas as pd\n",
43 | "import re\n",
44 | "import requests\n",
45 | "import warnings\n",
46 | "import urllib.parse\n",
47 | "\n",
48 | "warnings.simplefilter(\"ignore\", category=UserWarning)"
49 | ]
50 | },
51 | {
52 | "cell_type": "markdown",
53 | "metadata": {},
54 | "source": [
55 | "Install packages if they aren't already using `Pip/Conda install -r requirements.txt`"
56 | ]
57 | },
58 | {
59 | "cell_type": "markdown",
60 | "metadata": {},
61 | "source": [
62 | "# 3. Define variablesDefine the url for the API call"
63 | ]
64 | },
65 | {
66 | "cell_type": "markdown",
67 | "metadata": {},
68 | "source": [
69 | "Define the url for the API call"
70 | ]
71 | },
72 | {
73 | "cell_type": "code",
74 | "execution_count": null,
75 | "metadata": {},
76 | "outputs": [],
77 | "source": [
78 | "base_endpoint = 'https://opendata.nhsbsa.net/api/3/action/'\n",
79 | "package_list_method = 'package_list' # List of data-sets in the portal\n",
80 | "package_show_method = 'package_show?id=' # List all resources of a data-set\n",
81 | "action_method = 'datastore_search_sql?' # SQL action method"
82 | ]
83 | },
84 | {
85 | "cell_type": "markdown",
86 | "metadata": {},
87 | "source": [
88 | "Send API call to get list of data-sets"
89 | ]
90 | },
91 | {
92 | "cell_type": "code",
93 | "execution_count": null,
94 | "metadata": {},
95 | "outputs": [],
96 | "source": [
97 | "datasets_response = requests.get(base_endpoint + package_list_method).json()"
98 | ]
99 | },
100 | {
101 | "cell_type": "markdown",
102 | "metadata": {},
103 | "source": [
104 | "Now lets have a look at the data-sets currently available"
105 | ]
106 | },
107 | {
108 | "cell_type": "code",
109 | "execution_count": null,
110 | "metadata": {},
111 | "outputs": [],
112 | "source": [
113 | "print(datasets_response['result'])"
114 | ]
115 | },
116 | {
117 | "cell_type": "markdown",
118 | "metadata": {},
119 | "source": [
120 | "For this example we're interested in the English Prescribing Dataset (EPD).\n",
121 | "We know the name of this data-set so can set this manually, or access it \n",
122 | "from datasets_response."
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": null,
128 | "metadata": {},
129 | "outputs": [],
130 | "source": [
131 | "dataset_id = \"english-prescribing-data-epd\""
132 | ]
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "# 4. API calls for single month"
139 | ]
140 | },
141 | {
142 | "cell_type": "markdown",
143 | "metadata": {},
144 | "source": [
145 | "Define the parameters for the SQL query"
146 | ]
147 | },
148 | {
149 | "cell_type": "code",
150 | "execution_count": null,
151 | "metadata": {},
152 | "outputs": [],
153 | "source": [
154 | "resource_name = 'EPD_202001' # For EPD resources are named EPD_YYYYMM\n",
155 | "pco_code = '13T00' # Newcastle Gateshead CCG\n",
156 | "bnf_chemical_substance = '0407010H0' # Paracetamol"
157 | ]
158 | },
159 | {
160 | "cell_type": "markdown",
161 | "metadata": {},
162 | "source": [
163 | "Build SQL query (WHERE criteria should be enclosed in single quotes)"
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": null,
169 | "metadata": {},
170 | "outputs": [],
171 | "source": [
172 | "single_month_query = \"SELECT * \" \\\n",
173 | " f\"FROM `{resource_name}` \" \\\n",
174 | " f\"WHERE pco_code = '{pco_code}' \" \\\n",
175 | " f\"AND bnf_chemical_substance = '{bnf_chemical_substance}'\""
176 | ]
177 | },
178 | {
179 | "cell_type": "markdown",
180 | "metadata": {},
181 | "source": [
182 | "Build API call"
183 | ]
184 | },
185 | {
186 | "cell_type": "code",
187 | "execution_count": null,
188 | "metadata": {},
189 | "outputs": [],
190 | "source": [
191 | "single_month_api_call = f\"{base_endpoint}\" \\\n",
192 | " f\"{action_method}\" \\\n",
193 | " \"resource_id=\" \\\n",
194 | " f\"{resource_name}\" \\\n",
195 | " \"&\" \\\n",
196 | " \"sql=\" \\\n",
197 | " f\"{urllib.parse.quote(single_month_query)}\" # Encode spaces in the url"
198 | ]
199 | },
200 | {
201 | "cell_type": "markdown",
202 | "metadata": {},
203 | "source": [
204 | "Grab the response JSON as a list"
205 | ]
206 | },
207 | {
208 | "cell_type": "code",
209 | "execution_count": null,
210 | "metadata": {},
211 | "outputs": [],
212 | "source": [
213 | "single_month_response = requests.get(single_month_api_call).json()"
214 | ]
215 | },
216 | {
217 | "cell_type": "markdown",
218 | "metadata": {},
219 | "source": [
220 | "Extract records in the response to a dataframe"
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": null,
226 | "metadata": {},
227 | "outputs": [],
228 | "source": [
229 | "single_month_df = pd.json_normalize(single_month_response['result']['result']['records'])"
230 | ]
231 | },
232 | {
233 | "cell_type": "markdown",
234 | "metadata": {},
235 | "source": [
236 | "Lets have a quick look at the data"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": null,
242 | "metadata": {},
243 | "outputs": [],
244 | "source": [
245 | "single_month_df.head()"
246 | ]
247 | },
248 | {
249 | "cell_type": "markdown",
250 | "metadata": {},
251 | "source": [
252 | "You can use any of the fields listed in the data-set within the SQL query as \n",
253 | "part of the select or in the where clause in order to filter.\n",
254 | "\n",
255 | "Information on the fields present in a data-set and an accompanying data \n",
256 | "dictionary can be found on the page for the relevant data-set on the Open Data \n",
257 | "Portal."
258 | ]
259 | },
260 | {
261 | "cell_type": "markdown",
262 | "metadata": {},
263 | "source": [
264 | "# 5. API calls for data for multiple months"
265 | ]
266 | },
267 | {
268 | "cell_type": "markdown",
269 | "metadata": {},
270 | "source": [
271 | "Now that you have extracted data for a single month, you may want to get the \n",
272 | "data for several months, or a whole year.\n",
273 | "\n",
274 | "Firstly we need to get a list of all of the names and resource IDs for every \n",
275 | "EPD file. We therefore extract the metadata for the EPD dataset."
276 | ]
277 | },
278 | {
279 | "cell_type": "code",
280 | "execution_count": null,
281 | "metadata": {},
282 | "outputs": [],
283 | "source": [
284 | "metadata_repsonse = requests.get(f\"{base_endpoint}\" \\\n",
285 | " f\"{package_show_method}\" \\\n",
286 | " f\"{dataset_id}\").json()"
287 | ]
288 | },
289 | {
290 | "cell_type": "markdown",
291 | "metadata": {},
292 | "source": [
293 | "Resource names and IDs are kept within the resources table returned from the \n",
294 | "package_show_method call."
295 | ]
296 | },
297 | {
298 | "cell_type": "code",
299 | "execution_count": null,
300 | "metadata": {},
301 | "outputs": [],
302 | "source": [
303 | "resources_table = pd.json_normalize(metadata_repsonse['result']['resources'])"
304 | ]
305 | },
306 | {
307 | "cell_type": "markdown",
308 | "metadata": {},
309 | "source": [
310 | "We only want data for one calendar year, to do this we need to look at the \n",
311 | "name of the data-set to identify the year. For this example we're looking at \n",
312 | "2020."
313 | ]
314 | },
315 | {
316 | "cell_type": "code",
317 | "execution_count": null,
318 | "metadata": {},
319 | "outputs": [],
320 | "source": [
321 | "resource_name_list = resources_table[resources_table['name'].str.contains('2020')]['name']"
322 | ]
323 | },
324 | {
325 | "cell_type": "markdown",
326 | "metadata": {},
327 | "source": [
328 | "## 5.1. For loop"
329 | ]
330 | },
331 | {
332 | "cell_type": "markdown",
333 | "metadata": {},
334 | "source": [
335 | "We can do this with a for loop that makes all of the individual API calls for \n",
336 | "you and combines the data together into one dataframe\n",
337 | "\n",
338 | "Initialise dataframe that data will be saved to"
339 | ]
340 | },
341 | {
342 | "cell_type": "code",
343 | "execution_count": null,
344 | "metadata": {},
345 | "outputs": [],
346 | "source": [
347 | "for_loop_df = pd.DataFrame()"
348 | ]
349 | },
350 | {
351 | "cell_type": "markdown",
352 | "metadata": {},
353 | "source": [
354 | "As each individual month of EPD data is so large it will be unlikely that your \n",
355 | "local system will have enough RAM to hold a full year's worth of data in \n",
356 | "memory. Therefore we will only look at a single CCG and chemical substance as \n",
357 | "we did previously\n",
358 | "\n",
359 | "Loop through resource_name_list and make call to API to extract data, then \n",
360 | "bind each month together to make a single data-set"
361 | ]
362 | },
363 | {
364 | "cell_type": "code",
365 | "execution_count": null,
366 | "metadata": {},
367 | "outputs": [],
368 | "source": [
369 | "for month in resource_name_list:\n",
370 | " \n",
371 | " # Build temporary SQL query\n",
372 | " tmp_query = \"SELECT * \" \\\n",
373 | " f\"FROM `{month}` \" \\\n",
374 | " f\"WHERE pco_code = '{pco_code}' \" \\\n",
375 | " f\"AND bnf_chemical_substance = '{bnf_chemical_substance}'\"\n",
376 | " \n",
377 | " # Build temporary API call\n",
378 | " tmp_api_call = f\"{base_endpoint}\" \\\n",
379 | " f\"{action_method}\" \\\n",
380 | " \"resource_id=\" \\\n",
381 | " f\"{month}\" \\\n",
382 | " \"&\" \\\n",
383 | " \"sql=\" \\\n",
384 | " f\"{urllib.parse.quote(tmp_query)}\" # Encode spaces in the url\n",
385 | " \n",
386 | " # Grab the response JSON as a temporary list\n",
387 | " tmp_response = requests.get(tmp_api_call).json()\n",
388 | " \n",
389 | " # Extract records in the response to a temporary dataframe\n",
390 | " tmp_df = pd.json_normalize(tmp_response['result']['result']['records'])\n",
391 | " \n",
392 | " # Bind the temporary data to the main dataframe\n",
393 | " for_loop_df = for_loop_df.append(tmp_df)"
394 | ]
395 | },
396 | {
397 | "cell_type": "markdown",
398 | "metadata": {},
399 | "source": [
400 | "Lets have a quick look at the data"
401 | ]
402 | },
403 | {
404 | "cell_type": "code",
405 | "execution_count": null,
406 | "metadata": {},
407 | "outputs": [],
408 | "source": [
409 | "for_loop_df.head()"
410 | ]
411 | },
412 | {
413 | "cell_type": "markdown",
414 | "metadata": {},
415 | "source": [
416 | "## 5.2. Async"
417 | ]
418 | },
419 | {
420 | "cell_type": "markdown",
421 | "metadata": {},
422 | "source": [
423 | "We can call the API asynchronously and this will result in an approx 10x speed \n",
424 | "increase over a for loop for large resource_names by vectorising our approach.\n",
425 | "\n",
426 | "Construct the SQL query as a function"
427 | ]
428 | },
429 | {
430 | "cell_type": "code",
431 | "execution_count": null,
432 | "metadata": {},
433 | "outputs": [],
434 | "source": [
435 | "def async_query(resource_name):\n",
436 | " query = \"SELECT * \" \\\n",
437 | " f\"FROM `{resource_name}` \" \\\n",
438 | " f\"WHERE pco_code = '{pco_code}' \" \\\n",
439 | " f\"AND bnf_chemical_substance = '{bnf_chemical_substance}'\"\n",
440 | " return(query)"
441 | ]
442 | },
443 | {
444 | "cell_type": "markdown",
445 | "metadata": {},
446 | "source": [
447 | "Create the API calls"
448 | ]
449 | },
450 | {
451 | "cell_type": "code",
452 | "execution_count": null,
453 | "metadata": {},
454 | "outputs": [],
455 | "source": [
456 | "async_api_calls = []\n",
457 | "for x in resource_name_list:\n",
458 | " async_api_calls.append(\n",
459 | " f\"{base_endpoint}\" \\\n",
460 | " f\"{action_method}\" \\\n",
461 | " \"resource_id=\" \\\n",
462 | " f\"{x}\" \\\n",
463 | " \"&\" \\\n",
464 | " \"sql=\" \\\n",
465 | " f\"{urllib.parse.quote(async_query(x))}\" # Encode spaces in the url \n",
466 | " )"
467 | ]
468 | },
469 | {
470 | "cell_type": "markdown",
471 | "metadata": {},
472 | "source": [
473 | "Use grequests to get the results"
474 | ]
475 | },
476 | {
477 | "cell_type": "code",
478 | "execution_count": null,
479 | "metadata": {},
480 | "outputs": [],
481 | "source": [
482 | "dd = (grequests.get(u) for u in async_api_calls)\n",
483 | "res = grequests.map(dd)"
484 | ]
485 | },
486 | {
487 | "cell_type": "markdown",
488 | "metadata": {},
489 | "source": [
490 | "Check that everything is a success"
491 | ]
492 | },
493 | {
494 | "cell_type": "code",
495 | "execution_count": null,
496 | "metadata": {},
497 | "outputs": [],
498 | "source": [
499 | "for x in res:\n",
500 | " if x.ok:\n",
501 | " print(True)\n",
502 | " else:\n",
503 | " print(False)"
504 | ]
505 | },
506 | {
507 | "cell_type": "markdown",
508 | "metadata": {},
509 | "source": [
510 | "Parse the output into a list of dataframes and concatenate the results"
511 | ]
512 | },
513 | {
514 | "cell_type": "code",
515 | "execution_count": null,
516 | "metadata": {},
517 | "outputs": [],
518 | "source": [
519 | "async_df = pd.DataFrame()\n",
520 | "\n",
521 | "for x in res:\n",
522 | " # Grab the response JSON as a temporary list\n",
523 | " tmp_response = x.json()\n",
524 | " \n",
525 | " # Extract records in the response to a temporary dataframe\n",
526 | " tmp_df = pd.json_normalize(tmp_response['result']['result']['records'])\n",
527 | " \n",
528 | " # Bind the temporary data to the main dataframe\n",
529 | " async_df = async_df.append(tmp_df)"
530 | ]
531 | },
532 | {
533 | "cell_type": "markdown",
534 | "metadata": {},
535 | "source": [
536 | "Lets have a quick look at the data"
537 | ]
538 | },
539 | {
540 | "cell_type": "code",
541 | "execution_count": null,
542 | "metadata": {},
543 | "outputs": [],
544 | "source": [
545 | "async_df.head()"
546 | ]
547 | },
548 | {
549 | "cell_type": "markdown",
550 | "metadata": {},
551 | "source": [
552 | "# 6. Export the data"
553 | ]
554 | },
555 | {
556 | "cell_type": "code",
557 | "execution_count": null,
558 | "metadata": {},
559 | "outputs": [],
560 | "source": [
561 | "single_month_df.to_csv('single_month.csv')\n",
562 | "for_loop_df.to_csv('for_loop.csv')\n",
563 | "async_df.to_csv('async.csv')"
564 | ]
565 | }
566 | ],
567 | "metadata": {
568 | "kernelspec": {
569 | "display_name": "Python 3",
570 | "language": "python",
571 | "name": "python3"
572 | },
573 | "language_info": {
574 | "codemirror_mode": {
575 | "name": "ipython",
576 | "version": 3
577 | },
578 | "file_extension": ".py",
579 | "mimetype": "text/x-python",
580 | "name": "python",
581 | "nbconvert_exporter": "python",
582 | "pygments_lexer": "ipython3",
583 | "version": "3.8.5"
584 | }
585 | },
586 | "nbformat": 4,
587 | "nbformat_minor": 4
588 | }
589 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | grequests>=0.6.0
2 | pandas>=1.1.3
3 | requests>=2.24.0
4 | urllib3>=1.25.11
--------------------------------------------------------------------------------