├── .gitignore ├── LICENSE ├── README.md ├── notebooks ├── authors │ └── hirsch-index.ipynb ├── data_questions │ └── counts_within_country.ipynb ├── getting-started │ ├── README.md │ ├── api-webinar-apr2024 │ │ └── tutorial01.ipynb │ ├── get-random-entity.ipynb │ ├── paging.ipynb │ └── premium.ipynb ├── institutions │ ├── japan_sources.csv │ ├── japan_sources.ipynb │ ├── oa-percentage.ipynb │ ├── uw-collaborators copy.ipynb │ └── uw-collaborators.ipynb └── openalex_works │ └── openalex_works.ipynb ├── requirements.txt ├── resources └── img │ ├── OpenAlex-banner.png │ ├── OpenAlex-entities.png │ ├── OpenAlex-logo.png │ ├── notebooks │ ├── cursor-paging.png │ └── meta-object.png │ └── ui_vs_api.svg └── runtime.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # Jupyter Notebook 2 | .ipynb_checkpoints 3 | 4 | # Environments 5 | .env 6 | .venv 7 | env/ 8 | venv/ 9 | ENV/ 10 | env.bak/ 11 | venv.bak/ 12 | 13 | data/ -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 OurResearch 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | OpenAlex logo 3 | 4 | 5 | # OpenAlex API tutorials 6 | 7 | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ourresearch/openalex-api-tutorials/main) 8 | [![Open All Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ourresearch/openalex-api-tutorials) 9 | [![Deepnote](https://deepnote.com/buttons/launch-in-deepnote-small.svg)](https://www.deepnote.com/launch?url=https%3A%2F%2Fgithub.com%2Fourresearch%2Fopenalex-api-tutorials/) 10 | 11 | A collection of Jupyter notebooks, each walking you through a common example of bibliometric analysis 12 | using scholarly data from the [OpenAlex API](https://docs.openalex.org/). (:warning: Work In Progress). 13 | 14 | 15 | ## :bulb: What is OpenAlex? 16 | [OpenAlex](https://openalex.org/) is a fully-open index of scholarly works, authors, venues, institutions, and concepts 17 | — along with all the ways they're connected to one another. 18 | It's named after the ancient [Library of Alexandria](https://en.wikipedia.org/wiki/Library_of_Alexandria) 19 | and made by the nonprofit [OurResearch](https://ourresearch.org/). 20 | 21 |
22 | 23 | OpenAlex base model 24 | 25 |
26 | 27 | What makes OpenAlex stand out as a bibliographic data source is its Openness: 28 | * The data is made available under the [CC0 license](https://creativecommons.org/publicdomain/zero/1.0/). 29 | That means it's in the public domain, and free to use in any way you like. 30 | 31 | * The primary way to access the data, is the [API](https://docs.openalex.org/#access). 32 | It is free and requires no authentication. 33 | 34 | 35 | ## :notebook: What are Jupyter notebooks? 36 | Jupyter notebooks are documents that let you combine executable code snippets 37 | with explanatory text, formulas and visualizations. 38 | Weaving both of them together allows to craft a narrative around the *How?* and *Why?* 39 | of one's programming work which makes them especially useful for writing up documentation and tutorials. 40 | But not only that: 41 | you can also dive right in by modifying and re-running code snippets as needed. 42 | Therefore a notebook may serve as a starting point to prototype your own idea! 43 | 44 | 45 | ## :rocket: How do I run the notebooks? 46 | *Note: You can browse through and read the notebooks right here on GitHub. However, the code snippets won't be executable.* 47 | 48 | The easiest way to run Jupyter notebooks is via cloud services like [Binder](https://mybinder.org/), 49 | [Google’s Colaboratory](https://colab.research.google.com/) or [Deepnote](https://deepnote.com/). 50 | They provide you with a free execution environment that you can access directly in your browser - no setup needed. 51 | Just click on one of the badges at the top of this README and it will take you to the selected service. 52 | 53 | Alternatively you can set up a Jupyter server on your computer 54 | (for instructions please refer to the [official Jupyter docs](https://docs.jupyter.org/en/latest/install.html)). 55 | Many IDEs also support running Jupyter notebooks out of the box or via a plugin. If you have one installed, 56 | it may be a good idea to consult its docs or marketplace. 57 | If you go local, though, please remember to install the Python packages specified in the `requirements.txt` file. 58 | 59 | 60 | ## :book: Citation 61 | If you use OpenAlex in your research, please cite this paper: 62 | > Priem, J., Piwowar, H., & Orr, R. (2022). _OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts._ ArXiv. https://arxiv.org/abs/2205.01833 63 | 64 | and don't forget to [tell us](https://docs.openalex.org/#contact) about your project. We love to hear what you come up with using data from OpenAlex! 65 | -------------------------------------------------------------------------------- /notebooks/data_questions/counts_within_country.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "attachments": {}, 5 | "cell_type": "markdown", 6 | "metadata": {}, 7 | "source": [ 8 | "# Question: How are number of works counted when looking at institutions within a country?" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 13, 14 | "metadata": {}, 15 | "outputs": [], 16 | "source": [ 17 | "import requests\n", 18 | "country_code = 'ES'\n", 19 | "url = f\"https://api.openalex.org/works\"\n", 20 | "params = {\n", 21 | " 'filter': f'institutions.country_code:{country_code}',\n", 22 | " 'group_by': 'institutions.id',\n", 23 | "}\n", 24 | "r = requests.get(url, params=params)" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 14, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "counts_by_institutions_from_works_endpoint = r.json()['group_by']" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 16, 39 | "metadata": {}, 40 | "outputs": [ 41 | { 42 | "data": { 43 | "text/plain": [ 44 | "{'key': 'https://openalex.org/I71999127',\n", 45 | " 'key_display_name': 'University of Barcelona',\n", 46 | " 'count': 106711}" 47 | ] 48 | }, 49 | "execution_count": 16, 50 | "metadata": {}, 51 | "output_type": "execute_result" 52 | } 53 | ], 54 | "source": [ 55 | "counts_by_institutions_from_works_endpoint[0]" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 18, 61 | "metadata": {}, 62 | "outputs": [ 63 | { 64 | "name": "stdout", 65 | "output_type": "stream", 66 | "text": [ 67 | "collected 1804 sources (using 74 api calls)\n" 68 | ] 69 | } 70 | ], 71 | "source": [ 72 | "country_code = 'ES'\n", 73 | "# url with a placeholder for page number\n", 74 | "url = f\"https://api.openalex.org/institutions\"\n", 75 | "params = {\n", 76 | " 'filter': f'country_code:{country_code}',\n", 77 | " 'page': 1, # initaliaze `page` param to 1\n", 78 | "}\n", 79 | "\n", 80 | "has_more_pages = True\n", 81 | "fewer_than_10k_results = True\n", 82 | "\n", 83 | "institutions_data_from_institutions_endpoint = []\n", 84 | "\n", 85 | "# loop through pages\n", 86 | "loop_index = 0\n", 87 | "while has_more_pages and fewer_than_10k_results:\n", 88 | " \n", 89 | " page_with_results = requests.get(url, params=params).json()\n", 90 | " \n", 91 | " # loop through partial list of results\n", 92 | " results = page_with_results['results']\n", 93 | " for api_result in results:\n", 94 | " # # Collect the fields we are interested in, for this source\n", 95 | " # source = {field: api_result[field] for field in fields}\n", 96 | " # Append this source to our `japanese_sourcers` list\n", 97 | " institutions_data_from_institutions_endpoint.append(api_result)\n", 98 | "\n", 99 | " # next page\n", 100 | " params['page'] += 1\n", 101 | " \n", 102 | " # end loop when either there are no more results on the requested page \n", 103 | " # or the next request would exceed 10,000 results\n", 104 | " per_page = page_with_results['meta']['per_page']\n", 105 | " has_more_pages = len(results) == per_page\n", 106 | " fewer_than_10k_results = per_page * params['page'] <= 10000\n", 107 | " loop_index += 1\n", 108 | "print(f\"collected {len(institutions_data_from_institutions_endpoint)} sources (using {loop_index+1} api calls)\")" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 24, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "keyed_counts = {item['key']: item for item in counts_by_institutions_from_works_endpoint}\n", 118 | "data = []\n", 119 | "for inst in institutions_data_from_institutions_endpoint:\n", 120 | " id = inst['id']\n", 121 | " c = keyed_counts.get(id)\n", 122 | " if c:\n", 123 | " data.append({\n", 124 | " 'id': id,\n", 125 | " 'count1': inst['works_count'],\n", 126 | " 'count2': c['count'],\n", 127 | " })" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 25, 133 | "metadata": {}, 134 | "outputs": [ 135 | { 136 | "data": { 137 | "text/plain": [ 138 | "[{'id': 'https://openalex.org/I71999127', 'count1': 106775, 'count2': 106711},\n", 139 | " {'id': 'https://openalex.org/I121748325', 'count1': 101654, 'count2': 101658},\n", 140 | " {'id': 'https://openalex.org/I123044942', 'count1': 88726, 'count2': 88466},\n", 141 | " {'id': 'https://openalex.org/I16097986', 'count1': 75688, 'count2': 74845},\n", 142 | " {'id': 'https://openalex.org/I173304897', 'count1': 70325, 'count2': 70273},\n", 143 | " {'id': 'https://openalex.org/I63634437', 'count1': 63111, 'count2': 61569},\n", 144 | " {'id': 'https://openalex.org/I79238269', 'count1': 58094, 'count2': 58138},\n", 145 | " {'id': 'https://openalex.org/I169108374', 'count1': 55330, 'count2': 55244},\n", 146 | " {'id': 'https://openalex.org/I9617848', 'count1': 51480, 'count2': 51501},\n", 147 | " {'id': 'https://openalex.org/I200284239', 'count1': 49358, 'count2': 49181},\n", 148 | " {'id': 'https://openalex.org/I255234318', 'count1': 47887, 'count2': 47898},\n", 149 | " {'id': 'https://openalex.org/I88060688', 'count1': 42417, 'count2': 42434},\n", 150 | " {'id': 'https://openalex.org/I60053951', 'count1': 42409, 'count2': 42438},\n", 151 | " {'id': 'https://openalex.org/I80180929', 'count1': 36961, 'count2': 36981},\n", 152 | " {'id': 'https://openalex.org/I165339363', 'count1': 36934, 'count2': 35901},\n", 153 | " {'id': 'https://openalex.org/I184999862', 'count1': 36332, 'count2': 36361},\n", 154 | " {'id': 'https://openalex.org/I82767444', 'count1': 33232, 'count2': 33261},\n", 155 | " {'id': 'https://openalex.org/I134820265', 'count1': 32576, 'count2': 31468},\n", 156 | " {'id': 'https://openalex.org/I6289922', 'count1': 29132, 'count2': 29159},\n", 157 | " {'id': 'https://openalex.org/I130194489', 'count1': 28886, 'count2': 28914},\n", 158 | " {'id': 'https://openalex.org/I170486558', 'count1': 28116, 'count2': 28094},\n", 159 | " {'id': 'https://openalex.org/I79189158', 'count1': 27491, 'count2': 27530},\n", 160 | " {'id': 'https://openalex.org/I50357001', 'count1': 26882, 'count2': 26880},\n", 161 | " {'id': 'https://openalex.org/I108103353', 'count1': 25897, 'count2': 25916},\n", 162 | " {'id': 'https://openalex.org/I189268942', 'count1': 25005, 'count2': 25014},\n", 163 | " {'id': 'https://openalex.org/I4210115097', 'count1': 24678, 'count2': 24696},\n", 164 | " {'id': 'https://openalex.org/I158438070', 'count1': 24397, 'count2': 24402},\n", 165 | " {'id': 'https://openalex.org/I80606768', 'count1': 22496, 'count2': 22516},\n", 166 | " {'id': 'https://openalex.org/I88155538', 'count1': 21926, 'count2': 21929},\n", 167 | " {'id': 'https://openalex.org/I2801357902', 'count1': 21347, 'count2': 21351},\n", 168 | " {'id': 'https://openalex.org/I55952717', 'count1': 20759, 'count2': 20781},\n", 169 | " {'id': 'https://openalex.org/I13134134', 'count1': 20519, 'count2': 19919},\n", 170 | " {'id': 'https://openalex.org/I2800562746', 'count1': 20002, 'count2': 19995},\n", 171 | " {'id': 'https://openalex.org/I53110688', 'count1': 19820, 'count2': 19835},\n", 172 | " {'id': 'https://openalex.org/I2960094004', 'count1': 19486, 'count2': 19479},\n", 173 | " {'id': 'https://openalex.org/I182083151', 'count1': 18710, 'count2': 18728},\n", 174 | " {'id': 'https://openalex.org/I10902133', 'count1': 18113, 'count2': 18134},\n", 175 | " {'id': 'https://openalex.org/I39147953', 'count1': 17024, 'count2': 17029},\n", 176 | " {'id': 'https://openalex.org/I2961216182', 'count1': 16932, 'count2': 16922},\n", 177 | " {'id': 'https://openalex.org/I11019714', 'count1': 16732, 'count2': 16732},\n", 178 | " {'id': 'https://openalex.org/I178450904', 'count1': 15500, 'count2': 15508},\n", 179 | " {'id': 'https://openalex.org/I191420491', 'count1': 15161, 'count2': 15168},\n", 180 | " {'id': 'https://openalex.org/I251424209', 'count1': 15059, 'count2': 15072},\n", 181 | " {'id': 'https://openalex.org/I111262870', 'count1': 14954, 'count2': 14962},\n", 182 | " {'id': 'https://openalex.org/I4210161852', 'count1': 14401, 'count2': 14409},\n", 183 | " {'id': 'https://openalex.org/I119635470', 'count1': 14168, 'count2': 14181},\n", 184 | " {'id': 'https://openalex.org/I50441567', 'count1': 13880, 'count2': 13747},\n", 185 | " {'id': 'https://openalex.org/I52354020', 'count1': 13729, 'count2': 13743},\n", 186 | " {'id': 'https://openalex.org/I4210135003', 'count1': 13702, 'count2': 13709},\n", 187 | " {'id': 'https://openalex.org/I2802050225', 'count1': 13507, 'count2': 13509},\n", 188 | " {'id': 'https://openalex.org/I4210153139', 'count1': 13325, 'count2': 13331},\n", 189 | " {'id': 'https://openalex.org/I4210101691', 'count1': 12792, 'count2': 12798},\n", 190 | " {'id': 'https://openalex.org/I4210127641', 'count1': 12724, 'count2': 12741},\n", 191 | " {'id': 'https://openalex.org/I11932220', 'count1': 11809, 'count2': 11554},\n", 192 | " {'id': 'https://openalex.org/I2801795740', 'count1': 11780, 'count2': 11787},\n", 193 | " {'id': 'https://openalex.org/I175051016', 'count1': 11301, 'count2': 11312},\n", 194 | " {'id': 'https://openalex.org/I4210153460', 'count1': 10731, 'count2': 10740},\n", 195 | " {'id': 'https://openalex.org/I95013407', 'count1': 10652, 'count2': 10655},\n", 196 | " {'id': 'https://openalex.org/I15766328', 'count1': 10543, 'count2': 10557},\n", 197 | " {'id': 'https://openalex.org/I78880903', 'count1': 10080, 'count2': 10089},\n", 198 | " {'id': 'https://openalex.org/I8833935', 'count1': 9528, 'count2': 9527},\n", 199 | " {'id': 'https://openalex.org/I4210130807', 'count1': 9359, 'count2': 9359},\n", 200 | " {'id': 'https://openalex.org/I110594554', 'count1': 9356, 'count2': 9245},\n", 201 | " {'id': 'https://openalex.org/I4210147680', 'count1': 9301, 'count2': 9306},\n", 202 | " {'id': 'https://openalex.org/I4210130874', 'count1': 8808, 'count2': 8817},\n", 203 | " {'id': 'https://openalex.org/I4210129357', 'count1': 8784, 'count2': 8796},\n", 204 | " {'id': 'https://openalex.org/I4210130498', 'count1': 8714, 'count2': 8721},\n", 205 | " {'id': 'https://openalex.org/I4210105637', 'count1': 8234, 'count2': 8239},\n", 206 | " {'id': 'https://openalex.org/I4210118429', 'count1': 7971, 'count2': 7973},\n", 207 | " {'id': 'https://openalex.org/I168974976', 'count1': 7912, 'count2': 7922},\n", 208 | " {'id': 'https://openalex.org/I4210094406', 'count1': 7776, 'count2': 7783},\n", 209 | " {'id': 'https://openalex.org/I4210114530', 'count1': 7635, 'count2': 7634},\n", 210 | " {'id': 'https://openalex.org/I138847295', 'count1': 7576, 'count2': 7582},\n", 211 | " {'id': 'https://openalex.org/I4210150677', 'count1': 7321, 'count2': 7327},\n", 212 | " {'id': 'https://openalex.org/I3123212020', 'count1': 7197, 'count2': 7195},\n", 213 | " {'id': 'https://openalex.org/I46176106', 'count1': 7179, 'count2': 7185},\n", 214 | " {'id': 'https://openalex.org/I4210099858', 'count1': 7050, 'count2': 7054},\n", 215 | " {'id': 'https://openalex.org/I3019010403', 'count1': 7048, 'count2': 7061},\n", 216 | " {'id': 'https://openalex.org/I4210123675', 'count1': 7031, 'count2': 7037},\n", 217 | " {'id': 'https://openalex.org/I4210146061', 'count1': 6833, 'count2': 6833},\n", 218 | " {'id': 'https://openalex.org/I4210120109', 'count1': 6542, 'count2': 6550},\n", 219 | " {'id': 'https://openalex.org/I4210090436', 'count1': 6466, 'count2': 6469},\n", 220 | " {'id': 'https://openalex.org/I4210116170', 'count1': 6465, 'count2': 6464},\n", 221 | " {'id': 'https://openalex.org/I1307323311', 'count1': 6442, 'count2': 6446},\n", 222 | " {'id': 'https://openalex.org/I5593406', 'count1': 6304, 'count2': 6280},\n", 223 | " {'id': 'https://openalex.org/I4210133994', 'count1': 6252, 'count2': 6259},\n", 224 | " {'id': 'https://openalex.org/I4210105802', 'count1': 6159, 'count2': 6159},\n", 225 | " {'id': 'https://openalex.org/I4210113665', 'count1': 5795, 'count2': 5801},\n", 226 | " {'id': 'https://openalex.org/I4210096311', 'count1': 5746, 'count2': 5748},\n", 227 | " {'id': 'https://openalex.org/I4210137412', 'count1': 5685, 'count2': 5691},\n", 228 | " {'id': 'https://openalex.org/I4210159146', 'count1': 5644, 'count2': 5647},\n", 229 | " {'id': 'https://openalex.org/I136040515', 'count1': 5524, 'count2': 5529},\n", 230 | " {'id': 'https://openalex.org/I4210137674', 'count1': 5454, 'count2': 5462},\n", 231 | " {'id': 'https://openalex.org/I4210127649', 'count1': 5345, 'count2': 5347},\n", 232 | " {'id': 'https://openalex.org/I4210151127', 'count1': 5344, 'count2': 5355},\n", 233 | " {'id': 'https://openalex.org/I4210135032', 'count1': 5099, 'count2': 5103},\n", 234 | " {'id': 'https://openalex.org/I2800401438', 'count1': 5026, 'count2': 5020},\n", 235 | " {'id': 'https://openalex.org/I96580804', 'count1': 4985, 'count2': 4989},\n", 236 | " {'id': 'https://openalex.org/I4210107147', 'count1': 4959, 'count2': 4962},\n", 237 | " {'id': 'https://openalex.org/I4210157802', 'count1': 4872, 'count2': 4876},\n", 238 | " {'id': 'https://openalex.org/I2184545', 'count1': 4707, 'count2': 4710},\n", 239 | " {'id': 'https://openalex.org/I4210089289', 'count1': 4610, 'count2': 4615},\n", 240 | " {'id': 'https://openalex.org/I47686490', 'count1': 4605, 'count2': 4610},\n", 241 | " {'id': 'https://openalex.org/I4210085921', 'count1': 4401, 'count2': 4403},\n", 242 | " {'id': 'https://openalex.org/I68763199', 'count1': 4335, 'count2': 4343},\n", 243 | " {'id': 'https://openalex.org/I2799803557', 'count1': 4308, 'count2': 4305},\n", 244 | " {'id': 'https://openalex.org/I4210160192', 'count1': 4181, 'count2': 4180},\n", 245 | " {'id': 'https://openalex.org/I179630473', 'count1': 4178, 'count2': 4180},\n", 246 | " {'id': 'https://openalex.org/I904013037', 'count1': 4044, 'count2': 4054},\n", 247 | " {'id': 'https://openalex.org/I4210100188', 'count1': 4033, 'count2': 4037},\n", 248 | " {'id': 'https://openalex.org/I118091203', 'count1': 4012, 'count2': 4019},\n", 249 | " {'id': 'https://openalex.org/I2802353815', 'count1': 3948, 'count2': 3945},\n", 250 | " {'id': 'https://openalex.org/I4210086614', 'count1': 3931, 'count2': 3935},\n", 251 | " {'id': 'https://openalex.org/I4210095952', 'count1': 3918, 'count2': 3925},\n", 252 | " {'id': 'https://openalex.org/I4210108548', 'count1': 3844, 'count2': 3851},\n", 253 | " {'id': 'https://openalex.org/I3123915005', 'count1': 3842, 'count2': 3849},\n", 254 | " {'id': 'https://openalex.org/I4210129578', 'count1': 3842, 'count2': 3843},\n", 255 | " {'id': 'https://openalex.org/I4210157481', 'count1': 3806, 'count2': 3808},\n", 256 | " {'id': 'https://openalex.org/I4210140856', 'count1': 3772, 'count2': 3773},\n", 257 | " {'id': 'https://openalex.org/I4210106702', 'count1': 3763, 'count2': 3767},\n", 258 | " {'id': 'https://openalex.org/I4210092123', 'count1': 3663, 'count2': 3665},\n", 259 | " {'id': 'https://openalex.org/I4210101151', 'count1': 3658, 'count2': 3660},\n", 260 | " {'id': 'https://openalex.org/I2809107791', 'count1': 3572, 'count2': 3576},\n", 261 | " {'id': 'https://openalex.org/I4210097920', 'count1': 3558, 'count2': 3559},\n", 262 | " {'id': 'https://openalex.org/I105140100', 'count1': 3546, 'count2': 3549},\n", 263 | " {'id': 'https://openalex.org/I4210104045', 'count1': 3545, 'count2': 3548},\n", 264 | " {'id': 'https://openalex.org/I4210126200', 'count1': 3543, 'count2': 3547},\n", 265 | " {'id': 'https://openalex.org/I4210152826', 'count1': 3474, 'count2': 3473},\n", 266 | " {'id': 'https://openalex.org/I4210106572', 'count1': 3343, 'count2': 3343},\n", 267 | " {'id': 'https://openalex.org/I4210087295', 'count1': 3311, 'count2': 3313},\n", 268 | " {'id': 'https://openalex.org/I4210158107', 'count1': 3301, 'count2': 3302},\n", 269 | " {'id': 'https://openalex.org/I4210129022', 'count1': 3290, 'count2': 3290},\n", 270 | " {'id': 'https://openalex.org/I4210135619', 'count1': 3228, 'count2': 3230}]" 271 | ] 272 | }, 273 | "execution_count": 25, 274 | "metadata": {}, 275 | "output_type": "execute_result" 276 | } 277 | ], 278 | "source": [ 279 | "data" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": null, 285 | "metadata": {}, 286 | "outputs": [], 287 | "source": [] 288 | } 289 | ], 290 | "metadata": { 291 | "kernelspec": { 292 | "display_name": "venv", 293 | "language": "python", 294 | "name": "python3" 295 | }, 296 | "language_info": { 297 | "codemirror_mode": { 298 | "name": "ipython", 299 | "version": 3 300 | }, 301 | "file_extension": ".py", 302 | "mimetype": "text/x-python", 303 | "name": "python", 304 | "nbconvert_exporter": "python", 305 | "pygments_lexer": "ipython3", 306 | "version": "3.9.12" 307 | }, 308 | "orig_nbformat": 4, 309 | "vscode": { 310 | "interpreter": { 311 | "hash": "271691dbc4cdb85f541c883090ff5a004cbd8b9c207c2cfed84437fce4e65fdb" 312 | } 313 | } 314 | }, 315 | "nbformat": 4, 316 | "nbformat_minor": 2 317 | } 318 | -------------------------------------------------------------------------------- /notebooks/getting-started/README.md: -------------------------------------------------------------------------------- 1 | # Getting started 2 | Notebooks explaining the basics of querying the [OpenAlex API](https://docs.openalex.org/) 3 | -------------------------------------------------------------------------------- /notebooks/getting-started/api-webinar-apr2024/tutorial01.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "
\n", 8 | " \n", 9 | " \"OpenAlex\n", 10 | " \n", 11 | "
" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "# OpenAlex API Webinar - Tutorial 01 - Getting data about papers that a university's research has cited\n", 19 | "\n", 20 | "Jason Portenoy\n", 21 | "\n", 22 | "April 25, 2024" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "Welcome to the Jupyter Notebook accompanying part 2 of the [OpenAlex](https://openalex.org) webinar on using the API!\n", 30 | "\n", 31 | "Video recording of the webinar: [https://youtu.be/DLKUgbw7FV4](https://youtu.be/DLKUgbw7FV4)\n", 32 | "\n", 33 | "We will be using Python code to get data about a university's research works, and the works referenced by those works.\n", 34 | "\n", 35 | "* The [OpenAlex webinars](https://openalex.org/webinars) page is where you can find information on all webinars including this one, with dates for upcoming webinars, and links to video recordings of previous webinars.\n", 36 | "* If you aren't familiar with Jupyter notebooks, [you can learn more here](https://jupyter.org/try-jupyter/notebooks/?path=notebooks/Intro.ipynb)\n", 37 | "* To learn all about the OpenAlex API: [visit the technical documentation](https://docs.openalex.org)\n", 38 | "\n", 39 | "And of course, if you aren't yet familiar with OpenAlex, you can go to [https://openalex.org](https://openalex.org) right now and start exploring!" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "## API Basics\n", 47 | "\n", 48 | "[Part 1 of this OpenAlex API webinar series](https://www.youtube.com/watch?v=ycoHc8flx8U) was a basic introduction to an Application Programming Interface, or API: what it is, how OpenAlex's API works, and why it might be useful.\n", 49 | "\n", 50 | "Now, we dive in, and use Python code to get data from OpenAlex about a university's research!\n", 51 | "\n", 52 | "As a reminder, an API allows a program to interact with the data, instead of you (a person). When a person is using the data, the User Interface (UI) is more appropriate:\n", 53 | "\n", 54 | "![UI vs API](../../../resources/img/ui_vs_api.svg)\n", 55 | "\n", 56 | "The **free version** of the OpenAlex API:\n", 57 | "* Does not require authentication\n", 58 | " * But, add your e-mail address for faster and more consistent responses: “mailto=you@example.com”\n", 59 | "* 100k calls per day\n", 60 | "* 10 calls per second\n", 61 | "* _We raise limits for free to support research projects when possible_\n", 62 | "\n", 63 | "The **premium version** of the OpenAlex API:\n", 64 | "* API key tied to your account\n", 65 | "* API limits raised to meet your needs\n", 66 | "* Additional filters to support hourly data updates\n", 67 | " * `from_created_date`\n", 68 | " * `from_updated_date`" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "## Let's dive in!\n", 76 | "\n", 77 | "The OpenAlex API is very **powerful**, but it is also very **easy to use**. There is no authentication required. All your code needs to do is make standard HTTP GET requests.\n", 78 | "\n", 79 | "While there are some [good libraries you can use to access the API](https://docs.openalex.org/how-to-use-the-api/api-overview#client-libraries), we're going to start very simply by making API calls directly. We will import just two small libraries to help us." 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 1, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "# setup: import libraries\n", 89 | "import requests\n", 90 | "import csv" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 2, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "# IMPORTANT: Set your email here in order to use the API's \"polite pool\"\n", 100 | "# See: https://docs.openalex.org/how-to-use-the-api/rate-limits-and-authentication#the-polite-pool\n", 101 | "\n", 102 | "# e.g., mailto=\"youremail@example.com\"\n", 103 | "# Go ahead, fill it out:\n", 104 | "mailto = \"\"" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 3, 110 | "metadata": {}, 111 | "outputs": [ 112 | { 113 | "name": "stdout", 114 | "output_type": "stream", 115 | "text": [ 116 | "Success!\n" 117 | ] 118 | } 119 | ], 120 | "source": [ 121 | "url = \"https://api.openalex.org/works\"\n", 122 | "if not mailto:\n", 123 | " raise ValueError(\"You need to fill in your email address in the `mailto` variable above!\")\n", 124 | "params = {\n", 125 | " \"mailto\": mailto,\n", 126 | " \"filter\": \"authorships.author.id:a5086928770\", # Kyle Demes's author ID\n", 127 | "}\n", 128 | "response = requests.get(url, params=params)\n", 129 | "\n", 130 | "# A \"200\" status code means that the API query was successful\n", 131 | "if response.status_code == 200:\n", 132 | " print(\"Success!\")" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 4, 138 | "metadata": {}, 139 | "outputs": [ 140 | { 141 | "name": "stdout", 142 | "output_type": "stream", 143 | "text": [ 144 | "Number of results: 24\n" 145 | ] 146 | } 147 | ], 148 | "source": [ 149 | "results = response.json()['results']\n", 150 | "print(f\"Number of results: {len(results)}\")" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "We've retrieved the papers from the API, using a simple query:\n", 158 | "\n", 159 | "[`https://api.openalex.org/works?filter=authorships.author.id:a5086928770`](https://api.openalex.org/works?filter=authorships.author.id:a5086928770)\n", 160 | "\n", 161 | "You can follow that link in the browser to get the same result. But with the data now accessible by our code, we can save a CSV file with whichever fields we want.\n", 162 | "\n", 163 | "Instructions for how to write CSV files with Python are [here](https://docs.python.org/3/library/csv.html). Following these instructions:" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 5, 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "# The Python documentation shows how to write data to a CSV file:\n", 173 | "# https://docs.python.org/3/library/csv.html\n", 174 | "with open('kdemes_works.csv', 'w', newline='') as f:\n", 175 | " # initialize the csv writer for this file\n", 176 | " writer = csv.writer(f)\n", 177 | "\n", 178 | " # write a header row at the top\n", 179 | " header = ['id', 'doi', 'publication_year', 'title']\n", 180 | " writer.writerow(header)\n", 181 | "\n", 182 | " # loop through the works and write each row\n", 183 | " for item in results:\n", 184 | " this_id = item['id']\n", 185 | " this_doi = item['doi']\n", 186 | " this_publication_year = item['publication_year']\n", 187 | " this_title = item['title']\n", 188 | " writer.writerow([this_id, this_doi, this_publication_year, this_title])" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "### University (Institution)\n", 196 | "\n", 197 | "Next, we'll try something a little more advanced. We're going to get the works from a certain institution, and then retrieve all of the references from those works (the works cited by the university's works).\n", 198 | "\n", 199 | "We'll start by collecting the university's works. We'll limit to just recently published papers so it doesn't take too long, but you could get all of the papers just as easily, if you're willing to wait.\n", 200 | "\n", 201 | "#### Cursor paging\n", 202 | "\n", 203 | "Each API query will only return a limited subset of the overall data, in what is known as a page. We need to make multiple queries to \"page through\" all of the data, collecting the data for each API query. We use a method called [\"cursor paging\"](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging#cursor-paging) to do this." 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 6, 209 | "metadata": {}, 210 | "outputs": [ 211 | { 212 | "name": "stdout", 213 | "output_type": "stream", 214 | "text": [ 215 | "Done paging through results. We made 44 API queries, and retrieved 4259 results.\n" 216 | ] 217 | } 218 | ], 219 | "source": [ 220 | "url = \"https://api.openalex.org/works\"\n", 221 | "if not mailto:\n", 222 | " raise ValueError(\"You need to fill in your email address in the `mailto` variable above!\")\n", 223 | "params = {\n", 224 | " \"mailto\": mailto,\n", 225 | " \"filter\": f\"authorships.institutions.lineage:i129801699,publication_year:>2022\", # University of Tasmania\n", 226 | " \"per-page\": 100,\n", 227 | " \"select\": \"id,doi,publication_year,title,primary_location,authorships,topics\",\n", 228 | "}\n", 229 | "\n", 230 | "# Initialize cursor\n", 231 | "cursor = \"*\"\n", 232 | "\n", 233 | "# Initialize an empty list to store our results as we get them\n", 234 | "all_results = []\n", 235 | "count_api_queries = 0\n", 236 | "\n", 237 | "# Loop through pages\n", 238 | "while cursor:\n", 239 | " params[\"cursor\"] = cursor\n", 240 | " response = requests.get(url, params=params)\n", 241 | " if response.status_code != 200:\n", 242 | " print(\"Oh no! Something went wrong during the live demo! How embarrassing!\")\n", 243 | " break\n", 244 | " this_page_results = response.json()['results']\n", 245 | " for result in this_page_results:\n", 246 | " # Store these results in the list we created before the loop we are currently in\n", 247 | " all_results.append(result)\n", 248 | " count_api_queries += 1\n", 249 | "\n", 250 | " # Update cursor, using the response's `next_cursor` metadata field\n", 251 | " cursor = response.json()['meta']['next_cursor']\n", 252 | "print(f\"Done paging through results. We made {count_api_queries} API queries, and retrieved {len(all_results)} results.\")" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "Our next step is to loop through each work collected above, and collect all of the referenced works." 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": 7, 265 | "metadata": {}, 266 | "outputs": [], 267 | "source": [ 268 | "# Let's make our cursor paging code above into a function, so we can reuse it easily.\n", 269 | "# This code just defines the function. We'll need to call the function later on to get it to actually get it to run.\n", 270 | "def api_query_page_results(url, params):\n", 271 | " # Initialize cursor\n", 272 | " cursor = \"*\"\n", 273 | "\n", 274 | " # Loop through pages\n", 275 | " all_results = []\n", 276 | " while cursor:\n", 277 | " params[\"cursor\"] = cursor\n", 278 | " response = requests.get(url, params=params)\n", 279 | " if response.status_code != 200:\n", 280 | " print(\"Oh no! Something went wrong during the live demo! How embarrassing!\")\n", 281 | " response.raise_for_status()\n", 282 | " this_page_results = response.json()['results']\n", 283 | " for result in this_page_results:\n", 284 | " all_results.append(result)\n", 285 | "\n", 286 | " # Update cursor\n", 287 | " cursor = response.json()['meta']['next_cursor']\n", 288 | " return all_results" 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": 8, 294 | "metadata": {}, 295 | "outputs": [ 296 | { 297 | "name": "stdout", 298 | "output_type": "stream", 299 | "text": [ 300 | "Done collecting references. We retrieved 9157 works.\n" 301 | ] 302 | } 303 | ], 304 | "source": [ 305 | "# collect all of the works referenced by the works found above\n", 306 | "# This will be a dictionary mapping Citing Paper -> List of Cited Papers\n", 307 | "# We start by initializing an empty dictionary\n", 308 | "all_references = {}\n", 309 | "\n", 310 | "# Let's limit the results to loop through to only n=100, because this is a demo, and we don't want to wait for too long\n", 311 | "works_to_collect = all_results[:100]\n", 312 | "\n", 313 | "# We will keep track of the number of works retrieved from the API\n", 314 | "count_works_retrieved = 0\n", 315 | "\n", 316 | "for work in works_to_collect:\n", 317 | " # Get references for this work (i.e., works that have been cited by this work)\n", 318 | " this_work_id = work['id']\n", 319 | " url = \"https://api.openalex.org/works\"\n", 320 | " if not mailto:\n", 321 | " raise ValueError(\"You need to fill in your email address in the `mailto` variable above!\")\n", 322 | " params = {\n", 323 | " \"mailto\": mailto,\n", 324 | " \"filter\": f\"cited_by:{this_work_id}\",\n", 325 | " \"per-page\": 100,\n", 326 | " \"select\": \"id,doi,publication_year,title,primary_location,authorships,topics\",\n", 327 | " }\n", 328 | " this_work_references = api_query_page_results(url, params=params)\n", 329 | " # put this data into our dictionary:\n", 330 | " # The key for the dictionary is the citing work_id, and the value is the list of referenced Works\n", 331 | " all_references[this_work_id] = this_work_references\n", 332 | " count_works_retrieved += len(this_work_references)\n", 333 | "print(f\"Done collecting references. We retrieved {count_works_retrieved} works.\")" 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "Now we have collected the referenced papers for each of the university's papers we collected in the first step. The next step is to save the data to a CSV file." 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": 9, 346 | "metadata": {}, 347 | "outputs": [], 348 | "source": [ 349 | "# Function to shorten the OpenAlex ID to make it better for display\n", 350 | "def make_short_id(long_id):\n", 351 | " short_id = long_id.replace(\"https://openalex.org/\", \"\")\n", 352 | " return short_id" 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": 10, 358 | "metadata": {}, 359 | "outputs": [], 360 | "source": [ 361 | "# Write each citing -> cited pair of works to a CSV file\n", 362 | "output_filename = \"tasmania_paper_references.csv\"\n", 363 | "with open(output_filename, 'w', newline='') as f:\n", 364 | " # initialize the csv writer for this file\n", 365 | " writer = csv.writer(f)\n", 366 | "\n", 367 | " # write a header row at the top\n", 368 | " header = ['citing_paper_id', 'cited_paper_id']\n", 369 | " writer.writerow(header)\n", 370 | "\n", 371 | " # loop through each citation, writing one row for each citation\n", 372 | " for citing_id, cited_works in all_references.items():\n", 373 | " citing_id_short = make_short_id(citing_id)\n", 374 | " for cited_work in cited_works:\n", 375 | " cited_id_short = make_short_id(cited_work['id'])\n", 376 | " writer.writerow([citing_id_short, cited_id_short])" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": 11, 382 | "metadata": {}, 383 | "outputs": [], 384 | "source": [ 385 | "# We can keep track of how many times each work has been cited.\n", 386 | "# One way to do this is to use Python's collections.Counter\n", 387 | "from collections import Counter\n", 388 | "citation_counts = Counter()\n", 389 | "for citing_id, cited_works in all_references.items():\n", 390 | " citation_counts.update([w['id'] for w in cited_works])" 391 | ] 392 | }, 393 | { 394 | "cell_type": "markdown", 395 | "metadata": {}, 396 | "source": [ 397 | "We can also save another CSV file with detailed metadata about each of the referenced papers we found. We will include information about the source (journal), and the topics. But you can build out this code to include any information you like (just make sure you are collecting it from the API when you specify the `select` parameter in your API requests above)." 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": 12, 403 | "metadata": {}, 404 | "outputs": [], 405 | "source": [ 406 | "output_filename = \"tasmania_references_paper_metadata.csv\"\n", 407 | "seen_work_ids = set()\n", 408 | "with open(output_filename, 'w', newline='') as f:\n", 409 | " # initialize the csv writer for this file\n", 410 | " writer = csv.writer(f)\n", 411 | "\n", 412 | " # write a header row at the top\n", 413 | " header = ['work_id', 'title', 'doi', 'utasmania_citation_count', \n", 414 | " 'source_id', 'source_issn', 'source_display_name', \n", 415 | " 'primary_topic_id', 'primary_topic_display_name']\n", 416 | " writer.writerow(header)\n", 417 | "\n", 418 | " for cited_works in all_references.values():\n", 419 | " for w in cited_works:\n", 420 | " work_id = w['id']\n", 421 | " work_id_short = make_short_id(work_id)\n", 422 | " title = w['title']\n", 423 | " if work_id not in seen_work_ids and title != 'Deleted Work':\n", 424 | " # We will write a row to the CSV file for this work\n", 425 | " doi = w['doi']\n", 426 | " utasmania_citation_count = citation_counts[work_id]\n", 427 | "\n", 428 | " # Get source (journal)\n", 429 | " try:\n", 430 | " source = w['primary_location']['source']\n", 431 | " source_id = source['id']\n", 432 | " source_id_short = make_short_id(source_id)\n", 433 | " source_issn = source['issn_l']\n", 434 | " source_display_name = source['display_name']\n", 435 | " except (KeyError, TypeError):\n", 436 | " source_id = None\n", 437 | " source_issn = None\n", 438 | " source_display_name = None\n", 439 | "\n", 440 | " # Get primary_topic\n", 441 | " try:\n", 442 | " primary_topic = w['topics'][0]\n", 443 | " primary_topic_id = primary_topic['id']\n", 444 | " primary_topic_id_short = make_short_id(primary_topic_id)\n", 445 | " primary_topic_display_name = primary_topic['display_name']\n", 446 | " except (IndexError, KeyError, TypeError):\n", 447 | " primary_topic_id = None\n", 448 | " primary_topic_display_name = None\n", 449 | " \n", 450 | " # Write this work's row to the CSV file\n", 451 | " writer.writerow([work_id_short, title, doi, \n", 452 | " utasmania_citation_count, source_id_short, \n", 453 | " source_issn, source_display_name, \n", 454 | " primary_topic_id_short, primary_topic_display_name])\n", 455 | " \n", 456 | " seen_work_ids.add(work_id)\n" 457 | ] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "metadata": {}, 462 | "source": [ 463 | "Now we have two CSV files:\n", 464 | "\n", 465 | "* `tasmania_paper_references.csv` has a two column edge-list of citing work -> cited work\n", 466 | "* `tasmania_references_paper_metadata.csv` has metadata about each cited work\n", 467 | "\n", 468 | "You could open this file in a spreadsheet program like Excel to do additional analysis, or continue working with the data in Python." 469 | ] 470 | } 471 | ], 472 | "metadata": { 473 | "kernelspec": { 474 | "display_name": "venv", 475 | "language": "python", 476 | "name": "python3" 477 | }, 478 | "language_info": { 479 | "codemirror_mode": { 480 | "name": "ipython", 481 | "version": 3 482 | }, 483 | "file_extension": ".py", 484 | "mimetype": "text/x-python", 485 | "name": "python", 486 | "nbconvert_exporter": "python", 487 | "pygments_lexer": "ipython3", 488 | "version": "3.10.9" 489 | } 490 | }, 491 | "nbformat": 4, 492 | "nbformat_minor": 2 493 | } 494 | -------------------------------------------------------------------------------- /notebooks/getting-started/get-random-entity.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "230c97f7-1471-4c29-bf95-decb29c2a2ae", 6 | "metadata": {}, 7 | "source": [ 8 | "
\n", 9 | " \n", 10 | " \"OpenAlex\n", 11 | " \n", 12 | "
" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "id": "7796f7f3-4cc9-4951-a9eb-bb2b6d3949cb", 18 | "metadata": {}, 19 | "source": [ 20 | "# 🔀 That's so random!\n", 21 | "\n", 22 | "Want to explore the OpenAlex API in a fun nerdy way? \n", 23 | "\n", 24 | "Simply add \"_/random_\" (-> [docs](https://docs.openalex.org/how-to-use-the-api/get-single-entities#random-entity)) to any of the API endpoints like this:\n", 25 | "\n", 26 | "* for a random author: http://api.openalex.org/authors/random\n", 27 | "* for a random concept: http://api.openalex.org/concepts/random\n", 28 | "* for a random institution: http://api.openalex.org/institutions/random\n", 29 | "* for a random venue: http://api.openalex.org/venues/random\n", 30 | "* for a random work: http://api.openalex.org/works/random\n", 31 | "\n", 32 | "Each time you call one of these URLs you'll get a different entity. \n", 33 | "`random` lets you dive right into the data and get a feel for what attributes are available for an entity type and how they are represented in JSON.\n", 34 | "***" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "id": "a3ff3c1f-acae-4c1b-b674-2c23a714d4af", 40 | "metadata": {}, 41 | "source": [ 42 | "Let's see for ourselves and query the OpenAlex API for a random work:" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 1, 48 | "id": "2b325b3d-450b-4c22-a31f-79c4b1d2d52d", 49 | "metadata": {}, 50 | "outputs": [ 51 | { 52 | "name": "stdout", 53 | "output_type": "stream", 54 | "text": [ 55 | "{'abstract_inverted_index': None,\n", 56 | " 'alternate_host_venues': [],\n", 57 | " 'authorships': [{'author': {'display_name': 'Gabriel Germain',\n", 58 | " 'id': 'https://openalex.org/A2794418983',\n", 59 | " 'orcid': None},\n", 60 | " 'author_position': 'first',\n", 61 | " 'institutions': [],\n", 62 | " 'raw_affiliation_string': None}],\n", 63 | " 'biblio': {'first_page': '223',\n", 64 | " 'issue': '384',\n", 65 | " 'last_page': '224',\n", 66 | " 'volume': '81'},\n", 67 | " 'cited_by_api_url': 'https://api.openalex.org/works?filter=cites:W2403480063',\n", 68 | " 'cited_by_count': 0,\n", 69 | " 'concepts': [{'display_name': 'Art',\n", 70 | " 'id': 'https://openalex.org/C142362112',\n", 71 | " 'level': 0,\n", 72 | " 'score': '0.406585',\n", 73 | " 'wikidata': 'https://www.wikidata.org/wiki/Q735'},\n", 74 | " {'display_name': 'Philosophy',\n", 75 | " 'id': 'https://openalex.org/C138885662',\n", 76 | " 'level': 0,\n", 77 | " 'score': '0.363935',\n", 78 | " 'wikidata': 'https://www.wikidata.org/wiki/Q5891'}],\n", 79 | " 'counts_by_year': [],\n", 80 | " 'created_date': '2016-06-24',\n", 81 | " 'display_name': '12. Arrighetti (Graziano). Cosmologia mitica di Omero e '\n", 82 | " 'Esiodo (Extr. des « Studi Classici e Orientali », v. XV)',\n", 83 | " 'doi': None,\n", 84 | " 'host_venue': {'display_name': 'Revue des Études Grecques',\n", 85 | " 'id': None,\n", 86 | " 'is_oa': None,\n", 87 | " 'issn': None,\n", 88 | " 'issn_l': None,\n", 89 | " 'license': None,\n", 90 | " 'publisher': 'Persée - Portail des revues scientifiques en SHS',\n", 91 | " 'type': None,\n", 92 | " 'url': 'https://www.persee.fr/doc/reg_0035-2039_1968_num_81_384_1022_t1_0223_0000_3',\n", 93 | " 'version': None},\n", 94 | " 'id': 'https://openalex.org/W2403480063',\n", 95 | " 'ids': {'mag': '2403480063', 'openalex': 'https://openalex.org/W2403480063'},\n", 96 | " 'is_paratext': None,\n", 97 | " 'is_retracted': False,\n", 98 | " 'mesh': [],\n", 99 | " 'open_access': {'is_oa': True, 'oa_status': None, 'oa_url': None},\n", 100 | " 'publication_date': '1968-01-01',\n", 101 | " 'publication_year': 1968,\n", 102 | " 'referenced_works': [],\n", 103 | " 'related_works': [],\n", 104 | " 'title': '12. Arrighetti (Graziano). Cosmologia mitica di Omero e Esiodo '\n", 105 | " '(Extr. des « Studi Classici e Orientali », v. XV)',\n", 106 | " 'type': None,\n", 107 | " 'updated_date': '2021-11-04'}\n" 108 | ] 109 | } 110 | ], 111 | "source": [ 112 | "RANDOM_WORKS_URL = 'http://api.openalex.org/works/random'\n", 113 | "\n", 114 | "import requests\n", 115 | "response = requests.get(url=RANDOM_WORKS_URL)\n", 116 | "response.raise_for_status()\n", 117 | "random_work = response.json()\n", 118 | "\n", 119 | "import pprint\n", 120 | "pprint.pprint(random_work)" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "id": "5e620a43-cfac-489e-8e48-642df57f0990", 126 | "metadata": {}, 127 | "source": [ 128 | "*** \n", 129 | "\n", 130 | "Run the notebook again (or if you are using the URL in your browser hit refresh) and voilà a new one!\n", 131 | "\n", 132 | "Happy exploring! 😎" 133 | ] 134 | } 135 | ], 136 | "metadata": { 137 | "kernelspec": { 138 | "display_name": "Python 3 (ipykernel)", 139 | "language": "python", 140 | "name": "python3" 141 | }, 142 | "language_info": { 143 | "codemirror_mode": { 144 | "name": "ipython", 145 | "version": 3 146 | }, 147 | "file_extension": ".py", 148 | "mimetype": "text/x-python", 149 | "name": "python", 150 | "nbconvert_exporter": "python", 151 | "pygments_lexer": "ipython3", 152 | "version": "3.8.10" 153 | } 154 | }, 155 | "nbformat": 4, 156 | "nbformat_minor": 5 157 | } 158 | -------------------------------------------------------------------------------- /notebooks/getting-started/paging.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "29720234-5a2d-48b7-82ee-eb1574e6275b", 6 | "metadata": {}, 7 | "source": [ 8 | "
\n", 9 | " \n", 10 | " \"OpenAlex\n", 11 | " \n", 12 | "
" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "id": "f133f73b-c2df-4f5b-ab14-2ea668d47052", 18 | "metadata": {}, 19 | "source": [ 20 | "# Turn the page\n", 21 | "❓ Let's say we query OpenAlex for a [list of entities](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities). By default the API only gives us the first 25 results of the list. Why is that ❓\n", 22 | "\n", 23 | ">Just like books split large amounts of text and distribute it onto **pages**, the OpenAlex API does the same with a (potentially massive) list of entities.\n", 24 | "\n", 25 | "It makes the data more manageable for both sides: We get small amounts of data that fit into our computer's memory in a reasonable amount of time, while the OpenAlex API needs to process less data at once and can serve more requests to more users.\n", 26 | "\n", 27 | "\n", 28 | "👉 So, coming back to our question of why only 25 results: That is only the first page with a partial list of results! \n", 29 | "The API even tells us this. Every page includes a **meta section** with the following information:\n", 30 | "\n", 31 | "
\n", 32 | " \"meta\n", 33 | "
\n", 34 | "\n", 35 | "\n", 36 | "In order to get the complete list, we need to \"_leaf through_\" all the pages. But how do we do that? \n", 37 | "There are two techniques the OpenAlex API offers: **_🔢 basic paging_** and **_↪️ cursor paging_**. Let's get to know them!\n", 38 | "\n", 39 | "
\n", 40 | " 💡 Use the Polite Pool
\n", 41 | "While it is always a good idea to use the polite pool, this holds especially true for paging. The polite pool has much faster and more consistent response times, so for multiple requests these gains in response time will aggregate and speed up your application!\n", 42 | "
" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "id": "a9d7fae8-4ee2-40fa-9db1-dbd1002811f6", 48 | "metadata": {}, 49 | "source": [ 50 | "
\n", 51 | "\n", 52 | "## 🔢 Basic paging\n", 53 | "[Basic paging](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging#basic-paging) is the simplest form of paging and works like this: \n", 54 | "* All pages are numbered **from 1 to n** \n", 55 | "*We can determine n by dividing meta's `count` by `per_page` and rounding the result up to the next integer.* \n", 56 | "* To request one of the pages, we add the **`page` parameter** to the URL and put the page number as its value, \n", 57 | "e.g. for requesting page 2, we add https://api.openalex.org/works?filter=author.id:A5048491430&page=2\n", 58 | "\n", 59 | "### Within limits\n", 60 | "
\n", 61 | " ⚠️ While basic paging is easy to use, it only works for the first 10,000 results of any list.
\n", 62 | " If we want to see more than 10,000 results, we'll need to use cursor paging.\n", 63 | "
\n", 64 | "\n", 65 | "### Example\n", 66 | "Let's look at an example, where we want to retrieve a complete list of all publications from an author and print their OpenAlex IDs. \n", 67 | "Given the OpenAlex ID for the author `A5048491430` the URL would be: https://api.openalex.org/works?filter=author.id:A5048491430.\n", 68 | "\n", 69 | "To loop through all pages, we start by setting `page=1` and then repeating:\n", 70 | "* request the specified page by adding the `page` parameter to the URL\n", 71 | "* print all of the OpenAlex IDs from the publications on this page in blocks of five\n", 72 | "* update `page` parameter to `page`+1\n", 73 | "\n", 74 | "until *either* there are no more results on the requested page *or* the next request would exceed 10,000 results." 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 1, 80 | "id": "98e2d323-c7e2-4003-9e84-1a70551756c7", 81 | "metadata": {}, 82 | "outputs": [ 83 | { 84 | "name": "stdout", 85 | "output_type": "stream", 86 | "text": [ 87 | "\n", 88 | "https://api.openalex.org/works?filter=author.id:A5048491430&page=1\n", 89 | "W2046766973\tW2741809807\tW2045657963\tW1572136682\tW2066415719\n", 90 | "W2170531319\tW1963524534\tW1553564559\tW2003014790\tW2051771537\n", 91 | "W1987881751\tW2980172586\tW2095083909\tW1528782725\tW2102613218\n", 92 | "W4235038322\tW2014140050\tW2109312864\tW3071882161\tW1501540670\n", 93 | "W2103827239\tW4229010617\tW2133737815\tW4366077396\tW2171848392\n", 94 | "\n", 95 | "https://api.openalex.org/works?filter=author.id:A5048491430&page=2\n", 96 | "W4245410681\tW3021154342\tW2017292130\tW2168771768\tW4213202391\n", 97 | "W4236031980\tW1945323029\tW2103382090\tW2105695765\tW3084168212\n", 98 | "W4211010643\tW1934573562\tW2050143895\tW2065622609\tW2110180658\n", 99 | "W2941875476\tW4237216357\tW4242907897\tW4244183537\tW4247478427\n", 100 | "W4287670050\tW104609242\tW1972136887\tW2005148091\tW2010883332\n", 101 | "\n", 102 | "https://api.openalex.org/works?filter=author.id:A5048491430&page=3\n", 103 | "W2108112433\tW2154768595\tW2255028491\tW2284153834\tW2307679124\n", 104 | "W2398849157\tW2402184614\tW2414739039\tW2613086963\tW2727815292\n", 105 | "W2740744046\tW2949915600\tW2951362513\tW2979437137\tW3084303366\n", 106 | "W3168937413\tW3206844309\tW4221043181\tW4230863633\tW4237614390\n", 107 | "W4240735862\tW4244937397\tW4246220990\tW4252159547\tW4252662598\n", 108 | "\n", 109 | "https://api.openalex.org/works?filter=author.id:A5048491430&page=4\n", 110 | "W4288680697\tW4299928665\tW4301303362\t" 111 | ] 112 | } 113 | ], 114 | "source": [ 115 | "import requests\n", 116 | "\n", 117 | "# url with a placeholder for page number\n", 118 | "example_url_with_page = 'https://api.openalex.org/works?filter=author.id:A5048491430&page={}'\n", 119 | "\n", 120 | "page = 1\n", 121 | "has_more_pages = True\n", 122 | "fewer_than_10k_results = True\n", 123 | "\n", 124 | "# loop through pages\n", 125 | "while has_more_pages and fewer_than_10k_results:\n", 126 | " \n", 127 | " # set page value and request page from OpenAlex\n", 128 | " url = example_url_with_page.format(page)\n", 129 | " print('\\n' + url)\n", 130 | " page_with_results = requests.get(url).json()\n", 131 | " \n", 132 | " # loop through partial list of results\n", 133 | " results = page_with_results['results']\n", 134 | " for i,work in enumerate(results):\n", 135 | " openalex_id = work['id'].replace(\"https://openalex.org/\", \"\")\n", 136 | " print(openalex_id, end='\\t' if (i+1)%5!=0 else '\\n')\n", 137 | "\n", 138 | " # next page\n", 139 | " page += 1\n", 140 | " \n", 141 | " # end loop when either there are no more results on the requested page \n", 142 | " # or the next request would exceed 10,000 results\n", 143 | " per_page = page_with_results['meta']['per_page']\n", 144 | " has_more_pages = len(results) == per_page\n", 145 | " fewer_than_10k_results = per_page * page <= 10000" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "id": "8cc5ccb4-a896-48ed-9adc-1df242028b34", 151 | "metadata": {}, 152 | "source": [ 153 | "
\n", 154 | "\n", 155 | "## ↪️ Cursor paging\n", 156 | "[Cursor paging](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging#cursor-paging) is a bit more complicated than basic paging, but it allows us to access as many records as we like. \n", 157 | "\n", 158 | "\n", 159 | "To use cursor paging,\n", 160 | "* we add the **`cursor` parameter** with a start value of `*` to our first query, \n", 161 | "e.g. https://api.openalex.org/works?filter=author.id:A5048491430&cursor=*\n", 162 | "\n", 163 | "* The response to our query will now include a `next_cursor` value in the response's `meta` section. \n", 164 | "To retrieve the next page, we **copy `meta.next_cursor`** into the cursor field of our URL.\n", 165 | "\n", 166 | "* To get all the results, we keep repeating the second step until `meta.next_cursor` is null.\n", 167 | "\n", 168 | "
\n", 169 | " \"cursor\n", 170 | "
\n", 171 | "\n", 172 | "### With great power comes great responsibility\n", 173 | "Cursor paging is very powerful, since there is no limit on the number of pages you can request. Please use it responsibly!\n", 174 | "
\n", 175 | " 🚫 Don't use cursor paging to download a very large or even the whole dataset\n", 176 | " \n", 180 | "\n", 181 | " Instead, download everything at once, using the data snapshot. It's free, easy, fast, and you get all the results in same format you'd get from the API.\n", 182 | "
\n", 183 | "\n", 184 | "### Example\n", 185 | "Let's look at the same example as before, where we want to retrieve a complete list of all publications from an author and print their OpenAlex IDs. \n", 186 | "\n", 187 | "To loop through all pages, we start by setting `cursor=*` and then repeating:\n", 188 | "* request the specified page by adding the `cursor` parameter to the URL\n", 189 | "* print all of the OpenAlex IDs from the publications on this page in blocks of five\n", 190 | "* update `cursor` parameter to `meta.next_cursor`\n", 191 | "\n", 192 | "until `meta.next_cursor` is null and the list of results is empty." 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 2, 198 | "id": "0bd272c4-037b-4587-8b9e-2b8fa0a1b269", 199 | "metadata": {}, 200 | "outputs": [ 201 | { 202 | "name": "stdout", 203 | "output_type": "stream", 204 | "text": [ 205 | "\n", 206 | "https://api.openalex.org/works?filter=author.id:A5048491430&cursor=*\n", 207 | "W2046766973\tW2741809807\tW2045657963\tW1572136682\tW2066415719\n", 208 | "W2170531319\tW1963524534\tW1553564559\tW2003014790\tW2051771537\n", 209 | "W1987881751\tW2980172586\tW2095083909\tW1528782725\tW2102613218\n", 210 | "W4235038322\tW2014140050\tW2109312864\tW3071882161\tW1501540670\n", 211 | "W2103827239\tW4229010617\tW2133737815\tW4366077396\tW2171848392\n", 212 | "\n", 213 | "https://api.openalex.org/works?filter=author.id:A5048491430&cursor=Ils3LCAnaHR0cHM6Ly9vcGVuYWxleC5vcmcvVzIxNzE4NDgzOTInXSI=\n", 214 | "W4245410681\tW3021154342\tW2017292130\tW2168771768\tW4213202391\n", 215 | "W4236031980\tW1945323029\tW2103382090\tW2105695765\tW3084168212\n", 216 | "W4211010643\tW1934573562\tW2050143895\tW2065622609\tW2110180658\n", 217 | "W2941875476\tW4237216357\tW4242907897\tW4244183537\tW4247478427\n", 218 | "W4287670050\tW104609242\tW1972136887\tW2005148091\tW2010883332\n", 219 | "\n", 220 | "https://api.openalex.org/works?filter=author.id:A5048491430&cursor=IlswLCAnaHR0cHM6Ly9vcGVuYWxleC5vcmcvVzIwMTA4ODMzMzInXSI=\n", 221 | "W2108112433\tW2154768595\tW2255028491\tW2284153834\tW2307679124\n", 222 | "W2398849157\tW2402184614\tW2414739039\tW2613086963\tW2727815292\n", 223 | "W2740744046\tW2949915600\tW2951362513\tW2979437137\tW3084303366\n", 224 | "W3168937413\tW3206844309\tW4221043181\tW4230863633\tW4237614390\n", 225 | "W4240735862\tW4244937397\tW4246220990\tW4252159547\tW4252662598\n", 226 | "\n", 227 | "https://api.openalex.org/works?filter=author.id:A5048491430&cursor=IlswLCAnaHR0cHM6Ly9vcGVuYWxleC5vcmcvVzQyNTI2NjI1OTgnXSI=\n", 228 | "W4288680697\tW4299928665\tW4301303362\t\n", 229 | "https://api.openalex.org/works?filter=author.id:A5048491430&cursor=IlswLCAnaHR0cHM6Ly9vcGVuYWxleC5vcmcvVzQzMDEzMDMzNjInXSI=\n" 230 | ] 231 | } 232 | ], 233 | "source": [ 234 | "import requests\n", 235 | "\n", 236 | "# url with a placeholder for cursor\n", 237 | "example_url_with_cursor = 'https://api.openalex.org/works?filter=author.id:A5048491430&cursor={}'\n", 238 | "\n", 239 | "cursor = '*'\n", 240 | "\n", 241 | "# loop through pages\n", 242 | "while cursor:\n", 243 | " \n", 244 | " # set cursor value and request page from OpenAlex\n", 245 | " url = example_url_with_cursor.format(cursor)\n", 246 | " print(\"\\n\" + url)\n", 247 | " page_with_results = requests.get(url).json()\n", 248 | " \n", 249 | " # loop through partial list of results\n", 250 | " results = page_with_results['results']\n", 251 | " for i,work in enumerate(results):\n", 252 | " openalex_id = work['id'].replace(\"https://openalex.org/\", \"\")\n", 253 | " print(openalex_id, end='\\t' if (i+1)%5!=0 else '\\n')\n", 254 | "\n", 255 | " # update cursor to meta.next_cursor\n", 256 | " cursor = page_with_results['meta']['next_cursor']" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "id": "b3ea64d9-e5ab-4396-bbf3-41ec9318ff69", 262 | "metadata": {}, 263 | "source": [ 264 | "
\n", 265 | "\n", 266 | "What we covered in this notebook is quite technical and might be a bit for beginners to take in, so\n", 267 | "please don't worry too much, if you need to reread it or need additional clarifying. \n", 268 | "The main concept to take away is that \n", 269 | "* the OpenAlex API distributes result lists into smaller chunks called pages \n", 270 | "* and thus to retrieve a complete result list, we have to manually or programatically \"leaf\" though these pages.\n", 271 | "\n", 272 | "Happy paging! 😎" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "id": "25ec853d", 278 | "metadata": {}, 279 | "source": [] 280 | } 281 | ], 282 | "metadata": { 283 | "kernelspec": { 284 | "display_name": "Python 3 (ipykernel)", 285 | "language": "python", 286 | "name": "python3" 287 | }, 288 | "language_info": { 289 | "codemirror_mode": { 290 | "name": "ipython", 291 | "version": 3 292 | }, 293 | "file_extension": ".py", 294 | "mimetype": "text/x-python", 295 | "name": "python", 296 | "nbconvert_exporter": "python", 297 | "pygments_lexer": "ipython3", 298 | "version": "3.10.9" 299 | } 300 | }, 301 | "nbformat": 4, 302 | "nbformat_minor": 5 303 | } 304 | -------------------------------------------------------------------------------- /notebooks/getting-started/premium.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "
\n", 8 | " \n", 9 | " \"OpenAlex\n", 10 | " \n", 11 | "
" 12 | ] 13 | }, 14 | { 15 | "attachments": {}, 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "# Getting started with OpenAlex Premium\n", 20 | "\n", 21 | "In this tutorial, we're going to learn how to get started using [OpenAlex Premium](https://openalex.org/pricing). This subscription service provides some features beyond the free services. One of the most important of these features is **faster updates,** allowing you to keep your data fully synced with OpenAlex.\n", 22 | "\n", 23 | "The way we do this is by using the `from_created_date` [(doc)](https://docs.openalex.org/api-entities/works/filter-works#from_created_date) or the `from_updated_date` [(doc)](https://docs.openalex.org/api-entities/works/filter-works#from_updated_date) filters. These filters allow you to get the new works you need to keep your data updated, and they require a Premium API Key to work.\n", 24 | "\n", 25 | "We're going to set up the code to poll the OpenAlex API for newly updated works on a regular basis, once per day." 26 | ] 27 | }, 28 | { 29 | "attachments": {}, 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "First, we need to get the API key you received by signing up for OpenAlex Premium. (Don't have a key yet? [Contact us right now to learn more about getting premium!](https://openalex.org/pricing))\n", 34 | "\n", 35 | "We'll store our API key in a variable called `my_api_key`. There are several ways to do this. You could just put it into the code, but since it is sensitive information that we don't want others to see, we're going to get it from an [environment variable, which we'll store in a `.env` file.](https://towardsdatascience.com/the-quick-guide-to-using-environment-variables-in-python-d4ec9291619e)\n", 36 | "\n", 37 | "This is just a text file with the name `.env`, that looks like this:\n", 38 | "```\n", 39 | "API_KEY=\n", 40 | "```\n", 41 | "Replace `` with your OpenAlex Premium API Key.\n" 42 | ] 43 | }, 44 | { 45 | "attachments": {}, 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "Now, to set our `my_api_key` variable, we'll set our environment using the [`python-dotenv`](https://pypi.org/project/python-dotenv/) library, then get the variable using the `os.getenv()` function." 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 1, 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "name": "stdout", 59 | "output_type": "stream", 60 | "text": [ 61 | "API key is set!\n" 62 | ] 63 | } 64 | ], 65 | "source": [ 66 | "import os\n", 67 | "from dotenv import load_dotenv\n", 68 | "\n", 69 | "load_dotenv('.env')\n", 70 | "my_api_key = os.getenv('API_KEY')\n", 71 | "if my_api_key is None:\n", 72 | " print(\"No API key found!!!\")\n", 73 | "else:\n", 74 | " print(\"API key is set!\")" 75 | ] 76 | }, 77 | { 78 | "attachments": {}, 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "Our plan is to get all of the works that have been updated in the last four hours. So let's construct a URL that will request that information from the API:" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 4, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "import requests\n", 92 | "from datetime import datetime, timedelta" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "To use the API key, we have two options. The first option is to include it in the URL, as an `api_key` parameter:" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 9, 105 | "metadata": {}, 106 | "outputs": [ 107 | { 108 | "name": "stdout", 109 | "output_type": "stream", 110 | "text": [ 111 | "Our formatted date-time string looks like this: 2023-11-07T19:22:46.848349\n", 112 | "Requesting newly updated works, including the API key as a URL query parameter...\n" 113 | ] 114 | }, 115 | { 116 | "name": "stdout", 117 | "output_type": "stream", 118 | "text": [ 119 | "Retrieved 25 works, out of 1967482 works updated since 2023-11-07T19:22:46.848349\n" 120 | ] 121 | } 122 | ], 123 | "source": [ 124 | "four_hours_ago = datetime.utcnow() - timedelta(hours=4)\n", 125 | "four_hours_ago_formatted_string = four_hours_ago.isoformat()\n", 126 | "print(f\"Our formatted date-time string looks like this: {four_hours_ago_formatted_string}\")\n", 127 | "# Construct a URL to requests works from the last four hours, including our API key as a URL query parameter.\n", 128 | "url = f\"https://api.openalex.org/works?filter=from_updated_date:{four_hours_ago_formatted_string}&api_key={my_api_key}\"\n", 129 | "print(f\"Requesting newly updated works, including the API key as a URL query parameter...\")\n", 130 | "r = requests.get(url)\n", 131 | "updated_works = r.json()\n", 132 | "\n", 133 | "count_works_retrieved = len(updated_works['results'])\n", 134 | "count_works_total = updated_works['meta']['count']\n", 135 | "print(f\"Retrieved {count_works_retrieved} works, out of {count_works_total} works updated since {four_hours_ago_formatted_string}\")\n" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "Success! We've used our API key to request all of the works updated in the last day. To get all of the data, [you can page through the results](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging)." 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "Alternatively, we can include the API key as a request header:" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 10, 155 | "metadata": {}, 156 | "outputs": [ 157 | { 158 | "name": "stdout", 159 | "output_type": "stream", 160 | "text": [ 161 | "Requesting newly updated works, including the API key in the request headers...\n", 162 | "Retrieved 200 works, out of 1965510 works updated since 2023-11-07T19:22:46.848349\n" 163 | ] 164 | } 165 | ], 166 | "source": [ 167 | "# Requests works from the last four hours, including our API key in the request headers\n", 168 | "url = f\"https://api.openalex.org/works?filter=from_updated_date:{four_hours_ago_formatted_string}&per-page=200\"\n", 169 | "headers = {\"api_key\": my_api_key}\n", 170 | "print(f\"Requesting newly updated works, including the API key in the request headers...\")\n", 171 | "r = requests.get(url, headers=headers)\n", 172 | "updated_works = r.json()\n", 173 | "\n", 174 | "count_works_retrieved = len(updated_works['results'])\n", 175 | "count_works_total = updated_works['meta']['count']\n", 176 | "print(f\"Retrieved {count_works_retrieved} works, out of {count_works_total} works updated since {four_hours_ago_formatted_string}\")\n" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "You can retrieve up to 200 works per page. Again, to get all of the results, [you can page through the results](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging)." 184 | ] 185 | }, 186 | { 187 | "attachments": {}, 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "Keep in mind that this method will get works that have been updated with *any change at all*, including increases in various counts. If you're only interested in *new* works, you could use the [`from_created_date`](https://docs.openalex.org/api-entities/works/filter-works#from_created_date) filter instead of `from_updated_date`, which will give a much smaller number of works.\n", 192 | "\n", 193 | "You'll need to do two things to keep your data fresh:\n", 194 | "1. Set up a script that does something similar to what we did above, and that runs on schedule once per day (using a [cron job](https://en.wikipedia.org/wiki/Cron), for example).\n", 195 | "2. Update your database with the new data you've grabbed from our API (such as using a SQL script)." 196 | ] 197 | }, 198 | { 199 | "attachments": {}, 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "And that's it! Using this method, you can keep your data up to date with regular API requests, instead of waiting for new data snapshots.\n", 204 | "\n", 205 | "Enjoy!" 206 | ] 207 | } 208 | ], 209 | "metadata": { 210 | "kernelspec": { 211 | "display_name": "venv", 212 | "language": "python", 213 | "name": "python3" 214 | }, 215 | "language_info": { 216 | "codemirror_mode": { 217 | "name": "ipython", 218 | "version": 3 219 | }, 220 | "file_extension": ".py", 221 | "mimetype": "text/x-python", 222 | "name": "python", 223 | "nbconvert_exporter": "python", 224 | "pygments_lexer": "ipython3", 225 | "version": "3.10.10" 226 | }, 227 | "orig_nbformat": 4 228 | }, 229 | "nbformat": 4, 230 | "nbformat_minor": 2 231 | } 232 | -------------------------------------------------------------------------------- /notebooks/institutions/japan_sources.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "attachments": {}, 5 | "cell_type": "markdown", 6 | "metadata": {}, 7 | "source": [ 8 | "# What are the publication sources located in Japan?\n", 9 | "\n", 10 | "When it comes to geographic location data of works in OpenAlex, there are generally two ways to look at it. One is the location of the *authors* (or rather, the institutional affiliation of the authors), and the other is the location of the work's *source*. In OpenAlex, [sources are where works are hosted.](https://docs.openalex.org/api-entities/venues) Examples of sources include journals, conferences, and institutional repositories. In this tutorial, we are going to look into the sources located in Japan.\n", 11 | "\n", 12 | "Our questions are:\n", 13 | "1. How many sources of scholarly works are in Japan?\n", 14 | "2. What are the types of these sources? Journals? Repositories? Conferences?\n", 15 | " - For journals, what are the publishers?\n", 16 | " - For repositories, what are the host institutions?\n", 17 | "3. What are the names of these sources?\n", 18 | "4. How many works have Japanese sources? How does this vary over time?\n" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 1, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "import requests" 28 | ] 29 | }, 30 | { 31 | "attachments": {}, 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "### Question 1: How many sources of scholarly works are in Japan?\n", 36 | "\n", 37 | "Let's start with the first question: How many sources of scholarly works are in Japan?\n", 38 | "\n", 39 | "To do this, we will query the `/sources` API endpoint, and use a **filter** to limit it to sources where the `country_code` is `JP`." 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 2, 45 | "metadata": {}, 46 | "outputs": [ 47 | { 48 | "name": "stdout", 49 | "output_type": "stream", 50 | "text": [ 51 | "There are 2265 sources with country_code 'JP' (Japan)\n" 52 | ] 53 | } 54 | ], 55 | "source": [ 56 | "country_code = 'JP'\n", 57 | "url = f\"https://api.openalex.org/sources\"\n", 58 | "params = {\n", 59 | " 'filter': f'country_code:{country_code}',\n", 60 | "}\n", 61 | "r = requests.get(url, params=params)\n", 62 | "num_sources = r.json()['meta']['count']\n", 63 | "print(f\"There are {num_sources} sources with country_code 'JP' (Japan)\")" 64 | ] 65 | }, 66 | { 67 | "attachments": {}, 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "### Question 2: What are the types of these sources?\n", 72 | "The next question is: What are the *types* of these sources. Possible types are listed in the API docs on the [Source object](https://docs.openalex.org/api-entities/venues/venue-object#type): `journal`, `repository`, `conference`, `ebook platform`.\n", 73 | "\n", 74 | "To answer this question, we can add a `group_by` to our API query, grouping by the `type` field and counting the number of sources:" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 3, 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "name": "stdout", 84 | "output_type": "stream", 85 | "text": [ 86 | "Number of sources in Japan for each *type* of source:\n", 87 | " \"journal\": 2158 sources\n", 88 | " \"repository\": 47 sources\n", 89 | " \"book series\": 40 sources\n", 90 | " \"conference\": 19 sources\n", 91 | " \"ebook platform\": 1 sources\n", 92 | " \"other\": 0 sources\n" 93 | ] 94 | } 95 | ], 96 | "source": [ 97 | "params = {\n", 98 | " 'filter': f'country_code:{country_code}',\n", 99 | " 'group_by': 'type',\n", 100 | "}\n", 101 | "r = requests.get(url, params=params)\n", 102 | "print(\"Number of sources in Japan for each *type* of source:\")\n", 103 | "for item in r.json()['group_by']:\n", 104 | " print(f' \"{item[\"key\"]}\": {item[\"count\"]} sources')" 105 | ] 106 | }, 107 | { 108 | "attachments": {}, 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "Most of the Japanese sources are of type: `journal`, and there are few other types of sources, including repositories, book series, and conferences." 113 | ] 114 | }, 115 | { 116 | "attachments": {}, 117 | "cell_type": "markdown", 118 | "metadata": {}, 119 | "source": [ 120 | "### Question 3: What are the names of these sources?\n", 121 | "The next question is: What are the *names* of these sources?\n", 122 | "\n", 123 | "To answer this, we need the API to give us all 2,162 sources. This means we will need to use the [paging technique](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging).\n", 124 | "\n", 125 | "We'll adapt the technique from the [paging notebook](../getting-started/paging.ipynb) to collect the names and ISSNs of the sources." 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 4, 131 | "metadata": {}, 132 | "outputs": [ 133 | { 134 | "name": "stdout", 135 | "output_type": "stream", 136 | "text": [ 137 | "collected 2265 sources (using 92 api calls)\n" 138 | ] 139 | } 140 | ], 141 | "source": [ 142 | "# page through to get all sources\n", 143 | "# use paging technique from `paging.ipynb`\n", 144 | "# url with a placeholder for page number\n", 145 | "country_code = 'JP'\n", 146 | "url = f\"https://api.openalex.org/sources\"\n", 147 | "params = {\n", 148 | " 'filter': f'country_code:{country_code}',\n", 149 | " 'page': 1, # initaliaze `page` param to 1\n", 150 | "}\n", 151 | "\n", 152 | "has_more_pages = True\n", 153 | "fewer_than_10k_results = True\n", 154 | "\n", 155 | "# We will collect the data in a variable called `japanese_sources`.\n", 156 | "# Initialize this as an empty list, which we will append to\n", 157 | "japanese_sources = []\n", 158 | "\n", 159 | "# loop through pages\n", 160 | "loop_index = 0\n", 161 | "while has_more_pages and fewer_than_10k_results:\n", 162 | " \n", 163 | " page_with_results = requests.get(url, params=params).json()\n", 164 | " \n", 165 | " # loop through partial list of results\n", 166 | " results = page_with_results['results']\n", 167 | " for api_result in results:\n", 168 | " # # Collect the fields we are interested in, for this source\n", 169 | " # source = {field: api_result[field] for field in fields}\n", 170 | " # Append this source to our `japanese_sources` list\n", 171 | " japanese_sources.append(api_result)\n", 172 | "\n", 173 | " # next page\n", 174 | " params['page'] += 1\n", 175 | " \n", 176 | " # end loop when either there are no more results on the requested page \n", 177 | " # or the next request would exceed 10,000 results\n", 178 | " per_page = page_with_results['meta']['per_page']\n", 179 | " has_more_pages = len(results) == per_page\n", 180 | " fewer_than_10k_results = per_page * params['page'] <= 10000\n", 181 | " loop_index += 1\n", 182 | "print(f\"collected {len(japanese_sources)} sources (using {loop_index+1} api calls)\")" 183 | ] 184 | }, 185 | { 186 | "attachments": {}, 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "Now would be a good time for us to put our sources into a Pandas dataframe. This is just a way to organize the data, make it more spreadsheet-like, and make it more convenient to work with." 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 5, 196 | "metadata": {}, 197 | "outputs": [ 198 | { 199 | "name": "stdout", 200 | "output_type": "stream", 201 | "text": [ 202 | "Dataframe has 2265 rows and 7 columns.\n", 203 | "\n", 204 | "The first five sources are named:\n", 205 | " Journal of the Japan Society of Mechanical Engineers\n", 206 | " Nippon Hoshasen Gijutsu Gakkai Zasshi\n", 207 | " Journal of the Physical Society of Japan\n", 208 | " Nihon rinsho. Japanese journal of clinical medicine\n", 209 | " Bulletin of the Chemical Society of Japan\n" 210 | ] 211 | } 212 | ], 213 | "source": [ 214 | "import pandas as pd\n", 215 | "\n", 216 | "# Each source in our list of `japanese sources` contains a lot of data, some of it complex and nested.\n", 217 | "# So let's limit our dataframe to include only some of the fields.\n", 218 | "\n", 219 | "# Define the fields that we are interested in collecting:\n", 220 | "fields = [\n", 221 | " 'id',\n", 222 | " 'issn_l',\n", 223 | " 'display_name',\n", 224 | " 'host_organization',\n", 225 | " 'works_count',\n", 226 | " 'cited_by_count',\n", 227 | " 'type',\n", 228 | "]\n", 229 | "\n", 230 | "# One way to limit the dataframe to include only our `fields` is to use the `from_records()` method\n", 231 | "# and specify only the columns we want.\n", 232 | "df_sources = pd.DataFrame.from_records(japanese_sources, columns=fields)\n", 233 | "\n", 234 | "num_rows, num_columns = df_sources.shape\n", 235 | "print(f\"Dataframe has {num_rows} rows and {num_columns} columns.\")\n", 236 | "print() # blank line\n", 237 | "print(\"The first five sources are named:\")\n", 238 | "for name in df_sources['display_name'].head(5):\n", 239 | " print(f\" {name}\")" 240 | ] 241 | }, 242 | { 243 | "attachments": {}, 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "With Pandas, it is very easy to save the data as a spreadsheet file, in case we want to work with it later." 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": 6, 253 | "metadata": {}, 254 | "outputs": [], 255 | "source": [ 256 | "df_sources.to_csv(\"japan_sources.csv\")" 257 | ] 258 | }, 259 | { 260 | "attachments": {}, 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "At this point, we can go back and answer a question we skipped over when we moved onto question 3: What are the [`host_organizations`](https://docs.openalex.org/api-entities/venues/venue-object#host_organization) for these sources? In the case of a source of type `journal`, a `host_organization` is a Publisher---the company or organization that distributes the works.\n", 265 | "\n", 266 | "The data we have collected from the API contains the field `host_organization`, which is a link that can be fed back into the API to get more information about the organization, such as name, parent companies, and number of works. It would not be hard to collect this data, but for now we will leave it to future work. However, we can use the data we already have to simply *count* the number of different publishers that host japanese sources." 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 7, 272 | "metadata": {}, 273 | "outputs": [ 274 | { 275 | "name": "stdout", 276 | "output_type": "stream", 277 | "text": [ 278 | "The Japanese sources are associated with 431 different host organizations (publishers).\n" 279 | ] 280 | } 281 | ], 282 | "source": [ 283 | "num_host_orgs = df_sources['host_organization'].nunique()\n", 284 | "print(f\"The Japanese sources are associated with {num_host_orgs} different host organizations (publishers).\")" 285 | ] 286 | }, 287 | { 288 | "attachments": {}, 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [ 292 | "### Question 4: How many works have Japanese sources? How does this vary over time?\n", 293 | "Finally, let's look at how many works there are with Japanese sources, and the trends of these works over time.\n", 294 | "\n", 295 | "First, we'll just look at the number of works with Japanese sources in the OpenAlex dataset." 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": 8, 301 | "metadata": {}, 302 | "outputs": [ 303 | { 304 | "name": "stdout", 305 | "output_type": "stream", 306 | "text": [ 307 | "There are 3,827,227 works (articles) with Japanese sources.\n" 308 | ] 309 | } 310 | ], 311 | "source": [ 312 | "num_works = df_sources['works_count'].sum()\n", 313 | "print(f\"There are {num_works:,} works (articles) with Japanese sources.\") # putting \":,\" after num_works tells the formatter to use commas as thousands separators" 314 | ] 315 | }, 316 | { 317 | "attachments": {}, 318 | "cell_type": "markdown", 319 | "metadata": {}, 320 | "source": [ 321 | "Next, we'll look at the number of works per year per source. This data is not in our dataframe (we excluded it when we specified only certain `fields`). However, we can go back to our collection of `japanese_sources`, which has the field `counts_by_year`. This field—as we can learn from [the docs](https://docs.openalex.org/api-entities/venues/venue-object#counts_by_year)—contains the source's counts of works by year, for the last ten years, organized as a list of dictionaries." 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 9, 327 | "metadata": {}, 328 | "outputs": [ 329 | { 330 | "name": "stdout", 331 | "output_type": "stream", 332 | "text": [ 333 | "Created a dataframe counting works from year 2012 to year 2049.\n" 334 | ] 335 | } 336 | ], 337 | "source": [ 338 | "# Put the data for `counts_by_year` into a Pandas dataframe.\n", 339 | "data = []\n", 340 | "for source in japanese_sources:\n", 341 | " for year_count in source['counts_by_year']:\n", 342 | " data.append({\n", 343 | " 'id': source['id'],\n", 344 | " 'year': int(year_count['year']),\n", 345 | " 'works_count': int(year_count['works_count']),\n", 346 | " 'cited_by_count': int(year_count['cited_by_count']),\n", 347 | " })\n", 348 | "df_counts_by_year = pd.DataFrame(data)\n", 349 | "\n", 350 | "# Each row in the dataframe represents one year of one source.\n", 351 | "# We can group by year and sum the number of works to get the\n", 352 | "# total counts by year.\n", 353 | "\n", 354 | "all_counts_japan = df_counts_by_year.groupby('year')['works_count'].sum()\n", 355 | "print(f\"Created a dataframe counting works from year {all_counts_japan.index.min()} to year {all_counts_japan.index.max()}.\")" 356 | ] 357 | }, 358 | { 359 | "attachments": {}, 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "We'll use the [seaborn](https://seaborn.pydata.org/index.html) library to plot graphs of the data. Seaborn is an extension on top of [matplotlib](https://seaborn.pydata.org/index.html), the standard visualization library in Python. It is not the only choice for visualization, but we'll use it now because it is widely-used, and not too difficult to get started with." 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 10, 369 | "metadata": {}, 370 | "outputs": [ 371 | { 372 | "data": { 373 | "text/plain": [ 374 | "Text(0.5, 1.0, 'Number of works in Japanese journals')" 375 | ] 376 | }, 377 | "execution_count": 10, 378 | "metadata": {}, 379 | "output_type": "execute_result" 380 | }, 381 | { 382 | "data": { 383 | "image/png": "", 384 | "text/plain": [ 385 | "
" 386 | ] 387 | }, 388 | "metadata": {}, 389 | "output_type": "display_data" 390 | } 391 | ], 392 | "source": [ 393 | "# Import seaborn\n", 394 | "import seaborn as sns\n", 395 | "\n", 396 | "# Apply the default theme\n", 397 | "sns.set_theme()\n", 398 | "\n", 399 | "# Visualize the data\n", 400 | "g = sns.lineplot(all_counts_japan)\n", 401 | "g.set_ylim(bottom=0)\n", 402 | "g.set_xlim(2012, 2022)\n", 403 | "g.set_ylabel(\"number of works\")\n", 404 | "g.set_title(\"Number of works in Japanese journals\")" 405 | ] 406 | }, 407 | { 408 | "attachments": {}, 409 | "cell_type": "markdown", 410 | "metadata": {}, 411 | "source": [ 412 | "We have shown the number of works with Japanese sources in absolute terms, and that the number has been declining over recent years. But this only tells part of the story. Is this a general trend in the data set, or is it specific to Japanese works? To answer this, we want to look at the *relative* number of works, as a percentage of total number of works published in journals.\n", 413 | "\n", 414 | "We need to get the total number of works by year. One way to do this is to query the `/works` API endpoint, and [group by](https://docs.openalex.org/api-entities/works/group-works) the `publication_year`. To match the data we have about sources in Japan, we'll also limit the data to the last ten years, using the [`from_publication_date` convenience filter](https://docs.openalex.org/api-entities/works/filter-works#from_publication_date)." 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": 11, 420 | "metadata": {}, 421 | "outputs": [], 422 | "source": [ 423 | "url = f\"https://api.openalex.org/works\"\n", 424 | "filters = [\n", 425 | " 'primary_location.source.type:journal',\n", 426 | " 'from_publication_date:2012-01-01',\n", 427 | "]\n", 428 | "params = {\n", 429 | " 'filter': \",\".join(filters),\n", 430 | " 'group_by': 'publication_year',\n", 431 | "}\n", 432 | "# make the API query\n", 433 | "r = requests.get(url, params=params)\n", 434 | "\n", 435 | "# Get the data into a pandas dataframe\n", 436 | "counts_data = []\n", 437 | "for row in r.json()['group_by']:\n", 438 | " counts_data.append({\n", 439 | " 'year': int(row['key']),\n", 440 | " 'works_count': int(row['count']),\n", 441 | " })\n", 442 | "all_counts_all_countries = pd.DataFrame(counts_data)\n", 443 | "# change the data into a series, with the year as index and the number of works as values.\n", 444 | "# this will match the `all_counts_japan` data\n", 445 | "all_counts_all_countries = all_counts_all_countries.set_index('year')['works_count']" 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": 12, 451 | "metadata": {}, 452 | "outputs": [ 453 | { 454 | "data": { 455 | "text/plain": [ 456 | "Text(0, 0.5, 'Relative number of works')" 457 | ] 458 | }, 459 | "execution_count": 12, 460 | "metadata": {}, 461 | "output_type": "execute_result" 462 | }, 463 | { 464 | "data": { 465 | "image/png": "", 466 | "text/plain": [ 467 | "
" 468 | ] 469 | }, 470 | "metadata": {}, 471 | "output_type": "display_data" 472 | } 473 | ], 474 | "source": [ 475 | "relative_japan = all_counts_japan / all_counts_all_countries\n", 476 | "g = sns.lineplot(relative_japan)\n", 477 | "g.set_ylim(bottom=0)\n", 478 | "g.set_xlim(2012, 2022)\n", 479 | "g.set_title(\"Number of works in Japanese journals relative to total number of works in journals\")\n", 480 | "g.set_ylabel(\"Relative number of works\")" 481 | ] 482 | }, 483 | { 484 | "attachments": {}, 485 | "cell_type": "markdown", 486 | "metadata": {}, 487 | "source": [ 488 | "We have shown that the relative number of works with Japanese sources has been declining over the last ten years. One possible explanation for this is that Japanese authors are publishing less in Japanese sources. Our next step could be to look at papers with Japanese *authors* (institutional affiliations in Japan), and see if they are increasingly publishing in non-Japanese sources. Looking at publications by language would also be interesting—this information [is now available in OpenAlex](https://docs.openalex.org/api-entities/works/work-object#language)." 489 | ] 490 | }, 491 | { 492 | "cell_type": "markdown", 493 | "metadata": {}, 494 | "source": [] 495 | } 496 | ], 497 | "metadata": { 498 | "kernelspec": { 499 | "display_name": "venv", 500 | "language": "python", 501 | "name": "python3" 502 | }, 503 | "language_info": { 504 | "codemirror_mode": { 505 | "name": "ipython", 506 | "version": 3 507 | }, 508 | "file_extension": ".py", 509 | "mimetype": "text/x-python", 510 | "name": "python", 511 | "nbconvert_exporter": "python", 512 | "pygments_lexer": "ipython3", 513 | "version": "3.10.9" 514 | }, 515 | "orig_nbformat": 4, 516 | "vscode": { 517 | "interpreter": { 518 | "hash": "271691dbc4cdb85f541c883090ff5a004cbd8b9c207c2cfed84437fce4e65fdb" 519 | } 520 | } 521 | }, 522 | "nbformat": 4, 523 | "nbformat_minor": 2 524 | } 525 | -------------------------------------------------------------------------------- /notebooks/institutions/oa-percentage.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "cd151571-2976-4e81-a1e2-2cf716466271", 6 | "metadata": {}, 7 | "source": [ 8 | "
\n", 9 | " \n", 10 | " \"OpenAlex\n", 11 | " \n", 12 | "
" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "id": "e141e5ff-f69d-4563-a556-ce1e746aef14", 18 | "metadata": {}, 19 | "source": [ 20 | "# Monitoring Open Access publications for a given institution\n", 21 | "\n", 22 | "
\n", 23 | " In this notebook we will query the OpenAlex API to answer the question:\n", 24 | "
\n", 25 | " How many of recent journal articles from a given institution are Open Access? And how many aren't?\n", 26 | "
\n", 27 | " To get to the bottom of this, we will use the following API functionalities: \n", 28 | " filtering and \n", 29 | " grouping\n", 30 | "
\n", 31 | "
\n", 32 | "\n", 33 | "Imagine you would like to track the University of Florida's progress in the transition towards Open Access (OA). How could you do that using OpenAlex?\n", 34 | "\n", 35 | "### Steps\n", 36 | "Let's start by dividing the process into smaller, more manageable steps:\n", 37 | "1. First we need to get all recent journal articles from the University of Florida\n", 38 | "2. Next we divide them into open and closed access\n", 39 | "3. Finally we count the publications in each category\n", 40 | "4. Additionally we can put the numbers into a plot to visualize our findings\n", 41 | "\n", 42 | "### Input\n", 43 | "The only input we need is an identifier for the institution and here we opted for its [ROR ID](https://ror.org/). \n", 44 | "If we look up the University of Florida in the ROR registry we find its ROR ID is https://ror.org/02y3ad647:" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "id": "558a64e5-2fea-44af-b0be-35d67eb23553", 51 | "metadata": { 52 | "tags": [] 53 | }, 54 | "outputs": [], 55 | "source": [ 56 | "#input\n", 57 | "ror = 'https://ror.org/02y3ad647'" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "id": "f7216c04-3575-4d14-a4ed-5bb66401f5e4", 63 | "metadata": {}, 64 | "source": [ 65 | "All set, so let's dive in!\n", 66 | "\n", 67 | "
\n", 68 | "\n", 69 | "## 1. Get all recent journal articles from the University of Florida\n", 70 | "The first step in querying OpenAlex is always to build the URL to get exactly the data we need. We need to ask two things:\n", 71 | "1. About which entity type (author, concept, institution, venue, work) do we want data? \n", 72 | "* --> Since we want to query for metadata about \"_journal articles_\", the entity type should be `works`.\n", 73 | "\n", 74 | "2. What are the criteria the works need to fulfill to fit our purpose? \n", 75 | "* Here we need to look into the list of available [filters for works](https://docs.openalex.org/api-entities/works/filter-works) and select the appropriate ones. \n", 76 | "* --> We want to query for \"_all recent journal articles from the University of Florida_\", so we will filter for the works that:\n", 77 | " * were published in the last 10 years (=recent): `from_publication_date:2012-08-24`,\n", 78 | " * are specified as articles: `type:article`,\n", 79 | " * have at least one [authorship](https://docs.openalex.org/api-entities/works/work-object#authorships) affiliation with the University of Florida: `institutions.ror:https://ror.org/02y3ad647`,\n", 80 | " * are not [paratext](https://docs.openalex.org/api-entities/works/work-object#is_paratext): `is_paratext:false`\n", 81 | "\n", 82 | "
\n", 83 | "\n", 84 | "Now we need to **put the URL together** from these parts as follows: \n", 85 | "* Starting point is the base URL of the OpenAlex API: `https://api.openalex.org/`\n", 86 | "* We append the entity type to it: `https://api.openalex.org/works`\n", 87 | "* All criteria need to go into the query parameter `filter` that is added after a question mark: `https://api.openalex.org/works?filter=`\n", 88 | "* To construct the filter value we take the criteria we specified and concatenate them using commas as separators: \n", 89 | "`https://api.openalex.org/works?filter=institutions.ror:https://ror.org/02y3ad647,type:article,from_publication_date:2012-08-24,is_paratext:false`\n", 90 | "\n", 91 | "With this URL we can get all recent journal articles from the University of Florida!" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "id": "98345207-4fa0-4799-ba03-5179491adbda", 98 | "metadata": { 99 | "tags": [] 100 | }, 101 | "outputs": [], 102 | "source": [ 103 | "def build_institution_works_url(ror):\n", 104 | " # specify endpoint\n", 105 | " endpoint = 'works'\n", 106 | "\n", 107 | " # build the 'filter' parameter\n", 108 | " filters = (\n", 109 | " f'institutions.ror:{ror}',\n", 110 | " 'is_paratext:false',\n", 111 | " 'type:article', \n", 112 | " 'from_publication_date:2012-08-24'\n", 113 | " )\n", 114 | " \n", 115 | " # put the URL together\n", 116 | " return f'https://api.openalex.org/{endpoint}?filter={\",\".join(filters)}'\n", 117 | "\n", 118 | "filtered_works_url = build_institution_works_url(ror)\n", 119 | "print(f'complete URL with filters:\\n{filtered_works_url}')" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "id": "58b725da-3088-4e5b-b69a-2b3c18487956", 125 | "metadata": {}, 126 | "source": [ 127 | "
\n", 128 | "\n", 129 | "## 2. Divide them into open and closed access\n", 130 | "To get the number of open and closed works, we need to find an additional attribute that we can use to divide the retrieved works further into these categories. Fortunately OpenAlex includes information about the access status of a work in its metadata via the nested [OpenAccess object](https://docs.openalex.org/api-entities/works/work-object#the-openaccess-object). It is made up of the three attributes\n", 131 | "* `is_oa` _(Boolean): True if this work is Open Access._\n", 132 | "* `oa_status` _(String): The Open Access (OA) status of this work. Possible values are gold, green, hybrid, bronze, closed._\n", 133 | "* `oa_url` _(String): The best Open Access (OA) URL for this work._\n", 134 | "\n", 135 | "**-->`is_oa` seems to be exactly the criterion we are looking for!**\n", 136 | "\n", 137 | "\n", 138 | "#### Shortcut `group_by`\n", 139 | "So one way to get the number of open and closed works would be to add `is_oa` as an additional filter to our query and query OpenAlex for each value in its range `{true, false}` to get its resulting count of works, e.g.\n", 140 | "* `filter=...,is_oa:true`\n", 141 | "* `filter=...,is_oa:false`\n", 142 | "\n", 143 | "\n", 144 | "But wait! Isn't that exactly what `group_by` does? \n", 145 | "Yes, absolutely, the `group_by` parameter takes one attribute as input, divides the list of results based on the attribute's values and returns each of their counts. What a time saver!\n", 146 | "\n", 147 | "Let's add `group_by=is_oa` as an additional query parameter to the end of our URL:" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": null, 153 | "id": "bc0569ad-cd73-460e-8472-27d92c310ff6", 154 | "metadata": {}, 155 | "outputs": [], 156 | "source": [ 157 | "group_by_param = 'group_by=is_oa'\n", 158 | "\n", 159 | "work_groups_url = f'{filtered_works_url}&{group_by_param}'\n", 160 | "print(f'complete URL with group_by:\\n{work_groups_url}')" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "id": "bc97ecd0-2605-4702-b48b-c83d9d090a95", 166 | "metadata": {}, 167 | "source": [ 168 | "
\n", 169 | "\n", 170 | "## 3. Count the number of works in each group\n", 171 | "\n", 172 | "After putting together the URL, we can query OpenAlex for the groups of publications and retrieve the following two groups:" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "id": "a7768120-47af-4a3a-bfda-f5cd7fed3046", 179 | "metadata": {}, 180 | "outputs": [], 181 | "source": [ 182 | "import requests, json\n", 183 | "response = requests.get(work_groups_url).json()\n", 184 | "\n", 185 | "work_groups = response['group_by']\n", 186 | "print(json.dumps(work_groups, indent=2))" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "id": "8dc253db-a7b9-4cc3-93eb-5ab0d20f471f", 192 | "metadata": {}, 193 | "source": [ 194 | "Each group is made up of its `key` that contains the attribute value for the `group_by` attribute, in our case `is_oa`, and its `count` of entities belonging to the group. Given these data we can already answer our initial question: \n", 195 | "> _How many of recent journal articles from a given institution are Open Access? And how many aren't?_" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": null, 201 | "id": "cbe98e13-7716-4967-ac93-a3a7b420c3a8", 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [ 205 | "def calculate_open_closed_counts(work_groups):\n", 206 | " open_works_count = 0\n", 207 | " closed_works_count = 0\n", 208 | " for index, group in enumerate(work_groups):\n", 209 | " print(f\"--> Group {index+1} includes all works where `is_oa` is {group['key']} and has a count of {group['count']} publications.\")\n", 210 | "\n", 211 | " if group['key']==\"true\":\n", 212 | " open_works_count += group['count']\n", 213 | " else: \n", 214 | " closed_works_count += group['count']\n", 215 | " \n", 216 | " return open_works_count, closed_works_count\n", 217 | "\n", 218 | "open_works_count, closed_works_count = calculate_open_closed_counts(work_groups)\n", 219 | "total_works_count = open_works_count + closed_works_count\n", 220 | "\n", 221 | "if total_works_count > 0:\n", 222 | " print('That makes an OA percentage of %f' % (100 * open_works_count/total_works_count))\n", 223 | "else:\n", 224 | " print('OA percentage can`t be determined, no publications in result')" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "id": "ead1c279-ffd5-41bf-bd66-dd1cf9f8b3ce", 230 | "metadata": {}, 231 | "source": [ 232 | "
\n", 233 | "\n", 234 | "## 4. Plot the data (optional)\n", 235 | "Last but not least we can put the data into a visually appealing plot. How about a donut plot?" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": null, 241 | "id": "9b1f4de3-3a74-4a30-98a9-597e27778f96", 242 | "metadata": {}, 243 | "outputs": [], 244 | "source": [ 245 | "def create_donut_plot(open_works_count, closed_works_count):\n", 246 | " import matplotlib.pyplot as plt\n", 247 | " plt.rcParams[\"figure.figsize\"] = (8,5.5)\n", 248 | "\n", 249 | " # set labels and their respective values\n", 250 | " groups = ['Open Access', 'Closed Access']\n", 251 | " counts = [open_works_count, closed_works_count]\n", 252 | "\n", 253 | " # some visual settings\n", 254 | " colors = ['#23c552', '#f84f31']\n", 255 | " explode = (0.01, 0.01)\n", 256 | "\n", 257 | " # pie chart\n", 258 | " plt.pie(counts, colors=colors, labels=groups,\n", 259 | " autopct='%1.1f%%', pctdistance=0.85,\n", 260 | " explode=explode, textprops={'fontsize': 14})\n", 261 | "\n", 262 | " # make it a donut (draw circle in the middle)\n", 263 | " centre_circle = plt.Circle((0, 0), 0.70, fc='white')\n", 264 | " fig = plt.gcf()\n", 265 | " fig.gca().add_artist(centre_circle)\n", 266 | " \n", 267 | " # display chart\n", 268 | " plt.show()\n", 269 | "\n", 270 | "# create donut chart from open/closed counts\n", 271 | "create_donut_plot(open_works_count, closed_works_count)" 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "id": "403a64ee-56e5-41d6-b472-cd63765f7217", 277 | "metadata": {}, 278 | "source": [ 279 | "---\n", 280 | "Feel free to use the notebook and determine the percentage of Open Access works for your institution or tweak the filters to fit your analysis. \n", 281 | "\n", 282 | "Happy exploring! 😎" 283 | ] 284 | } 285 | ], 286 | "metadata": { 287 | "kernelspec": { 288 | "display_name": "Python 3 (ipykernel)", 289 | "language": "python", 290 | "name": "python3" 291 | }, 292 | "language_info": { 293 | "codemirror_mode": { 294 | "name": "ipython", 295 | "version": 3 296 | }, 297 | "file_extension": ".py", 298 | "mimetype": "text/x-python", 299 | "name": "python", 300 | "nbconvert_exporter": "python", 301 | "pygments_lexer": "ipython3", 302 | "version": "3.10.9" 303 | } 304 | }, 305 | "nbformat": 4, 306 | "nbformat_minor": 5 307 | } 308 | -------------------------------------------------------------------------------- /notebooks/openalex_works/openalex_works.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "attachments": {}, 5 | "cell_type": "markdown", 6 | "metadata": {}, 7 | "source": [ 8 | "# What impact has OpenAlex had so far?\n", 9 | "\n", 10 | "## Citation analysis\n", 11 | "\n", 12 | "Let's start by looking at the paper that OpenAlex asks researchers to cite:\n", 13 | "\n", 14 | "> Priem, J., Piwowar, H., & Orr, R. (2022). OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. ArXiv. https://arxiv.org/abs/2205.01833" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 3, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "import requests\n", 24 | "\n", 25 | "doi = '10.48550/arXiv.2205.01833'\n", 26 | "\n", 27 | "url = f'https://api.openalex.org/works?filter=doi:{doi}'\n", 28 | "r = requests.get(url)\n", 29 | "response_data = r.json()\n", 30 | "openalex_article = response_data['results'][0]" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 4, 36 | "metadata": {}, 37 | "outputs": [ 38 | { 39 | "name": "stdout", 40 | "output_type": "stream", 41 | "text": [ 42 | "Within the OpenAlex data, the OpenAlex paper has 3 (incoming) citations.\n" 43 | ] 44 | } 45 | ], 46 | "source": [ 47 | "print(f\"Within the OpenAlex data, the OpenAlex paper has {openalex_article['cited_by_count']} (incoming) citations.\")" 48 | ] 49 | }, 50 | { 51 | "attachments": {}, 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "The number of papers citing OpenAlex seems low. Let's try Semantic Scholar's data for the same article." 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 8, 61 | "metadata": {}, 62 | "outputs": [], 63 | "source": [ 64 | "s2_api_endpoint = \"https://api.semanticscholar.org/graph/v1/paper\"\n", 65 | "fields = ['citationCount', 'citations.title', 'citations.year', 'citations.publicationDate', 'citations.citationCount', 'citations.externalIds']\n", 66 | "params = {\n", 67 | " 'fields': \",\".join(fields)\n", 68 | "}\n", 69 | "r = requests.get(f\"{s2_api_endpoint}/DOI:{doi}\", params=params)\n", 70 | "s2_article = r.json()" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 9, 76 | "metadata": {}, 77 | "outputs": [ 78 | { 79 | "data": { 80 | "text/plain": [ 81 | "{'error': 'Paper with id DOI:10.48550/arXiv.2205.01833 not found'}" 82 | ] 83 | }, 84 | "execution_count": 9, 85 | "metadata": {}, 86 | "output_type": "execute_result" 87 | } 88 | ], 89 | "source": [ 90 | "s2_article" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 12, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "s2_api_endpoint = \"https://api.semanticscholar.org/graph/v1/paper\"\n", 100 | "fields = ['citationCount', 'citations.title', 'citations.year', 'citations.publicationDate', 'citations.citationCount', 'citations.externalIds']\n", 101 | "params = {\n", 102 | " 'fields': \",\".join(fields)\n", 103 | "}\n", 104 | "arxiv_id = '2205.01833'\n", 105 | "r = requests.get(f\"{s2_api_endpoint}/ARXIV:{arxiv_id}\", params=params)\n", 106 | "s2_article = r.json()" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 14, 112 | "metadata": {}, 113 | "outputs": [ 114 | { 115 | "name": "stdout", 116 | "output_type": "stream", 117 | "text": [ 118 | "Within the Semantic Scholar data, the OpenAlex paper has 16 (incoming) citations.\n" 119 | ] 120 | } 121 | ], 122 | "source": [ 123 | "print(f\"Within the Semantic Scholar data, the OpenAlex paper has {s2_article['citationCount']} (incoming) citations.\")" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 16, 129 | "metadata": {}, 130 | "outputs": [], 131 | "source": [ 132 | "citations_dois = [citing_article['externalIds'].get('DOI') for citing_article in s2_article['citations']]" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 17, 138 | "metadata": {}, 139 | "outputs": [ 140 | { 141 | "data": { 142 | "text/plain": [ 143 | "['10.48550/arXiv.2302.02231',\n", 144 | " '10.48550/arXiv.2301.01502',\n", 145 | " '10.48550/arXiv.2210.14871',\n", 146 | " '10.1109/TVCG.2022.3209422',\n", 147 | " '10.1016/j.cosrev.2022.100531',\n", 148 | " '10.3389/frma.2022.1010504',\n", 149 | " '10.48550/arXiv.2211.04429',\n", 150 | " '10.48550/arXiv.2210.00356',\n", 151 | " '10.1108/jd-04-2022-0083',\n", 152 | " '10.48550/arXiv.2209.09246',\n", 153 | " '10.1007/978-3-031-16802-4_52',\n", 154 | " '10.48550/arXiv.2208.11065',\n", 155 | " '10.1007/s11192-022-04446-y',\n", 156 | " '10.5281/zenodo.6975102',\n", 157 | " '10.1162/qss_a_00222',\n", 158 | " '10.1162/qss_a_00200']" 159 | ] 160 | }, 161 | "execution_count": 17, 162 | "metadata": {}, 163 | "output_type": "execute_result" 164 | } 165 | ], 166 | "source": [ 167 | "citations_dois" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 21, 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [ 176 | "url = f'https://api.openalex.org/works'\n", 177 | "citing_dois_str = \"|\".join(citations_dois)\n", 178 | "params = {\n", 179 | " 'filter': f\"doi:{citing_dois_str}\"\n", 180 | "}\n", 181 | "r = requests.get(url, params=params)\n", 182 | "response_data = r.json()" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 25, 188 | "metadata": {}, 189 | "outputs": [ 190 | { 191 | "name": "stdout", 192 | "output_type": "stream", 193 | "text": [ 194 | "https://doi.org/10.1162/qss_a_00222 2022-11-07 0\n", 195 | "https://doi.org/10.1162/qss_a_00200 2021-09-01 20\n", 196 | "https://doi.org/10.5281/zenodo.6975102 2022-06-28 0\n", 197 | "https://doi.org/10.1007/s11192-022-04446-y 2022-07-15 19\n", 198 | "https://doi.org/10.48550/arxiv.2208.11065 2022-08-23 0\n", 199 | "https://doi.org/10.1007/978-3-031-16802-4_52 2022-01-01 5\n", 200 | "https://doi.org/10.48550/arxiv.2209.09246 2022-09-19 0\n", 201 | "https://doi.org/10.1109/tvcg.2022.3209422 2022-01-01 21\n", 202 | "https://doi.org/10.1108/jd-04-2022-0083 2022-09-21 27\n", 203 | "https://doi.org/10.48550/arxiv.2210.00356 2022-10-01 0\n", 204 | "https://doi.org/10.48550/arxiv.2210.14871 2022-10-26 0\n", 205 | "https://doi.org/10.48550/arxiv.2211.04429 2022-11-08 0\n", 206 | "https://doi.org/10.3389/frma.2022.1010504 2022-11-10 17\n", 207 | "https://doi.org/10.1016/j.cosrev.2022.100531 2023-02-01 155\n", 208 | "https://doi.org/10.48550/arxiv.2301.01502 2023-01-04 0\n", 209 | "https://doi.org/10.48550/arxiv.2302.02231 2023-02-04 0\n" 210 | ] 211 | } 212 | ], 213 | "source": [ 214 | "for result in response_data['results']:\n", 215 | " print(result['doi'], result['publication_date'], len(result['referenced_works']))" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": null, 221 | "metadata": {}, 222 | "outputs": [], 223 | "source": [] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "metadata": {}, 229 | "outputs": [], 230 | "source": [] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 6, 235 | "metadata": {}, 236 | "outputs": [ 237 | { 238 | "ename": "MaxTriesExceededException", 239 | "evalue": "Cannot Fetch from Google Scholar.", 240 | "output_type": "error", 241 | "traceback": [ 242 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 243 | "\u001b[0;31mMaxTriesExceededException\u001b[0m Traceback (most recent call last)", 244 | "Cell \u001b[0;32mIn[6], line 10\u001b[0m\n\u001b[1;32m 7\u001b[0m scholarly\u001b[39m.\u001b[39muse_proxy(pg)\n\u001b[1;32m 9\u001b[0m \u001b[39m# Now search Google Scholar from behind a proxy\u001b[39;00m\n\u001b[0;32m---> 10\u001b[0m search_query \u001b[39m=\u001b[39m scholarly\u001b[39m.\u001b[39;49msearch_pubs(\u001b[39mf\u001b[39;49m\u001b[39m'\u001b[39;49m\u001b[39mdoi:\u001b[39;49m\u001b[39m{\u001b[39;49;00mdoi\u001b[39m}\u001b[39;49;00m\u001b[39m'\u001b[39;49m)\n\u001b[1;32m 11\u001b[0m article \u001b[39m=\u001b[39m \u001b[39mnext\u001b[39m(search_query)\n", 245 | "File \u001b[0;32m~/code/ourresearch/openalex-api-tutorials/venv/lib/python3.9/site-packages/scholarly/_scholarly.py:160\u001b[0m, in \u001b[0;36m_Scholarly.search_pubs\u001b[0;34m(self, query, patents, citations, year_low, year_high, sort_by, include_last_year, start_index)\u001b[0m\n\u001b[1;32m 97\u001b[0m \u001b[39m\u001b[39m\u001b[39m\"\"\"Searches by query and returns a generator of Publication objects\u001b[39;00m\n\u001b[1;32m 98\u001b[0m \n\u001b[1;32m 99\u001b[0m \u001b[39m:param query: terms to be searched\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 155\u001b[0m \n\u001b[1;32m 156\u001b[0m \u001b[39m\"\"\"\u001b[39;00m\n\u001b[1;32m 157\u001b[0m url \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_construct_url(_PUBSEARCH\u001b[39m.\u001b[39mformat(requests\u001b[39m.\u001b[39mutils\u001b[39m.\u001b[39mquote(query)), patents\u001b[39m=\u001b[39mpatents,\n\u001b[1;32m 158\u001b[0m citations\u001b[39m=\u001b[39mcitations, year_low\u001b[39m=\u001b[39myear_low, year_high\u001b[39m=\u001b[39myear_high,\n\u001b[1;32m 159\u001b[0m sort_by\u001b[39m=\u001b[39msort_by, include_last_year\u001b[39m=\u001b[39minclude_last_year, start_index\u001b[39m=\u001b[39mstart_index)\n\u001b[0;32m--> 160\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m__nav\u001b[39m.\u001b[39;49msearch_publications(url)\n", 246 | "File \u001b[0;32m~/code/ourresearch/openalex-api-tutorials/venv/lib/python3.9/site-packages/scholarly/_navigator.py:296\u001b[0m, in \u001b[0;36mNavigator.search_publications\u001b[0;34m(self, url)\u001b[0m\n\u001b[1;32m 288\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39msearch_publications\u001b[39m(\u001b[39mself\u001b[39m, url: \u001b[39mstr\u001b[39m) \u001b[39m-\u001b[39m\u001b[39m>\u001b[39m _SearchScholarIterator:\n\u001b[1;32m 289\u001b[0m \u001b[39m \u001b[39m\u001b[39m\"\"\"Returns a Publication Generator given a url\u001b[39;00m\n\u001b[1;32m 290\u001b[0m \n\u001b[1;32m 291\u001b[0m \u001b[39m :param url: the url where publications can be found.\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 294\u001b[0m \u001b[39m :rtype: {_SearchScholarIterator}\u001b[39;00m\n\u001b[1;32m 295\u001b[0m \u001b[39m \"\"\"\u001b[39;00m\n\u001b[0;32m--> 296\u001b[0m \u001b[39mreturn\u001b[39;00m _SearchScholarIterator(\u001b[39mself\u001b[39;49m, url)\n", 247 | "File \u001b[0;32m~/code/ourresearch/openalex-api-tutorials/venv/lib/python3.9/site-packages/scholarly/publication_parser.py:53\u001b[0m, in \u001b[0;36m_SearchScholarIterator.__init__\u001b[0;34m(self, nav, url)\u001b[0m\n\u001b[1;32m 51\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_pubtype \u001b[39m=\u001b[39m PublicationSource\u001b[39m.\u001b[39mPUBLICATION_SEARCH_SNIPPET \u001b[39mif\u001b[39;00m \u001b[39m\"\u001b[39m\u001b[39m/scholar?\u001b[39m\u001b[39m\"\u001b[39m \u001b[39min\u001b[39;00m url \u001b[39melse\u001b[39;00m PublicationSource\u001b[39m.\u001b[39mJOURNAL_CITATION_LIST\n\u001b[1;32m 52\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_nav \u001b[39m=\u001b[39m nav\n\u001b[0;32m---> 53\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_load_url(url)\n\u001b[1;32m 54\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mtotal_results \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_get_total_results()\n\u001b[1;32m 55\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mpub_parser \u001b[39m=\u001b[39m PublicationParser(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_nav)\n", 248 | "File \u001b[0;32m~/code/ourresearch/openalex-api-tutorials/venv/lib/python3.9/site-packages/scholarly/publication_parser.py:59\u001b[0m, in \u001b[0;36m_SearchScholarIterator._load_url\u001b[0;34m(self, url)\u001b[0m\n\u001b[1;32m 57\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_load_url\u001b[39m(\u001b[39mself\u001b[39m, url: \u001b[39mstr\u001b[39m):\n\u001b[1;32m 58\u001b[0m \u001b[39m# this is temporary until setup json file\u001b[39;00m\n\u001b[0;32m---> 59\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_soup \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_nav\u001b[39m.\u001b[39;49m_get_soup(url)\n\u001b[1;32m 60\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_pos \u001b[39m=\u001b[39m \u001b[39m0\u001b[39m\n\u001b[1;32m 61\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_rows \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_soup\u001b[39m.\u001b[39mfind_all(\u001b[39m'\u001b[39m\u001b[39mdiv\u001b[39m\u001b[39m'\u001b[39m, class_\u001b[39m=\u001b[39m\u001b[39m'\u001b[39m\u001b[39mgs_r gs_or gs_scl\u001b[39m\u001b[39m'\u001b[39m) \u001b[39m+\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_soup\u001b[39m.\u001b[39mfind_all(\u001b[39m'\u001b[39m\u001b[39mdiv\u001b[39m\u001b[39m'\u001b[39m, class_\u001b[39m=\u001b[39m\u001b[39m'\u001b[39m\u001b[39mgsc_mpat_ttl\u001b[39m\u001b[39m'\u001b[39m)\n", 249 | "File \u001b[0;32m~/code/ourresearch/openalex-api-tutorials/venv/lib/python3.9/site-packages/scholarly/_navigator.py:239\u001b[0m, in \u001b[0;36mNavigator._get_soup\u001b[0;34m(self, url)\u001b[0m\n\u001b[1;32m 237\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_get_soup\u001b[39m(\u001b[39mself\u001b[39m, url: \u001b[39mstr\u001b[39m) \u001b[39m-\u001b[39m\u001b[39m>\u001b[39m BeautifulSoup:\n\u001b[1;32m 238\u001b[0m \u001b[39m \u001b[39m\u001b[39m\"\"\"Return the BeautifulSoup for a page on scholar.google.com\"\"\"\u001b[39;00m\n\u001b[0;32m--> 239\u001b[0m html \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_get_page(\u001b[39m'\u001b[39;49m\u001b[39mhttps://scholar.google.com\u001b[39;49m\u001b[39m{0}\u001b[39;49;00m\u001b[39m'\u001b[39;49m\u001b[39m.\u001b[39;49mformat(url))\n\u001b[1;32m 240\u001b[0m html \u001b[39m=\u001b[39m html\u001b[39m.\u001b[39mreplace(\u001b[39mu\u001b[39m\u001b[39m'\u001b[39m\u001b[39m\\xa0\u001b[39;00m\u001b[39m'\u001b[39m, \u001b[39mu\u001b[39m\u001b[39m'\u001b[39m\u001b[39m \u001b[39m\u001b[39m'\u001b[39m)\n\u001b[1;32m 241\u001b[0m res \u001b[39m=\u001b[39m BeautifulSoup(html, \u001b[39m'\u001b[39m\u001b[39mhtml.parser\u001b[39m\u001b[39m'\u001b[39m)\n", 250 | "File \u001b[0;32m~/code/ourresearch/openalex-api-tutorials/venv/lib/python3.9/site-packages/scholarly/_navigator.py:190\u001b[0m, in \u001b[0;36mNavigator._get_page\u001b[0;34m(self, pagerequest, premium)\u001b[0m\n\u001b[1;32m 188\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_get_page(pagerequest, \u001b[39mTrue\u001b[39;00m)\n\u001b[1;32m 189\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[0;32m--> 190\u001b[0m \u001b[39mraise\u001b[39;00m MaxTriesExceededException(\u001b[39m\"\u001b[39m\u001b[39mCannot Fetch from Google Scholar.\u001b[39m\u001b[39m\"\u001b[39m)\n", 251 | "\u001b[0;31mMaxTriesExceededException\u001b[0m: Cannot Fetch from Google Scholar." 252 | ] 253 | } 254 | ], 255 | "source": [ 256 | "from scholarly import scholarly, ProxyGenerator\n", 257 | "\n", 258 | "# Set up a ProxyGenerator object to use free proxies\n", 259 | "# This needs to be done only once per session\n", 260 | "pg = ProxyGenerator()\n", 261 | "pg.FreeProxies()\n", 262 | "scholarly.use_proxy(pg)\n", 263 | "\n", 264 | "# Now search Google Scholar from behind a proxy\n", 265 | "search_query = scholarly.search_pubs(f'doi:{doi}')\n", 266 | "article = next(search_query)" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": null, 272 | "metadata": {}, 273 | "outputs": [], 274 | "source": [] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": null, 279 | "metadata": {}, 280 | "outputs": [], 281 | "source": [] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": 1, 286 | "metadata": {}, 287 | "outputs": [ 288 | { 289 | "name": "stdout", 290 | "output_type": "stream", 291 | "text": [ 292 | "complete URL with filters:\n", 293 | "https://api.openalex.org/works?search=openalex\n" 294 | ] 295 | } 296 | ], 297 | "source": [ 298 | "# specify endpoint\n", 299 | "endpoint = 'works'\n", 300 | "\n", 301 | "search_query = 'openalex'\n", 302 | "\n", 303 | "# put the URL together\n", 304 | "url = f'https://api.openalex.org/{endpoint}?search={search_query}'\n", 305 | "print(f'complete URL with filters:\\n{url}')" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": 3, 311 | "metadata": {}, 312 | "outputs": [], 313 | "source": [ 314 | "openalex_arxiv_paper = 'W4229010617'\n", 315 | "openalex_arxiv_paper_2 = 'W4288680697'\n", 316 | "url = f'https://api.openalex.org/{endpoint}?filter=cites:{openalex_arxiv_paper}'\n", 317 | "result = requests.get(url).json()" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 5, 323 | "metadata": {}, 324 | "outputs": [ 325 | { 326 | "data": { 327 | "text/plain": [ 328 | "{'count': 3, 'db_response_time_ms': 59, 'page': 1, 'per_page': 25}" 329 | ] 330 | }, 331 | "execution_count": 5, 332 | "metadata": {}, 333 | "output_type": "execute_result" 334 | } 335 | ], 336 | "source": [ 337 | "result['meta']" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 9, 343 | "metadata": {}, 344 | "outputs": [ 345 | { 346 | "data": { 347 | "text/plain": [ 348 | "dict_keys(['container_type', 'source', 'bib', 'filled', 'gsrank', 'pub_url', 'author_id', 'url_scholarbib', 'url_add_sclib', 'num_citations', 'citedby_url', 'url_related_articles', 'eprint_url'])" 349 | ] 350 | }, 351 | "execution_count": 9, 352 | "metadata": {}, 353 | "output_type": "execute_result" 354 | } 355 | ], 356 | "source": [ 357 | "article.keys()" 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": 10, 363 | "metadata": {}, 364 | "outputs": [ 365 | { 366 | "data": { 367 | "text/plain": [ 368 | "19" 369 | ] 370 | }, 371 | "execution_count": 10, 372 | "metadata": {}, 373 | "output_type": "execute_result" 374 | } 375 | ], 376 | "source": [ 377 | "article['num_citations']" 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": null, 383 | "metadata": {}, 384 | "outputs": [], 385 | "source": [] 386 | } 387 | ], 388 | "metadata": { 389 | "kernelspec": { 390 | "display_name": "venv", 391 | "language": "python", 392 | "name": "python3" 393 | }, 394 | "language_info": { 395 | "codemirror_mode": { 396 | "name": "ipython", 397 | "version": 3 398 | }, 399 | "file_extension": ".py", 400 | "mimetype": "text/x-python", 401 | "name": "python", 402 | "nbconvert_exporter": "python", 403 | "pygments_lexer": "ipython3", 404 | "version": "3.9.12" 405 | }, 406 | "orig_nbformat": 4, 407 | "vscode": { 408 | "interpreter": { 409 | "hash": "271691dbc4cdb85f541c883090ff5a004cbd8b9c207c2cfed84437fce4e65fdb" 410 | } 411 | } 412 | }, 413 | "nbformat": 4, 414 | "nbformat_minor": 2 415 | } 416 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | requests==2.31.0 2 | pandas==2.1.1 3 | python-dotenv==1.0.0 4 | 5 | # Visualization libraries 6 | matplotlib==3.8.0 7 | plotly==5.17.0 8 | seaborn==0.13.0 9 | 10 | # Static visualization rendering 11 | kaleido==0.2.1 12 | 13 | # Geo 14 | country_converter==1.0.0 -------------------------------------------------------------------------------- /resources/img/OpenAlex-banner.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ourresearch/openalex-api-tutorials/1988d22c5499d6a1f68d85ef2902b600b555aaa5/resources/img/OpenAlex-banner.png -------------------------------------------------------------------------------- /resources/img/OpenAlex-entities.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ourresearch/openalex-api-tutorials/1988d22c5499d6a1f68d85ef2902b600b555aaa5/resources/img/OpenAlex-entities.png -------------------------------------------------------------------------------- /resources/img/OpenAlex-logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ourresearch/openalex-api-tutorials/1988d22c5499d6a1f68d85ef2902b600b555aaa5/resources/img/OpenAlex-logo.png -------------------------------------------------------------------------------- /resources/img/notebooks/cursor-paging.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ourresearch/openalex-api-tutorials/1988d22c5499d6a1f68d85ef2902b600b555aaa5/resources/img/notebooks/cursor-paging.png -------------------------------------------------------------------------------- /resources/img/notebooks/meta-object.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ourresearch/openalex-api-tutorials/1988d22c5499d6a1f68d85ef2902b600b555aaa5/resources/img/notebooks/meta-object.png -------------------------------------------------------------------------------- /runtime.txt: -------------------------------------------------------------------------------- 1 | python-3.9 --------------------------------------------------------------------------------