├── .gitignore
├── LICENSE
├── README.md
├── notebooks
├── authors
│ └── hirsch-index.ipynb
├── data_questions
│ └── counts_within_country.ipynb
├── getting-started
│ ├── README.md
│ ├── api-webinar-apr2024
│ │ └── tutorial01.ipynb
│ ├── get-random-entity.ipynb
│ ├── paging.ipynb
│ └── premium.ipynb
├── institutions
│ ├── japan_sources.csv
│ ├── japan_sources.ipynb
│ ├── oa-percentage.ipynb
│ ├── uw-collaborators copy.ipynb
│ └── uw-collaborators.ipynb
└── openalex_works
│ └── openalex_works.ipynb
├── requirements.txt
├── resources
└── img
│ ├── OpenAlex-banner.png
│ ├── OpenAlex-entities.png
│ ├── OpenAlex-logo.png
│ ├── notebooks
│ ├── cursor-paging.png
│ └── meta-object.png
│ └── ui_vs_api.svg
└── runtime.txt
/.gitignore:
--------------------------------------------------------------------------------
1 | # Jupyter Notebook
2 | .ipynb_checkpoints
3 |
4 | # Environments
5 | .env
6 | .venv
7 | env/
8 | venv/
9 | ENV/
10 | env.bak/
11 | venv.bak/
12 |
13 | data/
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2022 OurResearch
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | # OpenAlex API tutorials
6 |
7 | [](https://mybinder.org/v2/gh/ourresearch/openalex-api-tutorials/main)
8 | [](https://colab.research.google.com/github/ourresearch/openalex-api-tutorials)
9 | [](https://www.deepnote.com/launch?url=https%3A%2F%2Fgithub.com%2Fourresearch%2Fopenalex-api-tutorials/)
10 |
11 | A collection of Jupyter notebooks, each walking you through a common example of bibliometric analysis
12 | using scholarly data from the [OpenAlex API](https://docs.openalex.org/). (:warning: Work In Progress).
13 |
14 |
15 | ## :bulb: What is OpenAlex?
16 | [OpenAlex](https://openalex.org/) is a fully-open index of scholarly works, authors, venues, institutions, and concepts
17 | — along with all the ways they're connected to one another.
18 | It's named after the ancient [Library of Alexandria](https://en.wikipedia.org/wiki/Library_of_Alexandria)
19 | and made by the nonprofit [OurResearch](https://ourresearch.org/).
20 |
21 |
"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "# OpenAlex API Webinar - Tutorial 01 - Getting data about papers that a university's research has cited\n",
19 | "\n",
20 | "Jason Portenoy\n",
21 | "\n",
22 | "April 25, 2024"
23 | ]
24 | },
25 | {
26 | "cell_type": "markdown",
27 | "metadata": {},
28 | "source": [
29 | "Welcome to the Jupyter Notebook accompanying part 2 of the [OpenAlex](https://openalex.org) webinar on using the API!\n",
30 | "\n",
31 | "Video recording of the webinar: [https://youtu.be/DLKUgbw7FV4](https://youtu.be/DLKUgbw7FV4)\n",
32 | "\n",
33 | "We will be using Python code to get data about a university's research works, and the works referenced by those works.\n",
34 | "\n",
35 | "* The [OpenAlex webinars](https://openalex.org/webinars) page is where you can find information on all webinars including this one, with dates for upcoming webinars, and links to video recordings of previous webinars.\n",
36 | "* If you aren't familiar with Jupyter notebooks, [you can learn more here](https://jupyter.org/try-jupyter/notebooks/?path=notebooks/Intro.ipynb)\n",
37 | "* To learn all about the OpenAlex API: [visit the technical documentation](https://docs.openalex.org)\n",
38 | "\n",
39 | "And of course, if you aren't yet familiar with OpenAlex, you can go to [https://openalex.org](https://openalex.org) right now and start exploring!"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "## API Basics\n",
47 | "\n",
48 | "[Part 1 of this OpenAlex API webinar series](https://www.youtube.com/watch?v=ycoHc8flx8U) was a basic introduction to an Application Programming Interface, or API: what it is, how OpenAlex's API works, and why it might be useful.\n",
49 | "\n",
50 | "Now, we dive in, and use Python code to get data from OpenAlex about a university's research!\n",
51 | "\n",
52 | "As a reminder, an API allows a program to interact with the data, instead of you (a person). When a person is using the data, the User Interface (UI) is more appropriate:\n",
53 | "\n",
54 | "\n",
55 | "\n",
56 | "The **free version** of the OpenAlex API:\n",
57 | "* Does not require authentication\n",
58 | " * But, add your e-mail address for faster and more consistent responses: “mailto=you@example.com”\n",
59 | "* 100k calls per day\n",
60 | "* 10 calls per second\n",
61 | "* _We raise limits for free to support research projects when possible_\n",
62 | "\n",
63 | "The **premium version** of the OpenAlex API:\n",
64 | "* API key tied to your account\n",
65 | "* API limits raised to meet your needs\n",
66 | "* Additional filters to support hourly data updates\n",
67 | " * `from_created_date`\n",
68 | " * `from_updated_date`"
69 | ]
70 | },
71 | {
72 | "cell_type": "markdown",
73 | "metadata": {},
74 | "source": [
75 | "## Let's dive in!\n",
76 | "\n",
77 | "The OpenAlex API is very **powerful**, but it is also very **easy to use**. There is no authentication required. All your code needs to do is make standard HTTP GET requests.\n",
78 | "\n",
79 | "While there are some [good libraries you can use to access the API](https://docs.openalex.org/how-to-use-the-api/api-overview#client-libraries), we're going to start very simply by making API calls directly. We will import just two small libraries to help us."
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": 1,
85 | "metadata": {},
86 | "outputs": [],
87 | "source": [
88 | "# setup: import libraries\n",
89 | "import requests\n",
90 | "import csv"
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": 2,
96 | "metadata": {},
97 | "outputs": [],
98 | "source": [
99 | "# IMPORTANT: Set your email here in order to use the API's \"polite pool\"\n",
100 | "# See: https://docs.openalex.org/how-to-use-the-api/rate-limits-and-authentication#the-polite-pool\n",
101 | "\n",
102 | "# e.g., mailto=\"youremail@example.com\"\n",
103 | "# Go ahead, fill it out:\n",
104 | "mailto = \"\""
105 | ]
106 | },
107 | {
108 | "cell_type": "code",
109 | "execution_count": 3,
110 | "metadata": {},
111 | "outputs": [
112 | {
113 | "name": "stdout",
114 | "output_type": "stream",
115 | "text": [
116 | "Success!\n"
117 | ]
118 | }
119 | ],
120 | "source": [
121 | "url = \"https://api.openalex.org/works\"\n",
122 | "if not mailto:\n",
123 | " raise ValueError(\"You need to fill in your email address in the `mailto` variable above!\")\n",
124 | "params = {\n",
125 | " \"mailto\": mailto,\n",
126 | " \"filter\": \"authorships.author.id:a5086928770\", # Kyle Demes's author ID\n",
127 | "}\n",
128 | "response = requests.get(url, params=params)\n",
129 | "\n",
130 | "# A \"200\" status code means that the API query was successful\n",
131 | "if response.status_code == 200:\n",
132 | " print(\"Success!\")"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": 4,
138 | "metadata": {},
139 | "outputs": [
140 | {
141 | "name": "stdout",
142 | "output_type": "stream",
143 | "text": [
144 | "Number of results: 24\n"
145 | ]
146 | }
147 | ],
148 | "source": [
149 | "results = response.json()['results']\n",
150 | "print(f\"Number of results: {len(results)}\")"
151 | ]
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "metadata": {},
156 | "source": [
157 | "We've retrieved the papers from the API, using a simple query:\n",
158 | "\n",
159 | "[`https://api.openalex.org/works?filter=authorships.author.id:a5086928770`](https://api.openalex.org/works?filter=authorships.author.id:a5086928770)\n",
160 | "\n",
161 | "You can follow that link in the browser to get the same result. But with the data now accessible by our code, we can save a CSV file with whichever fields we want.\n",
162 | "\n",
163 | "Instructions for how to write CSV files with Python are [here](https://docs.python.org/3/library/csv.html). Following these instructions:"
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": 5,
169 | "metadata": {},
170 | "outputs": [],
171 | "source": [
172 | "# The Python documentation shows how to write data to a CSV file:\n",
173 | "# https://docs.python.org/3/library/csv.html\n",
174 | "with open('kdemes_works.csv', 'w', newline='') as f:\n",
175 | " # initialize the csv writer for this file\n",
176 | " writer = csv.writer(f)\n",
177 | "\n",
178 | " # write a header row at the top\n",
179 | " header = ['id', 'doi', 'publication_year', 'title']\n",
180 | " writer.writerow(header)\n",
181 | "\n",
182 | " # loop through the works and write each row\n",
183 | " for item in results:\n",
184 | " this_id = item['id']\n",
185 | " this_doi = item['doi']\n",
186 | " this_publication_year = item['publication_year']\n",
187 | " this_title = item['title']\n",
188 | " writer.writerow([this_id, this_doi, this_publication_year, this_title])"
189 | ]
190 | },
191 | {
192 | "cell_type": "markdown",
193 | "metadata": {},
194 | "source": [
195 | "### University (Institution)\n",
196 | "\n",
197 | "Next, we'll try something a little more advanced. We're going to get the works from a certain institution, and then retrieve all of the references from those works (the works cited by the university's works).\n",
198 | "\n",
199 | "We'll start by collecting the university's works. We'll limit to just recently published papers so it doesn't take too long, but you could get all of the papers just as easily, if you're willing to wait.\n",
200 | "\n",
201 | "#### Cursor paging\n",
202 | "\n",
203 | "Each API query will only return a limited subset of the overall data, in what is known as a page. We need to make multiple queries to \"page through\" all of the data, collecting the data for each API query. We use a method called [\"cursor paging\"](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging#cursor-paging) to do this."
204 | ]
205 | },
206 | {
207 | "cell_type": "code",
208 | "execution_count": 6,
209 | "metadata": {},
210 | "outputs": [
211 | {
212 | "name": "stdout",
213 | "output_type": "stream",
214 | "text": [
215 | "Done paging through results. We made 44 API queries, and retrieved 4259 results.\n"
216 | ]
217 | }
218 | ],
219 | "source": [
220 | "url = \"https://api.openalex.org/works\"\n",
221 | "if not mailto:\n",
222 | " raise ValueError(\"You need to fill in your email address in the `mailto` variable above!\")\n",
223 | "params = {\n",
224 | " \"mailto\": mailto,\n",
225 | " \"filter\": f\"authorships.institutions.lineage:i129801699,publication_year:>2022\", # University of Tasmania\n",
226 | " \"per-page\": 100,\n",
227 | " \"select\": \"id,doi,publication_year,title,primary_location,authorships,topics\",\n",
228 | "}\n",
229 | "\n",
230 | "# Initialize cursor\n",
231 | "cursor = \"*\"\n",
232 | "\n",
233 | "# Initialize an empty list to store our results as we get them\n",
234 | "all_results = []\n",
235 | "count_api_queries = 0\n",
236 | "\n",
237 | "# Loop through pages\n",
238 | "while cursor:\n",
239 | " params[\"cursor\"] = cursor\n",
240 | " response = requests.get(url, params=params)\n",
241 | " if response.status_code != 200:\n",
242 | " print(\"Oh no! Something went wrong during the live demo! How embarrassing!\")\n",
243 | " break\n",
244 | " this_page_results = response.json()['results']\n",
245 | " for result in this_page_results:\n",
246 | " # Store these results in the list we created before the loop we are currently in\n",
247 | " all_results.append(result)\n",
248 | " count_api_queries += 1\n",
249 | "\n",
250 | " # Update cursor, using the response's `next_cursor` metadata field\n",
251 | " cursor = response.json()['meta']['next_cursor']\n",
252 | "print(f\"Done paging through results. We made {count_api_queries} API queries, and retrieved {len(all_results)} results.\")"
253 | ]
254 | },
255 | {
256 | "cell_type": "markdown",
257 | "metadata": {},
258 | "source": [
259 | "Our next step is to loop through each work collected above, and collect all of the referenced works."
260 | ]
261 | },
262 | {
263 | "cell_type": "code",
264 | "execution_count": 7,
265 | "metadata": {},
266 | "outputs": [],
267 | "source": [
268 | "# Let's make our cursor paging code above into a function, so we can reuse it easily.\n",
269 | "# This code just defines the function. We'll need to call the function later on to get it to actually get it to run.\n",
270 | "def api_query_page_results(url, params):\n",
271 | " # Initialize cursor\n",
272 | " cursor = \"*\"\n",
273 | "\n",
274 | " # Loop through pages\n",
275 | " all_results = []\n",
276 | " while cursor:\n",
277 | " params[\"cursor\"] = cursor\n",
278 | " response = requests.get(url, params=params)\n",
279 | " if response.status_code != 200:\n",
280 | " print(\"Oh no! Something went wrong during the live demo! How embarrassing!\")\n",
281 | " response.raise_for_status()\n",
282 | " this_page_results = response.json()['results']\n",
283 | " for result in this_page_results:\n",
284 | " all_results.append(result)\n",
285 | "\n",
286 | " # Update cursor\n",
287 | " cursor = response.json()['meta']['next_cursor']\n",
288 | " return all_results"
289 | ]
290 | },
291 | {
292 | "cell_type": "code",
293 | "execution_count": 8,
294 | "metadata": {},
295 | "outputs": [
296 | {
297 | "name": "stdout",
298 | "output_type": "stream",
299 | "text": [
300 | "Done collecting references. We retrieved 9157 works.\n"
301 | ]
302 | }
303 | ],
304 | "source": [
305 | "# collect all of the works referenced by the works found above\n",
306 | "# This will be a dictionary mapping Citing Paper -> List of Cited Papers\n",
307 | "# We start by initializing an empty dictionary\n",
308 | "all_references = {}\n",
309 | "\n",
310 | "# Let's limit the results to loop through to only n=100, because this is a demo, and we don't want to wait for too long\n",
311 | "works_to_collect = all_results[:100]\n",
312 | "\n",
313 | "# We will keep track of the number of works retrieved from the API\n",
314 | "count_works_retrieved = 0\n",
315 | "\n",
316 | "for work in works_to_collect:\n",
317 | " # Get references for this work (i.e., works that have been cited by this work)\n",
318 | " this_work_id = work['id']\n",
319 | " url = \"https://api.openalex.org/works\"\n",
320 | " if not mailto:\n",
321 | " raise ValueError(\"You need to fill in your email address in the `mailto` variable above!\")\n",
322 | " params = {\n",
323 | " \"mailto\": mailto,\n",
324 | " \"filter\": f\"cited_by:{this_work_id}\",\n",
325 | " \"per-page\": 100,\n",
326 | " \"select\": \"id,doi,publication_year,title,primary_location,authorships,topics\",\n",
327 | " }\n",
328 | " this_work_references = api_query_page_results(url, params=params)\n",
329 | " # put this data into our dictionary:\n",
330 | " # The key for the dictionary is the citing work_id, and the value is the list of referenced Works\n",
331 | " all_references[this_work_id] = this_work_references\n",
332 | " count_works_retrieved += len(this_work_references)\n",
333 | "print(f\"Done collecting references. We retrieved {count_works_retrieved} works.\")"
334 | ]
335 | },
336 | {
337 | "cell_type": "markdown",
338 | "metadata": {},
339 | "source": [
340 | "Now we have collected the referenced papers for each of the university's papers we collected in the first step. The next step is to save the data to a CSV file."
341 | ]
342 | },
343 | {
344 | "cell_type": "code",
345 | "execution_count": 9,
346 | "metadata": {},
347 | "outputs": [],
348 | "source": [
349 | "# Function to shorten the OpenAlex ID to make it better for display\n",
350 | "def make_short_id(long_id):\n",
351 | " short_id = long_id.replace(\"https://openalex.org/\", \"\")\n",
352 | " return short_id"
353 | ]
354 | },
355 | {
356 | "cell_type": "code",
357 | "execution_count": 10,
358 | "metadata": {},
359 | "outputs": [],
360 | "source": [
361 | "# Write each citing -> cited pair of works to a CSV file\n",
362 | "output_filename = \"tasmania_paper_references.csv\"\n",
363 | "with open(output_filename, 'w', newline='') as f:\n",
364 | " # initialize the csv writer for this file\n",
365 | " writer = csv.writer(f)\n",
366 | "\n",
367 | " # write a header row at the top\n",
368 | " header = ['citing_paper_id', 'cited_paper_id']\n",
369 | " writer.writerow(header)\n",
370 | "\n",
371 | " # loop through each citation, writing one row for each citation\n",
372 | " for citing_id, cited_works in all_references.items():\n",
373 | " citing_id_short = make_short_id(citing_id)\n",
374 | " for cited_work in cited_works:\n",
375 | " cited_id_short = make_short_id(cited_work['id'])\n",
376 | " writer.writerow([citing_id_short, cited_id_short])"
377 | ]
378 | },
379 | {
380 | "cell_type": "code",
381 | "execution_count": 11,
382 | "metadata": {},
383 | "outputs": [],
384 | "source": [
385 | "# We can keep track of how many times each work has been cited.\n",
386 | "# One way to do this is to use Python's collections.Counter\n",
387 | "from collections import Counter\n",
388 | "citation_counts = Counter()\n",
389 | "for citing_id, cited_works in all_references.items():\n",
390 | " citation_counts.update([w['id'] for w in cited_works])"
391 | ]
392 | },
393 | {
394 | "cell_type": "markdown",
395 | "metadata": {},
396 | "source": [
397 | "We can also save another CSV file with detailed metadata about each of the referenced papers we found. We will include information about the source (journal), and the topics. But you can build out this code to include any information you like (just make sure you are collecting it from the API when you specify the `select` parameter in your API requests above)."
398 | ]
399 | },
400 | {
401 | "cell_type": "code",
402 | "execution_count": 12,
403 | "metadata": {},
404 | "outputs": [],
405 | "source": [
406 | "output_filename = \"tasmania_references_paper_metadata.csv\"\n",
407 | "seen_work_ids = set()\n",
408 | "with open(output_filename, 'w', newline='') as f:\n",
409 | " # initialize the csv writer for this file\n",
410 | " writer = csv.writer(f)\n",
411 | "\n",
412 | " # write a header row at the top\n",
413 | " header = ['work_id', 'title', 'doi', 'utasmania_citation_count', \n",
414 | " 'source_id', 'source_issn', 'source_display_name', \n",
415 | " 'primary_topic_id', 'primary_topic_display_name']\n",
416 | " writer.writerow(header)\n",
417 | "\n",
418 | " for cited_works in all_references.values():\n",
419 | " for w in cited_works:\n",
420 | " work_id = w['id']\n",
421 | " work_id_short = make_short_id(work_id)\n",
422 | " title = w['title']\n",
423 | " if work_id not in seen_work_ids and title != 'Deleted Work':\n",
424 | " # We will write a row to the CSV file for this work\n",
425 | " doi = w['doi']\n",
426 | " utasmania_citation_count = citation_counts[work_id]\n",
427 | "\n",
428 | " # Get source (journal)\n",
429 | " try:\n",
430 | " source = w['primary_location']['source']\n",
431 | " source_id = source['id']\n",
432 | " source_id_short = make_short_id(source_id)\n",
433 | " source_issn = source['issn_l']\n",
434 | " source_display_name = source['display_name']\n",
435 | " except (KeyError, TypeError):\n",
436 | " source_id = None\n",
437 | " source_issn = None\n",
438 | " source_display_name = None\n",
439 | "\n",
440 | " # Get primary_topic\n",
441 | " try:\n",
442 | " primary_topic = w['topics'][0]\n",
443 | " primary_topic_id = primary_topic['id']\n",
444 | " primary_topic_id_short = make_short_id(primary_topic_id)\n",
445 | " primary_topic_display_name = primary_topic['display_name']\n",
446 | " except (IndexError, KeyError, TypeError):\n",
447 | " primary_topic_id = None\n",
448 | " primary_topic_display_name = None\n",
449 | " \n",
450 | " # Write this work's row to the CSV file\n",
451 | " writer.writerow([work_id_short, title, doi, \n",
452 | " utasmania_citation_count, source_id_short, \n",
453 | " source_issn, source_display_name, \n",
454 | " primary_topic_id_short, primary_topic_display_name])\n",
455 | " \n",
456 | " seen_work_ids.add(work_id)\n"
457 | ]
458 | },
459 | {
460 | "cell_type": "markdown",
461 | "metadata": {},
462 | "source": [
463 | "Now we have two CSV files:\n",
464 | "\n",
465 | "* `tasmania_paper_references.csv` has a two column edge-list of citing work -> cited work\n",
466 | "* `tasmania_references_paper_metadata.csv` has metadata about each cited work\n",
467 | "\n",
468 | "You could open this file in a spreadsheet program like Excel to do additional analysis, or continue working with the data in Python."
469 | ]
470 | }
471 | ],
472 | "metadata": {
473 | "kernelspec": {
474 | "display_name": "venv",
475 | "language": "python",
476 | "name": "python3"
477 | },
478 | "language_info": {
479 | "codemirror_mode": {
480 | "name": "ipython",
481 | "version": 3
482 | },
483 | "file_extension": ".py",
484 | "mimetype": "text/x-python",
485 | "name": "python",
486 | "nbconvert_exporter": "python",
487 | "pygments_lexer": "ipython3",
488 | "version": "3.10.9"
489 | }
490 | },
491 | "nbformat": 4,
492 | "nbformat_minor": 2
493 | }
494 |
--------------------------------------------------------------------------------
/notebooks/getting-started/get-random-entity.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "230c97f7-1471-4c29-bf95-decb29c2a2ae",
6 | "metadata": {},
7 | "source": [
8 | "
"
13 | ]
14 | },
15 | {
16 | "cell_type": "markdown",
17 | "id": "f133f73b-c2df-4f5b-ab14-2ea668d47052",
18 | "metadata": {},
19 | "source": [
20 | "# Turn the page\n",
21 | "❓ Let's say we query OpenAlex for a [list of entities](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities). By default the API only gives us the first 25 results of the list. Why is that ❓\n",
22 | "\n",
23 | ">Just like books split large amounts of text and distribute it onto **pages**, the OpenAlex API does the same with a (potentially massive) list of entities.\n",
24 | "\n",
25 | "It makes the data more manageable for both sides: We get small amounts of data that fit into our computer's memory in a reasonable amount of time, while the OpenAlex API needs to process less data at once and can serve more requests to more users.\n",
26 | "\n",
27 | "\n",
28 | "👉 So, coming back to our question of why only 25 results: That is only the first page with a partial list of results! \n",
29 | "The API even tells us this. Every page includes a **meta section** with the following information:\n",
30 | "\n",
31 | "
\n",
32 | " \n",
33 | "
\n",
34 | "\n",
35 | "\n",
36 | "In order to get the complete list, we need to \"_leaf through_\" all the pages. But how do we do that? \n",
37 | "There are two techniques the OpenAlex API offers: **_🔢 basic paging_** and **_↪️ cursor paging_**. Let's get to know them!\n",
38 | "\n",
39 | "
\n",
40 | " 💡 Use the Polite Pool \n",
41 | "While it is always a good idea to use the polite pool, this holds especially true for paging. The polite pool has much faster and more consistent response times, so for multiple requests these gains in response time will aggregate and speed up your application!\n",
42 | "
"
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "id": "a9d7fae8-4ee2-40fa-9db1-dbd1002811f6",
48 | "metadata": {},
49 | "source": [
50 | "\n",
51 | "\n",
52 | "## 🔢 Basic paging\n",
53 | "[Basic paging](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging#basic-paging) is the simplest form of paging and works like this: \n",
54 | "* All pages are numbered **from 1 to n** \n",
55 | "*We can determine n by dividing meta's `count` by `per_page` and rounding the result up to the next integer.* \n",
56 | "* To request one of the pages, we add the **`page` parameter** to the URL and put the page number as its value, \n",
57 | "e.g. for requesting page 2, we add https://api.openalex.org/works?filter=author.id:A5048491430&page=2\n",
58 | "\n",
59 | "### Within limits\n",
60 | "
\n",
61 | " ⚠️ While basic paging is easy to use, it only works for the first 10,000 results of any list. \n",
62 | " If we want to see more than 10,000 results, we'll need to use cursor paging.\n",
63 | "
\n",
64 | "\n",
65 | "### Example\n",
66 | "Let's look at an example, where we want to retrieve a complete list of all publications from an author and print their OpenAlex IDs. \n",
67 | "Given the OpenAlex ID for the author `A5048491430` the URL would be: https://api.openalex.org/works?filter=author.id:A5048491430.\n",
68 | "\n",
69 | "To loop through all pages, we start by setting `page=1` and then repeating:\n",
70 | "* request the specified page by adding the `page` parameter to the URL\n",
71 | "* print all of the OpenAlex IDs from the publications on this page in blocks of five\n",
72 | "* update `page` parameter to `page`+1\n",
73 | "\n",
74 | "until *either* there are no more results on the requested page *or* the next request would exceed 10,000 results."
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": 1,
80 | "id": "98e2d323-c7e2-4003-9e84-1a70551756c7",
81 | "metadata": {},
82 | "outputs": [
83 | {
84 | "name": "stdout",
85 | "output_type": "stream",
86 | "text": [
87 | "\n",
88 | "https://api.openalex.org/works?filter=author.id:A5048491430&page=1\n",
89 | "W2046766973\tW2741809807\tW2045657963\tW1572136682\tW2066415719\n",
90 | "W2170531319\tW1963524534\tW1553564559\tW2003014790\tW2051771537\n",
91 | "W1987881751\tW2980172586\tW2095083909\tW1528782725\tW2102613218\n",
92 | "W4235038322\tW2014140050\tW2109312864\tW3071882161\tW1501540670\n",
93 | "W2103827239\tW4229010617\tW2133737815\tW4366077396\tW2171848392\n",
94 | "\n",
95 | "https://api.openalex.org/works?filter=author.id:A5048491430&page=2\n",
96 | "W4245410681\tW3021154342\tW2017292130\tW2168771768\tW4213202391\n",
97 | "W4236031980\tW1945323029\tW2103382090\tW2105695765\tW3084168212\n",
98 | "W4211010643\tW1934573562\tW2050143895\tW2065622609\tW2110180658\n",
99 | "W2941875476\tW4237216357\tW4242907897\tW4244183537\tW4247478427\n",
100 | "W4287670050\tW104609242\tW1972136887\tW2005148091\tW2010883332\n",
101 | "\n",
102 | "https://api.openalex.org/works?filter=author.id:A5048491430&page=3\n",
103 | "W2108112433\tW2154768595\tW2255028491\tW2284153834\tW2307679124\n",
104 | "W2398849157\tW2402184614\tW2414739039\tW2613086963\tW2727815292\n",
105 | "W2740744046\tW2949915600\tW2951362513\tW2979437137\tW3084303366\n",
106 | "W3168937413\tW3206844309\tW4221043181\tW4230863633\tW4237614390\n",
107 | "W4240735862\tW4244937397\tW4246220990\tW4252159547\tW4252662598\n",
108 | "\n",
109 | "https://api.openalex.org/works?filter=author.id:A5048491430&page=4\n",
110 | "W4288680697\tW4299928665\tW4301303362\t"
111 | ]
112 | }
113 | ],
114 | "source": [
115 | "import requests\n",
116 | "\n",
117 | "# url with a placeholder for page number\n",
118 | "example_url_with_page = 'https://api.openalex.org/works?filter=author.id:A5048491430&page={}'\n",
119 | "\n",
120 | "page = 1\n",
121 | "has_more_pages = True\n",
122 | "fewer_than_10k_results = True\n",
123 | "\n",
124 | "# loop through pages\n",
125 | "while has_more_pages and fewer_than_10k_results:\n",
126 | " \n",
127 | " # set page value and request page from OpenAlex\n",
128 | " url = example_url_with_page.format(page)\n",
129 | " print('\\n' + url)\n",
130 | " page_with_results = requests.get(url).json()\n",
131 | " \n",
132 | " # loop through partial list of results\n",
133 | " results = page_with_results['results']\n",
134 | " for i,work in enumerate(results):\n",
135 | " openalex_id = work['id'].replace(\"https://openalex.org/\", \"\")\n",
136 | " print(openalex_id, end='\\t' if (i+1)%5!=0 else '\\n')\n",
137 | "\n",
138 | " # next page\n",
139 | " page += 1\n",
140 | " \n",
141 | " # end loop when either there are no more results on the requested page \n",
142 | " # or the next request would exceed 10,000 results\n",
143 | " per_page = page_with_results['meta']['per_page']\n",
144 | " has_more_pages = len(results) == per_page\n",
145 | " fewer_than_10k_results = per_page * page <= 10000"
146 | ]
147 | },
148 | {
149 | "cell_type": "markdown",
150 | "id": "8cc5ccb4-a896-48ed-9adc-1df242028b34",
151 | "metadata": {},
152 | "source": [
153 | "\n",
154 | "\n",
155 | "## ↪️ Cursor paging\n",
156 | "[Cursor paging](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging#cursor-paging) is a bit more complicated than basic paging, but it allows us to access as many records as we like. \n",
157 | "\n",
158 | "\n",
159 | "To use cursor paging,\n",
160 | "* we add the **`cursor` parameter** with a start value of `*` to our first query, \n",
161 | "e.g. https://api.openalex.org/works?filter=author.id:A5048491430&cursor=*\n",
162 | "\n",
163 | "* The response to our query will now include a `next_cursor` value in the response's `meta` section. \n",
164 | "To retrieve the next page, we **copy `meta.next_cursor`** into the cursor field of our URL.\n",
165 | "\n",
166 | "* To get all the results, we keep repeating the second step until `meta.next_cursor` is null.\n",
167 | "\n",
168 | "
\n",
169 | " \n",
170 | "
\n",
171 | "\n",
172 | "### With great power comes great responsibility\n",
173 | "Cursor paging is very powerful, since there is no limit on the number of pages you can request. Please use it responsibly!\n",
174 | "
\n",
175 | " 🚫 Don't use cursor paging to download a very large or even the whole dataset\n",
176 | "
\n",
177 | "
It's bad for you because it will take many days to page through a long list like '/works' or '/authors'.
\n",
178 | "
It's bad for the OpenAlex API (and other users!) because it puts a massive load on their servers.
\n",
179 | "
\n",
180 | "\n",
181 | " Instead, download everything at once, using the data snapshot. It's free, easy, fast, and you get all the results in same format you'd get from the API.\n",
182 | "
\n",
183 | "\n",
184 | "### Example\n",
185 | "Let's look at the same example as before, where we want to retrieve a complete list of all publications from an author and print their OpenAlex IDs. \n",
186 | "\n",
187 | "To loop through all pages, we start by setting `cursor=*` and then repeating:\n",
188 | "* request the specified page by adding the `cursor` parameter to the URL\n",
189 | "* print all of the OpenAlex IDs from the publications on this page in blocks of five\n",
190 | "* update `cursor` parameter to `meta.next_cursor`\n",
191 | "\n",
192 | "until `meta.next_cursor` is null and the list of results is empty."
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "execution_count": 2,
198 | "id": "0bd272c4-037b-4587-8b9e-2b8fa0a1b269",
199 | "metadata": {},
200 | "outputs": [
201 | {
202 | "name": "stdout",
203 | "output_type": "stream",
204 | "text": [
205 | "\n",
206 | "https://api.openalex.org/works?filter=author.id:A5048491430&cursor=*\n",
207 | "W2046766973\tW2741809807\tW2045657963\tW1572136682\tW2066415719\n",
208 | "W2170531319\tW1963524534\tW1553564559\tW2003014790\tW2051771537\n",
209 | "W1987881751\tW2980172586\tW2095083909\tW1528782725\tW2102613218\n",
210 | "W4235038322\tW2014140050\tW2109312864\tW3071882161\tW1501540670\n",
211 | "W2103827239\tW4229010617\tW2133737815\tW4366077396\tW2171848392\n",
212 | "\n",
213 | "https://api.openalex.org/works?filter=author.id:A5048491430&cursor=Ils3LCAnaHR0cHM6Ly9vcGVuYWxleC5vcmcvVzIxNzE4NDgzOTInXSI=\n",
214 | "W4245410681\tW3021154342\tW2017292130\tW2168771768\tW4213202391\n",
215 | "W4236031980\tW1945323029\tW2103382090\tW2105695765\tW3084168212\n",
216 | "W4211010643\tW1934573562\tW2050143895\tW2065622609\tW2110180658\n",
217 | "W2941875476\tW4237216357\tW4242907897\tW4244183537\tW4247478427\n",
218 | "W4287670050\tW104609242\tW1972136887\tW2005148091\tW2010883332\n",
219 | "\n",
220 | "https://api.openalex.org/works?filter=author.id:A5048491430&cursor=IlswLCAnaHR0cHM6Ly9vcGVuYWxleC5vcmcvVzIwMTA4ODMzMzInXSI=\n",
221 | "W2108112433\tW2154768595\tW2255028491\tW2284153834\tW2307679124\n",
222 | "W2398849157\tW2402184614\tW2414739039\tW2613086963\tW2727815292\n",
223 | "W2740744046\tW2949915600\tW2951362513\tW2979437137\tW3084303366\n",
224 | "W3168937413\tW3206844309\tW4221043181\tW4230863633\tW4237614390\n",
225 | "W4240735862\tW4244937397\tW4246220990\tW4252159547\tW4252662598\n",
226 | "\n",
227 | "https://api.openalex.org/works?filter=author.id:A5048491430&cursor=IlswLCAnaHR0cHM6Ly9vcGVuYWxleC5vcmcvVzQyNTI2NjI1OTgnXSI=\n",
228 | "W4288680697\tW4299928665\tW4301303362\t\n",
229 | "https://api.openalex.org/works?filter=author.id:A5048491430&cursor=IlswLCAnaHR0cHM6Ly9vcGVuYWxleC5vcmcvVzQzMDEzMDMzNjInXSI=\n"
230 | ]
231 | }
232 | ],
233 | "source": [
234 | "import requests\n",
235 | "\n",
236 | "# url with a placeholder for cursor\n",
237 | "example_url_with_cursor = 'https://api.openalex.org/works?filter=author.id:A5048491430&cursor={}'\n",
238 | "\n",
239 | "cursor = '*'\n",
240 | "\n",
241 | "# loop through pages\n",
242 | "while cursor:\n",
243 | " \n",
244 | " # set cursor value and request page from OpenAlex\n",
245 | " url = example_url_with_cursor.format(cursor)\n",
246 | " print(\"\\n\" + url)\n",
247 | " page_with_results = requests.get(url).json()\n",
248 | " \n",
249 | " # loop through partial list of results\n",
250 | " results = page_with_results['results']\n",
251 | " for i,work in enumerate(results):\n",
252 | " openalex_id = work['id'].replace(\"https://openalex.org/\", \"\")\n",
253 | " print(openalex_id, end='\\t' if (i+1)%5!=0 else '\\n')\n",
254 | "\n",
255 | " # update cursor to meta.next_cursor\n",
256 | " cursor = page_with_results['meta']['next_cursor']"
257 | ]
258 | },
259 | {
260 | "cell_type": "markdown",
261 | "id": "b3ea64d9-e5ab-4396-bbf3-41ec9318ff69",
262 | "metadata": {},
263 | "source": [
264 | "\n",
265 | "\n",
266 | "What we covered in this notebook is quite technical and might be a bit for beginners to take in, so\n",
267 | "please don't worry too much, if you need to reread it or need additional clarifying. \n",
268 | "The main concept to take away is that \n",
269 | "* the OpenAlex API distributes result lists into smaller chunks called pages \n",
270 | "* and thus to retrieve a complete result list, we have to manually or programatically \"leaf\" though these pages.\n",
271 | "\n",
272 | "Happy paging! 😎"
273 | ]
274 | },
275 | {
276 | "cell_type": "markdown",
277 | "id": "25ec853d",
278 | "metadata": {},
279 | "source": []
280 | }
281 | ],
282 | "metadata": {
283 | "kernelspec": {
284 | "display_name": "Python 3 (ipykernel)",
285 | "language": "python",
286 | "name": "python3"
287 | },
288 | "language_info": {
289 | "codemirror_mode": {
290 | "name": "ipython",
291 | "version": 3
292 | },
293 | "file_extension": ".py",
294 | "mimetype": "text/x-python",
295 | "name": "python",
296 | "nbconvert_exporter": "python",
297 | "pygments_lexer": "ipython3",
298 | "version": "3.10.9"
299 | }
300 | },
301 | "nbformat": 4,
302 | "nbformat_minor": 5
303 | }
304 |
--------------------------------------------------------------------------------
/notebooks/getting-started/premium.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "
"
12 | ]
13 | },
14 | {
15 | "attachments": {},
16 | "cell_type": "markdown",
17 | "metadata": {},
18 | "source": [
19 | "# Getting started with OpenAlex Premium\n",
20 | "\n",
21 | "In this tutorial, we're going to learn how to get started using [OpenAlex Premium](https://openalex.org/pricing). This subscription service provides some features beyond the free services. One of the most important of these features is **faster updates,** allowing you to keep your data fully synced with OpenAlex.\n",
22 | "\n",
23 | "The way we do this is by using the `from_created_date` [(doc)](https://docs.openalex.org/api-entities/works/filter-works#from_created_date) or the `from_updated_date` [(doc)](https://docs.openalex.org/api-entities/works/filter-works#from_updated_date) filters. These filters allow you to get the new works you need to keep your data updated, and they require a Premium API Key to work.\n",
24 | "\n",
25 | "We're going to set up the code to poll the OpenAlex API for newly updated works on a regular basis, once per day."
26 | ]
27 | },
28 | {
29 | "attachments": {},
30 | "cell_type": "markdown",
31 | "metadata": {},
32 | "source": [
33 | "First, we need to get the API key you received by signing up for OpenAlex Premium. (Don't have a key yet? [Contact us right now to learn more about getting premium!](https://openalex.org/pricing))\n",
34 | "\n",
35 | "We'll store our API key in a variable called `my_api_key`. There are several ways to do this. You could just put it into the code, but since it is sensitive information that we don't want others to see, we're going to get it from an [environment variable, which we'll store in a `.env` file.](https://towardsdatascience.com/the-quick-guide-to-using-environment-variables-in-python-d4ec9291619e)\n",
36 | "\n",
37 | "This is just a text file with the name `.env`, that looks like this:\n",
38 | "```\n",
39 | "API_KEY=\n",
40 | "```\n",
41 | "Replace `` with your OpenAlex Premium API Key.\n"
42 | ]
43 | },
44 | {
45 | "attachments": {},
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "Now, to set our `my_api_key` variable, we'll set our environment using the [`python-dotenv`](https://pypi.org/project/python-dotenv/) library, then get the variable using the `os.getenv()` function."
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": 1,
55 | "metadata": {},
56 | "outputs": [
57 | {
58 | "name": "stdout",
59 | "output_type": "stream",
60 | "text": [
61 | "API key is set!\n"
62 | ]
63 | }
64 | ],
65 | "source": [
66 | "import os\n",
67 | "from dotenv import load_dotenv\n",
68 | "\n",
69 | "load_dotenv('.env')\n",
70 | "my_api_key = os.getenv('API_KEY')\n",
71 | "if my_api_key is None:\n",
72 | " print(\"No API key found!!!\")\n",
73 | "else:\n",
74 | " print(\"API key is set!\")"
75 | ]
76 | },
77 | {
78 | "attachments": {},
79 | "cell_type": "markdown",
80 | "metadata": {},
81 | "source": [
82 | "Our plan is to get all of the works that have been updated in the last four hours. So let's construct a URL that will request that information from the API:"
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "execution_count": 4,
88 | "metadata": {},
89 | "outputs": [],
90 | "source": [
91 | "import requests\n",
92 | "from datetime import datetime, timedelta"
93 | ]
94 | },
95 | {
96 | "cell_type": "markdown",
97 | "metadata": {},
98 | "source": [
99 | "To use the API key, we have two options. The first option is to include it in the URL, as an `api_key` parameter:"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": 9,
105 | "metadata": {},
106 | "outputs": [
107 | {
108 | "name": "stdout",
109 | "output_type": "stream",
110 | "text": [
111 | "Our formatted date-time string looks like this: 2023-11-07T19:22:46.848349\n",
112 | "Requesting newly updated works, including the API key as a URL query parameter...\n"
113 | ]
114 | },
115 | {
116 | "name": "stdout",
117 | "output_type": "stream",
118 | "text": [
119 | "Retrieved 25 works, out of 1967482 works updated since 2023-11-07T19:22:46.848349\n"
120 | ]
121 | }
122 | ],
123 | "source": [
124 | "four_hours_ago = datetime.utcnow() - timedelta(hours=4)\n",
125 | "four_hours_ago_formatted_string = four_hours_ago.isoformat()\n",
126 | "print(f\"Our formatted date-time string looks like this: {four_hours_ago_formatted_string}\")\n",
127 | "# Construct a URL to requests works from the last four hours, including our API key as a URL query parameter.\n",
128 | "url = f\"https://api.openalex.org/works?filter=from_updated_date:{four_hours_ago_formatted_string}&api_key={my_api_key}\"\n",
129 | "print(f\"Requesting newly updated works, including the API key as a URL query parameter...\")\n",
130 | "r = requests.get(url)\n",
131 | "updated_works = r.json()\n",
132 | "\n",
133 | "count_works_retrieved = len(updated_works['results'])\n",
134 | "count_works_total = updated_works['meta']['count']\n",
135 | "print(f\"Retrieved {count_works_retrieved} works, out of {count_works_total} works updated since {four_hours_ago_formatted_string}\")\n"
136 | ]
137 | },
138 | {
139 | "cell_type": "markdown",
140 | "metadata": {},
141 | "source": [
142 | "Success! We've used our API key to request all of the works updated in the last day. To get all of the data, [you can page through the results](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging)."
143 | ]
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "metadata": {},
148 | "source": [
149 | "Alternatively, we can include the API key as a request header:"
150 | ]
151 | },
152 | {
153 | "cell_type": "code",
154 | "execution_count": 10,
155 | "metadata": {},
156 | "outputs": [
157 | {
158 | "name": "stdout",
159 | "output_type": "stream",
160 | "text": [
161 | "Requesting newly updated works, including the API key in the request headers...\n",
162 | "Retrieved 200 works, out of 1965510 works updated since 2023-11-07T19:22:46.848349\n"
163 | ]
164 | }
165 | ],
166 | "source": [
167 | "# Requests works from the last four hours, including our API key in the request headers\n",
168 | "url = f\"https://api.openalex.org/works?filter=from_updated_date:{four_hours_ago_formatted_string}&per-page=200\"\n",
169 | "headers = {\"api_key\": my_api_key}\n",
170 | "print(f\"Requesting newly updated works, including the API key in the request headers...\")\n",
171 | "r = requests.get(url, headers=headers)\n",
172 | "updated_works = r.json()\n",
173 | "\n",
174 | "count_works_retrieved = len(updated_works['results'])\n",
175 | "count_works_total = updated_works['meta']['count']\n",
176 | "print(f\"Retrieved {count_works_retrieved} works, out of {count_works_total} works updated since {four_hours_ago_formatted_string}\")\n"
177 | ]
178 | },
179 | {
180 | "cell_type": "markdown",
181 | "metadata": {},
182 | "source": [
183 | "You can retrieve up to 200 works per page. Again, to get all of the results, [you can page through the results](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging)."
184 | ]
185 | },
186 | {
187 | "attachments": {},
188 | "cell_type": "markdown",
189 | "metadata": {},
190 | "source": [
191 | "Keep in mind that this method will get works that have been updated with *any change at all*, including increases in various counts. If you're only interested in *new* works, you could use the [`from_created_date`](https://docs.openalex.org/api-entities/works/filter-works#from_created_date) filter instead of `from_updated_date`, which will give a much smaller number of works.\n",
192 | "\n",
193 | "You'll need to do two things to keep your data fresh:\n",
194 | "1. Set up a script that does something similar to what we did above, and that runs on schedule once per day (using a [cron job](https://en.wikipedia.org/wiki/Cron), for example).\n",
195 | "2. Update your database with the new data you've grabbed from our API (such as using a SQL script)."
196 | ]
197 | },
198 | {
199 | "attachments": {},
200 | "cell_type": "markdown",
201 | "metadata": {},
202 | "source": [
203 | "And that's it! Using this method, you can keep your data up to date with regular API requests, instead of waiting for new data snapshots.\n",
204 | "\n",
205 | "Enjoy!"
206 | ]
207 | }
208 | ],
209 | "metadata": {
210 | "kernelspec": {
211 | "display_name": "venv",
212 | "language": "python",
213 | "name": "python3"
214 | },
215 | "language_info": {
216 | "codemirror_mode": {
217 | "name": "ipython",
218 | "version": 3
219 | },
220 | "file_extension": ".py",
221 | "mimetype": "text/x-python",
222 | "name": "python",
223 | "nbconvert_exporter": "python",
224 | "pygments_lexer": "ipython3",
225 | "version": "3.10.10"
226 | },
227 | "orig_nbformat": 4
228 | },
229 | "nbformat": 4,
230 | "nbformat_minor": 2
231 | }
232 |
--------------------------------------------------------------------------------
/notebooks/institutions/japan_sources.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "attachments": {},
5 | "cell_type": "markdown",
6 | "metadata": {},
7 | "source": [
8 | "# What are the publication sources located in Japan?\n",
9 | "\n",
10 | "When it comes to geographic location data of works in OpenAlex, there are generally two ways to look at it. One is the location of the *authors* (or rather, the institutional affiliation of the authors), and the other is the location of the work's *source*. In OpenAlex, [sources are where works are hosted.](https://docs.openalex.org/api-entities/venues) Examples of sources include journals, conferences, and institutional repositories. In this tutorial, we are going to look into the sources located in Japan.\n",
11 | "\n",
12 | "Our questions are:\n",
13 | "1. How many sources of scholarly works are in Japan?\n",
14 | "2. What are the types of these sources? Journals? Repositories? Conferences?\n",
15 | " - For journals, what are the publishers?\n",
16 | " - For repositories, what are the host institutions?\n",
17 | "3. What are the names of these sources?\n",
18 | "4. How many works have Japanese sources? How does this vary over time?\n"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 1,
24 | "metadata": {},
25 | "outputs": [],
26 | "source": [
27 | "import requests"
28 | ]
29 | },
30 | {
31 | "attachments": {},
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "### Question 1: How many sources of scholarly works are in Japan?\n",
36 | "\n",
37 | "Let's start with the first question: How many sources of scholarly works are in Japan?\n",
38 | "\n",
39 | "To do this, we will query the `/sources` API endpoint, and use a **filter** to limit it to sources where the `country_code` is `JP`."
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "execution_count": 2,
45 | "metadata": {},
46 | "outputs": [
47 | {
48 | "name": "stdout",
49 | "output_type": "stream",
50 | "text": [
51 | "There are 2265 sources with country_code 'JP' (Japan)\n"
52 | ]
53 | }
54 | ],
55 | "source": [
56 | "country_code = 'JP'\n",
57 | "url = f\"https://api.openalex.org/sources\"\n",
58 | "params = {\n",
59 | " 'filter': f'country_code:{country_code}',\n",
60 | "}\n",
61 | "r = requests.get(url, params=params)\n",
62 | "num_sources = r.json()['meta']['count']\n",
63 | "print(f\"There are {num_sources} sources with country_code 'JP' (Japan)\")"
64 | ]
65 | },
66 | {
67 | "attachments": {},
68 | "cell_type": "markdown",
69 | "metadata": {},
70 | "source": [
71 | "### Question 2: What are the types of these sources?\n",
72 | "The next question is: What are the *types* of these sources. Possible types are listed in the API docs on the [Source object](https://docs.openalex.org/api-entities/venues/venue-object#type): `journal`, `repository`, `conference`, `ebook platform`.\n",
73 | "\n",
74 | "To answer this question, we can add a `group_by` to our API query, grouping by the `type` field and counting the number of sources:"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": 3,
80 | "metadata": {},
81 | "outputs": [
82 | {
83 | "name": "stdout",
84 | "output_type": "stream",
85 | "text": [
86 | "Number of sources in Japan for each *type* of source:\n",
87 | " \"journal\": 2158 sources\n",
88 | " \"repository\": 47 sources\n",
89 | " \"book series\": 40 sources\n",
90 | " \"conference\": 19 sources\n",
91 | " \"ebook platform\": 1 sources\n",
92 | " \"other\": 0 sources\n"
93 | ]
94 | }
95 | ],
96 | "source": [
97 | "params = {\n",
98 | " 'filter': f'country_code:{country_code}',\n",
99 | " 'group_by': 'type',\n",
100 | "}\n",
101 | "r = requests.get(url, params=params)\n",
102 | "print(\"Number of sources in Japan for each *type* of source:\")\n",
103 | "for item in r.json()['group_by']:\n",
104 | " print(f' \"{item[\"key\"]}\": {item[\"count\"]} sources')"
105 | ]
106 | },
107 | {
108 | "attachments": {},
109 | "cell_type": "markdown",
110 | "metadata": {},
111 | "source": [
112 | "Most of the Japanese sources are of type: `journal`, and there are few other types of sources, including repositories, book series, and conferences."
113 | ]
114 | },
115 | {
116 | "attachments": {},
117 | "cell_type": "markdown",
118 | "metadata": {},
119 | "source": [
120 | "### Question 3: What are the names of these sources?\n",
121 | "The next question is: What are the *names* of these sources?\n",
122 | "\n",
123 | "To answer this, we need the API to give us all 2,162 sources. This means we will need to use the [paging technique](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging).\n",
124 | "\n",
125 | "We'll adapt the technique from the [paging notebook](../getting-started/paging.ipynb) to collect the names and ISSNs of the sources."
126 | ]
127 | },
128 | {
129 | "cell_type": "code",
130 | "execution_count": 4,
131 | "metadata": {},
132 | "outputs": [
133 | {
134 | "name": "stdout",
135 | "output_type": "stream",
136 | "text": [
137 | "collected 2265 sources (using 92 api calls)\n"
138 | ]
139 | }
140 | ],
141 | "source": [
142 | "# page through to get all sources\n",
143 | "# use paging technique from `paging.ipynb`\n",
144 | "# url with a placeholder for page number\n",
145 | "country_code = 'JP'\n",
146 | "url = f\"https://api.openalex.org/sources\"\n",
147 | "params = {\n",
148 | " 'filter': f'country_code:{country_code}',\n",
149 | " 'page': 1, # initaliaze `page` param to 1\n",
150 | "}\n",
151 | "\n",
152 | "has_more_pages = True\n",
153 | "fewer_than_10k_results = True\n",
154 | "\n",
155 | "# We will collect the data in a variable called `japanese_sources`.\n",
156 | "# Initialize this as an empty list, which we will append to\n",
157 | "japanese_sources = []\n",
158 | "\n",
159 | "# loop through pages\n",
160 | "loop_index = 0\n",
161 | "while has_more_pages and fewer_than_10k_results:\n",
162 | " \n",
163 | " page_with_results = requests.get(url, params=params).json()\n",
164 | " \n",
165 | " # loop through partial list of results\n",
166 | " results = page_with_results['results']\n",
167 | " for api_result in results:\n",
168 | " # # Collect the fields we are interested in, for this source\n",
169 | " # source = {field: api_result[field] for field in fields}\n",
170 | " # Append this source to our `japanese_sources` list\n",
171 | " japanese_sources.append(api_result)\n",
172 | "\n",
173 | " # next page\n",
174 | " params['page'] += 1\n",
175 | " \n",
176 | " # end loop when either there are no more results on the requested page \n",
177 | " # or the next request would exceed 10,000 results\n",
178 | " per_page = page_with_results['meta']['per_page']\n",
179 | " has_more_pages = len(results) == per_page\n",
180 | " fewer_than_10k_results = per_page * params['page'] <= 10000\n",
181 | " loop_index += 1\n",
182 | "print(f\"collected {len(japanese_sources)} sources (using {loop_index+1} api calls)\")"
183 | ]
184 | },
185 | {
186 | "attachments": {},
187 | "cell_type": "markdown",
188 | "metadata": {},
189 | "source": [
190 | "Now would be a good time for us to put our sources into a Pandas dataframe. This is just a way to organize the data, make it more spreadsheet-like, and make it more convenient to work with."
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": 5,
196 | "metadata": {},
197 | "outputs": [
198 | {
199 | "name": "stdout",
200 | "output_type": "stream",
201 | "text": [
202 | "Dataframe has 2265 rows and 7 columns.\n",
203 | "\n",
204 | "The first five sources are named:\n",
205 | " Journal of the Japan Society of Mechanical Engineers\n",
206 | " Nippon Hoshasen Gijutsu Gakkai Zasshi\n",
207 | " Journal of the Physical Society of Japan\n",
208 | " Nihon rinsho. Japanese journal of clinical medicine\n",
209 | " Bulletin of the Chemical Society of Japan\n"
210 | ]
211 | }
212 | ],
213 | "source": [
214 | "import pandas as pd\n",
215 | "\n",
216 | "# Each source in our list of `japanese sources` contains a lot of data, some of it complex and nested.\n",
217 | "# So let's limit our dataframe to include only some of the fields.\n",
218 | "\n",
219 | "# Define the fields that we are interested in collecting:\n",
220 | "fields = [\n",
221 | " 'id',\n",
222 | " 'issn_l',\n",
223 | " 'display_name',\n",
224 | " 'host_organization',\n",
225 | " 'works_count',\n",
226 | " 'cited_by_count',\n",
227 | " 'type',\n",
228 | "]\n",
229 | "\n",
230 | "# One way to limit the dataframe to include only our `fields` is to use the `from_records()` method\n",
231 | "# and specify only the columns we want.\n",
232 | "df_sources = pd.DataFrame.from_records(japanese_sources, columns=fields)\n",
233 | "\n",
234 | "num_rows, num_columns = df_sources.shape\n",
235 | "print(f\"Dataframe has {num_rows} rows and {num_columns} columns.\")\n",
236 | "print() # blank line\n",
237 | "print(\"The first five sources are named:\")\n",
238 | "for name in df_sources['display_name'].head(5):\n",
239 | " print(f\" {name}\")"
240 | ]
241 | },
242 | {
243 | "attachments": {},
244 | "cell_type": "markdown",
245 | "metadata": {},
246 | "source": [
247 | "With Pandas, it is very easy to save the data as a spreadsheet file, in case we want to work with it later."
248 | ]
249 | },
250 | {
251 | "cell_type": "code",
252 | "execution_count": 6,
253 | "metadata": {},
254 | "outputs": [],
255 | "source": [
256 | "df_sources.to_csv(\"japan_sources.csv\")"
257 | ]
258 | },
259 | {
260 | "attachments": {},
261 | "cell_type": "markdown",
262 | "metadata": {},
263 | "source": [
264 | "At this point, we can go back and answer a question we skipped over when we moved onto question 3: What are the [`host_organizations`](https://docs.openalex.org/api-entities/venues/venue-object#host_organization) for these sources? In the case of a source of type `journal`, a `host_organization` is a Publisher---the company or organization that distributes the works.\n",
265 | "\n",
266 | "The data we have collected from the API contains the field `host_organization`, which is a link that can be fed back into the API to get more information about the organization, such as name, parent companies, and number of works. It would not be hard to collect this data, but for now we will leave it to future work. However, we can use the data we already have to simply *count* the number of different publishers that host japanese sources."
267 | ]
268 | },
269 | {
270 | "cell_type": "code",
271 | "execution_count": 7,
272 | "metadata": {},
273 | "outputs": [
274 | {
275 | "name": "stdout",
276 | "output_type": "stream",
277 | "text": [
278 | "The Japanese sources are associated with 431 different host organizations (publishers).\n"
279 | ]
280 | }
281 | ],
282 | "source": [
283 | "num_host_orgs = df_sources['host_organization'].nunique()\n",
284 | "print(f\"The Japanese sources are associated with {num_host_orgs} different host organizations (publishers).\")"
285 | ]
286 | },
287 | {
288 | "attachments": {},
289 | "cell_type": "markdown",
290 | "metadata": {},
291 | "source": [
292 | "### Question 4: How many works have Japanese sources? How does this vary over time?\n",
293 | "Finally, let's look at how many works there are with Japanese sources, and the trends of these works over time.\n",
294 | "\n",
295 | "First, we'll just look at the number of works with Japanese sources in the OpenAlex dataset."
296 | ]
297 | },
298 | {
299 | "cell_type": "code",
300 | "execution_count": 8,
301 | "metadata": {},
302 | "outputs": [
303 | {
304 | "name": "stdout",
305 | "output_type": "stream",
306 | "text": [
307 | "There are 3,827,227 works (articles) with Japanese sources.\n"
308 | ]
309 | }
310 | ],
311 | "source": [
312 | "num_works = df_sources['works_count'].sum()\n",
313 | "print(f\"There are {num_works:,} works (articles) with Japanese sources.\") # putting \":,\" after num_works tells the formatter to use commas as thousands separators"
314 | ]
315 | },
316 | {
317 | "attachments": {},
318 | "cell_type": "markdown",
319 | "metadata": {},
320 | "source": [
321 | "Next, we'll look at the number of works per year per source. This data is not in our dataframe (we excluded it when we specified only certain `fields`). However, we can go back to our collection of `japanese_sources`, which has the field `counts_by_year`. This field—as we can learn from [the docs](https://docs.openalex.org/api-entities/venues/venue-object#counts_by_year)—contains the source's counts of works by year, for the last ten years, organized as a list of dictionaries."
322 | ]
323 | },
324 | {
325 | "cell_type": "code",
326 | "execution_count": 9,
327 | "metadata": {},
328 | "outputs": [
329 | {
330 | "name": "stdout",
331 | "output_type": "stream",
332 | "text": [
333 | "Created a dataframe counting works from year 2012 to year 2049.\n"
334 | ]
335 | }
336 | ],
337 | "source": [
338 | "# Put the data for `counts_by_year` into a Pandas dataframe.\n",
339 | "data = []\n",
340 | "for source in japanese_sources:\n",
341 | " for year_count in source['counts_by_year']:\n",
342 | " data.append({\n",
343 | " 'id': source['id'],\n",
344 | " 'year': int(year_count['year']),\n",
345 | " 'works_count': int(year_count['works_count']),\n",
346 | " 'cited_by_count': int(year_count['cited_by_count']),\n",
347 | " })\n",
348 | "df_counts_by_year = pd.DataFrame(data)\n",
349 | "\n",
350 | "# Each row in the dataframe represents one year of one source.\n",
351 | "# We can group by year and sum the number of works to get the\n",
352 | "# total counts by year.\n",
353 | "\n",
354 | "all_counts_japan = df_counts_by_year.groupby('year')['works_count'].sum()\n",
355 | "print(f\"Created a dataframe counting works from year {all_counts_japan.index.min()} to year {all_counts_japan.index.max()}.\")"
356 | ]
357 | },
358 | {
359 | "attachments": {},
360 | "cell_type": "markdown",
361 | "metadata": {},
362 | "source": [
363 | "We'll use the [seaborn](https://seaborn.pydata.org/index.html) library to plot graphs of the data. Seaborn is an extension on top of [matplotlib](https://seaborn.pydata.org/index.html), the standard visualization library in Python. It is not the only choice for visualization, but we'll use it now because it is widely-used, and not too difficult to get started with."
364 | ]
365 | },
366 | {
367 | "cell_type": "code",
368 | "execution_count": 10,
369 | "metadata": {},
370 | "outputs": [
371 | {
372 | "data": {
373 | "text/plain": [
374 | "Text(0.5, 1.0, 'Number of works in Japanese journals')"
375 | ]
376 | },
377 | "execution_count": 10,
378 | "metadata": {},
379 | "output_type": "execute_result"
380 | },
381 | {
382 | "data": {
383 | "image/png": "",
384 | "text/plain": [
385 | ""
386 | ]
387 | },
388 | "metadata": {},
389 | "output_type": "display_data"
390 | }
391 | ],
392 | "source": [
393 | "# Import seaborn\n",
394 | "import seaborn as sns\n",
395 | "\n",
396 | "# Apply the default theme\n",
397 | "sns.set_theme()\n",
398 | "\n",
399 | "# Visualize the data\n",
400 | "g = sns.lineplot(all_counts_japan)\n",
401 | "g.set_ylim(bottom=0)\n",
402 | "g.set_xlim(2012, 2022)\n",
403 | "g.set_ylabel(\"number of works\")\n",
404 | "g.set_title(\"Number of works in Japanese journals\")"
405 | ]
406 | },
407 | {
408 | "attachments": {},
409 | "cell_type": "markdown",
410 | "metadata": {},
411 | "source": [
412 | "We have shown the number of works with Japanese sources in absolute terms, and that the number has been declining over recent years. But this only tells part of the story. Is this a general trend in the data set, or is it specific to Japanese works? To answer this, we want to look at the *relative* number of works, as a percentage of total number of works published in journals.\n",
413 | "\n",
414 | "We need to get the total number of works by year. One way to do this is to query the `/works` API endpoint, and [group by](https://docs.openalex.org/api-entities/works/group-works) the `publication_year`. To match the data we have about sources in Japan, we'll also limit the data to the last ten years, using the [`from_publication_date` convenience filter](https://docs.openalex.org/api-entities/works/filter-works#from_publication_date)."
415 | ]
416 | },
417 | {
418 | "cell_type": "code",
419 | "execution_count": 11,
420 | "metadata": {},
421 | "outputs": [],
422 | "source": [
423 | "url = f\"https://api.openalex.org/works\"\n",
424 | "filters = [\n",
425 | " 'primary_location.source.type:journal',\n",
426 | " 'from_publication_date:2012-01-01',\n",
427 | "]\n",
428 | "params = {\n",
429 | " 'filter': \",\".join(filters),\n",
430 | " 'group_by': 'publication_year',\n",
431 | "}\n",
432 | "# make the API query\n",
433 | "r = requests.get(url, params=params)\n",
434 | "\n",
435 | "# Get the data into a pandas dataframe\n",
436 | "counts_data = []\n",
437 | "for row in r.json()['group_by']:\n",
438 | " counts_data.append({\n",
439 | " 'year': int(row['key']),\n",
440 | " 'works_count': int(row['count']),\n",
441 | " })\n",
442 | "all_counts_all_countries = pd.DataFrame(counts_data)\n",
443 | "# change the data into a series, with the year as index and the number of works as values.\n",
444 | "# this will match the `all_counts_japan` data\n",
445 | "all_counts_all_countries = all_counts_all_countries.set_index('year')['works_count']"
446 | ]
447 | },
448 | {
449 | "cell_type": "code",
450 | "execution_count": 12,
451 | "metadata": {},
452 | "outputs": [
453 | {
454 | "data": {
455 | "text/plain": [
456 | "Text(0, 0.5, 'Relative number of works')"
457 | ]
458 | },
459 | "execution_count": 12,
460 | "metadata": {},
461 | "output_type": "execute_result"
462 | },
463 | {
464 | "data": {
465 | "image/png": "",
466 | "text/plain": [
467 | ""
468 | ]
469 | },
470 | "metadata": {},
471 | "output_type": "display_data"
472 | }
473 | ],
474 | "source": [
475 | "relative_japan = all_counts_japan / all_counts_all_countries\n",
476 | "g = sns.lineplot(relative_japan)\n",
477 | "g.set_ylim(bottom=0)\n",
478 | "g.set_xlim(2012, 2022)\n",
479 | "g.set_title(\"Number of works in Japanese journals relative to total number of works in journals\")\n",
480 | "g.set_ylabel(\"Relative number of works\")"
481 | ]
482 | },
483 | {
484 | "attachments": {},
485 | "cell_type": "markdown",
486 | "metadata": {},
487 | "source": [
488 | "We have shown that the relative number of works with Japanese sources has been declining over the last ten years. One possible explanation for this is that Japanese authors are publishing less in Japanese sources. Our next step could be to look at papers with Japanese *authors* (institutional affiliations in Japan), and see if they are increasingly publishing in non-Japanese sources. Looking at publications by language would also be interesting—this information [is now available in OpenAlex](https://docs.openalex.org/api-entities/works/work-object#language)."
489 | ]
490 | },
491 | {
492 | "cell_type": "markdown",
493 | "metadata": {},
494 | "source": []
495 | }
496 | ],
497 | "metadata": {
498 | "kernelspec": {
499 | "display_name": "venv",
500 | "language": "python",
501 | "name": "python3"
502 | },
503 | "language_info": {
504 | "codemirror_mode": {
505 | "name": "ipython",
506 | "version": 3
507 | },
508 | "file_extension": ".py",
509 | "mimetype": "text/x-python",
510 | "name": "python",
511 | "nbconvert_exporter": "python",
512 | "pygments_lexer": "ipython3",
513 | "version": "3.10.9"
514 | },
515 | "orig_nbformat": 4,
516 | "vscode": {
517 | "interpreter": {
518 | "hash": "271691dbc4cdb85f541c883090ff5a004cbd8b9c207c2cfed84437fce4e65fdb"
519 | }
520 | }
521 | },
522 | "nbformat": 4,
523 | "nbformat_minor": 2
524 | }
525 |
--------------------------------------------------------------------------------
/notebooks/institutions/oa-percentage.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "cd151571-2976-4e81-a1e2-2cf716466271",
6 | "metadata": {},
7 | "source": [
8 | "