├── results.png ├── results.parquet ├── make_plot.py ├── README.md ├── LICENSE.txt └── arxiv-aws.ipynb /results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mrocklin/arxiv-matplotlib/HEAD/results.png -------------------------------------------------------------------------------- /results.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mrocklin/arxiv-matplotlib/HEAD/results.parquet -------------------------------------------------------------------------------- /make_plot.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | 3 | import matplotlib.pyplot as plt 4 | from matplotlib.ticker import PercentFormatter 5 | 6 | import pandas as pd 7 | 8 | # read data 9 | by_month = pd.read_parquet("results.parquet").groupby("date").has_matplotlib.mean() 10 | 11 | 12 | # get figure 13 | fig, ax = plt.subplots(layout="constrained") 14 | # plot the data 15 | ax.plot(by_month, "o", color="k", ms=3) 16 | 17 | # over-ride the default auto limits 18 | ax.set_xlim(left=datetime.date(2004, 1, 1)) 19 | ax.set_ylim(bottom=0) 20 | 21 | # turn on a horizontal grid 22 | ax.grid(axis="y") 23 | 24 | # remove the top and right spines 25 | ax.spines.right.set_visible(False) 26 | ax.spines.top.set_visible(False) 27 | 28 | # format y-ticks a percent 29 | ax.yaxis.set_major_formatter(PercentFormatter(xmax=1)) 30 | 31 | # add title and labels 32 | ax.set_xlabel("date") 33 | ax.set_ylabel("% of all papers") 34 | ax.set_title("Matplotlib usage on arXiv") 35 | 36 | fig.savefig("fancy_plot.png") 37 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # arXiv Matplotlib Query 2 | 3 | Anecdotally the Matplotlib maintainers were told 4 | 5 | *"About 15% of arXiv papers use Matplotlib"* 6 | 7 | Unfortunately the original analysis of this data was lost. We reproduce it here. 8 | 9 | ## Watermark 10 | 11 | Starting in the early 2010s, Matplotlib started including the bytes `b"Matplotlib"` in every PNG and PDF that they produce. These bytes persist in the output PDFs stored on arXiv. As a result, it's pretty simple to check if a PDF contains a Matplotlib image. All we have to do is scan through every PDF and look for these bytes; no parsing required. 12 | 13 | ## Data 14 | 15 | The data is stored in a requester pays bucket at s3://arxiv (more information at https://arxiv.org/help/bulk_data_s3 ) and also on GCS hosted by Kaggle (more information at https://www.kaggle.com/datasets/Cornell-University/arxiv). 16 | 17 | The data is about 1TB in size. We use Dask for this. 18 | 19 | ## Contents 20 | 21 | This repository has the notebook used to generate the data, 22 | the data itself as a parquet file, 23 | and a nice image showing growing usage. 24 | 25 | 26 | ## Results 27 | 28 | ![A plot of Matplotlib usage on arVix showing strong growth from 2004 to 2022 from 0 to 17%](results.png?raw=true "Matplotlib usage on arXiv over time") 29 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2022, Matthew Rocklin 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | * Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /arxiv-aws.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "ae977bc3-d3cf-4492-a359-a95f8156fb52", 6 | "metadata": {}, 7 | "source": [ 8 | "# How Popular is Matplotlib?\n", 9 | "\n", 10 | "Anecdotally the Matplotlib maintainers were told \n", 11 | "\n", 12 | "*\"About 15% of arXiv papers use Matplotlib\"*\n", 13 | "\n", 14 | "arXiv is the preeminent repository for scholarly prepreint articles. It stores millions of journal articles used across science. It's also public access, and so we can just scrape the entire thing given enough compute power.\n", 15 | "\n", 16 | "## Watermark\n", 17 | "\n", 18 | "Starting in the early 2010s, Matplotlib started including the bytes `b\"Matplotlib\"` in every PNG and PDF that they produce. These bytes persist in PDFs that contain Matplotlib plots, including the PDFs stored on arXiv. As a result, it's pretty simple to check if a PDF contains a Matplotlib image. All we have to do is scan through every PDF and look for these bytes; no parsing required.\n", 19 | "\n", 20 | "## Data\n", 21 | "\n", 22 | "The data is stored in a requester pays bucket at s3://arxiv (more information at https://arxiv.org/help/bulk_data_s3 ) and also on GCS hosted by Kaggle (more information at https://www.kaggle.com/datasets/Cornell-University/arxiv). \n", 23 | "\n", 24 | "The data is about 1TB in size. We're going to use Dask for this.\n", 25 | "\n", 26 | "This is a good example of writing plain vanilla Python code to solve a problem, running into issues of scale, and then using Dask to easily jump over those problems." 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "id": "6f0965a0-fa87-470b-bd3d-0b5b7ecaca99", 32 | "metadata": {}, 33 | "source": [ 34 | "### Get all filenames\n", 35 | "\n", 36 | "Our data is stored in a requester pays S3 bucket in the `us-east-1` region. Each file is a tar file which contains a directory of papers." 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 1, 42 | "id": "e62539ef-5e91-43c5-afa8-0c3fa51b8f11", 43 | "metadata": {}, 44 | "outputs": [ 45 | { 46 | "data": { 47 | "text/plain": [ 48 | "['arxiv/pdf/arXiv_pdf_0001_001.tar',\n", 49 | " 'arxiv/pdf/arXiv_pdf_0001_002.tar',\n", 50 | " 'arxiv/pdf/arXiv_pdf_0002_001.tar',\n", 51 | " 'arxiv/pdf/arXiv_pdf_0002_002.tar',\n", 52 | " 'arxiv/pdf/arXiv_pdf_0003_001.tar',\n", 53 | " 'arxiv/pdf/arXiv_pdf_0003_002.tar',\n", 54 | " 'arxiv/pdf/arXiv_pdf_0004_001.tar',\n", 55 | " 'arxiv/pdf/arXiv_pdf_0004_002.tar',\n", 56 | " 'arxiv/pdf/arXiv_pdf_0005_001.tar',\n", 57 | " 'arxiv/pdf/arXiv_pdf_0005_002.tar']" 58 | ] 59 | }, 60 | "execution_count": 1, 61 | "metadata": {}, 62 | "output_type": "execute_result" 63 | } 64 | ], 65 | "source": [ 66 | "import s3fs\n", 67 | "s3 = s3fs.S3FileSystem(requester_pays=True)\n", 68 | "\n", 69 | "directories = s3.ls(\"s3://arxiv/pdf\")\n", 70 | "directories[:10]" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 2, 76 | "id": "4bc51124-c61e-4d7e-84ff-ffd551c7ca7f", 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "data": { 81 | "text/plain": [ 82 | "'arxiv/pdf/arXiv_pdf_1407_009.tar'" 83 | ] 84 | }, 85 | "execution_count": 2, 86 | "metadata": {}, 87 | "output_type": "execute_result" 88 | } 89 | ], 90 | "source": [ 91 | "directories[1000]" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "id": "92438ddf-7b02-462d-8a5d-2b2e760dd1a4", 97 | "metadata": {}, 98 | "source": [ 99 | "There are lots of these" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 3, 105 | "id": "9e5cb2b5-1ad5-4a21-b98d-4f0f615dacd6", 106 | "metadata": {}, 107 | "outputs": [ 108 | { 109 | "data": { 110 | "text/plain": [ 111 | "5373" 112 | ] 113 | }, 114 | "execution_count": 3, 115 | "metadata": {}, 116 | "output_type": "execute_result" 117 | } 118 | ], 119 | "source": [ 120 | "len(directories)" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "id": "a4d3219f-a6ce-487f-b5ae-522a7415c014", 126 | "metadata": {}, 127 | "source": [ 128 | "## Process one file with plain Python\n", 129 | "\n", 130 | "Mostly we have to muck about with tar files. This wasn't hard. The `tarfile` library is in the stardard library. It's not beautiful, but it's also not hard to use." 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 4, 136 | "id": "85146f13-5e5a-40e3-8d56-a79064f35ce4", 137 | "metadata": { 138 | "tags": [] 139 | }, 140 | "outputs": [], 141 | "source": [ 142 | "import tarfile\n", 143 | "import io\n", 144 | "\n", 145 | "def extract(filename: str):\n", 146 | " \"\"\" Extract and process one directory of arXiv data\n", 147 | " \n", 148 | " Returns\n", 149 | " -------\n", 150 | " filename: str\n", 151 | " contains_matplotlib: boolean\n", 152 | " \"\"\"\n", 153 | " out = []\n", 154 | " with s3.open(filename) as f:\n", 155 | " bytes = f.read()\n", 156 | " with io.BytesIO() as bio:\n", 157 | " bio.write(bytes)\n", 158 | " bio.seek(0)\n", 159 | " with tarfile.TarFile(fileobj=bio) as tf:\n", 160 | " for member in tf.getmembers():\n", 161 | " if member.isfile() and member.name.endswith(\".pdf\"):\n", 162 | " data = tf.extractfile(member).read()\n", 163 | " out.append((\n", 164 | " member.name, \n", 165 | " b\"matplotlib\" in data.lower()\n", 166 | " ))\n", 167 | " return out" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 5, 173 | "id": "de243ca8-bcd2-47b4-8574-bce3f0bda790", 174 | "metadata": {}, 175 | "outputs": [ 176 | { 177 | "name": "stdout", 178 | "output_type": "stream", 179 | "text": [ 180 | "CPU times: user 3.99 s, sys: 1.79 s, total: 5.78 s\n", 181 | "Wall time: 51.5 s\n" 182 | ] 183 | }, 184 | { 185 | "data": { 186 | "text/plain": [ 187 | "[('0011/cs0011019.pdf', False),\n", 188 | " ('0011/gr-qc0011017.pdf', False),\n", 189 | " ('0011/hep-ex0011095.pdf', False),\n", 190 | " ('0011/cond-mat0011373.pdf', False),\n", 191 | " ('0011/hep-ph0011035.pdf', False),\n", 192 | " ('0011/gr-qc0011082.pdf', False),\n", 193 | " ('0011/cond-mat0011202.pdf', False),\n", 194 | " ('0011/hep-ph0011209.pdf', False),\n", 195 | " ('0011/cond-mat0011038.pdf', False),\n", 196 | " ('0011/gr-qc0011014.pdf', False),\n", 197 | " ('0011/hep-ph0011118.pdf', False),\n", 198 | " ('0011/gr-qc0011095.pdf', False),\n", 199 | " ('0011/astro-ph0011090.pdf', False),\n", 200 | " ('0011/hep-ph0011162.pdf', False),\n", 201 | " ('0011/cs0011010.pdf', False),\n", 202 | " ('0011/cond-mat0011086.pdf', False),\n", 203 | " ('0011/hep-lat0011037.pdf', False),\n", 204 | " ('0011/astro-ph0011369.pdf', False),\n", 205 | " ('0011/astro-ph0011187.pdf', False),\n", 206 | " ('0011/astro-ph0011074.pdf', False)]" 207 | ] 208 | }, 209 | "execution_count": 5, 210 | "metadata": {}, 211 | "output_type": "execute_result" 212 | } 213 | ], 214 | "source": [ 215 | "%%time\n", 216 | "\n", 217 | "# See an example of its use\n", 218 | "extract(directories[20])[:20]" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "id": "9feae86b-4c46-455d-8e5b-09eb80ec3400", 224 | "metadata": {}, 225 | "source": [ 226 | "# Scale function to full dataset\n", 227 | "\n", 228 | "Great, we can get a record of each file and whether or not it used Matplotlib. Each of these takes about a minute to run on my local machine. Processing all 5000 files would take 5000 minutes, or around 100 hours. \n", 229 | "\n", 230 | "We can accelerate this in two ways:\n", 231 | "\n", 232 | "1. **Process closer to the data** by spinning up resources in the same region on the cloud (this also reduces data transfer costs)\n", 233 | "2. **Use hundreds of workers** in parallel\n", 234 | "\n", 235 | "We can do this easily with Dask (parallel computing) and Coiled (set up Dask infrastructure)" 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "id": "34748fc3-7f2f-442e-9829-626d88878234", 241 | "metadata": {}, 242 | "source": [ 243 | "## Create Dask Cluster\n", 244 | "\n", 245 | "We start a Dask cluster on AWS in the same region where the data is stored. \n", 246 | "\n", 247 | "We mimic the local software environment on the cluster with `package_sync=True`." 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": 6, 253 | "id": "245c269a-c432-4352-aa01-0d9110b63304", 254 | "metadata": {}, 255 | "outputs": [ 256 | { 257 | "data": { 258 | "text/html": [ 259 | "
╭───────────────────────────────────────── Package Issues ─────────────────────────────────────────╮\n",
260 |        "│                    ╷                                                           ╷                 │\n",
261 |        "│   Package           Issue                                                      Risk Level      │\n",
262 |        "│ ╶──────────────────┼───────────────────────────────────────────────────────────┼───────────────╴ │\n",
263 |        "│   libgfortran5     │ 11.3.0 has no install candidate for linux-64              │                 │\n",
264 |        "│   libgfortran      │ 5.0.0 has no install candidate for linux-64               │                 │\n",
265 |        "│   grpcio           │ 1.47.1 has no install candidate for linux-64              │                 │\n",
266 |        "│   grpc-cpp         │ 1.47.1 has no install candidate for linux-64              │                 │\n",
267 |        "│   arrow-cpp        │ 9.0.0 has no install candidate for linux-64               │                 │\n",
268 |        "│   openssl          │ Package ignored                                           │                 │\n",
269 |        "│   libabseil        │ Package ignored                                           │                 │\n",
270 |        "│                    ╵                                                           ╵                 │\n",
271 |        "╰──────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
272 |        "
\n" 273 | ], 274 | "text/plain": [ 275 | "╭───────────────────────────────────────── \u001b[1;32mPackage Issues\u001b[0m ─────────────────────────────────────────╮\n", 276 | "│ ╷ ╷ │\n", 277 | "│ \u001b[1m \u001b[0m\u001b[1mPackage \u001b[0m\u001b[1m \u001b[0m│\u001b[1m \u001b[0m\u001b[1mIssue \u001b[0m\u001b[1m \u001b[0m│\u001b[1m \u001b[0m\u001b[1mRisk Level \u001b[0m\u001b[1m \u001b[0m │\n", 278 | "│ ╶──────────────────┼───────────────────────────────────────────────────────────┼───────────────╴ │\n", 279 | "│ libgfortran5 │ 11.3.0 has no install candidate for linux-64 │ │\n", 280 | "│ libgfortran │ 5.0.0 has no install candidate for linux-64 │ │\n", 281 | "│ grpcio │ 1.47.1 has no install candidate for linux-64 │ │\n", 282 | "│ grpc-cpp │ 1.47.1 has no install candidate for linux-64 │ │\n", 283 | "│ arrow-cpp │ 9.0.0 has no install candidate for linux-64 │ │\n", 284 | "│ openssl │ Package ignored │ │\n", 285 | "│ libabseil │ Package ignored │ │\n", 286 | "│ ╵ ╵ │\n", 287 | "╰──────────────────────────────────────────────────────────────────────────────────────────────────╯\n" 288 | ] 289 | }, 290 | "metadata": {}, 291 | "output_type": "display_data" 292 | }, 293 | { 294 | "data": { 295 | "application/vnd.jupyter.widget-view+json": { 296 | "model_id": "f9d8cf6d8cfa42f28fde31ea8131e061", 297 | "version_major": 2, 298 | "version_minor": 0 299 | }, 300 | "text/plain": [ 301 | "Output()" 302 | ] 303 | }, 304 | "metadata": {}, 305 | "output_type": "display_data" 306 | }, 307 | { 308 | "data": { 309 | "text/html": [ 310 | "
\n"
311 |       ],
312 |       "text/plain": []
313 |      },
314 |      "metadata": {},
315 |      "output_type": "display_data"
316 |     },
317 |     {
318 |      "name": "stdout",
319 |      "output_type": "stream",
320 |      "text": [
321 |       "CPU times: user 9.72 s, sys: 1.08 s, total: 10.8 s\n",
322 |       "Wall time: 1min 25s\n"
323 |      ]
324 |     }
325 |    ],
326 |    "source": [
327 |     "%%time\n",
328 |     "\n",
329 |     "import coiled\n",
330 |     "\n",
331 |     "cluster = coiled.Cluster(\n",
332 |     "    n_workers=100,\n",
333 |     "    name=\"arxiv\",\n",
334 |     "    package_sync=True, \n",
335 |     "    backend_options={\"region\": \"us-east-1\"},  # faster and cheaper\n",
336 |     ")"
337 |    ]
338 |   },
339 |   {
340 |    "cell_type": "code",
341 |    "execution_count": 7,
342 |    "id": "f46591e5-664f-4bcc-98f3-c89bbb826331",
343 |    "metadata": {},
344 |    "outputs": [
345 |     {
346 |      "name": "stderr",
347 |      "output_type": "stream",
348 |      "text": [
349 |       "/Users/mrocklin/mambaforge/envs/play/lib/python3.10/site-packages/distributed/client.py:1309: VersionMismatchWarning: Mismatched versions found\n",
350 |       "\n",
351 |       "+-------------+-----------+-----------+----------+\n",
352 |       "| Package     | Client    | Scheduler | Workers  |\n",
353 |       "+-------------+-----------+-----------+----------+\n",
354 |       "| dask        | 2022.10.0 | 2022.9.1  | 2022.9.1 |\n",
355 |       "| distributed | 2022.10.0 | 2022.9.1  | 2022.9.1 |\n",
356 |       "+-------------+-----------+-----------+----------+\n",
357 |       "  warnings.warn(version_module.VersionMismatchWarning(msg[0][\"warning\"]))\n"
358 |      ]
359 |     }
360 |    ],
361 |    "source": [
362 |     "from dask.distributed import Client, wait\n",
363 |     "client = Client(cluster)"
364 |    ]
365 |   },
366 |   {
367 |    "cell_type": "markdown",
368 |    "id": "f9c6042c-917f-46ca-9722-c99e02fb97cb",
369 |    "metadata": {
370 |     "tags": []
371 |    },
372 |    "source": [
373 |     "### Map function across every directory\n",
374 |     "\n",
375 |     "Let's scale up this work across all of the directories in our dataset\n",
376 |     "\n",
377 |     "Hopefully it will also be faster because the Dask workers are in the same region as the dataset itself."
378 |    ]
379 |   },
380 |   {
381 |    "cell_type": "code",
382 |    "execution_count": 8,
383 |    "id": "6ff8f366-4cdf-4ac3-a0c1-08545f6e07ac",
384 |    "metadata": {},
385 |    "outputs": [
386 |     {
387 |      "name": "stdout",
388 |      "output_type": "stream",
389 |      "text": [
390 |       "CPU times: user 11.7 s, sys: 1.77 s, total: 13.5 s\n",
391 |       "Wall time: 5min 58s\n"
392 |      ]
393 |     }
394 |    ],
395 |    "source": [
396 |     "%%time\n",
397 |     "\n",
398 |     "futures = client.map(extract, directories)\n",
399 |     "wait(futures)\n",
400 |     "\n",
401 |     "# We had one error in one file.  Let's just ignore and move on.\n",
402 |     "good = [future for future in futures if future.status == \"finished\"]\n",
403 |     "\n",
404 |     "lists = client.gather(good)"
405 |    ]
406 |   },
407 |   {
408 |    "cell_type": "markdown",
409 |    "id": "84af922e-4f6b-4f4b-97d5-a7097084aa1b",
410 |    "metadata": {},
411 |    "source": [
412 |     "Now that we're done with the large data problem we can turn off Dask and proceed with pure Pandas.  There's no reason to deal with scalable tools if we don't have to."
413 |    ]
414 |   },
415 |   {
416 |    "cell_type": "code",
417 |    "execution_count": 9,
418 |    "id": "f6f61c10-b892-446f-bf5e-c504209b412b",
419 |    "metadata": {},
420 |    "outputs": [],
421 |    "source": [
422 |     "# Scale down now that we're done\n",
423 |     "cluster.close()"
424 |    ]
425 |   },
426 |   {
427 |    "cell_type": "markdown",
428 |    "id": "299b5097-ff28-4036-b569-a18449cca0d9",
429 |    "metadata": {},
430 |    "source": [
431 |     "## Enrich Data\n",
432 |     "\n",
433 |     "Let's enhance our data a bit.  The filenames of each file include the year and month when they were published.  After extracting this data we'll be able to see a timeseries of Matplotlib adoption."
434 |    ]
435 |   },
436 |   {
437 |    "cell_type": "code",
438 |    "execution_count": 10,
439 |    "id": "85d74a56-e7c1-4614-86bd-6342e16d58fd",
440 |    "metadata": {},
441 |    "outputs": [
442 |     {
443 |      "data": {
444 |       "text/html": [
445 |        "
\n", 446 | "\n", 459 | "\n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | "
filenamehas_matplotlib
00001/astro-ph0001477.pdfFalse
10001/hep-th0001095.pdfFalse
20001/astro-ph0001322.pdfFalse
30001/cond-mat0001159.pdfFalse
40001/astro-ph0001132.pdfFalse
.........
6309912/math9912098.pdfFalse
6319912/math9912251.pdfFalse
6329912/solv-int9912013.pdfFalse
6339912/hep-th9912254.pdfFalse
6349912/hep-th9912272.pdfFalse
\n", 525 | "

2122438 rows × 2 columns

\n", 526 | "
" 527 | ], 528 | "text/plain": [ 529 | " filename has_matplotlib\n", 530 | "0 0001/astro-ph0001477.pdf False\n", 531 | "1 0001/hep-th0001095.pdf False\n", 532 | "2 0001/astro-ph0001322.pdf False\n", 533 | "3 0001/cond-mat0001159.pdf False\n", 534 | "4 0001/astro-ph0001132.pdf False\n", 535 | ".. ... ...\n", 536 | "630 9912/math9912098.pdf False\n", 537 | "631 9912/math9912251.pdf False\n", 538 | "632 9912/solv-int9912013.pdf False\n", 539 | "633 9912/hep-th9912254.pdf False\n", 540 | "634 9912/hep-th9912272.pdf False\n", 541 | "\n", 542 | "[2122438 rows x 2 columns]" 543 | ] 544 | }, 545 | "execution_count": 10, 546 | "metadata": {}, 547 | "output_type": "execute_result" 548 | } 549 | ], 550 | "source": [ 551 | "# Convert to Pandas\n", 552 | "\n", 553 | "import pandas as pd\n", 554 | "\n", 555 | "dfs = [\n", 556 | " pd.DataFrame(list, columns=[\"filename\", \"has_matplotlib\"]) \n", 557 | " for list in lists\n", 558 | "]\n", 559 | "\n", 560 | "df = pd.concat(dfs)\n", 561 | "\n", 562 | "df" 563 | ] 564 | }, 565 | { 566 | "cell_type": "code", 567 | "execution_count": 11, 568 | "id": "f98c54e7-fc46-4180-9586-c06eac6432e6", 569 | "metadata": {}, 570 | "outputs": [ 571 | { 572 | "data": { 573 | "text/plain": [ 574 | "Timestamp('2000-05-01 00:00:00')" 575 | ] 576 | }, 577 | "execution_count": 11, 578 | "metadata": {}, 579 | "output_type": "execute_result" 580 | } 581 | ], 582 | "source": [ 583 | "def date(filename):\n", 584 | " year = int(filename.split(\"/\")[0][:2])\n", 585 | " month = int(filename.split(\"/\")[0][2:4])\n", 586 | " if year > 80:\n", 587 | " year = 1900 + year\n", 588 | " else:\n", 589 | " year = 2000 + year\n", 590 | " \n", 591 | " return pd.Timestamp(year=year, month=month, day=1)\n", 592 | "\n", 593 | "date(\"0005/astro-ph0001322.pdf\")" 594 | ] 595 | }, 596 | { 597 | "cell_type": "markdown", 598 | "id": "033451dc-344f-40e0-bd05-f2f6fd80d0c2", 599 | "metadata": {}, 600 | "source": [ 601 | "Yup. That seems to work. Let's map this function over our dataset." 602 | ] 603 | }, 604 | { 605 | "cell_type": "code", 606 | "execution_count": 12, 607 | "id": "315f8de5-c53a-49d9-8f8d-3ba62fcee727", 608 | "metadata": {}, 609 | "outputs": [ 610 | { 611 | "data": { 612 | "text/html": [ 613 | "
\n", 614 | "\n", 627 | "\n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | "
filenamehas_matplotlibdate
00001/astro-ph0001477.pdfFalse2000-01-01
10001/hep-th0001095.pdfFalse2000-01-01
20001/astro-ph0001322.pdfFalse2000-01-01
30001/cond-mat0001159.pdfFalse2000-01-01
40001/astro-ph0001132.pdfFalse2000-01-01
\n", 669 | "
" 670 | ], 671 | "text/plain": [ 672 | " filename has_matplotlib date\n", 673 | "0 0001/astro-ph0001477.pdf False 2000-01-01\n", 674 | "1 0001/hep-th0001095.pdf False 2000-01-01\n", 675 | "2 0001/astro-ph0001322.pdf False 2000-01-01\n", 676 | "3 0001/cond-mat0001159.pdf False 2000-01-01\n", 677 | "4 0001/astro-ph0001132.pdf False 2000-01-01" 678 | ] 679 | }, 680 | "execution_count": 12, 681 | "metadata": {}, 682 | "output_type": "execute_result" 683 | } 684 | ], 685 | "source": [ 686 | "df[\"date\"] = df.filename.map(date)\n", 687 | "df.head()" 688 | ] 689 | }, 690 | { 691 | "cell_type": "markdown", 692 | "id": "e64781f1-0486-4eda-81b4-68911383be7a", 693 | "metadata": {}, 694 | "source": [ 695 | "## Plot\n", 696 | "\n", 697 | "Now we can just fool around with Pandas and Matplotlib." 698 | ] 699 | }, 700 | { 701 | "cell_type": "code", 702 | "execution_count": 13, 703 | "id": "270ae46c-ac65-48f1-a7ca-c6f742591c7c", 704 | "metadata": {}, 705 | "outputs": [ 706 | { 707 | "data": { 708 | "image/png": "\n", 709 | "text/plain": [ 710 | "
" 711 | ] 712 | }, 713 | "metadata": {}, 714 | "output_type": "display_data" 715 | } 716 | ], 717 | "source": [ 718 | "df.groupby(\"date\").has_matplotlib.mean().plot(\n", 719 | " title=\"Matplotlib Usage in arXiv\", \n", 720 | " ylabel=\"Fraction of papers\"\n", 721 | ").get_figure().savefig(\"results.png\")" 722 | ] 723 | }, 724 | { 725 | "cell_type": "markdown", 726 | "id": "bd3a2592-d153-4139-a4e2-a80f4a466c24", 727 | "metadata": {}, 728 | "source": [ 729 | "I did the plot above. Then Thomas Caswell (matplotlib maintainer) came by and, in true form, made something much better 🙂" 730 | ] 731 | }, 732 | { 733 | "cell_type": "code", 734 | "execution_count": 14, 735 | "id": "cf83b666-05c1-47f3-a2e3-55bbe8f5eeb8", 736 | "metadata": {}, 737 | "outputs": [ 738 | { 739 | "data": { 740 | "text/plain": [ 741 | "Text(0.5, 1.0, 'Matplotlib usage on arXiv')" 742 | ] 743 | }, 744 | "execution_count": 14, 745 | "metadata": {}, 746 | "output_type": "execute_result" 747 | }, 748 | { 749 | "data": { 750 | "image/png": "\n", 751 | "text/plain": [ 752 | "
" 753 | ] 754 | }, 755 | "metadata": {}, 756 | "output_type": "display_data" 757 | } 758 | ], 759 | "source": [ 760 | "import datetime\n", 761 | "import matplotlib.pyplot as plt\n", 762 | "from matplotlib.ticker import PercentFormatter\n", 763 | "\n", 764 | "import pandas as pd\n", 765 | "\n", 766 | "# read data\n", 767 | "by_month = pd.read_parquet(\"results.parquet\").groupby(\"date\").has_matplotlib.mean()\n", 768 | "\n", 769 | "\n", 770 | "# get figure\n", 771 | "fig, ax = plt.subplots(layout=\"constrained\")\n", 772 | "# plot the data\n", 773 | "ax.plot(by_month, \"o\", color=\"k\", ms=3)\n", 774 | "\n", 775 | "# over-ride the default auto limits\n", 776 | "ax.set_xlim(left=datetime.date(2004, 1, 1))\n", 777 | "ax.set_ylim(bottom=0)\n", 778 | "\n", 779 | "# turn on a horizontal grid\n", 780 | "ax.grid(axis=\"y\")\n", 781 | "\n", 782 | "# remove the top and right spines\n", 783 | "ax.spines.right.set_visible(False)\n", 784 | "ax.spines.top.set_visible(False)\n", 785 | "\n", 786 | "# format y-ticks a percent\n", 787 | "ax.yaxis.set_major_formatter(PercentFormatter(xmax=1))\n", 788 | "\n", 789 | "# add title and labels\n", 790 | "ax.set_xlabel(\"date\")\n", 791 | "ax.set_ylabel(\"% of all papers\")\n", 792 | "ax.set_title(\"Matplotlib usage on arXiv\")" 793 | ] 794 | }, 795 | { 796 | "cell_type": "markdown", 797 | "id": "81198952-f6ba-4e1b-9792-36e54a5fe491", 798 | "metadata": {}, 799 | "source": [ 800 | "Yup. Matplotlib is used pretty commonly on arXiv. Go team." 801 | ] 802 | }, 803 | { 804 | "cell_type": "markdown", 805 | "id": "a59d3a1d-289c-405c-8330-fc249e376b70", 806 | "metadata": {}, 807 | "source": [ 808 | "## Save results\n", 809 | "\n", 810 | "This data was slighly painful to procure. Let's save the results locally for future analysis. That way other researchers can further analyze the results without having to muck about with parallelism or cloud stuff." 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": null, 816 | "id": "3ae0d6ff-4471-4a45-bbe1-6bcd8a8d72a3", 817 | "metadata": {}, 818 | "outputs": [], 819 | "source": [ 820 | "df.to_csv(\"arxiv-matplotlib.csv\")" 821 | ] 822 | }, 823 | { 824 | "cell_type": "code", 825 | "execution_count": null, 826 | "id": "623becf0-a84d-4cf7-b719-883bfe60eef4", 827 | "metadata": {}, 828 | "outputs": [], 829 | "source": [ 830 | "!du -hs arxiv-matplotlib.csv" 831 | ] 832 | }, 833 | { 834 | "cell_type": "code", 835 | "execution_count": null, 836 | "id": "1755b7a4-8b92-441c-9083-2bc0b51e2f81", 837 | "metadata": {}, 838 | "outputs": [], 839 | "source": [ 840 | "df.to_parquet(\"arxiv-matplotlib.parquet\", compression=\"snappy\")" 841 | ] 842 | }, 843 | { 844 | "cell_type": "code", 845 | "execution_count": null, 846 | "id": "afc412d6-38b3-424e-8ca5-b4141d1b776f", 847 | "metadata": {}, 848 | "outputs": [], 849 | "source": [ 850 | "!du -hs arxiv-matplotlib.parquet" 851 | ] 852 | }, 853 | { 854 | "cell_type": "markdown", 855 | "id": "f637b60b-7051-4898-9c99-b3b436acd1ae", 856 | "metadata": {}, 857 | "source": [ 858 | "## Conclusion\n", 859 | "\n", 860 | "### Matplotlib + arXiv\n", 861 | "\n", 862 | "It's incredible to see the steady growth of Matplotlib across arXiv. It's worth noting that this is *all* papers, even from fields like theoretical mathematics that are unlikely to include computer generated plots. Is this matplotlib growing in popularity? Is it Python generally?\n", 863 | "\n", 864 | "For future work, we should break this down by subfield. The filenames actually contained the name of the field for a while, like \"hep-ex\" for \"high energy physics, experimental\", but it looks like arXiv stopped doing this at some point. My guess is that there is a list mapping filenames to fields somewhere though. The filenames are all in the Pandas dataframe / parquet dataset, so doing this analysis shouldn't require any scalable computing.\n", 865 | "\n", 866 | "### Dask + Coiled\n", 867 | "\n", 868 | "Dask and Coiled were built to make it easy to answer large questions. \n", 869 | "\n", 870 | "We started this notebook with some generic Python code. When we wanted to scale up we invoked Dask+Coiled, did some work, and then tore things down, all in about ten minutes. The problem of scale or \"big data\" didn't get in the way of us analyzing data and making a delightful discovery. \n", 871 | "\n", 872 | "This is exactly why these projects exist." 873 | ] 874 | } 875 | ], 876 | "metadata": { 877 | "kernelspec": { 878 | "display_name": "Python [conda env:play]", 879 | "language": "python", 880 | "name": "conda-env-play-py" 881 | }, 882 | "language_info": { 883 | "codemirror_mode": { 884 | "name": "ipython", 885 | "version": 3 886 | }, 887 | "file_extension": ".py", 888 | "mimetype": "text/x-python", 889 | "name": "python", 890 | "nbconvert_exporter": "python", 891 | "pygments_lexer": "ipython3", 892 | "version": "3.10.6" 893 | } 894 | }, 895 | "nbformat": 4, 896 | "nbformat_minor": 5 897 | } 898 | --------------------------------------------------------------------------------