├── .gitignore ├── LICENSE ├── NotebookScheduler.py └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | notebooks.log 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Joshuaek 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /NotebookScheduler.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import argparse 4 | import fnmatch 5 | import logging 6 | import papermill as pm 7 | from datetime import datetime 8 | import time 9 | 10 | # Notebook Scheduler 11 | # --------------------------------------- 12 | # This script helps with the automated processing of Jupyter Notebooks via papermill (https://github.com/nteract/papermill/) 13 | 14 | 15 | 16 | snapshotDir = 'snapshots' 17 | 18 | def findFiles(directory, pattern): 19 | # Lists all files in the specified directory that match the specified pattern 20 | for filename in os.listdir(directory): 21 | if fnmatch.fnmatch(filename.lower(), pattern): 22 | yield os.path.join(directory, filename) 23 | 24 | def processNotebooks(notebookDirectory, days=[]): 25 | 26 | now = datetime.now() 27 | 28 | # For monthly tasks, we only run on the specified days (or for others if no days are specified) 29 | if (len(days) > 0 and now.day in days) or len(days) == 0: 30 | 31 | logging.info('Processing ' + notebookDirectory) 32 | 33 | # Each time a notebook is processed a snapshot is saved to a snapshot sub-directory 34 | # This checks the sub-directory exists and creates it if not 35 | if os.path.isdir(os.path.join(notebookDirectory,snapshotDir)) == False: 36 | os.mkdir(os.path.join(notebookDirectory,snapshotDir)) 37 | 38 | for file in findFiles(notebookDirectory, '*.ipynb'): 39 | try: 40 | nb = os.path.basename(file) 41 | 42 | # Within the snapshot directory, each notebook output is stored in its own sub-directory 43 | notebookSnapshot = os.path.join(notebookDirectory, snapshotDir, nb.split('.ipynb')[0]) 44 | 45 | if os.path.isdir(notebookSnapshot) == False: 46 | os.mkdir(notebookSnapshot) 47 | 48 | # The output will be saved in a timestamp directory (snapshots/notebook/timestamp) 49 | runDir = os.path.join(notebookSnapshot, now.strftime("%Y-%m-%d %H.%M.%S.%f")) 50 | if os.path.isdir(runDir) == False: 51 | os.mkdir(runDir) 52 | 53 | # The snapshot file includes a timestamp 54 | output_file = os.path.join(runDir, nb) 55 | 56 | # Execute the notebook and save the snapshot 57 | pm.execute_notebook( 58 | file, 59 | output_file, 60 | parameters=dict(snapshotDir = runDir + os.sep) 61 | ) 62 | except Exception: 63 | # If any errors occur with the notebook processing they will be logged to the log file 64 | logging.exception("Error processing notebook") 65 | 66 | 67 | 68 | if __name__ == '__main__': 69 | 70 | # Ensure we're running in the same directory as the script 71 | os.chdir(os.path.dirname(os.path.abspath(__file__))) 72 | 73 | # Set up logger to display to screen and file 74 | logging.basicConfig(level=logging.INFO, 75 | format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s', 76 | datefmt='%Y-%m-%d %H:%M:%S', 77 | filename='notebooks.log') 78 | 79 | console = logging.StreamHandler() 80 | console.setLevel(logging.INFO) 81 | logging.getLogger('').addHandler(console) 82 | 83 | # Check if the subfolders for notebooks exist, and create them if they don't 84 | for directory in ['daily','hourly','weekly', 'monthly']: 85 | if os.path.isdir(directory) == False: 86 | os.mkdir(directory) 87 | 88 | # Get optional directory passed in via command line. If this is specified, then we just process the requested directory. 89 | # This is useful if you're scheduling the processing with an external task scheduler 90 | # If directory is not specified, then we'll set up our own scheduler and process the tasks 91 | 92 | parser = argparse.ArgumentParser(description = "NotebookScheduler options") 93 | parser.add_argument("-d", "--directory", help = "Which set of notebooks to process - e.g. hourly", required = False, default = False) 94 | argument = parser.parse_args() 95 | 96 | if argument.directory: 97 | # If a directory has been specified, we'll just process that one directory now and exit 98 | processNotebooks(argument.directory) 99 | 100 | else: 101 | # Only require the schedule module if we're using the internal scheduler 102 | # Install this with pip install schedule 103 | import schedule 104 | 105 | print("Starting scheduler...") 106 | 107 | # If no directory has been specified, schedule the processing and execute 108 | schedule.every().hour.at(':40').do(processNotebooks, notebookDirectory='hourly') 109 | schedule.every().day.at('13:15').do(processNotebooks, notebookDirectory='daily') 110 | schedule.every().sunday.at('13:15').do(processNotebooks, notebookDirectory='weekly') 111 | schedule.every().day.at('14:15').do(processNotebooks, notebookDirectory='monthly', days=[1]) 112 | 113 | # Run the scheduled tasks 114 | while True: 115 | schedule.run_pending() 116 | time.sleep(1) -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # NotebookScheduler 2 | A simple script to help schedule Jupyter Notebook execution and storing of the results using Papermill 3 | 4 | Check out [this blog post](https://productmetrics.net/blog/schedule-jupyter-notebooks/) for more details. 5 | 6 | ## Introducing NotebookScheduler 7 | 8 | [NotebookScheduler](https://github.com/Joshuaek/NotebookScheduler) is a simple Python script which uses [Papermill](https://github.com/nteract/papermill) to execute a directory of Jupyter Notebooks. Notebooks are arranged into subfolders for hourly, daily, weekly or monthly execution. Each time a notebook is run, a snapshot is saved to a timestamped folder (along with any other outputs your notebook saves) giving you the ability to look back at past executions and to have a full audit of the analysis that has been done. 9 | 10 | Once I've set up the notebook to provide whatever stats I want, scheduling its execution on a weekly basis is now as simple as a drag-and-drop into the weekly subfolder. 11 | 12 | ## Getting started 13 | 14 | The code is available in this [GitHub repository](https://github.com/Joshuaek/NotebookScheduler) -clone or download it to a folder on your PC. The first time you run the script, it will create a skeleton directory structure, with subdirectories for hourly, daily and weekly notebooks. 15 | 16 | Simply move your notebook (*.ipynb) files into the relevant subdirectory and when the script is run they will be executed. 17 | 18 | The directory structure is shown below: 19 | 20 | ``` 21 | / 22 | ├── NotebookScheduler.py 23 | ├── hourly/ 24 | │ ├── notebook1.ipynb 25 | │ ├── notebook2.ipynb 26 | │ └── snapshots/ 27 | │ ├── notebook1/ 28 | | │ └── 29 | │ │ └── notebook1.ipynb 30 | │ └── notebook2/ 31 | | └── 32 | │ └── notebook2.ipynb 33 | ├── daily/ 34 | │ ├── notebook3.ipynb 35 | │ ├── notebook4.ipynb 36 | │ └── snapshots/ 37 | │ ├── notebook3/ 38 | | │ └── 39 | │ │ └── notebook1.ipynb 40 | │ └── notebook4/ 41 | | └── 42 | │ └── notebook2.ipynb 43 | └── weekly/ 44 | ├── notebook5.ipynb 45 | ├── notebook6.ipynb 46 | └── snapshots/ 47 | ├── notebook5/ 48 | │ └── 49 | │ └── notebook1.ipynb 50 | └── notebook6/ 51 | └── 52 | └── notebook2.ipynb 53 | ``` 54 | 55 | ## Install the dependencies 56 | 57 | The script has a few dependencies. 58 | 59 | ### Papermill 60 | 61 | [Papermill](https://github.com/nteract/papermill) is the module that runs the jupyter notebooks. You'll need to install Papermill and its dependencies first. 62 | 63 | ``` 64 | pip install papermill 65 | ``` 66 | 67 | ### Schedule 68 | 69 | If you want to use the built in scheduler, then you'll need to install [Schedule](https://pypi.org/project/schedule/). 70 | 71 | ``` 72 | pip install schedule 73 | ``` 74 | 75 | If you're going to use Windows Task Scheduler or Cron jobs to schedule the execution, then you don't need this. 76 | 77 | ## Running the script without an external scheduler 78 | 79 | The simplest way to get started is to use the built in scheduler. In this mode, you'll run the Python script in a terminal and leave it running. The script itself will loop and run the notebooks as per the schedule determined by which of the subdirectories the notebook is in (e.g. daily, weekly, monthly). 80 | 81 | To do this, once you have some notebooks in your folders, simply run the script from its root folder: 82 | 83 | ``` 84 | python NotebookScheduler.py 85 | ``` 86 | 87 | ## Running the script with an external scheduler 88 | 89 | An alternative way of running is to use an external scheduler, like the built in Windows Task Scheduler or a Cron job to execute the script. In this mode, the external scheduler will determine the frequency of execution. You just need to set the ```-d``` command line option to tell the script which directory to execute. So, if you wanted to run your hourly and daily scripts, you'd set up two tasks: 90 | 91 | One job set to run hourly, with the script executed as follows: 92 | 93 | ``` 94 | python NotebookScheduler.py -d hourly 95 | ``` 96 | 97 | And another one set to run daily, with the script executed as follows: 98 | 99 | ``` 100 | python NotebookScheduler.py -d daily 101 | ``` 102 | 103 | When the directory is specified using the ```-d``` option, the notebooks in the specified directory are executed immediately. 104 | 105 | ## About monthly tasks 106 | Monthly tasks are run on the first day of the month. 107 | 108 | If you want to run on a different day, then change this line: 109 | 110 | ``` 111 | schedule.every().day.at('14:15').do(processNotebooks, notebookDirectory='monthly', days=[1]) 112 | ``` 113 | 114 | ## About the snapshots 115 | 116 | Within each of the daily/hourly/weekly directories a "snapshot" directory will be created. This will have sub-folders for each notebook that is executed, and each execution will be stored in time stamped folder. Whilst this is a lot of nesting, it makes it quick and easy to view the output of a particular notebook on a particular day. Once the notebook is executed, Papermill will save the output notebook to the snapshot directory. 117 | 118 | ## Saving other artifacts 119 | 120 | Papermill can [pass parameters](https://papermill.readthedocs.io/en/latest/usage-parameterize.html) to the notebooks it is executing. NotebookScheduler will set a ```snapshotDir``` parameter so that you can use this within your notebooks for saving files within the snapshot directory. For example, the following code generates a random dataframe and then saves a .csv file into the snapshot directory. This means that each execution of the notebook has it's .csv output right next to the output notebook in the timestamped folder. 121 | 122 | ```python 123 | import pandas as pd 124 | import numpy as np 125 | import random 126 | 127 | snapshotDir = "" 128 | 129 | df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')) 130 | 131 | df.to_csv(snapshotDir + 'output.csv') 132 | ``` 133 | 134 | Hopefully that helps keep everything neat and tidy! 135 | 136 | ## Logging 137 | 138 | Logging is setup - once the script is run you'll see ```notebook.log``` appear in the folder. All executions are logged here. If anything goes wrong with the execution (e.g. somethings broken in your notebook) then a stacktrace will be included in the log. All actions are logged to a single log file so you only have one place to check to see if scripts have run or find out why they broke. 139 | 140 | ## Testing and feedback 141 | 142 | I've only tested the script using Python 3.6 so far. If you encounter any bugs or strange behaviour then please raise an issue via the [repository](https://github.com/Joshuaek/NotebookScheduler). --------------------------------------------------------------------------------