├── .gitignore
├── LICENSE
├── NotebookScheduler.py
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | notebooks.log
2 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Joshuaek
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/NotebookScheduler.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | import argparse
  4 | import fnmatch
  5 | import logging
  6 | import papermill as pm
  7 | from datetime import datetime
  8 | import time
  9 | 
 10 | # Notebook Scheduler
 11 | # ---------------------------------------
 12 | # This script helps with the automated processing of Jupyter Notebooks via papermill (https://github.com/nteract/papermill/)
 13 | 
 14 | 
 15 | 
 16 | snapshotDir = 'snapshots'
 17 | 
 18 | def findFiles(directory, pattern):
 19 |     # Lists all files in the specified directory that match the specified pattern
 20 |     for filename in os.listdir(directory):
 21 |         if fnmatch.fnmatch(filename.lower(), pattern):
 22 |             yield os.path.join(directory, filename)
 23 | 
 24 | def processNotebooks(notebookDirectory, days=[]):
 25 |     
 26 |     now = datetime.now()
 27 |     
 28 |     # For monthly tasks, we only run on the specified days (or for others if no days are specified)
 29 |     if (len(days) > 0 and now.day in days) or len(days) == 0:
 30 | 
 31 |         logging.info('Processing ' + notebookDirectory)
 32 |         
 33 |         # Each time a notebook is processed a snapshot is saved to a snapshot sub-directory
 34 |         # This checks the sub-directory exists and creates it if not
 35 |         if os.path.isdir(os.path.join(notebookDirectory,snapshotDir)) == False:
 36 |             os.mkdir(os.path.join(notebookDirectory,snapshotDir))
 37 |         
 38 |         for file in findFiles(notebookDirectory, '*.ipynb'):
 39 |             try:
 40 |                 nb = os.path.basename(file)
 41 |                 
 42 |                 # Within the snapshot directory, each notebook output is stored in its own sub-directory
 43 |                 notebookSnapshot = os.path.join(notebookDirectory, snapshotDir, nb.split('.ipynb')[0])
 44 |                 
 45 |                 if os.path.isdir(notebookSnapshot) == False:
 46 |                     os.mkdir(notebookSnapshot)
 47 | 
 48 |                 # The output will be saved in a timestamp directory (snapshots/notebook/timestamp) 
 49 |                 runDir = os.path.join(notebookSnapshot, now.strftime("%Y-%m-%d %H.%M.%S.%f"))
 50 |                 if os.path.isdir(runDir) == False:
 51 |                     os.mkdir(runDir)
 52 | 
 53 |                 # The snapshot file includes a timestamp
 54 |                 output_file = os.path.join(runDir, nb)
 55 |                 
 56 |                 # Execute the notebook and save the snapshot
 57 |                 pm.execute_notebook(
 58 |                     file,
 59 |                     output_file,
 60 |                     parameters=dict(snapshotDir = runDir + os.sep)
 61 |                 )
 62 |             except Exception:
 63 |                 # If any errors occur with the notebook processing they will be logged to the log file
 64 |                 logging.exception("Error processing notebook")
 65 | 
 66 | 
 67 | 
 68 | if __name__ == '__main__':
 69 | 
 70 |     # Ensure we're running in the same directory as the script
 71 |     os.chdir(os.path.dirname(os.path.abspath(__file__)))
 72 | 
 73 |     # Set up logger to display to screen and file
 74 |     logging.basicConfig(level=logging.INFO,
 75 |                         format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s',
 76 |                         datefmt='%Y-%m-%d %H:%M:%S',
 77 |                         filename='notebooks.log')
 78 | 
 79 |     console = logging.StreamHandler()
 80 |     console.setLevel(logging.INFO)
 81 |     logging.getLogger('').addHandler(console)
 82 | 
 83 |     # Check if the subfolders for notebooks exist, and create them if they don't
 84 |     for directory in ['daily','hourly','weekly', 'monthly']:
 85 |         if os.path.isdir(directory) == False:
 86 |             os.mkdir(directory)
 87 | 
 88 |     # Get optional directory passed in via command line. If this is specified, then we just process the requested directory. 
 89 |     # This is useful if you're scheduling the processing with an external task scheduler
 90 |     # If directory is not specified, then we'll set up our own scheduler and process the tasks
 91 | 
 92 |     parser = argparse.ArgumentParser(description = "NotebookScheduler options")
 93 |     parser.add_argument("-d", "--directory", help = "Which set of notebooks to process - e.g. hourly", required = False, default = False)
 94 |     argument = parser.parse_args()
 95 | 
 96 |     if argument.directory:
 97 |         # If a directory has been specified, we'll just process that one directory now and exit
 98 |         processNotebooks(argument.directory)    
 99 | 
100 |     else:
101 |         # Only require the schedule module if we're using the internal scheduler
102 |         # Install this with pip install schedule
103 |         import schedule
104 | 
105 |         print("Starting scheduler...")
106 | 
107 |         # If no directory has been specified, schedule the processing and execute
108 |         schedule.every().hour.at(':40').do(processNotebooks, notebookDirectory='hourly')
109 |         schedule.every().day.at('13:15').do(processNotebooks, notebookDirectory='daily')
110 |         schedule.every().sunday.at('13:15').do(processNotebooks, notebookDirectory='weekly')
111 |         schedule.every().day.at('14:15').do(processNotebooks, notebookDirectory='monthly', days=[1])
112 | 
113 |         # Run the scheduled tasks
114 |         while True:
115 |             schedule.run_pending()
116 |             time.sleep(1)


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # NotebookScheduler
  2 | A simple script to help schedule Jupyter Notebook execution and storing of the results using Papermill
  3 | 
  4 | Check out [this blog post](https://productmetrics.net/blog/schedule-jupyter-notebooks/) for more details.
  5 | 
  6 | ## Introducing NotebookScheduler
  7 | 
  8 | [NotebookScheduler](https://github.com/Joshuaek/NotebookScheduler) is a simple Python script which uses [Papermill](https://github.com/nteract/papermill) to execute a directory of Jupyter Notebooks. Notebooks are arranged into subfolders for hourly, daily, weekly or monthly execution. Each time a notebook is run, a snapshot is saved to a timestamped folder (along with any other outputs your notebook saves) giving you the ability to look back at past executions and to have a full audit of the analysis that has been done.
  9 | 
 10 | Once I've set up the notebook to provide whatever stats I want, scheduling its execution on a weekly basis is now as simple as a drag-and-drop  into the weekly subfolder.
 11 | 
 12 | ## Getting started
 13 | 
 14 | The code is available in this [GitHub repository](https://github.com/Joshuaek/NotebookScheduler) -clone or download it to a folder on your PC. The first time you run the script, it will create a skeleton directory structure, with subdirectories for hourly, daily and weekly notebooks.
 15 | 
 16 | Simply move your notebook (*.ipynb) files into the relevant subdirectory and when the script is run they will be executed.
 17 | 
 18 | The directory structure is shown below:
 19 | 
 20 | ```
 21 |  <script_folder>/
 22 |  ├── NotebookScheduler.py 
 23 |  ├── hourly/
 24 |  │   ├── notebook1.ipynb
 25 |  │   ├── notebook2.ipynb
 26 |  │   └── snapshots/
 27 |  │       ├── notebook1/
 28 |  |       │   └──<timestamp>
 29 |  │       │      └── notebook1.ipynb
 30 |  │       └── notebook2/    
 31 |  |           └──<timestamp>
 32 |  │              └── notebook2.ipynb 
 33 |  ├── daily/
 34 |  │   ├── notebook3.ipynb
 35 |  │   ├── notebook4.ipynb
 36 |  │   └── snapshots/
 37 |  │       ├── notebook3/
 38 |  |       │   └──<timestamp>
 39 |  │       │      └── notebook1.ipynb
 40 |  │       └── notebook4/    
 41 |  |           └──<timestamp>
 42 |  │              └── notebook2.ipynb 
 43 |  └── weekly/
 44 |      ├── notebook5.ipynb
 45 |      ├── notebook6.ipynb
 46 |      └── snapshots/
 47 |          ├── notebook5/
 48 |          │   └──<timestamp>
 49 |          │      └── notebook1.ipynb
 50 |          └── notebook6/    
 51 |              └──<timestamp>
 52 |                 └── notebook2.ipynb 
 53 | ```
 54 | 
 55 | ## Install the dependencies
 56 | 
 57 | The script has a few dependencies.
 58 | 
 59 | ### Papermill
 60 | 
 61 | [Papermill](https://github.com/nteract/papermill) is the module that runs the jupyter notebooks. You'll need to install Papermill and its dependencies first.
 62 | 
 63 | ``` 
 64 | pip install papermill 
 65 | ```
 66 | 
 67 | ### Schedule
 68 | 
 69 | If you want to use the built in scheduler, then you'll need to install [Schedule](https://pypi.org/project/schedule/).
 70 | 
 71 | ``` 
 72 | pip install schedule 
 73 | ```
 74 | 
 75 | If you're going to use Windows Task Scheduler or Cron jobs to schedule the execution, then you don't need this. 
 76 | 
 77 | ## Running the script without an external scheduler
 78 | 
 79 | The simplest way to get started is to use the built in scheduler. In this mode, you'll run the Python script in a terminal and leave it running. The script itself will loop and run the notebooks as per the schedule determined by which of the subdirectories the notebook is in (e.g. daily, weekly, monthly).
 80 | 
 81 | To do this, once you have some notebooks in your folders, simply run the script from its root folder:
 82 | 
 83 | ``` 
 84 | python NotebookScheduler.py 
 85 | ```
 86 | 
 87 | ## Running the script with an external scheduler
 88 | 
 89 | An alternative way of running is to use an external scheduler, like the built in Windows Task Scheduler or a Cron job to execute the script. In this mode, the external scheduler will determine the frequency of execution. You just need to set the ```-d``` command line option to tell the script which directory to execute. So, if you wanted to run your hourly and daily scripts, you'd set up two tasks:
 90 | 
 91 | One job set to run hourly, with the script executed as follows:
 92 | 
 93 | ```
 94 | python NotebookScheduler.py -d hourly
 95 | ```
 96 | 
 97 | And another one set to run daily, with the script executed as follows:
 98 | 
 99 | ```
100 | python NotebookScheduler.py -d daily
101 | ```
102 | 
103 | When the directory is specified using the ```-d``` option, the notebooks in the specified directory are executed immediately.
104 | 
105 | ## About monthly tasks
106 | Monthly tasks are run on the first day of the month. 
107 | 
108 | If you want to run on a different day, then change this line:
109 | 
110 | ```
111 | schedule.every().day.at('14:15').do(processNotebooks, notebookDirectory='monthly', days=[1])
112 | ```
113 | 
114 | ## About the snapshots
115 | 
116 | Within each of the daily/hourly/weekly directories a "snapshot" directory will be created. This will have sub-folders for each notebook that is executed, and each execution will be stored in time stamped folder. Whilst this is a lot of nesting, it makes it quick and easy to view the output of a particular notebook on a particular day. Once the notebook is executed, Papermill will save the output notebook to the snapshot directory.
117 | 
118 | ## Saving other artifacts
119 | 
120 | Papermill can [pass parameters](https://papermill.readthedocs.io/en/latest/usage-parameterize.html)  to the notebooks it is executing. NotebookScheduler will set a ```snapshotDir``` parameter so that you can use this within your notebooks for saving files within the snapshot directory. For example, the following code generates a random dataframe and then saves a .csv file into the snapshot directory. This means that each execution of the notebook has it's .csv output right next to the output notebook in the timestamped folder.
121 | 
122 | ```python
123 | import pandas as pd
124 | import numpy as np
125 | import random
126 | 
127 | snapshotDir = ""
128 | 
129 | df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
130 | 
131 | df.to_csv(snapshotDir + 'output.csv')
132 | ```
133 | 
134 | Hopefully that helps keep everything neat and tidy!
135 | 
136 | ## Logging
137 | 
138 | Logging is setup - once the script is run you'll see ```notebook.log``` appear in the folder. All executions are logged here. If anything goes wrong with the execution (e.g. somethings broken in your notebook) then a stacktrace will be included in the log. All actions are logged to a single log file so you only have one place to check to see if scripts have run or find out why they broke.
139 | 
140 | ## Testing and feedback
141 | 
142 | I've only tested the script using Python 3.6 so far. If you encounter any bugs or strange behaviour then please raise an issue via the [repository](https://github.com/Joshuaek/NotebookScheduler).


--------------------------------------------------------------------------------