├── 03_continuous_integration ├── iris-api │ ├── .travis.yml │ ├── tests │ │ ├── __init__.py │ │ └── resources │ │ │ └── prediction.py │ ├── resources │ │ ├── __init__.py │ │ ├── README.md │ │ └── IrisPredictorResource.py │ ├── service.yaml │ ├── models │ │ └── finalized_model.sav │ ├── bin │ │ └── docker_build_context.sh │ ├── LICENSE │ ├── Dockerfile │ ├── main.py │ ├── .gitignore │ └── README.md ├── Code_sharing_best_practices_workshop.pptx ├── 03_continuous_integration.py └── 03_continuous_integration.ipynb ├── img ├── ssh-remote.png ├── click_example_code.jpg ├── click_example_help.jpg └── jupyter_environments.png ├── Info Flyer ├── flyer.png ├── flyer-sketch.png ├── MUDS_Practical_Training_Flyer_2020.png ├── MUDS_Practical_Training_Flyer_2020_workshop.jpg └── ugly_code_numpy_linalg.py ├── MUDS resources ├── MUDS_Logofinal.png ├── MUDS_Logo_CMYK_final.eps ├── MUDS-Banner-Web-V1(1).jpg ├── presentation │ └── MUDS_Folienmaster.pptx └── Technical_University_of_Munich_emblem.svg ├── muds_practical_training_overview.pptx ├── 02_database_basics ├── photos │ ├── keyvalue_example.PNG │ └── cybernetics-1869205_1280.jpg └── 02_database_basics.py ├── 04_best_practices ├── Code_sharing_best_practices_workshop.pptx ├── 04_best_practices.py ├── slurm.ipynb └── 04_best_practices.ipynb ├── README.md ├── .gitignore ├── check_list_before_sharing.md ├── autogen_slidetype.py ├── 06_advanced_python ├── debugging.ipynb └── jupyter_addons.ipynb ├── 00_intro ├── 00_intro.py └── 00_intro.ipynb ├── 99_other_material ├── meme_treasury.ipynb └── complexity.ipynb └── 07_graphs └── 07_graphs.ipynb /03_continuous_integration/iris-api/.travis.yml: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /03_continuous_integration/iris-api/tests/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /03_continuous_integration/iris-api/resources/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /img/ssh-remote.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Mu-DS/practical_training/HEAD/img/ssh-remote.png -------------------------------------------------------------------------------- /Info Flyer/flyer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Mu-DS/practical_training/HEAD/Info Flyer/flyer.png -------------------------------------------------------------------------------- /img/click_example_code.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Mu-DS/practical_training/HEAD/img/click_example_code.jpg -------------------------------------------------------------------------------- /img/click_example_help.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Mu-DS/practical_training/HEAD/img/click_example_help.jpg -------------------------------------------------------------------------------- /Info Flyer/flyer-sketch.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Mu-DS/practical_training/HEAD/Info Flyer/flyer-sketch.png -------------------------------------------------------------------------------- /img/jupyter_environments.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Mu-DS/practical_training/HEAD/img/jupyter_environments.png -------------------------------------------------------------------------------- /MUDS resources/MUDS_Logofinal.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Mu-DS/practical_training/HEAD/MUDS resources/MUDS_Logofinal.png -------------------------------------------------------------------------------- /MUDS resources/MUDS_Logo_CMYK_final.eps: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Mu-DS/practical_training/HEAD/MUDS resources/MUDS_Logo_CMYK_final.eps -------------------------------------------------------------------------------- /muds_practical_training_overview.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Mu-DS/practical_training/HEAD/muds_practical_training_overview.pptx -------------------------------------------------------------------------------- /MUDS resources/MUDS-Banner-Web-V1(1).jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Mu-DS/practical_training/HEAD/MUDS resources/MUDS-Banner-Web-V1(1).jpg -------------------------------------------------------------------------------- /02_database_basics/photos/keyvalue_example.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Mu-DS/practical_training/HEAD/02_database_basics/photos/keyvalue_example.PNG -------------------------------------------------------------------------------- /Info Flyer/MUDS_Practical_Training_Flyer_2020.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Mu-DS/practical_training/HEAD/Info Flyer/MUDS_Practical_Training_Flyer_2020.png -------------------------------------------------------------------------------- /MUDS resources/presentation/MUDS_Folienmaster.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Mu-DS/practical_training/HEAD/MUDS resources/presentation/MUDS_Folienmaster.pptx -------------------------------------------------------------------------------- /02_database_basics/photos/cybernetics-1869205_1280.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Mu-DS/practical_training/HEAD/02_database_basics/photos/cybernetics-1869205_1280.jpg -------------------------------------------------------------------------------- /03_continuous_integration/iris-api/service.yaml: -------------------------------------------------------------------------------- 1 | 2 | # The service configuration used in the Dockerfile. 3 | 4 | # knn model 5 | model_path: /app/models/finalized_model.sav -------------------------------------------------------------------------------- /04_best_practices/Code_sharing_best_practices_workshop.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Mu-DS/practical_training/HEAD/04_best_practices/Code_sharing_best_practices_workshop.pptx -------------------------------------------------------------------------------- /Info Flyer/MUDS_Practical_Training_Flyer_2020_workshop.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Mu-DS/practical_training/HEAD/Info Flyer/MUDS_Practical_Training_Flyer_2020_workshop.jpg -------------------------------------------------------------------------------- /03_continuous_integration/iris-api/models/finalized_model.sav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Mu-DS/practical_training/HEAD/03_continuous_integration/iris-api/models/finalized_model.sav -------------------------------------------------------------------------------- /03_continuous_integration/Code_sharing_best_practices_workshop.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Mu-DS/practical_training/HEAD/03_continuous_integration/Code_sharing_best_practices_workshop.pptx -------------------------------------------------------------------------------- /03_continuous_integration/iris-api/tests/resources/prediction.py: -------------------------------------------------------------------------------- 1 | """Unit tests for iris-api""" 2 | 3 | from resources.IrisPredictor import predict_knn 4 | 5 | def test_predict_knn(): 6 | assert True -------------------------------------------------------------------------------- /MUDS resources/Technical_University_of_Munich_emblem.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /03_continuous_integration/iris-api/resources/README.md: -------------------------------------------------------------------------------- 1 | # Resources 2 | 3 | Here you can find different resources which are existing as an HTTP API. 4 | 5 | # API 6 | 7 | 8 | ## /iris_api 9 | 10 | Detect iris class from the list. 11 | 12 | Input: json file 13 | 14 | Example: 15 | ```bash 16 | curl -d '{"features":[1,2,3,4]}' \ 17 | -H "Content-Type: application/json" \ 18 | -X POST http://localhost:8000/iris_api 19 | ``` 20 | -------------------------------------------------------------------------------- /03_continuous_integration/iris-api/bin/docker_build_context.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"/.. 4 | 5 | rm -rf ${DIR}/build/docker 6 | 7 | mkdir -p ${DIR}/build/docker 8 | mkdir -p ${DIR}/build/docker/resources 9 | mkdir -p ${DIR}/build/docker/models 10 | 11 | 12 | cp ${DIR}/Dockerfile ${DIR}/build/docker 13 | 14 | cp ${DIR}/resources/*.py ${DIR}/build/docker/resources 15 | cp ${DIR}/*.py ${DIR}/build/docker 16 | cp ${DIR}/models/*.sav ${DIR}/build/docker/models 17 | cp ${DIR}/service.yaml ${DIR}/build/docker 18 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Practical Training for Researchers in Data Science and Software Development 2 | 3 | The material presented here covers all material of the 5-day lecture series developed for the Helmholtz graduate school for data science Munich (MUDS). For any questions or comments on the material covered here, please contact @the-rccg or @aliechoes. 4 | 5 | This repo uses [jupytext](https://github.com/mwouts/jupytext) to keep 6 | notebooks as text files 7 | 8 | ## Official flyer for the (postponed) lecture series 9 | 10 | ![FlyerImage](https://github.com/Mu-DS/practical_training/blob/master/Info%20Flyer/MUDS_Practical_Training_Flyer_2020_workshop.jpg) 11 | -------------------------------------------------------------------------------- /03_continuous_integration/iris-api/LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Ali Boushehri 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /03_continuous_integration/iris-api/Dockerfile: -------------------------------------------------------------------------------- 1 | 2 | # Dockerfile for the Face Detection Service 3 | 4 | # Use an official Python runtime as a parent image 5 | FROM continuumio/miniconda3 6 | 7 | # Set the working directory to /app 8 | WORKDIR /app 9 | 10 | # Update Linux package lists 11 | RUN apt-get update 12 | 13 | # Install build tools (gcc etc.) 14 | RUN apt-get install -y build-essential 15 | 16 | 17 | # Install ops tools 18 | RUN apt-get install -y procps vim 19 | 20 | 21 | # Install any needed packages specified in requirements.txt 22 | # TODO: add correct python libraries 23 | RUN conda install -c conda-forge gevent 24 | RUN conda install -c conda-forge gunicorn>=19.0 25 | RUN conda install -c conda-forge falcon>=2.0 26 | 27 | 28 | # Copy the current directory contents into the container at /app 29 | COPY . /app 30 | RUN pwd 31 | 32 | # Make port 80 available to the world outside this container 33 | EXPOSE 80 34 | 35 | # Define environment variable 36 | ENV PYTHONUNBUFFERED TRUE 37 | ENV IRIS_API_CONFIG /app/service.yaml 38 | ENV NUM_WORKER 1 39 | 40 | # Run Gunicorn when the container launches 41 | CMD ["sh", "-c", "gunicorn --workers ${NUM_WORKER} --worker-class gevent --bind 0.0.0.0:80 main:app"] -------------------------------------------------------------------------------- /03_continuous_integration/03_continuous_integration.py: -------------------------------------------------------------------------------- 1 | # --- 2 | # jupyter: 3 | # jupytext: 4 | # formats: ipynb,py 5 | # text_representation: 6 | # extension: .py 7 | # format_name: light 8 | # format_version: '1.5' 9 | # jupytext_version: 1.5.2 10 | # kernelspec: 11 | # display_name: Python 3 12 | # language: python 13 | # name: python3 14 | # --- 15 | 16 | # # The Continuous Integration Pipeline 17 | 18 | # ## Motivation 19 | 20 | # ## Overview 21 | 22 | # 1. Git (Ali) 23 | # 2. Unit Tests (Ali) 24 | # 3. Docker (Ali) 25 | # 4. APIs (Ali) 26 | 27 | # # Git 28 | 29 | # ## Motivation 30 | 31 | # 32 | 33 | # 34 | 35 | # # Unit Tests 36 | 37 | # 38 | 39 | # Testing in Production 40 | # 41 | # 42 | 43 | # + jupyter={"outputs_hidden": true} 44 | 45 | # - 46 | 47 | 48 | # # Docker 49 | 50 | # ## Motivation 51 | 52 | # 53 | 54 | # + jupyter={"outputs_hidden": true} 55 | 56 | # - 57 | 58 | 59 | # # APIs 60 | 61 | # + jupyter={"outputs_hidden": true} 62 | 63 | -------------------------------------------------------------------------------- /03_continuous_integration/iris-api/main.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | import io 4 | import logging 5 | import falcon 6 | import yaml 7 | from resources.IrisPredictorResource import IrisPredictorResource 8 | 9 | def init_logging(): 10 | """Initialize logging to write to STDOUT.""" 11 | logger = logging.getLogger(__name__) 12 | logger.setLevel(logging.INFO) 13 | handler = logging.StreamHandler(sys.stdout) 14 | handler.setLevel(logging.INFO) 15 | formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') 16 | handler.setFormatter(formatter) 17 | logger.addHandler(handler) 18 | return logger 19 | 20 | def ensure_if_path_exists(pth): 21 | # Create target Directory if don't exist 22 | if not os.path.exists(pth): 23 | os.makedirs(pth) 24 | return None 25 | 26 | 27 | def load_yaml(file_path): 28 | with open(file_path, 'r') as stream: 29 | return yaml.load(stream) 30 | 31 | 32 | 33 | """ 34 | In this part, the app initialized to create an API. There are multiple steps to be followed: 35 | 1) Initializing the API, loading the config file and loading the logger 36 | 2) Initializing IrisPredictor 37 | 3) adding the route 38 | """ 39 | 40 | 41 | ## Part 1 42 | app = falcon.API() 43 | 44 | config_path = os.environ.get('IRIS_API_CONFIG', None) 45 | if config_path is None: 46 | config_path = 'service.yaml' 47 | 48 | config = load_yaml(config_path) 49 | model_path = config['model_path'] 50 | 51 | # Start the logging 52 | logger = init_logging() 53 | logger.info('Service config: %s' % config) 54 | 55 | 56 | ## Part 2: Resources 57 | iris_api = IrisPredictorResource(model_path, logger) 58 | 59 | ## Part 3: iris_api 60 | app.add_route("/iris_api", iris_api) -------------------------------------------------------------------------------- /03_continuous_integration/iris-api/.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | -------------------------------------------------------------------------------- /03_continuous_integration/iris-api/resources/IrisPredictorResource.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | import sklearn 3 | from sklearn.neighbors import KNeighborsClassifier 4 | import json 5 | import numpy as np 6 | import falcon 7 | 8 | 9 | 10 | def predict_knn(features, model): 11 | 12 | """ 13 | This function gets the features and models and predicts the output 14 | Args: 15 | features(list): list of features. It must include 4 floating numbers 16 | model(sklearn model): knn sklearn model, loaded from models folder 17 | output: 18 | prediceted_class(str) 19 | """ 20 | 21 | classes = ['setosa', 'versicolor', 'virginica'] 22 | """ 23 | TODO: 24 | predict the class! 25 | """ 26 | return predicted_class 27 | 28 | ### resource 29 | class IrisPredictorResource(): 30 | """ 31 | TODO: Documentation 32 | """ 33 | def __init__(self, model_path, logger): 34 | """ 35 | TODO: Documentation 36 | """ 37 | self.logger = logger 38 | self.model = pickle.load(open(model_path, 'rb')) 39 | self.logger.info("Starting: IrisPredictor") 40 | 41 | def on_post(self, req, resp): 42 | """ 43 | TODO: Documentation 44 | """ 45 | try: 46 | self.logger.info("IrisPredictor: reading file") 47 | request_bytes = req.stream.read() 48 | 49 | try: 50 | request = json.loads(request_bytes.decode("utf-8")) 51 | 52 | except Exception as e: 53 | self.logger.error(e, exc_info=True) 54 | resp.status = falcon.HTTP_400 55 | resp.body = "Invalid JSON\n" 56 | return 57 | """ 58 | @TODO: 59 | check the quality of the input file. 60 | In case the quality of the input is not valid, 61 | send back the correct resp.body and resp.status 62 | """ 63 | 64 | ## In this part, you consider the input is correct and 65 | ## just need to return the result 66 | prediction = predict_knn(features, self.model) 67 | 68 | self.logger.info('IrisPredictor: the prediction is %s' % prediction) 69 | response = {"predicted_class": prediction} 70 | 71 | self.logger.info('IrisPredictor: Sending the results \n') 72 | 73 | """ 74 | TODO: use the correct HTTP status code 75 | """ 76 | resp.status = ## FILL HERE! ## 77 | resp.body = json.dumps(response) + '\n' 78 | 79 | except Exception as e: 80 | self.logger.error(e, exc_info=True) 81 | resp.status = falcon.HTTP_500 -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | 131 | .vscode/ 132 | 133 | data_loader/.Rhistory 134 | 135 | inputs/.Rhistory 136 | 137 | preprocessing/.Rhistory 138 | 139 | code.sh 140 | 141 | launch_code.sh 142 | 143 | configs/sample_config.json 144 | 145 | *.sh 146 | eval.sh 147 | launch_eval.sh 148 | 149 | 07_graphs/data -------------------------------------------------------------------------------- /03_continuous_integration/iris-api/README.md: -------------------------------------------------------------------------------- 1 | # Iris API 2 | 3 | This is an HTTP service that provides functions to detect Iris target from lists. It includes: 4 | 5 | * [Iris Predictor](resources/IrisPredictor.py) 6 | 7 | For understanding the APIs and resources please refer to the folder [resources](resources) or the explanation [here](resources/README.md) 8 | 9 | ## Folders 10 | 11 | * [bin](bin): executable file creating the necessary folders and copying models before building the docker 12 | * [models](models): sklearn KNN model 13 | * [resources](resources): resources for different APIs 14 | 15 | 16 | ## Docker 17 | 18 | In this part we explain how to build and run the docker. 19 | 20 | ### Build a Docker container 21 | 22 | 23 | ```bash 24 | sudo bin/docker_build_context.sh 25 | sudo docker build --tag=iris_api:0.0.1 build/docker 26 | ``` 27 | 28 | 29 | ### Run Docker container 30 | 31 | Just CPU: 32 | 33 | ```bash 34 | sudo docker run -d -p 8000:80 iris_api:0.0.1 35 | ``` 36 | 37 | After starting the container the service should listen on 127.0.0.1 port 8000. 38 | 39 | The number of Gunicorn workers can be configured by setting the `NUM_WORKER` environment variable when running the container, e.g. `-e NUM_WORKER=2`. 40 | 41 | ### Start service manually 42 | 43 | For debugging it can be helpful to start the service manually. Run the container but overwrite the entrypoint with a Bash shell (you need to modify the version manually instead of 0.0.1): 44 | 45 | ```bash 46 | docker run -it -p 8000:80 --entrypoint=/bin/bash iris_api:0.0.1 47 | ``` 48 | 49 | This starts the container and opens a shell but does not start the service. Start the service manually: 50 | 51 | ```bash 52 | cd /app 53 | gunicorn --workers 1 --worker-class gevent --bind 0.0.0.0:80 main:app 54 | ``` 55 | 56 | ### Login to the Docker container 57 | 58 | Lookup container ID: 59 | 60 | ```bash 61 | docker container ps 62 | CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 63 | 7932a3814453 friendlyhello "python app.py" 16 seconds ago Up 15 seconds 0.0.0.0:4000->80/tcp musing_robinson 64 | ``` 65 | 66 | Open a shell on the container: 67 | 68 | ```bash 69 | docker exec -it c1de50a17e8d /bin/bash 70 | ``` 71 | 72 | ### Cleanup 73 | 74 | Remove all containers and images: 75 | 76 | ```bash 77 | sudo docker rm $(sudo docker ps -a -q) 78 | sudo docker rmi $(sudo docker images -q) 79 | ``` 80 | 81 | ## Gunicorn 82 | 83 | The Gunicorn configuration is described in [Gunicorn settings](http://docs.gunicorn.org/en/stable/settings.html). 84 | 85 | The most important Gunicorn configuration parameters are: 86 | 87 | * `--reload` - Restart workers when code changes. This should only be used during development 88 | * `--workers` - The number of worker processes for handling requests 89 | * `--worker-class` - The type of workers to use 90 | * `--bind` - The socket and port to bind 91 | * `--access-logfile` - Path of the access log file 92 | * `--error-logfile` - Path of the error log file 93 | * `--daemon` - Daemonize the Gunicorn process 94 | -------------------------------------------------------------------------------- /check_list_before_sharing.md: -------------------------------------------------------------------------------- 1 | # Check-List before sharing your code 2 | 3 | Thank you for wanting to share your code! 4 | And thank you even more for trying to make sure it is helpful rather than sending the person in circles! 5 | Research needs more people like you! 6 | 7 | ## Code is read way more frequently than it is written 8 | 9 | * [ ] Does your code abide by Python naming rules? 10 | 11 | - [ ] ALL functions are snake_case 12 | - [ ] ALL classes are UpperCamelCase 13 | - [ ] ALL variables are snake_case 14 | - [ ] ALL constants are UPPER_CASE 15 | - [ ] ALL packages are lowercase 16 | 17 | * [ ] No. There are no exceptions because you prefer them... 18 | * [ ] Are your custom data structures clearly described? 19 | 20 | - [ ] Shape? 21 | - [ ] Data Types? 22 | - [ ] Hierarchy? 23 | 24 | * [ ] Do ALL public functions have docstrings? 25 | 26 | - [ ] Does it include the inputs and their format? 27 | - [ ] Does it include the ouputs and their format? 28 | - [ ] Does it include the exceptions that it throws? 29 | 30 | * [ ] Remember that function that took you ages to write? Read the code: if you can immediately understand it you are good - otherwise, rewrite it to be more explicit. 31 | * [ ] Does ANY functions have >2 bracket-pairs of any kind? Split it into more lines. 32 | * [ ] Have ALL your abbreviations been defined in the same file? 33 | * [ ] No, not everyone knows that abbreviation... 34 | * [ ] Have you set up an auto-formatter? 35 | * [ ] Has the formatter configuration been documented? 36 | * [ ] Okay. Now run the formatter again. 37 | 38 | ## Code is more often debugged than it is written 39 | 40 | * [ ] Do you have notebooks? 41 | 42 | - [ ] Restart Kernel 43 | - [ ] Run all 44 | - [ ] Repeat until no error shows up 45 | 46 | * [ ] Are ALL your private functions unit tested? 47 | 48 | - [ ] Does it guarantee ALL functionality the function allows? 49 | - [ ] Are your mathematical methods tested for convergence? 50 | - [ ] Have you included the edge cases in your tests? 51 | - [ ] Yes, even wrappers around standard libraries. 52 | 53 | * [ ] Have you set up a CI? 54 | 55 | - [ ] For the NEWEST versions of packages? 56 | - [ ] For the NEWEST versions of the language? 57 | - [ ] Have you checked that these are really the newest? 58 | - [ ] Does it run all notebooks? 59 | 60 | * [ ] Has it passed all tests for all versions? 61 | * [ ] Have you added `requirements.txt` file in your repo? 62 | * [ ] Have you added the correct `.gitignore` file in your repo? 63 | * [ ] Does every folder contain a markdown with correct and up-to-date explanation? 64 | * [ ] Does adding a docker help in reproduciblity of your work? If so, have you implemented it? 65 | * [ ] Have you checked every box? Congratulations, you can now share the code :) 66 | 67 | ## Automating the boring stuff 68 | 69 | * [ ] Use **black** auto-formatting on save 70 | * [ ] Use **pylint** in CI to fail when stuff (e.g. docstrings) are missing 71 | * [ ] Use **codacy** to check the quality of the whole code 72 | * [ ] Use **TravisCI** to test against the newest versions from pip 73 | * [ ] Use **codecov** to ensure you are not missing unit tests 74 | * [ ] Use **mypy** for type hints and checking [link](http://mypy-lang.org/) 75 | 76 | ## Have more tips? 77 | 78 | Please do let us know or simply submit your own PR to this repo! 79 | -------------------------------------------------------------------------------- /autogen_slidetype.py: -------------------------------------------------------------------------------- 1 | import json 2 | import re 3 | import os 4 | from tqdm import tqdm 5 | import click 6 | 7 | 8 | def get_num_hashtags(string): 9 | """count the number of hashtags in beginning of string 10 | 11 | :param string: string to count hashtags in 12 | :type string: str 13 | :return: number of consecutive hashtags followed by a white space 14 | :rtype: int 15 | """ 16 | num = 0 17 | match = re.match(r"^[#]{1,}[\s]", string) 18 | if match: 19 | num = len(match[0])-1 20 | return num 21 | 22 | 23 | def set_slide_type(metadata, celltype): 24 | """adds slide type to global "notebook" (side-effect) 25 | 26 | :param cell_idx: index of the cell to consider 27 | :type cell_idx: int 28 | :param celltype: type of slideshow type to designate this cell 29 | :type celltype: str 30 | """ 31 | if 'slideshow' not in metadata.keys(): 32 | metadata['slideshow'] = {} 33 | metadata['slideshow']['slide_type'] = celltype 34 | return metadata 35 | 36 | 37 | @click.command() 38 | @click.option("--in", "-i", "basename", default="", type=click.STRING, show_default=False, required=True, 39 | help="input file name to be loaded, autodetects jupytext") 40 | @click.option("--out", "-o", "outname", default="", type=click.STRING, show_default=False, required=True, 41 | help="output file name to be saved as, autodetects jupytext") 42 | @click.option("--order", "slide_order", default=2, type=click.IntRange(0,), show_default=True, required=False, 43 | help="Number of # above which all are done as subslides") 44 | @click.option("--INDENT", "indentation", default=1, type=click.IntRange(0,), show_default=True, required=False, 45 | help="Number of spaces to indent the json/ipynb output") 46 | def main(basename, outname, slide_order, indentation): 47 | """automatically generate slide_type metadata for ipynb files 48 | 49 | :param basename: input ipynb name without .ipynb 50 | :type basename: str 51 | :param outname: output ipynb name without .ipynb 52 | :type outname: str 53 | :param slide_order: numbers of #s above which sections are considered sub-slides 54 | :type slide_order: int > 0 55 | """ 56 | # Decoding jupyter notebooks as jsons 57 | with open(f"{basename}.ipynb", "r", encoding="utf-8") as infile: 58 | notebook = json.load(infile) 59 | 60 | # Adjusting metadata for each cell 61 | for cell_idx in tqdm(range(len(notebook["cells"]))): 62 | if len(notebook["cells"][cell_idx]["source"]): 63 | num_hashtags = max( 64 | map(get_num_hashtags, notebook["cells"][cell_idx]["source"])) 65 | metadata = notebook["cells"][cell_idx]['metadata'] 66 | if not isinstance(metadata, dict): 67 | print(metadata) 68 | metadata = {} 69 | if num_hashtags == 0: 70 | metadata = set_slide_type(metadata, "fragment") 71 | elif num_hashtags > slide_order: 72 | metadata = set_slide_type(metadata, "subslide") 73 | else: 74 | metadata = set_slide_type(metadata, "slide") 75 | else: 76 | metadata = set_slide_type(metadata, "skip") 77 | notebook["cells"][cell_idx]['metadata'] = metadata 78 | 79 | # Saving new file 80 | with open(f"{outname}.ipynb", "w", encoding="utf-8") as outfile: 81 | json.dump(notebook, fp=outfile, indent=indentation) 82 | # Ensure time for jupytext is preserved 83 | if f"{basename}.py" in os.listdir(): 84 | with open(f"{basename}.py", "r", encoding="utf-8") as infile: 85 | loaded = json.load(infile) 86 | with open(f"{outname}.py", "w", encoding="utf-8") as outfile: 87 | json.dump(loaded, fp=outfile, indent=indentation) 88 | 89 | 90 | if __name__ == "__main__": 91 | main() 92 | -------------------------------------------------------------------------------- /06_advanced_python/debugging.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "Collapsed": "false" 7 | }, 8 | "source": [ 9 | "# Debugging in Python" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": { 15 | "Collapsed": "false" 16 | }, 17 | "source": [ 18 | "" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": { 24 | "Collapsed": "false" 25 | }, 26 | "source": [ 27 | "## pdb - The Python Debugger" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": { 33 | "Collapsed": "false" 34 | }, 35 | "source": [ 36 | "part of the core library\n", 37 | "\n", 38 | "allows interactive as well as whole script runs\n", 39 | "\n", 40 | "```python\n", 41 | "import pdb\n", 42 | "import mymodule\n", 43 | "pdf.run('mymodule.some_function()')\n", 44 | "```\n", 45 | "\n", 46 | "```bash\n", 47 | "python -m pdb myscript.py\n", 48 | "```" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": { 54 | "Collapsed": "false" 55 | }, 56 | "source": [ 57 | "Python 3.2: `-c` option allows executing commands as if a `.pdbrc` config file was given\n", 58 | "\n", 59 | "Python 3.7: \n", 60 | "- built-in `breakpoint()` to set a trace instead of `import pdb; pdb.set_trace()`\n", 61 | "- `-m` option executes modules similar to the way `python -m` does" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": { 67 | "Collapsed": "false" 68 | }, 69 | "source": [ 70 | "Upon `breakpoint()`, execution enters `debug mode`" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": { 76 | "Collapsed": "false" 77 | }, 78 | "source": [ 79 | "### Debugger Commands" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": { 85 | "Collapsed": "false" 86 | }, 87 | "source": [] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": { 92 | "Collapsed": "false" 93 | }, 94 | "source": [ 95 | "## Callgraphs" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": { 101 | "Collapsed": "false" 102 | }, 103 | "source": [ 104 | "https://github.com/osteele/callgraph" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": { 110 | "Collapsed": "false" 111 | }, 112 | "source": [ 113 | "```\n", 114 | "pip install callgraph\n", 115 | "```" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": { 121 | "Collapsed": "false" 122 | }, 123 | "source": [ 124 | "### In Code (Decorator)\n", 125 | "\n", 126 | "```python\n", 127 | "from functools import lru_cache\n", 128 | "import callgraph.decorator as callgraph\n", 129 | "\n", 130 | "@callgraph()\n", 131 | "@lru_cache()\n", 132 | "def nchoosek(n, k):\n", 133 | " if k == 0:\n", 134 | " return 1\n", 135 | " if n == k:\n", 136 | " return 1\n", 137 | " return nchoosek(n - 1, k - 1) + nchoosek(n - 1, k)\n", 138 | "\n", 139 | "nchoosek(5, 2)\n", 140 | "\n", 141 | "nchoosek.__callgraph__.view()\n", 142 | "```" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": { 148 | "Collapsed": "false" 149 | }, 150 | "source": [ 151 | "### Callgraph Magic (Jupyter)\n", 152 | "\n", 153 | "```python\n", 154 | "from functools import lru_cache\n", 155 | "\n", 156 | "@lru_cache()\n", 157 | "def lev(a, b):\n", 158 | " if \"\" in (a, b):\n", 159 | " return len(a) + len(b)\n", 160 | "\n", 161 | " candidates = []\n", 162 | " if a[0] == b[0]:\n", 163 | " candidates.append(lev(a[1:], b[1:]))\n", 164 | " else:\n", 165 | " candidates.append(lev(a[1:], b[1:]) + 1)\n", 166 | " candidates.append(lev(a, b[1:]) + 1)\n", 167 | " candidates.append(lev(a[1:], b) + 1)\n", 168 | " return min(candidates)\n", 169 | "\n", 170 | "%callgraph -w10 lev(\"big\", \"dog\"); lev(\"dig\", \"dog\")\n", 171 | "```" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": { 177 | "Collapsed": "false" 178 | }, 179 | "source": [] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": { 184 | "Collapsed": "false" 185 | }, 186 | "source": [] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": { 191 | "Collapsed": "false" 192 | }, 193 | "source": [] 194 | } 195 | ], 196 | "metadata": { 197 | "kernelspec": { 198 | "display_name": "Python 3", 199 | "language": "python", 200 | "name": "python3" 201 | }, 202 | "language_info": { 203 | "codemirror_mode": { 204 | "name": "ipython", 205 | "version": 3 206 | }, 207 | "file_extension": ".py", 208 | "mimetype": "text/x-python", 209 | "name": "python", 210 | "nbconvert_exporter": "python", 211 | "pygments_lexer": "ipython3", 212 | "version": "3.7.7" 213 | } 214 | }, 215 | "nbformat": 4, 216 | "nbformat_minor": 4 217 | } 218 | -------------------------------------------------------------------------------- /00_intro/00_intro.py: -------------------------------------------------------------------------------- 1 | # --- 2 | # jupyter: 3 | # jupytext: 4 | # formats: ipynb,py 5 | # text_representation: 6 | # extension: .py 7 | # format_name: light 8 | # format_version: '1.5' 9 | # jupytext_version: 1.5.2 10 | # kernelspec: 11 | # display_name: Python 3 12 | # language: python 13 | # name: python3 14 | # --- 15 | 16 | # # Why Practical Training is Crucial 17 | 18 | # ## Why bridging the gap is important 19 | 20 | # 21 | 22 | # 23 | 24 | # ### How to bridge the gap 25 | 26 | # There are plenty of expensive courses, thick books, and jaded postdocs telling you how to do things in theory - and that's great! 27 | 28 | # But... how do you get there? 29 | 30 | # Let's play a little game, can you tell me how to do each of these? 31 | 32 | # | Problem | Implementation | 33 | # |--------------------------------------------------------|----------------| 34 | # | Develop this new framework over the next 6 months | ? | 35 | # | Adding this feature will take a while | ? | 36 | # | The code you're writing is turnign into a monster | ? | 37 | # | Hmmm this Jupyter notebook is gettign too long | ? | 38 | # | Writing documentation is too troublesome | ? | 39 | # | The code of the other developer looks terrible | ? | 40 | # | "Why is there a super() in here???" | ? | 41 | # | "You know, you should make this script run with a CLI" | ? | 42 | # | "How are these objects related?" | ? | 43 | # | Chart the structure of your project | ? | 44 | # | Figure out which part is slowing down the code | ? | 45 | # | Speed up this NumPy code | ? | 46 | # | This loop is really slow... | | 47 | 48 | # Solution: 49 | 50 | # | Problem | Implementation | 51 | # |--------------------------------------------------------|----------------| 52 | # | Develop this new framework over the next 6 months | Agile, Sprint planning, etc. | 53 | # | Adding this feature will take a while | Sprint planning, review process | 54 | # | The code you're writing is turnign into a monster | Code architecture, refactoring | 55 | # | Hmmm this Jupyter notebook is gettign too long | module architecture | 56 | # | Writing documentation is too troublesome | AutoDoc, Docstring creator, etc | 57 | # | The code of the other developer looks terrible | code formatter, linting | 58 | # | "Why is there a super() in here???" | Java Developers | 59 | # | "What are these properties?" | Setters and Getters | 60 | # | "You know, you should make this script run with a CLI" | click | 61 | # | "How are these objects related?" | Coda Analytzr | 62 | # | "Can you show me how this project is structured?" | UML, Code Analyzer | 63 | # | Figure out which part is slowing down the code | Dynamic Code Analyzer / Profiler | 64 | # | Speed up this NumPy code | Numba | 65 | # | This loop is really slow.... | Map(), Numba, Dask | 66 | # | I should run this in parallel... | Multiprocessing, Dask | 67 | 68 | # # Is this Course for you? 69 | 70 | # 71 | 72 | # Have you ever gotten... 73 | 74 | # + [markdown] jupyter={"outputs_hidden": true} 75 | # - a shared project in your group, but couldn't figure out what 80\% of the functions or objects did? 76 | # - 77 | 78 | 79 | # - a code from a previous student/phd/postdoc and though -- WHAT THE F- is this?! 80 | 81 | # - a bachelor student to use your code and only gotten stupid questions from them? 82 | 83 | # I hate to break it to you, but you also write bad code 84 | 85 | # We all write bad code, and the point is not to write perfect code, but to write less bad code. 86 | # 87 | # Just a world with less bad code. That's the dream. 88 | 89 | # # Exercise 90 | 91 | # - Pair up in groups of 2 or 3 92 | # - Show the other person your last opened python code you wrote 93 | # - Spend 5 minutes trying to unsderstand it 94 | # - Discuss the code 95 | 96 | # # Overview of the Course 97 | 98 | # 1. Fundamentals of Production Code 99 | # - Workflow Organization 100 | # - Environments 101 | # - Code Style and Formatters 102 | # - Design Patterns 103 | # - Thinking Functionally 104 | # - Module Architecture 105 | # - CLI Interfaces 106 | # 2. Data Management Fundamentals 107 | # - Pre-SQL 108 | # - SQL 109 | # - NoSQL 110 | # - Graph Databases 111 | # 3. Continuous Integration Pipeline 112 | # - Git 113 | # - Unit Tests 114 | # - Dockers 115 | # - APIs 116 | 117 | # 4. Best Practices in Data Science 118 | # - Machine Learning 119 | # - Coding 120 | 121 | # 5. Processing Data Efficiently 122 | # - Tensorflow 123 | # - Network Architeictures & Applications 124 | # - Slurm 125 | # - Numba 126 | # - Dask 127 | -------------------------------------------------------------------------------- /99_other_material/meme_treasury.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Debugging" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": { 13 | "collapsed": "false" 14 | }, 15 | "source": [ 16 | "" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "metadata": { 23 | "collapsed": true, 24 | "jupyter": { 25 | "outputs_hidden": true 26 | } 27 | }, 28 | "outputs": [], 29 | "source": [] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "# Commenting your code is important\n", 36 | "\n", 37 | "" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "metadata": { 44 | "collapsed": true, 45 | "jupyter": { 46 | "outputs_hidden": true 47 | } 48 | }, 49 | "outputs": [], 50 | "source": [] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": { 55 | "collapsed": "false" 56 | }, 57 | "source": [ 58 | "Why this is needed\n", 59 | "\n", 60 | "" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": { 67 | "collapsed": true, 68 | "jupyter": { 69 | "outputs_hidden": true 70 | } 71 | }, 72 | "outputs": [], 73 | "source": [] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": { 78 | "collapsed": "false" 79 | }, 80 | "source": [ 81 | "why git is important\n", 82 | "\n", 83 | "" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": { 90 | "collapsed": true, 91 | "jupyter": { 92 | "outputs_hidden": true 93 | } 94 | }, 95 | "outputs": [], 96 | "source": [] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": { 101 | "collapsed": "false" 102 | }, 103 | "source": [ 104 | "Why containers are important\n", 105 | "\n", 106 | "" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": { 112 | "collapsed": "false" 113 | }, 114 | "source": [ 115 | "" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": { 121 | "collapsed": "false" 122 | }, 123 | "source": [ 124 | "" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": { 130 | "collapsed": "false" 131 | }, 132 | "source": [ 133 | "" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": { 139 | "collapsed": "false" 140 | }, 141 | "source": [ 142 | "" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": { 148 | "collapsed": "false" 149 | }, 150 | "source": [ 151 | "" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": { 157 | "collapsed": "false" 158 | }, 159 | "source": [ 160 | "" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": { 166 | "collapsed": "false" 167 | }, 168 | "source": [ 169 | "" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": { 175 | "collapsed": "false" 176 | }, 177 | "source": [ 178 | "" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": { 184 | "collapsed": "false" 185 | }, 186 | "source": [ 187 | "" 188 | ] 189 | } 190 | ], 191 | "metadata": { 192 | "kernelspec": { 193 | "display_name": "Python 3", 194 | "language": "python", 195 | "name": "python3" 196 | }, 197 | "language_info": { 198 | "codemirror_mode": { 199 | "name": "ipython", 200 | "version": 3 201 | }, 202 | "file_extension": ".py", 203 | "mimetype": "text/x-python", 204 | "name": "python", 205 | "nbconvert_exporter": "python", 206 | "pygments_lexer": "ipython3", 207 | "version": "3.7.9" 208 | } 209 | }, 210 | "nbformat": 4, 211 | "nbformat_minor": 4 212 | } 213 | -------------------------------------------------------------------------------- /04_best_practices/04_best_practices.py: -------------------------------------------------------------------------------- 1 | # --- 2 | # jupyter: 3 | # jupytext: 4 | # formats: ipynb,py 5 | # text_representation: 6 | # extension: .py 7 | # format_name: light 8 | # format_version: '1.5' 9 | # jupytext_version: 1.6.0 10 | # kernelspec: 11 | # display_name: Python 3 12 | # language: python 13 | # name: python3 14 | # --- 15 | 16 | # + [markdown] slideshow={"slide_type": "slide"} 17 | # # Best Practices in Machine Learning and Code Organization 18 | 19 | # + [markdown] slideshow={"slide_type": "slide"} 20 | # ## Motivation 21 | 22 | # + [markdown] slideshow={"slide_type": "fragment"} 23 | # - What does best-practice even mean? 24 | # - How do I know something is a bad practice? 25 | 26 | # + [markdown] jupyter={"outputs_hidden": true} slideshow={"slide_type": "fragment"} 27 | # > It's not wrong, but it feels wrong. 28 | 29 | 30 | # + [markdown] slideshow={"slide_type": "slide"} 31 | # ## Overview 32 | 33 | # + [markdown] slideshow={"slide_type": "fragment"} 34 | # Best Pratices in: 35 | # - Machine Learning Code Bases and Versioning 36 | # - Code and Module organization and philosophies 37 | 38 | # + [markdown] slideshow={"slide_type": "slide"} 39 | # ## Bad vs. Best Practices in Python 40 | 41 | # + [markdown] slideshow={"slide_type": "subslide"} 42 | # ### Repetition 43 | 44 | # + [markdown] slideshow={"slide_type": "subslide"} 45 | # #### Python is not C - so do ***not*** copy-and-paste! 46 | 47 | # + [markdown] slideshow={"slide_type": "fragment"} 48 | # 49 | 50 | # + [markdown] slideshow={"slide_type": "subslide"} 51 | # #### Instead of copy & pasting: 52 | 53 | # + [markdown] slideshow={"slide_type": "fragment"} 54 | # - write functions! 55 | # - compose functions! 56 | # - create partial function! 57 | 58 | # + slideshow={"slide_type": "fragment"} 59 | def add(a, b): 60 | return a + b 61 | 62 | 63 | # + 64 | from functools import partial 65 | 66 | add2 = partial(add, 2) # Create a copy of add() with a=2 67 | 68 | add2(3) 69 | 70 | # + slideshow={"slide_type": "fragment"} 71 | add2 = lambda x: add(2, x) 72 | 73 | add2(3) 74 | 75 | 76 | # + 77 | def add2(x): 78 | return add(2, x) 79 | 80 | add2(3) 81 | 82 | 83 | # + [markdown] slideshow={"slide_type": "subslide"} 84 | # ### Switch Behavior 85 | 86 | # + [markdown] slideshow={"slide_type": "subslide"} 87 | # #### Python has no switch statements, but don't go around stacking if's: 88 | 89 | # + [markdown] slideshow={"slide_type": "fragment"} 90 | # 91 | 92 | # + [markdown] slideshow={"slide_type": "subslide"} 93 | # #### Instead of stacking if-else: 94 | 95 | # + [markdown] slideshow={"slide_type": "fragment"} 96 | # - map things with a dictionary! 97 | 98 | # + [markdown] slideshow={"slide_type": "fragment"} 99 | # Dictionaries are hashmaps, meaning the map a hash to an object. 100 | 101 | # + [markdown] slideshow={"slide_type": "fragment"} 102 | # Since Functions are first order objects in Python, they can be pointed to! 103 | 104 | # + slideshow={"slide_type": "fragment"} 105 | def add(a, b): 106 | return a + b 107 | 108 | def add_sum(a, b): 109 | return sum([a, b]) 110 | 111 | math_functions = {'add': add_sum} 112 | 113 | math_functions['add'](2, 2) 114 | 115 | # + [markdown] slideshow={"slide_type": "subslide"} 116 | # ### Depth 117 | 118 | # + [markdown] slideshow={"slide_type": "subslide"} 119 | # #### Making too many layers - inheritance, nesting, etc. 120 | 121 | # + [markdown] slideshow={"slide_type": "fragment"} 122 | # 123 | 124 | # + [markdown] slideshow={"slide_type": "subslide"} 125 | # #### Instead keep things shallow 126 | 127 | # + [markdown] slideshow={"slide_type": "fragment"} 128 | # Ask yourself: 129 | # - Do I need this class? 130 | # - Will it be instantiated often? 131 | # - Are there many objects inheriting from it? 132 | # - Does it carry state? Otherwise its a namespace! 133 | # - Does this need to be submodul or a file? 134 | # - Are there many long functions? 135 | # - Are there a large number of private functions? 136 | 137 | # + [markdown] slideshow={"slide_type": "fragment"} 138 | # Singleton Pattern (Single global instance for an Object) 139 | # - If it does not carry state, it is a namespace 140 | # - In Python, any file is a namespace! No need for the Object or Instance! 141 | # - If it just carries state, you want a database 142 | # - Atomicity of operation can be guaranteed with a database 143 | # - Database outside of Global Interpreter Lock (GIL) 144 | # - Databases scale better! 145 | 146 | # + [markdown] slideshow={"slide_type": "subslide"} 147 | # ### Readability 148 | 149 | # + [markdown] slideshow={"slide_type": "subslide"} 150 | # #### Write code - but write it to be read! 151 | 152 | # + [markdown] slideshow={"slide_type": "fragment"} 153 | # 154 | 155 | # + [markdown] slideshow={"slide_type": "subslide"} 156 | # #### Code is written to be read 157 | 158 | # + [markdown] slideshow={"slide_type": "fragment"} 159 | # - Documentation 160 | # - Type Hinting 161 | # - Naming 162 | 163 | # + [markdown] slideshow={"slide_type": "subslide"} 164 | # ### Dependencies 165 | 166 | # + [markdown] slideshow={"slide_type": "subslide"} 167 | # #### Sometimes they're too tempting 168 | 169 | # + [markdown] slideshow={"slide_type": "fragment"} 170 | # 171 | 172 | # + [markdown] slideshow={"slide_type": "subslide"} 173 | # #### Why? 174 | 175 | # + [markdown] slideshow={"slide_type": "fragment"} 176 | # - Projects get abandoned 177 | # - Lack of security patches 178 | # - Forced to stay with old versions 179 | # - => Your project becomes ancient 180 | # - 181 | 182 | # Update regularly! 183 | # - Small bugs on a regular basis prevent abandonment 184 | # - Improved performances 185 | # - Additional functionality! 186 | 187 | # + [markdown] slideshow={"slide_type": "subslide"} 188 | # ### Keep things short 189 | 190 | # + [markdown] slideshow={"slide_type": "subslide"} 191 | # #### The first law of Software Quality 192 | 193 | # + [markdown] slideshow={"slide_type": "fragment"} 194 | # 195 | 196 | # + [markdown] slideshow={"slide_type": "subslide"} 197 | # #### Sometimes less functionality is more maintainability 198 | 199 | # + [markdown] slideshow={"slide_type": "fragment"} 200 | # > Each line of code is a credit you take on and interest is paid in time to maintain the base. Don't default on your code debt. 201 | 202 | # + [markdown] slideshow={"slide_type": "fragment"} 203 | # Finding non-critical code: 204 | # - Is this functionality used by many? 205 | # - Is this code still used or abandoned? 206 | # - Is it relevant to the larger goal? 207 | 208 | # + [markdown] slideshow={"slide_type": "fragment"} 209 | # Solving too much code: 210 | # - Spin out functionality into a different module 211 | # - Simplify the code 212 | # - Delete code 213 | # - No really, you should delete code 214 | 215 | # + [markdown] slideshow={"slide_type": "subslide"} 216 | # ### Use version control 217 | 218 | # + [markdown] slideshow={"slide_type": "fragment"} 219 | # 220 | -------------------------------------------------------------------------------- /06_advanced_python/jupyter_addons.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Useful tools for you workflow" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Spell Check!" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "https://github.com/ijmbarr/jupyterlab_spellchecker" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "`jupyter labextension install @ijmbarr/jupyterlab_spellchecker`" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "### Using Latex in Power Point" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "https://www.fast.ai/2019/06/17/latex-ppt/" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "## Table of Contents\n", 50 | "\n", 51 | "https://github.com/ian-r-rose/jupyterlab-toc" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "```bash\n", 59 | "jupyter labextension install @jupyterlab/toc\n", 60 | "```" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "## Collapsible Headings\n", 68 | "\n", 69 | "https://github.com/aquirdTurtle/Collapsible_Headings" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "```bash\n", 77 | "jupyter labextension install @aquirdturtle/collapsible_headings\n", 78 | "```" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "## Go-To-Definition\n", 86 | "\n", 87 | "https://github.com/krassowski/jupyterlab-go-to-definition\n", 88 | "\n", 89 | "```bash\n", 90 | "jupyter labextension install @krassowski/jupyterlab_go_to_definition\n", 91 | "```" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "## Notifications for completion of long run of Jupyter Code\n", 99 | "\n", 100 | "https://github.com/ShopRunner/jupyter-notify" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "### Install JupyterNotify\n", 108 | "```bash\n", 109 | "pip install jupyternotify\n", 110 | "```" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "### Enable Notification " 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "#### Activate Javascript (only JupyterLab)\n", 125 | "```python\n", 126 | "%%javascript\n", 127 | "var jq = document.createElement('script');\n", 128 | "jq.src = \"https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js\";\n", 129 | "document.getElementsByTagName('head')[0].appendChild(jq);\n", 130 | "```" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "#### Enable extension\n", 138 | "```python\n", 139 | "%load_ext jupyternotify\n", 140 | "``` " 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "### Using Notification\n", 148 | "```python\n", 149 | "%%notify\n", 150 | "print(\"hi, when this is written, you'll get a notification!\")\n", 151 | "```" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "#### Giving a message\n", 159 | "\n", 160 | "\n", 161 | "```python\n", 162 | "%%notify -m \"this is the notification message\"\n", 163 | "```\n", 164 | "\n", 165 | "```python\n", 166 | "%%notify -o\n", 167 | "time.sleep(4)\n", 168 | "'this is the notification messsage'\n", 169 | "```" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "## Bell Notification\n", 177 | "\n", 178 | "https://github.com/samwhitehall/ipython-bell" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "## Visualizations" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "### Dash Plugin\n", 193 | "https://github.com/plotly/jupyterlab-dash\n", 194 | "```bash\n", 195 | "jupyter labextension install jupyterlab-dash@0.1.0-alpha.3\n", 196 | "```" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "### Bokeh Plugin\n", 204 | "https://github.com/bokeh/jupyter_bokeh\n", 205 | "```bash\n", 206 | "conda install -c bokeh jupyter_bokeh\n", 207 | "jupyter labextension install @jupyter-widgets/jupyterlab-manager\n", 208 | "jupyter labextension install @bokeh/jupyter_bokeh\n", 209 | "```" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "metadata": {}, 215 | "source": [ 216 | "## Enhanced Multiprocessing\n", 217 | "\n", 218 | "https://github.com/krassowski/enhanced-multiprocessing\n", 219 | "\n", 220 | "```bash\n", 221 | "pip install enhanced_multiprocessing\n", 222 | "```" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "## Helpers\n", 230 | "\n", 231 | "https://github.com/krassowski/jupyter-helpers\n", 232 | "\n", 233 | "```bash\n", 234 | "pip install jupyter_helpers\n", 235 | "```" 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "## Jupytext for better git diffs" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "https://github.com/mwouts/jupytext" 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "```bash\n", 257 | "pip install jupytext\n", 258 | "```" 259 | ] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "metadata": {}, 264 | "source": [ 265 | "## Make presentations out of Notebooks!" 266 | ] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "metadata": {}, 271 | "source": [ 272 | "https://github.com/damianavila/RISE" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "```bash\n", 280 | "pip install rise\n", 281 | "``` " 282 | ] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "metadata": {}, 287 | "source": [ 288 | "For automatically designating slide_types based on mmarkdown headers, consider: https://github.com/the-rccg/ipynb_slidetype_generator" 289 | ] 290 | } 291 | ], 292 | "metadata": { 293 | "kernelspec": { 294 | "display_name": "Python 3", 295 | "language": "python", 296 | "name": "python3" 297 | }, 298 | "language_info": { 299 | "codemirror_mode": { 300 | "name": "ipython", 301 | "version": 3 302 | }, 303 | "file_extension": ".py", 304 | "mimetype": "text/x-python", 305 | "name": "python", 306 | "nbconvert_exporter": "python", 307 | "pygments_lexer": "ipython3", 308 | "version": "3.7.9" 309 | } 310 | }, 311 | "nbformat": 4, 312 | "nbformat_minor": 4 313 | } 314 | -------------------------------------------------------------------------------- /03_continuous_integration/03_continuous_integration.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "Collapsed": "false", 7 | "slideshow": { 8 | "slide_type": "slide" 9 | } 10 | }, 11 | "source": [ 12 | "# The API & Docker course" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": { 18 | "Collapsed": "false", 19 | "slideshow": { 20 | "slide_type": "slide" 21 | }, 22 | "tags": [] 23 | }, 24 | "source": [ 25 | "In this session, we try to implement an API using Docker and python. The code shall be documented and stored on git. We will talk about testing scenarios and continous integration\n", 26 | "\n", 27 | "Project: The goal of the project is to create an API which recived inputs a form of lists, and predicts the target value and sends back the results. For example as the input\n", 28 | "\n", 29 | "```bash\n", 30 | " curl -d '{\"features\":[1,2,3,4]}' \\\n", 31 | " -H \"Content-Type: application/json\" \\\n", 32 | " -X POST http://localhost:8000/iris_api\n", 33 | "```" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": { 39 | "Collapsed": "false", 40 | "slideshow": { 41 | "slide_type": "slide" 42 | } 43 | }, 44 | "source": [ 45 | "\n", 46 | "For this part, you need to have a GitHub acount. I highly recommend to use GitKraken as your UI for git commands.\n", 47 | "\n", 48 | "## Implementation Part 1: repository setup\n", 49 | "\n", 50 | "1. Create a GitHub repository name `iris-api`\n", 51 | "2. Add a .gitignore for python codes\n", 52 | "3. Clone the repository on your computer\n", 53 | "4. copy the content of the iris-api to your folder https://github.com/Mu-DS/practical_training/tree/master/03_continuous_integration/iris-api\n", 54 | "5. commit and push\n", 55 | "\n", 56 | "\n", 57 | "https://github.github.com/training-kit/downloads/github-git-cheat-sheet.pdf" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": { 70 | "slideshow": { 71 | "slide_type": "slide" 72 | } 73 | }, 74 | "source": [ 75 | "## HTTP Request Methods\n", 76 | "\n", 77 | "\n", 78 | "What is HTTP?\n", 79 | "The Hypertext Transfer Protocol (HTTP) is designed to enable communications between clients and servers.\n", 80 | "\n", 81 | "HTTP works as a request-response protocol between a client and server.\n", 82 | "\n", 83 | "Example: A client (browser) sends an HTTP request to the server; then the server returns a response to the client. The response contains status information about the request and may also contain the requested content.\n", 84 | "\n", 85 | "\n", 86 | "https://www.w3schools.com/tags/ref_httpmethods.asp\n", 87 | "\n", 88 | "## HTTP Status codes\n", 89 | "\n", 90 | "- 200 OK\n", 91 | "- 400 Bad request\n", 92 | "- 500 Internal Server Error\n", 93 | "\n", 94 | "https://en.wikipedia.org/wiki/List_of_HTTP_status_codes" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": { 100 | "Collapsed": "false", 101 | "slideshow": { 102 | "slide_type": "slide" 103 | } 104 | }, 105 | "source": [ 106 | "## Web-framework for open science access\n", 107 | "\n", 108 | "Frameworks:\n", 109 | "\n", 110 | "- Flask [https://flask.palletsprojects.com/en/1.1.x/tutorial/] [https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-i-hello-world]\n", 111 | "- Django\n", 112 | "- Bottle\n", 113 | "- Falcon (we use this one!) [https://falcon.readthedocs.io/en/stable/]" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": { 119 | "slideshow": { 120 | "slide_type": "slide" 121 | } 122 | }, 123 | "source": [ 124 | "## The GET Method\n", 125 | "\n", 126 | "GET is used to request data from a specified resource.\n", 127 | "\n", 128 | "GET is one of the most common HTTP methods.\n", 129 | "\n", 130 | "Note that the query string (name/value pairs) is sent in the URL of a GET request:\n", 131 | "\n", 132 | "/test/demo_form.php?name1=value1&name2=value2\n", 133 | "Some other notes on GET requests:\n", 134 | "\n", 135 | "GET requests can be cached\n", 136 | "GET requests remain in the browser history\n", 137 | "GET requests can be bookmarked\n", 138 | "GET requests should never be used when dealing with sensitive data\n", 139 | "GET requests have length restrictions\n", 140 | "GET requests are only used to request data (not modify)\n" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": { 146 | "slideshow": { 147 | "slide_type": "slide" 148 | } 149 | }, 150 | "source": [ 151 | "## The POST Method\n", 152 | "\n", 153 | "POST is used to send data to a server to create/update a resource.\n", 154 | "\n", 155 | "The data sent to the server with POST is stored in the request body of the HTTP request:\n", 156 | "\n", 157 | "POST /test/demo_form.php HTTP/1.1\n", 158 | "Host: w3schools.com\n", 159 | "name1=value1&name2=value2\n", 160 | "POST is one of the most common HTTP methods.\n", 161 | "\n", 162 | "Some other notes on POST requests:\n", 163 | "\n", 164 | "POST requests are never cached\n", 165 | "POST requests do not remain in the browser history\n", 166 | "POST requests cannot be bookmarked\n", 167 | "POST requests have no restrictions on data length" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": { 173 | "Collapsed": "false", 174 | "slideshow": { 175 | "slide_type": "slide" 176 | } 177 | }, 178 | "source": [ 179 | "## Implementation 2: Coding part\n", 180 | "\n", 181 | "There are missing parts in '/resources/IrisPredictorResource.py'.\n", 182 | "\n", 183 | "Finish the code with your teammate" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": { 189 | "Collapsed": "false", 190 | "slideshow": { 191 | "slide_type": "slide" 192 | } 193 | }, 194 | "source": [ 195 | "### API structure\n", 196 | "\n", 197 | "\n", 198 | "The structure of the root folder should look like this:\n", 199 | "\n", 200 | "```\n", 201 | "root/\n", 202 | " ├── resources/ \n", 203 | " │\n", 204 | " ├── tests/\n", 205 | " │\n", 206 | " ├── bin/ \n", 207 | " │\n", 208 | " ├── models/ \n", 209 | " │\n", 210 | " ├── .gitignore \n", 211 | " │\n", 212 | " ├── Dockerfile \n", 213 | " │\n", 214 | " ├── main.py \n", 215 | " │\n", 216 | " ├── LICENSE \n", 217 | " │\n", 218 | " ├── service.yaml \n", 219 | " │ \n", 220 | " └── README.md \n", 221 | "```\n", 222 | "\n" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": { 228 | "Collapsed": "false", 229 | "slideshow": { 230 | "slide_type": "slide" 231 | } 232 | }, 233 | "source": [ 234 | "## Docker\n", 235 | "\n", 236 | "Developing apps today requires so much more than writing code. Multiple languages, frameworks, architectures, and discontinuous interfaces between tools for each lifecycle stage creates enormous complexity. Docker simplifies and accelerates your workflow, while giving developers the freedom to innovate with their choice of tools, application stacks, and deployment environments for each project.\n", 237 | "\n", 238 | "https://www.docker.com/sites/default/files/d8/2019-09/docker-cheat-sheet.pdf" 239 | ] 240 | }, 241 | { 242 | "cell_type": "markdown", 243 | "metadata": { 244 | "Collapsed": "false", 245 | "slideshow": { 246 | "slide_type": "slide" 247 | } 248 | }, 249 | "source": [ 250 | "## Implementation 3: Docker part\n", 251 | "\n", 252 | "We use Docker as our container. Also, we use gunicorn for handling the calls.\n", 253 | "https://gunicorn.org/\n", 254 | "\n", 255 | "1. Finilize the docker file '/Dockerfile'.\n", 256 | "2. Install the Docker using the readme\n", 257 | "3. Run the Docker\n", 258 | "4. test the docker using curl command\n" 259 | ] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "metadata": { 264 | "slideshow": { 265 | "slide_type": "slide" 266 | } 267 | }, 268 | "source": [ 269 | "## GitHub Apps for CI\n", 270 | "\n", 271 | "You can use different github apps for your code quality\n", 272 | "\n", 273 | "here we use Travis & Codacy Production\n", 274 | "\n", 275 | "Travis CI: https://travis-ci.org/\n", 276 | "\n", 277 | "Codacy: https://codacy.com" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": { 283 | "slideshow": { 284 | "slide_type": "slide" 285 | } 286 | }, 287 | "source": [ 288 | "## Implementation 4: Add unit testing\n", 289 | "\n", 290 | "Add different test scenarios for your functions to check if everything is alright\n", 291 | "\n", 292 | "Add this unittests in '/tests/' folder" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": { 298 | "slideshow": { 299 | "slide_type": "slide" 300 | } 301 | }, 302 | "source": [ 303 | "## Implementation 5: CI apps\n", 304 | "\n", 305 | "Log in to Travis and Codacy and connect them to your github account" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": { 311 | "slideshow": { 312 | "slide_type": "slide" 313 | } 314 | }, 315 | "source": [ 316 | "## Implementation 6: travis\n", 317 | "\n", 318 | "add this content to '.travis.yml' file\n", 319 | "\n", 320 | "'''yaml\n", 321 | "os:\n", 322 | " - linux\n", 323 | "\n", 324 | "language: python\n", 325 | "\n", 326 | "python:\n", 327 | " - \"3.6\"\n", 328 | " - \"3.7\"\n", 329 | "\n", 330 | "script:\n", 331 | " - pytest\n", 332 | "'''" 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": null, 338 | "metadata": {}, 339 | "outputs": [], 340 | "source": [] 341 | } 342 | ], 343 | "metadata": { 344 | "celltoolbar": "Slideshow", 345 | "jupytext": { 346 | "formats": "ipynb,py" 347 | }, 348 | "kernelspec": { 349 | "display_name": "Python 3", 350 | "language": "python", 351 | "name": "python3" 352 | }, 353 | "language_info": { 354 | "codemirror_mode": { 355 | "name": "ipython", 356 | "version": 3 357 | }, 358 | "file_extension": ".py", 359 | "mimetype": "text/x-python", 360 | "name": "python", 361 | "nbconvert_exporter": "python", 362 | "pygments_lexer": "ipython3", 363 | "version": "3.7.4" 364 | } 365 | }, 366 | "nbformat": 4, 367 | "nbformat_minor": 4 368 | } 369 | -------------------------------------------------------------------------------- /00_intro/00_intro.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "slideshow": { 7 | "slide_type": "slide" 8 | } 9 | }, 10 | "source": [ 11 | "# Why Practical Training is Crucial" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": { 17 | "slideshow": { 18 | "slide_type": "slide" 19 | } 20 | }, 21 | "source": [ 22 | "## Why bridging the gap is important" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": { 28 | "slideshow": { 29 | "slide_type": "fragment" 30 | } 31 | }, 32 | "source": [ 33 | "" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": { 39 | "slideshow": { 40 | "slide_type": "fragment" 41 | } 42 | }, 43 | "source": [ 44 | "" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": { 50 | "slideshow": { 51 | "slide_type": "subslide" 52 | } 53 | }, 54 | "source": [ 55 | "### How to bridge the gap" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": { 61 | "slideshow": { 62 | "slide_type": "fragment" 63 | } 64 | }, 65 | "source": [ 66 | "There are plenty of expensive courses, thick books, and jaded postdocs telling you how to do things in theory - and that's great! " 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": { 72 | "slideshow": { 73 | "slide_type": "fragment" 74 | } 75 | }, 76 | "source": [ 77 | "But... how do you get there?" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": { 83 | "slideshow": { 84 | "slide_type": "fragment" 85 | } 86 | }, 87 | "source": [ 88 | "Let's play a little game, can you tell me how to do each of these?" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": { 94 | "slideshow": { 95 | "slide_type": "fragment" 96 | } 97 | }, 98 | "source": [ 99 | "| Problem | Implementation |\n", 100 | "|--------------------------------------------------------|----------------|\n", 101 | "| Develop this new framework over the next 6 months | ? |\n", 102 | "| Adding this feature will take a while | ? |\n", 103 | "| The code you're writing is turnign into a monster | ? |\n", 104 | "| Hmmm this Jupyter notebook is gettign too long | ? |\n", 105 | "| Writing documentation is too troublesome | ? |\n", 106 | "| The code of the other developer looks terrible | ? |\n", 107 | "| \"Why is there a super() in here???\" | ? |\n", 108 | "| \"You know, you should make this script run with a CLI\" | ? |\n", 109 | "| \"How are these objects related?\" | ? |\n", 110 | "| Chart the structure of your project | ? |\n", 111 | "| Figure out which part is slowing down the code | ? |\n", 112 | "| Speed up this NumPy code | ? |\n", 113 | "| This loop is really slow... | |" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": { 119 | "slideshow": { 120 | "slide_type": "fragment" 121 | } 122 | }, 123 | "source": [ 124 | "Solution:" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": { 130 | "slideshow": { 131 | "slide_type": "fragment" 132 | } 133 | }, 134 | "source": [ 135 | "| Problem | Implementation |\n", 136 | "|--------------------------------------------------------|----------------|\n", 137 | "| Develop this new framework over the next 6 months | Agile, Sprint planning, etc. |\n", 138 | "| Adding this feature will take a while | Sprint planning, review process |\n", 139 | "| The code you're writing is turnign into a monster | Code architecture, refactoring |\n", 140 | "| Hmmm this Jupyter notebook is gettign too long | module architecture |\n", 141 | "| Writing documentation is too troublesome | AutoDoc, Docstring creator, etc |\n", 142 | "| The code of the other developer looks terrible | code formatter, linting |\n", 143 | "| \"Why is there a super() in here???\" | Java Developers |\n", 144 | "| \"What are these properties?\" | Setters and Getters |\n", 145 | "| \"You know, you should make this script run with a CLI\" | click |\n", 146 | "| \"How are these objects related?\" | Coda Analytzr |\n", 147 | "| \"Can you show me how this project is structured?\" | UML, Code Analyzer |\n", 148 | "| Figure out which part is slowing down the code | Dynamic Code Analyzer / Profiler |\n", 149 | "| Speed up this NumPy code | Numba |\n", 150 | "| This loop is really slow.... | Map(), Numba, Dask |\n", 151 | "| I should run this in parallel... | Multiprocessing, Dask |" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": { 157 | "slideshow": { 158 | "slide_type": "slide" 159 | } 160 | }, 161 | "source": [ 162 | "# Is this Course for you?" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": { 168 | "slideshow": { 169 | "slide_type": "fragment" 170 | } 171 | }, 172 | "source": [ 173 | "" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": { 179 | "slideshow": { 180 | "slide_type": "fragment" 181 | } 182 | }, 183 | "source": [ 184 | "Have you ever gotten..." 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": { 190 | "jupyter": { 191 | "outputs_hidden": true 192 | }, 193 | "lines_to_next_cell": 2, 194 | "slideshow": { 195 | "slide_type": "fragment" 196 | } 197 | }, 198 | "source": [ 199 | "- a shared project in your group, but couldn't figure out what 80\\% of the functions or objects did?" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": { 205 | "slideshow": { 206 | "slide_type": "fragment" 207 | } 208 | }, 209 | "source": [ 210 | "- a code from a previous student/phd/postdoc and though -- WHAT THE F- is this?!" 211 | ] 212 | }, 213 | { 214 | "cell_type": "markdown", 215 | "metadata": { 216 | "slideshow": { 217 | "slide_type": "fragment" 218 | } 219 | }, 220 | "source": [ 221 | "- a bachelor student to use your code and only gotten stupid questions from them?" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": { 227 | "slideshow": { 228 | "slide_type": "fragment" 229 | } 230 | }, 231 | "source": [ 232 | "I hate to break it to you, but you also write bad code" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": { 238 | "slideshow": { 239 | "slide_type": "fragment" 240 | } 241 | }, 242 | "source": [ 243 | "We all write bad code, and the point is not to write perfect code, but to write less bad code.\n", 244 | "\n", 245 | "Just a world with less bad code. That's the dream." 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": { 251 | "slideshow": { 252 | "slide_type": "slide" 253 | } 254 | }, 255 | "source": [ 256 | "# Exercise" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": { 262 | "slideshow": { 263 | "slide_type": "fragment" 264 | } 265 | }, 266 | "source": [ 267 | "- Pair up in groups of 2 or 3\n", 268 | "- Show the other person your last opened python code you wrote\n", 269 | "- Spend 5 minutes trying to unsderstand it\n", 270 | "- Discuss the code" 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": { 276 | "slideshow": { 277 | "slide_type": "slide" 278 | } 279 | }, 280 | "source": [ 281 | "# Overview of the Course" 282 | ] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "metadata": { 287 | "slideshow": { 288 | "slide_type": "fragment" 289 | } 290 | }, 291 | "source": [ 292 | "1. Fundamentals of Production Code\n", 293 | " - Workflow Organization\n", 294 | " - Environments\n", 295 | " - Code Style and Formatters\n", 296 | " - Design Patterns\n", 297 | " - Thinking Functionally\n", 298 | " - Module Architecture\n", 299 | " - CLI Interfaces" 300 | ] 301 | }, 302 | { 303 | "cell_type": "markdown", 304 | "metadata": { 305 | "slideshow": { 306 | "slide_type": "fragment" 307 | } 308 | }, 309 | "source": [ 310 | "2. Data Management Fundamentals\n", 311 | " - Pre-SQL\n", 312 | " - SQL\n", 313 | " - NoSQL\n", 314 | " - Graph Databases" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "metadata": { 320 | "slideshow": { 321 | "slide_type": "fragment" 322 | } 323 | }, 324 | "source": [ 325 | "3. Continuous Integration Pipeline\n", 326 | " - Git\n", 327 | " - Unit Tests\n", 328 | " - Dockers\n", 329 | " - APIs" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": { 335 | "slideshow": { 336 | "slide_type": "fragment" 337 | } 338 | }, 339 | "source": [ 340 | "4. Best Practices in Data Science\n", 341 | " - Machine Learning \n", 342 | " - Coding" 343 | ] 344 | }, 345 | { 346 | "cell_type": "markdown", 347 | "metadata": { 348 | "slideshow": { 349 | "slide_type": "fragment" 350 | } 351 | }, 352 | "source": [ 353 | "5. Processing Data Efficiently\n", 354 | " - Tensorflow\n", 355 | " - Network Architeictures & Applications\n", 356 | " - Slurm\n", 357 | " - Numba\n", 358 | " - Dask" 359 | ] 360 | } 361 | ], 362 | "metadata": { 363 | "jupytext": { 364 | "formats": "ipynb,py" 365 | }, 366 | "kernelspec": { 367 | "display_name": "Python 3", 368 | "language": "python", 369 | "name": "python3" 370 | }, 371 | "language_info": { 372 | "codemirror_mode": { 373 | "name": "ipython", 374 | "version": 3 375 | }, 376 | "file_extension": ".py", 377 | "mimetype": "text/x-python", 378 | "name": "python", 379 | "nbconvert_exporter": "python", 380 | "pygments_lexer": "ipython3", 381 | "version": "3.7.7" 382 | } 383 | }, 384 | "nbformat": 4, 385 | "nbformat_minor": 4 386 | } -------------------------------------------------------------------------------- /04_best_practices/slurm.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "slideshow": { 7 | "slide_type": "slide" 8 | } 9 | }, 10 | "source": [ 11 | "# Slurm\n", 12 | "Slurm is a widely used cluster manager and job scheduling system. It's used to submit jobs in an HPC system. " 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": { 18 | "slideshow": { 19 | "slide_type": "subslide" 20 | } 21 | }, 22 | "source": [ 23 | "### First contact with Slurm\n", 24 | "Basic commands to communicate with the cluster are:\n", 25 | "- `srun` to directly run a command on a computing node. This is usually used to have interactive sessions slurm\n", 26 | "- `sinfo` to get info on specific jobs (selected by jobid, user etc.). Useful to monitor your jobs\n", 27 | "- `sbatch` to submit jobs to the cluster. This is useful if you want to submit scripts etc. " 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": { 33 | "slideshow": { 34 | "slide_type": "slide" 35 | } 36 | }, 37 | "source": [ 38 | "## How does the cluster look like\n", 39 | "We can use `sinfo` to get a glimpse of the cluster structure:\n", 40 | "```bash\n", 41 | "PARTITION AVAIL TIMELIMIT NODES STATE NODELIST\n", 42 | "icb_cpu* up 7-00:00:00 15 mix ibis216-010-[022-023,034-035,051,064,071],ibis216-224-[010-011],icb-neu-[001-003],icb-rsrv[05-06,08]\n", 43 | "icb_cpu* up 7-00:00:00 22 alloc ibis-ceph-[002-006,008-019],ibis216-010-[011-012,020-021,033]\n", 44 | "icb_cpu* up 7-00:00:00 19 idle ibis216-010-[001-004,007,024-032,036-037,068-070]\n", 45 | "icb_gpu up 7-00:00:00 9 mix icb-gpusrv[02-08],supergpu02pxe,supergpu03pxe\n", 46 | "icb_gpu up 7-00:00:00 1 idle icb-gpusrv01\n", 47 | "icb_interactive up 12:00:00 9 down* clara,fonsi,heidi,hias,icb-lisa,icb-mona,icb-sarah,sepp,wastl\n", 48 | "icb_interactive up 12:00:00 1 mix icb-iris\n", 49 | "icb_rstrct up 5-00:00:00 1 mix icb-neu-003\n", 50 | "bcf up 12-00:00:0 1 mix ibis216-010-005\n", 51 | "bcf up 12-00:00:0 1 idle ibis216-010-006\n", 52 | "```" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": { 58 | "slideshow": { 59 | "slide_type": "slide" 60 | } 61 | }, 62 | "source": [ 63 | "## What are the running jobs?\n", 64 | "We can use `squeue` to get that info\n", 65 | "```bash\n", 66 | " 535882 icb_cpu nf-Veloc thomas.w R 1-00:59:00 1 ibis216-224-010\n", 67 | " 538003 icb_cpu rhapsody emilio.d R 22:16:26 1 ibis216-010-071\n", 68 | " 541083 icb_gpu EMBEDDIN leander. R 51:45 1 supergpu03pxe\n", 69 | " 541090 icb_gpu EMBEDDIN leander. R 42:29 1 supergpu03pxe\n", 70 | " 541091 icb_gpu EMBEDDIN leander. R 41:46 1 supergpu03pxe\n", 71 | "```" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": { 77 | "slideshow": { 78 | "slide_type": "slide" 79 | } 80 | }, 81 | "source": [ 82 | "## How about a specific job?\n", 83 | "We can look at specific jobs with `scontrol show jobid [JOBID]`\n", 84 | "```bash\n", 85 | "(base) [giovanni.palla@vicb-submit-02 cpu_interactive]$ sbatch submit_interactive.sh\n", 86 | "Submitted batch job 543650\n", 87 | "(base) [giovanni.palla@vicb-submit-02 cpu_interactive]$ sq\n", 88 | " 543650 icb_cpu interact giovanni R 0:00 1 ibis216-010-051\n", 89 | "(base) [giovanni.palla@vicb-submit-02 cpu_interactive]$ scontrol show jobid 543650\n", 90 | "JobId=543650 JobName=interactive\n", 91 | " UserId=giovanni.palla(138707) GroupId=OG-ICB-User(20000) MCS_label=N/A\n", 92 | " Priority=4294048901 Nice=1000 Account=icb-user QOS=icb_stndrd\n", 93 | " JobState=RUNNING Reason=None Dependency=(null)\n", 94 | " Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0\n", 95 | " RunTime=00:00:12 TimeLimit=10:00:00 TimeMin=N/A\n", 96 | " SubmitTime=2020-09-10T12:01:00 EligibleTime=2020-09-10T12:01:00\n", 97 | " AccrueTime=2020-09-10T12:01:01\n", 98 | " StartTime=2020-09-10T12:01:01 EndTime=2020-09-10T22:01:01 Deadline=N/A\n", 99 | " SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-10T12:01:01\n", 100 | " Partition=icb_cpu AllocNode:Sid=vicb-submit-02.scidom.de:24925\n", 101 | " ReqNodeList=(null) ExcNodeList=(null)\n", 102 | " NodeList=ibis216-010-051\n", 103 | " BatchHost=ibis216-010-051\n", 104 | " NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*\n", 105 | " TRES=cpu=8,mem=8G,node=1,billing=8\n", 106 | " Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*\n", 107 | " MinCPUsNode=8 MinMemoryNode=8G MinTmpDiskNode=0\n", 108 | " Features=xeon_6126|opteron_6234|opteron_6376|opteron_6378 DelayBoot=00:00:00\n", 109 | " OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)\n", 110 | " Command=/storage/groups/ml01/workspace/giovanni.palla/cpu_interactive/submit_interactive.sh\n", 111 | " WorkDir=/storage/groups/ml01/workspace/giovanni.palla/cpu_interactive\n", 112 | " StdErr=/storage/groups/ml01/workspace/giovanni.palla/cpu_interactive/interactive_543650.err\n", 113 | " StdIn=/dev/null\n", 114 | " StdOut=/storage/groups/ml01/workspace/giovanni.palla/cpu_interactive/interactive_543650.out\n", 115 | " Power=\n", 116 | " ```" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": { 122 | "slideshow": { 123 | "slide_type": "slide" 124 | } 125 | }, 126 | "source": [ 127 | "## Establish an interactive slurm session\n", 128 | "```bash\n", 129 | "srun -p icb_interactive -w ibis216-010-022 -c 1 -t 00:15:00 --mem=200 --pty bash\n", 130 | "```\n", 131 | "\n", 132 | "The `--pty` is used to assign the commmand. In this case, we just want to get a bash terminal. " 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": { 138 | "slideshow": { 139 | "slide_type": "fragment" 140 | } 141 | }, 142 | "source": [ 143 | "One way I often use this is\n", 144 | "```bash\n", 145 | "(base) [giovanni.palla@vicb-submit-02 cpu_interactive]$ srun -p icb_gpu -w icb-gpusrv03 --pty nvidia-smi\n", 146 | "Thu Sep 10 13:11:07 2020\n", 147 | "+-----------------------------------------------------------------------------+\n", 148 | "| NVIDIA-SMI 440.31 Driver Version: 440.31 CUDA Version: 10.2 |\n", 149 | "|-------------------------------+----------------------+----------------------+\n", 150 | "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", 151 | "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", 152 | "|===============================+======================+======================|\n", 153 | "| 0 TITAN V Off | 00000000:65:00.0 Off | N/A |\n", 154 | "| 61% 83C P2 147W / 250W | 12005MiB / 12066MiB | 83% Default |\n", 155 | "+-------------------------------+----------------------+----------------------+\n", 156 | "| 1 TITAN V Off | 00000000:B3:00.0 Off | N/A |\n", 157 | "| 62% 83C P2 140W / 250W | 12005MiB / 12066MiB | 48% Default |\n", 158 | "+-------------------------------+----------------------+----------------------+\n", 159 | "\n", 160 | "+-----------------------------------------------------------------------------+\n", 161 | "| Processes: GPU Memory |\n", 162 | "| GPU PID Type Process name Usage |\n", 163 | "|=============================================================================|\n", 164 | "| 0 60099 C python 11993MiB |\n", 165 | "| 1 60098 C python 11993MiB |\n", 166 | "+-----------------------------------------------------------------------------+\n", 167 | "```\n", 168 | "Very useful if you want to get quick info on efficiency of gpu usage etc.\n", 169 | "\n", 170 | "In general, always be specific with the arguments, although it's true that slurm systems usually have sound defaults values." 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": { 176 | "slideshow": { 177 | "slide_type": "slide" 178 | } 179 | }, 180 | "source": [ 181 | "## More on sbatch\n", 182 | "`sbatch` is useful to submit scripts as jobs. Usually, you have interactive sessions with `srun` for prorotyping, but then you want to use `sbatch` for the major computation. Argemunts are exactly the same as `srun`, but specified differently.\n", 183 | "\n", 184 | "```bash\n", 185 | "#!/bin/bash\n", 186 | "\n", 187 | "#SBATCH -o slurm_output.txt\n", 188 | "#SBATCH -e slurm_error.txt\n", 189 | "#SBATCH -J MyFancyJobName\n", 190 | "#SBATCH -p icb_cpu\n", 191 | "#SBATCH --nodelist=ibis-ceph-002\n", 192 | "#SBATCH -c 1\n", 193 | "#SBATCH --mem=2G\n", 194 | "#SBATCH -t 00:15:00\n", 195 | "#SBATCH --nice=10000 \n", 196 | "\n", 197 | "echo \"Starting stuff at `date`\"\n", 198 | "# You can put arbitrary unix commands here, call other scripts, etc...\n", 199 | "sleep 10\n", 200 | "echo \"Computering...\"\n", 201 | "sleep 900\n", 202 | "echo \"Ending stuff at `date`\"\n", 203 | "```" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": { 209 | "slideshow": { 210 | "slide_type": "slide" 211 | } 212 | }, 213 | "source": [ 214 | "## Interactive session with sbatch\n", 215 | "In a typical datascience workflow, you might want to start your coding with a jupyter instance. This is a way to do it. " 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": { 221 | "slideshow": { 222 | "slide_type": "fragment" 223 | } 224 | }, 225 | "source": [ 226 | "Create a script `submit_interactive.sh` that looks like this:\n", 227 | "```bash\n", 228 | "#!/bin/bash\n", 229 | "\n", 230 | "#SBATCH -o \"interactive_%j.out\"\n", 231 | "#SBATCH -e \"interactive_%j.err\"\n", 232 | "#SBATCH -J interactive\n", 233 | "#SBATCH -c 8 # default values is 2\n", 234 | "#SBATCH --constraint=\"xeon_6126|opteron_6234|opteron_6376|opteron_6378\"\n", 235 | "#SBATCH --mem=8GB\n", 236 | "#SBATCH -t 10:00:00\n", 237 | "#SBATCH --nice=10000\n", 238 | "\n", 239 | "./run_jupyter.bash -e myenv\n", 240 | "``` " 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": { 246 | "slideshow": { 247 | "slide_type": "fragment" 248 | } 249 | }, 250 | "source": [ 251 | "and another script `run_jupyter.bash`, that looks like this:\n", 252 | "```bash\n", 253 | "#!/bin/bash\n", 254 | "\n", 255 | "source ~/.bashrc\n", 256 | "\n", 257 | "while getopts \":e:\" opt; do\n", 258 | " case $opt in\n", 259 | " e) env=\"$OPTARG\"\n", 260 | " ;;\n", 261 | " \\?) echo \"Invalid option -$OPTARG\" >&2\n", 262 | " ;;\n", 263 | " esac\n", 264 | "done\n", 265 | "\n", 266 | "conda activate $env\n", 267 | "cd /storage/groups/ml01/workspace/giovanni.palla\n", 268 | "jupyter lab --no-browser --ip=0.0.0.0\n", 269 | "```" 270 | ] 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "metadata": { 275 | "slideshow": { 276 | "slide_type": "slide" 277 | } 278 | }, 279 | "source": [ 280 | "## Interactive session with sbatch\n", 281 | "After ~30 seconds, you will read the link for the jupyter session in the `.err` file\n", 282 | "```bash\n", 283 | "(base) [giovanni.palla@vicb-submit-02 cpu_interactive]$ cat interactive_543650.err\n", 284 | "[I 12:01:26.392 LabApp] JupyterLab extension loaded from /home/icb/giovanni.palla/miniconda3/envs/sfaira/lib/python3.8/site-packages/jupyterlab\n", 285 | "[I 12:01:26.392 LabApp] JupyterLab application directory is /home/icb/giovanni.palla/miniconda3/envs/sfaira/share/jupyter/lab\n", 286 | "[I 12:01:26.401 LabApp] Serving notebooks from local directory: /storage/groups/ml01/workspace/giovanni.palla\n", 287 | "[I 12:01:26.401 LabApp] The Jupyter Notebook is running at:\n", 288 | "[I 12:01:26.401 LabApp] http://ibis216-010-051.scidom.de:8888/?token=ba33b814bc360beb21c803517adc53ade10da631ede21690\n", 289 | "[I 12:01:26.401 LabApp] or http://127.0.0.1:8888/?token=ba33b814bc360beb21c803517adc53ade10da631ede21690\n", 290 | "[I 12:01:26.401 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).\n", 291 | "[C 12:01:26.424 LabApp]\n", 292 | "\n", 293 | " To access the notebook, open this file in a browser:\n", 294 | " file:///mnt/home/icb/giovanni.palla/.local/share/jupyter/runtime/nbserver-1565-open.html\n", 295 | " Or copy and paste one of these URLs:\n", 296 | " http://ibis216-010-051.scidom.de:8888/?token=ba33b814bc360beb21c803517adc53ade10da631ede21690\n", 297 | " or http://127.0.0.1:8888/?token=ba33b814bc360beb21c803517adc53ade10da631ede21690\n", 298 | "```" 299 | ] 300 | }, 301 | { 302 | "cell_type": "markdown", 303 | "metadata": { 304 | "slideshow": { 305 | "slide_type": "slide" 306 | } 307 | }, 308 | "source": [ 309 | "## Another command: sacct\n", 310 | "Useful to check all your recent jobs (finished, cancelled, etc)\n", 311 | "\n", 312 | "```bash\n", 313 | "(base) [giovanni.palla@vicb-submit-02 cpu_interactive]$ sacct\n", 314 | " JobID JobName Partition Account AllocCPUS State ExitCode\n", 315 | "------------ ---------- ---------- ---------- ---------- ---------- --------\n", 316 | "543650 interacti+ icb_cpu icb-user 8 RUNNING 0:0\n", 317 | "543650.batch batch icb-user 8 RUNNING 0:0\n", 318 | "543650.exte+ extern icb-user 8 RUNNING 0:0\n", 319 | "543818 nvidia-smi icb_gpu icb-user 2 COMPLETED 0:0\n", 320 | "543818.exte+ extern icb-user 2 COMPLETED 0:0\n", 321 | "543818.0 nvidia-smi icb-user 2 COMPLETED 0:0\n", 322 | "```" 323 | ] 324 | } 325 | ], 326 | "metadata": { 327 | "kernelspec": { 328 | "display_name": "Python 3", 329 | "language": "python", 330 | "name": "python3" 331 | }, 332 | "language_info": { 333 | "codemirror_mode": { 334 | "name": "ipython", 335 | "version": 3 336 | }, 337 | "file_extension": ".py", 338 | "mimetype": "text/x-python", 339 | "name": "python", 340 | "nbconvert_exporter": "python", 341 | "pygments_lexer": "ipython3", 342 | "version": "3.8.5" 343 | } 344 | }, 345 | "nbformat": 4, 346 | "nbformat_minor": 4 347 | } 348 | -------------------------------------------------------------------------------- /04_best_practices/04_best_practices.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "slideshow": { 7 | "slide_type": "slide" 8 | } 9 | }, 10 | "source": [ 11 | "# Best Practices in Machine Learning and Code Organization" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": { 17 | "slideshow": { 18 | "slide_type": "slide" 19 | } 20 | }, 21 | "source": [ 22 | "## Motivation" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": { 28 | "slideshow": { 29 | "slide_type": "fragment" 30 | } 31 | }, 32 | "source": [ 33 | "- What does best-practice even mean?\n", 34 | "- How do I know something is a bad practice?" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": { 40 | "jupyter": { 41 | "outputs_hidden": true 42 | }, 43 | "lines_to_next_cell": 2, 44 | "slideshow": { 45 | "slide_type": "fragment" 46 | } 47 | }, 48 | "source": [ 49 | "> It's not wrong, but it feels wrong." 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": { 55 | "slideshow": { 56 | "slide_type": "slide" 57 | } 58 | }, 59 | "source": [ 60 | "## Overview" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": { 66 | "slideshow": { 67 | "slide_type": "fragment" 68 | } 69 | }, 70 | "source": [ 71 | "Best Pratices in:\n", 72 | "- Machine Learning Code Bases and Versioning\n", 73 | "- Code and Module organization and philosophies" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": { 79 | "slideshow": { 80 | "slide_type": "slide" 81 | } 82 | }, 83 | "source": [ 84 | "## Bad vs. Best Practices in Python" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": { 90 | "slideshow": { 91 | "slide_type": "subslide" 92 | } 93 | }, 94 | "source": [ 95 | "### Repetition" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": { 101 | "slideshow": { 102 | "slide_type": "subslide" 103 | } 104 | }, 105 | "source": [ 106 | "#### Python is not C - so do ***not*** copy-and-paste!" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": { 112 | "slideshow": { 113 | "slide_type": "fragment" 114 | } 115 | }, 116 | "source": [ 117 | "" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": { 123 | "slideshow": { 124 | "slide_type": "subslide" 125 | } 126 | }, 127 | "source": [ 128 | "#### Instead of copy & pasting:" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": { 134 | "slideshow": { 135 | "slide_type": "fragment" 136 | } 137 | }, 138 | "source": [ 139 | "- write functions!\n", 140 | "- compose functions!\n", 141 | "- create partial function!" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 1, 147 | "metadata": { 148 | "slideshow": { 149 | "slide_type": "fragment" 150 | } 151 | }, 152 | "outputs": [], 153 | "source": [ 154 | "def add(a, b):\n", 155 | " return a + b" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 3, 161 | "metadata": {}, 162 | "outputs": [ 163 | { 164 | "data": { 165 | "text/plain": [ 166 | "5" 167 | ] 168 | }, 169 | "execution_count": 3, 170 | "metadata": {}, 171 | "output_type": "execute_result" 172 | } 173 | ], 174 | "source": [ 175 | "from functools import partial\n", 176 | "\n", 177 | "add2 = partial(add, 2) # Create a copy of add() with a=2\n", 178 | "\n", 179 | "add2(3)" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 5, 185 | "metadata": { 186 | "slideshow": { 187 | "slide_type": "fragment" 188 | } 189 | }, 190 | "outputs": [ 191 | { 192 | "data": { 193 | "text/plain": [ 194 | "5" 195 | ] 196 | }, 197 | "execution_count": 5, 198 | "metadata": {}, 199 | "output_type": "execute_result" 200 | } 201 | ], 202 | "source": [ 203 | "add2 = lambda x: add(2, x)\n", 204 | "\n", 205 | "add2(3)" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 7, 211 | "metadata": {}, 212 | "outputs": [ 213 | { 214 | "data": { 215 | "text/plain": [ 216 | "5" 217 | ] 218 | }, 219 | "execution_count": 7, 220 | "metadata": {}, 221 | "output_type": "execute_result" 222 | } 223 | ], 224 | "source": [ 225 | "def add2(x):\n", 226 | " return add(2, x)\n", 227 | "\n", 228 | "add2(3)" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": { 234 | "slideshow": { 235 | "slide_type": "subslide" 236 | } 237 | }, 238 | "source": [ 239 | "### Switch Behavior" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": { 245 | "slideshow": { 246 | "slide_type": "subslide" 247 | } 248 | }, 249 | "source": [ 250 | "#### Python has no switch statements, but don't go around stacking if's:" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": { 256 | "slideshow": { 257 | "slide_type": "fragment" 258 | } 259 | }, 260 | "source": [ 261 | "" 262 | ] 263 | }, 264 | { 265 | "cell_type": "markdown", 266 | "metadata": { 267 | "slideshow": { 268 | "slide_type": "subslide" 269 | } 270 | }, 271 | "source": [ 272 | "#### Instead of stacking if-else:" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": { 278 | "slideshow": { 279 | "slide_type": "fragment" 280 | } 281 | }, 282 | "source": [ 283 | "- map things with a dictionary!" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": { 289 | "slideshow": { 290 | "slide_type": "fragment" 291 | } 292 | }, 293 | "source": [ 294 | "Dictionaries are hashmaps, meaning the map a hash to an object." 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": { 300 | "slideshow": { 301 | "slide_type": "fragment" 302 | } 303 | }, 304 | "source": [ 305 | "Since Functions are first order objects in Python, they can be pointed to!" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": 9, 311 | "metadata": { 312 | "slideshow": { 313 | "slide_type": "fragment" 314 | } 315 | }, 316 | "outputs": [ 317 | { 318 | "data": { 319 | "text/plain": [ 320 | "4" 321 | ] 322 | }, 323 | "execution_count": 9, 324 | "metadata": {}, 325 | "output_type": "execute_result" 326 | } 327 | ], 328 | "source": [ 329 | "def add(a, b):\n", 330 | " return a + b\n", 331 | "\n", 332 | "def add_sum(a, b):\n", 333 | " return sum([a, b])\n", 334 | "\n", 335 | "math_functions = {'add': add_sum}\n", 336 | "\n", 337 | "math_functions['add'](2, 2)" 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": { 343 | "slideshow": { 344 | "slide_type": "subslide" 345 | } 346 | }, 347 | "source": [ 348 | "### Depth" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": { 354 | "slideshow": { 355 | "slide_type": "subslide" 356 | } 357 | }, 358 | "source": [ 359 | "#### Making too many layers - inheritance, nesting, etc." 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": { 365 | "slideshow": { 366 | "slide_type": "fragment" 367 | } 368 | }, 369 | "source": [ 370 | "" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": { 376 | "slideshow": { 377 | "slide_type": "subslide" 378 | } 379 | }, 380 | "source": [ 381 | "#### Instead keep things shallow" 382 | ] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": { 387 | "slideshow": { 388 | "slide_type": "fragment" 389 | } 390 | }, 391 | "source": [ 392 | "Ask yourself:\n", 393 | "- Do I need this class?\n", 394 | " - Will it be instantiated often?\n", 395 | " - Are there many objects inheriting from it?\n", 396 | " - Does it carry state? Otherwise its a namespace!\n", 397 | "- Does this need to be submodul or a file?\n", 398 | " - Are there many long functions?\n", 399 | " - Are there a large number of private functions?" 400 | ] 401 | }, 402 | { 403 | "cell_type": "markdown", 404 | "metadata": { 405 | "slideshow": { 406 | "slide_type": "fragment" 407 | } 408 | }, 409 | "source": [ 410 | "Singleton Pattern (Single global instance for an Object)\n", 411 | "- If it does not carry state, it is a namespace\n", 412 | " - In Python, any file is a namespace! No need for the Object or Instance!\n", 413 | "- If it just carries state, you want a database\n", 414 | " - Atomicity of operation can be guaranteed with a database\n", 415 | " - Database outside of Global Interpreter Lock (GIL)\n", 416 | " - Databases scale better!" 417 | ] 418 | }, 419 | { 420 | "cell_type": "markdown", 421 | "metadata": { 422 | "slideshow": { 423 | "slide_type": "subslide" 424 | } 425 | }, 426 | "source": [ 427 | "### Readability" 428 | ] 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "metadata": { 433 | "slideshow": { 434 | "slide_type": "subslide" 435 | } 436 | }, 437 | "source": [ 438 | "#### Write code - but write it to be read!" 439 | ] 440 | }, 441 | { 442 | "cell_type": "markdown", 443 | "metadata": { 444 | "slideshow": { 445 | "slide_type": "fragment" 446 | } 447 | }, 448 | "source": [ 449 | "" 450 | ] 451 | }, 452 | { 453 | "cell_type": "markdown", 454 | "metadata": { 455 | "slideshow": { 456 | "slide_type": "subslide" 457 | } 458 | }, 459 | "source": [ 460 | "#### Code is written to be read" 461 | ] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": { 466 | "slideshow": { 467 | "slide_type": "fragment" 468 | } 469 | }, 470 | "source": [ 471 | "- Documentation\n", 472 | "- Type Hinting\n", 473 | "- Naming" 474 | ] 475 | }, 476 | { 477 | "cell_type": "markdown", 478 | "metadata": { 479 | "slideshow": { 480 | "slide_type": "subslide" 481 | } 482 | }, 483 | "source": [ 484 | "### Dependencies" 485 | ] 486 | }, 487 | { 488 | "cell_type": "markdown", 489 | "metadata": { 490 | "slideshow": { 491 | "slide_type": "subslide" 492 | } 493 | }, 494 | "source": [ 495 | "#### Sometimes they're too tempting" 496 | ] 497 | }, 498 | { 499 | "cell_type": "markdown", 500 | "metadata": { 501 | "slideshow": { 502 | "slide_type": "fragment" 503 | } 504 | }, 505 | "source": [ 506 | "" 507 | ] 508 | }, 509 | { 510 | "cell_type": "markdown", 511 | "metadata": { 512 | "slideshow": { 513 | "slide_type": "subslide" 514 | } 515 | }, 516 | "source": [ 517 | "#### Why?" 518 | ] 519 | }, 520 | { 521 | "cell_type": "markdown", 522 | "metadata": { 523 | "slideshow": { 524 | "slide_type": "fragment" 525 | } 526 | }, 527 | "source": [ 528 | "- Projects get abandoned\n", 529 | " - Lack of security patches\n", 530 | " - Forced to stay with old versions\n", 531 | " - => Your project becomes ancient" 532 | ] 533 | }, 534 | { 535 | "cell_type": "markdown", 536 | "metadata": {}, 537 | "source": [ 538 | "Update regularly!\n", 539 | "- Small bugs on a regular basis prevent abandonment\n", 540 | "- Improved performances\n", 541 | "- Additional functionality!" 542 | ] 543 | }, 544 | { 545 | "cell_type": "markdown", 546 | "metadata": { 547 | "slideshow": { 548 | "slide_type": "subslide" 549 | } 550 | }, 551 | "source": [ 552 | "### Keep things short" 553 | ] 554 | }, 555 | { 556 | "cell_type": "markdown", 557 | "metadata": { 558 | "slideshow": { 559 | "slide_type": "subslide" 560 | } 561 | }, 562 | "source": [ 563 | "#### The first law of Software Quality" 564 | ] 565 | }, 566 | { 567 | "cell_type": "markdown", 568 | "metadata": { 569 | "slideshow": { 570 | "slide_type": "fragment" 571 | } 572 | }, 573 | "source": [ 574 | "" 575 | ] 576 | }, 577 | { 578 | "cell_type": "markdown", 579 | "metadata": { 580 | "slideshow": { 581 | "slide_type": "subslide" 582 | } 583 | }, 584 | "source": [ 585 | "#### Sometimes less functionality is more maintainability" 586 | ] 587 | }, 588 | { 589 | "cell_type": "markdown", 590 | "metadata": { 591 | "slideshow": { 592 | "slide_type": "fragment" 593 | } 594 | }, 595 | "source": [ 596 | "> Each line of code is a credit you take on and interest is paid in time to maintain the base. Don't default on your code debt." 597 | ] 598 | }, 599 | { 600 | "cell_type": "markdown", 601 | "metadata": { 602 | "slideshow": { 603 | "slide_type": "fragment" 604 | } 605 | }, 606 | "source": [ 607 | "Finding non-critical code:\n", 608 | "- Is this functionality used by many?\n", 609 | "- Is this code still used or abandoned?\n", 610 | "- Is it relevant to the larger goal?" 611 | ] 612 | }, 613 | { 614 | "cell_type": "markdown", 615 | "metadata": { 616 | "slideshow": { 617 | "slide_type": "fragment" 618 | } 619 | }, 620 | "source": [ 621 | "Solving too much code:\n", 622 | "- Spin out functionality into a different module\n", 623 | "- Simplify the code\n", 624 | "- Delete code\n", 625 | "- No really, you should delete code" 626 | ] 627 | }, 628 | { 629 | "cell_type": "markdown", 630 | "metadata": { 631 | "slideshow": { 632 | "slide_type": "subslide" 633 | } 634 | }, 635 | "source": [ 636 | "### Use version control" 637 | ] 638 | }, 639 | { 640 | "cell_type": "markdown", 641 | "metadata": { 642 | "slideshow": { 643 | "slide_type": "fragment" 644 | } 645 | }, 646 | "source": [ 647 | "" 648 | ] 649 | } 650 | ], 651 | "metadata": { 652 | "jupytext": { 653 | "formats": "ipynb,py" 654 | }, 655 | "kernelspec": { 656 | "display_name": "Python 3", 657 | "language": "python", 658 | "name": "python3" 659 | }, 660 | "language_info": { 661 | "codemirror_mode": { 662 | "name": "ipython", 663 | "version": 3 664 | }, 665 | "file_extension": ".py", 666 | "mimetype": "text/x-python", 667 | "name": "python", 668 | "nbconvert_exporter": "python", 669 | "pygments_lexer": "ipython3", 670 | "version": "3.7.9" 671 | } 672 | }, 673 | "nbformat": 4, 674 | "nbformat_minor": 4 675 | } 676 | -------------------------------------------------------------------------------- /Info Flyer/ugly_code_numpy_linalg.py: -------------------------------------------------------------------------------- 1 | __all__ = ['matrix_power', 'solve', 'tensorsolve', 'tensorinv', 'inv', 2 | 'cholesky', 'eigvals', 'eigvalsh', 'pinv', 'slogdet', 'det', 3 | 'svd', 'eig', 'eigh', 'lstsq', 'norm', 'qr', 'cond', 'matrix_rank', 4 | 'LinAlgError', 'multi_dot']; import functools 5 | import operator 6 | import warnings; from numpy.core import ( 7 | array, asarray, zeros, empty, empty_like, intc, single, double, 8 | csingle, cdouble, inexact, complexfloating, newaxis, all, Inf, dot, 9 | add, multiply, sqrt, fastCopyAndTranspose, sum, isfinite, 10 | finfo, errstate, geterrobj, moveaxis, amin, amax, product, abs, 11 | atleast_2d, intp, asanyarray, object_, matmul, 12 | swapaxes, divide, count_nonzero, isnan, sign 13 | ); from numpy.core.multiarray import normalize_axis_index; from numpy.core.overrides import set_module; from numpy.core import overrides; from numpy.lib.twodim_base import triu, eye; from numpy.linalg import lapack_lite, _umath_linalg; array_function_dispatch = functools.partial( 14 | overrides.array_function_dispatch, module='numpy.linalg'); _N = b'N'; _V = b'V'; _A = b'A'; _S = b'S'; _L = b'L' fortran_int = intc @set_module('numpy.linalg') class LinAlgError(Exception): def _determine_error_states(): errobj = geterrobj() bufsize = errobj[0] with errstate(invalid='call', over='ignore', divide='ignore', under='ignore'): invalid_call_errmask = geterrobj()[1] return [bufsize, invalid_call_errmask, None]; _linalg_error_extobj = _determine_error_states(); del _determine_error_states; def _raise_linalgerror_singular(err, flag): raise LinAlgError("Singular matrix"); def _raise_linalgerror_nonposdef(err, flag): raise LinAlgError("Matrix is not positive definite"); def _raise_linalgerror_eigenvalues_nonconvergence(err, flag): raise LinAlgError("Eigenvalues did not converge"); def _raise_linalgerror_svd_nonconvergence(err, flag): raise LinAlgError("SVD did not converge"); def _raise_linalgerror_lstsq(err, flag): raise LinAlgError("SVD did not converge in Linear Least Squares"); def get_linalg_error_extobj(callback): extobj = list(_linalg_error_extobj); extobj[2] = callback; return extobj; def _makearray(a): new = asarray(a); wrap = getattr(a, "__array_prepare__", new.__array_wrap__); return new, wrap; def isComplexType(t): return issubclass(t, complexfloating); _real_types_map = {single: single,; double: double,; csingle: single,; cdouble: double}; _complex_types_map = {single: csingle,; double: cdouble,; csingle: csingle,; cdouble: cdouble}; def _realType(t, default=double): return _real_types_map.get(t, default); def _complexType(t, default=cdouble): return _complex_types_map.get(t, default); def _linalgRealType(t): """Cast the type t to either double or cdouble."""; return double; def _commonType(*arrays): result_type = single; is_complex = False; for a in arrays: if issubclass(a.dtype.type, inexact): if isComplexType(a.dtype.type): is_complex = True; rt = _realType(a.dtype.type, default=None); if rt is None: raise TypeError("array type %s is unsupported in linalg" %; (a.dtype.name,)); else: rt = double; if rt is double: result_type = double; if is_complex: t = cdouble; result_type = _complex_types_map[result_type]; else: t = double; return t, result_type; _fastCT = fastCopyAndTranspose; def _to_native_byte_order(*arrays): ret = []; for arr in arrays: if arr.dtype.byteorder not in ('=', '|'): ret.append(asarray(arr, dtype=arr.dtype.newbyteorder('='))); else: ret.append(arr); if len(ret) == 1: return ret[0]; else: return ret; def _fastCopyAndTranspose(type, *arrays): cast_arrays = (); for a in arrays: if a.dtype.type is type: cast_arrays = cast_arrays + (_fastCT(a),); else: cast_arrays = cast_arrays + (_fastCT(a.astype(type)),); if len(cast_arrays) == 1: return cast_arrays[0]; else: return cast_arrays; def _assert_2d(*arrays): for a in arrays: if a.ndim != 2: raise LinAlgError('%d-dimensional array given. Array must be '; 'two-dimensional' % a.ndim); def _assert_stacked_2d(*arrays): for a in arrays: if a.ndim < 2: raise LinAlgError('%d-dimensional array given. Array must be '; 'at least two-dimensional' % a.ndim); def _assert_stacked_square(*arrays): for a in arrays: m, n = a.shape[-2:]; if m != n: raise LinAlgError('Last 2 dimensions of the array must be square'); def _assert_finite(*arrays): for a in arrays: if not isfinite(a).all(): raise LinAlgError("Array must not contain infs or NaNs"); def _is_empty_2d(arr): return arr.size == 0 and product(arr.shape[-2:]) == 0; def transpose(a): a, wrap = _makearray(a); b = asarray(b); an = a.ndim; if axes is not None: allaxes = list(range(0, an)); for k in axes: allaxes.remove(k); allaxes.insert(an, k); a = a.transpose(allaxes); oldshape = a.shape[-(an-b.ndim):]; prod = 1; for k in oldshape: prod *= k; a = a.reshape(-1, prod); b = b.ravel(); res = wrap(solve(a, b)); res.shape = oldshape; return res; def _solve_dispatcher(a, b): return (a, b); @array_function_dispatch(_solve_dispatcher); def solve(a, b): a, _ = _makearray(a); _assert_stacked_2d(a); _assert_stacked_square(a); b, wrap = _makearray(b); t, result_t = _commonType(a, b); if b.ndim == a.ndim - 1: gufunc = _umath_linalg.solve1; else: gufunc = _umath_linalg.solve; signature = 'DD->D' if isComplexType(t) else 'dd->d'; extobj = get_linalg_error_extobj(_raise_linalgerror_singular); r = gufunc(a, b, signature=signature, extobj=extobj); return wrap(r.astype(result_t, copy=False)); def _tensorinv_dispatcher(a, ind=None): return (a,); @array_function_dispatch(_tensorinv_dispatcher); def tensorinv(a, ind=2): a = asarray(a); oldshape = a.shape; prod = 1; if ind > 0: invshape = oldshape[ind:] + oldshape[:ind]; for k in oldshape[ind:]: prod *= k; else: raise ValueError("Invalid ind argument."); a = a.reshape(prod, -1); ia = inv(a); return ia.reshape(*invshape); def _unary_dispatcher(a): return (a,); @array_function_dispatch(_unary_dispatcher); def inv(a): a, wrap = _makearray(a); _assert_stacked_2d(a); _assert_stacked_square(a); t, result_t = _commonType(a); signature = 'D->D' if isComplexType(t) else 'd->d'; extobj = get_linalg_error_extobj(_raise_linalgerror_singular); ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj); return wrap(ainv.astype(result_t, copy=False)); def _matrix_power_dispatcher(a, n): return (a,); @array_function_dispatch(_matrix_power_dispatcher); def matrix_power(a, n): a = asanyarray(a); _assert_stacked_2d(a); _assert_stacked_square(a); try: n = operator.index(n); except TypeError: raise TypeError("exponent must be an integer"); if a.dtype != object: fmatmul = matmul; elif a.ndim == 2: fmatmul = dot; else: raise NotImplementedError(; "matrix_power not supported for stacks of object arrays"); if n == 0: a = empty_like(a); a[...] = eye(a.shape[-2], dtype=a.dtype); return a; elif n < 0: a = inv(a); n = abs(n); if n == 1: return a; elif n == 2: return fmatmul(a, a); elif n == 3: return fmatmul(fmatmul(a, a), a); z = result = None; while n > 0: z = a if z is None else fmatmul(z, z); n, bit = divmod(n, 2); if bit: result = z if result is None else fmatmul(result, z); return result; @array_function_dispatch(_unary_dispatcher); def cholesky(a): extobj = get_linalg_error_extobj(_raise_linalgerror_nonposdef); gufunc = _umath_linalg.cholesky_lo; a, wrap = _makearray(a); _assert_stacked_2d(a); _assert_stacked_square(a); t, result_t = _commonType(a); signature = 'D->D' if isComplexType(t) else 'd->d'; r = gufunc(a, signature=signature, extobj=extobj); return wrap(r.astype(result_t, copy=False)); def _qr_dispatcher(a, mode=None): return (a,); @array_function_dispatch(_qr_dispatcher); def qr(a, mode='reduced'): if mode not in ('reduced', 'complete',x 'r', 'raw'): if mode in ('f', 'full'): msg = "".join((; "The 'full' option is deprecated in favor of 'reduced'.\n",; "For backward compatibility let mode default.")); warnings.warn(msg, DeprecationWarning, stacklevel=3); mode = 'reduced'; elif mode in ('e', 'economic'): msg = "The 'economic' option is deprecated."; warnings.warn(msg, DeprecationWarning, stacklevel=3); mode = 'economic'; else: raise ValueError("Unrecognized mode '%s'" % mode); a, wrap = _makearray(a); _assert_2d(a); m, n = a.shape; t, result_t = _commonType(a); a = _fastCopyAndTranspose(t, a); a = _to_native_byte_order(a); mn = min(m, n); tau = zeros((mn,), t); if isComplexType(t): lapack_routine = lapack_lite.zgeqrf; routine_name = 'zgeqrf'; else: lapack_routine = lapack_lite.dgeqrf; routine_name = 'dgeqrf'; lwork = 1; work = zeros((lwork,), t); results = lapack_routine(m, n, a, max(1, m), tau, work, -1, 0); if results['info'] != 0: raise LinAlgError('%s returns %d' % (routine_name, results['info'])); lwork = max(1, n, int(abs(work[0]))); work = zeros((lwork,), t); results = lapack_routine(m, n, a, max(1, m), tau, work, lwork, 0); if results['info'] != 0: raise LinAlgError('%s returns %d' % (routine_name, results['info'])); if mode == 'r': r = _fastCopyAndTranspose(result_t, a[:, :mn]); return wrap(triu(r)); if mode == 'raw': return a, tau; if mode == 'economic': if t != result_t : a = a.astype(result_t, copy=False); return wrap(a.T); if mode == 'complete' and m > n: mc = m; q = empty((m, m), t); else: mc = mn; q = empty((n, m), t); q[:n] = a; if isComplexType(t): lapack_routine = lapack_lite.zungqr; routine_name = 'zungqr'; else: lapack_routine = lapack_lite.dorgqr; routine_name = 'dorgqr'; lwork = 1; work = zeros((lwork,), t); results = lapack_routine(m, mc, mn, q, max(1, m), tau, work, -1, 0); if results['info'] != 0: raise LinAlgError('%s returns %d' % (routine_name, results['info'])); lwork = max(1, n, int(abs(work[0]))); work = zeros((lwork,), t); results = lapack_routine(m, mc, mn, q, max(1, m), tau, work, lwork, 0); if results['info'] != 0: raise LinAlgError('%s returns %d' % (routine_name, results['info'])); q = _fastCopyAndTranspose(result_t, q[:mc]); r = _fastCopyAndTranspose(result_t, a[:, :mc]); return wrap(q), wrap(triu(r)); @array_function_dispatch(_unary_dispatcher); def eigvals(a): a, wrap = _makearray(a); _assert_stacked_2d(a); _assert_stacked_square(a); _assert_finite(a); t, result_t = _commonType(a); extobj = get_linalg_error_extobj(; _raise_linalgerror_eigenvalues_nonconvergence); signature = 'D->D' if isComplexType(t) else 'd->D'; w = _umath_linalg.eigvals(a, signature=signature, extobj=extobj); if not isComplexType(t): if all(w.imag == 0): w = w.real; result_t = _realType(result_t); else: result_t = _complexType(result_t); return w.astype(result_t, copy=False); def _eigvalsh_dispatcher(a, UPLO=None): return (a,); @array_function_dispatch(_eigvalsh_dispatcher); def eigvalsh(a, UPLO='L'): UPLO = UPLO.upper(); if UPLO not in ('L', 'U'): raise ValueError("UPLO argument must be 'L' or 'U'"); extobj = get_linalg_error_extobj(; _raise_linalgerror_eigenvalues_nonconvergence); if UPLO == 'L': gufunc = _umath_linalg.eigvalsh_lo; else: gufunc = _umath_linalg.eigvalsh_up; a, wrap = _makearray(a); _assert_stacked_2d(a); _assert_stacked_square(a); t, result_t = _commonType(a); signature = 'D->d' if isComplexType(t) else 'd->d'; w = gufunc(a, signature=signature, extobj=extobj); return w.astype(_realType(result_t), copy=False); def _convertarray(a): t, result_t = _commonType(a); a = _fastCT(a.astype(t)); return a, t, result_t; def eig(a): a, wrap = _makearray(a); _assert_stacked_2d(a); _assert_stacked_square(a); _assert_finite(a); t, result_t = _commonType(a); extobj = get_linalg_error_extobj(; _raise_linalgerror_eigenvalues_nonconvergence); signature = 'D->DD' if isComplexType(t) else 'd->DD'; w, vt = _umath_linalg.eig(a, signature=signature, extobj=extobj); if not isComplexType(t) and all(w.imag == 0.0): w = w.real; vt = vt.real; result_t = _realType(result_t); else: result_t = _complexType(result_t); vt = vt.astype(result_t, copy=False); return w.astype(result_t, copy=False), wrap(vt); @array_function_dispatch(_eigvalsh_dispatcher); def eigh(a, UPLO='L'): UPLO = UPLO.upper(); if UPLO not in ('L', 'U'): raise ValueError("UPLO argument must be 'L' or 'U'"); a, wrap = _makearray(a); _assert_stacked_2d(a); _assert_stacked_square(a); t, result_t = _commonType(a); extobj = get_linalg_error_extobj(; _raise_linalgerror_eigenvalues_nonconvergence); if UPLO == 'L': gufunc = _umath_linalg.eigh_lo; else: gufunc = _umath_linalg.eigh_up; signature = 'D->dD' if isComplexType(t) else 'd->dd'; w, vt = gufunc(a, signature=signature, extobj=extobj); w = w.astype(_realType(result_t), copy=False); vt = vt.astype(result_t, copy=False); return w, wrap(vt); def _svd_dispatcher(a, full_matrices=None, compute_uv=None, hermitian=None): return (a,); @array_function_dispatch(_svd_dispatcher); def svd(a, full_matrices=True, compute_uv=True, hermitian=False): a, wrap = _makearray(a); if hermitian: if compute_uv: s, u = eigh(a); s = s[..., ::-1]; u = u[..., ::-1]; vt = transpose(u * sign(s)[..., None, :]).conjugate(); s = abs(s); return wrap(u), s, wrap(vt); else: s = eigvalsh(a); s = s[..., ::-1]; s = abs(s); return s; _assert_stacked_2d(a); t, result_t = _commonType(a); extobj = get_linalg_error_extobj(_raise_linalgerror_svd_nonconvergence); m, n = a.shape[-2:]; if compute_uv: if full_matrices: if m < n: gufunc = _umath_linalg.svd_m_f; else: gufunc = _umath_linalg.svd_n_f; else: if m < n: gufunc = _umath_linalg.svd_m_s; else: gufunc = _umath_linalg.svd_n_s; signature = 'D->DdD' if isComplexType(t) else 'd->ddd'; u, s, vh = gufunc(a, signature=signature, extobj=extobj); u = u.astype(result_t, copy=False); s = s.astype(_realType(result_t), copy=False); vh = vh.astype(result_t, copy=False); return wrap(u), s, wrap(vh); else: if m < n: gufunc = _umath_linalg.svd_m; else: gufunc = _umath_linalg.svd_n; signature = 'D->d' if isComplexType(t) else 'd->d'; s = gufunc(a, signature=signature, extobj=extobj); s = s.astype(_realType(result_t), copy=False); return s; def _cond_dispatcher(x, p=None): return (x,); @array_function_dispatch(_cond_dispatcher); def cond(x, p=None): x = asarray(x); if _is_empty_2d(x): raise LinAlgError("cond is not defined on empty arrays"); if p is None or p == 2 or p == -2: s = svd(x, compute_uv=False); with errstate(all='ignore'): if p == -2: r = s[..., -1] / s[..., 0]; else: r = s[..., 0] / s[..., -1]; else: _assert_stacked_2d(x); _assert_stacked_square(x); t, result_t = _commonType(x); signature = 'D->D' if isComplexType(t) else 'd->d'; with errstate(all='ignore'): invx = _umath_linalg.inv(x, signature=signature); r = norm(x, p, axis=(-2, -1)) * norm(invx, p, axis=(-2, -1)); r = r.astype(result_t, copy=False); r = asarray(r); nan_mask = isnan(r); if nan_mask.any(): nan_mask &= ~isnan(x).any(axis=(-2, -1)); if r.ndim > 0: r[nan_mask] = Inf; elif nan_mask: r[()] = Inf; if r.ndim == 0: r = r[()]; return r; def _matrix_rank_dispatcher(M, tol=None, hermitian=None): return (M,); @array_function_dispatch(_matrix_rank_dispatcher); def matrix_rank(M, tol=None, hermitian=False): M = asarray(M); if M.ndim < 2: return int(not all(M==0)); S = svd(M, compute_uv=False, hermitian=hermitian); if tol is None: tol = S.max(axis=-1, keepdims=True) * max(M.shape[-2:]) * finfo(S.dtype).eps; else: tol = asarray(tol)[..., newaxis]; return count_nonzero(S > tol, axis=-1); def pinv(a, rcond=1e-15, hermitian=False): a, wrap = _makearray(a); rcond = asarray(rcond); if _is_empty_2d(a): m, n = a.shape[-2:]; res = empty(a.shape[:-2] + (n, m), dtype=a.dtype); return wrap(res); a = a.conjugate(); u, s, vt = svd(a, full_matrices=False, hermitian=hermitian); cutoff = rcond[..., newaxis] * amax(s, axis=-1, keepdims=True); large = s > cutoff; s = divide(1, s, where=large, out=s); s[~large] = 0; res = matmul(transpose(vt), multiply(s[..., newaxis], transpose(u))); return wrap(res); def slogdet(a): a = asarray(a); _assert_stacked_2d(a); _assert_stacked_square(a); t, result_t = _commonType(a); real_t = _realType(result_t); signature = 'D->Dd' if isComplexType(t) else 'd->dd'; sign, logdet = _umath_linalg.slogdet(a, signature=signature); sign = sign.astype(result_t, copy=False); logdet = logdet.astype(real_t, copy=False); return sign, logdet; @array_function_dispatch(_unary_dispatcher); def det(a): a = asarray(a); _assert_stacked_2d(a); _assert_stacked_square(a); t, result_t = _commonType(a); signature = 'D->D' if isComplexType(t) else 'd->d'; r = _umath_linalg.det(a, signature=signature); r = r.astype(result_t, copy=False); return r; def lstsq(a, b, rcond="warn"): a, _ = _makearray(a); b, wrap = _makearray(b); is_1d = b.ndim == 1; if is_1d: b = b[:, newaxis]; _assert_2d(a, b); m, n = a.shape[-2:]; m2, n_rhs = b.shape[-2:]; if m != m2: raise LinAlgError('Incompatible dimensions'); t, result_t = _commonType(a, b); real_t = _linalgRealType(t); result_real_t = _realType(result_t); if rcond == "warn": warnings.warn("`rcond` parameter will change to the default of "; "machine precision times ``max(M, N)`` where M and N "; "are the input matrix dimensions.\n"; "To use the future default and silence this warning "; "we advise to pass `rcond=None`, to keep using the old, "; "explicitly pass `rcond=-1`.",; FutureWarning, stacklevel=3); rcond = -1; if rcond is None: rcond = finfo(t).eps * max(n, m); if m <= n: gufunc = _umath_linalg.lstsq_m; else: gufunc = _umath_linalg.lstsq_n; signature = 'DDd->Ddid' if isComplexType(t) else 'ddd->ddid'; extobj = get_linalg_error_extobj(_raise_linalgerror_lstsq); if n_rhs == 0: b = zeros(b.shape[:-2] + (m, n_rhs + 1), dtype=b.dtype); x, resids, rank, s = gufunc(a, b, rcond, signature=signature, extobj=extobj); if m == 0: x[...] = 0; if n_rhs == 0: x = x[..., :n_rhs]; resids = resids[..., :n_rhs]; if is_1d: x = x.squeeze(axis=-1); if rank != n or m <= n: resids = array([], result_real_t); s = s.astype(result_real_t, copy=False); resids = resids.astype(result_real_t, copy=False); x = x.astype(result_t, copy=True); return wrap(x), wrap(resids), rank, s; def _multi_svd_norm(x, row_axis, col_axis, op): y = moveaxis(x, (row_axis, col_axis), (-2, -1)); result = op(svd(y, compute_uv=False), axis=-1); return result; def _norm_dispatcher(x, ord=None, axis=None, keepdims=None): return (x,); @array_function_dispatch(_norm_dispatcher); def norm(x, ord=None, axis=None, keepdims=False): x = asarray(x); if not issubclass(x.dtype.type, (inexact, object_)): x = x.astype(float); if axis is None: ndim = x.ndim; if ((ord is None) or; (ord in ('f', 'fro') and ndim == 2) or; (ord == 2 and ndim == 1)): x = x.ravel(order='K'); if isComplexType(x.dtype.type): sqnorm = dot(x.real, x.real) + dot(x.imag, x.imag); else: sqnorm = dot(x, x); ret = sqrt(sqnorm); if keepdims: ret = ret.reshape(ndim*[1]); return ret; nd = x.ndim; if axis is None: axis = tuple(range(nd)); elif not isinstance(axis, tuple): try: axis = int(axis); except Exception: raise TypeError("'axis' must be None, an integer or a tuple of integers"); axis = (axis,); if len(axis) == 1: if ord == Inf: return abs(x).max(axis=axis, keepdims=keepdims); elif ord == -Inf: return abs(x).min(axis=axis, keepdims=keepdims); elif ord == 0: return (x != 0).astype(x.real.dtype).sum(axis=axis, keepdims=keepdims); elif ord == 1: return add.reduce(abs(x), axis=axis, keepdims=keepdims); elif ord is None or ord == 2: s = (x.conj() * x).real; return sqrt(add.reduce(s, axis=axis, keepdims=keepdims)); else: try: ord + 1; except TypeError: raise ValueError("Invalid norm order for vectors."); absx = abs(x); absx **= ord; ret = add.reduce(absx, axis=axis, keepdims=keepdims); ret **= (1 / ord); return ret; elif len(axis) == 2: row_axis, col_axis = axis; row_axis = normalize_axis_index(row_axis, nd); col_axis = normalize_axis_index(col_axis, nd); if row_axis == col_axis: raise ValueError('Duplicate axes given.'); if ord == 2: ret =_multi_svd_norm(x, row_axis, col_axis, amax); elif ord == -2: ret = _multi_svd_norm(x, row_axis, col_axis, amin); elif ord == 1: if col_axis > row_axis: col_axis -= 1; ret = add.reduce(abs(x), axis=row_axis).max(axis=col_axis); elif ord == Inf: if row_axis > col_axis: row_axis -= 1; ret = add.reduce(abs(x), axis=col_axis).max(axis=row_axis); elif ord == -1: if col_axis > row_axis: col_axis -= 1; ret = add.reduce(abs(x), axis=row_axis).min(axis=col_axis); elif ord == -Inf: if row_axis > col_axis: row_axis -= 1; ret = add.reduce(abs(x), axis=col_axis).min(axis=row_axis); elif ord in [None, 'fro', 'f']: ret = sqrt(add.reduce((x.conj() * x).real, axis=axis)); elif ord == 'nuc': ret = _multi_svd_norm(x, row_axis, col_axis, sum); else: raise ValueError("Invalid norm order for matrices."); if keepdims: ret_shape = list(x.shape); ret_shape[axis[0]] = 1; ret_shape[axis[1]] = 1; ret = ret.reshape(ret_shape); return ret; else: raise ValueError("Improper number of dimensions to norm."); def multi_dot(arrays): n = len(arrays); if n < 2: raise ValueError("Expecting at least two arrays."); elif n == 2: return dot(arrays[0], arrays[1]); arrays = [asanyarray(a) for a in arrays]; ndim_first, ndim_last = arrays[0].ndim, arrays[-1].ndim; if arrays[0].ndim == 1: arrays[0] = atleast_2d(arrays[0]); if arrays[-1].ndim == 1: arrays[-1] = atleast_2d(arrays[-1]).T; _assert_2d(*arrays) 15 | if n == 3: result = _multi_dot_three(arrays[0], arrays[1], arrays[2]) 16 | else: order = _multi_dot_matrix_chain_order(arrays) 17 | result = _multi_dot(arrays, order, 0, n - 1) 18 | if ndim_first == 1 and ndim_last == 1: return result[0, 0] 19 | elif ndim_first == 1 or ndim_last == 1: return result.ravel() 20 | else: return result; def _multi_dot_three(A, B, C): a0, a1b0 = A.shape 21 | b1c0, c1 = C.shape; cost1 = a0 * b1c0 * (a1b0 + c1); cost2 = a1b0 * c1 * (a0 + b1c0); if cost1 < cost2: return dot(dot(A, B), C); else: return dot(A, dot(B, C)); def _multi_dot_matrix_chain_order(arrays, return_costs=False): n = len(arrays); p = [a.shape[0] for a in arrays] + [arrays[-1].shape[1]]; m = zeros((n, n), dtype=double); s = empty((n, n), dtype=intp); for l in range(1, n): for i in range(n - l): j = i + l; m[i, j] = Inf; for k in range(i, j): q = m[i, k] + m[k+1, j] + p[i]*p[k+1]*p[j+1]; if q < m[i, j]: m[i, j] = q; s[i, j] = k; if i == j: return arrays[i]; else: return dot(_multi_dot(arrays, order, i, order[i, j]),; _multi_dot(arrays, order, order[i, j] + 1, j)) 22 | -------------------------------------------------------------------------------- /99_other_material/complexity.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [] 7 | }, 8 | { 9 | "cell_type": "markdown", 10 | "metadata": {}, 11 | "source": [] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 3, 16 | "metadata": {}, 17 | "outputs": [ 18 | { 19 | "data": { 20 | "text/plain": [ 21 | "['OFFSETTEXTPAD',\n", 22 | " '__class__',\n", 23 | " '__delattr__',\n", 24 | " '__dict__',\n", 25 | " '__dir__',\n", 26 | " '__doc__',\n", 27 | " '__eq__',\n", 28 | " '__format__',\n", 29 | " '__ge__',\n", 30 | " '__getattribute__',\n", 31 | " '__getstate__',\n", 32 | " '__gt__',\n", 33 | " '__hash__',\n", 34 | " '__init__',\n", 35 | " '__init_subclass__',\n", 36 | " '__le__',\n", 37 | " '__lt__',\n", 38 | " '__module__',\n", 39 | " '__name__',\n", 40 | " '__ne__',\n", 41 | " '__new__',\n", 42 | " '__reduce__',\n", 43 | " '__reduce_ex__',\n", 44 | " '__repr__',\n", 45 | " '__setattr__',\n", 46 | " '__sizeof__',\n", 47 | " '__str__',\n", 48 | " '__subclasshook__',\n", 49 | " '__weakref__',\n", 50 | " '_agg_filter',\n", 51 | " '_alpha',\n", 52 | " '_animated',\n", 53 | " '_autolabelpos',\n", 54 | " '_axes',\n", 55 | " '_clipon',\n", 56 | " '_clippath',\n", 57 | " '_contains',\n", 58 | " '_copy_tick_props',\n", 59 | " '_get_clipping_extent_bbox',\n", 60 | " '_get_label',\n", 61 | " '_get_offset_text',\n", 62 | " '_get_tick',\n", 63 | " '_get_tick_bboxes',\n", 64 | " '_get_tick_boxes_siblings',\n", 65 | " '_get_ticks_position',\n", 66 | " '_gid',\n", 67 | " '_gridOnMajor',\n", 68 | " '_gridOnMinor',\n", 69 | " '_in_layout',\n", 70 | " '_label',\n", 71 | " '_major_tick_kw',\n", 72 | " '_minor_tick_kw',\n", 73 | " '_mouseover',\n", 74 | " '_oid',\n", 75 | " '_path_effects',\n", 76 | " '_picker',\n", 77 | " '_prop_order',\n", 78 | " '_propobservers',\n", 79 | " '_rasterized',\n", 80 | " '_remove_method',\n", 81 | " '_remove_overlapping_locs',\n", 82 | " '_scale',\n", 83 | " '_set_artist_props',\n", 84 | " '_set_gc_clip',\n", 85 | " '_set_scale',\n", 86 | " '_sketch',\n", 87 | " '_smart_bounds',\n", 88 | " '_snap',\n", 89 | " '_stale',\n", 90 | " '_sticky_edges',\n", 91 | " '_transform',\n", 92 | " '_transformSet',\n", 93 | " '_translate_tick_kw',\n", 94 | " '_update_axisinfo',\n", 95 | " '_update_label_position',\n", 96 | " '_update_offset_text_position',\n", 97 | " '_update_ticks',\n", 98 | " '_url',\n", 99 | " '_visible',\n", 100 | " 'add_callback',\n", 101 | " 'aname',\n", 102 | " 'axes',\n", 103 | " 'axis_date',\n", 104 | " 'axis_name',\n", 105 | " 'callbacks',\n", 106 | " 'cla',\n", 107 | " 'clipbox',\n", 108 | " 'contains',\n", 109 | " 'convert_units',\n", 110 | " 'convert_xunits',\n", 111 | " 'convert_yunits',\n", 112 | " 'converter',\n", 113 | " 'draw',\n", 114 | " 'eventson',\n", 115 | " 'figure',\n", 116 | " 'findobj',\n", 117 | " 'format_cursor_data',\n", 118 | " 'get_agg_filter',\n", 119 | " 'get_alpha',\n", 120 | " 'get_animated',\n", 121 | " 'get_children',\n", 122 | " 'get_clip_box',\n", 123 | " 'get_clip_on',\n", 124 | " 'get_clip_path',\n", 125 | " 'get_contains',\n", 126 | " 'get_cursor_data',\n", 127 | " 'get_data_interval',\n", 128 | " 'get_figure',\n", 129 | " 'get_gid',\n", 130 | " 'get_gridlines',\n", 131 | " 'get_in_layout',\n", 132 | " 'get_inverted',\n", 133 | " 'get_label',\n", 134 | " 'get_label_position',\n", 135 | " 'get_label_text',\n", 136 | " 'get_major_formatter',\n", 137 | " 'get_major_locator',\n", 138 | " 'get_major_ticks',\n", 139 | " 'get_majorticklabels',\n", 140 | " 'get_majorticklines',\n", 141 | " 'get_majorticklocs',\n", 142 | " 'get_minor_formatter',\n", 143 | " 'get_minor_locator',\n", 144 | " 'get_minor_ticks',\n", 145 | " 'get_minorticklabels',\n", 146 | " 'get_minorticklines',\n", 147 | " 'get_minorticklocs',\n", 148 | " 'get_minpos',\n", 149 | " 'get_offset_text',\n", 150 | " 'get_path_effects',\n", 151 | " 'get_picker',\n", 152 | " 'get_pickradius',\n", 153 | " 'get_rasterized',\n", 154 | " 'get_remove_overlapping_locs',\n", 155 | " 'get_scale',\n", 156 | " 'get_sketch_params',\n", 157 | " 'get_smart_bounds',\n", 158 | " 'get_snap',\n", 159 | " 'get_text_heights',\n", 160 | " 'get_tick_padding',\n", 161 | " 'get_tick_space',\n", 162 | " 'get_ticklabel_extents',\n", 163 | " 'get_ticklabels',\n", 164 | " 'get_ticklines',\n", 165 | " 'get_ticklocs',\n", 166 | " 'get_ticks_direction',\n", 167 | " 'get_ticks_position',\n", 168 | " 'get_tightbbox',\n", 169 | " 'get_transform',\n", 170 | " 'get_transformed_clip_path_and_affine',\n", 171 | " 'get_units',\n", 172 | " 'get_url',\n", 173 | " 'get_view_interval',\n", 174 | " 'get_visible',\n", 175 | " 'get_window_extent',\n", 176 | " 'get_zorder',\n", 177 | " 'grid',\n", 178 | " 'have_units',\n", 179 | " 'isDefault_label',\n", 180 | " 'isDefault_majfmt',\n", 181 | " 'isDefault_majloc',\n", 182 | " 'isDefault_minfmt',\n", 183 | " 'isDefault_minloc',\n", 184 | " 'is_transform_set',\n", 185 | " 'iter_ticks',\n", 186 | " 'label',\n", 187 | " 'label_position',\n", 188 | " 'labelpad',\n", 189 | " 'limit_range_for_scale',\n", 190 | " 'major',\n", 191 | " 'majorTicks',\n", 192 | " 'minor',\n", 193 | " 'minorTicks',\n", 194 | " 'mouseover',\n", 195 | " 'offsetText',\n", 196 | " 'offset_text_position',\n", 197 | " 'pan',\n", 198 | " 'pchanged',\n", 199 | " 'pick',\n", 200 | " 'pickable',\n", 201 | " 'pickradius',\n", 202 | " 'properties',\n", 203 | " 'remove',\n", 204 | " 'remove_callback',\n", 205 | " 'remove_overlapping_locs',\n", 206 | " 'reset_ticks',\n", 207 | " 'set',\n", 208 | " 'set_agg_filter',\n", 209 | " 'set_alpha',\n", 210 | " 'set_animated',\n", 211 | " 'set_clip_box',\n", 212 | " 'set_clip_on',\n", 213 | " 'set_clip_path',\n", 214 | " 'set_contains',\n", 215 | " 'set_data_interval',\n", 216 | " 'set_default_intervals',\n", 217 | " 'set_figure',\n", 218 | " 'set_gid',\n", 219 | " 'set_in_layout',\n", 220 | " 'set_inverted',\n", 221 | " 'set_label',\n", 222 | " 'set_label_coords',\n", 223 | " 'set_label_position',\n", 224 | " 'set_label_text',\n", 225 | " 'set_major_formatter',\n", 226 | " 'set_major_locator',\n", 227 | " 'set_minor_formatter',\n", 228 | " 'set_minor_locator',\n", 229 | " 'set_path_effects',\n", 230 | " 'set_picker',\n", 231 | " 'set_pickradius',\n", 232 | " 'set_rasterized',\n", 233 | " 'set_remove_overlapping_locs',\n", 234 | " 'set_sketch_params',\n", 235 | " 'set_smart_bounds',\n", 236 | " 'set_snap',\n", 237 | " 'set_tick_params',\n", 238 | " 'set_ticklabels',\n", 239 | " 'set_ticks',\n", 240 | " 'set_ticks_position',\n", 241 | " 'set_transform',\n", 242 | " 'set_units',\n", 243 | " 'set_url',\n", 244 | " 'set_view_interval',\n", 245 | " 'set_visible',\n", 246 | " 'set_zorder',\n", 247 | " 'stale',\n", 248 | " 'stale_callback',\n", 249 | " 'sticky_edges',\n", 250 | " 'tick_bottom',\n", 251 | " 'tick_top',\n", 252 | " 'units',\n", 253 | " 'update',\n", 254 | " 'update_from',\n", 255 | " 'update_units',\n", 256 | " 'zoom',\n", 257 | " 'zorder']" 258 | ] 259 | }, 260 | "execution_count": 3, 261 | "metadata": {}, 262 | "output_type": "execute_result" 263 | } 264 | ], 265 | "source": [ 266 | "dir(ax.xaxis)" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 14, 272 | "metadata": {}, 273 | "outputs": [ 274 | { 275 | "data": { 276 | "text/plain": [ 277 | "[Text(0, 0, '1 person (you)'),\n", 278 | " Text(0, 0, '5 people (team)'),\n", 279 | " Text(0, 0, '100 people (group)'),\n", 280 | " Text(0, 0, '10,000 people (division)'),\n", 281 | " Text(0, 0, '1,000,000 people (world)')]" 282 | ] 283 | }, 284 | "execution_count": 14, 285 | "metadata": {}, 286 | "output_type": "execute_result" 287 | }, 288 | { 289 | "data": { 290 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAwIAAAGqCAYAAAC4bPMSAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAgAElEQVR4nO3df9xtVV0n8M+Xi78CFZWrIT8CEyRUULmShk3YDwPrNUxKKTmKjsWgYk1Ok5hNVs40OdlkjgKSEWollWERkehoioQKl8QLKOgNSG5YgiL+SIfQ1R97P9xzD+f5de+53Mtd7/fr9byec/ZZe+/1nL3OOvuz1977qdZaAACAvuy2oysAAADc8wQBAADokCAAAAAdEgQAAKBDggAAAHRIEAAAgA4tGwSq6uyq+nxVXb3I61VVb6yqjVW1oaqeNP9qAgAA87SSEYFzkhy7xOvHJTl4/Dk5yRnbXi0AAGB7WjYItNYuTvLFJYocn+TtbfDRJHtV1T7zqiAAADB/u89hGfsmuWni+aZx2uemC1bVyRlGDbLHHnsceeihh85h9QAAsLgrrrji1tba2h1dj53NPIJAzZjWZhVsrZ2V5KwkWbduXVu/fv0cVg8AAIurqn/Y0XXYGc3jrkGbkuw/8Xy/JDfPYbkAAMB2Mo8gcH6SF4x3D3pKkttba3c7LQgAANh5LHtqUFW9M8kxSfauqk1JXpPkPknSWjszyYVJnplkY5J/SfKi7VVZAABgPpYNAq21E5d5vSV52dxqBAAAbHf+szAAAHRIEAAAgA4JAgAA0CFBAAAAOiQIAABAhwQBAADokCAAAAAdEgQAAKBDggAAAHRIEAAAgA4JAgAA0CFBAAAAOiQIAABAhwQBAADokCAAAAAdEgQAAKBDggAAAHRIEAAAgA4JAgAA0CFBAAAAOiQIAABAhwQBAADokCAAAAAdEgQAAKBDggAAAHRIEAAAgA4JAgAA0CFBAAAAOiQIAABAhwQBAADokCAAAAAdEgQAAKBDggAAAHRIEAAAgA4JAgAA0CFBAAAAOiQIAABAhwQBAADokCAAAAAdEgQAAKBDggAAAHRIEAAAgA4JAgAA0CFBAAAAOiQIAABAhwQBAADokCAAAAAdEgQAAKBDggAAAHRIEAAAgA4JAgAA0CFBAAAAOiQIAABAhwQBAADokCAAAAAdEgQAAKBDggAAAHRIEAAAgA4JAgAA0CFBAAAAOiQIAABAhwQBAADokCAAAAAdEgQAAKBDggAAAHRIEAAAgA4JAgAA0CFBAAAAOiQIAABAh1YUBKrq2Kq6rqo2VtVpM15/cFX9ZVV9oqquqaoXzb+qAADAvCwbBKpqTZI3JzkuyWFJTqyqw6aKvSzJJ1trRyQ5JslvVdV951xXAABgTlYyInBUko2ttetba3ckOTfJ8VNlWpIHVlUl2TPJF5PcOdeaAgAAc7OSILBvkpsmnm8ap016U5LvSnJzkquS/Gxr7VtzqSEAADB3KwkCNWNam3r+w0muTPLIJE9I8qaqetDdFlR1clWtr6r1t9xyy6orCwAAzMdKgsCmJPtPPN8vw5H/SS9Kcl4bbExyQ5JDpxfUWjurtbautbZu7dq1W1tnAABgG60kCFye5OCqOmi8APi5Sc6fKvPZJD+QJFX1iCSPSXL9PCsKAADMz+7LFWit3VlVpya5KMmaJGe31q6pqlPG189M8tok51TVVRlOJXpla+3W7VhvAABgGywbBJKktXZhkgunpp058fjmJM+Yb9UAAIDtxX8WBgCADgkCAADQIUEAAAA6JAgAAECHBAEAAOiQIAAAAB0SBAAAoEOCAAAAdEgQAACADgkCAADQIUEAAAA6JAgAAECHBAEAAOiQIAAAAB0SBAAAoEOCAAAAdEgQAACADgkCAADQIUEAAAA6JAgAAECHBAEAAOiQIAAAAB0SBAAAoEOCAAAAdEgQAACADgkCAADQIUEAAAA6JAgAAECHBAEAAOiQIAAAAB0SBAAAoEOCAAAAdEgQAACADgkCAADQIUEAAAA6JAgAAECHBAEAAOiQIAAAAB0SBAAAoEOCAAAAdEgQAACADgkCAADQIUEAAAA6JAgAAECHBAEAAOiQIAAAAB0SBAAAoEOCAAAAdEgQAACADgkCAADQIUEAAAA6JAgAAECHBAEAAOiQIAAAAB0SBAAAoEOCAAAAdEgQAACADgkCAADQIUEAAAA6JAgAAECHBAEAAOiQIAAAAB0SBAAAoEOCAAAAdEgQAACADgkCAADQIUEAAAA6JAgAAECHBAEAAOiQIAAAAB1aURCoqmOr6rqq2lhVpy1S5piqurKqrqmqD823mgAAwDztvlyBqlqT5M1JfijJpiSXV9X5rbVPTpTZK8npSY5trX22qh6+vSoMAABsu5WMCByVZGNr7frW2h1Jzk1y/FSZn0xyXmvts0nSWvv8fKsJAADM00qCwL5Jbpp4vmmcNumQJA+pqg9W1RVV9YJZC6qqk6tqfVWtv+WWW7auxgAAwDZbSRCoGdPa1PPdkxyZ5EeS/HCS/15Vh9xtptbOaq2ta62tW7t27aorCwAAzMey1whkGAHYf+L5fklunlHm1tba15J8raouTnJEkk/PpZYAAMBcrWRE4PIkB1fVQVV13yTPTXL+VJm/SPK9VbV7VX1bku9O8qn5VhUAAJiXZUcEWmt3VtWpSS5KsibJ2a21a6rqlPH1M1trn6qq9yTZkORbSd7aWrt6e1YcAADYetXa9On+94x169a19evX75B1AwDQj6q6orW2bkfXY2fjPwsDAECHBAEAAOiQIAAAAB0SBAAAoEOCAAAAdEgQAACADgkCAADQIUEAAAA6JAgAAECHBAEAAOiQIAAAAB0SBAAAoEOCAAAAdEgQAACADgkCAADQIUEAAAA6JAgAAECHBAEAAOiQIAAAAB0SBAAAoEOCAAAAdEgQAACADgkCAADQIUEAAAA6JAgAAECHBAEAAOiQIAAAAB0SBAAAoEOCAAAAdEgQAACADgkCAADQIUEAAAA6JAgAAECHBAEAAOiQIAAAAB0SBAAAoEOCAAAAdEgQAACADgkCAADQIUEAAAA6JAgAAECHBAEAAOiQIAAAAB0SBAAAoEOCAAAAdEgQAACADgkCAADQIUEAAAA6JAgAAECHBAEAAOiQIAAAAB0SBAAAoEOCAAAAdEgQAACADgkCAADQIUEAAAA6JAgAAECHBAEAAOiQIAAAAB0SBAAAoEOCAAAAdEgQAACADgkCAADQIUEAAAA6JAgAAECHBAEAAOiQIAAAAB0SBAAAoEOCAAAAdGhFQaCqjq2q66pqY1WdtkS5J1fVN6vqhPlVEQAAmLdlg0BVrUny5iTHJTksyYlVddgi5V6X5KJ5VxIAAJivlYwIHJVkY2vt+tbaHUnOTXL8jHIvT/JnST4/x/oBAADbwUqCwL5Jbpp4vmmcdpeq2jfJjyU5c6kFVdXJVbW+qtbfcsstq60rAAAwJysJAjVjWpt6/oYkr2ytfXOpBbXWzmqtrWutrVu7du1K6wgAAMzZ7isosynJ/hPP90ty81SZdUnOraok2TvJM6vqztban8+llgAAwFytJAhcnuTgqjooyT8meW6Sn5ws0Fo7aOFxVZ2T5AIhAAAAdl7LBoHW2p1VdWqGuwGtSXJ2a+2aqjplfH3J6wIAAICdz0pGBNJauzDJhVPTZgaA1toLt71aAADA9uQ/CwMAQIcEAQAA6JAgAAAAHRIEAACgQ4IAAAB0SBAAAIAOCQIAANAhQQAAADokCAAAQIcEAQAA6JAgAAAAHRIEAACgQ4IAAAB0SBAAAIAOCQIAANAhQQAAADokCAAAQIcEAQAA6JAgAAAAHRIEAACgQ4IAAAB0SBAAAIAOCQIAANAhQQAAADokCAAAQIcEAQAA6JAgAAAAHRIEAACgQ4IAAAB0SBAAAIAOCQIAANAhQQAAADokCAAAQIcEAQAA6JAgAAAAHRIEAACgQ4IAAAB0SBAAAIAOCQIAANAhQQAAADokCAAAQIcEAQAA6JAgAAAAHRIEAACgQ4IAAAB0SBAAAIAOCQIAANAhQQAAADokCAAAQIcEAQAA6JAgAAAAHRIEAACgQ4IAAAB0SBAAAIAOCQIAANAhQQAAADokCAAAQIcEAQAA6JAgAAAAHRIEAACgQ4IAAAB0SBAAAIAOCQIAANAhQQAAADokCAAAQIcEAQAA6JAgAAAAHRIEAACgQ4IAAAB0aEVBoKqOrarrqmpjVZ024/XnVdWG8efSqjpi/lUFAADmZdkgUFVrkrw5yXFJDktyYlUdNlXshiTf11o7PMlrk5w174oCAADzs5IRgaOSbGytXd9auyPJuUmOnyzQWru0tXbb+PSjSfabbzUBAIB5WkkQ2DfJTRPPN43TFvPiJH+9LZUCAAC2r91XUKZmTGszC1Y9PUMQeNoir5+c5OQkOeCAA1ZYRQAAYN5WMiKwKcn+E8/3S3LzdKGqOjzJW5Mc31r7wqwFtdbOaq2ta62tW7t27dbUFwAAmIOVBIHLkxxcVQdV1X2TPDfJ+ZMFquqAJOcleX5r7dPzryYAADBPy54a1Fq7s6pOTXJRkjVJzm6tXVNVp4yvn5nkl5M8LMnpVZUkd7bW1m2/agMAANuiWpt5uv92t27durZ+/fodsm4AAPpRVVc4SH13/rMwAAB0SBAAAIAOCQIAANAhQQAAADokCAAAQIcEAQAA6JAgAAAAHRIEAACgQ4IAAAB0SBAAAIAOCQIAANAhQQAAADokCAAAQIcEAQAA6JAgAAAAHRIEAACgQ4IAAAB0SBAAAIAOCQIAANAhQQAAADokCAAAQIcEAQAA6JAgAAAAHRIEAACgQ4IAAAB0SBAAAIAOCQIAANAhQQAAADokCAAAQIcEAQAA6JAgAAAAHRIEAACgQ4IAAAB0SBAAAIAOCQIAANAhQQAAADokCAAAQIcEAQAA6JAgAAAAHRIEAACgQ4IAAAB0SBAAAIAOCQIAANAhQQAAADokCAAAQIcEAQAA6JAgAAAAHRIEAACgQ4IAAAB0SBAAAIAOCQIAANAhQQAAADokCAAAQIcEAQAA6JAgAAAAHRIEAACgQ4IAAAB0SBAAAIAOCQIAANAhQQAAADokCAAAQIcEAQAA6JAgAAAAHRIEAACgQ4IAAAB0SBAAAIAOCQIAANAhQQAAADokCAAAQIcEAQAA6NCKgkBVHVtV11XVxqo6bcbrVVVvHF/fUFVPmn9VAQCAeVk2CFTVmiRvTnJcksOSnFhVh00VOy7JwePPyUnOmHM9AQCAOVrJiMBRSTa21q5vrd2R5Nwkx0+VOT7J29vgo0n2qqp95lxXAABgTnZfQZl9k9w08XxTku9eQZl9k3xuslBVnZxhxCBJvlpV162qtvRg7yS37uhKsNPRLphFu2AW7YJZHrOjK7AzWkkQqBnT2laUSWvtrCRnrWCddKqq1rfW1u3oerBz0S6YRbtgFu2CWapq/Y6uw85oJacGbUqy/8Tz/ZLcvBVlAACAncRKgsDlSQ6uqoOq6r5Jnpvk/Kky5yd5wXj3oKckub219rnpBQEAADuHZU8Naq3dWVWnJrkoyZokZ7fWrqmqU8bXz0xyYZJnJtmY5F+SvGj7VZldnFPHmEW7YBbtglm0C2bRLmao1u52Kj8AALCL85+FAQCgQ4IAAAB0SBAgVXV2VX2+qq7e0XVZrap6YlW9dc7LXFtV75nnMu8NqurGqrqqqq7ckbdZG+ux9yrneVdVPWp8/Ivbp2ZbrG+XaSOLff6r6qFV9b6q+sz4+yETr72qqjZW1XVV9cP3fK2TqjpwtX1WVT2gqj5UVWu2V72WWf99q+riqlrJrbvnud5Vb+OpcieNZT5TVSdNTD+oqj42Tv/j8YYiGW8c8saxjWyoqidt379wtqo6pqouWOU8+yw2T1V9sKrWjY8vrKq9lljOI6vqXcus69LV1G1q3nOr6uCtnHfZ7/yqut+4TTeO2/jAiddW1R5mLHtm/1FVR47fQRvH9lPL1eWeVFUvrKo3rXKe7bGPsmjfN9VG/99in+lJggBJck6SY7fXwrfzl+4vJvm/81xga+2WJJ+rqqPnudx7iae31p5wb7oHd1U9Nsma1tr146TtHgR2sTZyTmZ//k9L8v7W2sFJ3j8+T1UdluHucY8d5zt9R+1Yb4X/lOS81to3VzrDPP+21todGd7L58xrmSt0TlaxjSdV1UOTvCbDPxI9KslrJnYuXpfkt8f5b0vy4nH6cUkOHn9OTnLG3P6S7e8VSX53uUKttWe21r60xOs3t9ZOWGYZ37MV9VtwRpJf2Mp5z8ny3/kvTnJba+3RSX47w7be2vZwl2X6jzMytJeFtrNQx5l1uZeY6z7KKg8ivCPJS5ct1Vrz4ydJDkxy9RKvn5PkzCQfTvLpJD86Tl+T5Dcz3GZ2Q5L/PE4/JsnfJPmjJJ9MskeSv0ryiSRXJ3nOWO4Hknw8yVVJzk5yv3H6jUl+Ncnfja8dOqNOD0xy3fh4tySfSbJ24vnGDP9h8jsyfMltGH8fMPE3nTCxvK9OPD4+yek7ervcw23gxiR7L1Nmte2gxulXj9txYbsfk+TiJO8e28eZSXabrkeS/5jksiRXJnlLhh3+6Tr9epIXjo9/I8k3x/J/uNQyMnzprE9yTZJfnXoffj3JR8bXn5Thrml/n+SUXbGNZMbnP8l1SfYZH+8z8Vl7VZJXTZS7KMlTF2lPrxvf+8uSPHqcvjbJn41t5fIkR4/TH5rkz8f289Ekh4/TfyXDF9oHMnzGf3q6zou1vxl1ujTJgePj3ZKcPm7/CzLc/e6Eibr/cpJLMuy0PGGs04axzT5kLPfBJOvGx3snuXF8/MIkf5HkPeP7+JqJOhyR5MKdeRtPlTkxyVsmnr9lnFYZ/nvv7uP0pya5aLLMrPVMLfurSX4rQz///mzuv79zfO+uyNDXHDpOX6ovn9UvHZPkgvHxHhm+Yy7P8J1z/CLv0/XZ/D30gCTnjuv74yQfm9jeN47b/HVJXjox/68k+a/Zsn0+Npv7oA1JDl74+8ffS/WTH0zyriTXJvnDbL7Jy25Jblh4/+fRHqZev+tzneEOk7eO9Vx1e5ha7sz+I0P7u3ZWu1usLjP+nmuTvG18j9+V5NvG145M8qGxPV2UzW1+qc/1GzL0F1cnOWric/2mpfqxqTrdtY8yPr8qyV7je/WFJC8Yp78jyQ8muX+S3x/LfTzDgbmF9f5pkr/M0A/ete2ydBt9yFLbeOHHiACrcWCS70vyI0nOrKr7Z0jqt7fWnpzkyUl+uqoOGssfleTVrbXDMiT7m1trR7TWHpfkPeP852To9B6f4QP+kon13dpae1KGHbafn1GfdRk+pGmtfSvJHyR53vjaDyb5RGvt1iRvSvL21trhGTrSN67gb12f5HtXUG5X0pK8t6quqKqTlyh3YFbeDp6VobM9IsM2+c2q2mdczlEZvjAfn+GL/1mTK6mq78pw5PTo1toTMuzgPy93d3SGDj6ttdOSfL0NoxrPW2YZr27DyMfhSb6vqg6fWOZNrbWnZtixOCfJCUmekuTXJsrs6m3kEW38fzDj74eP0/dNctNEuU3jtFm+3Fo7KsNn8A3jtN/JcNTwyUmenWRh2PxXk3x8/Jz+YpK3Tyzn8Azt7alJfrmqHjm1nqX6oSTDaTlJHtVau3Gc9KwMbfnxSX5qXPakb7TWntZaO3esyyvHul2V4Yjoco7K0NaekOTHF4brM/RZT17B/PeExbbxpMW298OSfKm1dufU9KXmmbZHkr8b+/kPZfP7elaSl7fWjszQ958+Tl+qLz8wd++XJr06yQfGNvL0DH3RHpMFxjZzW2vt/4+TXpLkX8b1/c8MO5TTzs2WIzw/kWGnbdIpSX5n7IPWZXg/Ji3VTz4xyX9JcliSR2Xo7xa+8zaO82wPd23DcRvfnmGbb017mLncqXL7Zsv3ZWZ7mqrLtMckOWvcXl9O8tKquk+GI/InjO3p7AzbMln6c71HG0ZsXjrOM22xfmzSXfsoo7/NsP0emyFwLnx/PCVDIHnZ+Dc+PkMQettEO35qkpNaa98/tY5F22hr7bYk96uqWe/VXe7R8xS51/uTsfP5TFVdn+TQJM9IcnhVLQyBPjjDkN4dSS5rrd0wTr8qyeur6nUZjtB8uKqOSHJDa+3TY5m3ZfggLOwwnDf+viJTO4mjfZLcMvH87AxH4d6Q4RSA3x+nP3Vi/nck+d8r+Fs/n2R6Z2NXd3Rr7eaqeniS91XVta21i2eUW007eFqSd7bhVIx/rqoPZdgJ+nKG9nF9klTVO8eyk+fU/kCGTu3y8VTRB2TYLtOm28GkpZbxE2Pg2X1cxmEZjqokm/9p4lVJ9mytfSXJV6rqG1W1VxtOCeixjSTD0axpi92H+p0Tv397fPyDSQ4bt0eSPKiqHphh+z87SVprH6iqh1XVg8cyf9Fa+3qSr1fV32TYyb5yYj2Ltb8bJsrsnWTyVI6nJfnTsS3/07jcSX+cJGMd9mqtfWic/rbcfUdvlve11r4wLuO8cX3rW2vfrKo7quqBY7va2S22vZdqByttI9/K+D5nOJBzXlXtmeR7kvzpRBu53/h7qb58Vr806RlJ/n1VLRxUun+SA5J8aqLMdF/y7zKGjdbahqrakCmttY9X1cPHcLo2Q5D47NR57B9J8uqq2i/DqWmfmVrMcv3kpiSpqiszBJ5LxvkW+qArpus1B6vd7ivd5tuzPd3UWvvb8fEfJPmZDCNLj8vwnZYMo4efW8Hn+p1J0lq7uKoeNON6kJn92NRnero9fThDm/qHjKdBVdW+Sb7YWvtqVT0t42lErbVrq+ofkhwyzvu+1toXZ/zNy7XRhTbyhRnzJhEEWJ3pD97Ch/flrbWLJl+oqmOSfO2ugq19uqqOzPCP5/5XVb03d/8P1dMWjsp8M7Pb6tczdOYL67ipqv65qr4/w/mLs44eT/4dd2a8Tma8KGnywqb7j8vvRmvt5vH356vq3Rl2tmYFgdW0g2cutcplnleSt7XWXrVM1bdoBytZxnjk7+eTPLm1dltVnTO1jIW2962JxwvPF9rirt5G/rmq9mmtfW48OrkQoDYl2X+i3H5Jbl5kGW3G490yDPNv8d7VxDfqjHlW0lbu1v6mTLeTWeub9LVlXk8m+pDcvQ0uVef7JfnGCpa/vS22jSdtynCKyoL9Mpw6cWuSvapq9/Eo7WQ7WE0bmdQyvJ9fGo+er6T8rMeznleSZ7fWrltiebP6ksVC7qR3ZRg1/PYMIwRbLqC1P6qqj2UYrbioqn6qtfaBqbotZrL/mf4u3J590MI23DSel/7gJF/M1rWHWcudnP/mcfp+M6YvVZdpi303XTOO8N5l4iDDYpZrTzP7sSnT7eniDAc7D8gwQvVjGdrNhxeqtcSyluqPlmqjy7YRpwaxGj9eVbtV1XdmGKK8LsP5di8Zh99SVYdMD7eO0x+ZYfjqD5K8PsN519cmObCqHj0We36G4eGV+lSSR09Ne2uGIwF/0jZfEHhphvN8kyEcLBxNuTGbh9GOT3KfieUcki2H9HZpVbXHeFQ24/Z7Rhb/+1fTDi5O8pyqWlNVazMcvbhsXM5RNdxlYrcMQ+uXTK3n/UlOGEcoFu5w8h0z6jPdDv51oR5LLONBGTrW26vqERkublytXb2NnJ/kpPHxSRlG2xamP7eGO3kclOHI+2Uz5k82nzLxnAxHRZPkvUlOXShQVQs7fBdnDO/jgYRbW2tfHl87vqruPw5xH5PhnNxJy/ZD4zD5momh9kuSPHtsy4/Iljs3k/PdnuS2qloYxp/sp27M5j5k+sLQHxrb2wOS/IcMpwVk/Btuaa3966z13cNmbuOq2req3j9OvyjJM6rqITVcFPqMDOd+twzXgZ0wPf+43BfU4CkZTtv63Iz17zYx/08muWTc5jdU1Y+Pdalx9DhZvC9PZvdLky5K8vKFwFlVT5xRn09nOOK+YLJNPi7DKWqznDvW64RsOaqZcd5HJbm+tfbGDO/N9HKW6ieXckiGa1zmoqpOraqFz+Zk2zghw2lVLVvRHqrqqKp6+8Ry79Z/jO3jK1X1lHEbvSBbtqdZdZl2QFUt7PCfmKF9XJdk7cL0qrpPVT12mc91MvZd41H628fykxbrxyZt8d3UWrspw8jkweNo+CUZDkgtBIHJ9nZIhsCwVHCdnmeLNjq+j9+eoZ9alCDAwmkZH0nymKraVFV3u9J/dF2GD8pfZ7ho8hsZdrw/meTvarid1Vsy++j945NcVsPQ5quT/I9x/hdlGAK+KsPR1jNXWu/W2rVJHrywAzs6P8me2XxaUDIMD75oHDJ7fpKfHaf/boZzwy/LMIIwmbifnuHi5l48IsklVfWJDF9Af9VaW+z2mKtpB+/OcLrNJzJc5PQLrbV/GpfzkQwX916d4RSOd0+upLX2ySS/lOG6hQ1J3pdhqHXaX2XLnbizkmyoqj9cbBmttU9kuBjrmgynlP1tVm+XaCNLfP5/I8PO7GeS/ND4PK21a5L8SYbt/Z4kL2uL34XnfuOR0J9N8nPjtJ9Jsq6G20p+MsP508lwkeW6cTv9RjZ/8Sdjm8xwHu1rF0avJqy0H3pvhtMwkuFCv00Z2t9bMlxkN/1lv+CkDOdtb8hwLvfCtSKvzxBALs3wBT/pkgynr1yZ5M9aawu35H16hguT7zGr3cYZPmd3Jsl4OsJrs/miyF+bOEXhlUleUVUbM5yz/Xvj9AsznAO9MUM/u9idS76W5LFVdUWS78/m9/V5SV489kfXZDhQkyzelyez+6VJr81wsGfD2EZeO12Z1trXkvz9xMGpM5LsOa7vF7LIzvn4mXhgkn9cJPA8J8nV4/ffodny+pdk6X5ypjG8fn2R9S1pifZwaDafQvJ7SR42bttXZLyj1Fa2hwMyHpVepv94SYbP8sYMN2f466XqMsOnkpw0bq+HJjmjDXfqOiHJ68b2dGWGU8+SxT/XyRASLs2wTzJrn2ixfuwui+yjfCxD4EyGALBvNgfa0zMcrLgqwylzL5y4XmUxS7XRI5N8dOK6jZkWrj6HJdVw6sQFrbUl74t8T6uqn0vyldbaW8fn6zJcwLNNF3FW1cUZ7ipx2xyquV/Ryj0AAAFpSURBVMuYVzsYj/j+fGvtR+dQpwdkOBJ19BI7pHOnjSytqm7McPeKW7dxOb+S4e4qr59DnZ6Y5BWtteePz/ccz819WIYv0KOX2wFb4XpemOFvP3XGa+dluGvKckf6dpjxqPBnW2vLnb65rev5amttzzks55zM6fupqn4syZGttV/a1mVtT+N335dba7+3bOGVL/OCJM8ad57npqp+M8k7Wmt3u8Zijus4MEMbeNwclvXBDN9P2/z/dKb3Ue5JVfU7Sc5vrb1/qXKuEeDe7owkC0PIp2U4orDYtQErMg7N/h87ePcOrbWvV9VrMhxZ+ew9sU5t5N5pvLDzb6pqzRgaL6jhIsD7Zhhp2OYQsJQa7lz05ztzCEiS1tqq/mnSrqS19u5a5i4rO4kvZRhxmpt5HJhZZLn/bXss917irn2UHeDq5UJAYkQAAAC65BoBAADokCAAAAAdEgQAAKBDggAAAHRIEAAAgA79GwU6zSeTW8wdAAAAAElFTkSuQmCC\n", 291 | "text/plain": [ 292 | "
" 293 | ] 294 | }, 295 | "metadata": { 296 | "needs_background": "light" 297 | }, 298 | "output_type": "display_data" 299 | } 300 | ], 301 | "source": [ 302 | "import matplotlib.pyplot as plt\n", 303 | "import numpy as np\n", 304 | "\n", 305 | "vals = [1, 5, 100, 10000, 1000000]\n", 306 | "vals = np.log10(vals)\n", 307 | "vals = [1, 2, 3, 4, 5]\n", 308 | "\n", 309 | "fig, ax = plt.subplots(1, 1, figsize=(12, 7.2))\n", 310 | "ax.xaxis.set_ticks(vals)\n", 311 | "ax.xaxis.set_ticklabels([\n", 312 | " \"1 person (you)\",\n", 313 | " \"5 people (team)\",\n", 314 | " \"100 people (group)\",\n", 315 | " \"10,000 people (division)\",\n", 316 | " \"1,000,000 people (world)\"])\n", 317 | "ax.yaxis.set_ticks(vals)\n", 318 | "ax.yaxis.set_ticklabels([\n", 319 | " \"1 day\",\n", 320 | " \"1 week\",\n", 321 | " \"3 months\",\n", 322 | " \"1 year\",\n", 323 | " \"10 years\"])" 324 | ] 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "metadata": {}, 329 | "source": [] 330 | }, 331 | { 332 | "cell_type": "markdown", 333 | "metadata": {}, 334 | "source": [] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": {}, 344 | "source": [] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": {}, 354 | "source": [] 355 | }, 356 | { 357 | "cell_type": "markdown", 358 | "metadata": {}, 359 | "source": [] 360 | } 361 | ], 362 | "metadata": { 363 | "kernelspec": { 364 | "display_name": "Python 3", 365 | "language": "python", 366 | "name": "python3" 367 | }, 368 | "language_info": { 369 | "codemirror_mode": { 370 | "name": "ipython", 371 | "version": 3 372 | }, 373 | "file_extension": ".py", 374 | "mimetype": "text/x-python", 375 | "name": "python", 376 | "nbconvert_exporter": "python", 377 | "pygments_lexer": "ipython3", 378 | "version": "3.7.7" 379 | } 380 | }, 381 | "nbformat": 4, 382 | "nbformat_minor": 4 383 | } 384 | -------------------------------------------------------------------------------- /07_graphs/07_graphs.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "Collapsed": "false" 7 | }, 8 | "source": [ 9 | "# Machine Learning for Graphs\n", 10 | "This notebook contains:\n", 11 | "1. A very brief introduction for the library [`NetworkX`](https://networkx.github.io/).\n", 12 | "1. An introduction to Graph Neural Networks (GNNs) with [`PyTorch`](https://pytorch.org/).\n", 13 | "1. Some basic queries for the proprietary graph database [`Neo4j`](https://neo4j.com/)." 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "## Introduction to NetworkX\n", 21 | "How to handle graph data in python?" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "import networkx as nx\n", 31 | "import pandas as pd\n", 32 | "import matplotlib.pyplot as plt\n", 33 | "\n", 34 | "print(f'NetworkX version used: {nx.__version__}')" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "### Load dataset 📞 ☎️" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "df = pd.read_csv('./log_of_calls.csv')\n", 51 | "df" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "### Convert to NetworkX graph\n", 59 | "It is sometimes some work to get all the edge and feature attributes into the desired format 🤬!" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "from_df = df[[c for c in df.columns if c.startswith('from_')]]\n", 69 | "from_df.columns = [c[5:] for c in from_df.columns]\n", 70 | "to_df = df[[c for c in df.columns if c.startswith('to_')]]\n", 71 | "to_df.columns = [c[3:] for c in to_df.columns]\n", 72 | "df_nodes = pd.concat((from_df, to_df), ignore_index=True)\n", 73 | "df_nodes = df_nodes.drop(columns='dt')\n", 74 | "df_nodes = df_nodes.drop_duplicates(subset='number')\n", 75 | "df_nodes = df_nodes.reset_index(drop=True)\n", 76 | "df_nodes" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "G = nx.MultiDiGraph()\n", 86 | "\n", 87 | "G.add_nodes_from(zip(\n", 88 | " df_nodes.number,\n", 89 | " df_nodes.drop(columns='number').to_dict('records')\n", 90 | "))\n", 91 | "G.nodes(data=True)['403-726-6587']" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "The addition of edges preserves the insertion order:" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "list(G.nodes)[0]" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "G.add_edges_from(zip(\n", 117 | " df.from_number,\n", 118 | " df.to_number,\n", 119 | " df[['from_dt', 'to_dt']].to_dict('records')\n", 120 | "))\n", 121 | "list(G.edges(nbunch='403-726-6587', data=True))" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "### What does the Graph look like? 📊 📈 📉\n", 129 | "As mentioned this question strongly depends on the spatial order of nodes..." 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "plt.figure(figsize=(16,16))\n", 139 | "nx.draw_circular(G, width=0.1, node_size=10)" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "### Some Summary Statistics 📂\n", 147 | "Plenty of algorithms are available:" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": null, 153 | "metadata": { 154 | "scrolled": false 155 | }, 156 | "outputs": [], 157 | "source": [ 158 | "from IPython.display import IFrame \n", 159 | "IFrame('https://networkx.github.io/documentation/networkx-2.4/reference', width=1000, height=650)" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "nx.average_shortest_path_length(G)" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "## Graph Neural Networks (GNNs) with PyTorch (aka Message Passing 📥 📤)\n", 176 | "Graph Neural Networks (GNNs) unraveled!" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "from collections import OrderedDict\n", 186 | "from typing import Tuple, List\n", 187 | "\n", 188 | "import numpy as np\n", 189 | "import scipy.sparse as sp\n", 190 | "from sklearn.model_selection import train_test_split\n", 191 | "from sklearn.preprocessing import OneHotEncoder, LabelEncoder\n", 192 | "import torch\n", 193 | "from torch.nn import functional as F\n", 194 | "from torch import nn\n", 195 | "from tqdm.notebook import tqdm " 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "### Loading the labels 🤷‍♂️ 🤷‍♀️\n", 203 | "_(This task is probably political correct, but the result is quite surprising!)_" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": null, 209 | "metadata": {}, 210 | "outputs": [], 211 | "source": [ 212 | "le = LabelEncoder()\n", 213 | "y = le.fit_transform(df_nodes['gender'].values)\n", 214 | "y = torch.from_numpy(y)\n", 215 | "y" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "### Loading the features\n", 223 | "I.e. the one hot encoding of names:\n", 224 | "\n", 225 | "**Observation:** \n", 226 | "🔎_The name is basically a unique identifier, so how should we be able to learn something (with partially labelled data)?_ 🔎" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": null, 232 | "metadata": {}, 233 | "outputs": [], 234 | "source": [ 235 | "X = OneHotEncoder().fit_transform(df_nodes['name'].values[:, None]).toarray()\n", 236 | "X = torch.from_numpy(X).float()\n", 237 | "X.shape" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "## Convert the NetworkX graph into a sparse adjacecny matrix\n", 245 | "Otherwise the space requirements are $O(n^2)$ with the number of nodes $n$" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": null, 251 | "metadata": {}, 252 | "outputs": [], 253 | "source": [ 254 | "def from_networkx_to_sparse_tensor(G: nx.Graph) -> torch.Tensor:\n", 255 | " if hasattr(G, 'to_undirected'):\n", 256 | " G = G.to_undirected()\n", 257 | " adjacency_matrix = nx.convert_matrix.to_scipy_sparse_matrix(G)\n", 258 | " adjacency_matrix += sp.diags(np.ones(len(G.nodes())))\n", 259 | " adjacency_matrix = adjacency_matrix.tocoo()\n", 260 | " row_index = torch.from_numpy(adjacency_matrix.row).to(torch.long)\n", 261 | " col_index = torch.from_numpy(adjacency_matrix.col).to(torch.long)\n", 262 | " A = torch.sparse.FloatTensor(\n", 263 | " torch.stack([row_index, col_index], dim=0),\n", 264 | " torch.ones_like(row_index, dtype=torch.float)\n", 265 | " ).coalesce()\n", 266 | " return A" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": null, 272 | "metadata": {}, 273 | "outputs": [], 274 | "source": [ 275 | "A = from_networkx_to_sparse_tensor(G)\n", 276 | "A" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "### Implementation of a Graph Convolutional Network (GCN)\n", 284 | "For the graph convolutional layer we are going to use the following update scheme:\n", 285 | "\n", 286 | "$$𝐻^{(𝑙+1)}=\\sigma\\left(𝐷^{−\\frac{1}{2}} 𝐴 𝐷^{−\\frac{1}{2}} 𝐻^{(𝑙)} 𝑊{(𝑙)}\\right)=\\sigma\\left(\\hat{A} 𝐻^{(𝑙)} 𝑊{(𝑙)}\\right)$$\n", 287 | "\n", 288 | "We use the ReLU for the activation function, but in the last layer where we directly output the raw logits (i.e. no activation at all). With $𝐻^{(0)}$ we denote the node features." 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": null, 294 | "metadata": {}, 295 | "outputs": [], 296 | "source": [ 297 | "class GraphConvolution(nn.Module):\n", 298 | " \"\"\"\n", 299 | " Graph Convolution Layer: as proposed in [Kipf et al. 2017](https://arxiv.org/abs/1609.02907).\n", 300 | " \n", 301 | " Parameters\n", 302 | " ----------\n", 303 | " in_channels: int\n", 304 | " Dimensionality of input channels/features.\n", 305 | " out_channels: int\n", 306 | " Dimensionality of output channels/features.\n", 307 | " \"\"\"\n", 308 | "\n", 309 | " def __init__(self, in_channels: int, out_channels: int):\n", 310 | " super().__init__()\n", 311 | " self.linear = nn.Linear(in_channels, out_channels, bias=False)\n", 312 | "\n", 313 | " def forward(self, arguments: Tuple[torch.tensor, torch.sparse.FloatTensor]) -> torch.tensor:\n", 314 | " \"\"\"\n", 315 | " Forward method.\n", 316 | " \n", 317 | " Parameters\n", 318 | " ----------\n", 319 | " arguments: Tuple[torch.tensor, torch.sparse.FloatTensor]\n", 320 | " Tuple of feature matrix `X` and normalized adjacency matrix `A_hat`\n", 321 | " \n", 322 | " Returns\n", 323 | " ---------\n", 324 | " X: torch.tensor\n", 325 | " The result of the message passing step\n", 326 | " \"\"\"\n", 327 | " X, A_hat = arguments\n", 328 | " X = A_hat @ self.linear(X)\n", 329 | " return X" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "In the following we stack multiple layers ($l$) (with ReLU activation functions and dropout in between). Before we pass the adjacency matrix to the GCN, we calculate the normalized adjacency matrix: \n", 337 | "$$\\hat{A} = 𝐷^{−\\frac{1}{2}} 𝐴 𝐷^{−\\frac{1}{2}}$$" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": null, 343 | "metadata": {}, 344 | "outputs": [], 345 | "source": [ 346 | "class GCN(nn.Module):\n", 347 | " \"\"\"\n", 348 | " Graph Convolution Network: as proposed in [Kipf et al. 2017](https://arxiv.org/abs/1609.02907).\n", 349 | " \n", 350 | " Parameters\n", 351 | " ----------\n", 352 | " n_features: int\n", 353 | " Dimensionality of input features.\n", 354 | " n_classes: int\n", 355 | " Number of classes for the semi-supervised node classification.\n", 356 | " hidden_dimensions: List[int]\n", 357 | " Internal number of features. `len(hidden_dimensions)` defines the number of hidden representations.\n", 358 | " activation: nn.Module\n", 359 | " The activation for each layer but the last.\n", 360 | " dropout: float\n", 361 | " The dropout probability.\n", 362 | " \"\"\"\n", 363 | " \n", 364 | " def __init__(self,\n", 365 | " n_features: int,\n", 366 | " n_classes: int,\n", 367 | " hidden_dimensions: List[int] = [80],\n", 368 | " activation: nn.Module = nn.ReLU(),\n", 369 | " dropout: float = 0.5):\n", 370 | " super().__init__()\n", 371 | " self.n_features = n_features\n", 372 | " self.n_classes = n_classes\n", 373 | " self.hidden_dimensions = hidden_dimensions\n", 374 | " self.layers = nn.ModuleList()\n", 375 | " self.layers.extend([\n", 376 | " nn.Sequential(OrderedDict([\n", 377 | " (f'gcn_{idx}', GraphConvolution(in_channels=in_channels,\n", 378 | " out_channels=out_channels)),\n", 379 | " (f'activation_{idx}', activation),\n", 380 | " (f'dropout_{idx}', nn.Dropout(p=dropout))\n", 381 | " ]))\n", 382 | " for idx, (in_channels, out_channels)\n", 383 | " in enumerate(zip([n_features] + hidden_dimensions[:-1], hidden_dimensions))\n", 384 | " ])\n", 385 | " self.layers.append(\n", 386 | " nn.Sequential(OrderedDict([\n", 387 | " (f'gcn_{len(hidden_dimensions)}', GraphConvolution(in_channels=hidden_dimensions[-1],\n", 388 | " out_channels=n_classes))\n", 389 | " ]))\n", 390 | " )\n", 391 | " \n", 392 | " def normalize(self, A: torch.sparse.FloatTensor) -> torch.tensor:\n", 393 | " \"\"\"\n", 394 | " For calculating $\\hat{A} = 𝐷^{−\\frac{1}{2}} 𝐴 𝐷^{−\\frac{1}{2}}$.\n", 395 | " \n", 396 | " Parameters\n", 397 | " ----------\n", 398 | " A: torch.sparse.FloatTensor\n", 399 | " Sparse adjacency matrix with added self-loops.\n", 400 | " \n", 401 | " Returns\n", 402 | " -------\n", 403 | " A_hat: torch.sparse.FloatTensor\n", 404 | " Normalized message passing matrix\n", 405 | " \"\"\"\n", 406 | " row, col = A._indices()\n", 407 | " edge_weight = A._values()\n", 408 | " deg = (A @ torch.ones(A.shape[0], 1, device=A.device)).squeeze()\n", 409 | " deg_inv_sqrt = deg.pow(-0.5)\n", 410 | " normalized_edge_weight = deg_inv_sqrt[row] * edge_weight * deg_inv_sqrt[col]\n", 411 | " A_hat = torch.sparse.FloatTensor(A._indices(), normalized_edge_weight, A.shape)\n", 412 | " return A_hat\n", 413 | "\n", 414 | " def forward(self, X: torch.Tensor, A: torch.sparse.FloatTensor) -> torch.tensor:\n", 415 | " \"\"\"\n", 416 | " Forward method.\n", 417 | " \n", 418 | " Parameters\n", 419 | " ----------\n", 420 | " X: torch.tensor\n", 421 | " Feature matrix `X`\n", 422 | " A: torch.tensor\n", 423 | " adjacency matrix `A` (with self-loops)\n", 424 | " \n", 425 | " Returns\n", 426 | " ---------\n", 427 | " X: torch.tensor\n", 428 | " The result of the last message passing step (i.e. the logits)\n", 429 | " \"\"\"\n", 430 | " A_hat = self.normalize(A)\n", 431 | " for layer in self.layers:\n", 432 | " X = layer((X, A_hat))\n", 433 | " return X" 434 | ] 435 | }, 436 | { 437 | "cell_type": "markdown", 438 | "metadata": {}, 439 | "source": [ 440 | "### Train/Validation/Test split 🎛" 441 | ] 442 | }, 443 | { 444 | "cell_type": "code", 445 | "execution_count": null, 446 | "metadata": {}, 447 | "outputs": [], 448 | "source": [ 449 | "def split(labels: np.ndarray,\n", 450 | " train_size: float = 0.1,\n", 451 | " val_size: float = 0.1,\n", 452 | " test_size: float = 0.8,\n", 453 | " random_state: int = 42) -> List[np.ndarray]:\n", 454 | " \"\"\"Split the arrays or matrices into random train, validation and test subsets.\n", 455 | "\n", 456 | " Parameters\n", 457 | " ----------\n", 458 | " labels: np.ndarray [n_nodes]\n", 459 | " The class labels\n", 460 | " train_size: float\n", 461 | " Proportion of the dataset included in the train split.\n", 462 | " val_size: float\n", 463 | " Proportion of the dataset included in the validation split.\n", 464 | " test_size: float\n", 465 | " Proportion of the dataset included in the test split.\n", 466 | " random_state: int\n", 467 | " Random_state is the seed used by the random number generator;\n", 468 | "\n", 469 | " Returns\n", 470 | " -------\n", 471 | " split_train: array-like\n", 472 | " The indices of the training nodes\n", 473 | " split_val: array-like\n", 474 | " The indices of the validation nodes\n", 475 | " split_test array-like\n", 476 | " The indices of the test nodes\n", 477 | "\n", 478 | " \"\"\"\n", 479 | " idx = np.arange(labels.shape[0])\n", 480 | " idx_train_and_val, idx_test = train_test_split(idx,\n", 481 | " random_state=random_state,\n", 482 | " train_size=(train_size + val_size),\n", 483 | " test_size=test_size,\n", 484 | " stratify=labels)\n", 485 | "\n", 486 | " idx_train, idx_val = train_test_split(idx_train_and_val,\n", 487 | " random_state=random_state,\n", 488 | " train_size=(train_size / (train_size + val_size)),\n", 489 | " test_size=(val_size / (train_size + val_size)),\n", 490 | " stratify=labels[idx_train_and_val])\n", 491 | " \n", 492 | " return idx_train, idx_val, idx_test" 493 | ] 494 | }, 495 | { 496 | "cell_type": "markdown", 497 | "metadata": {}, 498 | "source": [ 499 | "### The training code... 🎓" 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": null, 505 | "metadata": {}, 506 | "outputs": [], 507 | "source": [ 508 | "def train(model: nn.Module, \n", 509 | " X: torch.Tensor, \n", 510 | " A: torch.sparse.FloatTensor, \n", 511 | " labels: torch.Tensor, \n", 512 | " idx_train: np.ndarray, \n", 513 | " idx_val: np.ndarray,\n", 514 | " lr: float = 1e-2,\n", 515 | " weight_decay: float = 5e-4, \n", 516 | " patience: int = 50, \n", 517 | " max_epochs: int = 500, \n", 518 | " display_step: int = 10):\n", 519 | " \"\"\"\n", 520 | " Train a model using either standard or adversarial training.\n", 521 | " \n", 522 | " Parameters\n", 523 | " ----------\n", 524 | " model: nn.Module\n", 525 | " Model which we want to train.\n", 526 | " X: torch.Tensor [n, d]\n", 527 | " Dense attribute matrix.\n", 528 | " A: torch.sparse.FloatTensor [n, n]\n", 529 | " Sparse adjacency matrix.\n", 530 | " labels: torch.Tensor [n]\n", 531 | " Ground-truth labels of all nodes,\n", 532 | " idx_train: np.ndarray [?]\n", 533 | " Indices of the training nodes.\n", 534 | " idx_val: np.ndarray [?]\n", 535 | " Indices of the validation nodes.\n", 536 | " lr: float\n", 537 | " Learning rate.\n", 538 | " weight_decay : float\n", 539 | " Weight decay.\n", 540 | " patience: int\n", 541 | " The number of epochs to wait for the validation loss to improve before stopping early.\n", 542 | " max_epochs: int\n", 543 | " Maximum number of epochs for training.\n", 544 | " display_step : int\n", 545 | " How often to print information.\n", 546 | " seed: int\n", 547 | " Seed\n", 548 | " \n", 549 | " Returns\n", 550 | " -------\n", 551 | " trace_train: list\n", 552 | " A list of values of the train loss during training.\n", 553 | " trace_val: list\n", 554 | " A list of values of the validation loss during training.\n", 555 | " \"\"\"\n", 556 | " trace_train = []\n", 557 | " trace_val = []\n", 558 | " optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)\n", 559 | "\n", 560 | " best_loss = np.inf\n", 561 | " for it in tqdm(range(max_epochs), desc='Training...'):\n", 562 | " logits = model(X, A) \n", 563 | " loss_train = F.cross_entropy(logits[idx_train], labels[idx_train])\n", 564 | " loss_val = F.cross_entropy(logits[idx_val], labels[idx_val])\n", 565 | "\n", 566 | " optimizer.zero_grad()\n", 567 | " loss_train.backward()\n", 568 | " optimizer.step()\n", 569 | " \n", 570 | " trace_train.append(loss_train.detach().item())\n", 571 | " trace_val.append(loss_val.detach().item())\n", 572 | "\n", 573 | " if loss_val < best_loss:\n", 574 | " best_loss = loss_val\n", 575 | " best_epoch = it\n", 576 | " best_state = {key: value.cpu() for key, value in model.state_dict().items()}\n", 577 | " else:\n", 578 | " if it >= best_epoch + patience:\n", 579 | " break\n", 580 | "\n", 581 | " if display_step > 0 and it % display_step == 0:\n", 582 | " print(f'Epoch {it:4}: loss_train: {loss_train.item():.5f}, loss_val: {loss_val.item():.5f} ')\n", 583 | "\n", 584 | " # restore the best validation state\n", 585 | " model.load_state_dict(best_state)\n", 586 | " return trace_train, trace_val" 587 | ] 588 | }, 589 | { 590 | "cell_type": "markdown", 591 | "metadata": {}, 592 | "source": [ 593 | "### 🚧 Putting it all together 🚧" 594 | ] 595 | }, 596 | { 597 | "cell_type": "code", 598 | "execution_count": null, 599 | "metadata": {}, 600 | "outputs": [], 601 | "source": [ 602 | "D, C = X.shape[1], y.max() + 1\n", 603 | "\n", 604 | "gcn = GCN(n_features=D, n_classes=C, hidden_dimensions=[64])\n", 605 | "\n", 606 | "gcn" 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": null, 612 | "metadata": {}, 613 | "outputs": [], 614 | "source": [ 615 | "idx_train, idx_val, idx_test = split(y.numpy(), train_size=0.1, val_size=0.1, test_size=0.8)" 616 | ] 617 | }, 618 | { 619 | "cell_type": "code", 620 | "execution_count": null, 621 | "metadata": { 622 | "scrolled": false 623 | }, 624 | "outputs": [], 625 | "source": [ 626 | "trace_train, trace_val = train(gcn, X, A, y, idx_train, idx_val)\n", 627 | "\n", 628 | "plt.plot(trace_train, label='train')\n", 629 | "plt.plot(trace_val, label='validation')\n", 630 | "plt.xlabel('Epochs')\n", 631 | "plt.ylabel('Loss')\n", 632 | "plt.legend()\n", 633 | "plt.grid(True)" 634 | ] 635 | }, 636 | { 637 | "cell_type": "code", 638 | "execution_count": null, 639 | "metadata": {}, 640 | "outputs": [], 641 | "source": [ 642 | "gcn.eval()\n", 643 | "logits = gcn(X, A)\n", 644 | "accuracy = (torch.argmax(logits, dim=-1) == y)[idx_test].float().mean()\n", 645 | "print(f'We can predict the name with an accuracy of {100*accuracy:.2f} % ' \n", 646 | " f'based on non-informative features due to the graph stucture!!!')" 647 | ] 648 | }, 649 | { 650 | "cell_type": "markdown", 651 | "metadata": { 652 | "Collapsed": "false" 653 | }, 654 | "source": [ 655 | "## 🗺 Graph Databases (Neo4j)\n", 656 | "Handle your graph data professionally!" 657 | ] 658 | }, 659 | { 660 | "cell_type": "markdown", 661 | "metadata": { 662 | "Collapsed": "false" 663 | }, 664 | "source": [ 665 | "Switch to this folder: `cd 07_graphs`\n", 666 | "\n", 667 | "Start Neo4j server e.g. via docker (initial user: `neo4j`, pw: `neo4j`):\n", 668 | "```bash\n", 669 | "docker run \\\n", 670 | " --publish=7474:7474 --publish=7687:7687 \\\n", 671 | " --volume=$PWD/data:/data \\\n", 672 | " --volume=$PWD/import:/import \\\n", 673 | " --env 'NEO4JLABS_PLUGINS=[\"graph-data-science\"]' \\\n", 674 | " neo4j:4.1.1\n", 675 | "```\n", 676 | "\n", 677 | "Then connect to the Neo4j bowser via `http://localhost:7474/` (default user and pw is typically `neo4j`).\n", 678 | "\n", 679 | "### Load Graph\n", 680 | "\n", 681 | "```sql\n", 682 | "MATCH (n) DETACH DELETE n;\n", 683 | "LOAD CSV WITH HEADERS FROM 'file:///log_of_calls.csv' AS line\n", 684 | "MERGE (c1:City { name: line.from_city })\n", 685 | "MERGE (p1:Person { name: line.from_name, number: line.from_number, gender: line.from_gender })\n", 686 | "MERGE (p1)-[:FROM]->(c1)\n", 687 | "MERGE (c2:City { name: line.to_city })\n", 688 | "MERGE (p2:Person { name: line.to_name, number: line.to_number, gender: line.to_gender })\n", 689 | "MERGE (p2)-[:FROM]->(c2)\n", 690 | "CREATE (p1)-[c:Calls { \n", 691 | "\t\tfrom: datetime(line.from_dt),\n", 692 | "\t\tto: datetime(line.to_dt),\n", 693 | " duration: duration.between(datetime(line.from_dt), datetime(line.to_dt)).minutes\n", 694 | "\t}]->(p2)\n", 695 | "```\n", 696 | "\n", 697 | "### Visualize Graph\n", 698 | "\n", 699 | "For example we want to have a look at all persons from `Pattaya`:\n", 700 | "```sql\n", 701 | "MATCH p=()-[r:FROM]->({ name: 'Pattaya' }) \n", 702 | "RETURN p\n", 703 | "```\n", 704 | "or equivalently:\n", 705 | "```sql\n", 706 | "MATCH p=()-[r:FROM]->(c)\n", 707 | "WHERE c.name='Pattaya'\n", 708 | "RETURN p\n", 709 | "```\n", 710 | "\n", 711 | "### Explain\n", 712 | "Similarily to SQL, we can execute an `EXPLAIN` query for analyis of the execution plan:\n", 713 | "```\n", 714 | "EXPLAIN MATCH p=()-[r:FROM]->(c)\n", 715 | "WHERE c.name='Pattaya'\n", 716 | "RETURN p\n", 717 | "```\n", 718 | "\n", 719 | "### Closeness centrality\n", 720 | "```\n", 721 | "CALL gds.alpha.closeness.stream({\n", 722 | " nodeProjection: 'Person',\n", 723 | " relationshipProjection: 'Calls'\n", 724 | "})\n", 725 | "YIELD nodeId, centrality\n", 726 | "RETURN gds.util.asNode(nodeId).name AS user, centrality\n", 727 | "ORDER BY centrality DESC\n", 728 | "```" 729 | ] 730 | }, 731 | { 732 | "cell_type": "code", 733 | "execution_count": null, 734 | "metadata": {}, 735 | "outputs": [], 736 | "source": [] 737 | } 738 | ], 739 | "metadata": { 740 | "jupytext": { 741 | "formats": "ipynb,py" 742 | }, 743 | "kernelspec": { 744 | "display_name": "Python 3", 745 | "language": "python", 746 | "name": "python3" 747 | }, 748 | "language_info": { 749 | "codemirror_mode": { 750 | "name": "ipython", 751 | "version": 3 752 | }, 753 | "file_extension": ".py", 754 | "mimetype": "text/x-python", 755 | "name": "python", 756 | "nbconvert_exporter": "python", 757 | "pygments_lexer": "ipython3", 758 | "version": "3.7.6" 759 | } 760 | }, 761 | "nbformat": 4, 762 | "nbformat_minor": 4 763 | } 764 | -------------------------------------------------------------------------------- /02_database_basics/02_database_basics.py: -------------------------------------------------------------------------------- 1 | # --- 2 | # jupyter: 3 | # jupytext: 4 | # formats: ipynb,py 5 | # text_representation: 6 | # extension: .py 7 | # format_name: light 8 | # format_version: '1.5' 9 | # jupytext_version: 1.4.2 10 | # kernelspec: 11 | # display_name: Python 3 12 | # language: python 13 | # name: python3 14 | # --- 15 | 16 | # + [markdown] Collapsed="false" 17 | # # Data Management and Database Basics 18 | 19 | # + [markdown] Collapsed="false" 20 | # ## Motivation 21 | 22 | # + [markdown] Collapsed="false" 23 | # 24 | 25 | # + [markdown] Collapsed="false" 26 | # ## Overview 27 | 28 | # + [markdown] Collapsed="false" 29 | # 1. Pre-SQL (Robin) 30 | # 2. SQL databases (Ali/Emilio) 31 | # 3. Non-SQL databases (Ali) 32 | # 4. Simple graph database introduction (Robin?) 33 | 34 | # + [markdown] Collapsed="false" 35 | # # Pre-SQL 36 | 37 | # + [markdown] Collapsed="false" 38 | # - You kind of have data, but not really that much. 39 | # - You want to organize it better, but keep things lightweight to share. 40 | 41 | # + [markdown] Collapsed="false" 42 | # ## Working with CSV files 43 | 44 | # + Collapsed="false" 45 | 46 | 47 | 48 | # + [markdown] Collapsed="false" 49 | # ### Efficiently reading last lines 50 | 51 | # + [markdown] Collapsed="false" slideshow={"slide_type": "slide"} 52 | # # SQL 53 | 54 | # + [markdown] slideshow={"slide_type": "slide"} 55 | # ## Introduction 56 | # - SQL is a declarative programming language to manipulate tables 57 | # - no functions or loops, just _declare_ what you need and the runtime will figure out how to compute it 58 | # - SQL queries can be used to 59 | # - Insert new rows into a table 60 | # - Delete rows from a table 61 | # - Ipdate one or more attributes of one or more rows in a table 62 | # - Retrieve and possibly transform rows combing from one or more tables 63 | # - Relational Database Management System (RDBMS) 64 | # - Manages data in the tables 65 | # - Executes queries, returns results 66 | # - This section will mostly focus on reading data (last point) 67 | 68 | # + [markdown] slideshow={"slide_type": "slide"} 69 | # ## Main abstraction: Tables 70 | # - A table is a _set_ of tuples (rows) 71 | # - No two rows are the same 72 | # - Rows are distinguished by _primary keys_ 73 | # - Primary key: smallest set of attributes that uniquely identifies a row 74 | # - Cannot have two rows with the same primary key 75 | # - Examples: 76 | # - Student ID (one attribute) 77 | # - First name, last name, birth date, place of birth (four attributes) 78 | # - The primary key is a property of each table 79 | # - All rows in a table use the same attributes as primary key 80 | # - But different tables can have different primary keys 81 | 82 | # + [markdown] slideshow={"slide_type": "slide"} 83 | # ## Domain 84 | # - Good database design has 85 | # - One table for each entity in the domain 86 | # - Relationships between two or more entities 87 | # - _Foreign keys_ are used to refer to rows of other tables 88 | # - e.g. a table with grades will have foreign keys that point to the student and the course 89 | 90 | # + [markdown] slideshow={"slide_type": "slide"} 91 | # ### Example: University 92 | # - Entities 93 | # - Students (ID, Name, Degree) 94 | # - Courses (ID, Title, Faculty, Semester) 95 | # - Professors (ID, Name, Chair) 96 | # - Relationships 97 | # - One student can *Mentor* another student 98 | # - A student *Attends* several courses and obtains a grade for each of them 99 | # - Professors *Teach* courses 100 | 101 | # + [markdown] slideshow={"slide_type": "slide"} 102 | # ### ER diagram 103 | # - Graphical form to represent entities and relationships 104 | # - Box: entity 105 | # - Diamond: relationship 106 | # - Circle: attribute 107 | # 108 | # 109 | # ![](../img/sql_er_diagram.png) 110 | 111 | # + [markdown] slideshow={"slide_type": "slide"} 112 | # ### Which tables to create? 113 | # - Until now, we separated entities from relationships 114 | # - But in practice everything must be stored into tables 115 | # - How to do this? 116 | # - One table per entity (students, courses, professors) 117 | # - What about the relationships? 118 | # - Mentor: 1 to 1, three possibilities 119 | # 1. Have a column "mentor" 120 | # 2. Have a column "mentee" 121 | # - Having both is not ideal: more work to ensure consistency 122 | # 3. Have a new table (mentor, mentee) 123 | # - Attends: M to N 124 | # - Requires a table (student, course) 125 | # - Teaches: 1 to N 126 | # - Store professor in course table or create separate table 127 | # - General rule: 128 | # - Using a separate table is always possible, or 129 | # - 1 to 1: can store in either entity 130 | # - 1 to N: store in entity with cardinality N 131 | # - M to N: must use separate table 132 | 133 | # + [markdown] slideshow={"slide_type": "slide"} 134 | # ### Final list of tables 135 | # - Students(ID, Name, Degree, Mentor) 136 | # - Professors(ID, Name, Chair) 137 | # - Courses(ID, Title, Faculty, Semester, Professor) 138 | # - Attends(Student, Course, Grade) 139 | # - Which attributes are primary and foreign keys? 140 | 141 | # + [markdown] slideshow={"slide_type": "slide"} 142 | # ## Purpose of SQL 143 | # - SQL shines when "navigating" across relationships, for example: 144 | # - For each student, find the professor that gave them the highest grade 145 | # - For each professor, find courses taught last semester 146 | # - Also used to modify data, tables, databases, etc. 147 | # - Not discussed in this course 148 | 149 | # + [markdown] slideshow={"slide_type": "slide"} 150 | # ## Anatomy of a SELECT query 151 | # - SELECT queries are used to retrieve data from the database 152 | # - The result is itself a table (not saved unless specified) 153 | # 154 | # ``` 155 | # SELECT 156 | # FROM 157 | # [WHERE ] 158 | # [GROUP BY 159 | # [HAVING ]] 160 | # [ORDER BY [ASC|DESC]]; 161 | # ``` 162 | # 163 | # - Must have SELECT and FROM 164 | # - WHERE and GROUPBY are optional 165 | # - HAVING is optional, and must be used with GROUP BY 166 | # - GROUP BY: eventually you must have only one row per group 167 | 168 | # + [markdown] slideshow={"slide_type": "slide"} 169 | # ## Example 170 | # 171 | # Find all courses held in the Winter semester 2019/2020: 172 | # 173 | # ```sql 174 | # SELECT * 175 | # FROM Courses 176 | # WHERE Semester = 'WiSE 19/20'; 177 | # ``` 178 | 179 | # + [markdown] slideshow={"slide_type": "slide"} 180 | # ## Select query untangled 181 | # - Confusingly, the execution order is different than the writing order: 182 | # 1. FROM: first, gather all input rows from all tables 183 | # 2. WHERE: next, remove all rows not matching the predicate 184 | # 3. GROUP BY: now, if needed, create groups of rows 185 | # 4. HAVING: then, remove all groups that do not match the predicate 186 | # 5. ORDER BY: sort the tuples by a the value of a certain column 187 | # 6. SELECT: finally, produce output columns 188 | 189 | # + [markdown] slideshow={"slide_type": "slide"} 190 | # ## Interactive SQL console 191 | # 192 | # An interactive SQL console with a few tables can be accessed at [w3schools.com](https://www.w3schools.com/sql/trysql.asp?filename=trysql_select_all) 193 | # 194 | # - Go to w3schools.com 195 | # - Scroll until SQL, on the left side there will be a query and a button "Try it Yourself" 196 | # - I encourage you to fiddle around while I am explaining 197 | # - They also have a (superficial) command reference 198 | # 199 | # ![](../img/w3trysql.png) 200 | 201 | # + [markdown] slideshow={"slide_type": "slide"} 202 | # ## Interactive SQL console 203 | # 204 | # ![](../img/w3sqled.png) 205 | 206 | # + [markdown] slideshow={"slide_type": "slide"} 207 | # ## FROM: source tables 208 | # - You can specify one or more tables in the from clause 209 | # - FROM will do a cross-product of all tuples of all tables 210 | # - In almost all cases, you only want a small subset of the cross-product 211 | # - Use WHERE to remove tuples that do not make sense 212 | # - Possible to give aliases to tables and use that alias in the rest of the query 213 | # - Useful to keep query short and to disambiguate when the same table is used several times in the same query 214 | 215 | # + [markdown] slideshow={"slide_type": "slide"} 216 | # ## WHERE: tuple filter 217 | # - Specify a boolean condition that is evaluated for each row produced by the FROM 218 | # - All rows where this evaluates to false are discarded 219 | # - Example: Associate to each student all its grades (one per row) 220 | # 221 | # ```sql 222 | # SELECT * 223 | # FROM 224 | # Students AS s, 225 | # Attend AS a, 226 | # Course AS c 227 | # WHERE 228 | # s.ID = a.Student 229 | # AND a.Course = c.ID; 230 | # ``` 231 | 232 | # + [markdown] slideshow={"slide_type": "slide"} 233 | # ## WHERE: handling of NULL values 234 | # 235 | # - NULL is used for "undefined" values 236 | # - Nothing is equal to NULL (not even NULL) 237 | # - `x = NULL` always equals NULL (i.e. false) 238 | # - Use instead `x IS NULL` or `x IS NOT NULL` 239 | # - Nasty example: `SELECT * FROM table WHERE x = 10 OR NOT x = 10` 240 | # - When `x` contains NULLs this equals `WHERE x IS NOT NULL` 241 | # - Dumb fix: `WHERE x = 10 OR NOT x = 10 OR x IS NULL` 242 | 243 | # + [markdown] slideshow={"slide_type": "slide"} 244 | # ## JOIN: a special case of FROM+WHERE 245 | # - In most cases, we are not interested in the cross-product 246 | # - We actually want tuples that match primary/foreign keys 247 | # - This operation is so common that it has a special name to distinguish it from the general case 248 | # - Other than the name, the two are completely equivalent 249 | # - Join makes your intentions clearer 250 | # - The previous query becomes: 251 | # 252 | # ```sql 253 | # SELECT * 254 | # FROM 255 | # Students AS s 256 | # JOIN Attend AS a 257 | # ON s.ID = a.Student 258 | # JOIN Course AS c 259 | # ON c.ID = a.Course; 260 | # ``` 261 | 262 | # + [markdown] slideshow={"slide_type": "slide"} 263 | # ## Non-matching rows in JOINs 264 | # - Options to handle non-matches: 265 | # - Inner join: Only keep matches 266 | # - Left join: keep matches and un-matched records from _left_ table 267 | # - Right join: keep matches and un-matched records from _right_ table 268 | # - Outer join: keep matches, cross-product between un-matched records 269 | # - Other possibilities: 270 | # - Natural join (`ON` is missing): match all columns with the same name 271 | # - Self-join: A table with itself (e.g. to find a student's mentor) 272 | 273 | # + [markdown] slideshow={"slide_type": "slide"} 274 | # ### INNER JOIN 275 | # 276 | # ```sql 277 | # FROM Students [INNER] JOIN Attend 278 | # ON Student.ID = Attend.Student 279 | # ``` 280 | # 281 | # ![](../img/sql_join_inner.svg) 282 | 283 | # + [markdown] slideshow={"slide_type": "slide"} 284 | # ### LEFT JOIN 285 | # 286 | # ```sql 287 | # FROM Students LEFT JOIN Attend 288 | # ON Student.ID = Attend.Student 289 | # ``` 290 | # 291 | # ![](../img/sql_join_left.svg) 292 | 293 | # + [markdown] slideshow={"slide_type": "slide"} 294 | # ### RIGHT JOIN 295 | # 296 | # ```sql 297 | # FROM Students RIGHT JOIN Attend 298 | # ON Student.ID = Attend.Student 299 | # ``` 300 | # 301 | # ![](../img/sql_join_right.svg) 302 | 303 | # + [markdown] slideshow={"slide_type": "slide"} 304 | # ### OUTER JOIN 305 | # 306 | # ```sql 307 | # FROM Students OUTER JOIN Attend 308 | # ON Student.ID = Attend.Student 309 | # ``` 310 | # 311 | # ![](../img/sql_join_outer.svg) 312 | # 313 | # 314 | # Warning: cross-product between unmatched rows! 315 | 316 | # + [markdown] slideshow={"slide_type": "slide"} 317 | # ### Retrieving un-matched rows only 318 | # 319 | # - Example: find all students who have not attended any course 320 | # 321 | # ```sql 322 | # SELECT Students.ID 323 | # FROM Students LEFT JOIN Attend 324 | # ON Students.ID = Attends.Student 325 | # WHERE 326 | # Attends.Student IS NULL 327 | # ``` 328 | # 329 | # ![](../img/sql_join_unmatched_only.svg) 330 | 331 | # + [markdown] slideshow={"slide_type": "slide"} 332 | # ## GROUP BY: create groups of rows 333 | # - must specify one or more columns, possibly with transformation 334 | # - all rows that have the same values for all (transformed) column(s) end up in the same group 335 | 336 | # + [markdown] slideshow={"slide_type": "slide"} 337 | # ## HAVING: filter groups 338 | # - A boolean condition applied to each group 339 | # - Example: filter by group size, min/max/average of something.. 340 | # - Common case: counting 341 | # - `COUNT(*)`: number of rows in the group 342 | # - `COUNT(expr)`: number of rows where `expr` is not NULL 343 | # - `COUNT(DISTINCT expr)`: number of unique values of `expr` (excluding NULLs) 344 | 345 | # + [markdown] slideshow={"slide_type": "slide"} 346 | # ## ORDER BY: order tuples 347 | # 348 | # - Sort the tuples produced by the query 349 | # - Sort by the value of one or more columns, possibly transformed 350 | # - Possible to order by aggregations (count/min/max/sum/avg) 351 | 352 | # + [markdown] slideshow={"slide_type": "slide"} 353 | # ## SELECT: produce output columns 354 | # - All the surviving groups/rows are transformed 355 | # - Select only a subset of attributes, or transform values 356 | # - Careful: each group must be collapsed into a row 357 | 358 | # + [markdown] slideshow={"slide_type": "slide"} 359 | # # Examples 360 | 361 | # + [markdown] slideshow={"slide_type": "slide"} 362 | # ## Example 1 363 | # 364 | # Find the ID of all students who failed at least one exam. 365 | # 366 | # ```sql 367 | # SELECT ... 368 | # ``` 369 | # 370 | # Tables: 371 | # - Students(ID, Name, Degree, Mentor) 372 | # - Professors(ID, Name, Chair) 373 | # - Courses(ID, Title, Faculty, Semester, Professor) 374 | # - Attends(Student, Course, Grade) 375 | 376 | # + [markdown] slideshow={"slide_type": "slide"} 377 | # ## Example 1 378 | # 379 | # Find the ID of all students who failed at least one exam. 380 | # 381 | # ```sql 382 | # SELECT Student 383 | # FROM Attends 384 | # WHERE Grade > 5 385 | # ``` 386 | # 387 | # 388 | # Tables: 389 | # - Students(ID, Name, Degree, Mentor) 390 | # - Professors(ID, Name, Chair) 391 | # - Courses(ID, Title, Faculty, Semester, Professor) 392 | # - Attends(Student, Course, Grade) 393 | 394 | # + [markdown] slideshow={"slide_type": "slide"} 395 | # ## Example 2 396 | # 397 | # Find how many exams each student failed. 398 | # 399 | # ```sql 400 | # SELECT ... 401 | # ``` 402 | # 403 | # Tables: 404 | # - Students(ID, Name, Degree, Mentor) 405 | # - Professors(ID, Name, Chair) 406 | # - Courses(ID, Title, Faculty, Semester, Professor) 407 | # - Attends(Student, Course, Grade) 408 | 409 | # + [markdown] slideshow={"slide_type": "slide"} 410 | # ## Example 2 411 | # 412 | # Find how many exams each student failed. 413 | # 414 | # ```sql 415 | # SELECT Student, COUNT(*) 416 | # FROM Attends 417 | # WHERE Grade > 5 418 | # GROUP BY Student 419 | # ``` 420 | # 421 | # 422 | # Tables: 423 | # - Students(ID, Name, Degree, Mentor) 424 | # - Professors(ID, Name, Chair) 425 | # - Courses(ID, Title, Faculty, Semester, Professor) 426 | # - Attends(Student, Course, Grade) 427 | 428 | # + [markdown] slideshow={"slide_type": "slide"} 429 | # ## Example 3 430 | # 431 | # Find how many exams each student failed, only for the students who failed at least 2. 432 | # 433 | # ```sql 434 | # SELECT ... 435 | # ``` 436 | # 437 | # 438 | # Tables: 439 | # - Students(ID, Name, Degree, Mentor) 440 | # - Professors(ID, Name, Chair) 441 | # - Courses(ID, Title, Faculty, Semester, Professor) 442 | # - Attends(Student, Course, Grade) 443 | 444 | # + [markdown] slideshow={"slide_type": "slide"} 445 | # ## Example 3 446 | # 447 | # Find how many exams each student failed, only for the students who failed at least 2. 448 | # 449 | # 450 | # ```sql 451 | # SELECT Student, COUNT(*) 452 | # FROM Attends 453 | # WHERE Grade > 5 454 | # GROUP BY Student 455 | # HAVING COUNT(*) > 1 456 | # ``` 457 | # 458 | # 459 | # Tables: 460 | # - Students(ID, Name, Degree, Mentor) 461 | # - Professors(ID, Name, Chair) 462 | # - Courses(ID, Title, Faculty, Semester, Professor) 463 | # - Attends(Student, Course, Grade) 464 | 465 | # + [markdown] slideshow={"slide_type": "slide"} 466 | # ## Example 4 467 | # 468 | # Find how many courses each student failed, only for the students who failed at least 2 exams. 469 | # 470 | # ```sql 471 | # SELECT ... 472 | # ``` 473 | # 474 | # 475 | # Tables: 476 | # - Students(ID, Name, Degree, Mentor) 477 | # - Professors(ID, Name, Chair) 478 | # - Courses(ID, Title, Faculty, Semester, Professor) 479 | # - Attends(Student, Course, Grade) 480 | 481 | # + [markdown] slideshow={"slide_type": "slide"} 482 | # ## Example 4 483 | # 484 | # Find how many courses each student failed, only for the students who failed at least 2 exams. 485 | # 486 | # ```sql 487 | # SELECT Student, COUNT(DISTINCT Course) 488 | # FROM Attends 489 | # WHERE Grade > 5 490 | # GROUP BY Student 491 | # HAVING COUNT(*) > 1 492 | # ``` 493 | # 494 | # 495 | # Tables: 496 | # - Students(ID, Name, Degree, Mentor) 497 | # - Professors(ID, Name, Chair) 498 | # - Courses(ID, Title, Faculty, Semester, Professor) 499 | # - Attends(Student, Course, Grade) 500 | 501 | # + [markdown] slideshow={"slide_type": "slide"} 502 | # # Transactions and ACID properties 503 | # 504 | # - When the data is read and modified by several clients at the same time, care must be taken 505 | # - Read/modify/write workflows especially vulnerable 506 | # - Transaction: a set of queries (reads and/or writes) 507 | # - Atomicity: sequence of operations appears as as a single operation on the data 508 | # - Either all operations succeed, or the all modifications are undone 509 | # - Consistency: database invariants are always satisfied regardless of the outcome 510 | # - Invariants: uniqueness, non-empty values, primary/foreign keys, etc. 511 | # - Isolation: different transactions cannot "see" each other 512 | # - Order of transactions does not matter 513 | # - Durability: once completed, the modifications are permanent 514 | # - Useful in case of crashes 515 | # - All of this is handled automatically by the DBMS 516 | # - Users only need to declare start/end and outcome of the transaction 517 | 518 | # + [markdown] slideshow={"slide_type": "slide"} 519 | # # Interfacing to a RDBMS 520 | # 521 | # Three types of clients 522 | # 523 | # 1. Command line clients 524 | # 2. Graphical clients 525 | # 3. Programmatic access 526 | 527 | # + [markdown] slideshow={"slide_type": "slide"} 528 | # ## Command line clients 529 | # 530 | # Enter SQL queries and administrative commands directly from the command line: 531 | # 532 | # ``` 533 | # $ sqlite3 534 | # SQLite version 3.32.3 2020-06-18 14:00:33 535 | # Enter ".help" for usage hints. 536 | # Connected to a transient in-memory database. 537 | # Use ".open FILENAME" to reopen on a persistent database. 538 | # sqlite> 539 | # ``` 540 | # 541 | # ``` 542 | # $ psql -U user -h 10.0.6.12 -p 21334 -d database 543 | # psql (11.1, server 11.0) 544 | # Type "help" for help. 545 | # 546 | # postgres=# 547 | # ``` 548 | 549 | # + [markdown] slideshow={"slide_type": "slide"} 550 | # ## Graphical clients 551 | # 552 | # Database-specific: 553 | # - pgAdmin (PostgreSQL) 554 | # - SQLite Browser (SQLite) 555 | # - MySQL Workbench (MySQL) 556 | # 557 | # General purpose: 558 | # - [SQuirreL](http://squirrel-sql.sourceforge.net) 559 | # - [SQLAdmin](http://sqladmin.sourceforge.net/) 560 | 561 | # + [markdown] slideshow={"slide_type": "slide"} 562 | # ### SQuirreL example: querying 563 | # ![](http://squirrel-sql.sourceforge.net/screenshots/15_edit_result.png) 564 | 565 | # + [markdown] slideshow={"slide_type": "slide"} 566 | # ### SQuirreL example: visualizing tables 567 | # 568 | # ![](http://squirrel-sql.sourceforge.net/screenshots/7_graph.png) 569 | 570 | # + [markdown] slideshow={"slide_type": "slide"} 571 | # ## Programmatic Access 572 | # 573 | # Two types of APIs: 574 | # 575 | # 1. High-level: Object-relational mapping (ORM) 576 | # - Each table has a corresponding class in the code 577 | # - Operations on objects are automatically translated on queries 578 | # - These libraries can work with many SQL databases 579 | # 2. Low-level: Directly write SQL queries as strings 580 | # - Usually tied to a specific type of SQL database 581 | 582 | # + [markdown] slideshow={"slide_type": "slide"} 583 | # ### SQLAlchemy: ORM in Python 584 | # 585 | # Example from [pythoncentral.io](https://www.pythoncentral.io/overview-sqlalchemys-expression-language-orm-queries/). 586 | # 587 | # Tables: 588 | # 589 | # ```python 590 | # class Department(Base): 591 | # __tablename__ = 'department' 592 | # id = Column(Integer, primary_key=True) 593 | # name = Column(String) 594 | # employees = relationship('Employee', secondary='department_employee') 595 | # 596 | # class Employee(Base): 597 | # __tablename__ = 'employee' 598 | # id = Column(Integer, primary_key=True) 599 | # name = Column(String) 600 | # departments = relationship('Department', secondary='department_employee') 601 | # 602 | # class DepartmentEmployee(Base): 603 | # __tablename__ = 'department_employee' 604 | # department_id = Column(Integer, ForeignKey('department.id'), primary_key=True) 605 | # employee_id = Column(Integer, ForeignKey('employee.id'), primary_key=True) 606 | # ``` 607 | 608 | # + [markdown] slideshow={"slide_type": "slide"} 609 | # ### Inserting data in SQLAlchemy 610 | # 611 | # ```python 612 | # from sqlalchemy import create_engine 613 | # engine = create_engine('sqlite:///') 614 | # 615 | # from sqlalchemy.orm import sessionmaker 616 | # session = sessionmaker() 617 | # session.configure(bind=engine) 618 | # Base.metadata.create_all(engine) 619 | # 620 | # s = session() 621 | # john = Employee(name='john') 622 | # s.add(john) 623 | # it_department = Department(name='IT') 624 | # it_department.employees.append(john) 625 | # s.add(it_department) 626 | # s.commit() 627 | # ``` 628 | 629 | # + [markdown] slideshow={"slide_type": "slide"} 630 | # ### Querying in SQLAlchemy 631 | # 632 | # asd 633 | # 634 | # ```python 635 | # find_marry = select([ 636 | # Employee.id 637 | # ]).select_from( 638 | # Employee.__table__.join(DepartmentEmployee) 639 | # ).group_by( 640 | # Employee.id 641 | # ).having(func.count( 642 | # DepartmentEmployee.department_id 643 | # ) > 1) 644 | # 645 | # rs = s.execute(find_marry) 646 | # rs.fetchall() # result: [(2,)] 647 | # ``` 648 | 649 | # + [markdown] slideshow={"slide_type": "slide"} 650 | # ### Accessing SQLite in Python 651 | # 652 | # Example from [python3.org](https://docs.python.org/3/library/sqlite3.html) 653 | # 654 | # ```python 655 | # import sqlite3 656 | # conn = sqlite3.connect('example.db') 657 | # 658 | # c = conn.cursor() 659 | # 660 | # c.execute('''CREATE TABLE stocks 661 | # (date text, trans text, symbol text, qty real, price real)''') 662 | # 663 | # c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)") 664 | # 665 | # conn.commit() 666 | # 667 | # conn.close() 668 | # ``` 669 | 670 | # + [markdown] slideshow={"slide_type": "slide"} 671 | # ### Querying SQLite in Python 672 | # 673 | # ```python 674 | # purchases = [('2006-03-28', 'BUY', 'IBM', 1000, 45.00), 675 | # ('2006-04-05', 'BUY', 'MSFT', 1000, 72.00), 676 | # ('2006-04-06', 'SELL', 'IBM', 500, 53.00)] 677 | # 678 | # c.executemany('INSERT INTO stocks VALUES (?,?,?,?,?)', purchases) 679 | # 680 | # for row in c.execute('SELECT * FROM stocks WHERE price ON ()` 782 | # - a table can have many indices 783 | # - an index is always created automatically for primary keys 784 | # - all other unique keys must also have an index 785 | # - indices on foreign keys _might_ be useful 786 | # - WHERE/JOIN are much faster when there is an index on one of the columns 787 | # - if a query is slow and/or executed very frequently, consider adding an index on columns used in the WHERE/JOIN 788 | 789 | # + [markdown] slideshow={"slide_type": "slide"} 790 | # ## Main types of index 791 | # - Tree-based: O(log N) access, can be used to quickly answer queries like `WHERE L < column < U` 792 | # - Branching factor in the order of 1000s 793 | # - Hash-based: O(1) access, cannot answer range queries 794 | # - Clustered index: table is physically sorted by the columns 795 | 796 | # + [markdown] slideshow={"slide_type": "slide"} 797 | # ## Query plans 798 | # - understanding why a query is slow is not trivial 799 | # - the query plan is produced by the optimizer and shows exactly what and how is done to execute the query 800 | # - it contains an estimated cost and can be augmented with the actual cost measured when executing the query 801 | # - estimated cost: 802 | # - computed from statistics about rows/values that the DBMS maintains internally 803 | # - these statistics can become inaccurate after lots of operations 804 | # - useful to periodically recompute these statistics 805 | # - also useful to periodically clear the space allocated to deleted rows and defragment table data 806 | # - (show example of plans before/after adding an index) 807 | 808 | # + [markdown] slideshow={"slide_type": "slide"} 809 | # ### Example 810 | # 811 | # ![](../img/qplan.png) 812 | # 813 | # Image from [dba.stackexchange.com](https://dba.stackexchange.com/q/9234) 814 | 815 | # + [markdown] Collapsed="false" slideshow={"slide_type": "slide"} 816 | # # Non-SQL 817 | 818 | # + Collapsed="false" 819 | 820 | 821 | 822 | # + [markdown] Collapsed="false" 823 | # # Graph Databases 824 | 825 | # + [markdown] Collapsed="false" 826 | # ## Graph Theory 827 | 828 | # + Collapsed="false" 829 | 830 | 831 | 832 | # + [markdown] Collapsed="false" 833 | # ## Neo4j 834 | 835 | # + Collapsed="false" 836 | 837 | --------------------------------------------------------------------------------