├── Heinrich_tutorial
├── Lecture2.ipynb
└── SUSY_small.csv
├── LICENSE
├── README.md
├── cf_notebooks
├── 1.PartI-DifferentiableForwardModel.ipynb
├── 2.PartII-GenerativeModels.ipynb
├── 3.PartIII-VariationalInference.ipynb
├── 4.MappingDarkMatterDataChallenge.ipynb
└── HSCDataPreparation.ipynb
├── if_projects
├── .gitignore
├── IF-Graph-Clustering.ipynb
├── IF-Image-Classifier.ipynb
├── IF-Image-Segmentation.ipynb
└── README.md
├── jet_notebooks
├── 1.LHCJetDatasetExploration.ipynb
├── 2.JetTaggingMLP.ipynb
├── 3.JetTaggingConv2D.ipynb
├── 4.JetTaggingConv1D.ipynb
├── 5.JetTaggingRNN.ipynb
├── 6.JetTaggingGCN.ipynb
├── 7.JetTaggingTransformer.ipynb
├── 8.JetAnomalyDetectionAE.ipynb
├── 9.JetAnomalyDetectionVAE.ipynb
├── ae.png
├── conv1d.png
├── conv2d.gif
├── particle-net-arch.png
├── rnn1.png
└── vae.png
├── python_advanced
├── IMDB-Movie-Data.csv
├── data.csv
├── data.json
├── example.mplstyle
├── matplotlib_intro.ipynb
├── numpy_intro.ipynb
├── pandas_intro.ipynb
└── stockholm_td_adj.dat
├── python_basics
├── demofile.txt
├── helloworld.py
├── mymodule.py
├── python_intro_part1.ipynb
└── python_intro_part2.ipynb
├── pytorch_basics
├── pytorch_NeuralNetworks.ipynb
└── pytorch_intro.ipynb
├── pytorch_geometric_intro
├── 1.IntroToPyG.ipynb
├── 2.KCNodeClassificationPyG.ipynb
└── 3.TUGraphClassification.ipynb
└── slides
├── GettingStarted.pdf
└── LHCJetTaggingIntro.pdf
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2023 Michael Kagan
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Machine Learning Across The Frontiers - SSI 2023 Projects
2 |
3 | Each HEP frontier presents its own Big Data challenges, inviting the use of AI/ML to tackle them.
4 | Here we choose three specific challenges, one from each of the Energy, Intensity, and Cosmic Frontiers, that can be tackled during the school by small project teams.
5 |
6 | Each has a dataset associated with it, which can be either downloaded to your local (or remote) computing resource, or imported to Google colab.
7 | Your team might then pick up one of the approaches described in the lectures, and try and apply it.
8 | We provide a number of tutorial notebooks below, that introduce the datasets and provide some possible starting points for you.
9 |
10 | On the last Thursday of the school, we will hear very short presentations from each project team in a common slide deck, and award various small prizes.
11 |
12 | For maximum community value, project teams should plan to submit their project notebook back to this repo via a pull request, so everyone can benefit from their hard work. Fork this repo and get to work!
13 |
14 | Have a look at the [`Getting Started`](https://github.com/makagan/SSI_Projects/blob/main/slides/GettingStarted.pdf) slides to get started with Github and Google Colab.
15 |
16 | ## The Challenges
17 |
18 | **Energy Frontier:** here, the challenge is to develop ML models for LHC jets.
19 | These could be for classification, or generative modeling.
20 | We provide a dataset to explore that includes various boosted jets, including high-level jet features, jet-images, and per-particle features.
21 | Many thanks to SSI lecturer Jennifer Ngadubia, from whose recent [`course`](https://github.com/jngadiub/ML_course_Pavia_23/blob/main/) the materials for this challenge are drawn!
22 |
23 | **Cosmic Frontier:** here, the challenge is to develop methods for mapping Dark Matter in the Universe from weak lensing data, after exploring some related inverse problems using LSST-like imaging data. We provide suitable weak lensing datasets.
24 | Many thanks to SSI Lecturer François Lanusse for the materials for this challenge, which are based on the materials used at the [Quarks2Cosmos conference](https://github.com/EiffL/Quarks2CosmosDataChallenge/tree/colab)!
25 |
26 | **Intensity Frontier:** here, the challenge is to...
27 | Many thanks to SSI Organizer Kazu Terao for the materials for this challenge!
28 |
29 | ## SSI2023 Project Prerequisites
30 |
31 | Prerequisites for the course include basic knowledge of GitHub, Colab and python. It is thus required before the course to go through [these](https://github.com/makagan/SSI_Projects/blob/main/slides/GettingStarted.pdf) slides as well as the following two python basics notebooks:
32 |
33 | * [`python_intro_part1.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/python_basics/python_intro_part1.ipynb)
34 | * Quickstart
35 | * Indentation
36 | * Comments
37 | * Variables
38 | * Conditions and `if` statements
39 | * Arrays
40 | * Strings
41 | * Loops: `while` and `for`
42 | * Dictionaries
43 | * [`python_intro_part2.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/python_basics/python_intro_part2.ipynb)
44 | * Functions
45 | * Classes/Objects
46 | * Inheritance
47 | * Modules
48 | * JSON data format
49 | * Exception Handling
50 | * File Handling
51 |
52 | ## Tutorials
53 |
54 | We've organized a variety of tutorial notebooks below, grouped by Frontier (after some more general tutorials you may find helpful).
55 | Note that your project might well benefit from techniques you pick up by looking for tutorials _across the Frontiers..._
56 |
57 | ### General: Advanced Python
58 |
59 | * Intro to Numpy: [`numpy_intro.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/python_advanced/numpy_intro.ipynb)
60 | * Intro to Pandas: [`pandas_intro.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/python_advanced/pandas_intro.ipynb)
61 | * Intro to Matplotlib: [`matplotlib_intro.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/python_advanced/matplotlib_intro.ipynb)
62 |
63 | ### General: Introduction to PyTorch
64 |
65 | * Intro to PyTorch: [`pytorch_intro.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/pytorch_basics/pytorch_intro.ipynb) and [`pytorch_NeuralNetworks.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/pytorch_basics/pytorch_NeuralNetworks.ipynb)
66 |
67 |
68 | ### General: PyTorch Geometric (PyG)
69 | * Intro to PyTorch Geometric: [`1.IntroToPyG.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/pytorch_geometric_intro/1.IntroToPyG.ipynb)
70 | * Node classification with PyG on Cora citation dataset: [`2.KCNodeClassificationPyG.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/pytorch_geometric_intro/3.KCNodeClassificationPyG.ipynb)
71 | * Graph classification with PyG on molecular prediction dataset: [`3.TUGraphClassification.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/pytorch_geometric_intro/3.TUGraphClassification.ipynb)
72 |
73 | ### Energy Frontier: Basic NN with Keras for LHC jet tagging task
74 |
75 | * Introduction to dataset and tasks [slides: [GettingStarted.pdf](https://github.com/makagan/SSI_Projects/blob/main/slides/GettingStarted.pdf)]
76 | * Dataset exploration: [`1.LHCJetDatasetExploration.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/jet_notebooks/1.LHCJetDatasetExploration.ipynb)
77 | * MLP implementation with Keras: [`2.JetTaggingMLP.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/jet_notebooks/2.JetTaggingMLP.ipynb)
78 | * Conv2D implementation with Keras: [`3.JetTaggingConv2D.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/jet_notebooks/3.JetTaggingConv2D.ipynb)
79 | * Conv1D implementation with Keras: [`4.JetTaggingConv1D.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/jet_notebooks/4.JetTaggingConv1D.ipynb)
80 |
81 |
82 | ### Energy Frontier: RNN, GNN and Transformer implementations for LHC jet tagging task
83 |
84 | * GRU for LHC jet tagging task: [`5.JetTaggingRNN.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/neural-networks/5.JetTaggingRNN.ipynb)
85 | * Graph classification with PyG on LHC jet dataset: [`6.JetTaggingGCN.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/jet_notebooks/6.JetTaggingGCN.ipynb)
86 | * Transformer model for LHC jet tagging with tensorflow: [`7.JetTaggingTransformer.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/jet_notebooks/7.JetTaggingTransformer.ipynb)
87 |
88 | ### Energy Frontier: Anomaly Detection for LHC jets
89 | * Anomaly detection for LHC jets with AE [`8.JetAnomalyDetectionAE.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/jet_notebooks/8.JetAnomalyDetectionAE.ipynb)
90 | * Anomaly detection for LHC jets with VAE [`9.JetAnomalyDetectionVAE.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/jet_notebooks/9.JetAnomalyDetectionVAE.ipynb)
91 |
92 | ### Cosmic Frontier: Differentiable Forward Models, Generative Models, And Variational Inference
93 | * [`1.PartI-DifferentiableForwardModel.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/cf_notebooks/1.PartI-DifferentiableForwardModel.ipynb.ipynb)
94 | - How to write a probabilistic forward model for galaxy images with Jax + TensorFlow Probability
95 | - How to optimize parameters of a Jax model
96 | - Write a forward model of ground-based galaxy images
97 | * [`2.PartII-GenerativeModels.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/cf_notebooks/2.PartII-GenerativeModels.ipynb)
98 | - Write an Auto-Encoder in Jax+Haiku
99 | - Build a Normalizing Flow in Jax+Haiku+TensorFlow Probability
100 | - Bonus: Learn a prior by Denoising Score Matching
101 | - Build a generative model of galaxy morphology from Space-Based images
102 | * [`3.PartIII-VariationalInference.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/cf_notebooks/3.PartIII-VariationalInference.ipynb)
103 | - Solve inverse problem by MAP
104 | - Learn how to sample from the posterior using Variational Inference
105 | - Bonus: Learn to sample with SDE
106 | - Recover high-resolution posterior images for HSC galaxies
107 | - Propose an inpainting model for masked regions in HSC galaxies
108 | - Bonus: Demonstrate single band deblending!
109 |
110 | ### Cosmic Frontier: Dark Matter Mass-Mapping using Real HSC Weak Gravitational Lensing Data
111 | * Open challenge [`4.MappingDarkMatterDataChallenge.ipynb`](https://github.com/makagan/SSI_Projects/blob/main/cf_notebooks/4.MappingDarkMatterDataChallenge.ipynb)
112 | - Use Jax to write a differentiable model for weak gravitational lensing
113 | - Use an analytic Gaussian prior to solve the inverse problem (Wiener Filtering)
114 | - Use Denoising Score Matching to learn the score of a prior distribution
115 | - Use Stochastic Differential Equations for sampling from the posterior
116 |
117 | ## Other Resources
118 |
119 | * Pattern Recognition and Machine Learning, Bishop (2006)
120 | * Deep Learning, Goodfellow et al. (2016) -- [`link`](https://www.deeplearningbook.org/)
121 | * Introduction to machine learning, Murray (2010) -- [`video lectures`](http://videolectures.net/bootcamp2010_murray_iml/)
122 | * Stanford ML courses -- [`link`](https://ai.stanford.edu/stanford-ai-courses/)
123 |
--------------------------------------------------------------------------------
/cf_notebooks/3.PartIII-VariationalInference.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "id": "view-in-github",
7 | "colab_type": "text"
8 | },
9 | "source": [
10 | "
"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {
16 | "id": "hq9pSZfBrT1x"
17 | },
18 | "source": [
19 | "# Guided Data Challenge Part III: Variational Posterior Inference\n",
20 | "\n",
21 | "Author:\n",
22 | " - [@EiffL](https://github.com/EiffL) (Fancois Lanusse)\n",
23 | "\n",
24 | "## Overview\n",
25 | "\n",
26 | "In this last notebook, we will use everything we have seen so far, and try to perform posterior using Variational Inference.\n",
27 | "\n",
28 | "\n",
29 | "### Learning objectives:\n",
30 | "\n",
31 | "In this notebook we will put into practice:\n",
32 | " - Perform MAP inference\n",
33 | " - Variational inference"
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "source": [
39 | "## Installing dependencies and accessing data"
40 | ],
41 | "metadata": {
42 | "id": "0ktLeF7bjnbw"
43 | }
44 | },
45 | {
46 | "cell_type": "code",
47 | "source": [
48 | "!pip install git+https://github.com/EiffL/Quarks2CosmosDataChallenge.git\n",
49 | "!echo \"deb http://packages.cloud.google.com/apt gcsfuse-bionic main\" > /etc/apt/sources.list.d/gcsfuse.list\n",
50 | "!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -\n",
51 | "!apt -qq update\n",
52 | "!apt -qq install gcsfuse\n",
53 | "!mkdir galsim\n",
54 | "\n",
55 | "import logging\n",
56 | "logger = logging.getLogger()\n",
57 | "class CheckTypesFilter(logging.Filter):\n",
58 | " def filter(self, record):\n",
59 | " return \"check_types\" not in record.getMessage()\n",
60 | "logger.addFilter(CheckTypesFilter())"
61 | ],
62 | "metadata": {
63 | "id": "p_56Uqv6h0QH"
64 | },
65 | "execution_count": null,
66 | "outputs": []
67 | },
68 | {
69 | "cell_type": "code",
70 | "source": [
71 | "# Authenticating and mounting cloud data storage\n",
72 | "from google.colab import auth\n",
73 | "auth.authenticate_user()\n",
74 | "!gcsfuse --implicit-dirs galsim galsim"
75 | ],
76 | "metadata": {
77 | "id": "rypc1fA8iK9p"
78 | },
79 | "execution_count": null,
80 | "outputs": []
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": null,
85 | "metadata": {
86 | "id": "CrCEhAmmrT1y"
87 | },
88 | "outputs": [],
89 | "source": [
90 | "%pylab inline\n",
91 | "import jax\n",
92 | "import jax.numpy as jnp"
93 | ]
94 | },
95 | {
96 | "cell_type": "markdown",
97 | "metadata": {
98 | "id": "jpbKCpkkrT1z"
99 | },
100 | "source": [
101 | "## Step I: Load your generative model\n",
102 | "\n",
103 | "\n",
104 | "Here I'm going to load an existing pretrained model, you should feel free to replace this by a model you might have trained yourself :-)"
105 | ]
106 | },
107 | {
108 | "cell_type": "code",
109 | "execution_count": null,
110 | "metadata": {
111 | "id": "joDnjqHmrT10"
112 | },
113 | "outputs": [],
114 | "source": [
115 | "# Let's start with the imports\n",
116 | "import haiku as hk # NN library\n",
117 | "import optax # Optimizer library\n",
118 | "import pickle\n",
119 | "\n",
120 | "# Utility function for tensoboard\n",
121 | "from flax.metrics import tensorboard\n",
122 | "\n",
123 | "# TensorFlow probability\n",
124 | "from tensorflow_probability.substrates import jax as tfp\n",
125 | "tfd = tfp.distributions\n",
126 | "tfb = tfp.bijectors\n",
127 | "\n",
128 | "# Specific models built by EiffL\n",
129 | "from quarks2cosmos.models.vae import Decoder\n",
130 | "from quarks2cosmos.models.flow import AffineFlow"
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": null,
136 | "metadata": {
137 | "id": "cKBNDxIPrT10"
138 | },
139 | "outputs": [],
140 | "source": [
141 | "# Create a random sequence\n",
142 | "rng_seq = hk.PRNGSequence(42)"
143 | ]
144 | },
145 | {
146 | "cell_type": "code",
147 | "execution_count": null,
148 | "metadata": {
149 | "id": "IUuGfYTzrT10"
150 | },
151 | "outputs": [],
152 | "source": [
153 | "# Restore model parameters\n",
154 | "import pickle\n",
155 | "with open('galsim/model-50000.pckl', 'rb') as file:\n",
156 | " params, state, _ = pickle.load(file)\n",
157 | "with open('galsim/model-20000.pckl', 'rb') as file:\n",
158 | " params_flow, _ = pickle.load(file)\n",
159 | "\n",
160 | "params = hk.data_structures.merge(params, params_flow)"
161 | ]
162 | },
163 | {
164 | "cell_type": "markdown",
165 | "metadata": {
166 | "id": "HHxgnAVsrT10"
167 | },
168 | "source": [
169 | "#### Create a forward model combining latent flow with VAE"
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "execution_count": null,
175 | "metadata": {
176 | "id": "Uaz_xUR4rT11"
177 | },
178 | "outputs": [],
179 | "source": [
180 | "def generative_model_fn(z):\n",
181 | " # Transform from Gaussian space to VAE latent space\n",
182 | " z1 = AffineFlow()().bijector.forward(z)\n",
183 | "\n",
184 | " # Decode sample with decoder\n",
185 | " likelihood = Decoder()(z1, is_training=False)\n",
186 | "\n",
187 | " return likelihood.mean()\n",
188 | "\n",
189 | "generative_model = hk.without_apply_rng(hk.transform_with_state(generative_model_fn))"
190 | ]
191 | },
192 | {
193 | "cell_type": "code",
194 | "execution_count": null,
195 | "metadata": {
196 | "id": "5OdkYNj1rT11"
197 | },
198 | "outputs": [],
199 | "source": [
200 | "# To sample from the model, we draw from a Gaussian...\n",
201 | "z = tfd.MultivariateNormalDiag(jnp.zeros(32)).sample(16, seed=next(rng_seq))\n",
202 | "# And we run it through the forward model\n",
203 | "x, _ = generative_model.apply(params, state, z)"
204 | ]
205 | },
206 | {
207 | "cell_type": "code",
208 | "execution_count": null,
209 | "metadata": {
210 | "id": "5hcbS8TIrT11"
211 | },
212 | "outputs": [],
213 | "source": [
214 | "figure(figsize=(10,10))\n",
215 | "for i in range(4):\n",
216 | " for j in range(4):\n",
217 | " subplot(4,4,i+4*j+1)\n",
218 | " imshow(x[i+4*j],cmap='gray')\n",
219 | " axis('off')"
220 | ]
221 | },
222 | {
223 | "cell_type": "markdown",
224 | "metadata": {
225 | "id": "gi-wiGMzrT12"
226 | },
227 | "source": [
228 | "Not too bad :-)"
229 | ]
230 | },
231 | {
232 | "cell_type": "markdown",
233 | "metadata": {
234 | "id": "fgfGjVkDrT12"
235 | },
236 | "source": [
237 | "## Step II: Back to our inverse problems"
238 | ]
239 | },
240 | {
241 | "cell_type": "code",
242 | "execution_count": null,
243 | "metadata": {
244 | "id": "E9hDrdLwrT12"
245 | },
246 | "outputs": [],
247 | "source": [
248 | "import quarks2cosmos.datasets\n",
249 | "import tensorflow_datasets as tfds\n",
250 | "from quarks2cosmos import galjax as gj"
251 | ]
252 | },
253 | {
254 | "cell_type": "code",
255 | "execution_count": null,
256 | "metadata": {
257 | "id": "zVDI_aJPrT12"
258 | },
259 | "outputs": [],
260 | "source": [
261 | "dset_cosmos = tfds.load(\"Cosmos/23.5\", split=tfds.Split.TRAIN,\n",
262 | " data_dir='galsim/tensorflow_datasets') # Load the TRAIN split\n",
263 | "dset_cosmos = dset_cosmos.as_numpy_iterator() # Convert the dataset to numpy iterator\n",
264 | "\n",
265 | "dset_hsc = tfds.load(\"HSC\", split=tfds.Split.TRAIN,\n",
266 | " data_dir='galsim/tensorflow_datasets')\n",
267 | "dset_hsc = dset_hsc.as_numpy_iterator()"
268 | ]
269 | },
270 | {
271 | "cell_type": "code",
272 | "execution_count": null,
273 | "metadata": {
274 | "id": "-u-J_rjQrT12"
275 | },
276 | "outputs": [],
277 | "source": [
278 | "# Extract a new example from the dataset\n",
279 | "cosmos = next(dset_cosmos)\n",
280 | "\n",
281 | "figure(figsize=[10,5])\n",
282 | "subplot(121)\n",
283 | "imshow(cosmos['image'],cmap='gray')\n",
284 | "title('Galaxy')\n",
285 | "subplot(122)\n",
286 | "imshow(cosmos['psf'],cmap='gray')\n",
287 | "title('PSF');"
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "execution_count": null,
293 | "metadata": {
294 | "id": "cxxPvLzsrT12"
295 | },
296 | "outputs": [],
297 | "source": [
298 | "# Extract a new example from the dataset\n",
299 | "hsc = next(dset_hsc)\n",
300 | "\n",
301 | "figure(figsize=[20,5])\n",
302 | "subplot(141)\n",
303 | "imshow(hsc['image'],cmap='gray')\n",
304 | "title('Galaxy')\n",
305 | "subplot(142)\n",
306 | "imshow(hsc['psf'],cmap='gray')\n",
307 | "title('PSF')\n",
308 | "subplot(143)\n",
309 | "imshow(hsc['mask'] == 44,cmap='gray')\n",
310 | "title('Interpolated pixels')\n",
311 | "subplot(144)\n",
312 | "imshow(hsc['variance'],cmap='gray')\n",
313 | "title('Variance plane');"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": null,
319 | "metadata": {
320 | "id": "4t3A0b0rrT12"
321 | },
322 | "outputs": [],
323 | "source": [
324 | "def simulate_hsc(x, in_psf, out_psf):\n",
325 | " \"\"\" This function will simulate an image at HSC resolution given an image at HST resolution,\n",
326 | " accounting for input PSF and convolving by output PSF\n",
327 | " Args:\n",
328 | " x: HST resolution image (MUST BE ODD SIZE!!!!)\n",
329 | " in_psf: HST PSF\n",
330 | " out_psf: HSC PSF\n",
331 | " Returns:\n",
332 | " y: HSC simulated image of size [41,41]\n",
333 | " \"\"\"\n",
334 | " y = gj.deconvolve(x, in_psf) # Deconvolve by input PSF\n",
335 | " y = gj.kresample(y, 0.03, 0.168, 41) # Resample image to HSC grid\n",
336 | " y = gj.convolve(y, out_psf) # Reconvolve by HSC PSF\n",
337 | " return 2.587*y # Conversion factor for the flux"
338 | ]
339 | },
340 | {
341 | "cell_type": "code",
342 | "execution_count": null,
343 | "metadata": {
344 | "id": "8tl8G06OrT12"
345 | },
346 | "outputs": [],
347 | "source": [
348 | "likelihood = tfd.Independent(tfd.Normal(loc=simulate_hsc(cosmos['image'], cosmos['psf'], hsc['psf']),\n",
349 | " scale=jnp.sqrt(hsc['variance'])),\n",
350 | " reinterpreted_batch_ndims=2) # This is to make sure TFP understand we have a 2d image"
351 | ]
352 | },
353 | {
354 | "cell_type": "code",
355 | "execution_count": null,
356 | "metadata": {
357 | "id": "jn7wUzQfrT12"
358 | },
359 | "outputs": [],
360 | "source": [
361 | "im_noise = likelihood.sample(seed=jax.random.PRNGKey(1))\n",
362 | "x_true = cosmos['image']\n",
363 | "cr_mask = 1.*(hsc['mask'] == 44)\n",
364 | "y_obs = im_noise * (1 - cr_mask)"
365 | ]
366 | },
367 | {
368 | "cell_type": "code",
369 | "execution_count": null,
370 | "metadata": {
371 | "id": "o7Dv52l5rT13"
372 | },
373 | "outputs": [],
374 | "source": [
375 | "figure(figsize=[15,5])\n",
376 | "subplot(131)\n",
377 | "imshow(x_true)\n",
378 | "title('Hubble image to recover')\n",
379 | "subplot(132)\n",
380 | "imshow(y_obs)\n",
381 | "title('Observed image')\n",
382 | "subplot(133)\n",
383 | "imshow(cr_mask)\n",
384 | "title('Cosmic Ray mask');"
385 | ]
386 | },
387 | {
388 | "cell_type": "markdown",
389 | "metadata": {
390 | "id": "0CtujABIrT13"
391 | },
392 | "source": [
393 | "## Step III: MAP Inference\n",
394 | "\n",
395 | "We now have all the tools for trying to perform Maximum A Posterior inference for our inverse problem, i.e.:\n",
396 | "\n",
397 | "$$\\hat{z} = \\arg \\max_{z} \\log p(y | z) + \\log p(z) $$\n",
398 | "\n",
399 | "In order to achieve this, you will need to put together the following elements:\n",
400 | "\n",
401 | "- Combine the physical forward model with generative model for an end-to-end forward model going from latent variable $z$ to HSC image.\n",
402 | "- Write a function that computes the log posterior for a given $z$\n",
403 | "- Use the tools from day I to do the optmization and recover a solution\n",
404 | "\n",
405 | "Your turn :-)"
406 | ]
407 | },
408 | {
409 | "cell_type": "code",
410 | "execution_count": null,
411 | "metadata": {
412 | "id": "kka4OISkrT13"
413 | },
414 | "outputs": [],
415 | "source": []
416 | },
417 | {
418 | "cell_type": "markdown",
419 | "metadata": {
420 | "id": "PpaMoWIlrT13"
421 | },
422 | "source": [
423 | "## Step IV: Variational Inference\n",
424 | "\n",
425 | "In the previous section, we only recover a single point estimate of the solution, but ideally we want to access the full posterior. In this section, we will try to use VI.\n",
426 | "\n",
427 | "\n",
428 | "The idea of VI, is to use a parametric model $q_\\theta$ to approximate the posterior distribution $p(z | x)$. You need two things:\n",
429 | "- a tractable and flexible parametric model $q_\\theta$, we can use a Normalizing Flow for instance ;-)\n",
430 | "- a loss function that minimizes the distance between $p$ and $q_\\theta$\n",
431 | "\n",
432 | "\n",
433 | "The loss function typically used for VI is the Evidence Lower-Bound (ELBO) (the same one as we used in the VAE ;-) ). The ELBO is the right hand side part of this expression:\n",
434 | "\n",
435 | "$$ p_\\theta(y) \\geq \\mathbb{E}_{z \\sim q_\\theta}\\left[ p(y | z) \\right] - KL(q_\\theta || p) $$\n",
436 | "where $p$ in the KL divergence term is the latent space prior.\n",
437 | "\n",
438 | "In other words, maximizing the ELBO tries to maximize the likelihood of the data under the model.\n"
439 | ]
440 | },
441 | {
442 | "cell_type": "code",
443 | "execution_count": null,
444 | "metadata": {
445 | "id": "eQf0UuDwrT13"
446 | },
447 | "outputs": [],
448 | "source": [
449 | "# We are going to need a normalizing flow to model the posterior then\n",
450 | "def sample_and_logp(N=1):\n",
451 | " flow = AffineFlow()()\n",
452 | " z = flow.sample(N, seed=hk.next_rng_key())\n",
453 | " log_p = flow.log_prob(z)\n",
454 | " return z, log_p"
455 | ]
456 | },
457 | {
458 | "cell_type": "code",
459 | "execution_count": null,
460 | "metadata": {
461 | "id": "j3TDWHYorT13"
462 | },
463 | "outputs": [],
464 | "source": [
465 | "q_sample_logp = hk.transform(sample_and_logp)\n",
466 | "\n",
467 | "# We initialize the parameters for the variational distribution\n",
468 | "q_params = q_sample_logp.init(next(rng_seq), 1)\n",
469 | "\n",
470 | "# And here is our prior distribution\n",
471 | "p = tfd.MultivariateNormalDiag(jnp.zeros(32),\n",
472 | " scale_identity_multiplier=1.)"
473 | ]
474 | },
475 | {
476 | "cell_type": "code",
477 | "execution_count": null,
478 | "metadata": {
479 | "id": "b35SCz_9rT13"
480 | },
481 | "outputs": [],
482 | "source": [
483 | "# Let's write a concrete ELBO\n",
484 | "def elbo(params, rng_key):\n",
485 | "\n",
486 | " # Sample from the log posterior\n",
487 | " z, log_q = q_sample_logp.apply(params, rng_key, N=100)\n",
488 | "\n",
489 | " # KL term\n",
490 | " kl = log_q - p.log_prob(z)\n",
491 | "\n",
492 | " # You need to plug your forward model producing a likelihood object here\n",
493 | " likelihood = # .....\n",
494 | "\n",
495 | " log_likelihood.log_prob(y_obs)\n",
496 | "\n",
497 | " # Form the ELBO\n",
498 | " elbo = jnp.mean(log_likelihood - kl)\n",
499 | "\n",
500 | " return -elbo"
501 | ]
502 | },
503 | {
504 | "cell_type": "markdown",
505 | "metadata": {
506 | "id": "LlaRjtK0rT13"
507 | },
508 | "source": [
509 | "The rest is now up to you :-) Use this ELBO to optimize the parameters of the posterior variational distribution $q_\\theta$. Once you have achieved a good solution, try to sample from that posterior."
510 | ]
511 | },
512 | {
513 | "cell_type": "code",
514 | "execution_count": null,
515 | "metadata": {
516 | "id": "SSQOeYfLrT13"
517 | },
518 | "outputs": [],
519 | "source": []
520 | }
521 | ],
522 | "metadata": {
523 | "kernelspec": {
524 | "display_name": "Python 3 - AI",
525 | "language": "python",
526 | "name": "python3-ai"
527 | },
528 | "language_info": {
529 | "codemirror_mode": {
530 | "name": "ipython",
531 | "version": 3
532 | },
533 | "file_extension": ".py",
534 | "mimetype": "text/x-python",
535 | "name": "python",
536 | "nbconvert_exporter": "python",
537 | "pygments_lexer": "ipython3",
538 | "version": "3.7.9"
539 | },
540 | "colab": {
541 | "provenance": [],
542 | "include_colab_link": true
543 | }
544 | },
545 | "nbformat": 4,
546 | "nbformat_minor": 0
547 | }
--------------------------------------------------------------------------------
/cf_notebooks/HSCDataPreparation.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Prepares dataset of HSC galaxies, PSFs and masks"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "import h5py\n",
17 | "from astropy.table import Table\n",
18 | "import astropy.units as u\n",
19 | "from unagi import hsc\n",
20 | "from unagi import task"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 2,
26 | "metadata": {},
27 | "outputs": [
28 | {
29 | "name": "stderr",
30 | "output_type": "stream",
31 | "text": [
32 | "/local/home/flanusse/.local/lib/python3.8/site-packages/numpy/ma/core.py:2831: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray\n",
33 | " _data = np.array(data, dtype=dtype, copy=copy,\n"
34 | ]
35 | },
36 | {
37 | "name": "stdout",
38 | "output_type": "stream",
39 | "text": [
40 | "# Get table list from /local/home/flanusse/repo/unagi/unagi/data/pdr2_wide/pdr2_wide_tables.fits\n"
41 | ]
42 | }
43 | ],
44 | "source": [
45 | "# Define the HSC archive\n",
46 | "archive = hsc.Hsc(dr='pdr2', rerun='pdr2_wide')"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": 3,
52 | "metadata": {},
53 | "outputs": [],
54 | "source": [
55 | "query_mask = ''\n",
56 | "-- Merge forced photometry and spectroscopic sample from HSC PDR 2 wide\n",
57 | "SELECT object_id, ra, dec, tract, patch,\n",
58 | "\t-- Absorption\n",
59 | "\ta_g, a_r, a_i, a_z, a_y,\n",
60 | "\t-- Extendedness\n",
61 | "\tg_extendedness_value, r_extendedness_value, i_extendedness_value, z_extendedness_value, y_extendedness_value,\n",
62 | " -- Background Information\n",
63 | " g_localbackground_flux, r_localbackground_flux, i_localbackground_flux, z_localbackground_flux, y_localbackground_flux,\n",
64 | "\t-- Magnitudes\n",
65 | "\tg_cmodel_mag, g_cmodel_magsigma, g_cmodel_exp_mag, g_cmodel_exp_magsigma, g_cmodel_dev_mag, g_cmodel_dev_magsigma,\n",
66 | "\tr_cmodel_mag, r_cmodel_magsigma, r_cmodel_exp_mag, r_cmodel_exp_magsigma, r_cmodel_dev_mag, r_cmodel_dev_magsigma,\n",
67 | "\ti_cmodel_mag, i_cmodel_magsigma, i_cmodel_exp_mag, i_cmodel_exp_magsigma, i_cmodel_dev_mag, i_cmodel_dev_magsigma,\n",
68 | "\tz_cmodel_mag, z_cmodel_magsigma, z_cmodel_exp_mag, z_cmodel_exp_magsigma, z_cmodel_dev_mag, z_cmodel_dev_magsigma,\n",
69 | "\ty_cmodel_mag, y_cmodel_magsigma, y_cmodel_exp_mag, y_cmodel_exp_magsigma, y_cmodel_dev_mag, y_cmodel_dev_magsigma\n",
70 | "\n",
71 | "FROM pdr2_wide.forced forced\n",
72 | " LEFT JOIN pdr2_wide.forced2 USING (object_id)\n",
73 | " LEFT JOIN pdr2_wide.forced3 USING (object_id)\n",
74 | "\n",
75 | "-- Applying some data quality cuts\n",
76 | "WHERE forced.isprimary\n",
77 | "AND forced.i_cmodel_mag < 23.5\n",
78 | "AND forced.i_cmodel_mag > 21\n",
79 | "-- Simple Full Depth Full Colour cuts: At least 3 exposures in each band\n",
80 | "AND forced.g_inputcount_value >= 3\n",
81 | "AND forced.r_inputcount_value >= 3\n",
82 | "AND forced.i_inputcount_value >= 3\n",
83 | "AND forced.z_inputcount_value >= 3\n",
84 | "AND forced.y_inputcount_value >= 3\n",
85 | "-- Remove objects affected by bright stars\n",
86 | "AND NOT forced.g_pixelflags_bright_objectcenter\n",
87 | "AND NOT forced.r_pixelflags_bright_objectcenter\n",
88 | "AND NOT forced.i_pixelflags_bright_objectcenter\n",
89 | "AND NOT forced.z_pixelflags_bright_objectcenter\n",
90 | "AND NOT forced.y_pixelflags_bright_objectcenter\n",
91 | "AND NOT forced.g_pixelflags_bright_object\n",
92 | "AND NOT forced.r_pixelflags_bright_object\n",
93 | "AND NOT forced.i_pixelflags_bright_object\n",
94 | "AND NOT forced.z_pixelflags_bright_object\n",
95 | "AND NOT forced.y_pixelflags_bright_object\n",
96 | "-- Remove objects intersecting edges\n",
97 | "AND NOT forced.g_pixelflags_edge\n",
98 | "AND NOT forced.r_pixelflags_edge\n",
99 | "AND NOT forced.i_pixelflags_edge\n",
100 | "AND NOT forced.z_pixelflags_edge\n",
101 | "AND NOT forced.y_pixelflags_edge\n",
102 | "-- Remove objects with saturated pixels\n",
103 | "AND NOT forced.g_pixelflags_saturatedcenter\n",
104 | "AND NOT forced.r_pixelflags_saturatedcenter\n",
105 | "AND NOT forced.i_pixelflags_saturatedcenter\n",
106 | "AND NOT forced.z_pixelflags_saturatedcenter\n",
107 | "AND NOT forced.y_pixelflags_saturatedcenter\n",
108 | "-- But force objects with interpolated pixels\n",
109 | "AND forced.i_pixelflags_interpolatedcenter\n",
110 | "-- Remove objects with generic cmodel fit failures\n",
111 | "AND NOT forced.g_cmodel_flag\n",
112 | "AND NOT forced.r_cmodel_flag\n",
113 | "AND NOT forced.i_cmodel_flag\n",
114 | "AND NOT forced.z_cmodel_flag\n",
115 | "AND NOT forced.y_cmodel_flag\n",
116 | "-- Sort by tract and patch for faster cutout query\n",
117 | "ORDER BY object_id\n",
118 | "LIMIT 10000\n",
119 | ";\n",
120 | "'''"
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": 5,
126 | "metadata": {},
127 | "outputs": [
128 | {
129 | "name": "stdout",
130 | "output_type": "stream",
131 | "text": [
132 | "Waiting for query to finish... [Done]\n"
133 | ]
134 | }
135 | ],
136 | "source": [
137 | "catalog = archive.sql_query(query_mask, from_file=False, verbose=True)"
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": 10,
143 | "metadata": {},
144 | "outputs": [],
145 | "source": [
146 | "catalog.write('catalog_masked_obj.fits')"
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": 12,
152 | "metadata": {},
153 | "outputs": [],
154 | "source": [
155 | "# Defining size of cutouts\n",
156 | "img_len = 42 / 2 # Size of cutouts in pixels\n",
157 | "cutout_size = 0.168*(img_len) # Size of cutouts in Arcsecs\n",
158 | "\n",
159 | "# Which filter we care about\n",
160 | "filters = ['HSC-I']\n",
161 | "\n",
162 | "tmp_dir='tmp_dir'\n",
163 | "out_dir='./'\n",
164 | "!mkdir -p tmp_dir"
165 | ]
166 | },
167 | {
168 | "cell_type": "code",
169 | "execution_count": 13,
170 | "metadata": {},
171 | "outputs": [
172 | {
173 | "name": "stdout",
174 | "output_type": "stream",
175 | "text": [
176 | "Starting download of 10 batches ...\n",
177 | "Download filter HSC-I for batch 1Download filter HSC-I for batch 0\n",
178 | "\n",
179 | "Download filter HSC-I for batch 2\n",
180 | "Download filter HSC-I for batch 1\n"
181 | ]
182 | },
183 | {
184 | "name": "stderr",
185 | "output_type": "stream",
186 | "text": [
187 | "WARNING: AstropyDeprecationWarning: tmp_dir/batch_HSC-I_1 already exists. Automatically overwriting ASCII files is deprecated. Use the argument 'overwrite=True' in the future. [astropy.io.ascii.ui]\n"
188 | ]
189 | },
190 | {
191 | "name": "stdout",
192 | "output_type": "stream",
193 | "text": [
194 | "Download filter HSC-I for batch 3\n",
195 | "Download filter HSC-I for batch 4\n",
196 | "Download filter HSC-I for batch 5\n",
197 | "Download filter HSC-I for batch 6\n",
198 | "Download filter HSC-I for batch 7\n",
199 | "Download filter HSC-I for batch 8\n",
200 | "Download filter HSC-I for batch 9\n",
201 | "Download finalized, aggregating cutouts.\n"
202 | ]
203 | }
204 | ],
205 | "source": [
206 | "# Extract the cutouts\n",
207 | "cutouts_filename = task.hsc_bulk_cutout(catalog, \n",
208 | " cutout_size=cutout_size * u.Unit('arcsec'), \n",
209 | " filters=filters, \n",
210 | " archive=archive, \n",
211 | " nproc=2, # Download using 2 parallel jobs\n",
212 | " tmp_dir=tmp_dir, \n",
213 | " mask=True, variance=True,\n",
214 | " output_dir=out_dir)"
215 | ]
216 | },
217 | {
218 | "cell_type": "code",
219 | "execution_count": 14,
220 | "metadata": {},
221 | "outputs": [
222 | {
223 | "name": "stdout",
224 | "output_type": "stream",
225 | "text": [
226 | "Starting download of 10 batches ...\n",
227 | "Download PSF for filter HSC-I for batch 0Download PSF for filter HSC-I for batch 1\n",
228 | "\n",
229 | "Found cutout file for batch file 1, skipping download\n",
230 | "Download PSF for filter HSC-I for batch 2\n",
231 | "Download PSF for filter HSC-I for batch 0\n"
232 | ]
233 | },
234 | {
235 | "name": "stderr",
236 | "output_type": "stream",
237 | "text": [
238 | "WARNING: AstropyDeprecationWarning: tmp_dir/batch_HSC-I_0 already exists. Automatically overwriting ASCII files is deprecated. Use the argument 'overwrite=True' in the future. [astropy.io.ascii.ui]\n"
239 | ]
240 | },
241 | {
242 | "name": "stdout",
243 | "output_type": "stream",
244 | "text": [
245 | "Download PSF for filter HSC-I for batch 3\n",
246 | "Download PSF for filter HSC-I for batch 4\n",
247 | "Download PSF for filter HSC-I for batch 5\n",
248 | "Download PSF for filter HSC-I for batch 6\n",
249 | "Download PSF for filter HSC-I for batch 7\n",
250 | "Download PSF for filter HSC-I for batch 8\n",
251 | "Download PSF for filter HSC-I for batch 9\n",
252 | "Download finalized, aggregating cutouts.\n"
253 | ]
254 | }
255 | ],
256 | "source": [
257 | "# Extract the PSFs for all these objects\n",
258 | "psfs_filename = task.hsc_bulk_psf(catalog, filters=filters, \n",
259 | " archive=archive, \n",
260 | " nproc=2, tmp_dir=tmp_dir,\n",
261 | " output_dir=out_dir)"
262 | ]
263 | },
264 | {
265 | "cell_type": "code",
266 | "execution_count": 15,
267 | "metadata": {},
268 | "outputs": [
269 | {
270 | "data": {
271 | "text/plain": [
272 | "('./cutouts_pdr2_wide_coadd.hdf', './psfs_pdr2_wide_coadd.hdf')"
273 | ]
274 | },
275 | "execution_count": 15,
276 | "metadata": {},
277 | "output_type": "execute_result"
278 | }
279 | ],
280 | "source": [
281 | "cutouts_filename, psfs_filename"
282 | ]
283 | },
284 | {
285 | "cell_type": "code",
286 | "execution_count": null,
287 | "metadata": {},
288 | "outputs": [],
289 | "source": []
290 | }
291 | ],
292 | "metadata": {
293 | "kernelspec": {
294 | "display_name": "Python 3",
295 | "language": "python",
296 | "name": "python3"
297 | },
298 | "language_info": {
299 | "codemirror_mode": {
300 | "name": "ipython",
301 | "version": 3
302 | },
303 | "file_extension": ".py",
304 | "mimetype": "text/x-python",
305 | "name": "python",
306 | "nbconvert_exporter": "python",
307 | "pygments_lexer": "ipython3",
308 | "version": "3.8.3"
309 | }
310 | },
311 | "nbformat": 4,
312 | "nbformat_minor": 4
313 | }
314 |
--------------------------------------------------------------------------------
/if_projects/.gitignore:
--------------------------------------------------------------------------------
1 | *ipynb_checkpoints/
2 | *ssi_if
3 | *~
--------------------------------------------------------------------------------
/if_projects/README.md:
--------------------------------------------------------------------------------
1 | # Intensity frontier projects
2 |
3 | There are three projects based on image/graph analysis.
4 |
5 | 1. `IF-Image-Classifier` for classifying the type of a particle in an image (image classification)
6 | 2. `IF-Image-Segmentation` for classifying the type of a particle at the pixel level in an image (image segmentation)
7 | 3. `IF-Graph-Clustering` for a graph node/edge analysis
8 |
9 | Each notebook include brief descriptions about the dataset and the challenges.
10 |
11 |
--------------------------------------------------------------------------------
/jet_notebooks/1.LHCJetDatasetExploration.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "
"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {
13 | "id": "z2P7i7u9Q8mv"
14 | },
15 | "source": [
16 | "# Dataset Exploration\n",
17 | "\n",
18 | "---\n",
19 | "In this notebook, we explore the in put data file and the different datasets contained in it\n",
20 | "- A set of physics-motivated high-level features \n",
21 | "- Jets represented as an image\n",
22 | "- Jets represented as a list of particles\n",
23 | "These different representations will be used to train different kind of networks while solving the same problem,\n",
24 | "a classification task aiming to distinguish jets originating from quarks, gluons, Ws, Zs, or top quarks.\n",
25 | "\n",
26 | "---\n",
27 | "\n",
28 | "We start loading the main libraries\n",
29 | "- h5py to read the input HDF5 file\n",
30 | "- numpy top handle the datasets stored there\n",
31 | "- matplotlib for graphs\n",
32 | "---"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": null,
38 | "metadata": {
39 | "id": "XDuPVEm_Q8my"
40 | },
41 | "outputs": [],
42 | "source": [
43 | "import h5py\n",
44 | "import numpy as np\n",
45 | "import matplotlib.pyplot as plt\n",
46 | "%matplotlib inline"
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {
52 | "id": "gNKuxZndQ8mz"
53 | },
54 | "source": [
55 | "## Reading the data\n",
56 | "\n",
57 | "---\n",
58 | "In order to import the dataset, we now\n",
59 | "- clone the dataset repository (to import the data in Colab)\n",
60 | "- load the h5 files in the data/ repository\n",
61 | "- extract the data we need: a target and jetImage \n",
62 | "\n",
63 | "To type shell commands, we start the command line with !\n",
64 | "\n",
65 | "**nb, if you are running locally and you have already downloaded the datasets you can skip the cell below and, if needed, change the paths later to point to the folder with your previous download of the datasets.**"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": null,
71 | "metadata": {
72 | "id": "5B8ylbkIQ8mz"
73 | },
74 | "outputs": [],
75 | "source": [
76 | "! curl https://cernbox.cern.ch/s/zZDKjltAcJW0RB7/download -o Data-ML-Jet-Project.tar.gz\n",
77 | "! tar -xvzf Data-ML-Jet-Project.tar.gz \n",
78 | "! ls Data-MLtutorial/JetDataset/\n",
79 | "! rm Data-ML-Jet-Project.tar.gz "
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": null,
85 | "metadata": {
86 | "id": "t36Pjm1fQ8m1"
87 | },
88 | "outputs": [],
89 | "source": [
90 | "# let's open the file\n",
91 | "fileIN = 'Data-MLtutorial/JetDataset/jetImage_7_100p_30000_40000.h5'\n",
92 | "f = h5py.File(fileIN)\n",
93 | "# and see what it contains\n",
94 | "print(list(f.keys()))"
95 | ]
96 | },
97 | {
98 | "cell_type": "markdown",
99 | "metadata": {
100 | "id": "yudavOOhQ8m2"
101 | },
102 | "source": [
103 | "---\n",
104 | "- 'jetImage' contains the image representation of the jets (more later)\n",
105 | "- 'jetImageECAL' and 'jetImageHCAL' are the ECAL- and HCAL-only equivalent images. We will not use them (but you are more than welcome to play with it)\n",
106 | "- 'jetConstituentList' is the list of particles cointained in the jet. For each particle, a list of relevant quantities is stored\n",
107 | "- 'particleFeatureNames' is the list of the names corresponding to the quantities contained in 'jetConstituentList'\n",
108 | "- 'jets' is the dataset we consider for the moment\n",
109 | "- 'jetFeatureNames' is the list of the names corresponding to the quantities contained in 'jets'\n",
110 | "\n",
111 | "The first 100 highest-$p_T$ particles are considered for each jet\n",
112 | "\n",
113 | "---"
114 | ]
115 | },
116 | {
117 | "cell_type": "markdown",
118 | "metadata": {
119 | "id": "xMFbHOZsQ8m3"
120 | },
121 | "source": [
122 | "## The physics-motivated high-level features\n",
123 | "\n",
124 | "We then open the input file and load the 'jet' data, containing\n",
125 | "- the discriminating quantities\n",
126 | "- the truth (which kind of jet we are dealing with)"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": null,
132 | "metadata": {
133 | "id": "uTbm8hcNQ8m3"
134 | },
135 | "outputs": [],
136 | "source": [
137 | "# These are the quantities we are dealing with\n",
138 | "featurenames = f.get('jetFeatureNames')\n",
139 | "print(featurenames[:])\n",
140 | "# the b is due to the byte vs utf-8 encoding of the strings in the dataset\n",
141 | "# just ignore them for the moment"
142 | ]
143 | },
144 | {
145 | "cell_type": "markdown",
146 | "metadata": {
147 | "id": "241J64TfQ8m4"
148 | },
149 | "source": [
150 | "---\n",
151 | "The ground truth is incorporated in the ['j_g', 'j_q', 'j_w', 'j_z', 'j_t] vector of boolean, taking the form\n",
152 | "- [1, 0, 0, 0, 0] for gluons\n",
153 | "- [0, 1, 0, 0, 0] for quarks\n",
154 | "- [0, 0, 1, 0, 0] for Ws\n",
155 | "- [0, 0, 0, 1, 0] for Zs\n",
156 | "- [0, 0, 0, 0, 1] for tops\n",
157 | "\n",
158 | "This is what is called 'one-hot' encoding of a descrete label (typical of ground truth for classification problems)\n",
159 | "\n",
160 | "We define the 'target' of our problem the set of these labels"
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": null,
166 | "metadata": {
167 | "id": "Z290lJYKQ8m5",
168 | "scrolled": true
169 | },
170 | "outputs": [],
171 | "source": [
172 | "jet_data = np.array(f.get('jets'))\n",
173 | "target = jet_data[:,-6:-1]\n",
174 | "# shape of the dataset\n",
175 | "print(\"Dataset shape:\")\n",
176 | "print(target.shape)\n",
177 | "print(\"First five entries:\")\n",
178 | "for i in range(5):\n",
179 | " print(target[i])\n",
180 | "print(\"Last 5 entries:\")\n",
181 | "for i in range(-5,0):\n",
182 | " print(target[i])"
183 | ]
184 | },
185 | {
186 | "cell_type": "markdown",
187 | "metadata": {
188 | "id": "MlrlY4OgQ8m6"
189 | },
190 | "source": [
191 | "As you can see there are 10K examples in this file. We will need more for a meaningful training (more later)\n",
192 | "\n",
193 | "---"
194 | ]
195 | },
196 | {
197 | "cell_type": "markdown",
198 | "metadata": {
199 | "id": "VTVP8yRrQ8m7"
200 | },
201 | "source": [
202 | "And now the feature dataset"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": null,
208 | "metadata": {
209 | "id": "px02-P2vQ8m7"
210 | },
211 | "outputs": [],
212 | "source": [
213 | "data = np.array(jet_data[:,:-6])\n",
214 | "print(data.shape)"
215 | ]
216 | },
217 | {
218 | "cell_type": "markdown",
219 | "metadata": {
220 | "id": "jEdrmCo5Q8m8"
221 | },
222 | "source": [
223 | "We have 53 features for each of the 66K jets\n",
224 | "We now make some plot\n",
225 | "--- "
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": null,
231 | "metadata": {
232 | "id": "g3E6wpS9Q8m8"
233 | },
234 | "outputs": [],
235 | "source": [
236 | "labelCat= [\"gluon\", \"quark\", \"W\", \"Z\", \"top\"]\n",
237 | "# this function makes the histogram of a given quantity for the five classes\n",
238 | "def makePlot(feature_index, input_data, input_featurenames):\n",
239 | " plt.subplots()\n",
240 | " for i in range(len(labelCat)):\n",
241 | " # notice the use of numpy masking to select specific classes of jets\n",
242 | " my_data = input_data[np.argmax(target, axis=1) == i]\n",
243 | " # then plot the right quantity for the reduced array\n",
244 | " plt.hist(my_data[:,feature_index], 50, density=True, histtype='step', fill=False, linewidth=1.5)\n",
245 | " plt.yscale('log') \n",
246 | " plt.legend(labelCat, fontsize=12, frameon=False)\n",
247 | " plt.xlabel(str(input_featurenames[feature_index], \"utf-8\"), fontsize=15)\n",
248 | " plt.ylabel('Prob. Density (a.u.)', fontsize=15)\n",
249 | " plt.show()\n",
250 | " #del fig, ax\n",
251 | " #return fig, ax"
252 | ]
253 | },
254 | {
255 | "cell_type": "code",
256 | "execution_count": null,
257 | "metadata": {
258 | "id": "LKEI0vq_Q8m9"
259 | },
260 | "outputs": [],
261 | "source": [
262 | "# we now plot all the features\n",
263 | "for i in range(len(featurenames[:-6])):\n",
264 | " makePlot(i, data, featurenames)\n",
265 | " #fig.show()"
266 | ]
267 | },
268 | {
269 | "cell_type": "markdown",
270 | "metadata": {
271 | "id": "GlNM5jgnQ8m9"
272 | },
273 | "source": [
274 | "More information on these quantities can be found in https://arxiv.org/pdf/1709.08705.pdf\n",
275 | "\n",
276 | "---"
277 | ]
278 | },
279 | {
280 | "cell_type": "markdown",
281 | "metadata": {
282 | "id": "KFEnFDATQ8m-"
283 | },
284 | "source": [
285 | "# The image dataset\n",
286 | "\n",
287 | "Jets can be converted to images considering the (η, φ) plane, centered along the axis direction and binned.\n",
288 | "In our case, we consider a square of 1.6x1.6 in size (because the jet size is R=0.8) binned in 100x100 equal-size 'cells'"
289 | ]
290 | },
291 | {
292 | "cell_type": "code",
293 | "execution_count": null,
294 | "metadata": {
295 | "id": "mPfUOSLDQ8m-"
296 | },
297 | "outputs": [],
298 | "source": [
299 | "from matplotlib.colors import LogNorm\n",
300 | "labelCat= [\"gluon\", \"quark\", \"W\", \"Z\", \"top\"]\n",
301 | "image = np.array(f.get('jetImage'))\n",
302 | "image_g = image[np.argmax(target, axis=1) == 0]\n",
303 | "image_q = image[np.argmax(target, axis=1) == 1]\n",
304 | "image_W = image[np.argmax(target, axis=1) == 2]\n",
305 | "image_Z = image[np.argmax(target, axis=1) == 3]\n",
306 | "image_t = image[np.argmax(target, axis=1) == 4]\n",
307 | "images = [image_q, image_g, image_W, image_Z, image_t]\n",
308 | "#plt.rc('text', usetex=True) #you can uncomment this if you have a latex installation\n",
309 | "plt.rc('font', family='serif')\n",
310 | "for i in range(len(images)):\n",
311 | " SUM_Image = np.sum(images[i], axis = 0)\n",
312 | " plt.imshow(SUM_Image/float(images[i].shape[0]), origin='lower',norm=LogNorm(vmin=0.01))\n",
313 | " plt.colorbar()\n",
314 | " plt.title(labelCat[i], fontsize=15)\n",
315 | " plt.xlabel(\"Delta_eta cell\", fontsize=15)\n",
316 | " plt.ylabel(\"Delta_phi cell\", fontsize=15)\n",
317 | " plt.show()"
318 | ]
319 | },
320 | {
321 | "cell_type": "markdown",
322 | "metadata": {
323 | "id": "sDqswz4yQ8m-"
324 | },
325 | "source": [
326 | "# The particle-list dataset\n",
327 | "\n",
328 | "In this case, we look at the particle-related features that we have stored for each jet constituent. The structure of the dataset is similar to that of the physics-motivated features, except for the fact that we have now a double-index dataset: (jet index, particle index).\n",
329 | "The list is cut at 100 constituents /jet. If less are found, the dataset is completed filling it with 0s (zero padding)"
330 | ]
331 | },
332 | {
333 | "cell_type": "code",
334 | "execution_count": null,
335 | "metadata": {
336 | "id": "3RHM9xDOQ8m_"
337 | },
338 | "outputs": [],
339 | "source": [
340 | "p_featurenames = f.get(\"particleFeatureNames\")\n",
341 | "print(p_featurenames[:])"
342 | ]
343 | },
344 | {
345 | "cell_type": "code",
346 | "execution_count": null,
347 | "metadata": {
348 | "id": "XTV65ABvQ8m_"
349 | },
350 | "outputs": [],
351 | "source": [
352 | "p_data = f.get(\"jetConstituentList\")\n",
353 | "print(p_data.shape)"
354 | ]
355 | },
356 | {
357 | "cell_type": "code",
358 | "execution_count": null,
359 | "metadata": {
360 | "id": "iDx0NgFFQ8m_"
361 | },
362 | "outputs": [],
363 | "source": [
364 | "labelCat= [\"gluon\", \"quark\", \"W\", \"Z\", \"top\"]\n",
365 | "# this function makes the histogram of a given quantity for the five classes\n",
366 | "def makePlot_p(feature_index, input_data, input_featurenames):\n",
367 | " plt.subplots()\n",
368 | " for i in range(len(labelCat)):\n",
369 | " my_data = input_data[:,:,feature_index]\n",
370 | " # notice the use of numpy masking to select specific classes of jets\n",
371 | " my_data = my_data[np.argmax(target, axis=1) == i]\n",
372 | " # then plot the right quantity for the reduced array\n",
373 | " plt.hist(my_data[:,feature_index].flatten(), 50, density=True, histtype='step', fill=False, linewidth=1.5)\n",
374 | " plt.yscale('log') \n",
375 | " plt.legend(labelCat, fontsize=12, frameon=False) \n",
376 | " plt.xlabel(str(input_featurenames[feature_index], \"utf-8\"), fontsize=15)\n",
377 | " plt.ylabel('Prob. Density (a.u.)', fontsize=15)\n",
378 | " plt.show()\n",
379 | " #del fig, ax\n",
380 | " #return fig, ax"
381 | ]
382 | },
383 | {
384 | "cell_type": "code",
385 | "execution_count": null,
386 | "metadata": {
387 | "id": "f2JG4CsaQ8nA"
388 | },
389 | "outputs": [],
390 | "source": [
391 | "# we now plot all the features\n",
392 | "for i in range(len(p_featurenames)-1):\n",
393 | " makePlot_p(i, p_data, p_featurenames)\n",
394 | " #fig.show()"
395 | ]
396 | },
397 | {
398 | "cell_type": "code",
399 | "execution_count": null,
400 | "metadata": {
401 | "id": "8w87Zf7wQ8nA"
402 | },
403 | "outputs": [],
404 | "source": []
405 | }
406 | ],
407 | "metadata": {
408 | "colab": {
409 | "name": "Notebook1_ExploreDataset.ipynb",
410 | "provenance": []
411 | },
412 | "kernelspec": {
413 | "display_name": "Python 3 (ipykernel)",
414 | "language": "python",
415 | "name": "python3"
416 | },
417 | "language_info": {
418 | "codemirror_mode": {
419 | "name": "ipython",
420 | "version": 3
421 | },
422 | "file_extension": ".py",
423 | "mimetype": "text/x-python",
424 | "name": "python",
425 | "nbconvert_exporter": "python",
426 | "pygments_lexer": "ipython3",
427 | "version": "3.8.9"
428 | }
429 | },
430 | "nbformat": 4,
431 | "nbformat_minor": 1
432 | }
433 |
--------------------------------------------------------------------------------
/jet_notebooks/2.JetTaggingMLP.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "
"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {
13 | "id": "Rf258xT0XIwV"
14 | },
15 | "source": [
16 | "# Training a Jet Tagging with **DNN** \n",
17 | "\n",
18 | "---\n",
19 | "In this notebook, we perform a Jet identification task using a multiclass classifier based on a \n",
20 | "Dense Neural Network (DNN), also called multi-layer perceptron (MLP). The problem consists on identifying a given jet as a quark, a gluon, a W, a Z, or a top,\n",
21 | "based on set of physics-motivated high-level features.\n",
22 | "\n",
23 | "For details on the physics problem, see https://arxiv.org/pdf/1804.06913.pdf \n",
24 | "\n",
25 | "For details on the dataset, see Notebook1\n",
26 | "\n",
27 | "---"
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": null,
33 | "metadata": {
34 | "id": "4OMAZgtyXIwY"
35 | },
36 | "outputs": [],
37 | "source": [
38 | "import os\n",
39 | "import h5py\n",
40 | "import glob\n",
41 | "import numpy as np\n",
42 | "import matplotlib.pyplot as plt\n",
43 | "%matplotlib inline"
44 | ]
45 | },
46 | {
47 | "cell_type": "markdown",
48 | "metadata": {
49 | "id": "2lbB-J3hXIwb"
50 | },
51 | "source": [
52 | "# Preparation of the training and validation samples\n",
53 | "\n",
54 | "---\n",
55 | "In order to import the dataset, we now\n",
56 | "- clone the dataset repository (to import the data in Colab)\n",
57 | "- load the h5 files in the data/ repository\n",
58 | "- extract the data we need: a target and jetImage \n",
59 | "\n",
60 | "To type shell commands, we start the command line with !\n",
61 | "\n",
62 | "**nb, if you are running locally and you have already downloaded the datasets you can skip the cell below and, if needed, change the paths later to point to the folder with your previous download of the datasets.**"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": null,
68 | "metadata": {
69 | "id": "jWjxFaRPXIwb"
70 | },
71 | "outputs": [],
72 | "source": [
73 | "! curl https://cernbox.cern.ch/s/zZDKjltAcJW0RB7/download -o Data-MLtutorial.tar.gz\n",
74 | "! tar -xvzf Data-MLtutorial.tar.gz \n",
75 | "! ls Data-MLtutorial/JetDataset/\n",
76 | "! rm Data-MLtutorial.tar.gz "
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": null,
82 | "metadata": {
83 | "id": "cCGhrKdwXIwc"
84 | },
85 | "outputs": [],
86 | "source": [
87 | "target = np.array([])\n",
88 | "features = np.array([])\n",
89 | "# we cannot load all data on Colab. So we just take a few files\n",
90 | "datafiles = ['Data-MLtutorial/JetDataset/jetImage_7_100p_30000_40000.h5',\n",
91 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_60000_70000.h5',\n",
92 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_50000_60000.h5',\n",
93 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_10000_20000.h5',\n",
94 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_0_10000.h5']\n",
95 | "# if you are running locallt, you can use the full dataset doing\n",
96 | "# for fileIN in glob.glob(\"tutorials/HiggsSchool/data/*h5\"):\n",
97 | "for fileIN in datafiles:\n",
98 | " print(\"Appending %s\" %fileIN)\n",
99 | " f = h5py.File(fileIN)\n",
100 | " myFeatures = np.array(f.get(\"jets\")[:,[12, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 48, 52]])\n",
101 | " mytarget = np.array(f.get('jets')[0:,-6:-1])\n",
102 | " features = np.concatenate([features, myFeatures], axis=0) if features.size else myFeatures\n",
103 | " target = np.concatenate([target, mytarget], axis=0) if target.size else mytarget\n",
104 | " f.close()\n",
105 | "print(target.shape, features.shape)"
106 | ]
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {
111 | "id": "6a333RYPXIwe"
112 | },
113 | "source": [
114 | "The dataset consists of 50000 jets, each represented by 16 features\n",
115 | "\n",
116 | "---\n",
117 | "\n",
118 | "We now shuffle the data, splitting them into a training and a validation dataset with 2:1 ratio"
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "execution_count": null,
124 | "metadata": {
125 | "id": "ZBqFs1eBXIwf"
126 | },
127 | "outputs": [],
128 | "source": [
129 | "from sklearn.model_selection import train_test_split\n",
130 | "X_train, X_val, y_train, y_val = train_test_split(features, target, test_size=0.33)\n",
131 | "print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)\n",
132 | "del features, target"
133 | ]
134 | },
135 | {
136 | "cell_type": "markdown",
137 | "metadata": {
138 | "id": "GkNz5UAhXIwg"
139 | },
140 | "source": [
141 | "# DNN model building"
142 | ]
143 | },
144 | {
145 | "cell_type": "code",
146 | "execution_count": null,
147 | "metadata": {
148 | "id": "tTSDOiEHXIwh"
149 | },
150 | "outputs": [],
151 | "source": [
152 | "# keras imports\n",
153 | "from tensorflow.keras.models import Model\n",
154 | "from tensorflow.keras.layers import Dense, Input, Dropout, Flatten, Activation\n",
155 | "from tensorflow.keras.utils import plot_model\n",
156 | "from tensorflow.keras import backend as K\n",
157 | "from tensorflow.keras import metrics\n",
158 | "from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, TerminateOnNaN"
159 | ]
160 | },
161 | {
162 | "cell_type": "code",
163 | "execution_count": null,
164 | "metadata": {
165 | "id": "rAl0DZTxXIwi"
166 | },
167 | "outputs": [],
168 | "source": [
169 | "input_shape = X_train.shape[1]\n",
170 | "dropoutRate = 0.25"
171 | ]
172 | },
173 | {
174 | "cell_type": "code",
175 | "execution_count": null,
176 | "metadata": {
177 | "id": "2l492G8BXIwj"
178 | },
179 | "outputs": [],
180 | "source": [
181 | "####\n",
182 | "inputArray = Input(shape=(input_shape,))\n",
183 | "#\n",
184 | "x = Dense(40, activation='relu')(inputArray)\n",
185 | "x = Dropout(dropoutRate)(x)\n",
186 | "#\n",
187 | "x = Dense(20)(x)\n",
188 | "x = Activation('relu')(x)\n",
189 | "x = Dropout(dropoutRate)(x)\n",
190 | "#\n",
191 | "x = Dense(10, activation='relu')(x)\n",
192 | "x = Dropout(dropoutRate)(x)\n",
193 | "#\n",
194 | "x = Dense(5, activation='relu')(x)\n",
195 | "#\n",
196 | "output = Dense(5, activation='softmax')(x)\n",
197 | "####\n",
198 | "model = Model(inputs=inputArray, outputs=output)"
199 | ]
200 | },
201 | {
202 | "cell_type": "code",
203 | "execution_count": null,
204 | "metadata": {
205 | "id": "xu8rRUkhXIwj"
206 | },
207 | "outputs": [],
208 | "source": [
209 | "model.compile(loss='categorical_crossentropy', optimizer='adam')\n",
210 | "model.summary()"
211 | ]
212 | },
213 | {
214 | "cell_type": "markdown",
215 | "metadata": {
216 | "id": "2HfKWoOtXIwk"
217 | },
218 | "source": [
219 | "We now train the model with these settings:\n",
220 | "\n",
221 | "- the **batch size** is a hyperparameter of gradient descent that controls the number of training samples to work through before the model internal parameters are updated\n",
222 | " - batch size = 1 results in fast computation but noisy training that is slow to converge\n",
223 | " - batch size = dataset size results in slow computation but faster convergence)\n",
224 | "\n",
225 | "- the **number of epochs** controls the number of complete passes through the full training dataset -- at each epoch gradients are computed for each of the mini batches and model internal parameters are updated.\n",
226 | "\n",
227 | "- the **callbacks** are algorithms used to optimize the training (full list [here](https://keras.io/api/callbacks/)):\n",
228 | " - *EarlyStopping*: stop training when a monitored metric (`monitor`) has stopped improving in the last N epochs (`patience`)\n",
229 | " - *ReduceLROnPlateau*: reduce learning rate when a metric (`monitor`) has stopped improving in the last N epochs (`patience`)\n",
230 | " - *TerminateOnNaN*: terminates training when a NaN loss is encountered"
231 | ]
232 | },
233 | {
234 | "cell_type": "code",
235 | "execution_count": null,
236 | "metadata": {
237 | "id": "KzO-lyLEXIwk"
238 | },
239 | "outputs": [],
240 | "source": [
241 | "batch_size = 128\n",
242 | "n_epochs = 50\n",
243 | "\n",
244 | "# train \n",
245 | "history = model.fit(X_train, y_train, epochs=n_epochs, batch_size=batch_size, verbose = 2,\n",
246 | " validation_data=(X_val, y_val),\n",
247 | " callbacks = [\n",
248 | " EarlyStopping(monitor='val_loss', patience=10, verbose=1),\n",
249 | " ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=2, verbose=1),\n",
250 | " TerminateOnNaN()])"
251 | ]
252 | },
253 | {
254 | "cell_type": "code",
255 | "execution_count": null,
256 | "metadata": {
257 | "id": "044bCLqVXIwl"
258 | },
259 | "outputs": [],
260 | "source": [
261 | "# plot training history\n",
262 | "plt.plot(history.history['loss'])\n",
263 | "plt.plot(history.history['val_loss'])\n",
264 | "plt.yscale('log')\n",
265 | "plt.title('Training History')\n",
266 | "plt.ylabel('loss')\n",
267 | "plt.xlabel('epoch')\n",
268 | "plt.legend(['training', 'validation'], loc='upper right')\n",
269 | "plt.show()"
270 | ]
271 | },
272 | {
273 | "cell_type": "markdown",
274 | "metadata": {
275 | "id": "oESSmNLxXIwm"
276 | },
277 | "source": [
278 | "# Building the ROC Curves"
279 | ]
280 | },
281 | {
282 | "cell_type": "code",
283 | "execution_count": null,
284 | "metadata": {
285 | "id": "a_AROD6SXIwm"
286 | },
287 | "outputs": [],
288 | "source": [
289 | "labels = ['gluon', 'quark', 'W', 'Z', 'top']"
290 | ]
291 | },
292 | {
293 | "cell_type": "code",
294 | "execution_count": null,
295 | "metadata": {
296 | "id": "gjKT7EjUXIwn"
297 | },
298 | "outputs": [],
299 | "source": [
300 | "import pandas as pd\n",
301 | "from sklearn.metrics import roc_curve, auc\n",
302 | "predict_val = model.predict(X_val)\n",
303 | "df = pd.DataFrame()\n",
304 | "fpr = {}\n",
305 | "tpr = {}\n",
306 | "auc1 = {}\n",
307 | "\n",
308 | "plt.figure()\n",
309 | "for i, label in enumerate(labels):\n",
310 | " df[label] = y_val[:,i]\n",
311 | " df[label + '_pred'] = predict_val[:,i]\n",
312 | "\n",
313 | " fpr[label], tpr[label], threshold = roc_curve(df[label],df[label+'_pred'])\n",
314 | "\n",
315 | " auc1[label] = auc(fpr[label], tpr[label])\n",
316 | "\n",
317 | " plt.plot(tpr[label],fpr[label],label='%s tagger, auc = %.1f%%'%(label,auc1[label]*100.))\n",
318 | "plt.semilogy()\n",
319 | "plt.xlabel(\"sig. efficiency\")\n",
320 | "plt.ylabel(\"bkg. mistag rate\")\n",
321 | "plt.ylim(0.000001,1)\n",
322 | "plt.grid(True)\n",
323 | "plt.legend(loc='lower right')\n",
324 | "plt.show()"
325 | ]
326 | },
327 | {
328 | "cell_type": "code",
329 | "execution_count": null,
330 | "metadata": {
331 | "id": "lzbQ-d0RKVmV"
332 | },
333 | "outputs": [],
334 | "source": []
335 | }
336 | ],
337 | "metadata": {
338 | "colab": {
339 | "name": "Notebook2_JetID_DNN.ipynb",
340 | "provenance": []
341 | },
342 | "kernelspec": {
343 | "display_name": "Python 3 (ipykernel)",
344 | "language": "python",
345 | "name": "python3"
346 | },
347 | "language_info": {
348 | "codemirror_mode": {
349 | "name": "ipython",
350 | "version": 3
351 | },
352 | "file_extension": ".py",
353 | "mimetype": "text/x-python",
354 | "name": "python",
355 | "nbconvert_exporter": "python",
356 | "pygments_lexer": "ipython3",
357 | "version": "3.8.9"
358 | }
359 | },
360 | "nbformat": 4,
361 | "nbformat_minor": 1
362 | }
363 |
--------------------------------------------------------------------------------
/jet_notebooks/3.JetTaggingConv2D.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "
"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {
13 | "id": "y7_esAc6K1Z4"
14 | },
15 | "source": [
16 | "# Training a Jet Tagging with **Conv2D** \n",
17 | "\n",
18 | "---\n",
19 | "In this notebook, we perform a Jet identification task using a Conv2D multiclass classifier.\n",
20 | "The problem consists in identifying a given jet as a quark, a gluon, a W, a Z, or a top,\n",
21 | "based on a jet image, i.e., a 2D histogram of the transverse momentum ($p_T$) deposited in each of 100x100\n",
22 | "bins of a square window of the ($\\eta$, $\\phi$) plane, centered along the jet axis.\n",
23 | "\n",
24 | "For details on the physics problem, see https://arxiv.org/pdf/1804.06913.pdf \n",
25 | "\n",
26 | "For details on the dataset, see Notebook1\n",
27 | "\n",
28 | "---"
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "execution_count": null,
34 | "metadata": {
35 | "id": "5pK5FCdMK1Z7"
36 | },
37 | "outputs": [],
38 | "source": [
39 | "import os\n",
40 | "import h5py\n",
41 | "import glob, pickle\n",
42 | "import numpy as np\n",
43 | "import pandas as pd\n",
44 | "import matplotlib.pyplot as plt\n",
45 | "%matplotlib inline"
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {
51 | "id": "gVO0EQALK1Z8"
52 | },
53 | "source": [
54 | "# Preparation of the training and validation samples\n",
55 | "\n",
56 | "---\n",
57 | "In order to import the dataset, we now\n",
58 | "- clone the dataset repository (to import the data in Colab)\n",
59 | "- load the h5 files in the data/ repository\n",
60 | "- extract the data we need: a target and jetImage \n",
61 | "\n",
62 | "To type shell commands, we start the command line with !\n",
63 | "\n",
64 | "**nb, if you are running locally and you have already downloaded the datasets you can skip the cell below and, if needed, change the paths later to point to the folder with your previous download of the datasets.**"
65 | ]
66 | },
67 | {
68 | "cell_type": "code",
69 | "execution_count": null,
70 | "metadata": {
71 | "id": "GlxKaXA8K1Z-"
72 | },
73 | "outputs": [],
74 | "source": [
75 | "! curl https://cernbox.cern.ch/s/zZDKjltAcJW0RB7/download -o Data-MLtutorial.tar.gz\n",
76 | "! tar -xvzf Data-MLtutorial.tar.gz \n",
77 | "! ls Data-MLtutorial/JetDataset/\n",
78 | "! rm Data-MLtutorial.tar.gz "
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": null,
84 | "metadata": {
85 | "id": "_Bfy4kz2K1Z_"
86 | },
87 | "outputs": [],
88 | "source": [
89 | "target = np.array([])\n",
90 | "jetImage = np.array([])\n",
91 | "# we cannot load all data on Colab. So we just take a few files\n",
92 | "datafiles = ['Data-MLtutorial/JetDataset/jetImage_7_100p_30000_40000.h5',\n",
93 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_60000_70000.h5',\n",
94 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_50000_60000.h5',\n",
95 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_10000_20000.h5',\n",
96 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_0_10000.h5']\n",
97 | "# if you are running locally, you can use the full dataset doing\n",
98 | "# for fileIN in glob.glob(\"tutorials/HiggsSchool/data/*h5\"):\n",
99 | "for fileIN in datafiles:\n",
100 | " print(\"Appending %s\" %fileIN)\n",
101 | " f = h5py.File(fileIN)\n",
102 | " myjetImage = np.array(f.get(\"jetImage\"))\n",
103 | " mytarget = np.array(f.get('jets')[0:,-6:-1])\n",
104 | " jetImage = np.concatenate([jetImage, myjetImage], axis=0) if jetImage.size else myjetImage\n",
105 | " target = np.concatenate([target, mytarget], axis=0) if target.size else mytarget\n",
106 | " f.close()\n",
107 | "print(target.shape, jetImage.shape)"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {
113 | "id": "QTuZKlYlK1aA"
114 | },
115 | "source": [
116 | "The dataset consists of 50000 with up to 100 particles in each jet. These 100 particles have been used to fill the 100x100 jet images.\n",
117 | "\n",
118 | "---\n",
119 | "\n",
120 | "We now shuffle the data, splitting them into a training and a validation dataset with 2:1 ratio"
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": null,
126 | "metadata": {
127 | "id": "MZenCl_lK1aC"
128 | },
129 | "outputs": [],
130 | "source": [
131 | "from sklearn.model_selection import train_test_split\n",
132 | "X_train, X_val, y_train, y_val = train_test_split(jetImage, target, test_size=0.33)\n",
133 | "print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)\n",
134 | "del jetImage, target"
135 | ]
136 | },
137 | {
138 | "cell_type": "markdown",
139 | "metadata": {
140 | "id": "NRONFD2bK1aD"
141 | },
142 | "source": [
143 | "In keras, images are representable as $n \\times m \\times k$ tensors, where $n \\times m$ are the pixel dimenions and $k$ is the number of channels (e.g., 1 in a black\\&while image, 3 for an RGB image). In our case, k=1. To comply to this, we add the channel index by reshaping the image dataset"
144 | ]
145 | },
146 | {
147 | "cell_type": "code",
148 | "execution_count": null,
149 | "metadata": {
150 | "id": "JwzpamYuK1aE"
151 | },
152 | "outputs": [],
153 | "source": [
154 | "X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], X_train.shape[2], 1))\n",
155 | "X_val = X_val.reshape((X_val.shape[0], X_val.shape[1], X_val.shape[2], 1))\n",
156 | "print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)"
157 | ]
158 | },
159 | {
160 | "cell_type": "markdown",
161 | "metadata": {
162 | "id": "-Cj7HnaAK1aF"
163 | },
164 | "source": [
165 | "# Conv 2D model building\n",
166 | "\n",
167 | "The main ingredients of a Conv2D layer are:\n",
168 | "\n",
169 | "- **filter**: a *k x k’* matrix of weights (orange matrix in the picture below) that scans the image and performs a scalar product of each image block (this is also called *kernel*)\n",
170 | "- **stride**: number of pixels the filter is shifted by (=1 in the image below)\n",
171 | "- **padding**: the amount of pixels added to an image when it is being processed by the filter of a CNN (helps keeping information on the boundaries of the original image by allowing border pixels to be at the center of the filter)\n",
172 | " - *valid* means no padding (default setting)\n",
173 | " - *same* results in padding with zeros evenly to the left/right or up/down of the input image as needed to ensure that the output has the same shape as the input\n",
174 | "\n",
175 | "
\n",
176 | "

\n",
177 | "
\n",
178 | "\n",
179 | "It is common practice to insert **pooling** layers in between Conv2D layers to progressively reduce the size of the representation and thus reduce the amount of parameters and computation in the network. Pooling also makes processing more robust to changes in the position of a feature in the image. Common types of pooling operations are:\n",
180 | "\n",
181 | "- **MaxPooling**: given an image and a pool of size *k x k’*, scans the image and replaces each *k x k’* patch with its *maximum* -- helps to extract the sharpest features on the image when the sharpest features are a best lower-level representation of the image\n",
182 | "- **AveragePooling**: given an image and a pool of size *k x k’*, scans the image and replaces each *k x k’* patch with its *average* -- helps to extract the smooth features when \"colours\" transition is smooth\n",
183 | "\n",
184 | "
"
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": null,
190 | "metadata": {
191 | "id": "RaUzTTuYK1aG"
192 | },
193 | "outputs": [],
194 | "source": [
195 | "# keras imports\n",
196 | "from tensorflow.keras.models import Model, model_from_json\n",
197 | "from tensorflow.keras.layers import Dense, Input, Conv2D, Dropout, Flatten\n",
198 | "from tensorflow.keras.layers import MaxPooling2D, BatchNormalization, Activation\n",
199 | "from tensorflow.keras.utils import plot_model\n",
200 | "from tensorflow.keras import backend as K\n",
201 | "from tensorflow.keras import metrics\n",
202 | "from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, TerminateOnNaN"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": null,
208 | "metadata": {
209 | "id": "evIemCVrK1aH"
210 | },
211 | "outputs": [],
212 | "source": [
213 | "img_rows = X_train.shape[1]\n",
214 | "img_cols = X_train.shape[2]\n",
215 | "dropoutRate = 0.25"
216 | ]
217 | },
218 | {
219 | "cell_type": "code",
220 | "execution_count": null,
221 | "metadata": {
222 | "id": "xjQH0sMuK1aI"
223 | },
224 | "outputs": [],
225 | "source": [
226 | "image_shape = (img_rows, img_cols, 1)\n",
227 | "####\n",
228 | "inputImage = Input(shape=(image_shape))\n",
229 | "x = Conv2D(5, kernel_size=(5,5), data_format=\"channels_last\", strides=(1, 1), padding=\"same\")(inputImage)\n",
230 | "x = BatchNormalization()(x)\n",
231 | "x = Activation('relu')(x)\n",
232 | "x = MaxPooling2D( pool_size = (5,5))(x)\n",
233 | "x = Dropout(dropoutRate)(x)\n",
234 | "#\n",
235 | "x = Conv2D(3, kernel_size=(3,3), data_format=\"channels_last\", strides=(1, 1), padding=\"same\")(x)\n",
236 | "x = BatchNormalization()(x)\n",
237 | "x = Activation('relu')(x)\n",
238 | "x = MaxPooling2D( pool_size = (3,3))(x)\n",
239 | "x = Dropout(dropoutRate)(x)\n",
240 | "#\n",
241 | "x = Flatten()(x)\n",
242 | "#\n",
243 | "x = Dense(5, activation='relu')(x)\n",
244 | "#\n",
245 | "output = Dense(5, activation='softmax')(x)\n",
246 | "####\n",
247 | "model = Model(inputs=inputImage, outputs=output)"
248 | ]
249 | },
250 | {
251 | "cell_type": "code",
252 | "execution_count": null,
253 | "metadata": {
254 | "id": "hXVPnzcTK1aI"
255 | },
256 | "outputs": [],
257 | "source": [
258 | "model.compile(loss='categorical_crossentropy', optimizer='adam')\n",
259 | "model.summary()"
260 | ]
261 | },
262 | {
263 | "cell_type": "markdown",
264 | "metadata": {
265 | "id": "5leMyRciK1aJ"
266 | },
267 | "source": [
268 | "We now train the model. This takes really long time and processing power on common CPUs. **If you are running locally set TRAIN=False** such that a pre-trained model is loaded for the next evaluation steps. We live as homework to reproduce the results (suggest to use Colab with GPU)."
269 | ]
270 | },
271 | {
272 | "cell_type": "code",
273 | "execution_count": null,
274 | "metadata": {
275 | "id": "aj_vANK5K1aK"
276 | },
277 | "outputs": [],
278 | "source": [
279 | "TRAIN = False\n",
280 | "batch_size = 128\n",
281 | "n_epochs = 10\n",
282 | "\n",
283 | "if TRAIN: #train and save the model\n",
284 | " \n",
285 | " history = model.fit(X_train, y_train, epochs=n_epochs, batch_size=batch_size, verbose = 2,\n",
286 | " validation_data=(X_val, y_val),\n",
287 | " callbacks = [\n",
288 | " EarlyStopping(monitor='val_loss', patience=10, verbose=1),\n",
289 | " ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=2, verbose=1),\n",
290 | " TerminateOnNaN()])\n",
291 | "\n",
292 | " model_json = model.to_json()\n",
293 | " outputdir = './'\n",
294 | "\n",
295 | " with open(\"{OUTPUTDIR}/jetTagger_CNN.json\".format(OUTPUTDIR=outputdir), \"w\") as json_file:\n",
296 | " json_file.write(model_json)\n",
297 | " model.save_weights(\"{OUTPUTDIR}/jetTagger_CNN.h5\".format(OUTPUTDIR=outputdir))\n",
298 | " \n",
299 | " with open('{OUTPUTDIR}/jetTagger_CNN_history.h5'.format(OUTPUTDIR=outputdir), 'wb') as f:\n",
300 | " pickle.dump(history.history, f, protocol=pickle.HIGHEST_PROTOCOL) \n",
301 | " \n",
302 | "else: #load pretrained model\n",
303 | " \n",
304 | " ! curl https://cernbox.cern.ch/index.php/s/yYUgxxSnYN42qay/download -o jetTagger_CNN.tar.gz\n",
305 | " ! tar -xvzf jetTagger_CNN.tar.gz \n",
306 | " ! ls jetTagger_CNN/\n",
307 | " ! rm jetTagger_CNN.tar.gz\n",
308 | " \n",
309 | " with open('jetTagger_CNN/jetTagger_CNN.json', 'r') as json_file:\n",
310 | " model_json = json_file.read()\n",
311 | " model = model_from_json(model_json)\n",
312 | " model.load_weights(\"jetTagger_CNN/jetTagger_CNN.h5\")\n",
313 | " \n",
314 | " with open('jetTagger_CNN/history.h5', 'rb') as f: history = pickle.load(f)"
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": null,
320 | "metadata": {
321 | "id": "z7E3JsM_K1aK"
322 | },
323 | "outputs": [],
324 | "source": [
325 | "# plot training history\n",
326 | "if TRAIN: history = pd.DataFrame(history.history)\n",
327 | "plt.plot(history['loss'])\n",
328 | "plt.plot(history['val_loss'])\n",
329 | "plt.yscale('log')\n",
330 | "plt.title('Training History')\n",
331 | "plt.ylabel('loss')\n",
332 | "plt.xlabel('epoch')\n",
333 | "plt.legend(['training', 'validation'], loc='upper right')\n",
334 | "plt.show()"
335 | ]
336 | },
337 | {
338 | "cell_type": "markdown",
339 | "metadata": {
340 | "id": "FVNnL-1QK1aM"
341 | },
342 | "source": [
343 | "# Building the ROC Curves"
344 | ]
345 | },
346 | {
347 | "cell_type": "code",
348 | "execution_count": null,
349 | "metadata": {
350 | "id": "FWbtbTrPK1aM"
351 | },
352 | "outputs": [],
353 | "source": [
354 | "labels = ['gluon', 'quark', 'W', 'Z', 'top']"
355 | ]
356 | },
357 | {
358 | "cell_type": "code",
359 | "execution_count": null,
360 | "metadata": {
361 | "id": "N33vdzeQK1aM"
362 | },
363 | "outputs": [],
364 | "source": [
365 | "\n",
366 | "from sklearn.metrics import roc_curve, auc\n",
367 | "predict_val = model.predict(X_val)\n",
368 | "df = pd.DataFrame()\n",
369 | "fpr = {}\n",
370 | "tpr = {}\n",
371 | "auc1 = {}\n",
372 | "\n",
373 | "plt.figure()\n",
374 | "for i, label in enumerate(labels):\n",
375 | " df[label] = y_val[:,i]\n",
376 | " df[label + '_pred'] = predict_val[:,i]\n",
377 | "\n",
378 | " fpr[label], tpr[label], threshold = roc_curve(df[label],df[label+'_pred'])\n",
379 | "\n",
380 | " auc1[label] = auc(fpr[label], tpr[label])\n",
381 | "\n",
382 | " plt.plot(tpr[label],fpr[label],label='%s tagger, auc = %.1f%%'%(label,auc1[label]*100.))\n",
383 | "plt.semilogy()\n",
384 | "plt.xlabel(\"sig. efficiency\")\n",
385 | "plt.ylabel(\"bkg. mistag rate\")\n",
386 | "plt.ylim(0.000001,1)\n",
387 | "plt.grid(True)\n",
388 | "plt.legend(loc='lower right')\n",
389 | "plt.show()"
390 | ]
391 | },
392 | {
393 | "cell_type": "code",
394 | "execution_count": null,
395 | "metadata": {
396 | "id": "w4zyw_ExK1aN"
397 | },
398 | "outputs": [],
399 | "source": []
400 | }
401 | ],
402 | "metadata": {
403 | "colab": {
404 | "provenance": []
405 | },
406 | "kernelspec": {
407 | "display_name": "Python 3 (ipykernel)",
408 | "language": "python",
409 | "name": "python3"
410 | },
411 | "language_info": {
412 | "codemirror_mode": {
413 | "name": "ipython",
414 | "version": 3
415 | },
416 | "file_extension": ".py",
417 | "mimetype": "text/x-python",
418 | "name": "python",
419 | "nbconvert_exporter": "python",
420 | "pygments_lexer": "ipython3",
421 | "version": "3.8.9"
422 | }
423 | },
424 | "nbformat": 4,
425 | "nbformat_minor": 1
426 | }
427 |
--------------------------------------------------------------------------------
/jet_notebooks/4.JetTaggingConv1D.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "
"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {
13 | "id": "gejTja5eMIBK"
14 | },
15 | "source": [
16 | "# Training a Jet Tagging with **CNN 1D** \n",
17 | "\n",
18 | "---\n",
19 | "In this notebook, we perform a Jet identification task using a multiclass classifier with a network based on Conv1D layers.\n",
20 | "\n",
21 | "The problem consists in identifying a given jet as a quark, a gluon, a W, a Z, or a top,\n",
22 | "based on a jet sequence, i.e. a list of particles. Foe each particle, the four-momentum coordinates are given as features.\n",
23 | "For details on the physics problem, see https://arxiv.org/pdf/1804.06913.pdf and https://arxiv.org/pdf/1908.05318.pdf.\n",
24 | "\n",
25 | "For details on the dataset, see Notebook1\n",
26 | "\n",
27 | "---"
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": null,
33 | "metadata": {
34 | "id": "hkwvFqoYMIBO"
35 | },
36 | "outputs": [],
37 | "source": [
38 | "import os\n",
39 | "import h5py\n",
40 | "import glob\n",
41 | "import numpy as np\n",
42 | "import matplotlib.pyplot as plt\n",
43 | "%matplotlib inline"
44 | ]
45 | },
46 | {
47 | "cell_type": "markdown",
48 | "metadata": {
49 | "id": "n-mnPnKMMIBR"
50 | },
51 | "source": [
52 | "# Preparation of the training and validation samples\n",
53 | "\n",
54 | "---\n",
55 | "In order to import the dataset, we now\n",
56 | "- clone the dataset repository (to import the data in Colab)\n",
57 | "- load the h5 files in the data/ repository\n",
58 | "- extract the data we need: a target and jetImage \n",
59 | "\n",
60 | "To type shell commands, we start the command line with !\n",
61 | "\n",
62 | "nb, if you are running locally you can skip the step below and change the paths later to point to the folder with your previous download of the datasets.\n",
63 | "\n",
64 | "**nb, if you are running locally and you have already downloaded the datasets you can skip the cell below and, if needed, change the paths later to point to the folder with your previous download of the datasets.**"
65 | ]
66 | },
67 | {
68 | "cell_type": "code",
69 | "execution_count": null,
70 | "metadata": {
71 | "id": "UFu00GI0MIBS"
72 | },
73 | "outputs": [],
74 | "source": [
75 | "! curl https://cernbox.cern.ch/s/zZDKjltAcJW0RB7/download -o Data-MLtutorial.tar.gz\n",
76 | "! tar -xvzf Data-MLtutorial.tar.gz \n",
77 | "! ls Data-MLtutorial/JetDataset/\n",
78 | "! rm Data-MLtutorial.tar.gz "
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": null,
84 | "metadata": {
85 | "id": "P9v6UXQ_MIBU"
86 | },
87 | "outputs": [],
88 | "source": [
89 | "inputDir = \"Data-MLtutorial/JetDataset/\""
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": null,
95 | "metadata": {
96 | "id": "_KmP6oBlMIBV"
97 | },
98 | "outputs": [],
99 | "source": [
100 | "target = np.array([])\n",
101 | "jetList = np.array([])\n",
102 | "# we cannot load all data on Colab. So we just take a few files\n",
103 | "datafiles = ['%s/jetImage_7_100p_30000_40000.h5' %inputDir,\n",
104 | " '%s/jetImage_7_100p_60000_70000.h5' %inputDir,\n",
105 | " '%s/jetImage_7_100p_50000_60000.h5' %inputDir,\n",
106 | " '%s/jetImage_7_100p_10000_20000.h5' %inputDir,\n",
107 | " '%s/jetImage_7_100p_0_10000.h5' %inputDir]\n",
108 | "# if you are running locallt, you can use the full dataset doing\n",
109 | "# for fileIN in glob.glob(\"tutorials/HiggsSchool/data/*h5\"):\n",
110 | "for i, fileIN in enumerate(datafiles):\n",
111 | " f = h5py.File(fileIN)\n",
112 | " if i == 0: print(f.get(\"particleFeatureNames\")[:])\n",
113 | " print(\"Appending %s\" %fileIN)\n",
114 | " myJetList = np.array(f.get(\"jetConstituentList\"))\n",
115 | " mytarget = np.array(f.get('jets')[0:,-6:-1])\n",
116 | " jetList = np.concatenate([jetList, myJetList], axis=0) if jetList.size else myJetList\n",
117 | " target = np.concatenate([target, mytarget], axis=0) if target.size else mytarget\n",
118 | " del myJetList, mytarget\n",
119 | " f.close()\n",
120 | "print(target.shape, jetList.shape)"
121 | ]
122 | },
123 | {
124 | "cell_type": "markdown",
125 | "metadata": {
126 | "id": "mzd2xPuBMIBX"
127 | },
128 | "source": [
129 | "The dataset consists of 50000 with up to 100 particles in each jet. For each particle, 16 features are given (see printout)\n",
130 | "\n",
131 | "---\n",
132 | "\n",
133 | "We now shuffle the data, splitting them into a training and a validation dataset with 2:1 ratio"
134 | ]
135 | },
136 | {
137 | "cell_type": "code",
138 | "execution_count": null,
139 | "metadata": {
140 | "id": "Cf10MQWNMIBY",
141 | "scrolled": false
142 | },
143 | "outputs": [],
144 | "source": [
145 | "from sklearn.model_selection import train_test_split\n",
146 | "X_train, X_val, y_train, y_val = train_test_split(jetList, target, test_size=0.33)\n",
147 | "print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)\n",
148 | "del jetList, target"
149 | ]
150 | },
151 | {
152 | "cell_type": "markdown",
153 | "metadata": {
154 | "id": "X5PrD8nHMIBZ"
155 | },
156 | "source": [
157 | "We interpret the last index (the particle feature) as the channel"
158 | ]
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {
163 | "id": "LYAs9xtKMIBa"
164 | },
165 | "source": [
166 | "# Building the Conv1D model\n",
167 | "\n",
168 | "A Conv1D model is a special case of convolutional NN where the kernel is convolved with the input tensor over a single spatial (or temporal) dimension with some ordering. Only the number of \"time steps\" or elements of the sequence to be processed by each filter stride is specified, i.e. one specifies only one dimension of the kernel. The other dimension of the kernel is equal to the number of channels.\n",
169 | "\n",
170 | "In the jet tagging model below we are analyzing a sequence of particles in within the jet which are ordered by momentum (from the highest to the lowest). Note that this is not the only order possible.\n",
171 | "\n",
172 | "The drawing below illustrate the Conv1D processing concept. In this example, starting from 10 particles with 1 feature we are processing 4 particles per each stride of each of the four 4x1 kernels to create 4 new filtered sequences of 7 elements. A final pooling layer is applied to each of the filtered sequences to output the final feature vector that can be used as an input to a regular fully connected layer.\n",
173 | "\n",
174 | "\n",
175 | "\n",
176 | "

\n",
177 | "
"
178 | ]
179 | },
180 | {
181 | "cell_type": "code",
182 | "execution_count": null,
183 | "metadata": {
184 | "id": "PbO6t_ixMIBc"
185 | },
186 | "outputs": [],
187 | "source": [
188 | "# keras imports\n",
189 | "from tensorflow.keras.models import Model\n",
190 | "from tensorflow.keras.layers import Dense, Input, Conv1D, AveragePooling1D, Dropout, Flatten\n",
191 | "from tensorflow.keras.utils import plot_model\n",
192 | "from tensorflow.keras import backend as K\n",
193 | "from tensorflow.keras import metrics\n",
194 | "from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, TerminateOnNaN"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": null,
200 | "metadata": {
201 | "id": "nwUxXdMkMIBd"
202 | },
203 | "outputs": [],
204 | "source": [
205 | "featureArrayLength = (X_train.shape[1],X_train.shape[2])\n",
206 | "dropoutRate = 0.25"
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "execution_count": null,
212 | "metadata": {
213 | "id": "A9l_M9muMIBd"
214 | },
215 | "outputs": [],
216 | "source": [
217 | "####\n",
218 | "inputList = Input(shape=(featureArrayLength))\n",
219 | "x = Conv1D(20, kernel_size=3, data_format=\"channels_last\", strides=1, padding=\"valid\", activation='relu')(inputList)\n",
220 | "x = AveragePooling1D(pool_size=3)(x)\n",
221 | "#\n",
222 | "x = Conv1D(40, kernel_size=3, data_format=\"channels_last\", strides=1, padding=\"valid\", activation='relu')(x)\n",
223 | "x = AveragePooling1D(pool_size=3)(x)\n",
224 | "#\n",
225 | "x = Conv1D(60, kernel_size=2, data_format=\"channels_last\", strides=1, padding=\"valid\", activation='relu')(x)\n",
226 | "x = AveragePooling1D(pool_size=9)(x)\n",
227 | "#\n",
228 | "x = Flatten()(x)\n",
229 | "x = Dense(20, activation='relu')(x)\n",
230 | "x = Dropout(dropoutRate)(x)\n",
231 | "x = Dense(10, activation='relu')(x)\n",
232 | "x = Dropout(dropoutRate)(x)\n",
233 | "output = Dense(5, activation='softmax')(x)\n",
234 | "####\n",
235 | "model = Model(inputs=inputList, outputs=output)"
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": null,
241 | "metadata": {
242 | "id": "zzfeZ2D1MIBe"
243 | },
244 | "outputs": [],
245 | "source": [
246 | "model.compile(loss='categorical_crossentropy', optimizer='adam')\n",
247 | "model.summary()"
248 | ]
249 | },
250 | {
251 | "cell_type": "markdown",
252 | "metadata": {
253 | "id": "6zoWbOqdMIBf"
254 | },
255 | "source": [
256 | "We now train the model"
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": null,
262 | "metadata": {
263 | "id": "oNGw3a5OMIBf"
264 | },
265 | "outputs": [],
266 | "source": [
267 | "batch_size = 128\n",
268 | "n_epochs = 200"
269 | ]
270 | },
271 | {
272 | "cell_type": "code",
273 | "execution_count": null,
274 | "metadata": {
275 | "id": "wjEQ_Ty8MIBf"
276 | },
277 | "outputs": [],
278 | "source": [
279 | "# train \n",
280 | "history = model.fit(X_train, y_train, epochs=n_epochs, batch_size=batch_size, verbose = 2,\n",
281 | " validation_data=(X_val, y_val),\n",
282 | " callbacks = [\n",
283 | " EarlyStopping(monitor='val_loss', patience=10, verbose=1),\n",
284 | " ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=2, verbose=1),\n",
285 | " TerminateOnNaN()])"
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": null,
291 | "metadata": {
292 | "id": "9BKPby1UMIBg"
293 | },
294 | "outputs": [],
295 | "source": [
296 | "# plot training history\n",
297 | "plt.plot(history.history['loss'])\n",
298 | "plt.plot(history.history['val_loss'])\n",
299 | "plt.yscale('log')\n",
300 | "plt.title('Training History')\n",
301 | "plt.ylabel('loss')\n",
302 | "plt.xlabel('epoch')\n",
303 | "plt.legend(['training', 'validation'], loc='upper right')\n",
304 | "plt.show()"
305 | ]
306 | },
307 | {
308 | "cell_type": "markdown",
309 | "metadata": {
310 | "id": "tqIXA0OiMIBg"
311 | },
312 | "source": [
313 | "# Building the ROC Curves"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": null,
319 | "metadata": {
320 | "id": "qaGLi3FEMIBh"
321 | },
322 | "outputs": [],
323 | "source": [
324 | "labels = ['gluon', 'quark', 'W', 'Z', 'top']"
325 | ]
326 | },
327 | {
328 | "cell_type": "code",
329 | "execution_count": null,
330 | "metadata": {
331 | "id": "ilPbqXu-MIBh"
332 | },
333 | "outputs": [],
334 | "source": [
335 | "import pandas as pd\n",
336 | "from sklearn.metrics import roc_curve, auc\n",
337 | "predict_val = model.predict(X_val)\n",
338 | "df = pd.DataFrame()\n",
339 | "fpr = {}\n",
340 | "tpr = {}\n",
341 | "auc1 = {}\n",
342 | "\n",
343 | "plt.figure()\n",
344 | "for i, label in enumerate(labels):\n",
345 | " df[label] = y_val[:,i]\n",
346 | " df[label + '_pred'] = predict_val[:,i]\n",
347 | "\n",
348 | " fpr[label], tpr[label], threshold = roc_curve(df[label],df[label+'_pred'])\n",
349 | "\n",
350 | " auc1[label] = auc(fpr[label], tpr[label])\n",
351 | "\n",
352 | " plt.plot(tpr[label],fpr[label],label='%s tagger, auc = %.1f%%'%(label,auc1[label]*100.))\n",
353 | "plt.semilogy()\n",
354 | "plt.xlabel(\"true positive rate\")\n",
355 | "plt.ylabel(\"false positive rate\")\n",
356 | "plt.ylim(0.000001,1)\n",
357 | "plt.grid(True)\n",
358 | "plt.legend(loc='lower right')\n",
359 | "plt.show()"
360 | ]
361 | },
362 | {
363 | "cell_type": "code",
364 | "execution_count": null,
365 | "metadata": {
366 | "id": "I6KMJTT-MIBh"
367 | },
368 | "outputs": [],
369 | "source": []
370 | }
371 | ],
372 | "metadata": {
373 | "colab": {
374 | "provenance": []
375 | },
376 | "kernelspec": {
377 | "display_name": "Python 3 (ipykernel)",
378 | "language": "python",
379 | "name": "python3"
380 | },
381 | "language_info": {
382 | "codemirror_mode": {
383 | "name": "ipython",
384 | "version": 3
385 | },
386 | "file_extension": ".py",
387 | "mimetype": "text/x-python",
388 | "name": "python",
389 | "nbconvert_exporter": "python",
390 | "pygments_lexer": "ipython3",
391 | "version": "3.8.9"
392 | }
393 | },
394 | "nbformat": 4,
395 | "nbformat_minor": 1
396 | }
397 |
--------------------------------------------------------------------------------
/jet_notebooks/5.JetTaggingRNN.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "
"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {
13 | "id": "R8qKRk-1wCz8"
14 | },
15 | "source": [
16 | "# Training a Jet Tagging with **Recurrent Neural Network** \n",
17 | "\n",
18 | "---\n",
19 | "In this notebook, we perform a Jet identification task using a multiclass classifier with a GRU unit.\n",
20 | "Gated Recurrent Units are one kind of RNNs. \n",
21 | "\n",
22 | "The problem consists in identifying a given jet as a quark, a gluon, a W, a Z, or a top,\n",
23 | "based on a jet image, i.e., a 2D histogram of the transverse momentum ($p_T$) deposited in each of 100x100\n",
24 | "bins of a square window of the ($\\eta$, $\\phi$) plane, centered along the jet axis.\n",
25 | "\n",
26 | "For details on the physics problem, see https://arxiv.org/pdf/1804.06913.pdf \n",
27 | "\n",
28 | "For details on the dataset, see Notebook1\n",
29 | "\n",
30 | "---"
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": null,
36 | "metadata": {
37 | "id": "PneHRgABwCz-"
38 | },
39 | "outputs": [],
40 | "source": [
41 | "import os\n",
42 | "import h5py\n",
43 | "import glob\n",
44 | "import numpy as np\n",
45 | "import matplotlib.pyplot as plt\n",
46 | "%matplotlib inline"
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {
52 | "id": "Rj_ebDHYwCz_"
53 | },
54 | "source": [
55 | "# Preparation of the training and validation samples\n",
56 | "\n",
57 | "---\n",
58 | "In order to import the dataset, we now\n",
59 | "- clone the dataset repository (to import the data in Colab)\n",
60 | "- load the h5 files in the data/ repository\n",
61 | "- extract the data we need: a target and jetImage \n",
62 | "\n",
63 | "To type shell commands, we start the command line with !\n",
64 | "\n",
65 | "nb, if you are running locally you can skip the step below and change the paths later to point to the folder with your previous download of the datasets."
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": null,
71 | "metadata": {
72 | "id": "6u1OqBz6wC0A"
73 | },
74 | "outputs": [],
75 | "source": [
76 | "! curl https://cernbox.cern.ch/s/zZDKjltAcJW0RB7/download -o Data-MLtutorial.tar.gz\n",
77 | "! tar -xvzf Data-MLtutorial.tar.gz \n",
78 | "! ls Data-MLtutorial/JetDataset/\n",
79 | "! rm Data-MLtutorial.tar.gz "
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": null,
85 | "metadata": {
86 | "id": "tRdVzVZawC0C"
87 | },
88 | "outputs": [],
89 | "source": [
90 | "target = np.array([])\n",
91 | "jetList = np.array([])\n",
92 | "# we cannot load all data on Colab. So we just take a few files\n",
93 | "datafiles = ['Data-MLtutorial/JetDataset/jetImage_7_100p_30000_40000.h5',\n",
94 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_60000_70000.h5',\n",
95 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_50000_60000.h5',\n",
96 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_10000_20000.h5',\n",
97 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_0_10000.h5']\n",
98 | "# if you are running locallt, you can use the full dataset doing\n",
99 | "# for fileIN in glob.glob(\"tutorials/HiggsSchool/data/*h5\"):\n",
100 | "for fileIN in datafiles:\n",
101 | " print(\"Appending %s\" %fileIN)\n",
102 | " f = h5py.File(fileIN)\n",
103 | " myJetList = np.array(f.get(\"jetConstituentList\"))\n",
104 | " mytarget = np.array(f.get('jets')[0:,-6:-1])\n",
105 | " jetList = np.concatenate([jetList, myJetList], axis=0) if jetList.size else myJetList\n",
106 | " target = np.concatenate([target, mytarget], axis=0) if target.size else mytarget\n",
107 | " del myJetList, mytarget\n",
108 | " f.close()\n",
109 | "print(target.shape, jetList.shape)"
110 | ]
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {
115 | "id": "5_jiNpnuwC0D"
116 | },
117 | "source": [
118 | "The dataset consists of 50000 with up to 100 particles in each jet. These 100 particles have been used to fill the 100x100 jet images.\n",
119 | "\n",
120 | "---\n",
121 | "\n",
122 | "We now shuffle the data, splitting them into a training and a validation dataset with 2:1 ratio"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": null,
128 | "metadata": {
129 | "id": "46JSP9FcwC0D"
130 | },
131 | "outputs": [],
132 | "source": [
133 | "from sklearn.model_selection import train_test_split\n",
134 | "X_train, X_val, y_train, y_val = train_test_split(jetList, target, test_size=0.33)\n",
135 | "print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)\n",
136 | "del jetList, target"
137 | ]
138 | },
139 | {
140 | "cell_type": "markdown",
141 | "metadata": {
142 | "id": "y7raw_XjwC0E"
143 | },
144 | "source": [
145 | "# Building the RNN model\n",
146 | "\n",
147 | "A recurrent neural network (RNN) is a type of NN which processes sequential data or time series data. They are commonly used for ordinal or temporal problems, such as natural language processing (NLP). They are distinguished by their “memory” as they take information from prior inputs to influence the current input and output.\n",
148 | "\n",
149 | "\n",
150 | "

\n",
151 | "
\n",
152 | "\n",
153 | "In this notebook we treat the particles clustered by the jet algorithm as an ordered sequence processed through a type of RNN called [Gated Recurrent Units](https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be). GRUs are improved version of standard RNN that solves the solves the vanishing gradient problem. The update and reset gates decide what information should be passed to the output making the model able to keep information from long ago, without washing it through time or remove information which is irrelevant to the prediction. The main ingredients are:\n",
154 | "\n",
155 | "- number of hidden units: the size of the hidden state *ht*\n",
156 | "- gates activation function (typically a sigmoid between 0 and 1 to either let no flow or complete flow of information throughout the gates)\n",
157 | "- current state activation function (typically a tanh between -1 and 1 to allow for increases and decreases in the state)"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": null,
163 | "metadata": {
164 | "id": "hKw9zBUFwC0F"
165 | },
166 | "outputs": [],
167 | "source": [
168 | "# keras imports\n",
169 | "from tensorflow.keras.models import Model\n",
170 | "from tensorflow.keras.layers import Dense, Input, GRU, Dropout, Masking\n",
171 | "from tensorflow.keras.utils import plot_model\n",
172 | "from tensorflow.keras import backend as K\n",
173 | "from tensorflow.keras import metrics\n",
174 | "from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, TerminateOnNaN"
175 | ]
176 | },
177 | {
178 | "cell_type": "code",
179 | "execution_count": null,
180 | "metadata": {
181 | "id": "dnBy3xU8wC0G"
182 | },
183 | "outputs": [],
184 | "source": [
185 | "featureArrayLength = (X_train.shape[1],X_train.shape[2])\n",
186 | "dropoutRate = 0.25"
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": null,
192 | "metadata": {
193 | "id": "dTs6zTvAwC0G"
194 | },
195 | "outputs": [],
196 | "source": [
197 | "####\n",
198 | "inputList = Input(shape=(featureArrayLength))\n",
199 | "x = Masking(mask_value=0.0)(inputList)\n",
200 | "x = GRU(units=40, activation=\"tanh\", recurrent_activation='sigmoid')(x)\n",
201 | "x = Dropout(dropoutRate)(x)\n",
202 | "#\n",
203 | "x = Dense(20, activation='relu')(x)\n",
204 | "x = Dropout(dropoutRate)(x)\n",
205 | "#\n",
206 | "x = Dense(10, activation='relu')(x)\n",
207 | "x = Dropout(dropoutRate)(x)\n",
208 | "x = Dense(5, activation='relu')(x)\n",
209 | "#\n",
210 | "output = Dense(5, activation='softmax')(x)\n",
211 | "####\n",
212 | "model = Model(inputs=inputList, outputs=output)"
213 | ]
214 | },
215 | {
216 | "cell_type": "markdown",
217 | "metadata": {
218 | "id": "BgQYEVdRQ7qA"
219 | },
220 | "source": [
221 | "The `Masking` layer in Keras is used to mask certain timesteps in a sequence input, so that they are effectively ignored during processing by subsequent layers. This can be useful when dealing with variable-length sequences of data, where some elements of the sequence may not be present for certain examples.\n",
222 | "\n",
223 | "The `mask_value` argument specifies the value to use as the mask. Any input timestep that has this value will be masked (i.e. ignored) by subsequent layers in the model. Here, the value chosen for the mask is 0.0."
224 | ]
225 | },
226 | {
227 | "cell_type": "code",
228 | "execution_count": null,
229 | "metadata": {
230 | "id": "A2L7JftSwC0H"
231 | },
232 | "outputs": [],
233 | "source": [
234 | "model.compile(loss='categorical_crossentropy', optimizer='adam')\n",
235 | "model.summary()"
236 | ]
237 | },
238 | {
239 | "cell_type": "markdown",
240 | "metadata": {
241 | "id": "LSVNXKuFwC0I"
242 | },
243 | "source": [
244 | "We now train the model"
245 | ]
246 | },
247 | {
248 | "cell_type": "code",
249 | "execution_count": null,
250 | "metadata": {
251 | "id": "PBbQioJqwC0I"
252 | },
253 | "outputs": [],
254 | "source": [
255 | "batch_size = 128\n",
256 | "n_epochs = 200"
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": null,
262 | "metadata": {
263 | "id": "ekZQBaN8wC0J"
264 | },
265 | "outputs": [],
266 | "source": [
267 | "# train \n",
268 | "history = model.fit(X_train, y_train, epochs=n_epochs, batch_size=batch_size, verbose = 2,\n",
269 | " validation_data=(X_val, y_val),\n",
270 | " callbacks = [\n",
271 | " EarlyStopping(monitor='val_loss', patience=10, verbose=1),\n",
272 | " ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=2, verbose=1),\n",
273 | " TerminateOnNaN()])"
274 | ]
275 | },
276 | {
277 | "cell_type": "code",
278 | "execution_count": null,
279 | "metadata": {
280 | "id": "GGfPysYawC0K"
281 | },
282 | "outputs": [],
283 | "source": [
284 | "# plot training history\n",
285 | "plt.plot(history.history['loss'])\n",
286 | "plt.plot(history.history['val_loss'])\n",
287 | "plt.yscale('log')\n",
288 | "plt.title('Training History')\n",
289 | "plt.ylabel('loss')\n",
290 | "plt.xlabel('epoch')\n",
291 | "plt.legend(['training', 'validation'], loc='upper right')\n",
292 | "plt.show()"
293 | ]
294 | },
295 | {
296 | "cell_type": "markdown",
297 | "metadata": {
298 | "id": "5CC6I3WWwC0K"
299 | },
300 | "source": [
301 | "# Building the ROC Curves"
302 | ]
303 | },
304 | {
305 | "cell_type": "code",
306 | "execution_count": null,
307 | "metadata": {
308 | "id": "rwnriP9SwC0K"
309 | },
310 | "outputs": [],
311 | "source": [
312 | "labels = ['gluon', 'quark', 'W', 'Z', 'top']"
313 | ]
314 | },
315 | {
316 | "cell_type": "code",
317 | "execution_count": null,
318 | "metadata": {
319 | "id": "o6mK4T75wC0L"
320 | },
321 | "outputs": [],
322 | "source": [
323 | "import pandas as pd\n",
324 | "from sklearn.metrics import roc_curve, auc\n",
325 | "predict_val = model.predict(X_val)\n",
326 | "df = pd.DataFrame()\n",
327 | "fpr = {}\n",
328 | "tpr = {}\n",
329 | "auc1 = {}\n",
330 | "\n",
331 | "plt.figure()\n",
332 | "for i, label in enumerate(labels):\n",
333 | " df[label] = y_val[:,i]\n",
334 | " df[label + '_pred'] = predict_val[:,i]\n",
335 | "\n",
336 | " fpr[label], tpr[label], threshold = roc_curve(df[label],df[label+'_pred'])\n",
337 | "\n",
338 | " auc1[label] = auc(fpr[label], tpr[label])\n",
339 | "\n",
340 | " plt.plot(tpr[label],fpr[label],label='%s tagger, auc = %.1f%%'%(label,auc1[label]*100.))\n",
341 | "plt.semilogy()\n",
342 | "plt.xlabel(\"sig. efficiency\")\n",
343 | "plt.ylabel(\"bkg. mistag rate\")\n",
344 | "plt.ylim(0.000001,1)\n",
345 | "plt.grid(True)\n",
346 | "plt.legend(loc='lower right')\n",
347 | "plt.show()"
348 | ]
349 | },
350 | {
351 | "cell_type": "code",
352 | "execution_count": null,
353 | "metadata": {
354 | "id": "OpZGatsJwC0L"
355 | },
356 | "outputs": [],
357 | "source": []
358 | }
359 | ],
360 | "metadata": {
361 | "colab": {
362 | "provenance": []
363 | },
364 | "kernelspec": {
365 | "display_name": "Python 3 (ipykernel)",
366 | "language": "python",
367 | "name": "python3"
368 | },
369 | "language_info": {
370 | "codemirror_mode": {
371 | "name": "ipython",
372 | "version": 3
373 | },
374 | "file_extension": ".py",
375 | "mimetype": "text/x-python",
376 | "name": "python",
377 | "nbconvert_exporter": "python",
378 | "pygments_lexer": "ipython3",
379 | "version": "3.8.9"
380 | }
381 | },
382 | "nbformat": 4,
383 | "nbformat_minor": 1
384 | }
385 |
--------------------------------------------------------------------------------
/jet_notebooks/8.JetAnomalyDetectionAE.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "
"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# Training an Anomalous Jet Detector with **AE** \n",
15 | "\n",
16 | "---\n",
17 | "In this notebook, we train an unsupervised algorithm capable of compressing a jet features into a low-dimension laten space and, from there, reconstruct the input data. This type of architecture is **autoencoder**:\n",
18 | "\n",
19 | "
\n",
20 | "\n",
21 | "The distance between the input and the output is used to identify rare jets. When trained on background QCD jets (quarks and gluons) it will learn to well reconstruct them yeilding a small reconstruction loss (mean squared error distance) whenever the trained model is evaluated on those. When the trained model sees a different \"anomalous\" jet it will yield a large loss. Applying a lower treshold on the loss, one can veto standard QCD jets and select a sample enriched in anomalous jets (W, Z, top, etc). \n",
22 | "\n",
23 | "---"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": null,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "import os\n",
33 | "import h5py\n",
34 | "import glob\n",
35 | "import numpy as np\n",
36 | "import matplotlib.pyplot as plt\n",
37 | "%matplotlib inline"
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "metadata": {},
43 | "source": [
44 | "# Preparation of the training and validation samples\n",
45 | "\n",
46 | "---\n",
47 | "In order to import the dataset, we now\n",
48 | "- clone the dataset repository (to import the data in Colab)\n",
49 | "- load the h5 files in the data/ repository\n",
50 | "- extract the data we need: a target and jetImage \n",
51 | "\n",
52 | "To type shell commands, we start the command line with !\n",
53 | "\n",
54 | "**nb, if you are running locally and you have already downloaded the datasets you can skip the cell below and, if needed, change the paths later to point to the folder with your previous download of the datasets.**"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": null,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": [
63 | "! curl https://cernbox.cern.ch/s/zZDKjltAcJW0RB7/download -o Data-MLtutorial.tar.gz\n",
64 | "! tar -xvzf Data-MLtutorial.tar.gz \n",
65 | "! ls Data-MLtutorial/JetDataset/\n",
66 | "! rm Data-MLtutorial.tar.gz "
67 | ]
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": null,
72 | "metadata": {},
73 | "outputs": [],
74 | "source": [
75 | "target = np.array([])\n",
76 | "features = np.array([])\n",
77 | "# we cannot load all data on Colab. So we just take a few files\n",
78 | "datafiles = ['Data-MLtutorial/JetDataset/jetImage_7_100p_30000_40000.h5',\n",
79 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_60000_70000.h5',\n",
80 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_50000_60000.h5',\n",
81 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_10000_20000.h5',\n",
82 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_0_10000.h5']\n",
83 | "# if you are running locallt, you can use the full dataset doing\n",
84 | "# for fileIN in glob.glob(\"tutorials/HiggsSchool/data/*h5\"):\n",
85 | "for fileIN in datafiles:\n",
86 | " print(\"Appending %s\" %fileIN)\n",
87 | " f = h5py.File(fileIN)\n",
88 | " myFeatures = np.array(f.get(\"jets\")[:,[12, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 48, 52]])\n",
89 | " mytarget = np.array(f.get('jets')[0:,-6:-1])\n",
90 | " features = np.concatenate([features, myFeatures], axis=0) if features.size else myFeatures\n",
91 | " target = np.concatenate([target, mytarget], axis=0) if target.size else mytarget\n",
92 | " f.close()\n",
93 | "print(target.shape, features.shape)"
94 | ]
95 | },
96 | {
97 | "cell_type": "code",
98 | "execution_count": null,
99 | "metadata": {},
100 | "outputs": [],
101 | "source": [
102 | "# we standardize the data, so that the mean is = 0 and rms = 1 \n",
103 | "from sklearn.preprocessing import StandardScaler\n",
104 | "print(np.mean(features[:,10]), np.var(features[:,10]))\n",
105 | "scaler = StandardScaler()\n",
106 | "scaler.fit(features)\n",
107 | "features = scaler.transform(features)\n",
108 | "print(np.mean(features[:,10]), np.var(features[:,10]))"
109 | ]
110 | },
111 | {
112 | "cell_type": "markdown",
113 | "metadata": {},
114 | "source": [
115 | "We now separate the dataset in 4:\n",
116 | "- a training dataset, consisting of quarks and gluons\n",
117 | "- three 'anomalous jets' samples: W, Z, and top"
118 | ]
119 | },
120 | {
121 | "cell_type": "code",
122 | "execution_count": null,
123 | "metadata": {},
124 | "outputs": [],
125 | "source": [
126 | "features_standard = features[np.argmax(target,axis=1)<2]\n",
127 | "features_W = features[np.argmax(target,axis=1)==2]\n",
128 | "features_Z = features[np.argmax(target,axis=1)==3]\n",
129 | "features_t = features[np.argmax(target,axis=1)==4]\n",
130 | "print(features_standard.shape, features_W.shape, features_Z.shape, features_t.shape)"
131 | ]
132 | },
133 | {
134 | "cell_type": "markdown",
135 | "metadata": {},
136 | "source": [
137 | "Notice that this is an unsupervised algorithm, so we don't need the target array anymore.\n",
138 | "Nevertheless, we keep a part of it around, since it might be useful to test the response \n",
139 | "of the algorithm to quarks and gluons separetly"
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": null,
145 | "metadata": {},
146 | "outputs": [],
147 | "source": [
148 | "label_standard = target[np.argmax(target,axis=1)<2]\n",
149 | "print(label_standard)"
150 | ]
151 | },
152 | {
153 | "cell_type": "markdown",
154 | "metadata": {},
155 | "source": [
156 | "We now shuffle the standard-jet data and its labels, splitting them into a training, a validation+test dataset with 2:1:1 ratio. \n",
157 | "\n",
158 | "Then we separate the validation+test in two halves (training and validation)"
159 | ]
160 | },
161 | {
162 | "cell_type": "code",
163 | "execution_count": null,
164 | "metadata": {},
165 | "outputs": [],
166 | "source": [
167 | "from sklearn.model_selection import train_test_split\n",
168 | "\n",
169 | "#split into training and test\n",
170 | "X_learn, X_test, label_learn, label_test = train_test_split(features_standard, label_standard, test_size=0.2)\n",
171 | "print(X_learn.shape, label_learn.shape, X_test.shape, label_test.shape)\n",
172 | "\n",
173 | "#split the training dataset into training and validation\n",
174 | "X_train, X_val, label_train, label_val = train_test_split(X_learn, label_learn, test_size=0.2)\n",
175 | "print(X_train.shape, label_train.shape, X_val.shape, label_val.shape, X_test.shape, label_test.shape)\n",
176 | "\n",
177 | "del features_standard, label_standard, features, target, X_learn, label_learn"
178 | ]
179 | },
180 | {
181 | "cell_type": "markdown",
182 | "metadata": {},
183 | "source": [
184 | "# Building the AE model"
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": null,
190 | "metadata": {},
191 | "outputs": [],
192 | "source": [
193 | "# keras imports\n",
194 | "from tensorflow.keras.models import Model\n",
195 | "from tensorflow.keras.layers import Dense, Input, Flatten\n",
196 | "from tensorflow.keras.layers import BatchNormalization, Activation\n",
197 | "from tensorflow.keras.utils import plot_model\n",
198 | "from tensorflow.keras import backend as K\n",
199 | "from tensorflow.keras import metrics\n",
200 | "from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, TerminateOnNaN"
201 | ]
202 | },
203 | {
204 | "cell_type": "code",
205 | "execution_count": null,
206 | "metadata": {},
207 | "outputs": [],
208 | "source": [
209 | "#---------\n",
210 | "# Enncoder\n",
211 | "#---------\n",
212 | "inputLayer = Input(shape=(16))\n",
213 | "#\n",
214 | "enc = Dense(10)(inputLayer)\n",
215 | "enc = Activation('elu')(enc)\n",
216 | "#\n",
217 | "enc = Dense(5)(enc)\n",
218 | "enc = Activation('elu')(enc)\n",
219 | "\n",
220 | "#---------\n",
221 | "# Decoder\n",
222 | "#---------\n",
223 | "dec = Dense(10)(enc)\n",
224 | "dec = Activation('elu')(dec)\n",
225 | "#\n",
226 | "dec = Dense(16)(dec)\n",
227 | "autoencoder = Model(inputs=inputLayer, outputs=dec)"
228 | ]
229 | },
230 | {
231 | "cell_type": "code",
232 | "execution_count": null,
233 | "metadata": {},
234 | "outputs": [],
235 | "source": [
236 | "autoencoder.compile(loss='mse', optimizer='adam')\n",
237 | "autoencoder.summary()"
238 | ]
239 | },
240 | {
241 | "cell_type": "markdown",
242 | "metadata": {},
243 | "source": [
244 | "We now train the model. Notice the difference with respect to the supervised case\n",
245 | "- the input to the training is (X,X) and nor (X, y). Similarly for the validation dataset\n",
246 | "- the model has no dropout. It is difficult for an unsupervised model to overtran, so there is not really a need"
247 | ]
248 | },
249 | {
250 | "cell_type": "code",
251 | "execution_count": null,
252 | "metadata": {},
253 | "outputs": [],
254 | "source": [
255 | "batch_size = 128\n",
256 | "n_epochs = 200"
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": null,
262 | "metadata": {},
263 | "outputs": [],
264 | "source": [
265 | "# train \n",
266 | "history = autoencoder.fit(X_train, X_train, epochs=n_epochs, batch_size=batch_size, verbose = 2,\n",
267 | " validation_data=(X_val, X_val),\n",
268 | " callbacks = [\n",
269 | " EarlyStopping(monitor='val_loss', patience=10, verbose=1),\n",
270 | " ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=2, verbose=1),\n",
271 | " TerminateOnNaN()])"
272 | ]
273 | },
274 | {
275 | "cell_type": "code",
276 | "execution_count": null,
277 | "metadata": {},
278 | "outputs": [],
279 | "source": [
280 | "# plot training history\n",
281 | "plt.plot(history.history['loss'])\n",
282 | "plt.plot(history.history['val_loss'])\n",
283 | "plt.yscale('log')\n",
284 | "plt.title('Training History')\n",
285 | "plt.ylabel('loss')\n",
286 | "plt.xlabel('epoch')\n",
287 | "plt.legend(['training', 'validation'], loc='upper right')\n",
288 | "plt.show()"
289 | ]
290 | },
291 | {
292 | "cell_type": "markdown",
293 | "metadata": {},
294 | "source": [
295 | "# Loss Distributions"
296 | ]
297 | },
298 | {
299 | "cell_type": "code",
300 | "execution_count": null,
301 | "metadata": {},
302 | "outputs": [],
303 | "source": [
304 | "labels = ['W', 'Z', 'top']"
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "execution_count": null,
310 | "metadata": {},
311 | "outputs": [],
312 | "source": [
313 | "anomaly = [features_W, features_Z, features_t]\n",
314 | "predictedQCD = autoencoder.predict(X_test)\n",
315 | "predicted_anomaly = []\n",
316 | "for i in range(len(labels)):\n",
317 | " predicted_anomaly.append(autoencoder.predict(anomaly[i]))"
318 | ]
319 | },
320 | {
321 | "cell_type": "code",
322 | "execution_count": null,
323 | "metadata": {},
324 | "outputs": [],
325 | "source": [
326 | "def mse(data_in, data_out):\n",
327 | " mse = (data_out-data_in)*(data_out-data_in)\n",
328 | " # sum over features\n",
329 | " mse = mse.sum(-1)\n",
330 | " return mse "
331 | ]
332 | },
333 | {
334 | "cell_type": "code",
335 | "execution_count": null,
336 | "metadata": {},
337 | "outputs": [],
338 | "source": [
339 | "lossQCD = mse(X_test, predictedQCD)\n",
340 | "loss_anomaly = []\n",
341 | "for i in range(len(labels)):\n",
342 | " loss_anomaly.append(mse(anomaly[i], predicted_anomaly[i]))"
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "execution_count": null,
348 | "metadata": {},
349 | "outputs": [],
350 | "source": [
351 | "maxScore = np.max(lossQCD)\n",
352 | "# plot QCD\n",
353 | "plt.figure()\n",
354 | "plt.hist(lossQCD, bins=100, label='QCD', density=True, range=(0, maxScore), \n",
355 | " histtype='step', fill=False, linewidth=1.5)\n",
356 | "plt.semilogy()\n",
357 | "plt.xlabel(\"AE Loss\")\n",
358 | "plt.ylabel(\"Probability (a.u.)\")\n",
359 | "plt.grid(True)\n",
360 | "plt.legend(loc='upper right')\n",
361 | "plt.show()"
362 | ]
363 | },
364 | {
365 | "cell_type": "code",
366 | "execution_count": null,
367 | "metadata": {},
368 | "outputs": [],
369 | "source": [
370 | "maxScore = np.max(lossQCD)\n",
371 | "# plot QCD\n",
372 | "plt.figure()\n",
373 | "plt.hist(lossQCD, bins=100, label='QCD', density=True, range=(0, maxScore), \n",
374 | " histtype='step', fill=False, linewidth=1.5)\n",
375 | "for i in range(len(labels)):\n",
376 | " plt.hist(loss_anomaly[i], bins=100, label=labels[i], density=True, range=(0, maxScore),\n",
377 | " histtype='step', fill=False, linewidth=1.5)\n",
378 | "plt.semilogy()\n",
379 | "plt.xlabel(\"AE Loss\")\n",
380 | "plt.ylabel(\"Probability (a.u.)\")\n",
381 | "plt.grid(True)\n",
382 | "plt.legend(loc='upper right')\n",
383 | "plt.show()"
384 | ]
385 | },
386 | {
387 | "cell_type": "markdown",
388 | "metadata": {},
389 | "source": [
390 | "# Building the ROC Curves"
391 | ]
392 | },
393 | {
394 | "cell_type": "code",
395 | "execution_count": null,
396 | "metadata": {},
397 | "outputs": [],
398 | "source": [
399 | "from sklearn.metrics import roc_curve, auc\n",
400 | "plt.figure()\n",
401 | "targetQCD = np.zeros(lossQCD.shape[0])\n",
402 | "for i, label in enumerate(labels):\n",
403 | " print(loss_anomaly[i].shape, targetQCD.shape)\n",
404 | " trueVal = np.concatenate((np.ones(loss_anomaly[i].shape[0]),targetQCD))\n",
405 | " predVal = np.concatenate((loss_anomaly[i],lossQCD))\n",
406 | " print(trueVal.shape, predVal.shape)\n",
407 | " fpr, tpr, threshold = roc_curve(trueVal,predVal)\n",
408 | " auc1= auc(fpr, tpr)\n",
409 | " plt.plot(tpr,fpr,label='%s Anomaly Detection, auc = %.1f%%'%(label,auc1*100.))\n",
410 | "#plt.semilogy()\n",
411 | "plt.xlabel(\"sig. efficiency\")\n",
412 | "plt.ylabel(\"bkg. mistag rate\")\n",
413 | "plt.grid(True)\n",
414 | "plt.legend(loc='lower right')\n",
415 | "plt.show()"
416 | ]
417 | },
418 | {
419 | "cell_type": "code",
420 | "execution_count": null,
421 | "metadata": {},
422 | "outputs": [],
423 | "source": []
424 | }
425 | ],
426 | "metadata": {
427 | "kernelspec": {
428 | "display_name": "Python 3 (ipykernel)",
429 | "language": "python",
430 | "name": "python3"
431 | },
432 | "language_info": {
433 | "codemirror_mode": {
434 | "name": "ipython",
435 | "version": 3
436 | },
437 | "file_extension": ".py",
438 | "mimetype": "text/x-python",
439 | "name": "python",
440 | "nbconvert_exporter": "python",
441 | "pygments_lexer": "ipython3",
442 | "version": "3.8.9"
443 | }
444 | },
445 | "nbformat": 4,
446 | "nbformat_minor": 2
447 | }
448 |
--------------------------------------------------------------------------------
/jet_notebooks/9.JetAnomalyDetectionVAE.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "
"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# Training an Anomalous Jet Detector with **VAE** \n",
15 | "\n",
16 | "---\n",
17 | "In this notebook, we train an unsupervised algorithm capable of compressing a jet features into a low-dimension laten space and, from there, reconstruct the input data. This type of architecture is **autoencoder**:\n",
18 | "\n",
19 | "\n",
20 | "

\n",
21 | "
\n",
22 | "\n",
23 | "\n",
24 | "The distance between the input and the output is used to identify rare jets. When trained on background QCD jets (quarks and gluons) it will learn to well reconstruct them yeilding a small reconstruction loss (mean squared error distance) whenever the trained model is evaluated on those. When the trained model sees a different \"anomalous\" jet it will yield a large loss. Applying a lower treshold on the loss, one can veto standard QCD jets and select a sample enriched in anomalous jets (W, Z, top, etc). \n",
25 | "\n",
26 | "We will use below a special autoencoder called **variational autoencoder**, which is an autoencoder whose training is regularised to avoid overfitting and ensure that the latent space has good properties that enable generative process. Instead of encoding the inputs as a single point, we encode it as a multi-dimensional gaussian distribution over the latent space:\n",
27 | "\n",
28 | "\n",
29 | "

\n",
30 | "
\n",
31 | "\n",
32 | "The loss is now the sum of two terms:\n",
33 | "\n",
34 | "- the *MSE loss* that makes the encoding-decoding reconstruction scheme as performant as possible\n",
35 | "- the *Kullback-Leibler (KL) loss* which represents the distance between gaussian pdfs and acts as a regularization term on the latent space\n",
36 | "\n",
37 | "---"
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": null,
43 | "metadata": {},
44 | "outputs": [],
45 | "source": [
46 | "import os\n",
47 | "import h5py\n",
48 | "import glob\n",
49 | "import numpy as np\n",
50 | "import matplotlib.pyplot as plt\n",
51 | "%matplotlib inline"
52 | ]
53 | },
54 | {
55 | "cell_type": "markdown",
56 | "metadata": {},
57 | "source": [
58 | "# Preparation of the training and validation samples\n",
59 | "\n",
60 | "---\n",
61 | "In order to import the dataset, we now\n",
62 | "- clone the dataset repository (to import the data in Colab)\n",
63 | "- load the h5 files in the data/ repository\n",
64 | "- extract the data we need: a target and jetImage \n",
65 | "\n",
66 | "To type shell commands, we start the command line with !\n",
67 | "\n",
68 | "**nb, if you are running locally and you have already downloaded the datasets you can skip the cell below and, if needed, change the paths later to point to the folder with your previous download of the datasets.**"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": null,
74 | "metadata": {},
75 | "outputs": [],
76 | "source": [
77 | "! curl https://cernbox.cern.ch/s/zZDKjltAcJW0RB7/download -o Data-MLtutorial.tar.gz\n",
78 | "! tar -xvzf Data-MLtutorial.tar.gz \n",
79 | "! ls Data-MLtutorial/JetDataset/\n",
80 | "! rm Data-MLtutorial.tar.gz "
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": null,
86 | "metadata": {},
87 | "outputs": [],
88 | "source": [
89 | "target = np.array([])\n",
90 | "features = np.array([])\n",
91 | "# we cannot load all data on Colab. So we just take a few files\n",
92 | "datafiles = ['Data-MLtutorial/JetDataset/jetImage_7_100p_30000_40000.h5',\n",
93 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_60000_70000.h5',\n",
94 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_50000_60000.h5',\n",
95 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_10000_20000.h5',\n",
96 | " 'Data-MLtutorial/JetDataset/jetImage_7_100p_0_10000.h5']\n",
97 | "# if you are running locallt, you can use the full dataset doing\n",
98 | "# for fileIN in glob.glob(\"tutorials/HiggsSchool/data/*h5\"):\n",
99 | "for fileIN in datafiles:\n",
100 | " print(\"Appending %s\" %fileIN)\n",
101 | " f = h5py.File(fileIN)\n",
102 | " myFeatures = np.array(f.get(\"jets\")[:,[12, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 48, 52]], dtype=np.float32)\n",
103 | " mytarget = np.array(f.get('jets')[0:,-6:-1])\n",
104 | " features = np.concatenate([features, myFeatures], axis=0) if features.size else myFeatures\n",
105 | " target = np.concatenate([target, mytarget], axis=0) if target.size else mytarget\n",
106 | " f.close()\n",
107 | "print(target.shape, features.shape)"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": null,
113 | "metadata": {},
114 | "outputs": [],
115 | "source": [
116 | "# we standardize the data, so that the mean is = 0 and rms = 1 \n",
117 | "from sklearn.preprocessing import StandardScaler\n",
118 | "print(np.mean(features[:,10]), np.var(features[:,10]))\n",
119 | "scaler = StandardScaler()\n",
120 | "scaler.fit(features)\n",
121 | "features = scaler.transform(features)\n",
122 | "print(np.mean(features[:,10]), np.var(features[:,10]))"
123 | ]
124 | },
125 | {
126 | "cell_type": "markdown",
127 | "metadata": {},
128 | "source": [
129 | "We now separate the dataset in 4:\n",
130 | "- a training dataset, consisting of quarks and gluons\n",
131 | "- three 'anomalous jets' samples: W, Z, and top"
132 | ]
133 | },
134 | {
135 | "cell_type": "code",
136 | "execution_count": null,
137 | "metadata": {},
138 | "outputs": [],
139 | "source": [
140 | "features_standard = features[np.argmax(target,axis=1)<2]\n",
141 | "features_W = features[np.argmax(target,axis=1)==2]\n",
142 | "features_Z = features[np.argmax(target,axis=1)==3]\n",
143 | "features_t = features[np.argmax(target,axis=1)==4]\n",
144 | "print(features_standard.shape, features_W.shape, features_Z.shape, features_t.shape)"
145 | ]
146 | },
147 | {
148 | "cell_type": "markdown",
149 | "metadata": {},
150 | "source": [
151 | "Notice that this is an unsupervised algorithm, so we don't need the target array anymore.\n",
152 | "Nevertheless, we keep a part of it around, since it might be useful to test the response \n",
153 | "of the algorithm to quarks and gluons separetly"
154 | ]
155 | },
156 | {
157 | "cell_type": "code",
158 | "execution_count": null,
159 | "metadata": {},
160 | "outputs": [],
161 | "source": [
162 | "label_standard = target[np.argmax(target,axis=1)<2]\n",
163 | "print(label_standard)"
164 | ]
165 | },
166 | {
167 | "cell_type": "markdown",
168 | "metadata": {},
169 | "source": [
170 | "We now shuffle the standard-jet data and its labels, splitting them into a training, a validation+test dataset with 2:1:1 ratio. \n",
171 | "\n",
172 | "Then we separate the validation+test in two halves (training and validation)"
173 | ]
174 | },
175 | {
176 | "cell_type": "code",
177 | "execution_count": null,
178 | "metadata": {},
179 | "outputs": [],
180 | "source": [
181 | "from sklearn.model_selection import train_test_split\n",
182 | "\n",
183 | "#split into training and test\n",
184 | "X_learn, X_test, label_learn, label_test = train_test_split(features_standard, label_standard, test_size=0.2)\n",
185 | "print(X_learn.shape, label_learn.shape, X_test.shape, label_test.shape)\n",
186 | "\n",
187 | "#split the training dataset into training and validation\n",
188 | "X_train, X_val, label_train, label_val = train_test_split(X_learn, label_learn, test_size=0.2)\n",
189 | "print(X_train.shape, label_train.shape, X_val.shape, label_val.shape, X_test.shape, label_test.shape)\n",
190 | "\n",
191 | "del features_standard, label_standard, features, target, X_learn, label_learn"
192 | ]
193 | },
194 | {
195 | "cell_type": "markdown",
196 | "metadata": {},
197 | "source": [
198 | "# Building the VAE model"
199 | ]
200 | },
201 | {
202 | "cell_type": "code",
203 | "execution_count": null,
204 | "metadata": {},
205 | "outputs": [],
206 | "source": [
207 | "# keras imports\n",
208 | "from tensorflow.keras.models import Model\n",
209 | "from tensorflow.keras.layers import Dense, Input, Lambda, Layer\n",
210 | "from tensorflow.keras.layers import BatchNormalization, Activation\n",
211 | "from tensorflow.keras.utils import plot_model\n",
212 | "from tensorflow.keras.optimizers import Adam\n",
213 | "from tensorflow.keras import backend as K\n",
214 | "from tensorflow.keras import metrics\n",
215 | "from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, TerminateOnNaN"
216 | ]
217 | },
218 | {
219 | "cell_type": "code",
220 | "execution_count": null,
221 | "metadata": {},
222 | "outputs": [],
223 | "source": [
224 | "def sample_z(args):\n",
225 | " z_mean, z_log_var = args\n",
226 | " batch = K.shape(z_mean)[0]\n",
227 | " dim = K.int_shape(z_mean)[1]\n",
228 | " eps = K.random_normal(shape=(batch, dim))\n",
229 | " return z_mean + K.exp(z_log_var / 2) * eps"
230 | ]
231 | },
232 | {
233 | "cell_type": "markdown",
234 | "metadata": {},
235 | "source": [
236 | "Loss definition: The first block of code is just the reconstruction error which is given by the MSE. The second block of code calculates the KL-divergence analytically and adds it to the loss function with the line self.add_loss. It represents the KL-divergence as just another layer in the neural network with the inputs equal to the outputs (means and variances in latent space)"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": null,
242 | "metadata": {},
243 | "outputs": [],
244 | "source": [
245 | "from tensorflow.keras import backend as K\n",
246 | "\n",
247 | "# Define loss\n",
248 | "def myloss(y_true, y_pred):\n",
249 | " # mse\n",
250 | " sum_sq = (y_true-y_pred)*(y_true-y_pred)\n",
251 | " return K.sum(sum_sq, axis=-1)\n",
252 | "\n",
253 | "class KLDivergenceLayer(Layer):\n",
254 | "\n",
255 | " \"\"\" Identity transform layer that adds KL divergence\n",
256 | " to the final model loss.\n",
257 | " \"\"\"\n",
258 | "\n",
259 | " def __init__(self, *args, **kwargs):\n",
260 | " self.is_placeholder = True\n",
261 | " super(KLDivergenceLayer, self).__init__(*args, **kwargs)\n",
262 | "\n",
263 | " def call(self, inputs):\n",
264 | " mu, log_var = inputs\n",
265 | " kl_batch = - .5 * K.sum(1 + log_var -\n",
266 | " K.square(mu) -\n",
267 | " K.exp(log_var), axis=-1)\n",
268 | " self.add_loss(K.mean(kl_batch), inputs=inputs)\n",
269 | " return inputs"
270 | ]
271 | },
272 | {
273 | "cell_type": "code",
274 | "execution_count": null,
275 | "metadata": {},
276 | "outputs": [],
277 | "source": [
278 | "def vae(input_dim, latent_dim, beta):\n",
279 | " #encoder\n",
280 | " input_encoder = Input(shape=(input_dim), name='encoder_input')\n",
281 | " x = Dense(10, activation='elu')(input_encoder)\n",
282 | " z_mu = Dense(latent_dim, name='latent_mu')(x)\n",
283 | " z_log_var = Dense(latent_dim, name='latent_logvar')(x)\n",
284 | " z_mu, z_log_var = KLDivergenceLayer()([z_mu, z_log_var])\n",
285 | " \n",
286 | " z = Lambda(sample_z, output_shape=(latent_dim, ), name='z')([z_mu, z_log_var])\n",
287 | " encoder = Model(inputs=input_encoder, outputs=[z_mu, z_log_var, z], name='encoder')\n",
288 | " encoder.summary()\n",
289 | " \n",
290 | " #decoder\n",
291 | " input_decoder = Input(shape=(latent_dim,), name='decoder_input')\n",
292 | " x = Dense(10, activation='elu')(input_decoder)\n",
293 | " dec = Dense(input_dim, activation='linear')(x) \n",
294 | " decoder = Model(inputs=input_decoder, outputs=dec, name='decoder')\n",
295 | " decoder.summary()\n",
296 | " \n",
297 | " #vae\n",
298 | " vae_outputs = decoder(encoder(input_encoder)[2])\n",
299 | " vae = Model(input_encoder, vae_outputs, name='vae')\n",
300 | " vae.summary()\n",
301 | " \n",
302 | " return vae, encoder"
303 | ]
304 | },
305 | {
306 | "cell_type": "code",
307 | "execution_count": null,
308 | "metadata": {},
309 | "outputs": [],
310 | "source": [
311 | "model, encoder = vae(16, 5, 1.0)"
312 | ]
313 | },
314 | {
315 | "cell_type": "code",
316 | "execution_count": null,
317 | "metadata": {},
318 | "outputs": [],
319 | "source": [
320 | "model.compile(optimizer='adam', loss=myloss)"
321 | ]
322 | },
323 | {
324 | "cell_type": "code",
325 | "execution_count": null,
326 | "metadata": {},
327 | "outputs": [],
328 | "source": [
329 | "n_epochs = 200\n",
330 | "batch_size = 128\n",
331 | "# train \n",
332 | "history = model.fit(X_train, X_train, epochs=n_epochs, batch_size=batch_size, verbose = 2,\n",
333 | " validation_data=(X_val, X_val),\n",
334 | " callbacks = [\n",
335 | " EarlyStopping(monitor='val_loss', patience=10, verbose=1),\n",
336 | " ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=2, verbose=1),\n",
337 | " TerminateOnNaN()])"
338 | ]
339 | },
340 | {
341 | "cell_type": "code",
342 | "execution_count": null,
343 | "metadata": {},
344 | "outputs": [],
345 | "source": [
346 | "# plot training history\n",
347 | "plt.plot(history.history['loss'])\n",
348 | "plt.plot(history.history['val_loss'])\n",
349 | "plt.yscale('log')\n",
350 | "plt.title('Training History')\n",
351 | "plt.ylabel('loss')\n",
352 | "plt.xlabel('epoch')\n",
353 | "plt.legend(['training', 'validation'], loc='upper right')\n",
354 | "plt.show()"
355 | ]
356 | },
357 | {
358 | "cell_type": "markdown",
359 | "metadata": {},
360 | "source": [
361 | "# Loss Distributions"
362 | ]
363 | },
364 | {
365 | "cell_type": "code",
366 | "execution_count": null,
367 | "metadata": {},
368 | "outputs": [],
369 | "source": [
370 | "labels = ['W', 'Z', 'top']"
371 | ]
372 | },
373 | {
374 | "cell_type": "code",
375 | "execution_count": null,
376 | "metadata": {},
377 | "outputs": [],
378 | "source": [
379 | "anomaly = [features_W, features_Z, features_t]\n",
380 | "predictedQCD = model.predict(X_test)\n",
381 | "predicted_anomaly = []\n",
382 | "for i in range(len(labels)):\n",
383 | " predicted_anomaly.append(model.predict(anomaly[i]))"
384 | ]
385 | },
386 | {
387 | "cell_type": "code",
388 | "execution_count": null,
389 | "metadata": {},
390 | "outputs": [],
391 | "source": [
392 | "def mse(data_in, data_out):\n",
393 | " mse = (data_out-data_in)*(data_out-data_in)\n",
394 | " # sum over features\n",
395 | " mse = mse.sum(-1)\n",
396 | " return mse "
397 | ]
398 | },
399 | {
400 | "cell_type": "code",
401 | "execution_count": null,
402 | "metadata": {},
403 | "outputs": [],
404 | "source": [
405 | "lossQCD = mse(X_test, predictedQCD)\n",
406 | "loss_anomaly = []\n",
407 | "for i in range(len(labels)):\n",
408 | " loss_anomaly.append(mse(anomaly[i], predicted_anomaly[i]))"
409 | ]
410 | },
411 | {
412 | "cell_type": "code",
413 | "execution_count": null,
414 | "metadata": {},
415 | "outputs": [],
416 | "source": [
417 | "maxScore = np.max(lossQCD)\n",
418 | "# plot QCD\n",
419 | "plt.figure()\n",
420 | "plt.hist(lossQCD, bins=100, label='QCD', density=True, range=(0, maxScore), \n",
421 | " histtype='step', fill=False, linewidth=1.5)\n",
422 | "plt.semilogy()\n",
423 | "plt.xlabel(\"AE Loss\")\n",
424 | "plt.ylabel(\"Probability (a.u.)\")\n",
425 | "plt.grid(True)\n",
426 | "plt.legend(loc='upper right')\n",
427 | "plt.show()"
428 | ]
429 | },
430 | {
431 | "cell_type": "code",
432 | "execution_count": null,
433 | "metadata": {},
434 | "outputs": [],
435 | "source": [
436 | "maxScore = np.max(lossQCD)\n",
437 | "# plot QCD\n",
438 | "plt.figure()\n",
439 | "plt.hist(lossQCD, bins=100, label='QCD', density=True, range=(0, maxScore), \n",
440 | " histtype='step', fill=False, linewidth=1.5)\n",
441 | "for i in range(len(labels)):\n",
442 | " plt.hist(loss_anomaly[i], bins=100, label=labels[i], density=True, range=(0, maxScore),\n",
443 | " histtype='step', fill=False, linewidth=1.5)\n",
444 | "plt.semilogy()\n",
445 | "plt.xlabel(\"AE Loss\")\n",
446 | "plt.ylabel(\"Probability (a.u.)\")\n",
447 | "plt.grid(True)\n",
448 | "plt.legend(loc='upper right')\n",
449 | "plt.show()"
450 | ]
451 | },
452 | {
453 | "cell_type": "markdown",
454 | "metadata": {},
455 | "source": [
456 | "# Building the ROC Curves"
457 | ]
458 | },
459 | {
460 | "cell_type": "code",
461 | "execution_count": null,
462 | "metadata": {},
463 | "outputs": [],
464 | "source": [
465 | "from sklearn.metrics import roc_curve, auc\n",
466 | "plt.figure()\n",
467 | "targetQCD = np.zeros(lossQCD.shape[0])\n",
468 | "for i, label in enumerate(labels):\n",
469 | " print(loss_anomaly[i].shape, targetQCD.shape)\n",
470 | " trueVal = np.concatenate((np.ones(loss_anomaly[i].shape[0]),targetQCD))\n",
471 | " predVal = np.concatenate((loss_anomaly[i],lossQCD))\n",
472 | " print(trueVal.shape, predVal.shape)\n",
473 | " fpr, tpr, threshold = roc_curve(trueVal,predVal)\n",
474 | " auc1= auc(fpr, tpr)\n",
475 | " plt.plot(tpr,fpr,label='%s Anomaly Detection, auc = %.1f%%'%(label,auc1*100.))\n",
476 | "#plt.semilogy()\n",
477 | "plt.xlabel(\"sig. efficiency\")\n",
478 | "plt.ylabel(\"bkg. mistag rate\")\n",
479 | "plt.grid(True)\n",
480 | "plt.legend(loc='upper left')\n",
481 | "plt.show()"
482 | ]
483 | },
484 | {
485 | "cell_type": "code",
486 | "execution_count": null,
487 | "metadata": {},
488 | "outputs": [],
489 | "source": []
490 | }
491 | ],
492 | "metadata": {
493 | "kernelspec": {
494 | "display_name": "Python 3 (ipykernel)",
495 | "language": "python",
496 | "name": "python3"
497 | },
498 | "language_info": {
499 | "codemirror_mode": {
500 | "name": "ipython",
501 | "version": 3
502 | },
503 | "file_extension": ".py",
504 | "mimetype": "text/x-python",
505 | "name": "python",
506 | "nbconvert_exporter": "python",
507 | "pygments_lexer": "ipython3",
508 | "version": "3.8.9"
509 | }
510 | },
511 | "nbformat": 4,
512 | "nbformat_minor": 2
513 | }
514 |
--------------------------------------------------------------------------------
/jet_notebooks/ae.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/makagan/SSI_Projects/127dd9b49ffdf9d37c763fe71bca172c1127599b/jet_notebooks/ae.png
--------------------------------------------------------------------------------
/jet_notebooks/conv1d.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/makagan/SSI_Projects/127dd9b49ffdf9d37c763fe71bca172c1127599b/jet_notebooks/conv1d.png
--------------------------------------------------------------------------------
/jet_notebooks/conv2d.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/makagan/SSI_Projects/127dd9b49ffdf9d37c763fe71bca172c1127599b/jet_notebooks/conv2d.gif
--------------------------------------------------------------------------------
/jet_notebooks/particle-net-arch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/makagan/SSI_Projects/127dd9b49ffdf9d37c763fe71bca172c1127599b/jet_notebooks/particle-net-arch.png
--------------------------------------------------------------------------------
/jet_notebooks/rnn1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/makagan/SSI_Projects/127dd9b49ffdf9d37c763fe71bca172c1127599b/jet_notebooks/rnn1.png
--------------------------------------------------------------------------------
/jet_notebooks/vae.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/makagan/SSI_Projects/127dd9b49ffdf9d37c763fe71bca172c1127599b/jet_notebooks/vae.png
--------------------------------------------------------------------------------
/python_advanced/data.csv:
--------------------------------------------------------------------------------
1 | Duration,Pulse,Maxpulse,Calories
2 | 60,110,130,409.1
3 | 60,117,145,479.0
4 | 60,103,135,340.0
5 | 45,109,175,282.4
6 | 45,117,148,406.0
7 | 60,102,127,300.0
8 | 60,110,136,374.0
9 | 45,104,134,253.3
10 | 30,109,133,195.1
11 | 60,98,124,269.0
12 | 60,103,147,329.3
13 | 60,100,120,250.7
14 | 60,106,128,345.3
15 | 60,104,132,379.3
16 | 60,98,123,275.0
17 | 60,98,120,215.2
18 | 60,100,120,300.0
19 | 45,90,112,
20 | 60,103,123,323.0
21 | 45,97,125,243.0
22 | 60,108,131,364.2
23 | 45,100,119,282.0
24 | 60,130,101,300.0
25 | 45,105,132,246.0
26 | 60,102,126,334.5
27 | 60,100,120,250.0
28 | 60,92,118,241.0
29 | 60,103,132,
30 | 60,100,132,280.0
31 | 60,102,129,380.3
32 | 60,92,115,243.0
33 | 45,90,112,180.1
34 | 60,101,124,299.0
35 | 60,93,113,223.0
36 | 60,107,136,361.0
37 | 60,114,140,415.0
38 | 60,102,127,300.0
39 | 60,100,120,300.0
40 | 60,100,120,300.0
41 | 45,104,129,266.0
42 | 45,90,112,180.1
43 | 60,98,126,286.0
44 | 60,100,122,329.4
45 | 60,111,138,400.0
46 | 60,111,131,397.0
47 | 60,99,119,273.0
48 | 60,109,153,387.6
49 | 45,111,136,300.0
50 | 45,108,129,298.0
51 | 60,111,139,397.6
52 | 60,107,136,380.2
53 | 80,123,146,643.1
54 | 60,106,130,263.0
55 | 60,118,151,486.0
56 | 30,136,175,238.0
57 | 60,121,146,450.7
58 | 60,118,121,413.0
59 | 45,115,144,305.0
60 | 20,153,172,226.4
61 | 45,123,152,321.0
62 | 210,108,160,1376.0
63 | 160,110,137,1034.4
64 | 160,109,135,853.0
65 | 45,118,141,341.0
66 | 20,110,130,131.4
67 | 180,90,130,800.4
68 | 150,105,135,873.4
69 | 150,107,130,816.0
70 | 20,106,136,110.4
71 | 300,108,143,1500.2
72 | 150,97,129,1115.0
73 | 60,109,153,387.6
74 | 90,100,127,700.0
75 | 150,97,127,953.2
76 | 45,114,146,304.0
77 | 90,98,125,563.2
78 | 45,105,134,251.0
79 | 45,110,141,300.0
80 | 120,100,130,500.4
81 | 270,100,131,1729.0
82 | 30,159,182,319.2
83 | 45,149,169,344.0
84 | 30,103,139,151.1
85 | 120,100,130,500.0
86 | 45,100,120,225.3
87 | 30,151,170,300.0
88 | 45,102,136,234.0
89 | 120,100,157,1000.1
90 | 45,129,103,242.0
91 | 20,83,107,50.3
92 | 180,101,127,600.1
93 | 45,107,137,
94 | 30,90,107,105.3
95 | 15,80,100,50.5
96 | 20,150,171,127.4
97 | 20,151,168,229.4
98 | 30,95,128,128.2
99 | 25,152,168,244.2
100 | 30,109,131,188.2
101 | 90,93,124,604.1
102 | 20,95,112,77.7
103 | 90,90,110,500.0
104 | 90,90,100,500.0
105 | 90,90,100,500.4
106 | 30,92,108,92.7
107 | 30,93,128,124.0
108 | 180,90,120,800.3
109 | 30,90,120,86.2
110 | 90,90,120,500.3
111 | 210,137,184,1860.4
112 | 60,102,124,325.2
113 | 45,107,124,275.0
114 | 15,124,139,124.2
115 | 45,100,120,225.3
116 | 60,108,131,367.6
117 | 60,108,151,351.7
118 | 60,116,141,443.0
119 | 60,97,122,277.4
120 | 60,105,125,
121 | 60,103,124,332.7
122 | 30,112,137,193.9
123 | 45,100,120,100.7
124 | 60,119,169,336.7
125 | 60,107,127,344.9
126 | 60,111,151,368.5
127 | 60,98,122,271.0
128 | 60,97,124,275.3
129 | 60,109,127,382.0
130 | 90,99,125,466.4
131 | 60,114,151,384.0
132 | 60,104,134,342.5
133 | 60,107,138,357.5
134 | 60,103,133,335.0
135 | 60,106,132,327.5
136 | 60,103,136,339.0
137 | 20,136,156,189.0
138 | 45,117,143,317.7
139 | 45,115,137,318.0
140 | 45,113,138,308.0
141 | 20,141,162,222.4
142 | 60,108,135,390.0
143 | 60,97,127,
144 | 45,100,120,250.4
145 | 45,122,149,335.4
146 | 60,136,170,470.2
147 | 45,106,126,270.8
148 | 60,107,136,400.0
149 | 60,112,146,361.9
150 | 30,103,127,185.0
151 | 60,110,150,409.4
152 | 60,106,134,343.0
153 | 60,109,129,353.2
154 | 60,109,138,374.0
155 | 30,150,167,275.8
156 | 60,105,128,328.0
157 | 60,111,151,368.5
158 | 60,97,131,270.4
159 | 60,100,120,270.4
160 | 60,114,150,382.8
161 | 30,80,120,240.9
162 | 30,85,120,250.4
163 | 45,90,130,260.4
164 | 45,95,130,270.0
165 | 45,100,140,280.9
166 | 60,105,140,290.8
167 | 60,110,145,300.0
168 | 60,115,145,310.2
169 | 75,120,150,320.4
170 | 75,125,150,330.4
171 |
--------------------------------------------------------------------------------
/python_advanced/data.json:
--------------------------------------------------------------------------------
1 | {
2 | "Duration":{
3 | "0":60,
4 | "1":60,
5 | "2":60,
6 | "3":45,
7 | "4":45,
8 | "5":60,
9 | "6":60,
10 | "7":45,
11 | "8":30,
12 | "9":60,
13 | "10":60,
14 | "11":60,
15 | "12":60,
16 | "13":60,
17 | "14":60,
18 | "15":60,
19 | "16":60,
20 | "17":45,
21 | "18":60,
22 | "19":45,
23 | "20":60,
24 | "21":45,
25 | "22":60,
26 | "23":45,
27 | "24":60,
28 | "25":60,
29 | "26":60,
30 | "27":60,
31 | "28":60,
32 | "29":60,
33 | "30":60,
34 | "31":45,
35 | "32":60,
36 | "33":60,
37 | "34":60,
38 | "35":60,
39 | "36":60,
40 | "37":60,
41 | "38":60,
42 | "39":45,
43 | "40":45,
44 | "41":60,
45 | "42":60,
46 | "43":60,
47 | "44":60,
48 | "45":60,
49 | "46":60,
50 | "47":45,
51 | "48":45,
52 | "49":60,
53 | "50":60,
54 | "51":80,
55 | "52":60,
56 | "53":60,
57 | "54":30,
58 | "55":60,
59 | "56":60,
60 | "57":45,
61 | "58":20,
62 | "59":45,
63 | "60":210,
64 | "61":160,
65 | "62":160,
66 | "63":45,
67 | "64":20,
68 | "65":180,
69 | "66":150,
70 | "67":150,
71 | "68":20,
72 | "69":300,
73 | "70":150,
74 | "71":60,
75 | "72":90,
76 | "73":150,
77 | "74":45,
78 | "75":90,
79 | "76":45,
80 | "77":45,
81 | "78":120,
82 | "79":270,
83 | "80":30,
84 | "81":45,
85 | "82":30,
86 | "83":120,
87 | "84":45,
88 | "85":30,
89 | "86":45,
90 | "87":120,
91 | "88":45,
92 | "89":20,
93 | "90":180,
94 | "91":45,
95 | "92":30,
96 | "93":15,
97 | "94":20,
98 | "95":20,
99 | "96":30,
100 | "97":25,
101 | "98":30,
102 | "99":90,
103 | "100":20,
104 | "101":90,
105 | "102":90,
106 | "103":90,
107 | "104":30,
108 | "105":30,
109 | "106":180,
110 | "107":30,
111 | "108":90,
112 | "109":210,
113 | "110":60,
114 | "111":45,
115 | "112":15,
116 | "113":45,
117 | "114":60,
118 | "115":60,
119 | "116":60,
120 | "117":60,
121 | "118":60,
122 | "119":60,
123 | "120":30,
124 | "121":45,
125 | "122":60,
126 | "123":60,
127 | "124":60,
128 | "125":60,
129 | "126":60,
130 | "127":60,
131 | "128":90,
132 | "129":60,
133 | "130":60,
134 | "131":60,
135 | "132":60,
136 | "133":60,
137 | "134":60,
138 | "135":20,
139 | "136":45,
140 | "137":45,
141 | "138":45,
142 | "139":20,
143 | "140":60,
144 | "141":60,
145 | "142":45,
146 | "143":45,
147 | "144":60,
148 | "145":45,
149 | "146":60,
150 | "147":60,
151 | "148":30,
152 | "149":60,
153 | "150":60,
154 | "151":60,
155 | "152":60,
156 | "153":30,
157 | "154":60,
158 | "155":60,
159 | "156":60,
160 | "157":60,
161 | "158":60,
162 | "159":30,
163 | "160":30,
164 | "161":45,
165 | "162":45,
166 | "163":45,
167 | "164":60,
168 | "165":60,
169 | "166":60,
170 | "167":75,
171 | "168":75
172 | },
173 | "Pulse":{
174 | "0":110,
175 | "1":117,
176 | "2":103,
177 | "3":109,
178 | "4":117,
179 | "5":102,
180 | "6":110,
181 | "7":104,
182 | "8":109,
183 | "9":98,
184 | "10":103,
185 | "11":100,
186 | "12":106,
187 | "13":104,
188 | "14":98,
189 | "15":98,
190 | "16":100,
191 | "17":90,
192 | "18":103,
193 | "19":97,
194 | "20":108,
195 | "21":100,
196 | "22":130,
197 | "23":105,
198 | "24":102,
199 | "25":100,
200 | "26":92,
201 | "27":103,
202 | "28":100,
203 | "29":102,
204 | "30":92,
205 | "31":90,
206 | "32":101,
207 | "33":93,
208 | "34":107,
209 | "35":114,
210 | "36":102,
211 | "37":100,
212 | "38":100,
213 | "39":104,
214 | "40":90,
215 | "41":98,
216 | "42":100,
217 | "43":111,
218 | "44":111,
219 | "45":99,
220 | "46":109,
221 | "47":111,
222 | "48":108,
223 | "49":111,
224 | "50":107,
225 | "51":123,
226 | "52":106,
227 | "53":118,
228 | "54":136,
229 | "55":121,
230 | "56":118,
231 | "57":115,
232 | "58":153,
233 | "59":123,
234 | "60":108,
235 | "61":110,
236 | "62":109,
237 | "63":118,
238 | "64":110,
239 | "65":90,
240 | "66":105,
241 | "67":107,
242 | "68":106,
243 | "69":108,
244 | "70":97,
245 | "71":109,
246 | "72":100,
247 | "73":97,
248 | "74":114,
249 | "75":98,
250 | "76":105,
251 | "77":110,
252 | "78":100,
253 | "79":100,
254 | "80":159,
255 | "81":149,
256 | "82":103,
257 | "83":100,
258 | "84":100,
259 | "85":151,
260 | "86":102,
261 | "87":100,
262 | "88":129,
263 | "89":83,
264 | "90":101,
265 | "91":107,
266 | "92":90,
267 | "93":80,
268 | "94":150,
269 | "95":151,
270 | "96":95,
271 | "97":152,
272 | "98":109,
273 | "99":93,
274 | "100":95,
275 | "101":90,
276 | "102":90,
277 | "103":90,
278 | "104":92,
279 | "105":93,
280 | "106":90,
281 | "107":90,
282 | "108":90,
283 | "109":137,
284 | "110":102,
285 | "111":107,
286 | "112":124,
287 | "113":100,
288 | "114":108,
289 | "115":108,
290 | "116":116,
291 | "117":97,
292 | "118":105,
293 | "119":103,
294 | "120":112,
295 | "121":100,
296 | "122":119,
297 | "123":107,
298 | "124":111,
299 | "125":98,
300 | "126":97,
301 | "127":109,
302 | "128":99,
303 | "129":114,
304 | "130":104,
305 | "131":107,
306 | "132":103,
307 | "133":106,
308 | "134":103,
309 | "135":136,
310 | "136":117,
311 | "137":115,
312 | "138":113,
313 | "139":141,
314 | "140":108,
315 | "141":97,
316 | "142":100,
317 | "143":122,
318 | "144":136,
319 | "145":106,
320 | "146":107,
321 | "147":112,
322 | "148":103,
323 | "149":110,
324 | "150":106,
325 | "151":109,
326 | "152":109,
327 | "153":150,
328 | "154":105,
329 | "155":111,
330 | "156":97,
331 | "157":100,
332 | "158":114,
333 | "159":80,
334 | "160":85,
335 | "161":90,
336 | "162":95,
337 | "163":100,
338 | "164":105,
339 | "165":110,
340 | "166":115,
341 | "167":120,
342 | "168":125
343 | },
344 | "Maxpulse":{
345 | "0":130,
346 | "1":145,
347 | "2":135,
348 | "3":175,
349 | "4":148,
350 | "5":127,
351 | "6":136,
352 | "7":134,
353 | "8":133,
354 | "9":124,
355 | "10":147,
356 | "11":120,
357 | "12":128,
358 | "13":132,
359 | "14":123,
360 | "15":120,
361 | "16":120,
362 | "17":112,
363 | "18":123,
364 | "19":125,
365 | "20":131,
366 | "21":119,
367 | "22":101,
368 | "23":132,
369 | "24":126,
370 | "25":120,
371 | "26":118,
372 | "27":132,
373 | "28":132,
374 | "29":129,
375 | "30":115,
376 | "31":112,
377 | "32":124,
378 | "33":113,
379 | "34":136,
380 | "35":140,
381 | "36":127,
382 | "37":120,
383 | "38":120,
384 | "39":129,
385 | "40":112,
386 | "41":126,
387 | "42":122,
388 | "43":138,
389 | "44":131,
390 | "45":119,
391 | "46":153,
392 | "47":136,
393 | "48":129,
394 | "49":139,
395 | "50":136,
396 | "51":146,
397 | "52":130,
398 | "53":151,
399 | "54":175,
400 | "55":146,
401 | "56":121,
402 | "57":144,
403 | "58":172,
404 | "59":152,
405 | "60":160,
406 | "61":137,
407 | "62":135,
408 | "63":141,
409 | "64":130,
410 | "65":130,
411 | "66":135,
412 | "67":130,
413 | "68":136,
414 | "69":143,
415 | "70":129,
416 | "71":153,
417 | "72":127,
418 | "73":127,
419 | "74":146,
420 | "75":125,
421 | "76":134,
422 | "77":141,
423 | "78":130,
424 | "79":131,
425 | "80":182,
426 | "81":169,
427 | "82":139,
428 | "83":130,
429 | "84":120,
430 | "85":170,
431 | "86":136,
432 | "87":157,
433 | "88":103,
434 | "89":107,
435 | "90":127,
436 | "91":137,
437 | "92":107,
438 | "93":100,
439 | "94":171,
440 | "95":168,
441 | "96":128,
442 | "97":168,
443 | "98":131,
444 | "99":124,
445 | "100":112,
446 | "101":110,
447 | "102":100,
448 | "103":100,
449 | "104":108,
450 | "105":128,
451 | "106":120,
452 | "107":120,
453 | "108":120,
454 | "109":184,
455 | "110":124,
456 | "111":124,
457 | "112":139,
458 | "113":120,
459 | "114":131,
460 | "115":151,
461 | "116":141,
462 | "117":122,
463 | "118":125,
464 | "119":124,
465 | "120":137,
466 | "121":120,
467 | "122":169,
468 | "123":127,
469 | "124":151,
470 | "125":122,
471 | "126":124,
472 | "127":127,
473 | "128":125,
474 | "129":151,
475 | "130":134,
476 | "131":138,
477 | "132":133,
478 | "133":132,
479 | "134":136,
480 | "135":156,
481 | "136":143,
482 | "137":137,
483 | "138":138,
484 | "139":162,
485 | "140":135,
486 | "141":127,
487 | "142":120,
488 | "143":149,
489 | "144":170,
490 | "145":126,
491 | "146":136,
492 | "147":146,
493 | "148":127,
494 | "149":150,
495 | "150":134,
496 | "151":129,
497 | "152":138,
498 | "153":167,
499 | "154":128,
500 | "155":151,
501 | "156":131,
502 | "157":120,
503 | "158":150,
504 | "159":120,
505 | "160":120,
506 | "161":130,
507 | "162":130,
508 | "163":140,
509 | "164":140,
510 | "165":145,
511 | "166":145,
512 | "167":150,
513 | "168":150
514 | },
515 | "Calories":{
516 | "0":409.1,
517 | "1":479.0,
518 | "2":340.0,
519 | "3":282.4,
520 | "4":406.0,
521 | "5":300.5,
522 | "6":374.0,
523 | "7":253.3,
524 | "8":195.1,
525 | "9":269.0,
526 | "10":329.3,
527 | "11":250.7,
528 | "12":345.3,
529 | "13":379.3,
530 | "14":275.0,
531 | "15":215.2,
532 | "16":300.0,
533 | "17":null,
534 | "18":323.0,
535 | "19":243.0,
536 | "20":364.2,
537 | "21":282.0,
538 | "22":300.0,
539 | "23":246.0,
540 | "24":334.5,
541 | "25":250.0,
542 | "26":241.0,
543 | "27":null,
544 | "28":280.0,
545 | "29":380.3,
546 | "30":243.0,
547 | "31":180.1,
548 | "32":299.0,
549 | "33":223.0,
550 | "34":361.0,
551 | "35":415.0,
552 | "36":300.5,
553 | "37":300.1,
554 | "38":300.0,
555 | "39":266.0,
556 | "40":180.1,
557 | "41":286.0,
558 | "42":329.4,
559 | "43":400.0,
560 | "44":397.0,
561 | "45":273.0,
562 | "46":387.6,
563 | "47":300.0,
564 | "48":298.0,
565 | "49":397.6,
566 | "50":380.2,
567 | "51":643.1,
568 | "52":263.0,
569 | "53":486.0,
570 | "54":238.0,
571 | "55":450.7,
572 | "56":413.0,
573 | "57":305.0,
574 | "58":226.4,
575 | "59":321.0,
576 | "60":1376.0,
577 | "61":1034.4,
578 | "62":853.0,
579 | "63":341.0,
580 | "64":131.4,
581 | "65":800.4,
582 | "66":873.4,
583 | "67":816.0,
584 | "68":110.4,
585 | "69":1500.2,
586 | "70":1115.0,
587 | "71":387.6,
588 | "72":700.0,
589 | "73":953.2,
590 | "74":304.0,
591 | "75":563.2,
592 | "76":251.0,
593 | "77":300.0,
594 | "78":500.4,
595 | "79":1729.0,
596 | "80":319.2,
597 | "81":344.0,
598 | "82":151.1,
599 | "83":500.0,
600 | "84":225.3,
601 | "85":300.1,
602 | "86":234.0,
603 | "87":1000.1,
604 | "88":242.0,
605 | "89":50.3,
606 | "90":600.1,
607 | "91":null,
608 | "92":105.3,
609 | "93":50.5,
610 | "94":127.4,
611 | "95":229.4,
612 | "96":128.2,
613 | "97":244.2,
614 | "98":188.2,
615 | "99":604.1,
616 | "100":77.7,
617 | "101":500.0,
618 | "102":500.0,
619 | "103":500.4,
620 | "104":92.7,
621 | "105":124.0,
622 | "106":800.3,
623 | "107":86.2,
624 | "108":500.3,
625 | "109":1860.4,
626 | "110":325.2,
627 | "111":275.0,
628 | "112":124.2,
629 | "113":225.3,
630 | "114":367.6,
631 | "115":351.7,
632 | "116":443.0,
633 | "117":277.4,
634 | "118":null,
635 | "119":332.7,
636 | "120":193.9,
637 | "121":100.7,
638 | "122":336.7,
639 | "123":344.9,
640 | "124":368.5,
641 | "125":271.0,
642 | "126":275.3,
643 | "127":382.0,
644 | "128":466.4,
645 | "129":384.0,
646 | "130":342.5,
647 | "131":357.5,
648 | "132":335.0,
649 | "133":327.5,
650 | "134":339.0,
651 | "135":189.0,
652 | "136":317.7,
653 | "137":318.0,
654 | "138":308.0,
655 | "139":222.4,
656 | "140":390.0,
657 | "141":null,
658 | "142":250.4,
659 | "143":335.4,
660 | "144":470.2,
661 | "145":270.8,
662 | "146":400.0,
663 | "147":361.9,
664 | "148":185.0,
665 | "149":409.4,
666 | "150":343.0,
667 | "151":353.2,
668 | "152":374.0,
669 | "153":275.8,
670 | "154":328.0,
671 | "155":368.5,
672 | "156":270.4,
673 | "157":270.4,
674 | "158":382.8,
675 | "159":240.9,
676 | "160":250.4,
677 | "161":260.4,
678 | "162":270.0,
679 | "163":280.9,
680 | "164":290.8,
681 | "165":300.4,
682 | "166":310.2,
683 | "167":320.4,
684 | "168":330.4
685 | }
686 | }
--------------------------------------------------------------------------------
/python_advanced/example.mplstyle:
--------------------------------------------------------------------------------
1 | figure.figsize : 6,4
2 |
3 | # Line Colors
4 | # These are based on cbrewer with some modifications
5 | axes.prop_cycle : cycler('color',['5c92be','e94749','70bf6e','ff9832','ac71b5','b77752','f89acb','adadad','d7c78f'])
6 | # Map colors
7 | image.cmap : Spectral_r
8 |
9 | # Set x axis
10 | xtick.direction : in
11 | xtick.major.size : 3
12 | xtick.major.width : 0.5
13 | xtick.minor.size : 1.5
14 | xtick.minor.width : 0.5
15 | xtick.minor.visible : True
16 | xtick.top : True
17 |
18 | # Set y axis
19 | ytick.direction : in
20 | ytick.major.size : 3
21 | ytick.major.width : 0.5
22 | ytick.minor.size : 1.5
23 | ytick.minor.width : 0.5
24 | ytick.minor.visible : True
25 | ytick.right : True
26 |
27 | # Set line widths
28 | axes.linewidth : 0.5
29 | grid.linewidth : 0.5
30 | lines.linewidth : 2
31 |
32 | # Always save as 'tight'
33 | savefig.bbox : tight
34 | savefig.pad_inches : 0.05
35 |
36 | # Use serif fonts
37 | # font.serif : Times New Roman
38 | # font.family : serif
39 | font.size : 12
40 |
41 | axes.formatter.limits : -3, 3 # When to use scientific notation
42 | axes.formatter.use_mathtext : True # False:1e6 vs True: 1 \times 10^6
43 |
--------------------------------------------------------------------------------
/python_basics/demofile.txt:
--------------------------------------------------------------------------------
1 | Hello! Welcome to demofile.txt
2 | This file is for testing purposes.
3 | Good Luck!
4 |
--------------------------------------------------------------------------------
/python_basics/helloworld.py:
--------------------------------------------------------------------------------
1 | print("Hello, World!")
2 |
--------------------------------------------------------------------------------
/python_basics/mymodule.py:
--------------------------------------------------------------------------------
1 | def greeting(name):
2 | print("Hello, " + name)
3 |
4 | person1 = {
5 | "name": "John",
6 | "age": 36,
7 | "country": "Norway"
8 | }
9 |
--------------------------------------------------------------------------------
/pytorch_basics/pytorch_intro.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "colab_type": "text",
7 | "id": "view-in-github"
8 | },
9 | "source": [
10 | "
"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {
16 | "id": "xOl7MvFyyOOV"
17 | },
18 | "source": [
19 | "### Introduction to Pytorch\n",
20 | "\n",
21 | "Adapted from official [Pytorch tutorial](https://pytorch.org/tutorials/beginner/basics/intro.html) for dealing with PyTorch tensors, datasets, building neural networks etc., also has an accompanying video series."
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {
27 | "id": "D6ltEl0nzUXI"
28 | },
29 | "source": [
30 | "## Tensors\n",
31 | "\n",
32 | "Tensors are a specialized data structure that are very similar to NumPy `ndarrays`. In PyTorch, we use tensors to encode the inputs and outputs of a model, as well as the model’s parameters. Tensors are also optimized for automatic differentiation."
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": null,
38 | "metadata": {
39 | "id": "lh3knnd-ahfZ"
40 | },
41 | "outputs": [],
42 | "source": [
43 | "import torch\n",
44 | "import numpy as np"
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {
50 | "id": "OHiKPM0DapMU"
51 | },
52 | "source": [
53 | "### Initializing a Tensor\n",
54 | "\n",
55 | "Tensors can be initialized in various ways. Take a look at the following examples:"
56 | ]
57 | },
58 | {
59 | "cell_type": "markdown",
60 | "metadata": {
61 | "id": "7VMS73xDatXQ"
62 | },
63 | "source": [
64 | "**Directly from data**\n",
65 | "\n",
66 | "Tensors can be created directly from data. The data type is automatically inferred."
67 | ]
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": null,
72 | "metadata": {
73 | "id": "akgMhYw8ayXS"
74 | },
75 | "outputs": [],
76 | "source": [
77 | "data = [[1, 2],[3, 4]]\n",
78 | "x_data = torch.tensor(data)"
79 | ]
80 | },
81 | {
82 | "cell_type": "markdown",
83 | "metadata": {
84 | "id": "jqBZTAHda8BY"
85 | },
86 | "source": [
87 | "**From a NumPy array**\n",
88 | "\n",
89 | "Tensors can be created from NumPy arrays (and vice versa)."
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": null,
95 | "metadata": {
96 | "id": "EE2gC9rIao5v"
97 | },
98 | "outputs": [],
99 | "source": [
100 | "np_array = np.array(data)\n",
101 | "x_np = torch.from_numpy(np_array)"
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {
107 | "id": "fnqjE__vbKEq"
108 | },
109 | "source": [
110 | "**From another tensor**\n",
111 | "\n",
112 | "The new tensor retains the properties (shape, datatype) of the argument tensor, unless explicitly overridden."
113 | ]
114 | },
115 | {
116 | "cell_type": "code",
117 | "execution_count": null,
118 | "metadata": {
119 | "id": "nVhIJMk7bM-q"
120 | },
121 | "outputs": [],
122 | "source": [
123 | "x_ones = torch.ones_like(x_data) # retains the properties of x_data\n",
124 | "print(f\"Ones Tensor: \\n {x_ones} \\n\")\n",
125 | "\n",
126 | "x_rand = torch.rand_like(x_data, dtype=torch.float) # overrides the datatype of x_data\n",
127 | "print(f\"Random Tensor: \\n {x_rand} \\n\")"
128 | ]
129 | },
130 | {
131 | "cell_type": "markdown",
132 | "metadata": {
133 | "id": "pc86VvLZbePF"
134 | },
135 | "source": [
136 | "**With random or constant values**\n",
137 | "\n",
138 | "`shape` is a tuple of tensor dimensions. In the functions below, it determines the dimensionality of the output tensor."
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": null,
144 | "metadata": {
145 | "id": "Xoq55JAebjuG"
146 | },
147 | "outputs": [],
148 | "source": [
149 | "shape = (2,3,)\n",
150 | "rand_tensor = torch.rand(shape)\n",
151 | "ones_tensor = torch.ones(shape)\n",
152 | "zeros_tensor = torch.zeros(shape)\n",
153 | "\n",
154 | "print(f\"Random Tensor: \\n {rand_tensor} \\n\")\n",
155 | "print(f\"Ones Tensor: \\n {ones_tensor} \\n\")\n",
156 | "print(f\"Zeros Tensor: \\n {zeros_tensor}\")"
157 | ]
158 | },
159 | {
160 | "cell_type": "markdown",
161 | "metadata": {
162 | "id": "m2jfWJYvb4tU"
163 | },
164 | "source": [
165 | "### Attributes of a Tensor\n",
166 | "\n",
167 | "Tensor attributes describe their shape, datatype, and the device on which they are stored."
168 | ]
169 | },
170 | {
171 | "cell_type": "code",
172 | "execution_count": null,
173 | "metadata": {
174 | "id": "j-lqGiF0cAhS"
175 | },
176 | "outputs": [],
177 | "source": [
178 | "tensor = torch.rand(3,4)\n",
179 | "\n",
180 | "print(f\"Shape of tensor: {tensor.shape}\")\n",
181 | "print(f\"Datatype of tensor: {tensor.dtype}\")\n",
182 | "print(f\"Device tensor is stored on: {tensor.device}\")"
183 | ]
184 | },
185 | {
186 | "cell_type": "markdown",
187 | "metadata": {
188 | "id": "aUsjyaH7cKB4"
189 | },
190 | "source": [
191 | "### Operations on Tensors\n",
192 | "\n",
193 | "Over 100 tensor operations, including arithmetic, linear algebra, matrix manipulation (transposing, indexing, slicing), sampling and more are comprehensively described [here](https://pytorch.org/docs/stable/torch.html).\n",
194 | "\n",
195 | "Each of these operations can be run on the GPU. By default, tensors are created on the CPU. We need to explicitly move tensors to the GPU using `.to` method (after checking for GPU availability)."
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": null,
201 | "metadata": {
202 | "id": "QMkxXospchGt"
203 | },
204 | "outputs": [],
205 | "source": [
206 | "# We move our tensor to the GPU if available\n",
207 | "if torch.cuda.is_available():\n",
208 | " print(\"Found GPU\")\n",
209 | " tensor = tensor.to(\"cuda\")"
210 | ]
211 | },
212 | {
213 | "cell_type": "markdown",
214 | "metadata": {
215 | "id": "GTKR8atYcr_F"
216 | },
217 | "source": [
218 | "**Standard numpy-like indexing and slicing**"
219 | ]
220 | },
221 | {
222 | "cell_type": "code",
223 | "execution_count": null,
224 | "metadata": {
225 | "id": "rpdU44nMcu9Z"
226 | },
227 | "outputs": [],
228 | "source": [
229 | "tensor = torch.rand(4, 4)\n",
230 | "print(tensor)\n",
231 | "print(f\"First row: {tensor[0,:]}\")\n",
232 | "print(f\"First column: {tensor[:,0]}\")\n",
233 | "print(f\"Last column: {tensor[:,-1]}\")\n",
234 | "tensor[:,1] = 0\n",
235 | "print(tensor)"
236 | ]
237 | },
238 | {
239 | "cell_type": "markdown",
240 | "metadata": {
241 | "id": "g5g0YabNdjXR"
242 | },
243 | "source": [
244 | "**Joining tensors**\n",
245 | "\n",
246 | "You can use `torch.cat` to concatenate a sequence of tensors along a given dimension."
247 | ]
248 | },
249 | {
250 | "cell_type": "code",
251 | "execution_count": null,
252 | "metadata": {
253 | "id": "ll7u18V0dp7V"
254 | },
255 | "outputs": [],
256 | "source": [
257 | "t1 = torch.cat([tensor, tensor, tensor], dim=1)\n",
258 | "print(t1.shape)\n",
259 | "print(t1)"
260 | ]
261 | },
262 | {
263 | "cell_type": "markdown",
264 | "metadata": {
265 | "id": "i86MON9seCYq"
266 | },
267 | "source": [
268 | "**Arithmetic operations**"
269 | ]
270 | },
271 | {
272 | "cell_type": "code",
273 | "execution_count": null,
274 | "metadata": {
275 | "id": "e-GD0INYeGe6"
276 | },
277 | "outputs": [],
278 | "source": [
279 | "# This computes the matrix multiplication between two tensors. y1, y2, y3 will have the same value\n",
280 | "# ``tensor.T`` returns the transpose of a tensor\n",
281 | "y1 = tensor @ tensor.T\n",
282 | "y2 = tensor.matmul(tensor.T)\n",
283 | "\n",
284 | "y3 = torch.rand_like(y1)\n",
285 | "torch.matmul(tensor, tensor.T, out=y3)\n",
286 | "\n",
287 | "\n",
288 | "# This computes the element-wise product. z1, z2, z3 will have the same value\n",
289 | "z1 = tensor * tensor\n",
290 | "z2 = tensor.mul(tensor)\n",
291 | "\n",
292 | "z3 = torch.rand_like(tensor)\n",
293 | "torch.mul(tensor, tensor, out=z3)"
294 | ]
295 | },
296 | {
297 | "cell_type": "markdown",
298 | "metadata": {
299 | "id": "Os1ZcNNrx3x_"
300 | },
301 | "source": [
302 | "**Single-element tensors**\n",
303 | "\n",
304 | "If you have a one-element tensor, for example by aggregating all values of a tensor into one value, you can convert it to a Python numerical value using `item()`:"
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "execution_count": null,
310 | "metadata": {
311 | "id": "K0YYD_QFx9hi"
312 | },
313 | "outputs": [],
314 | "source": [
315 | "agg = tensor.sum()\n",
316 | "agg_item = agg.item()\n",
317 | "print(agg_item, type(agg_item))"
318 | ]
319 | },
320 | {
321 | "cell_type": "markdown",
322 | "metadata": {
323 | "id": "3eF6poF7yZV9"
324 | },
325 | "source": [
326 | "**In-place operations**\n",
327 | "\n",
328 | "Operations that store the result into the operand are called in-place. They are denoted by a `_` suffix. For example: `x.copy_(y)`, `x.t_()`, will change `x`."
329 | ]
330 | },
331 | {
332 | "cell_type": "code",
333 | "execution_count": null,
334 | "metadata": {
335 | "id": "gGgGfsqVylCy"
336 | },
337 | "outputs": [],
338 | "source": [
339 | "print(f\"{tensor} \\n\")\n",
340 | "tensor.add_(5)\n",
341 | "print(tensor)"
342 | ]
343 | },
344 | {
345 | "cell_type": "markdown",
346 | "metadata": {
347 | "id": "v3UR8cWty1Bn"
348 | },
349 | "source": [
350 | "### Bridge with NumPy\n",
351 | "\n",
352 | "Tensors on the CPU and NumPy arrays can share their underlying memory locations, and changing one will change the other."
353 | ]
354 | },
355 | {
356 | "cell_type": "markdown",
357 | "metadata": {
358 | "id": "VV5b4q_dy6lC"
359 | },
360 | "source": [
361 | "**Tensor to NumPy array**"
362 | ]
363 | },
364 | {
365 | "cell_type": "code",
366 | "execution_count": null,
367 | "metadata": {
368 | "id": "QnARXAHIy9Mf"
369 | },
370 | "outputs": [],
371 | "source": [
372 | "t = torch.ones(5)\n",
373 | "print(f\"t: {t}\")\n",
374 | "n = t.numpy()\n",
375 | "print(f\"n: {n}\")"
376 | ]
377 | },
378 | {
379 | "cell_type": "markdown",
380 | "metadata": {
381 | "id": "Oml36m6kzDuU"
382 | },
383 | "source": [
384 | "A change in the tensor reflects in the NumPy array."
385 | ]
386 | },
387 | {
388 | "cell_type": "code",
389 | "execution_count": null,
390 | "metadata": {
391 | "id": "-ck0MgA_zEIu"
392 | },
393 | "outputs": [],
394 | "source": [
395 | "t.add_(1)\n",
396 | "print(f\"t: {t}\")\n",
397 | "print(f\"n: {n}\")"
398 | ]
399 | },
400 | {
401 | "cell_type": "markdown",
402 | "metadata": {
403 | "id": "6QMiDGoszJ88"
404 | },
405 | "source": [
406 | "**NumPy array to Tensor**"
407 | ]
408 | },
409 | {
410 | "cell_type": "code",
411 | "execution_count": null,
412 | "metadata": {
413 | "id": "D4gjyUhDzLhk"
414 | },
415 | "outputs": [],
416 | "source": [
417 | "n = np.ones(5)\n",
418 | "t = torch.from_numpy(n)"
419 | ]
420 | },
421 | {
422 | "cell_type": "markdown",
423 | "metadata": {
424 | "id": "t7AIcf7LzNzl"
425 | },
426 | "source": [
427 | "Changes in the NumPy array reflects in the tensor."
428 | ]
429 | },
430 | {
431 | "cell_type": "code",
432 | "execution_count": null,
433 | "metadata": {
434 | "id": "0HSxoc7pzPYU"
435 | },
436 | "outputs": [],
437 | "source": [
438 | "np.add(n, 1, out=n)\n",
439 | "print(f\"t: {t}\")\n",
440 | "print(f\"n: {n}\")"
441 | ]
442 | },
443 | {
444 | "cell_type": "markdown",
445 | "metadata": {
446 | "id": "xsmgaDDazbiF"
447 | },
448 | "source": [
449 | "## Working with data\n",
450 | "\n",
451 | "PyTorch has [two primitives to work with data](https://pytorch.org/docs/stable/data.html): `torch.utils.data.DataLoader` and `torch.utils.data.Dataset`:\n",
452 | "* **Dataset** stores the samples and their corresponding labels;\n",
453 | "* **DataLoader** wraps an iterable around the `Dataset` to enable easy access to the samples.\n",
454 | "\n",
455 | "PyTorch domain libraries provide a number of pre-loaded datasets (such as FashionMNIST) that subclass `torch.utils.data.Dataset` and implement functions specific to the particular data. They can be used to prototype and benchmark your model. You can find them here: [Image Datasets](https://pytorch.org/vision/stable/datasets.html), [Text Datasets](https://pytorch.org/text/stable/datasets.html), and [Audio Datasets](https://pytorch.org/audio/stable/datasets.html)."
456 | ]
457 | },
458 | {
459 | "cell_type": "markdown",
460 | "metadata": {
461 | "id": "Chww-ZVMzqPv"
462 | },
463 | "source": [
464 | "### Loading a Dataset\n",
465 | "\n",
466 | "Here is an example of how to load the [Fashion-MNIST dataset](https://github.com/zalandoresearch/fashion-mnist) from [TorchVision](https://pytorch.org/vision/stable/datasets.html). Fashion-MNIST is a dataset of Zalando’s article images consisting of 60,000 training examples and 10,000 test examples. Each example comprises a 28×28 grayscale image and an associated label from one of 10 classes.\n",
467 | "\n",
468 | "We load the [FashionMNIST Dataset](https://pytorch.org/vision/stable/datasets.html#fashion-mnist) with the following parameters:\n",
469 | "\n",
470 | "* `root` is the path where the train/test data is stored;\n",
471 | "* `train` specifies training or test dataset;\n",
472 | "* `download=True` downloads the data from the internet if it’s not available at `root`;\n",
473 | "* `transform` and `target_transform` specify the feature and label transformations.\n"
474 | ]
475 | },
476 | {
477 | "cell_type": "code",
478 | "execution_count": null,
479 | "metadata": {
480 | "id": "0P0z5TOdz2jf"
481 | },
482 | "outputs": [],
483 | "source": [
484 | "import torch\n",
485 | "from torch.utils.data import Dataset\n",
486 | "from torchvision import datasets\n",
487 | "from torchvision.transforms import ToTensor, Lambda\n",
488 | "import matplotlib.pyplot as plt\n",
489 | "\n",
490 | "training_data = datasets.FashionMNIST(\n",
491 | " root=\"data\",\n",
492 | " train=True,\n",
493 | " download=True,\n",
494 | " transform=ToTensor()\n",
495 | ")\n",
496 | "\n",
497 | "test_data = datasets.FashionMNIST(\n",
498 | " root=\"data\",\n",
499 | " train=False,\n",
500 | " download=True,\n",
501 | " transform=ToTensor()\n",
502 | ")"
503 | ]
504 | },
505 | {
506 | "cell_type": "markdown",
507 | "metadata": {
508 | "id": "J__5cWXL1IOi"
509 | },
510 | "source": [
511 | "### Iterating and Visualizing the Dataset\n",
512 | "\n",
513 | "We can index `Datasets` manually like a list: `training_data[index]`. We use `matplotlib` to visualize some samples in our training data."
514 | ]
515 | },
516 | {
517 | "cell_type": "code",
518 | "execution_count": null,
519 | "metadata": {
520 | "id": "zRq8kFBu1ox6"
521 | },
522 | "outputs": [],
523 | "source": [
524 | "labels_map = {\n",
525 | " 0: \"T-Shirt\",\n",
526 | " 1: \"Trouser\",\n",
527 | " 2: \"Pullover\",\n",
528 | " 3: \"Dress\",\n",
529 | " 4: \"Coat\",\n",
530 | " 5: \"Sandal\",\n",
531 | " 6: \"Shirt\",\n",
532 | " 7: \"Sneaker\",\n",
533 | " 8: \"Bag\",\n",
534 | " 9: \"Ankle Boot\",\n",
535 | "}\n",
536 | "figure = plt.figure(figsize=(8, 8))\n",
537 | "cols, rows = 3, 3\n",
538 | "for i in range(1, cols * rows + 1):\n",
539 | " sample_idx = torch.randint(len(training_data), size=(1,)).item()\n",
540 | " img, label = training_data[sample_idx]\n",
541 | " figure.add_subplot(rows, cols, i)\n",
542 | " plt.title(labels_map[label])\n",
543 | " plt.axis(\"off\")\n",
544 | " plt.imshow(img.squeeze(), cmap=\"gray\")\n",
545 | "plt.show()"
546 | ]
547 | },
548 | {
549 | "cell_type": "markdown",
550 | "metadata": {
551 | "id": "0ogeFMzZ15Qw"
552 | },
553 | "source": [
554 | "### Creating a Custom Dataset for your files\n",
555 | "\n",
556 | "A custom `Dataset` class must implement three functions: `__init__`, `__len__`, and `__getitem__`. Take a look at this implementation; the FashionMNIST images are stored in a directory `img_dir`, and their labels are stored separately in a CSV file `annotations_file`.\n",
557 | "\n",
558 | "In the next sections, we’ll break down what’s happening in each of these functions."
559 | ]
560 | },
561 | {
562 | "cell_type": "code",
563 | "execution_count": null,
564 | "metadata": {
565 | "id": "E5LyOQ-22Dum"
566 | },
567 | "outputs": [],
568 | "source": [
569 | "import os\n",
570 | "import pandas as pd\n",
571 | "from torchvision.io import read_image\n",
572 | "\n",
573 | "class CustomImageDataset(Dataset):\n",
574 | " def __init__(self, annotations_file, img_dir, transform=None, target_transform=None):\n",
575 | " self.img_labels = pd.read_csv(annotations_file)\n",
576 | " self.img_dir = img_dir\n",
577 | " self.transform = transform\n",
578 | " self.target_transform = target_transform\n",
579 | "\n",
580 | " def __len__(self):\n",
581 | " return len(self.img_labels)\n",
582 | "\n",
583 | " def __getitem__(self, idx):\n",
584 | " img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])\n",
585 | " image = read_image(img_path)\n",
586 | " label = self.img_labels.iloc[idx, 1]\n",
587 | " if self.transform:\n",
588 | " image = self.transform(image)\n",
589 | " if self.target_transform:\n",
590 | " label = self.target_transform(label)\n",
591 | " return image, label"
592 | ]
593 | },
594 | {
595 | "cell_type": "markdown",
596 | "metadata": {
597 | "id": "eF3dbHy72PGU"
598 | },
599 | "source": [
600 | "**`__init_`**\n",
601 | "\n",
602 | "The `__init__` function is run once when instantiating the `Dataset` object. We initialize the directory containing the images, the annotations file, and both transforms (covered in more detail in the next section).\n",
603 | "\n",
604 | "The `labels.csv` file looks like:\n",
605 | "\n",
606 | "```\n",
607 | "tshirt1.jpg, 0\n",
608 | "tshirt2.jpg, 0\n",
609 | "......\n",
610 | "ankleboot999.jpg, 9\n",
611 | "```\n",
612 | "\n",
613 | "**`__len__`**\n",
614 | "\n",
615 | "The `__len__` function returns the number of samples in our dataset.\n",
616 | "\n",
617 | "**`__getitem__`**\n",
618 | "\n",
619 | "The `__getitem__` function loads and returns a sample from the dataset at the given index `idx`. Based on the index, it identifies the image’s location on disk, converts that to a tensor using `read_image`, retrieves the corresponding label from the csv data in `self.img_labels`, calls the `transform` functions on them (if applicable), and returns the tensor image and corresponding label in a tuple."
620 | ]
621 | },
622 | {
623 | "cell_type": "markdown",
624 | "metadata": {
625 | "id": "MwYgbBlS3g_8"
626 | },
627 | "source": [
628 | "### Preparing your data for training with DataLoaders\n",
629 | "\n",
630 | "The `Dataset` retrieves our dataset’s features and labels one sample at a time. While training a model, we typically want to pass samples in *minibatches*, reshuffle the data at every epoch to reduce model overfitting, and use Python’s multiprocessing to speed up data retrieval.\n",
631 | "\n",
632 | "`DataLoader` is an iterable that abstracts this complexity for us in an easy API."
633 | ]
634 | },
635 | {
636 | "cell_type": "code",
637 | "execution_count": null,
638 | "metadata": {
639 | "id": "MqqX62f73tu7"
640 | },
641 | "outputs": [],
642 | "source": [
643 | "from torch.utils.data import DataLoader\n",
644 | "\n",
645 | "train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)\n",
646 | "test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)"
647 | ]
648 | },
649 | {
650 | "cell_type": "markdown",
651 | "metadata": {
652 | "id": "nIp7wKup3xw3"
653 | },
654 | "source": [
655 | "### Iterate through the DataLoader\n",
656 | "\n",
657 | "We have loaded that dataset into the `DataLoader` and can iterate through the dataset as needed. Each iteration below returns a batch of `train_features` and `train_labels` (containing `batch_size=64` features and labels respectively). Because we specified `shuffle=True`, after we iterate over all batches the data is shuffled (for finer-grained control over the data loading order, take a look at [Samplers](https://pytorch.org/docs/stable/data.html#data-loading-order-and-sampler))."
658 | ]
659 | },
660 | {
661 | "cell_type": "code",
662 | "execution_count": null,
663 | "metadata": {
664 | "id": "mfMgStmm4iks"
665 | },
666 | "outputs": [],
667 | "source": [
668 | "# Display image and label.\n",
669 | "train_features, train_labels = next(iter(train_dataloader))\n",
670 | "print(f\"Feature batch shape: {train_features.size()}\")\n",
671 | "print(f\"Labels batch shape: {train_labels.size()}\")\n",
672 | "img = train_features[0].squeeze() #Returns a tensor with all the dimensions of input of size 1 removed\n",
673 | "label = train_labels[0]\n",
674 | "plt.imshow(img, cmap=\"gray\")\n",
675 | "plt.show()\n",
676 | "print(f\"Label: {label}\")"
677 | ]
678 | },
679 | {
680 | "cell_type": "markdown",
681 | "metadata": {
682 | "id": "jB4pE5hL57Ae"
683 | },
684 | "source": [
685 | "### Transforms\n",
686 | "\n",
687 | "Data does not always come in its final processed form that is required for training machine learning algorithms. We use **transforms** to perform some manipulation of the data and make it suitable for training.\n",
688 | "\n",
689 | "All TorchVision datasets have two parameters: `transform` to modify the features and `target_transform` to modify the labels - that accept callables containing the transformation logic. The [`torchvision.transforms`](https://pytorch.org/vision/stable/transforms.html) module offers several commonly-used transforms out of the box.\n",
690 | "\n",
691 | "The FashionMNIST features are in Python Imaging Library (PIL) format, and the labels are integers. For training, we need the features as normalized tensors, and the labels as one-hot encoded tensors. To make these transformations, we use `ToTensor` and `Lambda`."
692 | ]
693 | },
694 | {
695 | "cell_type": "code",
696 | "execution_count": null,
697 | "metadata": {
698 | "id": "kY7iscx96lR6"
699 | },
700 | "outputs": [],
701 | "source": [
702 | "ds = datasets.FashionMNIST(\n",
703 | " root=\"data\",\n",
704 | " train=True,\n",
705 | " download=True,\n",
706 | " transform=ToTensor(),\n",
707 | " target_transform=Lambda(lambda y: torch.zeros(10, dtype=torch.float).scatter_(0, torch.tensor(y), value=1))\n",
708 | ")"
709 | ]
710 | },
711 | {
712 | "cell_type": "markdown",
713 | "metadata": {
714 | "id": "FcWupzLI7CNL"
715 | },
716 | "source": [
717 | "**ToTensor()**\n",
718 | "\n",
719 | "[`ToTensor`](https://pytorch.org/vision/stable/transforms.html#torchvision.transforms.ToTensor) converts a PIL image or NumPy `ndarray` into a `FloatTensor`. and scales the image’s pixel intensity values in the range `[0., 1.]`\n",
720 | "\n",
721 | "**Lambda Transforms**\n",
722 | "\n",
723 | "`Lambda` transforms apply any user-defined lambda function. Here, we define a function to turn the integer into a one-hot encoded tensor. It first creates a zero tensor of size 10 (the number of labels in our dataset) and calls [`scatter_`](https://pytorch.org/docs/stable/generated/torch.Tensor.scatter_.html) which assigns a `value=1` on the index as given by the label `y`."
724 | ]
725 | }
726 | ],
727 | "metadata": {
728 | "accelerator": "GPU",
729 | "colab": {
730 | "authorship_tag": "ABX9TyN9y+zYt0XPYRKtiGRGNpAq",
731 | "include_colab_link": true,
732 | "provenance": []
733 | },
734 | "gpuClass": "standard",
735 | "kernelspec": {
736 | "display_name": "Python 3 (ipykernel)",
737 | "language": "python",
738 | "name": "python3"
739 | },
740 | "language_info": {
741 | "codemirror_mode": {
742 | "name": "ipython",
743 | "version": 3
744 | },
745 | "file_extension": ".py",
746 | "mimetype": "text/x-python",
747 | "name": "python",
748 | "nbconvert_exporter": "python",
749 | "pygments_lexer": "ipython3",
750 | "version": "3.8.9"
751 | }
752 | },
753 | "nbformat": 4,
754 | "nbformat_minor": 1
755 | }
756 |
--------------------------------------------------------------------------------
/pytorch_geometric_intro/2.KCNodeClassificationPyG.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "colab_type": "text",
7 | "id": "view-in-github"
8 | },
9 | "source": [
10 | "
"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {
16 | "id": "hbnj1Bj96DRr"
17 | },
18 | "source": [
19 | "# Another Node Classification Example\n",
20 | "\n",
21 | "This tutorial is taken from taken from this tutorial by [Matthias Fey](https://rusty1s.github.io/#/)"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": null,
27 | "metadata": {
28 | "id": "F1op-CbyLuN4"
29 | },
30 | "outputs": [],
31 | "source": [
32 | "# Install torch geometric\n",
33 | "# Install required packages.\n",
34 | "import os\n",
35 | "import torch\n",
36 | "os.environ['TORCH'] = torch.__version__\n",
37 | "print(torch.__version__)\n",
38 | "!pip install -q torch-scatter -f https://data.pyg.org/whl/torch-${TORCH}.html\n",
39 | "!pip install -q torch-sparse -f https://data.pyg.org/whl/torch-${TORCH}.html\n",
40 | "!pip install -q torch_cluster -f https://data.pyg.org/whl/torch-${TORCH}.html\n",
41 | "!pip install -q git+https://github.com/pyg-team/pytorch_geometric.git"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": null,
47 | "metadata": {
48 | "id": "2-cSha6f6Jxm"
49 | },
50 | "outputs": [],
51 | "source": [
52 | "# Helper function for visualization.\n",
53 | "%matplotlib inline\n",
54 | "import matplotlib.pyplot as plt\n",
55 | "from sklearn.manifold import TSNE\n",
56 | "\n",
57 | "def visualize(h, color):\n",
58 | " z = TSNE(n_components=2).fit_transform(h.detach().cpu().numpy())\n",
59 | "\n",
60 | " plt.figure(figsize=(10,10))\n",
61 | " plt.xticks([])\n",
62 | " plt.yticks([])\n",
63 | "\n",
64 | " plt.scatter(z[:, 0], z[:, 1], s=70, c=color, cmap=\"Set2\")\n",
65 | " plt.show()"
66 | ]
67 | },
68 | {
69 | "cell_type": "markdown",
70 | "metadata": {
71 | "id": "dszt2RUHE7lW"
72 | },
73 | "source": [
74 | "# Node Classification with Graph Neural Networks\n",
75 | "\n",
76 | "In this tutorial we will look at the **Graph Neural Networks (GNNs) to the task of node classification**.\n",
77 | "Here, we are given the ground-truth labels of only a small subset of nodes, and want to infer the labels for all the remaining nodes (*transductive learning*).\n",
78 | "\n",
79 | "This time we make use of the `Cora` dataset, which is a **citation network** where nodes represent documents.\n",
80 | "Each node is described by a 1433-dimensional bag-of-words feature vector.\n",
81 | "Two documents are connected if there exists a citation link between them.\n",
82 | "The task is to infer the category of each document (7 in total).\n",
83 | "\n",
84 | "This dataset was first introduced by [Yang et al. (2016)](https://arxiv.org/abs/1603.08861) as one of the datasets of the `Planetoid` benchmark suite.\n",
85 | "We again can make use [PyTorch Geometric](https://github.com/rusty1s/pytorch_geometric) for an easy access to this dataset via [`torch_geometric.datasets.Planetoid`](https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.Planetoid):"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": null,
91 | "metadata": {
92 | "id": "imGrKO5YH11-"
93 | },
94 | "outputs": [],
95 | "source": [
96 | "from torch_geometric.datasets import Planetoid\n",
97 | "from torch_geometric.transforms import NormalizeFeatures\n",
98 | "\n",
99 | "dataset = Planetoid(root='data/Planetoid', name='Cora', transform=NormalizeFeatures())\n",
100 | "\n",
101 | "print()\n",
102 | "print(f'Dataset: {dataset}:')\n",
103 | "print('======================')\n",
104 | "print(f'Number of graphs: {len(dataset)}')\n",
105 | "print(f'Number of features: {dataset.num_features}')\n",
106 | "print(f'Number of classes: {dataset.num_classes}')\n",
107 | "\n",
108 | "data = dataset[0] # Get the first graph object.\n",
109 | "\n",
110 | "print()\n",
111 | "print(data)\n",
112 | "print('===========================================================================================================')\n",
113 | "\n",
114 | "# Gather some statistics about the graph.\n",
115 | "print(f'Number of nodes: {data.num_nodes}')\n",
116 | "print(f'Number of edges: {data.num_edges}')\n",
117 | "print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')\n",
118 | "print(f'Number of training nodes: {data.train_mask.sum()}')\n",
119 | "print(f'Training node label rate: {int(data.train_mask.sum()) / data.num_nodes:.2f}')\n",
120 | "print(f'Has isolated nodes: {data.has_isolated_nodes()}')\n",
121 | "print(f'Has self-loops: {data.has_self_loops()}')\n",
122 | "print(f'Is undirected: {data.is_undirected()}')"
123 | ]
124 | },
125 | {
126 | "cell_type": "markdown",
127 | "metadata": {
128 | "id": "eqWR0j_kIx67"
129 | },
130 | "source": [
131 | "Overall, this dataset is quite similar to the previously used [`KarateClub`](https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.KarateClub) network.\n",
132 | "We can see that the `Cora` network holds 2,708 nodes and 10,556 edges, resulting in an average node degree of 3.9.\n",
133 | "For training this dataset, we are given the ground-truth categories of 140 nodes (20 for each class).\n",
134 | "This results in a training node label rate of only 5%.\n",
135 | "\n",
136 | "In contrast to `KarateClub`, this graph holds the additional attributes `val_mask` and `test_mask`, which denotes which nodes should be used for validation and testing.\n",
137 | "Furthermore, we make use of **[data transformations](https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html#data-transforms) via `transform=NormalizeFeatures()`**.\n",
138 | "Transforms can be used to modify your input data before inputting them into a neural network, *e.g.*, for normalization or data augmentation.\n",
139 | "Here, we [row-normalize](https://pytorch-geometric.readthedocs.io/en/latest/modules/transforms.html#torch_geometric.transforms.NormalizeFeatures) the bag-of-words input feature vectors.\n",
140 | "\n",
141 | "We can further see that this network is undirected, and that there exists no isolated nodes (each document has at least one citation)."
142 | ]
143 | },
144 | {
145 | "cell_type": "markdown",
146 | "metadata": {
147 | "id": "5IRdAELVKOl6"
148 | },
149 | "source": [
150 | "## Training a Multi-layer Perception Network (MLP)\n",
151 | "\n",
152 | "In theory, we should be able to infer the category of a document solely based on its content, *i.e.* its bag-of-words feature representation, without taking any relational information into account.\n",
153 | "\n",
154 | "Let's verify that by constructing a simple MLP that solely operates on input node features (using shared weights across all nodes):"
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "execution_count": null,
160 | "metadata": {
161 | "id": "afXwPCA3KNoC"
162 | },
163 | "outputs": [],
164 | "source": [
165 | "import torch\n",
166 | "from torch.nn import Linear\n",
167 | "import torch.nn.functional as F\n",
168 | "\n",
169 | "\n",
170 | "class MLP(torch.nn.Module):\n",
171 | " def __init__(self, hidden_channels):\n",
172 | " super().__init__()\n",
173 | " torch.manual_seed(12345)\n",
174 | " self.lin1 = Linear(dataset.num_features, hidden_channels)\n",
175 | " self.lin2 = Linear(hidden_channels, dataset.num_classes)\n",
176 | "\n",
177 | " def forward(self, x):\n",
178 | " x = self.lin1(x)\n",
179 | " x = x.relu()\n",
180 | " x = F.dropout(x, p=0.5, training=self.training)\n",
181 | " x = self.lin2(x)\n",
182 | " return x\n",
183 | "\n",
184 | "model = MLP(hidden_channels=16)\n",
185 | "print(model)"
186 | ]
187 | },
188 | {
189 | "cell_type": "markdown",
190 | "metadata": {
191 | "id": "L_PO9EEHL7J6"
192 | },
193 | "source": [
194 | "Our MLP is defined by two linear layers and enhanced by [ReLU](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html?highlight=relu#torch.nn.ReLU) non-linearity and [dropout](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html?highlight=dropout#torch.nn.Dropout).\n",
195 | "Here, we first reduce the 1433-dimensional feature vector to a low-dimensional embedding (`hidden_channels=16`), while the second linear layer acts as a classifier that should map each low-dimensional node embedding to one of the 7 classes.\n",
196 | "\n",
197 | "Let's train our simple MLP. We again make use of the **cross entropy loss** and **Adam optimizer**.\n",
198 | "This time, we also define a **`test` function** to evaluate how well our final model performs on the test node set (which labels have not been observed during training)."
199 | ]
200 | },
201 | {
202 | "cell_type": "code",
203 | "execution_count": null,
204 | "metadata": {
205 | "id": "0YgHcLXMLk4o"
206 | },
207 | "outputs": [],
208 | "source": [
209 | "from IPython.display import Javascript # Restrict height of output cell.\n",
210 | "display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 300})'''))\n",
211 | "\n",
212 | "model = MLP(hidden_channels=16)\n",
213 | "criterion = torch.nn.CrossEntropyLoss() # Define loss criterion.\n",
214 | "optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4) # Define optimizer.\n",
215 | "\n",
216 | "def train():\n",
217 | " model.train()\n",
218 | " optimizer.zero_grad() # Clear gradients.\n",
219 | " out = model(data.x) # Perform a single forward pass.\n",
220 | " loss = criterion(out[data.train_mask], data.y[data.train_mask]) # Compute the loss solely based on the training nodes.\n",
221 | " loss.backward() # Derive gradients.\n",
222 | " optimizer.step() # Update parameters based on gradients.\n",
223 | " return loss\n",
224 | "\n",
225 | "def test():\n",
226 | " model.eval()\n",
227 | " out = model(data.x)\n",
228 | " pred = out.argmax(dim=1) # Use the class with highest probability.\n",
229 | " test_correct = pred[data.test_mask] == data.y[data.test_mask] # Check against ground-truth labels.\n",
230 | " test_acc = int(test_correct.sum()) / int(data.test_mask.sum()) # Derive ratio of correct predictions.\n",
231 | " return test_acc\n",
232 | "\n",
233 | "for epoch in range(1, 201):\n",
234 | " loss = train()\n",
235 | " print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')"
236 | ]
237 | },
238 | {
239 | "cell_type": "markdown",
240 | "metadata": {
241 | "id": "kG4IKy9YOLGF"
242 | },
243 | "source": [
244 | "After training the model, we can call the `test` function to see how well our model performs on unseen labels.\n",
245 | "Here, we are interested in the accuracy of the model, *i.e.*, the ratio of correctly classified nodes:"
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "execution_count": null,
251 | "metadata": {
252 | "id": "dBBCeLlAL0oL"
253 | },
254 | "outputs": [],
255 | "source": [
256 | "test_acc = test()\n",
257 | "print(f'Test Accuracy: {test_acc:.4f}')"
258 | ]
259 | },
260 | {
261 | "cell_type": "markdown",
262 | "metadata": {
263 | "id": "_jjJOB-VO-cw"
264 | },
265 | "source": [
266 | "As one can see, our MLP performs rather bad with only about 59% test accuracy.\n",
267 | "But why does the MLP do not perform better?\n",
268 | "The main reason for that is that this model suffers from heavy overfitting due to only having access to a **small amount of training nodes**, and therefore generalizes poorly to unseen node representations.\n",
269 | "\n",
270 | "It also fails to incorporate an important bias into the model: **Cited papers are very likely related to the category of a document**.\n",
271 | "That is exactly where Graph Neural Networks come into play and can help to boost the performance of our model.\n",
272 | "\n"
273 | ]
274 | },
275 | {
276 | "cell_type": "markdown",
277 | "metadata": {
278 | "id": "_OWGw54wRd98"
279 | },
280 | "source": [
281 | "## Training a Graph Neural Network (GNN)\n",
282 | "\n",
283 | "We can easily convert our MLP to a GNN by swapping the `torch.nn.Linear` layers with PyG's GNN operators.\n",
284 | "\n",
285 | "Following-up on [the first part of this tutorial](https://github.com/jngadiub/ML_course_Pavia_23_WIP/blob/main/gnn/1.Introduction.ipynb), we replace the linear layers by the [`GCNConv`](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GCNConv) module.\n",
286 | "To recap, the **GCN layer** ([Kipf et al. (2017)](https://arxiv.org/abs/1609.02907)) is defined as\n",
287 | "\n",
288 | "$$\n",
289 | "\\mathbf{x}_v^{(\\ell + 1)} = \\mathbf{W}^{(\\ell + 1)} \\sum_{w \\in \\mathcal{N}(v) \\, \\cup \\, \\{ v \\}} \\frac{1}{c_{w,v}} \\cdot \\mathbf{x}_w^{(\\ell)}\n",
290 | "$$\n",
291 | "\n",
292 | "where $\\mathbf{W}^{(\\ell + 1)}$ denotes a trainable weight matrix of shape `[num_output_features, num_input_features]` and $c_{w,v}$ refers to a fixed normalization coefficient for each edge.\n",
293 | "In contrast, a single `Linear` layer is defined as\n",
294 | "\n",
295 | "$$\n",
296 | "\\mathbf{x}_v^{(\\ell + 1)} = \\mathbf{W}^{(\\ell + 1)} \\mathbf{x}_v^{(\\ell)}\n",
297 | "$$\n",
298 | "\n",
299 | "which does not make use of neighboring node information."
300 | ]
301 | },
302 | {
303 | "cell_type": "code",
304 | "execution_count": null,
305 | "metadata": {
306 | "id": "fmXWs1dKIzD8"
307 | },
308 | "outputs": [],
309 | "source": [
310 | "from torch_geometric.nn import GCNConv\n",
311 | "\n",
312 | "\n",
313 | "class GCN(torch.nn.Module):\n",
314 | " def __init__(self, hidden_channels):\n",
315 | " super().__init__()\n",
316 | " torch.manual_seed(1234567)\n",
317 | " self.conv1 = GCNConv(dataset.num_features, hidden_channels)\n",
318 | " self.conv2 = GCNConv(hidden_channels, dataset.num_classes)\n",
319 | "\n",
320 | " def forward(self, x, edge_index):\n",
321 | " x = self.conv1(x, edge_index)\n",
322 | " x = x.relu()\n",
323 | " x = F.dropout(x, p=0.5, training=self.training)\n",
324 | " x = self.conv2(x, edge_index)\n",
325 | " return x\n",
326 | "\n",
327 | "model = GCN(hidden_channels=16)\n",
328 | "print(model)"
329 | ]
330 | },
331 | {
332 | "cell_type": "markdown",
333 | "metadata": {
334 | "id": "XhO8QDgYf_Q8"
335 | },
336 | "source": [
337 | "Let's visualize the node embeddings of our **untrained** GCN network.\n",
338 | "For visualization, we make use of [**TSNE**](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) to embed our 7-dimensional node embeddings onto a 2D plane."
339 | ]
340 | },
341 | {
342 | "cell_type": "code",
343 | "execution_count": null,
344 | "metadata": {
345 | "id": "ntt9qVFXlk6A"
346 | },
347 | "outputs": [],
348 | "source": [
349 | "model = GCN(hidden_channels=16)\n",
350 | "model.eval()\n",
351 | "\n",
352 | "out = model(data.x, data.edge_index)\n",
353 | "visualize(out, color=data.y)"
354 | ]
355 | },
356 | {
357 | "cell_type": "markdown",
358 | "metadata": {
359 | "id": "Fpdscco5g6kG"
360 | },
361 | "source": [
362 | "We certainly can do better by training our model.\n",
363 | "The training and testing procedure is once again the same, but this time we make use of the node features `x` **and** the graph connectivity `edge_index` as input to our GCN model."
364 | ]
365 | },
366 | {
367 | "cell_type": "code",
368 | "execution_count": null,
369 | "metadata": {
370 | "id": "p3TAi69zI1bO"
371 | },
372 | "outputs": [],
373 | "source": [
374 | "from IPython.display import Javascript # Restrict height of output cell.\n",
375 | "display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 300})'''))\n",
376 | "\n",
377 | "model = GCN(hidden_channels=16)\n",
378 | "optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)\n",
379 | "criterion = torch.nn.CrossEntropyLoss()\n",
380 | "\n",
381 | "def train():\n",
382 | " model.train()\n",
383 | " optimizer.zero_grad() # Clear gradients.\n",
384 | " out = model(data.x, data.edge_index) # Perform a single forward pass.\n",
385 | " loss = criterion(out[data.train_mask], data.y[data.train_mask]) # Compute the loss solely based on the training nodes.\n",
386 | " loss.backward() # Derive gradients.\n",
387 | " optimizer.step() # Update parameters based on gradients.\n",
388 | " return loss\n",
389 | "\n",
390 | "def test():\n",
391 | " model.eval()\n",
392 | " out = model(data.x, data.edge_index)\n",
393 | " pred = out.argmax(dim=1) # Use the class with highest probability.\n",
394 | " test_correct = pred[data.test_mask] == data.y[data.test_mask] # Check against ground-truth labels.\n",
395 | " test_acc = int(test_correct.sum()) / int(data.test_mask.sum()) # Derive ratio of correct predictions.\n",
396 | " return test_acc\n",
397 | "\n",
398 | "\n",
399 | "for epoch in range(1, 101):\n",
400 | " loss = train()\n",
401 | " print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')"
402 | ]
403 | },
404 | {
405 | "cell_type": "markdown",
406 | "metadata": {
407 | "id": "opBBGQHqg5ZO"
408 | },
409 | "source": [
410 | "After training the model, we can check its test accuracy:"
411 | ]
412 | },
413 | {
414 | "cell_type": "code",
415 | "execution_count": null,
416 | "metadata": {
417 | "id": "8zOh6IIeI3Op"
418 | },
419 | "outputs": [],
420 | "source": [
421 | "test_acc = test()\n",
422 | "print(f'Test Accuracy: {test_acc:.4f}')"
423 | ]
424 | },
425 | {
426 | "cell_type": "markdown",
427 | "metadata": {
428 | "id": "yhofzjaqhfY2"
429 | },
430 | "source": [
431 | "**There it is!**\n",
432 | "By simply swapping the linear layers with GNN layers, we can reach **81.5% of test accuracy**!\n",
433 | "This is in stark contrast to the 59% of test accuracy obtained by our MLP, indicating that relational information plays a crucial role in obtaining better performance.\n",
434 | "\n",
435 | "We can also verify that once again by looking at the output embeddings of our **trained** model, which now produces a far better clustering of nodes of the same category."
436 | ]
437 | },
438 | {
439 | "cell_type": "code",
440 | "execution_count": null,
441 | "metadata": {
442 | "id": "9r_VmGMukf5R"
443 | },
444 | "outputs": [],
445 | "source": [
446 | "model.eval()\n",
447 | "\n",
448 | "out = model(data.x, data.edge_index)\n",
449 | "visualize(out, color=data.y)"
450 | ]
451 | },
452 | {
453 | "cell_type": "markdown",
454 | "metadata": {
455 | "id": "S-q6Do4INLET"
456 | },
457 | "source": [
458 | "## (Optional) Exercises\n",
459 | "\n",
460 | "1. How does `GCN` behave when increasing the hidden feature dimensionality or the number of layers?\n",
461 | "Does increasing the number of layers help at all?\n",
462 | "\n",
463 | "2. You can try to use different GNN layers to see how model performance changes. What happens if you swap out all `GCNConv` instances with [`GATConv`](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GATConv) layers that make use of attention? Try to write a 2-layer `GAT` model that makes use of 8 attention heads in the first layer and 1 attention head in the second layer, uses a `dropout` ratio of `0.6` inside and outside each `GATConv` call, and uses a `hidden_channels` dimensions of `8` per head."
464 | ]
465 | },
466 | {
467 | "cell_type": "code",
468 | "execution_count": null,
469 | "metadata": {
470 | "id": "pcr9joFQ6Mri"
471 | },
472 | "outputs": [],
473 | "source": [
474 | "from torch_geometric.nn import GATConv\n",
475 | "\n",
476 | "\n",
477 | "class GAT(torch.nn.Module):\n",
478 | " def __init__(self, hidden_channels, heads):\n",
479 | " super().__init__()\n",
480 | " torch.manual_seed(1234567)\n",
481 | " # TODO\n",
482 | "\n",
483 | " def forward(self, x, edge_index):\n",
484 | " # TODO\n",
485 | "\n",
486 | "model = GAT(hidden_channels=8, heads=8)\n",
487 | "print(model)\n",
488 | "\n",
489 | "optimizer = torch.optim.Adam(model.parameters(), lr=0.005, weight_decay=5e-4)\n",
490 | "criterion = torch.nn.CrossEntropyLoss()\n",
491 | "\n",
492 | "def train():\n",
493 | " model.train()\n",
494 | " optimizer.zero_grad() # Clear gradients.\n",
495 | " out = model(data.x, data.edge_index) # Perform a single forward pass.\n",
496 | " loss = criterion(out[data.train_mask], data.y[data.train_mask]) # Compute the loss solely based on the training nodes.\n",
497 | " loss.backward() # Derive gradients.\n",
498 | " optimizer.step() # Update parameters based on gradients.\n",
499 | " return loss\n",
500 | "\n",
501 | "def test(mask):\n",
502 | " model.eval()\n",
503 | " out = model(data.x, data.edge_index)\n",
504 | " pred = out.argmax(dim=1) # Use the class with highest probability.\n",
505 | " correct = pred[mask] == data.y[mask] # Check against ground-truth labels.\n",
506 | " acc = int(correct.sum()) / int(mask.sum()) # Derive ratio of correct predictions.\n",
507 | " return acc\n",
508 | "\n",
509 | "\n",
510 | "for epoch in range(1, 201):\n",
511 | " loss = train()\n",
512 | " val_acc = test(data.val_mask)\n",
513 | " test_acc = test(data.test_mask)\n",
514 | " print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, Val: {val_acc:.4f}, Test: {test_acc:.4f}')"
515 | ]
516 | },
517 | {
518 | "cell_type": "code",
519 | "execution_count": null,
520 | "metadata": {
521 | "id": "dBOp5Dt5ANHt"
522 | },
523 | "outputs": [],
524 | "source": []
525 | }
526 | ],
527 | "metadata": {
528 | "accelerator": "GPU",
529 | "colab": {
530 | "include_colab_link": true,
531 | "provenance": []
532 | },
533 | "gpuClass": "standard",
534 | "kernelspec": {
535 | "display_name": "Python 3 (ipykernel)",
536 | "language": "python",
537 | "name": "python3"
538 | },
539 | "language_info": {
540 | "codemirror_mode": {
541 | "name": "ipython",
542 | "version": 3
543 | },
544 | "file_extension": ".py",
545 | "mimetype": "text/x-python",
546 | "name": "python",
547 | "nbconvert_exporter": "python",
548 | "pygments_lexer": "ipython3",
549 | "version": "3.8.9"
550 | }
551 | },
552 | "nbformat": 4,
553 | "nbformat_minor": 1
554 | }
555 |
--------------------------------------------------------------------------------
/slides/GettingStarted.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/makagan/SSI_Projects/127dd9b49ffdf9d37c763fe71bca172c1127599b/slides/GettingStarted.pdf
--------------------------------------------------------------------------------
/slides/LHCJetTaggingIntro.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/makagan/SSI_Projects/127dd9b49ffdf9d37c763fe71bca172c1127599b/slides/LHCJetTaggingIntro.pdf
--------------------------------------------------------------------------------