├── 07_recurrent_neural_networks ├── recurrent_neural_networks.ipynb └── README.md ├── 08_transformers_and_attention_mechanisms ├── transformers_and_attention_mechanisms.ipynb └── README.md ├── 11_pytorch_lightning └── README.md ├── 09_generative_models └── README.md ├── 10_model_deployment └── README.md ├── 03_automatic_differentiation ├── automatic_differentiation.ipynb └── README.md ├── requirements.txt ├── LICENSE ├── 13_custom_extensions ├── README.md └── custom_extensions.py ├── 19_neural_architecture_search └── README.md ├── 17_model_optimization_techniques └── README.md ├── 14_performance_optimization ├── README.md └── performance_optimization.py ├── 15_advanced_model_architectures └── README.md ├── 16_reinforcement_learning └── README.md ├── 20_bayesian_deep_learning └── README.md ├── 18_meta_learning └── README.md ├── 21_advanced_research_topics └── README.md ├── 05_data_loading_preprocessing ├── data_loading_preprocessing.ipynb └── README.md ├── README.md ├── 01_pytorch_basics ├── pytorch_basics.py └── README.md ├── 12_distributed_training ├── README.md └── distributed_training.py ├── 06_convolutional_neural_networks └── README.md ├── 02_neural_networks_fundamentals └── README.md └── 04_training_neural_networks └── README.md /07_recurrent_neural_networks/recurrent_neural_networks.ipynb: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /08_transformers_and_attention_mechanisms/transformers_and_attention_mechanisms.ipynb: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /11_pytorch_lightning/README.md: -------------------------------------------------------------------------------- 1 | # PyTorch Lightning 2 | 3 | This section will cover PyTorch Lightning. 4 | 5 | ## Contents 6 | - Lightning modules 7 | - Trainers and callbacks 8 | - Multi-GPU training 9 | - Experiment logging -------------------------------------------------------------------------------- /09_generative_models/README.md: -------------------------------------------------------------------------------- 1 | # Generative Models 2 | 3 | This section will cover generative models in PyTorch. 4 | 5 | ## Contents 6 | - Autoencoders 7 | - Variational Autoencoders (VAEs) 8 | - Generative Adversarial Networks (GANs) 9 | - Diffusion models -------------------------------------------------------------------------------- /10_model_deployment/README.md: -------------------------------------------------------------------------------- 1 | # Model Deployment 2 | 3 | This section will cover model deployment in PyTorch. 4 | 5 | ## Contents 6 | - TorchScript and tracing 7 | - ONNX export 8 | - Quantization 9 | - Mobile deployment (PyTorch Mobile) 10 | - Web deployment (ONNX.js) -------------------------------------------------------------------------------- /03_automatic_differentiation/automatic_differentiation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Automatic Differentiation with PyTorch Autograd\n", 8 | "\n", 9 | "This notebook provides a detailed introduction to PyTorch's Autograd system, covering automatic differentiation concepts and practical implementation." 10 | ] 11 | } 12 | ], 13 | "metadata": { 14 | "language_info": { 15 | "name": "python" 16 | } 17 | }, 18 | "nbformat": 4, 19 | "nbformat_minor": 2 20 | } 21 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | # Core PyTorch Dependencies 2 | torch>=2.0.0 3 | torchvision>=0.15.0 4 | torchaudio>=2.0.0 5 | 6 | # Data Science and Visualization 7 | numpy>=1.20.0 8 | matplotlib>=3.5.0 9 | seaborn>=0.12.0 10 | pandas>=1.3.0 11 | scikit-learn>=1.0.0 12 | 13 | # Jupyter Notebook Support 14 | jupyter>=1.0.0 15 | ipykernel>=6.0.0 16 | notebook>=6.4.0 17 | 18 | # PyTorch Ecosystem 19 | pytorch-lightning>=2.0.0 20 | torchmetrics>=0.10.0 21 | torchtext>=0.15.0 22 | 23 | # Deep Learning Libraries 24 | transformers>=4.20.0 25 | timm>=0.6.0 26 | 27 | # Computer Vision 28 | pillow>=9.0.0 29 | opencv-python>=4.6.0 30 | albumentations>=1.3.0 31 | 32 | # Natural Language Processing 33 | nltk>=3.7.0 34 | spacy>=3.4.0 35 | regex>=2022.3.15 36 | 37 | # Model Export and Deployment 38 | onnx>=1.12.0 39 | onnxruntime>=1.12.0 40 | 41 | # Training Utilities 42 | tqdm>=4.62.0 43 | tensorboard>=2.10.0 44 | wandb>=0.13.0 45 | 46 | # Additional Utilities 47 | requests>=2.25.0 48 | scipy>=1.7.0 -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Nicolai Høirup Nielsen 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /13_custom_extensions/README.md: -------------------------------------------------------------------------------- 1 | # Tutorial 13: Custom Extensions (C++ and CUDA) 2 | 3 | ## Overview 4 | This tutorial covers how to extend PyTorch with custom C++ and CUDA operations for performance-critical applications. You'll learn how to write, compile, and integrate custom extensions into your PyTorch workflows. 5 | 6 | ## Contents 7 | - Understanding when to use custom extensions 8 | - Writing C++ extensions 9 | - Creating CUDA kernels 10 | - Building and packaging extensions 11 | - JIT compilation vs ahead-of-time compilation 12 | - Debugging custom extensions 13 | 14 | ## Learning Objectives 15 | - Write custom C++ operations for PyTorch 16 | - Create CUDA kernels for GPU acceleration 17 | - Build and integrate extensions into PyTorch 18 | - Debug and optimize custom operations 19 | - Understand memory management in extensions 20 | 21 | ## Prerequisites 22 | - Strong understanding of PyTorch fundamentals 23 | - Basic C++ knowledge 24 | - CUDA programming basics (for GPU extensions) 25 | - Understanding of PyTorch's autograd system 26 | 27 | ## Key Concepts 28 | 1. **PyTorch Extension API**: Interface for creating custom operations 29 | 2. **Tensor Memory Layout**: Understanding contiguous memory and strides 30 | 3. **Autograd Integration**: Making custom ops work with automatic differentiation 31 | 4. **CUDA Kernels**: Writing GPU-accelerated operations 32 | 5. **Build Systems**: Using setuptools and JIT compilation 33 | 34 | ## Practical Applications 35 | - Performance-critical operations 36 | - Novel layer implementations 37 | - Custom optimizers 38 | - Specialized data structures 39 | - Hardware-specific optimizations 40 | 41 | ## Next Steps 42 | After completing this tutorial, you'll be able to create high-performance custom operations that seamlessly integrate with PyTorch's ecosystem. -------------------------------------------------------------------------------- /19_neural_architecture_search/README.md: -------------------------------------------------------------------------------- 1 | # Tutorial 19: Neural Architecture Search 2 | 3 | ## Overview 4 | This tutorial explores Neural Architecture Search (NAS) techniques for automatically designing optimal neural network architectures. You'll learn about different search strategies, search spaces, and performance estimation methods, with practical implementations in PyTorch. 5 | 6 | ## Contents 7 | - Introduction to NAS concepts and motivation 8 | - Search space design 9 | - Random search and grid search 10 | - Evolutionary algorithms for NAS 11 | - Differentiable architecture search (DARTS) 12 | - Efficient NAS techniques 13 | - Performance estimation strategies 14 | 15 | ## Learning Objectives 16 | - Understand the NAS problem formulation 17 | - Design effective search spaces 18 | - Implement various search strategies 19 | - Build differentiable architecture search 20 | - Apply early stopping and performance prediction 21 | - Evaluate and compare architectures 22 | 23 | ## Prerequisites 24 | - Strong PyTorch and deep learning knowledge 25 | - Understanding of various network architectures 26 | - Basic knowledge of optimization algorithms 27 | - Familiarity with computational graphs 28 | 29 | ## Key Concepts 30 | 1. **Search Space**: Set of possible architectures 31 | 2. **Search Strategy**: Algorithm to explore the space 32 | 3. **Performance Estimation**: Evaluating architectures efficiently 33 | 4. **Supernet**: Weight-sharing approaches 34 | 5. **Architecture Encoding**: Representing architectures 35 | 36 | ## Practical Applications 37 | - AutoML systems 38 | - Hardware-specific optimization 39 | - Domain-specific architecture design 40 | - Model compression 41 | - Multi-objective optimization 42 | - Efficient model discovery 43 | 44 | ## Next Steps 45 | After this tutorial, you'll be able to implement NAS techniques to automatically discover optimal architectures for your specific tasks and constraints. -------------------------------------------------------------------------------- /17_model_optimization_techniques/README.md: -------------------------------------------------------------------------------- 1 | # Tutorial 17: Model Optimization Techniques 2 | 3 | ## Overview 4 | This tutorial covers advanced model optimization techniques including quantization, pruning, knowledge distillation, and neural architecture search. You'll learn how to make models smaller, faster, and more efficient for deployment while maintaining accuracy. 5 | 6 | ## Contents 7 | - Model quantization (INT8, dynamic, static) 8 | - Network pruning (structured and unstructured) 9 | - Knowledge distillation 10 | - Model compression techniques 11 | - Efficient inference optimization 12 | - Hardware-aware optimization 13 | - Deployment considerations 14 | 15 | ## Learning Objectives 16 | - Implement various quantization schemes 17 | - Apply pruning to reduce model size 18 | - Use knowledge distillation for model compression 19 | - Optimize models for specific hardware 20 | - Balance accuracy vs efficiency trade-offs 21 | - Deploy optimized models effectively 22 | 23 | ## Prerequisites 24 | - Strong PyTorch fundamentals 25 | - Understanding of neural network architectures 26 | - Basic knowledge of computer architecture 27 | - Familiarity with model training 28 | 29 | ## Key Concepts 30 | 1. **Quantization**: Reducing numerical precision 31 | 2. **Pruning**: Removing unnecessary parameters 32 | 3. **Distillation**: Transferring knowledge to smaller models 33 | 4. **Compression**: Reducing model size and complexity 34 | 5. **Hardware Optimization**: Tailoring models for specific devices 35 | 36 | ## Practical Applications 37 | - Mobile and edge deployment 38 | - Real-time inference systems 39 | - Resource-constrained environments 40 | - Cloud cost optimization 41 | - Embedded AI systems 42 | - IoT applications 43 | 44 | ## Next Steps 45 | After this tutorial, you'll be able to optimize PyTorch models for production deployment, significantly reducing their computational requirements while maintaining performance. -------------------------------------------------------------------------------- /14_performance_optimization/README.md: -------------------------------------------------------------------------------- 1 | # Tutorial 14: Performance Optimization 2 | 3 | ## Overview 4 | This tutorial covers comprehensive performance optimization techniques for PyTorch models, from basic profiling to advanced optimization strategies. You'll learn how to identify bottlenecks and apply various optimization techniques to improve training and inference speed. 5 | 6 | ## Contents 7 | - Profiling PyTorch models 8 | - Memory optimization techniques 9 | - Mixed precision training 10 | - Data loading optimization 11 | - Model parallelism and distributed training 12 | - Kernel fusion and graph optimization 13 | - Hardware-specific optimizations 14 | 15 | ## Learning Objectives 16 | - Profile and identify performance bottlenecks 17 | - Optimize memory usage and reduce memory fragmentation 18 | - Implement mixed precision training effectively 19 | - Speed up data loading pipelines 20 | - Apply model and data parallelism 21 | - Use TorchScript for production optimization 22 | 23 | ## Prerequisites 24 | - Strong understanding of PyTorch fundamentals 25 | - Experience training neural networks 26 | - Basic understanding of GPU architecture 27 | - Familiarity with Python profiling tools 28 | 29 | ## Key Concepts 30 | 1. **Profiling**: Using PyTorch profiler to identify bottlenecks 31 | 2. **Memory Management**: Efficient tensor allocation and deallocation 32 | 3. **Mixed Precision**: Using FP16/BF16 for faster computation 33 | 4. **Data Pipeline**: Optimizing data loading and preprocessing 34 | 5. **Parallelism**: Distributing computation across devices 35 | 36 | ## Practical Applications 37 | - Large-scale model training 38 | - Real-time inference systems 39 | - Mobile and edge deployment 40 | - Cloud-based ML services 41 | - Research experiments at scale 42 | 43 | ## Next Steps 44 | After completing this tutorial, you'll be equipped to optimize PyTorch models for various deployment scenarios and achieve significant performance improvements. -------------------------------------------------------------------------------- /15_advanced_model_architectures/README.md: -------------------------------------------------------------------------------- 1 | # Tutorial 15: Advanced Model Architectures 2 | 3 | ## Overview 4 | This tutorial explores cutting-edge neural network architectures including Graph Neural Networks (GNNs), Vision Transformers (ViT), Neural Architecture Search (NAS) results, and other state-of-the-art models. You'll learn to implement and train these advanced architectures for various tasks. 5 | 6 | ## Contents 7 | - Graph Neural Networks (GCN, GAT, GraphSAGE) 8 | - Vision Transformers and variants 9 | - EfficientNet and compound scaling 10 | - Neural ODEs 11 | - Capsule Networks 12 | - Self-supervised learning architectures 13 | - Multimodal architectures 14 | 15 | ## Learning Objectives 16 | - Implement Graph Neural Networks for graph-structured data 17 | - Build Vision Transformers from scratch 18 | - Understand and apply efficient model scaling 19 | - Work with continuous-time neural networks 20 | - Implement advanced attention mechanisms 21 | - Design architectures for multimodal learning 22 | 23 | ## Prerequisites 24 | - Strong understanding of CNNs and Transformers 25 | - Familiarity with graph theory basics 26 | - Experience with PyTorch modules and autograd 27 | - Understanding of attention mechanisms 28 | 29 | ## Key Concepts 30 | 1. **Graph Convolutions**: Learning on graph-structured data 31 | 2. **Patch Embeddings**: Converting images to sequences for transformers 32 | 3. **Compound Scaling**: Efficiently scaling model dimensions 33 | 4. **Neural ODEs**: Continuous-depth models 34 | 5. **Routing Mechanisms**: Dynamic computation paths 35 | 36 | ## Practical Applications 37 | - Social network analysis 38 | - Molecular property prediction 39 | - Large-scale image classification 40 | - Video understanding 41 | - Multimodal AI systems 42 | - Scientific computing 43 | 44 | ## Next Steps 45 | After this tutorial, you'll be equipped to implement and adapt state-of-the-art architectures for your specific use cases and research. -------------------------------------------------------------------------------- /16_reinforcement_learning/README.md: -------------------------------------------------------------------------------- 1 | # Tutorial 16: Reinforcement Learning 2 | 3 | ## Overview 4 | This tutorial introduces reinforcement learning (RL) with PyTorch, covering fundamental algorithms and modern deep RL techniques. You'll learn to implement agents that learn through interaction with environments, from basic Q-learning to advanced policy gradient methods. 5 | 6 | ## Contents 7 | - RL fundamentals and terminology 8 | - Deep Q-Networks (DQN) and variants 9 | - Policy gradient methods (REINFORCE, A2C, PPO) 10 | - Actor-Critic architectures 11 | - Multi-agent reinforcement learning 12 | - Model-based RL approaches 13 | - Practical training tips and tricks 14 | 15 | ## Learning Objectives 16 | - Understand core RL concepts (MDPs, value functions, policies) 17 | - Implement DQN with experience replay and target networks 18 | - Build policy gradient algorithms from scratch 19 | - Create efficient Actor-Critic agents 20 | - Handle continuous action spaces 21 | - Debug and visualize RL training 22 | 23 | ## Prerequisites 24 | - Strong PyTorch fundamentals 25 | - Basic understanding of probability and statistics 26 | - Familiarity with neural network training 27 | - Knowledge of optimization techniques 28 | 29 | ## Key Concepts 30 | 1. **Markov Decision Processes**: Mathematical framework for RL 31 | 2. **Value Functions**: Estimating future rewards 32 | 3. **Policy Optimization**: Learning optimal behavior directly 33 | 4. **Exploration vs Exploitation**: Balancing learning and performance 34 | 5. **Experience Replay**: Efficient use of past experiences 35 | 36 | ## Practical Applications 37 | - Game playing AI 38 | - Robotics control 39 | - Resource management 40 | - Trading strategies 41 | - Autonomous navigation 42 | - Recommendation systems 43 | 44 | ## Next Steps 45 | After this tutorial, you'll be able to implement and train RL agents for various tasks, understand the trade-offs between different algorithms, and apply RL to real-world problems. -------------------------------------------------------------------------------- /20_bayesian_deep_learning/README.md: -------------------------------------------------------------------------------- 1 | # Tutorial 20: Bayesian Deep Learning 2 | 3 | ## Overview 4 | This tutorial explores Bayesian approaches to deep learning, focusing on uncertainty quantification and probabilistic modeling. You'll learn how to build neural networks that can express uncertainty in their predictions, implement various Bayesian neural network techniques, and understand when and why to use them. 5 | 6 | ## Contents 7 | - Introduction to Bayesian deep learning 8 | - Monte Carlo Dropout for uncertainty estimation 9 | - Bayesian Neural Networks with variational inference 10 | - Deep ensembles 11 | - Gaussian processes and neural networks 12 | - Uncertainty calibration 13 | - Applications and best practices 14 | 15 | ## Learning Objectives 16 | - Understand uncertainty in neural networks 17 | - Implement Monte Carlo Dropout 18 | - Build Bayesian neural networks with PyTorch 19 | - Create and train deep ensembles 20 | - Quantify and calibrate uncertainty 21 | - Apply Bayesian methods to real problems 22 | 23 | ## Prerequisites 24 | - Strong understanding of deep learning 25 | - Basic probability and statistics 26 | - Familiarity with Bayesian inference concepts 27 | - PyTorch fundamentals 28 | 29 | ## Key Concepts 30 | 1. **Epistemic Uncertainty**: Model uncertainty 31 | 2. **Aleatoric Uncertainty**: Data uncertainty 32 | 3. **Variational Inference**: Approximate Bayesian inference 33 | 4. **Posterior Distribution**: Distribution over model parameters 34 | 5. **Predictive Uncertainty**: Uncertainty in predictions 35 | 36 | ## Practical Applications 37 | - Medical diagnosis with confidence estimates 38 | - Autonomous driving safety 39 | - Financial risk assessment 40 | - Active learning 41 | - Out-of-distribution detection 42 | - Robust decision making 43 | 44 | ## Next Steps 45 | After this tutorial, you'll be able to build neural networks that know what they don't know, crucial for safety-critical applications and informed decision-making. -------------------------------------------------------------------------------- /18_meta_learning/README.md: -------------------------------------------------------------------------------- 1 | # Tutorial 18: Meta-Learning and Few-Shot Learning 2 | 3 | ## Overview 4 | This tutorial explores meta-learning (learning to learn) and few-shot learning techniques in PyTorch. You'll learn how to build models that can quickly adapt to new tasks with minimal training data, including implementations of MAML, Prototypical Networks, and other state-of-the-art approaches. 5 | 6 | ## Contents 7 | - Introduction to meta-learning concepts 8 | - Model-Agnostic Meta-Learning (MAML) 9 | - Prototypical Networks 10 | - Matching Networks 11 | - Reptile algorithm 12 | - Few-shot classification and regression 13 | - Applications and best practices 14 | 15 | ## Learning Objectives 16 | - Understand the meta-learning paradigm 17 | - Implement MAML for fast adaptation 18 | - Build Prototypical Networks for few-shot classification 19 | - Create Matching Networks with attention 20 | - Apply meta-learning to real problems 21 | - Evaluate few-shot learning performance 22 | 23 | ## Prerequisites 24 | - Strong PyTorch and deep learning knowledge 25 | - Understanding of gradient-based optimization 26 | - Familiarity with classification tasks 27 | - Basic knowledge of attention mechanisms 28 | 29 | ## Key Concepts 30 | 1. **Meta-Learning**: Learning algorithms that improve with experience 31 | 2. **Few-Shot Learning**: Learning from very few examples 32 | 3. **Task Distribution**: Learning over distributions of tasks 33 | 4. **Fast Adaptation**: Quick learning on new tasks 34 | 5. **Episodic Training**: Training on task episodes 35 | 36 | ## Practical Applications 37 | - Medical diagnosis with limited data 38 | - Personalized recommendation systems 39 | - Robotics and control 40 | - Drug discovery 41 | - Rare event detection 42 | - Language understanding for low-resource languages 43 | 44 | ## Next Steps 45 | After this tutorial, you'll be able to implement meta-learning algorithms for scenarios with limited data and build systems that can quickly adapt to new tasks. -------------------------------------------------------------------------------- /21_advanced_research_topics/README.md: -------------------------------------------------------------------------------- 1 | # Tutorial 21: Advanced Research Topics 2 | 3 | ## Overview 4 | This tutorial explores cutting-edge research topics in deep learning, including neural ODEs, implicit neural representations, self-supervised learning, contrastive learning, and other emerging areas. You'll learn about the latest advances in the field and how to implement them using PyTorch. 5 | 6 | ## Contents 7 | - Neural Ordinary Differential Equations (Neural ODEs) 8 | - Implicit Neural Representations (NeRF, SIREN) 9 | - Self-supervised learning methods 10 | - Contrastive learning (SimCLR, MoCo) 11 | - Diffusion models basics 12 | - Transformer variants and improvements 13 | - Emerging architectures and techniques 14 | 15 | ## Learning Objectives 16 | - Understand neural ODEs and continuous-depth models 17 | - Implement implicit neural representations 18 | - Master self-supervised learning techniques 19 | - Build contrastive learning systems 20 | - Explore diffusion models 21 | - Understand recent transformer innovations 22 | - Apply cutting-edge techniques to real problems 23 | 24 | ## Prerequisites 25 | - Strong foundation in deep learning 26 | - Experience with PyTorch 27 | - Understanding of advanced architectures 28 | - Familiarity with research papers 29 | - Mathematical maturity 30 | 31 | ## Key Concepts 32 | 1. **Neural ODEs**: Continuous-depth neural networks 33 | 2. **Implicit Representations**: Coordinate-based neural networks 34 | 3. **Self-Supervised Learning**: Learning without labels 35 | 4. **Contrastive Learning**: Learning representations through contrasts 36 | 5. **Diffusion Models**: Generative models via denoising 37 | 6. **Attention Mechanisms**: Advanced transformer techniques 38 | 39 | ## Practical Applications 40 | - 3D scene reconstruction 41 | - Representation learning 42 | - Few-shot learning 43 | - Image generation 44 | - Video understanding 45 | - Scientific computing 46 | - Robotics and control 47 | 48 | ## Next Steps 49 | After this tutorial, you'll be equipped to: 50 | - Read and implement research papers 51 | - Contribute to open-source projects 52 | - Conduct your own research 53 | - Apply state-of-the-art methods 54 | - Push the boundaries of deep learning -------------------------------------------------------------------------------- /05_data_loading_preprocessing/data_loading_preprocessing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Data Loading, Preprocessing, and Augmentation in PyTorch\n", 8 | "\n", 9 | "This notebook provides a comprehensive guide to efficiently loading, preprocessing, and augmenting data in PyTorch. Effective data handling is critical for any machine learning pipeline." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import torch\n", 19 | "import torch.nn as nn\n", 20 | "import torch.optim as optim\n", 21 | "import torchvision\n", 22 | "import torchvision.transforms as transforms\n", 23 | "import matplotlib.pyplot as plt\n", 24 | "import numpy as np\n", 25 | "from torch.utils.data import Dataset, DataLoader, random_split\n", 26 | "from PIL import Image\n", 27 | "import os\n", 28 | "import pandas as pd\n", 29 | "from pathlib import Path\n", 30 | "import glob\n", 31 | "import time\n", 32 | "\n", 33 | "# Set random seed for reproducibility\n", 34 | "torch.manual_seed(42)\n", 35 | "np.random.seed(42)\n", 36 | "\n", 37 | "# Device configuration\n", 38 | "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", 39 | "print(f\"Using device: {device}\")\n", 40 | "print(f\"PyTorch version: {torch.__version__}\")\n", 41 | "\n", 42 | "# Create output directory\n", 43 | "output_dir = \"05_data_loading_preprocessing_outputs\"\n", 44 | "os.makedirs(output_dir, exist_ok=True)" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "## 1. Introduction to Data Handling\n", 52 | "\n", 53 | "Data loading and preprocessing are critical steps in any machine learning pipeline:\n", 54 | "\n", 55 | "- **Loading:** Reading data from various sources (files, databases)\n", 56 | "- **Preprocessing:** Cleaning, transforming, and structuring data\n", 57 | "- **Augmentation:** Artificially expanding the dataset for better generalization" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | "# Demonstrate built-in datasets\n", 67 | "print(\"Using Built-in Datasets:\")\n", 68 | "print(\"-\" * 30)\n", 69 | "\n", 70 | "# Load MNIST dataset\n", 71 | "mnist_dataset = torchvision.datasets.MNIST(\n", 72 | " root='./data', \n", 73 | " train=True, \n", 74 | " download=True, \n", 75 | " transform=transforms.ToTensor()\n", 76 | ")\n", 77 | "\n", 78 | "print(f\"Dataset size: {len(mnist_dataset)}\")\n", 79 | "sample, label = mnist_dataset[0]\n", 80 | "print(f\"Sample shape: {sample.shape}\")\n", 81 | "print(f\"Sample dtype: {sample.dtype}\")\n", 82 | "print(f\"Label: {label}\")\n", 83 | "\n", 84 | "# Visualize a sample\n", 85 | "plt.figure(figsize=(8, 4))\n", 86 | "plt.subplot(1, 2, 1)\n", 87 | "plt.imshow(sample.squeeze(), cmap='gray')\n", 88 | "plt.title(f'MNIST Sample (Label: {label})')\n", 89 | "plt.axis('off')\n", 90 | "\n", 91 | "# Show multiple samples\n", 92 | "plt.subplot(1, 2, 2)\n", 93 | "fig, axes = plt.subplots(2, 3, figsize=(6, 4))\n", 94 | "for i, ax in enumerate(axes.flat):\n", 95 | " if i < 6:\n", 96 | " img, lbl = mnist_dataset[i]\n", 97 | " ax.imshow(img.squeeze(), cmap='gray')\n", 98 | " ax.set_title(f'Label: {lbl}')\n", 99 | " ax.axis('off')\n", 100 | "plt.tight_layout()\n", 101 | "plt.show()" 102 | ] 103 | } 104 | ], 105 | "metadata": { 106 | "language_info": { 107 | "name": "python" 108 | } 109 | }, 110 | "nbformat": 4, 111 | "nbformat_minor": 2 112 | } 113 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # PyTorch Tutorials 2 | 3 | A comprehensive collection of PyTorch tutorials from beginner to expert level. This repository aims to provide practical, hands-on examples and explanations for various PyTorch concepts and applications. 4 | 5 | ## 🚀 Quick Start 6 | 7 | ### Installation 8 | ```bash 9 | git clone https://github.com/niconielsen32/pytorch-tutorials.git 10 | cd pytorch-tutorials 11 | pip install -r requirements.txt 12 | ``` 13 | 14 | ### Running the Tutorials 15 | ```bash 16 | # Run Python scripts directly 17 | python 01_pytorch_basics/pytorch_basics.py 18 | 19 | # Or use Jupyter notebooks for interactive learning 20 | jupyter notebook 21 | # Then navigate to any tutorial folder and open the .ipynb file 22 | ``` 23 | 24 | ## 📚 Table of Contents 25 | 26 | ### **Fundamentals** 27 | 28 | #### Beginner Level 29 | 30 | 1. **[PyTorch Basics](01_pytorch_basics/)** 31 | - Tensors, operations, and computational graphs 32 | - NumPy integration 33 | - GPU acceleration 34 | - Basic autograd operations 35 | 36 | 2. **[Neural Networks Fundamentals](02_neural_networks_fundamentals/)** 37 | - Linear layers, activation functions, loss functions, optimizers 38 | - Building your first neural network 39 | - Forward and backward propagation 40 | - nn.Module and nn.Sequential 41 | 42 | 3. **[Automatic Differentiation](03_automatic_differentiation/)** 43 | - Autograd mechanics 44 | - Computing gradients 45 | - Custom autograd functions 46 | - Higher-order derivatives 47 | 48 | 4. **[Training Neural Networks](04_training_neural_networks/)** 49 | - Training loop implementation 50 | - Validation techniques 51 | - Hyperparameter tuning 52 | - Learning rate scheduling 53 | - Early stopping 54 | 55 | 5. **[Data Loading and Preprocessing](05_data_loading_preprocessing/)** 56 | - Dataset and DataLoader classes 57 | - Custom datasets 58 | - Data transformations and augmentation 59 | - Efficient data loading techniques 60 | - Batch processing 61 | 62 | ### **Computer Vision** 63 | 64 | #### Intermediate Level 65 | 66 | 6. **[Convolutional Neural Networks](06_convolutional_neural_networks/)** 67 | - CNN architecture components 68 | - Convolution, pooling, and fully connected layers 69 | - Image classification with CNNs 70 | - Transfer learning with pre-trained models 71 | - Feature visualization 72 | 73 | #### Advanced Computer Vision Applications 74 | - Object detection (YOLO, R-CNN) 75 | - Semantic segmentation 76 | - Instance segmentation 77 | - Image generation 78 | - Style transfer 79 | 80 | ### **Natural Language Processing** 81 | 82 | 7. **[Recurrent Neural Networks](07_recurrent_neural_networks/)** 83 | - RNN architecture 84 | - LSTM and GRU implementations 85 | - Sequence modeling 86 | - Text classification 87 | - Text generation 88 | - Time series forecasting 89 | 90 | 8. **[Transformers and Attention Mechanisms](08_transformers_and_attention_mechanisms/)** 91 | - Self-attention and multi-head attention 92 | - Transformer architecture 93 | - BERT and GPT model implementations 94 | - Fine-tuning pre-trained transformers 95 | - Positional encoding 96 | 97 | ### **Advanced Topics** 98 | 99 | #### Advanced Level 100 | 101 | 9. **[Generative Models](09_generative_models/)** 102 | - Autoencoders 103 | - Variational Autoencoders (VAEs) 104 | - Generative Adversarial Networks (GANs) 105 | - Diffusion models 106 | - Style transfer 107 | 108 | 10. **[Model Deployment](10_model_deployment/)** 109 | - TorchScript and tracing 110 | - ONNX export 111 | - Quantization techniques 112 | - Mobile deployment (PyTorch Mobile) 113 | - Web deployment (ONNX.js) 114 | - Model serving 115 | 116 | 11. **[PyTorch Lightning](11_pytorch_lightning/)** 117 | - Lightning modules 118 | - Trainers and callbacks 119 | - Multi-GPU training 120 | - Experiment logging 121 | - Hyperparameter tuning with Lightning 122 | 123 | 12. **[Distributed Training](12_distributed_training/)** 124 | - Data Parallel (DP) for single-machine multi-GPU 125 | - Distributed Data Parallel (DDP) for multi-node training 126 | - Model Parallel for large models 127 | - Pipeline Parallelism for deep networks 128 | - Fully Sharded Data Parallel (FSDP) for extreme scale 129 | 130 | ### **Additional Advanced Topics** 131 | 132 | 13. **[Custom Extensions](13_custom_extensions/)** 133 | - C++ extensions for custom operations 134 | - CUDA kernels for GPU acceleration 135 | - Custom autograd functions 136 | - JIT compilation with TorchScript 137 | - Binding C++/CUDA code to Python 138 | 139 | 14. **[Performance Optimization](14_performance_optimization/)** 140 | - Memory optimization techniques 141 | - Mixed precision training with AMP 142 | - Profiling and benchmarking 143 | - Data loading optimization 144 | - Gradient accumulation and checkpointing 145 | 146 | 15. **[Advanced Model Architectures](15_advanced_model_architectures/)** 147 | - Graph Neural Networks (GNNs) 148 | - Vision Transformers (ViT) 149 | - EfficientNet and compound scaling 150 | - Neural ODEs 151 | - Capsule Networks 152 | 153 | 16. **[Reinforcement Learning](16_reinforcement_learning/)** 154 | - Deep Q-Networks (DQN) 155 | - Policy gradient methods (REINFORCE) 156 | - Actor-Critic and A2C 157 | - Proximal Policy Optimization (PPO) 158 | - Integration with OpenAI Gym 159 | 160 | 17. **[Model Optimization Techniques](17_model_optimization_techniques/)** 161 | - Quantization (dynamic and static) 162 | - Pruning (structured and unstructured) 163 | - Knowledge distillation 164 | - Model compression 165 | - Hardware-aware optimization 166 | 167 | 18. **[Meta-Learning and Few-Shot Learning](18_meta_learning/)** 168 | - Model-Agnostic Meta-Learning (MAML) 169 | - Prototypical Networks 170 | - Matching Networks 171 | - Reptile algorithm 172 | - Few-shot classification tasks 173 | 174 | ### **Expert Level Topics** 175 | 176 | 19. **[Neural Architecture Search](19_neural_architecture_search/)** 177 | - Random search and grid search 178 | - Evolutionary algorithms 179 | - Differentiable Architecture Search (DARTS) 180 | - Efficient Neural Architecture Search (ENAS) 181 | - Performance prediction 182 | 183 | 20. **[Bayesian Deep Learning](20_bayesian_deep_learning/)** 184 | - Bayesian Neural Networks 185 | - Variational inference 186 | - Monte Carlo Dropout 187 | - Deep ensembles 188 | - Uncertainty quantification 189 | 190 | 21. **[Advanced Research Topics](21_advanced_research_topics/)** 191 | - Self-supervised learning (SimCLR, BYOL) 192 | - Contrastive learning methods 193 | - Diffusion models 194 | - Neural Radiance Fields (NeRF) 195 | - Implicit neural representations 196 | 197 | ## 📋 Each Tutorial Includes 198 | 199 | - **📖 README.md** - Detailed theory and concepts 200 | - **🐍 Python Script** - Complete runnable code with comments 201 | - **📓 Jupyter Notebook** - Interactive step-by-step learning 202 | 203 | ## 🛠️ Requirements 204 | 205 | - Python 3.8+ 206 | - PyTorch 2.0+ 207 | - torchvision 208 | - torchaudio (for audio tutorials) 209 | - matplotlib 210 | - numpy 211 | - pandas 212 | - scikit-learn 213 | - Jupyter Notebook/Lab 214 | 215 | You can install the required packages using: 216 | ```bash 217 | pip install -r requirements.txt 218 | ``` 219 | 220 | ## 📖 How to Use This Repository 221 | 222 | 1. **Sequential Learning**: Follow the tutorials in order for a comprehensive learning experience 223 | 2. **Topic-Based**: Jump to specific topics based on your interests and needs 224 | 3. **Practice**: Each tutorial contains exercises and examples 225 | 4. **Experiment**: Modify the code and experiment with different parameters 226 | 227 | ### Getting Started 228 | 229 | 1. **Start with the README** in each folder for theoretical background 230 | 2. **Run the Python script** to see the complete implementation 231 | 3. **Open the Jupyter notebook** for interactive learning and experimentation 232 | 233 | ## 🤝 Contributing 234 | 235 | Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change. 236 | 237 | ## 📄 License 238 | 239 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. 240 | 241 | ## 🙏 Acknowledgments 242 | 243 | - PyTorch team for the amazing framework 244 | - The deep learning community for continuous innovation 245 | - All contributors to this repository 246 | 247 | --- 248 | 249 | Perfect for both beginners starting their PyTorch journey and experts looking to deepen their understanding of advanced topics! -------------------------------------------------------------------------------- /08_transformers_and_attention_mechanisms/README.md: -------------------------------------------------------------------------------- 1 | # Transformers and Attention Mechanisms 2 | 3 | This tutorial delves into Transformers and Attention Mechanisms, pivotal concepts in modern deep learning, especially for natural language processing and beyond. 4 | 5 | ## Table of Contents 6 | 1. [Introduction to Attention Mechanisms](#introduction-to-attention-mechanisms) 7 | - What is Attention? 8 | - Types of Attention (Bahdanau, Luong) 9 | 2. [Self-Attention](#self-attention) 10 | - Concept and Motivation 11 | - Scaled Dot-Product Attention 12 | 3. [Multi-Head Attention](#multi-head-attention) 13 | - Purpose and Architecture 14 | - Implementation Details 15 | 4. [The Transformer Architecture](#the-transformer-architecture) 16 | - Encoder-Decoder Structure 17 | - Positional Encoding 18 | - Feed-Forward Networks 19 | - Layer Normalization and Residual Connections 20 | 5. [Building a Transformer Block](#building-a-transformer-block) 21 | - Encoder Block 22 | - Decoder Block 23 | 6. [Applications of Transformers](#applications-of-transformers) 24 | - Natural Language Processing (e.g., Translation, Summarization) 25 | - Vision Transformers (ViT) 26 | 7. [Implementing a Simple Transformer with PyTorch](#implementing-a-simple-transformer-with-pytorch) 27 | - Step-by-step guide 28 | 8. [Pre-trained Transformer Models (BERT, GPT)](#pre-trained-transformer-models) 29 | - Overview of popular models 30 | - Using Hugging Face Transformers library 31 | 9. [Fine-tuning Pre-trained Transformers](#fine-tuning-pre-trained-transformers) 32 | - Concepts and techniques 33 | - Example: Text classification 34 | 35 | ## Introduction to Attention Mechanisms 36 | 37 | Attention mechanisms in deep learning are inspired by human visual attention – the ability to focus on specific parts of an image while perceiving the whole. In the context of neural networks, attention allows a model to dynamically focus on different parts of the input sequence when producing an output. 38 | 39 | - **What is Attention?** 40 | - A mechanism that allows the model to assign different weights (importance scores) to different parts of the input. 41 | - Helps in handling long sequences and capturing long-range dependencies. 42 | - **Types of Attention:** 43 | - **Bahdanau Attention (Additive Attention):** Uses a feed-forward network to compute alignment scores. 44 | - **Luong Attention (Multiplicative Attention):** Uses dot-product based alignment scores. 45 | 46 | ## Self-Attention 47 | 48 | Self-attention, also known as intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. It is a key component of Transformers. 49 | 50 | - **Concept and Motivation:** 51 | - Allows the model to weigh the importance of other words in the *same* sentence when encoding a particular word. 52 | - Example: "The animal didn't cross the street because **it** was too tired." Self-attention helps determine if "it" refers to "animal" or "street". 53 | - **Scaled Dot-Product Attention:** 54 | - The core of self-attention. 55 | - Queries (Q), Keys (K), and Values (V) are computed from the input embeddings. 56 | - Attention Score = `softmax((Q * K^T) / sqrt(d_k)) * V` 57 | - `d_k` is the dimension of the key vectors, used for scaling to prevent overly small gradients. 58 | 59 | ## Multi-Head Attention 60 | 61 | Instead of performing a single attention function, Multi-Head Attention runs multiple attention mechanisms in parallel and concatenates their outputs. 62 | 63 | - **Purpose and Architecture:** 64 | - Allows the model to jointly attend to information from different representation subspaces at different positions. 65 | - Each "head" can learn different aspects of the input. 66 | - Input Q, K, V are linearly projected `h` times with different, learned linear projections. 67 | - Attention is applied by each head in parallel. 68 | - Outputs are concatenated and linearly projected again. 69 | 70 | ## The Transformer Architecture 71 | 72 | The Transformer model, introduced in "Attention Is All You Need," relies entirely on attention mechanisms, dispensing with recurrence and convolutions. 73 | 74 | - **Encoder-Decoder Structure:** 75 | - **Encoder:** Maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations (z1, ..., zn). Composed of a stack of N identical layers. 76 | - **Decoder:** Given z, generates an output sequence (y1, ..., ym) one symbol at a time. Also composed of a stack of N identical layers. The decoder incorporates an additional multi-head attention over the output of the encoder stack. 77 | - **Positional Encoding:** 78 | - Since Transformers contain no recurrence or convolution, positional encodings are added to the input embeddings to give the model information about the relative or absolute position of tokens in the sequence. 79 | - Sine and cosine functions of different frequencies are typically used. 80 | - **Feed-Forward Networks:** 81 | - Each layer in the encoder and decoder contains a fully connected feed-forward network, applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between. 82 | - **Layer Normalization and Residual Connections:** 83 | - Each sub-layer (self-attention, feed-forward network) in the encoder and decoder has a residual connection around it, followed by layer normalization. 84 | 85 | ## Building a Transformer Block 86 | 87 | - **Encoder Block:** 88 | - Multi-Head Self-Attention layer 89 | - Add & Norm (Residual Connection + Layer Normalization) 90 | - Position-wise Feed-Forward Network 91 | - Add & Norm 92 | - **Decoder Block:** 93 | - Masked Multi-Head Self-Attention layer (to prevent attending to future positions) 94 | - Add & Norm 95 | - Multi-Head Attention (over encoder output) 96 | - Add & Norm 97 | - Position-wise Feed-Forward Network 98 | - Add & Norm 99 | 100 | ## Applications of Transformers 101 | 102 | - **Natural Language Processing (NLP):** 103 | - Machine Translation (original application) 104 | - Text Summarization 105 | - Question Answering 106 | - Sentiment Analysis 107 | - Text Generation 108 | - **Vision Transformers (ViT):** 109 | - Apply Transformer architecture directly to sequences of image patches for image classification. 110 | 111 | ## Implementing a Simple Transformer with PyTorch 112 | 113 | This section will provide code examples for building the core components of a Transformer, such as Scaled Dot-Product Attention, Multi-Head Attention, Positional Encoding, and a basic Encoder-Decoder structure using PyTorch. 114 | 115 | ```python 116 | import torch 117 | import torch.nn as nn 118 | import math 119 | 120 | # Example: Scaled Dot-Product Attention 121 | class ScaledDotProductAttention(nn.Module): 122 | def forward(self, query, key, value, mask=None): 123 | dk = query.size(-1) 124 | scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(dk) 125 | if mask is not None: 126 | scores = scores.masked_fill(mask == 0, -1e9) 127 | attention = torch.softmax(scores, dim=-1) 128 | return torch.matmul(attention, value) 129 | 130 | # Further components like MultiHeadAttention, PositionalEncoding, EncoderLayer, DecoderLayer will be shown. 131 | ``` 132 | 133 | ## Pre-trained Transformer Models (BERT, GPT) 134 | 135 | - **BERT (Bidirectional Encoder Representations from Transformers):** 136 | - Developed by Google. 137 | - Designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. 138 | - Used for tasks like question answering, language inference. 139 | - **GPT (Generative Pre-trained Transformer):** 140 | - Developed by OpenAI. 141 | - Uses a decoder-only transformer architecture. 142 | - Excels at text generation tasks. 143 | - **Hugging Face Transformers Library:** 144 | - Provides thousands of pre-trained models for a wide range of tasks in NLP, vision, and audio. 145 | - Simplifies downloading and using state-of-the-art models. 146 | 147 | ```python 148 | # Example using Hugging Face Transformers 149 | # from transformers import BertTokenizer, BertModel 150 | # tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') 151 | # model = BertModel.from_pretrained('bert-base-uncased') 152 | # inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") 153 | # outputs = model(**inputs) 154 | # last_hidden_states = outputs.last_hidden_state 155 | ``` 156 | 157 | ## Fine-tuning Pre-trained Transformers 158 | 159 | - **Concepts:** Instead of training a large model from scratch, take a pre-trained model and adapt it to a specific downstream task using a smaller, task-specific dataset. 160 | - **Techniques:** 161 | - Add a task-specific layer (e.g., a classification head) on top of the pre-trained model. 162 | - Unfreeze some of the top layers of the pre-trained model and train them with the task-specific layer. 163 | - Or, unfreeze and train the entire model, but with a much smaller learning rate. 164 | 165 | ## Running the Tutorial 166 | 167 | To run the Python script associated with this tutorial: 168 | ```bash 169 | python transformers_and_attention_mechanisms.py 170 | ``` 171 | Alternatively, you can follow along with the Jupyter notebook `transformers_and_attention_mechanisms.ipynb` for an interactive experience. 172 | 173 | ## Prerequisites 174 | - Python 3.7+ 175 | - PyTorch 1.10+ 176 | - (Optionally) Hugging Face Transformers library: `pip install transformers` 177 | 178 | ## Next Steps 179 | Explore building and training a full Transformer model for a specific task, or dive deeper into the mathematics and variations of attention mechanisms. -------------------------------------------------------------------------------- /01_pytorch_basics/pytorch_basics.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | """ 5 | PyTorch Basics 6 | 7 | This script provides an introduction to PyTorch, covering tensors, operations, 8 | and basic computational graphs. 9 | """ 10 | 11 | import torch 12 | import numpy as np 13 | import matplotlib.pyplot as plt 14 | 15 | # Set random seed for reproducibility 16 | torch.manual_seed(42) 17 | 18 | # Device configuration 19 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 20 | print(f"Using device: {device}") 21 | 22 | # ----------------------------------------------------------------------------- 23 | # Section 1: Introduction to PyTorch 24 | # ----------------------------------------------------------------------------- 25 | 26 | def intro_to_pytorch(): 27 | """Introduce basic PyTorch concepts and features.""" 28 | print("Introduction to PyTorch") 29 | print("-" * 50) 30 | print("PyTorch is an open-source machine learning library for Python.") 31 | print("Key features include:") 32 | print(" - Tensor computation with strong GPU acceleration") 33 | print(" - Dynamic neural networks") 34 | print(" - Automatic differentiation for deep learning") 35 | print(f"PyTorch version: {torch.__version__}") 36 | if torch.cuda.is_available(): 37 | print(f"CUDA version: {torch.version.cuda}") 38 | 39 | # ----------------------------------------------------------------------------- 40 | # Section 2: Tensors 41 | # ----------------------------------------------------------------------------- 42 | 43 | def demonstrate_tensors(): 44 | """Demonstrate tensor creation and properties.""" 45 | print("\nTensors in PyTorch") 46 | print("-" * 50) 47 | 48 | # Creating tensors 49 | tensor_1d = torch.tensor([1, 2, 3, 4, 5]) 50 | tensor_2d = torch.tensor([[1, 2, 3], [4, 5, 6]]) 51 | print("1D Tensor:", tensor_1d) 52 | print("2D Tensor:\n", tensor_2d) 53 | 54 | # Tensor properties 55 | print("\nTensor Properties:") 56 | print(f"Shape of 1D tensor: {tensor_1d.shape}") 57 | print(f"Shape of 2D tensor: {tensor_2d.shape}") 58 | print(f"Data type of 1D tensor: {tensor_1d.dtype}") 59 | print(f"Device of 1D tensor: {tensor_1d.device}") 60 | 61 | # Different initialization methods 62 | zeros_tensor = torch.zeros(3, 3) 63 | ones_tensor = torch.ones(2, 4) 64 | random_tensor = torch.randn(2, 3) 65 | print("\nInitialization Methods:") 66 | print("Zeros Tensor:\n", zeros_tensor) 67 | print("Ones Tensor:\n", ones_tensor) 68 | print("Random Tensor:\n", random_tensor) 69 | 70 | # Converting data types 71 | float_tensor = tensor_1d.float() 72 | int_tensor = tensor_1d.int() 73 | print("\nType Conversion:") 74 | print(f"Original dtype: {tensor_1d.dtype}") 75 | print(f"Float tensor dtype: {float_tensor.dtype}") 76 | print(f"Int tensor dtype: {int_tensor.dtype}") 77 | 78 | # ----------------------------------------------------------------------------- 79 | # Section 3: Tensor Operations 80 | # ----------------------------------------------------------------------------- 81 | 82 | def demonstrate_tensor_operations(): 83 | """Demonstrate various tensor operations.""" 84 | print("\nTensor Operations") 85 | print("-" * 50) 86 | 87 | # Create sample tensors 88 | a = torch.tensor([[1, 2], [3, 4]], dtype=torch.float) 89 | b = torch.tensor([[5, 6], [7, 8]], dtype=torch.float) 90 | 91 | # Element-wise operations 92 | add_result = a + b 93 | mul_result = a * b 94 | print("Element-wise Operations:") 95 | print("Addition:\n", add_result) 96 | print("Multiplication:\n", mul_result) 97 | 98 | # Matrix operations 99 | matmul_result = torch.matmul(a, b) 100 | transpose = a.t() 101 | print("\nMatrix Operations:") 102 | print("Matrix Multiplication:\n", matmul_result) 103 | print("Transpose of a:\n", transpose) 104 | 105 | # Reshaping 106 | reshape_result = a.view(4, 1) 107 | print("\nReshaping:") 108 | print("Original shape:", a.shape) 109 | print("Reshaped tensor shape:", reshape_result.shape) 110 | print("Reshaped tensor:\n", reshape_result) 111 | 112 | # Indexing 113 | element = a[0, 1] 114 | row = a[0, :] 115 | print("\nIndexing:") 116 | print("Element at [0,1]:", element) 117 | print("First row:", row) 118 | 119 | # Broadcasting 120 | scalar = torch.tensor(2.0) 121 | broadcast_result = a + scalar 122 | print("\nBroadcasting:") 123 | print("Original tensor:\n", a) 124 | print("After adding scalar 2.0:\n", broadcast_result) 125 | 126 | # ----------------------------------------------------------------------------- 127 | # Section 4: NumPy Integration 128 | # ----------------------------------------------------------------------------- 129 | 130 | def demonstrate_numpy_integration(): 131 | """Demonstrate integration between PyTorch tensors and NumPy arrays.""" 132 | print("\nNumPy Integration") 133 | print("-" * 50) 134 | 135 | # Convert NumPy array to tensor 136 | np_array = np.array([[1, 2], [3, 4]]) 137 | tensor_from_np = torch.from_numpy(np_array) 138 | print("NumPy array to Tensor:") 139 | print("NumPy array:\n", np_array) 140 | print("Tensor:\n", tensor_from_np) 141 | 142 | # Convert tensor to NumPy array 143 | tensor = torch.tensor([[5, 6], [7, 8]]) 144 | np_from_tensor = tensor.numpy() 145 | print("\nTensor to NumPy array:") 146 | print("Tensor:\n", tensor) 147 | print("NumPy array:\n", np_from_tensor) 148 | 149 | # Shared memory demonstration 150 | print("\nShared Memory Demonstration:") 151 | np_array[0, 0] = 99 152 | print("Modified NumPy array:\n", np_array) 153 | print("Tensor (shares memory):\n", tensor_from_np) 154 | 155 | # ----------------------------------------------------------------------------- 156 | # Section 5: GPU Acceleration 157 | # ----------------------------------------------------------------------------- 158 | 159 | def demonstrate_gpu_acceleration(): 160 | """Demonstrate GPU usage with PyTorch.""" 161 | print("\nGPU Acceleration") 162 | print("-" * 50) 163 | 164 | if torch.cuda.is_available(): 165 | # Create tensor on CPU 166 | cpu_tensor = torch.randn(1000, 1000) 167 | print("CPU tensor device:", cpu_tensor.device) 168 | 169 | # Move tensor to GPU 170 | gpu_tensor = cpu_tensor.to(device) 171 | print("GPU tensor device:", gpu_tensor.device) 172 | 173 | # Perform operation on GPU 174 | start_time = time.time() 175 | result_gpu = torch.matmul(gpu_tensor, gpu_tensor) 176 | gpu_time = time.time() - start_time 177 | 178 | # Perform same operation on CPU 179 | start_time = time.time() 180 | result_cpu = torch.matmul(cpu_tensor, cpu_tensor) 181 | cpu_time = time.time() - start_time 182 | 183 | print(f"Matrix multiplication time on CPU: {cpu_time:.4f} seconds") 184 | print(f"Matrix multiplication time on GPU: {gpu_time:.4f} seconds") 185 | print(f"Speedup: {cpu_time/gpu_time:.2f}x") 186 | else: 187 | print("CUDA is not available. GPU demonstration skipped.") 188 | print("To enable GPU acceleration, install CUDA and cuDNN.") 189 | 190 | # ----------------------------------------------------------------------------- 191 | # Section 6: Computational Graphs 192 | # ----------------------------------------------------------------------------- 193 | 194 | def demonstrate_computational_graphs(): 195 | """Demonstrate dynamic computational graphs and autograd.""" 196 | print("\nComputational Graphs") 197 | print("-" * 50) 198 | 199 | # Create tensors with gradient tracking 200 | x = torch.tensor(2.0, requires_grad=True) 201 | y = torch.tensor(3.0, requires_grad=True) 202 | 203 | # Define a simple computation 204 | z = x * y + x**2 205 | print("Forward computation: z = x * y + x^2") 206 | print(f"x = {x.item()}, y = {y.item()}") 207 | print(f"z = {z.item()}") 208 | 209 | # Compute gradients 210 | z.backward() 211 | print("\nGradients:") 212 | print(f"dz/dx = {x.grad.item()} (should be y + 2x = {y.item() + 2*x.item()})") 213 | print(f"dz/dy = {y.grad.item()} (should be x = {x.item()})") 214 | 215 | # Demonstrate a more complex graph 216 | a = torch.tensor(1.0, requires_grad=True) 217 | b = torch.tensor(2.0, requires_grad=True) 218 | c = a + b 219 | d = c * a 220 | e = d + b**2 221 | print("\nMore complex graph: e = (a + b) * a + b^2") 222 | e.backward() 223 | print("Gradients:") 224 | print(f"de/da = {a.grad.item()}") 225 | print(f"de/db = {b.grad.item()}") 226 | 227 | # ----------------------------------------------------------------------------- 228 | # Main function to run all sections 229 | # ----------------------------------------------------------------------------- 230 | 231 | import time 232 | 233 | def main(): 234 | """Main function to run all PyTorch basics tutorial sections.""" 235 | print("=" * 80) 236 | print("PyTorch Basics Tutorial") 237 | print("=" * 80) 238 | 239 | # Section 1: Introduction 240 | intro_to_pytorch() 241 | 242 | # Section 2: Tensors 243 | demonstrate_tensors() 244 | 245 | # Section 3: Tensor Operations 246 | demonstrate_tensor_operations() 247 | 248 | # Section 4: NumPy Integration 249 | demonstrate_numpy_integration() 250 | 251 | # Section 5: GPU Acceleration 252 | demonstrate_gpu_acceleration() 253 | 254 | # Section 6: Computational Graphs 255 | demonstrate_computational_graphs() 256 | 257 | print("\nTutorial complete!") 258 | 259 | if __name__ == '__main__': 260 | main() -------------------------------------------------------------------------------- /12_distributed_training/README.md: -------------------------------------------------------------------------------- 1 | # Distributed Training 2 | 3 | This tutorial covers distributed training techniques in PyTorch, enabling you to scale your models across multiple GPUs and machines for faster training and larger model capacity. 4 | 5 | ## Table of Contents 6 | 1. [Introduction to Distributed Training](#introduction-to-distributed-training) 7 | 2. [Data Parallel (DP)](#data-parallel-dp) 8 | 3. [Distributed Data Parallel (DDP)](#distributed-data-parallel-ddp) 9 | 4. [Model Parallel](#model-parallel) 10 | 5. [Pipeline Parallelism](#pipeline-parallelism) 11 | 6. [Fully Sharded Data Parallel (FSDP)](#fully-sharded-data-parallel-fsdp) 12 | 13 | ## Introduction to Distributed Training 14 | 15 | - Why distributed training? 16 | - Types of parallelism 17 | - Communication backends 18 | - Hardware requirements 19 | 20 | ## Data Parallel (DP) 21 | 22 | - Single-machine multi-GPU training 23 | - Automatic gradient averaging 24 | - Limitations and performance considerations 25 | - When to use DP vs DDP 26 | 27 | ## Distributed Data Parallel (DDP) 28 | 29 | - Multi-GPU and multi-node training 30 | - Process groups and initialization 31 | - Gradient synchronization 32 | - Best practices for DDP 33 | 34 | ## Model Parallel 35 | 36 | - Splitting models across devices 37 | - Forward and backward pass coordination 38 | - Memory management 39 | - Use cases for very large models 40 | 41 | ## Pipeline Parallelism 42 | 43 | - Micro-batch processing 44 | - Pipeline stages 45 | - Bubble overhead optimization 46 | - Combining with data parallelism 47 | 48 | ## Fully Sharded Data Parallel (FSDP) 49 | 50 | - Sharding model parameters, gradients, and optimizer states 51 | - Memory efficiency for large models 52 | - Configuration options 53 | - Performance tuning 54 | 55 | ## Running the Tutorial 56 | 57 | To run this tutorial: 58 | 59 | ```bash 60 | # Single GPU example 61 | python distributed_training.py 62 | 63 | # Multi-GPU DDP example 64 | torchrun --nproc_per_node=2 distributed_training.py --distributed 65 | 66 | # Multi-node example 67 | torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" --master_port=29500 distributed_training.py --distributed 68 | ``` 69 | 70 | Alternatively, you can follow along with the Jupyter notebook `distributed_training.ipynb` for an interactive experience. 71 | 72 | ## Prerequisites 73 | 74 | - Python 3.7+ 75 | - PyTorch 1.10+ 76 | - Multiple GPUs (for multi-GPU examples) 77 | - NCCL backend (for optimal performance) 78 | 79 | ## Related Tutorials 80 | 81 | 1. [Training Neural Networks](../04_training_neural_networks/README.md) 82 | 2. [PyTorch Lightning](../11_pytorch_lightning/README.md) 83 | 84 | ## Introduction to Distributed Training 85 | 86 | Distributed training is essential for modern deep learning, allowing you to: 87 | - **Reduce training time** by leveraging multiple GPUs/machines 88 | - **Train larger models** that don't fit on a single GPU 89 | - **Process larger batches** for better gradient estimates 90 | 91 | ### Types of Parallelism 92 | 93 | 1. **Data Parallelism**: Split data across devices, replicate model 94 | 2. **Model Parallelism**: Split model across devices 95 | 3. **Pipeline Parallelism**: Split model into stages processed sequentially 96 | 4. **Hybrid Approaches**: Combine multiple parallelism strategies 97 | 98 | ### Communication Backends 99 | 100 | PyTorch supports multiple backends for inter-process communication: 101 | - **NCCL** (recommended for GPUs): Optimized for NVIDIA GPUs 102 | - **Gloo**: CPU and GPU support, good for development 103 | - **MPI**: Message Passing Interface, requires separate installation 104 | 105 | ## Data Parallel (DP) 106 | 107 | DataParallel is the simplest way to use multiple GPUs on a single machine: 108 | 109 | ```python 110 | import torch 111 | import torch.nn as nn 112 | 113 | # Create model 114 | model = nn.Sequential( 115 | nn.Linear(10, 100), 116 | nn.ReLU(), 117 | nn.Linear(100, 10) 118 | ) 119 | 120 | # Wrap with DataParallel 121 | if torch.cuda.device_count() > 1: 122 | model = nn.DataParallel(model) 123 | model = model.to('cuda') 124 | 125 | # Forward pass automatically uses all GPUs 126 | input = torch.randn(32, 10).to('cuda') 127 | output = model(input) 128 | ``` 129 | 130 | ### Limitations of DP 131 | 132 | - Python GIL bottleneck 133 | - Imbalanced GPU memory usage 134 | - Lower performance compared to DDP 135 | - Single-machine only 136 | 137 | ## Distributed Data Parallel (DDP) 138 | 139 | DDP is the recommended approach for distributed training: 140 | 141 | ```python 142 | import torch 143 | import torch.distributed as dist 144 | import torch.multiprocessing as mp 145 | from torch.nn.parallel import DistributedDataParallel as DDP 146 | 147 | def setup(rank, world_size): 148 | """Initialize the distributed environment.""" 149 | dist.init_process_group("nccl", rank=rank, world_size=world_size) 150 | 151 | def cleanup(): 152 | """Clean up the distributed environment.""" 153 | dist.destroy_process_group() 154 | 155 | def train(rank, world_size): 156 | setup(rank, world_size) 157 | 158 | # Create model and move to GPU 159 | model = nn.Sequential( 160 | nn.Linear(10, 100), 161 | nn.ReLU(), 162 | nn.Linear(100, 10) 163 | ).to(rank) 164 | 165 | # Wrap with DDP 166 | ddp_model = DDP(model, device_ids=[rank]) 167 | 168 | # Create data loader with DistributedSampler 169 | dataset = YourDataset() 170 | sampler = torch.utils.data.distributed.DistributedSampler( 171 | dataset, num_replicas=world_size, rank=rank 172 | ) 173 | dataloader = torch.utils.data.DataLoader( 174 | dataset, batch_size=32, sampler=sampler 175 | ) 176 | 177 | # Training loop 178 | optimizer = torch.optim.Adam(ddp_model.parameters()) 179 | for epoch in range(num_epochs): 180 | sampler.set_epoch(epoch) # Ensure different shuffling per epoch 181 | for data, target in dataloader: 182 | optimizer.zero_grad() 183 | output = ddp_model(data.to(rank)) 184 | loss = loss_fn(output, target.to(rank)) 185 | loss.backward() 186 | optimizer.step() 187 | 188 | cleanup() 189 | 190 | # Launch distributed training 191 | if __name__ == "__main__": 192 | world_size = torch.cuda.device_count() 193 | mp.spawn(train, args=(world_size,), nprocs=world_size, join=True) 194 | ``` 195 | 196 | ### DDP Best Practices 197 | 198 | 1. **Use DistributedSampler** to ensure each process gets different data 199 | 2. **Set random seeds** per process for reproducibility 200 | 3. **Synchronize metrics** across processes when needed 201 | 4. **Save checkpoints** from only one process (usually rank 0) 202 | 5. **Use gradient accumulation** for large effective batch sizes 203 | 204 | ## Model Parallel 205 | 206 | For models too large to fit on a single GPU: 207 | 208 | ```python 209 | class ModelParallelNet(nn.Module): 210 | def __init__(self): 211 | super().__init__() 212 | # Place different parts on different GPUs 213 | self.layer1 = nn.Linear(10, 100).to('cuda:0') 214 | self.layer2 = nn.Linear(100, 100).to('cuda:1') 215 | self.layer3 = nn.Linear(100, 10).to('cuda:1') 216 | 217 | def forward(self, x): 218 | x = self.layer1(x.to('cuda:0')) 219 | x = self.layer2(x.to('cuda:1')) 220 | x = self.layer3(x) 221 | return x 222 | ``` 223 | 224 | ### Challenges with Model Parallel 225 | 226 | - Device idle time during forward/backward 227 | - Complex implementation for arbitrary models 228 | - Communication overhead between devices 229 | 230 | ## Pipeline Parallelism 231 | 232 | Pipeline parallelism addresses idle time by processing micro-batches: 233 | 234 | ```python 235 | from torch.distributed.pipeline.sync import Pipe 236 | 237 | # Define sequential model 238 | model = nn.Sequential( 239 | nn.Linear(10, 100), 240 | nn.ReLU(), 241 | nn.Linear(100, 100), 242 | nn.ReLU(), 243 | nn.Linear(100, 10) 244 | ) 245 | 246 | # Create pipeline (splits model into balanced stages) 247 | model = Pipe(model, balance=[2, 3], devices=['cuda:0', 'cuda:1']) 248 | 249 | # Forward pass with micro-batches 250 | output = model(input) 251 | ``` 252 | 253 | ### Pipeline Parallelism Benefits 254 | 255 | - Better GPU utilization 256 | - Automatic micro-batch scheduling 257 | - Can combine with data parallelism 258 | - Suitable for deep networks 259 | 260 | ## Fully Sharded Data Parallel (FSDP) 261 | 262 | FSDP enables training of extremely large models by sharding parameters: 263 | 264 | ```python 265 | from torch.distributed.fsdp import FullyShardedDataParallel as FSDP 266 | from torch.distributed.fsdp.wrap import wrap 267 | 268 | class FSDPModel(nn.Module): 269 | def __init__(self): 270 | super().__init__() 271 | self.layer1 = wrap(nn.Linear(10, 100)) 272 | self.layer2 = wrap(nn.Linear(100, 100)) 273 | self.layer3 = wrap(nn.Linear(100, 10)) 274 | 275 | def forward(self, x): 276 | x = torch.relu(self.layer1(x)) 277 | x = torch.relu(self.layer2(x)) 278 | return self.layer3(x) 279 | 280 | # Wrap entire model with FSDP 281 | model = FSDP(FSDPModel()) 282 | 283 | # Training works as normal 284 | optimizer = torch.optim.Adam(model.parameters()) 285 | for data, target in dataloader: 286 | optimizer.zero_grad() 287 | output = model(data) 288 | loss = loss_fn(output, target) 289 | loss.backward() 290 | optimizer.step() 291 | ``` 292 | 293 | ### FSDP Configuration 294 | 295 | ```python 296 | from torch.distributed.fsdp import ( 297 | FullyShardedDataParallel as FSDP, 298 | MixedPrecision, 299 | BackwardPrefetch, 300 | ShardingStrategy, 301 | ) 302 | 303 | # Configure FSDP 304 | fsdp_config = { 305 | "sharding_strategy": ShardingStrategy.FULL_SHARD, 306 | "cpu_offload": CPUOffload(offload_params=True), 307 | "mixed_precision": MixedPrecision( 308 | param_dtype=torch.float16, 309 | reduce_dtype=torch.float16, 310 | buffer_dtype=torch.float16, 311 | ), 312 | "backward_prefetch": BackwardPrefetch.BACKWARD_PRE, 313 | } 314 | 315 | model = FSDP(model, **fsdp_config) 316 | ``` 317 | 318 | ## Performance Optimization Tips 319 | 320 | 1. **Profile your code** to identify bottlenecks 321 | 2. **Overlap computation and communication** when possible 322 | 3. **Use mixed precision training** for faster computation 323 | 4. **Tune batch sizes** for optimal GPU utilization 324 | 5. **Monitor GPU memory** and adjust accordingly 325 | 326 | ## Common Pitfalls and Solutions 327 | 328 | ### Hanging Processes 329 | - Ensure all processes execute the same number of collective operations 330 | - Use proper error handling and cleanup 331 | 332 | ### Gradient Synchronization Issues 333 | - Verify all processes have the same model architecture 334 | - Check for conditional logic that might cause divergence 335 | 336 | ### Memory Imbalance 337 | - Balance model partitioning for model parallel 338 | - Use gradient checkpointing for memory-intensive models 339 | 340 | ## Monitoring and Debugging 341 | 342 | ```python 343 | # Log only from main process 344 | if rank == 0: 345 | print(f"Epoch {epoch}, Loss: {loss.item()}") 346 | 347 | # Synchronize before timing 348 | dist.barrier() 349 | start_time = time.time() 350 | 351 | # Use distributed.all_reduce for metrics 352 | dist.all_reduce(loss, op=dist.ReduceOp.AVG) 353 | ``` 354 | 355 | ## Conclusion 356 | 357 | Distributed training is essential for modern deep learning. Key takeaways: 358 | - Use DDP for most multi-GPU scenarios 359 | - Consider FSDP for very large models 360 | - Combine strategies for optimal performance 361 | - Always profile and monitor your training 362 | 363 | The next tutorials will explore more advanced optimization techniques and deployment strategies. -------------------------------------------------------------------------------- /01_pytorch_basics/README.md: -------------------------------------------------------------------------------- 1 | # PyTorch Basics 2 | 3 | This tutorial covers the fundamental concepts of PyTorch, providing a foundation for deep learning applications. 4 | 5 | ## Table of Contents 6 | 1. [Introduction to PyTorch](#introduction-to-pytorch) 7 | 2. [Tensors](#tensors) 8 | 3. [Tensor Operations](#tensor-operations) 9 | 4. [NumPy Integration](#numpy-integration) 10 | 5. [GPU Acceleration](#gpu-acceleration) 11 | 6. [Computational Graphs](#computational-graphs) 12 | 13 | ## Introduction to PyTorch 14 | 15 | - Overview of PyTorch as a deep learning framework 16 | - Key features and advantages 17 | - Installation and setup instructions 18 | 19 | ## Tensors 20 | 21 | - Creating tensors 22 | - Tensor types and shapes 23 | - Tensor initialization methods 24 | - Converting between data types 25 | 26 | ## Tensor Operations 27 | 28 | - Element-wise operations 29 | - Matrix operations 30 | - Reshaping and indexing 31 | - Broadcasting 32 | 33 | ## NumPy Integration 34 | 35 | - Converting between PyTorch tensors and NumPy arrays 36 | - Shared memory considerations 37 | - Practical examples of integration 38 | 39 | ## GPU Acceleration 40 | 41 | - Checking GPU availability 42 | - Moving tensors to GPU 43 | - Basic operations on GPU 44 | - Performance considerations 45 | 46 | ## Computational Graphs 47 | 48 | - Understanding dynamic computational graphs 49 | - Graph visualization 50 | - Basic autograd operations 51 | 52 | ## Running the Tutorial 53 | 54 | To run this tutorial: 55 | 56 | ```bash 57 | python pytorch_basics.py 58 | ``` 59 | 60 | Alternatively, you can follow along with the Jupyter notebook `pytorch_basics.ipynb` for an interactive experience. 61 | 62 | ## Prerequisites 63 | 64 | - Python 3.7+ 65 | - PyTorch 1.10+ 66 | 67 | ## Related Tutorials 68 | 69 | 1. [Neural Networks Fundamentals](../02_neural_networks_fundamentals/README.md) 70 | 2. [Automatic Differentiation](../03_automatic_differentiation/README.md) 71 | 72 | ## Introduction to PyTorch 73 | 74 | PyTorch is an open-source machine learning library developed by Facebook's AI Research lab. It provides a flexible and intuitive interface for building and training neural networks. PyTorch is known for its dynamic computational graph, which allows for more flexible model development compared to static graph frameworks. 75 | 76 | Key features of PyTorch include: 77 | - Dynamic computational graph (define-by-run) 78 | - Intuitive Python interface 79 | - Seamless integration with Python data science stack 80 | - GPU acceleration 81 | - Rich ecosystem of tools and libraries 82 | 83 | ## Tensors 84 | 85 | Tensors are the fundamental data structure in PyTorch, similar to NumPy arrays but with additional capabilities like GPU acceleration and automatic differentiation. Tensors can represent scalars, vectors, matrices, and higher-dimensional data. 86 | 87 | ### Creating Tensors 88 | 89 | ```python 90 | import torch 91 | 92 | # Create a tensor from a Python list 93 | x = torch.tensor([1, 2, 3, 4]) 94 | print(x) # tensor([1, 2, 3, 4]) 95 | 96 | # Create a 2D tensor (matrix) 97 | matrix = torch.tensor([[1, 2], [3, 4]]) 98 | print(matrix) 99 | # tensor([[1, 2], 100 | # [3, 4]]) 101 | 102 | # Create tensors with specific data types 103 | float_tensor = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32) 104 | int_tensor = torch.tensor([1, 2, 3], dtype=torch.int64) 105 | 106 | # Create tensors with specific shapes 107 | zeros = torch.zeros(3, 4) # 3x4 tensor of zeros 108 | ones = torch.ones(2, 3) # 2x3 tensor of ones 109 | rand = torch.rand(2, 2) # 2x2 tensor with random values from uniform distribution [0, 1) 110 | randn = torch.randn(2, 2) # 2x2 tensor with random values from standard normal distribution 111 | 112 | # Create a tensor with a specific range 113 | range_tensor = torch.arange(0, 10, step=1) # tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) 114 | linspace = torch.linspace(0, 1, steps=5) # tensor([0.0000, 0.2500, 0.5000, 0.7500, 1.0000]) 115 | 116 | # Create an identity matrix 117 | eye = torch.eye(3) # 3x3 identity matrix 118 | ``` 119 | 120 | ### Tensor Attributes 121 | 122 | ```python 123 | x = torch.randn(3, 4, 5) 124 | 125 | print(x.shape) # torch.Size([3, 4, 5]) 126 | print(x.size()) # torch.Size([3, 4, 5]) 127 | print(x.dim()) # 3 (number of dimensions) 128 | print(x.dtype) # torch.float32 129 | print(x.device) # device(type='cpu') 130 | ``` 131 | 132 | ### Tensor Indexing and Slicing 133 | 134 | ```python 135 | x = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) 136 | 137 | # Indexing 138 | print(x[0, 0]) # tensor(1) 139 | print(x[1, 2]) # tensor(6) 140 | 141 | # Slicing 142 | print(x[:, 0]) # First column: tensor([1, 4, 7]) 143 | print(x[1, :]) # Second row: tensor([4, 5, 6]) 144 | print(x[0:2, 1:3]) # Sub-matrix: tensor([[2, 3], [5, 6]]) 145 | 146 | # Advanced indexing 147 | indices = torch.tensor([0, 2]) 148 | print(x[indices]) # tensor([[1, 2, 3], [7, 8, 9]]) 149 | 150 | # Boolean indexing 151 | mask = x > 5 152 | print(mask) 153 | # tensor([[False, False, False], 154 | # [False, False, True], 155 | # [ True, True, True]]) 156 | print(x[mask]) # tensor([6, 7, 8, 9]) 157 | ``` 158 | 159 | ## Tensor Operations 160 | 161 | PyTorch provides a wide range of operations for manipulating tensors. 162 | 163 | ### Arithmetic Operations 164 | 165 | ```python 166 | a = torch.tensor([1, 2, 3]) 167 | b = torch.tensor([4, 5, 6]) 168 | 169 | # Addition 170 | print(a + b) # tensor([5, 7, 9]) 171 | print(torch.add(a, b)) # tensor([5, 7, 9]) 172 | 173 | # Subtraction 174 | print(a - b) # tensor([-3, -3, -3]) 175 | print(torch.sub(a, b)) # tensor([-3, -3, -3]) 176 | 177 | # Multiplication (element-wise) 178 | print(a * b) # tensor([4, 10, 18]) 179 | print(torch.mul(a, b)) # tensor([4, 10, 18]) 180 | 181 | # Division (element-wise) 182 | print(a / b) # tensor([0.2500, 0.4000, 0.5000]) 183 | print(torch.div(a, b)) # tensor([0.2500, 0.4000, 0.5000]) 184 | 185 | # In-place operations (modifies the tensor) 186 | a.add_(b) # a becomes tensor([5, 7, 9]) 187 | ``` 188 | 189 | ### Matrix Operations 190 | 191 | ```python 192 | a = torch.tensor([[1, 2], [3, 4]]) 193 | b = torch.tensor([[5, 6], [7, 8]]) 194 | 195 | # Matrix multiplication 196 | print(torch.matmul(a, b)) 197 | # tensor([[19, 22], 198 | # [43, 50]]) 199 | 200 | print(a @ b) # @ operator for matrix multiplication 201 | # tensor([[19, 22], 202 | # [43, 50]]) 203 | 204 | # Element-wise multiplication 205 | print(a * b) 206 | # tensor([[ 5, 12], 207 | # [21, 32]]) 208 | 209 | # Transpose 210 | print(a.t()) 211 | # tensor([[1, 3], 212 | # [2, 4]]) 213 | 214 | # Determinant 215 | print(torch.det(a)) # tensor(-2.) 216 | 217 | # Inverse 218 | print(torch.inverse(a)) 219 | # tensor([[-2.0000, 1.0000], 220 | # [ 1.5000, -0.5000]]) 221 | ``` 222 | 223 | ### Reduction Operations 224 | 225 | ```python 226 | x = torch.tensor([[1, 2, 3], [4, 5, 6]]) 227 | 228 | # Sum 229 | print(torch.sum(x)) # tensor(21) 230 | print(x.sum()) # tensor(21) 231 | print(x.sum(dim=0)) # Sum along rows: tensor([5, 7, 9]) 232 | print(x.sum(dim=1)) # Sum along columns: tensor([6, 15]) 233 | 234 | # Mean 235 | print(torch.mean(x.float())) # tensor(3.5000) 236 | print(x.float().mean()) # tensor(3.5000) 237 | 238 | # Max and Min 239 | print(torch.max(x)) # tensor(6) 240 | print(x.max()) # tensor(6) 241 | print(x.max(dim=0)) # Max along rows: (values=tensor([4, 5, 6]), indices=tensor([1, 1, 1])) 242 | print(x.min()) # tensor(1) 243 | 244 | # Product 245 | print(torch.prod(x)) # tensor(720) 246 | ``` 247 | 248 | ### Reshaping Operations 249 | 250 | ```python 251 | x = torch.tensor([[1, 2, 3], [4, 5, 6]]) 252 | 253 | # Reshape 254 | print(x.reshape(3, 2)) 255 | # tensor([[1, 2], 256 | # [3, 4], 257 | # [5, 6]]) 258 | 259 | # View (shares the same data with the original tensor) 260 | print(x.view(6, 1)) 261 | # tensor([[1], 262 | # [2], 263 | # [3], 264 | # [4], 265 | # [5], 266 | # [6]]) 267 | 268 | # Flatten 269 | print(x.flatten()) # tensor([1, 2, 3, 4, 5, 6]) 270 | 271 | # Permute dimensions 272 | y = torch.tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]) # Shape: (2, 2, 2) 273 | print(y.permute(2, 0, 1)) # Permute to shape (2, 2, 2) 274 | 275 | # Squeeze and Unsqueeze 276 | z = torch.tensor([[[1], [2]]]) # Shape: (1, 2, 1) 277 | print(z.squeeze()) # Remove dimensions of size 1: tensor([1, 2]) 278 | print(z.squeeze(0)) # Remove dimension 0 if it's size 1: tensor([[1], [2]]) 279 | print(torch.unsqueeze(x, 0)) # Add dimension at position 0: shape becomes (1, 2, 3) 280 | ``` 281 | 282 | ## NumPy Integration 283 | 284 | PyTorch provides seamless integration with NumPy, allowing you to convert between PyTorch tensors and NumPy arrays. 285 | 286 | ```python 287 | import numpy as np 288 | 289 | # Convert NumPy array to PyTorch tensor 290 | np_array = np.array([1, 2, 3]) 291 | tensor = torch.from_numpy(np_array) 292 | print(tensor) # tensor([1, 2, 3]) 293 | 294 | # Convert PyTorch tensor to NumPy array 295 | tensor = torch.tensor([4, 5, 6]) 296 | np_array = tensor.numpy() 297 | print(np_array) # array([4, 5, 6]) 298 | 299 | # Note: If the tensor is on CPU, the tensor and the NumPy array share the same memory 300 | # Changes to one will affect the other 301 | np_array = np.array([1, 2, 3]) 302 | tensor = torch.from_numpy(np_array) 303 | np_array[0] = 5 304 | print(tensor) # tensor([5, 2, 3]) 305 | 306 | # This doesn't work for tensors on GPU 307 | ``` 308 | 309 | ## GPU Acceleration 310 | 311 | One of the key features of PyTorch is its ability to leverage GPU acceleration for faster computations. 312 | 313 | ```python 314 | # Check if CUDA (NVIDIA GPU) is available 315 | print(torch.cuda.is_available()) # True if CUDA is available 316 | 317 | # Create a tensor on GPU 318 | if torch.cuda.is_available(): 319 | device = torch.device("cuda") 320 | x = torch.tensor([1, 2, 3], device=device) 321 | # or 322 | y = torch.tensor([4, 5, 6]).to(device) 323 | 324 | # Move tensor back to CPU 325 | z = y.cpu() 326 | else: 327 | device = torch.device("cpu") 328 | x = torch.tensor([1, 2, 3]) # Default is CPU 329 | 330 | # Check which device a tensor is on 331 | print(x.device) # device(type='cuda') or device(type='cpu') 332 | 333 | # Perform operations on GPU 334 | if torch.cuda.is_available(): 335 | a = torch.tensor([1, 2, 3], device=device) 336 | b = torch.tensor([4, 5, 6], device=device) 337 | c = a + b # Operation happens on GPU 338 | print(c) # tensor([5, 7, 9], device='cuda:0') 339 | ``` 340 | 341 | ## Computational Graphs 342 | 343 | PyTorch uses a dynamic computational graph, which means the graph is built on-the-fly as operations are executed. This is different from static graph frameworks where the graph is defined before execution. 344 | 345 | ```python 346 | # Create tensors with requires_grad=True to track operations 347 | x = torch.tensor(2.0, requires_grad=True) 348 | y = torch.tensor(3.0, requires_grad=True) 349 | 350 | # Build a computational graph 351 | z = x**2 + y**3 352 | 353 | # Compute gradients 354 | z.backward() 355 | 356 | # Access gradients 357 | print(x.grad) # tensor(4.) (dz/dx = 2*x = 2*2 = 4) 358 | print(y.grad) # tensor(27.) (dz/dy = 3*y^2 = 3*3^2 = 27) 359 | 360 | # Detach a tensor from the graph 361 | a = x.detach() # Creates a new tensor that shares data but doesn't require gradients 362 | ``` 363 | 364 | ### Gradient Accumulation 365 | 366 | By default, PyTorch accumulates gradients when `backward()` is called multiple times. 367 | 368 | ```python 369 | # Reset gradients 370 | x.grad.zero_() 371 | y.grad.zero_() 372 | 373 | # Compute gradients multiple times 374 | z = x**2 + y**3 375 | z.backward() 376 | print(x.grad) # tensor(4.) 377 | 378 | z = x**2 + y**3 379 | z.backward() 380 | print(x.grad) # tensor(8.) (gradients are accumulated) 381 | 382 | # To avoid accumulation, reset gradients before each backward pass 383 | x.grad.zero_() 384 | y.grad.zero_() 385 | ``` 386 | 387 | ## Conclusion 388 | 389 | This tutorial covered the basics of PyTorch, including tensors, operations, NumPy integration, GPU acceleration, and computational graphs. These concepts form the foundation for building and training neural networks with PyTorch. 390 | 391 | In the next tutorial, we'll explore automatic differentiation and optimization in more detail. -------------------------------------------------------------------------------- /07_recurrent_neural_networks/README.md: -------------------------------------------------------------------------------- 1 | # Recurrent Neural Networks (RNNs) in PyTorch: A Comprehensive Guide 2 | 3 | This tutorial provides an in-depth guide to understanding, implementing, and applying Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Gated Recurrent Units (GRUs) using PyTorch. These models are fundamental for processing sequential data such as text, time series, and audio. 4 | 5 | ## Table of Contents 6 | 1. [Introduction to Recurrent Neural Networks](#introduction-to-recurrent-neural-networks) 7 | - What are RNNs and Why Sequential Data? 8 | - The Concept of a Hidden State (Memory) 9 | - Basic RNN Cell Structure and Unrolling 10 | - Challenges: Vanishing and Exploding Gradients 11 | 2. [Core RNN Layer Implementations in PyTorch](#core-rnn-layer-implementations-in-pytorch) 12 | - **`nn.RNN`**: The basic Elman RNN. 13 | - Key Parameters: `input_size`, `hidden_size`, `num_layers`, `batch_first`, `bidirectional`. 14 | - Input and Output Shapes. 15 | - **`nn.LSTM` (Long Short-Term Memory)** 16 | - Addressing Vanishing Gradients with Gates (Forget, Input, Output Gates, Cell State). 17 | - Key Parameters and Shapes. 18 | - **`nn.GRU` (Gated Recurrent Unit)** 19 | - Simplified Gating Mechanism (Update, Reset Gates). 20 | - Key Parameters and Shapes. 21 | - Multi-layer (Stacked) RNNs 22 | - Bidirectional RNNs 23 | 3. [Sequence Modeling with RNNs](#sequence-modeling-with-rnns) 24 | - Many-to-One, One-to-Many, Many-to-Many Architectures (Conceptual) 25 | - **Sequence Classification (Many-to-One):** e.g., Sentiment Analysis. 26 | - Using the final hidden state or pooling outputs for classification. 27 | - Handling Variable-Length Sequences: Padding, Packing (`torch.nn.utils.rnn.pack_padded_sequence`, `pad_packed_sequence`). 28 | 4. [Application: Text Generation (Character-level RNN)](#application-text-generation-character-level-rnn) 29 | - Representing Text Data (Character Encoding). 30 | - Preparing Input-Target Sequences for Language Modeling. 31 | - Building a Character-level RNN/LSTM Model. 32 | - Training the Language Model. 33 | - Generating New Text (Sampling Strategies, Temperature). 34 | 5. [Application: Time Series Forecasting](#application-time-series-forecasting) 35 | - Preparing Time Series Data (Windowing/Sliding Windows). 36 | - Univariate vs. Multivariate Time Series. 37 | - Building an RNN/LSTM Model for Forecasting. 38 | - Sequence-to-Sequence vs. Sequence-to-Value Forecasting. 39 | 6. [Advanced RNN Techniques (Conceptual Overview)](#advanced-rnn-techniques-conceptual-overview) 40 | - **Attention Mechanisms:** Allowing the model to focus on relevant parts of the input sequence. 41 | - **Teacher Forcing:** Using ground truth outputs as inputs during training for faster convergence. 42 | - **Beam Search:** A more advanced decoding strategy for generation tasks. 43 | - **Encoder-Decoder Architecture (Seq2Seq):** For tasks like machine translation. 44 | 7. [Practical Tips for Training RNNs](#practical-tips-for-training-rnns) 45 | - Gradient Clipping to prevent exploding gradients. 46 | - Proper Initialization. 47 | - Choosing between RNN, LSTM, GRU. 48 | - Regularization (Dropout on non-recurrent connections). 49 | 50 | ## Introduction to Recurrent Neural Networks 51 | 52 | - **What are RNNs and Why Sequential Data?** 53 | RNNs are a class of neural networks designed to recognize patterns in sequences of data, such as text, speech, time series, or genomes. Unlike feedforward networks, RNNs have loops, allowing information to persist from one step of the sequence to the next. 54 | - **The Concept of a Hidden State (Memory):** 55 | The core idea of an RNN is its hidden state, which acts as a form of memory. The hidden state at timestep `t` captures information from all previous timesteps up to `t-1`. This hidden state is updated at each step based on the current input and the previous hidden state. 56 | `h_t = f(W_hh * h_{t-1} + W_xh * x_t + b_h)` 57 | `output_t = g(W_hy * h_t + b_y)` 58 | - **Basic RNN Cell Structure and Unrolling:** An RNN can be thought of as multiple copies of the same network, each passing a message to a successor. Unrolling the RNN visualizes this chain-like structure. 59 | - **Challenges: Vanishing and Exploding Gradients:** Standard RNNs struggle to learn long-range dependencies due to the vanishing gradient problem (gradients shrink exponentially as they propagate back through time) or the exploding gradient problem (gradients grow exponentially). 60 | 61 | ## Core RNN Layer Implementations in PyTorch 62 | 63 | PyTorch provides optimized implementations for common recurrent layers. 64 | 65 | - **`nn.RNN`**: The basic Elman RNN. 66 | - **Key Parameters:** 67 | - `input_size`: The number of expected features in the input `x`. 68 | - `hidden_size`: The number of features in the hidden state `h`. 69 | - `num_layers`: Number of recurrent layers. Stacking RNNs can increase model capacity. 70 | - `nonlinearity`: `tanh` or `relu`. Default: `tanh`. 71 | - `batch_first (bool)`: If `True`, input and output tensors are provided as `(batch, seq, feature)` instead of `(seq, batch, feature)`. Default: `False`. 72 | - `dropout (float)`: If non-zero, introduces a Dropout layer on the outputs of each RNN layer except the last layer. Default: 0. 73 | - `bidirectional (bool)`: If `True`, becomes a bidirectional RNN. Default: `False`. 74 | - **Input Shapes (if `batch_first=False`):** 75 | - `input`: `(seq_len, batch_size, input_size)` 76 | - `h_0` (initial hidden state): `(num_layers * num_directions, batch_size, hidden_size)` 77 | - **Output Shapes (if `batch_first=False`):** 78 | - `output`: `(seq_len, batch_size, num_directions * hidden_size)` (all hidden states from the last layer) 79 | - `h_n` (final hidden state): `(num_layers * num_directions, batch_size, hidden_size)` 80 | ```python 81 | import torch 82 | import torch.nn as nn 83 | 84 | # Example nn.RNN 85 | rnn = nn.RNN(input_size=10, hidden_size=20, num_layers=2, batch_first=True) 86 | # input_tensor shape: (batch_size=5, seq_len=3, input_size=10) 87 | # input_tensor = torch.randn(5, 3, 10) 88 | # h0 shape: (num_layers*num_directions=2*1, batch_size=5, hidden_size=20) 89 | # h0 = torch.randn(2, 5, 20) 90 | # output, hn = rnn(input_tensor, h0) 91 | # print(f"RNN Output shape: {output.shape}") # (5, 3, 20) 92 | # print(f"RNN Hidden state shape: {hn.shape}") # (2, 5, 20) 93 | ``` 94 | 95 | - **`nn.LSTM` (Long Short-Term Memory)** 96 | LSTMs use a more complex cell structure with gates (input, forget, output) and a cell state (`c_t`) to better control information flow and capture long-range dependencies, mitigating vanishing gradients. 97 | - **Gates:** Sigmoid layers that control what information to keep or discard. 98 | - **Cell State (`c_t`):** A separate memory stream that information can be added to or removed from, regulated by gates. 99 | - **Input/Output Shapes:** Similar to `nn.RNN`, but `h_0` and `h_n` are tuples `(hidden_state, cell_state)`. Each state has shape `(num_layers * num_directions, batch, hidden_size)`. 100 | ```python 101 | # lstm = nn.LSTM(input_size=10, hidden_size=20, num_layers=2, batch_first=True) 102 | # input_lstm = torch.randn(5, 3, 10) 103 | # h0_lstm = torch.randn(2, 5, 20) # Initial hidden state 104 | # c0_lstm = torch.randn(2, 5, 20) # Initial cell state 105 | # output_lstm, (hn_lstm, cn_lstm) = lstm(input_lstm, (h0_lstm, c0_lstm)) 106 | # print(f"LSTM Output shape: {output_lstm.shape}") 107 | # print(f"LSTM Hidden state shape: {hn_lstm.shape}") 108 | # print(f"LSTM Cell state shape: {cn_lstm.shape}") 109 | ``` 110 | 111 | - **`nn.GRU` (Gated Recurrent Unit)** 112 | GRUs are a simpler alternative to LSTMs, combining the cell state and hidden state. They use update and reset gates. 113 | - **Input/Output Shapes:** Same as `nn.RNN`. 114 | ```python 115 | # gru = nn.GRU(input_size=10, hidden_size=20, num_layers=2, batch_first=True) 116 | # input_gru = torch.randn(5, 3, 10) 117 | # h0_gru = torch.randn(2, 5, 20) 118 | # output_gru, hn_gru = gru(input_gru, h0_gru) 119 | # print(f"GRU Output shape: {output_gru.shape}") 120 | # print(f"GRU Hidden state shape: {hn_gru.shape}") 121 | ``` 122 | 123 | - **Multi-layer (Stacked) RNNs:** Set `num_layers > 1`. The output of one layer becomes the input to the next. Dropout can be applied between layers. 124 | - **Bidirectional RNNs:** Set `bidirectional=True`. Processes the sequence in both forward and backward directions. The outputs are typically concatenated. Useful when context from both past and future is important. 125 | 126 | ## Sequence Modeling with RNNs 127 | 128 | - **Architectures:** RNNs can be used for various sequence tasks: 129 | - **Many-to-One:** Input sequence, single output (e.g., sentiment classification of a sentence). 130 | - **One-to-Many:** Single input, output sequence (e.g., image captioning). 131 | - **Many-to-Many (Synchronized):** Input and output sequences have same length (e.g., part-of-speech tagging). 132 | - **Many-to-Many (Delayed/Encoder-Decoder):** Input and output sequences can have different lengths (e.g., machine translation). 133 | - **Handling Variable-Length Sequences:** Real-world sequences often have different lengths. Techniques: 134 | - **Padding:** Pad shorter sequences to the length of the longest sequence in a batch using a special padding token. 135 | - **Packing (`torch.nn.utils.rnn.pack_padded_sequence`):** Before feeding padded sequences to an RNN, pack them to avoid computation on padding tokens. Use `torch.nn.utils.rnn.pad_packed_sequence` to unpack the output. 136 | 137 | ## Application: Text Generation (Character-level RNN) 138 | 139 | - **Representing Text Data:** Convert characters to numerical indices (character encoding). Create a vocabulary of all unique characters. 140 | - **Preparing Sequences:** For a sequence `s`, the input at timestep `t` is `s[t]` and the target is `s[t+1]`. The model learns to predict the next character. 141 | - **Training:** Use Cross-Entropy Loss to compare predicted character probabilities with the actual next character. 142 | - **Generating New Text:** Start with a seed character/sequence. Feed it to the model to get probabilities for the next character. Sample from this distribution (e.g., using `torch.multinomial` or `argmax`). Append the sampled character to the sequence and repeat. 143 | - **Temperature:** A hyperparameter to control the randomness of sampling. Higher temperature -> more random; lower temperature -> more deterministic. 144 | 145 | ## Application: Time Series Forecasting 146 | 147 | - **Preparing Data (Windowing):** Create input-output pairs by sliding a window over the time series. Input: `(x_t, x_{t+1}, ..., x_{t+N-1})`. Target: `x_{t+N}` (for one-step ahead) or `(x_{t+N}, ..., x_{t+N+M-1})` (for multi-step ahead). 148 | - **Univariate vs. Multivariate:** Forecasting a single variable vs. multiple interacting variables. 149 | - **Model Output:** Can be a single value (next step) or a sequence (multiple future steps). 150 | 151 | ## Advanced RNN Techniques (Conceptual Overview) 152 | 153 | - **Attention Mechanisms:** For long sequences, allows the model to selectively focus on important parts of the input sequence when producing an output at each timestep. Particularly useful in Seq2Seq models. 154 | - **Teacher Forcing:** During training, instead of feeding the model's own (potentially incorrect) previous prediction as input for the next step, the ground truth from the previous step is used. Helps stabilize training but can lead to exposure bias (discrepancy between training and inference). 155 | - **Beam Search:** A decoding algorithm used in generation tasks (like machine translation or text generation) that explores multiple hypotheses (beams) at each step, rather than just greedily picking the single best option. 156 | - **Encoder-Decoder Architecture (Seq2Seq):** Consists of two RNNs: an encoder that processes the input sequence into a context vector, and a decoder that generates the output sequence from this context vector. Widely used in machine translation and text summarization. 157 | 158 | ## Practical Tips for Training RNNs 159 | 160 | - **Gradient Clipping:** Crucial for RNNs/LSTMs/GRUs to prevent exploding gradients. Use `torch.nn.utils.clip_grad_norm_`. 161 | - **Initialization:** Proper weight initialization (e.g., Xavier, Kaiming, or specific heuristics for RNNs) can be important. 162 | - **Choice of Unit:** LSTMs and GRUs are generally preferred over vanilla RNNs for their ability to handle longer sequences. GRUs are simpler and sometimes faster than LSTMs with comparable performance. 163 | - **Dropout:** Apply dropout between stacked RNN layers (using the `dropout` parameter in `nn.RNN/LSTM/GRU`) or on the non-recurrent connections (e.g., before/after the RNN block or between the RNN output and fully connected layers). 164 | 165 | ## Running the Tutorial 166 | 167 | To run the Python script associated with this tutorial: 168 | ```bash 169 | python recurrent_neural_networks.py 170 | ``` 171 | This will execute demonstrations of RNN, LSTM, GRU layers, a character-level text generation example, and a time series forecasting example. 172 | 173 | ## Prerequisites 174 | - Python 3.7+ 175 | - PyTorch 1.10+ 176 | - NumPy 177 | - Matplotlib (for visualization) 178 | 179 | ## Related Tutorials 180 | 1. [Training Neural Networks](../04_training_neural_networks/README.md) 181 | 2. [Transformers and Attention Mechanisms](../08_transformers_and_attention_mechanisms/README.md) (Modern alternative/successor to RNNs for many sequence tasks) -------------------------------------------------------------------------------- /05_data_loading_preprocessing/README.md: -------------------------------------------------------------------------------- 1 | # Data Loading, Preprocessing, and Augmentation in PyTorch 2 | 3 | This tutorial provides a comprehensive guide to efficiently loading, preprocessing, and augmenting data in PyTorch. Effective data handling is a critical step in any machine learning pipeline, ensuring that your model receives data in the correct format and benefits from techniques that can improve generalization. 4 | 5 | ## Table of Contents 6 | 1. [Introduction: The Importance of Data Handling](#introduction-the-importance-of-data-handling) 7 | 2. [PyTorch `Dataset` Class](#pytorch-dataset-class) 8 | - Role and Purpose 9 | - Key Methods: `__init__`, `__len__`, `__getitem__` 10 | - Using Built-in Datasets (e.g., `torchvision.datasets.MNIST`, `CIFAR10`) 11 | 3. [Creating Custom `Dataset`s](#creating-custom-datasets) 12 | - For Image Data (e.g., from a folder of images, from a CSV file with paths) 13 | - For Text Data (e.g., loading text files, tokenization basics) 14 | - For Other Data Types (e.g., CSV, time series) 15 | 4. [PyTorch `DataLoader` Class](#pytorch-dataloader-class) 16 | - Purpose: Batching, Shuffling, Parallel Loading 17 | - Key Parameters: `dataset`, `batch_size`, `shuffle`, `num_workers`, `pin_memory` 18 | - Iterating Through a `DataLoader` 19 | 5. [Data Transformations (`torchvision.transforms`)](#data-transformations-torchvisiontransforms) 20 | - Common Transformations for Images: 21 | - `transforms.ToTensor()`: Converting PIL Images/NumPy arrays to Tensors. 22 | - `transforms.Normalize()`: Normalizing tensor images. 23 | - Resizing, Cropping (`transforms.Resize`, `transforms.CenterCrop`, `transforms.RandomResizedCrop`) 24 | - `transforms.Compose()`: Chaining multiple transformations. 25 | - Creating Custom Transformations 26 | 6. [Data Augmentation](#data-augmentation) 27 | - Why Augment Data? Improving Model Robustness and Generalization. 28 | - Image Augmentation Techniques (using `torchvision.transforms`): 29 | - Random Flips (`transforms.RandomHorizontalFlip`, `transforms.RandomVerticalFlip`) 30 | - Random Rotations (`transforms.RandomRotation`) 31 | - Color Jitter (`transforms.ColorJitter`) 32 | - Random Affine Transformations (`transforms.RandomAffine`) 33 | - Integrating Augmentations into the `Dataset` or `DataLoader` Flow 34 | - Advanced Augmentation Libraries (e.g., Albumentations - conceptual mention) 35 | 7. [Working with Different Data Types](#working-with-different-data-types) 36 | - **Image Data:** Loading, common formats, channel orders. 37 | - **Text Data:** Tokenization, padding, creating vocabulary, embedding lookups (conceptual). 38 | - **Tabular Data:** Loading from CSV/Pandas, feature engineering, encoding categorical features (conceptual). 39 | 8. [Efficient Data Loading Techniques](#efficient-data-loading-techniques) 40 | - `num_workers` in `DataLoader`: Parallelizing data loading. 41 | - `pin_memory=True` in `DataLoader`: Faster CPU-to-GPU data transfer. 42 | - Pre-fetching and Caching Strategies (Conceptual) 43 | - Considerations for Large Datasets that Don't Fit in Memory 44 | 9. [Practical Example: Image Classification Dataset](#practical-example-image-classification-dataset) 45 | - Setting up a custom image folder dataset. 46 | - Applying transformations and augmentations. 47 | - Using `DataLoader` for training. 48 | 49 | ## Introduction: The Importance of Data Handling 50 | 51 | Raw data is rarely in a format suitable for direct input into a neural network. Data loading and preprocessing involve several steps: 52 | - **Loading:** Reading data from various sources (files, databases). 53 | - **Preprocessing:** Cleaning, transforming, and structuring data (e.g., resizing images, tokenizing text, normalizing features). 54 | - **Augmentation:** Artificially expanding the dataset by creating modified versions of existing data (e.g., rotating images, paraphrasing text) to improve model generalization and reduce overfitting. 55 | Efficient data handling is crucial for training performance, as data loading can become a bottleneck if not optimized. 56 | 57 | ## PyTorch `Dataset` Class 58 | 59 | - **Role and Purpose:** `torch.utils.data.Dataset` is an abstract class representing a dataset. All datasets in PyTorch that interact with `DataLoader` should inherit from this class. 60 | - **Key Methods:** 61 | - `__init__(self, ...)`: Initializes the dataset (e.g., loads data paths, labels, performs initial setup). 62 | - `__len__(self)`: Returns the total number of samples in the dataset. 63 | - `__getitem__(self, idx)`: Loads and returns a single sample from the dataset at the given index `idx`. This is where transformations are often applied. 64 | - **Using Built-in Datasets:** `torchvision.datasets` provides many common datasets like MNIST, CIFAR10, ImageNet, which are subclasses of `Dataset`. 65 | 66 | ```python 67 | import torchvision 68 | import torchvision.transforms as transforms 69 | 70 | # Example: Using torchvision.datasets.MNIST 71 | mnist_train_raw = torchvision.datasets.MNIST(root='./data', train=True, download=True) 72 | sample_raw, label_raw = mnist_train_raw[0] 73 | print(f"MNIST raw sample type: {type(sample_raw)}, Label: {label_raw}") 74 | 75 | # Applying a transform to convert PIL Image to Tensor 76 | mnist_train_transformed = torchvision.datasets.MNIST( 77 | root='./data', 78 | train=True, 79 | download=True, 80 | transform=transforms.ToTensor() # Converts PIL Image to FloatTensor 81 | ) 82 | sample_tensor, label_tensor = mnist_train_transformed[0] 83 | print(f"MNIST transformed sample type: {type(sample_tensor)}, shape: {sample_tensor.shape}, Label: {label_tensor}") 84 | ``` 85 | 86 | ## Creating Custom `Dataset`s 87 | 88 | For most real-world applications, you'll need to create your own custom `Dataset`. 89 | 90 | - **For Image Data:** Often involves reading image files (e.g., JPEG, PNG) and their corresponding labels. 91 | ```python 92 | from torch.utils.data import Dataset 93 | from PIL import Image # Pillow library for image manipulation 94 | import os 95 | 96 | class CustomImageDataset(Dataset): 97 | def __init__(self, img_dir, transform=None, target_transform=None): 98 | # Example: img_dir contains subfolders for each class (e.g., img_dir/cat/cat1.jpg) 99 | self.img_labels = [] # List of (image_path, class_index) 100 | self.classes = sorted(entry.name for entry in os.scandir(img_dir) if entry.is_dir()) 101 | self.class_to_idx = {cls_name: i for i, cls_name in enumerate(self.classes)} 102 | 103 | for class_name in self.classes: 104 | class_dir = os.path.join(img_dir, class_name) 105 | for img_name in os.listdir(class_dir): 106 | self.img_labels.append((os.path.join(class_dir, img_name), self.class_to_idx[class_name])) 107 | 108 | self.transform = transform 109 | self.target_transform = target_transform 110 | 111 | def __len__(self): 112 | return len(self.img_labels) 113 | 114 | def __getitem__(self, idx): 115 | img_path, label = self.img_labels[idx] 116 | image = Image.open(img_path).convert("RGB") # Ensure 3 channels 117 | if self.transform: 118 | image = self.transform(image) 119 | if self.target_transform: 120 | label = self.target_transform(label) 121 | return image, label 122 | ``` 123 | - **For Text Data:** Might involve reading lines from files, tokenizing text into numerical representations, and padding sequences. 124 | 125 | ## PyTorch `DataLoader` Class 126 | 127 | - **Purpose:** `torch.utils.data.DataLoader` takes a `Dataset` object and provides an iterable to easily access batches of data. It automates batching, shuffling, and can use multiple worker processes for parallel data loading. 128 | - **Key Parameters:** 129 | - `dataset`: The `Dataset` object from which to load the data. 130 | - `batch_size (int, optional)`: How many samples per batch to load (default: 1). 131 | - `shuffle (bool, optional)`: Set to `True` to have the data reshuffled at every epoch (default: `False`). 132 | - `num_workers (int, optional)`: How many subprocesses to use for data loading. 0 means that the data will be loaded in the main process (default: 0). 133 | - `pin_memory (bool, optional)`: If `True`, the `DataLoader` will copy Tensors into CUDA pinned memory before returning them. Useful for faster CPU to GPU transfers. 134 | 135 | ```python 136 | from torch.utils.data import DataLoader 137 | 138 | # Assuming mnist_train_transformed is an instance of a Dataset 139 | # train_loader = DataLoader(mnist_train_transformed, batch_size=64, shuffle=True, num_workers=2) 140 | 141 | # Iterating through a DataLoader 142 | # for epoch in range(num_epochs): 143 | # for i, (inputs, labels) in enumerate(train_loader): 144 | # # inputs and labels are now batches of data 145 | # # Move to device: inputs, labels = inputs.to(device), labels.to(device) 146 | # # ... training logic ... 147 | # if i % 100 == 0: 148 | # print(f"Epoch {epoch}, Batch {i}, Input shape: {inputs.shape}") 149 | ``` 150 | 151 | ## Data Transformations (`torchvision.transforms`) 152 | 153 | `torchvision.transforms` provides common image transformations. They can be chained together using `transforms.Compose()`. 154 | 155 | - **Common Transformations:** 156 | - `transforms.ToTensor()`: Converts a PIL Image or `numpy.ndarray` (H x W x C) in the range [0, 255] to a `torch.FloatTensor` of shape (C x H x W) in the range [0.0, 1.0]. 157 | - `transforms.Normalize(mean, std)`: Normalizes a tensor image with mean and standard deviation. `output[channel] = (input[channel] - mean[channel]) / std[channel]`. 158 | - `transforms.Resize(size)`: Resizes the input PIL Image to the given size. 159 | - `transforms.CenterCrop(size)`: Crops the given PIL Image at the center. 160 | 161 | ```python 162 | # Example of composing transformations 163 | image_transforms = transforms.Compose([ 164 | transforms.Resize(256), 165 | transforms.CenterCrop(224), 166 | transforms.ToTensor(), 167 | transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) # ImageNet stats 168 | ]) 169 | 170 | # my_dataset = CustomImageDataset(..., transform=image_transforms) 171 | ``` 172 | 173 | ## Data Augmentation 174 | 175 | Data augmentation artificially increases the training set size by creating modified copies of its data. This helps the model become more robust to variations and reduces overfitting. 176 | 177 | - **Image Augmentation Techniques:** 178 | - `transforms.RandomHorizontalFlip(p=0.5)` 179 | - `transforms.RandomRotation(degrees)` 180 | - `transforms.ColorJitter(brightness=0, contrast=0, saturation=0, hue=0)` 181 | - `transforms.RandomResizedCrop(size)`: Crops a random part of an image and resizes it. 182 | 183 | ```python 184 | # Example augmentation pipeline for training 185 | train_transforms_augmented = transforms.Compose([ 186 | transforms.RandomResizedCrop(224), 187 | transforms.RandomHorizontalFlip(), 188 | transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1), 189 | transforms.RandomRotation(degrees=15), 190 | transforms.ToTensor(), 191 | transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) 192 | ]) 193 | # For validation/testing, typically only use non-random transformations like Resize, CenterCrop, ToTensor, Normalize. 194 | ``` 195 | 196 | ## Working with Different Data Types 197 | Conceptual overview; detailed implementations depend on the specific task. 198 | - **Image Data:** Use PIL/OpenCV for loading, `torchvision.transforms` for preprocessing/augmentation. Pay attention to channel order (e.g., RGB vs BGR) and normalization. 199 | - **Text Data:** Involves tokenization (splitting text into words/subwords), numericalization (mapping tokens to integers), padding sequences to the same length, and often using pre-trained embeddings or an `nn.Embedding` layer. 200 | - **Tabular Data:** Often loaded using Pandas. Numerical features might need scaling/normalization. Categorical features need encoding (e.g., one-hot encoding, label encoding, or embedding layers). 201 | 202 | ## Efficient Data Loading Techniques 203 | 204 | - **`num_workers > 0`:** Spawns multiple subprocesses to load data in parallel, preventing the main training process from waiting for data I/O. 205 | - **`pin_memory=True`:** If using GPUs, setting this to `True` in `DataLoader` tells PyTorch to put fetched data Tensors in pinned (page-locked) memory. This enables faster data transfer from CPU to GPU memory via Direct Memory Access (DMA). 206 | - **Caching/Pre-fetching:** For very large datasets or slow storage, caching frequently accessed data or pre-fetching next batches can help. 207 | 208 | ## Practical Example: Image Classification Dataset 209 | 210 | This section will be detailed in the accompanying Python script (`data_loading_preprocessing.py`) and Jupyter Notebook, showing an end-to-end example of loading an image dataset from folders, applying transformations, and using `DataLoader`. 211 | 212 | ## Running the Tutorial 213 | 214 | To run the Python script associated with this tutorial: 215 | ```bash 216 | python data_loading_preprocessing.py 217 | ``` 218 | We recommend you manually create a `data_loading_preprocessing.ipynb` notebook and copy the code from the Python script into it for an interactive experience. 219 | 220 | ## Prerequisites 221 | - Python 3.7+ 222 | - PyTorch 1.10+ 223 | - Torchvision (for built-in datasets and transforms) 224 | - Pillow (PIL Fork, usually a dependency of Torchvision: `pip install Pillow`) 225 | - NumPy 226 | 227 | ## Related Tutorials 228 | 1. [PyTorch Basics](../01_pytorch_basics/README.md) 229 | 2. [Training Neural Networks](../04_training_neural_networks/README.md) 230 | 3. [Convolutional Neural Networks](../06_convolutional_neural_networks/README.md) -------------------------------------------------------------------------------- /06_convolutional_neural_networks/README.md: -------------------------------------------------------------------------------- 1 | # Convolutional Neural Networks (CNNs) in PyTorch 2 | 3 | This tutorial provides a comprehensive guide to understanding and implementing Convolutional Neural Networks (CNNs) using PyTorch. CNNs are a class of deep neural networks most commonly applied to analyzing visual imagery, but also effective for other types of data like audio and text. 4 | 5 | ## Table of Contents 6 | 1. [Introduction to Convolutional Neural Networks](#introduction-to-convolutional-neural-networks) 7 | - What are CNNs and Why Use Them for Images? 8 | - Key Concepts: Local Receptive Fields, Shared Weights, Pooling 9 | 2. [Core CNN Layers and Components](#core-cnn-layers-and-components) 10 | - **Convolutional Layers (`nn.Conv2d`)** 11 | - Kernels (Filters): Size, Stride, Padding, Dilation 12 | - Input and Output Channels 13 | - Feature Maps 14 | - 2D Convolution Operation Explained 15 | - **Activation Functions (ReLU)** 16 | - Role in CNNs 17 | - **Pooling Layers (`nn.MaxPool2d`, `nn.AvgPool2d`)** 18 | - Purpose: Down-sampling, Dimensionality Reduction, Invariance 19 | - Max Pooling vs. Average Pooling 20 | - Kernel Size and Stride 21 | - **Fully Connected Layers (`nn.Linear`)** 22 | - Role in Classification/Regression after Convolutional Base 23 | - Flattening Feature Maps 24 | - **Batch Normalization (`nn.BatchNorm2d`)** 25 | - Normalizing Activations in CNNs 26 | - **Dropout (`nn.Dropout2d`, `nn.Dropout`)** 27 | - Regularization in CNNs 28 | 3. [Building a Basic CNN Architecture](#building-a-basic-cnn-architecture) 29 | - Stacking Convolutional, Activation, and Pooling Layers 30 | - Adding Fully Connected Layers for Classification 31 | - Example CNN for MNIST or CIFAR-10 32 | 4. [Training CNNs for Image Classification](#training-cnns-for-image-classification) 33 | - Data Preparation: Image Transforms and Augmentation specific to CNNs 34 | - Loss Function (e.g., `nn.CrossEntropyLoss`) 35 | - Optimizer (e.g., Adam, SGD) 36 | - The Training Loop (Revisiting with CNN context) 37 | 5. [Understanding and Implementing Famous CNN Architectures (Conceptual Overview)](#understanding-and-implementing-famous-cnn-architectures-conceptual-overview) 38 | - **LeNet-5:** A pioneering CNN. 39 | - **AlexNet:** Deepened the architecture, used ReLUs and Dropout. 40 | - **VGGNets:** Simplicity with deeper stacks of small (3x3) convolutions. 41 | - **GoogLeNet (Inception):** Introduced Inception modules for efficiency and multi-scale processing. 42 | - **ResNet (Residual Networks):** Introduced residual connections to train very deep networks. 43 | - (Implementation of one simple architecture like LeNet-5 will be in the .py script) 44 | 6. [Transfer Learning with Pre-trained CNN Models](#transfer-learning-with-pre-trained-cnn-models) 45 | - What is Transfer Learning? 46 | - Benefits: Reduced training time, better performance with less data. 47 | - Using Pre-trained Models from `torchvision.models` (e.g., ResNet, VGG). 48 | - **Feature Extraction:** Using the pre-trained CNN as a fixed feature extractor by freezing its weights and replacing the classifier head. 49 | - **Fine-tuning:** Unfreezing some of the later layers of the pre-trained model and training them with a smaller learning rate on the new dataset. 50 | 7. [Visualizing What CNNs Learn (Feature Visualization - Conceptual)](#visualizing-what-cnns-learn-feature-visualization---conceptual) 51 | - Understanding intermediate feature maps. 52 | - Visualizing Convolutional Filters (first layer). 53 | - Techniques like Saliency Maps, Class Activation Maps (CAM), Grad-CAM (Conceptual Overview). 54 | 8. [Practical Tips for Training CNNs](#practical-tips-for-training-cnns) 55 | - Data Augmentation is Key 56 | - Appropriate Learning Rates and Schedulers 57 | - Choosing Batch Size (considering GPU memory) 58 | - Regularization (Dropout, Weight Decay) 59 | - Monitoring Validation Performance 60 | 61 | ## Introduction to Convolutional Neural Networks 62 | 63 | - **What are CNNs and Why Use Them for Images?** 64 | CNNs are specialized neural networks designed to process data with a grid-like topology, such as images (2D grid of pixels) or audio (1D grid of time samples). They are highly effective for image-related tasks because they can automatically and adaptively learn spatial hierarchies of features from low-level edges and textures to high-level object parts and concepts. 65 | - **Key Concepts:** 66 | - **Local Receptive Fields:** Each neuron in a convolutional layer is connected to only a small region of the input volume (its local receptive field), allowing it to learn local features. 67 | - **Shared Weights (Parameter Sharing):** The same set of weights (kernel/filter) is used across different spatial locations in the input. This drastically reduces the number of parameters and makes the model equivariant to translations of features. 68 | - **Pooling:** Summarizes features in a neighborhood, providing a degree of translation invariance and reducing dimensionality. 69 | 70 | ## Core CNN Layers and Components 71 | 72 | - **Convolutional Layers (`nn.Conv2d`)** 73 | The core building block of a CNN. It performs a convolution operation, sliding a learnable filter (kernel) over the input. 74 | - **Kernels (Filters):** Small matrices of learnable parameters. Each kernel is responsible for detecting a specific feature (e.g., an edge, a texture). The depth of the kernel matches the depth (number of channels) of its input. 75 | - **Input and Output Channels:** `in_channels` is the number of channels in the input volume (e.g., 3 for RGB images). `out_channels` is the number of filters applied, determining the depth of the output feature map. 76 | - **Feature Maps:** The output of a convolutional layer. Each channel in the output feature map corresponds to the response of a specific filter across the input. 77 | - **Parameters:** 78 | - `kernel_size (int or tuple)`: Size of the filter (e.g., 3 for 3x3, (3,5) for 3x5). 79 | - `stride (int or tuple, optional)`: Step size with which the filter slides over the input (default: 1). 80 | - `padding (int or tuple, optional)`: Amount of zero-padding added to the borders of the input (default: 0). Padding can help control the spatial size of the output feature map and preserve border information. 81 | - `dilation (int or tuple, optional)`: Spacing between kernel elements (default: 1). 82 | ```python 83 | import torch 84 | import torch.nn as nn 85 | 86 | # Example: Conv2d layer 87 | # Input: Batch of 16 images, 3 channels (RGB), 32x32 pixels 88 | # Output: 32 feature maps (output channels), spatial size depends on kernel, stride, padding 89 | conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, stride=1, padding=1) 90 | # input_tensor = torch.randn(16, 3, 32, 32) # Batch, Channels, Height, Width 91 | # output_feature_map = conv1(input_tensor) 92 | # print(f"Output feature map shape: {output_feature_map.shape}") # e.g., [16, 32, 32, 32] 93 | ``` 94 | 95 | - **Activation Functions (ReLU)** 96 | Typically, a non-linear activation function like ReLU (`nn.ReLU()`) is applied element-wise after each convolutional operation to introduce non-linearity. 97 | 98 | - **Pooling Layers (`nn.MaxPool2d`, `nn.AvgPool2d`)** 99 | Reduce the spatial dimensions (height and width) of the feature maps, reducing computation and parameters, and providing a form of translation invariance. 100 | - `nn.MaxPool2d(kernel_size, stride=None)`: Selects the maximum value from each patch of the feature map covered by the pooling window. 101 | - `nn.AvgPool2d(kernel_size, stride=None)`: Computes the average value. 102 | ```python 103 | # pool = nn.MaxPool2d(kernel_size=2, stride=2) # Reduces H and W by factor of 2 104 | # pooled_output = pool(output_feature_map) # Assuming output_feature_map from conv1 105 | # print(f"Pooled output shape: {pooled_output.shape}") # e.g., [16, 32, 16, 16] 106 | ``` 107 | 108 | - **Fully Connected Layers (`nn.Linear`)** 109 | After several convolutional and pooling layers, the high-level features are typically flattened and fed into one or more fully connected layers for classification or regression. 110 | - **Flattening:** Converting the 3D feature maps (Channels x Height x Width) into a 1D vector. 111 | 112 | - **Batch Normalization (`nn.BatchNorm2d`)** 113 | Applied after convolutional layers (and before or after activation) to normalize the activations across the batch. Helps stabilize training, allows higher learning rates, and can act as a regularizer. 114 | 115 | - **Dropout (`nn.Dropout2d`, `nn.Dropout`)** 116 | `nn.Dropout2d` randomly zeros out entire channels during training. `nn.Dropout` (1D dropout) is used for fully connected layers. Helps prevent overfitting. 117 | 118 | ## Building a Basic CNN Architecture 119 | 120 | A typical CNN architecture pattern: 121 | `INPUT -> [[CONV -> ACT -> POOL] * N -> FLATTEN -> [FC -> ACT] * M -> FC (Output)]` 122 | 123 | ```python 124 | class SimpleCNN(nn.Module): 125 | def __init__(self, num_classes=10): 126 | super(SimpleCNN, self).__init__() 127 | self.conv_block1 = nn.Sequential( 128 | nn.Conv2d(in_channels=1, out_channels=16, kernel_size=5, stride=1, padding=2), # MNIST: 1 channel 129 | nn.ReLU(), 130 | nn.MaxPool2d(kernel_size=2, stride=2) # Output: 16 x 14 x 14 (for 28x28 input) 131 | ) 132 | self.conv_block2 = nn.Sequential( 133 | nn.Conv2d(in_channels=16, out_channels=32, kernel_size=5, stride=1, padding=2), 134 | nn.ReLU(), 135 | nn.MaxPool2d(kernel_size=2, stride=2) # Output: 32 x 7 x 7 136 | ) 137 | # After two max pooling layers of stride 2, a 28x28 image becomes 7x7. 138 | # So, the flattened size is 32 (channels) * 7 * 7. 139 | self.fc = nn.Linear(32 * 7 * 7, num_classes) 140 | 141 | def forward(self, x): # Input x shape: [batch_size, 1, 28, 28] for MNIST 142 | x = self.conv_block1(x) 143 | x = self.conv_block2(x) 144 | x = x.view(x.size(0), -1) # Flatten the feature maps: [batch_size, 32*7*7] 145 | x = self.fc(x) 146 | return x # Raw logits for classification 147 | 148 | # model_cnn = SimpleCNN(num_classes=10) # For MNIST (10 digits) 149 | # print(model_cnn) 150 | ``` 151 | 152 | ## Training CNNs for Image Classification 153 | 154 | Training involves the same general steps as other neural networks, but with data and augmentations tailored for images. 155 | - **Data Preparation:** Use `torchvision.transforms` for normalization, resizing, and data augmentation (random flips, rotations, crops, color jitter, etc.). 156 | - **Loss Function:** `nn.CrossEntropyLoss` is standard for multi-class image classification. 157 | - **Optimizer:** Adam or SGD with momentum are common choices. 158 | 159 | ## Understanding and Implementing Famous CNN Architectures (Conceptual Overview) 160 | 161 | - **LeNet-5:** One of the earliest successful CNNs, designed for digit recognition. 162 | - **AlexNet:** Won the ImageNet LSVRC-2012. Deeper than LeNet, used ReLU, Dropout, and data augmentation extensively. 163 | - **VGGNets:** Showed that depth is critical. Used very small (3x3) convolutional filters stacked deeply. 164 | - **GoogLeNet (Inception):** Introduced the "Inception module," which performs convolutions at multiple scales in parallel and concatenates their outputs, improving performance and computational efficiency. 165 | - **ResNet (Residual Networks):** Enabled training of extremely deep networks (hundreds of layers) by introducing "residual connections" (skip connections) that allow gradients to propagate more easily. 166 | 167 | ## Transfer Learning with Pre-trained CNN Models 168 | 169 | Leveraging models pre-trained on large datasets (like ImageNet) can significantly boost performance on smaller, related datasets. 170 | 171 | - **`torchvision.models`:** Provides access to many pre-trained models (ResNet, VGG, Inception, MobileNet, etc.). 172 | ```python 173 | import torchvision.models as models 174 | # resnet18_pretrained = models.resnet18(pretrained=True) # PyTorch < 0.13 175 | # resnet18_pretrained = models.resnet18(weights=models.ResNet18_Weights.DEFAULT) # PyTorch >= 0.13 176 | ``` 177 | - **Feature Extraction:** Freeze the weights of the convolutional base of the pre-trained model and replace its final classification layer with a new one suited to your task. Train only the new classifier. 178 | - **Fine-tuning:** Unfreeze some of the top layers of the pre-trained model in addition to training the new classifier. Use a small learning rate to avoid catastrophically forgetting the learned features. 179 | 180 | ## Visualizing What CNNs Learn (Feature Visualization - Conceptual) 181 | 182 | Understanding the internal workings of CNNs can be aided by visualizing: 183 | - **Filters:** Especially in the first layer, filters often learn to detect simple patterns like edges, corners, and color blobs. 184 | - **Feature Maps (Activations):** Show which regions of an image activate certain filters/channels at different layers, revealing the hierarchical feature extraction process. 185 | - **Saliency Maps/Class Activation Maps (CAM/Grad-CAM):** Highlight the image regions most influential in a model's prediction for a specific class. 186 | 187 | ## Practical Tips for Training CNNs 188 | - Start with a standard architecture (e.g., ResNet variant) and pre-trained weights if applicable. 189 | - Aggressive data augmentation is often very beneficial. 190 | - Use appropriate learning rates, often starting higher and decaying (e.g., with a scheduler). 191 | - Batch Normalization is generally helpful. 192 | - Monitor training and validation metrics closely. 193 | 194 | ## Running the Tutorial 195 | 196 | To run the Python script associated with this tutorial: 197 | ```bash 198 | python convolutional_neural_networks.py 199 | ``` 200 | We recommend you manually create a `convolutional_neural_networks.ipynb` notebook and copy the code from the Python script into it for an interactive experience. 201 | 202 | ## Prerequisites 203 | - Python 3.7+ 204 | - PyTorch 1.10+ 205 | - Torchvision 206 | - NumPy 207 | - Matplotlib (for visualization) 208 | 209 | ## Related Tutorials 210 | 1. [Data Loading and Preprocessing](../05_data_loading_preprocessing/README.md) 211 | 2. [Training Neural Networks](../04_training_neural_networks/README.md) 212 | 3. [Recurrent Neural Networks](../07_recurrent_neural_networks/README.md) (for sequence data) -------------------------------------------------------------------------------- /02_neural_networks_fundamentals/README.md: -------------------------------------------------------------------------------- 1 | # Neural Networks Fundamentals in PyTorch 2 | 3 | This tutorial provides a comprehensive introduction to the fundamental concepts of neural networks and their implementation using PyTorch. We will cover the building blocks of neural networks, how they learn, and how to construct your first neural network. 4 | 5 | ## Table of Contents 6 | 1. [Introduction to Neural Networks](#introduction-to-neural-networks) 7 | - What is a Neural Network? 8 | - Biological Inspiration 9 | - Basic Components: Neurons, Weights, Biases, Layers 10 | - Types of Neural Networks (Brief Overview) 11 | 2. [The Perceptron: The Simplest Neural Network](#the-perceptron-the-simplest-neural-network) 12 | - Single-Layer Perceptron 13 | - Linear Separability 14 | 3. [Activation Functions](#activation-functions) 15 | - Purpose: Introducing Non-linearity 16 | - Common Activation Functions: 17 | - Sigmoid 18 | - Tanh (Hyperbolic Tangent) 19 | - ReLU (Rectified Linear Unit) and its variants (Leaky ReLU, ELU) 20 | - Softmax (for output layers in classification) 21 | - Choosing an Activation Function 22 | - PyTorch Implementation 23 | 4. [Multi-Layer Perceptrons (MLPs)](#multi-layer-perceptrons-mlps) 24 | - Architecture: Input, Hidden, and Output Layers 25 | - The Power of Hidden Layers: Universal Approximation Theorem (Concept) 26 | - Forward Propagation in an MLP 27 | 5. [Defining a Neural Network in PyTorch (`nn.Module`)](#defining-a-neural-network-in-pytorch-nnmodule) 28 | - The `nn.Module` Class 29 | - Defining Layers (`nn.Linear`, etc.) 30 | - Implementing the `forward` method 31 | - Example: A Simple MLP for Classification 32 | 6. [Loss Functions: Measuring Model Error](#loss-functions-measuring-model-error) 33 | - Purpose of Loss Functions 34 | - Common Loss Functions: 35 | - Mean Squared Error (MSE) (`nn.MSELoss`): For Regression 36 | - Cross-Entropy Loss (`nn.CrossEntropyLoss`): For Multi-class Classification 37 | - Binary Cross-Entropy Loss (`nn.BCELoss`, `nn.BCEWithLogitsLoss`): For Binary Classification 38 | - Choosing the Right Loss Function 39 | 7. [Optimizers: How Neural Networks Learn](#optimizers-how-neural-networks-learn) 40 | - Gradient Descent (Concept) 41 | - Stochastic Gradient Descent (SGD) 42 | - SGD with Momentum 43 | - Adam Optimizer (`torch.optim.Adam`) 44 | - Learning Rate 45 | - Linking Optimizers to Model Parameters 46 | 8. [The Training Loop: Forward and Backward Propagation](#the-training-loop-forward-and-backward-propagation) 47 | - Overview of the Training Process 48 | - **Forward Propagation:** Calculating Predictions and Loss 49 | - **Backward Propagation (Backpropagation):** Calculating Gradients (`loss.backward()`) 50 | - **Optimizer Step:** Updating Weights (`optimizer.step()`) 51 | - Zeroing Gradients (`optimizer.zero_grad()`) 52 | - Iterating over Data (Epochs and Batches) 53 | 9. [Building and Training Your First Neural Network in PyTorch](#building-and-training-your-first-neural-network-in-pytorch) 54 | - Step 1: Prepare the Data (e.g., a simple synthetic dataset) 55 | - Step 2: Define the Model (using `nn.Module`) 56 | - Step 3: Define Loss Function and Optimizer 57 | - Step 4: Implement the Training Loop 58 | - Step 5: Evaluate the Model (Conceptual) 59 | 60 | ## Introduction to Neural Networks 61 | 62 | - **What is a Neural Network?** 63 | An Artificial Neural Network (ANN) is a computational model inspired by the structure and function of biological neural networks in the human brain. It consists of interconnected processing units called neurons (or nodes) organized in layers. 64 | - **Biological Inspiration:** Neurons in the brain receive signals, process them, and transmit signals to other neurons. ANNs attempt to mimic this behavior mathematically. 65 | - **Basic Components:** 66 | - **Neurons (Nodes):** Basic computational units that receive inputs, perform a calculation (typically a weighted sum followed by an activation function), and produce an output. 67 | - **Weights:** Parameters associated with each input to a neuron, representing the strength or importance of that input. 68 | - **Biases:** Additional parameters added to the weighted sum, allowing the neuron to be activated even when all inputs are zero, or shifting the activation function. 69 | - **Layers:** Neurons are organized into layers: an input layer, one or more hidden layers, and an output layer. 70 | - **Types of Neural Networks:** Feedforward Neural Networks (FNNs), Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Transformers, etc. This tutorial focuses on FNNs (specifically MLPs). 71 | 72 | ## The Perceptron: The Simplest Neural Network 73 | 74 | - **Single-Layer Perceptron:** The simplest form of a neural network, consisting of a single layer of output neurons. Inputs are fed directly to the outputs via a series of weights. It performs a weighted sum of inputs and applies an activation function (often a step function). 75 | `output = activation(sum(weights_i * input_i) + bias)` 76 | - **Linear Separability:** A single-layer perceptron can only solve linearly separable problems. 77 | 78 | ## Activation Functions 79 | 80 | - **Purpose:** Activation functions introduce non-linearity into the network. Without non-linearity, a multi-layer network would behave like a single-layer linear network, severely limiting its ability to model complex relationships. 81 | - **Common Activation Functions:** 82 | - **Sigmoid:** `f(x) = 1 / (1 + exp(-x))`. Squashes values between 0 and 1. Used in older networks, can suffer from vanishing gradients. 83 | - **Tanh (Hyperbolic Tangent):** `f(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))`. Squashes values between -1 and 1. Also prone to vanishing gradients but often preferred over sigmoid in hidden layers as it's zero-centered. 84 | - **ReLU (Rectified Linear Unit):** `f(x) = max(0, x)`. Computationally efficient, helps alleviate vanishing gradients. Most popular choice for hidden layers. 85 | - **Leaky ReLU:** `f(x) = max(0.01*x, x)`. Addresses the "dying ReLU" problem by allowing a small, non-zero gradient when the unit is not active. 86 | - **Softmax:** `f(x_i) = exp(x_i) / sum(exp(x_j))`. Used in the output layer of multi-class classification tasks to convert raw scores (logits) into probabilities that sum to 1. 87 | 88 | ```python 89 | import torch 90 | import torch.nn as nn 91 | import torch.nn.functional as F 92 | 93 | # Examples of activation functions 94 | sigmoid = nn.Sigmoid() 95 | relu = nn.ReLU() 96 | tanh = nn.Tanh() 97 | softmax = nn.Softmax(dim=1) # Apply softmax across a specific dimension 98 | 99 | input_tensor = torch.randn(2, 3) # Batch of 2, 3 features each 100 | print("Input:\n", input_tensor) 101 | print("Sigmoid output:\n", sigmoid(input_tensor)) 102 | print("ReLU output:\n", relu(input_tensor)) 103 | print("Tanh output:\n", tanh(input_tensor)) 104 | # For softmax, let's assume these are logits for 2 samples, 3 classes 105 | print("Softmax output:\n", softmax(input_tensor)) 106 | ``` 107 | 108 | ## Multi-Layer Perceptrons (MLPs) 109 | 110 | MLPs are feedforward neural networks with one or more hidden layers between the input and output layers. Each layer is fully connected to the next. 111 | 112 | - **Architecture:** 113 | - **Input Layer:** Receives the raw input data. 114 | - **Hidden Layer(s):** Perform intermediate computations. The number of hidden layers and neurons per layer are hyperparameters. 115 | - **Output Layer:** Produces the final prediction. 116 | - **Universal Approximation Theorem:** (Conceptual) An MLP with at least one hidden layer and a non-linear activation function can approximate any continuous function to an arbitrary degree of accuracy, given enough neurons. 117 | - **Forward Propagation:** The process of passing input data through the network layer by layer to compute the output. 118 | `h1 = activation1(W1*x + b1)` 119 | `h2 = activation2(W2*h1 + b2)` 120 | `output = activation_out(W_out*h2 + b_out)` 121 | 122 | ## Defining a Neural Network in PyTorch (`nn.Module`) 123 | 124 | PyTorch provides the `nn.Module` class as a base for all neural network modules. 125 | 126 | - **The `nn.Module` Class:** 127 | - Your custom network should inherit from `nn.Module`. 128 | - Layers are defined as attributes in the `__init__` method. 129 | - The `forward` method defines how input data flows through the network. 130 | - **Defining Layers:** PyTorch offers various predefined layers in `torch.nn`: 131 | - `nn.Linear(in_features, out_features)`: Applies a linear transformation (fully connected layer). 132 | - `nn.Conv2d`, `nn.RNN`, etc., for other network types. 133 | 134 | ```python 135 | class SimpleMLP(nn.Module): 136 | def __init__(self, input_size, hidden_size, num_classes): 137 | super(SimpleMLP, self).__init__() 138 | self.fc1 = nn.Linear(input_size, hidden_size) # Input layer to hidden layer 139 | self.relu = nn.ReLU() # Activation function 140 | self.fc2 = nn.Linear(hidden_size, num_classes) # Hidden layer to output layer 141 | 142 | def forward(self, x): 143 | # x is the input tensor 144 | out = self.fc1(x) 145 | out = self.relu(out) 146 | out = self.fc2(out) 147 | # No softmax here if using nn.CrossEntropyLoss, as it combines Softmax and NLLLoss 148 | return out 149 | 150 | # Example usage 151 | input_dim = 784 # e.g., for flattened 28x28 MNIST images 152 | hidden_dim = 128 153 | output_dim = 10 # e.g., for 10 digit classes 154 | model_mlp = SimpleMLP(input_dim, hidden_dim, output_dim) 155 | print(model_mlp) 156 | ``` 157 | 158 | ## Loss Functions: Measuring Model Error 159 | 160 | Loss functions (or cost functions) quantify how far the model's predictions are from the actual target values. 161 | 162 | - **Common Loss Functions:** 163 | - **`nn.MSELoss` (Mean Squared Error):** For regression tasks. `loss = (1/N) * sum((y_true - y_pred)^2)`. 164 | - **`nn.CrossEntropyLoss`:** For multi-class classification. It conveniently combines `nn.LogSoftmax` and `nn.NLLLoss`. Expects raw logits as model output. 165 | - **`nn.BCELoss` (Binary Cross-Entropy Loss):** For binary classification. Expects model output to be probabilities (after a Sigmoid activation). 166 | - **`nn.BCEWithLogitsLoss`:** For binary classification. More numerically stable than `nn.BCELoss` as it combines Sigmoid and BCE. Expects raw logits. 167 | 168 | ```python 169 | # Example Loss Functions 170 | loss_mse = nn.MSELoss() 171 | loss_ce = nn.CrossEntropyLoss() 172 | loss_bce_logits = nn.BCEWithLogitsLoss() 173 | 174 | # For MSE (Regression) 175 | predictions_reg = torch.randn(5, 1) # 5 samples, 1 output value 176 | targets_reg = torch.randn(5, 1) 177 | mse = loss_mse(predictions_reg, targets_reg) 178 | print(f"MSE Loss: {mse.item()}") 179 | 180 | # For CrossEntropy (Multi-class classification) 181 | predictions_mc = torch.randn(5, 3) # 5 samples, 3 classes (logits) 182 | targets_mc = torch.tensor([0, 1, 2, 0, 1]) # True class indices 183 | ce = loss_ce(predictions_mc, targets_mc) 184 | print(f"CrossEntropy Loss: {ce.item()}") 185 | 186 | # For BCEWithLogits (Binary classification) 187 | predictions_bc = torch.randn(5, 1) # 5 samples, 1 output logit 188 | targets_bc = torch.rand(5, 1) # True probabilities (or 0s and 1s) 189 | bce_wl = loss_bce_logits(predictions_bc, targets_bc) 190 | print(f"BCEWithLogits Loss: {bce_wl.item()}") 191 | ``` 192 | 193 | ## Optimizers: How Neural Networks Learn 194 | 195 | Optimizers implement algorithms to update the model's weights based on the gradients computed during backpropagation, aiming to minimize the loss function. 196 | 197 | - **Gradient Descent:** Iteratively moves in the direction opposite to the gradient of the loss function. 198 | - **Stochastic Gradient Descent (SGD):** Uses a single training example or a small batch to compute the gradient and update weights, making it faster and often able to escape local minima. 199 | `optimizer = torch.optim.SGD(model.parameters(), lr=0.01)` 200 | - **SGD with Momentum:** Adds a fraction of the previous update vector to the current one, helping accelerate SGD in the relevant direction and dampening oscillations. 201 | `optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)` 202 | - **Adam (Adaptive Moment Estimation):** An adaptive learning rate optimization algorithm that computes individual learning rates for different parameters. Often a good default choice. 203 | `optimizer = torch.optim.Adam(model.parameters(), lr=0.001)` 204 | - **Learning Rate (lr):** A crucial hyperparameter that controls the step size during weight updates. 205 | 206 | ## The Training Loop: Forward and Backward Propagation 207 | 208 | The core process of training a neural network involves repeatedly feeding data to the model and adjusting its weights. 209 | 210 | - **Forward Propagation:** Input data is passed through the network to generate predictions. The loss function then compares these predictions to the true targets to compute the loss. 211 | `outputs = model(inputs)` 212 | `loss = criterion(outputs, labels)` 213 | - **Backward Propagation (Backpropagation):** The `loss.backward()` call computes the gradients of the loss with respect to all model parameters (weights and biases) that have `requires_grad=True`. 214 | - **Optimizer Step:** The `optimizer.step()` call updates the model parameters using the computed gradients and the optimizer's update rule (e.g., SGD, Adam). 215 | - **Zeroing Gradients:** Before each `loss.backward()` call in a new iteration, it's crucial to clear old gradients using `optimizer.zero_grad()`. Otherwise, gradients would accumulate across iterations. 216 | - **Epochs and Batches:** 217 | - **Epoch:** One complete pass through the entire training dataset. 218 | - **Batch:** A subset of the training dataset processed in one iteration of the training loop. 219 | 220 | ## Building and Training Your First Neural Network in PyTorch 221 | 222 | This section will be detailed in the accompanying Python script (`neural_networks_fundamentals.py`) and Jupyter Notebook, showing a complete end-to-end example. 223 | 224 | **Conceptual Steps:** 225 | 1. **Prepare Data:** Load and preprocess your dataset. PyTorch uses `Dataset` and `DataLoader` classes. 226 | 2. **Define Model:** Create your neural network class inheriting from `nn.Module`. 227 | 3. **Define Loss and Optimizer:** Instantiate your chosen loss function and optimizer, linking the optimizer to your model's parameters. 228 | 4. **Training Loop:** 229 | ```python 230 | # num_epochs = ... 231 | # for epoch in range(num_epochs): 232 | # for i, (inputs, labels) in enumerate(train_loader): 233 | # # Move tensors to the configured device (CPU/GPU) 234 | # inputs = inputs.to(device) 235 | # labels = labels.to(device) 236 | # 237 | # # Forward pass 238 | # outputs = model(inputs) 239 | # loss = criterion(outputs, labels) 240 | # 241 | # # Backward and optimize 242 | # optimizer.zero_grad() 243 | # loss.backward() 244 | # optimizer.step() 245 | # 246 | # if (i+1) % 100 == 0: 247 | # print (f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(train_loader)}], Loss: {loss.item():.4f}') 248 | ``` 249 | 5. **Evaluate Model:** Assess performance on a separate test dataset. 250 | 251 | ## Running the Tutorial 252 | 253 | To run the Python script associated with this tutorial: 254 | ```bash 255 | python neural_networks_fundamentals.py 256 | ``` 257 | Alternatively, you can follow along with the Jupyter notebook `neural_networks_fundamentals.ipynb` for an interactive experience. We recommend manually creating the notebook and copying code from the script if direct creation fails. 258 | 259 | ## Prerequisites 260 | - Python 3.7+ 261 | - PyTorch 1.10+ 262 | - NumPy 263 | - Matplotlib (for visualization) 264 | - Scikit-learn (for generating sample data or splitting) 265 | 266 | ## Related Tutorials 267 | 1. [PyTorch Basics](../01_pytorch_basics/README.md) 268 | 2. [Automatic Differentiation](../03_automatic_differentiation/README.md) 269 | 3. [Training Neural Networks](../04_training_neural_networks/README.md) -------------------------------------------------------------------------------- /03_automatic_differentiation/README.md: -------------------------------------------------------------------------------- 1 | # Automatic Differentiation with PyTorch Autograd 2 | 3 | This tutorial provides a detailed explanation of PyTorch's automatic differentiation system, known as Autograd. Understanding Autograd is crucial for training neural networks as it automates the computation of gradients. 4 | 5 | ## Table of Contents 6 | 1. [Introduction to Automatic Differentiation](#introduction-to-automatic-differentiation) 7 | - What is Differentiation? 8 | - Manual vs. Symbolic vs. Automatic Differentiation 9 | - Why Automatic Differentiation for Deep Learning? 10 | 2. [PyTorch Autograd: The Basics](#pytorch-autograd-the-basics) 11 | - Tensors and `requires_grad` 12 | - The `grad_fn` (Gradient Function) 13 | - Computing Gradients: `backward()` 14 | - Accessing Gradients: `.grad` attribute 15 | 3. [The Computational Graph](#the-computational-graph) 16 | - Dynamic Computational Graphs in PyTorch 17 | - How Autograd Constructs the Graph 18 | - Nodes and Edges: Tensors and Operations 19 | - Leaf Nodes vs. Non-Leaf Nodes 20 | 4. [Gradient Accumulation](#gradient-accumulation) 21 | - How Gradients Accumulate by Default 22 | - Zeroing Gradients: `optimizer.zero_grad()` or `tensor.grad.zero_()` 23 | - Use Cases for Gradient Accumulation (e.g., simulating larger batch sizes) 24 | 5. [Excluding Tensors from Autograd (`torch.no_grad()`, `detach()`)](#excluding-tensors-from-autograd-torchnograd-detach) 25 | - `torch.no_grad()`: Context manager to disable gradient computation. 26 | - `.detach()`: Creates a new tensor that shares the same data but is detached from the computation history. 27 | - Use cases: Inference, freezing layers, modifying tensors without tracking. 28 | 6. [Gradients of Non-Scalar Outputs (Vector-Jacobian Product)](#gradients-of-non-scalar-outputs-vector-jacobian-product) 29 | - `backward()` on a non-scalar tensor requires a `gradient` argument. 30 | - Understanding the Vector-Jacobian Product (JVP) concept. 31 | - Practical examples. 32 | 7. [Higher-Order Derivatives](#higher-order-derivatives) 33 | - Computing gradients of gradients. 34 | - Using `torch.autograd.grad()` for more control. 35 | - `create_graph=True` in `backward()` or `torch.autograd.grad()`. 36 | 8. [In-place Operations and Autograd](#in-place-operations-and-autograd) 37 | - Potential issues with in-place operations (ending with `_`). 38 | - Autograd's need for original values for gradient computation. 39 | - When they might be problematic and when they are safe. 40 | 9. [Custom Autograd Functions (`torch.autograd.Function`)](#custom-autograd-functions-torchautogradfunction) 41 | - When to use: Implementing novel operations, non-PyTorch computations. 42 | - Subclassing `torch.autograd.Function`. 43 | - Defining `forward()` and `backward()` static methods. 44 | - `ctx` (context) object for saving tensors for backward pass. 45 | - Example: A custom ReLU or a simple custom operation. 46 | 10. [Practical Considerations and Tips](#practical-considerations-and-tips) 47 | - Checking if a tensor requires gradients: `tensor.requires_grad`. 48 | - Checking if a tensor is a leaf tensor: `tensor.is_leaf`. 49 | - Memory usage: Autograd stores intermediate values for backward pass. 50 | - `retain_graph=True` in `backward()`: When needed and its implications. 51 | 52 | ## Introduction to Automatic Differentiation 53 | 54 | - **What is Differentiation?** Finding the rate of change of a function with respect to its input variables (i.e., its derivatives or gradients). 55 | - **Manual Differentiation:** Deriving gradients by hand. Tedious and error-prone for complex functions like neural networks. 56 | - **Symbolic Differentiation:** Using computer algebra systems to manipulate mathematical expressions and find derivatives (e.g., Wolfram Alpha, SymPy). Can lead to complex and inefficient expressions. 57 | - **Automatic Differentiation (AD):** A set of techniques to numerically evaluate the derivative of a function specified by a computer program. AD decomposes the computation into a sequence of elementary operations (addition, multiplication, sin, exp, etc.) and applies the chain rule repeatedly. 58 | - **Reverse Mode AD:** What PyTorch uses. Computes gradients by traversing the computational graph backward from output to input. Efficient for functions with many inputs and few outputs (like neural network loss functions). 59 | - **Why AD for Deep Learning?** Neural networks are complex functions with millions of parameters. AD (specifically reverse mode) provides an efficient and accurate way to compute the gradients of the loss function with respect to all these parameters, which is essential for gradient-based optimization (like SGD). 60 | 61 | ## PyTorch Autograd: The Basics 62 | 63 | PyTorch's `autograd` package provides automatic differentiation for all operations on Tensors. 64 | 65 | - **Tensors and `requires_grad`:** 66 | - If a `Tensor` has its `requires_grad` attribute set to `True`, PyTorch tracks all operations on it. This is typically done for learnable parameters (weights, biases) or tensors that are part of a computation leading to a value for which gradients are needed. 67 | - You can set `requires_grad=True` when creating a tensor or later using `tensor.requires_grad_(True)` (in-place). 68 | - **The `grad_fn` (Gradient Function):** 69 | - When an operation is performed on tensors that require gradients, the resulting tensor will have a `grad_fn` attribute. This function knows how to compute the gradient of that operation during the backward pass. 70 | - Leaf tensors (created by the user, not as a result of an operation) with `requires_grad=True` will have `grad_fn=None` initially, but their `.grad` attribute will be populated after `backward()`. 71 | - **Computing Gradients: `backward()`:** 72 | - To compute gradients, you call `.backward()` on a scalar tensor (e.g., the loss). If the tensor is non-scalar, you need to provide a `gradient` argument (see Section 6). 73 | - This initiates the backward pass, computing gradients for all tensors in the computational graph that have `requires_grad=True`. 74 | - **Accessing Gradients: `.grad` attribute:** 75 | - After `loss.backward()` is called, the gradients are accumulated in the `.grad` attribute of the leaf tensors (those for which `requires_grad=True`). 76 | 77 | ```python 78 | import torch 79 | 80 | # Example 1: Basic gradient computation 81 | x = torch.tensor(2.0, requires_grad=True) 82 | y = torch.tensor(3.0, requires_grad=True) 83 | z = x**2 + y**3 # z = 2^2 + 3^3 = 4 + 27 = 31 84 | 85 | # Compute gradients 86 | z.backward() # Computes dz/dx and dz/dy 87 | 88 | print(f"x: {x}, Gradient dz/dx: {x.grad}") # dz/dx = 2*x = 2*2 = 4 89 | print(f"y: {y}, Gradient dz/dy: {y.grad}") # dz/dy = 3*y^2 = 3*3^2 = 27 90 | 91 | # grad_fn example 92 | print(f"z.grad_fn: {z.grad_fn}") # Should show 93 | print(f"x.grad_fn: {x.grad_fn}") # Leaf tensor, no grad_fn from previous op 94 | ``` 95 | 96 | ## The Computational Graph 97 | 98 | - **Dynamic Computational Graphs:** PyTorch builds the computational graph on-the-fly as operations are executed (define-by-run). This allows for more flexibility in model architecture (e.g., using standard Python control flow like loops and conditionals). 99 | - **How Autograd Constructs the Graph:** Each operation on tensors with `requires_grad=True` creates a new node in the graph. Tensors are nodes, and operations (`grad_fn`) are edges that define how to compute gradients. 100 | - **Leaf Nodes:** Tensors created directly by the user (e.g., `torch.tensor(...)`, model parameters). Their gradients are accumulated in `.grad`. 101 | - **Non-Leaf Nodes (Intermediate Tensors):** Tensors resulting from operations. They have a `grad_fn`. By default, their gradients are not saved to save memory, but can be retained using `tensor.retain_grad()`. 102 | 103 | ## Gradient Accumulation 104 | 105 | - **How Gradients Accumulate:** When `backward()` is called multiple times (e.g., in a loop without zeroing gradients), gradients are summed (accumulated) in the `.grad` attribute of leaf tensors. 106 | - **Zeroing Gradients:** It's crucial to zero out gradients before each new backward pass in a typical training loop using `optimizer.zero_grad()` or by manually setting `tensor.grad.zero_()` for each parameter. Otherwise, gradients from previous batches/iterations will interfere. 107 | - **Use Cases for Accumulation:** Deliberate gradient accumulation can be used to simulate a larger effective batch size when GPU memory is limited. You perform several forward/backward passes accumulating gradients and then perform an optimizer step. 108 | 109 | ```python 110 | x = torch.tensor(1.0, requires_grad=True) 111 | y1 = x * 2 112 | y2 = x * 3 113 | 114 | # First backward pass 115 | y1.backward(retain_graph=True) # retain_graph needed if y2.backward() follows on same graph portion 116 | print(f"After y1.backward(), x.grad: {x.grad}") # dy1/dx = 2 117 | 118 | # Second backward pass (gradients accumulate) 119 | y2.backward() 120 | print(f"After y2.backward(), x.grad: {x.grad}") # 2 (from y1) + 3 (from y2) = 5 121 | 122 | # Zeroing gradients 123 | x.grad.zero_() 124 | print(f"After x.grad.zero_(), x.grad: {x.grad}") 125 | ``` 126 | 127 | ## Excluding Tensors from Autograd (`torch.no_grad()`, `detach()`) 128 | 129 | - **`torch.no_grad()`:** A context manager that disables gradient computation within its block. Useful for inference (when you don't need gradients) or when modifying model parameters without tracking these changes (e.g., during evaluation). 130 | - **`.detach()`:** Creates a new tensor that shares the same data as the original tensor but is detached from the current computational graph. It won't require gradients, and no operations on it will be tracked. Useful if you need to use a tensor in a computation that shouldn't be part of the gradient calculation, or to copy a tensor without its history. 131 | 132 | ```python 133 | a = torch.tensor([1.0, 2.0], requires_grad=True) 134 | b = a * 2 135 | 136 | with torch.no_grad(): 137 | c = a * 3 # Operation inside no_grad block 138 | print(f"c.requires_grad inside no_grad: {c.requires_grad}") # False 139 | 140 | d = b.detach() # d shares data with b but is detached 141 | print(f"b.requires_grad: {b.requires_grad}") # True 142 | print(f"d.requires_grad: {d.requires_grad}") # False 143 | ``` 144 | 145 | ## Gradients of Non-Scalar Outputs (Vector-Jacobian Product) 146 | 147 | - If `backward()` is called on a tensor `y` that is not a scalar (e.g., a vector or matrix), PyTorch expects a `gradient` argument. This argument should be a tensor of the same shape as `y` and represents the vector `v` in the vector-Jacobian product `v^T * J`. 148 | - **Vector-Jacobian Product:** Autograd is designed to compute Jacobian-vector products efficiently. If `y = f(x)` and `L` is a scalar loss computed from `y` (i.e., `L = g(y)`), then `dL/dx = (dL/dy) * (dy/dx)`. Here, `dL/dy` is the vector `v` you pass to `y.backward(v)`. 149 | - If you just want the full Jacobian matrix, you'd have to call `backward()` multiple times with one-hot vectors for `gradient`, which is inefficient. `torch.autograd.functional.jacobian` can be used for this if needed. 150 | 151 | ```python 152 | x = torch.randn(3, requires_grad=True) 153 | y = x * 2 # y is a vector 154 | # y.backward() # This would raise an error 155 | 156 | # Provide gradient argument for non-scalar output 157 | # This is equivalent to if we had a scalar loss L = sum(y*v) 158 | # and then called L.backward(). The gradient for x would be 2*v. 159 | v = torch.tensor([0.1, 1.0, 0.001], dtype=torch.float) 160 | y.backward(gradient=v) 161 | print(f"x.grad after y.backward(v): {x.grad}") # Expected: 2*v = [0.2, 2.0, 0.002] 162 | ``` 163 | 164 | ## Higher-Order Derivatives 165 | 166 | - PyTorch can compute gradients of gradients (and so on). 167 | - **`torch.autograd.grad()`:** A more flexible way to compute gradients. It takes the output tensor(s) and input tensor(s) and returns the gradients of outputs with respect to inputs. 168 | - **`create_graph=True`:** To compute higher-order derivatives, you need to set `create_graph=True` when calling `backward()` or `torch.autograd.grad()`. This tells Autograd to build a computational graph for the backward pass itself, allowing subsequent differentiation. 169 | 170 | ```python 171 | x = torch.tensor(2.0, requires_grad=True) 172 | y = x**3 173 | 174 | # First derivative (dy/dx) 175 | grad_y_x = torch.autograd.grad(outputs=y, inputs=x, create_graph=True)[0] 176 | print(f"dy/dx = 3*x^2 = {grad_y_x}") # 3 * 2^2 = 12 177 | 178 | # Second derivative (d^2y/dx^2) 179 | grad2_y_x2 = torch.autograd.grad(outputs=grad_y_x, inputs=x)[0] 180 | print(f"d^2y/dx^2 = 6*x = {grad2_y_x2}") # 6 * 2 = 12 181 | ``` 182 | 183 | ## In-place Operations and Autograd 184 | 185 | - In-place operations (e.g., `x.add_(1)`, `y.relu_()`) modify tensors directly without creating new ones. This can save memory. 186 | - **Potential Issues:** Autograd needs the original values of tensors involved in the forward pass to compute gradients correctly during the backward pass. If an in-place operation overwrites a value that's needed, it can lead to errors or incorrect gradients. 187 | - PyTorch will often raise an error if an in-place operation that would cause issues is detected (e.g., modifying a leaf variable or a variable needed by `grad_fn`). 188 | 189 | ## Custom Autograd Functions (`torch.autograd.Function`) 190 | 191 | - For operations not natively supported by PyTorch, or if you want to define a custom gradient computation (e.g., for a layer written in C++ or CUDA, or to implement a non-differentiable function with a surrogate gradient). 192 | - **Subclass `torch.autograd.Function`:** Implement `forward()` and `backward()` as static methods. 193 | - `forward(ctx, input1, input2, ...)`: Performs the operation. `ctx` (context) is used to save tensors or any other objects needed for the backward pass using `ctx.save_for_backward(tensor1, tensor2)`. It must return the output tensor(s). 194 | - `backward(ctx, grad_output1, grad_output2, ...)`: Computes the gradients of the loss with respect to the inputs of the forward function. It receives the gradients of the loss with respect to the outputs of forward (`grad_output`). It must return as many tensors as there were inputs to `forward`, or `None` for inputs that don't need gradients. 195 | 196 | ```python 197 | class MyCustomReLU(torch.autograd.Function): 198 | @staticmethod 199 | def forward(ctx, input_tensor): 200 | # ctx is a context object that can be used to stash information 201 | # for backward computation 202 | ctx.save_for_backward(input_tensor) 203 | return input_tensor.clamp(min=0) 204 | 205 | @staticmethod 206 | def backward(ctx, grad_output): 207 | # We return as many input gradients as there were arguments. 208 | # Gradients of non-Tensor arguments to forward must be None. 209 | input_tensor, = ctx.saved_tensors 210 | grad_input = grad_output.clone() 211 | grad_input[input_tensor < 0] = 0 212 | return grad_input 213 | 214 | # Usage: 215 | my_relu_fn = MyCustomReLU.apply # Get the function to use 216 | x = torch.tensor([-1.0, 2.0, -0.5], requires_grad=True) 217 | y = my_relu_fn(x) 218 | print(f"Custom ReLU Output: {y}") 219 | y.backward(torch.tensor([1.0, 1.0, 1.0])) # Example upstream gradients 220 | print(f"Gradients for x after custom ReLU: {x.grad}") # Expected: [0., 1., 0.] 221 | ``` 222 | 223 | ## Practical Considerations and Tips 224 | - **`tensor.requires_grad`**: Check if a tensor is tracking history. 225 | - **`tensor.is_leaf`**: Check if a tensor is a leaf node in the graph. 226 | - **Memory Usage**: Autograd stores intermediate activations for the backward pass. For very large models or long sequences, this can lead to high memory usage. Techniques like gradient checkpointing can help. 227 | - **`retain_graph=True`**: Use in `backward()` if you need to perform another backward pass from the same part of the graph. Be mindful of memory implications. 228 | 229 | ## Running the Tutorial 230 | 231 | To run the Python script associated with this tutorial: 232 | ```bash 233 | python automatic_differentiation.py 234 | ``` 235 | We recommend you manually create an `automatic_differentiation.ipynb` notebook and copy the code from the Python script into it for an interactive experience. 236 | 237 | ## Prerequisites 238 | - Python 3.7+ 239 | - PyTorch 1.10+ 240 | - NumPy 241 | 242 | ## Related Tutorials 243 | 1. [PyTorch Basics](../01_pytorch_basics/README.md) 244 | 2. [Neural Networks Fundamentals](../02_neural_networks_fundamentals/README.md) 245 | 3. [Training Neural Networks](../04_training_neural_networks/README.md) -------------------------------------------------------------------------------- /14_performance_optimization/performance_optimization.py: -------------------------------------------------------------------------------- 1 | """ 2 | Tutorial 14: Performance Optimization 3 | ===================================== 4 | 5 | This tutorial covers comprehensive performance optimization techniques 6 | for PyTorch models, from profiling to advanced optimization strategies. 7 | """ 8 | 9 | import torch 10 | import torch.nn as nn 11 | import torch.nn.functional as F 12 | from torch.utils.data import DataLoader, Dataset 13 | import torchvision 14 | import torchvision.transforms as transforms 15 | import time 16 | import numpy as np 17 | from torch.profiler import profile, record_function, ProfilerActivity 18 | import torch.cuda.amp as amp 19 | from torch.nn.parallel import DataParallel, DistributedDataParallel 20 | import torch.distributed as dist 21 | import torch.multiprocessing as mp 22 | import os 23 | import psutil 24 | import gc 25 | 26 | # Set device 27 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 28 | print(f"Using device: {device}") 29 | print() 30 | 31 | # Example 1: Basic Profiling 32 | print("Example 1: PyTorch Profiler") 33 | print("=" * 50) 34 | 35 | class SimpleModel(nn.Module): 36 | def __init__(self): 37 | super().__init__() 38 | self.conv1 = nn.Conv2d(3, 64, 3, padding=1) 39 | self.conv2 = nn.Conv2d(64, 128, 3, padding=1) 40 | self.fc1 = nn.Linear(128 * 8 * 8, 256) 41 | self.fc2 = nn.Linear(256, 10) 42 | 43 | def forward(self, x): 44 | x = F.relu(self.conv1(x)) 45 | x = F.max_pool2d(x, 2) 46 | x = F.relu(self.conv2(x)) 47 | x = F.max_pool2d(x, 2) 48 | x = x.view(x.size(0), -1) 49 | x = F.relu(self.fc1(x)) 50 | x = self.fc2(x) 51 | return x 52 | 53 | # Profile the model 54 | model = SimpleModel().to(device) 55 | inputs = torch.randn(32, 3, 32, 32).to(device) 56 | 57 | # Use PyTorch profiler 58 | with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], 59 | record_shapes=True, 60 | profile_memory=True, 61 | with_stack=True) as prof: 62 | with record_function("model_inference"): 63 | for _ in range(10): 64 | model(inputs) 65 | 66 | # Print profiler results 67 | print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)) 68 | print() 69 | 70 | # Example 2: Memory Optimization 71 | print("Example 2: Memory Optimization") 72 | print("=" * 50) 73 | 74 | def get_memory_usage(): 75 | if torch.cuda.is_available(): 76 | return torch.cuda.memory_allocated() / 1024**2 # MB 77 | else: 78 | return psutil.Process().memory_info().rss / 1024**2 # MB 79 | 80 | # Memory-efficient gradient checkpointing 81 | class CheckpointedModel(nn.Module): 82 | def __init__(self): 83 | super().__init__() 84 | self.layers = nn.ModuleList([ 85 | nn.Sequential( 86 | nn.Linear(1024, 1024), 87 | nn.ReLU(), 88 | nn.Dropout(0.1) 89 | ) for _ in range(10) 90 | ]) 91 | self.final = nn.Linear(1024, 10) 92 | 93 | def forward(self, x): 94 | for layer in self.layers: 95 | # Use checkpoint to trade compute for memory 96 | x = torch.utils.checkpoint.checkpoint(layer, x) 97 | return self.final(x) 98 | 99 | # Compare memory usage 100 | print("Memory usage comparison:") 101 | x = torch.randn(128, 1024).to(device) 102 | 103 | # Without checkpointing 104 | regular_model = nn.Sequential(*[ 105 | nn.Sequential(nn.Linear(1024, 1024), nn.ReLU(), nn.Dropout(0.1)) 106 | for _ in range(10) 107 | ] + [nn.Linear(1024, 10)]).to(device) 108 | 109 | mem_before = get_memory_usage() 110 | y1 = regular_model(x) 111 | loss1 = y1.sum() 112 | loss1.backward() 113 | mem_regular = get_memory_usage() - mem_before 114 | print(f"Regular model: {mem_regular:.2f} MB") 115 | 116 | # With checkpointing 117 | checkpointed_model = CheckpointedModel().to(device) 118 | optimizer = torch.optim.Adam(checkpointed_model.parameters()) 119 | optimizer.zero_grad() 120 | 121 | mem_before = get_memory_usage() 122 | y2 = checkpointed_model(x) 123 | loss2 = y2.sum() 124 | loss2.backward() 125 | mem_checkpoint = get_memory_usage() - mem_before 126 | print(f"Checkpointed model: {mem_checkpoint:.2f} MB") 127 | print(f"Memory saved: {(1 - mem_checkpoint/mem_regular)*100:.1f}%") 128 | print() 129 | 130 | # Example 3: Mixed Precision Training 131 | print("Example 3: Mixed Precision Training") 132 | print("=" * 50) 133 | 134 | # Create a more complex model for mixed precision demo 135 | class MixedPrecisionModel(nn.Module): 136 | def __init__(self): 137 | super().__init__() 138 | self.features = nn.Sequential( 139 | nn.Conv2d(3, 64, 3, padding=1), 140 | nn.BatchNorm2d(64), 141 | nn.ReLU(), 142 | nn.Conv2d(64, 128, 3, padding=1), 143 | nn.BatchNorm2d(128), 144 | nn.ReLU(), 145 | nn.AdaptiveAvgPool2d(1) 146 | ) 147 | self.classifier = nn.Linear(128, 10) 148 | 149 | def forward(self, x): 150 | x = self.features(x) 151 | x = x.view(x.size(0), -1) 152 | x = self.classifier(x) 153 | return x 154 | 155 | # Training with mixed precision 156 | def train_with_amp(model, dataloader, epochs=2): 157 | model = model.to(device) 158 | optimizer = torch.optim.Adam(model.parameters()) 159 | scaler = amp.GradScaler() 160 | 161 | model.train() 162 | total_time = 0 163 | 164 | for epoch in range(epochs): 165 | epoch_start = time.time() 166 | for i, (inputs, targets) in enumerate(dataloader): 167 | if i >= 10: # Limit iterations for demo 168 | break 169 | 170 | inputs, targets = inputs.to(device), targets.to(device) 171 | 172 | optimizer.zero_grad() 173 | 174 | # Mixed precision forward pass 175 | with amp.autocast(): 176 | outputs = model(inputs) 177 | loss = F.cross_entropy(outputs, targets) 178 | 179 | # Scaled backward pass 180 | scaler.scale(loss).backward() 181 | scaler.step(optimizer) 182 | scaler.update() 183 | 184 | epoch_time = time.time() - epoch_start 185 | total_time += epoch_time 186 | 187 | return total_time / epochs 188 | 189 | # Create dummy dataset 190 | class DummyDataset(Dataset): 191 | def __init__(self, size=1000): 192 | self.size = size 193 | 194 | def __len__(self): 195 | return self.size 196 | 197 | def __getitem__(self, idx): 198 | return torch.randn(3, 32, 32), torch.randint(0, 10, (1,)).item() 199 | 200 | dataset = DummyDataset() 201 | dataloader = DataLoader(dataset, batch_size=64, num_workers=0) 202 | 203 | # Compare training times 204 | model_fp32 = MixedPrecisionModel() 205 | model_amp = MixedPrecisionModel() 206 | 207 | print("Training with FP32...") 208 | time_fp32 = train_with_amp(model_fp32, dataloader) 209 | print(f"Average epoch time: {time_fp32:.3f}s") 210 | 211 | print("\nTraining with AMP...") 212 | time_amp = train_with_amp(model_amp, dataloader) 213 | print(f"Average epoch time: {time_amp:.3f}s") 214 | print(f"Speedup: {time_fp32/time_amp:.2f}x") 215 | print() 216 | 217 | # Example 4: Data Loading Optimization 218 | print("Example 4: Data Loading Optimization") 219 | print("=" * 50) 220 | 221 | # Optimized dataset with caching 222 | class OptimizedDataset(Dataset): 223 | def __init__(self, size=1000, cache_size=100): 224 | self.size = size 225 | self.cache_size = cache_size 226 | self.cache = {} 227 | self.transform = transforms.Compose([ 228 | transforms.RandomHorizontalFlip(), 229 | transforms.RandomCrop(32, padding=4), 230 | transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) 231 | ]) 232 | 233 | def __len__(self): 234 | return self.size 235 | 236 | def __getitem__(self, idx): 237 | # Simple caching mechanism 238 | if idx in self.cache: 239 | return self.cache[idx] 240 | 241 | # Simulate data loading 242 | image = torch.randn(3, 32, 32) 243 | label = torch.randint(0, 10, (1,)).item() 244 | 245 | # Cache recent items 246 | if len(self.cache) < self.cache_size: 247 | self.cache[idx] = (image, label) 248 | 249 | return image, label 250 | 251 | # Compare data loading performance 252 | def benchmark_dataloader(dataset, num_workers, pin_memory=False): 253 | dataloader = DataLoader( 254 | dataset, 255 | batch_size=128, 256 | num_workers=num_workers, 257 | pin_memory=pin_memory, 258 | persistent_workers=(num_workers > 0) 259 | ) 260 | 261 | start_time = time.time() 262 | for i, (data, target) in enumerate(dataloader): 263 | if i >= 50: # Limit iterations 264 | break 265 | # Simulate processing 266 | data = data.to(device, non_blocking=True) 267 | 268 | total_time = time.time() - start_time 269 | return total_time 270 | 271 | dataset = OptimizedDataset(5000) 272 | 273 | print("Data loading benchmark:") 274 | for num_workers in [0, 2, 4]: 275 | for pin_memory in [False, True]: 276 | time_taken = benchmark_dataloader(dataset, num_workers, pin_memory) 277 | print(f"Workers: {num_workers}, Pin memory: {pin_memory} - Time: {time_taken:.3f}s") 278 | print() 279 | 280 | # Example 5: Model Optimization with TorchScript 281 | print("Example 5: TorchScript Optimization") 282 | print("=" * 50) 283 | 284 | # Create a model for scripting 285 | class ScriptableModel(nn.Module): 286 | def __init__(self): 287 | super().__init__() 288 | self.conv1 = nn.Conv2d(3, 32, 3) 289 | self.conv2 = nn.Conv2d(32, 64, 3) 290 | self.fc = nn.Linear(64 * 6 * 6, 10) 291 | 292 | def forward(self, x): 293 | x = F.relu(self.conv1(x)) 294 | x = F.max_pool2d(x, 2) 295 | x = F.relu(self.conv2(x)) 296 | x = F.max_pool2d(x, 2) 297 | x = torch.flatten(x, 1) 298 | x = self.fc(x) 299 | return x 300 | 301 | # Compare scripted vs regular model 302 | model = ScriptableModel().to(device) 303 | model.eval() 304 | 305 | # Script the model 306 | scripted_model = torch.jit.script(model) 307 | 308 | # Benchmark 309 | x = torch.randn(100, 3, 32, 32).to(device) 310 | 311 | # Regular model 312 | torch.cuda.synchronize() if torch.cuda.is_available() else None 313 | start = time.time() 314 | for _ in range(100): 315 | _ = model(x) 316 | torch.cuda.synchronize() if torch.cuda.is_available() else None 317 | regular_time = time.time() - start 318 | 319 | # Scripted model 320 | torch.cuda.synchronize() if torch.cuda.is_available() else None 321 | start = time.time() 322 | for _ in range(100): 323 | _ = scripted_model(x) 324 | torch.cuda.synchronize() if torch.cuda.is_available() else None 325 | scripted_time = time.time() - start 326 | 327 | print(f"Regular model: {regular_time:.3f}s") 328 | print(f"Scripted model: {scripted_time:.3f}s") 329 | print(f"Speedup: {regular_time/scripted_time:.2f}x") 330 | print() 331 | 332 | # Example 6: Tensor Operations Optimization 333 | print("Example 6: Tensor Operations Optimization") 334 | print("=" * 50) 335 | 336 | # Inefficient operations 337 | def inefficient_operation(x): 338 | result = torch.zeros_like(x) 339 | for i in range(x.shape[0]): 340 | for j in range(x.shape[1]): 341 | result[i, j] = x[i, j] * 2 + 1 342 | return result 343 | 344 | # Efficient vectorized operation 345 | def efficient_operation(x): 346 | return x * 2 + 1 347 | 348 | # Benchmark 349 | x = torch.randn(1000, 1000).to(device) 350 | 351 | start = time.time() 352 | _ = inefficient_operation(x) 353 | inefficient_time = time.time() - start 354 | 355 | start = time.time() 356 | _ = efficient_operation(x) 357 | efficient_time = time.time() - start 358 | 359 | print(f"Inefficient operation: {inefficient_time:.4f}s") 360 | print(f"Efficient operation: {efficient_time:.4f}s") 361 | print(f"Speedup: {inefficient_time/efficient_time:.0f}x") 362 | print() 363 | 364 | # Example 7: Memory-Efficient Attention 365 | print("Example 7: Memory-Efficient Attention") 366 | print("=" * 50) 367 | 368 | class EfficientAttention(nn.Module): 369 | def __init__(self, dim, num_heads=8, chunk_size=256): 370 | super().__init__() 371 | self.num_heads = num_heads 372 | self.chunk_size = chunk_size 373 | self.scale = (dim // num_heads) ** -0.5 374 | 375 | self.qkv = nn.Linear(dim, dim * 3) 376 | self.proj = nn.Linear(dim, dim) 377 | 378 | def forward(self, x): 379 | B, N, C = x.shape 380 | qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4) 381 | q, k, v = qkv[0], qkv[1], qkv[2] 382 | 383 | # Chunked attention computation 384 | attn_chunks = [] 385 | for i in range(0, N, self.chunk_size): 386 | end_idx = min(i + self.chunk_size, N) 387 | q_chunk = q[:, :, i:end_idx] 388 | 389 | # Compute attention for this chunk 390 | attn = (q_chunk @ k.transpose(-2, -1)) * self.scale 391 | attn = attn.softmax(dim=-1) 392 | attn_chunk = attn @ v 393 | attn_chunks.append(attn_chunk) 394 | 395 | # Concatenate chunks 396 | x = torch.cat(attn_chunks, dim=2) 397 | x = x.transpose(1, 2).reshape(B, N, C) 398 | x = self.proj(x) 399 | return x 400 | 401 | # Test memory-efficient attention 402 | seq_len = 1024 403 | dim = 512 404 | batch_size = 8 405 | 406 | attention = EfficientAttention(dim).to(device) 407 | x = torch.randn(batch_size, seq_len, dim).to(device) 408 | 409 | mem_before = get_memory_usage() 410 | output = attention(x) 411 | mem_used = get_memory_usage() - mem_before 412 | print(f"Memory used by efficient attention: {mem_used:.2f} MB") 413 | print(f"Output shape: {output.shape}") 414 | print() 415 | 416 | # Example 8: Custom Memory Allocator 417 | print("Example 8: Custom Memory Management") 418 | print("=" * 50) 419 | 420 | class TensorPool: 421 | """Simple tensor pool for reusing allocations""" 422 | def __init__(self): 423 | self.pool = {} 424 | 425 | def get(self, shape, dtype=torch.float32, device='cpu'): 426 | key = (tuple(shape), dtype, device) 427 | if key in self.pool and len(self.pool[key]) > 0: 428 | return self.pool[key].pop() 429 | return torch.empty(shape, dtype=dtype, device=device) 430 | 431 | def release(self, tensor): 432 | key = (tuple(tensor.shape), tensor.dtype, tensor.device) 433 | if key not in self.pool: 434 | self.pool[key] = [] 435 | self.pool[key].append(tensor) 436 | 437 | def clear(self): 438 | self.pool.clear() 439 | 440 | # Example usage 441 | pool = TensorPool() 442 | 443 | # Simulate multiple allocations 444 | print("Using tensor pool:") 445 | tensors = [] 446 | for i in range(5): 447 | t = pool.get((100, 100), device=device) 448 | tensors.append(t) 449 | 450 | # Release some tensors back to pool 451 | for t in tensors[:3]: 452 | pool.release(t) 453 | 454 | # Reuse from pool 455 | print(f"Pool size before reuse: {sum(len(v) for v in pool.pool.values())}") 456 | new_tensors = [] 457 | for i in range(3): 458 | t = pool.get((100, 100), device=device) 459 | new_tensors.append(t) 460 | print(f"Pool size after reuse: {sum(len(v) for v in pool.pool.values())}") 461 | print() 462 | 463 | # Best Practices Summary 464 | print("Performance Optimization Best Practices") 465 | print("=" * 50) 466 | print("1. Profile First: Always profile before optimizing") 467 | print("2. Memory Management: Use gradient checkpointing for large models") 468 | print("3. Mixed Precision: Use AMP for faster training") 469 | print("4. Data Loading: Use multiple workers and pin_memory") 470 | print("5. Batch Size: Find optimal batch size for your GPU") 471 | print("6. TorchScript: Script models for production deployment") 472 | print("7. Operator Fusion: Use fused operations when available") 473 | print("8. Distributed Training: Scale across multiple GPUs") 474 | print() 475 | 476 | # Performance Checklist 477 | print("Performance Optimization Checklist") 478 | print("-" * 30) 479 | checklist = [ 480 | "Profile with torch.profiler", 481 | "Enable mixed precision training", 482 | "Optimize data loading pipeline", 483 | "Use gradient checkpointing for memory", 484 | "Apply model quantization", 485 | "Enable CUDNN benchmarking", 486 | "Use TorchScript for inference", 487 | "Implement custom CUDA kernels for bottlenecks", 488 | "Use distributed training for large models", 489 | "Monitor GPU utilization" 490 | ] 491 | 492 | for item in checklist: 493 | print(f"- [ ] {item}") 494 | 495 | print("\nRemember: Premature optimization is the root of all evil!") 496 | print("Always measure and profile before optimizing.") -------------------------------------------------------------------------------- /13_custom_extensions/custom_extensions.py: -------------------------------------------------------------------------------- 1 | """ 2 | Tutorial 13: Custom Extensions (C++ and CUDA) 3 | ============================================ 4 | 5 | This tutorial demonstrates how to create custom C++ and CUDA extensions 6 | for PyTorch to achieve better performance for specialized operations. 7 | """ 8 | 9 | import torch 10 | import torch.nn as nn 11 | import torch.nn.functional as F 12 | import numpy as np 13 | import time 14 | import os 15 | import sys 16 | from torch.utils.cpp_extension import load_inline 17 | 18 | # First, let's understand why we might need custom extensions 19 | print("Why Custom Extensions?") 20 | print("=" * 50) 21 | print("1. Performance: C++/CUDA can be much faster than Python") 22 | print("2. Memory efficiency: Better control over memory allocation") 23 | print("3. Novel operations: Implement operations not available in PyTorch") 24 | print("4. Hardware optimization: Leverage specific hardware features") 25 | print() 26 | 27 | # Example 1: Simple C++ Extension (Inline JIT Compilation) 28 | print("Example 1: Simple C++ Extension") 29 | print("-" * 30) 30 | 31 | # C++ source code for a custom ReLU implementation 32 | cpp_source = ''' 33 | #include 34 | #include 35 | 36 | // Forward pass 37 | torch::Tensor custom_relu_forward(torch::Tensor input) { 38 | auto output = torch::zeros_like(input); 39 | output = torch::where(input > 0, input, output); 40 | return output; 41 | } 42 | 43 | // Backward pass 44 | torch::Tensor custom_relu_backward(torch::Tensor grad_output, torch::Tensor input) { 45 | auto grad_input = torch::zeros_like(grad_output); 46 | grad_input = torch::where(input > 0, grad_output, grad_input); 47 | return grad_input; 48 | } 49 | 50 | PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { 51 | m.def("forward", &custom_relu_forward, "Custom ReLU forward"); 52 | m.def("backward", &custom_relu_backward, "Custom ReLU backward"); 53 | } 54 | ''' 55 | 56 | # Load the extension 57 | custom_relu_cpp = load_inline( 58 | name='custom_relu_cpp', 59 | cpp_sources=[cpp_source], 60 | functions=['forward', 'backward'], 61 | verbose=True, 62 | build_directory='./cpp_build' 63 | ) 64 | 65 | # Create a custom autograd Function 66 | class CustomReLUFunction(torch.autograd.Function): 67 | @staticmethod 68 | def forward(ctx, input): 69 | ctx.save_for_backward(input) 70 | return custom_relu_cpp.forward(input) 71 | 72 | @staticmethod 73 | def backward(ctx, grad_output): 74 | input, = ctx.saved_tensors 75 | return custom_relu_cpp.backward(grad_output, input) 76 | 77 | # Wrap it in a module 78 | class CustomReLU(nn.Module): 79 | def forward(self, input): 80 | return CustomReLUFunction.apply(input) 81 | 82 | # Test the custom ReLU 83 | x = torch.randn(10, 10, requires_grad=True) 84 | custom_relu = CustomReLU() 85 | y = custom_relu(x) 86 | loss = y.sum() 87 | loss.backward() 88 | 89 | print(f"Input shape: {x.shape}") 90 | print(f"Output shape: {y.shape}") 91 | print(f"Gradient computed: {x.grad is not None}") 92 | print() 93 | 94 | # Example 2: CUDA Extension for Matrix Operations 95 | print("Example 2: CUDA Extension") 96 | print("-" * 30) 97 | 98 | # Check if CUDA is available 99 | if torch.cuda.is_available(): 100 | # CUDA kernel source code 101 | cuda_source = ''' 102 | #include 103 | #include 104 | #include 105 | #include 106 | 107 | template 108 | __global__ void custom_matmul_kernel( 109 | const scalar_t* __restrict__ a, 110 | const scalar_t* __restrict__ b, 111 | scalar_t* __restrict__ c, 112 | int m, int n, int k) { 113 | 114 | int row = blockIdx.y * blockDim.y + threadIdx.y; 115 | int col = blockIdx.x * blockDim.x + threadIdx.x; 116 | 117 | if (row < m && col < n) { 118 | scalar_t sum = 0; 119 | for (int i = 0; i < k; i++) { 120 | sum += a[row * k + i] * b[i * n + col]; 121 | } 122 | c[row * n + col] = sum; 123 | } 124 | } 125 | 126 | torch::Tensor custom_matmul_cuda(torch::Tensor a, torch::Tensor b) { 127 | const int m = a.size(0); 128 | const int k = a.size(1); 129 | const int n = b.size(1); 130 | 131 | auto c = torch::zeros({m, n}, a.options()); 132 | 133 | const dim3 threads(16, 16); 134 | const dim3 blocks((n + threads.x - 1) / threads.x, 135 | (m + threads.y - 1) / threads.y); 136 | 137 | AT_DISPATCH_FLOATING_TYPES(a.type(), "custom_matmul_cuda", ([&] { 138 | custom_matmul_kernel<<>>( 139 | a.data_ptr(), 140 | b.data_ptr(), 141 | c.data_ptr(), 142 | m, n, k 143 | ); 144 | })); 145 | 146 | return c; 147 | } 148 | ''' 149 | 150 | cpp_source_cuda = ''' 151 | #include 152 | 153 | torch::Tensor custom_matmul_cuda(torch::Tensor a, torch::Tensor b); 154 | 155 | torch::Tensor custom_matmul(torch::Tensor a, torch::Tensor b) { 156 | // Check inputs 157 | TORCH_CHECK(a.dim() == 2, "Matrix A must be 2D"); 158 | TORCH_CHECK(b.dim() == 2, "Matrix B must be 2D"); 159 | TORCH_CHECK(a.size(1) == b.size(0), "Matrix dimensions must match for multiplication"); 160 | 161 | if (a.is_cuda()) { 162 | return custom_matmul_cuda(a, b); 163 | } else { 164 | // CPU implementation 165 | return torch::matmul(a, b); 166 | } 167 | } 168 | 169 | PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { 170 | m.def("matmul", &custom_matmul, "Custom matrix multiplication"); 171 | } 172 | ''' 173 | 174 | # Note: CUDA compilation requires nvcc and proper setup 175 | print("CUDA extension example (pseudo-code for demonstration)") 176 | print("In practice, you would compile this with setuptools or torch.utils.cpp_extension") 177 | else: 178 | print("CUDA not available, skipping CUDA example") 179 | print() 180 | 181 | # Example 3: Custom Linear Layer with Fused Operations 182 | print("Example 3: Fused Linear Layer") 183 | print("-" * 30) 184 | 185 | # C++ code for fused linear layer (bias + activation) 186 | fused_cpp_source = ''' 187 | #include 188 | #include 189 | 190 | torch::Tensor fused_linear_relu_forward( 191 | torch::Tensor input, 192 | torch::Tensor weight, 193 | torch::Tensor bias) { 194 | 195 | // Perform linear transformation 196 | auto output = torch::matmul(input, weight.t()); 197 | 198 | // Add bias and apply ReLU in one pass 199 | output = torch::clamp_min(output + bias, 0); 200 | 201 | return output; 202 | } 203 | 204 | std::vector fused_linear_relu_backward( 205 | torch::Tensor grad_output, 206 | torch::Tensor input, 207 | torch::Tensor weight, 208 | torch::Tensor output) { 209 | 210 | // ReLU backward 211 | auto relu_grad = torch::where(output > 0, grad_output, torch::zeros_like(grad_output)); 212 | 213 | // Linear backward 214 | auto grad_input = torch::matmul(relu_grad, weight); 215 | auto grad_weight = torch::matmul(relu_grad.t(), input); 216 | auto grad_bias = relu_grad.sum(0); 217 | 218 | return {grad_input, grad_weight, grad_bias}; 219 | } 220 | 221 | PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { 222 | m.def("forward", &fused_linear_relu_forward, "Fused Linear-ReLU forward"); 223 | m.def("backward", &fused_linear_relu_backward, "Fused Linear-ReLU backward"); 224 | } 225 | ''' 226 | 227 | # Load the fused operation 228 | fused_linear_relu = load_inline( 229 | name='fused_linear_relu', 230 | cpp_sources=[fused_cpp_source], 231 | functions=['forward', 'backward'], 232 | verbose=True, 233 | build_directory='./cpp_build' 234 | ) 235 | 236 | class FusedLinearReLUFunction(torch.autograd.Function): 237 | @staticmethod 238 | def forward(ctx, input, weight, bias): 239 | output = fused_linear_relu.forward(input, weight, bias) 240 | ctx.save_for_backward(input, weight, output) 241 | return output 242 | 243 | @staticmethod 244 | def backward(ctx, grad_output): 245 | input, weight, output = ctx.saved_tensors 246 | grad_input, grad_weight, grad_bias = fused_linear_relu.backward( 247 | grad_output, input, weight, output 248 | ) 249 | return grad_input, grad_weight, grad_bias 250 | 251 | class FusedLinearReLU(nn.Module): 252 | def __init__(self, in_features, out_features): 253 | super().__init__() 254 | self.weight = nn.Parameter(torch.randn(out_features, in_features)) 255 | self.bias = nn.Parameter(torch.zeros(out_features)) 256 | 257 | def forward(self, input): 258 | return FusedLinearReLUFunction.apply(input, self.weight, self.bias) 259 | 260 | # Test the fused layer 261 | fused_layer = FusedLinearReLU(100, 50) 262 | x = torch.randn(32, 100) 263 | y = fused_layer(x) 264 | print(f"Fused layer output shape: {y.shape}") 265 | print() 266 | 267 | # Example 4: Custom Optimizer in C++ 268 | print("Example 4: Custom Optimizer") 269 | print("-" * 30) 270 | 271 | custom_optimizer_source = ''' 272 | #include 273 | #include 274 | 275 | void custom_sgd_step( 276 | torch::Tensor param, 277 | torch::Tensor grad, 278 | torch::Tensor momentum_buffer, 279 | float lr, 280 | float momentum, 281 | float weight_decay) { 282 | 283 | if (weight_decay != 0) { 284 | grad = grad + weight_decay * param; 285 | } 286 | 287 | if (momentum != 0) { 288 | momentum_buffer.mul_(momentum).add_(grad); 289 | param.add_(momentum_buffer, -lr); 290 | } else { 291 | param.add_(grad, -lr); 292 | } 293 | } 294 | 295 | PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { 296 | m.def("step", &custom_sgd_step, "Custom SGD step"); 297 | } 298 | ''' 299 | 300 | custom_sgd = load_inline( 301 | name='custom_sgd', 302 | cpp_sources=[custom_optimizer_source], 303 | functions=['step'], 304 | verbose=True, 305 | build_directory='./cpp_build' 306 | ) 307 | 308 | class CustomSGD: 309 | def __init__(self, params, lr=0.01, momentum=0.9, weight_decay=0): 310 | self.params = list(params) 311 | self.lr = lr 312 | self.momentum = momentum 313 | self.weight_decay = weight_decay 314 | self.momentum_buffers = {} 315 | 316 | for p in self.params: 317 | self.momentum_buffers[p] = torch.zeros_like(p) 318 | 319 | def step(self): 320 | for p in self.params: 321 | if p.grad is not None: 322 | custom_sgd.step( 323 | p.data, 324 | p.grad.data, 325 | self.momentum_buffers[p], 326 | self.lr, 327 | self.momentum, 328 | self.weight_decay 329 | ) 330 | 331 | def zero_grad(self): 332 | for p in self.params: 333 | if p.grad is not None: 334 | p.grad.zero_() 335 | 336 | # Example 5: Performance Comparison 337 | print("Example 5: Performance Comparison") 338 | print("-" * 30) 339 | 340 | def benchmark_operation(name, func, *args, num_runs=1000): 341 | # Warmup 342 | for _ in range(10): 343 | func(*args) 344 | 345 | # Benchmark 346 | if torch.cuda.is_available(): 347 | torch.cuda.synchronize() 348 | 349 | start_time = time.time() 350 | for _ in range(num_runs): 351 | result = func(*args) 352 | 353 | if torch.cuda.is_available(): 354 | torch.cuda.synchronize() 355 | 356 | end_time = time.time() 357 | avg_time = (end_time - start_time) / num_runs * 1000 # Convert to ms 358 | 359 | return avg_time, result 360 | 361 | # Compare custom ReLU with PyTorch ReLU 362 | x = torch.randn(1000, 1000) 363 | pytorch_relu = nn.ReLU() 364 | custom_relu = CustomReLU() 365 | 366 | pytorch_time, _ = benchmark_operation("PyTorch ReLU", pytorch_relu, x) 367 | custom_time, _ = benchmark_operation("Custom ReLU", custom_relu, x) 368 | 369 | print(f"PyTorch ReLU: {pytorch_time:.4f} ms") 370 | print(f"Custom ReLU: {custom_time:.4f} ms") 371 | print(f"Speedup: {pytorch_time/custom_time:.2f}x") 372 | print() 373 | 374 | # Example 6: Building Extensions with setuptools 375 | print("Example 6: Building with setuptools") 376 | print("-" * 30) 377 | 378 | setup_py_content = ''' 379 | from setuptools import setup, Extension 380 | from torch.utils import cpp_extension 381 | 382 | setup( 383 | name='custom_ops', 384 | ext_modules=[ 385 | cpp_extension.CppExtension( 386 | 'custom_ops', 387 | ['custom_ops.cpp'], 388 | extra_compile_args=['-O3'] 389 | ), 390 | cpp_extension.CUDAExtension( 391 | 'custom_cuda_ops', 392 | ['custom_cuda_ops.cpp', 'custom_cuda_ops_kernel.cu'], 393 | extra_compile_args={'cxx': ['-O3'], 394 | 'nvcc': ['-O3', '--use_fast_math']} 395 | ) if torch.cuda.is_available() else None 396 | ], 397 | cmdclass={ 398 | 'build_ext': cpp_extension.BuildExtension 399 | } 400 | ) 401 | ''' 402 | 403 | print("Example setup.py for building extensions:") 404 | print(setup_py_content) 405 | print() 406 | 407 | # Example 7: Memory Management in Extensions 408 | print("Example 7: Memory Management") 409 | print("-" * 30) 410 | 411 | memory_cpp_source = ''' 412 | #include 413 | #include 414 | 415 | // Efficient memory pooling example 416 | class MemoryPool { 417 | private: 418 | std::vector pool; 419 | std::vector in_use; 420 | 421 | public: 422 | torch::Tensor allocate(std::vector shape, torch::TensorOptions options) { 423 | // Try to find a suitable tensor in the pool 424 | for (size_t i = 0; i < pool.size(); i++) { 425 | if (!in_use[i] && pool[i].sizes() == shape && pool[i].options() == options) { 426 | in_use[i] = true; 427 | return pool[i]; 428 | } 429 | } 430 | 431 | // Allocate new tensor 432 | auto tensor = torch::empty(shape, options); 433 | pool.push_back(tensor); 434 | in_use.push_back(true); 435 | return tensor; 436 | } 437 | 438 | void release(torch::Tensor tensor) { 439 | for (size_t i = 0; i < pool.size(); i++) { 440 | if (pool[i].data_ptr() == tensor.data_ptr()) { 441 | in_use[i] = false; 442 | break; 443 | } 444 | } 445 | } 446 | }; 447 | 448 | // Global memory pool 449 | MemoryPool global_pool; 450 | 451 | torch::Tensor pooled_operation(torch::Tensor input) { 452 | auto shape = input.sizes().vec(); 453 | auto output = global_pool.allocate(shape, input.options()); 454 | 455 | // Perform operation 456 | output.copy_(input); 457 | output.mul_(2.0); 458 | 459 | return output; 460 | } 461 | 462 | PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { 463 | m.def("pooled_operation", &pooled_operation, "Operation with memory pooling"); 464 | } 465 | ''' 466 | 467 | print("Memory pooling example shown above") 468 | print("This technique can significantly reduce memory allocation overhead") 469 | print() 470 | 471 | # Best Practices and Tips 472 | print("Best Practices for Custom Extensions") 473 | print("=" * 50) 474 | print("1. Profile First: Ensure the operation is actually a bottleneck") 475 | print("2. Use Existing Ops: Check if PyTorch already has what you need") 476 | print("3. Memory Layout: Ensure tensors are contiguous when needed") 477 | print("4. Error Handling: Use TORCH_CHECK for input validation") 478 | print("5. Gradient Testing: Always verify gradients with gradcheck") 479 | print("6. Documentation: Document tensor shapes and assumptions") 480 | print("7. Platform Support: Test on different platforms and CUDA versions") 481 | print() 482 | 483 | # Debugging Tips 484 | print("Debugging Custom Extensions") 485 | print("-" * 30) 486 | print("1. Use print statements in C++ (std::cout)") 487 | print("2. Enable verbose mode in load_inline") 488 | print("3. Use cuda-gdb for CUDA kernels") 489 | print("4. Check tensor continuity with .is_contiguous()") 490 | print("5. Verify shapes and strides match expectations") 491 | print("6. Use torch.autograd.gradcheck for gradient verification") 492 | print() 493 | 494 | # Summary 495 | print("Summary") 496 | print("=" * 50) 497 | print("Custom extensions allow you to:") 498 | print("- Achieve better performance for specialized operations") 499 | print("- Implement novel algorithms not available in PyTorch") 500 | print("- Leverage hardware-specific optimizations") 501 | print("- Create memory-efficient implementations") 502 | print("\nRemember: Only use custom extensions when necessary!") 503 | print("PyTorch's built-in operations are highly optimized and sufficient for most use cases.") -------------------------------------------------------------------------------- /12_distributed_training/distributed_training.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | """ 5 | Distributed Training 6 | 7 | This script demonstrates various distributed training techniques in PyTorch, 8 | including Data Parallel, Distributed Data Parallel, Model Parallel, and FSDP. 9 | """ 10 | 11 | import os 12 | import time 13 | import argparse 14 | import torch 15 | import torch.nn as nn 16 | import torch.nn.functional as F 17 | import torch.distributed as dist 18 | import torch.multiprocessing as mp 19 | from torch.nn.parallel import DataParallel, DistributedDataParallel as DDP 20 | from torch.utils.data import Dataset, DataLoader 21 | from torch.utils.data.distributed import DistributedSampler 22 | import matplotlib.pyplot as plt 23 | import numpy as np 24 | 25 | # Set random seed for reproducibility 26 | torch.manual_seed(42) 27 | np.random.seed(42) 28 | 29 | # ----------------------------------------------------------------------------- 30 | # Section 1: Introduction to Distributed Training 31 | # ----------------------------------------------------------------------------- 32 | 33 | def intro_to_distributed_training(): 34 | """Introduce distributed training concepts.""" 35 | print("\nSection 1: Introduction to Distributed Training") 36 | print("-" * 50) 37 | print("Distributed training enables:") 38 | print(" - Faster training with multiple GPUs/nodes") 39 | print(" - Training larger models that don't fit on single GPU") 40 | print(" - Processing larger batch sizes") 41 | print("\nTypes of parallelism:") 42 | print(" - Data Parallel: Split data, replicate model") 43 | print(" - Model Parallel: Split model across devices") 44 | print(" - Pipeline Parallel: Split model into stages") 45 | print(f"\nCUDA available: {torch.cuda.is_available()}") 46 | print(f"Number of GPUs: {torch.cuda.device_count()}") 47 | 48 | # ----------------------------------------------------------------------------- 49 | # Section 2: Sample Dataset and Model 50 | # ----------------------------------------------------------------------------- 51 | 52 | class SyntheticDataset(Dataset): 53 | """A synthetic dataset for demonstration.""" 54 | def __init__(self, size=10000, input_dim=784, num_classes=10): 55 | self.size = size 56 | self.input_dim = input_dim 57 | self.num_classes = num_classes 58 | 59 | def __len__(self): 60 | return self.size 61 | 62 | def __getitem__(self, idx): 63 | # Generate random data 64 | data = torch.randn(self.input_dim) 65 | label = torch.randint(0, self.num_classes, (1,)).item() 66 | return data, label 67 | 68 | class SimpleNet(nn.Module): 69 | """A simple neural network for demonstration.""" 70 | def __init__(self, input_dim=784, hidden_dim=256, num_classes=10): 71 | super().__init__() 72 | self.fc1 = nn.Linear(input_dim, hidden_dim) 73 | self.fc2 = nn.Linear(hidden_dim, hidden_dim) 74 | self.fc3 = nn.Linear(hidden_dim, num_classes) 75 | self.dropout = nn.Dropout(0.2) 76 | 77 | def forward(self, x): 78 | x = F.relu(self.fc1(x)) 79 | x = self.dropout(x) 80 | x = F.relu(self.fc2(x)) 81 | x = self.dropout(x) 82 | x = self.fc3(x) 83 | return x 84 | 85 | # ----------------------------------------------------------------------------- 86 | # Section 3: Data Parallel (DP) 87 | # ----------------------------------------------------------------------------- 88 | 89 | def demonstrate_data_parallel(): 90 | """Demonstrate Data Parallel training.""" 91 | print("\nSection 2: Data Parallel (DP)") 92 | print("-" * 50) 93 | 94 | if torch.cuda.device_count() < 2: 95 | print("Data Parallel requires at least 2 GPUs. Simulating with CPU...") 96 | return 97 | 98 | # Create model and wrap with DataParallel 99 | model = SimpleNet() 100 | model = DataParallel(model) 101 | model = model.cuda() 102 | 103 | # Create dataset and dataloader 104 | dataset = SyntheticDataset(size=1000) 105 | dataloader = DataLoader(dataset, batch_size=64, shuffle=True) 106 | 107 | # Loss and optimizer 108 | criterion = nn.CrossEntropyLoss() 109 | optimizer = torch.optim.Adam(model.parameters(), lr=0.001) 110 | 111 | # Training loop 112 | print("Training with DataParallel...") 113 | start_time = time.time() 114 | 115 | for epoch in range(2): 116 | total_loss = 0 117 | for batch_idx, (data, target) in enumerate(dataloader): 118 | data, target = data.cuda(), target.cuda() 119 | 120 | optimizer.zero_grad() 121 | output = model(data) 122 | loss = criterion(output, target) 123 | loss.backward() 124 | optimizer.step() 125 | 126 | total_loss += loss.item() 127 | 128 | if batch_idx % 5 == 0: 129 | print(f" Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}") 130 | 131 | avg_loss = total_loss / len(dataloader) 132 | print(f"Epoch {epoch} - Average Loss: {avg_loss:.4f}") 133 | 134 | elapsed_time = time.time() - start_time 135 | print(f"Training time: {elapsed_time:.2f} seconds") 136 | 137 | # ----------------------------------------------------------------------------- 138 | # Section 4: Distributed Data Parallel (DDP) 139 | # ----------------------------------------------------------------------------- 140 | 141 | def setup_ddp(rank, world_size): 142 | """Initialize the distributed environment.""" 143 | os.environ['MASTER_ADDR'] = 'localhost' 144 | os.environ['MASTER_PORT'] = '12355' 145 | 146 | # Initialize process group 147 | dist.init_process_group("nccl" if torch.cuda.is_available() else "gloo", 148 | rank=rank, world_size=world_size) 149 | 150 | def cleanup_ddp(): 151 | """Clean up the distributed environment.""" 152 | dist.destroy_process_group() 153 | 154 | def train_ddp(rank, world_size, num_epochs=2): 155 | """Training function for DDP.""" 156 | print(f"\nProcess {rank}: Initializing DDP training...") 157 | setup_ddp(rank, world_size) 158 | 159 | # Create model and move to device 160 | device = torch.device(f'cuda:{rank}' if torch.cuda.is_available() else 'cpu') 161 | model = SimpleNet().to(device) 162 | 163 | # Wrap model with DDP 164 | if torch.cuda.is_available(): 165 | ddp_model = DDP(model, device_ids=[rank]) 166 | else: 167 | ddp_model = DDP(model) 168 | 169 | # Create dataset with DistributedSampler 170 | dataset = SyntheticDataset(size=1000) 171 | sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank) 172 | dataloader = DataLoader(dataset, batch_size=64, sampler=sampler) 173 | 174 | # Loss and optimizer 175 | criterion = nn.CrossEntropyLoss() 176 | optimizer = torch.optim.Adam(ddp_model.parameters(), lr=0.001) 177 | 178 | # Training loop 179 | start_time = time.time() 180 | 181 | for epoch in range(num_epochs): 182 | sampler.set_epoch(epoch) # Important for proper shuffling 183 | total_loss = 0 184 | 185 | for batch_idx, (data, target) in enumerate(dataloader): 186 | data, target = data.to(device), target.to(device) 187 | 188 | optimizer.zero_grad() 189 | output = ddp_model(data) 190 | loss = criterion(output, target) 191 | loss.backward() 192 | optimizer.step() 193 | 194 | total_loss += loss.item() 195 | 196 | if rank == 0 and batch_idx % 5 == 0: 197 | print(f" Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}") 198 | 199 | # Synchronize and compute average loss 200 | avg_loss = total_loss / len(dataloader) 201 | if rank == 0: 202 | print(f"Epoch {epoch} - Average Loss: {avg_loss:.4f}") 203 | 204 | elapsed_time = time.time() - start_time 205 | if rank == 0: 206 | print(f"DDP Training time: {elapsed_time:.2f} seconds") 207 | 208 | cleanup_ddp() 209 | 210 | def demonstrate_ddp(): 211 | """Demonstrate Distributed Data Parallel training.""" 212 | print("\nSection 3: Distributed Data Parallel (DDP)") 213 | print("-" * 50) 214 | 215 | world_size = min(torch.cuda.device_count(), 2) if torch.cuda.is_available() else 2 216 | 217 | if world_size < 2: 218 | print("DDP demonstration requires at least 2 processes.") 219 | print("Simulating with 2 CPU processes...") 220 | 221 | mp.spawn(train_ddp, args=(world_size,), nprocs=world_size, join=True) 222 | 223 | # ----------------------------------------------------------------------------- 224 | # Section 5: Model Parallel 225 | # ----------------------------------------------------------------------------- 226 | 227 | class ModelParallelNet(nn.Module): 228 | """A model split across multiple devices.""" 229 | def __init__(self, input_dim=784, hidden_dim=256, num_classes=10): 230 | super().__init__() 231 | 232 | # Determine devices 233 | self.device1 = torch.device('cuda:0' if torch.cuda.device_count() > 0 else 'cpu') 234 | self.device2 = torch.device('cuda:1' if torch.cuda.device_count() > 1 else 'cpu') 235 | 236 | # Split model across devices 237 | self.fc1 = nn.Linear(input_dim, hidden_dim).to(self.device1) 238 | self.fc2 = nn.Linear(hidden_dim, hidden_dim).to(self.device2) 239 | self.fc3 = nn.Linear(hidden_dim, num_classes).to(self.device2) 240 | 241 | def forward(self, x): 242 | x = x.to(self.device1) 243 | x = F.relu(self.fc1(x)) 244 | 245 | x = x.to(self.device2) 246 | x = F.relu(self.fc2(x)) 247 | x = self.fc3(x) 248 | 249 | return x 250 | 251 | def demonstrate_model_parallel(): 252 | """Demonstrate Model Parallel training.""" 253 | print("\nSection 4: Model Parallel") 254 | print("-" * 50) 255 | 256 | if torch.cuda.device_count() < 2: 257 | print("Model Parallel requires at least 2 GPUs.") 258 | print("Demonstrating concept with CPU...") 259 | 260 | # Create model parallel network 261 | model = ModelParallelNet() 262 | 263 | # Create small dataset for demonstration 264 | dataset = SyntheticDataset(size=200) 265 | dataloader = DataLoader(dataset, batch_size=32, shuffle=True) 266 | 267 | # Loss and optimizer 268 | device2 = torch.device('cuda:1' if torch.cuda.device_count() > 1 else 'cpu') 269 | criterion = nn.CrossEntropyLoss() 270 | optimizer = torch.optim.Adam(model.parameters(), lr=0.001) 271 | 272 | # Training loop 273 | print("Training with Model Parallel...") 274 | start_time = time.time() 275 | 276 | for epoch in range(2): 277 | total_loss = 0 278 | for batch_idx, (data, target) in enumerate(dataloader): 279 | target = target.to(device2) 280 | 281 | optimizer.zero_grad() 282 | output = model(data) 283 | loss = criterion(output, target) 284 | loss.backward() 285 | optimizer.step() 286 | 287 | total_loss += loss.item() 288 | 289 | if batch_idx % 5 == 0: 290 | print(f" Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}") 291 | 292 | avg_loss = total_loss / len(dataloader) 293 | print(f"Epoch {epoch} - Average Loss: {avg_loss:.4f}") 294 | 295 | elapsed_time = time.time() - start_time 296 | print(f"Training time: {elapsed_time:.2f} seconds") 297 | 298 | # ----------------------------------------------------------------------------- 299 | # Section 6: Pipeline Parallel (Conceptual Demo) 300 | # ----------------------------------------------------------------------------- 301 | 302 | def demonstrate_pipeline_parallel(): 303 | """Demonstrate Pipeline Parallel concepts.""" 304 | print("\nSection 5: Pipeline Parallel") 305 | print("-" * 50) 306 | print("Pipeline Parallelism splits the model into stages and processes") 307 | print("micro-batches in a pipeline fashion to improve GPU utilization.") 308 | print("\nKey concepts:") 309 | print(" - Model is split into sequential stages") 310 | print(" - Each stage processes micro-batches") 311 | print(" - Reduces bubble (idle) time") 312 | print(" - Can be combined with data parallelism") 313 | 314 | # Simple visualization of pipeline scheduling 315 | print("\nPipeline Schedule Visualization:") 316 | print("Time →") 317 | print("GPU0: [F1][F2][F3][F4][B4][B3][B2][B1]") 318 | print("GPU1: [F1][F2][F3][F4][B4][B3][B2][B1]") 319 | print("GPU2: [F1][F2][F3][F4][B4][B3][B2][B1]") 320 | print("GPU3: [F1][F2][F3][F4][B4][B3][B2][B1]") 321 | print("\nF=Forward, B=Backward, Numbers=Micro-batch IDs") 322 | 323 | # ----------------------------------------------------------------------------- 324 | # Section 7: Fully Sharded Data Parallel (FSDP) Demo 325 | # ----------------------------------------------------------------------------- 326 | 327 | def demonstrate_fsdp_concepts(): 328 | """Demonstrate FSDP concepts.""" 329 | print("\nSection 6: Fully Sharded Data Parallel (FSDP)") 330 | print("-" * 50) 331 | print("FSDP enables training of extremely large models by:") 332 | print(" - Sharding model parameters across GPUs") 333 | print(" - Sharding optimizer states") 334 | print(" - Sharding gradients") 335 | print(" - Optional CPU offloading") 336 | print("\nMemory savings example:") 337 | print(" Standard DDP: Each GPU stores full model") 338 | print(" FSDP: Each GPU stores 1/N of model (N = number of GPUs)") 339 | 340 | # Calculate memory savings 341 | model_size_gb = 7 # Example: 7B parameter model 342 | num_gpus = 8 343 | 344 | print(f"\nExample with {model_size_gb}B parameter model on {num_gpus} GPUs:") 345 | print(f" DDP memory per GPU: {model_size_gb} GB") 346 | print(f" FSDP memory per GPU: {model_size_gb/num_gpus:.2f} GB") 347 | print(f" Memory reduction: {(1 - 1/num_gpus)*100:.1f}%") 348 | 349 | # ----------------------------------------------------------------------------- 350 | # Section 8: Performance Comparison 351 | # ----------------------------------------------------------------------------- 352 | 353 | def plot_performance_comparison(): 354 | """Create a performance comparison visualization.""" 355 | print("\nSection 7: Performance Comparison") 356 | print("-" * 50) 357 | 358 | # Simulated performance data 359 | methods = ['Single GPU', 'DP (4 GPUs)', 'DDP (4 GPUs)', 'FSDP (4 GPUs)'] 360 | throughput = [100, 320, 380, 350] # Images/second 361 | memory_usage = [16, 64, 64, 20] # GB 362 | 363 | fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5)) 364 | 365 | # Throughput comparison 366 | ax1.bar(methods, throughput, color=['blue', 'green', 'orange', 'red']) 367 | ax1.set_ylabel('Throughput (samples/sec)') 368 | ax1.set_title('Training Throughput Comparison') 369 | ax1.set_ylim(0, 400) 370 | 371 | # Memory usage comparison 372 | ax2.bar(methods, memory_usage, color=['blue', 'green', 'orange', 'red']) 373 | ax2.set_ylabel('Memory Usage (GB)') 374 | ax2.set_title('Memory Usage Comparison') 375 | ax2.set_ylim(0, 70) 376 | 377 | plt.tight_layout() 378 | plt.savefig('distributed_training_comparison.png') 379 | print("Performance comparison saved to 'distributed_training_comparison.png'") 380 | 381 | # ----------------------------------------------------------------------------- 382 | # Section 9: Best Practices 383 | # ----------------------------------------------------------------------------- 384 | 385 | def print_best_practices(): 386 | """Print distributed training best practices.""" 387 | print("\nSection 8: Best Practices") 388 | print("-" * 50) 389 | print("1. Data Loading:") 390 | print(" - Use DistributedSampler for DDP") 391 | print(" - Pin memory for GPU training") 392 | print(" - Use multiple workers for data loading") 393 | print("\n2. Gradient Synchronization:") 394 | print(" - Use gradient accumulation for large batches") 395 | print(" - Consider gradient compression for bandwidth") 396 | print("\n3. Checkpointing:") 397 | print(" - Save checkpoints only from rank 0") 398 | print(" - Use torch.save with map_location for loading") 399 | print("\n4. Debugging:") 400 | print(" - Set TORCH_DISTRIBUTED_DEBUG=DETAIL") 401 | print(" - Use torch.distributed.barrier() for synchronization") 402 | print(" - Monitor GPU utilization and memory") 403 | print("\n5. Performance:") 404 | print(" - Profile with torch.profiler") 405 | print(" - Use mixed precision training") 406 | print(" - Overlap computation and communication") 407 | 408 | # ----------------------------------------------------------------------------- 409 | # Main Function 410 | # ----------------------------------------------------------------------------- 411 | 412 | def main(): 413 | """Main function to run all demonstrations.""" 414 | parser = argparse.ArgumentParser(description='Distributed Training Tutorial') 415 | parser.add_argument('--distributed', action='store_true', 416 | help='Run distributed training examples') 417 | args = parser.parse_args() 418 | 419 | print("=" * 70) 420 | print("Distributed Training Tutorial") 421 | print("=" * 70) 422 | 423 | # Run demonstrations 424 | intro_to_distributed_training() 425 | 426 | if torch.cuda.device_count() >= 2: 427 | demonstrate_data_parallel() 428 | else: 429 | print("\nSkipping Data Parallel demo (requires 2+ GPUs)") 430 | 431 | if args.distributed: 432 | demonstrate_ddp() 433 | else: 434 | print("\nSkipping DDP demo (use --distributed flag to run)") 435 | 436 | if torch.cuda.device_count() >= 2: 437 | demonstrate_model_parallel() 438 | else: 439 | print("\nSkipping Model Parallel demo (requires 2+ GPUs)") 440 | 441 | demonstrate_pipeline_parallel() 442 | demonstrate_fsdp_concepts() 443 | plot_performance_comparison() 444 | print_best_practices() 445 | 446 | print("\n" + "=" * 70) 447 | print("Tutorial completed!") 448 | print("=" * 70) 449 | 450 | if __name__ == "__main__": 451 | main() -------------------------------------------------------------------------------- /04_training_neural_networks/README.md: -------------------------------------------------------------------------------- 1 | # Training Neural Networks in PyTorch: A Comprehensive Guide 2 | 3 | This tutorial provides an in-depth guide to training neural networks effectively using PyTorch. We will cover everything from the fundamental training loop to advanced techniques for optimization, regularization, and monitoring to help you build robust and high-performing models. 4 | 5 | ## Table of Contents 6 | 1. [Introduction to Neural Network Training](#introduction-to-neural-network-training) 7 | - The Goal: Learning from Data 8 | - Core Components Revisited: Model, Data, Loss, Optimizer 9 | - The Iterative Process: Epochs and Batches 10 | 2. [Preparing Your Data with `Dataset` and `DataLoader`](#preparing-your-data-with-dataset-and-dataloader) 11 | - `torch.utils.data.Dataset` Customization 12 | - `torch.utils.data.DataLoader` for Batching and Shuffling 13 | - Data Augmentation and Transformation 14 | 3. [The Essential Training Loop](#the-essential-training-loop) 15 | - Setting the Model to Training Mode (`model.train()`) 16 | - Iterating Through Data Batches 17 | - Zeroing Gradients (`optimizer.zero_grad()`) 18 | - Forward Pass: Getting Predictions 19 | - Calculating the Loss 20 | - Backward Pass: Computing Gradients (`loss.backward()`) 21 | - Optimizer Step: Updating Weights (`optimizer.step()`) 22 | - Tracking Metrics (Loss, Accuracy) 23 | 4. [Validation: Evaluating Model Performance](#validation-evaluating-model-performance) 24 | - Importance of a Validation Set 25 | - Train-Validation-Test Splits 26 | - Setting the Model to Evaluation Mode (`model.eval()`) 27 | - Disabling Gradient Computation (`torch.no_grad()`) 28 | - Implementing a Validation Loop 29 | - K-Fold Cross-Validation (Concept and Use Case) 30 | 5. [Saving and Loading Models](#saving-and-loading-models) 31 | - Saving/Loading Entire Model vs. State Dictionary (`state_dict`) 32 | - Saving `state_dict` (Recommended) 33 | - Loading `state_dict` 34 | - Saving Checkpoints During Training (for Resuming) 35 | 6. [Hyperparameter Tuning Strategies](#hyperparameter-tuning-strategies) 36 | - What are Hyperparameters? 37 | - Common Hyperparameters: Learning Rate, Batch Size, Network Architecture, Regularization Strength 38 | - Manual Search vs. Grid Search vs. Random Search 39 | - Advanced Tools: Optuna, Ray Tune, Weights & Biases Sweeps (Conceptual Overview) 40 | 7. [Learning Rate Scheduling](#learning-rate-scheduling) 41 | - Why Adjust Learning Rate During Training? 42 | - Common Schedulers in `torch.optim.lr_scheduler`: 43 | - `StepLR`: Decay by gamma every step_size epochs. 44 | - `MultiStepLR`: Decay by gamma at specified milestones. 45 | - `ExponentialLR`: Decay by gamma every epoch. 46 | - `CosineAnnealingLR`: Cosine-shaped decay. 47 | - `ReduceLROnPlateau`: Reduce LR when a metric stops improving. 48 | - Integrating Schedulers into the Training Loop 49 | 8. [Regularization Techniques to Prevent Overfitting](#regularization-techniques-to-prevent-overfitting) 50 | - What is Overfitting? 51 | - L1 and L2 Regularization (Weight Decay in Optimizers) 52 | - Dropout (`nn.Dropout`) 53 | - Early Stopping 54 | - Data Augmentation (as a form of regularization) 55 | 9. [Gradient Clipping](#gradient-clipping) 56 | - Problem: Exploding Gradients 57 | - `torch.nn.utils.clip_grad_norm_` 58 | - `torch.nn.utils.clip_grad_value_` 59 | - When and How to Use It 60 | 10. [Weight Initialization Strategies](#weight-initialization-strategies) 61 | - Importance of Proper Initialization 62 | - Common Methods in `torch.nn.init`: 63 | - Xavier/Glorot Initialization (`nn.init.xavier_uniform_`, `nn.init.xavier_normal_`) 64 | - Kaiming/He Initialization (`nn.init.kaiming_uniform_`, `nn.init.kaiming_normal_`) 65 | - Initializing Biases (e.g., to zero or small constants) 66 | - Applying Initialization to a Model 67 | 11. [Batch Normalization (`nn.BatchNorm1d`, `nn.BatchNorm2d`)](#batch-normalization-nnbatchnorm1d-nnbatchnorm2d) 68 | - How it Works: Normalizing Activations within a Batch 69 | - Benefits: Faster Convergence, Regularization Effect, Reduced Sensitivity to Initialization 70 | - Usage: `model.train()` vs. `model.eval()` behavior 71 | 12. [Monitoring Training with TensorBoard](#monitoring-training-with-tensorboard) 72 | - `torch.utils.tensorboard.SummaryWriter` 73 | - Logging Scalars: Loss, Accuracy, Learning Rate 74 | - Logging Histograms: Weights, Gradients 75 | - Logging Images, Model Graphs (Conceptual) 76 | 13. [A Complete Training Pipeline Example](#a-complete-training-pipeline-example) 77 | - Structuring the Code: Setup, Data Loading, Model, Training, Evaluation 78 | - Putting It All Together (Conceptual Flow) 79 | 80 | ## Introduction to Neural Network Training 81 | 82 | - **The Goal: Learning from Data** 83 | The primary objective of training a neural network is to enable it to learn patterns and relationships from a given dataset. This learned knowledge allows the model to make accurate predictions or classifications on new, unseen data. 84 | - **Core Components Revisited:** 85 | - **Model:** The neural network architecture (e.g., an MLP, CNN) defined using `nn.Module`. 86 | - **Data:** Input features and corresponding target labels, typically split into training, validation, and test sets. 87 | - **Loss Function:** A function that measures the discrepancy between the model's predictions and the true target values (e.g., `nn.CrossEntropyLoss` for classification, `nn.MSELoss` for regression). 88 | - **Optimizer:** An algorithm (e.g., SGD, Adam from `torch.optim`) that adjusts the model's parameters (weights and biases) to minimize the loss function. 89 | - **The Iterative Process: Epochs and Batches** 90 | - **Epoch:** One complete pass through the entire training dataset. 91 | - **Batch:** The training dataset is often divided into smaller subsets called batches. The model's weights are updated after processing each batch. This makes training more computationally manageable and can lead to faster convergence. 92 | 93 | ## Preparing Your Data with `Dataset` and `DataLoader` 94 | 95 | PyTorch provides convenient utilities for handling data. 96 | 97 | - **`torch.utils.data.Dataset`:** An abstract class for representing a dataset. You can create custom datasets by subclassing it and implementing `__len__` (to return the size of the dataset) and `__getitem__` (to support indexing and return a single sample). 98 | - **`torch.utils.data.DataLoader`:** Wraps a `Dataset` and provides an iterable over the dataset. It handles batching, shuffling, and parallel data loading. 99 | - **Data Augmentation and Transformation:** Often applied within the `Dataset` or via `torchvision.transforms` to increase data diversity and improve model generalization. 100 | 101 | ```python 102 | from torch.utils.data import Dataset, DataLoader 103 | import torchvision.transforms as transforms 104 | 105 | class MyCustomDataset(Dataset): 106 | def __init__(self, data, targets, transform=None): 107 | self.data = data 108 | self.targets = targets 109 | self.transform = transform 110 | 111 | def __len__(self): 112 | return len(self.data) 113 | 114 | def __getitem__(self, idx): 115 | sample = self.data[idx] 116 | target = self.targets[idx] 117 | if self.transform: 118 | sample = self.transform(sample) 119 | return sample, target 120 | 121 | # Example usage: 122 | # train_data, train_targets = ... 123 | # train_dataset = MyCustomDataset(train_data, train_targets, transform=transforms.ToTensor()) 124 | # train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True) 125 | ``` 126 | 127 | ## The Essential Training Loop 128 | 129 | The core of neural network training. Here's a breakdown of a typical single epoch: 130 | 131 | ```python 132 | # Assume model, train_loader, criterion, optimizer, device are defined 133 | # model = YourModel().to(device) 134 | # criterion = nn.CrossEntropyLoss() 135 | # optimizer = torch.optim.Adam(model.parameters(), lr=0.001) 136 | 137 | def train_one_epoch(model, train_loader, criterion, optimizer, device): 138 | model.train() # 1. Set model to training mode 139 | running_loss = 0.0 140 | correct_predictions = 0 141 | total_samples = 0 142 | 143 | # 2. Iterate through data batches 144 | for batch_idx, (inputs, targets) in enumerate(train_loader): 145 | inputs, targets = inputs.to(device), targets.to(device) 146 | 147 | # 3. Zeroing gradients 148 | optimizer.zero_grad() 149 | 150 | # 4. Forward pass: Getting predictions 151 | outputs = model(inputs) 152 | 153 | # 5. Calculating the loss 154 | loss = criterion(outputs, targets) 155 | 156 | # 6. Backward pass: Computing gradients 157 | loss.backward() 158 | 159 | # 7. Optimizer step: Updating weights 160 | optimizer.step() 161 | 162 | # 8. Tracking metrics 163 | running_loss += loss.item() * inputs.size(0) 164 | _, predicted_classes = outputs.max(1) 165 | total_samples += targets.size(0) 166 | correct_predictions += predicted_classes.eq(targets).sum().item() 167 | 168 | epoch_loss = running_loss / total_samples 169 | epoch_accuracy = correct_predictions / total_samples 170 | return epoch_loss, epoch_accuracy 171 | ``` 172 | 173 | ## Validation: Evaluating Model Performance 174 | 175 | Validation helps monitor overfitting and assess how well the model generalizes to unseen data. 176 | 177 | - **Train-Validation-Test Splits:** 178 | - **Training Set:** Used to train the model. 179 | - **Validation Set:** Used to tune hyperparameters and make decisions about the training process (e.g., early stopping). 180 | - **Test Set:** Used for a final, unbiased evaluation of the trained model. Should only be used once. 181 | - **`model.eval()`:** Sets the model to evaluation mode. This is important for layers like Dropout and BatchNorm, which behave differently during training and evaluation. 182 | - **`torch.no_grad()`:** A context manager that disables gradient computation, reducing memory usage and speeding up inference during validation/testing. 183 | 184 | ```python 185 | # Assume model, val_loader, criterion, device are defined 186 | def validate_one_epoch(model, val_loader, criterion, device): 187 | model.eval() # 1. Set model to evaluation mode 188 | running_loss = 0.0 189 | correct_predictions = 0 190 | total_samples = 0 191 | 192 | with torch.no_grad(): # 2. Disable gradient computation 193 | for inputs, targets in val_loader: 194 | inputs, targets = inputs.to(device), targets.to(device) 195 | outputs = model(inputs) 196 | loss = criterion(outputs, targets) 197 | 198 | running_loss += loss.item() * inputs.size(0) 199 | _, predicted_classes = outputs.max(1) 200 | total_samples += targets.size(0) 201 | correct_predictions += predicted_classes.eq(targets).sum().item() 202 | 203 | epoch_loss = running_loss / total_samples 204 | epoch_accuracy = correct_predictions / total_samples 205 | return epoch_loss, epoch_accuracy 206 | ``` 207 | 208 | - **K-Fold Cross-Validation:** For smaller datasets, split the data into K folds. Train on K-1 folds and validate on the remaining fold. Repeat K times, averaging the performance metrics. Provides a more robust estimate of model performance. 209 | 210 | ## Saving and Loading Models 211 | 212 | It's essential to save your trained model for later use or to resume training. 213 | 214 | - **Saving/Loading `state_dict` (Recommended):** This saves only the model's learnable parameters (weights and biases). 215 | ```python 216 | # Saving 217 | # torch.save(model.state_dict(), 'model_weights.pth') 218 | 219 | # Loading 220 | # model_architecture = YourModel(*args, **kwargs) # Recreate model instance first 221 | # model_architecture.load_state_dict(torch.load('model_weights.pth')) 222 | # model_architecture.to(device) # Don't forget to move to device 223 | # model_architecture.eval() # Set to eval mode if using for inference 224 | ``` 225 | - **Saving Checkpoints:** Save model `state_dict`, optimizer `state_dict`, epoch, loss, etc., to resume training. 226 | ```python 227 | # checkpoint = { 228 | # 'epoch': epoch, 229 | # 'model_state_dict': model.state_dict(), 230 | # 'optimizer_state_dict': optimizer.state_dict(), 231 | # 'loss': loss, 232 | # # any other metrics 233 | # } 234 | # torch.save(checkpoint, 'checkpoint.pth') 235 | ``` 236 | 237 | ## Hyperparameter Tuning Strategies 238 | 239 | - **Common Hyperparameters:** Learning rate, batch size, number of epochs, optimizer choice, hidden layer sizes, activation functions, dropout rate, weight decay. 240 | - **Manual Search:** Experimenting based on intuition and observation. 241 | - **Grid Search:** Defining a grid of hyperparameter values and trying all combinations. Computationally expensive. 242 | - **Random Search:** Randomly sampling hyperparameter combinations. Often more efficient than grid search. 243 | - **Advanced Tools:** Libraries like Optuna, Ray Tune, or services like Weights & Biases Sweeps automate the search process using more sophisticated algorithms (e.g., Bayesian optimization). 244 | 245 | ## Learning Rate Scheduling 246 | 247 | Dynamically adjusting the learning rate can lead to better performance and faster convergence. 248 | 249 | - **`torch.optim.lr_scheduler`:** Provides various schedulers. 250 | ```python 251 | from torch.optim.lr_scheduler import StepLR, ReduceLROnPlateau 252 | 253 | # optimizer = torch.optim.Adam(model.parameters(), lr=0.01) 254 | # scheduler_steplr = StepLR(optimizer, step_size=10, gamma=0.1) # Reduce LR by factor of 0.1 every 10 epochs 255 | # scheduler_plateau = ReduceLROnPlateau(optimizer, 'min', patience=5, factor=0.5) # Reduce if val_loss plateaus 256 | 257 | # In training loop, after optimizer.step(): 258 | # if isinstance(scheduler, ReduceLROnPlateau): 259 | # scheduler.step(validation_loss) # For ReduceLROnPlateau 260 | # else: 261 | # scheduler.step() # For most other schedulers 262 | ``` 263 | 264 | ## Regularization Techniques to Prevent Overfitting 265 | 266 | Overfitting occurs when a model learns the training data too well, including its noise, and performs poorly on unseen data. 267 | 268 | - **L1 and L2 Regularization (Weight Decay):** Add a penalty to the loss function based on the magnitude of model weights. L2 regularization (weight decay) is common and can be added directly in PyTorch optimizers: 269 | `optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)` 270 | - **Dropout (`nn.Dropout`):** Randomly zeros out a fraction of neuron outputs during training, forcing the network to learn more robust features. 271 | - **Early Stopping:** Monitor validation loss and stop training if it doesn't improve for a certain number of epochs. 272 | - **Data Augmentation:** Artificially increasing the size and diversity of the training dataset. 273 | 274 | ## Gradient Clipping 275 | 276 | Helps prevent exploding gradients (gradients becoming very large), which can destabilize training, especially in RNNs. 277 | 278 | - **`torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)`:** Clips the L2 norm of all gradients together. 279 | - **`torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)`:** Clips individual gradient values to be within `[-clip_value, clip_value]`. 280 | Call this *after* `loss.backward()` and *before* `optimizer.step()`. 281 | 282 | ## Weight Initialization Strategies 283 | 284 | Proper initialization helps prevent vanishing or exploding gradients and speeds up convergence. 285 | 286 | - **`torch.nn.init`:** Contains various initialization functions. 287 | - **Xavier/Glorot:** Good for layers with Sigmoid/Tanh activations. (`nn.init.xavier_uniform_`, `nn.init.xavier_normal_`) 288 | - **Kaiming/He:** Good for layers with ReLU activations. (`nn.init.kaiming_uniform_`, `nn.init.kaiming_normal_`) 289 | 290 | ```python 291 | # def initialize_weights(m): 292 | # if isinstance(m, nn.Linear): 293 | # nn.init.kaiming_normal_(m.weight, nonlinearity='relu') 294 | # if m.bias is not None: 295 | # nn.init.constant_(m.bias, 0) 296 | # model.apply(initialize_weights) 297 | ``` 298 | 299 | ## Batch Normalization (`nn.BatchNorm1d`, `nn.BatchNorm2d`) 300 | 301 | Normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation. Speeds up training and offers some regularization. 302 | Remember to use `model.train()` and `model.eval()` appropriately as BatchNorm layers behave differently. 303 | 304 | ## Monitoring Training with TensorBoard 305 | 306 | TensorBoard is a powerful visualization toolkit for inspecting and understanding your model's training process. 307 | 308 | - **`torch.utils.tensorboard.SummaryWriter`:** The main class for logging data to TensorBoard. 309 | ```python 310 | # from torch.utils.tensorboard import SummaryWriter 311 | # writer = SummaryWriter('runs/my_experiment_name') 312 | # writer.add_scalar('Training Loss', epoch_loss, global_step=epoch) 313 | # writer.add_scalar('Validation Accuracy', epoch_accuracy, global_step=epoch) 314 | # writer.add_histogram('fc1.weights', model.fc1.weight, global_step=epoch) 315 | # writer.close() 316 | ``` 317 | 318 | ## A Complete Training Pipeline Example 319 | 320 | A full pipeline involves integrating data loading, model definition, the training loop, validation, schedulers, saving, and monitoring. The accompanying Python script (`training_neural_networks.py`) will provide a concrete example of these components working together. 321 | 322 | ## Running the Tutorial 323 | 324 | To run the Python script associated with this tutorial: 325 | ```bash 326 | python training_neural_networks.py 327 | ``` 328 | We recommend you manually create a `training_neural_networks.ipynb` notebook and copy the code from the Python script into it for an interactive experience, as direct notebook creation has been problematic. 329 | 330 | ## Prerequisites 331 | - Python 3.7+ 332 | - PyTorch 1.10+ 333 | - NumPy 334 | - Matplotlib (for visualization) 335 | - Scikit-learn (optional, for utilities like KFold or datasets) 336 | - TensorBoard (optional, for advanced monitoring: `pip install tensorboard`) 337 | 338 | ## Related Tutorials 339 | 1. [PyTorch Basics](../01_pytorch_basics/README.md) 340 | 2. [Neural Networks Fundamentals](../02_neural_networks_fundamentals/README.md) 341 | 3. [Automatic Differentiation](../03_automatic_differentiation/README.md) 342 | 4. [Data Loading and Preprocessing](../05_data_loading_preprocessing/README.md) --------------------------------------------------------------------------------